Skip to content

Student Project Proposal 2024

Samuel Pastva edited this page Jun 12, 2024 · 5 revisions

Python API

Currently, BBM data need to be manually downloaded and managed by anyone that intends to use it. In the bioinformatics community, it is a common practice to prepare wrappers/packages that allow easier access to the datasets (for example, see CellCollective API, Dorothea, or PathwayCommons). Our goal is to develop a Python package with similar functionality for BBM. This Python package will be integrated into the CoLoMoTo Jupyter notebook which is the industry standard environment for analysis of logic-based biological models.

In particular, the Python package should be at least able to:

  • Retrieve latest versions of models from the online repository.
  • Retrieve historical model versions based on the BBM "edition".
  • Retrieve models in different formats (bnet, aeon, sbml, or already parsed AEON BooleanNetwork object).
  • For models with inputs (source nodes), provide versions with fixed inputs, free inputs, or randomly sampled inputs.
  • Cache the models locally to reduce bandwidth requirements.
  • Retrieve individual models (by ID), or in bulk.

Furthermore, in relation to the other project goals, we would also like to consider the following integrations:

  • Retrieve pre-computed results for known analysis workflows (attractors, trap spaces, etc.) in a suitable format.
  • Search for model instances based on annotation data or variable names.

Other considerations: While the R language is also widely used in bioinformatics, as of late, it appears to be rarely used in logic-based modeling. As such, we currently only intend to provide a Python API for BBM, but may consider other wrapper libraries in the future based on language popularity.

BMA model format

AEON currently supports three major Boolean network formats: bnet, sbml, and our own aeon format. All of these work by storing Boolean expressions of the network update functions. There are other less common formats, like boolean-net, that also use Boolean expressions and are reasonably easy to translate into the supported formats. As such, we do not consider these a priority at the moment. However, there is also the BMA format, which is actually fairly old at this point, but uses a substantially different paradigm to represent logic-based models: rational arithmetic expressions. Consequently, it has been rather difficult to translate between BMA and other common formats. Importantly, BMA seems to be still used and new models are published in this format, hence it makes sense to consider it. Finally, some preliminary work has been done by the trap-mvn Python package to build an independent interpreter of the BMA expression language which suggests that this task is realizable.

Overall, the implementation is expected to be broken down into the following subtask:

  • Introduce an internal data model for the BMA files, including (a) support for additional data items like layouts and regulations; (b) parser for the arithmetic expressions; (c) both XML and JSON serialization.
  • Re-implement the expression interpreter of trap-mvn in order to construct a logic table for each model update function.
  • Use binary decision diagrams or a similar mechanism to encode the logic table into Boolean expressions.

Other considerations: BMA models currently support both Boolean and multi-valued models. In theory, the process described above can be easily adapted to also incorporate Booleanization of multi-valued models (in BBM, this is currently facilitated by the bioLQM package). However, the initial version should focus on Boolean models and we may consider mutli-valued models as a secondary goal if we have enough time.

BBM model catalog website

An online presentation of the BBM project was developed as part of a recent bachelor thesis. This presentation is currently deployed in a test mode at bbm.unsigned-short.com. The initial component of this goal is to finalize this website and deploy it at faculty servers. Subsequently, the website needs to be improved and updated in order to reflect other improvements achieved as part of this project.

As such, this goal can be broken down into the following tasks:

  • Review the existing code of the website and prepare it for deployment on faculty servers, including data synchronization with the model repository.
  • Make the website's data model more granular and structured (essentially the full BBM dataset needs to be currently downloaded for the website to function). Data items to be considered in the restructuring: list of model variables, list of model interactions, metadata/annotations for each variable/regulation, global model metadata, structured metadata about the source publication (currently only LaTeX string is stored). Note that this may require cooperation with the Python API (or similar) to extract the relevant model metadata.
  • Update the presentation of each model to incorporate this new data items. Currently, only the model "readme" is shown in each detail page. It is to be decided whether this metadata will be appended to the readme as well, or whether it will be only provided on the website.
  • Improve the search feature of the website: Currently, there are three separate search options, by name, by year, and by journal. The new full-text search should instead consider all aspects of the model (variables, interactions, annotations, full publication entry). Furthermore, the reason for why a particular model is considered a match should be communicated to the user.
  • Subsequently, the search feature should also provide a certain degree of semantic similarity search on model variables. Here, the exact nature of determining a similarity is to be determined, but we will likely consider existing atlases of known gene names, such as the one provided by Gene Expression Omnibus.
  • Finally, integration between BBM and AEON frontend will be developed. In particular, we will explore (a) an "Open in AEON" button on each model page; (b) an "Open from BBM" menu in AEON itself where the known models will be listed. It is to be determined whether only one or both solutions will be implemented.

Furthermore, in relation to the other project goals, we would also like to consider the following:

  • Linking to or presenting attached case study notebooks.
  • Presented pre-computed analysis results such as strongly connected components or trap spaces.

Automated model meta-analysis

Currently, BBM collects very little information about the structure of the actual models. However, important information about the nature of logic-based models can be obtained by studying the dataset collectively. As such, we aim to explore useful structural properties that can be readily computed for each model and presented in an automatically updated report. Currently, it is expected that the AEON Python package will be used to compute these properties and compile the report, but additional packages may be introduced if necessary.

Structural properties for consideration (not all will be necessarily included):

  • Model inputs and outputs.
  • Distribution of SCCs in the interaction graph.
  • In-degree and out-degree distribution.
  • Monotonicity and essentiality of interactions.
  • Presence of canalizing or nested canalizing functions.
  • Function complexity in terms of prime implicants (or some other pseudo-minimal implicants measure if prime implicants cannot be computed).
  • Feed forward loops and redundant interaction chains.
  • Positive/negative feedback vertex set and other properties of network cycles (independent cycles, cycle length, etc.).

Overall, the analysis should be fast enough to be compiled on-the-fly during each validation pass to generate an up-to-date meta-report of the whole dataset.

Reproducible case-study notebooks

For some of the included models, known results or case studies are available. Currently, BBM does not consider these in any particular way. In the ideal scenarios, such case studies are distributed in the form of reproducible Jupyter notebooks. Furthermore, BBM also includes some models in a "repaired" state, where metadata has been updated to correct some common consistency issues. This repair process can be also captured by a Jupyter notebook.

In this goal, we thus aim to standardize a mechanism to include Jupyter notebook files with a model. These files are either reproducible (they can be executed in the Colomoto docker environment), or not (the process requires non-standard dependencies or excessive amount of resources). Overall, re-running case studies should not be a part of the standard CI/CD pipeline, but should be performed at least for each BBM edition. As such, it is important to provide a mechanism that will enable re-running all case studies.

Finally, case studies do need to be somehow incorporated into the model presentation in the repository as well as on the model website.

SCC decomposition in BBM/AEON

One of the most important dynamical properties of logic-based models are strongly connected components (SCCs) of the state-transition graph, and especially bottom SCCs, so called attractors. Furthermore, these components can be partitioned by the trap spaces of the network into a so-called succession diagram that reveals the gradual commitment of the model towards a particular long-term behavior. These are complex problems, as the state-transition graph is exponential in the number of network variables. Consequently, they are often solved using symbolic or automated reasoning methods.

Currently, AEON only supports computation of the network attractors. However, algorithms that facilitate full symbolic SCC decomposition do exists. Furthermore, there are packages that identify trap spaces of a network, including the succession diagram.

As such, as part of this goal, we aim to:

  • Implement current state of the art symbolic SCC decomposition algorithm(s) and evaluate them on the BBM dataset.
  • Incorporate the knowledge obtained from the succession diagram to accelerate SCC detection by pre-partitioning the network state space into independent blocks.
  • Incorporate these methods into the AEON package to make them available to the BBM analysis pipelines (see below).

Distribution of validated analysis data

BBM is often used as a means for benchmarking and validating new tools or methods. Such use cases can be significantly accelerated by providing pre-computed and validated results of common analytical tasks, such as attractor or trap space detection. These tasks are often computationally intensive and may require additional commitment to study a particular tool in order to identify settings that achieve the best performance. Furthermore, these are problems where multiple independent tools can be used to cross-validate the tool output.

In this goal, we aim to alleviate this problem by:

  • Identifying common dynamical analysis tasks with share-able, machine-readable results (attractors, trap spaces, etc.).
  • Propose automated workflow(s) that perform this analysis and produce a reproducible artifact with the results.
  • These workflows should ideally exists for multiple tools where applicable.
  • These workflows should also ideally produce a reproducible Jupyter notebook for each model where the problem can be solved within reasonable means that can be included in the case study section of the dataset.

These problems are often hard to solve. As such, we should adopt a system where the results may not be necessarily available for each model, and do not need to be part of the CI/CD pipeline. However, similar to the case studies, they need to be reproducible and are expected to be re-run at least for each BBM edition.

Incorporation of recent models

This goal is very straightforward. Currently, there are about 50 known publications with models that could be included into BBM (these are included in the BBM issue tracker). Furthermore, we should cross-reference BBM with other recent meta-reviews to ensure we are not missing any models uncovered by these. Finally, support for new model formats should enable us to include additional models that could not be incorporated before.