Skip to content

cmungall/human-cell-atlas

Repository files navigation

human-cell-atlas

EXPERIMENTAL translation of HCA

Caveat: this schema is entirely constructed via an automated import of the HCA json schema.

  • there may be parts missing
  • the direct mapping may not utilitize key parts of LinkML

Website

The above is generated entirely from the schema, which comes from the json schema; as such it may be spare on details.

This is also using the older linkml documentation framework, which doesn't show all the schema

Schema

How this was made

This was created using schema-automator

Utilizing the following HCA-specific extensions

  • mapping of user_friendly to linkml:title
  • mapping HCA ontology extensions to dyanamic enums

The following modifications were made:

  • Changed “10x” to “S10x” (because otherwise this creates awkward incompatibilities between the generated python classes and the schema)
  • Modified hca/system/links.json to avoid name clashes with SupplementaryFile

Treatment of Links

I need to figure out exactly how the system/links schema is used in HCA. Currently it doesn't "connect up" to the rest of the schema.

It seems that some kind of extra-schema information is required

Ontology Enums

All plain json enums are mapped to LinkML enums. Note that we elected not to inline these, so there are a lot of "trivial" enums with one value where the intent is to restrict the value of a field.

In future, the permissible values could be mapped to ontology terms, but this info isn't in the schema.

HCA also uses a JSON schema extension for ontology enums, these are converted to LinkML dynamic enums, as below

Examples

LinkML:

  DevelopmentStageOntology_ontology_options:
    include:
    - reachable_from:
        source_ontology: obo:efo
        source_nodes:
        - EFO:0000399
        - HsapDv:0000000
        - UBERON:0000105
        relationship_types:
        - rdfs:subClassOf
        is_direct: false
        include_self: false
    - reachable_from:
        source_ontology: obo:hcao
        source_nodes:
        - EFO:0000399
        - HsapDv:0000000
        - UBERON:0000105
        relationship_types:
        - rdfs:subClassOf
        is_direct: false
        include_self: false

from:

"ontology": {
            "description": "An ontology term identifier in the form prefix:accession.",
            "type": "string",
            "graph_restriction":  {
                "ontologies" : ["obo:efo", "obo:hcao"],
                "classes": ["EFO:0000399", "HsapDv:0000000", "UBERON:0000105"],
                "relations": ["rdfs:subClassOf"],
                "direct": false,
                "include_self": false
            },

note the mapping is not quite direct. A seperate query is generated in linkml for each input ontology, where the input seeds are repeated each time (include takes the union of all subqueries)

I believe the semantics are the same as for the source, although some combos will yield empty sets?

The more natural way to author this in linkml would be to make the classes specific to each subquery.

Materialized Ontology Enums

See value set toolkit

To expand value sets:

poetry run sh utils/expand-value-sets.sh

This materializes the value set queries, so that:

  • normal non-extended json-schema tooling can use them
  • query results can be versioned alongside releases

These are included alongside as <NAME>.expanded.yaml

File sizes:

Value Set Expanded File Size
enrichment_ontology enrichment_ontology expanded 4.0K
organ_ontology organ_ontology expanded 1.5M
cell_cycle_ontology cell_cycle_ontology expanded 8.0K
biological_macromolecule_ontology biological_macromolecule_ontology expanded 12K
sequencing_ontology sequencing_ontology expanded 60K
protocol_type_ontology protocol_type_ontology expanded 16K
species_ontology species_ontology expanded 215M
development_stage_ontology development_stage_ontology expanded 64K
target_pathway_ontology target_pathway_ontology expanded 108K
disease_ontology disease_ontology expanded 4.8M
strain_ontology strain_ontology expanded 16K
file_content_ontology file_content_ontology expanded 512K
library_construction_ontology library_construction_ontology expanded 12K
contributor_role_ontology contributor_role_ontology expanded 24K
mass_unit_ontology mass_unit_ontology expanded 8.0K
cell_type_ontology cell_type_ontology expanded 316K
library_amplification_ontology library_amplification_ontology expanded 4.0K
microscopy_ontology microscopy_ontology expanded 8.0K
ethnicity_ontology ethnicity_ontology expanded 36K
organ_part_ontology organ_part_ontology expanded 1.5M
treatment_method_ontology treatment_method_ontology expanded 296K
process_type_ontology process_type_ontology expanded 84K
time_unit_ontology time_unit_ontology expanded 4.0K
file_format_ontology file_format_ontology expanded 4.0K
instrument_ontology instrument_ontology expanded 12K
cellular_component_ontology cellular_component_ontology expanded 480K
length_unit_ontology length_unit_ontology expanded 8.0K

Note in particular that the species expanded subset in a quarter of a gigabyte...

Some of the expanded sets may be empty due to a mismatch in how HCA and OAK use CURIEs for EDAM

Repository Structure

Developer Documentation

Use the `make` command to generate project artefacts:
  • make all: make everything
  • make deploy: deploys site

Credits

this project was made with linkml-project-cookiecutter