-
Notifications
You must be signed in to change notification settings - Fork 44
Using QLever for PubChem
Install the qlever
script following the instructions https://github.com/ad-freiburg/qlever-control (this is a matter of a few minutes, no need to compile anything). Make sure that the PATH
to the qlever
script is set and that you are in a fresh directory with no other content. Then do:
qlever setup-config pubchem
qlever get-data
qlever index
qlever start
qlever ui
The get-data
command downloads the data and fixes it (in several of the IRIs, forbidden characters are not properly percent-encoded). This takes around 5 hours on an AMD Ryzen 9 with 16 cores and requires about 250 GB of space. The index
command builds the index data structures needed by QLever. This also takes around 5 hours and requires around 1.5 TB of disk space. The start
command starts the server, which is then up in a matter of seconds. The ui
command starts the UI, which looks just like the UI of the public QLever SPARQL endpoint for PubChem on https://qlever.cs.uni-freiburg.de/pubchem. See the Qleverfile
(created by qlever setup-config pubchem
) for a more detailed description of some of the peculiarities of the PubChem dataset.
PubChem makes heavy use of alpha-numeric identifiers like sio:CHEMINF_000339
(molecular entity name) or obo:CHEBI_15365
(acetylsalicylic acid) for its predicates and entities. The labels for these identifiers are not part of the PubChem datasets. We recommend adding them to the data by downloading the respective ontologies. Here is a command to do that:
cut -d, -f3,4 <<EOT | while IFS=, read URL NAME; do echo "Downloading $URL -> $NAME ..."; curl --location --silent --remote-time --output rdf.ontologies/$NAME $URL; done
BAO - BioAssay Ontology,bao,http://www.bioassayontology.org/bao/bao_complete.owl,bao.rdf
BFO - Basic Formal Ontology,bfo,http://purl.obolibrary.org/obo/bfo.owl,bfo.rdf
BioPAX - biological pathway data,bp,http://www.biopax.org/release/biopax-level3.owl,bio-pax.rdf
CHEMINF - Chemical Information Ontology,cheminf,http://purl.obolibrary.org/obo/cheminf.owl,cheminf.rdf
ChEBI - Chemical Entities of Biological Interest,chebi,http://purl.obolibrary.org/obo/chebi.owl,chebi.rdf
CiTO,cito,http://purl.org/spar/cito.nt,cito.nt
DCMI Terms,dcterms,https://www.dublincore.org/specifications/dublin-core/dcmi-terms/dublin_core_terms.nt,dcterms.nt
FaBiO,fabio,http://purl.org/spar/fabio.nt,fabio.nt
GO - Gene Ontology,go,http://purl.obolibrary.org/obo/go.owl,go.rdf
IAO - Information Artifact Ontology,iao,http://purl.obolibrary.org/obo/iao.owl,iao.rdf
NCIt,ncit,http://purl.obolibrary.org/obo/ncit.owl,ncit.rdf
NDF-RT,ndfrt,https://data.bioontology.org/ontologies/NDF-RT/submissions/1/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb,ndfrt.rdf
OBI - Ontology for Biomedical Investigations,obi,http://purl.obolibrary.org/obo/obi.owl,obi.rdf
OWL,owl,http://www.w3.org/2002/07/owl,owl.ttl
PDBo,pdbo,http://rdf.wwpdb.org/schema/pdbx-v40.owl,pdbo.rdf
PR - PRotein Ontology (PRO),pr,http://purl.obolibrary.org/obo/pr.owl,pr.rdf
RDF Schema,rdfs,https://www.w3.org/2000/01/rdf-schema,rdf-schema.ttl,rdfs.ttl
RDF,rdf,http://www.w3.org/1999/02/22-rdf-syntax-ns,22-rdf-syntax-ns.ttl,rdf.ttl
RO - Relation Ontology,ro,http://purl.obolibrary.org/obo/ro.owl,ro.rdf
SIO - Semanticscience Integrated Ontology,sio,http://semanticscience.org/ontology/sio.owl,sio.rdf
SKOS,skos,http://www.w3.org/TR/skos-reference/skos.rdf,skos.rdf
SO - Sequence types and features ontology,so,http://purl.obolibrary.org/obo/so.owl,so.rdf
UO - Units of measurement ontology,uo,http://purl.obolibrary.org/obo/uo.owl,uo.rdf
EOT
The PubChem data is about three central kind of entities:
- A compound is an abstract chemical structure, for example: compound:CID2244 (acetyl-salicylic acid)
- A substance is a concrete materialization of a compound, for example: substance:SID24890623 (a particular edition of Aspirin)
- A bioassay is an analytical method for measuring the effect of a substance on living matter
TLDR: There is no "canonical" name, neither for compounds nor for substances; each compound can have many substances; each substance can have many different kinds of names; each substance can even have multiple names of the same kind; some compounds are related to entities from other ontologies
Compounds are related to substances via the predicate sio:CHEMINF_000477
(has normalized counterpair), for example substance:SID24890623 sio:CHEMINF_000477 compound:CID2244
For each substance, there are different kinds of names, for example, sio_CHEMINF_000339
(molecular entity name) or sio_CHEMINF_000476
(chemical database identifier) or sio:CHEMINF_000561
(drug trade name). That way, even a single compound can have hundreds of names and synonyms, for example https://qlever.cs.uni-freiburg.de/pubchem/PAlJvI (all names/synonyms of Diclofenac) or https://qlever.cs.uni-freiburg.de/pubchem/7TwZLX (same, grouped by kind of name/synonym).
To get a particular kind of name of a particular substance do substance:SID24890623 sio:SIO_000008 [ rdf:type sio:CHEMINF_000339 ; sio:SIO_000300 ?name ]
, where the intermediate node is called a "synonym".
Some compounds are related to entities from other ontologies via rdf:type
or closeMatch
. For example, compound:CID2244 rdf:type obo:CHEBI_15365
(where obo:CHEBI_15365
is the identifier for acetylsalicylic acid in the ChEBI dictionary = Chemical Entities of Biological Interest) or compound:CID2244 skos:closeMatch wd:Q18216
(where wd:Q18216
is the identifier for Aspirin in Wikidata).
TLDR: Most properties in PubChem are not expressed via a single predicate, but via multiple predicates and entities
The various chemical properties of a compound are realized via the generic predicate sio:SIO_000008
(has attribute) and a mediator node. For example, molecular weight is realized as follows, using the specific sio:CHEMINF_000334
(molecular weight) and the generic sio:SIO_000300
(has value)
?compound sio:SIO_000008 [
rdf:type sio:CHEMINF_000334 ;
sio:SIO_000300 ?value ]