Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import DrugBank-DrugCentral mappings #112

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

cthoyt
Copy link
Member

@cthoyt cthoyt commented Nov 17, 2022

This PR imports the 3,960 mappings between molecules in DrugBank and DrugCentral that were predicted through exact string matches, manually reviewed by @caufieldjh, and stored in http://kg-hub-public-data.s3.amazonaws.com/frozen_incoming_data/drug-id-maps-0.2.sssom.tsv. Some notes:

  1. DrugCentral itself only provides CAS mappings in addition to structures as SMILES/InChI.
  2. This file contains novel mappings between DrugCentral and several other controlled vocabularies (ChEBI, ChEMBL Compound, Therapeutic Target Database Drugs, PharmGKB). This PR starts with DrugBank as a proof of concept. DrugBank does not contain any primary mappings to DrugCentral, as far as I can tell. The import script can be trivially updated to import some/the rest.
  3. Some provenance information about what files were used to generate these mappings didn't seem strictly necessary and are not propagated through to the Biomappings file.
  4. I used pyobo to add in missing labels

Update this PR now filters out drugcentral-drugbank mappings that are already available by querying DrugCentral's postgres database

@cthoyt cthoyt changed the title Imprort DrugBank-DrugCentral mappings Import DrugBank-DrugCentral mappings Nov 17, 2022
@cthoyt cthoyt marked this pull request as ready for review November 17, 2022 12:46
@cthoyt cthoyt requested a review from bgyori November 17, 2022 12:46
@caufieldjh
Copy link

Thanks @cthoyt !
Let me know if/when the ingests hit any snags so I can fix the table

@bgyori
Copy link
Contributor

bgyori commented Nov 17, 2022

Hi @cthoyt and @caufieldjh, thanks for working on this! Generally, Biomappings only includes mappings that aren't provided by any of the primary sources. I spot checked 10 entries from the new additions to mappings.tsv and in each case, I found that DrugCentral already lists the given DrugBank ID on the drug's landing page. For instance, taking the first new entry:

drugbank DB00001 Lepirudin skos:exactMatch drugcentral 2995 lepirudin

https://drugcentral.org/drugcard/2995 provides:
image

Ideally, these existing mappings would be filtered out and only novel/missing mappings added.

@@ -2822,6 +2822,671 @@ doid DOID:8850 salivary gland cancer skos:exactMatch mesh D012468 Salivary Gland
doid DOID:9335 scotoma skos:exactMatch mesh D012607 Scotoma manually_reviewed orcid:0000-0001-9439-5346
doid DOID:9383 iridocyclitis skos:exactMatch mesh D015863 Iridocyclitis manually_reviewed orcid:0000-0001-9439-5346
doid DOID:9675 pulmonary emphysema skos:exactMatch mesh D011656 Pulmonary Emphysema manually_reviewed orcid:0000-0001-9439-5346
drugbank DB00001 Lepirudin skos:exactMatch drugcentral 2995 lepirudin manually_reviewed orcid:0000-0001-5705-7831
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above is definitely on the DrugCentral website, it's odd that it didn't get filtered out - same for a couple of others that I randomly checked below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also ran into this for most (if not all?) that I checked. I am not sure what's going on/why they aren't in the database's xrefs table but they appear in the site

drugbank DB00013 Urokinase skos:exactMatch drugcentral 5109 urokinase manually_reviewed orcid:0000-0001-5705-7831
drugbank DB00014 Goserelin skos:exactMatch drugcentral 1327 goserelin manually_reviewed orcid:0000-0001-5705-7831
drugbank DB00016 Erythropoietin skos:exactMatch drugcentral 5160 epoetin beta manually_reviewed orcid:0000-0001-5705-7831
drugbank DB00016 Erythropoietin skos:exactMatch drugcentral 5170 epoetin zeta manually_reviewed orcid:0000-0001-5705-7831
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above two rows could be problematic: here, we assign a skos:exactMatch to two separate entries - these mappings that are not one-to-one could either be removed or use a different relation type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next step will be to identify one-many and many-one mappings and flag those for more curation (or just filter)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, those should probably be skos:narrowMatch

@caufieldjh
Copy link

caufieldjh commented Nov 18, 2022

For these sources, here's the relationships for mapping availability:

Source/target CHEBI CHEMBL DrugBank DrugCentral PharmGKB.drug ttd.drug
CHEBI to Yes No No No No
CHEMB to Yes Yes Yes Yes No
DrugBank to Yes Yes No Yes Yes
DrugCentral to Yes Yes Yes No No
PharmGKB.drug to Yes No Yes No Yes
ttd.drug to Yes No No No No

where "Yes" means at least some mappings are available.
It doesn't take completeness into account or that many of these are one-to-many or many-to-one since sources vary in size and specificity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants