Skip to content

Commit

Permalink
Add rc3 model (#31)
Browse files Browse the repository at this point in the history
* add rc3

* add docs

* update tests

* update readme

* add nagl pin

* add rc3 to TOC
  • Loading branch information
lilyminium committed Jul 28, 2024
1 parent a4db042 commit 7579994
Show file tree
Hide file tree
Showing 7 changed files with 229 additions and 1 deletion.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ These will come sorted according to semantic versioning, where the latest releas
- `v0.0.1-alpha.1`: a pre-production model to use for experimentation. We do *not* recommend using this model to assign charges in scientific work.
- `v0.1.0-rc1`: a pre-production model to use to assign AM1-BCC partial charges.
- `v0.1.0-rc2`: a pre-production model to use to assign AM1-BCC partial charges.
- `v0.1.0-rc3`: a pre-production model to use to assign AM1-BCC partial charges.


#### Acknowledgements
Expand Down
2 changes: 1 addition & 1 deletion devtools/conda-envs/test_env.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ dependencies:

# Testing
- pytest
- openff-nagl-base >=0.3.0
- openff-nagl-base >=0.4.0

- pip:
- importlib_resources
2 changes: 2 additions & 0 deletions docs/models/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,7 @@ Overview <self>
openff-gnn-am1bcc-0.0.1-alpha.1/index.md
openff-gnn-am1bcc-0.1.0-rc1/index.md
openff-gnn-am1bcc-0.1.0-rc2/index.md
openff-gnn-am1bcc-0.1.0-rc3/index.md


:::
212 changes: 212 additions & 0 deletions docs/models/openff-gnn-am1bcc-0.1.0-rc3/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
# AM1-BCC GNN v0.1.0-rc2

This model is trained to produce AM1-BCC charges similar to OpenEye's implementation.
The neural network model is identical to the the previously released
[GNN v0.1.0-rc2 model](https://openff-nagl-models.readthedocs.io/en/latest/models/openff-gnn-am1bcc-0.1.0-rc2/index.html).

It adds a small lookup table of charges, containing 13939 molecules with three or fewer heavy atoms.

## Lookup table

Molecules were generated by enumerating possible SMILES with the following options:

* All elements in H, C, N, O, F, P, S, Cl, Br, I
* With charges ranging from -4 to +4
* With single, double, triple, or aromatic bonds
* With linear chains or 3-membered rings (i.e. X, X~X, X~X~X, or X1~X~X~1)

All SMILES were first validated by loading with the OpenFF Toolkit to remove radicals
and invalid structures.

Charges were assigned using two backends: OpenEye and AmberTools. OpenEye Quacpac charges
are used preferentially in the lookup table. If OpenEye charges could not be generated
(e.g. due to conformer generation errors or invalid BCCs), AmberTools sqm charges were used.
If multiple molecules mapped onto the same InChI
due to the lack of bond order and formal charge in InChIs, molecules with the least complex
bond orders and formal charges were used (as calculated by total bond order magnitude and formal charge).

For example, the molecules `[N-:1]1[O+:2]=[P:3]1` and `[N:1]1=[P:3][O:2]1` both have
the InChI `InChI=1S/NOP/c1-2-3-1`, but different bond orders and formal charges.
In this example, only the *second* molecule is used to generate charges for the lookup table.



### OpenEye charges

These charges were generated through the OpenFF Toolkit API with the "am1bccelf10" method.

**Versions**:

* openff-toolkit: 0.15.2
* openeye-toolkits: 2023.2.3


### AmberTools charges

These charges were generated through the OpenFF Toolkit API with the "am1bcc" method.

**Versions**:

* openff-toolkit: 0.15.2
* ambertools: 23.3
* rdkit: 2023.09.4 (used for conformer generation)


## Training

The model was trained to three targets:

| Target | Weight | Denominator |
|-------------|--------|-------------|
| Charge RMSE | 1 | 0.02 |
| Dipole RMSE | 1 | 0.1 |
| Weighted ESP RMSE | 1 | 0.001 |


## Domain

The model is applicable to molecules with:

- elements C, O, H, N, S, F, Br, Cl, I, P
- atoms with between 1 and 6 bonds

It is *not* trained on molecules with the following bond patterns:

- `[#15:1]#[*:2]`
- `[#15:1]-[#53:2]`
- `[#15:1]:[*:2]`
- `[#15:1]=[!#6&!#7&!#8&!#16:2]`
- `[#16:1]#[*:2]`
- `[#16:1]-[#35,#53:2]`
- `[#16:1]=[!#15&!#6&!#7&!#8:2]`
- `[#17:1]#,:,=[*:2]`
- `[#17:1]-[!#15&!#16!#6&!#7&!#8:2]`
- `[#1:1]#,:,=[*:2]`
- `[#1:1]-[#1:2]`
- `[#35:1]#,:,=[*:2]`
- `[#35:1]-[!#15&!#6&!#7&!#8:2]`
- `[#53:1]#,:,=[*:2]`
- `[#53:1]-[!#53&!#6&!#7&!#8:2]`
- `[#7:1]#[!#7&!#6:2]`
- `[#8:1]#[*:2]`
- `[#9:1]#,:,=[*:2]`
- `[#9:1]-[!#15&!#16!#6&!#7&!#8:2]`

## Datasets

### Sources

This model was trained, validated, and tested on the following datasets:

| | Training | Validation | Test | Source |
|------------------------------------------|----------|------------|-------|-------------------------------------------------------------|
| Enamine Discovery Diversity Set (DDS-10) | 6499 | 2524 | | [Enamine DDS-10][enamine-10240] |
| Enamine Discovery Diversity Set (DDS-50) | 35731 | 8072 | | [Enamine DDS-50][enamine-50240] |
| Diverse ChEMBL training molecules | 31185 | 8564 | | [Bleiziffer, Schaller, and Riniker, 2018][riniker] |
| Diverse ZINC training molecules | 78933 | 15267 | | [Bleiziffer, Schaller, and Riniker, 2018][riniker] |
| NCI 250K | 85322 | 22513 | | [Release 4 File Series - May 2012][nci-250k] |
| PDB | 10844 | 4653 | | [Ligand Expo][pdb] |
| Peptides (1-4 AA) | 22690 | | | Generated from RDKit through combinations of FASTA codes |
| Small ChEMBL molecules (< 200 Da) | 39909 | | 17104 | Curated from the [ChEMBL][ChEMBL] database. |
| OpenFF Industry Benchmark Set (v1.1) | | | 13671 | [QCArchive Submission][qca-openff-benchmark] |
| SPICE | | | 14652 | [SPICE dataset][spice] |
| Peptides with PTMs | | | 6073 | Peptide chains with common post-translational modifications |
| Total | 271204 | 61593 | 34396 | |


[enamine-10240]: https://enamine.net/compound-libraries/diversity-libraries/dds-10560
[enamine-50240]: https://enamine.net/compound-libraries/diversity-libraries/dds-50240
[riniker]: https://doi.org/10.1021/acs.jcim.7b00663
[nci-250k]: https://cactus.nci.nih.gov/download/nci/
[pdb]: http://ligand-expo.rcsb.org/dictionaries/Components-smiles-stereo-oe.smi
[qca-openff-benchmark]: https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2021-06-04-OpenFF-Industry-Benchmark-Season-1-v1.1
[spice]: https://doi.org/10.1038/s41597-022-01882-6
[ChEMBL]: https://www.ebi.ac.uk/chembl/

### Characterisation

#### Elements

| Element | Training | Validation | Test |
|---------|----------|------------|-------|
| H | 310828 | 61582 | 52820 |
| C | 311043 | 61592 | 52968 |
| N | 278130 | 54368 | 47393 |
| O | 272095 | 56774 | 44682 |
| F | 32945 | 7026 | 7296 |
| P | 5422 | 1522 | 3103 |
| S | 82938 | 17168 | 12226 |
| Cl | 34440 | 7682 | 5838 |
| Br | 15813 | 2486 | 1482 |
| I | 2986 | 412 | 239 |

#### Total charges

| Charge | Training | Validation | Test |
|--------|----------|------------|-------|
| -6 | 4 | 0 | 2 |
| -5 | 7 | 0 | 14 |
| -4 | 209 | 27 | 154 |
| -3 | 455 | 66 | 1040 |
| -2 | 3747 | 902 | 1333 |
| -1 | 25826 | 7553 | 3441 |
| 0 | 216930 | 41211 | 39450 |
| 1 | 56354 | 10692 | 7118 |
| 2 | 6999 | 1080 | 471 |
| 3 | 508 | 55 | 21 |
| 4 | 70 | 6 | 1 |
| 5 | 1 | 1 | 0 |
| 6 | 1 | 0 | 0 |
| 7 | 0 | 0 | 0 |
| 8 | 2 | 0 | 0 |

#### Number of atoms

| # Atoms | Training | Validation | Test |
|---------|----------|------------|-------|
| 000-009 | 363 | 7 | 319 |
| 010-019 | 15322 | 421 | 4342 |
| 020-029 | 63828 | 6925 | 15237 |
| 030-039 | 103631 | 25430 | 9189 |
| 040-049 | 86267 | 22697 | 11165 |
| 050-059 | 24808 | 5747 | 4343 |
| 060-069 | 8631 | 347 | 700 |
| 070-079 | 6100 | 19 | 103 |
| 080-089 | 1948 | 0 | 60 |
| 090-099 | 215 | 0 | 647 |
| 100-109 | 0 | 0 | 2104 |
| 110-119 | 0 | 0 | 1833 |
| 120-129 | 0 | 0 | 873 |
| 130-139 | 0 | 0 | 977 |
| 140-149 | 0 | 0 | 702 |
| 150-159 | 0 | 0 | 430 |
| 160-169 | 0 | 0 | 21 |

#### Number of heavy atoms

| # Atoms | Training | Validation | Test |
|---------|----------|------------|-------|
| 000-009 | 6147 | 44 | 3056 |
| 010-019 | 131447 | 18303 | 21270 |
| 020-029 | 156412 | 43246 | 16915 |
| 030-039 | 14183 | 0 | 4114 |
| 040-049 | 2859 | 0 | 743 |
| 050-059 | 65 | 0 | 3417 |
| 060-069 | 0 | 0 | 3330 |
| 070-079 | 0 | 0 | 198 |
| 080-089 | 0 | 0 | 2 |



## Acknowledgements

We gratefully acknowledge the following people and organisations who helped us with the data used for training, validation, and testing:

- Enamine
- [Patrick Bleiziffer, Kay Schaller, and Sereina Riniker][riniker]
- The Open NCI Database
- The Ligand Expo and Protein Data Bank (PDB)
- MolSSI
- Peter Eastman, Pavan Kumar Behara, David L. Dotson, Raimondas Galvelis, John E. Herr, Josh T. Horton, Yuezhi Mao, John D. Chodera, Benjamin P. Pritchard, Yuanqing Wang, Gianni De Fabritiis, and Thomas E. Markland. "SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials." Scientific Data 10, 11 (2023). https://doi.org/10.1038/s41597-022-01882-6
- RDKit
- the ChEMBL database
Binary file not shown.
12 changes: 12 additions & 0 deletions openff/nagl_models/tests/test_model_performance.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,18 @@
[-0.101634, 0.133269, -0.605404,
0.043411, 0.043411, 0.043411,
0.019863, 0.019863, 0.40381 ]
),
(
"openff-gnn-am1bcc-0.1.0-rc.2.pt",
[-0.111519, 0.133221, -0.60017,
0.044064, 0.044064, 0.044064,
0.022841, 0.022841, 0.400592,]
),
(
"openff-gnn-am1bcc-0.1.0-rc.3.pt",
[-0.09629, 0.13245, -0.60293,
0.04465, 0.04465, 0.04465,
0.01728, 0.01728, 0.39826]
)
]
)
Expand Down
1 change: 1 addition & 0 deletions openff/nagl_models/tests/test_openff_nagl_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ def test_get_models_by_type():
"openff-gnn-am1bcc-0.0.1-alpha.1",
"openff-gnn-am1bcc-0.1.0-rc.1",
"openff-gnn-am1bcc-0.1.0-rc.2",
"openff-gnn-am1bcc-0.1.0-rc.3",
]

assert all_model_stems == expected_stems
Expand Down

0 comments on commit 7579994

Please sign in to comment.