Quantum Machine Learning for Drug-like Molecules: QMugs & DelFTa

Clemens Isert
Β Β·Β Deep Learning for Drug Design & Chemistry

Quantum mechanical (QM) methods offer a fundamental description of physical and chemical systems. [1] As such, they are interesting from a drug-design perspective, given that electronic structure is responsible for a wide range of drug-relevant properties. [2-4]

The various computational approaches for QM calculations differ in their accuracy and computational cost, and one generally needs to make a tradeoff between these two objectives:

Since accurate QM calculations are associated with high computational costs and this cost grows exponentially with the number of atoms in the system under study, using QM for drug design is difficult from a computational cost and throughput perspective. Quantum machine learning (QML), also termed "atomistic ML" or "machine learning potentials" aims to predict QM properties using machine learning (ML) algorithms to drastically increase the speed at which QM properties can be obtained. [6]

(Note that the term "quantum machine learning" is also used to describe ML on quantum computers)

To train the ML methods to predict QM properties, we need reliable datasets of drug-like molecules. The QM9 dataset [7] , which has routinely been used to train QML methods, contains QM properties for molecules with up to 9 heavy atoms - quite different to drug candidates which are generally much larger and more complex molecules.

The QMugs dataset has been developed to overcome this limitation. This blog post offers a brief tutorial on how to use the QMugs dataset [8], containing QM properties of ~2M drug-like molecules, to train your own models on a subset of chemical space relevant to drug design. We'll also discuss the open-source DelFTa toolbox [9] which was trained on QMugs and which can predict QM properties in an easy and user-friendly fashion.


PART 1 – QMugs

To get a detailed description of the generation and exact content of the QMugs dataset, please take a look at the original publication. In this blog post, we'll focus more on a hands-on tutorial on how to use the dataset. It contains a wide range of QM properties, ranging from molecular properties (e.g., total & formation energy) to atomic- (e.g., polarizability, dispersion coefficients, charges) and bond-level (e.g., bond orders) properties. Many properties are available both on the density functional level of theory (DFT) and using the semi-empirical GFN2-xTB [10] method.

To download the full dataset, you can use the following link: online repository. Depending on which properties you want to learn and which representation you're interested in working with, pick the files which are relevant for your use case:

  • structures.tar.gz: SDF format files for all compounds in the QMugs dataset including their 3D geometry; with molecular, atomic, and bond properties in the file. Can be easily loaded using e.g., RDKit [11] (see example below)

  • summary.csv: Overview of all compounds, good to get a general overview of the dataset & distribution of individual properties

  • wfns: Serialized wavefunction files which you can open with e.g., Psi4 [12], if you want to predict wavefunctions, orbitals, or electron densities.

  • vibspectra.tar.gz: Computed vibrational (IR/Raman) spectra for all compounds.

After downloading the structures.tar.gz file and extracting it (tar -xvf structures.tar.gz), you can access molecular properties for any specific molecule as follows (adapt the ChEMBL-ID and conformer number):

from rdkit import Chem
mol = next(Chem.SDMolSupplier("./structures/CHEMBL1071/conf_00.sdf", removeHs=False))
props = mol.GetPropsAsDict()
dft_formation_energy = props["DFT:FORMATION_ENERGY"]
print(dft_formation_energy)
>> -7.028001099502603 # in Hartree

By replacing the DFT:FORMATION_ENERGY with the desired property, you can obtain the endpoint(s) you’re interested in. Other examples include:

for many properties (including those shown above), you can replace DFT by GFN2 in the key to obtain the semi-empirical results instead. Again, a detailed overview of all available properties and their units is given in the paper (Table 2), and more extensive explanations on how to load atomic and bond properties is specified in the README.MD in the repository.

We'll add a bit of guidance on how to use the dataset to build your own model for QM property prediction - specifically, we recommend you take care of two QMugs-specific aspects when building your model to ensure you don't get test set leakage, resulting in an overestimation of model accuracy.

  1. QMugs contains three conformers per molecule (with one ChEMBL-ID). We recommend you put all three conformers in the same dataset split (e.g., the training set). Having one or two conformers in the training set and the remaining one(s) in the test set will make it unrealistically easy for the model to make predictions.

  2. In part due to the SMILES preprocessing, there are a few compounds which, although they have different ChEMBL-IDs, have the same SMILES. These compounds are indicated in summary.csv via the column nonunique_smiles. We recommend placing compounds with the same SMILES in the same dataset split as well.

Lastly, we note that you can use the ChEMBL database [13] to connect the QM properties from QMugs to any bioactivity annotations (such as binding affinity etc.) that you are interested in.


PART 2 - DelFTa

The DelFTa toolbox is an open-source, user-friendly toolbox trained on the QMugs dataset. Basically, it provides you with a quick and convenient way to obtain an ML-based approximation for DFT-computed quantities. For full details, see again the relevant publication. It can either be used in a Ξ”-learning fashion (predicting a correction to the semi-empirical GFN2-xTB [10] calculation) or in a direct-learning fashion (directly predicting the desired DFT output). Rather than discussing the details of the 3D message-passing neural network we're using for this task, we'll again focus on a tutorial for using the toolbox. You can find more tutorial notebooks in the project's GitHub repo.

DelFTa can be easily installed via conda:

conda install delfta -c delfta -c pytorch -c rusty1s -c conda-forge

Let's start with our first prediction for a simple molecule (the "Hello, World!" of QML). First, import the relevant functions/classes and generate a molecule object from a SMILES using OpenBabel [14]. (Note that you can easily convert between RDKit and OpenBabel molecules)

from openbabel.pybel import readstring
from delfta.calculator import DelftaCalculator
smiles = "O=C(C)Oc1ccccc1C(=O)O" # aspirin
mol = readstring("smi", smiles)

Next, we create an instance of the DelftaCalculator class. The key options here are delta (if False, the model will do direct learning otherwise Β Ξ”-learning, defaults to True) and xtbopt (True if you want to use GFN2-xTB [10] for geometry optimization of the input structure, defaults to False). See this Tutorial for a more in-depth look at other options. Since we only specified the input molecule via SMILES (so no 3D information), DelFTa internally uses OpenBabel and the MMFF94 force field [15] to generate a conformer. With xbtopt=False, this is not further optimized using GFN2-xTB, though you can change this if you want. Of course, you could also directly supply an existing geometry via e.g., SDF or XYZ files.

calc_delta = DelftaCalculator(delta=True, xtbopt=False, return_optmols=True)
predictions_delta_mmff94, opt_mols_mmff94 = calc_delta.predict(mol)
opt_mol_mmff94 = opt_mols_mmff94[0]

Output:

2022/08/25 10:25:53 AM | DelFTa | INFO: Now running xTB…
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  9.72it/s]
2022/08/25 10:25:53 AM | DelFTa | INFO: Now running network for model wbo_delta…
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 36.28it/s]
2022/08/25 10:25:53 AM | DelFTa | INFO: Now running network for model charges_delta…
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 36.61it/s]
2022/08/25 10:25:53 AM | DelFTa | INFO: Now running network for model single_energy_delta…
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 36.08it/s]
2022/08/25 10:25:53 AM | DelFTa | INFO: Now running network for model multitask_delta…
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 37.14it/s]

Let’s show the predictions (omitting charges and wbo (Wiberg bond orders) for brevity):

import pandas as pd
df = pd.DataFrame(predictions_delta_mmff94)[["E_form", "E_homo", "E_lumo", "E_gap", "dipole"]]
df

The molecular properties are provided as an array, so that you can access properties when providing a list of input molecules.

Since we specified return_optmols=True, we also get the optimized molecule back for visualization/further analysis.

Let's visualize the molecule:

import py3Dmol # conda install -c conda-forge py3dmol
size=(500,500)
view = py3Dmol.view()
view.removeAllModels()
view.addModel(opt_mol_mmff94.write("sdf"), "sdf")
charges = predictions_delta_mmff94["charges"][0]
for atom, charge in zip(opt_mol_mmff94.atoms, charges): 
    view.addLabel(f"{charge:.2f}", {"position":{"x":atom.coords[0],"y":atom.coords[1],"z":atom.coords[2]}, "fontSize": 12})
view.setStyle({'stick':{}})
view.zoomTo()
view.show()

You might notice that the optimized molecules are returned as a list (and we pick the first and only element: opt_mol_mmff94 = opt_mols_mmff94[0]). This is because the calculator can operate on a list of molecules as well – and we highly recommend you run the calculator this way rather than calling it repeatedly with individual molecules. The reason is simple: if you provide all molecules upfront, we can batch molecules together when running them through the network which will make the computation faster.

We hope that DelFTa can be useful in your own projects. If you run into any issues or notice bugs, please open an issue in the GitHub repo.


Author Bio

Clemens is a third-year PhD student in the Molecular Design Laboratory of Prof. Gisbert Schneider at ETH Zurich and recently completed an internship in the Computer-Assisted Drug Design group at Novartis. After his undergraduate studies in Chemical Engineering, he first got into cheminformatics and machine learning during his master thesis at MIT working with Connor Coley and Klavs F. Jensen. Clemens’ research interests lie in exploring how quantum mechanical calculations can be used in the drug discovery process.


References

[1] Β  Β Feynman Lectures on Physics, https://www.feynmanlectures.caltech.edu (accessed 20.10.21)

[2] Khrenova et al., J. Chem. Theory Comput. 2010, 6, 2293– 2302.

[3] Freccero et al., Chem. Eur. J. 2008, 14, 653–663.

[4] Rupp et al., PLoS Comput. Biol. 2014, 10, e1003400.

[5] von Lilienfeld, 31st Conference on Neural Information Processing Systems, 2017.

[6] von Lilienfeld et al., Nature Rev. Chem. 2020, 4, 347-358.

[7] Ramakrishnan et al., Sci. Data 2014, 1, 1–7.

[8] Isert et al., Sci. Data 2022, 9, 273.

[9] Atz et al., Phys. Chem. Chem. Phys., 2022, 24, 10775-10783.

[10] Β Bannwarth et al, J. Chem. Theory Comput. 2019, 15, 1652–1671.

[11] Landrum, RDKit: Open-Source Cheminformatics. http://www.rdkit.org (accessed 11.08.22).

[12] Β Smith et al., J. Chem. Phys. 2020, 152, 184108.

[13] Β Mendez, et al., Nucleic Acids Res. 2019, 47, D930–D940.

[14] Β http://openbabel.org (accessed 18.08.22)

[15] Β Tosco et al., J. Cheminformatics, 2014, 6, 37.