Integrated Molecular Modeling and Machine Learning for Drug Design

Nobody is a fan of getting sick. Small molecule drugs can stop disease-related problems and control their impact. But finding the right molecules with the perfect effect and the right ways they work in the body (we call it ADME/T - including absorption, distribution, metabolism, excretion, and toxicity) poses a significant challenge. It is long and expensive to research and develop new drugs, and computational approaches have increasingly become an indispensable part of helping reduce these costs.

In Professor Yingkai Zhang’s lab, we're working to integrate molecular modeling and machine learning to design better modulators. We'll summarize those recent efforts for you here, and point you to tutorials on how to use tools we've developed for drug related applications. We'll discuss:

- Pocket-guided rational design approaches to target protein−protein interactions.

- Protein−ligand docking with machine learning-based scoring functions.

- Molecular representation learning with deep graph neural networks.

If you want more after reading this post, check out our perspective article "Integrated Molecular Modeling and Machine Learning for Drug Design" and the accompanying source code on GitHub.

Computational Tools at all Stages of Drug Discovery and Development

Modern therapeutic development often involves several interconnected stages, and multiple iterations are usually required to bring a new drug to the market. Modern therapeutic development steps can be visualized as a Biologics 4D Map. Here’s a simplified “pipeline” of therapeutic development where computational tools (in the green boxes) are playing important roles:

Hit identification involves finding the compounds with desired bioactivity. Computational tools can be categorized into ligand-based methods and structure based approaches:

Ligand based methods use prior knowledge to predict new drug compounds with similar bioactivities. Similarity search, pharmacophore modeling, and quantitative structure−activity relationships are common tools for ligand based drug screening.
Structure based methods such as molecular docking and molecular dynamics simulations are common computational tools used to predict the binding affinity between the protein and ligand. Binding site prediction tools offer information on the protein surface where a ligand molecule binds to produce the desired output (activation, inhibition, or modulation). If the structure of the protein is not solved experimentally, homology modeling and ab initio protein structure modeling approaches can be used to predict the protein structure.

Hit-to-lead and lead optimization involve producing more potent and selective compounds with desirable physicochemical properties. It is estimated that over 40% of drug candidates are withdrawn in preclinical tests due to ADME/T concerns. Therefore, computational tools that can predict the ADME/T properties of small molecules offer valuable insights in the drug discovery process.

Pocket-Guided Rational Design Approaches to Target Protein-Protein Interactions

In modern approaches to structure-based drug design, focus is placed on the analysis of concave regions found on biomolecular surfaces. These regions can vary from well-defined small-molecule binding sites to extensive protein−protein interaction (PPI) interfaces. Understanding concave biomolecular surfaces could help us in a lot of ways, such as lead optimization, enhanced ligand screening, and even the identification of previously undiscovered druggable sites.

In 2015, our lab developed a python program named AlphaSpace, an alphasphere-based method building on the popular fpocket method, that leverages concepts of fragment-based drug discovery for PPI analysis. AlphaSpace uses a tuned clustering method to demarcate pockets that reflect on average one fragment per pocket and defines alpha-atoms, a theoretical ligand atom whose properties reflect a docked fragment. It represents entire PPI surfaces by retaining shallower fragment-centric pockets during the evaluation stage and suggests targetable pockets near the binding partner for ligand optimization.

More recently, we extended the method with AlphaSpace 2.0, adding β-clustering, a pseudo molecular representation of the concave site composed of β-atoms. This new β-cluster representation allows direct pocket-to-ligand shape comparison and can be used to guide ligand optimization.

Workflow of AlphaSpace to optimize a ligand to bind to a protein begins with a selected PDB structure and follows the purple arrows until the optimization is validated experimentally. The additions introduced by AlphaSpace2.0 follow the orange arrows.

In practice, AlphaSpace has been used to guide optimization of minimal protein mimetics that target PPIs and analyze protein structures for targetable binding sites. For example, our binding pocket analysis guided a ∼10-fold improvement of binding affinity in a p53 mimetic.

Aside from the protein-protein interaction prediction, AlphaSpace has also helped to design small molecule inhibitors and understand their binding mechanisms. For example, our collaborators detected a cryptic pocket in a metastable state of the apo striatal-enriched protein tyrosine phosphatase (STEP) from MD simulations, resulting in 11 novel inhibitors (IC50 = 9.7−47.7 μm).

Protein−Ligand Docking with Machine Learning-based Scoring Functions

Molecular docking methods are commonly used to assess large libraries of molecules in structure based virtual screening. They use the lock-and key model of drug action and predict noncovalent interactions between the target and small molecules. Molecular docking programs rely on a scoring function to evaluate the binding affinity and optimize the binding orientation between protein and ligand. The scoring function plays a pivotal role in accurately predicting how well the ligand binds to the target protein. Consequently, developing a robust, efficient, and accurate scoring function is a critical component of molecular docking. Check out this review by Chao et al. for more in-depth discussion on scoring functions.

Classical scoring functions use a linear combination of forcefield or interaction descriptors to evaluate the binding affinity between proteins and ligands.

Machine learning (ML)-based scoring functions have garnered increasing attention due to the rapid growth of experimental data and advancements in computational infrastructure. RF-score was the first ML-based scoring function that outperforms classical scoring functions in terms of scoring power. SFCscoreRF gets significantly higher scoring power on common benchmarks. Unfortunately, both these methods perform much worse on docking and screening tasks compared to classical scoring functions.

To address this problem, we applied a Δ-ML strategy, which involves using ML to predict the correction between experimental binding affinity and computationally predicted binding affinity. The final predicted score is obtained by adding this ML correction to classical scoring functions.

The first model to use a Δ-ML strategy is ΔVinaRF20, a random forest model that predicts the corrections to the Vina score. Successive iterations improved upon it: ΔVinaXGB by using features from explicit water molecules, metal ions, and ligand conformational stability and ΔLin_F9XGB by using a classical scoring function named Lin_F9 with improved feature selection and engineered training set. The model achieves superior scoring, ranking, and screening performance on the CASF-2016 benchmark. There’s a tutorial on how to run ΔLin_F9XGB on GitHub.

Molecular Representation Learning with Deep Graph Neural Networks

Exploring Chemical Space Efficiently with ForceField Optimized 3D Geometry

Deep learning (DL) models rely heavily on their training dataset to learn good representations of the vast chemical space. QM9 is a popular training dataset of ~130,000 molecules with optimized 3D geometries calculated using density functional theory (DFT), a quantum mechanical modeling method that requires significant amounts of computation. Merck Molecular Force Fields (MMFF) are less resource-intensive, so we can efficiently explore chemical space. MMFF-optimized geometry has given us good results in property prediction, with results even equalling DFT for molecular energy prediction. We extended QM9 by constructing the Frag20 data set, which has both MMFF and DFT-optimized geometries over half a million molecules and allows comparison between the two methods. This dataset has allowed us to achieve better than or close to chemical accuracy (1 kcal/mol) on multiple test sets. The Frag20 data set processing script is available on GitHub and the processed data set is available here.

We developed sPhysNet-MT-ens5 for predicting experimental hydration free energy and octanol/water partition coefficient (logP) using MMFF optimized geometry and achieved state-of-the-art results on the FreeSolv and PHYSPROP datasets. A tutorial is available on GitHub: the user just needs to provide the SMILES of the queried molecule and the scripts will run conformation generation, force field optimization and properties prediction.

Model performance comparison between sPhysNet-MT and other state-of-the-art models on the experimental hydration free energy and logP prediction. Each model is trained on 50 random splits and the error bars stand for the standard deviation.

Enhancing Downstream Tasks through Transfer Learning with Pretrained Molecular Modeling Data

In the early stages of the drug discovery process, data scarcity is the most common issue for DL models in the prediction of experimental physicochemical properties. DL models rely on the size and quality of the training data, and they tend to overfit on small data sets, leading to poor generalizability and performance. We have been working to develop datasets that can prepare models for a range of downstream tasks.

We first developed Frag20-Aqsol-100k as a pretraining set; it has 100k diverse molecules sampled from Frag20 and CSD20 datasets which allows the model to learn better representations after fine-tuning on FreeSolv data set than if we had just trained the model on FreeSolv alone. We then built Frag20-solv-678k, which contains 678,916 conformations and their calculated energetics in gas, water, and octanol phases. Pretraining on this dataset and fine-tuning on FreeSolv (experimental hydration free energy) and PHYSPROP (experimental logP) produced a model that achieves state-of-the-art performance on both tasks simultaneously.

Improving Molecular Representations through Mutual Learning with Multitask Training

Training on multiple tasks can give a model that learns better representations, especially when tasks are well-related. We’ve developed a large chemical language model called T5Chem that relies on the transformer architecture for multitask reaction prediction. The “5” encompasses 1) forward reaction prediction, 2) reactant prediction, 3) reagent prediction, 4) reaction yield prediction, and 5) reaction type classification. The model is competitive compared to models trained solely on individual tasks, and we’ve provided some instructions on getting started with it on GitHub.

We’ve also developed a graph neural network called sPhysNet-MT that predicts three electronic energies in three phases as well as the transfer free energies between these phases. In our tests, the model achieves chemical accuracy on all tasks. The GitHub has all the code necessary to train and evaluate the model from scratch.

Conclusion

We’ve been working hard to develop better datasets, proxies and models for molecular modeling and drug design - we couldn’t cover all of the efforts in this blog post. To learn more in depth, start with our perspective article Integrated Molecular Modeling and Machine Learning for Drug Design. After that, please feel free to reach out to us with any questions!