The ability to control the design process and incorporate prior knowledge is crucial for tailoring molecules to specific binding sites effectively. MolSnapper is a novel tool for conditioning diffusion models in structure-based drug design by integrating expert knowledge through 3D pharmacophores. Our method produces molecules better tailored to fit specific binding sites, achieving high 3D similarity to the originals. Plus, when compared to alternative methods, MolSnapper yields approximately twice as many valid molecules. In this blog post, I’ll tell you more about the problem we’re trying to solve, why we chose the approach we did, and how to use the model.
The Challenge of Structure-based Drug Design
Drug design is a complex and expensive process, often taking years, if not decades, and costing billions of dollars. Computational methods, especially deep learning, are often seen as a way to make this process easier. We dream of an AI tool doing everything and finding the perfect drug, but that's still far off.
In drug discovery, a key stage is finding a target protein and identifying molecules that can interact with it. Target proteins typically have binding sites or pockets where other molecules, known as ligands, can bind.
In the field of structure-based drug design (SBDD), the objective is to generate ligands that bind specifically to target proteins in precise 3D conformations. But sometimes, deep learning models generate molecules that don't make sense chemically or physically (as can be shown in the figure below) and fail to create hydrogen bond interactions with the target protein. Molecules that lack validity cannot be synthesized in a real lab. Moreover, if they fail to interact with the target protein, they can't function as drugs.
This could partly be due to the assessment not aligning with validity, such as when only 2D graphs are used for evaluation. Additionally, it's challenging to learn interactions without enough structural data, which is expensive to acquire.
Until we see progress in methodologies and data accessibility, a practical approach is to integrate guidance within the models themselves. This means using flexible models that integrate expert knowledge, for example, using important protein pocket regions for binding, granting researchers more control. The aim is to achieve the desired protein interactions effectively.
Introducing MolSnapper
In our quest to address these challenges, we introduce MolSnapper—a tool designed to condition diffusion models for SBDD by integrating expert knowledge. Diffusion models are the current state-of-the-art deep learning models for molecular generation. Our motivation lies in conditioning models trained on molecule datasets, which are larger and better representative of drug molecule space than those focusing only on protein-molecule interactions. By utilizing the advantages of these datasets and integrating expert knowledge, MolSnapper aims to generate molecules that are not only plausible and valid but also capable of forming hydrogen bond interactions similar to those observed in ground truth ligands.
How It Works:
MolSnapper allows users to select reference points, chosen by experts such as biochemists/med-chemists, to guide atom movements toward specific positions. This, coupled with clash prevention mechanisms, encourages the creation of viable molecules that seamlessly interact with the target protein.
MolSnapper builds upon the pretrained MolDiff model, originally trained on the GEOM-Drug dataset for molecule generation. MolDiff is among the top-performing models for molecule generation and was the first to diffuse the molecule's bonds, resulting in molecules with improved validity and synthetic accessibility. Instead of retraining MolDiff, we adapt its reverse generation process to generate molecules within pockets.
The process begins by selecting reference points, typically chosen by domain experts, which significantly influence the final outcome. Diverse selections can be tested for the same target to explore various possibilities. In our evaluation, these reference points are represented by 3D pharmacophores, which can represent the crucial chemical interactions for ligand binding to macromolecular targets. At each step, atoms are guided towards these points, gradually securing their positions. Additionally, a Clash guidance function prevents potential clashes with protein atoms. Furthermore, we enforce that nitrogen (N) is selected for donor atoms, and oxygen (O) is chosen for acceptor atoms.
Example:
Start with your target protein: You begin with the target protein you're interested in designing ligands for.
Choose reference points: Select reference points based on known binding interactions. These can come from a known ligand that binds to the protein, fragments that have shown binding affinity, or other molecules of interest. For example, you can use PyMOL to identify positions on the protein where interactions occur. Save these reference points to an SDF file. (Note: If you have a reference ligand, the project’s code can automatically extract pharmacophores from it.)
Use sample_single_pocket.py to generate molecules: Execute the sample_single_pocket.py script to generate molecules based on the reference points chosen earlier. For instance, run the following command:
python scripts/sample_single_pocket.py --outdir ./outputs --config ./configs/sample/sample_MolDiff.yml --batch_size 32 --pocket_path ./data/example_1h00/processed_pocket_1h00.pkl --sdf_path ./data/example_1h00/ref_points.sdf --use_pharma False --clash_rate 0.1
Evaluating generated molecules
There isn't a single way to evaluate the generated molecules. Our focus is on measuring the 3D similarity between the generated ligands and the ground truth ligand. A higher similarity score indicates stronger chances of generating molecules that closely resemble real ligands. We don't expect all generated molecules to be identical, but achieving high similarity with some of them is considered a success.
We begin by identifying the subset of ligands with the highest similarity scores—referred to as the 'Top x.' Once these subsets are determined based on their similarity scores, we then calculate other metrics specifically for these groups. It's important to note that we only evaluate ligands that pass the PoseBuster check, ensuring they are chemically and physically valid and do not clash with the protein.
When compared to other conditioning methods without pocket-specific training, our approach succeeds in generating twice as many valid ligands with better 3D similarity. Evaluating this subset reveals that MolSnapper-generated ligands have higher synthetic accessibility and better preservation of the original hydrogen bonds.
Next, we aimed to assess the performance compared to conditioning methods with pocket-specific training. These models were specifically trained on protein-ligand interactions, exhibiting similar distributions as the test set. In this comparison, we observed comparable performances, while achieving more valid molecules and higher synthetic accessibility scores.
These results highlight the effectiveness of MolSnapper's approach, indicating that more complex models requiring training on pocket data do not necessarily provide advantages over our conditioning methods.
More results can be found in the paper.
Conclusion
In summary, MolSnapper is a method to condition diffusion models with 3D pharmacophoric constraints, facilitating controlled molecular generation. It enables the easy utilization of prior knowledge to improve molecule design. By training on a broader molecule space and conditioning for pocket generation rather than solely on protein-molecule sets, MolSnapper achieves a more accurate representation of real molecules. We hope you’ll give it a try yourself, and reach out if you have any more questions.
Paper: https://www.biorxiv.org/content/10.1101/2024.03.28.586278