Better benchmarks for ML Drug Discovery

Introduction

Recently, I was involved in a ligand-based virtual screening campaign. There came a point where we had established baselines, and we wanted to explore more sophisticated approaches – perhaps by stacking some SOTA molecular prediction models. I dedicated over a month to studying research papers. However, the more I read, the more confused I became. I started questioning the effectiveness of machine learning for novel molecules: does it genuinely work, or does it merely memorize training data? I ended up not trusting any theoretical work that claimed impressive results on benchmarks.

My confusion stemmed mainly from a discrepancy between how models are selected for use and how they are actually applied in practice. I observed two common scenarios:

In virtual screening studies, researchers often apply molecular property models to novel molecules. However, they select these models based on their performance with non-novel molecules that are highly similar to the training set.
Molecular property prediction models are also employed to guide optimization in generative models. These models are applied iteratively to guide optimization, a process that demands sensitivity and the ability to discern minor modifications. Yet, this critical ability is not the basis for model selection. It is rarely tested, and its evaluation is absent in papers about molecular generation through step-by-step optimization.

These observations led me to develop a new benchmark. I tested a range of models, created a library, and published the results in the NeurIPS Dataset and Benchmarks 2023. In this post, I will explain how to better select your molecular property prediction models depending on your specific task.

Key Points:

Yes, predictive models do generalize and can predict novel molecules more accurately than mere chance.
Yes, predictive models are capable of distinguishing minor modifications and can effectively guide an optimization process.
However, the optimal models for these tasks vary. It's crucial to perform the correct model selection to choose the most suitable one.
Lo-Hi Splitter library will assist you in selecting the best models for your specific needs.

Why are we interested in generalizability and sensitivity to minor modifications? These qualities are important in two practical scenarios of ligand-based drug discovery.

Hit Identification

Hit identification (Hi) is an early stage in the drug discovery pipeline, where we search for molecules with desired properties, such as the ability to inhibit a target. In virtual screening, a model is trained to predict properties and then applied to assess a large chemical library, selecting the most promising molecules.

One significant challenge is that the top-ranked molecules often turn out to be the same as those in the training set. This similarity is a consequence of the models being slightly overfitted. To address this, researchers sometimes prioritize the novelty of molecules, filtering their libraries based on Tanimoto similarity to the training set, thereby excluding overly similar molecules (as here, here, here, here, here, or here). While Hit identification does not always focus on novelty, here I specifically focus on this scenario, where generalization to novel molecules is the primary concern.

Lead Optimization

The next step is Lead optimization (Lo). Once a hit is found, we aim to optimize certain properties, such as logP or blood-brain barrier penetration. Since similar molecules generally have similar properties, we often examine similar molecules with minor modifications. In such cases, it's beneficial to have a machine learning model for the optimized property that is sensitive to these minor modifications. This allows us to use the model to guide the optimization process.

This aspect is closely related to the field of goal-directed molecular generation, often formulated as the problem of searching for molecules with maximal activity within the ε-neighborhood of a known hit, within a particular scaffold, or as an optimization process in latent space.

The molecular generation challenge requires an "oracle" – a predictive model that guides the generative process step by step. Interestingly, it's rarely tested whether these oracles can distinguish small modifications, nor are they typically selected specifically for this capability.

Novelty

I've mentioned novelty several times, but how do we measure it? What makes a molecule novel, and what constitutes a minor modification of an existing one?

The answer, frankly, is subjective. When confronted with the notion of novelty, it's best to consider your goals and think about how to formalize them. There are several ways to approach this.

One perspective is to consider novel molecules as those that can be patented. This viewpoint is important from an intellectual property standpoint, but it's not practically feasible. Checking for patentability is a daunting task, and there are companies that charge for this service. This is largely because patents are written in complex language, making it currently impossible to automatically check a library of, say, a billion molecules.

Another approach might be to establish a property threshold. Ideally, we would find a similarity threshold that distinguishes molecules with likely the same properties from those with different properties. However, these thresholds vary significantly depending on the properties, making it difficult to identify a universally applicable standard.

"Similarity is in the eye of the beholder." There's a fascinating intersection between medicinal chemistry and psychology focused on disentangling and formalizing chemists' intuition. Within this field, there's a study specifically about molecular similarity. This study replicated the formal procedure of the EMA, involving 100 pairs of molecules presented to 143 experts who were asked whether they considered the molecules similar. Interestingly, about half of the experts considered a pair different enough when the Tanimoto similarity using ECFP4 fingerprints dropped below 0.4.

While the Tanimoto similarity threshold has its biases, it is a straightforward and quick method that seems to align with human intuition. Moreover, there are practical and theoretical studies that use a 0.4 Tanimoto similarity as a benchmark for novelty. Therefore, I adopted this threshold for my work.

Modern Benchmarks Test Neither Lo nor Hi

I sought a benchmark that captures either generalizability or sensitivity to minor modifications but found none.

Let's plot a count histogram for the nearest distances between train and test molecules. I added a red line to depict the novelty threshold. Here's the standard ESOL benchmark, with a canonical random split:

Notice that most molecules in the test set have very similar counterparts in the training set. There's even a peak at 1.0, indicating that four test molecules have exact copies in the train set!

This benchmark fails to test model generalization, as it focuses on very similar molecules. Its metrics are also challenging to interpret for the Lo task. Would you prefer a model with better ESOL metrics for optimization?

One might hope that scaffold splitting would be more effective:

“As we are more interested in discovering new categories of HIV inhibitors, scaffold splitting [...] is recommended.” [source]

Let's examine the well-known HIV dataset, with a canonical scaffold split:

Still, 56% of the test set molecules have very similar counterparts in the training set, making this benchmark inappropriate for the Hi task. It's also unsuitable for selecting models for the Lo task due to its binary labels.

You might consider other datasets, exotic out-of-distribution benchmarks, or activity cliff datasets. I encourage you to check out the paper; I've likely tested them and found them unsuitable for modeling Hi or Lo scenarios (though yes, activity cliffs do resemble Lo).

Hi Split

Since I couldn't find an appropriate benchmark for testing what I wanted, I decided to create my own. How can we implement a Hi split?

A straightforward method is to perform scaffold splitting and then remove from the test set any molecules that are too similar to the training set. Given that our datasets are already quite small, it would be unfortunate to lose more data points, especially when each can cost around $1,000. So, is there a way to minimize the number of removed molecules?

Imagine our dataset as a graph, where each molecule corresponds to a vertex, and two vertices are connected if their respective molecules are very similar. It's clear that we can assign connectivity components independently between the training and test sets.

However, this alone isn't sufficient because in real-life datasets, about 95% of the molecules are part of the same giant connectivity component.

Our objective is to remove the minimal number of vertices from a connected graph to split it into two parts, while respecting certain constraints on the sizes of these parts. This is because we aim to have a training set comprising about 90% of the dataset and a test set making up the remaining 10%.

Fortunately, this problem has been studied for the last twenty years in computer science, and it is known as the Minimal Vertex k-Cut, which is solved using linear programming. I'll skip the mathematical details here, but they are available in the paper. I will include the linear programming formulation here to make this post more impressive, but it's worth noting that this is actually a pretty simple and neat technique:

In practice, all the papers on Minimal Vertex k-Cut (which aren't numerous) test their algorithms on datasets of small graphs, with the largest graph containing about 300 vertices. When I applied this to the HIV dataset with roughly 40,000 vertices, it didn't finish overnight on 16 logical cores. To resolve this, I incorporated Butina clustering as a preprocessing step, which allowed the algorithm to work on a coarsened graph, where each vertex represents several molecules. This significantly improved the process, reducing the computation time to just a few minutes.

Lo Split

The Lo split is more straightforward. It creates a test set that consists of clusters of similar molecules, and the models are tested for their ability to correctly rank molecules within each cluster based on Spearman correlation.

There are a few technical details that I've omitted here, which you can find in the paper. In brief: I moved exactly one molecule from each cluster to the training set to imitate a known hit. Additionally, the algorithm ensures that the variation of the property within a cluster exceeds experimental noise, making it meaningful to rank the molecules.

Results

I prepared seven different datasets — four for the Hi task and three for the Lo task. These datasets cover activities related to a GPCR receptor, a kinase, a phenotypic HIV dataset, and solubility data. For each dataset, I created three folds (which is why we solved k-cut, not 2-cut, in the paper) to test for variability in the data and meticulously optimized the hyperparameters of different models. Here are the results:

Yes, the models performed better than a random baseline (though far from perfect), indicating their ability to generalize to novel molecules and distinguish minor modifications. Interestingly, while Chemprop emerged as the best for Hi tasks, SVM on ECFP4 fingerprints outperformed the others in Lo tasks. This finding is somewhat surprising, and the reason behind it is not clear. It might be due to the limited expressivity of GNNs, but binary fingerprints are limited too. Regardless, this result aligns with previous work on activity cliffs.

If you have questions, refer to the paper. It contains more technical details and additional experiments that I couldn’t cover here. Also, if you're looking to replicate this analysis, there's no need to delve into the algorithm for Minimal Vertex k-Cut. I've already implemented it in the Lo-Hi Splitter library, so you can easily perform both Hi and Lo splits for your datasets with just a couple of lines of Python.

Takeaway

Different models are better suited for different tasks. If your goal is to find novel molecules, select models that generalize well. If your goal is to optimize molecules step-by-step, choose models that can distinguish minor modifications. Use the Lo-Hi Splitter library to do this quickly.

Links

Thread in Twitter: https://twitter.com/ZdarovaAll/status/1712085059073605929

Paper: https://arxiv.org/abs/2310.06399

Lo-Hi Splitter library: https://github.com/SteshinSS/lohi_splitter