Exposing the Limitations of Molecular Machine Learning with Activity Cliffs

from their molecular structure is one of the central goals of chemistry. In principle, all properties of a molecule can be determined by their structure. Unfortunately, molecules are quite complicated to say the least – especially when they interact with other molecules. Hence, using machine learning has become instrumental for molecular property prediction. Molecular machine learning holds a lot of potential to speed up drug discovery and will help us answer fundamental questions about the behaviour of molecules. However, before we get there, we still have much to learn about how to build good models.

One important topic left relatively unexplored is how machine learning models behave in the presence of Activity cliffs. An activity cliff appears when a small change in molecular structure results in a drastic change in its bioactivity (or other molecular property, see Fig. 1). This term was coined by Gerald Maggiora [1] and is a reference to sudden changes or ‘cliffs’ in the structure-activity landscape.

We know that for many drugs, a tiny change in a molecular structure can sometimes make a huge impact. Knowing which structural changes strongly affect bioactivity can tell you a lot about how a molecule interacts with its designated target (e.g., a specific protein involved in a disease). At the same time, it is well-known that activity cliffs can be troublesome for machine learning models to predict. For this reason, one could see these extreme scenarios as good test cases for molecular property prediction models. Besides, the presence of highly similar molecules is very common in commercial libraries that are used for prospective applications like drug screening. Therefore, models that are not able to distinguish the effects of small molecular changes are probably not your best option in prospective settings. Still, even though activity cliffs are important in molecular data, we don’t know when, why, and how machine learning models tend to fail in their presence. Therefore, we set out to illuminate the failure modes of common ‘out-of-the-box’ methods for bioactivity predictions in the presence of activity cliffs.

Defining activity cliffs

One of the first difficulties of quantifying the effects of activity cliffs is the definition of activity cliffs themselves. Because we wanted to sketch a general picture that applies across the board, we tried to combine three types of molecular similarity (substructure-, scaffold-, and SMILES similarity) into a well-rounded definition of activity cliffs. Details can be found in our paper [2].

The data

Because we want to measure the general effects of the presence of activity cliffs on model performance, we composed 30 different datasets from ChEMBL [3], covering a wide range of training scenarios and target proteins. We thoroughly cleaned and curated this data to minimize noise and ‘fake’ activity cliffs. To maintain a similar distribution between molecules in the train and test data, we clustered molecules by their structure and used random stratified splitting by their activity cliff status into train (80%) and test (20%). In other words, we enforced similar proportions of activity cliff compounds in the train and test set.

Bioactivity predictions with ‘classical’ machine learning methods

Most approaches for bioactivity predictions – especially those used for prospective applications – are built with well-proven algorithms coupled with molecular descriptors representing molecules. We considered some commonly used ‘traditional’ approaches to estimate the effects of activity cliffs. Secondly, we wanted to see if deep learning would be better suited for activity cliffs. Next to the ‘traditional’ methods, we evaluated a handful of common deep learning approaches for molecular machine learning, which can be split up into two flavours: 1) deep learning on molecular graphs and 2) deep learning on molecular sequences like SMILES strings.

To evaluate their performance, we simply calculated the root mean square error (RMSE) on the test set. Because we’re interested in performance on activity cliff molecules in particular, we also calculated the RMSEcliff on the subset of molecules that are considered ‘activity cliff molecules’ according to our definition [2].

When looking at the performance of 24 distinct machine learning approaches on activity cliff molecules, we see a clear pattern emerge. To our surprise, the classical methods (only the best is shown in Fig. 2) proved to give superior performance compared to deep learning approaches that used ‘direct’ molecular representations. Most deep learning methods performed in the same ballpark. Graph-based models performed the worst, closely followed by CNNs and transformers operating on molecular strings. Only LSTMs seemed to perform decently. When zooming in on performance on activity cliff molecules vs all molecules, all methods showed somewhat similar difficulties in predicting bioactivity on activity cliffs. Neither molecular descriptor, graph, nor sequence-based approaches managed to bridge the gap between RMSE and RMSEcliff.

Failure modes

In search of good explanations for this behaviour, we tried to pinpoint the failure modes of all 720 traditional- and deep learning models. After all, the discrepancy between RMSE and RMSEcliff was highly heterogeneous. As a start, we did not find a relationship between the performance on activity cliffs and the number of molecules in the training set. The type of macromolecular target also did not impact performance. However, for some datasets we saw very strong relationships between RMSE and RMSEcliff (Fig. 3a. For example). After some searching, we found that this relationship depends on the number of molecules in your datasets (Fig. 3c.). In other words, for very small datasets, the overall performance of a model on this dataset tells you little about the performance on activity cliff molecules. In contrast, if your dataset contains ample molecules (let’s say more than 1000 or 1500), overall performance becomes a good proxy for activity cliff performance.

Because we see activity cliffs as an extreme case of regular molecular property prediction, we think that if the models you train seem to have performances on activity cliff molecules that approximate overall performance, your dataset is probably big/good enough to learn a good structure-activity relationship from.

Nevertheless, even with larger datasets, you should expect a substantial performance drop on activity cliff molecules. It is therefore probably a good idea to measure not only the overall performance of your model, but also the performance on activity cliff molecules. Especially if you are planning to apply your model for prospective tasks.

To do so, we developed MoleculeACE (Activity Cliff Estimation) as a nifty Python tool. It contains all datasets used in this study and allows you to easily measure your models’ performance on activity cliffs (with our definition by default or your own custom definition) using your own or our pre-curated bioactivity data. You can find it at: https://github.com/molML/MoleculeACE. An example is shown below:

from MoleculeACE import calc_rmse, calc_cliff_rmse

# Use your own data
x_train, y_train, smiles_train, x_test, y_test, smiles_train = ...

# Train your own model
model = ...
y_hat = model.predict(x_test)

# Evaluate your model on activity cliff compounds
rmse = calc_rmse(y_test, y_hat)
# You need to provide both the predicted and true values of the test set + train labels + the train and test molecules. Activity cliffs are calculated on the fly
rmse_cliff = calc_cliff_rmse(y_test_pred=y_hat, y_test=y_test, smiles_test=smiles_test, y_train=y_train, smiles_train=smiles_train)

print(f"rmse: {rmse}\nrmse_cliff: {rmse_cliff}")

Takeaways

Activity cliffs are difficult to predict across the spectrum of machine-learning approaches. After comparing well over 700 different machine learning models for bioactivity prediction, we found that a model’s performance on these activity cliff compounds is dataset-dependent, especially for deep learning methods and in low-data scenarios. Although the overall prediction error often approximates the performance on activity cliffs - especially for larger datasets -, “islands” of poor activity cliffs performance exist independent of the chosen model and approach. These results highlight the importance of evaluating machine learning models for their performance on activity cliffs alongside regular performance metrics.

While deep learning on molecules has shown its potential, when applied to realistic use-cases, it is very clear that feature engineering still wins. We have some work to do. On the learning side, but probably even more on the molecular representation and data side.

Author Bio

Derek van Tilborg is a PhD student in the molecular machine learning team of Francesca Grisoni at the Eindhoven University of Technology (Chemical Biology, Dept. Biomedical Engineering). Having an academic background in both biomedical sciences and bioinformatics, he has developed a passion for artificial intelligence in the realm of drug discovery. Derek works on improving how machine learning approaches are applied to drug screening data to bridge the gap between computational methods and preclinical experiments. His research interests are focussed on graph neural networks and active learning.

References

[1] Maggiora, G. M. On outliers and activity cliffs--why QSAR often disappoints. J. Chem. Inf. Model. 46, 1535 (2006).

[2] van Tilborg, D., Alenicheva, A. & Grisoni, F. Exposing the Limitations of Molecular Machine Learning with Activity Cliffs. J. Chem. Inf. Model. (2022) doi:10.1021/acs.jcim.2c01073.

[3] Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2019).