A Tutorial on Encoding and Generating Small Molecules with COATI

COATI: What, Why, How?

On the tArray platform, small molecules are immobilized onto nanoscale beads and used to measure binding affinity.

At Terray Therapeutics, we use our experimental drug discovery platform to collect hundreds of millions of high-quality target-ligand binding affinity measurements every month. We do this by designing libraries of millions of molecules every week, using combinatorial chemistry to synthesize barcoded libraries of those small molecules on nanoscale beads that are then immobilized in a silicon chip. Each chip is the size of nickel and holds 32 million individual beads. The identity of the molecule on each bead is read via the barcode on each bead and then the barcodes are removed. Finally, these spatially encoded ultra-dense microarray chips are screened against a therapeutic target of interest, providing 32 million individual, quantitative measurements of binding between a small molecule and that target - in less than 4 minutes! Over the past 12 months, we’ve collected over 50 billion individual measurements which map to over 2 billion unique target-ligand binding affinity values (each value is the average of roughly 25 measurements). This database and iterative cycle is an incredible resource for training machine learning models to predict target-ligand interactions. Naturally, this data is useful for identifying hits as well as creating Structure-Activity Relationship (SAR) maps for hit-to-lead optimization. From a machine learning perspective, having access to 2 billion unique target-ligand interactions (and counting) allows us to train powerful models that can predict if virtual molecules are likely to interact with our targets of interest.

We explore the "activity surface" of a therapeutic target using a combination of platform data and generative design.

Our goal is to dramatically improve the cost, time, and success rate of small molecule drug development using this rich set of data combined with generative design methods - i.e., automatically suggest molecules that have a high probability of binding to our targets of interest, in addition to possessing drug-like physicochemical properties. Through our platform, we gather information about what chemical properties and motifs result in potent molecules, and we want to be able to both provide actionable insights about proposed compounds, and propose new compounds on the basis of our data.

There is a long list of methods that take in molecular structures and produce new “ideas” - at a very simple level, you can apply reaction-based SMARTS transforms to generate useful suggestions for medicinal chemists. We want to approach this problem in a far more systematic way - leveraging decades of research into numerical optimization methods that traverse a vector space to find optimal solutions to problems. In this way, we frame molecular design as a classic engineering design problem - given these constraints (and models), find a solution that optimizes our objective function. However, to do that, we need a method that can encode molecules into vectorized representations, and, most importantly, decode these representations into chemical structures that can be made in the lab. Our goal is to produce practical suggestions for medicinal chemists - so it’s very important that these methods produce suggestions that can help them in their search for better molecules.

COATI provides both a molecule-to-vector encoding as well as decoding ability.

We’d also like this representation to be able to handle multiple modes of representing a molecule - for example, a SMILES string, or a 3D conformation describing the relative positions and identities of atoms. This provides useful information to the encoder, and enables more forward-thinking methods that leverage structural information.

Introducing COATI

An actual coati! Source: National Geographic

To address these needs, we created COATI (Contrastive Optimization for Accelerated Therapeutic Inference) - a multimodal model inspired by recent research into joint image-language representation. We fit this model using the training setup shown below - leveraging over 100 million conformer-SMILES string pairs to learn a multimodal chemical encoder-decoder using contrastive and autoregressive losses. We use a “capping” procedure inspired by ClipCap to “inject” representations into the transformer, in order to train it to decode to SMILES strings. We find that this representation performs competitively to other learned representations on a variety of ADME prediction tasks from TDC, and also performs well on a small fraction of the binding affinity data we’ve generated at Terray. Check out our paper (preprint here) if you’re interested in more details about the model and how we trained it.

A schematic of COATI's training - autoregressive and contrastive losses are jointly minimized.

We’ve also open-sourced some pre-trained models, so you can use them right now to encode, store, and decode molecules! This tutorial is also available as a Jupyter notebook here. You’ll need to install the COATI code from GitHub here - simply clone the repo and run pip install . from the root directory.

Molecular Encoding

In order to use a COATI model, you’ll first need to load a pre-trained model checkpoint. We provide a few of these on our public S3 bucket, as well as a helper function to handle the construction of the model and the tokenizer:

import torch
from coati.models.io.coati import load_e3gnn_smiles_clip_e2e

# Model parameters are pulled from the url and stored in a local models/ dir.
encoder, tokenizer = load_e3gnn_smiles_clip_e2e(
    freeze=True,
    device=torch.device("cuda:0"), 
    # model parameters to load.
    doc_url="s3://terray-public/models/barlow_closed.pkl",
)

This loads a COATI encoder in addition to a tokenizer - in much the same way that a Large Language Model (LLM) converts sequences of words or characters into numerical tokens, we tokenize sequences of characters in a SMILES string. Of course, you won’t need to do this if you’re encoding a conformer - but for this tutorial we’ll just focus on encoding SMILES strings.

To encode a single SMILES string (in this case, Olutasidenib) into a vector, simply run:

from coati.generative.coati_purifications import embed_smiles
from rdkit import Chem

smiles = Chem.CanonSmiles("C[C@H](NC1=CC=C(C#N)N(C)C1=O)C1=CC2=C(NC1=O)C=CC(Cl)=C2")
vector = embed_smiles(smiles, encoder, tokenizer)

This produces a 256-dimensional vector encoding the identity of the input molecule. In practice, you’ll probably be working with batches of molecules, so we’ve included another helper function that handles encoding a list of SMILES strings:

from coati.generative.coati_purifications import embed_smiles_batch 

smiles_batch = [smiles, Chem.CanonSmiles("O=C(C)Oc1ccccc1C(=O)O")]
vecs = embed_smiles_batch(smiles_batch, encoder, tokenizer)

So, what can you do with these embeddings? A useful property of COATI embeddings is that they’re decodable - you can feed them back into the model and (usually) get the same molecule. Decoding is stochastic, and the more complicated the molecule, the more difficult it is to decode. For typical drug-like molecules, however, it works really well!

reconstructed_smiles = encoder.hclip_to_2d_batch(
    h_clip=vector.unsqueeze(0),
    tokenizer=tokenizer,
    noise_scale=0.0,
)

COATI models autoencode well in drug-like chemical space - this is olutasidenib after encoding and decoding.

As you might have guessed, these embeddings are also useful for predicting properties of molecules. We like using the DUE regressor from this paper, because it provides useful uncertainty quantification - you can check our more detailed example notebook here. However, you can use any model you want. Here’s a random forest from sklearn, trained to predict the quantitative estimate of drug-likeness (QED) property produced by RDKit:

# prepare CHEMBL data
from coati.common.s3 import cache_read
import random
import pickle

# load Chembl dataset smile strings.
with cache_read("s3://terray-public/datasets/chembl_canonical_smiles.pkl", "rb") as f:
    chembl_canonical_smiles = pickle.loads(f.read(), encoding="UTF-8")

# for our example, we will use a small subset of the data
random.shuffle(chembl_canonical_smiles)
chembl_subset = chembl_canonical_smiles[:10_000]
chembl_subset = [{"smiles": s, "source": "chembl_mols", "qed": Chem.QED.qed(Chem.MolFromSmiles(s))} for s in chembl_subset]

Now that we’ve prepared the data, just add COATI embeddings:

# add embeddings! 
from coati.common.util import batch_indexable

for batch in batch_indexable(chembl_subset, 1024):
    batch_smi = [x["smiles"] for x in batch]
    batch_vecs = embed_smiles_batch(batch_smi, encoder, tokenizer)
    for i, entry in enumerate(batch):
        entry["coati_embed"] = batch_vecs[i, :].cpu().numpy()

And train a random forest!

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
import numpy as np

train_entries, test_entries = train_test_split(chembl_subset, test_size=0.2, random_state=42)

# train a simple model to predict QED from the embeddings!
train_x = np.vstack([x["coati_embed"] for x in train_entries])
train_y = np.array([x["qed"] for x in train_entries])

test_x = np.vstack([x["coati_embed"] for x in test_entries])
test_y = np.array([x["qed"] for x in test_entries])

model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
model.fit(train_x, train_y)

preds = model.predict(test_x)

We get an R^2 of around 0.77 on this toy problem. Not bad! In practice, we find that models trained using COATI embeddings are competitive with or superior to models trained on traditional cheminformatics fingerprints. Give it a try with other datasets!

Generating New Molecules

Encoding, decoding, and regressing are all useful tasks, but what we really want to do here is generate new ideas rather than relying on existing molecules. Here’s a simple walkthrough for how to generate molecules that are “nearby” a reference point in chemical space. We’ve also illustrated another, more complex method that uses COATI to “navigate” an SAR surface using a metadynamics-inspired optimization approach – you can check that out here! For now, though, let’s keep it simple.

We take our input encoding that we made earlier, and simply add normally distributed noise to it - this “fuzzes” the vector, moving it somewhere else in chemical space. However, we don’t move too far away, and the chemical space manifold learned by COATI decodes to molecules that are closely related to the input:

num_attempts = 30
# noise is added as an isotropic Gaussian with std=noise_scale.
nearby_smiles = encoder.hclip_to_2d_batch(
    h_clip=vector.unsqueeze(0).repeat(num_attempts, 1),
    tokenizer=tokenizer,
    noise_scale=0.2,
)
unique_valid_smiles = set([smi for smi in nearby_smiles if Chem.MolFromSmiles(smi)])

Generative samples around the vector encoding for olutasidenib, with noise_scale = 0.2.

We can adjust this noise threshold to get molecules that are “farther away” from the input - you’ll also note that we get more unique results. Set noise_scale equal to 0.4, and you get the molecules below. Note that output SMILES strings are not guaranteed to be valid - they typically are, but as you ramp up the noise it’s likelier that you produce something that doesn’t decode to a valid SMILES.

Generative samples around the vector encoding for olutasidenib, with noise_scale = 0.4. See additional diversity and more extreme changes here.

If you want to learn more about how to use the metadynamics-inspired optimization in practice, check out the examples on our GitHub! We provide an example for using it on a trained regressor that predicts binding to carbonic anhydrase - the optimization “navigates” the response surface defined by the regressor (which is, in turn, defined by our experimental observations) to produce the COATI vectors of molecules that are likely to bind.

The Future

We’re excited to release our COATI into the wild, and to see what the community will be able to use it for. We hope that this tutorial will help interested users get started! We’re actively working on improving these models with better encoders and more data, and using them to progress molecules towards the clinic.