Accessing machine learning-ready datasets for drug discovery with Therapeutics Data Commons

Machine learning has been shown to accelerate the traditionally time-consuming drug discovery process across the entire pipeline from target discovery, early drug discovery, clinical trials, and manufacturing, all the way to post-market surveillance.

There are a large number of tasks that can be tackled by machine learning but are not yet under the radar of the machine learning communities. Where can we find machine learning-ready drug discovery tasks and datasets that are under-explored but meaningful?

In this blog post, we will introduce an open-source project called Therapeutics Data Commons (TDC), a machine learning infrastructure for therapeutic development. This was something I co-created with amazing collaborators such as Tianfan Fu, Wenhao Gao, and Marinka Zitnik. In the next 10 minutes, you will learn how to access more than 50 meaningful ML-ready datasets across the entire drug discovery pipelines using a few lines of code!

How to access datasets in TDC

TDC covers a wide range of therapeutic tasks with varying data structures. Thus, we organize it into three layers of hierarchies. First, we broadly divide into three distinctive machine learning problems:

Single-instance prediction ‘single_pred’: Prediction of property given individual biomedical entity.
Multi-instance prediction ‘multi_pred’: Prediction of property given multiple biomedical entities.
Generation ‘generation’: Generation of new biomedical entity.

The second layer is task. Each therapeutic task falls into one of the machine learning problem. We create a data loader class for every task that inherits from the base problem data loader.

The last layer is dataset, where each task consists of many of them. As the data structure of most datasets in a task is the same, the dataset is used as a function input to the task data loader.

Given this hierarchy, all datasets in TDC can be accessed with the following syntax.

Supposed a dataset X is from therapeutic task Y with machine learning problem Z, then to obtain the data and splits, simply type:

from tdc.Z import Y
data = Y(name = 'X')
split = data.split()

TDC datasets for small molecule activity modeling

The activity of a small-molecule drug is measured by its binding affinity with the target protein. Given a new target protein, the very first step is to screen a set of potential compounds to find their activity. Traditional methods to gauge the affinities are through high-throughput screening wet-lab experiments. However, they are very expensive and are thus restricted by their abilities to search for a large set of candidates.

Predicting drug-target interactions

Drug-target interaction prediction task aims to predict the interaction activity score in silico given only the accessible compound/protein structural information. Machine learning models that can accurately predict affinities can not only save pharmaceutical research costs by reducing the amount of high-throughput screening but also enlarge the search space and avoid missing potential candidates.

TDC includes several DTI datasets, including the largest BindingDB dataset. Note that BindingDB is the collection of many assays. Since different assays use different units, TDC separates them as separate datasets. Specifically, it has four datasets with Kd, IC50, EC50, Ki as the units. We load Kd here as an example for the sake of tutorial example (although IC50 has much larger number of data points, ~1 million):

from tdc.multi_pred import DTI
data = DTI (name = 'BindingDB_Kd', print_stats = True)

Generate novel molecules

As the entire chemical space is far too large to screen for each target, HTS can only be restricted to a set of existing molecule library. Many novel drug candidates are thus usually omitted. Machine learning that can generate novel molecules obeying some pre-defined optimal properties can circumvent this problem and obtain a novel class of candidates. In molecule generation, a machine learning task first learns the molecular characteristics from a large set of molecules, which is evaluated through the oracles. Then, from the learned distribution, we can obtain novel candidates. Typical molecule generation requires an oracle function. TDC provides a diverse range of oracles. We will delve into them at other times, but for now, here is an example to generate novel molecules that have higher binding affinity to GSK3B target:

from tdc import Oracle
oracle= Oracle(name = 'GSK3B')
oracle(['CCOC1=CC(=C(C+C1C=CC(=O)O)Br)OCC', 'CC(=O)OC1=CC=CC=C1C(=O)O'])

Predict drug responses

The previous dataset assumes a one-drug-fits-all-patients diagram whereas, in reality, a different patient has a different response to the same drug, especially in the case of oncology where patient genomics is a deciding factor for a drug's effectiveness. The combinations of available drugs and all types of cell line genomics profiles are very large while testing each combination in the wet lab is prohibitively expensive. A machine learning model that can accurately predict a drug's response given various cell lines in silico can thus make the combination search feasible and significantly reduce the burdens on experiments. The fast prediction speed also allows us to screen a large set of drugs to circumvent the potential missing potent drugs. In TDC, we include Genomics in Drug Sensitivity in Cancer (GDSC) dataset which measures the drug response in various cancer cell lines. In the dataset, we also include the SMILES string for the drug and the gene expression for cell lines. There are two versions of GDSC where GDSC2 uses improved experimental procedures. To access the data:

from tdc.multi_pred import DrugRes
data = DrugRes(name = 'GDSC2')
data.get_data().head(2)

Another important trend is drug combinations (cocktail). Drug combinations can achieve a synergistic effect and improve treatment outcomes. In TDC, we include one drug synergy dataset OncoPolyPharmacology, which includes experimental results of drug pair combination response to various cancer cell lines. You can obtain it via:

from tdc.multi_pred import DrugSyn
data = DrugSyn(name = 'OncoPolyPharmacology')
data.get_data().head(2)

TDC datasets for optimizing small molecule efficacy and safety

After a compound is found to have high affinity to the target disease, it needs to be optimized significantly to have desirable properties. A small-molecule drug is a chemical and it needs to travel from the site of administration (e.g., oral) to the site of action (e.g., a tissue) and then decompose, and exit the body. To do that safely and efficaciously, the chemical is required to have numerous ideal absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties. Poor ADMET profile is the most prominent reason for failure in clinical trials. Thus, an early and accurate ADMET profiling during the discovery stage is a necessary condition for the successful development of the small-molecule candidate. This task aims to predict various kinds of ADMET properties accurately given a drug candidate's structural information. TDC curates a large number of ADMET tasks from various databases and they are all useful endpoints that medicinal chemists care a lot about. You can find more information here and here.

As an example, to access each dataset, you can simply call:

from tdc.single_pred import ADME
data = ADME(name = 'Pgb_Broccatelli')
data.get_data().head(2)

In addition to individual efficacy and safety, a drug can clash with each other to have adverse effects, i.e. drug-drug interactions (DDIs). This becomes more and more important as more people are taking a combination of drugs for various diseases and it is impossible to screen the combination of all of them in the wet lab, especially for higher-order combinations. In TDC, we include the DrugBank and TWOSIDES datasets for DDI. To access this dataset, simply type:

from tdc.single_pred import ADME
data = ADME(name = 'Pgb_Broccatelli')
data.get_data().head(2)

TDC datasets for small molecule manufacturing

After discovering a potential drug candidate, a big portion of drug development is manufacturing, that is how to make the drug candidate from basis reactants and catalysts.

TDC currently includes four tasks in this stage. The first is reaction prediction, where one wants to predict the reaction outcome given the reactants. Predicting the products as a result of a chemical reaction is a fundamental problem in organic chemistry. It is quite challenging for many complex organic reactions. Conventional empirical methods that rely on experimentation require intensive manual labeling of an experienced chemist and are always time-consuming and expensive. Reaction Outcome Prediction aims at automating the process. TDC parses out the full USPTO dataset and obtains 1,939,253 reactions. You can load the data via:

from tdc.generation import Reaction
data = Reaction(name = 'USPTO')
data.get_data().head(2)

Retrosynthesis

In addition to the forward synthesis, a realistic scenario is where one has the product and wants to know what is the reactants that can generate this product. This is also called retrosynthesis. Retrosynthesis planning is useful for chemists to design synthetic routes to target molecules. Computational retrosynthetic analysis tools can potentially greatly assist chemists in designing synthetic routes to novel molecules. Machine learning-based methods will significantly save time and cost. Using the same USPTO dataset above and flipping the input and output, we can get the retrosynthesis dataset. A popular smaller dataset is USPTO-50K which is widely used in the ML community. TDC also includes it:

from tdc.generation import RetroSyn
data = RetroSyn(name = 'USPTO-50K')
data.get_data().head(2)

Predicting catalyst type

In addition to reaction predictions, it is also important to predict the reaction condition. One condition is the catalyst. During a chemical reaction, the catalyst is able to increase the rate of the reaction. Conventionally, chemists design and synthesize catalysts by trial and error with chemical intuition, which is usually time-consuming and costly. Machine learning models automate and accelerate the process, understand the catalytic mechanism, and provide insight into novel catalytic design. Given the reactants and products, we want to predict the catalyst type. Here is how to access this catalyst prediction dataset in TDC:

from tdc.multi_pred import Catalyst
data = Catalyst(name = 'USPTO-Catalyst')
data.get_data().head(2)

Predicting reaction yield

Another important factor of drug manufacturing is yields. Many factors during reactions could lead to suboptimal reactants-products conversion rate, i.e. yields. To maximize the synthesis efficiency of target products, an accurate prediction of the reaction yield could help chemists to plan ahead and switch to alternate reaction routes, by which avoiding investing hours and materials in wet-lab experiments and reducing the number of attempts. TDC includes two Yields datasets. One is what we mine through USPTO. But as there recent research from Schwaller et al. argues that USPTO is a bit too noisy. We thus also include another dataset used in Schwaller et al., Buchwald-Hartwig. You can obtain it via:

from tdc.single_pred import Yields
data = Yields(name = 'Buchwalk-Hartwig')
data.get_data().head(2)

TDC datasets for biologics activity modeling

While small-molecule drugs are still effective in treating many diseases, there is an increasing trend in developing biologics including gene therapies, cell therapies, antibodies, and so on. How ML facilitate their development is relatively under-explored. In TDC, we also include a couple of important tasks that can help biologics discovery.

Immunotherapy is an important diagram of therapeutics. It has gained lots of interest in recent years because of its promise in treating various cancers with fewer side effects than small molecule compounds. One big part of immunotherapy is Monoclonal antibody therapy. An antibody binds to antigens and once it binds to antigens, together they serve as a target marker for the human immune system to attack those marked cells/proteins. TDC provides a task to predict the affinity between an antibody and an antigen.

from tdc.multi_pred import AntibodyAff
data = AntibodyAff(name = 'Protein_SAbDab')
data.get_data().head(2)

Predicting TCR-epitope affinity

A more specific antibody-antigen affinity is TCR-epitope affinity. T-cells are an integral part of the adaptive immune system, whose survival, proliferation, activation, and function are all governed by the interaction of their T-cell receptor (TCR) with immunogenic peptides (epitopes). A large repertoire of T-cell receptors with different specificities is needed to provide protection against a wide range of pathogens. This new task aims to predict the binding affinity given a pair of TCR sequence and epitope sequence. An accurate model can help design TCR receptors for effective immunotherapy. It can also unlock a patient’s TCR repertoire, which reflects their immune history and could inform about past and current infectious diseases, vaccine effectiveness, or autoimmune reactions. TDC has a TCR-epitope prediction dataset, which you can readily access via:

from tdc.multi_pred import TCREpitopeBinding
data = TCREpitopeBinding(name = 'weber', path = './data')
split = data.get_split()

Predicting MHC and binding affinity

Similar to the mechanism of Monoclonal antibody therapy, major histocompatibility complex (MHC) can bind to peptides and display them at the cell surface where the human immune system (T-cell) can recognize them and eliminate them. There are various categories of MHC (MHC I, II, III) due to various structural differences. Thus, it is important to predict MHC and peptide binding affinity. TDC provides two datasets for this. You can obtain them via:

from tdc.multi_pred import PeptideMHC
data = PeptideMHC(name = 'MHC1_IEDB-IMGT_Nielsen')
data.get_data().head(2)

In collaboration with Graphein, you can also obtain 3D protein representations in TDC with:

from tdc.single_pred import Develop
data = Develop(name = 'SAbDab_Chen')
split = data.get_split()

gr = data.graphin(graph = 'distance', node_feature = ['amino_acid_one_hot'], distance_threshold = 6)

Predicting gene editing outcomes

Gene editing offers a powerful new avenue of research for tackling intractable illnesses that are infeasible to treat using conventional approaches. However, since many human genetic variants associated with disease arise from insertions and deletions, it is critical to be able to better predict gene editing outcomes to ensure efficacy and avoid unwanted pathogenic mutations.

CRISPR-Cas9 is a gene editing technology that allows targeted deletion or modification of specific regions of the DNA within an organism. This is achieved by designing a guide RNA sequence that binds upstream of the target site which is then cleaved through a Cas9-mediated double-stranded DNA break. The cell responds by employing DNA repair mechanisms (such as non-homologous end joining) that result in heterogeneous outcomes including gene insertion or deletion mutations (indels) of varying lengths and frequencies. This task aims to predict the repair outcome given a DNA sequence:

from tdc.single_pred import CRISPROutcome
data = CRISPROutcome(name= 'Leenay', label_name = label+list[0])
split = data.get_split()

What's next?

We hope this was a helpful overview of what's possible with TDC! In our upcoming blog posts, we'll be providing a similar tutorial focused on predicting molecular properties and generating novel, synthesizable compounds using TDC.

Author Bio

Kexin is a 2nd-year CS PhD student at Stanford, advised by Prof. Jure Leskovec. His research focuses on algorithmic challenges arising from machine learning adoption in biomedicine.

Previously, he worked with Prof. Marinka Zitnik, Dr. Cao Xiao, and Prof. Jimeng Sun. He has spent time researching at Pfizer, IQVIA, Dana-Farber, Flatiron Health, and Rockefeller University. He also did his undergrad at NYU in Math & CS & Studio Art, and Masters at Harvard in Health Data Science.