Transformers and Large Language Models for Chemistry and Drug Discovery

Drug development is a complex and time-consuming matter with multiple stages that include molecular design, screening, and various phases of testing. The key bottleneck here is chemical synthesis, a costly and inefficient process that represents the major obstacle in drug development. This blog post will delve into the recent advancements in methods inspired by natural language processing (NLP) that have been designed to accelerate this critical stage and thus drug discovery as a result. We will explore retrosynthesis, property prediction, but also the emergence of autonomous chemistry agents. Join us as we unravel the key insights that enabled researchers to harness NLP algorithms and apply them effectively in the realm of chemistry!

Transformers and NLP

The Transformer architecture has been a game-changer in the field of NLP. This neural network is now being used for all sorts of language modeling tasks like summarization or translation. This architecture, whose main components are self-attention layers, is excellent at learning to pick up how different parts of a sequence relate to each other. Basically, Transformers are great at connecting information within a tokenized sequence.

Now, what makes the Transformer powerful is how adaptable it is. After turning data into a sequence of tokens, it can be used for plenty of scientific tasks. For instance in bioinformatics, where DNA sequences or proteins are naturally represented as sequences of nucleotides and amino acids, AlphaFold2 has revolutionized the field. This system accurately predicts the 3D structure of proteins, solving a major problem in biochemistry and making it one of the hottest tools in biology right now [1, 8].

As we will discuss further, this transformative approach is now being applied to the field of organic chemistry, opening up exciting new possibilities for research and discovery.

Modeling the language of Organic Chemistry

The evolution of NLP methods in chemistry follows a fascinating path composed of three phases (see Figure 1), beginning with analogies between chemical language and natural language. This comparison has enabled the creation of transformer models capable of reading and generating molecular structures, useful for tasks like retrosynthesis and regression.

As the field evolved, additional forms of data were added to models to unlock unprecedented possibilities. One group trained a model for elucidating molecules from IR spectra, while another trained a model for generating molecules from linguistic descriptions. Now, chemists are not just drawing inspiration from NLP methods, they're also using them as a direct resource for developing language-based applications in chemistry.

Advances in Natural Language Processing have inspired applications in Chemistry. Over time, the gap between chemical language and natural language is being closed by including additional modalities. Most recent works present general task solvers for chemistry, capable of chemical reasoning and automatic synthesis, among others.

Phase 1: Molecular Transformers

Turning chemical tasks into text sequences, along with the rise of open datasets and benchmarks, allowed chemists to train Transformers that take molecules as input, and return molecules as outputs; that is, Molecular Transformers. This marked a revolution in the field, that started with tackling key chemical challenges that are crucial to drug development, like predicting reaction products and retrosynthetic planning.

However, adapting these methods wasn't straightforward because of the diverse structural motifs in organic molecules which makes it hard to view them as sequences. A way around this was using linear string representations of molecules and reactions, like SMILES and SELFIES, along with field-specific tokenization methods [10].

Figure 2. Chemists adapted concepts from NLP in field-specific ways, e.g. for atomic tokenization.

These insights set the stage for the first uses of Transformers in chemistry. They led to powerful retrosynthesis systems [11, 13], and new ways to explore the chemical space [9, 12].

Phase 2: Articulating chemical language and other modalities

Although powerful, Molecular Transformers ignore a lot of information that goes beyond molecular representations. Chemical reactions also involve a bunch of other types of data, or modalities. These can include spectra from analytical techniques and linguistic descriptions that give us details and explanations of molecular processes.

Researchers have explored these connections by training Transformers on specific types and combinations of data. For example, they have used these models for elucidating molecules from their IR [2] and NMR [3] spectra. Other key applications involve predicting experimental steps from reaction SMILES [14], and molecular captioning, where the aim is to describe a molecule using words.

These applications mark the second step in the evolution of NLP methods in chemistry (see Figure 1 middle). They gradually bring in more modalities and forms of language, but they're not yet quite as versatile as the latest NLP models.

Phase 3: Advanced applications: Towards general chemistry-language models

Recent breakthroughs in Large Language Models (LLMs) have taken the connection between chemistry and language to a whole new level. Scaling models with massive datasets and compute, along with paradigms like fine-tuning and in-context learning have made it possible to model chemistry directly in natural language, which is a more flexible way of describing the scientific process and its results.

For example, fine-tuned LLMs have shown pretty impressive capabilities, like predicting molecular properties in zero- and few-shot settings [7]. These models are sometimes even better than task-specific, expert-designed models, especially in the low-data regime as is usually the case in chemistry. The reason they're so good is probably that they learn from large text corpora, which lets them transfer knowledge to new tasks more efficiently.

LLMs have also been used as agents, taking advantage of their reasoning and tool-using capabilities. These agents can provide a text-modulated interface for complex chemical tasks that require planning and analysis [4], and do so by using multiple tools in a goal-oriented way. For example, ChemCrow [5] was given the task of synthesizing an insect repellent. With access to a robotic chemistry lab among its tools, ChemCrow identified a target molecule with the desired function, planned the synthesis, and figured out the details of the synthesis process within the robotic facility. The result was the successful production of DEET, a widely-used insect repellent, marking a big milestone in the use of LLMs in chemistry.

Figure 3. An example of the potential of LLM Agents in chemistry. In this example, an agent was given multiple tools including a robotic platform. The agent was told to synthesize an insect repellent. This resulted in the automatic synthesis of DEET, without human intervention even in the planning stages.

This last step marks the third phase, and brings us full circle: chemistry can now be modeled just like any other language! We're back to the methods that inspired the original applications, letting researchers take advantage of more of the breakthroughs that have happened over the years in NLP.

We've got an exciting future ahead of us! We look forward to the novel discoveries that these techniques will facilitate in chemistry and drug discovery. To learn more, please check out our original preprint [6].

References

1. Akdel M, Pires DEV, Pardo EP, Jänes J, Zalevsky AO, Mészáros B, Bryant P, Good LL, Laskowski RA, Pozzati G, Shenoy A, Zhu W, Kundrotas P, Serra VR, Rodrigues CHM, Dunham AS, Burke D, Borkakoti N, Velankar S, Frost A, Basquin J, Lindorff-Larsen K, Bateman A, Kajava AV, Valencia A, Ovchinnikov S, Durairaj J, Ascher DB, Thornton JM, Davey NE, Stein A, Elofsson A, Croll TI, Beltrao P (2022) A structural biology community assessment of AlphaFold2 applications. Nat Struct Mol Biol 29:1056–1067. doi: 10.1038/s41594-022-00849-w

2. Alberts M, Laino T, Vaucher AC (2023) Leveraging Infrared Spectroscopy for Automated Structure Elucidation. Chemistry

3. Alberts M, Zipoli F, Vaucher AC (2023) Learning the Language of NMR: Structure Elucidation from NMR spectra using Transformer Models

4. Boiko DA, MacKnight R, Gomes G (2023) Emergent autonomous scientific research capabilities of large language models. doi: 10.48550/ARXIV.2304.05332

5. Bran AM, Cox S, White AD, Schwaller P (2023) ChemCrow: Augmenting large-language models with chemistry tools

6. Bran AM, Schwaller P (2023) Transformers and Large Language Models for Chemistry and Drug Discovery

7. Jablonka KM, Schwaller P, Ortega-Guerrero A, Smit B (2023) Leveraging Large Language Models for Predictive Chemistry

8. Jones DT, Thornton JM (2022) The impact of AlphaFold2 one year on. Nat Methods 19:15–20. doi: 10.1038/s41592-021-01365-3

9. Öztürk H, Özgür A, Schwaller P, Laino T, Ozkirimli E (2020) Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov Today 25:689–705. doi: 10.1016/j.drudis.2020.01.020

10. Schwaller P, Laino T, Gaudin T, Bolgar P, Hunter CA, Bekas C, Lee AA (2019) Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction. ACS Cent Sci 5:1572–1583. doi: 10.1021/acscentsci.9b00576

11. Schwaller P, Petraglia R, Zullo V, Nair VH, Haeuselmann RA, Pisoni R, Bekas C, Iuliano A, Laino T (2020) Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chem Sci 11:3316–3325. doi: 10.1039/C9SC05704H

12. Schwaller P, Probst D, Vaucher AC, Nair VH, Kreutter D, Laino T, Reymond J-L (2021) Mapping the space of chemical reactions using attention-based neural networks. Nat Mach Intell 3:144–152. doi: 10.1038/s42256-020-00284-w

13. Tu Z, Coley CW (2021) Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction

14. Vaucher AC, Schwaller P, Geluykens J, Nair VH, Iuliano A, Laino T (2021) Inferring experimental procedures from text-based representations of chemical reactions. Nat Commun 12:2573. doi: 10.1038/s41467-021-22951-1