Neural Scaling of Deep Chemical Models

It is no surprise to deep learning researchers that scaling compute, model and dataset sizes is a reliable way to improve performance of neural networks. But back in 2021, this wasn’t obvious to most people, and it certainly wasn’t clear what impact scaling would have (if any) on deep learning for chemistry and drug discovery. In this post, I’ll give an overview of our paper Neural Scaling of Deep Chemical Models, published in Nature Machine Intelligence, explaining what motivated the work, what we found out, what the takeaways and future directions are, and how our work fits into the bigger picture of ML for drug discovery.

The scaling problem in scientific machine learning

While many researchers find it deeply unsatisfying that scaling is such a powerful approach in deep learning, I believe that this attitude is unscientific (1). Scaling behavior is simply an empirical fact about some approaches in ML, and science is, at least in part, about observing things and understanding them. If your latest, greatest algorithmic innovation is going to be washed away and obsoleted by scaling, wouldn’t you want to know? If the gains from the time you spend hyperparameter tuning, tweaking model architectures, and “being clever” are eclipsed by scaling, isn’t that worth knowing? And if the outcomes from your research and innovation aren’t achievable by scaling, don’t you want to prove that and share it far and wide with the research community?

(1) (There are reasonable objections to scaling on the grounds of such research being inaccessible outside of big tech, which is why we introduce Training Performance Estimation in our paper, to reduce the computational cost of scaling experiments. My lab is also working on ways to better understand and lower the resource demands of scaling performant deep learning models for chemistry and biology.)

If you’re doing any flavor of deep learning, you’re always competing against non-deep learning baselines and taking the simplest, most scalable deep learning approach and making it bigger. To be clear, if a research direction involves new ideas or applications, it does not have to be “state-of-the-art” (SOTA) to be interesting. An idea can be beautiful or elegant and worth pursuing purely for those reasons. There is an overemphasis in the ML community on SOTA results, and an underemphasis on actually understanding how new ML methods and approaches perform compared to “classical” baselines and scaling simple algorithms.

The problem in chemical deep learning then, is to understand the effects of scaling ML models, to find out where existing methods may be sufficient, given enough compute and data, and where scaling doesn’t fit the bill. In our paper, we considered two different model classes and application areas: 1) autoregressive (GPT-style) language models trained via self-supervised learning on string representations of small molecules; and 2) graph neural networks (with varying levels of physical priors baked in) trained via supervised learning on atomistic configurations of molecules labeled with quantum mechanical energies and forces. We wanted to know if chemical language models (ChemGPT) and GNN force fields (SchNet, PaiNN, SpookyNet, and Allegro) would exhibit the same trends in training loss that are observed in large language models (LLMs), how physical priors like equivariance affect scaling behavior, and what the advantages and limitations are of scaling models for chemistry.

Understanding the advantages and limitations of scaling

We scaled models over many orders of magnitude of model and dataset sizes, beyond anything that had been tried at the time the work was first completed. We showed scaling behavior of ChemGPT beyond 1B parameters and 10 million molecules (300 million tokens). Because this was the first investigation of neural scaling of chemical models (and we had a limited, academic training budget), we expended our training budget across many different model and dataset sizes and trained models for multiple epochs. This gave us an interesting and more complete picture of the interplay between model and dataset size, but it also means that our largest 1.2B parameter ChemGPT model is undertrained with respect to dataset size. While we do see diminishing returns to improvements in pre-training loss, as expected for neural scaling, we did not observe any indication that we hit a fundamental limit to scaling ChemGPT.

ChemGPT is pre-trained on up to 10 million molecules (300 million tokens) from PubChem. Performance improvements are seen for models up to 1 billion non-embedding parameters and continuous improvements are observed with increasing pre-training dataset size.

Likewise, for GNNs we observed neural scaling behavior over many orders of magnitude of dataset size (on the largest single dataset of energies and forces for small molecules available at the time, the ANI-1x dataset). Model size is trickier to define and scale in a predictable way for GNNs, so we measure model capacity (depth x width), and find that higher capacity models do indeed show improved loss. Interestingly, our neural scaling experiments allowed us to quantify how scaling behavior changes with equivariance (a property related to the symmetries present in molecules that can be built in to model architectures). Many papers have shown that equivariant GNNs perform better than invariant ones on small datasets across many chemical tasks, which is intuitive because equivariance is a strong inductive bias. We showed that equivariance also fundamentally improves the scaling behavior of GNN force fields over many orders of magnitude; that is, equivariant GNNs improve faster (have greater sampler efficiency) as they are trained on more data. Again, we did not see any evidence that we have hit a limit for scaling existing architectures with respect to dataset size.

Now how about downstream task performance? ChemGPT can be used to autoregressively generate molecular strings, or for representation learning. We didn’t see any evidence that scaling fundamentally improves molecule generation or molecular representation learning. Since our work first appeared, fantastic follow-up work has shown that pre-trained models, including ChemGPT, do not consistently outperform simpler baselines when modeling structure-property relationships. For GNN interatomic potentials, the results are more straightforward - these models are extremely good at fitting the training data, equivariant models scale better than invariant ones (although non-equivariant models can catch up, given enough data), and dataset size, quality, and diversity seem to be the main bottlenecks to achieving arbitrary levels of performance from these models.

Takeaways, future directions, and the big picture

The main takeaways from our work are:

Large chemical language models and graph neural network interatomic potentials exhibit empirical power-law “neural scaling” behavior over many orders of magnitude of model and dataset sizes. Our neural scaling results can guide future research by illuminating where the most meaningful improvements can be made: data curation/acquisition, training procedure, or model architecture.
Physics-based priors, inductive biases, known empirical relationships, and other forms of scientific knowledge can fundamentally alter scaling behavior and lead to persistent performance improvements, but these benefits diminish with increasing scale.
Scaling is not (yet) a panacea in deep learning, nor is it in chemistry or scientific deep learning either. However, it is important to understand how algorithmic and architectural choices affect performance in different regimes of model and dataset size. We shouldn’t declare victory after only evaluating in extremely resource-poor or resource-rich regimes (where resource = {data, model capacity, compute}).

Our work exposes many interesting research directions, some of which are already well underway since the first appearance of our paper. Some open questions and related work that might spark the reader’s own research ideas:

Are representations from pre-trained models useful for modeling structure-property relationships of small molecules? Do we need different (more chemically aware) pre-training tasks, better fine-tuning approaches, or scalable methods that go beyond traditional string and graph representations of molecules?
Is an isolated molecule the right level of abstraction for chemical ML, or do we need to include additional context (e.g., from natural language descriptions and LLMs, or incorporating more of the biochemical environment)?
What is enabled if we achieve goals like better molecular generation, representation learning, and fast, differentiable interatomic potentials?

Zooming out, you might ask, what does any of this mean for drug discovery? There is plenty of writing from seasoned drug hunters arguing that molecular generation, predictive modeling, and physics-based computational design are not bottlenecks for drug discovery; instead, they point to clinical trial failure rates, largely due to potency/efficacy (you’re working on the wrong target) or toxicity (the dose of your drug required to reach efficacy is deadly). But think about those problems from a scientific machine learning perspective - useful ML models should allow you to make better decisions. What decisions do you actually get to make as a drug hunter? You decide what biomolecule to target (and therefore what pathway, what disease area, what indication, etc.), and what molecule to target it with. Everything else is downstream of these two extraordinarily complicated decisions (2).

(2) (Sure, there’s also a lot of cool stuff you can do with ML for design of experiments for clinical trials, etc., but those directions boil down to getting better data to support your conclusions about the (bio)molecules you chose.)

If you talk to the experienced drug hunters, you’ll learn about these real bottleneck problems, and probably come away with the impression that the ML community has done a whole lot of nothing about them. But if you talk to the right people in the right organizations, you’ll learn that we are already making serious progress on these fronts. You can respect the amazing complexity of chemistry and biology, and also understand that ML is a toolbox for learning from data to model complex systems. You have to understand that research progress on hard problems proceeds iteratively, by breaking down tough problems into tractable ones. And you have to understand that scaling is exponential.

Paper, code, contact

For many more details and results, see: Frey NC, Soklaski R, Axelrod S, Samsi S, Gomez-Bombarelli R, Coley C, Gadepally V. Neural scaling of deep chemical models. Nature Machine Intelligence (2023).

For code used to perform the experiments, see the LitMatter repo on GitHub. To access ChemGPT, use the MolFeat or ROGI-XD libraries. Neural force field code is available here. Pre-trained ChemGPT model checkpoints are available on the HuggingFace Hub and pre-trained model checkpoints for PaiNN and Allegro are availabe through Figshare. All experiments reported in the paper were performed on the MIT SuperCloud cluster, and we gratefully acknowledge the MIT SuperCloud team and Lincoln Laboratory Supercomputing Center for providing HPC and consultation resources that contributed to the research results reported within the paper.

To get in touch with me, reach out through Twitter, LinkedIn, or email.