Benchmarking DNA language models with BEND

This work has been accepted at ICLR2024.

DNA language models may help us decode the genome, but would work better if we use more real-world appropriate datasets and benchmarks. In this blog post, we will tell you about our benchmarking effort BEND, some key takeaways we’ve found, and why it’s a step in the right direction towards developing more useful models.

Can language models capture genomic patterns?

How much could be discovered and achieved if we had the ability to read and interpret the genetic code of ourselves and the species we share the planet with? So far one of the most promising tools to decipher the language of genomes appears to be deep learning.

Over the course of the last three years protein language models, an adjacent field, have been hugely successful for solving prediction tasks on proteins. Language modeling techniques can be very powerful for leveraging vast amounts of unlabeled data to learn representations that can be used in downstream tasks. This idea seems just as fitting for DNA: like with proteins, we have an ever-growing number of sequences (genomes) readily available, but experimental characterization and functional annotation is a bottleneck.

Motivated by this, a growing number of DNA LMs have been proposed, starting with DNABERT in 2020. DNA LMs have now been scaled to up to 2.5B parameters (Nucleotide Transformer), and innovative architectures such as HyenaDNA that overcome a transformer’s quadratic complexity have been proposed. However, regardless of model architecture and size, there are unique fundamental challenges in applying language modeling methods to DNA: Genomes are extremely large, there's no natural defined boundary for functional units, and functional elements are distributed extremely sparsely throughout the genome. 

Genomic DNA is extremely long, containing millions of base pairs in a single chromosome. Functional elements such as genes (exons and introns), promoters and enhancers are only distributed sparsely throughout the genome.

This has led to great heterogeneity in how the genomes are processed for pre-training, followed by even greater heterogeneity in the definition of downstream tasks for evaluation. As we became  interested in leveraging DNA LMs for prediction tasks, we set out to evaluate how well the currently available DNA LMs actually capture important biological patterns. We were particularly concerned with incorporating the unique challenges of length, sparsity and variability mentioned above. This led us to establish the BEND benchmark, a set of downstream datasets and tasks that meaningfully assess the performance of DNA LM representations. We wanted to think about the long term applications of DNA LMs, so we included tasks that are very challenging for current models but that should be captured by good representations. This was particularly important to us as without rigorous and appropriate ways of evaluating models we won't know how to shape and develop the next generation of models. 

Establishing biologically meaningful benchmark tasks

Throughout most recently published DNA LMs, we have observed that establishing good downstream tasks remains a key challenge. Often, selecting tasks seems to be driven by what data is publicly available in ML-ready format from previous studies, rather than being tailored to the purpose of evaluating the real-world utility of a DNA LM. Simultaneously, many benchmark sets are presented alongside novel models, which may lead to a bias in the reported performance. Specifically, we encountered the following issues in previous tasks:

  • Datasets are small, containing only a few 1000 samples. Although this is unavoidable for certain tasks, for some aspects of DNA function we have access to large-scale sequencing-based experiments that characterize the whole genome at once, providing millions of observations - we should use them!

  • Datasets are balanced, with negatives sampled in a fixed ratio to positives. In an actual genome, we expect severe label imbalance due to the sparsity of functional regions.

  • Datasets are short, being defined on very local features with a few 100 bp of flanking context. This prevents us from evaluating whether DNA LMs have learned to exploit context over longer ranges.

  • Datasets have no biological meaning - distinguishing e.g. DNA from different species does not directly relate to any specific aspect of DNA function.

To overcome these limitations, we established a collection of tasks on the human genome: Two long-range tasks (gene finding and enhancer annotation), three large-scale tasks using genome-scale data (histone modification, CpG methylation and chromatin accessibility prediction) and two zero-shot tasks for noncoding variant effect prediction (expression variants and disease variants). A key benefit of this selection is that there are dedicated existing expert methods available for each task - a testimony to their relevance in genomics research, and highly useful for us to put performance levels into absolute context, rather than just comparing DNA LMs against each other.

Overview of the tasks included in the benchmark.

Benchmarking DNA LMs

The current workflow and reasoning behind BEND is: 

We are primarily interested in benchmarking the representations learned by the DNA LMs, rather than fine-tuning LMs on a given task. This allows us to directly reason about what features LMs learned during pre-training, rather than evaluating how effectively they can be tuned on a task. Given the heterogeneity in model architecture and size, a fair fine-tuning comparison would require extensive hyperparameter and tuning strategy screens, and might conflate the effect of pre-training with the general inductive bias of the architecture for a task).

We therefore use DNA LMs as embedders, processing input sequences to LM representations. The representations then serve as input to a simple two-layer CNN that we train with supervision on the task data. This CNN only has a very limited receptive field since we want to evaluate whether the embeddings themselves already capture long-range features. In total, we benchmarked 10 publicly available LMs, and 2 small baseline LMs we trained for comparison. The approach also makes it straightforward to benchmark new DNA LMs on the tasks: All you need to do is implement an Embedder class that handles sequence tokenization, chunking, embedding, upsampling and special token removal according to what the specific LM needs, and adding the embedder to the YAML configuration. Training runs can then be started using the new featurization.

Takeaways

Performance of all models on the BEND benchmark collection. (NT=Nucleotide Transformer)

Currently, it's challenging to make any conclusive statements about what is the most effective way to set up a DNA LM since both pre-training procedures, datasets and model architectures vary between different works. 

However, there are some key insights: 

DNA LMs show clear promise; Comparing the pretrained ResNet LM embeddings with either the fully supervised baseline ResNet or Basset/DeepSea it is clear that even the relatively simple ResNet architecture pre-trained only on one genome provides good enough representation to be competitive across a range of tasks.

Multispecies training, as done by the NT-MS model, does seem to aid performance across many tasks, enabling LMs to also capture motifs that only occur sparsely in the pre-training data, such as splice sites (detailed results in the appendix of the paper).

Sparse long-range tasks are currently out of reach; tasks such as enhancer annotation are clearly still beyond the capabilities of current LMs, and pre-training LMs on longer ranges does not necessarily yield improvements on this task over larger, shorter-range LMs.

Outlook

So can we say what’s next for DNA LMs? On the one hand, we were very surprised by NT-MS’ performance on splice sites, indicating that pre-training on whole genomes allows us to learn good features for extremely sparse functional regions. On the other hand, reasoning over very long contexts is still hard, and this might be an inherent limitation of language modeling pre-training objectives (a feature at 50kb distance probably only has very limited relevance for a local token reconstruction objective). Although the results of the current benchmark might not seem overwhelming, DNA language modeling is very new and it remains to be seen what kind of interesting developments will arise out of the unique challenges posed by genomic data when the right evaluation tasks are used to shape the evolution of the field!

On the benchmarking side, a key future objective is to extend BEND to multiple species. It would be particularly interesting to evaluate how DNA LM representations help with generalizing between different species, such as by training on one and evaluating on another, since we obtain genome sequences from novel species at a far greater rate than we obtain the labels for those genomes. 

If you’re interested in learning more about our project or DNA language modeling and benchmarking in general, you can check out our ICLR paper and github repo where we are very open to receiving general questions, comments and issues. 

4