SaProt: Protein Language Modeling with Structure-aware Vocabulary

https://www.biorxiv.org/content/10.1101/2023.10.01.560349v2

Protein structures are often considered more informative compared to sequences because they directly determine the function of the protein. With the great breakthrough brought by AlphaFold2 (AF2), we have a large number of predicted structures available. How can we best use these structures? By training the next generation of protein language models! Like in Natural Language Processing (NLP) or Computer Vision (CV), a powerful base model could be deployed for various downstream tasks, e.g. mutational effect prediction or protein-protein interaction prediction in the protein field.

In our paper, we use Foldseek to process protein structures. Foldseek is a VQ-VAE model that encodes a protein structure into a structural sequence (the same length as residue sequence). Since it was trained to recover the distances or angles from discrete tokens, the structural sequence could represent the 3D structure well.

By encoding structures into one-dimensional discrete tokens, we combine them with amino acids (there are 20 amino acids in nature) to form a Structure-aware Vocabulary as a way of embedding structural information into model inputs and enhancing the model's representational capabilities. Our pre-training model uses the largest number of protein structures currently available (about 40 million), and was trained on 64 A100s for 3 months, resulting in an open-source model SaProt with a size of 650 M parameters. Experimental results show that our model outperforms previous sequence and structure models on various protein tasks.

How and why to combine structure and amino acid sequences

As mentioned earlier, we first encoded the protein (using Foldseek's structure vocabulary, where each token represents a different local structure) to generate a one-dimensional structural sequence , such that the structural sequence is of equal length to the amino acid sequence. We used a simple but effective approach: we computed the Cartesian product (i.e., a two-by-two combination) of the structure vocabulary and the amino acid vocabulary to form a new structure-aware vocabulary.

This way, for each site of the protein, its residue type and corresponding local structure can be combined into an element of the new vocabulary, thus achieving our goal: allowing the model to take into account both sequence and structure information of the protein.

This is followed by Masked Language Modeling (MLM) pre-training using the Bert architecture in a classical way (see our paper for more details on training).

Pre-training with structure-aware vocabulary decreases overfitting

Why do we need to encode structures in this way? Let’s look at plots of the results of pre-training using different ways of encoding the structure:

On the left and in the middle of the figure are two classical ways of modeling protein structures, i.e., encoding the structural information into bias and then adding it to the attention map of the transformer (e.g., Evoformer, Uni-Mol) or modeling the spatial relationships of proteins using graph neural networks (e.g., MIF, GearNet, etc.). From the loss plot, you can see that when the above two modeling approaches are pre-trained on the AF2 structure using the MLM loss function, the model overfits very quickly (as shown by the fact that the predicted loss is very good on AF2 structures, but the loss doesn’t decrease or even rise on real PDB structures).

We think this is due to the fact that the protein structure predicted by AF2 carries some hidden patterns, and since the first two approaches are directly modeling the 3D coordinates of the protein, these hidden patterns may be easily recognized by the model, which results in information leakage, allowing the model to easily complete the training goal without actually understanding the evolutionary relationships. In contrast, our structure-aware vocabulary ignores coordinate values by encoding protein structures into one-dimensional structural sequences while preserving structure information as much as possible, so the model is able to efficiently use the information without being affected by the hidden patterns.

Awareness Of Protein Structure

Our model gained strong representation capabilities by training on 40 million protein structures. A possible question is: how can we determine that our model learned more structure information rather than being trained better? We therefore tested SaProt and ESM-2 on the contact prediction task. We froze the backbone of the model and trained only a classification head. The results are as follows:

SaProt greatly outperforms ESM-2, which indicates that SaProt contains very rich structure information, enabling it to obtain very excellent results on structure prediction tasks. Meanwhile, we visualized alpha helix proteins and beta sheet proteins on the SCOPe database with the following results:

The SaProt visualization clearly distinguishes alpha proteins from beta proteins, whereas the ESM-2 visualization mixes the two proteins, which demonstrates the strong ability of our model to perceive changes in the structures.

Conclusion

In our work, we developed a model that uses a structure-aware vocabulary and does well in a range of tests while avoiding overfitting. We believe that expanding the size of the structure vocabulary and/or scaling the model could further improve our results, but I hope I've shown you how this approach is useful and powerful!

Check out our paper for more details, and please reach out if you have any questions!