Tackling protein language model training

There have been a few papers recently addressing the huge resources needed for training protein language models (PLMs) and proposing solutions.

Nathan Frey and a group at Genentech discussed "Cramming Protein Language Model Training in 24 GPU Hours", training a model in a single day that could do similar protein fitness landscape inference to ESM-3B, which was trained for >15,000x more GPU hours.

Xingyi Cheng and a group at BioMap Research looked at "Training Compute-Optimal Protein Language Models", training over 300 models with different combinations of parameters and tokens to investigate relationships between model sizes, training token numbers, and objectives.

Another group, this one at Nostrum Biodiscovery S.L, did a similar scaling study in "Are Protein Language Models Compute Optimal?". Training a 35M model on a reduced token set, they got perplexity results comparable to larger models like ESM-2 (15B) and xTrimoPGLM (100B) with a single dataset pass.