Protein language models meet reduced amino acid alphabets

Kind of an interesting concept: how far can we reduce the alphabet of amino acids without losing too much information? If you just represent all polar amino acids by one letter, for example, can you still get decent predictions?

This kind of approach is potentially useful to simplify the protein sequence space, while still capturing structural information. The authors pretrained PLMs from scratch with encodings imposed by reduced amino acid alphabets. Unsurprisingly, models trained on the full alphabet did better than models trained on other encodings, but some cut-down alphabets still did decently well across tasks. They also found that predicting structures with these cut-down alphabets boosted the LDDT-C(alpha) scores but not overall structural accuracy.

So, still a way to go but interesting to think about nonetheless.