Protein language models are biased by unequal sequence sampling across the tree of life

Likelihoods from pLMs have been shown to correlate well with protein fitness (catalytic activity, stability, binding affinity etc). This study suggests that there is a species bias in PLMs that affects the likelihoods they output. They found that between 26% and 69% of variance in likelihood could be explained by species identity after controlling for protein type, depending on the pLM. The bias seems to be linked to database makeup, which makes sense (with model organisms overrepresented).

So what does this matter? If we're designing proteins for humans or mice or E.coli (model organisms), it might not matter as much, but they found that unique adaptations of extremophiles (organisms that can tolerate extreme environments like hot springs) might be minimized after pLM likelihood-based design. Something to keep in mind depending on your use-case!

1