Just as language is composed of sublexical tokens that combine to form words, sentences, and paragraphs, protein backbones are composed of sub-structural elements that combine to form helices, sheets, folds, domains, and chains. Autoregressive language models operate on discrete tokens, whereas protein structure is inherently continuous, and generative approaches to protein design have borrowed more from image generation than language modeling. But autoregressive models do not inherently require their inputs and outputs to be discrete. Here we describe a generative autoregressive language model over the continuous space of protein backbones, where the distribution over the placement of each successive amino acid is conditioned on all preceding residues, and can be sampled from one residue after another. We show that this approach can learn to sample diverse and realistic protein chains, opening a new potential avenue for in silico protein design.
Mon, Jun 3, 3:00pm
The Continuous Language of Protein Structure
5