AutoPeptideML: Building Peptide Bioactivity Predictors Automatically

Nature is the most complete pharmacy. If you take a stroll in the park, you can see hundreds of species involved in very old chemical warfare: hemlock using poison to avoid being eaten, fungus colonizing new territory by spreading penicillin, ants using pheromones to communicate…

We have been making use of this pharmacy since the beginning of time, with a very limited understanding of why certain remedies worked and why some didn’t. As with many other things, our understanding of the chemistry of drugs has taken an impressive leap in the last century and where someone with a chronic illness like diabetes would die, now they can live perfectly normal lifes with the help of insulin!

Insulin is a very special type of drug called a peptide, which are really interesting weapons in the chemical arsenal of nature: they are much bigger than traditional drugs like penicillin (as you can see below) and they can be new antibiotics or anticancer drugs.

We've gotten quite good at making peptides (even designed from scratch), but it can be hard to know what function (a.k.a. bioactivity) they'll have. Developing tools for predicting what their bioactivity is can help us understand better how to develop new drugs and how their biology works.

In this blog post I'll tell you about AutoPeptideML, a solid tool for automating key steps in the development cycle of peptide bioactivity predictors so that it leads to more trustworthy evaluation of the models and enables researchers without a computational background to develop new predictive models. The first step in knowing a peptide's bioactivity is determining which bioactivity it has (is it antibacterial, or signaling, or none at all?) Here, we're considering 'bioactivity' a classification problem: a peptide either has a specific bioactivity or not.

The tool can be used through different interfaces: a web server GUI (which you can use here: peptide.ucd.ie/AutoPeptideML), a CLI application (with instructions on our GitHub project: here), and a Python API; for users with different levels of technical know-how. Throughout this post we’ll explore the different steps of the workflow and the flexibility that the Python API offers. I'll walk you through dataset preparation, partitioning, mathematically representing the peptides, and hyperparameter optimization. We will use as example the Antibacterial dataset from this paper. To download the dataset, you can go to the following link: dataset and download AB_positives.csv.

AutoPeptideML workflow

For more detailed information about the effect of the different design choices, take a look at the original publication.

Part 1: Dataset preparation

The first problem faced when building peptide bioactivity predictors is the lack of ground truth negative data. There are a fair amount of repositories with peptide sequences known to have certain biological properties, but it is rare that they contain sequences known to be negative. Weak label negatives have to be found and introduced into the dataset.

How do you define a negative? Different researchers have tried different answers: peptides not known to have the bioactivity of interest, random protein fragments, or peptides with a different unrelated bioactivity. We decided to use an intermediate approach, sampling peptides from a database with multiple bioactivities. The idea is to maximise the diversity of the negative samples, but keep it constrained to bioactive peptides, to minimise the number of confounding variables that the model could exploit and that do not have a direct bearing with the bioactivity we are trying to predict. Other steps in this process include cleaning sequences with non-canonical residues.

We use a curated version of the Peptipedia Database (see the original database) to search for bioactive peptides to use as negatives. In our example, we are interested in antibacterial peptides so we are going to exclude antibacterial peptides from the search for negative peptides, so as to not introduce false negatives in the dataset.

from autopeptideml import AutoPeptideML
import pandas as pd

# The AutoPeptideML class wraps all the methods for building a new predictor
apml = AutoPeptideML(verbose=False, threads=4, seed=42)

# Load dataset with only positive samples from FASTA format
df_pos = apml.curate_dataset('antibacterial_pos.fasta')

# Search for negative peptides and exclude antibacterial peptides from the search
df = apml.autosearch_negatives(df_pos, positive_tags=['Antibacterial/antibiotic'],
    proportion=1.0)

The proportion key defines how many negative peptides we sample from the database. proportion=1.0 indicates that the final dataset will have a 1:1 proportion of negatives to positives; proportion=2.0 would indicate 2:1, and so on.

Part 2: Dataset partitioning

Dataset partitioning refers to the step of dividing the data into training/validation/evaluation subsets. Normal ML dataset partitioning works under the assumption that data is independent and identically distributed sampling, which allows for partitioning the datasets using simple random partitioning. However, biochemical data is not independent and identically distributed as similar molecules tend to share similar properties and the experimental process through which data was generated is inherently biased. For this reason, we use an alternative known as similarity partitioning, that generates independent training and evaluation sets by ensuring that there are no peptides in the training set with a similarity above a threshold to any peptide in the evaluation set.

The threshold depends on how the data has been generated and how strict you want to be. The most strict threshold commonly used is 30% of sequence identity, so we recommend its use in most situations. Another strategy, that for simplicity we are not using, but that can be quite informative is to repeat the training/evaluation cycles with increasing thresholds (30, 40, 50, …) to get a sense on how the model performs in different scenarios.

The training set is then subdivided into 10 training/validation subsets for cross-validation. The objective of the validation dataset is to test whether the model is learning from the training data or not. For this purpose, similarity partitioning is not necessary (remember that similarity partitioning evaluates how the model will behave against new unseen data). For this reason, we recommend using random sampling, which though it may overestimate the performance of the model, will be sufficient for determining whether the model is learning or not. In any case, you can also use similarity partitioning in this step if the model is struggling to generalize or you suspect overfitting.

# Train/test partition
datasets = apml.train_test_partition(df, threshold=0.3, test_size=0.2,
    alignment='mmseqs+prefiler')
# Train/validation folds
folds = apml.train_val_partition(datasets['train'], method='random',
					threshold=None, n_folds=10, alignment=None)

The alignment algorithms supported are mmseqs+prefilter, mmseqs, and needle. The first two use a local alignment algorithm. They differ in that the first conducts a prior k-mer filtering step making it much faster. needle refers to the Needleman-Wunch algorithm which constitutes a global alignment. In general, the global alignment is more accurate than the local alignment, but also much more computationally expensive.

train_val_partition accepts as a method also graphpart which generates the folds using a similarity partitioning algorithm (more details in the original publication).

Part 3: Mathematical representation of the peptides

Next step is to provide a mathematical representation of the peptides. Traditional methods include analytical featurization including information about peptide composition, physicochemical properties, or evolutionary information (check this review if you want to learn more). However, the adequate combination of features is dataset-dependent and there is no clear standard methodology. We bypass the problem of performing feature selection by leveraging pre-trained Protein Language Models (PLMs). Currently, AutoPeptideML supports: ESM1b, ESM2-8M, ESM2-35M, ESM2-150M, ESM2-650M, Prot-BERT, ProtT5-XL-Uniref50, and ProstT5.

We use the PLMs only to generate a vector describing each peptide (using average pooling of the individual token representations). Our experiments indicate that under this setup there is no significant difference across datasets between the different models, though some perform better than others in individual datasets:

Some models may perform better than others in individual datasets, but generally there is no significant difference across datasets. Evaluation of different PLMs for building predictors for different bioactivities. Error bars represent the standard deviation in three runs.

To use these models we introduce a wrapper class RepresentationEngine that loads the models and tokenizers from HuggingFace and uses them to compute the representations.

# Prepare the PLM
re = RepresentationEngine('esm2-8m', 12)
# Calculate the representations
id2rep = apml.compute_representations(datasets, re)

Part 4: Hyperparameter optimisation

AutoPeptideML uses the training/validation folds to perform model selection and hyperparameter optimization through Bayesian optimization leveraging the implementation in the Optuna library. The hyperparameter space can be defined using a dictionary:

hp_space = {
	"trials": 100,
   	"model_selection": [
{
           "model": "K-Nearest Neighbours",
           "optimization_metric": "test_matthews_corrcoef",
           "hyperparameter-space": [
               {
                   "name": "n_neighbors", "type": "int",
                   "min": 1, "max": 30, "log": "False"
               },
               {
                   "name": "weights", "type": "categorical",
                   "values": ["uniform", "distance"]
               }
           ]
      }
}

Within the model_selection list several models can be included. Here only one is included for clarity. An alternative configuration is to substitute the key model_selection for ensemble. In this case, HPO will be performed for each model independently and the final model will be a soft-voting ensemble of each individual model.

 hpo_space = {
    "ensemble": [
        {
            "model": "K-Nearest Neighbours",
            "trials": 30,
            "optimization_metric": "test_matthews_corrcoef",
            "hyperparameter-space": [
                {
                    "name": "n_neighbors", "type": "int",
                    "min": 1, "max": 30, "log": "False"
                },
                {
                    "name": "weights", "type": "categorical",
                    "values": ["uniform", "distance"]
                }
            ]
        },
        {
            "model": "RFC",
            "trials": 30,
            "optimization_metric": "test_matthews_corrcoef",
            "hyperparameter-space": [
                {
                    "name": "max_depth", "type": "int",
                    "min": 2, "max": 20, "log": "False"
                },
                {
                    "name": "n_estimators", "type": "int",
                    "min": 10, "max": 100, "log": "False"
                }
            ]
        },
    ]
}

The name of the parameters correspond to the scikit-learn implementation of the algorithms, and therefore, you should check the corresponding documentation to get a better sense of what parameters you may be interested in including, their significance, and what the usual values are.

Finally, train and evaluate the the model and evaluate the model by:

model = apml.hpo_train(hpo_space, datasets['train'], id2rep, folds)
results = apml.evaluate_model(model, datasets['test'], id2rep)

The evaluate_model method will generate plots describing model performance like the Receiver-Operating Characteristic, the Precision-Recall, or the calibration curve. It also generates a PDF summary with a guide as to how to interpret the results. You can see an example below:

Conclusion

Although AutoPeptideML was designed for a non-technical audience, it offers a lot of flexibility for more knowledgeable users and we hope that it can be useful in your own projects. If you run into any issues, notice bugs or miss any features, please open an issue in the GitHub repository.

Author bio

Raúl is a second-year industrial PhD student both in the ShieldsLab under the supervision of Prof. Denis Shields and IBM Research Dublin under the supervision of Thanh Lam Hoang; he also belongs to the SFI Centre for Research Training in Genomics Data Science. After his undergraduate studies in Biotechnology, he specialized in machine learning for biochemistry during his master thesis working with Jean-Didier Maréchal. Raúl’s interests lay in exploring how to evaluate and improve the generalization capabilities of Foundation Models for Protein Representation Learning and developing tools to leverage existing models to facilitate the development of new predictive models.