Co-Author(s)

Improving developability of biologics through better solubility prediction

Poor developability - The curse of biologics

Inadequate solubility is a mundane yet troublesome issue that affects protein-based drugs, also known as biologics. With CamSol-PTM we are one step closer to solving it.

Solubility is difficult to measure experimentally, especially in high throughput, so it’s often left for late-stage experiments. Companies and laboratories around the globe are trying to address this problem by developing in silico approaches that can screen millions of compounds in mere minutes. These approaches require understanding how the target behaves, modeling the drug, investigating binding, and understanding the underlying physicochemical properties. For some of these aspects machine learning (ML) has been successfully applied – AlphaFold and RosettaFold are prime examples - while for others, such as the prediction of physicochemical properties, there’s still room for breakthroughs. 

Solubility prediction of proteins and peptides has a long history and is approached by various methods ranging from correlations with thermal stability to methods based on physicochemical properties and ML-based approaches. While the latter are moving more and more to the forefront for many problems hampering drug development, solubility has not seen major improvements as there is a severe lack of high-quality data to train machine learning algorithms on. This is the main reason the most successful solubility predictors are still based on calculations of physicochemical properties.

CamSol can help!

We developed CamSol with the goal of providing a computational alternative to experimental assays to measure solubility. We consider this goal achieved, since CamSol is on par with state-of-the-art experimental methods in terms of performance, as shown below.

CamSol provides results that are effectively indistinguishable from those of experimental solubility assays, as its predictions correlate with experiments as much as different experimental assays correlate with each other. The experimental assays listed in the figure are CIC: cross-interaction chromatography; SMAC: stand-up monolayer chromatography; AC-SINS: affinity-capture self-interaction nanoparticle spectroscopy; HIC: hydrophobic interaction chromatography; Tm: melting temperature.

CamSol is based on combining the intrinsic physicochemical properties of amino acids to calculate a solubility score for proteins and peptides. It does not require prior knowledge of the structure, which makes it especially powerful in early drug design phases where structures are not readily available. While it can be used to predict the solubility of any protein, it shows great accuracy in predicting the change in solubility upon small (even single) mutations which greatly aids in antibody design. 

In the past years we further developed CamSol to increase its capabilities trying to fill gaps in the in silico drug development landscape. Early 2023 we introduced CamSol-pH, a new version of CamSol that incorporates the highly dynamic effects from changes in formulation pH that affect the solubility of proteins.

In late 2023, we then introduced CamSol-PTM, a solubility predictor that can handle non-natural amino acids, which to our knowledge is the first method of its kind. Certain unique functionalities of non-natural amino acids are handy to ensure that biologics meet developability criteria and it’s therefore imperative to have a software that can accurately predict their effects on solubility. This entails predicting the effects of standard post-translational modifications such as phosphorylations but also those of completely new functionalities such as cyclohexyl residues.

Non-natural amino acids can be amino acids modified by typical post-translational modifications such as phosphorylations or acetylation (upper panel) but they also encompass residues that show completely new functional groups (lower panel).

How does CamSol-PTM work?

CamSol calculates for each amino acid a solubility score that is based on its hydrophobicity, charge and structural propensity (how likely it is to be part of an alpha-helix or beta-sheet). These scores make up a solubility profile of the protein which is corrected by applying a smoothing operator taking into account neighboring effects followed by two corrections for patterns known to affect aggregation behavior. 

An overview of the workflow of CamSol-PTM. The hydrophobicity, charge and structural propensities are calculated for each non-natural amino acid and then fed into the CamSol framework which uses these values to predict the solubility of the protein. Figure taken from Oeller et al. Nat Commun 2023.

Since the original CamSol framework works well, we kept and expanded it for CamSol-PTM. While the above-mentioned values for physicochemical properties are known for the standard amino acids, they are not known for most non-natural amino acids leading to the need to predict these values.  

For hydrophobicity and charge, there are many accurate predictors available and we decided to use pIChemiSt suite to calculate the pKas and the CrippenTool from RDKit for hydrophobicity prediction. For the secondary structure propensities we developed our own predictors based on the molecular weight, polar surface area, number of hydrogen donors/acceptors and rotational bonds. The combination of these predictors made it possible to accurately predict the properties for non-natural amino acids to be used in our CamSol framework.

If you want more detail about how CamSol or CamSol-PTM works, take a look at our papers!

Can I use CamSol-PTM for any kind of non-natural amino acid?

After confirming the results of our predictions on an initial set of non-natural amino acids that covered a broad range of chemical properties, we set out to automate and streamline the software. We ended up with a version that just needs two inputs: the sequence and the SMILES code for any non-natural amino acid that is part of the protein. By providing these inputs the user can predict the solubility of proteins containing any kind of non-natural amino acid. We do not recommend (yet) trying to predict the solubility of very large non-natural amino acids, like glycosylations or lipids as it is currently not verified for these kinds of residues.

CamSol is free for academics!

CamSol-PTM (like all versions of CamSol) can be licensed for industry users but it is free for academic users. It can be accessed via a web server at  https://www-cohsoftware.ch.cam.ac.uk following a free user registration. It is straightforward to use as only the sequence of the protein and the SMILES code for the non-natural amino acid is required. 

Looking forward, while CamSol-PTM itself was not trained with deep learning based models due to the lack of high quality data, it would greatly benefit from it. Pharmaceutical companies often possess a wealth of proprietary data from their own testing and this data could be crucial to move these methods to the next level.

Overall, with CamSol-PTM we add a powerful method to the toolbox of early drug discovery to accelerate drug development.

__________________________________________________________________________________

Marc Oeller is a postdoctoral fellow at the Department of Proteomics and Signal Transduction at the Max Planck Institute of Biochemistry in Munich, Germany. During his PhD at the University of Cambridge in the group of Michele Vendruscolo he developed software that predicts the solubility of proteins and peptides.

__________________________________________________________________________________

Further reading:

Sequence-based prediction of the intrinsic solubility of peptides containing non-natural amino acids. Nat Commun 2023.

The CamSol method of rational design of protein mutants with enhanced solubility. J Mol Biol 2015.

Rapid and accurate in silico solubility screening of a monoclonal antibody library. Sci Rep 2017

Sequence-based prediction of pH-dependent protein solubility using CamSol. Briefings in Bioinformatics 2023.

In Vitro and in Silico Assessment of the Developability of a Designed Monoclonal Antibody Library. mAbs 2019.

Assessment of Therapeutic Antibody Developability by Combinations of In Vitro and In Silico Methods. Therapeutic Antibodies 2022.

Automated Optimisation of Solubility and Conformational Stability of Antibodies and Proteins. Nat Commun 2023.

Protein Solubility Predictions Using the CamSol Method in the Study of Protein Homeostasis. Cold Spring Harbor Perspectives in Biology 2019.

1