AutoTransOP: translating omics signatures without orthologue requirements using deep learning

Animal and in vitro culture models have been essential in developing and evaluating human therapeutics and vaccines. Although these models have helped with understanding the mechanisms of diseases in many studies, they don’t fully capture human biology, contributing to numerous failed clinical trials.

There's obviously a translation gap between our model systems and human biology.

To try to bridge that gap, we built a deep learning framework called AutoTransOP, recently published in NPJ Systems Biology and Applications. It enables us to translate and make predictions across biological contexts (e.g. human gene expression based on mouse data) and propose perturbations to make one system more predictive of another's behavior. The model also has biological interpretability, which is key for getting stakeholders like clinicians on board.

In this post I will explain more about the big picture of the problem we are trying to solve, how AutoTransOP is built and works, as well as the different capabilities and limitations of the framework.

Lost in Translation

The ethical and practical constraints on human subject research have led to the development of in vitro (such as cell lines, organoids, etc.) and in vivo (animal) models for studying disease mechanisms and their potential therapies, or vaccination modalities.

Unfortunately, observations made in these models don’t always translate well to humans in the clinic. A treatment proven to be safe and efficacious in cellular models and animals isn’t going to be necessarily safe and efficacious in humans too, due to functional divergence in orthologous biomolecules, or even the absence of such orthologous molecules (e.g. the antibodies produced by different species against similar antigens can be substantially different in structure). Even within the same species, the transcriptional response to chemical stimuli can be cell type-specific due to distinct genetic profiles.

Many statistical and Machine Learning (ML) models have been built to find similarities between species and experimental models, but most of them focus on direct correlations between analogous features across species despite known species differences. Attempting to tackle this challenge, Brubaker et al. developed TransCompR, which maps human data into the principal component space of data from another species to enable translation. However, this model still requires homologs or comparable molecular features between species and is also a linear method. Nowadays we understand however that the relationships between species, and general biology, tend to be more non-linear. Deep Learning (DL) models, such as autoencoders, can approximate these non-linear relationships underlying different biological systems and species.

A flexible framework for omics translation without orthologue requirements

AutoTransOP uses ideas from autoencoder-based language translation models and the compositional perturbation autoencoder (CPA) to map samples to a global cross-systems space, where the distance between samples coming from the same condition (e.g. drug + dose + time point) is minimized and their mutual information maximized. Like all autoencoder models, AutoTransOP is also trained to minimize the reconstruction error of the input data.

Like languages that sometimes don’t have matching letters, omics profiles between biological systems may lack common biomolecules or have common biomolecules with different functionalities, and in that sense, AutoTransOP is like a “translator” that maps two languages into a third global language, to then enable translation between them.

AutoTransOP consists of separate Artificial Neural Network (ANN) encoders and decoders for each biological system, that share the same global latent space. This initial separation removes the need for a 1-1 mapping between the features of the two systems. Differently from traditional autoencoders, the goal of the framework isn’t to construct a latent space that captures all the information of the perturbations, but to create a global space that captures mostly information about conditions and stimuli, while filtering out as much system-specific information as possible to enable translation of perturbations.

We implemented two main variations of the framework. The first variation of the framework (AutoTransOP v1) consists of a single global latent space, while the second variation (AutoTransOP v2, which is shown here) incorporates the idea from CPA, where there are two separate latent spaces: (1) a global latent space and (2) a composed latent space where the system-specific biological information (what species, what cell type) is retrieved through a trainable vector representing this effect. The first variation is the simplest possible one and potentially requires less data, but it does not provide a latent space capturing all information like what usually happens with autoencoders. The second variation tries to address that by separating the latent spaces, although the separation is linear and creates more parameters to fit, and a more complex training procedure.

AutoTransOP contains a few other complementary classification tasks to condition the latent space, and more details can be found in our publication.

AutoTransOP comparison to published approaches when the orthologue requirement is satisfied

When trying to perform translation between species we can always only consider features (such as genes) that exist in both species or can be mapped between them as features with a similar function. Many models (such as the previously mentioned TransCompR) have such a requirement just to be able to apply them. However, this way we throw away information from features that do not exist in both species (like letters that do not exist in two alphabets coming from different languages). Thus, using these features is more challenging (most of the existing approaches cannot even be applied). Before trying to interrogate AutoTransOP in this task, we tried to compare it in cases where the existing orthologue-based approaches can be used.

Models’ performance in reconstructing and translating gene expression profiles between the two cell lines with the most common perturbations in the L1000 dataset, A375 and HT29, by using only the 978 measured landmark genes.

The first thing we wanted to do was to see whether we could translate the gene expression response to a drug between different cell lines within the same species (before trying to look into different species). Our model did at least as well as orthologue-based approaches [DeepCellState (DCS), FIT, and TransCompR], that require the same genes to have been measured in the two cell lines (and in cases where different features are measured cannot even be applied). In terms of translation, all of our framework’s variations provide a statistically significant increase in performance compared to the direct translation across all metrics; it outperforms FIT and the original DCS method, and performs similarly to TransCompR.

Next, before moving into case studies with no 1-1 mapping between features, we wanted to test AutoTransOP’s performance in translating between uncorrelated genes of the same condition in a cell line, corresponding in an artificial translation scenario with different features, and the model performs exceptionally! To do that a translation model is repeatedly trained using data from one cell line, where one of the autoencoders in the framework is used for reconstructing half of the landmark genes and the other autoencoder for the rest. We also repeat this across 16 different cell lines. This can even be considered a gene imputation capability perhaps, but more importantly, it indicates the potential AutoTransOP holds in translation without orthologue requirements.

Performance by using different input genes for the same condition, in the L1000 dataset. We validate how well AutoTransOP can translate to half of the hidden genes in an experiment, using the other half. We also randomly shuffle genes when training models in order to build randomized models to serve as a baseline. We observe significantly higher performance than this baseline, meaning that AutoTransOP can translate between uncorrelated different features.

Identifying features that are important for translation

Once we know that the model is trustworthy in translating between biological systems, one question that comes up is why are the two systems different to begin with? What features does this non-linear model consider as important to perform this translation? And what are the changes that need to occur to push one system closer to another?

To answer these questions we used an integrated gradients approach, where we identify the importance of each input feature in one biological system (e.g. each gene in one cell line) in predicting the value of each output feature in another biological system (e.g. each gene in another cell line). This feature importance can be used with approaches based on Gene Set Enrichment Analysis (GSEA) to identify enriched transcription factors, pathways, or other gene sets, that would be important to perform the translation.

Top significantly enriched TFs, based on importance scores of genes, to translate the same condition between two cell lines. The enrichment means that a lot of the genes that a TF controls are found to be important. The Venn diagrams represent the overlap of important TFs, derived from both directions of translation.

Conceptually, this way AutoTransOP can propose perturbations to transform one biological system into another. For example, it would be expected that to translate PC3 to HA1E (and vice versa), since one is cancerous and the other one is not, we would observe predominantly TFs whose activities are known to be regulated in cancer. This makes sense from a drug development perspective as it means that to push a cancer cell line to a non-cancerous stage, TFs associated with cancer should be regulated or targeted. Indeed, by looking at the top 16 TFs, when translating PC3 to HA1E, we can observe TFs such as E2F2, MYC, FOXM1, RELA, JUN, FOSM, even TP53 which is often a therapeutic target of anti-cancer therapeutics, and others. When translating from HA1E to PC3 we identify similar TFs plus some tissue-specific TFs (VDR).

All these demonstrate the ability of AutoTransOP to propose tissue- or disease-specific perturbations to translate between two biological systems.

Inter-species translation to predict protection against HIV with no 1-1 mapping and few data

It would be useful to predict how well a vaccine would protect you, and the need to aid and speed up vaccine development, by enabling translation between animal models and humans, became obvious even more with the recent COVID-19 pandemic. Serology data measures antibody response in the blood, but predicting HIV in humans from SHIV readings in non-human primates (NHPs) is a hard problem because there's no 1:1 mapping of features.

After making sure that AutoTransOP can reconstruct the serological profiles and predict protection and vaccination status with good performance, we tried to interrogate which NHP features are predictive of human protection against HIV (details on the full process in the publication).

Functional grouping of NHP features predictive of protection-associated human features. In the top nightingale rose plot, NHP features are categorized by antigenic target In the bottom nightingale rose plot, NHP features are categorized by serological feature type. On the right, we observe the network visualization of the associations between specific NHP and human serological features, related to human protection.

The important note here is that while these are features potentially predictive of human protection they aren’t necessarily associated with NHP protection. The top human features identified are generally related to V1V2-specific IgG titers, and the top NHP (human predictive) features include a wide range of feature types, including Fc receptor binding, interferon gamma (IFNg) elispots, and IgG titers. Generally, we can try interpreting these features by classifying NHP features by antigenic or feature type (left of Figure 7), but we acknowledge that actual validation can be achieved only by follow-up experiments.

Conclusions and Takeaways

AutoTransOP is a non-linear deep learning method that enables the translation of omics profiles between a wide range of conditions and tasks, without any orthologue requirement. Even in cases where there are orthologs, it outperforms or performs as well as already published approaches. It also allows the translation of gene expression profiles at the single-cell level.

Finally, and perhaps more importantly, we demonstrated that it is possible to get some level of biological interpretability, by interrogating features' importance, which is useful for proposing perturbations to move a biological system to be more similar to another one or for identifying preclinical biomarkers of therapeutic efficacy. This could even have therapeutic discovery capabilities, as it proposes perturbation to move a biological system toward a desirable state.

The code to implement the model and the whole study is publicly available on GitHub. We hope you liked this post and you may try improvements and extensions upon AutoTransOP in the future! Please feel free to reach out with any questions or feedback!