Guadalupe Gonzalez
 · Senior ML Scientist @ Prescient Design | Genentech | Roche
Co-Author(s)

PDGrapher: Combinatorial prediction of therapeutic perturbations using causally-inspired neural networks

Imagine a world where finding a disease treatment is less about going through a giant library of drugs to find perturbagens - agents capable of shifting our biological systems into different states - that might reverse disease and more about something akin to crafting the perfect key for a specific lock. Instead of the endless search through potential drugs, hoping to find one that counteracts a disease's effects, it's about directly engineering the solution. Enter PDGrapher. It uses causal inference and graph deep learning to bypass the traditional search and directly predict the perturbagens that most likely shift diseased cells back to health. This approach is a leap forward: it's not just about making better guesses from a predefined library; it's about changing the way we discover drug candidates.

In contrast with existing work, PDGrapher performs direct prediction of drug leads in the form of their building blocks (targeted genes).

Framing the problem

One of the most challenging parts is defining the problem formulation. We need to mathematically represent the search for “perturbagens (represented as sets of targeted genes) that shift gene expression – a measure of phenotypic state – from a diseased to a treated state”. The transition element being the key here. We don’t want to predict perturbagens that have a specific response, but we want to predict perturbagens that create a transition from an initial to a final desired state. We find the answer in the concept of optimal intervention design from the causal learning field. 

To simplify the problem formulation, we assume no unobserved confounders, which is strict and infeasible to test in most cases, something that can be potentially addressed in future works. Under this assumption, we formulate our problem using a causal model, where genes represent the nodes in a causal graph, and edges represent their causal relationships.

Using this causal model and optimal intervention design, we formulate our goal to predict a set of genes that a perturbagen should target to shift node states from diseased to treated. In practice, this translates to conditional probabilities over the causal graph (more details in the paper!). As a graph deep learning person I immediately think “GNNs!”. We then translate our problem formulation from the causal learning domain to the representation learning domain (where GNNs live). To solve this problem, we introduce PDGrapher, as a causally-inspired GNN model designed to predict arbitrary perturbagens as sets of therapeutic genes (target genes) that can shift gene expression from a diseased to a desired treated state.

Given a paired diseased and treated gene expression samples, PDGrapher predicts the set of therapeutic target genes that shift cell gene expression from diseased to treated.

We have one last challenge: there is no ground-truth gene-gene causal graph! Given that neural networks have incredibly high representation power, we assumed that this would compensate for a potentially noisy and incomplete causal graph, and so we use a protein-protein interaction network (PPI) as an approximation of the causal graph. 

Evaluating PDGrapher

We built datasets comprising gene expression measurements from healthy, diseased, and treated cell lines to study disease and treatment interventions. We have a total of four datasets which we named: Genetic-PPI-Lung, Genetic-PPI-Breast, Chemical-PPI-Lung, Chemical-PPI-Breast. You can find more details on datasets and baselines in the paper. We evaluate PDGrapher and baseline methods on held out folds that contain novel samples (random split) and challenging settings where held out folds contain novel samples from a cancer type that PDGrapher had never encountered before (leave-cell-line-out split), using 5-fold cross-validation.

We evaluate PDGrapher on two genetic datasets (A) and two chemical datasets (B)

PDGrapher efficiently predicts genetic and chemical perturbagens to shift cells from diseased to treated states. 

Given pairs of diseased and treated samples, PDGrapher is trained to output a ranking of genes, with the top-predicted genes identified as candidate therapeutic targets to shift gene expression phenotype from a diseased to a treated state in each sample. In held out folds that contain novel samples, PDGrapher ranks ground-truth therapeutic targets up to 34% higher in chemical intervention datasets and 16% higher in genetic intervention datasets than existing methods (Table 1). Even in held-out folds containing novel samples from a previously unseen disease, PDGrapher maintains robust performance (Table 2). Even though the ranking of therapeutic targets is not perfect, these results are particularly exciting because they can help reduce the search space from the large space of genes to a significantly smaller subset of candidates for therapeutic leads!  

Because perturbagens target multiple genes, we measure the fraction of samples in the test set for which we obtain a partially accurate prediction, where at least one of the predicted gene targets corresponds to an actual gene target. PDGrapher consistently provides accurate predictions for more samples in the test set than baselines (Tables 1, 2), which supports the idea that even if the ranking of therapeutic targets is not perfect, PDGrapher provides useful gene sets.

Table 1: PDGrapher does an incredibly good job at predicting the genes we’d need to target to shift cells from diseased to treated for unseen samples
Table 2: PDGrapher also does a great job at predicting the genes we’d need to target to shift cells from diseased to treated for unseen samples and cell lines!

We also find that in chemical datasets, candidate therapeutic targets predicted by PDGrapher are closer to ground-truth therapeutic targets in the gene-gene interaction network than what would be expected by chance. This implies that PDGrapher not only identifies relevant gene targets but does so in a way that reflects the underlying biological and network-based relationships, suggesting that its predictions are rooted in the inherent structure of the gene interaction network that governs gene similarity.

PDGrapher predicts genes that are closer in the network to ground-truth therapeutic genes compared to what would be expected by chance, for Chemical-PPI-Lung (A) and Chemical-PPI-Breast (B) datasets in the random splitting setting and in the leave-cell-out splitting setting (C, D).

We also explore PDGrapher’s ability to illuminate mechanisms of action of therapeutic perturbagens. We visualize predicted therapeutic targets for Raloxifene and Sertindole and their interaction communities in the Chemical-PPI-Lung dataset. This reveals PDGrapher’s accuracy in predicting known and potentially novel targets for both drugs. Raloxifene’s analysis highlights PDGrapher’s ability to predict its established targets (ESR1, ESR2) and suggests novel targets (SHBG, PDE5A) that align with known physiological effects, offering insights into Raloxifene’s broader impact on estrogen-related pathways. Similarly, for Sertindole, PDGrapher accurately predicts its primary targets and suggests additional genes (HTR1A, BRAF, HOXC6), enriching our understanding of its mechanism in modulating GPCR signaling pathways. 

(A,B) We visualize ground- truth and predicted therapeutic targets for Raloxifene (A) and Sertindole (B) in Chemical-PPI-Lung using Gephi with ForceAtlas embedding. We highlight in different colors distinct communities identified by Gephi’s modularity algorithm.

Conclusions

Here, we aimed to increase the versatility of ML approaches to phenotype-driven lead discovery by turning the problem on its head, introducing a causally-grounded problem formulation to find perturbagens that shift systems from diseased to treated states. To solve this problem, we introduced PDGrapher, a causally-inspired GNN model that predicts therapeutic perturbagens. All in all, our results show that PDGrapher not only predicts the right therapeutic genes to shift cells from diseased to treated states in more samples compared to baselines, but its predictions follow principles known to govern gene-gene similarities and can provide useful insights into drugs’ mechanisms of action. If you don’t love GNNs or the way we tackled this problem, you can still find value in the task formulation, and set out to solve it yourself with your preferred algorithmic approach. I can’t finish this without mentioning PDGrapher’s limitations including the assumption of no unobserved confounders, and approximations of the causal graph - both fascinating avenues to explore in the future!

Miscellaneous - practical considerations when using PDGrapher in your research.

PDGrapher’s repo comes with two examples to run the model. To get started, just clone and set up the repo and datasets, install PDGrapher with pip install -e and follow the examples here! 

I foresee three potential ways of using PDGrapher in your research. (1) if your datasets distributions match what PDGrapher was trained on, you can re-use our trained models (specifically, the perturbation discovery module of PDGrapher) to predict therapeutic targets. (2) if your datasets come from a different distribution, you can retrain PDGrapher directly if your phenotypic dataset is gene expression, and your perturbagens are represented as gene targets, or make minor adaptations to fit your phenotypic and perturbagen representations. Bear in mind that you will need perturbational datasets to do this, that is, a dataset with initial phenotypic states, perturbagens (represented as their constituent elements, e.g., genes targeted), and treated phenotypic states. PDGrapher needs data of phenotypic transitions to learn how to predict perturbagens; once trained, it can be used to predict perturbagens to shift systems from diseased to treated (healthy) states. (3) our work puts forward a general problem formulation to tackle the problem of predicting arbitrary perturbagens. Therefore, beyond re-training or adapting the model, our work puts forward a new task and opens the way for new methods addressing this fundamental problem in phenotype-driven lead discovery. 

s://github.com/mims-harvard/PDGrapher/tree/main?tab=readme-ov-file#data 

Project website: https://zitniklab.hms.harvard.edu/projects/PDGrapher/

__________________________________________________________________________________

Guadalupe Gonzalez is a Senior ML Scientist in the Frontier Research team at Prescient Design, Genentech. Her expertise lies at the intersection of graph deep learning, causal inference, and drug discovery. At Prescient, her focus is on (causal) graph deep learning for drug discovery, from the small-scale (e.g., proteins) to the large-scale (e.g., patient data) systems. She previously completed a PhD at Imperial College in graph deep learning for lead discovery advised by Michael Bronstein and Kirill Veselkov. 

Twitter: https://twitter.com/justguadaa

LinkedIn: https://www.linkedin.com/in/guadalupe-gonzalezp/ __________________________________________________________________________________

1