Mapping the gene space at single-cell resolution with gene signal pattern analysis

Single-cell genomics has revolutionized our ability to investigate cellular heterogeneity and response. Understanding how cellular diversity and behavior is programmed at the level of an individual cell helps us answer big questions in biology — from cancer to development — and has exciting implications for therapeutic discovery. While the field has been interested in building cell atlases to map cell-cell relationships, it has been unclear how (or why!) to map gene-gene relationships. In our recent work, “Mapping the gene space at single-cell resolution with gene signal pattern analysis”, we motivate and showcase our approach to learn gene embeddings from single-cell data. With these embeddings, we can unlock a broad array of downstream analyses and hopefully better understand cellular and molecular behavior within and across cell types.

Modeling single-cell data on a manifold

When exploring single-cell RNA sequencing (scRNA-seq) data, we usually think of each “observation” as a cell, and each “feature” as a gene measurement, where we measure thousands of genes. Mapping cell-cell relationships in a low-dimensional space is really useful to uncover interesting behavior at the resolution of a single cell.

Unfortunately, it’s not straightforward to understand this biology: single-cell sequencing data is high-dimensional, noisy, and sparse. Sparsity in single-cell data is due to a technical artifact called “dropout”, where only a fraction of transcripts in each cell is measured. One common assumption to overcome these artifacts is that cells lie on a manifold, a construct representing a locally Euclidean, smoothly varying space.

One way to model single-cell data on a manifold is through a process called data diffusion. First, we construct a cell-cell graph by (1) calculating the distances between all cells, and (2) converting distances to affinities via a kernel, like a Gaussian kernel (stored in affinity matrix A). This captures the local neighborhood structure between cells. To learn global structure, we construct the diffusion operator, P, by row-normalizing A so that entries correspond to the probability of moving from one cell to another. Then, we take a random walk by powering the diffusion operator, which denoises the data and recovers the underlying manifold.

Recovering underlying manifold from single-cell measurements.

In our work, we leverage this manifold structure in an entirely new way — to map the gene space.

Defining genes as signals on the cell-cell graph

Mapping gene-gene relationships helps us understand gene pathways and coordination events that guide cellular behavior. However, pointwise metrics to understand gene-gene coexpression, such as correlation, are not useful for single-cell analysis due to dropout. That is, two related genes may not be expressed in the same cell.

Our key insight was that we can frame gene measurements as signals on the cell-cell graph. In graph signal processing, signals are functions defined on the nodes of the graph. With this framing, we can calculate gene-gene distances with respect to the underlying manifold.

In the below example, we can use the cell-cell graph to determine if gene signal a is closer to gene signal c than gene signal e, despite a not overlapping with c nor e. This would not be possible without framing genes as graph signals!

Gene signals on cell-cell graph and hypothetical embedding of signals.

We define this as one desired property of gene embeddings — preserving local and global distances between signals based on the cell-cell graph. We also want gene embeddings to be denoising and flexibly defined for downstream analysis. To satisfy these properties, we developed Gene Signal Pattern Analysis (GSPA).

Gene Signal Pattern Analysis (GSPA)

Gene Signal Pattern Analysis constructs a cell-cell graph and calculates the diffusion operator to recover the cellular manifold. Then, we compute diffusion wavelets to capture the multiscale structure of the cell-cell graph (see paper for details!). By applying wavelets at different scales to each node in the graph, we build a large wavelet dictionary. Then, we project the gene signal matrix onto the cell wavelet dictionary and learn a reduced representation via an autoencoder. The resulting representation captures denoised and multiscale gene-gene relationships, encoding biologically relevant patterns for downstream analysis.

Comparing against alternative gene mapping strategies on simulated data

To evaluate this approach, we compared GSPA against eight baselines from graph representation learning and graph signal processing. Our approach best preserves gene-gene coexpression and, importantly, results in meaningful gene embeddings.

For example, for a simulated dataset with three branches, we construct a gene embedding and cluster genes into 7 “modules”, i.e. groups of related or coexpressing genes. We then visualize the module enrichment on the cell branches. Gene module 0 is enriched on branch 1, gene module 2 is enriched on branch 2, and gene module 1, which embeds between gene modules 0 and 2, is enriched on both branch 1 and branch 2. This really highlights that we can characterize genes in an interpretable, but still cluster and pseudotime-independent, manner.

Gene embedding and module enrichment on simulated dataset with three branches.

We also use GSPA to define differential localization, a metric to find highly informative gene patterns, and we show GSPA best preserves gene localization in the same simulated datasets.

Revealing gene-gene coexpression relationships

Simulated experiments helped us establish the validity of the approach, but in order to use GSPA to gain important insights, we turned to a newly generated dataset of T cells. T cells are a critical part of the immune system and are known to transition into a range of subtypes, but the gene signaling patterns that characterize these transitions are not fully known. With the lab of Dr. Nikhil Joshi, we investigated CD8+ T cells in response to acute and chronic infection at three timepoints.

Embedding of cells in six experimental conditions, colored by key marker genes.

Mapping the gene space from this data shows that our approach can accurately capture gene modules matching known T cell states, including those specific to a particular context or timepoint. For example, we identified a gene module containing known markers for naivety and memory (e.g. Sell, Tcf7, Ccr7). This module is enriched in the acute and chronic infection at the early timepoint (naive cells) and the acute infection only at the late timepoint (memory cells).

Gene embedding colored by gene modules, network of gene module 1, and enrichment of gene module 1 across conditions.

We could also characterize genes with respect to how localized they are on the manifold. In contrast to traditional differential expression, differential localization deprioritizes genes that are expressed ubiquitously, like Rps20, as they are less likely to explain cellular variation and decision making. Instead, it prioritizes genes that are specifically expressed in a particular cell population, including S1pr5 and Tox.

Gene embedding colored by localization score, comparison of localization score to clustering-based differential gene expression.

Mapping patient samples and predicting outcomes

Gene embeddings are not only useful for characterizing gene-gene relationships. They can also be used to compare patient samples toward personalized response and outcome prediction.

Given different single-cell datasets for each patient, we concatenate the samples and build a shared cell-cell graph, which we use to build our wavelet dictionary. Then, since we know which cells are from which patient, we can split the wavelet dictionary into patient-specific dictionaries and learn patient-specific gene embeddings. Flattening these gene embeddings gives us patient vectors where the entries of the vector are associated with multiscale representations of genes.

Patient-specific gene embeddings using Gene Signal Pattern Analysis.

Compared to baseline methods, this approach best classifies responders versus non-responders to immunotherapy among 48 melanoma samples. Furthermore, because the entries of each vector are associated with genes, the classification is easy to interpret. When we look at genes most predictive of response, including GPR183 and IL7R, we see an enrichment for progenitor function — known to be the primary target of immunotherapy. On the other hand, genes associated with non-response, including GZMA, are terminal differentiation markers, associated with T cells that are unable to reacquire significant function and are thus non-responsive to the immunotherapy.

Patient manifold learned with GSPA reveals biomarkers related to response prediction.

Concluding remarks

The main takeaway here is that considering genes as signals on the cell-cell graph allows us to preserve gene-gene relationships, and this is really useful for a whole host of downstream tasks to gain a deeper understanding of single-cell biology. We think this approach is also broadly useful to graph feature and signal representation learning, where extracting rich feature representations can reveal interesting patterns worth investigating further. We hope you read our full work and consider using our approach to analyze your single-cell data.