Characterizing the impacts of dataset imbalance on single-cell data integration

This work has been published in Nature Biotechnology (https://rdcu.be/dz2RE). This blog post was drafted with the help of Shawn Whitfield.

Conducting biomedical research is never cheap, but larger studies involving tens to hundreds of patient samples tend to run up the cost of sequencing very quickly. But what if, despite this extreme resource commitment, a lot of the biological findings are lost, or worse, misinterpreted? In this blog post, we’ll discuss how this can happen when integrating single-cell sequencing data, and what can be done to prevent it.

Data integration in a single-cell setting

Over the past decade, single-cell sequencing data has become vital to biological research. Increasingly, multi-sample and/or multi-modal single-cell sequencing assays of similar tissue samples are performed, and a fundamental challenge in these settings is to learn a meaningful joint representation of the data. Batch-effects are technical differences in experiments that lead to variation in measurements which is not due to any underlying biological cause. In the multi-sample setting, this joint representation should be invariant to batch-effects. Multi-modal measurements from single-cell data account for the fact that the complete biological picture cannot be summarized by one measurement (e.g. RNA). In a multi-modal setting, the learned joint representation should contain salient information from all of the modalities measured, as well as any overlapping information.

To learn these joint representations, data integration techniques specific to single-cell sequencing data have been developed (see this great review on data integration in single-cell and associated techniques). Regardless of whether we are considering multi-sample or multi-modal integration, it is critical that the joint representations retain as much of the relevant biological signal as possible. However, some recent studies have shown that integration techniques struggle in imbalanced scenarios - where imbalance refers to a different number of cells, different cell-types, and/or different cell-type proportions between the samples being integrated.

Why should we care about this problem?

If integration techniques struggle, this could be a big problem for researchers that use single-cell sequencing data. In the case of multi-sample analysis, we want to ensure that any batch-effects that might mimic biological differences are removed before we analyze the data, which is precisely the problem integration solves. But if integration techniques struggle or overcorrect, such as in the case of imbalanced data, we might end up losing biological information.

Imbalanced datasets occur very frequently in biological research scenarios. For example, in multi-sample developmental data, samples measured across different timepoints will be imbalanced with respect to cell-types, because of factors such as depletion of upstream populations and newly developed progenitors at later timepoints. In cancer research, often multiple samples are synonymous with multiple patient biopsies, but because of cancer heterogeneity (e.g. different subtypes) between patients, there will be marked differences in the cell-types between these samples.

There have been a myriad of single-cell studies (a database tracking these studies indicates it’s close to 2000 now but this is an extreme underestimation), many of which involve these imbalanced scenarios. If integration was performed haphazardly in the imbalanced scenarios, the results may not be reproducible due to altered biological signal. We set out to characterize the degree to which dataset imbalance affects downstream results of single-cell integration, and to introduce potential solutions to these effects.

Iniquitate and its key results

Experimental overview and key contributions. a) The Iniquitate pipeline, which was used to benchmark integration experiments in balanced and imbalanced scenarios, and assess the impacts on the learned representation and downstream results. b) The two properties of multi-sample data distributions which contribute to loss of biological signal. c) Balanced clustering metrics and their utility in benchmarking imbalanced integration scenarios.

Determining whether or not imbalance affects single-cell integration results poses a challenge for real datasets, which always contain some degree of imbalance. We therefore took two samples of peripheral mononuclear blood cells (PBMCs) processed using different technologies, which causes a batch effect, and created a dataset where the two samples are balanced. We did this by selecting only shared cell-types that were expertly annotated, and then downsampling each cell-type to be equivalent between samples.

We then went on to ‘perturb’ the degree of imbalance between these balanced samples, and compared integration results before (balanced) and after (imbalanced) perturbation. We tested a litany of downstream analyses post integration (e.g. clustering, trajectory inference), and regardless of analysis tested, we found statistically significant loss of biological signal in the imbalanced scenarios compared to their balanced counterparts. The pipeline that allows for all of the aforementioned dataset imbalance experimentation and statistical computation was termed Iniquitate (code for reproducibility and custom analysis using Iniquitate can be found at https://github.com/hsmaan/Iniquitate).

Further, we discovered that it's not only the imbalance itself, but that cell-type similarity also drives loss of biological signal. For example, imbalanced cell-types that are very closely related (e.g. transcriptomically if considering single-cell RNA-seq) to other cell-types across samples are more likely to be affected in integration settings. This is also an important observation, as often key research questions involve highly similar cell-types - e.g. subsets of T-cell populations.

Better benchmarking of single-cell integration

As the results of our analysis were method-agnostic (no methods showed good performance in these settings), we wanted to approach solutions from two standpoints, one of which was benchmarking.

Benchmarking involves learning an integrated representation, clustering that representation in an unsupervised manner, and then comparing the unsupervised clusters to ground-truth annotations using clustering metrics. This gives a sense of how “biologically accurate” the integration results are, and how well they remove batch-effects and learn joint modality representations.

There have been many single-cell integration benchmarking papers, but they have relied on quantifying results through metrics that do not take into account cell-type balance, and hence do not factor in the key results of this study.

Given this, we’ve developed new metrics to weigh each ground-truth cell-type equally (e.g. the Balanced Adjusted Rand Index and Balanced Adjusted Mutual Information) (https://github.com/hsmaan/balanced-clustering). This allows for a different viewpoint of integration results, where prevalence of cell-types is not an important distinction. The reworked metrics are not specific to single-cell research and can be used in similar representation learning scenarios across domains. Here’s an example on simulated data showing how the balanced clustering metrics could be used:

import numpy as np
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score, \
	homogeneity_score, completeness_score, v_measure_score
from sklearn.cluster import KMeans

from balanced_clustering import balanced_adjusted_rand_index, \
	balanced_adjusted_mutual_info, balanced_completeness, \
	balanced_homogeneity, balanced_v_measure, return_metrics

# Set a seed for reproducibility
np.random.seed(42)

# Sample three classes from separated gaussian distributions with varying
# standard deviations and class size
c_1 = np.random.default_rng(seed = 0).normal(loc = 0, scale = 0.5, size = (500, 2))
c_2 = np.random.default_rng(seed = 1).normal(loc = -2, scale = 0.1, size = (20, 2))
c_3 = np.random.default_rng(seed = 2).normal(loc = 3, scale = 1, size = (500, 2))
classes = np.concatenate(
	[np.repeat("A", len(c_1)), np.repeat("B", len(c_2)), np.repeat("C", len(c_3))]
)

# Perform k-means clustering with k = 2 - this misclusters the smallest class
cluster_arr = np.concatenate([c_1, c_2, c_3])
kmeans_res = KMeans(n_clusters = 2, random_state = 42).fit_predict(X = cluster_arr)

# Return and print balanced and imbalanced comparisons
return_metrics(
	class_arr = classes, cluster_arr = kmeans_res,
)

"""
ARI imbalanced: 0.915 ARI balanced: 0.5434
AMI imbalanced: 0.8671 AMI balanced: 0.686
Homogeneity imbalanced: 0.8204 Homogeneity balanced: 0.5402
Completeness imbalanced: 0.9198 Completeness balanced : 0.941
V-measure imbalanced: 0.8673 V-measure balanced: 0.6864
"""

# As we can see, the balanced clustering metrics penalize the resulting scores much more severely 
# due to the mis-clustering of the smallest (least prevalent) class - a result that is missed 
# with the base metrics.

Guidelines for imbalanced integration

Methodologically, the problem of imbalanced integration is one that is difficult to solve. The combined properties of imbalance and cell-type similarity must both be addressed. Although this is a problem that we are continuing to work on, we chose a different approach to solving the problems studied this paper, which involves a series of guidelines for imbalanced integration (https://github.com/hsmaan/Iniquitate/tree/main/docs):

Imbalanced integration guidelines. These guidelines describe an iterative process for end-users that allows for systematic assessment of dataset imbalance, integration tuning, and quantification of preservation of biological signal and batch-mixing.

These guidelines offer users of integration techniques an iterative process by which diagnosis of imbalance, tuning of methods, and tuning of batch-correction and biological signal conservation are repeatedly performed until a desired outcome is reached.

Diagnosing imbalance involves unsupervised clustering and potential use of a reference atlas (Pre-integration stage in figure). There are currently no complete methodological solutions to imbalanced integration, but our guidelines offer suggestions for tuning techniques to better preserve biological signal, typically by sacrificing the extent of batch/sample-mixing (Integration stage in figure). Finally, we offer ways to assess the degree to which biological signal is conserved and the degree of batch-mixing, and based on this tradeoff, to repeat iterations of guideline steps (Post-integration stage in figure). A complete walkthrough of the guidelines in R is available at https://github.com/hsmaan/Iniquitate/blob/main/docs/guidelines.pdf.

Overall implications and key takeaways

Single-cell integration is often considered a ‘solved’ problem, but our work showed that significant challenges still remain. As it is critical to conserve biological signal when learning joint representations of samples and modalities, this work highlights a blindspot within the single-cell research community.

Although this work is heavily domain specific, here are some key takeaways for integration and representation learning of biological data in general:

Dataset imbalance is a key factor, as integration is often contingent upon defining “anchors” between datasets using various modeling techniques. Classes that are not highly prevalent are at risk of being misassigned or ignored in anchors between datasets, especially so if inter-class similarity is high.
Metrics used to quantify the validity of a learned representation can often be misleading, and you should consider the downstream analyses the representation will be used for and quantify validity of results in that setup as well. A really great example of this is the GLUE benchmark from Natural Language Processing.
We should further scrutinize methodological claims by authors of techniques (e.g. integration methods); often, the initial experiments in original papers are not enough. Comprehensive benchmarking papers are crucial in this respect and can offer clarity to end-users that typically encounter difficulty in picking a method/framework (this is especially true in single-cell analysis, as the field is very saturated with methods).
Similarly, for novel datasets we should apply methods that are appropriate. If a method was mostly tested on balanced toy data, haphazardly applying it to a complex highly imbalanced new generated dataset is not a great idea. In this scenario, gradually testing the method on increasing complexity and imbalance might be a better approach (can downsample the larger dataset, create balanced setups depending on annotations).
No method is a silver bullet (and it might be unlikely one could be developed) in these challenging scenarios. We tested methods that claim to handle imbalance in this paper but none of them demonstrated the preservation of biological signal that we were looking for. Often a tradeoff has to be made, and some biological information has to be lost or some technical effect has to be introduced for the sake of accounting for the other factor. This is likely true in other similar representation learning scenarios.

This work was done jointly with members of the Bo Wang (https://wanglab.ai/) and Kieran Campbell (https://www.camlab.ca/) labs, including Lin Zhang, Chengxin Yu and Michael Geuenich. Bo Wang and Kieran Campbell funded the project and provided supervision.

If you have any thoughts about this work, please feel free to reach out at [email protected]