Tue, Feb 13, 4:00pm

Integrating single-cell data with substantial batch effects

Abstract: The combined analysis of multiple datasets provides new insights that cannot be obtained from individual datasets. Examples include various public perturbation screens that were generated with different technologies (e.g., single-cell or single-nuclei RNA sequencing) or used different in vitro models (e.g., cell lines or organoids); attempts at understanding how well preclinical models, such as mice, mimic human biology; and comparison of drug responses of individual cell types across tissues (e.g., circulating and tissue-resident immune cells). However, these datasets differ in both biological and technical factors, which complicates their direct comparison. Therefore, multiple methods for integrating single-cell data have been developed, with the most popular of them being based on conditional variational autoencoders (cVAEs). Nevertheless, these methods were initially designed for the integration of more biologically and technically aligned datasets, such as samples of the same patient tissue generated with matching protocols across institutions. Consequently, existing methods struggle to integrate datasets with more substantial batch effects, as outlined above.

Here, we propose to address these challenges by introducing and comparing a series of cVAE regularization constraints. The two commonly used strategies for increasing batch correction in cVAEs, that is KL regularization strength tuning and adversarial learning, suffer from substantial loss of biological information. Instead, we demonstrate that substituting the commonly used standard normal prior with a VampPrior not only improves the preservation of biological variation but also unexpectedly enhances batch correction. Furthermore, our implementation of latent cycle-consistency loss proves to be superior in preserving biological information compared to adversarial learning when removing substantial batch effects. Based on these findings, we propose a new model that combines VampPrior and cycle-consistency loss. We showcase its efficacy in improving downstream interpretation of cell states and biological conditions after integrating datasets with substantial batch effects.

To facilitate the adoption of our proposed model, we have incorporated it into the scvi-tools package as an external model named sysVI. Moreover, in the future, these regularization techniques hold promise for inclusion into other cVAE-based models, thus enhancing their integration performance.

4

Previous Talks