CellPLM: Pre-training of Cell Language Model Beyond Single Cells

The paper associated with this blog has been accepted at ICLR 2024.

Code associated with the paper is available in the CellPLM GitHub repo.

_____

Recent advancements in next-generation sequencing technologies, like single-cell RNA sequencing (scRNA-seq), have generated enormous amounts of data. This has sparked interest in developing large-scale pre-trained models for single-cell analysis, such as scGPT and Geneformer. Inspired by the success of large language models from the text domain, those existing pre-trained models treat genes as 'words' (tokens) and cells as 'sentences' using transformers (the main model behind ChatGPT and also the ‘T’ within chatGPT) to predict the next ‘word’ (another gene’s expression) given other ‘words’ (other genes’ expression) within a sentence (cell). Due to their similarity to language modeling in natural language processing, we call them "gene language models".

Despite their success in uncovering gene-gene interactions within cells, those gene language models largely neglect the interactions between cells. In fact, cell-cell interactions are crucial in understanding and treating diseases like cancer. For example, cancer cells can ‘talk’ to neighboring immune cells in a way that dampens the immune response against them. Luckily, with the emergence of spatially-resolved transcriptomic (SRT) data, we now can measure the gene expression in situ keeping cells’ relative position intact. To reveal a more comprehensive story in complex biological systems, we introduce a novel single-Cell Pre-trained Language Model (CellPLM) that accounts for not only gene-gene interactions but also cell-cell relations. Importantly, we leverage spatially-resolved transcriptomic (SRT) data alongside scRNA-seq data during pretraining. This makes CellPLM the first pre-trained model that can leverage the cellular positional information provided by SRT data, and enriches CellPLM’s perception of cell-cell interactions.

CellPLM is the first pre-trained transformer framework that effectively encodes inter-cell relations, integrates spatially-resolved transcriptomic data, and applies a well-justified prior distribution. Our experiments demonstrate that CellPLM consistently surpasses both pre-trained and non-pre-trained methods across various downstream tasks, including cell-type annotation, denoising and imputation, zero-shot cell embedding and gene perturbation prediction.

Why and how to encode cell-cell relationships

Encoding important biological information from single-cell data is a crucial objective of single-cell foundation models. Although cell-cell relationships are not explicitly measured in single-cell assays, they are an important aspect of biology and can be mined from the data. We are particularly interested in encoding two key types of inter-cell relations in the model, with the idea that they reflect important biological mechanisms underlying the data:

Cell Lineage Information: Cells within the same or similar lineage (cell type) provide crucial supplementary information for denoising and identifying cell states. For example, the B cells will have similar gene expression patterns even if they are far apart from each other.
Cell-Cell Communications: These interactions are vital in determining cell development and states. One example is that cancer cells can overexpress certain proteins, known as immune checkpoints, to suppress surrounding immune cells’ activity. While existing methods explore these communications at the cell type or cluster level, CellPLM aims to decipher the intricate intercellular "language" at the single-cell level.

An illustration of the difference in the language models between existing single-cell pre-trained models (left) and CellPLM (right). Existing pre-trained models only consider conditional probability between gene expressions within the same cell, while in CellPLM, gene expression distribution is also conditioned on other cells.

To capture these intercellular relationships in a pre-trained model, we introduce the concept of a “cell language model”. In contrast to the "gene language models" (e.g., scGPT, Geneformer) described above, our proposed CellPLM treats cells (rather than genes) as tokens, and considers both the observed genes in a cell and genes from other cells, capturing a more comprehensive view of cellular interactions.

To implement a cell language model, we masked the majority of genes in randomly selected cells and trained the model to recover them. The masking strategy makes it hard for the model to recover the missing gene expressions from the masked cells themselves. Instead, the model is encouraged to impute those “noisy” masked values from relevant “clean” cells, similar to a “masked language modeling (MLM)” in natural language processing. In this way, the model is trained to identify the interactions between cells and generate denoised representations.

Empowering language modeling with latent distribution and batch removal

An illustration of the pre-training framework of CellPLM. CellPLM is pre-trained with cell-level masked language modeling tasks on scRNA-seq and spatial transcriptomic data. The model consists of four modules: a gene expression embedder, a transformer encoder, a Gaussian mixture latent space, and a batch-aware decoder.

CellPLM is pre-trained through a cell-level masked language modeling task on extensive scRNA-seq and Spatial Transcriptomic data (over 10M cells in total). The model architecture is made up of four integral modules: a gene expression embedder, a transformer encoder, a Gaussian mixture latent space, and a batch-aware decoder. The gene expression embedder module accounts for the extraction of intracellular information and is extendable to datasets with various gene sets. The encoder module leverages the transformer to extract intercellular context information for each cell. Particularly, for SRT data, spatial positional embedding is added to each cell. The output of the encoder is then fed to the latent space module, where each cell is embedded as a distribution in the latent space with Gaussian mixture prior distribution. Last, latent variables sampled from the cell latent distribution are input to a batch-aware decoder module and are used to reconstruct the masked features. Notably, the batch-aware decoder is effective in removing batch effects from the latent space, which has been empirically validated by various studies, for example, scVI.

Efficient fine-tuning and inference with CellPLM package

In order to provide a convenient interface for users, we introduce the CellPLM package. The CellPLM package can be easily installed via pip install cellplm, and provides several pipeline modules for fine-tuning and inference on customized AnnData datasets. For example, a cell type annotation pipeline can be used to quickly adapt a pretrained CellPLM model to a cell type annotation task. Specifically, we initialize a pipeline by specifying a pre-trained checkpoint,

from CellPLM.pipeline.cell_type_annotation import CellTypeAnnotationPipeline

pipeline = CellTypeAnnotationPipeline(
    pretrain_prefix=prefix_of_the_checkpoint,
    pretrain_directory=path_to_the_checkpoint, 
)

With CellPLM pipelines, the jobs of fine-tuning, inference and evaluation can all be smoothly conducted via scikit-learn-like protocols (i.e., fit, predict and score methods). For example, by calling the fit function, our pipeline will automatically fine-tune the model on a given AnnData dataset.

pipeline.fit(
    data, # An AnnData object
    pipeline_config, # A config dictionary for additional arguments, optional
    label_fields = ['celltype'] # Specify a column in .obs that contains cell type labels
)

After fine-tuning, the pipeline can annotate new datasets via the predict function. The annotated labels will be returned as a torch tensor.

pipeline.predict(
    data, # An AnnData object
    pipeline_config, # A  config dictionary for additional arguments, optional
)

In addition to the cell type annotation pipeline, currently we provide two other pipelines for cell embedding and spatial imputation respectively. We plan to optimize the pipelines for industrial applications and support more downstream tasks in the future.

Conclusion

In this work, we propose the cell language model, a novel paradigm of single-cell pre-trained model, which aligns well with the fundamental characteristics of single-cell data. This has led to CellPLM, the first pre-trained transformer framework that encodes inter-cell relations, leverages spatially-resolved transcriptomic data and adopts a reasonable prior distribution. Our experiments on various downstream tasks demonstrate the power of CellPLM.

The performance of CellPLM on cell type annotation tasks are shown in the table.

For more empirical results, please refer to our paper. For more instructions on the package, check out our official github repo. Feel free to reach out if you have any questions!