Semi-supervised integration of single-cell transcriptomics data

Abstract Batch effects in single-cell RNA-seq data pose a significant challenge for comparative analyses across samples, individuals, and conditions. Although batch effect correction methods are routinely applied, data integration often leads to overcorrection and can result in the loss of biologica...

Full description

Bibliographic Details
Main Authors: Massimo Andreatta, Léonard Hérault, Paul Gueguen, David Gfeller, Ariel J. Berenstein, Santiago J. Carmona
Format: Article
Language:English
Published: Nature Portfolio 2024-01-01
Series:Nature Communications
Online Access:https://doi.org/10.1038/s41467-024-45240-z
_version_ 1797274128718233600
author Massimo Andreatta
Léonard Hérault
Paul Gueguen
David Gfeller
Ariel J. Berenstein
Santiago J. Carmona
author_facet Massimo Andreatta
Léonard Hérault
Paul Gueguen
David Gfeller
Ariel J. Berenstein
Santiago J. Carmona
author_sort Massimo Andreatta
collection DOAJ
description Abstract Batch effects in single-cell RNA-seq data pose a significant challenge for comparative analyses across samples, individuals, and conditions. Although batch effect correction methods are routinely applied, data integration often leads to overcorrection and can result in the loss of biological variability. In this work we present STACAS, a batch correction method for scRNA-seq that leverages prior knowledge on cell types to preserve biological variability upon integration. Through an open-source benchmark, we show that semi-supervised STACAS outperforms state-of-the-art unsupervised methods, as well as supervised methods such as scANVI and scGen. STACAS scales well to large datasets and is robust to incomplete and imprecise input cell type labels, which are commonly encountered in real-life integration tasks. We argue that the incorporation of prior cell type information should be a common practice in single-cell data integration, and we provide a flexible framework for semi-supervised batch effect correction.
first_indexed 2024-03-07T14:53:56Z
format Article
id doaj.art-d35b657fcd2e4fe8aefe46b618827fe8
institution Directory Open Access Journal
issn 2041-1723
language English
last_indexed 2024-03-07T14:53:56Z
publishDate 2024-01-01
publisher Nature Portfolio
record_format Article
series Nature Communications
spelling doaj.art-d35b657fcd2e4fe8aefe46b618827fe82024-03-05T19:33:33ZengNature PortfolioNature Communications2041-17232024-01-0115111310.1038/s41467-024-45240-zSemi-supervised integration of single-cell transcriptomics dataMassimo Andreatta0Léonard Hérault1Paul Gueguen2David Gfeller3Ariel J. Berenstein4Santiago J. Carmona5Department of Oncology, Lausanne Branch, Ludwig Institute for Cancer Research, CHUV and University of LausanneDepartment of Oncology, Lausanne Branch, Ludwig Institute for Cancer Research, CHUV and University of LausanneDepartment of Oncology, Lausanne Branch, Ludwig Institute for Cancer Research, CHUV and University of LausanneDepartment of Oncology, Lausanne Branch, Ludwig Institute for Cancer Research, CHUV and University of LausanneLaboratorio de Biología Molecular, División Patología, Instituto Multidisciplinario de Investigaciones en Patologías Pediátricas (IMIPP), CONICET-GCBADepartment of Oncology, Lausanne Branch, Ludwig Institute for Cancer Research, CHUV and University of LausanneAbstract Batch effects in single-cell RNA-seq data pose a significant challenge for comparative analyses across samples, individuals, and conditions. Although batch effect correction methods are routinely applied, data integration often leads to overcorrection and can result in the loss of biological variability. In this work we present STACAS, a batch correction method for scRNA-seq that leverages prior knowledge on cell types to preserve biological variability upon integration. Through an open-source benchmark, we show that semi-supervised STACAS outperforms state-of-the-art unsupervised methods, as well as supervised methods such as scANVI and scGen. STACAS scales well to large datasets and is robust to incomplete and imprecise input cell type labels, which are commonly encountered in real-life integration tasks. We argue that the incorporation of prior cell type information should be a common practice in single-cell data integration, and we provide a flexible framework for semi-supervised batch effect correction.https://doi.org/10.1038/s41467-024-45240-z
spellingShingle Massimo Andreatta
Léonard Hérault
Paul Gueguen
David Gfeller
Ariel J. Berenstein
Santiago J. Carmona
Semi-supervised integration of single-cell transcriptomics data
Nature Communications
title Semi-supervised integration of single-cell transcriptomics data
title_full Semi-supervised integration of single-cell transcriptomics data
title_fullStr Semi-supervised integration of single-cell transcriptomics data
title_full_unstemmed Semi-supervised integration of single-cell transcriptomics data
title_short Semi-supervised integration of single-cell transcriptomics data
title_sort semi supervised integration of single cell transcriptomics data
url https://doi.org/10.1038/s41467-024-45240-z
work_keys_str_mv AT massimoandreatta semisupervisedintegrationofsinglecelltranscriptomicsdata
AT leonardherault semisupervisedintegrationofsinglecelltranscriptomicsdata
AT paulgueguen semisupervisedintegrationofsinglecelltranscriptomicsdata
AT davidgfeller semisupervisedintegrationofsinglecelltranscriptomicsdata
AT arieljberenstein semisupervisedintegrationofsinglecelltranscriptomicsdata
AT santiagojcarmona semisupervisedintegrationofsinglecelltranscriptomicsdata