Identification of Interpretable Clusters and Associated Signatures in Breast Cancer Single-Cell Data: A Topic Modeling Approach

Topic modeling is a popular technique in machine learning and natural language processing, where a corpus of text documents is classified into themes or topics using word frequency analysis. This approach has proven successful in various biological data analysis applications, such as predicting canc...

Full description

Bibliographic Details
Main Authors: Gabriele Malagoli, Filippo Valle, Emmanuel Barillot, Michele Caselle, Loredana Martignetti
Format: Article
Language:English
Published: MDPI AG 2024-03-01
Series:Cancers
Subjects:
Online Access:https://www.mdpi.com/2072-6694/16/7/1350
_version_ 1797212749233651712
author Gabriele Malagoli
Filippo Valle
Emmanuel Barillot
Michele Caselle
Loredana Martignetti
author_facet Gabriele Malagoli
Filippo Valle
Emmanuel Barillot
Michele Caselle
Loredana Martignetti
author_sort Gabriele Malagoli
collection DOAJ
description Topic modeling is a popular technique in machine learning and natural language processing, where a corpus of text documents is classified into themes or topics using word frequency analysis. This approach has proven successful in various biological data analysis applications, such as predicting cancer subtypes with high accuracy and identifying genes, enhancers, and stable cell types simultaneously from sparse single-cell epigenomics data. The advantage of using a topic model is that it not only serves as a clustering algorithm, but it can also explain clustering results by providing word probability distributions over topics. Our study proposes a novel topic modeling approach for clustering single cells and detecting topics (gene signatures) in single-cell datasets that measure multiple omics simultaneously. We applied this approach to examine the transcriptional heterogeneity of luminal and triple-negative breast cancer cells using patient-derived xenograft models with acquired resistance to chemotherapy and targeted therapy. Through this approach, we identified protein-coding genes and long non-coding RNAs (lncRNAs) that group thousands of cells into biologically similar clusters, accurately distinguishing drug-sensitive and -resistant breast cancer types. In comparison to standard state-of-the-art clustering analyses, our approach offers an optimal partitioning of genes into topics and cells into clusters simultaneously, producing easily interpretable clustering outcomes. Additionally, we demonstrate that an integrative clustering approach, which combines the information from mRNAs and lncRNAs treated as disjoint omics layers, enhances the accuracy of cell classification.
first_indexed 2024-04-24T10:47:20Z
format Article
id doaj.art-0b25ef683fe7460eb07cf09323e053ca
institution Directory Open Access Journal
issn 2072-6694
language English
last_indexed 2024-04-24T10:47:20Z
publishDate 2024-03-01
publisher MDPI AG
record_format Article
series Cancers
spelling doaj.art-0b25ef683fe7460eb07cf09323e053ca2024-04-12T13:16:07ZengMDPI AGCancers2072-66942024-03-01167135010.3390/cancers16071350Identification of Interpretable Clusters and Associated Signatures in Breast Cancer Single-Cell Data: A Topic Modeling ApproachGabriele Malagoli0Filippo Valle1Emmanuel Barillot2Michele Caselle3Loredana Martignetti4Institut Curie, Inserm U900, Mines ParisTech, PSL Research University, 75248 Paris, FrancePhysics Department, University of Turin and INFN, 10125 Turin, ItalyInstitut Curie, Inserm U900, Mines ParisTech, PSL Research University, 75248 Paris, FrancePhysics Department, University of Turin and INFN, 10125 Turin, ItalyInstitut Curie, Inserm U900, Mines ParisTech, PSL Research University, 75248 Paris, FranceTopic modeling is a popular technique in machine learning and natural language processing, where a corpus of text documents is classified into themes or topics using word frequency analysis. This approach has proven successful in various biological data analysis applications, such as predicting cancer subtypes with high accuracy and identifying genes, enhancers, and stable cell types simultaneously from sparse single-cell epigenomics data. The advantage of using a topic model is that it not only serves as a clustering algorithm, but it can also explain clustering results by providing word probability distributions over topics. Our study proposes a novel topic modeling approach for clustering single cells and detecting topics (gene signatures) in single-cell datasets that measure multiple omics simultaneously. We applied this approach to examine the transcriptional heterogeneity of luminal and triple-negative breast cancer cells using patient-derived xenograft models with acquired resistance to chemotherapy and targeted therapy. Through this approach, we identified protein-coding genes and long non-coding RNAs (lncRNAs) that group thousands of cells into biologically similar clusters, accurately distinguishing drug-sensitive and -resistant breast cancer types. In comparison to standard state-of-the-art clustering analyses, our approach offers an optimal partitioning of genes into topics and cells into clusters simultaneously, producing easily interpretable clustering outcomes. Additionally, we demonstrate that an integrative clustering approach, which combines the information from mRNAs and lncRNAs treated as disjoint omics layers, enhances the accuracy of cell classification.https://www.mdpi.com/2072-6694/16/7/1350topic modelinghierarchical stochastic block modelingsingle-cell RNA-seqlong non-coding RNAsbreast cancer
spellingShingle Gabriele Malagoli
Filippo Valle
Emmanuel Barillot
Michele Caselle
Loredana Martignetti
Identification of Interpretable Clusters and Associated Signatures in Breast Cancer Single-Cell Data: A Topic Modeling Approach
Cancers
topic modeling
hierarchical stochastic block modeling
single-cell RNA-seq
long non-coding RNAs
breast cancer
title Identification of Interpretable Clusters and Associated Signatures in Breast Cancer Single-Cell Data: A Topic Modeling Approach
title_full Identification of Interpretable Clusters and Associated Signatures in Breast Cancer Single-Cell Data: A Topic Modeling Approach
title_fullStr Identification of Interpretable Clusters and Associated Signatures in Breast Cancer Single-Cell Data: A Topic Modeling Approach
title_full_unstemmed Identification of Interpretable Clusters and Associated Signatures in Breast Cancer Single-Cell Data: A Topic Modeling Approach
title_short Identification of Interpretable Clusters and Associated Signatures in Breast Cancer Single-Cell Data: A Topic Modeling Approach
title_sort identification of interpretable clusters and associated signatures in breast cancer single cell data a topic modeling approach
topic topic modeling
hierarchical stochastic block modeling
single-cell RNA-seq
long non-coding RNAs
breast cancer
url https://www.mdpi.com/2072-6694/16/7/1350
work_keys_str_mv AT gabrielemalagoli identificationofinterpretableclustersandassociatedsignaturesinbreastcancersinglecelldataatopicmodelingapproach
AT filippovalle identificationofinterpretableclustersandassociatedsignaturesinbreastcancersinglecelldataatopicmodelingapproach
AT emmanuelbarillot identificationofinterpretableclustersandassociatedsignaturesinbreastcancersinglecelldataatopicmodelingapproach
AT michelecaselle identificationofinterpretableclustersandassociatedsignaturesinbreastcancersinglecelldataatopicmodelingapproach
AT loredanamartignetti identificationofinterpretableclustersandassociatedsignaturesinbreastcancersinglecelldataatopicmodelingapproach