A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data

Abstract Background High-dimensional data of discrete and skewed nature is commonly encountered in high-throughput sequencing studies. Analyzing the network itself or the interplay between genes in this type of data continues to present many challenges. As data visualization techniques become cumber...

Full description

Bibliographic Details
Main Authors: Anjali Silva, Steven J. Rothstein, Paul D. McNicholas, Sanjeena Subedi
Format: Article
Language:English
Published: BMC 2019-07-01
Series:BMC Bioinformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12859-019-2916-0
_version_ 1819082941598269440
author Anjali Silva
Steven J. Rothstein
Paul D. McNicholas
Sanjeena Subedi
author_facet Anjali Silva
Steven J. Rothstein
Paul D. McNicholas
Sanjeena Subedi
author_sort Anjali Silva
collection DOAJ
description Abstract Background High-dimensional data of discrete and skewed nature is commonly encountered in high-throughput sequencing studies. Analyzing the network itself or the interplay between genes in this type of data continues to present many challenges. As data visualization techniques become cumbersome for higher dimensions and unconvincing when there is no clear separation between homogeneous subgroups within the data, cluster analysis provides an intuitive alternative. The aim of applying mixture model-based clustering in this context is to discover groups of co-expressed genes, which can shed light on biological functions and pathways of gene products. Results A mixture of multivariate Poisson-log normal (MPLN) model is developed for clustering of high-throughput transcriptome sequencing data. Parameter estimation is carried out using a Markov chain Monte Carlo expectation-maximization (MCMC-EM) algorithm, and information criteria are used for model selection. Conclusions The mixture of MPLN model is able to fit a wide range of correlation and overdispersion situations, and is suited for modeling multivariate count data from RNA sequencing studies. All scripts used for implementing the method can be found at https://github.com/anjalisilva/MPLNClust.
first_indexed 2024-12-21T20:24:40Z
format Article
id doaj.art-ee72178a189a44eabfde4c056f076cdf
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-12-21T20:24:40Z
publishDate 2019-07-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-ee72178a189a44eabfde4c056f076cdf2022-12-21T18:51:24ZengBMCBMC Bioinformatics1471-21052019-07-0120111110.1186/s12859-019-2916-0A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing dataAnjali Silva0Steven J. Rothstein1Paul D. McNicholas2Sanjeena Subedi3Department of Mathematics and Statistics, University of GuelphDepartment of Molecular and Cellular Biology, University of GuelphDepartment of Mathematics and Statistics, McMaster UniversityDepartment of Mathematical Sciences, Binghamton UniversityAbstract Background High-dimensional data of discrete and skewed nature is commonly encountered in high-throughput sequencing studies. Analyzing the network itself or the interplay between genes in this type of data continues to present many challenges. As data visualization techniques become cumbersome for higher dimensions and unconvincing when there is no clear separation between homogeneous subgroups within the data, cluster analysis provides an intuitive alternative. The aim of applying mixture model-based clustering in this context is to discover groups of co-expressed genes, which can shed light on biological functions and pathways of gene products. Results A mixture of multivariate Poisson-log normal (MPLN) model is developed for clustering of high-throughput transcriptome sequencing data. Parameter estimation is carried out using a Markov chain Monte Carlo expectation-maximization (MCMC-EM) algorithm, and information criteria are used for model selection. Conclusions The mixture of MPLN model is able to fit a wide range of correlation and overdispersion situations, and is suited for modeling multivariate count data from RNA sequencing studies. All scripts used for implementing the method can be found at https://github.com/anjalisilva/MPLNClust.http://link.springer.com/article/10.1186/s12859-019-2916-0ClusteringRNA sequencingDiscrete dataMultivariate Poisson-log normal distributionMarkov chain Monte CarloCo-expression networks
spellingShingle Anjali Silva
Steven J. Rothstein
Paul D. McNicholas
Sanjeena Subedi
A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data
BMC Bioinformatics
Clustering
RNA sequencing
Discrete data
Multivariate Poisson-log normal distribution
Markov chain Monte Carlo
Co-expression networks
title A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data
title_full A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data
title_fullStr A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data
title_full_unstemmed A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data
title_short A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data
title_sort multivariate poisson log normal mixture model for clustering transcriptome sequencing data
topic Clustering
RNA sequencing
Discrete data
Multivariate Poisson-log normal distribution
Markov chain Monte Carlo
Co-expression networks
url http://link.springer.com/article/10.1186/s12859-019-2916-0
work_keys_str_mv AT anjalisilva amultivariatepoissonlognormalmixturemodelforclusteringtranscriptomesequencingdata
AT stevenjrothstein amultivariatepoissonlognormalmixturemodelforclusteringtranscriptomesequencingdata
AT pauldmcnicholas amultivariatepoissonlognormalmixturemodelforclusteringtranscriptomesequencingdata
AT sanjeenasubedi amultivariatepoissonlognormalmixturemodelforclusteringtranscriptomesequencingdata
AT anjalisilva multivariatepoissonlognormalmixturemodelforclusteringtranscriptomesequencingdata
AT stevenjrothstein multivariatepoissonlognormalmixturemodelforclusteringtranscriptomesequencingdata
AT pauldmcnicholas multivariatepoissonlognormalmixturemodelforclusteringtranscriptomesequencingdata
AT sanjeenasubedi multivariatepoissonlognormalmixturemodelforclusteringtranscriptomesequencingdata