Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model

Modeling and analyzing human microbiome allows the assessment of the microbial community and its impacts on human health. Microbiome composition can be quantified using 16S rRNA technology into sequencing data, which are usually skewed and heavy-tailed with excess zeros. Clustering methods are usefu...

Full description

Bibliographic Details
Main Authors:	Dongyang Yang, Wei Xu
Format:	Article
Language:	English
Published:	MDPI AG 2020-10-01
Series:	Microorganisms
Subjects:	clustering microbiome unsupervised learning high-dimension
Online Access:	https://www.mdpi.com/2076-2607/8/10/1612

_version_	1797550362982350848
author	Dongyang Yang Wei Xu
author_facet	Dongyang Yang Wei Xu
author_sort	Dongyang Yang
collection	DOAJ
description	Modeling and analyzing human microbiome allows the assessment of the microbial community and its impacts on human health. Microbiome composition can be quantified using 16S rRNA technology into sequencing data, which are usually skewed and heavy-tailed with excess zeros. Clustering methods are useful in personalized medicine by identifying subgroups for patients stratification. However, there is currently a lack of standardized clustering method for the complex microbiome sequencing data. We propose a clustering algorithm with a specific beta diversity measure that can address the presence-absence bias encountered for sparse count data and effectively measure the sample distances for sample stratification. Our distance measure used for clustering is derived from a parametric based mixture model producing sample-specific distributions conditional on the observed operational taxonomic unit (OTU) counts and estimated mixture weights. The method can provide accurate estimates of the true zero proportions and thus construct a precise beta diversity measure. Extensive simulation studies have been conducted and suggest that the proposed method achieves substantial clustering improvement compared with some widely used distance measures when a large proportion of zeros is presented. The proposed algorithm was implemented to a human gut microbiome study on Parkinson’s diseases to identify distinct microbiome states with biological interpretations.
first_indexed	2024-03-10T15:28:09Z
format	Article
id	doaj.art-03a97458f4224fbcaa9fb056e18b9ab4
institution	Directory Open Access Journal
issn	2076-2607
language	English
last_indexed	2024-03-10T15:28:09Z
publishDate	2020-10-01
publisher	MDPI AG
record_format	Article
series	Microorganisms
spelling	doaj.art-03a97458f4224fbcaa9fb056e18b9ab42023-11-20T17:49:42ZengMDPI AGMicroorganisms2076-26072020-10-01810161210.3390/microorganisms8101612Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning ModelDongyang Yang0Wei Xu1Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, CanadaDivision of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, CanadaModeling and analyzing human microbiome allows the assessment of the microbial community and its impacts on human health. Microbiome composition can be quantified using 16S rRNA technology into sequencing data, which are usually skewed and heavy-tailed with excess zeros. Clustering methods are useful in personalized medicine by identifying subgroups for patients stratification. However, there is currently a lack of standardized clustering method for the complex microbiome sequencing data. We propose a clustering algorithm with a specific beta diversity measure that can address the presence-absence bias encountered for sparse count data and effectively measure the sample distances for sample stratification. Our distance measure used for clustering is derived from a parametric based mixture model producing sample-specific distributions conditional on the observed operational taxonomic unit (OTU) counts and estimated mixture weights. The method can provide accurate estimates of the true zero proportions and thus construct a precise beta diversity measure. Extensive simulation studies have been conducted and suggest that the proposed method achieves substantial clustering improvement compared with some widely used distance measures when a large proportion of zeros is presented. The proposed algorithm was implemented to a human gut microbiome study on Parkinson’s diseases to identify distinct microbiome states with biological interpretations.https://www.mdpi.com/2076-2607/8/10/1612clusteringmicrobiomeunsupervised learninghigh-dimension
spellingShingle	Dongyang Yang Wei Xu Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model Microorganisms clustering microbiome unsupervised learning high-dimension
title	Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model
title_full	Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model
title_fullStr	Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model
title_full_unstemmed	Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model
title_short	Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model
title_sort	clustering on human microbiome sequencing data a distance based unsupervised learning model
topic	clustering microbiome unsupervised learning high-dimension
url	https://www.mdpi.com/2076-2607/8/10/1612
work_keys_str_mv	AT dongyangyang clusteringonhumanmicrobiomesequencingdataadistancebasedunsupervisedlearningmodel AT weixu clusteringonhumanmicrobiomesequencingdataadistancebasedunsupervisedlearningmodel

Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model

Similar Items