Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model

Modeling and analyzing human microbiome allows the assessment of the microbial community and its impacts on human health. Microbiome composition can be quantified using 16S rRNA technology into sequencing data, which are usually skewed and heavy-tailed with excess zeros. Clustering methods are usefu...

Full description

Bibliographic Details
Main Authors: Dongyang Yang, Wei Xu
Format: Article
Language:English
Published: MDPI AG 2020-10-01
Series:Microorganisms
Subjects:
Online Access:https://www.mdpi.com/2076-2607/8/10/1612
_version_ 1797550362982350848
author Dongyang Yang
Wei Xu
author_facet Dongyang Yang
Wei Xu
author_sort Dongyang Yang
collection DOAJ
description Modeling and analyzing human microbiome allows the assessment of the microbial community and its impacts on human health. Microbiome composition can be quantified using 16S rRNA technology into sequencing data, which are usually skewed and heavy-tailed with excess zeros. Clustering methods are useful in personalized medicine by identifying subgroups for patients stratification. However, there is currently a lack of standardized clustering method for the complex microbiome sequencing data. We propose a clustering algorithm with a specific beta diversity measure that can address the presence-absence bias encountered for sparse count data and effectively measure the sample distances for sample stratification. Our distance measure used for clustering is derived from a parametric based mixture model producing sample-specific distributions conditional on the observed operational taxonomic unit (OTU) counts and estimated mixture weights. The method can provide accurate estimates of the true zero proportions and thus construct a precise beta diversity measure. Extensive simulation studies have been conducted and suggest that the proposed method achieves substantial clustering improvement compared with some widely used distance measures when a large proportion of zeros is presented. The proposed algorithm was implemented to a human gut microbiome study on Parkinson’s diseases to identify distinct microbiome states with biological interpretations.
first_indexed 2024-03-10T15:28:09Z
format Article
id doaj.art-03a97458f4224fbcaa9fb056e18b9ab4
institution Directory Open Access Journal
issn 2076-2607
language English
last_indexed 2024-03-10T15:28:09Z
publishDate 2020-10-01
publisher MDPI AG
record_format Article
series Microorganisms
spelling doaj.art-03a97458f4224fbcaa9fb056e18b9ab42023-11-20T17:49:42ZengMDPI AGMicroorganisms2076-26072020-10-01810161210.3390/microorganisms8101612Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning ModelDongyang Yang0Wei Xu1Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, CanadaDivision of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, CanadaModeling and analyzing human microbiome allows the assessment of the microbial community and its impacts on human health. Microbiome composition can be quantified using 16S rRNA technology into sequencing data, which are usually skewed and heavy-tailed with excess zeros. Clustering methods are useful in personalized medicine by identifying subgroups for patients stratification. However, there is currently a lack of standardized clustering method for the complex microbiome sequencing data. We propose a clustering algorithm with a specific beta diversity measure that can address the presence-absence bias encountered for sparse count data and effectively measure the sample distances for sample stratification. Our distance measure used for clustering is derived from a parametric based mixture model producing sample-specific distributions conditional on the observed operational taxonomic unit (OTU) counts and estimated mixture weights. The method can provide accurate estimates of the true zero proportions and thus construct a precise beta diversity measure. Extensive simulation studies have been conducted and suggest that the proposed method achieves substantial clustering improvement compared with some widely used distance measures when a large proportion of zeros is presented. The proposed algorithm was implemented to a human gut microbiome study on Parkinson’s diseases to identify distinct microbiome states with biological interpretations.https://www.mdpi.com/2076-2607/8/10/1612clusteringmicrobiomeunsupervised learninghigh-dimension
spellingShingle Dongyang Yang
Wei Xu
Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model
Microorganisms
clustering
microbiome
unsupervised learning
high-dimension
title Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model
title_full Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model
title_fullStr Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model
title_full_unstemmed Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model
title_short Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model
title_sort clustering on human microbiome sequencing data a distance based unsupervised learning model
topic clustering
microbiome
unsupervised learning
high-dimension
url https://www.mdpi.com/2076-2607/8/10/1612
work_keys_str_mv AT dongyangyang clusteringonhumanmicrobiomesequencingdataadistancebasedunsupervisedlearningmodel
AT weixu clusteringonhumanmicrobiomesequencingdataadistancebasedunsupervisedlearningmodel