Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model
Modeling and analyzing human microbiome allows the assessment of the microbial community and its impacts on human health. Microbiome composition can be quantified using 16S rRNA technology into sequencing data, which are usually skewed and heavy-tailed with excess zeros. Clustering methods are usefu...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2020-10-01
|
Series: | Microorganisms |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-2607/8/10/1612 |
_version_ | 1797550362982350848 |
---|---|
author | Dongyang Yang Wei Xu |
author_facet | Dongyang Yang Wei Xu |
author_sort | Dongyang Yang |
collection | DOAJ |
description | Modeling and analyzing human microbiome allows the assessment of the microbial community and its impacts on human health. Microbiome composition can be quantified using 16S rRNA technology into sequencing data, which are usually skewed and heavy-tailed with excess zeros. Clustering methods are useful in personalized medicine by identifying subgroups for patients stratification. However, there is currently a lack of standardized clustering method for the complex microbiome sequencing data. We propose a clustering algorithm with a specific beta diversity measure that can address the presence-absence bias encountered for sparse count data and effectively measure the sample distances for sample stratification. Our distance measure used for clustering is derived from a parametric based mixture model producing sample-specific distributions conditional on the observed operational taxonomic unit (OTU) counts and estimated mixture weights. The method can provide accurate estimates of the true zero proportions and thus construct a precise beta diversity measure. Extensive simulation studies have been conducted and suggest that the proposed method achieves substantial clustering improvement compared with some widely used distance measures when a large proportion of zeros is presented. The proposed algorithm was implemented to a human gut microbiome study on Parkinson’s diseases to identify distinct microbiome states with biological interpretations. |
first_indexed | 2024-03-10T15:28:09Z |
format | Article |
id | doaj.art-03a97458f4224fbcaa9fb056e18b9ab4 |
institution | Directory Open Access Journal |
issn | 2076-2607 |
language | English |
last_indexed | 2024-03-10T15:28:09Z |
publishDate | 2020-10-01 |
publisher | MDPI AG |
record_format | Article |
series | Microorganisms |
spelling | doaj.art-03a97458f4224fbcaa9fb056e18b9ab42023-11-20T17:49:42ZengMDPI AGMicroorganisms2076-26072020-10-01810161210.3390/microorganisms8101612Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning ModelDongyang Yang0Wei Xu1Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, CanadaDivision of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, CanadaModeling and analyzing human microbiome allows the assessment of the microbial community and its impacts on human health. Microbiome composition can be quantified using 16S rRNA technology into sequencing data, which are usually skewed and heavy-tailed with excess zeros. Clustering methods are useful in personalized medicine by identifying subgroups for patients stratification. However, there is currently a lack of standardized clustering method for the complex microbiome sequencing data. We propose a clustering algorithm with a specific beta diversity measure that can address the presence-absence bias encountered for sparse count data and effectively measure the sample distances for sample stratification. Our distance measure used for clustering is derived from a parametric based mixture model producing sample-specific distributions conditional on the observed operational taxonomic unit (OTU) counts and estimated mixture weights. The method can provide accurate estimates of the true zero proportions and thus construct a precise beta diversity measure. Extensive simulation studies have been conducted and suggest that the proposed method achieves substantial clustering improvement compared with some widely used distance measures when a large proportion of zeros is presented. The proposed algorithm was implemented to a human gut microbiome study on Parkinson’s diseases to identify distinct microbiome states with biological interpretations.https://www.mdpi.com/2076-2607/8/10/1612clusteringmicrobiomeunsupervised learninghigh-dimension |
spellingShingle | Dongyang Yang Wei Xu Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model Microorganisms clustering microbiome unsupervised learning high-dimension |
title | Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model |
title_full | Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model |
title_fullStr | Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model |
title_full_unstemmed | Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model |
title_short | Clustering on Human Microbiome Sequencing Data: A Distance-Based Unsupervised Learning Model |
title_sort | clustering on human microbiome sequencing data a distance based unsupervised learning model |
topic | clustering microbiome unsupervised learning high-dimension |
url | https://www.mdpi.com/2076-2607/8/10/1612 |
work_keys_str_mv | AT dongyangyang clusteringonhumanmicrobiomesequencingdataadistancebasedunsupervisedlearningmodel AT weixu clusteringonhumanmicrobiomesequencingdataadistancebasedunsupervisedlearningmodel |