<i>hcapca</i>: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R

Microbial natural product discovery programs face two main challenges today: rapidly prioritizing strains for discovering new molecules and avoiding the rediscovery of already known molecules. Typically, these problems have been tackled using biological assays to identify promising strains and techn...

Full description

Bibliographic Details
Main Authors: Shaurya Chanana, Chris S. Thomas, Fan Zhang, Scott R. Rajski, Tim S. Bugni
Format: Article
Language:English
Published: MDPI AG 2020-07-01
Series:Metabolites
Subjects:
Online Access:https://www.mdpi.com/2218-1989/10/7/297
_version_ 1797561835620139008
author Shaurya Chanana
Chris S. Thomas
Fan Zhang
Scott R. Rajski
Tim S. Bugni
author_facet Shaurya Chanana
Chris S. Thomas
Fan Zhang
Scott R. Rajski
Tim S. Bugni
author_sort Shaurya Chanana
collection DOAJ
description Microbial natural product discovery programs face two main challenges today: rapidly prioritizing strains for discovering new molecules and avoiding the rediscovery of already known molecules. Typically, these problems have been tackled using biological assays to identify promising strains and techniques that model variance in a dataset such as PCA to highlight novel chemistry. While these tools have shown successful outcomes in the past, datasets are becoming much larger and require a new approach. Since PCA models are dependent on the members of the group being modeled, large datasets with many members make it difficult to accurately model the variance in the data. Our tool, <b><i>hcapca</i></b>, first groups strains based on the similarity of their chemical composition, and then applies PCA to the smaller sub-groups yielding more robust PCA models. This allows for scalable chemical comparisons among hundreds of strains with thousands of molecular features. As a proof of concept, we applied our open-source tool to a dataset with 1046 LCMS profiles of marine invertebrate associated bacteria and discovered three new analogs of an established anticancer agent from one promising strain.
first_indexed 2024-03-10T18:20:14Z
format Article
id doaj.art-d46e9349703744f2a32b048eb32c96e0
institution Directory Open Access Journal
issn 2218-1989
language English
last_indexed 2024-03-10T18:20:14Z
publishDate 2020-07-01
publisher MDPI AG
record_format Article
series Metabolites
spelling doaj.art-d46e9349703744f2a32b048eb32c96e02023-11-20T07:26:56ZengMDPI AGMetabolites2218-19892020-07-0110729710.3390/metabo10070297<i>hcapca</i>: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in RShaurya Chanana0Chris S. Thomas1Fan Zhang2Scott R. Rajski3Tim S. Bugni4Pharmaceutical Sciences Division, School of Pharmacy, University of Wisconsin, Madison, WI 53705, USAPharmaceutical Sciences Division, School of Pharmacy, University of Wisconsin, Madison, WI 53705, USAPharmaceutical Sciences Division, School of Pharmacy, University of Wisconsin, Madison, WI 53705, USAPharmaceutical Sciences Division, School of Pharmacy, University of Wisconsin, Madison, WI 53705, USAPharmaceutical Sciences Division, School of Pharmacy, University of Wisconsin, Madison, WI 53705, USAMicrobial natural product discovery programs face two main challenges today: rapidly prioritizing strains for discovering new molecules and avoiding the rediscovery of already known molecules. Typically, these problems have been tackled using biological assays to identify promising strains and techniques that model variance in a dataset such as PCA to highlight novel chemistry. While these tools have shown successful outcomes in the past, datasets are becoming much larger and require a new approach. Since PCA models are dependent on the members of the group being modeled, large datasets with many members make it difficult to accurately model the variance in the data. Our tool, <b><i>hcapca</i></b>, first groups strains based on the similarity of their chemical composition, and then applies PCA to the smaller sub-groups yielding more robust PCA models. This allows for scalable chemical comparisons among hundreds of strains with thousands of molecular features. As a proof of concept, we applied our open-source tool to a dataset with 1046 LCMS profiles of marine invertebrate associated bacteria and discovered three new analogs of an established anticancer agent from one promising strain.https://www.mdpi.com/2218-1989/10/7/297metabolitesgenomicsPCAHCAdendrogramvariance
spellingShingle Shaurya Chanana
Chris S. Thomas
Fan Zhang
Scott R. Rajski
Tim S. Bugni
<i>hcapca</i>: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R
Metabolites
metabolites
genomics
PCA
HCA
dendrogram
variance
title <i>hcapca</i>: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R
title_full <i>hcapca</i>: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R
title_fullStr <i>hcapca</i>: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R
title_full_unstemmed <i>hcapca</i>: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R
title_short <i>hcapca</i>: Automated Hierarchical Clustering and Principal Component Analysis of Large Metabolomic Datasets in R
title_sort i hcapca i automated hierarchical clustering and principal component analysis of large metabolomic datasets in r
topic metabolites
genomics
PCA
HCA
dendrogram
variance
url https://www.mdpi.com/2218-1989/10/7/297
work_keys_str_mv AT shauryachanana ihcapcaiautomatedhierarchicalclusteringandprincipalcomponentanalysisoflargemetabolomicdatasetsinr
AT chrissthomas ihcapcaiautomatedhierarchicalclusteringandprincipalcomponentanalysisoflargemetabolomicdatasetsinr
AT fanzhang ihcapcaiautomatedhierarchicalclusteringandprincipalcomponentanalysisoflargemetabolomicdatasetsinr
AT scottrrajski ihcapcaiautomatedhierarchicalclusteringandprincipalcomponentanalysisoflargemetabolomicdatasetsinr
AT timsbugni ihcapcaiautomatedhierarchicalclusteringandprincipalcomponentanalysisoflargemetabolomicdatasetsinr