Similarity Downselection: Finding the <i>n</i> Most Dissimilar Molecular Conformers for Reference-Free Metabolomics

Computational methods for creating in silico libraries of molecular descriptors (e.g., collision cross sections) are becoming increasingly prevalent due to the limited number of authentic reference materials available for traditional library building. These so-called “reference-free metabolomics” me...

Full description

Bibliographic Details
Main Authors:	Felicity F. Nielson, Bill Kay, Stephen J. Young, Sean M. Colby, Ryan S. Renslow, Thomas O. Metz
Format:	Article
Language:	English
Published:	MDPI AG 2023-01-01
Series:	Metabolites
Subjects:	conformer downselection graph metabolomics molecule Monte Carlo
Online Access:	https://www.mdpi.com/2218-1989/13/1/105

_version_	1797438789733318656
author	Felicity F. Nielson Bill Kay Stephen J. Young Sean M. Colby Ryan S. Renslow Thomas O. Metz
author_facet	Felicity F. Nielson Bill Kay Stephen J. Young Sean M. Colby Ryan S. Renslow Thomas O. Metz
author_sort	Felicity F. Nielson
collection	DOAJ
description	Computational methods for creating in silico libraries of molecular descriptors (e.g., collision cross sections) are becoming increasingly prevalent due to the limited number of authentic reference materials available for traditional library building. These so-called “reference-free metabolomics” methods require sampling sets of molecular conformers in order to produce high accuracy property predictions. Due to the computational cost of the subsequent calculations for each conformer, there is a need to sample the most relevant subset and avoid repeating calculations on conformers that are nearly identical. The goal of this study is to introduce a heuristic method of finding the most dissimilar conformers from a larger population in order to help speed up reference-free calculation methods and maintain a high property prediction accuracy. Finding the set of the <i>n</i> items most dissimilar from each other out of a larger population becomes increasingly difficult and computationally expensive as either <i>n</i> or the population size grows large. Because there exists a pairwise relationship between each item and all other items in the population, finding the <i>set</i> of the <i>n</i> most dissimilar items is different than simply sorting an array of numbers. For instance, if you have a set of the most dissimilar <i>n</i> = 4 items, one or more of the items from <i>n</i> = 4 might not be in the set <i>n</i> = 5. An exact solution would have to search all possible combinations of size <i>n</i> in the population exhaustively. We present an open-source software called similarity downselection (SDS), written in Python and freely available on GitHub. SDS implements a heuristic algorithm for quickly finding the approximate set(s) of the <i>n</i> most dissimilar items. We benchmark SDS against a Monte Carlo method, which attempts to find the exact solution through repeated random sampling. We show that for SDS to find the set of <i>n</i> most dissimilar conformers, our method is not only orders of magnitude faster, but it is also more accurate than running Monte Carlo for 1,000,000 iterations, each searching for set sizes <i>n</i> = 3–7 out of a population of 50,000. We also benchmark SDS against the exact solution for example small populations, showing that SDS produces a solution close to the exact solution in these instances. Using theoretical approaches, we also demonstrate the constraints of the greedy algorithm and its efficacy as a ratio to the exact solution.
first_indexed	2024-03-09T11:42:22Z
format	Article
id	doaj.art-41a3b71695bf483995f34029fc39cc4b
institution	Directory Open Access Journal
issn	2218-1989
language	English
last_indexed	2024-03-09T11:42:22Z
publishDate	2023-01-01
publisher	MDPI AG
record_format	Article
series	Metabolites
spelling	doaj.art-41a3b71695bf483995f34029fc39cc4b2023-11-30T23:29:03ZengMDPI AGMetabolites2218-19892023-01-0113110510.3390/metabo13010105Similarity Downselection: Finding the <i>n</i> Most Dissimilar Molecular Conformers for Reference-Free MetabolomicsFelicity F. Nielson0Bill Kay1Stephen J. Young2Sean M. Colby3Ryan S. Renslow4Thomas O. Metz5Pacific Northwest National Laboratory, Biological Sciences Division, Richland, WA 99354, USAPacific Northwest National Laboratory, Advanced Computing, Mathematics, and Data Division, Richland, WA 99354, USAPacific Northwest National Laboratory, Advanced Computing, Mathematics, and Data Division, Richland, WA 99354, USAPacific Northwest National Laboratory, Biological Sciences Division, Richland, WA 99354, USAPacific Northwest National Laboratory, Biological Sciences Division, Richland, WA 99354, USAPacific Northwest National Laboratory, Biological Sciences Division, Richland, WA 99354, USAComputational methods for creating in silico libraries of molecular descriptors (e.g., collision cross sections) are becoming increasingly prevalent due to the limited number of authentic reference materials available for traditional library building. These so-called “reference-free metabolomics” methods require sampling sets of molecular conformers in order to produce high accuracy property predictions. Due to the computational cost of the subsequent calculations for each conformer, there is a need to sample the most relevant subset and avoid repeating calculations on conformers that are nearly identical. The goal of this study is to introduce a heuristic method of finding the most dissimilar conformers from a larger population in order to help speed up reference-free calculation methods and maintain a high property prediction accuracy. Finding the set of the <i>n</i> items most dissimilar from each other out of a larger population becomes increasingly difficult and computationally expensive as either <i>n</i> or the population size grows large. Because there exists a pairwise relationship between each item and all other items in the population, finding the <i>set</i> of the <i>n</i> most dissimilar items is different than simply sorting an array of numbers. For instance, if you have a set of the most dissimilar <i>n</i> = 4 items, one or more of the items from <i>n</i> = 4 might not be in the set <i>n</i> = 5. An exact solution would have to search all possible combinations of size <i>n</i> in the population exhaustively. We present an open-source software called similarity downselection (SDS), written in Python and freely available on GitHub. SDS implements a heuristic algorithm for quickly finding the approximate set(s) of the <i>n</i> most dissimilar items. We benchmark SDS against a Monte Carlo method, which attempts to find the exact solution through repeated random sampling. We show that for SDS to find the set of <i>n</i> most dissimilar conformers, our method is not only orders of magnitude faster, but it is also more accurate than running Monte Carlo for 1,000,000 iterations, each searching for set sizes <i>n</i> = 3–7 out of a population of 50,000. We also benchmark SDS against the exact solution for example small populations, showing that SDS produces a solution close to the exact solution in these instances. Using theoretical approaches, we also demonstrate the constraints of the greedy algorithm and its efficacy as a ratio to the exact solution.https://www.mdpi.com/2218-1989/13/1/105conformerdownselectiongraphmetabolomicsmoleculeMonte Carlo
spellingShingle	Felicity F. Nielson Bill Kay Stephen J. Young Sean M. Colby Ryan S. Renslow Thomas O. Metz Similarity Downselection: Finding the <i>n</i> Most Dissimilar Molecular Conformers for Reference-Free Metabolomics Metabolites conformer downselection graph metabolomics molecule Monte Carlo
title	Similarity Downselection: Finding the <i>n</i> Most Dissimilar Molecular Conformers for Reference-Free Metabolomics
title_full	Similarity Downselection: Finding the <i>n</i> Most Dissimilar Molecular Conformers for Reference-Free Metabolomics
title_fullStr	Similarity Downselection: Finding the <i>n</i> Most Dissimilar Molecular Conformers for Reference-Free Metabolomics
title_full_unstemmed	Similarity Downselection: Finding the <i>n</i> Most Dissimilar Molecular Conformers for Reference-Free Metabolomics
title_short	Similarity Downselection: Finding the <i>n</i> Most Dissimilar Molecular Conformers for Reference-Free Metabolomics
title_sort	similarity downselection finding the i n i most dissimilar molecular conformers for reference free metabolomics
topic	conformer downselection graph metabolomics molecule Monte Carlo
url	https://www.mdpi.com/2218-1989/13/1/105
work_keys_str_mv	AT felicityfnielson similaritydownselectionfindingtheinimostdissimilarmolecularconformersforreferencefreemetabolomics AT billkay similaritydownselectionfindingtheinimostdissimilarmolecularconformersforreferencefreemetabolomics AT stephenjyoung similaritydownselectionfindingtheinimostdissimilarmolecularconformersforreferencefreemetabolomics AT seanmcolby similaritydownselectionfindingtheinimostdissimilarmolecularconformersforreferencefreemetabolomics AT ryansrenslow similaritydownselectionfindingtheinimostdissimilarmolecularconformersforreferencefreemetabolomics AT thomasometz similaritydownselectionfindingtheinimostdissimilarmolecularconformersforreferencefreemetabolomics

Similarity Downselection: Finding the <i>n</i> Most Dissimilar Molecular Conformers for Reference-Free Metabolomics

Similar Items