Machine learning classifiers predict key genomic and evolutionary traits across the kingdoms of life

Abstract In this study, we investigate how an organism’s codon usage bias can serve as a predictor and classifier of various genomic and evolutionary traits across the domains of life. We perform secondary analysis of existing genetic datasets to build several AI/machine learning models. When traine...

Full description

Bibliographic Details
Main Authors: Logan Hallee, Bohdan B. Khomtchouk
Format: Article
Language:English
Published: Nature Portfolio 2023-02-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-023-28965-7
_version_ 1811165971546636288
author Logan Hallee
Bohdan B. Khomtchouk
author_facet Logan Hallee
Bohdan B. Khomtchouk
author_sort Logan Hallee
collection DOAJ
description Abstract In this study, we investigate how an organism’s codon usage bias can serve as a predictor and classifier of various genomic and evolutionary traits across the domains of life. We perform secondary analysis of existing genetic datasets to build several AI/machine learning models. When trained on codon usage patterns of nearly 13,000 organisms, our models accurately predict the organelle of origin and taxonomic identity of nucleotide samples. We extend our analysis to identify the most influential codons for phylogenetic prediction with a custom feature ranking ensemble. Our results suggest that the genetic code can be utilized to train accurate classifiers of taxonomic and phylogenetic features. We then apply this classification framework to open reading frame (ORF) detection. Our statistical model assesses all possible ORFs in a nucleotide sample and rejects or deems them plausible based on the codon usage distribution. Our dataset and analyses are made publicly available on GitHub and the UCI ML Repository to facilitate open-source reproducibility and community engagement.
first_indexed 2024-04-10T15:44:50Z
format Article
id doaj.art-fd3fb3e673384f4bb048337f1b230de7
institution Directory Open Access Journal
issn 2045-2322
language English
last_indexed 2024-04-10T15:44:50Z
publishDate 2023-02-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj.art-fd3fb3e673384f4bb048337f1b230de72023-02-12T12:12:29ZengNature PortfolioScientific Reports2045-23222023-02-0113111410.1038/s41598-023-28965-7Machine learning classifiers predict key genomic and evolutionary traits across the kingdoms of lifeLogan Hallee0Bohdan B. Khomtchouk1Center for Bioinformatics and Computational Biology, University of DelawareDepartment of BioHealth Informatics, Center for Computational Biology and Bioinformatics, Indiana UniversityAbstract In this study, we investigate how an organism’s codon usage bias can serve as a predictor and classifier of various genomic and evolutionary traits across the domains of life. We perform secondary analysis of existing genetic datasets to build several AI/machine learning models. When trained on codon usage patterns of nearly 13,000 organisms, our models accurately predict the organelle of origin and taxonomic identity of nucleotide samples. We extend our analysis to identify the most influential codons for phylogenetic prediction with a custom feature ranking ensemble. Our results suggest that the genetic code can be utilized to train accurate classifiers of taxonomic and phylogenetic features. We then apply this classification framework to open reading frame (ORF) detection. Our statistical model assesses all possible ORFs in a nucleotide sample and rejects or deems them plausible based on the codon usage distribution. Our dataset and analyses are made publicly available on GitHub and the UCI ML Repository to facilitate open-source reproducibility and community engagement.https://doi.org/10.1038/s41598-023-28965-7
spellingShingle Logan Hallee
Bohdan B. Khomtchouk
Machine learning classifiers predict key genomic and evolutionary traits across the kingdoms of life
Scientific Reports
title Machine learning classifiers predict key genomic and evolutionary traits across the kingdoms of life
title_full Machine learning classifiers predict key genomic and evolutionary traits across the kingdoms of life
title_fullStr Machine learning classifiers predict key genomic and evolutionary traits across the kingdoms of life
title_full_unstemmed Machine learning classifiers predict key genomic and evolutionary traits across the kingdoms of life
title_short Machine learning classifiers predict key genomic and evolutionary traits across the kingdoms of life
title_sort machine learning classifiers predict key genomic and evolutionary traits across the kingdoms of life
url https://doi.org/10.1038/s41598-023-28965-7
work_keys_str_mv AT loganhallee machinelearningclassifierspredictkeygenomicandevolutionarytraitsacrossthekingdomsoflife
AT bohdanbkhomtchouk machinelearningclassifierspredictkeygenomicandevolutionarytraitsacrossthekingdomsoflife