Evaluating Plant Gene Models Using Machine Learning

Gene models are regions of the genome that can be transcribed into RNA and translated to proteins, or belong to a class of non-coding RNA genes. The prediction of gene models is a complex process that can be unreliable, leading to false positive annotations. To help support the calling of confident...

Full description

Bibliographic Details
Main Authors: Shriprabha R. Upadhyaya, Philipp E. Bayer, Cassandria G. Tay Fernandez, Jakob Petereit, Jacqueline Batley, Mohammed Bennamoun, Farid Boussaid, David Edwards
Format: Article
Language:English
Published: MDPI AG 2022-06-01
Series:Plants
Subjects:
Online Access:https://www.mdpi.com/2223-7747/11/12/1619
_version_ 1797483159380557824
author Shriprabha R. Upadhyaya
Philipp E. Bayer
Cassandria G. Tay Fernandez
Jakob Petereit
Jacqueline Batley
Mohammed Bennamoun
Farid Boussaid
David Edwards
author_facet Shriprabha R. Upadhyaya
Philipp E. Bayer
Cassandria G. Tay Fernandez
Jakob Petereit
Jacqueline Batley
Mohammed Bennamoun
Farid Boussaid
David Edwards
author_sort Shriprabha R. Upadhyaya
collection DOAJ
description Gene models are regions of the genome that can be transcribed into RNA and translated to proteins, or belong to a class of non-coding RNA genes. The prediction of gene models is a complex process that can be unreliable, leading to false positive annotations. To help support the calling of confident conserved gene models and minimize false positives arising during gene model prediction we have developed Truegene, a machine learning approach to classify potential low confidence gene models using 14 gene and 41 protein-based characteristics. Amino acid and nucleotide sequence-based features were calculated for conserved (high confidence) and non-conserved (low confidence) annotated genes from the published <i>Pisum sativum</i> Cameor genome. These features were used to train eXtreme Gradient Boost (XGBoost) classifier models to predict whether a gene model is likely to be real. The optimized models demonstrated a prediction accuracy ranging from 87% to 90% and an F-1 score of 0.91–0.94. We used SHapley Additive exPlanations (SHAP) and feature importance plots to identify the features that contribute to the model predictions, and we show that protein and gene-based features can be used to build accurate models for gene prediction that have applications in supporting future gene annotation processes.
first_indexed 2024-03-09T22:42:59Z
format Article
id doaj.art-dadb75edadb849c8b3a521c62cc90b1e
institution Directory Open Access Journal
issn 2223-7747
language English
last_indexed 2024-03-09T22:42:59Z
publishDate 2022-06-01
publisher MDPI AG
record_format Article
series Plants
spelling doaj.art-dadb75edadb849c8b3a521c62cc90b1e2023-11-23T18:35:30ZengMDPI AGPlants2223-77472022-06-011112161910.3390/plants11121619Evaluating Plant Gene Models Using Machine LearningShriprabha R. Upadhyaya0Philipp E. Bayer1Cassandria G. Tay Fernandez2Jakob Petereit3Jacqueline Batley4Mohammed Bennamoun5Farid Boussaid6David Edwards7School of Biological Sciences, University of Western Australia, Perth, WA 6000, AustraliaSchool of Biological Sciences, University of Western Australia, Perth, WA 6000, AustraliaSchool of Biological Sciences, University of Western Australia, Perth, WA 6000, AustraliaSchool of Biological Sciences, University of Western Australia, Perth, WA 6000, AustraliaSchool of Biological Sciences, University of Western Australia, Perth, WA 6000, AustraliaDepartment of Computer Science and Software Engineering, University of Western Australia, Perth, WA 6000, AustraliaDepartment of Electrical, Electronic and Computer Engineering, University of Western Australia, Perth, WA 6000, AustraliaSchool of Biological Sciences, University of Western Australia, Perth, WA 6000, AustraliaGene models are regions of the genome that can be transcribed into RNA and translated to proteins, or belong to a class of non-coding RNA genes. The prediction of gene models is a complex process that can be unreliable, leading to false positive annotations. To help support the calling of confident conserved gene models and minimize false positives arising during gene model prediction we have developed Truegene, a machine learning approach to classify potential low confidence gene models using 14 gene and 41 protein-based characteristics. Amino acid and nucleotide sequence-based features were calculated for conserved (high confidence) and non-conserved (low confidence) annotated genes from the published <i>Pisum sativum</i> Cameor genome. These features were used to train eXtreme Gradient Boost (XGBoost) classifier models to predict whether a gene model is likely to be real. The optimized models demonstrated a prediction accuracy ranging from 87% to 90% and an F-1 score of 0.91–0.94. We used SHapley Additive exPlanations (SHAP) and feature importance plots to identify the features that contribute to the model predictions, and we show that protein and gene-based features can be used to build accurate models for gene prediction that have applications in supporting future gene annotation processes.https://www.mdpi.com/2223-7747/11/12/1619gene modelspeamachine learningXGBoostSHAP
spellingShingle Shriprabha R. Upadhyaya
Philipp E. Bayer
Cassandria G. Tay Fernandez
Jakob Petereit
Jacqueline Batley
Mohammed Bennamoun
Farid Boussaid
David Edwards
Evaluating Plant Gene Models Using Machine Learning
Plants
gene models
pea
machine learning
XGBoost
SHAP
title Evaluating Plant Gene Models Using Machine Learning
title_full Evaluating Plant Gene Models Using Machine Learning
title_fullStr Evaluating Plant Gene Models Using Machine Learning
title_full_unstemmed Evaluating Plant Gene Models Using Machine Learning
title_short Evaluating Plant Gene Models Using Machine Learning
title_sort evaluating plant gene models using machine learning
topic gene models
pea
machine learning
XGBoost
SHAP
url https://www.mdpi.com/2223-7747/11/12/1619
work_keys_str_mv AT shriprabharupadhyaya evaluatingplantgenemodelsusingmachinelearning
AT philippebayer evaluatingplantgenemodelsusingmachinelearning
AT cassandriagtayfernandez evaluatingplantgenemodelsusingmachinelearning
AT jakobpetereit evaluatingplantgenemodelsusingmachinelearning
AT jacquelinebatley evaluatingplantgenemodelsusingmachinelearning
AT mohammedbennamoun evaluatingplantgenemodelsusingmachinelearning
AT faridboussaid evaluatingplantgenemodelsusingmachinelearning
AT davidedwards evaluatingplantgenemodelsusingmachinelearning