Evaluating Plant Gene Models Using Machine Learning
Gene models are regions of the genome that can be transcribed into RNA and translated to proteins, or belong to a class of non-coding RNA genes. The prediction of gene models is a complex process that can be unreliable, leading to false positive annotations. To help support the calling of confident...
Main Authors: | , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-06-01
|
Series: | Plants |
Subjects: | |
Online Access: | https://www.mdpi.com/2223-7747/11/12/1619 |
_version_ | 1797483159380557824 |
---|---|
author | Shriprabha R. Upadhyaya Philipp E. Bayer Cassandria G. Tay Fernandez Jakob Petereit Jacqueline Batley Mohammed Bennamoun Farid Boussaid David Edwards |
author_facet | Shriprabha R. Upadhyaya Philipp E. Bayer Cassandria G. Tay Fernandez Jakob Petereit Jacqueline Batley Mohammed Bennamoun Farid Boussaid David Edwards |
author_sort | Shriprabha R. Upadhyaya |
collection | DOAJ |
description | Gene models are regions of the genome that can be transcribed into RNA and translated to proteins, or belong to a class of non-coding RNA genes. The prediction of gene models is a complex process that can be unreliable, leading to false positive annotations. To help support the calling of confident conserved gene models and minimize false positives arising during gene model prediction we have developed Truegene, a machine learning approach to classify potential low confidence gene models using 14 gene and 41 protein-based characteristics. Amino acid and nucleotide sequence-based features were calculated for conserved (high confidence) and non-conserved (low confidence) annotated genes from the published <i>Pisum sativum</i> Cameor genome. These features were used to train eXtreme Gradient Boost (XGBoost) classifier models to predict whether a gene model is likely to be real. The optimized models demonstrated a prediction accuracy ranging from 87% to 90% and an F-1 score of 0.91–0.94. We used SHapley Additive exPlanations (SHAP) and feature importance plots to identify the features that contribute to the model predictions, and we show that protein and gene-based features can be used to build accurate models for gene prediction that have applications in supporting future gene annotation processes. |
first_indexed | 2024-03-09T22:42:59Z |
format | Article |
id | doaj.art-dadb75edadb849c8b3a521c62cc90b1e |
institution | Directory Open Access Journal |
issn | 2223-7747 |
language | English |
last_indexed | 2024-03-09T22:42:59Z |
publishDate | 2022-06-01 |
publisher | MDPI AG |
record_format | Article |
series | Plants |
spelling | doaj.art-dadb75edadb849c8b3a521c62cc90b1e2023-11-23T18:35:30ZengMDPI AGPlants2223-77472022-06-011112161910.3390/plants11121619Evaluating Plant Gene Models Using Machine LearningShriprabha R. Upadhyaya0Philipp E. Bayer1Cassandria G. Tay Fernandez2Jakob Petereit3Jacqueline Batley4Mohammed Bennamoun5Farid Boussaid6David Edwards7School of Biological Sciences, University of Western Australia, Perth, WA 6000, AustraliaSchool of Biological Sciences, University of Western Australia, Perth, WA 6000, AustraliaSchool of Biological Sciences, University of Western Australia, Perth, WA 6000, AustraliaSchool of Biological Sciences, University of Western Australia, Perth, WA 6000, AustraliaSchool of Biological Sciences, University of Western Australia, Perth, WA 6000, AustraliaDepartment of Computer Science and Software Engineering, University of Western Australia, Perth, WA 6000, AustraliaDepartment of Electrical, Electronic and Computer Engineering, University of Western Australia, Perth, WA 6000, AustraliaSchool of Biological Sciences, University of Western Australia, Perth, WA 6000, AustraliaGene models are regions of the genome that can be transcribed into RNA and translated to proteins, or belong to a class of non-coding RNA genes. The prediction of gene models is a complex process that can be unreliable, leading to false positive annotations. To help support the calling of confident conserved gene models and minimize false positives arising during gene model prediction we have developed Truegene, a machine learning approach to classify potential low confidence gene models using 14 gene and 41 protein-based characteristics. Amino acid and nucleotide sequence-based features were calculated for conserved (high confidence) and non-conserved (low confidence) annotated genes from the published <i>Pisum sativum</i> Cameor genome. These features were used to train eXtreme Gradient Boost (XGBoost) classifier models to predict whether a gene model is likely to be real. The optimized models demonstrated a prediction accuracy ranging from 87% to 90% and an F-1 score of 0.91–0.94. We used SHapley Additive exPlanations (SHAP) and feature importance plots to identify the features that contribute to the model predictions, and we show that protein and gene-based features can be used to build accurate models for gene prediction that have applications in supporting future gene annotation processes.https://www.mdpi.com/2223-7747/11/12/1619gene modelspeamachine learningXGBoostSHAP |
spellingShingle | Shriprabha R. Upadhyaya Philipp E. Bayer Cassandria G. Tay Fernandez Jakob Petereit Jacqueline Batley Mohammed Bennamoun Farid Boussaid David Edwards Evaluating Plant Gene Models Using Machine Learning Plants gene models pea machine learning XGBoost SHAP |
title | Evaluating Plant Gene Models Using Machine Learning |
title_full | Evaluating Plant Gene Models Using Machine Learning |
title_fullStr | Evaluating Plant Gene Models Using Machine Learning |
title_full_unstemmed | Evaluating Plant Gene Models Using Machine Learning |
title_short | Evaluating Plant Gene Models Using Machine Learning |
title_sort | evaluating plant gene models using machine learning |
topic | gene models pea machine learning XGBoost SHAP |
url | https://www.mdpi.com/2223-7747/11/12/1619 |
work_keys_str_mv | AT shriprabharupadhyaya evaluatingplantgenemodelsusingmachinelearning AT philippebayer evaluatingplantgenemodelsusingmachinelearning AT cassandriagtayfernandez evaluatingplantgenemodelsusingmachinelearning AT jakobpetereit evaluatingplantgenemodelsusingmachinelearning AT jacquelinebatley evaluatingplantgenemodelsusingmachinelearning AT mohammedbennamoun evaluatingplantgenemodelsusingmachinelearning AT faridboussaid evaluatingplantgenemodelsusingmachinelearning AT davidedwards evaluatingplantgenemodelsusingmachinelearning |