A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in Arabidopsis

This study proposes a hybrid gene selection method to identify and predict key genes in Arabidopsis associated with various stresses (including salt, heat, cold, high-light, and flagellin), aiming to enhance crop tolerance. An open-source microarray dataset (GSE41935) comprising 207 samples and 30,3...

Full description

Bibliographic Details
Main Authors: Leyla Nazari, Vida Ghotbi, Mohammad Nadimi, Jitendra Paliwal
Format: Article
Language:English
Published: MDPI AG 2023-08-01
Series:Algorithms
Subjects:
Online Access:https://www.mdpi.com/1999-4893/16/9/407
_version_ 1797581672940568576
author Leyla Nazari
Vida Ghotbi
Mohammad Nadimi
Jitendra Paliwal
author_facet Leyla Nazari
Vida Ghotbi
Mohammad Nadimi
Jitendra Paliwal
author_sort Leyla Nazari
collection DOAJ
description This study proposes a hybrid gene selection method to identify and predict key genes in Arabidopsis associated with various stresses (including salt, heat, cold, high-light, and flagellin), aiming to enhance crop tolerance. An open-source microarray dataset (GSE41935) comprising 207 samples and 30,380 genes was analyzed using several machine learning tools including the synthetic minority oversampling technique (SMOTE), information gain (IG), ReliefF, and least absolute shrinkage and selection operator (LASSO), along with various classifiers (BayesNet, logistic, multilayer perceptron, sequential minimal optimization (SMO), and random forest). We identified 439 differentially expressed genes (DEGs), of which only three were down-regulated (AT3G20810, AT1G31680, and AT1G30250). The performance of the top 20 genes selected by IG and ReliefF was evaluated using the classifiers mentioned above to classify stressed versus non-stressed samples. The random forest algorithm outperformed other algorithms with an accuracy of 97.91% and 98.51% for IG and ReliefF, respectively. Additionally, 42 genes were identified from all 30,380 genes using LASSO regression. The top 20 genes for each feature selection were analyzed to determine three common genes (AT5G44050, AT2G47180, and AT1G70700), which formed a three-gene signature. The efficiency of these three genes was evaluated using random forest and XGBoost algorithms. Further validation was performed using an independent RNA_seq dataset and random forest. These gene signatures can be exploited in plant breeding to improve stress tolerance in a variety of crops.
first_indexed 2024-03-10T23:07:53Z
format Article
id doaj.art-c9a8af345bd54bd5ade70facc4bb79b9
institution Directory Open Access Journal
issn 1999-4893
language English
last_indexed 2024-03-10T23:07:53Z
publishDate 2023-08-01
publisher MDPI AG
record_format Article
series Algorithms
spelling doaj.art-c9a8af345bd54bd5ade70facc4bb79b92023-11-19T09:12:42ZengMDPI AGAlgorithms1999-48932023-08-0116940710.3390/a16090407A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in ArabidopsisLeyla Nazari0Vida Ghotbi1Mohammad Nadimi2Jitendra Paliwal3Crop and Horticultural Science Research Department, Fars Agricultural and Natural Resources Research and Education Center, Agricultural Research, Education and Extension Organization (AREEO), Shiraz 71558-63511, IranAgricultural Research, Education and Extension Organization (AREEO), Seed and Plant Improvement Institute, Karaj 31359-33151, IranDepartment of Biosystems Engineering, University of Manitoba, Winnipeg, MB R3T 5V6, CanadaDepartment of Biosystems Engineering, University of Manitoba, Winnipeg, MB R3T 5V6, CanadaThis study proposes a hybrid gene selection method to identify and predict key genes in Arabidopsis associated with various stresses (including salt, heat, cold, high-light, and flagellin), aiming to enhance crop tolerance. An open-source microarray dataset (GSE41935) comprising 207 samples and 30,380 genes was analyzed using several machine learning tools including the synthetic minority oversampling technique (SMOTE), information gain (IG), ReliefF, and least absolute shrinkage and selection operator (LASSO), along with various classifiers (BayesNet, logistic, multilayer perceptron, sequential minimal optimization (SMO), and random forest). We identified 439 differentially expressed genes (DEGs), of which only three were down-regulated (AT3G20810, AT1G31680, and AT1G30250). The performance of the top 20 genes selected by IG and ReliefF was evaluated using the classifiers mentioned above to classify stressed versus non-stressed samples. The random forest algorithm outperformed other algorithms with an accuracy of 97.91% and 98.51% for IG and ReliefF, respectively. Additionally, 42 genes were identified from all 30,380 genes using LASSO regression. The top 20 genes for each feature selection were analyzed to determine three common genes (AT5G44050, AT2G47180, and AT1G70700), which formed a three-gene signature. The efficiency of these three genes was evaluated using random forest and XGBoost algorithms. Further validation was performed using an independent RNA_seq dataset and random forest. These gene signatures can be exploited in plant breeding to improve stress tolerance in a variety of crops.https://www.mdpi.com/1999-4893/16/9/407LASSOinformation gainReliefFclassifiersrandom forest
spellingShingle Leyla Nazari
Vida Ghotbi
Mohammad Nadimi
Jitendra Paliwal
A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in Arabidopsis
Algorithms
LASSO
information gain
ReliefF
classifiers
random forest
title A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in Arabidopsis
title_full A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in Arabidopsis
title_fullStr A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in Arabidopsis
title_full_unstemmed A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in Arabidopsis
title_short A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in Arabidopsis
title_sort novel machine learning approach to predict stress responsive genes in arabidopsis
topic LASSO
information gain
ReliefF
classifiers
random forest
url https://www.mdpi.com/1999-4893/16/9/407
work_keys_str_mv AT leylanazari anovelmachinelearningapproachtopredictstressresponsivegenesinarabidopsis
AT vidaghotbi anovelmachinelearningapproachtopredictstressresponsivegenesinarabidopsis
AT mohammadnadimi anovelmachinelearningapproachtopredictstressresponsivegenesinarabidopsis
AT jitendrapaliwal anovelmachinelearningapproachtopredictstressresponsivegenesinarabidopsis
AT leylanazari novelmachinelearningapproachtopredictstressresponsivegenesinarabidopsis
AT vidaghotbi novelmachinelearningapproachtopredictstressresponsivegenesinarabidopsis
AT mohammadnadimi novelmachinelearningapproachtopredictstressresponsivegenesinarabidopsis
AT jitendrapaliwal novelmachinelearningapproachtopredictstressresponsivegenesinarabidopsis