A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in Arabidopsis
This study proposes a hybrid gene selection method to identify and predict key genes in Arabidopsis associated with various stresses (including salt, heat, cold, high-light, and flagellin), aiming to enhance crop tolerance. An open-source microarray dataset (GSE41935) comprising 207 samples and 30,3...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-08-01
|
Series: | Algorithms |
Subjects: | |
Online Access: | https://www.mdpi.com/1999-4893/16/9/407 |
_version_ | 1797581672940568576 |
---|---|
author | Leyla Nazari Vida Ghotbi Mohammad Nadimi Jitendra Paliwal |
author_facet | Leyla Nazari Vida Ghotbi Mohammad Nadimi Jitendra Paliwal |
author_sort | Leyla Nazari |
collection | DOAJ |
description | This study proposes a hybrid gene selection method to identify and predict key genes in Arabidopsis associated with various stresses (including salt, heat, cold, high-light, and flagellin), aiming to enhance crop tolerance. An open-source microarray dataset (GSE41935) comprising 207 samples and 30,380 genes was analyzed using several machine learning tools including the synthetic minority oversampling technique (SMOTE), information gain (IG), ReliefF, and least absolute shrinkage and selection operator (LASSO), along with various classifiers (BayesNet, logistic, multilayer perceptron, sequential minimal optimization (SMO), and random forest). We identified 439 differentially expressed genes (DEGs), of which only three were down-regulated (AT3G20810, AT1G31680, and AT1G30250). The performance of the top 20 genes selected by IG and ReliefF was evaluated using the classifiers mentioned above to classify stressed versus non-stressed samples. The random forest algorithm outperformed other algorithms with an accuracy of 97.91% and 98.51% for IG and ReliefF, respectively. Additionally, 42 genes were identified from all 30,380 genes using LASSO regression. The top 20 genes for each feature selection were analyzed to determine three common genes (AT5G44050, AT2G47180, and AT1G70700), which formed a three-gene signature. The efficiency of these three genes was evaluated using random forest and XGBoost algorithms. Further validation was performed using an independent RNA_seq dataset and random forest. These gene signatures can be exploited in plant breeding to improve stress tolerance in a variety of crops. |
first_indexed | 2024-03-10T23:07:53Z |
format | Article |
id | doaj.art-c9a8af345bd54bd5ade70facc4bb79b9 |
institution | Directory Open Access Journal |
issn | 1999-4893 |
language | English |
last_indexed | 2024-03-10T23:07:53Z |
publishDate | 2023-08-01 |
publisher | MDPI AG |
record_format | Article |
series | Algorithms |
spelling | doaj.art-c9a8af345bd54bd5ade70facc4bb79b92023-11-19T09:12:42ZengMDPI AGAlgorithms1999-48932023-08-0116940710.3390/a16090407A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in ArabidopsisLeyla Nazari0Vida Ghotbi1Mohammad Nadimi2Jitendra Paliwal3Crop and Horticultural Science Research Department, Fars Agricultural and Natural Resources Research and Education Center, Agricultural Research, Education and Extension Organization (AREEO), Shiraz 71558-63511, IranAgricultural Research, Education and Extension Organization (AREEO), Seed and Plant Improvement Institute, Karaj 31359-33151, IranDepartment of Biosystems Engineering, University of Manitoba, Winnipeg, MB R3T 5V6, CanadaDepartment of Biosystems Engineering, University of Manitoba, Winnipeg, MB R3T 5V6, CanadaThis study proposes a hybrid gene selection method to identify and predict key genes in Arabidopsis associated with various stresses (including salt, heat, cold, high-light, and flagellin), aiming to enhance crop tolerance. An open-source microarray dataset (GSE41935) comprising 207 samples and 30,380 genes was analyzed using several machine learning tools including the synthetic minority oversampling technique (SMOTE), information gain (IG), ReliefF, and least absolute shrinkage and selection operator (LASSO), along with various classifiers (BayesNet, logistic, multilayer perceptron, sequential minimal optimization (SMO), and random forest). We identified 439 differentially expressed genes (DEGs), of which only three were down-regulated (AT3G20810, AT1G31680, and AT1G30250). The performance of the top 20 genes selected by IG and ReliefF was evaluated using the classifiers mentioned above to classify stressed versus non-stressed samples. The random forest algorithm outperformed other algorithms with an accuracy of 97.91% and 98.51% for IG and ReliefF, respectively. Additionally, 42 genes were identified from all 30,380 genes using LASSO regression. The top 20 genes for each feature selection were analyzed to determine three common genes (AT5G44050, AT2G47180, and AT1G70700), which formed a three-gene signature. The efficiency of these three genes was evaluated using random forest and XGBoost algorithms. Further validation was performed using an independent RNA_seq dataset and random forest. These gene signatures can be exploited in plant breeding to improve stress tolerance in a variety of crops.https://www.mdpi.com/1999-4893/16/9/407LASSOinformation gainReliefFclassifiersrandom forest |
spellingShingle | Leyla Nazari Vida Ghotbi Mohammad Nadimi Jitendra Paliwal A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in Arabidopsis Algorithms LASSO information gain ReliefF classifiers random forest |
title | A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in Arabidopsis |
title_full | A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in Arabidopsis |
title_fullStr | A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in Arabidopsis |
title_full_unstemmed | A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in Arabidopsis |
title_short | A Novel Machine-Learning Approach to Predict Stress-Responsive Genes in Arabidopsis |
title_sort | novel machine learning approach to predict stress responsive genes in arabidopsis |
topic | LASSO information gain ReliefF classifiers random forest |
url | https://www.mdpi.com/1999-4893/16/9/407 |
work_keys_str_mv | AT leylanazari anovelmachinelearningapproachtopredictstressresponsivegenesinarabidopsis AT vidaghotbi anovelmachinelearningapproachtopredictstressresponsivegenesinarabidopsis AT mohammadnadimi anovelmachinelearningapproachtopredictstressresponsivegenesinarabidopsis AT jitendrapaliwal anovelmachinelearningapproachtopredictstressresponsivegenesinarabidopsis AT leylanazari novelmachinelearningapproachtopredictstressresponsivegenesinarabidopsis AT vidaghotbi novelmachinelearningapproachtopredictstressresponsivegenesinarabidopsis AT mohammadnadimi novelmachinelearningapproachtopredictstressresponsivegenesinarabidopsis AT jitendrapaliwal novelmachinelearningapproachtopredictstressresponsivegenesinarabidopsis |