Machine Learning Framework for the Prediction of Alzheimer’s Disease Using Gene Expression Data Based on Efficient Gene Selection

In recent years, much research has focused on using machine learning (ML) for disease prediction based on gene expression (GE) data. However, many diseases have received considerable attention, whereas some, including Alzheimer’s disease (AD), have not, perhaps due to data shortage. The present work...

Full description

Bibliographic Details
Main Authors: Aliaa El-Gawady, Mohamed A. Makhlouf, BenBella S. Tawfik, Hamed Nassar
Format: Article
Language:English
Published: MDPI AG 2022-02-01
Series:Symmetry
Subjects:
Online Access:https://www.mdpi.com/2073-8994/14/3/491
_version_ 1797441617206968320
author Aliaa El-Gawady
Mohamed A. Makhlouf
BenBella S. Tawfik
Hamed Nassar
author_facet Aliaa El-Gawady
Mohamed A. Makhlouf
BenBella S. Tawfik
Hamed Nassar
author_sort Aliaa El-Gawady
collection DOAJ
description In recent years, much research has focused on using machine learning (ML) for disease prediction based on gene expression (GE) data. However, many diseases have received considerable attention, whereas some, including Alzheimer’s disease (AD), have not, perhaps due to data shortage. The present work is intended to fill this gap by introducing a symmetric framework to predict AD from GE data, with the aim to produce the most accurate prediction using the smallest number of genes. The framework works in four stages after it receives a training dataset: pre-processing, gene selection (GS), classification, and AD prediction. The symmetry of the model is manifested in all of its stages. In the pre-processing stage gene columns in the training dataset are pre-processed identically. In the GS stage, the same user-defined filter metrics are invoked on every gene individually, and so are the same user-defined wrapper metrics. In the classification stage, a number of user-defined ML models are applied identically using the minimal set of genes selected in the preceding stage. The core of the proposed framework is a meticulous GS algorithm which we have designed to nominate eight subsets of the original set of genes provided in the training dataset. Exploring the eight subsets, the algorithm selects the best one to describe AD, and also the best ML model to predict the disease using this subset. For credible results, the framework calculates performance metrics using repeated stratified k-fold cross validation. To evaluate the framework, we used an AD dataset of 1157 cases and 39,280 genes, obtained by combining a number of smaller public datasets. The cases were split in two partitions, 1000 for training/testing, using 10-fold CV repeated 30 times, and 157 for validation. From the testing/training phase, the framework identified only 1058 genes to be the most relevant and the support vector machine (SVM) model to be the most accurate with these genes. In the final validation, we used the 157 cases that were never seen by the SVM classifier. For credible performance evaluation, we evaluated the classifier via six metrics, for which we obtained impressive values. Specifically, we obtained 0.97, 0.97, 0.98, 0.945, 0.972, and 0.975 for the sensitivity (recall), specificity, precision, kappa index, AUC, and accuracy, respectively.
first_indexed 2024-03-09T12:25:42Z
format Article
id doaj.art-59d52bc1bf7c4bd68e383a23c0e11547
institution Directory Open Access Journal
issn 2073-8994
language English
last_indexed 2024-03-09T12:25:42Z
publishDate 2022-02-01
publisher MDPI AG
record_format Article
series Symmetry
spelling doaj.art-59d52bc1bf7c4bd68e383a23c0e115472023-11-30T22:35:22ZengMDPI AGSymmetry2073-89942022-02-0114349110.3390/sym14030491Machine Learning Framework for the Prediction of Alzheimer’s Disease Using Gene Expression Data Based on Efficient Gene SelectionAliaa El-Gawady0Mohamed A. Makhlouf1BenBella S. Tawfik2Hamed Nassar3Faculty of Computers and Informatics, Suez Canal University, Ismailia 41522, EgyptFaculty of Computers and Informatics, Suez Canal University, Ismailia 41522, EgyptFaculty of Computers and Informatics, Suez Canal University, Ismailia 41522, EgyptFaculty of Computers and Informatics, Suez Canal University, Ismailia 41522, EgyptIn recent years, much research has focused on using machine learning (ML) for disease prediction based on gene expression (GE) data. However, many diseases have received considerable attention, whereas some, including Alzheimer’s disease (AD), have not, perhaps due to data shortage. The present work is intended to fill this gap by introducing a symmetric framework to predict AD from GE data, with the aim to produce the most accurate prediction using the smallest number of genes. The framework works in four stages after it receives a training dataset: pre-processing, gene selection (GS), classification, and AD prediction. The symmetry of the model is manifested in all of its stages. In the pre-processing stage gene columns in the training dataset are pre-processed identically. In the GS stage, the same user-defined filter metrics are invoked on every gene individually, and so are the same user-defined wrapper metrics. In the classification stage, a number of user-defined ML models are applied identically using the minimal set of genes selected in the preceding stage. The core of the proposed framework is a meticulous GS algorithm which we have designed to nominate eight subsets of the original set of genes provided in the training dataset. Exploring the eight subsets, the algorithm selects the best one to describe AD, and also the best ML model to predict the disease using this subset. For credible results, the framework calculates performance metrics using repeated stratified k-fold cross validation. To evaluate the framework, we used an AD dataset of 1157 cases and 39,280 genes, obtained by combining a number of smaller public datasets. The cases were split in two partitions, 1000 for training/testing, using 10-fold CV repeated 30 times, and 157 for validation. From the testing/training phase, the framework identified only 1058 genes to be the most relevant and the support vector machine (SVM) model to be the most accurate with these genes. In the final validation, we used the 157 cases that were never seen by the SVM classifier. For credible performance evaluation, we evaluated the classifier via six metrics, for which we obtained impressive values. Specifically, we obtained 0.97, 0.97, 0.98, 0.945, 0.972, and 0.975 for the sensitivity (recall), specificity, precision, kappa index, AUC, and accuracy, respectively.https://www.mdpi.com/2073-8994/14/3/491Alzheimer’s diseasegene expressionmachine learninggene selectionclassification
spellingShingle Aliaa El-Gawady
Mohamed A. Makhlouf
BenBella S. Tawfik
Hamed Nassar
Machine Learning Framework for the Prediction of Alzheimer’s Disease Using Gene Expression Data Based on Efficient Gene Selection
Symmetry
Alzheimer’s disease
gene expression
machine learning
gene selection
classification
title Machine Learning Framework for the Prediction of Alzheimer’s Disease Using Gene Expression Data Based on Efficient Gene Selection
title_full Machine Learning Framework for the Prediction of Alzheimer’s Disease Using Gene Expression Data Based on Efficient Gene Selection
title_fullStr Machine Learning Framework for the Prediction of Alzheimer’s Disease Using Gene Expression Data Based on Efficient Gene Selection
title_full_unstemmed Machine Learning Framework for the Prediction of Alzheimer’s Disease Using Gene Expression Data Based on Efficient Gene Selection
title_short Machine Learning Framework for the Prediction of Alzheimer’s Disease Using Gene Expression Data Based on Efficient Gene Selection
title_sort machine learning framework for the prediction of alzheimer s disease using gene expression data based on efficient gene selection
topic Alzheimer’s disease
gene expression
machine learning
gene selection
classification
url https://www.mdpi.com/2073-8994/14/3/491
work_keys_str_mv AT aliaaelgawady machinelearningframeworkforthepredictionofalzheimersdiseaseusinggeneexpressiondatabasedonefficientgeneselection
AT mohamedamakhlouf machinelearningframeworkforthepredictionofalzheimersdiseaseusinggeneexpressiondatabasedonefficientgeneselection
AT benbellastawfik machinelearningframeworkforthepredictionofalzheimersdiseaseusinggeneexpressiondatabasedonefficientgeneselection
AT hamednassar machinelearningframeworkforthepredictionofalzheimersdiseaseusinggeneexpressiondatabasedonefficientgeneselection