The use of class imbalanced learning methods on ULSAM data to predict the case–control status in genome-wide association studies

Abstract Machine learning (ML) methods for uncovering single nucleotide polymorphisms (SNPs) in genome-wide association study (GWAS) data that can be used to predict disease outcomes are becoming increasingly used in genetic research. Two issues with the use of ML models are finding the correct meth...

Full description

Bibliographic Details
Main Authors:	R. Onur Öztornaci, Hamzah Syed, Andrew P. Morris, Bahar Taşdelen
Format:	Article
Language:	English
Published:	SpringerOpen 2023-11-01
Series:	Journal of Big Data
Subjects:	Machine learning Class imbalanced methods GWAS ULSAM study
Online Access:	https://doi.org/10.1186/s40537-023-00853-x

_version_	1827603464705802240
author	R. Onur Öztornaci Hamzah Syed Andrew P. Morris Bahar Taşdelen
author_facet	R. Onur Öztornaci Hamzah Syed Andrew P. Morris Bahar Taşdelen
author_sort	R. Onur Öztornaci
collection	DOAJ
description	Abstract Machine learning (ML) methods for uncovering single nucleotide polymorphisms (SNPs) in genome-wide association study (GWAS) data that can be used to predict disease outcomes are becoming increasingly used in genetic research. Two issues with the use of ML models are finding the correct method for dealing with imbalanced data and data training. This article compares three ML models to identify SNPs that predict type 2 diabetes (T2D) status using the Support vector machine SMOTE (SVM SMOTE), The Adaptive Synthetic Sampling Approach (ADASYN), Random under sampling (RUS) on GWAS data from elderly male participants (165 cases and 951 controls) from the Uppsala Longitudinal Study of Adult Men (ULSAM). It was also applied to SNPs selected by the SMOTE, SVM SMOTE, ADASYN, and RUS clumping method. The analysis was performed using three different ML models: (i) support vector machine (SVM), (ii) multilayer perceptron (MLP) and (iii) random forests (RF). The accuracy of the case–control classification was compared between these three methods. The best classification algorithm was a combination of MLP and SMOTE (97% accuracy). Both RF and SVM achieved good accuracy results of over 90%. Overall, methods used against unbalanced data, all three ML algorithms were found to improve prediction accuracy.
first_indexed	2024-03-09T05:40:00Z
format	Article
id	doaj.art-4b6d27f55b894c45b3ebdea0af8933f6
institution	Directory Open Access Journal
issn	2196-1115
language	English
last_indexed	2024-03-09T05:40:00Z
publishDate	2023-11-01
publisher	SpringerOpen
record_format	Article
series	Journal of Big Data
spelling	doaj.art-4b6d27f55b894c45b3ebdea0af8933f62023-12-03T12:25:44ZengSpringerOpenJournal of Big Data2196-11152023-11-0110112810.1186/s40537-023-00853-xThe use of class imbalanced learning methods on ULSAM data to predict the case–control status in genome-wide association studiesR. Onur Öztornaci0Hamzah Syed1Andrew P. Morris2Bahar Taşdelen3Koç University Research Centre for Translational Medicine, Koç UniversityKoç University Research Centre for Translational Medicine, Koç UniversityDivision of Musculoskeletal and Dermatological Sciences, University of ManchesterFaculty of Medicine, Department of Biostatistics and Medical Informatics, Mersin UniversityAbstract Machine learning (ML) methods for uncovering single nucleotide polymorphisms (SNPs) in genome-wide association study (GWAS) data that can be used to predict disease outcomes are becoming increasingly used in genetic research. Two issues with the use of ML models are finding the correct method for dealing with imbalanced data and data training. This article compares three ML models to identify SNPs that predict type 2 diabetes (T2D) status using the Support vector machine SMOTE (SVM SMOTE), The Adaptive Synthetic Sampling Approach (ADASYN), Random under sampling (RUS) on GWAS data from elderly male participants (165 cases and 951 controls) from the Uppsala Longitudinal Study of Adult Men (ULSAM). It was also applied to SNPs selected by the SMOTE, SVM SMOTE, ADASYN, and RUS clumping method. The analysis was performed using three different ML models: (i) support vector machine (SVM), (ii) multilayer perceptron (MLP) and (iii) random forests (RF). The accuracy of the case–control classification was compared between these three methods. The best classification algorithm was a combination of MLP and SMOTE (97% accuracy). Both RF and SVM achieved good accuracy results of over 90%. Overall, methods used against unbalanced data, all three ML algorithms were found to improve prediction accuracy.https://doi.org/10.1186/s40537-023-00853-xMachine learningClass imbalanced methodsGWASULSAM study
spellingShingle	R. Onur Öztornaci Hamzah Syed Andrew P. Morris Bahar Taşdelen The use of class imbalanced learning methods on ULSAM data to predict the case–control status in genome-wide association studies Journal of Big Data Machine learning Class imbalanced methods GWAS ULSAM study
title	The use of class imbalanced learning methods on ULSAM data to predict the case–control status in genome-wide association studies
title_full	The use of class imbalanced learning methods on ULSAM data to predict the case–control status in genome-wide association studies
title_fullStr	The use of class imbalanced learning methods on ULSAM data to predict the case–control status in genome-wide association studies
title_full_unstemmed	The use of class imbalanced learning methods on ULSAM data to predict the case–control status in genome-wide association studies
title_short	The use of class imbalanced learning methods on ULSAM data to predict the case–control status in genome-wide association studies
title_sort	use of class imbalanced learning methods on ulsam data to predict the case control status in genome wide association studies
topic	Machine learning Class imbalanced methods GWAS ULSAM study
url	https://doi.org/10.1186/s40537-023-00853-x
work_keys_str_mv	AT ronuroztornaci theuseofclassimbalancedlearningmethodsonulsamdatatopredictthecasecontrolstatusingenomewideassociationstudies AT hamzahsyed theuseofclassimbalancedlearningmethodsonulsamdatatopredictthecasecontrolstatusingenomewideassociationstudies AT andrewpmorris theuseofclassimbalancedlearningmethodsonulsamdatatopredictthecasecontrolstatusingenomewideassociationstudies AT bahartasdelen theuseofclassimbalancedlearningmethodsonulsamdatatopredictthecasecontrolstatusingenomewideassociationstudies AT ronuroztornaci useofclassimbalancedlearningmethodsonulsamdatatopredictthecasecontrolstatusingenomewideassociationstudies AT hamzahsyed useofclassimbalancedlearningmethodsonulsamdatatopredictthecasecontrolstatusingenomewideassociationstudies AT andrewpmorris useofclassimbalancedlearningmethodsonulsamdatatopredictthecasecontrolstatusingenomewideassociationstudies AT bahartasdelen useofclassimbalancedlearningmethodsonulsamdatatopredictthecasecontrolstatusingenomewideassociationstudies

The use of class imbalanced learning methods on ULSAM data to predict the case–control status in genome-wide association studies

Similar Items