Effective Handling of Missing Values in Datasets for Classification Using Machine Learning Methods

The existence of missing values reduces the amount of knowledge learned by the machine learning models in the training stage thus affecting the classification accuracy negatively. To address this challenge, we introduce the use of Support Vector Machine (SVM) regression for imputing the missing valu...

Full description

Bibliographic Details
Main Authors:	Ashokkumar Palanivinayagam, Robertas Damaševičius
Format:	Article
Language:	English
Published:	MDPI AG 2023-02-01
Series:	Information
Subjects:	diabetes classification missing values data imputation false rate reduction two-level classification
Online Access:	https://www.mdpi.com/2078-2489/14/2/92

_version_	1797620304805101568
author	Ashokkumar Palanivinayagam Robertas Damaševičius
author_facet	Ashokkumar Palanivinayagam Robertas Damaševičius
author_sort	Ashokkumar Palanivinayagam
collection	DOAJ
description	The existence of missing values reduces the amount of knowledge learned by the machine learning models in the training stage thus affecting the classification accuracy negatively. To address this challenge, we introduce the use of Support Vector Machine (SVM) regression for imputing the missing values. Additionally, we propose a two-level classification process to reduce the number of false classifications. Our evaluation of the proposed method was conducted using the PIMA Indian dataset for diabetes classification. We compared the performance of five different machine learning models: Naive Bayes (NB), Support Vector Machine (SVM), k-Nearest Neighbours (KNN), Random Forest (RF), and Linear Regression (LR). The results of our experiments show that the SVM classifier achieved the highest accuracy of 94.89%. The RF classifier had the highest precision (98.80%) and the SVM classifier had the highest recall (85.48%). The NB model had the highest F1-Score (95.59%). Our proposed method provides a promising solution for detecting diabetes at an early stage by addressing the issue of missing values in the dataset. Our results show that the use of SVM regression and a two-level classification process can notably improve the performance of machine learning models for diabetes classification. This work provides a valuable contribution to the field of diabetes research and highlights the importance of addressing missing values in machine learning applications.
first_indexed	2024-03-11T08:39:16Z
format	Article
id	doaj.art-0098f5b92655401ea972e1b9af2fb183
institution	Directory Open Access Journal
issn	2078-2489
language	English
last_indexed	2024-03-11T08:39:16Z
publishDate	2023-02-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj.art-0098f5b92655401ea972e1b9af2fb1832023-11-16T21:12:13ZengMDPI AGInformation2078-24892023-02-011429210.3390/info14020092Effective Handling of Missing Values in Datasets for Classification Using Machine Learning MethodsAshokkumar Palanivinayagam0Robertas Damaševičius1Sri Ramachandra Faculty of Engineering and Technology, Sri Ramachandra Institute of Higher Education and Research, Chennai 600116, IndiaDepartment of Applied Informatics, Vytautas Magnus University, 44404 Kaunas, LithuaniaThe existence of missing values reduces the amount of knowledge learned by the machine learning models in the training stage thus affecting the classification accuracy negatively. To address this challenge, we introduce the use of Support Vector Machine (SVM) regression for imputing the missing values. Additionally, we propose a two-level classification process to reduce the number of false classifications. Our evaluation of the proposed method was conducted using the PIMA Indian dataset for diabetes classification. We compared the performance of five different machine learning models: Naive Bayes (NB), Support Vector Machine (SVM), k-Nearest Neighbours (KNN), Random Forest (RF), and Linear Regression (LR). The results of our experiments show that the SVM classifier achieved the highest accuracy of 94.89%. The RF classifier had the highest precision (98.80%) and the SVM classifier had the highest recall (85.48%). The NB model had the highest F1-Score (95.59%). Our proposed method provides a promising solution for detecting diabetes at an early stage by addressing the issue of missing values in the dataset. Our results show that the use of SVM regression and a two-level classification process can notably improve the performance of machine learning models for diabetes classification. This work provides a valuable contribution to the field of diabetes research and highlights the importance of addressing missing values in machine learning applications.https://www.mdpi.com/2078-2489/14/2/92diabetes classificationmissing valuesdata imputationfalse rate reductiontwo-level classification
spellingShingle	Ashokkumar Palanivinayagam Robertas Damaševičius Effective Handling of Missing Values in Datasets for Classification Using Machine Learning Methods Information diabetes classification missing values data imputation false rate reduction two-level classification
title	Effective Handling of Missing Values in Datasets for Classification Using Machine Learning Methods
title_full	Effective Handling of Missing Values in Datasets for Classification Using Machine Learning Methods
title_fullStr	Effective Handling of Missing Values in Datasets for Classification Using Machine Learning Methods
title_full_unstemmed	Effective Handling of Missing Values in Datasets for Classification Using Machine Learning Methods
title_short	Effective Handling of Missing Values in Datasets for Classification Using Machine Learning Methods
title_sort	effective handling of missing values in datasets for classification using machine learning methods
topic	diabetes classification missing values data imputation false rate reduction two-level classification
url	https://www.mdpi.com/2078-2489/14/2/92
work_keys_str_mv	AT ashokkumarpalanivinayagam effectivehandlingofmissingvaluesindatasetsforclassificationusingmachinelearningmethods AT robertasdamasevicius effectivehandlingofmissingvaluesindatasetsforclassificationusingmachinelearningmethods

Effective Handling of Missing Values in Datasets for Classification Using Machine Learning Methods

Similar Items