Cyclical hybrid imputation technique for missing values in data sets

Abstract The problem of missing data in data sets is the most important first step to be addressed in the preprocessing phase. Because incorrect imputation of missing data increases the error in the modeling phase and reduces the prediction performance of the model. When it comes to health, it is in...

Full description

Bibliographic Details
Main Authors: Kurban Kotan, Serdar Kırışoğlu
Format: Article
Language:English
Published: Nature Portfolio 2025-02-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-90964-7
Description
Summary:Abstract The problem of missing data in data sets is the most important first step to be addressed in the preprocessing phase. Because incorrect imputation of missing data increases the error in the modeling phase and reduces the prediction performance of the model. When it comes to health, it is inevitable to choose models that show a higher success rate. In cases where there is missing data, the performance of machine learning models may differ depending on the amount of data contained in the data set. The presence of missing data and this high rate affects the accuracy and reliability of analysis and modeling studies because it will affect the complete amount of data in the data set. Estimating and filling in the missing data very precisely, close to its real value, will provide a significant visible performance increase in the modeling phase, which is the next stage. After imputing the missing data with an artificial intelligence model rather than a random method, it is obvious that the accuracy of the model trained with this data is higher than the model trained with data filled with classical filling methods such as mean and mode. In this study, we propose a new algorithm that has been tested on many datasets to address the problems caused by missing data imputation in the dataset. The algorithm aims to impute missing values more effectively by using row-based and column-based imputation techniques together and cyclically. The algorithm takes into account individual missing values using column-based imputation features and the overall data structure using row-based imputation features. The proposed algorithm achieved 100% accuracy with some row and column-based imputation techniques on 3 different datasets used in the study. Higher accuracy was achieved compared to other imputation techniques.
ISSN:2045-2322