Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study

BackgroundWhen using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) a...

Full description

Bibliographic Details
Main Authors:	Hansle Gwon, Imjin Ahn, Yunha Kim, Hee Jun Kang, Hyeram Seo, Ha Na Cho, Heejung Choi, Tae Joon Jun, Young-Hak Kim
Format:	Article
Language:	English
Published:	JMIR Publications 2021-10-01
Series:	JMIR Public Health and Surveillance
Online Access:	https://publichealth.jmir.org/2021/10/e30824

_version_	1797735663365259264
author	Hansle Gwon Imjin Ahn Yunha Kim Hee Jun Kang Hyeram Seo Ha Na Cho Heejung Choi Tae Joon Jun Young-Hak Kim
author_facet	Hansle Gwon Imjin Ahn Yunha Kim Hee Jun Kang Hyeram Seo Ha Na Cho Heejung Choi Tae Joon Jun Young-Hak Kim
author_sort	Hansle Gwon
collection	DOAJ
description	BackgroundWhen using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision tree. ObjectiveThe objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed to effectively impute data using a progressive method called self-training in the medical field where training data are scarce. MethodsIn this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model. ResultsIn self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE and RF, self-training showed a P value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation showed the lowest possible P value, 3.05e-5, in all situations. ConclusionsSelf-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research.
first_indexed	2024-03-12T13:01:19Z
format	Article
id	doaj.art-4dc0083227d6405da23e362e542c175f
institution	Directory Open Access Journal
issn	2369-2960
language	English
last_indexed	2024-03-12T13:01:19Z
publishDate	2021-10-01
publisher	JMIR Publications
record_format	Article
series	JMIR Public Health and Surveillance
spelling	doaj.art-4dc0083227d6405da23e362e542c175f2023-08-28T19:31:29ZengJMIR PublicationsJMIR Public Health and Surveillance2369-29602021-10-01710e3082410.2196/30824Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development StudyHansle Gwonhttps://orcid.org/0000-0001-6019-4466Imjin Ahnhttps://orcid.org/0000-0003-3929-6390Yunha Kimhttps://orcid.org/0000-0001-6713-1900Hee Jun Kanghttps://orcid.org/0000-0002-0396-2112Hyeram Seohttps://orcid.org/0000-0002-3589-1347Ha Na Chohttps://orcid.org/0000-0001-8033-6644Heejung Choihttps://orcid.org/0000-0003-2265-1819Tae Joon Junhttps://orcid.org/0000-0002-6808-5149Young-Hak Kimhttps://orcid.org/0000-0002-3610-486X BackgroundWhen using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision tree. ObjectiveThe objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed to effectively impute data using a progressive method called self-training in the medical field where training data are scarce. MethodsIn this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model. ResultsIn self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE and RF, self-training showed a P value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation showed the lowest possible P value, 3.05e-5, in all situations. ConclusionsSelf-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research.https://publichealth.jmir.org/2021/10/e30824
spellingShingle	Hansle Gwon Imjin Ahn Yunha Kim Hee Jun Kang Hyeram Seo Ha Na Cho Heejung Choi Tae Joon Jun Young-Hak Kim Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study JMIR Public Health and Surveillance
title	Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study
title_full	Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study
title_fullStr	Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study
title_full_unstemmed	Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study
title_short	Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study
title_sort	self training with quantile errors for multivariate missing data imputation for regression problems in electronic medical records algorithm development study
url	https://publichealth.jmir.org/2021/10/e30824
work_keys_str_mv	AT hanslegwon selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy AT imjinahn selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy AT yunhakim selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy AT heejunkang selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy AT hyeramseo selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy AT hanacho selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy AT heejungchoi selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy AT taejoonjun selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy AT younghakkim selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy

Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study

Similar Items