Summary: | Missing value is a common problem in any dataset and its occurrence decreases
data completeness as data values are missing. Moreover, the problem reduces
data quality and negatively impacted the result of data analysis. Existing cold deck
imputation coped with this problem by selecting a replacement value from a pool
of donors identified in other data sources during the imputation process. In
comparison to other imputation methods, existing cold deck imputation has less
risk on model misspecification and preserves data distribution in the dataset.
Nevertheless, the limitation of the existing cold deck imputation is the chances in
finding trusted plausible donor is narrow due to a usage of single data source in
each imputation process. The availability of various web data sources today
alleviates this limitation. However, as values from multiple web data sources are
commonly conflicted to each other, adopting existing cold deck imputation with
multiple web donors is not a practical solution as trust score on each of the
conflicted values is not measured. Thus, it is difficult to select the most plausible
value during imputation process. This research concentrates on improving data
completeness by imputing missing values using a trust based cold deck
imputation.
Trust Based Cold Deck Missing Values Imputation with Multiple Web Donor is
presented in this research. The proposed method takes advantage of multiple web
donors from web data sources in order to provide higher chances in finding the
most plausible values to impute missing values. The plausible values are selected
based on the trust score computation’s novelty which is measured by accuracy
score and reliability score of the web donor.
The performance of the proposed method is evaluated by running a prediction
model on the imputed dataset. A number of experiments are carried out to quantify
the accuracy of the prediction model, Root Mean Squared Error (RMSE), and the
F-Measure. The results demonstrate that the proposed method improves the
performance of existing cold deck imputation. Additionally, the results are then
compared with other imputation methods which are K-Nearest Neighbor (KNN),
Mean Imputation (AVG), Case Deletion (IGN), Predictive Mean Matching (PMM)
and MissForest. The results showed that the RMSE, prediction accuracy and FMeasure
is improved when the prediction model is trained with datasets imputed
using the proposed method. This research contributed to the improvement of data
quality especially to the information system (IS) and database field where good
data quality benefited the data analysis performance.
|