Missing Value Imputation for PM10 Concentration in Sabah using Nearest Neighbour Method (NNM) and Expectation-Maximization (EM) Algorithm

Missing data in large data analysis has affected further analysis conducted on dataset. To fill in missing data, Nearest Neighbour Method (NNM) and Expectation Maximization (EM) algorithm are the two most widely used methods. Thus, this research aims to compare both methods by imputing missing data...

Full description

Bibliographic Details
Main Authors: Muhammad Izzuddin Rumaling, Fuei Pien Chee, Jedol Dayou, Jackson Hian Wui Chang, Steven Soon Kai Kong, Justin Sentian
Format: Article
Language:English
Published: Springer 2020-03-01
Series:Asian Journal of Atmospheric Environment
Subjects:
Online Access:http://www.asianjae.org/_common/do.php?a=full&b=11&bidx=1922&aidx=23538
_version_ 1827853554384109568
author Muhammad Izzuddin Rumaling
Fuei Pien Chee
Jedol Dayou
Jackson Hian Wui Chang
Steven Soon Kai Kong
Justin Sentian
author_facet Muhammad Izzuddin Rumaling
Fuei Pien Chee
Jedol Dayou
Jackson Hian Wui Chang
Steven Soon Kai Kong
Justin Sentian
author_sort Muhammad Izzuddin Rumaling
collection DOAJ
description Missing data in large data analysis has affected further analysis conducted on dataset. To fill in missing data, Nearest Neighbour Method (NNM) and Expectation Maximization (EM) algorithm are the two most widely used methods. Thus, this research aims to compare both methods by imputing missing data of air quality in five monitoring stations (CA0030, CA0039, CA0042, CA0049, CA0050) in Sabah, Malaysia. PM10 (particulate matter with aerodynamic size below 10 microns) dataset in the range from 2003-2007 (Part A) and 2008-2012 (Part B) are used in this research. To make performance evaluation possible, missing data is introduced in the datasets at 5 different levels (5%, 10%, 15%, 25% and 40%). The missing data is imputed by using both NNM and EM algorithm. The performance of both data imputation methods is evaluated using performance indicators (RMSE, MAE, IOA, COD) and regression analysis. Based on performance indicators and regression analysis, NNM performs better compared to EM in imputing data for stations CA0039, CA0042 and CA0049. This may be due to air quality data missing at random (MAR). However, this is not the case for CA0050 and part B of CA0030. This may be due to fluctuation that could not be detected by NNM. Accuracy evaluation using Mean Absolute Percentage Error (MAPE) shows that NNM is more accurate imputation method for most of the cases.
first_indexed 2024-03-12T11:08:27Z
format Article
id doaj.art-4c9baa52449b476483ec74ee627b4095
institution Directory Open Access Journal
issn 1976-6912
2287-1160
language English
last_indexed 2024-03-12T11:08:27Z
publishDate 2020-03-01
publisher Springer
record_format Article
series Asian Journal of Atmospheric Environment
spelling doaj.art-4c9baa52449b476483ec74ee627b40952023-09-02T03:27:58ZengSpringerAsian Journal of Atmospheric Environment1976-69122287-11602020-03-01141627210.5572/ajae.2020.14.1.062Missing Value Imputation for PM10 Concentration in Sabah using Nearest Neighbour Method (NNM) and Expectation-Maximization (EM) AlgorithmMuhammad Izzuddin Rumaling0Fuei Pien Chee1https://orcid.org/0000-0002-9782-5572Jedol Dayou2Jackson Hian Wui Chang3Steven Soon Kai Kong4Justin Sentian5Faculty of Science and Natural Resources (FSNR), Universiti Malaysia Sabah, Kota Kinabalu, Sabah, MalaysiaFaculty of Science and Natural Resources (FSNR), Universiti Malaysia Sabah, Kota Kinabalu, Sabah, MalaysiaFaculty of Science and Natural Resources (FSNR), Universiti Malaysia Sabah, Kota Kinabalu, Sabah, MalaysiaPreparatory Centre for Science and Technology, Universiti Malaysia Sabah, Kota Kinabalu, Sabah, MalaysiaCloud and Aerosol Laboratory, Department of Atmospheric Science, National Central University, Taoyuan, Taiwan (ROC)Climate Change Research Group (CCRG), FSNR, Universiti Malaysia Sabah, Kota Kinabalu, Sabah, MalaysiaMissing data in large data analysis has affected further analysis conducted on dataset. To fill in missing data, Nearest Neighbour Method (NNM) and Expectation Maximization (EM) algorithm are the two most widely used methods. Thus, this research aims to compare both methods by imputing missing data of air quality in five monitoring stations (CA0030, CA0039, CA0042, CA0049, CA0050) in Sabah, Malaysia. PM10 (particulate matter with aerodynamic size below 10 microns) dataset in the range from 2003-2007 (Part A) and 2008-2012 (Part B) are used in this research. To make performance evaluation possible, missing data is introduced in the datasets at 5 different levels (5%, 10%, 15%, 25% and 40%). The missing data is imputed by using both NNM and EM algorithm. The performance of both data imputation methods is evaluated using performance indicators (RMSE, MAE, IOA, COD) and regression analysis. Based on performance indicators and regression analysis, NNM performs better compared to EM in imputing data for stations CA0039, CA0042 and CA0049. This may be due to air quality data missing at random (MAR). However, this is not the case for CA0050 and part B of CA0030. This may be due to fluctuation that could not be detected by NNM. Accuracy evaluation using Mean Absolute Percentage Error (MAPE) shows that NNM is more accurate imputation method for most of the cases.http://www.asianjae.org/_common/do.php?a=full&b=11&bidx=1922&aidx=23538particulate mattermissing datanearest neighbour methodexpectation maximization algorithmperformance indicators
spellingShingle Muhammad Izzuddin Rumaling
Fuei Pien Chee
Jedol Dayou
Jackson Hian Wui Chang
Steven Soon Kai Kong
Justin Sentian
Missing Value Imputation for PM10 Concentration in Sabah using Nearest Neighbour Method (NNM) and Expectation-Maximization (EM) Algorithm
Asian Journal of Atmospheric Environment
particulate matter
missing data
nearest neighbour method
expectation maximization algorithm
performance indicators
title Missing Value Imputation for PM10 Concentration in Sabah using Nearest Neighbour Method (NNM) and Expectation-Maximization (EM) Algorithm
title_full Missing Value Imputation for PM10 Concentration in Sabah using Nearest Neighbour Method (NNM) and Expectation-Maximization (EM) Algorithm
title_fullStr Missing Value Imputation for PM10 Concentration in Sabah using Nearest Neighbour Method (NNM) and Expectation-Maximization (EM) Algorithm
title_full_unstemmed Missing Value Imputation for PM10 Concentration in Sabah using Nearest Neighbour Method (NNM) and Expectation-Maximization (EM) Algorithm
title_short Missing Value Imputation for PM10 Concentration in Sabah using Nearest Neighbour Method (NNM) and Expectation-Maximization (EM) Algorithm
title_sort missing value imputation for pm10 concentration in sabah using nearest neighbour method nnm and expectation maximization em algorithm
topic particulate matter
missing data
nearest neighbour method
expectation maximization algorithm
performance indicators
url http://www.asianjae.org/_common/do.php?a=full&b=11&bidx=1922&aidx=23538
work_keys_str_mv AT muhammadizzuddinrumaling missingvalueimputationforpm10concentrationinsabahusingnearestneighbourmethodnnmandexpectationmaximizationemalgorithm
AT fueipienchee missingvalueimputationforpm10concentrationinsabahusingnearestneighbourmethodnnmandexpectationmaximizationemalgorithm
AT jedoldayou missingvalueimputationforpm10concentrationinsabahusingnearestneighbourmethodnnmandexpectationmaximizationemalgorithm
AT jacksonhianwuichang missingvalueimputationforpm10concentrationinsabahusingnearestneighbourmethodnnmandexpectationmaximizationemalgorithm
AT stevensoonkaikong missingvalueimputationforpm10concentrationinsabahusingnearestneighbourmethodnnmandexpectationmaximizationemalgorithm
AT justinsentian missingvalueimputationforpm10concentrationinsabahusingnearestneighbourmethodnnmandexpectationmaximizationemalgorithm