Methods of handling missing data with reference to rainfall in Peninsular Malaysia

Missing data is one of the issues often discussed amongst hydrologists in Malaysia. Various imputation methods were introduced to help minimize the bias and improve the accuracy of the statistical analysis. However, the performances of the imputation methods will be affected if the reason for data b...

Full description

Bibliographic Details
Main Author: Ho, Ming Kang
Format: Thesis
Language:English
Published: 2014
Subjects:
Online Access:http://eprints.utm.my/78077/1/HoMingKangPFS2014.pdf
_version_ 1796862780885696512
author Ho, Ming Kang
author_facet Ho, Ming Kang
author_sort Ho, Ming Kang
collection ePrints
description Missing data is one of the issues often discussed amongst hydrologists in Malaysia. Various imputation methods were introduced to help minimize the bias and improve the accuracy of the statistical analysis. However, the performances of the imputation methods will be affected if the reason for data being missing is unidentified. Therefore, this study objectively investigates the reasons why some data is missing, known as missingness mechanism, and selects the best model to impute the missing rainfall data. A model using a combination of expectation maximization and logit (EM-Logit) is proposed and applied to a simulated data with missing values that are characterised as missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Besides, homogeneous rainfall data that are coupled with temperature and humidity in Damansara and Kelantan are also used before validating the proposed model. The results indicate that the model is able to identify types of missingness mechanism which leads to a data being missing. The results of the model has also identified that the MNAR is best missingness mechanism to describe missing rainfall data in both study areas. Therefore, for the imputation purposes, a two-step approach is proposed. The first step is to analyze the rainfall events, either wet or dry day, by using weighted-average algorithm and the subsequent step is the wet-classified day with missing data is estimated by self-organizing map (SOM). The two-step approach, also known as Probability Density Function Preserving Approach with SOM (PDSOM), is then compared with SOM model alone and Multilayer Perceptron (MLP). By using the mean absolute error (MAE) and root mean square error (RMSE) criteria and comparing the statistical properties of the imputed data with the rainfall data, PDSOM is found to be performing better than SOM and MLP. The missing rainfall data from 1996 to 2004 from the two stations (Damansara and Kelantan) are also selected to validate the performance of PDSOM by comparing the estimated mean and variance of the rainfall data with missing values that are imputed by PDSOM. The imputations are found within the confidence interval that are constructed under observed rainfall data. PDSOM has shown its capability to well preserve the mean and variance of the missing rainfall data, as well as the number of rainfall events in Damansara and Kelantan. Thus, PDSOM can be an alternative imputation model in dealing with rainfall data with missing values.
first_indexed 2024-03-05T20:16:47Z
format Thesis
id utm.eprints-78077
institution Universiti Teknologi Malaysia - ePrints
language English
last_indexed 2024-03-05T20:16:47Z
publishDate 2014
record_format dspace
spelling utm.eprints-780772018-07-23T06:06:01Z http://eprints.utm.my/78077/ Methods of handling missing data with reference to rainfall in Peninsular Malaysia Ho, Ming Kang QA Mathematics Missing data is one of the issues often discussed amongst hydrologists in Malaysia. Various imputation methods were introduced to help minimize the bias and improve the accuracy of the statistical analysis. However, the performances of the imputation methods will be affected if the reason for data being missing is unidentified. Therefore, this study objectively investigates the reasons why some data is missing, known as missingness mechanism, and selects the best model to impute the missing rainfall data. A model using a combination of expectation maximization and logit (EM-Logit) is proposed and applied to a simulated data with missing values that are characterised as missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Besides, homogeneous rainfall data that are coupled with temperature and humidity in Damansara and Kelantan are also used before validating the proposed model. The results indicate that the model is able to identify types of missingness mechanism which leads to a data being missing. The results of the model has also identified that the MNAR is best missingness mechanism to describe missing rainfall data in both study areas. Therefore, for the imputation purposes, a two-step approach is proposed. The first step is to analyze the rainfall events, either wet or dry day, by using weighted-average algorithm and the subsequent step is the wet-classified day with missing data is estimated by self-organizing map (SOM). The two-step approach, also known as Probability Density Function Preserving Approach with SOM (PDSOM), is then compared with SOM model alone and Multilayer Perceptron (MLP). By using the mean absolute error (MAE) and root mean square error (RMSE) criteria and comparing the statistical properties of the imputed data with the rainfall data, PDSOM is found to be performing better than SOM and MLP. The missing rainfall data from 1996 to 2004 from the two stations (Damansara and Kelantan) are also selected to validate the performance of PDSOM by comparing the estimated mean and variance of the rainfall data with missing values that are imputed by PDSOM. The imputations are found within the confidence interval that are constructed under observed rainfall data. PDSOM has shown its capability to well preserve the mean and variance of the missing rainfall data, as well as the number of rainfall events in Damansara and Kelantan. Thus, PDSOM can be an alternative imputation model in dealing with rainfall data with missing values. 2014-09 Thesis NonPeerReviewed application/pdf en http://eprints.utm.my/78077/1/HoMingKangPFS2014.pdf Ho, Ming Kang (2014) Methods of handling missing data with reference to rainfall in Peninsular Malaysia. PhD thesis, Universiti Teknologi Malaysia, Faculty of Science. http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:98266
spellingShingle QA Mathematics
Ho, Ming Kang
Methods of handling missing data with reference to rainfall in Peninsular Malaysia
title Methods of handling missing data with reference to rainfall in Peninsular Malaysia
title_full Methods of handling missing data with reference to rainfall in Peninsular Malaysia
title_fullStr Methods of handling missing data with reference to rainfall in Peninsular Malaysia
title_full_unstemmed Methods of handling missing data with reference to rainfall in Peninsular Malaysia
title_short Methods of handling missing data with reference to rainfall in Peninsular Malaysia
title_sort methods of handling missing data with reference to rainfall in peninsular malaysia
topic QA Mathematics
url http://eprints.utm.my/78077/1/HoMingKangPFS2014.pdf
work_keys_str_mv AT homingkang methodsofhandlingmissingdatawithreferencetorainfallinpeninsularmalaysia