Assessing the impact of missing data on water quality index estimation: a machine learning approach
Abstract Despite the regulations and controls implemented worldwide by governments and institutions to ensure the availability and quality of water resources, many water sources remain susceptible to contamination. This contamination poses significant risks to human health and can lead to substantia...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
Springer
2024-03-01
|
Series: | Discover Water |
Subjects: | |
Online Access: | https://doi.org/10.1007/s43832-024-00068-y |
_version_ | 1797258984111996928 |
---|---|
author | David Sierra-Porta |
author_facet | David Sierra-Porta |
author_sort | David Sierra-Porta |
collection | DOAJ |
description | Abstract Despite the regulations and controls implemented worldwide by governments and institutions to ensure the availability and quality of water resources, many water sources remain susceptible to contamination. This contamination poses significant risks to human health and can lead to substantial economic losses. One of the challenges in this context is the presence of missing or incomplete data, which can arise from various factors such as the methodology used or the expertise of personnel involved in sample collection and analysis. The existence of such data gaps hampers the accurate analysis that can be conducted. To address this issue and estimate a water quality index from the available samples, it is crucial to handle missing information appropriately to avoid biased calculations. This study focuses on the application of machine learning methods for imputing missing data in water samples. Furthermore, it quantifies the performance of different models based on the distribution of the obtained data. By applying 10 distinct methods to a sample of water quality data, the most effective approaches, namely Bayesian Ridge, Gradient Boosting, Ridge, Support Vector Machine, and Theil-Sen regressors, were identified. The selection of these models was based on the evaluation of two estimation error metrics: average percent bias (PBIAS) and Kling-Gupta Efficiency statistic (KGEss). The respective metric values for the aforementioned methods are as follows: $$\langle \hbox {PBIAS}\rangle _{0.5}=14.665, 19.555, 14.300, 15.380, 15.920$$ ⟨ PBIAS ⟩ 0.5 = 14.665 , 19.555 , 14.300 , 15.380 , 15.920 and $$\langle \hbox {KGEss}\rangle _{0.5}=0.670, 0.585, 0.655, 0.620, 0.595$$ ⟨ KGEss ⟩ 0.5 = 0.670 , 0.585 , 0.655 , 0.620 , 0.595 . The results obtained from these models have been utilized to establish unbiased relationships among physical, chemical, and biological parameters based on the information retrieved through the applied imputation methods. |
first_indexed | 2024-04-24T23:02:13Z |
format | Article |
id | doaj.art-0d6684aa96e0437cafc328eca6ecf485 |
institution | Directory Open Access Journal |
issn | 2730-647X |
language | English |
last_indexed | 2024-04-24T23:02:13Z |
publishDate | 2024-03-01 |
publisher | Springer |
record_format | Article |
series | Discover Water |
spelling | doaj.art-0d6684aa96e0437cafc328eca6ecf4852024-03-17T12:37:47ZengSpringerDiscover Water2730-647X2024-03-014112010.1007/s43832-024-00068-yAssessing the impact of missing data on water quality index estimation: a machine learning approachDavid Sierra-Porta0Facultad de Ciencias Básicas, Universidad Tecnológica de Bolívar.Abstract Despite the regulations and controls implemented worldwide by governments and institutions to ensure the availability and quality of water resources, many water sources remain susceptible to contamination. This contamination poses significant risks to human health and can lead to substantial economic losses. One of the challenges in this context is the presence of missing or incomplete data, which can arise from various factors such as the methodology used or the expertise of personnel involved in sample collection and analysis. The existence of such data gaps hampers the accurate analysis that can be conducted. To address this issue and estimate a water quality index from the available samples, it is crucial to handle missing information appropriately to avoid biased calculations. This study focuses on the application of machine learning methods for imputing missing data in water samples. Furthermore, it quantifies the performance of different models based on the distribution of the obtained data. By applying 10 distinct methods to a sample of water quality data, the most effective approaches, namely Bayesian Ridge, Gradient Boosting, Ridge, Support Vector Machine, and Theil-Sen regressors, were identified. The selection of these models was based on the evaluation of two estimation error metrics: average percent bias (PBIAS) and Kling-Gupta Efficiency statistic (KGEss). The respective metric values for the aforementioned methods are as follows: $$\langle \hbox {PBIAS}\rangle _{0.5}=14.665, 19.555, 14.300, 15.380, 15.920$$ ⟨ PBIAS ⟩ 0.5 = 14.665 , 19.555 , 14.300 , 15.380 , 15.920 and $$\langle \hbox {KGEss}\rangle _{0.5}=0.670, 0.585, 0.655, 0.620, 0.595$$ ⟨ KGEss ⟩ 0.5 = 0.670 , 0.585 , 0.655 , 0.620 , 0.595 . The results obtained from these models have been utilized to establish unbiased relationships among physical, chemical, and biological parameters based on the information retrieved through the applied imputation methods.https://doi.org/10.1007/s43832-024-00068-yWater qualityImputation methodsMachine learningData miningProcess improvement |
spellingShingle | David Sierra-Porta Assessing the impact of missing data on water quality index estimation: a machine learning approach Discover Water Water quality Imputation methods Machine learning Data mining Process improvement |
title | Assessing the impact of missing data on water quality index estimation: a machine learning approach |
title_full | Assessing the impact of missing data on water quality index estimation: a machine learning approach |
title_fullStr | Assessing the impact of missing data on water quality index estimation: a machine learning approach |
title_full_unstemmed | Assessing the impact of missing data on water quality index estimation: a machine learning approach |
title_short | Assessing the impact of missing data on water quality index estimation: a machine learning approach |
title_sort | assessing the impact of missing data on water quality index estimation a machine learning approach |
topic | Water quality Imputation methods Machine learning Data mining Process improvement |
url | https://doi.org/10.1007/s43832-024-00068-y |
work_keys_str_mv | AT davidsierraporta assessingtheimpactofmissingdataonwaterqualityindexestimationamachinelearningapproach |