When Considering More Elements: Attribute Correlation in Unsupervised Data Cleaning under Blocking

In banks, governments, and internet companies, due to the increasing demand for data in various information systems and continuously shortening of the cycle for data collection and update, there may be a variety of data quality issues in a database. As the expansion of data scales, methods such as p...

Full description

Bibliographic Details
Main Authors: Pei Li, Chaofan Dai, Wenqian Wang
Format: Article
Language:English
Published: MDPI AG 2019-04-01
Series:Symmetry
Subjects:
Online Access:https://www.mdpi.com/2073-8994/11/4/575
_version_ 1811305715779764224
author Pei Li
Chaofan Dai
Wenqian Wang
author_facet Pei Li
Chaofan Dai
Wenqian Wang
author_sort Pei Li
collection DOAJ
description In banks, governments, and internet companies, due to the increasing demand for data in various information systems and continuously shortening of the cycle for data collection and update, there may be a variety of data quality issues in a database. As the expansion of data scales, methods such as pre-specifying business rules or introducing expert experience into a repair process are no longer applicable to some information systems requiring rapid responses. In this case, we divided data cleaning into supervised and unsupervised forms according to whether there were interventions in the repair processes and put forward a new dimension suitable for unsupervised cleaning in this paper. For weak logic errors in unsupervised data cleaning, we proposed an attribute correlation-based (ACB)-Framework under blocking, and designed three different data blocking methods to reduce the time complexity and test the impact of clustering accuracy on data cleaning. The experiments showed that the blocking methods could effectively reduce the repair time by maintaining the repair validity. Moreover, we concluded that the blocking methods with a too high clustering accuracy tended to put tuples with the same elements into a data block, which reduced the cleaning ability. In summary, the ACB-Framework with blocking can reduce the corresponding time cost and does not need the guidance of domain knowledge or interventions in repair, which can be applied in information systems requiring rapid responses, such as internet web pages, network servers, and sensor information acquisition.
first_indexed 2024-04-13T08:31:22Z
format Article
id doaj.art-a0133fa98c17426a96065c445371399e
institution Directory Open Access Journal
issn 2073-8994
language English
last_indexed 2024-04-13T08:31:22Z
publishDate 2019-04-01
publisher MDPI AG
record_format Article
series Symmetry
spelling doaj.art-a0133fa98c17426a96065c445371399e2022-12-22T02:54:15ZengMDPI AGSymmetry2073-89942019-04-0111457510.3390/sym11040575sym11040575When Considering More Elements: Attribute Correlation in Unsupervised Data Cleaning under BlockingPei Li0Chaofan Dai1Wenqian Wang2Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha 410073, ChinaScience and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha 410073, ChinaScience and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha 410073, ChinaIn banks, governments, and internet companies, due to the increasing demand for data in various information systems and continuously shortening of the cycle for data collection and update, there may be a variety of data quality issues in a database. As the expansion of data scales, methods such as pre-specifying business rules or introducing expert experience into a repair process are no longer applicable to some information systems requiring rapid responses. In this case, we divided data cleaning into supervised and unsupervised forms according to whether there were interventions in the repair processes and put forward a new dimension suitable for unsupervised cleaning in this paper. For weak logic errors in unsupervised data cleaning, we proposed an attribute correlation-based (ACB)-Framework under blocking, and designed three different data blocking methods to reduce the time complexity and test the impact of clustering accuracy on data cleaning. The experiments showed that the blocking methods could effectively reduce the repair time by maintaining the repair validity. Moreover, we concluded that the blocking methods with a too high clustering accuracy tended to put tuples with the same elements into a data block, which reduced the cleaning ability. In summary, the ACB-Framework with blocking can reduce the corresponding time cost and does not need the guidance of domain knowledge or interventions in repair, which can be applied in information systems requiring rapid responses, such as internet web pages, network servers, and sensor information acquisition.https://www.mdpi.com/2073-8994/11/4/575data qualityunsupervised data cleaningattribute correlationdata blockingmachine learning
spellingShingle Pei Li
Chaofan Dai
Wenqian Wang
When Considering More Elements: Attribute Correlation in Unsupervised Data Cleaning under Blocking
Symmetry
data quality
unsupervised data cleaning
attribute correlation
data blocking
machine learning
title When Considering More Elements: Attribute Correlation in Unsupervised Data Cleaning under Blocking
title_full When Considering More Elements: Attribute Correlation in Unsupervised Data Cleaning under Blocking
title_fullStr When Considering More Elements: Attribute Correlation in Unsupervised Data Cleaning under Blocking
title_full_unstemmed When Considering More Elements: Attribute Correlation in Unsupervised Data Cleaning under Blocking
title_short When Considering More Elements: Attribute Correlation in Unsupervised Data Cleaning under Blocking
title_sort when considering more elements attribute correlation in unsupervised data cleaning under blocking
topic data quality
unsupervised data cleaning
attribute correlation
data blocking
machine learning
url https://www.mdpi.com/2073-8994/11/4/575
work_keys_str_mv AT peili whenconsideringmoreelementsattributecorrelationinunsuperviseddatacleaningunderblocking
AT chaofandai whenconsideringmoreelementsattributecorrelationinunsuperviseddatacleaningunderblocking
AT wenqianwang whenconsideringmoreelementsattributecorrelationinunsuperviseddatacleaningunderblocking