Pattern functional dependencies for data cleaning

Patterns (or regex-based expressions) are widely used to constrain the format of a domain (or a column), e.g., a Year column should contain only four digits, and thus a value like “1980-“ might be a typo. Moreover, integrity constraints (ICs) defined over multiple columns, such as (conditional) func...

Full description

Bibliographic Details
Main Authors:	Qahtan, Abdulhakim, Tang, Nan, Ouzzani, Mourad, Cao, Yang, Stonebraker, Michael
Format:	Article
Language:	English
Published:	VLDB Endowment 2021
Online Access:	https://hdl.handle.net/1721.1/133951

_version_	1826200185377652736
author	Qahtan, Abdulhakim Tang, Nan Ouzzani, Mourad Cao, Yang Stonebraker, Michael
author_facet	Qahtan, Abdulhakim Tang, Nan Ouzzani, Mourad Cao, Yang Stonebraker, Michael
author_sort	Qahtan, Abdulhakim
collection	MIT
description	Patterns (or regex-based expressions) are widely used to constrain the format of a domain (or a column), e.g., a Year column should contain only four digits, and thus a value like “1980-“ might be a typo. Moreover, integrity constraints (ICs) defined over multiple columns, such as (conditional) functional dependencies and denial constraints, e.g., a ZIP code uniquely determines a city in the UK, have been widely used in data cleaning. However, a promising, but not yet explored, direction is to combine regex- and IC-based theories to capture data dependencies involving partial attribute values. For example, in an employee ID such as“F-9-107“, “F“ is sufficient to determine the finance department. Inspired by the above observation, we propose a novel class of ICs, called pattern functional dependencies (PFDs), to model fine-grained data dependencies gleaned from partial attribute values. These dependencies cannot be modeled using traditional ICs, such as (conditional) functional dependencies, which work on entire attribute values. We also present a set of axioms for the inference of PFDs, analogous to Armstrong's axioms for FDs, and study the complexity of consistency and implication analysis of PFDs. Moreover, we devise an effective algorithm to automatically discover PFDs even in the presence of errors in the data. Our extensive experiments on 15 real-world datasets show that our approach can effectively discover valid and useful PFDs over dirty data, which can then be used to detect data errors that are hard to capture by other types of ICs.
first_indexed	2024-09-23T11:32:28Z
format	Article
id	mit-1721.1/133951
institution	Massachusetts Institute of Technology
language	English
last_indexed	2024-09-23T11:32:28Z
publishDate	2021
publisher	VLDB Endowment
record_format	dspace
spelling	mit-1721.1/1339512021-10-28T04:15:00Z Pattern functional dependencies for data cleaning Qahtan, Abdulhakim Tang, Nan Ouzzani, Mourad Cao, Yang Stonebraker, Michael Patterns (or regex-based expressions) are widely used to constrain the format of a domain (or a column), e.g., a Year column should contain only four digits, and thus a value like “1980-“ might be a typo. Moreover, integrity constraints (ICs) defined over multiple columns, such as (conditional) functional dependencies and denial constraints, e.g., a ZIP code uniquely determines a city in the UK, have been widely used in data cleaning. However, a promising, but not yet explored, direction is to combine regex- and IC-based theories to capture data dependencies involving partial attribute values. For example, in an employee ID such as“F-9-107“, “F“ is sufficient to determine the finance department. Inspired by the above observation, we propose a novel class of ICs, called pattern functional dependencies (PFDs), to model fine-grained data dependencies gleaned from partial attribute values. These dependencies cannot be modeled using traditional ICs, such as (conditional) functional dependencies, which work on entire attribute values. We also present a set of axioms for the inference of PFDs, analogous to Armstrong's axioms for FDs, and study the complexity of consistency and implication analysis of PFDs. Moreover, we devise an effective algorithm to automatically discover PFDs even in the presence of errors in the data. Our extensive experiments on 15 real-world datasets show that our approach can effectively discover valid and useful PFDs over dirty data, which can then be used to detect data errors that are hard to capture by other types of ICs. 2021-10-27T19:57:20Z 2021-10-27T19:57:20Z 2020 2021-09-24T13:57:25Z Article http://purl.org/eprint/type/ConferencePaper https://hdl.handle.net/1721.1/133951 en 10.14778/3377369.3377377 Proceedings of the VLDB Endowment Creative Commons Attribution-NonCommercial-NoDerivs License http://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf VLDB Endowment VLDB Endowment
spellingShingle	Qahtan, Abdulhakim Tang, Nan Ouzzani, Mourad Cao, Yang Stonebraker, Michael Pattern functional dependencies for data cleaning
title	Pattern functional dependencies for data cleaning
title_full	Pattern functional dependencies for data cleaning
title_fullStr	Pattern functional dependencies for data cleaning
title_full_unstemmed	Pattern functional dependencies for data cleaning
title_short	Pattern functional dependencies for data cleaning
title_sort	pattern functional dependencies for data cleaning
url	https://hdl.handle.net/1721.1/133951
work_keys_str_mv	AT qahtanabdulhakim patternfunctionaldependenciesfordatacleaning AT tangnan patternfunctionaldependenciesfordatacleaning AT ouzzanimourad patternfunctionaldependenciesfordatacleaning AT caoyang patternfunctionaldependenciesfordatacleaning AT stonebrakermichael patternfunctionaldependenciesfordatacleaning

Pattern functional dependencies for data cleaning

Similar Items