Profiling relational data: a survey

Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, su...

Full description

Bibliographic Details
Main Authors: Abedjan, Ziawasch, Golab, Lukasz, Naumann, Felix
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format: Article
Language:English
Published: Springer Berlin Heidelberg 2016
Online Access:http://hdl.handle.net/1721.1/106176
https://orcid.org/0000-0003-3483-0523
_version_ 1826214562362294272
author Abedjan, Ziawasch
Golab, Lukasz
Naumann, Felix
author2 Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Abedjan, Ziawasch
Golab, Lukasz
Naumann, Felix
author_sort Abedjan, Ziawasch
collection MIT
description Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases.
first_indexed 2024-09-23T16:07:25Z
format Article
id mit-1721.1/106176
institution Massachusetts Institute of Technology
language English
last_indexed 2024-09-23T16:07:25Z
publishDate 2016
publisher Springer Berlin Heidelberg
record_format dspace
spelling mit-1721.1/1061762022-10-02T06:30:19Z Profiling relational data: a survey Abedjan, Ziawasch Golab, Lukasz Naumann, Felix Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Abedjan, Ziawasch Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases. 2016-12-29T19:39:40Z 2016-12-29T19:39:40Z 2015-06 2015-05 2016-08-18T15:28:35Z Article http://purl.org/eprint/type/JournalArticle 1066-8888 0949-877X http://hdl.handle.net/1721.1/106176 Abedjan, Ziawasch, Lukasz Golab, and Felix Naumann. “Profiling Relational Data: A Survey.” The VLDB Journal 24.4 (2015): 557–581. https://orcid.org/0000-0003-3483-0523 en http://dx.doi.org/10.1007/s00778-015-0389-y The VLDB Journal Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use. Springer-Verlag Berlin Heidelberg application/pdf Springer Berlin Heidelberg Springer Berlin Heidelberg
spellingShingle Abedjan, Ziawasch
Golab, Lukasz
Naumann, Felix
Profiling relational data: a survey
title Profiling relational data: a survey
title_full Profiling relational data: a survey
title_fullStr Profiling relational data: a survey
title_full_unstemmed Profiling relational data: a survey
title_short Profiling relational data: a survey
title_sort profiling relational data a survey
url http://hdl.handle.net/1721.1/106176
https://orcid.org/0000-0003-3483-0523
work_keys_str_mv AT abedjanziawasch profilingrelationaldataasurvey
AT golablukasz profilingrelationaldataasurvey
AT naumannfelix profilingrelationaldataasurvey