Towards automatically linking data elements

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.

Bibliographic Details
Main Author: Xiao, Katharine (Katharine J.)
Other Authors: Kalyan Veeramachaneni.
Format: Thesis
Language:eng
Published: Massachusetts Institute of Technology 2018
Subjects:
Online Access:http://hdl.handle.net/1721.1/113450
_version_ 1811093515287920640
author Xiao, Katharine (Katharine J.)
author2 Kalyan Veeramachaneni.
author_facet Kalyan Veeramachaneni.
Xiao, Katharine (Katharine J.)
author_sort Xiao, Katharine (Katharine J.)
collection MIT
description Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
first_indexed 2024-09-23T15:46:20Z
format Thesis
id mit-1721.1/113450
institution Massachusetts Institute of Technology
language eng
last_indexed 2024-09-23T15:46:20Z
publishDate 2018
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1134502019-04-12T13:54:05Z Towards automatically linking data elements Xiao, Katharine (Katharine J.) Kalyan Veeramachaneni. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science. Electrical Engineering and Computer Science. Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017. This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections. Cataloged from student-submitted PDF version of thesis. Includes bibliographical references (pages 91-92). When presented with a new dataset, human data scientists explore it in order to identify salient properties of the data elements, identify relationships between entities, and write processing software that makes use of those relationships accordingly. While there has been progress made on automatically processing the data to generate features or models, most automation systems rely on receiving a data model that has all the meta information about the data, including salient properties and relationships. In this thesis, we present a first version of our system, called ADEL-Automatic Data Elements Linking. Given a collection of files, this system generates a relational data schema and identifies other salient properties. It detects the type of each data field, which describes not only the programmatic data type but also the context in which the data originated, through a method called Type Detection. For each file, it identifies the field that uniquely describes each row in it, also known as a Primary Key. Then, it discovers relationships between different data entities with Relationship Discovery, and discovers any implicit constraints in the data through Hard Constraint Discovery. We posit two out of these four problems as learning problems. To evaluate our algorithms, we compare the results of each to a set of manual annotations. For Type Detection, we saw a max error of 7%, with an average error of 2.2% across all datasets. For Primary Key Detection, we classified all existing primary keys correctly, and had one false positive across five datasets. For Relationship Discovery, we saw an average error of 5.6%. (Our results are limited by the small number of manual annotations we currently possess.) We then feed the output of our system into existing semi-automated data science software systems - the Deep Feature Synthesis (DFS) algorithm, which generates features for predictive models, and the Synthetic Data Vault (SDV), which generates a hierarchical graphical model. When ADEL's data annotations are fed into DFS, it produces similar or higher predictive accuracy in 3/4 problems, and when they are provided to SDV, it is able to generate synthetic data with no constraint violations. by Katharine Xiao. M. Eng. 2018-02-08T15:58:13Z 2018-02-08T15:58:13Z 2017 2017 Thesis http://hdl.handle.net/1721.1/113450 1020178875 eng MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission. http://dspace.mit.edu/handle/1721.1/7582 92 pages application/pdf Massachusetts Institute of Technology
spellingShingle Electrical Engineering and Computer Science.
Xiao, Katharine (Katharine J.)
Towards automatically linking data elements
title Towards automatically linking data elements
title_full Towards automatically linking data elements
title_fullStr Towards automatically linking data elements
title_full_unstemmed Towards automatically linking data elements
title_short Towards automatically linking data elements
title_sort towards automatically linking data elements
topic Electrical Engineering and Computer Science.
url http://hdl.handle.net/1721.1/113450
work_keys_str_mv AT xiaokatharinekatharinej towardsautomaticallylinkingdataelements