Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery

© 2018 IEEE. Employees that spend more time finding relevant data than analyzing it suffer from a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web, we propose to identify se...

Full description

Bibliographic Details
Main Authors: Castro Fernandez, Raul, Mansour, Essam, Qahtan, Abdulhakim A., Elmagarmid, Ahmed, Ilyas, Ihab, Madden, Samuel, Ouzzani, Mourad, Stonebraker, Michael, Tang, Nan
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format: Article
Language:English
Published: IEEE 2021
Online Access:https://hdl.handle.net/1721.1/137849
_version_ 1826198551574609920
author Castro Fernandez, Raul
Mansour, Essam
Qahtan, Abdulhakim A.
Elmagarmid, Ahmed
Ilyas, Ihab
Madden, Samuel
Ouzzani, Mourad
Stonebraker, Michael
Tang, Nan
author2 Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Castro Fernandez, Raul
Mansour, Essam
Qahtan, Abdulhakim A.
Elmagarmid, Ahmed
Ilyas, Ihab
Madden, Samuel
Ouzzani, Mourad
Stonebraker, Michael
Tang, Nan
author_sort Castro Fernandez, Raul
collection MIT
description © 2018 IEEE. Employees that spend more time finding relevant data than analyzing it suffer from a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web, we propose to identify semantic links that assist analysts in their discovery tasks. These links relate tables to each other, to facilitate navigating the schemas. They also relate data to external data sources, such as ontologies and dictionaries, to help explain the schema meaning. We materialize the links in an enterprise knowledge graph, where they become available to analysts. The main challenge is how to find pairs of objects that are semantically related. We propose SEMPROP, a DAG of different components that find links based on syntactic and semantic similarities. SEMPROP is commanded by a semantic matcher which leverages word embeddings to find objects that are semantically related. We introduce coherent group, a technique to combine word embeddings that works better than other state of the art combination alternatives. We implement SEMPROP as part of Aurum, a data discovery system we are building, and conduct user studies, real deployments and a quantitative evaluation to understand the benefits of links for data discovery tasks, as well as the benefits of SEMPROP and coherent groups to find those links.
first_indexed 2024-09-23T11:06:39Z
format Article
id mit-1721.1/137849
institution Massachusetts Institute of Technology
language English
last_indexed 2024-09-23T11:06:39Z
publishDate 2021
publisher IEEE
record_format dspace
spelling mit-1721.1/1378492023-04-07T19:59:00Z Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery Castro Fernandez, Raul Mansour, Essam Qahtan, Abdulhakim A. Elmagarmid, Ahmed Ilyas, Ihab Madden, Samuel Ouzzani, Mourad Stonebraker, Michael Tang, Nan Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory © 2018 IEEE. Employees that spend more time finding relevant data than analyzing it suffer from a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web, we propose to identify semantic links that assist analysts in their discovery tasks. These links relate tables to each other, to facilitate navigating the schemas. They also relate data to external data sources, such as ontologies and dictionaries, to help explain the schema meaning. We materialize the links in an enterprise knowledge graph, where they become available to analysts. The main challenge is how to find pairs of objects that are semantically related. We propose SEMPROP, a DAG of different components that find links based on syntactic and semantic similarities. SEMPROP is commanded by a semantic matcher which leverages word embeddings to find objects that are semantically related. We introduce coherent group, a technique to combine word embeddings that works better than other state of the art combination alternatives. We implement SEMPROP as part of Aurum, a data discovery system we are building, and conduct user studies, real deployments and a quantitative evaluation to understand the benefits of links for data discovery tasks, as well as the benefits of SEMPROP and coherent groups to find those links. 2021-11-09T12:48:12Z 2021-11-09T12:48:12Z 2018-04 2019-06-18T17:15:24Z Article http://purl.org/eprint/type/ConferencePaper https://hdl.handle.net/1721.1/137849 Castro Fernandez, Raul, Mansour, Essam, Qahtan, Abdulhakim A., Elmagarmid, Ahmed, Ilyas, Ihab et al. 2018. "Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery." en 10.1109/icde.2018.00093 Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf IEEE website
spellingShingle Castro Fernandez, Raul
Mansour, Essam
Qahtan, Abdulhakim A.
Elmagarmid, Ahmed
Ilyas, Ihab
Madden, Samuel
Ouzzani, Mourad
Stonebraker, Michael
Tang, Nan
Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery
title Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery
title_full Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery
title_fullStr Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery
title_full_unstemmed Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery
title_short Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery
title_sort seeping semantics linking datasets using word embeddings for data discovery
url https://hdl.handle.net/1721.1/137849
work_keys_str_mv AT castrofernandezraul seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery
AT mansouressam seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery
AT qahtanabdulhakima seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery
AT elmagarmidahmed seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery
AT ilyasihab seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery
AT maddensamuel seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery
AT ouzzanimourad seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery
AT stonebrakermichael seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery
AT tangnan seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery