Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery
© 2018 IEEE. Employees that spend more time finding relevant data than analyzing it suffer from a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web, we propose to identify se...
Main Authors: | , , , , , , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
IEEE
2021
|
Online Access: | https://hdl.handle.net/1721.1/137849 |
_version_ | 1826198551574609920 |
---|---|
author | Castro Fernandez, Raul Mansour, Essam Qahtan, Abdulhakim A. Elmagarmid, Ahmed Ilyas, Ihab Madden, Samuel Ouzzani, Mourad Stonebraker, Michael Tang, Nan |
author2 | Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory |
author_facet | Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Castro Fernandez, Raul Mansour, Essam Qahtan, Abdulhakim A. Elmagarmid, Ahmed Ilyas, Ihab Madden, Samuel Ouzzani, Mourad Stonebraker, Michael Tang, Nan |
author_sort | Castro Fernandez, Raul |
collection | MIT |
description | © 2018 IEEE. Employees that spend more time finding relevant data than analyzing it suffer from a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web, we propose to identify semantic links that assist analysts in their discovery tasks. These links relate tables to each other, to facilitate navigating the schemas. They also relate data to external data sources, such as ontologies and dictionaries, to help explain the schema meaning. We materialize the links in an enterprise knowledge graph, where they become available to analysts. The main challenge is how to find pairs of objects that are semantically related. We propose SEMPROP, a DAG of different components that find links based on syntactic and semantic similarities. SEMPROP is commanded by a semantic matcher which leverages word embeddings to find objects that are semantically related. We introduce coherent group, a technique to combine word embeddings that works better than other state of the art combination alternatives. We implement SEMPROP as part of Aurum, a data discovery system we are building, and conduct user studies, real deployments and a quantitative evaluation to understand the benefits of links for data discovery tasks, as well as the benefits of SEMPROP and coherent groups to find those links. |
first_indexed | 2024-09-23T11:06:39Z |
format | Article |
id | mit-1721.1/137849 |
institution | Massachusetts Institute of Technology |
language | English |
last_indexed | 2024-09-23T11:06:39Z |
publishDate | 2021 |
publisher | IEEE |
record_format | dspace |
spelling | mit-1721.1/1378492023-04-07T19:59:00Z Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery Castro Fernandez, Raul Mansour, Essam Qahtan, Abdulhakim A. Elmagarmid, Ahmed Ilyas, Ihab Madden, Samuel Ouzzani, Mourad Stonebraker, Michael Tang, Nan Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory © 2018 IEEE. Employees that spend more time finding relevant data than analyzing it suffer from a data discovery problem. The large volume of data in enterprises, and sometimes the lack of knowledge of the schemas aggravates this problem. Similar to how we navigate the Web, we propose to identify semantic links that assist analysts in their discovery tasks. These links relate tables to each other, to facilitate navigating the schemas. They also relate data to external data sources, such as ontologies and dictionaries, to help explain the schema meaning. We materialize the links in an enterprise knowledge graph, where they become available to analysts. The main challenge is how to find pairs of objects that are semantically related. We propose SEMPROP, a DAG of different components that find links based on syntactic and semantic similarities. SEMPROP is commanded by a semantic matcher which leverages word embeddings to find objects that are semantically related. We introduce coherent group, a technique to combine word embeddings that works better than other state of the art combination alternatives. We implement SEMPROP as part of Aurum, a data discovery system we are building, and conduct user studies, real deployments and a quantitative evaluation to understand the benefits of links for data discovery tasks, as well as the benefits of SEMPROP and coherent groups to find those links. 2021-11-09T12:48:12Z 2021-11-09T12:48:12Z 2018-04 2019-06-18T17:15:24Z Article http://purl.org/eprint/type/ConferencePaper https://hdl.handle.net/1721.1/137849 Castro Fernandez, Raul, Mansour, Essam, Qahtan, Abdulhakim A., Elmagarmid, Ahmed, Ilyas, Ihab et al. 2018. "Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery." en 10.1109/icde.2018.00093 Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf IEEE website |
spellingShingle | Castro Fernandez, Raul Mansour, Essam Qahtan, Abdulhakim A. Elmagarmid, Ahmed Ilyas, Ihab Madden, Samuel Ouzzani, Mourad Stonebraker, Michael Tang, Nan Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery |
title | Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery |
title_full | Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery |
title_fullStr | Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery |
title_full_unstemmed | Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery |
title_short | Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery |
title_sort | seeping semantics linking datasets using word embeddings for data discovery |
url | https://hdl.handle.net/1721.1/137849 |
work_keys_str_mv | AT castrofernandezraul seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery AT mansouressam seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery AT qahtanabdulhakima seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery AT elmagarmidahmed seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery AT ilyasihab seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery AT maddensamuel seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery AT ouzzanimourad seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery AT stonebrakermichael seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery AT tangnan seepingsemanticslinkingdatasetsusingwordembeddingsfordatadiscovery |