Approximate string joins with abbreviations

Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.

Bibliographic Details
Main Author: Tao, Wenbo, Ph. D. Massachusetts Institute of Technology
Other Authors: Michael Stonebraker.
Format: Thesis
Language:eng
Published: Massachusetts Institute of Technology 2018
Subjects:
Online Access:http://hdl.handle.net/1721.1/118039
_version_ 1811072825546506240
author Tao, Wenbo, Ph. D. Massachusetts Institute of Technology
author2 Michael Stonebraker.
author_facet Michael Stonebraker.
Tao, Wenbo, Ph. D. Massachusetts Institute of Technology
author_sort Tao, Wenbo, Ph. D. Massachusetts Institute of Technology
collection MIT
description Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.
first_indexed 2024-09-23T09:15:26Z
format Thesis
id mit-1721.1/118039
institution Massachusetts Institute of Technology
language eng
last_indexed 2024-09-23T09:15:26Z
publishDate 2018
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1180392022-06-27T17:04:38Z Approximate string joins with abbreviations ASJ with abbreviations Tao, Wenbo, Ph. D. Massachusetts Institute of Technology Michael Stonebraker. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Electrical Engineering and Computer Science. Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018. Cataloged from PDF version of thesis. Includes bibliographical references (pages 61-63). String joins have wide applications in data integration and cleaning. The inconsistency of data caused by data errors, term variations and missing values has led to the need for approximate string joins (ASJ). In this thesis, we study ASJ with abbreviations, which are a frequent type of term variation. Although prior works have studied ASJ given a user-inputted dictionary of synonym rules, they have three common limitations. First, they suffer from low precision in the presence of abbreviations having multiple full forms. Second, their join algorithms are not scalable due to the exponential time complexity. Third, the dictionary may not exist since abbreviations are highly domain-dependent. We propose an end-to-end workflow to address these limitations. There are three main components in the workflow: (1) a new similarity measure taking abbreviations into account that can handle abbreviations having multiple full forms, (2) an efficient join algorithm following the filter-verification framework and (3) an unsupervised approach to learn a dictionary of abbreviation rules from input strings. We evaluate our workflow on four real-world datasets and show that our workflow outputs accurate join results, scales well as input size grows and greatly outperforms state-of-the-art approaches in both accuracy and efficiency. by Wenbo Tao. S.M. 2018-09-17T15:54:53Z 2018-09-17T15:54:53Z 2018 2018 Thesis http://hdl.handle.net/1721.1/118039 1051459082 eng MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission. http://dspace.mit.edu/handle/1721.1/7582 63 pages application/pdf Massachusetts Institute of Technology
spellingShingle Electrical Engineering and Computer Science.
Tao, Wenbo, Ph. D. Massachusetts Institute of Technology
Approximate string joins with abbreviations
title Approximate string joins with abbreviations
title_full Approximate string joins with abbreviations
title_fullStr Approximate string joins with abbreviations
title_full_unstemmed Approximate string joins with abbreviations
title_short Approximate string joins with abbreviations
title_sort approximate string joins with abbreviations
topic Electrical Engineering and Computer Science.
url http://hdl.handle.net/1721.1/118039
work_keys_str_mv AT taowenbophdmassachusettsinstituteoftechnology approximatestringjoinswithabbreviations
AT taowenbophdmassachusettsinstituteoftechnology asjwithabbreviations