Natural language search of structured documents
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Language: | eng |
Published: |
Massachusetts Institute of Technology
2009
|
Subjects: | |
Online Access: | http://hdl.handle.net/1721.1/46009 |
_version_ | 1826216579715563520 |
---|---|
author | Oney, Stephen W |
author2 | Deb K. Roy. |
author_facet | Deb K. Roy. Oney, Stephen W |
author_sort | Oney, Stephen W |
collection | MIT |
description | Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008. |
first_indexed | 2024-09-23T16:49:49Z |
format | Thesis |
id | mit-1721.1/46009 |
institution | Massachusetts Institute of Technology |
language | eng |
last_indexed | 2024-09-23T16:49:49Z |
publishDate | 2009 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/460092019-04-11T14:27:25Z Natural language search of structured documents Oney, Stephen W Deb K. Roy. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Electrical Engineering and Computer Science. Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008. Includes bibliographical references (leaves 45-47). This thesis focuses on techniques with which natural language can be used to search for specific elements in a structured document, such as an XML file. The goal is to create a system capable of being trained to identify features, of written English sentence describing (in natural language) part of an XML document, that help identify the sections of said document which were discussed. In particular, this thesis will revolve around the problem of searching through XML documents, each of which describes the play-by-play events of a baseball game. These events are collected from Major League Baseball games between 2004 and 2008, containing information detailing the outcome of every pitch thrown. My techniques are trained and tested on written (newspaper) summaries of these games, which often refer to specific game events and statistics. The choice of these training data makes the task much more complex in two ways. First, these summaries come from multiple authors. Each of these authors has a distinct writing style, which uses language in a unique and often complex way. Secondly, large portions of these summaries discuss facts outside of the context of the play-by-play events of the XML documents. Training the system with these portions of the summary can create a problem due to sparse data, which has the potential to reduce the effectiveness of the system. The end result is the creation of a system capable of building classifiers for natural language search of these XML documents. (cont.) This system is able to overcome the two aforementioned problems, as well as several more subtle challenges. In addition, several limitations of alternative, strictly feature-based, classifiers are also illustrated, and applications of this research to related problems (outside of baseball and sports) are discussed. by Stephen W. Oney. M.Eng. 2009-06-30T16:59:39Z 2009-06-30T16:59:39Z 2008 2008 Thesis http://hdl.handle.net/1721.1/46009 355696468 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582 48 leaves application/pdf Massachusetts Institute of Technology |
spellingShingle | Electrical Engineering and Computer Science. Oney, Stephen W Natural language search of structured documents |
title | Natural language search of structured documents |
title_full | Natural language search of structured documents |
title_fullStr | Natural language search of structured documents |
title_full_unstemmed | Natural language search of structured documents |
title_short | Natural language search of structured documents |
title_sort | natural language search of structured documents |
topic | Electrical Engineering and Computer Science. |
url | http://hdl.handle.net/1721.1/46009 |
work_keys_str_mv | AT oneystephenw naturallanguagesearchofstructureddocuments |