Natural language search of structured documents

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.

Bibliographic Details
Main Author: Oney, Stephen W
Other Authors: Deb K. Roy.
Format: Thesis
Language:eng
Published: Massachusetts Institute of Technology 2009
Subjects:
Online Access:http://hdl.handle.net/1721.1/46009
_version_ 1826216579715563520
author Oney, Stephen W
author2 Deb K. Roy.
author_facet Deb K. Roy.
Oney, Stephen W
author_sort Oney, Stephen W
collection MIT
description Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.
first_indexed 2024-09-23T16:49:49Z
format Thesis
id mit-1721.1/46009
institution Massachusetts Institute of Technology
language eng
last_indexed 2024-09-23T16:49:49Z
publishDate 2009
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/460092019-04-11T14:27:25Z Natural language search of structured documents Oney, Stephen W Deb K. Roy. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Electrical Engineering and Computer Science. Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008. Includes bibliographical references (leaves 45-47). This thesis focuses on techniques with which natural language can be used to search for specific elements in a structured document, such as an XML file. The goal is to create a system capable of being trained to identify features, of written English sentence describing (in natural language) part of an XML document, that help identify the sections of said document which were discussed. In particular, this thesis will revolve around the problem of searching through XML documents, each of which describes the play-by-play events of a baseball game. These events are collected from Major League Baseball games between 2004 and 2008, containing information detailing the outcome of every pitch thrown. My techniques are trained and tested on written (newspaper) summaries of these games, which often refer to specific game events and statistics. The choice of these training data makes the task much more complex in two ways. First, these summaries come from multiple authors. Each of these authors has a distinct writing style, which uses language in a unique and often complex way. Secondly, large portions of these summaries discuss facts outside of the context of the play-by-play events of the XML documents. Training the system with these portions of the summary can create a problem due to sparse data, which has the potential to reduce the effectiveness of the system. The end result is the creation of a system capable of building classifiers for natural language search of these XML documents. (cont.) This system is able to overcome the two aforementioned problems, as well as several more subtle challenges. In addition, several limitations of alternative, strictly feature-based, classifiers are also illustrated, and applications of this research to related problems (outside of baseball and sports) are discussed. by Stephen W. Oney. M.Eng. 2009-06-30T16:59:39Z 2009-06-30T16:59:39Z 2008 2008 Thesis http://hdl.handle.net/1721.1/46009 355696468 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582 48 leaves application/pdf Massachusetts Institute of Technology
spellingShingle Electrical Engineering and Computer Science.
Oney, Stephen W
Natural language search of structured documents
title Natural language search of structured documents
title_full Natural language search of structured documents
title_fullStr Natural language search of structured documents
title_full_unstemmed Natural language search of structured documents
title_short Natural language search of structured documents
title_sort natural language search of structured documents
topic Electrical Engineering and Computer Science.
url http://hdl.handle.net/1721.1/46009
work_keys_str_mv AT oneystephenw naturallanguagesearchofstructureddocuments