Detecting and parsing embedded lightweight structures

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.

Bibliographic Details
Main Author: Rha, Philip
Other Authors: Rob Miller.
Format: Thesis
Language:eng
Published: Massachusetts Institute of Technology 2006
Subjects:
Online Access:http://hdl.handle.net/1721.1/33349
_version_ 1811004002166374400
author Rha, Philip
author2 Rob Miller.
author_facet Rob Miller.
Rha, Philip
author_sort Rha, Philip
collection MIT
description Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.
first_indexed 2024-09-23T16:13:23Z
format Thesis
id mit-1721.1/33349
institution Massachusetts Institute of Technology
language eng
last_indexed 2024-09-23T16:13:23Z
publishDate 2006
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/333492019-04-11T02:02:13Z Detecting and parsing embedded lightweight structures Rha, Philip Rob Miller. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Electrical Engineering and Computer Science. Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005. Includes bibliographical references (p. 71-72). Text documents, web pages, and source code are all documents that contain language structures that can be parsed with corresponding parsers. Some documents, like JSP pages, Java tutorial pages, and Java source code, often have language structures that are nested within another language structure. Although parsers exist exclusively for the outer and inner language structure, neither is suited for parsing the embedded structures in the context of the document. This thesis presents a new technique for selectively applying existing parsers on intelligently transformed document content. The task of parsing these embedded structures can be broken up into two phases: detection of embedded structures and parsing of those embedded structures. In order to detect embedded structures, we take advantage of the fact that there are natural boundaries in any given language in which these embedded structures can appear. We use these natural boundaries to narrow our search space for embedded structures. We further reduce the search space by using statistical analysis of token frequency for different language types. By combining the use of natural boundaries and the use of token frequency analysis, we can, for any given document, generate a set of regions that have a high probability of being an embedded structure. (cont.) To parse the embedded structures, the text of the region must often be transformed into a form that is readable by the intended parser. Our approach provides a systematic way to transform the document content into a form that is appropriate for the embedded structure parser using simple replacement rules. Using our knowledge of natural boundaries and statistical analysis of token frequency, we are able to locate regions of embedded structures. Combined with replacement rules which transform document content into a parsable form, we are successfully able to parse a range of documents with embedded structures using existing parsers. by Philip Rha. M.Eng. 2006-07-13T15:17:17Z 2006-07-13T15:17:17Z 2005 2005 Thesis http://hdl.handle.net/1721.1/33349 62412591 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582 72 p. 2721231 bytes 2724141 bytes application/pdf application/pdf application/pdf Massachusetts Institute of Technology
spellingShingle Electrical Engineering and Computer Science.
Rha, Philip
Detecting and parsing embedded lightweight structures
title Detecting and parsing embedded lightweight structures
title_full Detecting and parsing embedded lightweight structures
title_fullStr Detecting and parsing embedded lightweight structures
title_full_unstemmed Detecting and parsing embedded lightweight structures
title_short Detecting and parsing embedded lightweight structures
title_sort detecting and parsing embedded lightweight structures
topic Electrical Engineering and Computer Science.
url http://hdl.handle.net/1721.1/33349
work_keys_str_mv AT rhaphilip detectingandparsingembeddedlightweightstructures