Semi-supervised learning for natural language

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.

Bibliographic Details
Main Author:	Liang, Percy
Other Authors:	Michael Collins.
Format:	Thesis
Language:	eng
Published:	Massachusetts Institute of Technology 2006
Subjects:	Electrical Engineering and Computer Science.
Online Access:	http://hdl.handle.net/1721.1/33296

_version_	1826209980847489024
author	Liang, Percy
author2	Michael Collins.
author_facet	Michael Collins. Liang, Percy
author_sort	Liang, Percy
collection	MIT
description	Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.
first_indexed	2024-09-23T14:38:20Z
format	Thesis
id	mit-1721.1/33296
institution	Massachusetts Institute of Technology
language	eng
last_indexed	2024-09-23T14:38:20Z
publishDate	2006
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/332962019-04-11T03:13:27Z Semi-supervised learning for natural language Liang, Percy Michael Collins. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Electrical Engineering and Computer Science. Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005. Includes bibliographical references (p. 75-82). Statistical supervised learning techniques have been successful for many natural language processing tasks, but they require labeled datasets, which can be expensive to obtain. On the other hand, unlabeled data (raw text) is often available "for free" in large quantities. Unlabeled data has shown promise in improving the performance of a number of tasks, e.g. word sense disambiguation, information extraction, and natural language parsing. In this thesis, we focus on two segmentation tasks, named-entity recognition and Chinese word segmentation. The goal of named-entity recognition is to detect and classify names of people, organizations, and locations in a sentence. The goal of Chinese word segmentation is to find the word boundaries in a sentence that has been written as a string of characters without spaces. Our approach is as follows: In a preprocessing step, we use raw text to cluster words and calculate mutual information statistics. The output of this step is then used as features in a supervised model, specifically a global linear model trained using the Perception algorithm. We also compare Markov and semi-Markov models on the two segmentation tasks. Our results show that features derived from unlabeled data substantially improves performance, both in terms of reducing the amount of labeled data needed to achieve a certain performance level and in terms of reducing the error using a fixed amount of labeled data. We find that sometimes semi-Markov models can also improve performance over Markov models. by Percy Liang. M.Eng. 2006-07-13T15:13:19Z 2006-07-13T15:13:19Z 2005 2005 Thesis http://hdl.handle.net/1721.1/33296 62278990 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582 86 p. 4273216 bytes 4277241 bytes application/pdf application/pdf application/pdf Massachusetts Institute of Technology
spellingShingle	Electrical Engineering and Computer Science. Liang, Percy Semi-supervised learning for natural language
title	Semi-supervised learning for natural language
title_full	Semi-supervised learning for natural language
title_fullStr	Semi-supervised learning for natural language
title_full_unstemmed	Semi-supervised learning for natural language
title_short	Semi-supervised learning for natural language
title_sort	semi supervised learning for natural language
topic	Electrical Engineering and Computer Science.
url	http://hdl.handle.net/1721.1/33296
work_keys_str_mv	AT liangpercy semisupervisedlearningfornaturallanguage

Semi-supervised learning for natural language

Similar Items