Selecting training samples from large and noisy corpora for efficient text classification

59 p.

Bibliographic Details
Main Author:	Wong, Daji
Other Authors:	Manoranjan Dash
Format:	Thesis
Published:	2011
Subjects:	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Online Access:	http://hdl.handle.net/10356/47535

_version_	1824456295483179008
author	Wong, Daji
author2	Manoranjan Dash
author_facet	Manoranjan Dash Wong, Daji
author_sort	Wong, Daji
collection	NTU
description	59 p.
first_indexed	2025-02-19T03:51:50Z
format	Thesis
id	ntu-10356/47535
institution	Nanyang Technological University
last_indexed	2025-02-19T03:51:50Z
publishDate	2011
record_format	dspace
spelling	ntu-10356/475352019-12-10T13:02:26Z Selecting training samples from large and noisy corpora for efficient text classification Wong, Daji Manoranjan Dash Wee Kim Wee School of Communication and Information DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing 59 p. In this thesis, an algorithm is presented that selects samples of documents for training text classifiers. Often the number of documents is very large and the documents are noisy. Both for efficiency purposes and accuracy purposes, one need good samples not just blind samples such as that of simple random sampling. The proposed algorithm is far superior to simple random sampling both for small sampling ratios and in the presence of noise. The proposed algorithm is based on a simple fact that the terms in the set of training sample documents should have approximately equal document frequency as in the whole set (not including the test set). Master of Science (Information Studies) 2011-12-27T08:36:21Z 2011-12-27T08:36:21Z 2009 2009 Thesis http://hdl.handle.net/10356/47535 Nanyang Technological University application/pdf
spellingShingle	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing Wong, Daji Selecting training samples from large and noisy corpora for efficient text classification
title	Selecting training samples from large and noisy corpora for efficient text classification
title_full	Selecting training samples from large and noisy corpora for efficient text classification
title_fullStr	Selecting training samples from large and noisy corpora for efficient text classification
title_full_unstemmed	Selecting training samples from large and noisy corpora for efficient text classification
title_short	Selecting training samples from large and noisy corpora for efficient text classification
title_sort	selecting training samples from large and noisy corpora for efficient text classification
topic	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing
url	http://hdl.handle.net/10356/47535
work_keys_str_mv	AT wongdaji selectingtrainingsamplesfromlargeandnoisycorporaforefficienttextclassification

Selecting training samples from large and noisy corpora for efficient text classification

Similar Items