Efficient text classification

As the digital age pushes forward, data and document size have been increasing rapidly. A more efficient and accurate method of sampling data for training text classifiers is required. We require good samples and not just blind samples from Simple Random Sampling, therefore we experimented on a new...

Full description

Bibliographic Details
Main Author: Tan, Cheryl Qian Ru.
Other Authors: Manoranjan Dash
Format: Final Year Project (FYP)
Language:English
Published: 2010
Subjects:
Online Access:http://hdl.handle.net/10356/39727
_version_ 1826111423464341504
author Tan, Cheryl Qian Ru.
author2 Manoranjan Dash
author_facet Manoranjan Dash
Tan, Cheryl Qian Ru.
author_sort Tan, Cheryl Qian Ru.
collection NTU
description As the digital age pushes forward, data and document size have been increasing rapidly. A more efficient and accurate method of sampling data for training text classifiers is required. We require good samples and not just blind samples from Simple Random Sampling, therefore we experimented on a new proposed sampling algorithm – CONCISE. It is a novel sampling algorithm that is proposed for selecting training documents for text classification and experiments showed that it works particularly well with small sampling ratio. Experiments were conducted on the 20 Newsgroup corpus and Reuters 21578 document set using two classifiers SVM and Naïve Bayes classifier. CONCISE is compared with SRS in all experiments and results showed that CONCISE is consistent in accuracy no matter which classifier is used. In all experiments, CONCISE outperforms SRS in all sampling ratios and the accuracy with CONCISE is higher. However, CONCISE requires more running time but the trade off is small compared to the increase in accuracy.
first_indexed 2024-10-01T02:50:28Z
format Final Year Project (FYP)
id ntu-10356/39727
institution Nanyang Technological University
language English
last_indexed 2024-10-01T02:50:28Z
publishDate 2010
record_format dspace
spelling ntu-10356/397272023-03-03T20:47:47Z Efficient text classification Tan, Cheryl Qian Ru. Manoranjan Dash School of Computer Engineering Centre for Advanced Information Systems DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing As the digital age pushes forward, data and document size have been increasing rapidly. A more efficient and accurate method of sampling data for training text classifiers is required. We require good samples and not just blind samples from Simple Random Sampling, therefore we experimented on a new proposed sampling algorithm – CONCISE. It is a novel sampling algorithm that is proposed for selecting training documents for text classification and experiments showed that it works particularly well with small sampling ratio. Experiments were conducted on the 20 Newsgroup corpus and Reuters 21578 document set using two classifiers SVM and Naïve Bayes classifier. CONCISE is compared with SRS in all experiments and results showed that CONCISE is consistent in accuracy no matter which classifier is used. In all experiments, CONCISE outperforms SRS in all sampling ratios and the accuracy with CONCISE is higher. However, CONCISE requires more running time but the trade off is small compared to the increase in accuracy. Bachelor of Engineering (Computer Science) 2010-06-03T06:38:30Z 2010-06-03T06:38:30Z 2010 2010 Final Year Project (FYP) http://hdl.handle.net/10356/39727 en Nanyang Technological University 57 p. application/pdf
spellingShingle DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Tan, Cheryl Qian Ru.
Efficient text classification
title Efficient text classification
title_full Efficient text classification
title_fullStr Efficient text classification
title_full_unstemmed Efficient text classification
title_short Efficient text classification
title_sort efficient text classification
topic DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing
url http://hdl.handle.net/10356/39727
work_keys_str_mv AT tancherylqianru efficienttextclassification