Document representations for classification of short web-page descriptions

Motivated by applying Text Categorization to classification of Web search results, this paper describes an extensive experimental study of the impact of bag-of- words document representations on the performance of five major classifiers - Naïve Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts,...

Full description

Bibliographic Details
Main Authors: Radovanović Miloš, Ivanović Mirjana
Format: Article
Language:English
Published: University of Belgrade 2008-01-01
Series:Yugoslav Journal of Operations Research
Subjects:
Online Access:http://www.doiserbia.nb.rs/img/doi/0354-0243/2008/0354-02430801123R.pdf
_version_ 1830282957294665728
author Radovanović Miloš
Ivanović Mirjana
author_facet Radovanović Miloš
Ivanović Mirjana
author_sort Radovanović Miloš
collection DOAJ
description Motivated by applying Text Categorization to classification of Web search results, this paper describes an extensive experimental study of the impact of bag-of- words document representations on the performance of five major classifiers - Naïve Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts, representing short Web-page descriptions sorted into a large hierarchy of topics, are taken from the dmoz Open Directory Web-page ontology, and classifiers are trained to automatically determine the topics which may be relevant to a previously unseen Web-page. Different transformations of input data: stemming, normalization, logtf and idf, together with dimensionality reduction, are found to have a statistically significant improving or degrading effect on classification performance measured by classical metrics - accuracy, precision, recall, F1 and F2. The emphasis of the study is not on determining the best document representation which corresponds to each classifier, but rather on describing the effects of every individual transformation on classification, together with their mutual relationships. .
first_indexed 2024-12-19T02:48:20Z
format Article
id doaj.art-51a641bc24474d278a0a10de1bdeeb64
institution Directory Open Access Journal
issn 0354-0243
1820-743X
language English
last_indexed 2024-12-19T02:48:20Z
publishDate 2008-01-01
publisher University of Belgrade
record_format Article
series Yugoslav Journal of Operations Research
spelling doaj.art-51a641bc24474d278a0a10de1bdeeb642022-12-21T20:38:46ZengUniversity of BelgradeYugoslav Journal of Operations Research0354-02431820-743X2008-01-0118112313810.2298/YJOR0801123RDocument representations for classification of short web-page descriptionsRadovanović MilošIvanović MirjanaMotivated by applying Text Categorization to classification of Web search results, this paper describes an extensive experimental study of the impact of bag-of- words document representations on the performance of five major classifiers - Naïve Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts, representing short Web-page descriptions sorted into a large hierarchy of topics, are taken from the dmoz Open Directory Web-page ontology, and classifiers are trained to automatically determine the topics which may be relevant to a previously unseen Web-page. Different transformations of input data: stemming, normalization, logtf and idf, together with dimensionality reduction, are found to have a statistically significant improving or degrading effect on classification performance measured by classical metrics - accuracy, precision, recall, F1 and F2. The emphasis of the study is not on determining the best document representation which corresponds to each classifier, but rather on describing the effects of every individual transformation on classification, together with their mutual relationships. .http://www.doiserbia.nb.rs/img/doi/0354-0243/2008/0354-02430801123R.pdftext categorizationdocument representationmachine learning
spellingShingle Radovanović Miloš
Ivanović Mirjana
Document representations for classification of short web-page descriptions
Yugoslav Journal of Operations Research
text categorization
document representation
machine learning
title Document representations for classification of short web-page descriptions
title_full Document representations for classification of short web-page descriptions
title_fullStr Document representations for classification of short web-page descriptions
title_full_unstemmed Document representations for classification of short web-page descriptions
title_short Document representations for classification of short web-page descriptions
title_sort document representations for classification of short web page descriptions
topic text categorization
document representation
machine learning
url http://www.doiserbia.nb.rs/img/doi/0354-0243/2008/0354-02430801123R.pdf
work_keys_str_mv AT radovanovicmilos documentrepresentationsforclassificationofshortwebpagedescriptions
AT ivanovicmirjana documentrepresentationsforclassificationofshortwebpagedescriptions