Using Statistical Properties to Enhance Text Categorization

Statistical properties extracted from text are useful in many areas. Knowing who authored some text or knowing the category of a text is among the uses of collecting such statistics. In this paper, language-independent properties of text are studied using two categorized corpora of news articles. It...

Full description

Bibliographic Details
Main Authors: Rached Zantout, Ziad Osman
Format: Article
Language:English
Published: International Institute of Informatics and Cybernetics 2015-06-01
Series:Journal of Systemics, Cybernetics and Informatics
Subjects:
Online Access:http://www.iiisci.org/Journal/CV$/sci/pdfs/SA098US15.pdf
_version_ 1817990875525939200
author Rached Zantout
Ziad Osman
author_facet Rached Zantout
Ziad Osman
author_sort Rached Zantout
collection DOAJ
description Statistical properties extracted from text are useful in many areas. Knowing who authored some text or knowing the category of a text is among the uses of collecting such statistics. In this paper, language-independent properties of text are studied using two categorized corpora of news articles. It is observed that the properties do not depend on the corpus nor on its size. Several interesting properties are identified which enable minimizing the training set for an intelligent categorization system. Aside from text categorization, the properties can be used to compare the information content between different corpora. The properties can also be used to compare the rate of new information content between different corpora.
first_indexed 2024-04-14T01:05:47Z
format Article
id doaj.art-d0675f6c595248649a80078e1721efa5
institution Directory Open Access Journal
issn 1690-4524
language English
last_indexed 2024-04-14T01:05:47Z
publishDate 2015-06-01
publisher International Institute of Informatics and Cybernetics
record_format Article
series Journal of Systemics, Cybernetics and Informatics
spelling doaj.art-d0675f6c595248649a80078e1721efa52022-12-22T02:21:16ZengInternational Institute of Informatics and CyberneticsJournal of Systemics, Cybernetics and Informatics1690-45242015-06-011336874Using Statistical Properties to Enhance Text CategorizationRached Zantout0Ziad Osman1 Statistical properties extracted from text are useful in many areas. Knowing who authored some text or knowing the category of a text is among the uses of collecting such statistics. In this paper, language-independent properties of text are studied using two categorized corpora of news articles. It is observed that the properties do not depend on the corpus nor on its size. Several interesting properties are identified which enable minimizing the training set for an intelligent categorization system. Aside from text categorization, the properties can be used to compare the information content between different corpora. The properties can also be used to compare the rate of new information content between different corpora.http://www.iiisci.org/Journal/CV$/sci/pdfs/SA098US15.pdf Statistical Propertiestext categorizationtext miningdata mining
spellingShingle Rached Zantout
Ziad Osman
Using Statistical Properties to Enhance Text Categorization
Journal of Systemics, Cybernetics and Informatics
Statistical Properties
text categorization
text mining
data mining
title Using Statistical Properties to Enhance Text Categorization
title_full Using Statistical Properties to Enhance Text Categorization
title_fullStr Using Statistical Properties to Enhance Text Categorization
title_full_unstemmed Using Statistical Properties to Enhance Text Categorization
title_short Using Statistical Properties to Enhance Text Categorization
title_sort using statistical properties to enhance text categorization
topic Statistical Properties
text categorization
text mining
data mining
url http://www.iiisci.org/Journal/CV$/sci/pdfs/SA098US15.pdf
work_keys_str_mv AT rachedzantout usingstatisticalpropertiestoenhancetextcategorization
AT ziadosman usingstatisticalpropertiestoenhancetextcategorization