Using Statistical Properties to Enhance Text Categorization

Statistical properties extracted from text are useful in many areas. Knowing who authored some text or knowing the category of a text is among the uses of collecting such statistics. In this paper, language-independent properties of text are studied using two categorized corpora of news articles. It...

Full description

Bibliographic Details
Main Authors: Rached Zantout, Ziad Osman
Format: Article
Language:English
Published: International Institute of Informatics and Cybernetics 2015-06-01
Series:Journal of Systemics, Cybernetics and Informatics
Subjects:
Online Access:http://www.iiisci.org/Journal/CV$/sci/pdfs/SA098US15.pdf
Description
Summary:Statistical properties extracted from text are useful in many areas. Knowing who authored some text or knowing the category of a text is among the uses of collecting such statistics. In this paper, language-independent properties of text are studied using two categorized corpora of news articles. It is observed that the properties do not depend on the corpus nor on its size. Several interesting properties are identified which enable minimizing the training set for an intelligent categorization system. Aside from text categorization, the properties can be used to compare the information content between different corpora. The properties can also be used to compare the rate of new information content between different corpora.
ISSN:1690-4524