Using Statistical Properties to Enhance Text Categorization
Statistical properties extracted from text are useful in many areas. Knowing who authored some text or knowing the category of a text is among the uses of collecting such statistics. In this paper, language-independent properties of text are studied using two categorized corpora of news articles. It...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
International Institute of Informatics and Cybernetics
2015-06-01
|
Series: | Journal of Systemics, Cybernetics and Informatics |
Subjects: | |
Online Access: | http://www.iiisci.org/Journal/CV$/sci/pdfs/SA098US15.pdf
|
Summary: | Statistical properties extracted from text are useful in many areas. Knowing who authored some text or knowing the category of a text is among the uses of collecting such statistics. In this paper, language-independent properties of text are studied using two categorized corpora of news articles. It is observed that the properties do not depend on the corpus nor on its size. Several interesting properties are identified which enable minimizing the training set for an intelligent categorization system. Aside from text categorization, the properties can be used to compare the information content between different corpora. The properties can also be used to compare the rate of new information content between different corpora. |
---|---|
ISSN: | 1690-4524 |