Using Statistical Properties to Enhance Text Categorization
Statistical properties extracted from text are useful in many areas. Knowing who authored some text or knowing the category of a text is among the uses of collecting such statistics. In this paper, language-independent properties of text are studied using two categorized corpora of news articles. It...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
International Institute of Informatics and Cybernetics
2015-06-01
|
Series: | Journal of Systemics, Cybernetics and Informatics |
Subjects: | |
Online Access: | http://www.iiisci.org/Journal/CV$/sci/pdfs/SA098US15.pdf
|
_version_ | 1817990875525939200 |
---|---|
author | Rached Zantout Ziad Osman |
author_facet | Rached Zantout Ziad Osman |
author_sort | Rached Zantout |
collection | DOAJ |
description | Statistical properties extracted from text are useful in many areas. Knowing who authored some text or knowing the category of a text is among the uses of collecting such statistics. In this paper, language-independent properties of text are studied using two categorized corpora of news articles. It is observed that the properties do not depend on the corpus nor on its size. Several interesting properties are identified which enable minimizing the training set for an intelligent categorization system. Aside from text categorization, the properties can be used to compare the information content between different corpora. The properties can also be used to compare the rate of new information content between different corpora. |
first_indexed | 2024-04-14T01:05:47Z |
format | Article |
id | doaj.art-d0675f6c595248649a80078e1721efa5 |
institution | Directory Open Access Journal |
issn | 1690-4524 |
language | English |
last_indexed | 2024-04-14T01:05:47Z |
publishDate | 2015-06-01 |
publisher | International Institute of Informatics and Cybernetics |
record_format | Article |
series | Journal of Systemics, Cybernetics and Informatics |
spelling | doaj.art-d0675f6c595248649a80078e1721efa52022-12-22T02:21:16ZengInternational Institute of Informatics and CyberneticsJournal of Systemics, Cybernetics and Informatics1690-45242015-06-011336874Using Statistical Properties to Enhance Text CategorizationRached Zantout0Ziad Osman1 Statistical properties extracted from text are useful in many areas. Knowing who authored some text or knowing the category of a text is among the uses of collecting such statistics. In this paper, language-independent properties of text are studied using two categorized corpora of news articles. It is observed that the properties do not depend on the corpus nor on its size. Several interesting properties are identified which enable minimizing the training set for an intelligent categorization system. Aside from text categorization, the properties can be used to compare the information content between different corpora. The properties can also be used to compare the rate of new information content between different corpora.http://www.iiisci.org/Journal/CV$/sci/pdfs/SA098US15.pdf Statistical Propertiestext categorizationtext miningdata mining |
spellingShingle | Rached Zantout Ziad Osman Using Statistical Properties to Enhance Text Categorization Journal of Systemics, Cybernetics and Informatics Statistical Properties text categorization text mining data mining |
title | Using Statistical Properties to Enhance Text Categorization |
title_full | Using Statistical Properties to Enhance Text Categorization |
title_fullStr | Using Statistical Properties to Enhance Text Categorization |
title_full_unstemmed | Using Statistical Properties to Enhance Text Categorization |
title_short | Using Statistical Properties to Enhance Text Categorization |
title_sort | using statistical properties to enhance text categorization |
topic | Statistical Properties text categorization text mining data mining |
url | http://www.iiisci.org/Journal/CV$/sci/pdfs/SA098US15.pdf
|
work_keys_str_mv | AT rachedzantout usingstatisticalpropertiestoenhancetextcategorization AT ziadosman usingstatisticalpropertiestoenhancetextcategorization |