Binned Term Count: An Alternative to Term Frequency for Text Categorization

In text categorization, a well-known problem related to document length is that larger term counts in longer documents cause classification algorithms to become biased. The effect of document length can be eliminated by normalizing term counts, thus reducing the bias towards longer documents. This g...

Full description

Bibliographic Details
Main Authors: Farhan Shehzad, Abdur Rehman, Kashif Javed, Khalid A. Alnowibet, Haroon A. Babri, Hafiz Tayyab Rauf
Format: Article
Language:English
Published: MDPI AG 2022-11-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/10/21/4124
_version_ 1797467306738057216
author Farhan Shehzad
Abdur Rehman
Kashif Javed
Khalid A. Alnowibet
Haroon A. Babri
Hafiz Tayyab Rauf
author_facet Farhan Shehzad
Abdur Rehman
Kashif Javed
Khalid A. Alnowibet
Haroon A. Babri
Hafiz Tayyab Rauf
author_sort Farhan Shehzad
collection DOAJ
description In text categorization, a well-known problem related to document length is that larger term counts in longer documents cause classification algorithms to become biased. The effect of document length can be eliminated by normalizing term counts, thus reducing the bias towards longer documents. This gives us term frequency (TF), which in conjunction with inverse document frequency (IDF) became the most commonly used term weighting scheme to capture the importance of a term in a document and corpus. However, normalization may cause term frequency of a term in a related document to become equal or smaller than its term frequency in an unrelated document, thus perturbing a term’s strength from its true worth. In this paper, we solve this problem by introducing a non-linear mapping of term frequency. This alternative to TF is called binned term count (BTC). The newly proposed term frequency factor trims large term counts before normalization, thus moderating the normalization effect on large documents. To investigate the effectiveness of BTC, we compare it against the original TF and its more recently proposed alternative named modified term frequency (MTF). In our experiments, each of these term frequency factors (BTC, TF, and MTF) is combined with four well-known collection frequency factors (IDF), RF, IGM, and MONO and the performance of each of the resulting term weighting schemes is evaluated on three standard datasets (Reuters (R8-21578), 20-Newsgroups, and WebKB) using support vector machines and K-nearest neighbor classifiers. To determine whether BTC is statistically better than TF and MTF, we have applied the paired two-sided <i>t</i>-test on the macro <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>F</mi><mn>1</mn></msub></semantics></math></inline-formula> results. Overall, BTC is found to be 52% statistically significant than TF and MTF. Furthermore, the highest macro <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>F</mi><mn>1</mn></msub></semantics></math></inline-formula> value on the three datasets was achieved by BTC-based term weighting schemes.
first_indexed 2024-03-09T18:51:45Z
format Article
id doaj.art-46a2a54952fc41b28c67152b44aca42a
institution Directory Open Access Journal
issn 2227-7390
language English
last_indexed 2024-03-09T18:51:45Z
publishDate 2022-11-01
publisher MDPI AG
record_format Article
series Mathematics
spelling doaj.art-46a2a54952fc41b28c67152b44aca42a2023-11-24T05:45:15ZengMDPI AGMathematics2227-73902022-11-011021412410.3390/math10214124Binned Term Count: An Alternative to Term Frequency for Text CategorizationFarhan Shehzad0Abdur Rehman1Kashif Javed2Khalid A. Alnowibet3Haroon A. Babri4Hafiz Tayyab Rauf5Department of Computer Science, University of Gujrat, Gujrat 50700, PakistanDepartment of Computer Science, University of Gujrat, Gujrat 50700, PakistanDepartment of Electrical Engineering, University of Engineering and Technology, Lahore 54890, PakistanStatistics and Operations Research Department, College of Science, King Saud University, Riyadh 11451, Saudi ArabiaDepartment of Electrical Engineering, University of Engineering and Technology, Lahore 54890, PakistanCentre for Smart Systems, AI and Cybersecurity, Staffordshire University, Stoke-on-Trent ST4 2DE, UKIn text categorization, a well-known problem related to document length is that larger term counts in longer documents cause classification algorithms to become biased. The effect of document length can be eliminated by normalizing term counts, thus reducing the bias towards longer documents. This gives us term frequency (TF), which in conjunction with inverse document frequency (IDF) became the most commonly used term weighting scheme to capture the importance of a term in a document and corpus. However, normalization may cause term frequency of a term in a related document to become equal or smaller than its term frequency in an unrelated document, thus perturbing a term’s strength from its true worth. In this paper, we solve this problem by introducing a non-linear mapping of term frequency. This alternative to TF is called binned term count (BTC). The newly proposed term frequency factor trims large term counts before normalization, thus moderating the normalization effect on large documents. To investigate the effectiveness of BTC, we compare it against the original TF and its more recently proposed alternative named modified term frequency (MTF). In our experiments, each of these term frequency factors (BTC, TF, and MTF) is combined with four well-known collection frequency factors (IDF), RF, IGM, and MONO and the performance of each of the resulting term weighting schemes is evaluated on three standard datasets (Reuters (R8-21578), 20-Newsgroups, and WebKB) using support vector machines and K-nearest neighbor classifiers. To determine whether BTC is statistically better than TF and MTF, we have applied the paired two-sided <i>t</i>-test on the macro <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>F</mi><mn>1</mn></msub></semantics></math></inline-formula> results. Overall, BTC is found to be 52% statistically significant than TF and MTF. Furthermore, the highest macro <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>F</mi><mn>1</mn></msub></semantics></math></inline-formula> value on the three datasets was achieved by BTC-based term weighting schemes.https://www.mdpi.com/2227-7390/10/21/4124term frequencyterm weighting schemesbag-of-words modelfeature representationtext classification
spellingShingle Farhan Shehzad
Abdur Rehman
Kashif Javed
Khalid A. Alnowibet
Haroon A. Babri
Hafiz Tayyab Rauf
Binned Term Count: An Alternative to Term Frequency for Text Categorization
Mathematics
term frequency
term weighting schemes
bag-of-words model
feature representation
text classification
title Binned Term Count: An Alternative to Term Frequency for Text Categorization
title_full Binned Term Count: An Alternative to Term Frequency for Text Categorization
title_fullStr Binned Term Count: An Alternative to Term Frequency for Text Categorization
title_full_unstemmed Binned Term Count: An Alternative to Term Frequency for Text Categorization
title_short Binned Term Count: An Alternative to Term Frequency for Text Categorization
title_sort binned term count an alternative to term frequency for text categorization
topic term frequency
term weighting schemes
bag-of-words model
feature representation
text classification
url https://www.mdpi.com/2227-7390/10/21/4124
work_keys_str_mv AT farhanshehzad binnedtermcountanalternativetotermfrequencyfortextcategorization
AT abdurrehman binnedtermcountanalternativetotermfrequencyfortextcategorization
AT kashifjaved binnedtermcountanalternativetotermfrequencyfortextcategorization
AT khalidaalnowibet binnedtermcountanalternativetotermfrequencyfortextcategorization
AT haroonababri binnedtermcountanalternativetotermfrequencyfortextcategorization
AT hafiztayyabrauf binnedtermcountanalternativetotermfrequencyfortextcategorization