Improving Term Weighting Schemes for Short Text Classification in Vector Space Model
Short text is one of the predominant forms of communication with unique characteristics such as short length, high sparsity, and lack of shared context and word co-occurrence. These characteristics distinguish short text from general text and make short text classification a challenging task. Term w...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2019-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/8903254/ |
_version_ | 1818728034421702656 |
---|---|
author | Surender Singh Samant N. L. Bhanu Murthy Aruna Malapati |
author_facet | Surender Singh Samant N. L. Bhanu Murthy Aruna Malapati |
author_sort | Surender Singh Samant |
collection | DOAJ |
description | Short text is one of the predominant forms of communication with unique characteristics such as short length, high sparsity, and lack of shared context and word co-occurrence. These characteristics distinguish short text from general text and make short text classification a challenging task. Term weighting is an important pre-processing step for text classification in the vector space model. In this paper, we propose three modifications to existing state-of-the-art term weighting schemes: ifn-tp-icf, RFR and modOR and a new term weighting scheme: ifn-modRF. We compare the proposed schemes with ten existing unsupervised and supervised schemes using three datasets of informally written short text: a self-labelled dataset of real-world events from Twitter, a Yahoo! questions dataset and a dataset of product reviews. Based on the experimental results using three popular classifiers, we observe that the proposed scheme ifn-modRF achieves the best F1-scores on the Twitter dataset, while the proposed modification modOR is a consistent performer with the best scores in most of the experiments. The proposed modification ifn-tp-icf also outperform the original scheme in most experiments. |
first_indexed | 2024-12-17T22:23:34Z |
format | Article |
id | doaj.art-a4707f135579404fb2218dfd4239f35b |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-12-17T22:23:34Z |
publishDate | 2019-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-a4707f135579404fb2218dfd4239f35b2022-12-21T21:30:24ZengIEEEIEEE Access2169-35362019-01-01716657816659210.1109/ACCESS.2019.29539188903254Improving Term Weighting Schemes for Short Text Classification in Vector Space ModelSurender Singh Samant0https://orcid.org/0000-0001-8619-3779N. L. Bhanu Murthy1https://orcid.org/0000-0002-9187-1869Aruna Malapati2https://orcid.org/0000-0001-7275-378XDepartment of Computer Science and Information Systems, Birla Institute of Technology and Science, Pilani—Hyderabad Campus, Hyderabad, IndiaDepartment of Computer Science and Information Systems, Birla Institute of Technology and Science, Pilani—Hyderabad Campus, Hyderabad, IndiaDepartment of Computer Science and Information Systems, Birla Institute of Technology and Science, Pilani—Hyderabad Campus, Hyderabad, IndiaShort text is one of the predominant forms of communication with unique characteristics such as short length, high sparsity, and lack of shared context and word co-occurrence. These characteristics distinguish short text from general text and make short text classification a challenging task. Term weighting is an important pre-processing step for text classification in the vector space model. In this paper, we propose three modifications to existing state-of-the-art term weighting schemes: ifn-tp-icf, RFR and modOR and a new term weighting scheme: ifn-modRF. We compare the proposed schemes with ten existing unsupervised and supervised schemes using three datasets of informally written short text: a self-labelled dataset of real-world events from Twitter, a Yahoo! questions dataset and a dataset of product reviews. Based on the experimental results using three popular classifiers, we observe that the proposed scheme ifn-modRF achieves the best F1-scores on the Twitter dataset, while the proposed modification modOR is a consistent performer with the best scores in most of the experiments. The proposed modification ifn-tp-icf also outperform the original scheme in most experiments.https://ieeexplore.ieee.org/document/8903254/Text classificationtext categorizationterm weightingtwitter |
spellingShingle | Surender Singh Samant N. L. Bhanu Murthy Aruna Malapati Improving Term Weighting Schemes for Short Text Classification in Vector Space Model IEEE Access Text classification text categorization term weighting |
title | Improving Term Weighting Schemes for Short Text Classification in Vector Space Model |
title_full | Improving Term Weighting Schemes for Short Text Classification in Vector Space Model |
title_fullStr | Improving Term Weighting Schemes for Short Text Classification in Vector Space Model |
title_full_unstemmed | Improving Term Weighting Schemes for Short Text Classification in Vector Space Model |
title_short | Improving Term Weighting Schemes for Short Text Classification in Vector Space Model |
title_sort | improving term weighting schemes for short text classification in vector space model |
topic | Text classification text categorization term weighting |
url | https://ieeexplore.ieee.org/document/8903254/ |
work_keys_str_mv | AT surendersinghsamant improvingtermweightingschemesforshorttextclassificationinvectorspacemodel AT nlbhanumurthy improvingtermweightingschemesforshorttextclassificationinvectorspacemodel AT arunamalapati improvingtermweightingschemesforshorttextclassificationinvectorspacemodel |