Improving Term Weighting Schemes for Short Text Classification in Vector Space Model

Short text is one of the predominant forms of communication with unique characteristics such as short length, high sparsity, and lack of shared context and word co-occurrence. These characteristics distinguish short text from general text and make short text classification a challenging task. Term w...

Full description

Bibliographic Details
Main Authors: Surender Singh Samant, N. L. Bhanu Murthy, Aruna Malapati
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8903254/
_version_ 1818728034421702656
author Surender Singh Samant
N. L. Bhanu Murthy
Aruna Malapati
author_facet Surender Singh Samant
N. L. Bhanu Murthy
Aruna Malapati
author_sort Surender Singh Samant
collection DOAJ
description Short text is one of the predominant forms of communication with unique characteristics such as short length, high sparsity, and lack of shared context and word co-occurrence. These characteristics distinguish short text from general text and make short text classification a challenging task. Term weighting is an important pre-processing step for text classification in the vector space model. In this paper, we propose three modifications to existing state-of-the-art term weighting schemes: ifn-tp-icf, RFR and modOR and a new term weighting scheme: ifn-modRF. We compare the proposed schemes with ten existing unsupervised and supervised schemes using three datasets of informally written short text: a self-labelled dataset of real-world events from Twitter, a Yahoo! questions dataset and a dataset of product reviews. Based on the experimental results using three popular classifiers, we observe that the proposed scheme ifn-modRF achieves the best F1-scores on the Twitter dataset, while the proposed modification modOR is a consistent performer with the best scores in most of the experiments. The proposed modification ifn-tp-icf also outperform the original scheme in most experiments.
first_indexed 2024-12-17T22:23:34Z
format Article
id doaj.art-a4707f135579404fb2218dfd4239f35b
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-17T22:23:34Z
publishDate 2019-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-a4707f135579404fb2218dfd4239f35b2022-12-21T21:30:24ZengIEEEIEEE Access2169-35362019-01-01716657816659210.1109/ACCESS.2019.29539188903254Improving Term Weighting Schemes for Short Text Classification in Vector Space ModelSurender Singh Samant0https://orcid.org/0000-0001-8619-3779N. L. Bhanu Murthy1https://orcid.org/0000-0002-9187-1869Aruna Malapati2https://orcid.org/0000-0001-7275-378XDepartment of Computer Science and Information Systems, Birla Institute of Technology and Science, Pilani—Hyderabad Campus, Hyderabad, IndiaDepartment of Computer Science and Information Systems, Birla Institute of Technology and Science, Pilani—Hyderabad Campus, Hyderabad, IndiaDepartment of Computer Science and Information Systems, Birla Institute of Technology and Science, Pilani—Hyderabad Campus, Hyderabad, IndiaShort text is one of the predominant forms of communication with unique characteristics such as short length, high sparsity, and lack of shared context and word co-occurrence. These characteristics distinguish short text from general text and make short text classification a challenging task. Term weighting is an important pre-processing step for text classification in the vector space model. In this paper, we propose three modifications to existing state-of-the-art term weighting schemes: ifn-tp-icf, RFR and modOR and a new term weighting scheme: ifn-modRF. We compare the proposed schemes with ten existing unsupervised and supervised schemes using three datasets of informally written short text: a self-labelled dataset of real-world events from Twitter, a Yahoo! questions dataset and a dataset of product reviews. Based on the experimental results using three popular classifiers, we observe that the proposed scheme ifn-modRF achieves the best F1-scores on the Twitter dataset, while the proposed modification modOR is a consistent performer with the best scores in most of the experiments. The proposed modification ifn-tp-icf also outperform the original scheme in most experiments.https://ieeexplore.ieee.org/document/8903254/Text classificationtext categorizationterm weightingtwitter
spellingShingle Surender Singh Samant
N. L. Bhanu Murthy
Aruna Malapati
Improving Term Weighting Schemes for Short Text Classification in Vector Space Model
IEEE Access
Text classification
text categorization
term weighting
twitter
title Improving Term Weighting Schemes for Short Text Classification in Vector Space Model
title_full Improving Term Weighting Schemes for Short Text Classification in Vector Space Model
title_fullStr Improving Term Weighting Schemes for Short Text Classification in Vector Space Model
title_full_unstemmed Improving Term Weighting Schemes for Short Text Classification in Vector Space Model
title_short Improving Term Weighting Schemes for Short Text Classification in Vector Space Model
title_sort improving term weighting schemes for short text classification in vector space model
topic Text classification
text categorization
term weighting
twitter
url https://ieeexplore.ieee.org/document/8903254/
work_keys_str_mv AT surendersinghsamant improvingtermweightingschemesforshorttextclassificationinvectorspacemodel
AT nlbhanumurthy improvingtermweightingschemesforshorttextclassificationinvectorspacemodel
AT arunamalapati improvingtermweightingschemesforshorttextclassificationinvectorspacemodel