A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different a...

Full description

Bibliographic Details
Main Authors: S. Momtazi, A. Rahbar, D. Salami, I. Khanijazani
Format: Article
Language:English
Published: Shahrood University of Technology 2019-07-01
Series:Journal of Artificial Intelligence and Data Mining
Subjects:
Online Access:http://jad.shahroodut.ac.ir/article_1457_3c18f0bbd3b123b7d78dbfcbfafb8824.pdf
_version_ 1818499743503876096
author S. Momtazi
A. Rahbar
D. Salami
I. Khanijazani
author_facet S. Momtazi
A. Rahbar
D. Salami
I. Khanijazani
author_sort S. Momtazi
collection DOAJ
description Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use semantic models for document vector representations. Latent Dirichlet allocation (LDA) topic modeling and doc2vec neural document embedding are two well-known techniques for this purpose.<br /> In this paper, we first study the conceptual difference between the two models and show that they have different behavior and capture semantic features of texts from different perspectives. We then proposed a hybrid approach for document vector representation to benefit from the advantages of both models. The experimental results on 20newsgroup show the superiority of the proposed model compared to each of the baselines on both text clustering and classification tasks. We achieved 2.6% improvement in F-measure for text clustering and 2.1% improvement in F-measure in text classification compared to the best baseline model.
first_indexed 2024-12-10T20:33:37Z
format Article
id doaj.art-87f47e286691477e93e82f08ad779440
institution Directory Open Access Journal
issn 2322-5211
2322-4444
language English
last_indexed 2024-12-10T20:33:37Z
publishDate 2019-07-01
publisher Shahrood University of Technology
record_format Article
series Journal of Artificial Intelligence and Data Mining
spelling doaj.art-87f47e286691477e93e82f08ad7794402022-12-22T01:34:36ZengShahrood University of TechnologyJournal of Artificial Intelligence and Data Mining2322-52112322-44442019-07-017344345010.22044/jadm.2019.7400.18761457A Joint Semantic Vector Representation Model for Text Clustering and ClassificationS. Momtazi0A. Rahbar1D. Salami2I. Khanijazani3Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use semantic models for document vector representations. Latent Dirichlet allocation (LDA) topic modeling and doc2vec neural document embedding are two well-known techniques for this purpose.<br /> In this paper, we first study the conceptual difference between the two models and show that they have different behavior and capture semantic features of texts from different perspectives. We then proposed a hybrid approach for document vector representation to benefit from the advantages of both models. The experimental results on 20newsgroup show the superiority of the proposed model compared to each of the baselines on both text clustering and classification tasks. We achieved 2.6% improvement in F-measure for text clustering and 2.1% improvement in F-measure in text classification compared to the best baseline model.http://jad.shahroodut.ac.ir/article_1457_3c18f0bbd3b123b7d78dbfcbfafb8824.pdftext miningsemantic representationtopic modelingneural document embedding
spellingShingle S. Momtazi
A. Rahbar
D. Salami
I. Khanijazani
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Journal of Artificial Intelligence and Data Mining
text mining
semantic representation
topic modeling
neural document embedding
title A Joint Semantic Vector Representation Model for Text Clustering and Classification
title_full A Joint Semantic Vector Representation Model for Text Clustering and Classification
title_fullStr A Joint Semantic Vector Representation Model for Text Clustering and Classification
title_full_unstemmed A Joint Semantic Vector Representation Model for Text Clustering and Classification
title_short A Joint Semantic Vector Representation Model for Text Clustering and Classification
title_sort joint semantic vector representation model for text clustering and classification
topic text mining
semantic representation
topic modeling
neural document embedding
url http://jad.shahroodut.ac.ir/article_1457_3c18f0bbd3b123b7d78dbfcbfafb8824.pdf
work_keys_str_mv AT smomtazi ajointsemanticvectorrepresentationmodelfortextclusteringandclassification
AT arahbar ajointsemanticvectorrepresentationmodelfortextclusteringandclassification
AT dsalami ajointsemanticvectorrepresentationmodelfortextclusteringandclassification
AT ikhanijazani ajointsemanticvectorrepresentationmodelfortextclusteringandclassification
AT smomtazi jointsemanticvectorrepresentationmodelfortextclusteringandclassification
AT arahbar jointsemanticvectorrepresentationmodelfortextclusteringandclassification
AT dsalami jointsemanticvectorrepresentationmodelfortextclusteringandclassification
AT ikhanijazani jointsemanticvectorrepresentationmodelfortextclusteringandclassification