A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different a...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Shahrood University of Technology
2019-07-01
|
Series: | Journal of Artificial Intelligence and Data Mining |
Subjects: | |
Online Access: | http://jad.shahroodut.ac.ir/article_1457_3c18f0bbd3b123b7d78dbfcbfafb8824.pdf |
_version_ | 1818499743503876096 |
---|---|
author | S. Momtazi A. Rahbar D. Salami I. Khanijazani |
author_facet | S. Momtazi A. Rahbar D. Salami I. Khanijazani |
author_sort | S. Momtazi |
collection | DOAJ |
description | Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use semantic models for document vector representations. Latent Dirichlet allocation (LDA) topic modeling and doc2vec neural document embedding are two well-known techniques for this purpose.<br /> In this paper, we first study the conceptual difference between the two models and show that they have different behavior and capture semantic features of texts from different perspectives. We then proposed a hybrid approach for document vector representation to benefit from the advantages of both models. The experimental results on 20newsgroup show the superiority of the proposed model compared to each of the baselines on both text clustering and classification tasks. We achieved 2.6% improvement in F-measure for text clustering and 2.1% improvement in F-measure in text classification compared to the best baseline model. |
first_indexed | 2024-12-10T20:33:37Z |
format | Article |
id | doaj.art-87f47e286691477e93e82f08ad779440 |
institution | Directory Open Access Journal |
issn | 2322-5211 2322-4444 |
language | English |
last_indexed | 2024-12-10T20:33:37Z |
publishDate | 2019-07-01 |
publisher | Shahrood University of Technology |
record_format | Article |
series | Journal of Artificial Intelligence and Data Mining |
spelling | doaj.art-87f47e286691477e93e82f08ad7794402022-12-22T01:34:36ZengShahrood University of TechnologyJournal of Artificial Intelligence and Data Mining2322-52112322-44442019-07-017344345010.22044/jadm.2019.7400.18761457A Joint Semantic Vector Representation Model for Text Clustering and ClassificationS. Momtazi0A. Rahbar1D. Salami2I. Khanijazani3Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use semantic models for document vector representations. Latent Dirichlet allocation (LDA) topic modeling and doc2vec neural document embedding are two well-known techniques for this purpose.<br /> In this paper, we first study the conceptual difference between the two models and show that they have different behavior and capture semantic features of texts from different perspectives. We then proposed a hybrid approach for document vector representation to benefit from the advantages of both models. The experimental results on 20newsgroup show the superiority of the proposed model compared to each of the baselines on both text clustering and classification tasks. We achieved 2.6% improvement in F-measure for text clustering and 2.1% improvement in F-measure in text classification compared to the best baseline model.http://jad.shahroodut.ac.ir/article_1457_3c18f0bbd3b123b7d78dbfcbfafb8824.pdftext miningsemantic representationtopic modelingneural document embedding |
spellingShingle | S. Momtazi A. Rahbar D. Salami I. Khanijazani A Joint Semantic Vector Representation Model for Text Clustering and Classification Journal of Artificial Intelligence and Data Mining text mining semantic representation topic modeling neural document embedding |
title | A Joint Semantic Vector Representation Model for Text Clustering and Classification |
title_full | A Joint Semantic Vector Representation Model for Text Clustering and Classification |
title_fullStr | A Joint Semantic Vector Representation Model for Text Clustering and Classification |
title_full_unstemmed | A Joint Semantic Vector Representation Model for Text Clustering and Classification |
title_short | A Joint Semantic Vector Representation Model for Text Clustering and Classification |
title_sort | joint semantic vector representation model for text clustering and classification |
topic | text mining semantic representation topic modeling neural document embedding |
url | http://jad.shahroodut.ac.ir/article_1457_3c18f0bbd3b123b7d78dbfcbfafb8824.pdf |
work_keys_str_mv | AT smomtazi ajointsemanticvectorrepresentationmodelfortextclusteringandclassification AT arahbar ajointsemanticvectorrepresentationmodelfortextclusteringandclassification AT dsalami ajointsemanticvectorrepresentationmodelfortextclusteringandclassification AT ikhanijazani ajointsemanticvectorrepresentationmodelfortextclusteringandclassification AT smomtazi jointsemanticvectorrepresentationmodelfortextclusteringandclassification AT arahbar jointsemanticvectorrepresentationmodelfortextclusteringandclassification AT dsalami jointsemanticvectorrepresentationmodelfortextclusteringandclassification AT ikhanijazani jointsemanticvectorrepresentationmodelfortextclusteringandclassification |