GAE-Based Document Embedding Method for Clustering

Document embedding methods for clustering using deep neural networks have been proposed recently. However, the existing deep neural network-based document embedding methods for clustering have a problem of either generating document embeddings dependent on a given number of document clusters or gene...

Full description

Bibliographic Details
Main Authors: Sungwon Jung, Sangmin Ka
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9982435/
_version_ 1828164073277095936
author Sungwon Jung
Sangmin Ka
author_facet Sungwon Jung
Sangmin Ka
author_sort Sungwon Jung
collection DOAJ
description Document embedding methods for clustering using deep neural networks have been proposed recently. However, the existing deep neural network-based document embedding methods for clustering have a problem of either generating document embeddings dependent on a given number of document clusters or generating document embeddings that do not take into account the characteristic of high similarity between documents belonging to the same document cluster. In this paper, we propose a new document embedding method for clustering by using a graph autoencoder. To this end, we construct an undirected and weighted sparse graph from a set of documents wherein each document is represented by a node, and all the weighted edges created in the graph have high cosine similarities between the two end nodes. We then apply the proposed graph autoencoder to the graph to compute node embedding vectors. Each node embedding vector in the graph is used as a document embedding vector. This paper presents in-depth experimental analyses of the proposed method. Experimental results on various real document data sets demonstrate that the proposed approach affords the significant performance improvement over the existing document embedding methods.
first_indexed 2024-04-12T01:20:48Z
format Article
id doaj.art-3ebb880c9ffd4f068bd4f8e75d0ceeb9
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-04-12T01:20:48Z
publishDate 2022-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-3ebb880c9ffd4f068bd4f8e75d0ceeb92022-12-22T03:53:48ZengIEEEIEEE Access2169-35362022-01-011013008913009610.1109/ACCESS.2022.32285489982435GAE-Based Document Embedding Method for ClusteringSungwon Jung0https://orcid.org/0000-0002-5332-5947Sangmin Ka1https://orcid.org/0000-0001-7572-022XDepartment of Computer Science and Engineering, Sogang University, Seoul, South KoreaDepartment of Computer Science and Engineering, Sogang University, Seoul, South KoreaDocument embedding methods for clustering using deep neural networks have been proposed recently. However, the existing deep neural network-based document embedding methods for clustering have a problem of either generating document embeddings dependent on a given number of document clusters or generating document embeddings that do not take into account the characteristic of high similarity between documents belonging to the same document cluster. In this paper, we propose a new document embedding method for clustering by using a graph autoencoder. To this end, we construct an undirected and weighted sparse graph from a set of documents wherein each document is represented by a node, and all the weighted edges created in the graph have high cosine similarities between the two end nodes. We then apply the proposed graph autoencoder to the graph to compute node embedding vectors. Each node embedding vector in the graph is used as a document embedding vector. This paper presents in-depth experimental analyses of the proposed method. Experimental results on various real document data sets demonstrate that the proposed approach affords the significant performance improvement over the existing document embedding methods.https://ieeexplore.ieee.org/document/9982435/Document embeddingtext embeddingdocument clusteringgraph autoencodergraph CNNautoencoder
spellingShingle Sungwon Jung
Sangmin Ka
GAE-Based Document Embedding Method for Clustering
IEEE Access
Document embedding
text embedding
document clustering
graph autoencoder
graph CNN
autoencoder
title GAE-Based Document Embedding Method for Clustering
title_full GAE-Based Document Embedding Method for Clustering
title_fullStr GAE-Based Document Embedding Method for Clustering
title_full_unstemmed GAE-Based Document Embedding Method for Clustering
title_short GAE-Based Document Embedding Method for Clustering
title_sort gae based document embedding method for clustering
topic Document embedding
text embedding
document clustering
graph autoencoder
graph CNN
autoencoder
url https://ieeexplore.ieee.org/document/9982435/
work_keys_str_mv AT sungwonjung gaebaseddocumentembeddingmethodforclustering
AT sangminka gaebaseddocumentembeddingmethodforclustering