Supervised Paragraph Vector: Distributed Representations of Words, Documents and Class Labels

While the traditional method of deriving representations for documents was bag-of-words, they suffered from high dimensionality and sparsity. Recently, many methods to obtain lower dimensional and densely distributed representations were proposed. Paragraph Vector is one of such algorithms, which ex...

Full description

Bibliographic Details
Main Authors: Eunjeong L. Park, Sungzoon Cho, Pilsung Kang
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8653834/
_version_ 1818613273719734272
author Eunjeong L. Park
Sungzoon Cho
Pilsung Kang
author_facet Eunjeong L. Park
Sungzoon Cho
Pilsung Kang
author_sort Eunjeong L. Park
collection DOAJ
description While the traditional method of deriving representations for documents was bag-of-words, they suffered from high dimensionality and sparsity. Recently, many methods to obtain lower dimensional and densely distributed representations were proposed. Paragraph Vector is one of such algorithms, which extends the word2vec algorithm by considering the paragraph as an additional word. However, it generates a single representation for all tasks, while different tasks may require different representations. In this paper, we propose a Supervised Paragraph Vector, a task-specific variant of Paragraph Vector for situations where class labels exist. Essentially, Supervised Paragraph Vector uses class labels along with words and documents and obtains corresponding representations with respect to the particular classification task. In order to prove the benefits of the proposed algorithm, three performance criteria are used: interpretability, discriminative power, and computational efficiency. To test interpretability, we find words that are close and far to class vectors and demonstrate that such words are closely related to the corresponding class. We also use principal component analysis to visualize all words, documents, and class labels at the same time and show that our method effectively displays the related words and documents for each class label. To evaluate discriminative power and computational efficiency, we perform document classification on four commonly used datasets with various classifiers and achieve comparable classification accuracies to bag-of-words and Paragraph Vector.
first_indexed 2024-12-16T15:59:30Z
format Article
id doaj.art-9c771438c0fa4df1b506e18cbea44d4b
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-16T15:59:30Z
publishDate 2019-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-9c771438c0fa4df1b506e18cbea44d4b2022-12-21T22:25:30ZengIEEEIEEE Access2169-35362019-01-017290512906410.1109/ACCESS.2019.29019338653834Supervised Paragraph Vector: Distributed Representations of Words, Documents and Class LabelsEunjeong L. Park0Sungzoon Cho1Pilsung Kang2https://orcid.org/0000-0001-7663-3937NAVER, Seongnam, South KoreaDepartment of Industrial Engineering, Seoul National University, Seoul, South KoreaSchool of Industrial Management Engineering, Korea University, Seoul, South KoreaWhile the traditional method of deriving representations for documents was bag-of-words, they suffered from high dimensionality and sparsity. Recently, many methods to obtain lower dimensional and densely distributed representations were proposed. Paragraph Vector is one of such algorithms, which extends the word2vec algorithm by considering the paragraph as an additional word. However, it generates a single representation for all tasks, while different tasks may require different representations. In this paper, we propose a Supervised Paragraph Vector, a task-specific variant of Paragraph Vector for situations where class labels exist. Essentially, Supervised Paragraph Vector uses class labels along with words and documents and obtains corresponding representations with respect to the particular classification task. In order to prove the benefits of the proposed algorithm, three performance criteria are used: interpretability, discriminative power, and computational efficiency. To test interpretability, we find words that are close and far to class vectors and demonstrate that such words are closely related to the corresponding class. We also use principal component analysis to visualize all words, documents, and class labels at the same time and show that our method effectively displays the related words and documents for each class label. To evaluate discriminative power and computational efficiency, we perform document classification on four commonly used datasets with various classifiers and achieve comparable classification accuracies to bag-of-words and Paragraph Vector.https://ieeexplore.ieee.org/document/8653834/Class labeldistributed representationsrepresentation learningdocument embeddingword embedding
spellingShingle Eunjeong L. Park
Sungzoon Cho
Pilsung Kang
Supervised Paragraph Vector: Distributed Representations of Words, Documents and Class Labels
IEEE Access
Class label
distributed representations
representation learning
document embedding
word embedding
title Supervised Paragraph Vector: Distributed Representations of Words, Documents and Class Labels
title_full Supervised Paragraph Vector: Distributed Representations of Words, Documents and Class Labels
title_fullStr Supervised Paragraph Vector: Distributed Representations of Words, Documents and Class Labels
title_full_unstemmed Supervised Paragraph Vector: Distributed Representations of Words, Documents and Class Labels
title_short Supervised Paragraph Vector: Distributed Representations of Words, Documents and Class Labels
title_sort supervised paragraph vector distributed representations of words documents and class labels
topic Class label
distributed representations
representation learning
document embedding
word embedding
url https://ieeexplore.ieee.org/document/8653834/
work_keys_str_mv AT eunjeonglpark supervisedparagraphvectordistributedrepresentationsofwordsdocumentsandclasslabels
AT sungzooncho supervisedparagraphvectordistributedrepresentationsofwordsdocumentsandclasslabels
AT pilsungkang supervisedparagraphvectordistributedrepresentationsofwordsdocumentsandclasslabels