RoBERTa-Based Keyword Extraction from Small Number of Korean Documents

Keyword extraction is the task of identifying essential words in a lengthy document. This process is primarily executed through supervised keyword extraction. In instances where the dataset is limited in size, a classification-based approach is typically employed. Therefore, this paper introduces a...

Full description

Bibliographic Details
Main Authors:	So-Eon Kim, Jun-Beom Lee, Gyu-Min Park, Seok-Man Sohn, Seong-Bae Park
Format:	Article
Language:	English
Published:	MDPI AG 2023-11-01
Series:	Electronics
Subjects:	keyword extraction sequence labeling post-processing RoBERTa learning with small dataset
Online Access:	https://www.mdpi.com/2079-9292/12/22/4560

_version_	1797459520421625856
author	So-Eon Kim Jun-Beom Lee Gyu-Min Park Seok-Man Sohn Seong-Bae Park
author_facet	So-Eon Kim Jun-Beom Lee Gyu-Min Park Seok-Man Sohn Seong-Bae Park
author_sort	So-Eon Kim
collection	DOAJ
description	Keyword extraction is the task of identifying essential words in a lengthy document. This process is primarily executed through supervised keyword extraction. In instances where the dataset is limited in size, a classification-based approach is typically employed. Therefore, this paper introduces a novel keyword extractor based on a classification approach. The proposed keyword extractor comprises three key components: RoBERTa, a keyword estimator, and a decision rule. RoBERTa encodes an input document, the keyword estimator calculates the probability of each token in the document becoming a keyword, and the decision rule ultimately determines whether each token is a keyword based on these probabilities. However, training the proposed model with a small dataset presents two challenges. One problem is the case that all tokens in the documents are not a keyword, and the other problem is that a single word can be composed of keyword tokens and non-keyword tokens. Two novel heuristics are thus proposed to tackle these problems. To address these issues, two novel heuristics are proposed. These heuristics have been extensively tested through experiments, demonstrating that the proposed keyword extractor surpasses both the generation-based approach and the vanilla RoBERTa in environments with limited data. The efficacy of the heuristics is further validated through an ablation study. In summary, the proposed heuristics have proven to be effective in developing a supervised keyword extractor with a small dataset.
first_indexed	2024-03-09T16:52:34Z
format	Article
id	doaj.art-2cc4739a1b8944c1ac433ab49bab7e91
institution	Directory Open Access Journal
issn	2079-9292
language	English
last_indexed	2024-03-09T16:52:34Z
publishDate	2023-11-01
publisher	MDPI AG
record_format	Article
series	Electronics
spelling	doaj.art-2cc4739a1b8944c1ac433ab49bab7e912023-11-24T14:38:59ZengMDPI AGElectronics2079-92922023-11-011222456010.3390/electronics12224560RoBERTa-Based Keyword Extraction from Small Number of Korean DocumentsSo-Eon Kim0Jun-Beom Lee1Gyu-Min Park2Seok-Man Sohn3Seong-Bae Park4School of Computing, Kyung Hee University, Yongin 17104, Republic of KoreaSchool of Computing, Kyung Hee University, Yongin 17104, Republic of KoreaSchool of Computing, Kyung Hee University, Yongin 17104, Republic of KoreaKorea Electric Power Research Institute, Daejeon 34056, Republic of KoreaSchool of Computing, Kyung Hee University, Yongin 17104, Republic of KoreaKeyword extraction is the task of identifying essential words in a lengthy document. This process is primarily executed through supervised keyword extraction. In instances where the dataset is limited in size, a classification-based approach is typically employed. Therefore, this paper introduces a novel keyword extractor based on a classification approach. The proposed keyword extractor comprises three key components: RoBERTa, a keyword estimator, and a decision rule. RoBERTa encodes an input document, the keyword estimator calculates the probability of each token in the document becoming a keyword, and the decision rule ultimately determines whether each token is a keyword based on these probabilities. However, training the proposed model with a small dataset presents two challenges. One problem is the case that all tokens in the documents are not a keyword, and the other problem is that a single word can be composed of keyword tokens and non-keyword tokens. Two novel heuristics are thus proposed to tackle these problems. To address these issues, two novel heuristics are proposed. These heuristics have been extensively tested through experiments, demonstrating that the proposed keyword extractor surpasses both the generation-based approach and the vanilla RoBERTa in environments with limited data. The efficacy of the heuristics is further validated through an ablation study. In summary, the proposed heuristics have proven to be effective in developing a supervised keyword extractor with a small dataset.https://www.mdpi.com/2079-9292/12/22/4560keyword extractionsequence labelingpost-processingRoBERTalearning with small dataset
spellingShingle	So-Eon Kim Jun-Beom Lee Gyu-Min Park Seok-Man Sohn Seong-Bae Park RoBERTa-Based Keyword Extraction from Small Number of Korean Documents Electronics keyword extraction sequence labeling post-processing RoBERTa learning with small dataset
title	RoBERTa-Based Keyword Extraction from Small Number of Korean Documents
title_full	RoBERTa-Based Keyword Extraction from Small Number of Korean Documents
title_fullStr	RoBERTa-Based Keyword Extraction from Small Number of Korean Documents
title_full_unstemmed	RoBERTa-Based Keyword Extraction from Small Number of Korean Documents
title_short	RoBERTa-Based Keyword Extraction from Small Number of Korean Documents
title_sort	roberta based keyword extraction from small number of korean documents
topic	keyword extraction sequence labeling post-processing RoBERTa learning with small dataset
url	https://www.mdpi.com/2079-9292/12/22/4560
work_keys_str_mv	AT soeonkim robertabasedkeywordextractionfromsmallnumberofkoreandocuments AT junbeomlee robertabasedkeywordextractionfromsmallnumberofkoreandocuments AT gyuminpark robertabasedkeywordextractionfromsmallnumberofkoreandocuments AT seokmansohn robertabasedkeywordextractionfromsmallnumberofkoreandocuments AT seongbaepark robertabasedkeywordextractionfromsmallnumberofkoreandocuments

RoBERTa-Based Keyword Extraction from Small Number of Korean Documents

Similar Items