Annotated Open Corpus Construction and BERT-Based Approach for Automatic Metadata Extraction From Korean Academic Papers

With the accelerating development of science and technology, the academic papers being published in various fields are increasing rapidly. Academic papers specially in science and technology fields are a crucial media for researchers who develop new technologies by identifying knowledge regarding th...

Full description

Bibliographic Details
Main Authors:	Hyesoo Kong, Hwamook Yoon, Jaewook Seol, Mihwan Hyun, Hyejin Lee, Soonyoung Kim, Wonjun Choi
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	BERT corpus construction metadata extraction transfer learning
Online Access:	https://ieeexplore.ieee.org/document/10003205/

_version_	1797959601360994304
author	Hyesoo Kong Hwamook Yoon Jaewook Seol Mihwan Hyun Hyejin Lee Soonyoung Kim Wonjun Choi
author_facet	Hyesoo Kong Hwamook Yoon Jaewook Seol Mihwan Hyun Hyejin Lee Soonyoung Kim Wonjun Choi
author_sort	Hyesoo Kong
collection	DOAJ
description	With the accelerating development of science and technology, the academic papers being published in various fields are increasing rapidly. Academic papers specially in science and technology fields are a crucial media for researchers who develop new technologies by identifying knowledge regarding the latest technological trends and conduct derivative studies in science and technology. Therefore, the continual collection of extensive academic papers, structuring of metadata, and construction of databases are significant tasks. However, research on automatic metadata extraction from Korean papers is not being actively conducted currently owing to insufficient Korean training data. We automatically constructed the largest labeled corpus in South Korea to date from 315,320 PDF papers belonging to 503 Korean academic journals and this labeled corpus can be used for training the models of automatic extraction for 12 metadata types from PDF papers. This labeled corpus is available at <uri>https://doi.org/10.23057/48</uri>. Moreover, we developed inspection process and guidelines for the automatically constructed data and performed a full inspection of the validation and testing data. The reliability of the inspected data was verified through the inter-annotator agreement measurement. Using our corpus, we trained and evaluated the BERT based transfer learning model to verify its reliability. Furthermore, we proposed new training methods that can improve the metadata extraction performance of Korean papers, and through these methods, we developed KorSciBERT-ME-J and KorSciBERT-ME-J+C models. The KorSciBERT-ME-J showed the highest performance with an F1 score of 99.36%, as well as robust performance in automatic metadata extraction from Korean academic papers in various formats.
first_indexed	2024-04-11T00:35:01Z
format	Article
id	doaj.art-859f68b7f15c45ff830a8a75b6c301db
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-11T00:35:01Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-859f68b7f15c45ff830a8a75b6c301db2023-01-07T00:00:36ZengIEEEIEEE Access2169-35362023-01-011182583810.1109/ACCESS.2022.323322810003205Annotated Open Corpus Construction and BERT-Based Approach for Automatic Metadata Extraction From Korean Academic PapersHyesoo Kong0https://orcid.org/0000-0002-3742-7433Hwamook Yoon1Jaewook Seol2Mihwan Hyun3Hyejin Lee4Soonyoung Kim5Wonjun Choi6https://orcid.org/0000-0002-9711-0091Digital Curation Center, Korea Institute of Science and Technology Information, Daejeon, Republic of KoreaDigital Curation Center, Korea Institute of Science and Technology Information, Daejeon, Republic of KoreaDigital Curation Center, Korea Institute of Science and Technology Information, Daejeon, Republic of KoreaDigital Curation Center, Korea Institute of Science and Technology Information, Daejeon, Republic of KoreaDigital Curation Center, Korea Institute of Science and Technology Information, Daejeon, Republic of KoreaDigital Curation Center, Korea Institute of Science and Technology Information, Daejeon, Republic of KoreaDigital Curation Center, Korea Institute of Science and Technology Information, Daejeon, Republic of KoreaWith the accelerating development of science and technology, the academic papers being published in various fields are increasing rapidly. Academic papers specially in science and technology fields are a crucial media for researchers who develop new technologies by identifying knowledge regarding the latest technological trends and conduct derivative studies in science and technology. Therefore, the continual collection of extensive academic papers, structuring of metadata, and construction of databases are significant tasks. However, research on automatic metadata extraction from Korean papers is not being actively conducted currently owing to insufficient Korean training data. We automatically constructed the largest labeled corpus in South Korea to date from 315,320 PDF papers belonging to 503 Korean academic journals and this labeled corpus can be used for training the models of automatic extraction for 12 metadata types from PDF papers. This labeled corpus is available at <uri>https://doi.org/10.23057/48</uri>. Moreover, we developed inspection process and guidelines for the automatically constructed data and performed a full inspection of the validation and testing data. The reliability of the inspected data was verified through the inter-annotator agreement measurement. Using our corpus, we trained and evaluated the BERT based transfer learning model to verify its reliability. Furthermore, we proposed new training methods that can improve the metadata extraction performance of Korean papers, and through these methods, we developed KorSciBERT-ME-J and KorSciBERT-ME-J+C models. The KorSciBERT-ME-J showed the highest performance with an F1 score of 99.36%, as well as robust performance in automatic metadata extraction from Korean academic papers in various formats.https://ieeexplore.ieee.org/document/10003205/BERTcorpus constructionmetadata extractiontransfer learning
spellingShingle	Hyesoo Kong Hwamook Yoon Jaewook Seol Mihwan Hyun Hyejin Lee Soonyoung Kim Wonjun Choi Annotated Open Corpus Construction and BERT-Based Approach for Automatic Metadata Extraction From Korean Academic Papers IEEE Access BERT corpus construction metadata extraction transfer learning
title	Annotated Open Corpus Construction and BERT-Based Approach for Automatic Metadata Extraction From Korean Academic Papers
title_full	Annotated Open Corpus Construction and BERT-Based Approach for Automatic Metadata Extraction From Korean Academic Papers
title_fullStr	Annotated Open Corpus Construction and BERT-Based Approach for Automatic Metadata Extraction From Korean Academic Papers
title_full_unstemmed	Annotated Open Corpus Construction and BERT-Based Approach for Automatic Metadata Extraction From Korean Academic Papers
title_short	Annotated Open Corpus Construction and BERT-Based Approach for Automatic Metadata Extraction From Korean Academic Papers
title_sort	annotated open corpus construction and bert based approach for automatic metadata extraction from korean academic papers
topic	BERT corpus construction metadata extraction transfer learning
url	https://ieeexplore.ieee.org/document/10003205/
work_keys_str_mv	AT hyesookong annotatedopencorpusconstructionandbertbasedapproachforautomaticmetadataextractionfromkoreanacademicpapers AT hwamookyoon annotatedopencorpusconstructionandbertbasedapproachforautomaticmetadataextractionfromkoreanacademicpapers AT jaewookseol annotatedopencorpusconstructionandbertbasedapproachforautomaticmetadataextractionfromkoreanacademicpapers AT mihwanhyun annotatedopencorpusconstructionandbertbasedapproachforautomaticmetadataextractionfromkoreanacademicpapers AT hyejinlee annotatedopencorpusconstructionandbertbasedapproachforautomaticmetadataextractionfromkoreanacademicpapers AT soonyoungkim annotatedopencorpusconstructionandbertbasedapproachforautomaticmetadataextractionfromkoreanacademicpapers AT wonjunchoi annotatedopencorpusconstructionandbertbasedapproachforautomaticmetadataextractionfromkoreanacademicpapers

Annotated Open Corpus Construction and BERT-Based Approach for Automatic Metadata Extraction From Korean Academic Papers

Similar Items