Domain-Specific Language Model Pre-Training for Korean Tax Law Classification

Owing to their increasing amendments and complexity, most taxpayers do not have the required knowledge of tax laws, which results in issues in everyday life. To use tax counseling services through the internet, a person must first select a category of tax laws corresponding to their tax question. Ho...

Full description

Bibliographic Details
Main Authors:	Yeong Hyeon Gu, Xianghua Piao, Helin Yin, Dong Jin, Ri Zheng, Seong Joon Yoo
Format:	Article
Language:	English
Published:	IEEE 2022-01-01
Series:	IEEE Access
Subjects:	BERT domain-specific Korean tax law pre-trained language model text classification
Online Access:	https://ieeexplore.ieee.org/document/9745941/

_version_	1811342041714524160
author	Yeong Hyeon Gu Xianghua Piao Helin Yin Dong Jin Ri Zheng Seong Joon Yoo
author_facet	Yeong Hyeon Gu Xianghua Piao Helin Yin Dong Jin Ri Zheng Seong Joon Yoo
author_sort	Yeong Hyeon Gu
collection	DOAJ
description	Owing to their increasing amendments and complexity, most taxpayers do not have the required knowledge of tax laws, which results in issues in everyday life. To use tax counseling services through the internet, a person must first select a category of tax laws corresponding to their tax question. However, a layperson without prior knowledge of tax laws may not know which category to select in the first place. Therefore, a model capable of automatically classifying the categories of tax laws is needed. Recently, a model using BERT has been frequently used for text classification; however, it is generally used in open-domains, and often experiences a degraded performance due to domain-specific technical terms, such as tax laws. Furthermore, a significant amount of time is required to train the model, since BERT is a large-scale model. To address these issues, this study proposes Korean tax law-BERT (KTL-BERT) for the automatic classification of categories of tax questions. For the proposed KTL-BERT, a new pre-trained language model was constructed by performing learning from scratch, to which a static masking method was applied based on DistilRoBERTa. Subsequently, the pre-trained language model was fine-tuned to classify five categories of tax law. A total of 327,735 tax law questions were used to verify the performance of the proposed KTL-BERT. The F1-score of the proposed KTL-BERT was approximately 91.06%, which is higher than that of the benchmark models by approximately 1.07%-15.46%, and the training speed was approximately 0.89%-56.07% higher.
first_indexed	2024-04-13T19:04:27Z
format	Article
id	doaj.art-1f93ee65bd1b40fcbd3001d77e7b383a
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-13T19:04:27Z
publishDate	2022-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-1f93ee65bd1b40fcbd3001d77e7b383a2022-12-22T02:34:01ZengIEEEIEEE Access2169-35362022-01-0110463424635310.1109/ACCESS.2022.31640989745941Domain-Specific Language Model Pre-Training for Korean Tax Law ClassificationYeong Hyeon Gu0Xianghua Piao1https://orcid.org/0000-0002-2859-1661Helin Yin2Dong Jin3https://orcid.org/0000-0003-1131-6396Ri Zheng4https://orcid.org/0000-0002-9419-068XSeong Joon Yoo5Department of Computer Science and Engineering, Sejong University, Seoul, South KoreaDepartment of Computer Science and Engineering, Sejong University, Seoul, South KoreaDepartment of Computer Science and Engineering, Sejong University, Seoul, South KoreaDepartment of Computer Science and Engineering, Sejong University, Seoul, South KoreaDepartment of Computer Science and Engineering, Sejong University, Seoul, South KoreaDepartment of Computer Science and Engineering, Sejong University, Seoul, South KoreaOwing to their increasing amendments and complexity, most taxpayers do not have the required knowledge of tax laws, which results in issues in everyday life. To use tax counseling services through the internet, a person must first select a category of tax laws corresponding to their tax question. However, a layperson without prior knowledge of tax laws may not know which category to select in the first place. Therefore, a model capable of automatically classifying the categories of tax laws is needed. Recently, a model using BERT has been frequently used for text classification; however, it is generally used in open-domains, and often experiences a degraded performance due to domain-specific technical terms, such as tax laws. Furthermore, a significant amount of time is required to train the model, since BERT is a large-scale model. To address these issues, this study proposes Korean tax law-BERT (KTL-BERT) for the automatic classification of categories of tax questions. For the proposed KTL-BERT, a new pre-trained language model was constructed by performing learning from scratch, to which a static masking method was applied based on DistilRoBERTa. Subsequently, the pre-trained language model was fine-tuned to classify five categories of tax law. A total of 327,735 tax law questions were used to verify the performance of the proposed KTL-BERT. The F1-score of the proposed KTL-BERT was approximately 91.06%, which is higher than that of the benchmark models by approximately 1.07%-15.46%, and the training speed was approximately 0.89%-56.07% higher.https://ieeexplore.ieee.org/document/9745941/BERTdomain-specificKorean tax lawpre-trained language modeltext classification
spellingShingle	Yeong Hyeon Gu Xianghua Piao Helin Yin Dong Jin Ri Zheng Seong Joon Yoo Domain-Specific Language Model Pre-Training for Korean Tax Law Classification IEEE Access BERT domain-specific Korean tax law pre-trained language model text classification
title	Domain-Specific Language Model Pre-Training for Korean Tax Law Classification
title_full	Domain-Specific Language Model Pre-Training for Korean Tax Law Classification
title_fullStr	Domain-Specific Language Model Pre-Training for Korean Tax Law Classification
title_full_unstemmed	Domain-Specific Language Model Pre-Training for Korean Tax Law Classification
title_short	Domain-Specific Language Model Pre-Training for Korean Tax Law Classification
title_sort	domain specific language model pre training for korean tax law classification
topic	BERT domain-specific Korean tax law pre-trained language model text classification
url	https://ieeexplore.ieee.org/document/9745941/
work_keys_str_mv	AT yeonghyeongu domainspecificlanguagemodelpretrainingforkoreantaxlawclassification AT xianghuapiao domainspecificlanguagemodelpretrainingforkoreantaxlawclassification AT helinyin domainspecificlanguagemodelpretrainingforkoreantaxlawclassification AT dongjin domainspecificlanguagemodelpretrainingforkoreantaxlawclassification AT rizheng domainspecificlanguagemodelpretrainingforkoreantaxlawclassification AT seongjoonyoo domainspecificlanguagemodelpretrainingforkoreantaxlawclassification

Domain-Specific Language Model Pre-Training for Korean Tax Law Classification

Similar Items