Chinese Word Segmentation Based on Self‐Learning Model and Geological Knowledge for the Geoscience Domain

Abstract Chinese word segmentation (CWS) is the foundational work of geological report text mining and has an important influence on various tasks, such as named entity recognition and relation extraction. In recent years, the accuracy of the domain‐general CWS model has been limited by the domain a...

Full description

Bibliographic Details
Main Authors: Wenjia Li, Kai Ma, Qinjun Qiu, Liang Wu, Zhong Xie, Sanfeng Li, Siqiong Chen
Format: Article
Language:English
Published: American Geophysical Union (AGU) 2021-06-01
Series:Earth and Space Science
Subjects:
Online Access:https://doi.org/10.1029/2021EA001673
_version_ 1818589023526977536
author Wenjia Li
Kai Ma
Qinjun Qiu
Liang Wu
Zhong Xie
Sanfeng Li
Siqiong Chen
author_facet Wenjia Li
Kai Ma
Qinjun Qiu
Liang Wu
Zhong Xie
Sanfeng Li
Siqiong Chen
author_sort Wenjia Li
collection DOAJ
description Abstract Chinese word segmentation (CWS) is the foundational work of geological report text mining and has an important influence on various tasks, such as named entity recognition and relation extraction. In recent years, the accuracy of the domain‐general CWS model has been limited by the domain and large scale of the training corpus, especially data on Chinese geological texts. Training these CWS models also requires much manually annotated data, which takes a large amount of time and effort. When applying these existing models/methods directly to the geoscience domain, the segmentation accuracy and performance will drop dramatically. To address this problem, we pretrain the Bidirectional Encoder Representations from Transformer (BERT), which can leverage unlabeled domain‐specific knowledge, on unlabeled Chinese geological text and then input them into a Bidirectional long short‐term memory and Conditional random field (BiLSTM‐CRF) model for extracting text features. Finally, the predicted tags are decoded by the CRF. The experimental results show that the F1 score of the proposed model reaches 96.2% on the constructed test set of geological texts. Additionally, experiments illustrate that our proposed model achieves comparable performance to that of other state‐of‐the‐art models, and the proposed cyclic self‐learning strategy can be further extended to other domains.
first_indexed 2024-12-16T09:34:03Z
format Article
id doaj.art-00f1352c496a4881a391cab2aa586abd
institution Directory Open Access Journal
issn 2333-5084
language English
last_indexed 2024-12-16T09:34:03Z
publishDate 2021-06-01
publisher American Geophysical Union (AGU)
record_format Article
series Earth and Space Science
spelling doaj.art-00f1352c496a4881a391cab2aa586abd2022-12-21T22:36:27ZengAmerican Geophysical Union (AGU)Earth and Space Science2333-50842021-06-0186n/an/a10.1029/2021EA001673Chinese Word Segmentation Based on Self‐Learning Model and Geological Knowledge for the Geoscience DomainWenjia Li0Kai Ma1Qinjun Qiu2Liang Wu3Zhong Xie4Sanfeng Li5Siqiong Chen6National Engineering Research Center for GIS Wuhan ChinaCollege of Computer and Information Technology China Three Gorges University Yichang ChinaNational Engineering Research Center for GIS Wuhan ChinaNational Engineering Research Center for GIS Wuhan ChinaNational Engineering Research Center for GIS Wuhan ChinaWuhan Zondy Cyber Science & Technology Co. Ltd. Wuhan ChinaNational Engineering Research Center for GIS Wuhan ChinaAbstract Chinese word segmentation (CWS) is the foundational work of geological report text mining and has an important influence on various tasks, such as named entity recognition and relation extraction. In recent years, the accuracy of the domain‐general CWS model has been limited by the domain and large scale of the training corpus, especially data on Chinese geological texts. Training these CWS models also requires much manually annotated data, which takes a large amount of time and effort. When applying these existing models/methods directly to the geoscience domain, the segmentation accuracy and performance will drop dramatically. To address this problem, we pretrain the Bidirectional Encoder Representations from Transformer (BERT), which can leverage unlabeled domain‐specific knowledge, on unlabeled Chinese geological text and then input them into a Bidirectional long short‐term memory and Conditional random field (BiLSTM‐CRF) model for extracting text features. Finally, the predicted tags are decoded by the CRF. The experimental results show that the F1 score of the proposed model reaches 96.2% on the constructed test set of geological texts. Additionally, experiments illustrate that our proposed model achieves comparable performance to that of other state‐of‐the‐art models, and the proposed cyclic self‐learning strategy can be further extended to other domains.https://doi.org/10.1029/2021EA001673geological reportChinese word segmentationself‐learningBERTdomain ontology
spellingShingle Wenjia Li
Kai Ma
Qinjun Qiu
Liang Wu
Zhong Xie
Sanfeng Li
Siqiong Chen
Chinese Word Segmentation Based on Self‐Learning Model and Geological Knowledge for the Geoscience Domain
Earth and Space Science
geological report
Chinese word segmentation
self‐learning
BERT
domain ontology
title Chinese Word Segmentation Based on Self‐Learning Model and Geological Knowledge for the Geoscience Domain
title_full Chinese Word Segmentation Based on Self‐Learning Model and Geological Knowledge for the Geoscience Domain
title_fullStr Chinese Word Segmentation Based on Self‐Learning Model and Geological Knowledge for the Geoscience Domain
title_full_unstemmed Chinese Word Segmentation Based on Self‐Learning Model and Geological Knowledge for the Geoscience Domain
title_short Chinese Word Segmentation Based on Self‐Learning Model and Geological Knowledge for the Geoscience Domain
title_sort chinese word segmentation based on self learning model and geological knowledge for the geoscience domain
topic geological report
Chinese word segmentation
self‐learning
BERT
domain ontology
url https://doi.org/10.1029/2021EA001673
work_keys_str_mv AT wenjiali chinesewordsegmentationbasedonselflearningmodelandgeologicalknowledgeforthegeosciencedomain
AT kaima chinesewordsegmentationbasedonselflearningmodelandgeologicalknowledgeforthegeosciencedomain
AT qinjunqiu chinesewordsegmentationbasedonselflearningmodelandgeologicalknowledgeforthegeosciencedomain
AT liangwu chinesewordsegmentationbasedonselflearningmodelandgeologicalknowledgeforthegeosciencedomain
AT zhongxie chinesewordsegmentationbasedonselflearningmodelandgeologicalknowledgeforthegeosciencedomain
AT sanfengli chinesewordsegmentationbasedonselflearningmodelandgeologicalknowledgeforthegeosciencedomain
AT siqiongchen chinesewordsegmentationbasedonselflearningmodelandgeologicalknowledgeforthegeosciencedomain