Automatic Information Extraction in the Third-Generation Semiconductor Materials Domain Based on DKNet and MANet

The third-generation semiconductor materials (TGSMs) is a frontier scientific domain, where researchers need to consult extensive literature for the entity information on materials, devices, preparation methods, and experimental performances, and sort out the complex relations between them. However,...

Full description

Bibliographic Details
Main Authors: Xiaobo Jiang, Kun He, Borui Yang
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9733892/
_version_ 1811271214811840512
author Xiaobo Jiang
Kun He
Borui Yang
author_facet Xiaobo Jiang
Kun He
Borui Yang
author_sort Xiaobo Jiang
collection DOAJ
description The third-generation semiconductor materials (TGSMs) is a frontier scientific domain, where researchers need to consult extensive literature for the entity information on materials, devices, preparation methods, and experimental performances, and sort out the complex relations between them. However, the explosion of relevant papers has far exceeded researchers&#x2019; reading ability. In this article, the TGSM-field automatic information extraction is conducted based on entity recognition (ER) and relation extraction (RE) techniques. First, the corpora used for ER and RE in this field are created. Second, aiming at the complexity of the entities, a neural network using domain knowledge (DKNet) is proposed to improve ER performance. It uses the keyword sequence of each entity type as prior knowledge, adds a dedicated embedding to encode entity categories, then combines prior knowledge and encoded vectors with the context through a gated information fusion module to assist recognition. As for the indicative word dependence problem of entity relations, a multi-aspect attention-based network model (MANet) is proposed to enhance the attention to relation-indicative words, thereby improving the RE performance. Finally, F1 scores of 74.5 and 85.9 were achieved on the created ER and RE test sets, outperforming other advanced models by <inline-formula> <tex-math notation="LaTeX">$3.4~\sim ~10.1$ </tex-math></inline-formula>, which is the best performance of the TGSM-field automatic information extraction.
first_indexed 2024-04-12T22:17:10Z
format Article
id doaj.art-3330347bbeba413f9d01762d3d3b4d4f
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-04-12T22:17:10Z
publishDate 2022-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-3330347bbeba413f9d01762d3d3b4d4f2022-12-22T03:14:30ZengIEEEIEEE Access2169-35362022-01-0110293672937610.1109/ACCESS.2022.31593389733892Automatic Information Extraction in the Third-Generation Semiconductor Materials Domain Based on DKNet and MANetXiaobo Jiang0https://orcid.org/0000-0003-4865-8613Kun He1https://orcid.org/0000-0002-2951-6102Borui Yang2School of Electronic and Information Engineering, South China University of Technology, Guangzhou, ChinaSchool of Electronic and Information Engineering, South China University of Technology, Guangzhou, ChinaSchool of Electronic and Information Engineering, South China University of Technology, Guangzhou, ChinaThe third-generation semiconductor materials (TGSMs) is a frontier scientific domain, where researchers need to consult extensive literature for the entity information on materials, devices, preparation methods, and experimental performances, and sort out the complex relations between them. However, the explosion of relevant papers has far exceeded researchers&#x2019; reading ability. In this article, the TGSM-field automatic information extraction is conducted based on entity recognition (ER) and relation extraction (RE) techniques. First, the corpora used for ER and RE in this field are created. Second, aiming at the complexity of the entities, a neural network using domain knowledge (DKNet) is proposed to improve ER performance. It uses the keyword sequence of each entity type as prior knowledge, adds a dedicated embedding to encode entity categories, then combines prior knowledge and encoded vectors with the context through a gated information fusion module to assist recognition. As for the indicative word dependence problem of entity relations, a multi-aspect attention-based network model (MANet) is proposed to enhance the attention to relation-indicative words, thereby improving the RE performance. Finally, F1 scores of 74.5 and 85.9 were achieved on the created ER and RE test sets, outperforming other advanced models by <inline-formula> <tex-math notation="LaTeX">$3.4~\sim ~10.1$ </tex-math></inline-formula>, which is the best performance of the TGSM-field automatic information extraction.https://ieeexplore.ieee.org/document/9733892/Automatic information extractionentity recognitionrelation extractionthird-generation semiconductor materialsgated information fusionmulti-aspect attention
spellingShingle Xiaobo Jiang
Kun He
Borui Yang
Automatic Information Extraction in the Third-Generation Semiconductor Materials Domain Based on DKNet and MANet
IEEE Access
Automatic information extraction
entity recognition
relation extraction
third-generation semiconductor materials
gated information fusion
multi-aspect attention
title Automatic Information Extraction in the Third-Generation Semiconductor Materials Domain Based on DKNet and MANet
title_full Automatic Information Extraction in the Third-Generation Semiconductor Materials Domain Based on DKNet and MANet
title_fullStr Automatic Information Extraction in the Third-Generation Semiconductor Materials Domain Based on DKNet and MANet
title_full_unstemmed Automatic Information Extraction in the Third-Generation Semiconductor Materials Domain Based on DKNet and MANet
title_short Automatic Information Extraction in the Third-Generation Semiconductor Materials Domain Based on DKNet and MANet
title_sort automatic information extraction in the third generation semiconductor materials domain based on dknet and manet
topic Automatic information extraction
entity recognition
relation extraction
third-generation semiconductor materials
gated information fusion
multi-aspect attention
url https://ieeexplore.ieee.org/document/9733892/
work_keys_str_mv AT xiaobojiang automaticinformationextractioninthethirdgenerationsemiconductormaterialsdomainbasedondknetandmanet
AT kunhe automaticinformationextractioninthethirdgenerationsemiconductormaterialsdomainbasedondknetandmanet
AT boruiyang automaticinformationextractioninthethirdgenerationsemiconductormaterialsdomainbasedondknetandmanet