CHARM: An Improved Method for Chinese Precoding and Character-Level Embedding

The numerical transformation of text is a key step in natural language processing tasks, among which the word embedding model is the most representative. However, the word embedding model is insufficient in representing unregistered and low-frequency words, and the character-level embedding model ma...

Full description

Bibliographic Details
Main Authors: Xiaoming Fan, Tuo Shi, Jiayan Cai, Binjun Wang
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9536576/
_version_ 1819293418175594496
author Xiaoming Fan
Tuo Shi
Jiayan Cai
Binjun Wang
author_facet Xiaoming Fan
Tuo Shi
Jiayan Cai
Binjun Wang
author_sort Xiaoming Fan
collection DOAJ
description The numerical transformation of text is a key step in natural language processing tasks, among which the word embedding model is the most representative. However, the word embedding model is insufficient in representing unregistered and low-frequency words, and the character-level embedding model makes up for it. Most Chinese character-level models focus on the independent use of Chinese character features such as strokes, radicals, and pinyin, or the shallow correlations between some features, while the inherent correlations among different features such as pronunciation, glyph, stroke order, and word frequency are not fully utilized. Through the statistical analyses of various features of Chinese characters, this paper proposes a precoding method based on Character Helix Alternative Representation Model (CHARM), which can realize the reversible mapping of Chinese characters or words to English-like sequences, and the advantage of this method is verified in three tasks: text classification, named entity recognition and machine translation. Experimental results on several test sets show that the model performs well, and can be a replacement character-level corpus for the original Chinese text.
first_indexed 2024-12-24T04:10:06Z
format Article
id doaj.art-8118bc555003430c9ba7a11aeec18d20
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-24T04:10:06Z
publishDate 2021-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-8118bc555003430c9ba7a11aeec18d202022-12-21T17:16:06ZengIEEEIEEE Access2169-35362021-01-01912953912955110.1109/ACCESS.2021.31121909536576CHARM: An Improved Method for Chinese Precoding and Character-Level EmbeddingXiaoming Fan0https://orcid.org/0000-0003-2933-8884Tuo Shi1Jiayan Cai2https://orcid.org/0000-0002-9276-7385Binjun Wang3School of Information and Cyber Security, People’s Public Security University of China, Beijing, ChinaBeijing Police College, Beijing, ChinaBeijing Police College, Beijing, ChinaSchool of Information and Cyber Security, People’s Public Security University of China, Beijing, ChinaThe numerical transformation of text is a key step in natural language processing tasks, among which the word embedding model is the most representative. However, the word embedding model is insufficient in representing unregistered and low-frequency words, and the character-level embedding model makes up for it. Most Chinese character-level models focus on the independent use of Chinese character features such as strokes, radicals, and pinyin, or the shallow correlations between some features, while the inherent correlations among different features such as pronunciation, glyph, stroke order, and word frequency are not fully utilized. Through the statistical analyses of various features of Chinese characters, this paper proposes a precoding method based on Character Helix Alternative Representation Model (CHARM), which can realize the reversible mapping of Chinese characters or words to English-like sequences, and the advantage of this method is verified in three tasks: text classification, named entity recognition and machine translation. Experimental results on several test sets show that the model performs well, and can be a replacement character-level corpus for the original Chinese text.https://ieeexplore.ieee.org/document/9536576/Chinese textcharacter levelrepresentationprecoding
spellingShingle Xiaoming Fan
Tuo Shi
Jiayan Cai
Binjun Wang
CHARM: An Improved Method for Chinese Precoding and Character-Level Embedding
IEEE Access
Chinese text
character level
representation
precoding
title CHARM: An Improved Method for Chinese Precoding and Character-Level Embedding
title_full CHARM: An Improved Method for Chinese Precoding and Character-Level Embedding
title_fullStr CHARM: An Improved Method for Chinese Precoding and Character-Level Embedding
title_full_unstemmed CHARM: An Improved Method for Chinese Precoding and Character-Level Embedding
title_short CHARM: An Improved Method for Chinese Precoding and Character-Level Embedding
title_sort charm an improved method for chinese precoding and character level embedding
topic Chinese text
character level
representation
precoding
url https://ieeexplore.ieee.org/document/9536576/
work_keys_str_mv AT xiaomingfan charmanimprovedmethodforchineseprecodingandcharacterlevelembedding
AT tuoshi charmanimprovedmethodforchineseprecodingandcharacterlevelembedding
AT jiayancai charmanimprovedmethodforchineseprecodingandcharacterlevelembedding
AT binjunwang charmanimprovedmethodforchineseprecodingandcharacterlevelembedding