MLSL-Spell: Chinese Spelling Check Based on Multi-Label Annotation

Chinese spelling errors are commonplace in our daily lives, which might be caused by input methods, optical character recognition, or speech recognition. Due to Chinese characters’ phonetic and visual similarities, the Chinese spelling check (CSC) is a very challenging task. However, the existing CS...

Full description

Bibliographic Details
Main Authors:	Liming Jiang, Xingfa Shen, Qingbiao Zhao, Jian Yao
Format:	Article
Language:	English
Published:	MDPI AG 2024-03-01
Series:	Applied Sciences
Subjects:	Chinese spelling check (CSC) sequence labeling Masked Language Model (MLM) XGBoost
Online Access:	https://www.mdpi.com/2076-3417/14/6/2541

Description
Summary:	Chinese spelling errors are commonplace in our daily lives, which might be caused by input methods, optical character recognition, or speech recognition. Due to Chinese characters’ phonetic and visual similarities, the Chinese spelling check (CSC) is a very challenging task. However, the existing CSC solutions cannot achieve good spelling check performance since they often fail to fully extract the contextual information and Pinyin information. In this paper, we propose a novel CSC framework based on multi-label annotation (MLSL-Spell), consisting of two basic phases: spelling detection and correction. In the spelling detection phase, MLSL-Spell uses the fusion vectors of both character-based pre-trained context vectors and Pinyin vectors and adopts the sequence labeling method to explicitly label the type of misspelled characters. In the spelling correction phase, MLSL-Spell uses Masked Language Mode (MLM) model to generate candidate characters, then performs corresponding screenings according to the error types, and finally screens out the correct characters through the XGBoost classifier. Experiments show that the MLSL-Spell model outperforms the benchmark model. On SIGHAN 2013 dataset, the spelling detection F1 score of MLSL-Spell is 18.3% higher than that of the pointer network (PN) model, and the spelling correction F1 score is 10.9% higher. On SIGHAN 2015 dataset, the spelling detection F1 score of MLSL-Spell is 11% higher than that of Bert and 15.7% higher than that of the PN model. And the spelling correction F1 of MLSL-Spell score is 6.8% higher than that of PN model.
ISSN:	2076-3417

MLSL-Spell: Chinese Spelling Check Based on Multi-Label Annotation

Similar Items