Sub-Character Tokenization for Chinese Pretrained Language Models

AbstractTokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic informati...

Full description

Bibliographic Details
Main Authors: Chenglei Si, Zhengyan Zhang, Yingfa Chen, Fanchao Qi, Xiaozhi Wang, Zhiyuan Liu, Yasheng Wang, Qun Liu, Maosong Sun
Format: Article
Language:English
Published: The MIT Press 2023-05-01
Series:Transactions of the Association for Computational Linguistics
Online Access:https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00560/116047/Sub-Character-Tokenization-for-Chinese-Pretrained
_version_ 1797796593392418816
author Chenglei Si
Zhengyan Zhang
Yingfa Chen
Fanchao Qi
Xiaozhi Wang
Zhiyuan Liu
Yasheng Wang
Qun Liu
Maosong Sun
author_facet Chenglei Si
Zhengyan Zhang
Yingfa Chen
Fanchao Qi
Xiaozhi Wang
Zhiyuan Liu
Yasheng Wang
Qun Liu
Maosong Sun
author_sort Chenglei Si
collection DOAJ
description AbstractTokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word segmentation. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code and models at https://github.com/thunlp/SubCharTokenization to facilitate future work.
first_indexed 2024-03-13T03:36:22Z
format Article
id doaj.art-5e1c7178b0ed4851a6922b1218a03d89
institution Directory Open Access Journal
issn 2307-387X
language English
last_indexed 2024-03-13T03:36:22Z
publishDate 2023-05-01
publisher The MIT Press
record_format Article
series Transactions of the Association for Computational Linguistics
spelling doaj.art-5e1c7178b0ed4851a6922b1218a03d892023-06-23T18:58:04ZengThe MIT PressTransactions of the Association for Computational Linguistics2307-387X2023-05-011146948710.1162/tacl_a_00560Sub-Character Tokenization for Chinese Pretrained Language ModelsChenglei Si0Zhengyan Zhang1Yingfa Chen2Fanchao Qi3Xiaozhi Wang4Zhiyuan Liu5Yasheng Wang6Qun Liu7Maosong Sun8NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China. clsi@terpmail.umd.eduNLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China. zy-z19@mails.tsinghua.edu.cnNLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China. yingfa-c18@mails.tsinghua.edu.cnNLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China. qfc17@mails.tsinghua.edu.cnNLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China. wangxz20@mails.tsinghua.edu.cnNLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China. liuzy@tsinghua.edu.cnHuawei Noah’s Ark Lab, Hong Kong, China. wangyasheng@huawei.comHuawei Noah’s Ark Lab, Hong Kong, China. qun.liu@huawei.comNLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China. sms@tsinghua.edu.cn AbstractTokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word segmentation. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code and models at https://github.com/thunlp/SubCharTokenization to facilitate future work.https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00560/116047/Sub-Character-Tokenization-for-Chinese-Pretrained
spellingShingle Chenglei Si
Zhengyan Zhang
Yingfa Chen
Fanchao Qi
Xiaozhi Wang
Zhiyuan Liu
Yasheng Wang
Qun Liu
Maosong Sun
Sub-Character Tokenization for Chinese Pretrained Language Models
Transactions of the Association for Computational Linguistics
title Sub-Character Tokenization for Chinese Pretrained Language Models
title_full Sub-Character Tokenization for Chinese Pretrained Language Models
title_fullStr Sub-Character Tokenization for Chinese Pretrained Language Models
title_full_unstemmed Sub-Character Tokenization for Chinese Pretrained Language Models
title_short Sub-Character Tokenization for Chinese Pretrained Language Models
title_sort sub character tokenization for chinese pretrained language models
url https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00560/116047/Sub-Character-Tokenization-for-Chinese-Pretrained
work_keys_str_mv AT chengleisi subcharactertokenizationforchinesepretrainedlanguagemodels
AT zhengyanzhang subcharactertokenizationforchinesepretrainedlanguagemodels
AT yingfachen subcharactertokenizationforchinesepretrainedlanguagemodels
AT fanchaoqi subcharactertokenizationforchinesepretrainedlanguagemodels
AT xiaozhiwang subcharactertokenizationforchinesepretrainedlanguagemodels
AT zhiyuanliu subcharactertokenizationforchinesepretrainedlanguagemodels
AT yashengwang subcharactertokenizationforchinesepretrainedlanguagemodels
AT qunliu subcharactertokenizationforchinesepretrainedlanguagemodels
AT maosongsun subcharactertokenizationforchinesepretrainedlanguagemodels