Sub-Character Tokenization for Chinese Pretrained Language Models
AbstractTokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic informati...
Main Authors: | , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
The MIT Press
2023-05-01
|
Series: | Transactions of the Association for Computational Linguistics |
Online Access: | https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00560/116047/Sub-Character-Tokenization-for-Chinese-Pretrained |
_version_ | 1797796593392418816 |
---|---|
author | Chenglei Si Zhengyan Zhang Yingfa Chen Fanchao Qi Xiaozhi Wang Zhiyuan Liu Yasheng Wang Qun Liu Maosong Sun |
author_facet | Chenglei Si Zhengyan Zhang Yingfa Chen Fanchao Qi Xiaozhi Wang Zhiyuan Liu Yasheng Wang Qun Liu Maosong Sun |
author_sort | Chenglei Si |
collection | DOAJ |
description |
AbstractTokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word segmentation. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code and models at https://github.com/thunlp/SubCharTokenization to facilitate future work. |
first_indexed | 2024-03-13T03:36:22Z |
format | Article |
id | doaj.art-5e1c7178b0ed4851a6922b1218a03d89 |
institution | Directory Open Access Journal |
issn | 2307-387X |
language | English |
last_indexed | 2024-03-13T03:36:22Z |
publishDate | 2023-05-01 |
publisher | The MIT Press |
record_format | Article |
series | Transactions of the Association for Computational Linguistics |
spelling | doaj.art-5e1c7178b0ed4851a6922b1218a03d892023-06-23T18:58:04ZengThe MIT PressTransactions of the Association for Computational Linguistics2307-387X2023-05-011146948710.1162/tacl_a_00560Sub-Character Tokenization for Chinese Pretrained Language ModelsChenglei Si0Zhengyan Zhang1Yingfa Chen2Fanchao Qi3Xiaozhi Wang4Zhiyuan Liu5Yasheng Wang6Qun Liu7Maosong Sun8NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China. clsi@terpmail.umd.eduNLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China. zy-z19@mails.tsinghua.edu.cnNLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China. yingfa-c18@mails.tsinghua.edu.cnNLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China. qfc17@mails.tsinghua.edu.cnNLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China. wangxz20@mails.tsinghua.edu.cnNLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China. liuzy@tsinghua.edu.cnHuawei Noah’s Ark Lab, Hong Kong, China. wangyasheng@huawei.comHuawei Noah’s Ark Lab, Hong Kong, China. qun.liu@huawei.comNLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China. sms@tsinghua.edu.cn AbstractTokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word segmentation. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code and models at https://github.com/thunlp/SubCharTokenization to facilitate future work.https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00560/116047/Sub-Character-Tokenization-for-Chinese-Pretrained |
spellingShingle | Chenglei Si Zhengyan Zhang Yingfa Chen Fanchao Qi Xiaozhi Wang Zhiyuan Liu Yasheng Wang Qun Liu Maosong Sun Sub-Character Tokenization for Chinese Pretrained Language Models Transactions of the Association for Computational Linguistics |
title | Sub-Character Tokenization for Chinese Pretrained Language
Models |
title_full | Sub-Character Tokenization for Chinese Pretrained Language
Models |
title_fullStr | Sub-Character Tokenization for Chinese Pretrained Language
Models |
title_full_unstemmed | Sub-Character Tokenization for Chinese Pretrained Language
Models |
title_short | Sub-Character Tokenization for Chinese Pretrained Language
Models |
title_sort | sub character tokenization for chinese pretrained language models |
url | https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00560/116047/Sub-Character-Tokenization-for-Chinese-Pretrained |
work_keys_str_mv | AT chengleisi subcharactertokenizationforchinesepretrainedlanguagemodels AT zhengyanzhang subcharactertokenizationforchinesepretrainedlanguagemodels AT yingfachen subcharactertokenizationforchinesepretrainedlanguagemodels AT fanchaoqi subcharactertokenizationforchinesepretrainedlanguagemodels AT xiaozhiwang subcharactertokenizationforchinesepretrainedlanguagemodels AT zhiyuanliu subcharactertokenizationforchinesepretrainedlanguagemodels AT yashengwang subcharactertokenizationforchinesepretrainedlanguagemodels AT qunliu subcharactertokenizationforchinesepretrainedlanguagemodels AT maosongsun subcharactertokenizationforchinesepretrainedlanguagemodels |