Sub-Character Tokenization for Chinese Pretrained Language Models
AbstractTokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic informati...
Main Authors: | Chenglei Si, Zhengyan Zhang, Yingfa Chen, Fanchao Qi, Xiaozhi Wang, Zhiyuan Liu, Yasheng Wang, Qun Liu, Maosong Sun |
---|---|
Format: | Article |
Language: | English |
Published: |
The MIT Press
2023-05-01
|
Series: | Transactions of the Association for Computational Linguistics |
Online Access: | https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00560/116047/Sub-Character-Tokenization-for-Chinese-Pretrained |
Similar Items
-
An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers
by: Hofmann, V, et al.
Published: (2022) -
Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-level Backdoor Attacks
by: Zhang, Zhengyan, et al.
Published: (2024) -
Token and part-of-speech fusion for pretraining of transformers with application in automatic cyberbullying detection
by: Nor Saiful Azam Bin Nor Azmi, et al.
Published: (2025-03-01) -
Constructing Chinese taxonomy trees from understanding and generative pretrained language models
by: Jianyu Guo, et al.
Published: (2024-10-01) -
Geographic adaptation of pretrained language models
by: Hofmann, V, et al.
Published: (2024)