Sub-Character Tokenization for Chinese Pretrained Language Models
AbstractTokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic informati...
Main Authors: | , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
The MIT Press
2023-05-01
|
Series: | Transactions of the Association for Computational Linguistics |
Online Access: | https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00560/116047/Sub-Character-Tokenization-for-Chinese-Pretrained |