Sub-Character Tokenization for Chinese Pretrained Language
Models

Sub-Character Tokenization for Chinese Pretrained Language Models

AbstractTokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic informati...

Full description

Bibliographic Details
Main Authors:	Chenglei Si, Zhengyan Zhang, Yingfa Chen, Fanchao Qi, Xiaozhi Wang, Zhiyuan Liu, Yasheng Wang, Qun Liu, Maosong Sun
Format:	Article
Language:	English
Published:	The MIT Press 2023-05-01
Series:	Transactions of the Association for Computational Linguistics
Online Access:	https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00560/116047/Sub-Character-Tokenization-for-Chinese-Pretrained

Similar Items

An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers
by: Hofmann, V, et al.
Published: (2022)

Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-level Backdoor Attacks
by: Zhang, Zhengyan, et al.
Published: (2024)

Token and part-of-speech fusion for pretraining of transformers with application in automatic cyberbullying detection
by: Nor Saiful Azam Bin Nor Azmi, et al.
Published: (2025-03-01)

Constructing Chinese taxonomy trees from understanding and generative pretrained language models
by: Jianyu Guo, et al.
Published: (2024-10-01)

Geographic adaptation of pretrained language models
by: Hofmann, V, et al.
Published: (2024)

Geographic Adaptation of Pretrained Language Models
by: Valentin Hofmann, et al.
Published: (2024-04-01)

Survey of Applications of Pretrained Language Models
by: SUN Kaili, LUO Xudong , Michael Y.LUO
Published: (2023-01-01)

Tokenizers for African Languages
by: Goodwill Erasmo Ndomba, et al.
Published: (2025-01-01)

Measuring and Improving Consistency in Pretrained Language Models
by: Yanai Elazar, et al.
Published: (2021-01-01)

Deobfuscating JavaScript Code Using Character-Based Tokenization
by: Alexandru-Gabriel SÎRBU
Published: (2023-12-01)

Extracting event knowledge from pretrained language models
by: Ong, Claudia Beth
Published: (2023)

Data-efficient domain adaptation for pretrained language models
by: Guo, Xu
Published: (2023)

Erratum: Measuring and Improving Consistency in Pretrained Language Models
by: Yanai Elazar, et al.
Published: (2022-01-01)

CPT: Colorful Prompt Tuning for pre-trained vision-language models
by: Yuan Yao, et al.
Published: (2024-01-01)

To pretrain or not? A systematic analysis of the benefits of pretraining in diabetic retinopathy.
by: Vignesh Srinivasan, et al.
Published: (2022-01-01)

To pretrain or not? A systematic analysis of the benefits of pretraining in diabetic retinopathy
by: Vignesh Srinivasan, et al.
Published: (2022-01-01)

Language model tokenizers introduce unfairness between languages
by: Petrov, A, et al.
Published: (2024)

University Student Dropout Prediction Using Pretrained Language Models
by: Hyun-Sik Won, et al.
Published: (2023-06-01)

Sequence-to-sequence pretraining for a less-resourced Slovenian language
by: Matej Ulčar, et al.
Published: (2023-03-01)

Leveraging Frozen Pretrained Written Language Models for Neural Sign Language Translation
by: Mathieu De Coster, et al.
Published: (2022-04-01)

Token-Based Access Control
by: Guohua Gan, et al.
Published: (2020-01-01)

Pretraining the Noisy Channel Model for Task-Oriented Dialogue
by: Qi Liu, et al.
Published: (2021-01-01)

KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation
by: Xiaozhi Wang, et al.
Published: (2021-01-01)

Automatic Component Prediction for Issue Reports Using Fine-Tuned Pretrained Language Models
by: Dae-Sung Wang, et al.
Published: (2022-01-01)

DagoBERT: generating derivational morphology with a pretrained language model
by: Hofmann, V, et al.
Published: (2020)

Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval
by: Ohad Rubin, et al.
Published: (2024-10-01)

Leveraging Text-to-Text Pretrained Language Models for Question Answering in Chemistry
by: Dan Tran, et al.
Published: (2024-03-01)

Pretrained Natural Language Processing Model for Intent Recognition (BERT-IR)
by: Vasima Khan, et al.
Published: (2021-11-01)

OPAL: Ontology-Aware Pretrained Language Model for End-to-End Task-Oriented Dialogue
by: Zhi Chen, et al.
Published: (2023-12-01)

Ghanaian Chinese Language Learners’ Perception of Chinese Characters
by: Bright Nkrumah, et al.
Published: (2022-10-01)

Identification and Impact Analysis of Family History of Psychiatric Disorder in Mood Disorder Patients With Pretrained Language Model
by: Cheng Wan, et al.
Published: (2022-05-01)

PreparedLLM: effective pre-pretraining framework for domain-specific large language models
by: Zhou Chen, et al.
Published: (2024-10-01)

A Multilingual and Multidomain Study on Dialog Act Recognition Using Character-Level Tokenization
by: Eugénio Ribeiro, et al.
Published: (2019-03-01)

Acquisition of Chinese characters: The effects of character properties and individual differences among second language learners
by: Li-Jen eKuo, et al.
Published: (2015-08-01)

Tokenization of Assets: Security Tokens in Liechtenstein and Switzerland
by: Angelika K. Layr
Published: (2021-09-01)

Effect of tokenization granularity for Turkish large language models
by: Yiğit Bekir Kaya, et al.
Published: (2024-03-01)

Blackboard Agents for Standard Arabic Language Tokenization and Parsing
by: Abdul Kareem Murhij Radhi
Published: (2010-12-01)

Heuristic Attention Representation Learning for Self-Supervised Pretraining
by: Van Nhiem Tran, et al.
Published: (2022-07-01)

Temporal Shift Module with Pretrained Representations for Speech Emotion Recognition
by: Siyuan Shen, et al.
Published: (2024-01-01)

Pretraining Enhanced RNN Transducer
by: Junyu Lu, et al.
Published: (2024-12-01)