Summary: | The Tibetan language model (TLM) is the key to Tibetan natural language processing. In this paper, we first observe that, different from widely used languages, Tibetan contains many morphological verbs that rarely appear in natural sentences but play a key role in accurate text prediction. This property is usually ignored by existing methods and makes traditional training strategies less effective in constructing accurate and robust TLMs. Hence, we propose a morphological verb-aware TLM by offline learning via a character frequency reweighting strategy and online tuning of discriminative weights conditioned on morphological verbs. However, because of the influence of morphological verbs on the tense and semantics of sentences, it is necessary to consider the morphological verbs in Tibetan. As a result, compared with state-of-the-art methods, our method not only reduces the perplexity but also improves the character error on tasks of the text prediction and automatic speech recognition (ASR).
|