Pre-trained transformer-based language models for Sundanese

Abstract The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, w...

Full description

Bibliographic Details
Main Authors:	Wilson Wongso, Henry Lucky, Derwin Suhartono
Format:	Article
Language:	English
Published:	SpringerOpen 2022-04-01
Series:	Journal of Big Data
Subjects:	Sundanese Language Transformers Natural Language Understanding Low-resource Language
Online Access:	https://doi.org/10.1186/s40537-022-00590-7

_version_	1818022653735206912
author	Wilson Wongso Henry Lucky Derwin Suhartono
author_facet	Wilson Wongso Henry Lucky Derwin Suhartono
author_sort	Wilson Wongso
collection	DOAJ
description	Abstract The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.
first_indexed	2024-12-10T03:31:51Z
format	Article
id	doaj.art-bea6f1e67800483b88b90f3ed3244d14
institution	Directory Open Access Journal
issn	2196-1115
language	English
last_indexed	2024-12-10T03:31:51Z
publishDate	2022-04-01
publisher	SpringerOpen
record_format	Article
series	Journal of Big Data
spelling	doaj.art-bea6f1e67800483b88b90f3ed3244d142022-12-22T02:03:49ZengSpringerOpenJournal of Big Data2196-11152022-04-019111710.1186/s40537-022-00590-7Pre-trained transformer-based language models for SundaneseWilson Wongso0Henry Lucky1Derwin Suhartono2Computer Science Department, School of Computer Science, Bina Nusantara UniversityComputer Science Department, School of Computer Science, Bina Nusantara UniversityComputer Science Department, School of Computer Science, Bina Nusantara UniversityAbstract The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.https://doi.org/10.1186/s40537-022-00590-7Sundanese LanguageTransformersNatural Language UnderstandingLow-resource Language
spellingShingle	Wilson Wongso Henry Lucky Derwin Suhartono Pre-trained transformer-based language models for Sundanese Journal of Big Data Sundanese Language Transformers Natural Language Understanding Low-resource Language
title	Pre-trained transformer-based language models for Sundanese
title_full	Pre-trained transformer-based language models for Sundanese
title_fullStr	Pre-trained transformer-based language models for Sundanese
title_full_unstemmed	Pre-trained transformer-based language models for Sundanese
title_short	Pre-trained transformer-based language models for Sundanese
title_sort	pre trained transformer based language models for sundanese
topic	Sundanese Language Transformers Natural Language Understanding Low-resource Language
url	https://doi.org/10.1186/s40537-022-00590-7
work_keys_str_mv	AT wilsonwongso pretrainedtransformerbasedlanguagemodelsforsundanese AT henrylucky pretrainedtransformerbasedlanguagemodelsforsundanese AT derwinsuhartono pretrainedtransformerbasedlanguagemodelsforsundanese

Pre-trained transformer-based language models for Sundanese

Similar Items