Pre-trained transformer-based language models for Sundanese
Abstract The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, w...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
SpringerOpen
2022-04-01
|
Series: | Journal of Big Data |
Subjects: | |
Online Access: | https://doi.org/10.1186/s40537-022-00590-7 |
_version_ | 1818022653735206912 |
---|---|
author | Wilson Wongso Henry Lucky Derwin Suhartono |
author_facet | Wilson Wongso Henry Lucky Derwin Suhartono |
author_sort | Wilson Wongso |
collection | DOAJ |
description | Abstract The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use. |
first_indexed | 2024-12-10T03:31:51Z |
format | Article |
id | doaj.art-bea6f1e67800483b88b90f3ed3244d14 |
institution | Directory Open Access Journal |
issn | 2196-1115 |
language | English |
last_indexed | 2024-12-10T03:31:51Z |
publishDate | 2022-04-01 |
publisher | SpringerOpen |
record_format | Article |
series | Journal of Big Data |
spelling | doaj.art-bea6f1e67800483b88b90f3ed3244d142022-12-22T02:03:49ZengSpringerOpenJournal of Big Data2196-11152022-04-019111710.1186/s40537-022-00590-7Pre-trained transformer-based language models for SundaneseWilson Wongso0Henry Lucky1Derwin Suhartono2Computer Science Department, School of Computer Science, Bina Nusantara UniversityComputer Science Department, School of Computer Science, Bina Nusantara UniversityComputer Science Department, School of Computer Science, Bina Nusantara UniversityAbstract The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.https://doi.org/10.1186/s40537-022-00590-7Sundanese LanguageTransformersNatural Language UnderstandingLow-resource Language |
spellingShingle | Wilson Wongso Henry Lucky Derwin Suhartono Pre-trained transformer-based language models for Sundanese Journal of Big Data Sundanese Language Transformers Natural Language Understanding Low-resource Language |
title | Pre-trained transformer-based language models for Sundanese |
title_full | Pre-trained transformer-based language models for Sundanese |
title_fullStr | Pre-trained transformer-based language models for Sundanese |
title_full_unstemmed | Pre-trained transformer-based language models for Sundanese |
title_short | Pre-trained transformer-based language models for Sundanese |
title_sort | pre trained transformer based language models for sundanese |
topic | Sundanese Language Transformers Natural Language Understanding Low-resource Language |
url | https://doi.org/10.1186/s40537-022-00590-7 |
work_keys_str_mv | AT wilsonwongso pretrainedtransformerbasedlanguagemodelsforsundanese AT henrylucky pretrainedtransformerbasedlanguagemodelsforsundanese AT derwinsuhartono pretrainedtransformerbasedlanguagemodelsforsundanese |