Sequence-to-sequence pretraining for a less-resourced Slovenian language

IntroductionLarge pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modeling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which...

Full description

Bibliographic Details
Main Authors:	Matej Ulčar, Marko Robnik-Šikonja
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2023-03-01
Series:	Frontiers in Artificial Intelligence
Subjects:	natural language processing pretrained language models sequence-to-sequence models transformers T5 model Slovene
Online Access:	https://www.frontiersin.org/articles/10.3389/frai.2023.932519/full

_version_	1797858940409610240
author	Matej Ulčar Marko Robnik-Šikonja
author_facet	Matej Ulčar Marko Robnik-Šikonja
author_sort	Matej Ulčar
collection	DOAJ
description	IntroductionLarge pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modeling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which more naturally fits text generation tasks. The monolingual variants of T5 models have been limited to well-resourced languages, while the massively multilingual T5 model supports 101 languages.MethodsWe trained two different-sized T5-type sequence-to-sequence models for morphologically rich Slovene language with much fewer resources. We analyzed the behavior of new models on 11 tasks, eight classification ones (named entity recognition, sentiment classification, lemmatization, two question answering tasks, two natural language inference tasks, and a coreference resolution task), and three text generation tasks (text simplification and two summarization tasks on different datasets). We compared the new SloT5 models with the multilingual mT5 model, multilingual mBART-50 model, and with four encoder BERT-like models: multilingual BERT, multilingual XLM-RoBERTa, trilingual Croatian-Slovene-English BERT, and monolingual Slovene RoBERTa model.ResultsConcerning the classification tasks, the SloT5 models mostly lag behind the monolingual Slovene SloBERTa model. However, these models are helpful for generative tasks and provide several useful results. In general, the size of models matters, and currently, there is not enough training data for Slovene for successful pretraining of large models.DiscussionWhile the results are obtained on Slovene, we believe that they may generalize to other less-resourced languages, where such models will be built. We make the training and evaluation code, as well as the trained models, publicly available.
first_indexed	2024-04-09T21:21:30Z
format	Article
id	doaj.art-52757f1102d44804861f566763d0faf5
institution	Directory Open Access Journal
issn	2624-8212
language	English
last_indexed	2024-04-09T21:21:30Z
publishDate	2023-03-01
publisher	Frontiers Media S.A.
record_format	Article
series	Frontiers in Artificial Intelligence
spelling	doaj.art-52757f1102d44804861f566763d0faf52023-03-28T05:28:42ZengFrontiers Media S.A.Frontiers in Artificial Intelligence2624-82122023-03-01610.3389/frai.2023.932519932519Sequence-to-sequence pretraining for a less-resourced Slovenian languageMatej UlčarMarko Robnik-ŠikonjaIntroductionLarge pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modeling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which more naturally fits text generation tasks. The monolingual variants of T5 models have been limited to well-resourced languages, while the massively multilingual T5 model supports 101 languages.MethodsWe trained two different-sized T5-type sequence-to-sequence models for morphologically rich Slovene language with much fewer resources. We analyzed the behavior of new models on 11 tasks, eight classification ones (named entity recognition, sentiment classification, lemmatization, two question answering tasks, two natural language inference tasks, and a coreference resolution task), and three text generation tasks (text simplification and two summarization tasks on different datasets). We compared the new SloT5 models with the multilingual mT5 model, multilingual mBART-50 model, and with four encoder BERT-like models: multilingual BERT, multilingual XLM-RoBERTa, trilingual Croatian-Slovene-English BERT, and monolingual Slovene RoBERTa model.ResultsConcerning the classification tasks, the SloT5 models mostly lag behind the monolingual Slovene SloBERTa model. However, these models are helpful for generative tasks and provide several useful results. In general, the size of models matters, and currently, there is not enough training data for Slovene for successful pretraining of large models.DiscussionWhile the results are obtained on Slovene, we believe that they may generalize to other less-resourced languages, where such models will be built. We make the training and evaluation code, as well as the trained models, publicly available.https://www.frontiersin.org/articles/10.3389/frai.2023.932519/fullnatural language processingpretrained language modelssequence-to-sequence modelstransformersT5 modelSlovene
spellingShingle	Matej Ulčar Marko Robnik-Šikonja Sequence-to-sequence pretraining for a less-resourced Slovenian language Frontiers in Artificial Intelligence natural language processing pretrained language models sequence-to-sequence models transformers T5 model Slovene
title	Sequence-to-sequence pretraining for a less-resourced Slovenian language
title_full	Sequence-to-sequence pretraining for a less-resourced Slovenian language
title_fullStr	Sequence-to-sequence pretraining for a less-resourced Slovenian language
title_full_unstemmed	Sequence-to-sequence pretraining for a less-resourced Slovenian language
title_short	Sequence-to-sequence pretraining for a less-resourced Slovenian language
title_sort	sequence to sequence pretraining for a less resourced slovenian language
topic	natural language processing pretrained language models sequence-to-sequence models transformers T5 model Slovene
url	https://www.frontiersin.org/articles/10.3389/frai.2023.932519/full
work_keys_str_mv	AT matejulcar sequencetosequencepretrainingforalessresourcedslovenianlanguage AT markorobniksikonja sequencetosequencepretrainingforalessresourcedslovenianlanguage

Sequence-to-sequence pretraining for a less-resourced Slovenian language

Similar Items