Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati

This data article presents a dataset for Siswati, a Bantu language of the Nguni group that is one of the eleven official South African languages and the official language of Eswatini (together with English). The dataset contains parallel textual data between English and Siswati as well as monolingua...

Full description

Bibliographic Details
Main Authors: Tanja Gaustad, Cindy A. McKellar, Martin J. Puttkammer
Format: Article
Language:English
Published: Elsevier 2024-06-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340924002944
_version_ 1797215985220976640
author Tanja Gaustad
Cindy A. McKellar
Martin J. Puttkammer
author_facet Tanja Gaustad
Cindy A. McKellar
Martin J. Puttkammer
author_sort Tanja Gaustad
collection DOAJ
description This data article presents a dataset for Siswati, a Bantu language of the Nguni group that is one of the eleven official South African languages and the official language of Eswatini (together with English). The dataset contains parallel textual data between English and Siswati as well as monolingual data for Siswati and was developed for use as training data for machine translation systems, specifically the Autshumato machine translation project. Both corpora can also be used for development and evaluation of Natural Language Processing (NLP) core technologies for Siswati. In addition, the data lends itself for corpus linguistic studies. The article describes how the data was collected, what type of texts it contains and what clean-up was done. It also provides an overview of the number of words contained in the datasets.
first_indexed 2024-04-24T11:38:46Z
format Article
id doaj.art-c96fa087c91740859792dba75c8220f0
institution Directory Open Access Journal
issn 2352-3409
language English
last_indexed 2024-04-24T11:38:46Z
publishDate 2024-06-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj.art-c96fa087c91740859792dba75c8220f02024-04-10T04:29:05ZengElsevierData in Brief2352-34092024-06-0154110325Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for SiswatiTanja Gaustad0Cindy A. McKellar1Martin J. Puttkammer2Corresponding author.; Centre for Text Technology, North-West University, South AfricaCentre for Text Technology, North-West University, South AfricaCentre for Text Technology, North-West University, South AfricaThis data article presents a dataset for Siswati, a Bantu language of the Nguni group that is one of the eleven official South African languages and the official language of Eswatini (together with English). The dataset contains parallel textual data between English and Siswati as well as monolingual data for Siswati and was developed for use as training data for machine translation systems, specifically the Autshumato machine translation project. Both corpora can also be used for development and evaluation of Natural Language Processing (NLP) core technologies for Siswati. In addition, the data lends itself for corpus linguistic studies. The article describes how the data was collected, what type of texts it contains and what clean-up was done. It also provides an overview of the number of words contained in the datasets.http://www.sciencedirect.com/science/article/pii/S2352340924002944Natural Language ProcessingHuman Language TechnologyMachine translationLanguage corporaUnder-resourced languagesSouth African languages
spellingShingle Tanja Gaustad
Cindy A. McKellar
Martin J. Puttkammer
Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati
Data in Brief
Natural Language Processing
Human Language Technology
Machine translation
Language corpora
Under-resourced languages
South African languages
title Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati
title_full Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati
title_fullStr Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati
title_full_unstemmed Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati
title_short Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati
title_sort dataset for siswati parallel textual data for english and siswati and monolingual textual data for siswati
topic Natural Language Processing
Human Language Technology
Machine translation
Language corpora
Under-resourced languages
South African languages
url http://www.sciencedirect.com/science/article/pii/S2352340924002944
work_keys_str_mv AT tanjagaustad datasetforsiswatiparalleltextualdataforenglishandsiswatiandmonolingualtextualdataforsiswati
AT cindyamckellar datasetforsiswatiparalleltextualdataforenglishandsiswatiandmonolingualtextualdataforsiswati
AT martinjputtkammer datasetforsiswatiparalleltextualdataforenglishandsiswatiandmonolingualtextualdataforsiswati