Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati
This data article presents a dataset for Siswati, a Bantu language of the Nguni group that is one of the eleven official South African languages and the official language of Eswatini (together with English). The dataset contains parallel textual data between English and Siswati as well as monolingua...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2024-06-01
|
Series: | Data in Brief |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340924002944 |
_version_ | 1797215985220976640 |
---|---|
author | Tanja Gaustad Cindy A. McKellar Martin J. Puttkammer |
author_facet | Tanja Gaustad Cindy A. McKellar Martin J. Puttkammer |
author_sort | Tanja Gaustad |
collection | DOAJ |
description | This data article presents a dataset for Siswati, a Bantu language of the Nguni group that is one of the eleven official South African languages and the official language of Eswatini (together with English). The dataset contains parallel textual data between English and Siswati as well as monolingual data for Siswati and was developed for use as training data for machine translation systems, specifically the Autshumato machine translation project. Both corpora can also be used for development and evaluation of Natural Language Processing (NLP) core technologies for Siswati. In addition, the data lends itself for corpus linguistic studies. The article describes how the data was collected, what type of texts it contains and what clean-up was done. It also provides an overview of the number of words contained in the datasets. |
first_indexed | 2024-04-24T11:38:46Z |
format | Article |
id | doaj.art-c96fa087c91740859792dba75c8220f0 |
institution | Directory Open Access Journal |
issn | 2352-3409 |
language | English |
last_indexed | 2024-04-24T11:38:46Z |
publishDate | 2024-06-01 |
publisher | Elsevier |
record_format | Article |
series | Data in Brief |
spelling | doaj.art-c96fa087c91740859792dba75c8220f02024-04-10T04:29:05ZengElsevierData in Brief2352-34092024-06-0154110325Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for SiswatiTanja Gaustad0Cindy A. McKellar1Martin J. Puttkammer2Corresponding author.; Centre for Text Technology, North-West University, South AfricaCentre for Text Technology, North-West University, South AfricaCentre for Text Technology, North-West University, South AfricaThis data article presents a dataset for Siswati, a Bantu language of the Nguni group that is one of the eleven official South African languages and the official language of Eswatini (together with English). The dataset contains parallel textual data between English and Siswati as well as monolingual data for Siswati and was developed for use as training data for machine translation systems, specifically the Autshumato machine translation project. Both corpora can also be used for development and evaluation of Natural Language Processing (NLP) core technologies for Siswati. In addition, the data lends itself for corpus linguistic studies. The article describes how the data was collected, what type of texts it contains and what clean-up was done. It also provides an overview of the number of words contained in the datasets.http://www.sciencedirect.com/science/article/pii/S2352340924002944Natural Language ProcessingHuman Language TechnologyMachine translationLanguage corporaUnder-resourced languagesSouth African languages |
spellingShingle | Tanja Gaustad Cindy A. McKellar Martin J. Puttkammer Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati Data in Brief Natural Language Processing Human Language Technology Machine translation Language corpora Under-resourced languages South African languages |
title | Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati |
title_full | Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati |
title_fullStr | Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati |
title_full_unstemmed | Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati |
title_short | Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati |
title_sort | dataset for siswati parallel textual data for english and siswati and monolingual textual data for siswati |
topic | Natural Language Processing Human Language Technology Machine translation Language corpora Under-resourced languages South African languages |
url | http://www.sciencedirect.com/science/article/pii/S2352340924002944 |
work_keys_str_mv | AT tanjagaustad datasetforsiswatiparalleltextualdataforenglishandsiswatiandmonolingualtextualdataforsiswati AT cindyamckellar datasetforsiswatiparalleltextualdataforenglishandsiswatiandmonolingualtextualdataforsiswati AT martinjputtkammer datasetforsiswatiparalleltextualdataforenglishandsiswatiandmonolingualtextualdataforsiswati |