Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation

In this study, we present improvements in N-best rescoring of code-switched speech achieved by n-gram augmentation as well as optimised pretraining of long short-term memory (LSTM) language models with larger corpora of out-of-domain monolingual text. Our investigation specifically considers the imp...

Full description

Bibliographic Details
Main Authors:	Joshua Jansen van Vüren, Thomas Niesler
Format:	Article
Language:	English
Published:	MDPI AG 2022-09-01
Series:	Languages
Subjects:	code-switching automatic speech recognition low resource languages language modelling
Online Access:	https://www.mdpi.com/2226-471X/7/3/236

_version_	1827659539019726848
author	Joshua Jansen van Vüren Thomas Niesler
author_facet	Joshua Jansen van Vüren Thomas Niesler
author_sort	Joshua Jansen van Vüren
collection	DOAJ
description	In this study, we present improvements in N-best rescoring of code-switched speech achieved by n-gram augmentation as well as optimised pretraining of long short-term memory (LSTM) language models with larger corpora of out-of-domain monolingual text. Our investigation specifically considers the impact of the way in which multiple monolingual datasets are interleaved prior to being presented as input to a language model. In addition, we consider the application of large pretrained transformer-based architectures, and present the first investigation employing these models in English-Bantu code-switched speech recognition. Our experimental evaluation is performed on an under-resourced corpus of code-switched speech comprising four bilingual code-switched sub-corpora, each containing a Bantu language (isiZulu, isiXhosa, Sesotho, or Setswana) and English. We find in our experiments that, by combining n-gram augmentation with the optimised pretraining strategy, speech recognition errors are reduced for each individual bilingual pair by 3.51% absolute on average over the four corpora. Importantly, we find that even speech recognition at language boundaries improves by 1.14% even though the additional data is monolingual. Utilising the augmented n-grams for lattice generation, we then contrast these improvements with those achieved after fine-tuning pretrained transformer-based models such as distilled GPT-2 and M-BERT. We find that, even though these language models have not been trained on any of our target languages, they can improve speech recognition performance even in zero-shot settings. After fine-tuning on in-domain data, these large architectures offer further improvements, achieving a 4.45% absolute decrease in overall speech recognition errors and a 3.52% improvement over language boundaries. Finally, a combination of the optimised LSTM and fine-tuned BERT models achieves a further gain of 0.47% absolute on average for three of the four language pairs compared to M-BERT. We conclude that the careful optimisation of the pretraining strategy used for neural network language models can offer worthwhile improvements in speech recognition accuracy even at language switches, and that much larger state-of-the-art architectures such as GPT-2 and M-BERT promise even further gains.
first_indexed	2024-03-09T23:24:49Z
format	Article
id	doaj.art-d4aef961620943279911443861683270
institution	Directory Open Access Journal
issn	2226-471X
language	English
last_indexed	2024-03-09T23:24:49Z
publishDate	2022-09-01
publisher	MDPI AG
record_format	Article
series	Languages
spelling	doaj.art-d4aef9616209432799114438616832702023-11-23T17:21:28ZengMDPI AGLanguages2226-471X2022-09-017323610.3390/languages7030236Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data AugmentationJoshua Jansen van Vüren0Thomas Niesler1Department of Electrical and Electronic Engineering, Stellenbosch University, Stellenbosch 7600, South AfricaDepartment of Electrical and Electronic Engineering, Stellenbosch University, Stellenbosch 7600, South AfricaIn this study, we present improvements in N-best rescoring of code-switched speech achieved by n-gram augmentation as well as optimised pretraining of long short-term memory (LSTM) language models with larger corpora of out-of-domain monolingual text. Our investigation specifically considers the impact of the way in which multiple monolingual datasets are interleaved prior to being presented as input to a language model. In addition, we consider the application of large pretrained transformer-based architectures, and present the first investigation employing these models in English-Bantu code-switched speech recognition. Our experimental evaluation is performed on an under-resourced corpus of code-switched speech comprising four bilingual code-switched sub-corpora, each containing a Bantu language (isiZulu, isiXhosa, Sesotho, or Setswana) and English. We find in our experiments that, by combining n-gram augmentation with the optimised pretraining strategy, speech recognition errors are reduced for each individual bilingual pair by 3.51% absolute on average over the four corpora. Importantly, we find that even speech recognition at language boundaries improves by 1.14% even though the additional data is monolingual. Utilising the augmented n-grams for lattice generation, we then contrast these improvements with those achieved after fine-tuning pretrained transformer-based models such as distilled GPT-2 and M-BERT. We find that, even though these language models have not been trained on any of our target languages, they can improve speech recognition performance even in zero-shot settings. After fine-tuning on in-domain data, these large architectures offer further improvements, achieving a 4.45% absolute decrease in overall speech recognition errors and a 3.52% improvement over language boundaries. Finally, a combination of the optimised LSTM and fine-tuned BERT models achieves a further gain of 0.47% absolute on average for three of the four language pairs compared to M-BERT. We conclude that the careful optimisation of the pretraining strategy used for neural network language models can offer worthwhile improvements in speech recognition accuracy even at language switches, and that much larger state-of-the-art architectures such as GPT-2 and M-BERT promise even further gains.https://www.mdpi.com/2226-471X/7/3/236code-switchingautomatic speech recognitionlow resource languageslanguage modelling
spellingShingle	Joshua Jansen van Vüren Thomas Niesler Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation Languages code-switching automatic speech recognition low resource languages language modelling
title	Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation
title_full	Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation
title_fullStr	Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation
title_full_unstemmed	Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation
title_short	Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation
title_sort	improving n best rescoring in under resourced code switched speech recognition using pretraining and data augmentation
topic	code-switching automatic speech recognition low resource languages language modelling
url	https://www.mdpi.com/2226-471X/7/3/236
work_keys_str_mv	AT joshuajansenvanvuren improvingnbestrescoringinunderresourcedcodeswitchedspeechrecognitionusingpretraininganddataaugmentation AT thomasniesler improvingnbestrescoringinunderresourcedcodeswitchedspeechrecognitionusingpretraininganddataaugmentation

Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation

Similar Items