Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation

In this study, we present improvements in N-best rescoring of code-switched speech achieved by n-gram augmentation as well as optimised pretraining of long short-term memory (LSTM) language models with larger corpora of out-of-domain monolingual text. Our investigation specifically considers the imp...

Full description

Bibliographic Details
Main Authors: Joshua Jansen van Vüren, Thomas Niesler
Format: Article
Language:English
Published: MDPI AG 2022-09-01
Series:Languages
Subjects:
Online Access:https://www.mdpi.com/2226-471X/7/3/236
_version_ 1827659539019726848
author Joshua Jansen van Vüren
Thomas Niesler
author_facet Joshua Jansen van Vüren
Thomas Niesler
author_sort Joshua Jansen van Vüren
collection DOAJ
description In this study, we present improvements in N-best rescoring of code-switched speech achieved by n-gram augmentation as well as optimised pretraining of long short-term memory (LSTM) language models with larger corpora of out-of-domain monolingual text. Our investigation specifically considers the impact of the way in which multiple monolingual datasets are interleaved prior to being presented as input to a language model. In addition, we consider the application of large pretrained transformer-based architectures, and present the first investigation employing these models in English-Bantu code-switched speech recognition. Our experimental evaluation is performed on an under-resourced corpus of code-switched speech comprising four bilingual code-switched sub-corpora, each containing a Bantu language (isiZulu, isiXhosa, Sesotho, or Setswana) and English. We find in our experiments that, by combining n-gram augmentation with the optimised pretraining strategy, speech recognition errors are reduced for each individual bilingual pair by 3.51% absolute on average over the four corpora. Importantly, we find that even speech recognition at language boundaries improves by 1.14% even though the additional data is monolingual. Utilising the augmented n-grams for lattice generation, we then contrast these improvements with those achieved after fine-tuning pretrained transformer-based models such as distilled GPT-2 and M-BERT. We find that, even though these language models have not been trained on any of our target languages, they can improve speech recognition performance even in zero-shot settings. After fine-tuning on in-domain data, these large architectures offer further improvements, achieving a 4.45% absolute decrease in overall speech recognition errors and a 3.52% improvement over language boundaries. Finally, a combination of the optimised LSTM and fine-tuned BERT models achieves a further gain of 0.47% absolute on average for three of the four language pairs compared to M-BERT. We conclude that the careful optimisation of the pretraining strategy used for neural network language models can offer worthwhile improvements in speech recognition accuracy even at language switches, and that much larger state-of-the-art architectures such as GPT-2 and M-BERT promise even further gains.
first_indexed 2024-03-09T23:24:49Z
format Article
id doaj.art-d4aef961620943279911443861683270
institution Directory Open Access Journal
issn 2226-471X
language English
last_indexed 2024-03-09T23:24:49Z
publishDate 2022-09-01
publisher MDPI AG
record_format Article
series Languages
spelling doaj.art-d4aef9616209432799114438616832702023-11-23T17:21:28ZengMDPI AGLanguages2226-471X2022-09-017323610.3390/languages7030236Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data AugmentationJoshua Jansen van Vüren0Thomas Niesler1Department of Electrical and Electronic Engineering, Stellenbosch University, Stellenbosch 7600, South AfricaDepartment of Electrical and Electronic Engineering, Stellenbosch University, Stellenbosch 7600, South AfricaIn this study, we present improvements in N-best rescoring of code-switched speech achieved by n-gram augmentation as well as optimised pretraining of long short-term memory (LSTM) language models with larger corpora of out-of-domain monolingual text. Our investigation specifically considers the impact of the way in which multiple monolingual datasets are interleaved prior to being presented as input to a language model. In addition, we consider the application of large pretrained transformer-based architectures, and present the first investigation employing these models in English-Bantu code-switched speech recognition. Our experimental evaluation is performed on an under-resourced corpus of code-switched speech comprising four bilingual code-switched sub-corpora, each containing a Bantu language (isiZulu, isiXhosa, Sesotho, or Setswana) and English. We find in our experiments that, by combining n-gram augmentation with the optimised pretraining strategy, speech recognition errors are reduced for each individual bilingual pair by 3.51% absolute on average over the four corpora. Importantly, we find that even speech recognition at language boundaries improves by 1.14% even though the additional data is monolingual. Utilising the augmented n-grams for lattice generation, we then contrast these improvements with those achieved after fine-tuning pretrained transformer-based models such as distilled GPT-2 and M-BERT. We find that, even though these language models have not been trained on any of our target languages, they can improve speech recognition performance even in zero-shot settings. After fine-tuning on in-domain data, these large architectures offer further improvements, achieving a 4.45% absolute decrease in overall speech recognition errors and a 3.52% improvement over language boundaries. Finally, a combination of the optimised LSTM and fine-tuned BERT models achieves a further gain of 0.47% absolute on average for three of the four language pairs compared to M-BERT. We conclude that the careful optimisation of the pretraining strategy used for neural network language models can offer worthwhile improvements in speech recognition accuracy even at language switches, and that much larger state-of-the-art architectures such as GPT-2 and M-BERT promise even further gains.https://www.mdpi.com/2226-471X/7/3/236code-switchingautomatic speech recognitionlow resource languageslanguage modelling
spellingShingle Joshua Jansen van Vüren
Thomas Niesler
Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation
Languages
code-switching
automatic speech recognition
low resource languages
language modelling
title Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation
title_full Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation
title_fullStr Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation
title_full_unstemmed Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation
title_short Improving N-Best Rescoring in Under-Resourced Code-Switched Speech Recognition Using Pretraining and Data Augmentation
title_sort improving n best rescoring in under resourced code switched speech recognition using pretraining and data augmentation
topic code-switching
automatic speech recognition
low resource languages
language modelling
url https://www.mdpi.com/2226-471X/7/3/236
work_keys_str_mv AT joshuajansenvanvuren improvingnbestrescoringinunderresourcedcodeswitchedspeechrecognitionusingpretraininganddataaugmentation
AT thomasniesler improvingnbestrescoringinunderresourcedcodeswitchedspeechrecognitionusingpretraininganddataaugmentation