Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning
Automatic speech recognition (ASR), commonly known as speech-to-text, is the process of transcribing audio recordings into text, i.e., transforming speech into the respective sequence of words. This paper presents a deep learning ASR system optimization and evaluation for the European Portuguese lan...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-04-01
|
Series: | Future Internet |
Subjects: | |
Online Access: | https://www.mdpi.com/1999-5903/15/5/159 |
_version_ | 1827741200867655680 |
---|---|
author | Eduardo Medeiros Leonel Corado Luís Rato Paulo Quaresma Pedro Salgueiro |
author_facet | Eduardo Medeiros Leonel Corado Luís Rato Paulo Quaresma Pedro Salgueiro |
author_sort | Eduardo Medeiros |
collection | DOAJ |
description | Automatic speech recognition (ASR), commonly known as speech-to-text, is the process of transcribing audio recordings into text, i.e., transforming speech into the respective sequence of words. This paper presents a deep learning ASR system optimization and evaluation for the European Portuguese language. We present a pipeline composed of several stages for data acquisition, analysis, pre-processing, model creation, and evaluation. A transfer learning approach is proposed considering an English language-optimized model as starting point; a target composed of European Portuguese; and the contribution to the transfer process by a source from a different domain consisting of a multiple-variant Portuguese language dataset, essentially composed of Brazilian Portuguese. A domain adaptation was investigated between European Portuguese and mixed (mostly Brazilian) Portuguese. The proposed optimization evaluation used the NVIDIA NeMo framework implementing the QuartzNet15×5 architecture based on 1D time-channel separable convolutions. Following this transfer learning data-centric approach, the model was optimized, achieving a state-of-the-art word error rate (WER) of 0.0503. |
first_indexed | 2024-03-11T03:42:21Z |
format | Article |
id | doaj.art-26321c3692624b5f8095b7b4d265b1b3 |
institution | Directory Open Access Journal |
issn | 1999-5903 |
language | English |
last_indexed | 2024-03-11T03:42:21Z |
publishDate | 2023-04-01 |
publisher | MDPI AG |
record_format | Article |
series | Future Internet |
spelling | doaj.art-26321c3692624b5f8095b7b4d265b1b32023-11-18T01:26:53ZengMDPI AGFuture Internet1999-59032023-04-0115515910.3390/fi15050159Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep LearningEduardo Medeiros0Leonel Corado1Luís Rato2Paulo Quaresma3Pedro Salgueiro4Escola de Ciências e Tecnologia, Universidade de Évora, 7000-671 Évora, PortugalEscola de Ciências e Tecnologia, Universidade de Évora, 7000-671 Évora, PortugalEscola de Ciências e Tecnologia, Universidade de Évora, 7000-671 Évora, PortugalEscola de Ciências e Tecnologia, Universidade de Évora, 7000-671 Évora, PortugalEscola de Ciências e Tecnologia, Universidade de Évora, 7000-671 Évora, PortugalAutomatic speech recognition (ASR), commonly known as speech-to-text, is the process of transcribing audio recordings into text, i.e., transforming speech into the respective sequence of words. This paper presents a deep learning ASR system optimization and evaluation for the European Portuguese language. We present a pipeline composed of several stages for data acquisition, analysis, pre-processing, model creation, and evaluation. A transfer learning approach is proposed considering an English language-optimized model as starting point; a target composed of European Portuguese; and the contribution to the transfer process by a source from a different domain consisting of a multiple-variant Portuguese language dataset, essentially composed of Brazilian Portuguese. A domain adaptation was investigated between European Portuguese and mixed (mostly Brazilian) Portuguese. The proposed optimization evaluation used the NVIDIA NeMo framework implementing the QuartzNet15×5 architecture based on 1D time-channel separable convolutions. Following this transfer learning data-centric approach, the model was optimized, achieving a state-of-the-art word error rate (WER) of 0.0503.https://www.mdpi.com/1999-5903/15/5/159machine learningdeep learningdeep neural networksspeech-to-textautomatic speech recognitionNVIDIA NeMo |
spellingShingle | Eduardo Medeiros Leonel Corado Luís Rato Paulo Quaresma Pedro Salgueiro Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning Future Internet machine learning deep learning deep neural networks speech-to-text automatic speech recognition NVIDIA NeMo |
title | Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning |
title_full | Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning |
title_fullStr | Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning |
title_full_unstemmed | Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning |
title_short | Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning |
title_sort | domain adaptation speech to text for low resource european portuguese using deep learning |
topic | machine learning deep learning deep neural networks speech-to-text automatic speech recognition NVIDIA NeMo |
url | https://www.mdpi.com/1999-5903/15/5/159 |
work_keys_str_mv | AT eduardomedeiros domainadaptationspeechtotextforlowresourceeuropeanportugueseusingdeeplearning AT leonelcorado domainadaptationspeechtotextforlowresourceeuropeanportugueseusingdeeplearning AT luisrato domainadaptationspeechtotextforlowresourceeuropeanportugueseusingdeeplearning AT pauloquaresma domainadaptationspeechtotextforlowresourceeuropeanportugueseusingdeeplearning AT pedrosalgueiro domainadaptationspeechtotextforlowresourceeuropeanportugueseusingdeeplearning |