Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning

Automatic speech recognition (ASR), commonly known as speech-to-text, is the process of transcribing audio recordings into text, i.e., transforming speech into the respective sequence of words. This paper presents a deep learning ASR system optimization and evaluation for the European Portuguese lan...

Full description

Bibliographic Details
Main Authors: Eduardo Medeiros, Leonel Corado, Luís Rato, Paulo Quaresma, Pedro Salgueiro
Format: Article
Language:English
Published: MDPI AG 2023-04-01
Series:Future Internet
Subjects:
Online Access:https://www.mdpi.com/1999-5903/15/5/159
_version_ 1827741200867655680
author Eduardo Medeiros
Leonel Corado
Luís Rato
Paulo Quaresma
Pedro Salgueiro
author_facet Eduardo Medeiros
Leonel Corado
Luís Rato
Paulo Quaresma
Pedro Salgueiro
author_sort Eduardo Medeiros
collection DOAJ
description Automatic speech recognition (ASR), commonly known as speech-to-text, is the process of transcribing audio recordings into text, i.e., transforming speech into the respective sequence of words. This paper presents a deep learning ASR system optimization and evaluation for the European Portuguese language. We present a pipeline composed of several stages for data acquisition, analysis, pre-processing, model creation, and evaluation. A transfer learning approach is proposed considering an English language-optimized model as starting point; a target composed of European Portuguese; and the contribution to the transfer process by a source from a different domain consisting of a multiple-variant Portuguese language dataset, essentially composed of Brazilian Portuguese. A domain adaptation was investigated between European Portuguese and mixed (mostly Brazilian) Portuguese. The proposed optimization evaluation used the NVIDIA NeMo framework implementing the QuartzNet15×5 architecture based on 1D time-channel separable convolutions. Following this transfer learning data-centric approach, the model was optimized, achieving a state-of-the-art word error rate (WER) of 0.0503.
first_indexed 2024-03-11T03:42:21Z
format Article
id doaj.art-26321c3692624b5f8095b7b4d265b1b3
institution Directory Open Access Journal
issn 1999-5903
language English
last_indexed 2024-03-11T03:42:21Z
publishDate 2023-04-01
publisher MDPI AG
record_format Article
series Future Internet
spelling doaj.art-26321c3692624b5f8095b7b4d265b1b32023-11-18T01:26:53ZengMDPI AGFuture Internet1999-59032023-04-0115515910.3390/fi15050159Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep LearningEduardo Medeiros0Leonel Corado1Luís Rato2Paulo Quaresma3Pedro Salgueiro4Escola de Ciências e Tecnologia, Universidade de Évora, 7000-671 Évora, PortugalEscola de Ciências e Tecnologia, Universidade de Évora, 7000-671 Évora, PortugalEscola de Ciências e Tecnologia, Universidade de Évora, 7000-671 Évora, PortugalEscola de Ciências e Tecnologia, Universidade de Évora, 7000-671 Évora, PortugalEscola de Ciências e Tecnologia, Universidade de Évora, 7000-671 Évora, PortugalAutomatic speech recognition (ASR), commonly known as speech-to-text, is the process of transcribing audio recordings into text, i.e., transforming speech into the respective sequence of words. This paper presents a deep learning ASR system optimization and evaluation for the European Portuguese language. We present a pipeline composed of several stages for data acquisition, analysis, pre-processing, model creation, and evaluation. A transfer learning approach is proposed considering an English language-optimized model as starting point; a target composed of European Portuguese; and the contribution to the transfer process by a source from a different domain consisting of a multiple-variant Portuguese language dataset, essentially composed of Brazilian Portuguese. A domain adaptation was investigated between European Portuguese and mixed (mostly Brazilian) Portuguese. The proposed optimization evaluation used the NVIDIA NeMo framework implementing the QuartzNet15×5 architecture based on 1D time-channel separable convolutions. Following this transfer learning data-centric approach, the model was optimized, achieving a state-of-the-art word error rate (WER) of 0.0503.https://www.mdpi.com/1999-5903/15/5/159machine learningdeep learningdeep neural networksspeech-to-textautomatic speech recognitionNVIDIA NeMo
spellingShingle Eduardo Medeiros
Leonel Corado
Luís Rato
Paulo Quaresma
Pedro Salgueiro
Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning
Future Internet
machine learning
deep learning
deep neural networks
speech-to-text
automatic speech recognition
NVIDIA NeMo
title Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning
title_full Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning
title_fullStr Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning
title_full_unstemmed Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning
title_short Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning
title_sort domain adaptation speech to text for low resource european portuguese using deep learning
topic machine learning
deep learning
deep neural networks
speech-to-text
automatic speech recognition
NVIDIA NeMo
url https://www.mdpi.com/1999-5903/15/5/159
work_keys_str_mv AT eduardomedeiros domainadaptationspeechtotextforlowresourceeuropeanportugueseusingdeeplearning
AT leonelcorado domainadaptationspeechtotextforlowresourceeuropeanportugueseusingdeeplearning
AT luisrato domainadaptationspeechtotextforlowresourceeuropeanportugueseusingdeeplearning
AT pauloquaresma domainadaptationspeechtotextforlowresourceeuropeanportugueseusingdeeplearning
AT pedrosalgueiro domainadaptationspeechtotextforlowresourceeuropeanportugueseusingdeeplearning