Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish

Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, the int...

Full description

Bibliographic Details
Main Authors:	David Gimeno-Gómez, Carlos-D. Martínez-Hinarejos
Format:	Article
Language:	English
Published:	MDPI AG 2023-05-01
Series:	Applied Sciences
Subjects:	visual speech recognition speaker adaptation fine-tuning Adapters Spanish language end-to-end architectures
Online Access:	https://www.mdpi.com/2076-3417/13/11/6521

_version_	1797597875676381184
author	David Gimeno-Gómez Carlos-D. Martínez-Hinarejos
author_facet	David Gimeno-Gómez Carlos-D. Martínez-Hinarejos
author_sort	David Gimeno-Gómez
collection	DOAJ
description	Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, the intra-personal variability among speakers, and the complex modeling of silence. Nonetheless, these challenges can be alleviated when the task is approached from a speaker-dependent perspective. Our work focuses on the adaptation of end-to-end VSR systems to a specific speaker. Hence, we propose two different adaptation methods based on the conventional fine-tuning technique, the so-called Adapters. We conduct a comparative study in terms of performance while considering different deployment aspects such as training time and storage cost. Results on the Spanish LIP-RTVE database show that both methods are able to obtain recognition rates comparable to the state of the art, even when only a limited amount of training data is available. Although it incurs a deterioration in performance, the Adapters-based method presents a more scalable and efficient solution, significantly reducing the training time and storage cost by up to 80%.
first_indexed	2024-03-11T03:11:35Z
format	Article
id	doaj.art-8285a445ac6748b9a21025c5c8721e39
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-11T03:11:35Z
publishDate	2023-05-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-8285a445ac6748b9a21025c5c8721e392023-11-18T07:33:14ZengMDPI AGApplied Sciences2076-34172023-05-011311652110.3390/app13116521Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous SpanishDavid Gimeno-Gómez0Carlos-D. Martínez-Hinarejos1Pattern Recognition and Human Language Technologies Research Center, Universitat Politècnica de València, Camino de Vera, s/n, 46022 València, SpainPattern Recognition and Human Language Technologies Research Center, Universitat Politècnica de València, Camino de Vera, s/n, 46022 València, SpainVisual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, the intra-personal variability among speakers, and the complex modeling of silence. Nonetheless, these challenges can be alleviated when the task is approached from a speaker-dependent perspective. Our work focuses on the adaptation of end-to-end VSR systems to a specific speaker. Hence, we propose two different adaptation methods based on the conventional fine-tuning technique, the so-called Adapters. We conduct a comparative study in terms of performance while considering different deployment aspects such as training time and storage cost. Results on the Spanish LIP-RTVE database show that both methods are able to obtain recognition rates comparable to the state of the art, even when only a limited amount of training data is available. Although it incurs a deterioration in performance, the Adapters-based method presents a more scalable and efficient solution, significantly reducing the training time and storage cost by up to 80%.https://www.mdpi.com/2076-3417/13/11/6521visual speech recognitionspeaker adaptationfine-tuningAdaptersSpanish languageend-to-end architectures
spellingShingle	David Gimeno-Gómez Carlos-D. Martínez-Hinarejos Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish Applied Sciences visual speech recognition speaker adaptation fine-tuning Adapters Spanish language end-to-end architectures
title	Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
title_full	Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
title_fullStr	Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
title_full_unstemmed	Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
title_short	Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish
title_sort	comparing speaker adaptation methods for visual speech recognition for continuous spanish
topic	visual speech recognition speaker adaptation fine-tuning Adapters Spanish language end-to-end architectures
url	https://www.mdpi.com/2076-3417/13/11/6521
work_keys_str_mv	AT davidgimenogomez comparingspeakeradaptationmethodsforvisualspeechrecognitionforcontinuousspanish AT carlosdmartinezhinarejos comparingspeakeradaptationmethodsforvisualspeechrecognitionforcontinuousspanish

Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish

Similar Items