Speech recognition models are strong lip-readers

In this work, we show that a large pre-trained ASR model can be adapted to perform lip-reading. Our method enables an ASR model like Whisper to interpret lip movements in a video and output text transcriptions. We achieve this by learning a cross-modal mapping from a lip sequence to a speech sequenc...

Full description

Bibliographic Details
Main Authors: Prajwal, KR, Afouras, T, Zisserman, A
Format: Conference item
Language:English
Published: ISCA 2024
_version_ 1826314884392943616
author Prajwal, KR
Afouras, T
Zisserman, A
author_facet Prajwal, KR
Afouras, T
Zisserman, A
author_sort Prajwal, KR
collection OXFORD
description In this work, we show that a large pre-trained ASR model can be adapted to perform lip-reading. Our method enables an ASR model like Whisper to interpret lip movements in a video and output text transcriptions. We achieve this by learning a cross-modal mapping from a lip sequence to a speech sequence, allowing a pre-trained ASR model to directly perform lip-reading. The mapping can be learnt simply by backpropagating the cross-entropy loss on the text labels through the pre-trained, frozen ASR model. We achieve an impressive gain of 5.7 WER in the low data regime on the LRS3 benchmark over previous lip-reading methods. Finally, we demonstrate that the same strategy can be extended to other visual speech tasks, such as identifying the spoken language in silent videos.
first_indexed 2024-12-09T03:15:47Z
format Conference item
id oxford-uuid:176ddc51-6a08-4497-b63c-8f9bb38329a5
institution University of Oxford
language English
last_indexed 2024-12-09T03:15:47Z
publishDate 2024
publisher ISCA
record_format dspace
spelling oxford-uuid:176ddc51-6a08-4497-b63c-8f9bb38329a52024-10-21T11:06:40ZSpeech recognition models are strong lip-readersConference itemhttp://purl.org/coar/resource_type/c_5794uuid:176ddc51-6a08-4497-b63c-8f9bb38329a5EnglishSymplectic ElementsISCA2024Prajwal, KRAfouras, TZisserman, AIn this work, we show that a large pre-trained ASR model can be adapted to perform lip-reading. Our method enables an ASR model like Whisper to interpret lip movements in a video and output text transcriptions. We achieve this by learning a cross-modal mapping from a lip sequence to a speech sequence, allowing a pre-trained ASR model to directly perform lip-reading. The mapping can be learnt simply by backpropagating the cross-entropy loss on the text labels through the pre-trained, frozen ASR model. We achieve an impressive gain of 5.7 WER in the low data regime on the LRS3 benchmark over previous lip-reading methods. Finally, we demonstrate that the same strategy can be extended to other visual speech tasks, such as identifying the spoken language in silent videos.
spellingShingle Prajwal, KR
Afouras, T
Zisserman, A
Speech recognition models are strong lip-readers
title Speech recognition models are strong lip-readers
title_full Speech recognition models are strong lip-readers
title_fullStr Speech recognition models are strong lip-readers
title_full_unstemmed Speech recognition models are strong lip-readers
title_short Speech recognition models are strong lip-readers
title_sort speech recognition models are strong lip readers
work_keys_str_mv AT prajwalkr speechrecognitionmodelsarestronglipreaders
AT afourast speechrecognitionmodelsarestronglipreaders
AT zissermana speechrecognitionmodelsarestronglipreaders