Speech recognition models are strong lip-readers
In this work, we show that a large pre-trained ASR model can be adapted to perform lip-reading. Our method enables an ASR model like Whisper to interpret lip movements in a video and output text transcriptions. We achieve this by learning a cross-modal mapping from a lip sequence to a speech sequenc...
Main Authors: | , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
ISCA
2024
|
_version_ | 1826314884392943616 |
---|---|
author | Prajwal, KR Afouras, T Zisserman, A |
author_facet | Prajwal, KR Afouras, T Zisserman, A |
author_sort | Prajwal, KR |
collection | OXFORD |
description | In this work, we show that a large pre-trained ASR model can be adapted to perform lip-reading. Our method enables an ASR model like Whisper to interpret lip movements in a video and output text transcriptions. We achieve this by learning a cross-modal mapping from a lip sequence to a speech sequence, allowing a pre-trained ASR model to directly perform lip-reading. The mapping can be learnt simply by backpropagating the cross-entropy loss on the text labels through the pre-trained, frozen ASR model. We achieve an impressive gain of 5.7 WER in the low data regime on the LRS3 benchmark over previous lip-reading methods. Finally, we demonstrate that the same strategy can be extended to other visual speech tasks, such as identifying the spoken language in silent videos. |
first_indexed | 2024-12-09T03:15:47Z |
format | Conference item |
id | oxford-uuid:176ddc51-6a08-4497-b63c-8f9bb38329a5 |
institution | University of Oxford |
language | English |
last_indexed | 2024-12-09T03:15:47Z |
publishDate | 2024 |
publisher | ISCA |
record_format | dspace |
spelling | oxford-uuid:176ddc51-6a08-4497-b63c-8f9bb38329a52024-10-21T11:06:40ZSpeech recognition models are strong lip-readersConference itemhttp://purl.org/coar/resource_type/c_5794uuid:176ddc51-6a08-4497-b63c-8f9bb38329a5EnglishSymplectic ElementsISCA2024Prajwal, KRAfouras, TZisserman, AIn this work, we show that a large pre-trained ASR model can be adapted to perform lip-reading. Our method enables an ASR model like Whisper to interpret lip movements in a video and output text transcriptions. We achieve this by learning a cross-modal mapping from a lip sequence to a speech sequence, allowing a pre-trained ASR model to directly perform lip-reading. The mapping can be learnt simply by backpropagating the cross-entropy loss on the text labels through the pre-trained, frozen ASR model. We achieve an impressive gain of 5.7 WER in the low data regime on the LRS3 benchmark over previous lip-reading methods. Finally, we demonstrate that the same strategy can be extended to other visual speech tasks, such as identifying the spoken language in silent videos. |
spellingShingle | Prajwal, KR Afouras, T Zisserman, A Speech recognition models are strong lip-readers |
title | Speech recognition models are strong lip-readers |
title_full | Speech recognition models are strong lip-readers |
title_fullStr | Speech recognition models are strong lip-readers |
title_full_unstemmed | Speech recognition models are strong lip-readers |
title_short | Speech recognition models are strong lip-readers |
title_sort | speech recognition models are strong lip readers |
work_keys_str_mv | AT prajwalkr speechrecognitionmodelsarestronglipreaders AT afourast speechrecognitionmodelsarestronglipreaders AT zissermana speechrecognitionmodelsarestronglipreaders |