Speech recognition models are strong lip-readers
In this work, we show that a large pre-trained ASR model can be adapted to perform lip-reading. Our method enables an ASR model like Whisper to interpret lip movements in a video and output text transcriptions. We achieve this by learning a cross-modal mapping from a lip sequence to a speech sequenc...
Hauptverfasser: | Prajwal, KR, Afouras, T, Zisserman, A |
---|---|
Format: | Conference item |
Sprache: | English |
Veröffentlicht: |
ISCA
2024
|
Ähnliche Einträge
-
Sub-word level lip reading with visual attention
von: Prajwal, KR, et al.
Veröffentlicht: (2022) -
My lips are concealed: audio-visual speech enhancement through obstructions
von: Afouras, T, et al.
Veröffentlicht: (2019) -
Deep lip reading: a comparison of models and an online application
von: Afouras, T, et al.
Veröffentlicht: (2018) -
Deep audio-visual speech recognition
von: Afouras, T, et al.
Veröffentlicht: (2018) -
ASR is all you need: cross-modal distillation for lip reading
von: Afouras, T, et al.
Veröffentlicht: (2020)