Speech recognition models are strong lip-readers

In this work, we show that a large pre-trained ASR model can be adapted to perform lip-reading. Our method enables an ASR model like Whisper to interpret lip movements in a video and output text transcriptions. We achieve this by learning a cross-modal mapping from a lip sequence to a speech sequenc...

Ausführliche Beschreibung

Bibliographische Detailangaben
Hauptverfasser: Prajwal, KR, Afouras, T, Zisserman, A
Format: Conference item
Sprache:English
Veröffentlicht: ISCA 2024

Ähnliche Einträge