Speech recognition models are strong lip-readers

In this work, we show that a large pre-trained ASR model can be adapted to perform lip-reading. Our method enables an ASR model like Whisper to interpret lip movements in a video and output text transcriptions. We achieve this by learning a cross-modal mapping from a lip sequence to a speech sequenc...

Full description

Bibliographic Details
Main Authors: Prajwal, KR, Afouras, T, Zisserman, A
Format: Conference item
Language:English
Published: ISCA 2024