Speech recognition models are strong lip-readers

In this work, we show that a large pre-trained ASR model can be adapted to perform lip-reading. Our method enables an ASR model like Whisper to interpret lip movements in a video and output text transcriptions. We achieve this by learning a cross-modal mapping from a lip sequence to a speech sequenc...

Full description

Bibliographic Details
Main Authors:	Prajwal, KR, Afouras, T, Zisserman, A
Format:	Conference item
Language:	English
Published:	ISCA 2024

Speech recognition models are strong lip-readers

Similar Items