Speech recognition models are strong lip-readers
In this work, we show that a large pre-trained ASR model can be adapted to perform lip-reading. Our method enables an ASR model like Whisper to interpret lip movements in a video and output text transcriptions. We achieve this by learning a cross-modal mapping from a lip sequence to a speech sequenc...
Main Authors: | , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
ISCA
2024
|