Self-Supervised Audio-Visual Speech Diarization and Recognition

Many real world use cases of automatic speech recognition (ASR) contain video and multiple speakers, such as TV broadcasts and video conferences. However, state-of-the-art end-to-end multimodal ASR models generally do not support diarization. This thesis extends one such model, AV-HuBERT, to address...

Full description

Bibliographic Details
Main Author:	Wongprommoon, Arun
Other Authors:	Glass, James
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/156767

Internet

https://hdl.handle.net/1721.1/156767

Self-Supervised Audio-Visual Speech Diarization and Recognition

Internet

Similar Items