Self-Supervised Audio-Visual Speech Diarization and Recognition

Many real world use cases of automatic speech recognition (ASR) contain video and multiple speakers, such as TV broadcasts and video conferences. However, state-of-the-art end-to-end multimodal ASR models generally do not support diarization. This thesis extends one such model, AV-HuBERT, to address...

Full description

Bibliographic Details
Main Author: Wongprommoon, Arun
Other Authors: Glass, James
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/156767