Self-Supervised Audio-Visual Speech Diarization and Recognition

Many real world use cases of automatic speech recognition (ASR) contain video and multiple speakers, such as TV broadcasts and video conferences. However, state-of-the-art end-to-end multimodal ASR models generally do not support diarization. This thesis extends one such model, AV-HuBERT, to address...

Full description

Bibliographic Details
Main Author: Wongprommoon, Arun
Other Authors: Glass, James
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/156767
Description
Summary:Many real world use cases of automatic speech recognition (ASR) contain video and multiple speakers, such as TV broadcasts and video conferences. However, state-of-the-art end-to-end multimodal ASR models generally do not support diarization. This thesis extends one such model, AV-HuBERT, to address the diarization problem while maintaining word recognition accuracy. The proposed Audio-Visual Cocktail (AVC) HuBERT model extends video input dimenions, lengthens feature size, and adds projection layers to split outputs into corresponding speakers. A complementary synthesized dataset is constructed by mixing audio and video samples from LRS3 at varying overlap thresholds, resulting in the LRS3Mix dataset. This is used to train the model, whose weights are transferred from AV-HuBERT. Computing several word error rate (WER) metrics to measure recognition and diarization performance of several versions of AVC-HuBERT models demonstrates that the method improves diarization, albeit with a small tradeoff in word recognition. Augmenting the synthesized mixed dataset with the original clean single-speaker dataset boosts recognition ability, and the same effect can be observed when the dataset size increases.