Self-Supervised Audio-Visual Speech Diarization and Recognition

Many real world use cases of automatic speech recognition (ASR) contain video and multiple speakers, such as TV broadcasts and video conferences. However, state-of-the-art end-to-end multimodal ASR models generally do not support diarization. This thesis extends one such model, AV-HuBERT, to address...

Full description

Bibliographic Details
Main Author: Wongprommoon, Arun
Other Authors: Glass, James
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/156767
_version_ 1826217487508701184
author Wongprommoon, Arun
author2 Glass, James
author_facet Glass, James
Wongprommoon, Arun
author_sort Wongprommoon, Arun
collection MIT
description Many real world use cases of automatic speech recognition (ASR) contain video and multiple speakers, such as TV broadcasts and video conferences. However, state-of-the-art end-to-end multimodal ASR models generally do not support diarization. This thesis extends one such model, AV-HuBERT, to address the diarization problem while maintaining word recognition accuracy. The proposed Audio-Visual Cocktail (AVC) HuBERT model extends video input dimenions, lengthens feature size, and adds projection layers to split outputs into corresponding speakers. A complementary synthesized dataset is constructed by mixing audio and video samples from LRS3 at varying overlap thresholds, resulting in the LRS3Mix dataset. This is used to train the model, whose weights are transferred from AV-HuBERT. Computing several word error rate (WER) metrics to measure recognition and diarization performance of several versions of AVC-HuBERT models demonstrates that the method improves diarization, albeit with a small tradeoff in word recognition. Augmenting the synthesized mixed dataset with the original clean single-speaker dataset boosts recognition ability, and the same effect can be observed when the dataset size increases.
first_indexed 2024-09-23T17:04:26Z
format Thesis
id mit-1721.1/156767
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T17:04:26Z
publishDate 2024
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1567672024-09-17T03:01:20Z Self-Supervised Audio-Visual Speech Diarization and Recognition Wongprommoon, Arun Glass, James Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Many real world use cases of automatic speech recognition (ASR) contain video and multiple speakers, such as TV broadcasts and video conferences. However, state-of-the-art end-to-end multimodal ASR models generally do not support diarization. This thesis extends one such model, AV-HuBERT, to address the diarization problem while maintaining word recognition accuracy. The proposed Audio-Visual Cocktail (AVC) HuBERT model extends video input dimenions, lengthens feature size, and adds projection layers to split outputs into corresponding speakers. A complementary synthesized dataset is constructed by mixing audio and video samples from LRS3 at varying overlap thresholds, resulting in the LRS3Mix dataset. This is used to train the model, whose weights are transferred from AV-HuBERT. Computing several word error rate (WER) metrics to measure recognition and diarization performance of several versions of AVC-HuBERT models demonstrates that the method improves diarization, albeit with a small tradeoff in word recognition. Augmenting the synthesized mixed dataset with the original clean single-speaker dataset boosts recognition ability, and the same effect can be observed when the dataset size increases. M.Eng. 2024-09-16T13:47:53Z 2024-09-16T13:47:53Z 2024-05 2024-07-11T14:36:49.088Z Thesis https://hdl.handle.net/1721.1/156767 Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Wongprommoon, Arun
Self-Supervised Audio-Visual Speech Diarization and Recognition
title Self-Supervised Audio-Visual Speech Diarization and Recognition
title_full Self-Supervised Audio-Visual Speech Diarization and Recognition
title_fullStr Self-Supervised Audio-Visual Speech Diarization and Recognition
title_full_unstemmed Self-Supervised Audio-Visual Speech Diarization and Recognition
title_short Self-Supervised Audio-Visual Speech Diarization and Recognition
title_sort self supervised audio visual speech diarization and recognition
url https://hdl.handle.net/1721.1/156767
work_keys_str_mv AT wongprommoonarun selfsupervisedaudiovisualspeechdiarizationandrecognition