Self-Supervised Audio-Visual Speech Diarization and Recognition

Many real world use cases of automatic speech recognition (ASR) contain video and multiple speakers, such as TV broadcasts and video conferences. However, state-of-the-art end-to-end multimodal ASR models generally do not support diarization. This thesis extends one such model, AV-HuBERT, to address...

Full description

Bibliographic Details
Main Author:	Wongprommoon, Arun
Other Authors:	Glass, James
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/156767

_version_	1826217487508701184
author	Wongprommoon, Arun
author2	Glass, James
author_facet	Glass, James Wongprommoon, Arun
author_sort	Wongprommoon, Arun
collection	MIT
description	Many real world use cases of automatic speech recognition (ASR) contain video and multiple speakers, such as TV broadcasts and video conferences. However, state-of-the-art end-to-end multimodal ASR models generally do not support diarization. This thesis extends one such model, AV-HuBERT, to address the diarization problem while maintaining word recognition accuracy. The proposed Audio-Visual Cocktail (AVC) HuBERT model extends video input dimenions, lengthens feature size, and adds projection layers to split outputs into corresponding speakers. A complementary synthesized dataset is constructed by mixing audio and video samples from LRS3 at varying overlap thresholds, resulting in the LRS3Mix dataset. This is used to train the model, whose weights are transferred from AV-HuBERT. Computing several word error rate (WER) metrics to measure recognition and diarization performance of several versions of AVC-HuBERT models demonstrates that the method improves diarization, albeit with a small tradeoff in word recognition. Augmenting the synthesized mixed dataset with the original clean single-speaker dataset boosts recognition ability, and the same effect can be observed when the dataset size increases.
first_indexed	2024-09-23T17:04:26Z
format	Thesis
id	mit-1721.1/156767
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T17:04:26Z
publishDate	2024
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1567672024-09-17T03:01:20Z Self-Supervised Audio-Visual Speech Diarization and Recognition Wongprommoon, Arun Glass, James Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Many real world use cases of automatic speech recognition (ASR) contain video and multiple speakers, such as TV broadcasts and video conferences. However, state-of-the-art end-to-end multimodal ASR models generally do not support diarization. This thesis extends one such model, AV-HuBERT, to address the diarization problem while maintaining word recognition accuracy. The proposed Audio-Visual Cocktail (AVC) HuBERT model extends video input dimenions, lengthens feature size, and adds projection layers to split outputs into corresponding speakers. A complementary synthesized dataset is constructed by mixing audio and video samples from LRS3 at varying overlap thresholds, resulting in the LRS3Mix dataset. This is used to train the model, whose weights are transferred from AV-HuBERT. Computing several word error rate (WER) metrics to measure recognition and diarization performance of several versions of AVC-HuBERT models demonstrates that the method improves diarization, albeit with a small tradeoff in word recognition. Augmenting the synthesized mixed dataset with the original clean single-speaker dataset boosts recognition ability, and the same effect can be observed when the dataset size increases. M.Eng. 2024-09-16T13:47:53Z 2024-09-16T13:47:53Z 2024-05 2024-07-11T14:36:49.088Z Thesis https://hdl.handle.net/1721.1/156767 Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Wongprommoon, Arun Self-Supervised Audio-Visual Speech Diarization and Recognition
title	Self-Supervised Audio-Visual Speech Diarization and Recognition
title_full	Self-Supervised Audio-Visual Speech Diarization and Recognition
title_fullStr	Self-Supervised Audio-Visual Speech Diarization and Recognition
title_full_unstemmed	Self-Supervised Audio-Visual Speech Diarization and Recognition
title_short	Self-Supervised Audio-Visual Speech Diarization and Recognition
title_sort	self supervised audio visual speech diarization and recognition
url	https://hdl.handle.net/1721.1/156767
work_keys_str_mv	AT wongprommoonarun selfsupervisedaudiovisualspeechdiarizationandrecognition

Self-Supervised Audio-Visual Speech Diarization and Recognition

Similar Items