Self-Supervised Audio-Visual Speech Diarization and Recognition
Many real world use cases of automatic speech recognition (ASR) contain video and multiple speakers, such as TV broadcasts and video conferences. However, state-of-the-art end-to-end multimodal ASR models generally do not support diarization. This thesis extends one such model, AV-HuBERT, to address...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2024
|
Online Access: | https://hdl.handle.net/1721.1/156767 |
_version_ | 1826217487508701184 |
---|---|
author | Wongprommoon, Arun |
author2 | Glass, James |
author_facet | Glass, James Wongprommoon, Arun |
author_sort | Wongprommoon, Arun |
collection | MIT |
description | Many real world use cases of automatic speech recognition (ASR) contain video and multiple speakers, such as TV broadcasts and video conferences. However, state-of-the-art end-to-end multimodal ASR models generally do not support diarization. This thesis extends one such model, AV-HuBERT, to address the diarization problem while maintaining word recognition accuracy. The proposed Audio-Visual Cocktail (AVC) HuBERT model extends video input dimenions, lengthens feature size, and adds projection layers to split outputs into corresponding speakers. A complementary synthesized dataset is constructed by mixing audio and video samples from LRS3 at varying overlap thresholds, resulting in the LRS3Mix dataset. This is used to train the model, whose weights are transferred from AV-HuBERT. Computing several word error rate (WER) metrics to measure recognition and diarization performance of several versions of AVC-HuBERT models demonstrates that the method improves diarization, albeit with a small tradeoff in word recognition. Augmenting the synthesized mixed dataset with the original clean single-speaker dataset boosts recognition ability, and the same effect can be observed when the dataset size increases. |
first_indexed | 2024-09-23T17:04:26Z |
format | Thesis |
id | mit-1721.1/156767 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T17:04:26Z |
publishDate | 2024 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1567672024-09-17T03:01:20Z Self-Supervised Audio-Visual Speech Diarization and Recognition Wongprommoon, Arun Glass, James Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Many real world use cases of automatic speech recognition (ASR) contain video and multiple speakers, such as TV broadcasts and video conferences. However, state-of-the-art end-to-end multimodal ASR models generally do not support diarization. This thesis extends one such model, AV-HuBERT, to address the diarization problem while maintaining word recognition accuracy. The proposed Audio-Visual Cocktail (AVC) HuBERT model extends video input dimenions, lengthens feature size, and adds projection layers to split outputs into corresponding speakers. A complementary synthesized dataset is constructed by mixing audio and video samples from LRS3 at varying overlap thresholds, resulting in the LRS3Mix dataset. This is used to train the model, whose weights are transferred from AV-HuBERT. Computing several word error rate (WER) metrics to measure recognition and diarization performance of several versions of AVC-HuBERT models demonstrates that the method improves diarization, albeit with a small tradeoff in word recognition. Augmenting the synthesized mixed dataset with the original clean single-speaker dataset boosts recognition ability, and the same effect can be observed when the dataset size increases. M.Eng. 2024-09-16T13:47:53Z 2024-09-16T13:47:53Z 2024-05 2024-07-11T14:36:49.088Z Thesis https://hdl.handle.net/1721.1/156767 Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Wongprommoon, Arun Self-Supervised Audio-Visual Speech Diarization and Recognition |
title | Self-Supervised Audio-Visual Speech Diarization and Recognition |
title_full | Self-Supervised Audio-Visual Speech Diarization and Recognition |
title_fullStr | Self-Supervised Audio-Visual Speech Diarization and Recognition |
title_full_unstemmed | Self-Supervised Audio-Visual Speech Diarization and Recognition |
title_short | Self-Supervised Audio-Visual Speech Diarization and Recognition |
title_sort | self supervised audio visual speech diarization and recognition |
url | https://hdl.handle.net/1721.1/156767 |
work_keys_str_mv | AT wongprommoonarun selfsupervisedaudiovisualspeechdiarizationandrecognition |