Learning Audio-Video Language Representations

Automatic speech recognition has seen recent advancements powered by machine learning, but it is still only available for a small fraction of the more than 7,000 languages spoken worldwide due to the reliance on manually annotated speech data. Unlabeled multi-modal data, such as videos, are now incr...

Full description

Bibliographic Details
Main Author: Rouditchenko, Andrew
Other Authors: Glass, James
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/139024
_version_ 1826190587408154624
author Rouditchenko, Andrew
author2 Glass, James
author_facet Glass, James
Rouditchenko, Andrew
author_sort Rouditchenko, Andrew
collection MIT
description Automatic speech recognition has seen recent advancements powered by machine learning, but it is still only available for a small fraction of the more than 7,000 languages spoken worldwide due to the reliance on manually annotated speech data. Unlabeled multi-modal data, such as videos, are now increasingly available in many different languages and provide opportunities to scale speech technologies. In this thesis, we introduce models and datasets for learning visually grounded spoken language from raw audio in videos. We propose a self-supervised audio-video model that learns from the English narration naturally present in instructional videos to relate spoken words and sounds to visual content. Our model can recognize spoken words and natural sounds in audio queries to retrieve relevant visual clips, supporting its application to video search directly using audio and spoken queries, without needing to transcribe speech to text. We further demonstrate that our model can learn multilingual audiovideo representations and can successfully perform retrieval on Japanese videos. Since our approach only requires audio-visual data without transcripts, we believe it is a promising direction to enable novel speech processing tools.
first_indexed 2024-09-23T08:42:39Z
format Thesis
id mit-1721.1/139024
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T08:42:39Z
publishDate 2022
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1390242022-01-15T04:05:35Z Learning Audio-Video Language Representations Rouditchenko, Andrew Glass, James Harwath, David Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Automatic speech recognition has seen recent advancements powered by machine learning, but it is still only available for a small fraction of the more than 7,000 languages spoken worldwide due to the reliance on manually annotated speech data. Unlabeled multi-modal data, such as videos, are now increasingly available in many different languages and provide opportunities to scale speech technologies. In this thesis, we introduce models and datasets for learning visually grounded spoken language from raw audio in videos. We propose a self-supervised audio-video model that learns from the English narration naturally present in instructional videos to relate spoken words and sounds to visual content. Our model can recognize spoken words and natural sounds in audio queries to retrieve relevant visual clips, supporting its application to video search directly using audio and spoken queries, without needing to transcribe speech to text. We further demonstrate that our model can learn multilingual audiovideo representations and can successfully perform retrieval on Japanese videos. Since our approach only requires audio-visual data without transcripts, we believe it is a promising direction to enable novel speech processing tools. M.Eng. 2022-01-14T14:45:17Z 2022-01-14T14:45:17Z 2021-06 2021-06-17T20:14:11.951Z Thesis https://hdl.handle.net/1721.1/139024 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Rouditchenko, Andrew
Learning Audio-Video Language Representations
title Learning Audio-Video Language Representations
title_full Learning Audio-Video Language Representations
title_fullStr Learning Audio-Video Language Representations
title_full_unstemmed Learning Audio-Video Language Representations
title_short Learning Audio-Video Language Representations
title_sort learning audio video language representations
url https://hdl.handle.net/1721.1/139024
work_keys_str_mv AT rouditchenkoandrew learningaudiovideolanguagerepresentations