Learning Audio-Video Language Representations

Automatic speech recognition has seen recent advancements powered by machine learning, but it is still only available for a small fraction of the more than 7,000 languages spoken worldwide due to the reliance on manually annotated speech data. Unlabeled multi-modal data, such as videos, are now incr...

Full description

Bibliographic Details
Main Author:	Rouditchenko, Andrew
Other Authors:	Glass, James
Format:	Thesis
Published:	Massachusetts Institute of Technology 2022
Online Access:	https://hdl.handle.net/1721.1/139024

_version_	1826190587408154624
author	Rouditchenko, Andrew
author2	Glass, James
author_facet	Glass, James Rouditchenko, Andrew
author_sort	Rouditchenko, Andrew
collection	MIT
description	Automatic speech recognition has seen recent advancements powered by machine learning, but it is still only available for a small fraction of the more than 7,000 languages spoken worldwide due to the reliance on manually annotated speech data. Unlabeled multi-modal data, such as videos, are now increasingly available in many different languages and provide opportunities to scale speech technologies. In this thesis, we introduce models and datasets for learning visually grounded spoken language from raw audio in videos. We propose a self-supervised audio-video model that learns from the English narration naturally present in instructional videos to relate spoken words and sounds to visual content. Our model can recognize spoken words and natural sounds in audio queries to retrieve relevant visual clips, supporting its application to video search directly using audio and spoken queries, without needing to transcribe speech to text. We further demonstrate that our model can learn multilingual audiovideo representations and can successfully perform retrieval on Japanese videos. Since our approach only requires audio-visual data without transcripts, we believe it is a promising direction to enable novel speech processing tools.
first_indexed	2024-09-23T08:42:39Z
format	Thesis
id	mit-1721.1/139024
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T08:42:39Z
publishDate	2022
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1390242022-01-15T04:05:35Z Learning Audio-Video Language Representations Rouditchenko, Andrew Glass, James Harwath, David Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Automatic speech recognition has seen recent advancements powered by machine learning, but it is still only available for a small fraction of the more than 7,000 languages spoken worldwide due to the reliance on manually annotated speech data. Unlabeled multi-modal data, such as videos, are now increasingly available in many different languages and provide opportunities to scale speech technologies. In this thesis, we introduce models and datasets for learning visually grounded spoken language from raw audio in videos. We propose a self-supervised audio-video model that learns from the English narration naturally present in instructional videos to relate spoken words and sounds to visual content. Our model can recognize spoken words and natural sounds in audio queries to retrieve relevant visual clips, supporting its application to video search directly using audio and spoken queries, without needing to transcribe speech to text. We further demonstrate that our model can learn multilingual audiovideo representations and can successfully perform retrieval on Japanese videos. Since our approach only requires audio-visual data without transcripts, we believe it is a promising direction to enable novel speech processing tools. M.Eng. 2022-01-14T14:45:17Z 2022-01-14T14:45:17Z 2021-06 2021-06-17T20:14:11.951Z Thesis https://hdl.handle.net/1721.1/139024 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Rouditchenko, Andrew Learning Audio-Video Language Representations
title	Learning Audio-Video Language Representations
title_full	Learning Audio-Video Language Representations
title_fullStr	Learning Audio-Video Language Representations
title_full_unstemmed	Learning Audio-Video Language Representations
title_short	Learning Audio-Video Language Representations
title_sort	learning audio video language representations
url	https://hdl.handle.net/1721.1/139024
work_keys_str_mv	AT rouditchenkoandrew learningaudiovideolanguagerepresentations

Learning Audio-Video Language Representations

Similar Items