Learning Audio-Video Language Representations
Automatic speech recognition has seen recent advancements powered by machine learning, but it is still only available for a small fraction of the more than 7,000 languages spoken worldwide due to the reliance on manually annotated speech data. Unlabeled multi-modal data, such as videos, are now incr...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2022
|
Online Access: | https://hdl.handle.net/1721.1/139024 |
_version_ | 1826190587408154624 |
---|---|
author | Rouditchenko, Andrew |
author2 | Glass, James |
author_facet | Glass, James Rouditchenko, Andrew |
author_sort | Rouditchenko, Andrew |
collection | MIT |
description | Automatic speech recognition has seen recent advancements powered by machine learning, but it is still only available for a small fraction of the more than 7,000 languages spoken worldwide due to the reliance on manually annotated speech data. Unlabeled multi-modal data, such as videos, are now increasingly available in many different languages and provide opportunities to scale speech technologies. In this thesis, we introduce models and datasets for learning visually grounded spoken language from raw audio in videos. We propose a self-supervised audio-video model that learns from the English narration naturally present in instructional videos to relate spoken words and sounds to visual content. Our model can recognize spoken words and natural sounds in audio queries to retrieve relevant visual clips, supporting its application to video search directly using audio and spoken queries, without needing to transcribe speech to text. We further demonstrate that our model can learn multilingual audiovideo representations and can successfully perform retrieval on Japanese videos. Since our approach only requires audio-visual data without transcripts, we believe it is a promising direction to enable novel speech processing tools. |
first_indexed | 2024-09-23T08:42:39Z |
format | Thesis |
id | mit-1721.1/139024 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T08:42:39Z |
publishDate | 2022 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1390242022-01-15T04:05:35Z Learning Audio-Video Language Representations Rouditchenko, Andrew Glass, James Harwath, David Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Automatic speech recognition has seen recent advancements powered by machine learning, but it is still only available for a small fraction of the more than 7,000 languages spoken worldwide due to the reliance on manually annotated speech data. Unlabeled multi-modal data, such as videos, are now increasingly available in many different languages and provide opportunities to scale speech technologies. In this thesis, we introduce models and datasets for learning visually grounded spoken language from raw audio in videos. We propose a self-supervised audio-video model that learns from the English narration naturally present in instructional videos to relate spoken words and sounds to visual content. Our model can recognize spoken words and natural sounds in audio queries to retrieve relevant visual clips, supporting its application to video search directly using audio and spoken queries, without needing to transcribe speech to text. We further demonstrate that our model can learn multilingual audiovideo representations and can successfully perform retrieval on Japanese videos. Since our approach only requires audio-visual data without transcripts, we believe it is a promising direction to enable novel speech processing tools. M.Eng. 2022-01-14T14:45:17Z 2022-01-14T14:45:17Z 2021-06 2021-06-17T20:14:11.951Z Thesis https://hdl.handle.net/1721.1/139024 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Rouditchenko, Andrew Learning Audio-Video Language Representations |
title | Learning Audio-Video Language Representations |
title_full | Learning Audio-Video Language Representations |
title_fullStr | Learning Audio-Video Language Representations |
title_full_unstemmed | Learning Audio-Video Language Representations |
title_short | Learning Audio-Video Language Representations |
title_sort | learning audio video language representations |
url | https://hdl.handle.net/1721.1/139024 |
work_keys_str_mv | AT rouditchenkoandrew learningaudiovideolanguagerepresentations |