Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learne...
Main Authors: | , , , , , |
---|---|
Other Authors: | |
Format: | Book |
Language: | English |
Published: |
Springer International Publishing
2020
|
Online Access: | https://hdl.handle.net/1721.1/123476 |
_version_ | 1811068703978029056 |
---|---|
author | Harwath, David F. Recasens, Adria Suris Coll-Vinent, Didac Chuang, Galen Torralba, Antonio Glass, James R |
author2 | Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory |
author_facet | Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Harwath, David F. Recasens, Adria Suris Coll-Vinent, Didac Chuang, Galen Torralba, Antonio Glass, James R |
author_sort | Harwath, David F. |
collection | MIT |
description | In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate directly on the image pixels and speech waveform, and do not rely on any conventional supervision in the form of labels, segmentations, or alignments between the modalities during training. We perform analysis using the Places 205 and ADE20k datasets demonstrating that our models implicitly learn semantically-coupled object and word detectors. Keywords: vision and language; sound; speech; convolutional networks; multimodal learning; unsupervised learning |
first_indexed | 2024-09-23T07:59:52Z |
format | Book |
id | mit-1721.1/123476 |
institution | Massachusetts Institute of Technology |
language | English |
last_indexed | 2024-09-23T07:59:52Z |
publishDate | 2020 |
publisher | Springer International Publishing |
record_format | dspace |
spelling | mit-1721.1/1234762022-09-30T01:34:48Z Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input Harwath, David F. Recasens, Adria Suris Coll-Vinent, Didac Chuang, Galen Torralba, Antonio Glass, James R Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate directly on the image pixels and speech waveform, and do not rely on any conventional supervision in the form of labels, segmentations, or alignments between the modalities during training. We perform analysis using the Places 205 and ADE20k datasets demonstrating that our models implicitly learn semantically-coupled object and word detectors. Keywords: vision and language; sound; speech; convolutional networks; multimodal learning; unsupervised learning 2020-01-20T17:03:22Z 2020-01-20T17:03:22Z 2018-10-06 2018-04-04 2019-07-11T17:10:06Z Book http://purl.org/eprint/type/ConferencePaper 9783030012304 9783030012311 0302-9743 1611-3349 https://hdl.handle.net/1721.1/123476 Harwath, David et al. "Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input." Computer Vision – ECCV 2018, September 8–14, 2018, Munich, Germany, edited by V. Ferrari et al., Springer, 2018 en http://dx.doi.org/10.1007/978-3-030-01231-1_40 Computer Vision – ECCV 2018 Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Springer International Publishing arXiv |
spellingShingle | Harwath, David F. Recasens, Adria Suris Coll-Vinent, Didac Chuang, Galen Torralba, Antonio Glass, James R Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input |
title | Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input |
title_full | Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input |
title_fullStr | Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input |
title_full_unstemmed | Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input |
title_short | Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input |
title_sort | jointly discovering visual objects and spoken words from raw sensory input |
url | https://hdl.handle.net/1721.1/123476 |
work_keys_str_mv | AT harwathdavidf jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput AT recasensadria jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput AT suriscollvinentdidac jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput AT chuanggalen jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput AT torralbaantonio jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput AT glassjamesr jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput |