Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learne...

Full description

Bibliographic Details
Main Authors: Harwath, David F., Recasens, Adria, Suris Coll-Vinent, Didac, Chuang, Galen, Torralba, Antonio, Glass, James R
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format: Book
Language:English
Published: Springer International Publishing 2020
Online Access:https://hdl.handle.net/1721.1/123476
_version_ 1811068703978029056
author Harwath, David F.
Recasens, Adria
Suris Coll-Vinent, Didac
Chuang, Galen
Torralba, Antonio
Glass, James R
author2 Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Harwath, David F.
Recasens, Adria
Suris Coll-Vinent, Didac
Chuang, Galen
Torralba, Antonio
Glass, James R
author_sort Harwath, David F.
collection MIT
description In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate directly on the image pixels and speech waveform, and do not rely on any conventional supervision in the form of labels, segmentations, or alignments between the modalities during training. We perform analysis using the Places 205 and ADE20k datasets demonstrating that our models implicitly learn semantically-coupled object and word detectors. Keywords: vision and language; sound; speech; convolutional networks; multimodal learning; unsupervised learning
first_indexed 2024-09-23T07:59:52Z
format Book
id mit-1721.1/123476
institution Massachusetts Institute of Technology
language English
last_indexed 2024-09-23T07:59:52Z
publishDate 2020
publisher Springer International Publishing
record_format dspace
spelling mit-1721.1/1234762022-09-30T01:34:48Z Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input Harwath, David F. Recasens, Adria Suris Coll-Vinent, Didac Chuang, Galen Torralba, Antonio Glass, James R Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate directly on the image pixels and speech waveform, and do not rely on any conventional supervision in the form of labels, segmentations, or alignments between the modalities during training. We perform analysis using the Places 205 and ADE20k datasets demonstrating that our models implicitly learn semantically-coupled object and word detectors. Keywords: vision and language; sound; speech; convolutional networks; multimodal learning; unsupervised learning 2020-01-20T17:03:22Z 2020-01-20T17:03:22Z 2018-10-06 2018-04-04 2019-07-11T17:10:06Z Book http://purl.org/eprint/type/ConferencePaper 9783030012304 9783030012311 0302-9743 1611-3349 https://hdl.handle.net/1721.1/123476 Harwath, David et al. "Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input." Computer Vision – ECCV 2018, September 8–14, 2018, Munich, Germany, edited by V. Ferrari et al., Springer, 2018 en http://dx.doi.org/10.1007/978-3-030-01231-1_40 Computer Vision – ECCV 2018 Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Springer International Publishing arXiv
spellingShingle Harwath, David F.
Recasens, Adria
Suris Coll-Vinent, Didac
Chuang, Galen
Torralba, Antonio
Glass, James R
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
title Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
title_full Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
title_fullStr Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
title_full_unstemmed Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
title_short Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
title_sort jointly discovering visual objects and spoken words from raw sensory input
url https://hdl.handle.net/1721.1/123476
work_keys_str_mv AT harwathdavidf jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput
AT recasensadria jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput
AT suriscollvinentdidac jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput
AT chuanggalen jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput
AT torralbaantonio jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput
AT glassjamesr jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput