Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learne...

Full description

Bibliographic Details
Main Authors:	Harwath, David F., Recasens, Adria, Suris Coll-Vinent, Didac, Chuang, Galen, Torralba, Antonio, Glass, James R
Other Authors:	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format:	Book
Language:	English
Published:	Springer International Publishing 2020
Online Access:	https://hdl.handle.net/1721.1/123476

_version_	1811068703978029056
author	Harwath, David F. Recasens, Adria Suris Coll-Vinent, Didac Chuang, Galen Torralba, Antonio Glass, James R
author2	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Harwath, David F. Recasens, Adria Suris Coll-Vinent, Didac Chuang, Galen Torralba, Antonio Glass, James R
author_sort	Harwath, David F.
collection	MIT
description	In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate directly on the image pixels and speech waveform, and do not rely on any conventional supervision in the form of labels, segmentations, or alignments between the modalities during training. We perform analysis using the Places 205 and ADE20k datasets demonstrating that our models implicitly learn semantically-coupled object and word detectors. Keywords: vision and language; sound; speech; convolutional networks; multimodal learning; unsupervised learning
first_indexed	2024-09-23T07:59:52Z
format	Book
id	mit-1721.1/123476
institution	Massachusetts Institute of Technology
language	English
last_indexed	2024-09-23T07:59:52Z
publishDate	2020
publisher	Springer International Publishing
record_format	dspace
spelling	mit-1721.1/1234762022-09-30T01:34:48Z Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input Harwath, David F. Recasens, Adria Suris Coll-Vinent, Didac Chuang, Galen Torralba, Antonio Glass, James R Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate directly on the image pixels and speech waveform, and do not rely on any conventional supervision in the form of labels, segmentations, or alignments between the modalities during training. We perform analysis using the Places 205 and ADE20k datasets demonstrating that our models implicitly learn semantically-coupled object and word detectors. Keywords: vision and language; sound; speech; convolutional networks; multimodal learning; unsupervised learning 2020-01-20T17:03:22Z 2020-01-20T17:03:22Z 2018-10-06 2018-04-04 2019-07-11T17:10:06Z Book http://purl.org/eprint/type/ConferencePaper 9783030012304 9783030012311 0302-9743 1611-3349 https://hdl.handle.net/1721.1/123476 Harwath, David et al. "Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input." Computer Vision – ECCV 2018, September 8–14, 2018, Munich, Germany, edited by V. Ferrari et al., Springer, 2018 en http://dx.doi.org/10.1007/978-3-030-01231-1_40 Computer Vision – ECCV 2018 Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Springer International Publishing arXiv
spellingShingle	Harwath, David F. Recasens, Adria Suris Coll-Vinent, Didac Chuang, Galen Torralba, Antonio Glass, James R Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
title	Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
title_full	Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
title_fullStr	Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
title_full_unstemmed	Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
title_short	Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
title_sort	jointly discovering visual objects and spoken words from raw sensory input
url	https://hdl.handle.net/1721.1/123476
work_keys_str_mv	AT harwathdavidf jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput AT recasensadria jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput AT suriscollvinentdidac jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput AT chuanggalen jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput AT torralbaantonio jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput AT glassjamesr jointlydiscoveringvisualobjectsandspokenwordsfromrawsensoryinput

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

Similar Items