Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learne...

Full description

Bibliographic Details
Main Authors: Harwath, David F., Recasens, Adria, Suris Coll-Vinent, Didac, Chuang, Galen, Torralba, Antonio, Glass, James R
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format: Book
Language:English
Published: Springer International Publishing 2020
Online Access:https://hdl.handle.net/1721.1/123476