Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learne...

Full description

Bibliographic Details
Main Authors:	Harwath, David F., Recasens, Adria, Suris Coll-Vinent, Didac, Chuang, Galen, Torralba, Antonio, Glass, James R
Other Authors:	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format:	Book
Language:	English
Published:	Springer International Publishing 2020
Online Access:	https://hdl.handle.net/1721.1/123476

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input by Harwath, David F., Recasens, Adria, Suris Coll-Vinent, Didac, Chuang, Galen, Torralba, Antonio, Glass, James R

Published 2022

Get full text

Article

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input by Harwath, David, Recasens, Adrià, Surís, Dídac, Chuang, Galen, Torralba, Antonio, Glass, James

Published 2021

Get full text

Article

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input by Harwath, David F., Recasens, Adria, Suris Coll-Vinent, Didac, Chuang, Galen, Torralba, Antonio, Glass, James R

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input by Harwath, David, Recasens, Adrià, Surís, Dídac, Chuang, Galen, Torralba, Antonio, Glass, James

Similar Items