Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learne...
Main Authors: | , , , , , |
---|---|
Other Authors: | |
Format: | Book |
Language: | English |
Published: |
Springer International Publishing
2020
|
Online Access: | https://hdl.handle.net/1721.1/123476 |
Search Result 1
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
Published 2022
Get full text
Article
Search Result 2
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
Published 2021
Get full text
Article