Ambient Sound Provides Supervision for Visual Learning

The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural networ...

Full description

Bibliographic Details
Main Authors: Owens, Andrew Hale, Wu, Jiajun, McDermott, Joshua H., Freeman, William T., Torralba, Antonio
Other Authors: Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences
Format: Article
Language:en_US
Published: Springer-Verlag 2017
Online Access:http://hdl.handle.net/1721.1/111172
https://orcid.org/0000-0001-9020-9593
https://orcid.org/0000-0002-4176-343X
https://orcid.org/0000-0002-3965-2503
https://orcid.org/0000-0002-2231-7995
https://orcid.org/0000-0003-4915-0256
_version_ 1811087899193507840
author Owens, Andrew Hale
Wu, Jiajun
McDermott, Joshua H.
Freeman, William T.
Torralba, Antonio
author2 Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences
author_facet Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences
Owens, Andrew Hale
Wu, Jiajun
McDermott, Joshua H.
Freeman, William T.
Torralba, Antonio
author_sort Owens, Andrew Hale
collection MIT
description The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds.
first_indexed 2024-09-23T13:53:39Z
format Article
id mit-1721.1/111172
institution Massachusetts Institute of Technology
language en_US
last_indexed 2024-09-23T13:53:39Z
publishDate 2017
publisher Springer-Verlag
record_format dspace
spelling mit-1721.1/1111722022-10-01T17:51:20Z Ambient Sound Provides Supervision for Visual Learning Owens, Andrew Hale Wu, Jiajun McDermott, Joshua H. Freeman, William T. Torralba, Antonio Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Owens, Andrew Hale Wu, Jiajun McDermott, Joshua H. Freeman, William T. Torralba, Antonio The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. National Science Foundation (U.S.) (Grant 1524817) National Science Foundation (U.S.) (Grant 1447476) National Science Foundation (U.S.) (Grant 1212849) 2017-09-12T13:32:52Z 2017-09-12T13:32:52Z 2016-09 Article http://purl.org/eprint/type/ConferencePaper 978-3-319-46447-3 978-3-319-46448-0 0302-9743 1611-3349 http://hdl.handle.net/1721.1/111172 Owens, Andrew, et al. “Ambient Sound Provides Supervision for Visual Learning.” Lecture Notes in Computer Science 9905 (September 2016): 801–816. © 2016 Springer International Publishing AG https://orcid.org/0000-0001-9020-9593 https://orcid.org/0000-0002-4176-343X https://orcid.org/0000-0002-3965-2503 https://orcid.org/0000-0002-2231-7995 https://orcid.org/0000-0003-4915-0256 en_US http://dx.doi.org/10.1007/978-3-319-46448-0_48 Lecture Notes in Computer Science Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Springer-Verlag arXiv
spellingShingle Owens, Andrew Hale
Wu, Jiajun
McDermott, Joshua H.
Freeman, William T.
Torralba, Antonio
Ambient Sound Provides Supervision for Visual Learning
title Ambient Sound Provides Supervision for Visual Learning
title_full Ambient Sound Provides Supervision for Visual Learning
title_fullStr Ambient Sound Provides Supervision for Visual Learning
title_full_unstemmed Ambient Sound Provides Supervision for Visual Learning
title_short Ambient Sound Provides Supervision for Visual Learning
title_sort ambient sound provides supervision for visual learning
url http://hdl.handle.net/1721.1/111172
https://orcid.org/0000-0001-9020-9593
https://orcid.org/0000-0002-4176-343X
https://orcid.org/0000-0002-3965-2503
https://orcid.org/0000-0002-2231-7995
https://orcid.org/0000-0003-4915-0256
work_keys_str_mv AT owensandrewhale ambientsoundprovidessupervisionforvisuallearning
AT wujiajun ambientsoundprovidessupervisionforvisuallearning
AT mcdermottjoshuah ambientsoundprovidessupervisionforvisuallearning
AT freemanwilliamt ambientsoundprovidessupervisionforvisuallearning
AT torralbaantonio ambientsoundprovidessupervisionforvisuallearning