Ambient Sound Provides Supervision for Visual Learning
The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural networ...
Main Authors: | , , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | en_US |
Published: |
Springer-Verlag
2017
|
Online Access: | http://hdl.handle.net/1721.1/111172 https://orcid.org/0000-0001-9020-9593 https://orcid.org/0000-0002-4176-343X https://orcid.org/0000-0002-3965-2503 https://orcid.org/0000-0002-2231-7995 https://orcid.org/0000-0003-4915-0256 |
_version_ | 1811087899193507840 |
---|---|
author | Owens, Andrew Hale Wu, Jiajun McDermott, Joshua H. Freeman, William T. Torralba, Antonio |
author2 | Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences |
author_facet | Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences Owens, Andrew Hale Wu, Jiajun McDermott, Joshua H. Freeman, William T. Torralba, Antonio |
author_sort | Owens, Andrew Hale |
collection | MIT |
description | The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. |
first_indexed | 2024-09-23T13:53:39Z |
format | Article |
id | mit-1721.1/111172 |
institution | Massachusetts Institute of Technology |
language | en_US |
last_indexed | 2024-09-23T13:53:39Z |
publishDate | 2017 |
publisher | Springer-Verlag |
record_format | dspace |
spelling | mit-1721.1/1111722022-10-01T17:51:20Z Ambient Sound Provides Supervision for Visual Learning Owens, Andrew Hale Wu, Jiajun McDermott, Joshua H. Freeman, William T. Torralba, Antonio Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Owens, Andrew Hale Wu, Jiajun McDermott, Joshua H. Freeman, William T. Torralba, Antonio The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. National Science Foundation (U.S.) (Grant 1524817) National Science Foundation (U.S.) (Grant 1447476) National Science Foundation (U.S.) (Grant 1212849) 2017-09-12T13:32:52Z 2017-09-12T13:32:52Z 2016-09 Article http://purl.org/eprint/type/ConferencePaper 978-3-319-46447-3 978-3-319-46448-0 0302-9743 1611-3349 http://hdl.handle.net/1721.1/111172 Owens, Andrew, et al. “Ambient Sound Provides Supervision for Visual Learning.” Lecture Notes in Computer Science 9905 (September 2016): 801–816. © 2016 Springer International Publishing AG https://orcid.org/0000-0001-9020-9593 https://orcid.org/0000-0002-4176-343X https://orcid.org/0000-0002-3965-2503 https://orcid.org/0000-0002-2231-7995 https://orcid.org/0000-0003-4915-0256 en_US http://dx.doi.org/10.1007/978-3-319-46448-0_48 Lecture Notes in Computer Science Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Springer-Verlag arXiv |
spellingShingle | Owens, Andrew Hale Wu, Jiajun McDermott, Joshua H. Freeman, William T. Torralba, Antonio Ambient Sound Provides Supervision for Visual Learning |
title | Ambient Sound Provides Supervision for Visual Learning |
title_full | Ambient Sound Provides Supervision for Visual Learning |
title_fullStr | Ambient Sound Provides Supervision for Visual Learning |
title_full_unstemmed | Ambient Sound Provides Supervision for Visual Learning |
title_short | Ambient Sound Provides Supervision for Visual Learning |
title_sort | ambient sound provides supervision for visual learning |
url | http://hdl.handle.net/1721.1/111172 https://orcid.org/0000-0001-9020-9593 https://orcid.org/0000-0002-4176-343X https://orcid.org/0000-0002-3965-2503 https://orcid.org/0000-0002-2231-7995 https://orcid.org/0000-0003-4915-0256 |
work_keys_str_mv | AT owensandrewhale ambientsoundprovidessupervisionforvisuallearning AT wujiajun ambientsoundprovidessupervisionforvisuallearning AT mcdermottjoshuah ambientsoundprovidessupervisionforvisuallearning AT freemanwilliamt ambientsoundprovidessupervisionforvisuallearning AT torralbaantonio ambientsoundprovidessupervisionforvisuallearning |