Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs

Abstract This paper introduces the Crossmodal Attentive Skill Learner (CASL), integrated with the recently-introduced Asynchronous Advantage Option-Critic architecture [Harb et al. in When waiting is not an option: learning options with a deliberation cost. arXiv preprint arXiv:1709.04571, 2017] to...

Full description

Bibliographic Details
Main Authors:	Kim, Dong-Ki, Omidshafiei, Shayegan, Pazis, Jason, How, Jonathan P
Format:	Article
Language:	English
Published:	Springer US 2021
Online Access:	https://hdl.handle.net/1721.1/131879

_version_	1811092454516981760
author	Kim, Dong-Ki Omidshafiei, Shayegan Pazis, Jason How, Jonathan P
author_facet	Kim, Dong-Ki Omidshafiei, Shayegan Pazis, Jason How, Jonathan P
author_sort	Kim, Dong-Ki
collection	MIT
description	Abstract This paper introduces the Crossmodal Attentive Skill Learner (CASL), integrated with the recently-introduced Asynchronous Advantage Option-Critic architecture [Harb et al. in When waiting is not an option: learning options with a deliberation cost. arXiv preprint arXiv:1709.04571, 2017] to enable hierarchical reinforcement learning across multiple sensory inputs. Agents trained using our approach learn to attend to their various sensory modalities (e.g., audio, video) at the appropriate moments, thereby executing actions based on multiple sensory streams without reliance on supervisory data. We demonstrate empirically that the sensory attention mechanism anticipates and identifies useful latent features, while filtering irrelevant sensor modalities during execution. Further, we provide concrete examples in which the approach not only improves performance in a single task, but accelerates transfer to new tasks. We modify the Arcade Learning Environment [Bellemare et al. in J Artif Intell Res 47:253–279, 2013] to support audio queries (ALE-audio code available at https://github.com/shayegano/Arcade-Learning-Environment), and conduct evaluations of crossmodal learning in the Atari 2600 games H.E.R.O. and Amidar. Finally, building on the recent work of Babaeizadeh et al. [in: International conference on learning representations (ICLR), 2017], we open-source a fast hybrid CPU–GPU implementation of CASL (CASL code available at https://github.com/shayegano/CASL).
first_indexed	2024-09-23T15:18:25Z
format	Article
id	mit-1721.1/131879
institution	Massachusetts Institute of Technology
language	English
last_indexed	2024-09-23T15:18:25Z
publishDate	2021
publisher	Springer US
record_format	dspace
spelling	mit-1721.1/1318792021-09-21T03:12:35Z Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs Kim, Dong-Ki Omidshafiei, Shayegan Pazis, Jason How, Jonathan P Abstract This paper introduces the Crossmodal Attentive Skill Learner (CASL), integrated with the recently-introduced Asynchronous Advantage Option-Critic architecture [Harb et al. in When waiting is not an option: learning options with a deliberation cost. arXiv preprint arXiv:1709.04571, 2017] to enable hierarchical reinforcement learning across multiple sensory inputs. Agents trained using our approach learn to attend to their various sensory modalities (e.g., audio, video) at the appropriate moments, thereby executing actions based on multiple sensory streams without reliance on supervisory data. We demonstrate empirically that the sensory attention mechanism anticipates and identifies useful latent features, while filtering irrelevant sensor modalities during execution. Further, we provide concrete examples in which the approach not only improves performance in a single task, but accelerates transfer to new tasks. We modify the Arcade Learning Environment [Bellemare et al. in J Artif Intell Res 47:253–279, 2013] to support audio queries (ALE-audio code available at https://github.com/shayegano/Arcade-Learning-Environment), and conduct evaluations of crossmodal learning in the Atari 2600 games H.E.R.O. and Amidar. Finally, building on the recent work of Babaeizadeh et al. [in: International conference on learning representations (ICLR), 2017], we open-source a fast hybrid CPU–GPU implementation of CASL (CASL code available at https://github.com/shayegano/CASL). 2021-09-20T17:30:46Z 2021-09-20T17:30:46Z 2020-01-13 2020-09-24T21:37:40Z Article http://purl.org/eprint/type/JournalArticle https://hdl.handle.net/1721.1/131879 Autonomous Agents and Multi-Agent Systems. 2020 Jan 13;34(1):16 en https://doi.org/10.1007/s10458-019-09439-5 Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ Springer Science+Business Media, LLC, part of Springer Nature application/pdf Springer US Springer US
spellingShingle	Kim, Dong-Ki Omidshafiei, Shayegan Pazis, Jason How, Jonathan P Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs
title	Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs
title_full	Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs
title_fullStr	Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs
title_full_unstemmed	Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs
title_short	Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs
title_sort	crossmodal attentive skill learner learning in atari and beyond with audio video inputs
url	https://hdl.handle.net/1721.1/131879
work_keys_str_mv	AT kimdongki crossmodalattentiveskilllearnerlearninginatariandbeyondwithaudiovideoinputs AT omidshafieishayegan crossmodalattentiveskilllearnerlearninginatariandbeyondwithaudiovideoinputs AT pazisjason crossmodalattentiveskilllearnerlearninginatariandbeyondwithaudiovideoinputs AT howjonathanp crossmodalattentiveskilllearnerlearninginatariandbeyondwithaudiovideoinputs

Crossmodal attentive skill learner: learning in Atari and beyond with audio–video inputs

Similar Items