Connecting Touch and Vision via Cross-Modal Prediction

© 2019 IEEE. Humans perceive the world using multi-modal sensory inputs such as vision, audition, and touch. In this work, we investigate the cross-modal connection between vision and touch. The main challenge in this cross-domain modeling task lies in the significant scale discrepancy between the t...

Full description

Bibliographic Details
Main Authors:	Li, Yunzhu, Zhu, Jun-Yan, Tedrake, Russ, Torralba, Antonio
Other Authors:	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format:	Article
Language:	English
Published:	Institute of Electrical and Electronics Engineers (IEEE) 2021
Online Access:	https://hdl.handle.net/1721.1/137632

_version_	1826200876757286912
author	Li, Yunzhu Zhu, Jun-Yan Tedrake, Russ Torralba, Antonio
author2	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Li, Yunzhu Zhu, Jun-Yan Tedrake, Russ Torralba, Antonio
author_sort	Li, Yunzhu
collection	MIT
description	© 2019 IEEE. Humans perceive the world using multi-modal sensory inputs such as vision, audition, and touch. In this work, we investigate the cross-modal connection between vision and touch. The main challenge in this cross-domain modeling task lies in the significant scale discrepancy between the two: While our eyes perceive an entire visual scene at once, humans can only feel a small region of an object at any given moment. To connect vision and touch, we introduce new tasks of synthesizing plausible tactile signals from visual inputs as well as imagining how we interact with objects given tactile data as input. To accomplish our goals, we first equip robots with both visual and tactile sensors and collect a large-scale dataset of corresponding vision and tactile image sequences. To close the scale gap, we present a new conditional adversarial model that incorporates the scale and location information of the touch. Human perceptual studies demonstrate that our model can produce realistic visual images from tactile data and vice versa. Finally, we present both qualitative and quantitative experimental results regarding different system designs, as well as visualizing the learned representations of our model.
first_indexed	2024-09-23T11:43:07Z
format	Article
id	mit-1721.1/137632
institution	Massachusetts Institute of Technology
language	English
last_indexed	2024-09-23T11:43:07Z
publishDate	2021
publisher	Institute of Electrical and Electronics Engineers (IEEE)
record_format	dspace
spelling	mit-1721.1/1376322023-04-14T15:56:04Z Connecting Touch and Vision via Cross-Modal Prediction Li, Yunzhu Zhu, Jun-Yan Tedrake, Russ Torralba, Antonio Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory © 2019 IEEE. Humans perceive the world using multi-modal sensory inputs such as vision, audition, and touch. In this work, we investigate the cross-modal connection between vision and touch. The main challenge in this cross-domain modeling task lies in the significant scale discrepancy between the two: While our eyes perceive an entire visual scene at once, humans can only feel a small region of an object at any given moment. To connect vision and touch, we introduce new tasks of synthesizing plausible tactile signals from visual inputs as well as imagining how we interact with objects given tactile data as input. To accomplish our goals, we first equip robots with both visual and tactile sensors and collect a large-scale dataset of corresponding vision and tactile image sequences. To close the scale gap, we present a new conditional adversarial model that incorporates the scale and location information of the touch. Human perceptual studies demonstrate that our model can produce realistic visual images from tactile data and vice versa. Finally, we present both qualitative and quantitative experimental results regarding different system designs, as well as visualizing the learned representations of our model. 2021-11-08T12:42:36Z 2021-11-08T12:42:36Z 2019-06 2021-01-27T17:48:43Z Article http://purl.org/eprint/type/ConferencePaper https://hdl.handle.net/1721.1/137632 Li, Yunzhu, Zhu, Jun-Yan, Tedrake, Russ and Torralba, Antonio. 2019. "Connecting Touch and Vision via Cross-Modal Prediction." Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019-June. en 10.1109/CVPR.2019.01086 Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Institute of Electrical and Electronics Engineers (IEEE) MIT web domain
spellingShingle	Li, Yunzhu Zhu, Jun-Yan Tedrake, Russ Torralba, Antonio Connecting Touch and Vision via Cross-Modal Prediction
title	Connecting Touch and Vision via Cross-Modal Prediction
title_full	Connecting Touch and Vision via Cross-Modal Prediction
title_fullStr	Connecting Touch and Vision via Cross-Modal Prediction
title_full_unstemmed	Connecting Touch and Vision via Cross-Modal Prediction
title_short	Connecting Touch and Vision via Cross-Modal Prediction
title_sort	connecting touch and vision via cross modal prediction
url	https://hdl.handle.net/1721.1/137632
work_keys_str_mv	AT liyunzhu connectingtouchandvisionviacrossmodalprediction AT zhujunyan connectingtouchandvisionviacrossmodalprediction AT tedrakeruss connectingtouchandvisionviacrossmodalprediction AT torralbaantonio connectingtouchandvisionviacrossmodalprediction

Connecting Touch and Vision via Cross-Modal Prediction

Similar Items