Vision-Based Efficient Robotic Manipulation with a Dual-Streaming Compact Convolutional Transformer

Learning from visual observation for efficient robotic manipulation is a hitherto significant challenge in Reinforcement Learning (RL). Although the collocation of RL policies and convolution neural network (CNN) visual encoder achieves high efficiency and success rate, the method general performanc...

Full description

Bibliographic Details
Main Authors:	Hao Guo, Meichao Song, Zhen Ding, Chunzhi Yi, Feng Jiang
Format:	Article
Language:	English
Published:	MDPI AG 2023-01-01
Series:	Sensors
Subjects:	bio-inspired design and control of robots robotics reinforcement learning vision transformer
Online Access:	https://www.mdpi.com/1424-8220/23/1/515

_version_	1827617155596681216
author	Hao Guo Meichao Song Zhen Ding Chunzhi Yi Feng Jiang
author_facet	Hao Guo Meichao Song Zhen Ding Chunzhi Yi Feng Jiang
author_sort	Hao Guo
collection	DOAJ
description	Learning from visual observation for efficient robotic manipulation is a hitherto significant challenge in Reinforcement Learning (RL). Although the collocation of RL policies and convolution neural network (CNN) visual encoder achieves high efficiency and success rate, the method general performance for multi-tasks is still limited to the efficacy of the encoder. Meanwhile, the increasing cost of the encoder optimization for general performance could debilitate the efficiency advantage of the original policy. Building on the attention mechanism, we design a robotic manipulation method that significantly improves the policy general performance among multitasks with the lite Transformer based visual encoder, unsupervised learning, and data augmentation. The encoder of our method could achieve the performance of the original Transformer with much less data, ensuring efficiency in the training process and intensifying the general multi-task performances. Furthermore, we experimentally demonstrate that the master view outperforms the other alternative third-person views in the general robotic manipulation tasks when combining the third-person and egocentric views to assimilate global and local visual information. After extensively experimenting with the tasks from the OpenAI Gym Fetch environment, especially in the Push task, our method succeeds in 92% versus baselines that of 65%, 78% for the CNN encoder, 81% for the ViT encoder, and with fewer training steps.
first_indexed	2024-03-09T09:40:57Z
format	Article
id	doaj.art-a53e10a41a844ab88c1c39fefc38aa27
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-09T09:40:57Z
publishDate	2023-01-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-a53e10a41a844ab88c1c39fefc38aa272023-12-02T00:58:03ZengMDPI AGSensors1424-82202023-01-0123151510.3390/s23010515Vision-Based Efficient Robotic Manipulation with a Dual-Streaming Compact Convolutional TransformerHao Guo0Meichao Song1Zhen Ding2Chunzhi Yi3Feng Jiang4School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, ChinaSchool of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, ChinaSchool of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, ChinaSchool of Medicine and Health, Harbin Institute of Technology, Harbin 150001, ChinaSchool of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, ChinaLearning from visual observation for efficient robotic manipulation is a hitherto significant challenge in Reinforcement Learning (RL). Although the collocation of RL policies and convolution neural network (CNN) visual encoder achieves high efficiency and success rate, the method general performance for multi-tasks is still limited to the efficacy of the encoder. Meanwhile, the increasing cost of the encoder optimization for general performance could debilitate the efficiency advantage of the original policy. Building on the attention mechanism, we design a robotic manipulation method that significantly improves the policy general performance among multitasks with the lite Transformer based visual encoder, unsupervised learning, and data augmentation. The encoder of our method could achieve the performance of the original Transformer with much less data, ensuring efficiency in the training process and intensifying the general multi-task performances. Furthermore, we experimentally demonstrate that the master view outperforms the other alternative third-person views in the general robotic manipulation tasks when combining the third-person and egocentric views to assimilate global and local visual information. After extensively experimenting with the tasks from the OpenAI Gym Fetch environment, especially in the Push task, our method succeeds in 92% versus baselines that of 65%, 78% for the CNN encoder, 81% for the ViT encoder, and with fewer training steps.https://www.mdpi.com/1424-8220/23/1/515bio-inspired design and control of robotsroboticsreinforcement learningvision transformer
spellingShingle	Hao Guo Meichao Song Zhen Ding Chunzhi Yi Feng Jiang Vision-Based Efficient Robotic Manipulation with a Dual-Streaming Compact Convolutional Transformer Sensors bio-inspired design and control of robots robotics reinforcement learning vision transformer
title	Vision-Based Efficient Robotic Manipulation with a Dual-Streaming Compact Convolutional Transformer
title_full	Vision-Based Efficient Robotic Manipulation with a Dual-Streaming Compact Convolutional Transformer
title_fullStr	Vision-Based Efficient Robotic Manipulation with a Dual-Streaming Compact Convolutional Transformer
title_full_unstemmed	Vision-Based Efficient Robotic Manipulation with a Dual-Streaming Compact Convolutional Transformer
title_short	Vision-Based Efficient Robotic Manipulation with a Dual-Streaming Compact Convolutional Transformer
title_sort	vision based efficient robotic manipulation with a dual streaming compact convolutional transformer
topic	bio-inspired design and control of robots robotics reinforcement learning vision transformer
url	https://www.mdpi.com/1424-8220/23/1/515
work_keys_str_mv	AT haoguo visionbasedefficientroboticmanipulationwithadualstreamingcompactconvolutionaltransformer AT meichaosong visionbasedefficientroboticmanipulationwithadualstreamingcompactconvolutionaltransformer AT zhending visionbasedefficientroboticmanipulationwithadualstreamingcompactconvolutionaltransformer AT chunzhiyi visionbasedefficientroboticmanipulationwithadualstreamingcompactconvolutionaltransformer AT fengjiang visionbasedefficientroboticmanipulationwithadualstreamingcompactconvolutionaltransformer

Vision-Based Efficient Robotic Manipulation with a Dual-Streaming Compact Convolutional Transformer

Similar Items