Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments

In this work, we propose and investigate an original approach to using a pre-trained multimodal transformer of a specialized architecture for controlling a robotic agent in an object manipulation task based on language instruction, which we refer to as RozumFormer. Our model is based on a bimodal (t...

Full description

Bibliographic Details
Main Authors:	Aleksei Staroverov, Andrey S. Gorodetsky, Andrei S. Krishtopik, Uliana A. Izmesteva, Dmitry A. Yudin, Alexey K. Kovalev, Aleksandr I. Panov
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Action generation bimodal transformer models intelligent agent robotic manipulator arm control
Online Access:	https://ieeexplore.ieee.org/document/10323309/

_version_	1797473526604627968
author	Aleksei Staroverov Andrey S. Gorodetsky Andrei S. Krishtopik Uliana A. Izmesteva Dmitry A. Yudin Alexey K. Kovalev Aleksandr I. Panov
author_facet	Aleksei Staroverov Andrey S. Gorodetsky Andrei S. Krishtopik Uliana A. Izmesteva Dmitry A. Yudin Alexey K. Kovalev Aleksandr I. Panov
author_sort	Aleksei Staroverov
collection	DOAJ
description	In this work, we propose and investigate an original approach to using a pre-trained multimodal transformer of a specialized architecture for controlling a robotic agent in an object manipulation task based on language instruction, which we refer to as RozumFormer. Our model is based on a bimodal (text-image) transformer architecture originally trained for solving tasks that use one or both modalities, such as language modeling, visual question answering, image captioning, text recognition, text-to-image generation, etc. The discussed model has been adapted for robotic manipulation tasks by organizing the input sequence of tokens in a particular way, consisting of tokens for text, images, and actions. We have demonstrated that such a model adapts well to new tasks and shows better results with fine-tuning than complete training in simulation and real environments. To transfer the model from the simulator to a real robot, new datasets were collected and annotated. In addition, experiments controlling the agent in a visual environment using reinforcement learning have shown that fine-tuning the model with a mixed dataset that includes examples from the initial visual-linguistic tasks only slightly decreases performance on these tasks, simplifying the addition of tasks from another domain.
first_indexed	2024-03-09T20:15:44Z
format	Article
id	doaj.art-a1dbf8dab5934c439c2f37832a1dd9bb
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-09T20:15:44Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-a1dbf8dab5934c439c2f37832a1dd9bb2023-11-24T00:00:52ZengIEEEIEEE Access2169-35362023-01-011113054813055910.1109/ACCESS.2023.333479110323309Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real EnvironmentsAleksei Staroverov0Andrey S. Gorodetsky1https://orcid.org/0009-0007-0763-9455Andrei S. Krishtopik2Uliana A. Izmesteva3Dmitry A. Yudin4https://orcid.org/0000-0002-1407-2633Alexey K. Kovalev5Aleksandr I. Panov6https://orcid.org/0000-0002-9747-3837Artificial Intelligence Research Institute (AIRI), Moscow, RussiaCenter of Cognitive Modeling, Moscow Institute of Physics and Technology, Dolgoprudny, RussiaCenter of Cognitive Modeling, Moscow Institute of Physics and Technology, Dolgoprudny, RussiaCenter of Cognitive Modeling, Moscow Institute of Physics and Technology, Dolgoprudny, RussiaArtificial Intelligence Research Institute (AIRI), Moscow, RussiaArtificial Intelligence Research Institute (AIRI), Moscow, RussiaArtificial Intelligence Research Institute (AIRI), Moscow, RussiaIn this work, we propose and investigate an original approach to using a pre-trained multimodal transformer of a specialized architecture for controlling a robotic agent in an object manipulation task based on language instruction, which we refer to as RozumFormer. Our model is based on a bimodal (text-image) transformer architecture originally trained for solving tasks that use one or both modalities, such as language modeling, visual question answering, image captioning, text recognition, text-to-image generation, etc. The discussed model has been adapted for robotic manipulation tasks by organizing the input sequence of tokens in a particular way, consisting of tokens for text, images, and actions. We have demonstrated that such a model adapts well to new tasks and shows better results with fine-tuning than complete training in simulation and real environments. To transfer the model from the simulator to a real robot, new datasets were collected and annotated. In addition, experiments controlling the agent in a visual environment using reinforcement learning have shown that fine-tuning the model with a mixed dataset that includes examples from the initial visual-linguistic tasks only slightly decreases performance on these tasks, simplifying the addition of tasks from another domain.https://ieeexplore.ieee.org/document/10323309/Action generationbimodal transformer modelsintelligent agentrobotic manipulator arm control
spellingShingle	Aleksei Staroverov Andrey S. Gorodetsky Andrei S. Krishtopik Uliana A. Izmesteva Dmitry A. Yudin Alexey K. Kovalev Aleksandr I. Panov Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments IEEE Access Action generation bimodal transformer models intelligent agent robotic manipulator arm control
title	Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments
title_full	Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments
title_fullStr	Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments
title_full_unstemmed	Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments
title_short	Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments
title_sort	fine tuning multimodal transformer models for generating actions in virtual and real environments
topic	Action generation bimodal transformer models intelligent agent robotic manipulator arm control
url	https://ieeexplore.ieee.org/document/10323309/
work_keys_str_mv	AT alekseistaroverov finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments AT andreysgorodetsky finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments AT andreiskrishtopik finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments AT ulianaaizmesteva finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments AT dmitryayudin finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments AT alexeykkovalev finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments AT aleksandripanov finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments

Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments

Similar Items