Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments

In this work, we propose and investigate an original approach to using a pre-trained multimodal transformer of a specialized architecture for controlling a robotic agent in an object manipulation task based on language instruction, which we refer to as RozumFormer. Our model is based on a bimodal (t...

Full description

Bibliographic Details
Main Authors: Aleksei Staroverov, Andrey S. Gorodetsky, Andrei S. Krishtopik, Uliana A. Izmesteva, Dmitry A. Yudin, Alexey K. Kovalev, Aleksandr I. Panov
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10323309/
_version_ 1797473526604627968
author Aleksei Staroverov
Andrey S. Gorodetsky
Andrei S. Krishtopik
Uliana A. Izmesteva
Dmitry A. Yudin
Alexey K. Kovalev
Aleksandr I. Panov
author_facet Aleksei Staroverov
Andrey S. Gorodetsky
Andrei S. Krishtopik
Uliana A. Izmesteva
Dmitry A. Yudin
Alexey K. Kovalev
Aleksandr I. Panov
author_sort Aleksei Staroverov
collection DOAJ
description In this work, we propose and investigate an original approach to using a pre-trained multimodal transformer of a specialized architecture for controlling a robotic agent in an object manipulation task based on language instruction, which we refer to as RozumFormer. Our model is based on a bimodal (text-image) transformer architecture originally trained for solving tasks that use one or both modalities, such as language modeling, visual question answering, image captioning, text recognition, text-to-image generation, etc. The discussed model has been adapted for robotic manipulation tasks by organizing the input sequence of tokens in a particular way, consisting of tokens for text, images, and actions. We have demonstrated that such a model adapts well to new tasks and shows better results with fine-tuning than complete training in simulation and real environments. To transfer the model from the simulator to a real robot, new datasets were collected and annotated. In addition, experiments controlling the agent in a visual environment using reinforcement learning have shown that fine-tuning the model with a mixed dataset that includes examples from the initial visual-linguistic tasks only slightly decreases performance on these tasks, simplifying the addition of tasks from another domain.
first_indexed 2024-03-09T20:15:44Z
format Article
id doaj.art-a1dbf8dab5934c439c2f37832a1dd9bb
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-09T20:15:44Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-a1dbf8dab5934c439c2f37832a1dd9bb2023-11-24T00:00:52ZengIEEEIEEE Access2169-35362023-01-011113054813055910.1109/ACCESS.2023.333479110323309Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real EnvironmentsAleksei Staroverov0Andrey S. Gorodetsky1https://orcid.org/0009-0007-0763-9455Andrei S. Krishtopik2Uliana A. Izmesteva3Dmitry A. Yudin4https://orcid.org/0000-0002-1407-2633Alexey K. Kovalev5Aleksandr I. Panov6https://orcid.org/0000-0002-9747-3837Artificial Intelligence Research Institute (AIRI), Moscow, RussiaCenter of Cognitive Modeling, Moscow Institute of Physics and Technology, Dolgoprudny, RussiaCenter of Cognitive Modeling, Moscow Institute of Physics and Technology, Dolgoprudny, RussiaCenter of Cognitive Modeling, Moscow Institute of Physics and Technology, Dolgoprudny, RussiaArtificial Intelligence Research Institute (AIRI), Moscow, RussiaArtificial Intelligence Research Institute (AIRI), Moscow, RussiaArtificial Intelligence Research Institute (AIRI), Moscow, RussiaIn this work, we propose and investigate an original approach to using a pre-trained multimodal transformer of a specialized architecture for controlling a robotic agent in an object manipulation task based on language instruction, which we refer to as RozumFormer. Our model is based on a bimodal (text-image) transformer architecture originally trained for solving tasks that use one or both modalities, such as language modeling, visual question answering, image captioning, text recognition, text-to-image generation, etc. The discussed model has been adapted for robotic manipulation tasks by organizing the input sequence of tokens in a particular way, consisting of tokens for text, images, and actions. We have demonstrated that such a model adapts well to new tasks and shows better results with fine-tuning than complete training in simulation and real environments. To transfer the model from the simulator to a real robot, new datasets were collected and annotated. In addition, experiments controlling the agent in a visual environment using reinforcement learning have shown that fine-tuning the model with a mixed dataset that includes examples from the initial visual-linguistic tasks only slightly decreases performance on these tasks, simplifying the addition of tasks from another domain.https://ieeexplore.ieee.org/document/10323309/Action generationbimodal transformer modelsintelligent agentrobotic manipulator arm control
spellingShingle Aleksei Staroverov
Andrey S. Gorodetsky
Andrei S. Krishtopik
Uliana A. Izmesteva
Dmitry A. Yudin
Alexey K. Kovalev
Aleksandr I. Panov
Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments
IEEE Access
Action generation
bimodal transformer models
intelligent agent
robotic manipulator arm control
title Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments
title_full Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments
title_fullStr Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments
title_full_unstemmed Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments
title_short Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments
title_sort fine tuning multimodal transformer models for generating actions in virtual and real environments
topic Action generation
bimodal transformer models
intelligent agent
robotic manipulator arm control
url https://ieeexplore.ieee.org/document/10323309/
work_keys_str_mv AT alekseistaroverov finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments
AT andreysgorodetsky finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments
AT andreiskrishtopik finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments
AT ulianaaizmesteva finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments
AT dmitryayudin finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments
AT alexeykkovalev finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments
AT aleksandripanov finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments