Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments
In this work, we propose and investigate an original approach to using a pre-trained multimodal transformer of a specialized architecture for controlling a robotic agent in an object manipulation task based on language instruction, which we refer to as RozumFormer. Our model is based on a bimodal (t...
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10323309/ |
_version_ | 1797473526604627968 |
---|---|
author | Aleksei Staroverov Andrey S. Gorodetsky Andrei S. Krishtopik Uliana A. Izmesteva Dmitry A. Yudin Alexey K. Kovalev Aleksandr I. Panov |
author_facet | Aleksei Staroverov Andrey S. Gorodetsky Andrei S. Krishtopik Uliana A. Izmesteva Dmitry A. Yudin Alexey K. Kovalev Aleksandr I. Panov |
author_sort | Aleksei Staroverov |
collection | DOAJ |
description | In this work, we propose and investigate an original approach to using a pre-trained multimodal transformer of a specialized architecture for controlling a robotic agent in an object manipulation task based on language instruction, which we refer to as RozumFormer. Our model is based on a bimodal (text-image) transformer architecture originally trained for solving tasks that use one or both modalities, such as language modeling, visual question answering, image captioning, text recognition, text-to-image generation, etc. The discussed model has been adapted for robotic manipulation tasks by organizing the input sequence of tokens in a particular way, consisting of tokens for text, images, and actions. We have demonstrated that such a model adapts well to new tasks and shows better results with fine-tuning than complete training in simulation and real environments. To transfer the model from the simulator to a real robot, new datasets were collected and annotated. In addition, experiments controlling the agent in a visual environment using reinforcement learning have shown that fine-tuning the model with a mixed dataset that includes examples from the initial visual-linguistic tasks only slightly decreases performance on these tasks, simplifying the addition of tasks from another domain. |
first_indexed | 2024-03-09T20:15:44Z |
format | Article |
id | doaj.art-a1dbf8dab5934c439c2f37832a1dd9bb |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-09T20:15:44Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-a1dbf8dab5934c439c2f37832a1dd9bb2023-11-24T00:00:52ZengIEEEIEEE Access2169-35362023-01-011113054813055910.1109/ACCESS.2023.333479110323309Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real EnvironmentsAleksei Staroverov0Andrey S. Gorodetsky1https://orcid.org/0009-0007-0763-9455Andrei S. Krishtopik2Uliana A. Izmesteva3Dmitry A. Yudin4https://orcid.org/0000-0002-1407-2633Alexey K. Kovalev5Aleksandr I. Panov6https://orcid.org/0000-0002-9747-3837Artificial Intelligence Research Institute (AIRI), Moscow, RussiaCenter of Cognitive Modeling, Moscow Institute of Physics and Technology, Dolgoprudny, RussiaCenter of Cognitive Modeling, Moscow Institute of Physics and Technology, Dolgoprudny, RussiaCenter of Cognitive Modeling, Moscow Institute of Physics and Technology, Dolgoprudny, RussiaArtificial Intelligence Research Institute (AIRI), Moscow, RussiaArtificial Intelligence Research Institute (AIRI), Moscow, RussiaArtificial Intelligence Research Institute (AIRI), Moscow, RussiaIn this work, we propose and investigate an original approach to using a pre-trained multimodal transformer of a specialized architecture for controlling a robotic agent in an object manipulation task based on language instruction, which we refer to as RozumFormer. Our model is based on a bimodal (text-image) transformer architecture originally trained for solving tasks that use one or both modalities, such as language modeling, visual question answering, image captioning, text recognition, text-to-image generation, etc. The discussed model has been adapted for robotic manipulation tasks by organizing the input sequence of tokens in a particular way, consisting of tokens for text, images, and actions. We have demonstrated that such a model adapts well to new tasks and shows better results with fine-tuning than complete training in simulation and real environments. To transfer the model from the simulator to a real robot, new datasets were collected and annotated. In addition, experiments controlling the agent in a visual environment using reinforcement learning have shown that fine-tuning the model with a mixed dataset that includes examples from the initial visual-linguistic tasks only slightly decreases performance on these tasks, simplifying the addition of tasks from another domain.https://ieeexplore.ieee.org/document/10323309/Action generationbimodal transformer modelsintelligent agentrobotic manipulator arm control |
spellingShingle | Aleksei Staroverov Andrey S. Gorodetsky Andrei S. Krishtopik Uliana A. Izmesteva Dmitry A. Yudin Alexey K. Kovalev Aleksandr I. Panov Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments IEEE Access Action generation bimodal transformer models intelligent agent robotic manipulator arm control |
title | Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments |
title_full | Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments |
title_fullStr | Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments |
title_full_unstemmed | Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments |
title_short | Fine-Tuning Multimodal Transformer Models for Generating Actions in Virtual and Real Environments |
title_sort | fine tuning multimodal transformer models for generating actions in virtual and real environments |
topic | Action generation bimodal transformer models intelligent agent robotic manipulator arm control |
url | https://ieeexplore.ieee.org/document/10323309/ |
work_keys_str_mv | AT alekseistaroverov finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments AT andreysgorodetsky finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments AT andreiskrishtopik finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments AT ulianaaizmesteva finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments AT dmitryayudin finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments AT alexeykkovalev finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments AT aleksandripanov finetuningmultimodaltransformermodelsforgeneratingactionsinvirtualandrealenvironments |