Building a Language Conditioned System for 6-DoF Tabletop Manipulation

We present a full-stack modular system for solving tabletop manipulation tasks from natural language task descriptions. The tasks that the system can perform include everyday pick-place tasks, such as sorting, or rearrangement, and the ability to learn new skills. The system primarily consists of th...

Full description

Bibliographic Details
Main Author: Parakh, Meenal
Other Authors: Agrawal, Pulkit
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/152838
_version_ 1826196558015627264
author Parakh, Meenal
author2 Agrawal, Pulkit
author_facet Agrawal, Pulkit
Parakh, Meenal
author_sort Parakh, Meenal
collection MIT
description We present a full-stack modular system for solving tabletop manipulation tasks from natural language task descriptions. The tasks that the system can perform include everyday pick-place tasks, such as sorting, or rearrangement, and the ability to learn new skills. The system primarily consists of three components: perception, planning, and execution, each of which exploits the recent advancements in large machinelearning models developed for particular tasks. The three components interact with each other through carefully designed interfaces which are also crucial contributions of this work. We further evaluate different parts of the system, belonging to perception and execution, as well as showcase performance on some examples tasks, both in real and in sim. The main advantage of a modular system is that no training data is required to either train an end-to-end model or for finetuning. Further, the recent advancements in large models such as Segment Anything and GPT-4 made it possible to construct a modular system, that incorporates vast common sense knowledge, as opposed to traditional approaches. These large models have been trained on billions of data points, and internet-scale data, allowing for zero-shot applications in our system and no need for large-scale data collection. Building such modular systems has the potential to minimize the labor and time spent in the data collection step in robotics.
first_indexed 2024-09-23T10:29:24Z
format Thesis
id mit-1721.1/152838
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T10:29:24Z
publishDate 2023
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1528382023-11-03T03:49:02Z Building a Language Conditioned System for 6-DoF Tabletop Manipulation Parakh, Meenal Agrawal, Pulkit Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science We present a full-stack modular system for solving tabletop manipulation tasks from natural language task descriptions. The tasks that the system can perform include everyday pick-place tasks, such as sorting, or rearrangement, and the ability to learn new skills. The system primarily consists of three components: perception, planning, and execution, each of which exploits the recent advancements in large machinelearning models developed for particular tasks. The three components interact with each other through carefully designed interfaces which are also crucial contributions of this work. We further evaluate different parts of the system, belonging to perception and execution, as well as showcase performance on some examples tasks, both in real and in sim. The main advantage of a modular system is that no training data is required to either train an end-to-end model or for finetuning. Further, the recent advancements in large models such as Segment Anything and GPT-4 made it possible to construct a modular system, that incorporates vast common sense knowledge, as opposed to traditional approaches. These large models have been trained on billions of data points, and internet-scale data, allowing for zero-shot applications in our system and no need for large-scale data collection. Building such modular systems has the potential to minimize the labor and time spent in the data collection step in robotics. M.Eng. 2023-11-02T20:21:01Z 2023-11-02T20:21:01Z 2023-09 2023-10-03T18:21:18.358Z Thesis https://hdl.handle.net/1721.1/152838 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Parakh, Meenal
Building a Language Conditioned System for 6-DoF Tabletop Manipulation
title Building a Language Conditioned System for 6-DoF Tabletop Manipulation
title_full Building a Language Conditioned System for 6-DoF Tabletop Manipulation
title_fullStr Building a Language Conditioned System for 6-DoF Tabletop Manipulation
title_full_unstemmed Building a Language Conditioned System for 6-DoF Tabletop Manipulation
title_short Building a Language Conditioned System for 6-DoF Tabletop Manipulation
title_sort building a language conditioned system for 6 dof tabletop manipulation
url https://hdl.handle.net/1721.1/152838
work_keys_str_mv AT parakhmeenal buildingalanguageconditionedsystemfor6doftabletopmanipulation