Building a Language Conditioned System for 6-DoF Tabletop Manipulation

We present a full-stack modular system for solving tabletop manipulation tasks from natural language task descriptions. The tasks that the system can perform include everyday pick-place tasks, such as sorting, or rearrangement, and the ability to learn new skills. The system primarily consists of th...

Full description

Bibliographic Details
Main Author:	Parakh, Meenal
Other Authors:	Agrawal, Pulkit
Format:	Thesis
Published:	Massachusetts Institute of Technology 2023
Online Access:	https://hdl.handle.net/1721.1/152838

_version_	1826196558015627264
author	Parakh, Meenal
author2	Agrawal, Pulkit
author_facet	Agrawal, Pulkit Parakh, Meenal
author_sort	Parakh, Meenal
collection	MIT
description	We present a full-stack modular system for solving tabletop manipulation tasks from natural language task descriptions. The tasks that the system can perform include everyday pick-place tasks, such as sorting, or rearrangement, and the ability to learn new skills. The system primarily consists of three components: perception, planning, and execution, each of which exploits the recent advancements in large machinelearning models developed for particular tasks. The three components interact with each other through carefully designed interfaces which are also crucial contributions of this work. We further evaluate different parts of the system, belonging to perception and execution, as well as showcase performance on some examples tasks, both in real and in sim. The main advantage of a modular system is that no training data is required to either train an end-to-end model or for finetuning. Further, the recent advancements in large models such as Segment Anything and GPT-4 made it possible to construct a modular system, that incorporates vast common sense knowledge, as opposed to traditional approaches. These large models have been trained on billions of data points, and internet-scale data, allowing for zero-shot applications in our system and no need for large-scale data collection. Building such modular systems has the potential to minimize the labor and time spent in the data collection step in robotics.
first_indexed	2024-09-23T10:29:24Z
format	Thesis
id	mit-1721.1/152838
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T10:29:24Z
publishDate	2023
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1528382023-11-03T03:49:02Z Building a Language Conditioned System for 6-DoF Tabletop Manipulation Parakh, Meenal Agrawal, Pulkit Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science We present a full-stack modular system for solving tabletop manipulation tasks from natural language task descriptions. The tasks that the system can perform include everyday pick-place tasks, such as sorting, or rearrangement, and the ability to learn new skills. The system primarily consists of three components: perception, planning, and execution, each of which exploits the recent advancements in large machinelearning models developed for particular tasks. The three components interact with each other through carefully designed interfaces which are also crucial contributions of this work. We further evaluate different parts of the system, belonging to perception and execution, as well as showcase performance on some examples tasks, both in real and in sim. The main advantage of a modular system is that no training data is required to either train an end-to-end model or for finetuning. Further, the recent advancements in large models such as Segment Anything and GPT-4 made it possible to construct a modular system, that incorporates vast common sense knowledge, as opposed to traditional approaches. These large models have been trained on billions of data points, and internet-scale data, allowing for zero-shot applications in our system and no need for large-scale data collection. Building such modular systems has the potential to minimize the labor and time spent in the data collection step in robotics. M.Eng. 2023-11-02T20:21:01Z 2023-11-02T20:21:01Z 2023-09 2023-10-03T18:21:18.358Z Thesis https://hdl.handle.net/1721.1/152838 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Parakh, Meenal Building a Language Conditioned System for 6-DoF Tabletop Manipulation
title	Building a Language Conditioned System for 6-DoF Tabletop Manipulation
title_full	Building a Language Conditioned System for 6-DoF Tabletop Manipulation
title_fullStr	Building a Language Conditioned System for 6-DoF Tabletop Manipulation
title_full_unstemmed	Building a Language Conditioned System for 6-DoF Tabletop Manipulation
title_short	Building a Language Conditioned System for 6-DoF Tabletop Manipulation
title_sort	building a language conditioned system for 6 dof tabletop manipulation
url	https://hdl.handle.net/1721.1/152838
work_keys_str_mv	AT parakhmeenal buildingalanguageconditionedsystemfor6doftabletopmanipulation

Building a Language Conditioned System for 6-DoF Tabletop Manipulation

Similar Items