Building a Language Conditioned System for 6-DoF Tabletop Manipulation
We present a full-stack modular system for solving tabletop manipulation tasks from natural language task descriptions. The tasks that the system can perform include everyday pick-place tasks, such as sorting, or rearrangement, and the ability to learn new skills. The system primarily consists of th...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2023
|
Online Access: | https://hdl.handle.net/1721.1/152838 |
_version_ | 1826196558015627264 |
---|---|
author | Parakh, Meenal |
author2 | Agrawal, Pulkit |
author_facet | Agrawal, Pulkit Parakh, Meenal |
author_sort | Parakh, Meenal |
collection | MIT |
description | We present a full-stack modular system for solving tabletop manipulation tasks from natural language task descriptions. The tasks that the system can perform include everyday pick-place tasks, such as sorting, or rearrangement, and the ability to learn new skills. The system primarily consists of three components: perception, planning, and execution, each of which exploits the recent advancements in large machinelearning models developed for particular tasks. The three components interact with each other through carefully designed interfaces which are also crucial contributions of this work. We further evaluate different parts of the system, belonging to perception and execution, as well as showcase performance on some examples tasks, both in real and in sim. The main advantage of a modular system is that no training data is required to either train an end-to-end model or for finetuning. Further, the recent advancements in large models such as Segment Anything and GPT-4 made it possible to construct a modular system, that incorporates vast common sense knowledge, as opposed to traditional approaches. These large models have been trained on billions of data points, and internet-scale data, allowing for zero-shot applications in our system and no need for large-scale data collection. Building such modular systems has the potential to minimize the labor and time spent in the data collection step in robotics. |
first_indexed | 2024-09-23T10:29:24Z |
format | Thesis |
id | mit-1721.1/152838 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T10:29:24Z |
publishDate | 2023 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1528382023-11-03T03:49:02Z Building a Language Conditioned System for 6-DoF Tabletop Manipulation Parakh, Meenal Agrawal, Pulkit Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science We present a full-stack modular system for solving tabletop manipulation tasks from natural language task descriptions. The tasks that the system can perform include everyday pick-place tasks, such as sorting, or rearrangement, and the ability to learn new skills. The system primarily consists of three components: perception, planning, and execution, each of which exploits the recent advancements in large machinelearning models developed for particular tasks. The three components interact with each other through carefully designed interfaces which are also crucial contributions of this work. We further evaluate different parts of the system, belonging to perception and execution, as well as showcase performance on some examples tasks, both in real and in sim. The main advantage of a modular system is that no training data is required to either train an end-to-end model or for finetuning. Further, the recent advancements in large models such as Segment Anything and GPT-4 made it possible to construct a modular system, that incorporates vast common sense knowledge, as opposed to traditional approaches. These large models have been trained on billions of data points, and internet-scale data, allowing for zero-shot applications in our system and no need for large-scale data collection. Building such modular systems has the potential to minimize the labor and time spent in the data collection step in robotics. M.Eng. 2023-11-02T20:21:01Z 2023-11-02T20:21:01Z 2023-09 2023-10-03T18:21:18.358Z Thesis https://hdl.handle.net/1721.1/152838 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Parakh, Meenal Building a Language Conditioned System for 6-DoF Tabletop Manipulation |
title | Building a Language Conditioned System for 6-DoF Tabletop Manipulation |
title_full | Building a Language Conditioned System for 6-DoF Tabletop Manipulation |
title_fullStr | Building a Language Conditioned System for 6-DoF Tabletop Manipulation |
title_full_unstemmed | Building a Language Conditioned System for 6-DoF Tabletop Manipulation |
title_short | Building a Language Conditioned System for 6-DoF Tabletop Manipulation |
title_sort | building a language conditioned system for 6 dof tabletop manipulation |
url | https://hdl.handle.net/1721.1/152838 |
work_keys_str_mv | AT parakhmeenal buildingalanguageconditionedsystemfor6doftabletopmanipulation |