Composing Visual Relations with Composable Diffusion Models
Humans are able to build complex representations of our world – representing the world as compositional combinations of both objects and their interdependent relations. Recent work in text-guided diffusion models have produced impressive results in generating photorealistic images, but such models o...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2023
|
Online Access: | https://hdl.handle.net/1721.1/151470 |
_version_ | 1826197183720849408 |
---|---|
author | Wei, Megan |
author2 | Tenenbaum, Joshua B. |
author_facet | Tenenbaum, Joshua B. Wei, Megan |
author_sort | Wei, Megan |
collection | MIT |
description | Humans are able to build complex representations of our world – representing the world as compositional combinations of both objects and their interdependent relations. Recent work in text-guided diffusion models have produced impressive results in generating photorealistic images, but such models often fail to capture spatial relationships between objects, and will often generate scenes where individual specified relations are incorrectly captured. An underlying cause is that such models are not explicitly compositional – when given a relational text description such as fork on plate or plate on fork, models will regress to generating the previously seen images, and will only generate images with a fork on a plate. We propose an approach to more accurately capture relations by decomposing the image probability density as a hierarchical product between lifted density representing abstract relations between objects and individual densities representing each object. We illustrate how this approach is simple to implement in practice and enables us to scale to accurately capture relations between objects across simulated and realistic scenes. |
first_indexed | 2024-09-23T10:43:46Z |
format | Thesis |
id | mit-1721.1/151470 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T10:43:46Z |
publishDate | 2023 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1514702023-08-01T04:02:20Z Composing Visual Relations with Composable Diffusion Models Wei, Megan Tenenbaum, Joshua B. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Humans are able to build complex representations of our world – representing the world as compositional combinations of both objects and their interdependent relations. Recent work in text-guided diffusion models have produced impressive results in generating photorealistic images, but such models often fail to capture spatial relationships between objects, and will often generate scenes where individual specified relations are incorrectly captured. An underlying cause is that such models are not explicitly compositional – when given a relational text description such as fork on plate or plate on fork, models will regress to generating the previously seen images, and will only generate images with a fork on a plate. We propose an approach to more accurately capture relations by decomposing the image probability density as a hierarchical product between lifted density representing abstract relations between objects and individual densities representing each object. We illustrate how this approach is simple to implement in practice and enables us to scale to accurately capture relations between objects across simulated and realistic scenes. M.Eng. 2023-07-31T19:42:19Z 2023-07-31T19:42:19Z 2023-06 2023-06-06T16:34:40.489Z Thesis https://hdl.handle.net/1721.1/151470 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Wei, Megan Composing Visual Relations with Composable Diffusion Models |
title | Composing Visual Relations with Composable Diffusion Models |
title_full | Composing Visual Relations with Composable Diffusion Models |
title_fullStr | Composing Visual Relations with Composable Diffusion Models |
title_full_unstemmed | Composing Visual Relations with Composable Diffusion Models |
title_short | Composing Visual Relations with Composable Diffusion Models |
title_sort | composing visual relations with composable diffusion models |
url | https://hdl.handle.net/1721.1/151470 |
work_keys_str_mv | AT weimegan composingvisualrelationswithcomposablediffusionmodels |