Composing Visual Relations with Composable Diffusion Models

Humans are able to build complex representations of our world – representing the world as compositional combinations of both objects and their interdependent relations. Recent work in text-guided diffusion models have produced impressive results in generating photorealistic images, but such models o...

Full description

Bibliographic Details
Main Author: Wei, Megan
Other Authors: Tenenbaum, Joshua B.
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/151470
_version_ 1826197183720849408
author Wei, Megan
author2 Tenenbaum, Joshua B.
author_facet Tenenbaum, Joshua B.
Wei, Megan
author_sort Wei, Megan
collection MIT
description Humans are able to build complex representations of our world – representing the world as compositional combinations of both objects and their interdependent relations. Recent work in text-guided diffusion models have produced impressive results in generating photorealistic images, but such models often fail to capture spatial relationships between objects, and will often generate scenes where individual specified relations are incorrectly captured. An underlying cause is that such models are not explicitly compositional – when given a relational text description such as fork on plate or plate on fork, models will regress to generating the previously seen images, and will only generate images with a fork on a plate. We propose an approach to more accurately capture relations by decomposing the image probability density as a hierarchical product between lifted density representing abstract relations between objects and individual densities representing each object. We illustrate how this approach is simple to implement in practice and enables us to scale to accurately capture relations between objects across simulated and realistic scenes.
first_indexed 2024-09-23T10:43:46Z
format Thesis
id mit-1721.1/151470
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T10:43:46Z
publishDate 2023
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1514702023-08-01T04:02:20Z Composing Visual Relations with Composable Diffusion Models Wei, Megan Tenenbaum, Joshua B. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Humans are able to build complex representations of our world – representing the world as compositional combinations of both objects and their interdependent relations. Recent work in text-guided diffusion models have produced impressive results in generating photorealistic images, but such models often fail to capture spatial relationships between objects, and will often generate scenes where individual specified relations are incorrectly captured. An underlying cause is that such models are not explicitly compositional – when given a relational text description such as fork on plate or plate on fork, models will regress to generating the previously seen images, and will only generate images with a fork on a plate. We propose an approach to more accurately capture relations by decomposing the image probability density as a hierarchical product between lifted density representing abstract relations between objects and individual densities representing each object. We illustrate how this approach is simple to implement in practice and enables us to scale to accurately capture relations between objects across simulated and realistic scenes. M.Eng. 2023-07-31T19:42:19Z 2023-07-31T19:42:19Z 2023-06 2023-06-06T16:34:40.489Z Thesis https://hdl.handle.net/1721.1/151470 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Wei, Megan
Composing Visual Relations with Composable Diffusion Models
title Composing Visual Relations with Composable Diffusion Models
title_full Composing Visual Relations with Composable Diffusion Models
title_fullStr Composing Visual Relations with Composable Diffusion Models
title_full_unstemmed Composing Visual Relations with Composable Diffusion Models
title_short Composing Visual Relations with Composable Diffusion Models
title_sort composing visual relations with composable diffusion models
url https://hdl.handle.net/1721.1/151470
work_keys_str_mv AT weimegan composingvisualrelationswithcomposablediffusionmodels