Learning multimodal VAEs through mutual supervision

Multimodal VAEs seek to model the joint distribution over heterogeneous data (e.g.\ vision, language), whilst also capturing a shared representation across such modalities. Prior work has typically combined information from the modalities by reconciling idiosyncratic representations directly in the...

Full description

Bibliographic Details
Main Authors:	Joy, T, Shi, Y, Torr, PHS, Rainforth, T, Schmon, SM, Siddharth, N
Format:	Conference item
Language:	English
Published:	OpenReview 2022

_version_	1797108440055676928
author	Joy, T Shi, Y Torr, PHS Rainforth, T Schmon, SM Siddharth, N
author_facet	Joy, T Shi, Y Torr, PHS Rainforth, T Schmon, SM Siddharth, N
author_sort	Joy, T
collection	OXFORD
description	Multimodal VAEs seek to model the joint distribution over heterogeneous data (e.g.\ vision, language), whilst also capturing a shared representation across such modalities. Prior work has typically combined information from the modalities by reconciling idiosyncratic representations directly in the recognition model through explicit products, mixtures, or other such factorisations. Here we introduce a novel alternative, the MEME, that avoids such explicit combinations by repurposing semi-supervised VAEs to combine information between modalities implicitly through mutual supervision. This formulation naturally allows learning from partially-observed data where some modalities can be entirely missing—something that most existing approaches either cannot handle, or do so to a limited extent. We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes on the MNIST-SVHN (image–image) and CUB (image–text) datasets. We also contrast the quality of the representations learnt by mutual supervision against standard approaches and observe interesting trends in its ability to capture relatedness between data.
first_indexed	2024-03-07T07:27:44Z
format	Conference item
id	oxford-uuid:a96008a5-7f0e-4b7a-ae75-032c1f0fc72e
institution	University of Oxford
language	English
last_indexed	2024-03-07T07:27:44Z
publishDate	2022
publisher	OpenReview
record_format	dspace
spelling	oxford-uuid:a96008a5-7f0e-4b7a-ae75-032c1f0fc72e2022-12-05T12:06:56ZLearning multimodal VAEs through mutual supervisionConference itemhttp://purl.org/coar/resource_type/c_5794uuid:a96008a5-7f0e-4b7a-ae75-032c1f0fc72eEnglishSymplectic ElementsOpenReview2022Joy, TShi, YTorr, PHSRainforth, TSchmon, SMSiddharth, NMultimodal VAEs seek to model the joint distribution over heterogeneous data (e.g.\ vision, language), whilst also capturing a shared representation across such modalities. Prior work has typically combined information from the modalities by reconciling idiosyncratic representations directly in the recognition model through explicit products, mixtures, or other such factorisations. Here we introduce a novel alternative, the MEME, that avoids such explicit combinations by repurposing semi-supervised VAEs to combine information between modalities implicitly through mutual supervision. This formulation naturally allows learning from partially-observed data where some modalities can be entirely missing—something that most existing approaches either cannot handle, or do so to a limited extent. We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes on the MNIST-SVHN (image–image) and CUB (image–text) datasets. We also contrast the quality of the representations learnt by mutual supervision against standard approaches and observe interesting trends in its ability to capture relatedness between data.
spellingShingle	Joy, T Shi, Y Torr, PHS Rainforth, T Schmon, SM Siddharth, N Learning multimodal VAEs through mutual supervision
title	Learning multimodal VAEs through mutual supervision
title_full	Learning multimodal VAEs through mutual supervision
title_fullStr	Learning multimodal VAEs through mutual supervision
title_full_unstemmed	Learning multimodal VAEs through mutual supervision
title_short	Learning multimodal VAEs through mutual supervision
title_sort	learning multimodal vaes through mutual supervision
work_keys_str_mv	AT joyt learningmultimodalvaesthroughmutualsupervision AT shiy learningmultimodalvaesthroughmutualsupervision AT torrphs learningmultimodalvaesthroughmutualsupervision AT rainfortht learningmultimodalvaesthroughmutualsupervision AT schmonsm learningmultimodalvaesthroughmutualsupervision AT siddharthn learningmultimodalvaesthroughmutualsupervision

Learning multimodal VAEs through mutual supervision

Similar Items