Learning multimodal VAEs through mutual supervision

Multimodal VAEs seek to model the joint distribution over heterogeneous data (e.g.\ vision, language), whilst also capturing a shared representation across such modalities. Prior work has typically combined information from the modalities by reconciling idiosyncratic representations directly in the...

Full description

Bibliographic Details
Main Authors: Joy, T, Shi, Y, Torr, PHS, Rainforth, T, Schmon, SM, Siddharth, N
Format: Conference item
Language:English
Published: OpenReview 2022
_version_ 1797108440055676928
author Joy, T
Shi, Y
Torr, PHS
Rainforth, T
Schmon, SM
Siddharth, N
author_facet Joy, T
Shi, Y
Torr, PHS
Rainforth, T
Schmon, SM
Siddharth, N
author_sort Joy, T
collection OXFORD
description Multimodal VAEs seek to model the joint distribution over heterogeneous data (e.g.\ vision, language), whilst also capturing a shared representation across such modalities. Prior work has typically combined information from the modalities by reconciling idiosyncratic representations directly in the recognition model through explicit products, mixtures, or other such factorisations. Here we introduce a novel alternative, the MEME, that avoids such explicit combinations by repurposing semi-supervised VAEs to combine information between modalities implicitly through mutual supervision. This formulation naturally allows learning from partially-observed data where some modalities can be entirely missing—something that most existing approaches either cannot handle, or do so to a limited extent. We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes on the MNIST-SVHN (image–image) and CUB (image–text) datasets. We also contrast the quality of the representations learnt by mutual supervision against standard approaches and observe interesting trends in its ability to capture relatedness between data.
first_indexed 2024-03-07T07:27:44Z
format Conference item
id oxford-uuid:a96008a5-7f0e-4b7a-ae75-032c1f0fc72e
institution University of Oxford
language English
last_indexed 2024-03-07T07:27:44Z
publishDate 2022
publisher OpenReview
record_format dspace
spelling oxford-uuid:a96008a5-7f0e-4b7a-ae75-032c1f0fc72e2022-12-05T12:06:56ZLearning multimodal VAEs through mutual supervisionConference itemhttp://purl.org/coar/resource_type/c_5794uuid:a96008a5-7f0e-4b7a-ae75-032c1f0fc72eEnglishSymplectic ElementsOpenReview2022Joy, TShi, YTorr, PHSRainforth, TSchmon, SMSiddharth, NMultimodal VAEs seek to model the joint distribution over heterogeneous data (e.g.\ vision, language), whilst also capturing a shared representation across such modalities. Prior work has typically combined information from the modalities by reconciling idiosyncratic representations directly in the recognition model through explicit products, mixtures, or other such factorisations. Here we introduce a novel alternative, the MEME, that avoids such explicit combinations by repurposing semi-supervised VAEs to combine information between modalities implicitly through mutual supervision. This formulation naturally allows learning from partially-observed data where some modalities can be entirely missing—something that most existing approaches either cannot handle, or do so to a limited extent. We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes on the MNIST-SVHN (image–image) and CUB (image–text) datasets. We also contrast the quality of the representations learnt by mutual supervision against standard approaches and observe interesting trends in its ability to capture relatedness between data.
spellingShingle Joy, T
Shi, Y
Torr, PHS
Rainforth, T
Schmon, SM
Siddharth, N
Learning multimodal VAEs through mutual supervision
title Learning multimodal VAEs through mutual supervision
title_full Learning multimodal VAEs through mutual supervision
title_fullStr Learning multimodal VAEs through mutual supervision
title_full_unstemmed Learning multimodal VAEs through mutual supervision
title_short Learning multimodal VAEs through mutual supervision
title_sort learning multimodal vaes through mutual supervision
work_keys_str_mv AT joyt learningmultimodalvaesthroughmutualsupervision
AT shiy learningmultimodalvaesthroughmutualsupervision
AT torrphs learningmultimodalvaesthroughmutualsupervision
AT rainfortht learningmultimodalvaesthroughmutualsupervision
AT schmonsm learningmultimodalvaesthroughmutualsupervision
AT siddharthn learningmultimodalvaesthroughmutualsupervision