Expectation-maximization contrastive learning for compact video-and-language representations

Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the...

Full description

Bibliographic Details
Main Authors:	Jin, P, Huang, J, Liu, F, Wu, X, Ge, S, Song, G, Clifton, DA, Chen, J
Format:	Conference item
Language:	English
Published:	Curran Associates 2023

_version_	1797111194807435264
author	Jin, P Huang, J Liu, F Wu, X Ge, S Song, G Clifton, DA Chen, J
author_facet	Jin, P Huang, J Liu, F Wu, X Ge, S Song, G Clifton, DA Chen, J
author_sort	Jin, P
collection	OXFORD
description	Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases. Such feature decomposition of video-and-language representations reduces the rank of the latent space, resulting in increased representing power for the semantics. Extensive experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing approaches either as a jointly training layer or an out-of-the-box inference module with no extra training, making it easy to be incorporated into any existing methods.
first_indexed	2024-03-07T08:05:22Z
format	Conference item
id	oxford-uuid:9567dbe2-e142-42dd-a050-03f52571b446
institution	University of Oxford
language	English
last_indexed	2024-03-07T08:05:22Z
publishDate	2023
publisher	Curran Associates
record_format	dspace
spelling	oxford-uuid:9567dbe2-e142-42dd-a050-03f52571b4462023-10-30T10:10:29ZExpectation-maximization contrastive learning for compact video-and-language representationsConference itemhttp://purl.org/coar/resource_type/c_5794uuid:9567dbe2-e142-42dd-a050-03f52571b446EnglishSymplectic ElementsCurran Associates2023Jin, PHuang, JLiu, FWu, XGe, SSong, GClifton, DAChen, JMost video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases. Such feature decomposition of video-and-language representations reduces the rank of the latent space, resulting in increased representing power for the semantics. Extensive experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing approaches either as a jointly training layer or an out-of-the-box inference module with no extra training, making it easy to be incorporated into any existing methods.
spellingShingle	Jin, P Huang, J Liu, F Wu, X Ge, S Song, G Clifton, DA Chen, J Expectation-maximization contrastive learning for compact video-and-language representations
title	Expectation-maximization contrastive learning for compact video-and-language representations
title_full	Expectation-maximization contrastive learning for compact video-and-language representations
title_fullStr	Expectation-maximization contrastive learning for compact video-and-language representations
title_full_unstemmed	Expectation-maximization contrastive learning for compact video-and-language representations
title_short	Expectation-maximization contrastive learning for compact video-and-language representations
title_sort	expectation maximization contrastive learning for compact video and language representations
work_keys_str_mv	AT jinp expectationmaximizationcontrastivelearningforcompactvideoandlanguagerepresentations AT huangj expectationmaximizationcontrastivelearningforcompactvideoandlanguagerepresentations AT liuf expectationmaximizationcontrastivelearningforcompactvideoandlanguagerepresentations AT wux expectationmaximizationcontrastivelearningforcompactvideoandlanguagerepresentations AT ges expectationmaximizationcontrastivelearningforcompactvideoandlanguagerepresentations AT songg expectationmaximizationcontrastivelearningforcompactvideoandlanguagerepresentations AT cliftonda expectationmaximizationcontrastivelearningforcompactvideoandlanguagerepresentations AT chenj expectationmaximizationcontrastivelearningforcompactvideoandlanguagerepresentations

Expectation-maximization contrastive learning for compact video-and-language representations

Similar Items