Expectation-maximization contrastive learning for compact video-and-language representations

Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the...

Full description

Bibliographic Details
Main Authors:	Jin, P, Huang, J, Liu, F, Wu, X, Ge, S, Song, G, Clifton, DA, Chen, J
Format:	Conference item
Language:	English
Published:	Curran Associates 2023

Expectation-maximization contrastive learning for compact video-and-language representations

Similar Items