Expectation-maximization contrastive learning for compact video-and-language representations

Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the...

Full description

Bibliographic Details
Main Authors: Jin, P, Huang, J, Liu, F, Wu, X, Ge, S, Song, G, Clifton, DA, Chen, J
Format: Conference item
Language:English
Published: Curran Associates 2023