Expectation-maximization contrastive learning for compact video-and-language representations
Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the...
Main Authors: | , , , , , , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
Curran Associates
2023
|