Expectation-maximization contrastive learning for compact video-and-language representations
Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the...
Main Authors: | Jin, P, Huang, J, Liu, F, Wu, X, Ge, S, Song, G, Clifton, DA, Chen, J |
---|---|
Format: | Conference item |
Language: | English |
Published: |
Curran Associates
2023
|
Similar Items
-
Expectation-Maximization via Pretext-Invariant Representations
by: Chingis Oinar, et al.
Published: (2023-01-01) -
Self-supervised contrastive video-speech representation learning for ultrasound
by: Jiao, J, et al.
Published: (2020) -
Gender Representation of Flouting Maxim in Classroom Interaction Videos on Youtube
by: Oktazsya Marjelina Lorenza, et al.
Published: (2023-02-01) -
Learning Audio-Video Language Representations
by: Rouditchenko, Andrew
Published: (2022) -
Stochastic expectation maximization with variance reduction
by: Chen, J, et al.
Published: (2018)