Self-supervised learning of audio-visual objects from video

Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the au...

Täydet tiedot

Bibliografiset tiedot
Päätekijät: Afouras, T, Owens, A, Chung, JS, Zisserman, A
Aineistotyyppi: Conference item
Kieli:English
Julkaistu: Springer 2020