Self-supervised learning of audio-visual objects from video
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the au...
Päätekijät: | , , , |
---|---|
Aineistotyyppi: | Conference item |
Kieli: | English |
Julkaistu: |
Springer
2020
|