Self-supervised learning of audio-visual objects from video
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the au...
Hauptverfasser: | , , , |
---|---|
Format: | Conference item |
Sprache: | English |
Veröffentlicht: |
Springer
2020
|