Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization
Localizing the audio-visual events in video requires a combined judgment of visual and audio components. To integrate multimodal information, existing methods modeled the cross-modal relationships by feeding unimodal features into attention modules. However, these unimodal features are encoded in se...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-12-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/12/24/12622 |