Summary: | This report explores enhancing video grounding tasks by utilizing generated captions, addressing the challenge posed by sparse annotations in video datasets. We took inspiration from the PCNet model which uses caption-guided attention to fuse the captions generated by Parallel Dynamic Video Captioning (PDVC) and selected via the Non-Prompt Caption Suppression (NPCS) algorithm with feature maps to provide prior knowledge for training. Our model is also inspired by 2D-TAN model which leverages 2D temporal map to capture the temporal relations between the moments. We built our modified model upon 2D-TAN open-source codebase and ran against several popular datasets. Our approach, though not surpassing the 2D-TAN and PCNet reported accuracy, demonstrates improvements over some other benchmarks. This study underlines the potential of leveraging automatically generated captions to enrich video grounding models, as well as some limitations of the approach, paving the way for more effective multimedia content understanding.
|