Text-conditioned resampler for long form video understanding
In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LL...
Main Authors: | , , , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
Springer
2024
|