Text-conditioned resampler for long form video understanding

In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LL...

Full description

Bibliographic Details
Main Authors:	Korbar, B, Xian, Y, Tonioni, A, Zisserman, A, Tombari, F
Format:	Conference item
Language:	English
Published:	Springer 2024

Text-conditioned resampler for long form video understanding

Similar Items