Use what you have: Video retrieval using representations from collaborative experts

The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge. Human generated queries for video datasets ‘in the wild’ vary a lot in terms of degree of specificity, with some queries describing ‘specific details’ such as the na...

Full description

Bibliographic Details
Main Authors: Liu, Y, Albanie, S, Nagrani, A, Zisserman, A
Format: Conference item
Language:English
Published: British Machine Vision Association 2020
_version_ 1797068117482930176
author Liu, Y
Albanie, S
Nagrani, A
Zisserman, A
author_facet Liu, Y
Albanie, S
Nagrani, A
Zisserman, A
author_sort Liu, Y
collection OXFORD
description The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge. Human generated queries for video datasets ‘in the wild’ vary a lot in terms of degree of specificity, with some queries describing ‘specific details’ such as the names of famous identities, content from speech, or text available on the screen. Our goal is to condense the multi-modal, extremely high dimensional information from videos into a single, compact video representation for the task of video retrieval using free-form text queries, where the degree of specificity is open-ended. For this we exploit existing knowledge in the form of pre-trained semantic embeddings which include ‘general’ features such as motion, appearance, and scene features from visual content, and more ‘specific’ cues from ASR and OCR which may not always be available, but allow for more fine-grained disambiguation when present. We propose a collaborative experts model to aggregate information effectively from these different pre-trained experts. The effectiveness of our approach is demonstrated empirically, setting new state-of-the-art performances on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet, while simultaneously reducing the number of parameters used by prior work. Code and data can be foundat www.robots.ox.ac.uk/~vgg/research/collaborative-experts/.
first_indexed 2024-03-06T22:06:04Z
format Conference item
id oxford-uuid:502da19a-2a9c-45f4-95f0-ee09ecf77340
institution University of Oxford
language English
last_indexed 2024-03-06T22:06:04Z
publishDate 2020
publisher British Machine Vision Association
record_format dspace
spelling oxford-uuid:502da19a-2a9c-45f4-95f0-ee09ecf773402022-03-26T16:12:10ZUse what you have: Video retrieval using representations from collaborative expertsConference itemhttp://purl.org/coar/resource_type/c_5794uuid:502da19a-2a9c-45f4-95f0-ee09ecf77340EnglishSymplectic Elements at OxfordBritish Machine Vision Association2020Liu, YAlbanie, SNagrani, AZisserman, AThe rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge. Human generated queries for video datasets ‘in the wild’ vary a lot in terms of degree of specificity, with some queries describing ‘specific details’ such as the names of famous identities, content from speech, or text available on the screen. Our goal is to condense the multi-modal, extremely high dimensional information from videos into a single, compact video representation for the task of video retrieval using free-form text queries, where the degree of specificity is open-ended. For this we exploit existing knowledge in the form of pre-trained semantic embeddings which include ‘general’ features such as motion, appearance, and scene features from visual content, and more ‘specific’ cues from ASR and OCR which may not always be available, but allow for more fine-grained disambiguation when present. We propose a collaborative experts model to aggregate information effectively from these different pre-trained experts. The effectiveness of our approach is demonstrated empirically, setting new state-of-the-art performances on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet, while simultaneously reducing the number of parameters used by prior work. Code and data can be foundat www.robots.ox.ac.uk/~vgg/research/collaborative-experts/.
spellingShingle Liu, Y
Albanie, S
Nagrani, A
Zisserman, A
Use what you have: Video retrieval using representations from collaborative experts
title Use what you have: Video retrieval using representations from collaborative experts
title_full Use what you have: Video retrieval using representations from collaborative experts
title_fullStr Use what you have: Video retrieval using representations from collaborative experts
title_full_unstemmed Use what you have: Video retrieval using representations from collaborative experts
title_short Use what you have: Video retrieval using representations from collaborative experts
title_sort use what you have video retrieval using representations from collaborative experts
work_keys_str_mv AT liuy usewhatyouhavevideoretrievalusingrepresentationsfromcollaborativeexperts
AT albanies usewhatyouhavevideoretrievalusingrepresentationsfromcollaborativeexperts
AT nagrania usewhatyouhavevideoretrievalusingrepresentationsfromcollaborativeexperts
AT zissermana usewhatyouhavevideoretrievalusingrepresentationsfromcollaborativeexperts