Use what you have: Video retrieval using representations from collaborative experts

The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge. Human generated queries for video datasets ‘in the wild’ vary a lot in terms of degree of specificity, with some queries describing ‘specific details’ such as the na...

Full description

Bibliographic Details
Main Authors:	Liu, Y, Albanie, S, Nagrani, A, Zisserman, A
Format:	Conference item
Language:	English
Published:	British Machine Vision Association 2020

_version_	1797068117482930176
author	Liu, Y Albanie, S Nagrani, A Zisserman, A
author_facet	Liu, Y Albanie, S Nagrani, A Zisserman, A
author_sort	Liu, Y
collection	OXFORD
description	The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge. Human generated queries for video datasets ‘in the wild’ vary a lot in terms of degree of specificity, with some queries describing ‘specific details’ such as the names of famous identities, content from speech, or text available on the screen. Our goal is to condense the multi-modal, extremely high dimensional information from videos into a single, compact video representation for the task of video retrieval using free-form text queries, where the degree of specificity is open-ended. For this we exploit existing knowledge in the form of pre-trained semantic embeddings which include ‘general’ features such as motion, appearance, and scene features from visual content, and more ‘specific’ cues from ASR and OCR which may not always be available, but allow for more fine-grained disambiguation when present. We propose a collaborative experts model to aggregate information effectively from these different pre-trained experts. The effectiveness of our approach is demonstrated empirically, setting new state-of-the-art performances on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet, while simultaneously reducing the number of parameters used by prior work. Code and data can be foundat www.robots.ox.ac.uk/~vgg/research/collaborative-experts/.
first_indexed	2024-03-06T22:06:04Z
format	Conference item
id	oxford-uuid:502da19a-2a9c-45f4-95f0-ee09ecf77340
institution	University of Oxford
language	English
last_indexed	2024-03-06T22:06:04Z
publishDate	2020
publisher	British Machine Vision Association
record_format	dspace
spelling	oxford-uuid:502da19a-2a9c-45f4-95f0-ee09ecf773402022-03-26T16:12:10ZUse what you have: Video retrieval using representations from collaborative expertsConference itemhttp://purl.org/coar/resource_type/c_5794uuid:502da19a-2a9c-45f4-95f0-ee09ecf77340EnglishSymplectic Elements at OxfordBritish Machine Vision Association2020Liu, YAlbanie, SNagrani, AZisserman, AThe rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge. Human generated queries for video datasets ‘in the wild’ vary a lot in terms of degree of specificity, with some queries describing ‘specific details’ such as the names of famous identities, content from speech, or text available on the screen. Our goal is to condense the multi-modal, extremely high dimensional information from videos into a single, compact video representation for the task of video retrieval using free-form text queries, where the degree of specificity is open-ended. For this we exploit existing knowledge in the form of pre-trained semantic embeddings which include ‘general’ features such as motion, appearance, and scene features from visual content, and more ‘specific’ cues from ASR and OCR which may not always be available, but allow for more fine-grained disambiguation when present. We propose a collaborative experts model to aggregate information effectively from these different pre-trained experts. The effectiveness of our approach is demonstrated empirically, setting new state-of-the-art performances on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet, while simultaneously reducing the number of parameters used by prior work. Code and data can be foundat www.robots.ox.ac.uk/~vgg/research/collaborative-experts/.
spellingShingle	Liu, Y Albanie, S Nagrani, A Zisserman, A Use what you have: Video retrieval using representations from collaborative experts
title	Use what you have: Video retrieval using representations from collaborative experts
title_full	Use what you have: Video retrieval using representations from collaborative experts
title_fullStr	Use what you have: Video retrieval using representations from collaborative experts
title_full_unstemmed	Use what you have: Video retrieval using representations from collaborative experts
title_short	Use what you have: Video retrieval using representations from collaborative experts
title_sort	use what you have video retrieval using representations from collaborative experts
work_keys_str_mv	AT liuy usewhatyouhavevideoretrievalusingrepresentationsfromcollaborativeexperts AT albanies usewhatyouhavevideoretrievalusingrepresentationsfromcollaborativeexperts AT nagrania usewhatyouhavevideoretrievalusingrepresentationsfromcollaborativeexperts AT zissermana usewhatyouhavevideoretrievalusingrepresentationsfromcollaborativeexperts

Use what you have: Video retrieval using representations from collaborative experts

Similar Items