Personalised CLIP or: how to find your vacation videos

In this paper, our goal is a person-centric model capable of retrieving the image or video corresponding to a personalized compound query from a large set of images or videos. Specifically, given a query consisting of an image of a person's \textit{face} and a text \textit{scene description} or...

Full description

Bibliographic Details
Main Authors: Korbar, B, Zisserman, A
Format: Conference item
Language:English
Published: British Machine Vision Association 2022
Description
Summary:In this paper, our goal is a person-centric model capable of retrieving the image or video corresponding to a personalized compound query from a large set of images or videos. Specifically, given a query consisting of an image of a person's \textit{face} and a text \textit{scene description} or \textit{action description}, we retrieve images or video-clips corresponding to this compound query. We make three contributions: (1) we propose~\model, a model that is able to retrieve images/video given a personalized compound-query. We achieve this by building on a pre-trained CLIP vision-text model that has compound, but general, query capabilities, and provide a mechanism to personalize it to the target person specified by their face; (2) we share a new {\em Celebrities in Action} (\dset) dataset of movies with automatically generated annotations for identities, locations, and actions that can be used for evaluation of the compound-retrieval task; (3) we evaluate our model's performance on two datasets: Celebrities in Places for compound queries of a celebrity and a scene description; and our new \dset\ for compound queries of a celebrity and an action description. We demonstrate the flexibility of the model with free-form queries and compare to previous methods.