Summary: | The aim of large scale specific-object image retrieval systems is to instantaneously
find images that contain the query object in the image database. Current systems, for
example Google Goggles, concentrate on querying using a single view of an object, e.g. a
photo a user takes with his mobile phone, in order to answer the question “what is this?”.
Here we consider the somewhat converse problem of finding all images of an object given
that the user knows what he is looking for; so the input modality is text, not an image.
This problem is useful in a number of settings, for example media production teams are
interested in searching internal databases for images or video footage to accompany news
reports and newspaper articles.
Given a textual query (e.g. “coca cola bottle”), our approach is to first obtain multiple
images of the queried object using textual Google image search. These images are then
used to visually query the target database to discover images containing the object of
interest. We compare a number of different methods for combining the multiple query
images, including discriminative learning. We show that issuing multiple queries significantly improves recall and enables the system to find quite challenging occurrences of
the queried object.
The system is evaluated quantitatively on the standard Oxford Buildings benchmark
dataset where it achieves very high retrieval performance, and also qualitatively on the
TrecVid 2011 known-item search dataset.
|