Matching words and pictures

We present a new approach for modeling multi-modal data sets, focusing on the specific case of segmented images with associated text. Learning the joint distribution of image regions and words has many applications. We consider in detail predicting words associated with whole images (auto-annotation...

Cur síos iomlán

Sonraí bibleagrafaíochta
Príomhchruthaitheoirí:	Barnard, K, Duygulu, P, Forsyth, D, de Freitas, N, Blei, D, Jordan, M
Formáid:	Journal article
Foilsithe / Cruthaithe:	2003

_version_	1826259460972085248
author	Barnard, K Duygulu, P Forsyth, D de Freitas, N Blei, D Jordan, M
author_facet	Barnard, K Duygulu, P Forsyth, D de Freitas, N Blei, D Jordan, M
author_sort	Barnard, K
collection	OXFORD
description	We present a new approach for modeling multi-modal data sets, focusing on the specific case of segmented images with associated text. Learning the joint distribution of image regions and words has many applications. We consider in detail predicting words associated with whole images (auto-annotation) and corresponding to particular image regions (region naming). Auto-annotation might help organize and access large collections of images. Region naming is a model of object recognition as a process of translating image regions to words, much as one might translate from one language to another. Learning the relationships between image regions and semantic correlates (words) is an interesting example of multi-modal data mining, particularly because it is typically hard to apply data mining techniques to collections of images. We develop a number of models for the joint distribution of image regions and words, including several which explicitly learn the correspondence between regions and words. We study multi-modal and correspondence extensions to Hofmann's hierarchical clustering/aspect model, a translation model adapted from statistical machine translation (Brown et al.), and a multi-modal extension to mixture of latent Dirichlet allocation (MoM-LDA). All models are assessed using a large collection of annotated images of real scenes. We study in depth the difficult problem of measuring performance. For the annotation task, we look at prediction performance on held out data. We present three alternative measures, oriented toward different types of task. Measuring the performance of correspondence methods is harder, because one must determine whether a word has been placed on the right region of an image. We can use annotation performance as a proxy measure, but accurate measurement requires hand labeled data, and thus must occur on a smaller scale. We show results using both an annotation proxy, and manually labeled data.
first_indexed	2024-03-06T18:50:15Z
format	Journal article
id	oxford-uuid:0ffa90fe-4c47-49ca-a13a-66856af15835
institution	University of Oxford
last_indexed	2024-03-06T18:50:15Z
publishDate	2003
record_format	dspace
spelling	oxford-uuid:0ffa90fe-4c47-49ca-a13a-66856af158352022-03-26T09:54:00ZMatching words and picturesJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:0ffa90fe-4c47-49ca-a13a-66856af15835Department of Computer Science2003Barnard, KDuygulu, PForsyth, Dde Freitas, NBlei, DJordan, MWe present a new approach for modeling multi-modal data sets, focusing on the specific case of segmented images with associated text. Learning the joint distribution of image regions and words has many applications. We consider in detail predicting words associated with whole images (auto-annotation) and corresponding to particular image regions (region naming). Auto-annotation might help organize and access large collections of images. Region naming is a model of object recognition as a process of translating image regions to words, much as one might translate from one language to another. Learning the relationships between image regions and semantic correlates (words) is an interesting example of multi-modal data mining, particularly because it is typically hard to apply data mining techniques to collections of images. We develop a number of models for the joint distribution of image regions and words, including several which explicitly learn the correspondence between regions and words. We study multi-modal and correspondence extensions to Hofmann's hierarchical clustering/aspect model, a translation model adapted from statistical machine translation (Brown et al.), and a multi-modal extension to mixture of latent Dirichlet allocation (MoM-LDA). All models are assessed using a large collection of annotated images of real scenes. We study in depth the difficult problem of measuring performance. For the annotation task, we look at prediction performance on held out data. We present three alternative measures, oriented toward different types of task. Measuring the performance of correspondence methods is harder, because one must determine whether a word has been placed on the right region of an image. We can use annotation performance as a proxy measure, but accurate measurement requires hand labeled data, and thus must occur on a smaller scale. We show results using both an annotation proxy, and manually labeled data.
spellingShingle	Barnard, K Duygulu, P Forsyth, D de Freitas, N Blei, D Jordan, M Matching words and pictures
title	Matching words and pictures
title_full	Matching words and pictures
title_fullStr	Matching words and pictures
title_full_unstemmed	Matching words and pictures
title_short	Matching words and pictures
title_sort	matching words and pictures
work_keys_str_mv	AT barnardk matchingwordsandpictures AT duygulup matchingwordsandpictures AT forsythd matchingwordsandpictures AT defreitasn matchingwordsandpictures AT bleid matchingwordsandpictures AT jordanm matchingwordsandpictures

Matching words and pictures

Míreanna comhchosúla