Seeing What You’re Told: Sentence-Guided Activity Recognition In Video

We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, thereby providing a medium, not only for top-down and bottom-up integration, bu...

Full description

Bibliographic Details
Main Authors:	Siddharth, Narayanaswamy, Barbu, Andrei, Siskind, Jeffrey Mark
Format:	Technical Report
Language:	en_US
Published:	Center for Brains, Minds and Machines (CBMM), arXiv 2015
Subjects:	Computer vision Machine Learning Computer Language
Online Access:	http://hdl.handle.net/1721.1/100169

_version_	1826188389364269056
author	Siddharth, Narayanaswamy Barbu, Andrei Siskind, Jeffrey Mark
author_facet	Siddharth, Narayanaswamy Barbu, Andrei Siskind, Jeffrey Mark
author_sort	Siddharth, Narayanaswamy
collection	MIT
description	We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, thereby providing a medium, not only for top-down and bottom-up integration, but also for multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions) in the form of whole sentential descriptions mediated by a grammar, guides the activity-recognition process. Further, the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity videos: sentence-guided focus of attention, generation of sentential descriptions of video, and query-based video search, simply by leveraging the framework in different manners.
first_indexed	2024-09-23T07:58:53Z
format	Technical Report
id	mit-1721.1/100169
institution	Massachusetts Institute of Technology
language	en_US
last_indexed	2024-09-23T07:58:53Z
publishDate	2015
publisher	Center for Brains, Minds and Machines (CBMM), arXiv
record_format	dspace
spelling	mit-1721.1/1001692019-04-09T15:55:25Z Seeing What You’re Told: Sentence-Guided Activity Recognition In Video Siddharth, Narayanaswamy Barbu, Andrei Siskind, Jeffrey Mark Computer vision Machine Learning Computer Language We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, thereby providing a medium, not only for top-down and bottom-up integration, but also for multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions) in the form of whole sentential descriptions mediated by a grammar, guides the activity-recognition process. Further, the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity videos: sentence-guided focus of attention, generation of sentential descriptions of video, and query-based video search, simply by leveraging the framework in different manners. This research was supported, in part, by ARL, under Cooperative Agreement Number W911NF-10-2-0060, and the Center for Brains, Minds and Machines, funded by NSF STC award CCF-1231216. 2015-12-10T17:56:01Z 2015-12-10T17:56:01Z 2014-05-29 Technical Report Working Paper Other http://hdl.handle.net/1721.1/100169 arXiv:1308.4189v2 en_US CBMM Memo Series;006 Attribution-NonCommercial 3.0 United States http://creativecommons.org/licenses/by-nc/3.0/us/ application/pdf Center for Brains, Minds and Machines (CBMM), arXiv
spellingShingle	Computer vision Machine Learning Computer Language Siddharth, Narayanaswamy Barbu, Andrei Siskind, Jeffrey Mark Seeing What You’re Told: Sentence-Guided Activity Recognition In Video
title	Seeing What You’re Told: Sentence-Guided Activity Recognition In Video
title_full	Seeing What You’re Told: Sentence-Guided Activity Recognition In Video
title_fullStr	Seeing What You’re Told: Sentence-Guided Activity Recognition In Video
title_full_unstemmed	Seeing What You’re Told: Sentence-Guided Activity Recognition In Video
title_short	Seeing What You’re Told: Sentence-Guided Activity Recognition In Video
title_sort	seeing what you re told sentence guided activity recognition in video
topic	Computer vision Machine Learning Computer Language
url	http://hdl.handle.net/1721.1/100169
work_keys_str_mv	AT siddharthnarayanaswamy seeingwhatyouretoldsentenceguidedactivityrecognitioninvideo AT barbuandrei seeingwhatyouretoldsentenceguidedactivityrecognitioninvideo AT siskindjeffreymark seeingwhatyouretoldsentenceguidedactivityrecognitioninvideo

Seeing What You’re Told: Sentence-Guided Activity Recognition In Video

Similar Items