Seeing What You’re Told: Sentence-Guided Activity Recognition In Video

We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, thereby providing a medium, not only for top-down and bottom-up integration, bu...

Full description

Bibliographic Details
Main Authors: Siddharth, Narayanaswamy, Barbu, Andrei, Siskind, Jeffrey Mark
Format: Technical Report
Language:en_US
Published: Center for Brains, Minds and Machines (CBMM), arXiv 2015
Subjects:
Online Access:http://hdl.handle.net/1721.1/100169
_version_ 1826188389364269056
author Siddharth, Narayanaswamy
Barbu, Andrei
Siskind, Jeffrey Mark
author_facet Siddharth, Narayanaswamy
Barbu, Andrei
Siskind, Jeffrey Mark
author_sort Siddharth, Narayanaswamy
collection MIT
description We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, thereby providing a medium, not only for top-down and bottom-up integration, but also for multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions) in the form of whole sentential descriptions mediated by a grammar, guides the activity-recognition process. Further, the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity videos: sentence-guided focus of attention, generation of sentential descriptions of video, and query-based video search, simply by leveraging the framework in different manners.
first_indexed 2024-09-23T07:58:53Z
format Technical Report
id mit-1721.1/100169
institution Massachusetts Institute of Technology
language en_US
last_indexed 2024-09-23T07:58:53Z
publishDate 2015
publisher Center for Brains, Minds and Machines (CBMM), arXiv
record_format dspace
spelling mit-1721.1/1001692019-04-09T15:55:25Z Seeing What You’re Told: Sentence-Guided Activity Recognition In Video Siddharth, Narayanaswamy Barbu, Andrei Siskind, Jeffrey Mark Computer vision Machine Learning Computer Language We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, thereby providing a medium, not only for top-down and bottom-up integration, but also for multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions) in the form of whole sentential descriptions mediated by a grammar, guides the activity-recognition process. Further, the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity videos: sentence-guided focus of attention, generation of sentential descriptions of video, and query-based video search, simply by leveraging the framework in different manners. This research was supported, in part, by ARL, under Cooperative Agreement Number W911NF-10-2-0060, and the Center for Brains, Minds and Machines, funded by NSF STC award CCF-1231216. 2015-12-10T17:56:01Z 2015-12-10T17:56:01Z 2014-05-29 Technical Report Working Paper Other http://hdl.handle.net/1721.1/100169 arXiv:1308.4189v2 en_US CBMM Memo Series;006 Attribution-NonCommercial 3.0 United States http://creativecommons.org/licenses/by-nc/3.0/us/ application/pdf Center for Brains, Minds and Machines (CBMM), arXiv
spellingShingle Computer vision
Machine Learning
Computer Language
Siddharth, Narayanaswamy
Barbu, Andrei
Siskind, Jeffrey Mark
Seeing What You’re Told: Sentence-Guided Activity Recognition In Video
title Seeing What You’re Told: Sentence-Guided Activity Recognition In Video
title_full Seeing What You’re Told: Sentence-Guided Activity Recognition In Video
title_fullStr Seeing What You’re Told: Sentence-Guided Activity Recognition In Video
title_full_unstemmed Seeing What You’re Told: Sentence-Guided Activity Recognition In Video
title_short Seeing What You’re Told: Sentence-Guided Activity Recognition In Video
title_sort seeing what you re told sentence guided activity recognition in video
topic Computer vision
Machine Learning
Computer Language
url http://hdl.handle.net/1721.1/100169
work_keys_str_mv AT siddharthnarayanaswamy seeingwhatyouretoldsentenceguidedactivityrecognitioninvideo
AT barbuandrei seeingwhatyouretoldsentenceguidedactivityrecognitioninvideo
AT siskindjeffreymark seeingwhatyouretoldsentenceguidedactivityrecognitioninvideo