Spatiotemporal interpretation features in the recognition of dynamic images

Objects and their parts can be visually recognized and localized from purely spatial information in static images and also from purely temporal information as in the perception of biological motion. Cortical regions have been identified, which appear to specialize in visual recognition based on eith...

Full description

Bibliographic Details
Main Authors:	Ben-Yosef, Guy, Kreiman, Gabriel, Ullman, Shimon
Format:	Technical Report
Language:	en_US
Published:	Center for Brains, Minds and Machines (CBMM) 2018
Online Access:	http://hdl.handle.net/1721.1/119248

_version_	1826202047234441216
author	Ben-Yosef, Guy Kreiman, Gabriel Ullman, Shimon
author_facet	Ben-Yosef, Guy Kreiman, Gabriel Ullman, Shimon
author_sort	Ben-Yosef, Guy
collection	MIT
description	Objects and their parts can be visually recognized and localized from purely spatial information in static images and also from purely temporal information as in the perception of biological motion. Cortical regions have been identified, which appear to specialize in visual recognition based on either static or dynamic cues, but the mechanisms by which spatial and temporal information is integrated is only poorly understood. Here we show that visual recognition of objects and actions can be achieved by efficiently combining spatial and motion cues in configurations where each source on its own is insufficient for recognition. This analysis is obtained by the identification of minimal spatiotemporal configurations: these are short videos in which objects and their parts, along with an action being performed, can be reliably recognized, but any reduction in either space or time makes them unrecognizable. State-of-the-art computational models for recognition from dynamic images based on deep 2D and 3D convolutional networks cannot replicate human recognition in these configurations. Action recognition in minimal spatiotemporal configurations is invariably accompanied by full human interpretation of the internal components of the image and their inter-relations. We hypothesize that this gap is due to mechanisms for full spatiotemporal interpretation process, which in human vision is an integral part of recognizing dynamic event, but is not sufficiently represented in current DNNs.
first_indexed	2024-09-23T12:01:02Z
format	Technical Report
id	mit-1721.1/119248
institution	Massachusetts Institute of Technology
language	en_US
last_indexed	2024-09-23T12:01:02Z
publishDate	2018
publisher	Center for Brains, Minds and Machines (CBMM)
record_format	dspace
spelling	mit-1721.1/1192482019-09-12T18:29:34Z Spatiotemporal interpretation features in the recognition of dynamic images Ben-Yosef, Guy Kreiman, Gabriel Ullman, Shimon Objects and their parts can be visually recognized and localized from purely spatial information in static images and also from purely temporal information as in the perception of biological motion. Cortical regions have been identified, which appear to specialize in visual recognition based on either static or dynamic cues, but the mechanisms by which spatial and temporal information is integrated is only poorly understood. Here we show that visual recognition of objects and actions can be achieved by efficiently combining spatial and motion cues in configurations where each source on its own is insufficient for recognition. This analysis is obtained by the identification of minimal spatiotemporal configurations: these are short videos in which objects and their parts, along with an action being performed, can be reliably recognized, but any reduction in either space or time makes them unrecognizable. State-of-the-art computational models for recognition from dynamic images based on deep 2D and 3D convolutional networks cannot replicate human recognition in these configurations. Action recognition in minimal spatiotemporal configurations is invariably accompanied by full human interpretation of the internal components of the image and their inter-relations. We hypothesize that this gap is due to mechanisms for full spatiotemporal interpretation process, which in human vision is an integral part of recognizing dynamic event, but is not sufficiently represented in current DNNs. This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. 2018-11-21T19:36:07Z 2018-11-21T19:36:07Z 2018-11-21 Technical Report Working Paper Other http://hdl.handle.net/1721.1/119248 en_US CBMM Memo Series;094 application/pdf Center for Brains, Minds and Machines (CBMM)
spellingShingle	Ben-Yosef, Guy Kreiman, Gabriel Ullman, Shimon Spatiotemporal interpretation features in the recognition of dynamic images
title	Spatiotemporal interpretation features in the recognition of dynamic images
title_full	Spatiotemporal interpretation features in the recognition of dynamic images
title_fullStr	Spatiotemporal interpretation features in the recognition of dynamic images
title_full_unstemmed	Spatiotemporal interpretation features in the recognition of dynamic images
title_short	Spatiotemporal interpretation features in the recognition of dynamic images
title_sort	spatiotemporal interpretation features in the recognition of dynamic images
url	http://hdl.handle.net/1721.1/119248
work_keys_str_mv	AT benyosefguy spatiotemporalinterpretationfeaturesintherecognitionofdynamicimages AT kreimangabriel spatiotemporalinterpretationfeaturesintherecognitionofdynamicimages AT ullmanshimon spatiotemporalinterpretationfeaturesintherecognitionofdynamicimages

Spatiotemporal interpretation features in the recognition of dynamic images

Similar Items