Video action transformer network

We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution,...

Full description

Bibliographic Details
Main Authors:	Girdhar, R, Carreira, J, Doersch, C, Zisserman, A
Format:	Conference item
Language:	English
Published:	IEEE 2020

_version_	1826266208887898112
author	Girdhar, R Carreira, J Doersch, C Zisserman, A
author_facet	Girdhar, R Carreira, J Doersch, C Zisserman, A
author_sort	Girdhar, R
collection	OXFORD
description	We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action - all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin using only raw RGB frames as input.
first_indexed	2024-03-06T20:35:25Z
format	Conference item
id	oxford-uuid:32753878-ac96-4455-8586-c1793ea50c5e
institution	University of Oxford
language	English
last_indexed	2024-03-06T20:35:25Z
publishDate	2020
publisher	IEEE
record_format	dspace
spelling	oxford-uuid:32753878-ac96-4455-8586-c1793ea50c5e2022-03-26T13:14:14ZVideo action transformer networkConference itemhttp://purl.org/coar/resource_type/c_5794uuid:32753878-ac96-4455-8586-c1793ea50c5eEnglishSymplectic ElementsIEEE2020Girdhar, RCarreira, JDoersch, CZisserman, AWe introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action - all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin using only raw RGB frames as input.
spellingShingle	Girdhar, R Carreira, J Doersch, C Zisserman, A Video action transformer network
title	Video action transformer network
title_full	Video action transformer network
title_fullStr	Video action transformer network
title_full_unstemmed	Video action transformer network
title_short	Video action transformer network
title_sort	video action transformer network
work_keys_str_mv	AT girdharr videoactiontransformernetwork AT carreiraj videoactiontransformernetwork AT doerschc videoactiontransformernetwork AT zissermana videoactiontransformernetwork

Video action transformer network

Similar Items