Video action transformer network
We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution,...
Main Authors: | , , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
IEEE
2020
|
_version_ | 1826266208887898112 |
---|---|
author | Girdhar, R Carreira, J Doersch, C Zisserman, A |
author_facet | Girdhar, R Carreira, J Doersch, C Zisserman, A |
author_sort | Girdhar, R |
collection | OXFORD |
description | We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action - all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin using only raw RGB frames as input. |
first_indexed | 2024-03-06T20:35:25Z |
format | Conference item |
id | oxford-uuid:32753878-ac96-4455-8586-c1793ea50c5e |
institution | University of Oxford |
language | English |
last_indexed | 2024-03-06T20:35:25Z |
publishDate | 2020 |
publisher | IEEE |
record_format | dspace |
spelling | oxford-uuid:32753878-ac96-4455-8586-c1793ea50c5e2022-03-26T13:14:14ZVideo action transformer networkConference itemhttp://purl.org/coar/resource_type/c_5794uuid:32753878-ac96-4455-8586-c1793ea50c5eEnglishSymplectic ElementsIEEE2020Girdhar, RCarreira, JDoersch, CZisserman, AWe introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action - all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin using only raw RGB frames as input. |
spellingShingle | Girdhar, R Carreira, J Doersch, C Zisserman, A Video action transformer network |
title | Video action transformer network |
title_full | Video action transformer network |
title_fullStr | Video action transformer network |
title_full_unstemmed | Video action transformer network |
title_short | Video action transformer network |
title_sort | video action transformer network |
work_keys_str_mv | AT girdharr videoactiontransformernetwork AT carreiraj videoactiontransformernetwork AT doerschc videoactiontransformernetwork AT zissermana videoactiontransformernetwork |