Video action transformer network

Video action transformer network

We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution,...

Full description

Bibliographic Details
Main Authors:	Girdhar, R, Carreira, J, Doersch, C, Zisserman, A
Format:	Conference item
Language:	English
Published:	IEEE 2020

Similar Items

Massively parallel video networks
by: Carreira, J, et al.
Published: (2018)

Two-stream convolutional networks for action recognition in videos
by: Simonyan, K, et al.
Published: (2014)

Convolutional two-stream network fusion for video action recognition
by: Feichtenhofer, C, et al.
Published: (2016)

Input-level inductive biases for 3D reconstruction
by: Yifan, W, et al.
Published: (2022)

Learning from one continuous video stream
by: Carreira, J, et al.
Published: (2024)

Quo Vadis, action recognition? A new model and the kinetics dataset
by: Carreira, J, et al.
Published: (2017)

Human focused action localization in video
by: Kläser, A, et al.
Published: (2012)

The visual centrifuge: Model-free layered video representations
by: Alayrac, J-B, et al.
Published: (2020)

Controllable attention for structured layered video decomposition
by: Alayrac, J-B, et al.
Published: (2020)

Sim2real transfer learning for 3D human pose estimation: motion to the rescue
by: Doersch, C, et al.
Published: (2020)

Multi-task self-supervised visual learning
by: Doersch, C, et al.
Published: (2017)

Verbs in action: improving verb understanding in video-language models
by: Momeni, L, et al.
Published: (2024)

TAPIR: tracking any point with per-frame initialization and temporal refinement
by: Doersch, C, et al.
Published: (2024)

Action recognition in videos
by: Dai, Peilun
Published: (2014)

Visual grounding in video for unsupervised word translation
by: Sigurdsson, GA, et al.
Published: (2020)

Exploiting temporal context for 3D human pose estimation in the wild
by: Arnab, A, et al.
Published: (2020)

Deep neural network approach to predict actions from videos
by: Garg, Utsav
Published: (2018)

Temporal alignment networks for long-term video
by: Han, T, et al.
Published: (2022)

Deep insights into convolutional networks for video recognition
by: Feichtenhofer, C, et al.
Published: (2019)

Temporal query networks for fine-grained video understanding
by: Zhang, C, et al.
Published: (2021)

Video summarization (action recognition)
by: Chua, Shui Feng.
Published: (2011)

Human action recognitions in video
by: Vu, Truong Thinh.
Published: (2013)

Challenges in action recognition in videos
by: Lee, Liang Cheng
Published: (2016)

Local fusion networks with chained residual pooling for video action recognition
by: He, Feixiang, et al.
Published: (2020)

Action-based multifield video visualization.
by: Botchen, R, et al.
Published: (2008)

Video Google: efficient visual search of videos
by: Sivic, J, et al.
Published: (2007)

Deep video-to-video transformations for accessibility applications
by: Banda, Dalitso Hansini.
Published: (2019)

Action-stage emphasized spatiotemporal VLAD for video action recognition
by: Tu, Zhigang, et al.
Published: (2021)

Detecting and recognizing human action in videos
by: Yu, Gang
Published: (2014)

Domain adaptation for video action recognition
by: Wang, Xiyu
Published: (2023)

Interpreting models for video action recognition
by: Daniel Wijaya
Published: (2021)

Large scale video action understanding
by: Yan, Tom, M. Eng. Massachusetts Institute of Technology
Published: (2018)

Transformative climate action in cities
by: Gemenne, F, et al.
Published: (2020)

Human activity detection and action recognition in videos using convolutional neural networks
by: Basavaiah, Jagadeesh, et al.
Published: (2020)

Video Google: a text retrieval approach to object matching in videos
by: Sivic, J, et al.
Published: (2003)

Context and Action in the Transformation of the Firm
by: Pettigrew, A
Published: (1987)

Deep convolutional neural networks for efficient pose estimation in gesture videos
by: Pfister, T, et al.
Published: (2015)

Algorithms for transform selection in multiple-transform video compression
by: Cai, Xun, Ph. D. Massachusetts Institute of Technology
Published: (2012)

Towards efficient video-based action recognition: context-aware memory attention network
by: Koh, Thean Chun, et al.
Published: (2024)

Unsupervised action segmentation in videos with clustering algorithms
by: Lim, Isaac Sheng Yang
Published: (2024)