ViGAT: Bottom-Up Event Recognition and Explanation in Video Using Factorized Graph Attention Network

In this paper a pure-attention bottom-up approach, called ViGAT, that utilizes an object detector together with a Vision Transformer (ViT) backbone network to derive object and frame features, and a head network to process these features for the task of event recognition and explanation in video, is...

Full description

Bibliographic Details
Main Authors: Nikolaos Gkalelis, Dimitrios Daskalakis, Vasileios Mezaris
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9915576/