Background-Aware Robust Context Learning for Weakly-Supervised Temporal Action Localization

Weakly supervised temporal action localization (WTAL) aims to localize temporal intervals of actions in an untrimmed video using only video-level action labels. Although the learning of the background is an important issue in WTAL, most previous studies have not utilized an effective background. In...

Full description

Bibliographic Details
Main Authors: Jinah Kim, Jungchan Cho
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9797701/
Description
Summary:Weakly supervised temporal action localization (WTAL) aims to localize temporal intervals of actions in an untrimmed video using only video-level action labels. Although the learning of the background is an important issue in WTAL, most previous studies have not utilized an effective background. In this study, we propose a novel method for robustly separating contexts, e.g., action-like background, from the foreground to more accurately localize the action intervals. First, we detect background segments based on their probabilities to minimize the impact of background estimation errors. Second, we define the entropy boundary of the foreground and the positive distance between the boundary and background entropy. The background probability and entropy boundary allow the segment-level classifier to robustly learn the background. Third, we improve the performance of the overall actionness model based on a consensus of the RGB and flow features. The results of extensive experiments demonstrate that the proposed method learns the context separately from the action, consequently achieving new state-of-the-art results on the THUMOS-14 and ActivityNet-1.2 benchmarks. We also confirm that using feature adaptation helps overcome the limitation of a pretrained feature extractor on datasets that contain many backgrounds, such as THUMOS-14.
ISSN:2169-3536