Helping hands: an object-aware ego-centric video recognition model

We introduce an object-aware decoder for improving the performance of spatio-temporal representations on egocentric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired...

Full description

Bibliographic Details
Main Authors: Zhang, C, Gupta, A, Zisserman, A
Format: Conference item
Language:English
Published: IEEE 2024
_version_ 1797112293599739904
author Zhang, C
Gupta, A
Zisserman, A
author_facet Zhang, C
Gupta, A
Zisserman, A
author_sort Zhang, C
collection OXFORD
description We introduce an object-aware decoder for improving the performance of spatio-temporal representations on egocentric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this).We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e.g. Episodic Memory in Ego4D). In all cases the performance improves over the state of the art—even compared to networks trained with far larger batch sizes. We also show that by using noisy image-level detection as pseudo-labels in training, the model learns to provide better bounding boxes using video consistency, as well as grounding the words in the associated text descriptions.Overall, we show that the model can act as a drop-in replacement for an ego-centric video model to improve performance through visual-text grounding.
first_indexed 2024-03-07T08:22:08Z
format Conference item
id oxford-uuid:c2996582-1602-4643-ab41-31c81aea67e5
institution University of Oxford
language English
last_indexed 2024-03-07T08:22:08Z
publishDate 2024
publisher IEEE
record_format dspace
spelling oxford-uuid:c2996582-1602-4643-ab41-31c81aea67e52024-01-31T11:45:45ZHelping hands: an object-aware ego-centric video recognition modelConference itemhttp://purl.org/coar/resource_type/c_5794uuid:c2996582-1602-4643-ab41-31c81aea67e5EnglishSymplectic ElementsIEEE2024Zhang, CGupta, AZisserman, AWe introduce an object-aware decoder for improving the performance of spatio-temporal representations on egocentric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this).We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e.g. Episodic Memory in Ego4D). In all cases the performance improves over the state of the art—even compared to networks trained with far larger batch sizes. We also show that by using noisy image-level detection as pseudo-labels in training, the model learns to provide better bounding boxes using video consistency, as well as grounding the words in the associated text descriptions.Overall, we show that the model can act as a drop-in replacement for an ego-centric video model to improve performance through visual-text grounding.
spellingShingle Zhang, C
Gupta, A
Zisserman, A
Helping hands: an object-aware ego-centric video recognition model
title Helping hands: an object-aware ego-centric video recognition model
title_full Helping hands: an object-aware ego-centric video recognition model
title_fullStr Helping hands: an object-aware ego-centric video recognition model
title_full_unstemmed Helping hands: an object-aware ego-centric video recognition model
title_short Helping hands: an object-aware ego-centric video recognition model
title_sort helping hands an object aware ego centric video recognition model
work_keys_str_mv AT zhangc helpinghandsanobjectawareegocentricvideorecognitionmodel
AT guptaa helpinghandsanobjectawareegocentricvideorecognitionmodel
AT zissermana helpinghandsanobjectawareegocentricvideorecognitionmodel