Is an object-centric video representation beneficial for transfer?
The objective of this work is to learn an object-centric video representation, with the aim of improving transferability to novel tasks, i.e., tasks different from the pre-training task of action classification. To this end, we introduce a new object-centric video recognition model based on a transf...
Main Authors: | , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
Springer
2023
|
_version_ | 1797110368491798528 |
---|---|
author | Zhang, C Gupta, A Zisserman, A |
author_facet | Zhang, C Gupta, A Zisserman, A |
author_sort | Zhang, C |
collection | OXFORD |
description | The objective of this work is to learn an object-centric video representation, with the aim of improving transferability to novel tasks, i.e., tasks different from the pre-training task of action classification. To this end, we introduce a new object-centric video recognition model based on a transformer architecture. The model learns a set of object-centric summary vectors for the video, and uses these vectors to fuse the visual and spatio-temporal trajectory ‘modalities’ of the video clip. We also introduce a novel trajectory contrast loss to further enhance objectness in these summary vectors.
<br>
With experiments on four datasets—SomethingSomething-V2, SomethingElse, Action Genome and EpicKitchens—we show that the object-centric model outperforms prior video representations (both object-agnostic and object-aware), when: (1) classifying actions on unseen objects and unseen environments; (2) low-shot learning of novel classes; (3) linear probe to other downstream tasks; as well as (4) for standard action classification. |
first_indexed | 2024-03-07T07:54:00Z |
format | Conference item |
id | oxford-uuid:15807a37-40dc-478f-8388-8bd958622bc7 |
institution | University of Oxford |
language | English |
last_indexed | 2024-03-07T07:54:00Z |
publishDate | 2023 |
publisher | Springer |
record_format | dspace |
spelling | oxford-uuid:15807a37-40dc-478f-8388-8bd958622bc72023-08-08T15:39:39ZIs an object-centric video representation beneficial for transfer?Conference itemhttp://purl.org/coar/resource_type/c_5794uuid:15807a37-40dc-478f-8388-8bd958622bc7EnglishSymplectic ElementsSpringer2023Zhang, CGupta, AZisserman, AThe objective of this work is to learn an object-centric video representation, with the aim of improving transferability to novel tasks, i.e., tasks different from the pre-training task of action classification. To this end, we introduce a new object-centric video recognition model based on a transformer architecture. The model learns a set of object-centric summary vectors for the video, and uses these vectors to fuse the visual and spatio-temporal trajectory ‘modalities’ of the video clip. We also introduce a novel trajectory contrast loss to further enhance objectness in these summary vectors. <br> With experiments on four datasets—SomethingSomething-V2, SomethingElse, Action Genome and EpicKitchens—we show that the object-centric model outperforms prior video representations (both object-agnostic and object-aware), when: (1) classifying actions on unseen objects and unseen environments; (2) low-shot learning of novel classes; (3) linear probe to other downstream tasks; as well as (4) for standard action classification. |
spellingShingle | Zhang, C Gupta, A Zisserman, A Is an object-centric video representation beneficial for transfer? |
title | Is an object-centric video representation beneficial for transfer? |
title_full | Is an object-centric video representation beneficial for transfer? |
title_fullStr | Is an object-centric video representation beneficial for transfer? |
title_full_unstemmed | Is an object-centric video representation beneficial for transfer? |
title_short | Is an object-centric video representation beneficial for transfer? |
title_sort | is an object centric video representation beneficial for transfer |
work_keys_str_mv | AT zhangc isanobjectcentricvideorepresentationbeneficialfortransfer AT guptaa isanobjectcentricvideorepresentationbeneficialfortransfer AT zissermana isanobjectcentricvideorepresentationbeneficialfortransfer |