TIM: a time interval machine for audio-visual action recognition

<p>Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalities in long videos by explicitly modellin...

Full description

Bibliographic Details
Main Authors: Chalk, J, Huh, J, Kazakos, E, Zisserman, A, Damen, D
Format: Conference item
Language:English
Published: IEEE 2024
_version_ 1811139184027500544
author Chalk, J
Huh, J
Kazakos, E
Zisserman, A
Damen, D
author_facet Chalk, J
Huh, J
Kazakos, E
Zisserman, A
Damen, D
author_sort Chalk, J
collection OXFORD
description <p>Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder that ingests a long video input. The encoder then attends to the specified interval, as well as the surrounding context in both modalities, in order to recognise the ongoing action.</p> <p>We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE, reporting state-of-the-art (SOTA) for recognition. On EPICKITCHENS, we beat previous SOTA that utilises LLMs and significantly larger pre-training by 2.9% top-1 action recognition accuracy. Additionally, we show that TIM can be adapted for action detection, using dense multi-scale interval queries, outperforming SOTA on EPIC-KITCHENS-100 for most metrics, and showing strong performance on the Perception Test. Our ablations show the critical role of integrating the two modalities and modelling their time intervals in achieving this performance. Code and models at: https://github.com/JacobChalk/TIM.</p>
first_indexed 2024-09-25T04:02:03Z
format Conference item
id oxford-uuid:9b1cd459-fa33-46ca-b2a6-7911d7e9b408
institution University of Oxford
language English
last_indexed 2024-09-25T04:02:03Z
publishDate 2024
publisher IEEE
record_format dspace
spelling oxford-uuid:9b1cd459-fa33-46ca-b2a6-7911d7e9b4082024-04-24T12:23:44ZTIM: a time interval machine for audio-visual action recognitionConference itemhttp://purl.org/coar/resource_type/c_5794uuid:9b1cd459-fa33-46ca-b2a6-7911d7e9b408EnglishSymplectic ElementsIEEE2024Chalk, JHuh, JKazakos, EZisserman, ADamen, D<p>Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder that ingests a long video input. The encoder then attends to the specified interval, as well as the surrounding context in both modalities, in order to recognise the ongoing action.</p> <p>We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE, reporting state-of-the-art (SOTA) for recognition. On EPICKITCHENS, we beat previous SOTA that utilises LLMs and significantly larger pre-training by 2.9% top-1 action recognition accuracy. Additionally, we show that TIM can be adapted for action detection, using dense multi-scale interval queries, outperforming SOTA on EPIC-KITCHENS-100 for most metrics, and showing strong performance on the Perception Test. Our ablations show the critical role of integrating the two modalities and modelling their time intervals in achieving this performance. Code and models at: https://github.com/JacobChalk/TIM.</p>
spellingShingle Chalk, J
Huh, J
Kazakos, E
Zisserman, A
Damen, D
TIM: a time interval machine for audio-visual action recognition
title TIM: a time interval machine for audio-visual action recognition
title_full TIM: a time interval machine for audio-visual action recognition
title_fullStr TIM: a time interval machine for audio-visual action recognition
title_full_unstemmed TIM: a time interval machine for audio-visual action recognition
title_short TIM: a time interval machine for audio-visual action recognition
title_sort tim a time interval machine for audio visual action recognition
work_keys_str_mv AT chalkj timatimeintervalmachineforaudiovisualactionrecognition
AT huhj timatimeintervalmachineforaudiovisualactionrecognition
AT kazakose timatimeintervalmachineforaudiovisualactionrecognition
AT zissermana timatimeintervalmachineforaudiovisualactionrecognition
AT damend timatimeintervalmachineforaudiovisualactionrecognition