Sparse in space and time: audio-visual synchronisation with trainable selectors

<p>The objective of this paper is audio-visual synchronisation of general videos &lsquo;in the wild&rsquo;. For such videos, the events that may be harnessed for synchronisation cues may be spatially small and may occur only infrequently during a many seconds-long video clip, i.e. the...

Full description

Bibliographic Details
Main Authors: Iashin, V, Xie, W, Rahtu, E, Zisserman, A
Format: Conference item
Language:English
Published: British Machine Vision Association 2022
_version_ 1826309078663561216
author Iashin, V
Xie, W
Rahtu, E
Zisserman, A
author_facet Iashin, V
Xie, W
Rahtu, E
Zisserman, A
author_sort Iashin, V
collection OXFORD
description <p>The objective of this paper is audio-visual synchronisation of general videos &lsquo;in the wild&rsquo;. For such videos, the events that may be harnessed for synchronisation cues may be spatially small and may occur only infrequently during a many seconds-long video clip, i.e. the synchronisation signal is &lsquo;sparse in space and time&rsquo;. This contrasts with the case of synchronising videos of talking heads, where audio-visual correspondence is dense in both time and space. We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs &lsquo;selectors&rsquo; to distil the long audio and visual streams into small sequences that are then used to predict the temporal offset between streams. (ii) We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task. (iii) We curate a dataset with only sparse in time and space synchronisation signals; and (iv) the effectiveness of the proposed model is shown on both dense and sparse datasets quantitatively and qualitatively. Project page: v-iashin.github.io/SparseSync</p>
first_indexed 2024-03-07T07:28:51Z
format Conference item
id oxford-uuid:6c87a73f-f968-45b6-9e46-483300220142
institution University of Oxford
language English
last_indexed 2024-03-07T07:28:51Z
publishDate 2022
publisher British Machine Vision Association
record_format dspace
spelling oxford-uuid:6c87a73f-f968-45b6-9e46-4833002201422022-12-16T16:41:25ZSparse in space and time: audio-visual synchronisation with trainable selectorsConference itemhttp://purl.org/coar/resource_type/c_5794uuid:6c87a73f-f968-45b6-9e46-483300220142EnglishSymplectic ElementsBritish Machine Vision Association2022Iashin, VXie, WRahtu, EZisserman, A<p>The objective of this paper is audio-visual synchronisation of general videos &lsquo;in the wild&rsquo;. For such videos, the events that may be harnessed for synchronisation cues may be spatially small and may occur only infrequently during a many seconds-long video clip, i.e. the synchronisation signal is &lsquo;sparse in space and time&rsquo;. This contrasts with the case of synchronising videos of talking heads, where audio-visual correspondence is dense in both time and space. We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs &lsquo;selectors&rsquo; to distil the long audio and visual streams into small sequences that are then used to predict the temporal offset between streams. (ii) We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task. (iii) We curate a dataset with only sparse in time and space synchronisation signals; and (iv) the effectiveness of the proposed model is shown on both dense and sparse datasets quantitatively and qualitatively. Project page: v-iashin.github.io/SparseSync</p>
spellingShingle Iashin, V
Xie, W
Rahtu, E
Zisserman, A
Sparse in space and time: audio-visual synchronisation with trainable selectors
title Sparse in space and time: audio-visual synchronisation with trainable selectors
title_full Sparse in space and time: audio-visual synchronisation with trainable selectors
title_fullStr Sparse in space and time: audio-visual synchronisation with trainable selectors
title_full_unstemmed Sparse in space and time: audio-visual synchronisation with trainable selectors
title_short Sparse in space and time: audio-visual synchronisation with trainable selectors
title_sort sparse in space and time audio visual synchronisation with trainable selectors
work_keys_str_mv AT iashinv sparseinspaceandtimeaudiovisualsynchronisationwithtrainableselectors
AT xiew sparseinspaceandtimeaudiovisualsynchronisationwithtrainableselectors
AT rahtue sparseinspaceandtimeaudiovisualsynchronisationwithtrainableselectors
AT zissermana sparseinspaceandtimeaudiovisualsynchronisationwithtrainableselectors