Audio-visual synchronisation in the wild

In this paper, we consider the problem of audio-visual synchronisation applied to videos "in-the-wild" (i.e. of general classes beyond speech). As a new task, we identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-bas...

Full description

Bibliographic Details
Main Authors: Chen, H, Xie, W, Afouras, T, Nagrani, A, Vedaldi, A, Zisserman, A
Format: Conference item
Language:English
Published: British Machine Vision Association 2021
_version_ 1826300312680398848
author Chen, H
Xie, W
Afouras, T
Nagrani, A
Vedaldi, A
Zisserman, A
author_facet Chen, H
Xie, W
Afouras, T
Nagrani, A
Vedaldi, A
Zisserman, A
author_sort Chen, H
collection OXFORD
description In this paper, we consider the problem of audio-visual synchronisation applied to videos "in-the-wild" (i.e. of general classes beyond speech). As a new task, we identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length, while significantly reducing memory requirements during training. We further conduct an in-depth analysis on the curated dataset and define an evaluation metric for open domain audio-visual synchronisation. We apply our method on standard lip reading speech benchmarks, LRS2 and LRS3, with ablations on various aspects. Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset. In all cases, our proposed model outperforms the previous state-of-the-art by a significant margin.
first_indexed 2024-03-07T05:15:14Z
format Conference item
id oxford-uuid:dcef3c67-e8aa-451f-bbfd-9ceb8753a729
institution University of Oxford
language English
last_indexed 2024-03-07T05:15:14Z
publishDate 2021
publisher British Machine Vision Association
record_format dspace
spelling oxford-uuid:dcef3c67-e8aa-451f-bbfd-9ceb8753a7292022-03-27T09:21:24ZAudio-visual synchronisation in the wildConference itemhttp://purl.org/coar/resource_type/c_5794uuid:dcef3c67-e8aa-451f-bbfd-9ceb8753a729EnglishSymplectic ElementsBritish Machine Vision Association2021Chen, HXie, WAfouras, TNagrani, AVedaldi, AZisserman, AIn this paper, we consider the problem of audio-visual synchronisation applied to videos "in-the-wild" (i.e. of general classes beyond speech). As a new task, we identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length, while significantly reducing memory requirements during training. We further conduct an in-depth analysis on the curated dataset and define an evaluation metric for open domain audio-visual synchronisation. We apply our method on standard lip reading speech benchmarks, LRS2 and LRS3, with ablations on various aspects. Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset. In all cases, our proposed model outperforms the previous state-of-the-art by a significant margin.
spellingShingle Chen, H
Xie, W
Afouras, T
Nagrani, A
Vedaldi, A
Zisserman, A
Audio-visual synchronisation in the wild
title Audio-visual synchronisation in the wild
title_full Audio-visual synchronisation in the wild
title_fullStr Audio-visual synchronisation in the wild
title_full_unstemmed Audio-visual synchronisation in the wild
title_short Audio-visual synchronisation in the wild
title_sort audio visual synchronisation in the wild
work_keys_str_mv AT chenh audiovisualsynchronisationinthewild
AT xiew audiovisualsynchronisationinthewild
AT afourast audiovisualsynchronisationinthewild
AT nagrania audiovisualsynchronisationinthewild
AT vedaldia audiovisualsynchronisationinthewild
AT zissermana audiovisualsynchronisationinthewild