Audio-visual synchronisation in the wild
In this paper, we consider the problem of audio-visual synchronisation applied to videos "in-the-wild" (i.e. of general classes beyond speech). As a new task, we identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-bas...
Main Authors: | , , , , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
British Machine Vision Association
2021
|
_version_ | 1826300312680398848 |
---|---|
author | Chen, H Xie, W Afouras, T Nagrani, A Vedaldi, A Zisserman, A |
author_facet | Chen, H Xie, W Afouras, T Nagrani, A Vedaldi, A Zisserman, A |
author_sort | Chen, H |
collection | OXFORD |
description | In this paper, we consider the problem of audio-visual synchronisation applied to videos "in-the-wild" (i.e. of general classes beyond speech). As a new task, we identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length, while significantly reducing memory requirements during training. We further conduct an in-depth analysis on the curated dataset and define an evaluation metric for open domain audio-visual synchronisation. We apply our method on standard lip reading speech benchmarks, LRS2 and LRS3, with ablations on various aspects. Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset. In all cases, our proposed model outperforms the previous state-of-the-art by a significant margin.
|
first_indexed | 2024-03-07T05:15:14Z |
format | Conference item |
id | oxford-uuid:dcef3c67-e8aa-451f-bbfd-9ceb8753a729 |
institution | University of Oxford |
language | English |
last_indexed | 2024-03-07T05:15:14Z |
publishDate | 2021 |
publisher | British Machine Vision Association |
record_format | dspace |
spelling | oxford-uuid:dcef3c67-e8aa-451f-bbfd-9ceb8753a7292022-03-27T09:21:24ZAudio-visual synchronisation in the wildConference itemhttp://purl.org/coar/resource_type/c_5794uuid:dcef3c67-e8aa-451f-bbfd-9ceb8753a729EnglishSymplectic ElementsBritish Machine Vision Association2021Chen, HXie, WAfouras, TNagrani, AVedaldi, AZisserman, AIn this paper, we consider the problem of audio-visual synchronisation applied to videos "in-the-wild" (i.e. of general classes beyond speech). As a new task, we identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length, while significantly reducing memory requirements during training. We further conduct an in-depth analysis on the curated dataset and define an evaluation metric for open domain audio-visual synchronisation. We apply our method on standard lip reading speech benchmarks, LRS2 and LRS3, with ablations on various aspects. Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset. In all cases, our proposed model outperforms the previous state-of-the-art by a significant margin. |
spellingShingle | Chen, H Xie, W Afouras, T Nagrani, A Vedaldi, A Zisserman, A Audio-visual synchronisation in the wild |
title | Audio-visual synchronisation in the wild |
title_full | Audio-visual synchronisation in the wild |
title_fullStr | Audio-visual synchronisation in the wild |
title_full_unstemmed | Audio-visual synchronisation in the wild |
title_short | Audio-visual synchronisation in the wild |
title_sort | audio visual synchronisation in the wild |
work_keys_str_mv | AT chenh audiovisualsynchronisationinthewild AT xiew audiovisualsynchronisationinthewild AT afourast audiovisualsynchronisationinthewild AT nagrania audiovisualsynchronisationinthewild AT vedaldia audiovisualsynchronisationinthewild AT zissermana audiovisualsynchronisationinthewild |