AutoAD III: the prequel – back to the pixels

Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is ham...

Full description

Bibliographic Details
Main Authors: Han, T, Bain, M, Nagrani, A, Varol, G, Xie, W, Zisserman, A
Format: Conference item
Language:English
Published: IEEE 2024
_version_ 1824459238186942464
author Han, T
Bain, M
Nagrani, A
Varol, G
Xie, W
Zisserman, A
author_facet Han, T
Bain, M
Nagrani, A
Varol, G
Xie, W
Zisserman, A
author_sort Han, T
collection OXFORD
description Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well matched to human performance. Taken together, we improve the state of the art on AD generation.
first_indexed 2024-09-25T04:02:33Z
format Conference item
id oxford-uuid:e0a32d8c-45ae-4381-a5ba-f39e8a522c62
institution University of Oxford
language English
last_indexed 2025-02-19T04:38:36Z
publishDate 2024
publisher IEEE
record_format dspace
spelling oxford-uuid:e0a32d8c-45ae-4381-a5ba-f39e8a522c622025-02-11T13:24:22ZAutoAD III: the prequel – back to the pixelsConference itemhttp://purl.org/coar/resource_type/c_5794uuid:e0a32d8c-45ae-4381-a5ba-f39e8a522c62EnglishSymplectic ElementsIEEE2024Han, TBain, MNagrani, AVarol, GXie, WZisserman, AGenerating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well matched to human performance. Taken together, we improve the state of the art on AD generation.
spellingShingle Han, T
Bain, M
Nagrani, A
Varol, G
Xie, W
Zisserman, A
AutoAD III: the prequel – back to the pixels
title AutoAD III: the prequel – back to the pixels
title_full AutoAD III: the prequel – back to the pixels
title_fullStr AutoAD III: the prequel – back to the pixels
title_full_unstemmed AutoAD III: the prequel – back to the pixels
title_short AutoAD III: the prequel – back to the pixels
title_sort autoad iii the prequel back to the pixels
work_keys_str_mv AT hant autoadiiitheprequelbacktothepixels
AT bainm autoadiiitheprequelbacktothepixels
AT nagrania autoadiiitheprequelbacktothepixels
AT varolg autoadiiitheprequelbacktothepixels
AT xiew autoadiiitheprequelbacktothepixels
AT zissermana autoadiiitheprequelbacktothepixels