AutoAD III: the prequel – back to the pixels
Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is ham...
Main Authors: | , , , , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
IEEE
2024
|
_version_ | 1824459238186942464 |
---|---|
author | Han, T Bain, M Nagrani, A Varol, G Xie, W Zisserman, A |
author_facet | Han, T Bain, M Nagrani, A Varol, G Xie, W Zisserman, A |
author_sort | Han, T |
collection | OXFORD |
description | Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well matched to human performance. Taken together, we improve the state of the art on AD generation. |
first_indexed | 2024-09-25T04:02:33Z |
format | Conference item |
id | oxford-uuid:e0a32d8c-45ae-4381-a5ba-f39e8a522c62 |
institution | University of Oxford |
language | English |
last_indexed | 2025-02-19T04:38:36Z |
publishDate | 2024 |
publisher | IEEE |
record_format | dspace |
spelling | oxford-uuid:e0a32d8c-45ae-4381-a5ba-f39e8a522c622025-02-11T13:24:22ZAutoAD III: the prequel – back to the pixelsConference itemhttp://purl.org/coar/resource_type/c_5794uuid:e0a32d8c-45ae-4381-a5ba-f39e8a522c62EnglishSymplectic ElementsIEEE2024Han, TBain, MNagrani, AVarol, GXie, WZisserman, AGenerating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well matched to human performance. Taken together, we improve the state of the art on AD generation. |
spellingShingle | Han, T Bain, M Nagrani, A Varol, G Xie, W Zisserman, A AutoAD III: the prequel – back to the pixels |
title | AutoAD III: the prequel – back to the pixels |
title_full | AutoAD III: the prequel – back to the pixels |
title_fullStr | AutoAD III: the prequel – back to the pixels |
title_full_unstemmed | AutoAD III: the prequel – back to the pixels |
title_short | AutoAD III: the prequel – back to the pixels |
title_sort | autoad iii the prequel back to the pixels |
work_keys_str_mv | AT hant autoadiiitheprequelbacktothepixels AT bainm autoadiiitheprequelbacktothepixels AT nagrania autoadiiitheprequelbacktothepixels AT varolg autoadiiitheprequelbacktothepixels AT xiew autoadiiitheprequelbacktothepixels AT zissermana autoadiiitheprequelbacktothepixels |