Summary: | The objective of this paper is an automatic Audio Description (AD) model that ingests movies and outputs AD in
text form. Generating high-quality movie AD is challenging due to the dependency of the descriptions on context,
and the limited amount of training data available. In this
work, we leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned
text generation. In order to obtain high-quality AD, we
make the following four contributions: (i) we incorporate
context from the movie clip, AD from previous clips, as well
as the subtitles; (ii) we address the lack of training data
by pretraining on large-scale datasets, where visual or contextual information is unavailable, e.g. text-only AD without movies or visual captioning datasets without context;
(iii) we improve on the currently available AD datasets, by
removing label noise in the MAD dataset, and adding character naming information; and (iv) we obtain strong results
on the movie AD task compared with previous methods.
|