You said that?: Synthesising talking faces from audio

We describe a method for generating a video of a talking face. The method takes still images of the target face and an audio speech segment as inputs, and generates a video of the target face lip synched with the audio. The method runs in real time and is applicable to faces and audio not seen at tr...

Full beskrivning

Bibliografiska uppgifter
Huvudupphovsmän: Jamaludin, A, Chung, JS, Zisserman, A
Materialtyp: Journal article
Språk:English
Publicerad: Springer 2019
_version_ 1826287867361492992
author Jamaludin, A
Chung, JS
Zisserman, A
author_facet Jamaludin, A
Chung, JS
Zisserman, A
author_sort Jamaludin, A
collection OXFORD
description We describe a method for generating a video of a talking face. The method takes still images of the target face and an audio speech segment as inputs, and generates a video of the target face lip synched with the audio. The method runs in real time and is applicable to faces and audio not seen at training time. To achieve this we develop an encoder–decoder convolutional neural network (CNN) model that uses a joint embedding of the face and audio to generate synthesised talking face video frames. The model is trained on unlabelled videos using cross-modal self-supervision. We also propose methods to re-dub videos by visually blending the generated face into the source video frame using a multi-stream CNN model.
first_indexed 2024-03-07T02:05:06Z
format Journal article
id oxford-uuid:9eb7b560-a535-4643-93c8-6875cd3d8c31
institution University of Oxford
language English
last_indexed 2024-03-07T02:05:06Z
publishDate 2019
publisher Springer
record_format dspace
spelling oxford-uuid:9eb7b560-a535-4643-93c8-6875cd3d8c312022-03-27T00:52:11ZYou said that?: Synthesising talking faces from audioJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:9eb7b560-a535-4643-93c8-6875cd3d8c31EnglishSymplectic Elements at OxfordSpringer2019Jamaludin, AChung, JSZisserman, AWe describe a method for generating a video of a talking face. The method takes still images of the target face and an audio speech segment as inputs, and generates a video of the target face lip synched with the audio. The method runs in real time and is applicable to faces and audio not seen at training time. To achieve this we develop an encoder–decoder convolutional neural network (CNN) model that uses a joint embedding of the face and audio to generate synthesised talking face video frames. The model is trained on unlabelled videos using cross-modal self-supervision. We also propose methods to re-dub videos by visually blending the generated face into the source video frame using a multi-stream CNN model.
spellingShingle Jamaludin, A
Chung, JS
Zisserman, A
You said that?: Synthesising talking faces from audio
title You said that?: Synthesising talking faces from audio
title_full You said that?: Synthesising talking faces from audio
title_fullStr You said that?: Synthesising talking faces from audio
title_full_unstemmed You said that?: Synthesising talking faces from audio
title_short You said that?: Synthesising talking faces from audio
title_sort you said that synthesising talking faces from audio
work_keys_str_mv AT jamaludina yousaidthatsynthesisingtalkingfacesfromaudio
AT chungjs yousaidthatsynthesisingtalkingfacesfromaudio
AT zissermana yousaidthatsynthesisingtalkingfacesfromaudio