Learning to lip read words by watching videos

<p>Our aim is to recognise the words being spoken by a talking face, given only the video but not the audio. Existing works in this area have focussed on trying to recognise a small number of utterances in controlled environments (e.g. digits and alphabets), partially due to the shortage of su...

Full description

Bibliographic Details
Main Authors: Chung, J, Zisserman, A
Format: Journal article
Published: Elsevier 2018
_version_ 1826274349194149888
author Chung, J
Zisserman, A
author_facet Chung, J
Zisserman, A
author_sort Chung, J
collection OXFORD
description <p>Our aim is to recognise the words being spoken by a talking face, given only the video but not the audio. Existing works in this area have focussed on trying to recognise a small number of utterances in controlled environments (e.g. digits and alphabets), partially due to the shortage of suitable datasets.</p><p> We make three novel contributions: first, we develop a pipeline for fully automated data collection from TV broadcasts. With this we have generated a dataset with over a million word instances, spoken by over a thousand different people; second, we develop a two-stream convolutional neural network that learns a joint embedding between the sound and the mouth motions from unlabelled data. We apply this network to the tasks of audio-to-video synchronisation and active speaker detection; third, we train convolutional and recurrent networks that are able to effectively learn and recognize hundreds of words from this large-scale dataset.</p><p> In lip reading and in speaker detection, we demonstrate results that exceed the current state-of-the-art on public benchmark datasets.</p>
first_indexed 2024-03-06T22:42:05Z
format Journal article
id oxford-uuid:5bed328a-7857-406d-b10b-c552e4b69482
institution University of Oxford
last_indexed 2024-03-06T22:42:05Z
publishDate 2018
publisher Elsevier
record_format dspace
spelling oxford-uuid:5bed328a-7857-406d-b10b-c552e4b694822022-03-26T17:25:00ZLearning to lip read words by watching videosJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:5bed328a-7857-406d-b10b-c552e4b69482Symplectic Elements at OxfordElsevier2018Chung, JZisserman, A<p>Our aim is to recognise the words being spoken by a talking face, given only the video but not the audio. Existing works in this area have focussed on trying to recognise a small number of utterances in controlled environments (e.g. digits and alphabets), partially due to the shortage of suitable datasets.</p><p> We make three novel contributions: first, we develop a pipeline for fully automated data collection from TV broadcasts. With this we have generated a dataset with over a million word instances, spoken by over a thousand different people; second, we develop a two-stream convolutional neural network that learns a joint embedding between the sound and the mouth motions from unlabelled data. We apply this network to the tasks of audio-to-video synchronisation and active speaker detection; third, we train convolutional and recurrent networks that are able to effectively learn and recognize hundreds of words from this large-scale dataset.</p><p> In lip reading and in speaker detection, we demonstrate results that exceed the current state-of-the-art on public benchmark datasets.</p>
spellingShingle Chung, J
Zisserman, A
Learning to lip read words by watching videos
title Learning to lip read words by watching videos
title_full Learning to lip read words by watching videos
title_fullStr Learning to lip read words by watching videos
title_full_unstemmed Learning to lip read words by watching videos
title_short Learning to lip read words by watching videos
title_sort learning to lip read words by watching videos
work_keys_str_mv AT chungj learningtolipreadwordsbywatchingvideos
AT zissermana learningtolipreadwordsbywatchingvideos