Learning to lip read words by watching videos
<p>Our aim is to recognise the words being spoken by a talking face, given only the video but not the audio. Existing works in this area have focussed on trying to recognise a small number of utterances in controlled environments (e.g. digits and alphabets), partially due to the shortage of su...
Main Authors: | , |
---|---|
Format: | Journal article |
Published: |
Elsevier
2018
|
_version_ | 1826274349194149888 |
---|---|
author | Chung, J Zisserman, A |
author_facet | Chung, J Zisserman, A |
author_sort | Chung, J |
collection | OXFORD |
description | <p>Our aim is to recognise the words being spoken by a talking face, given only the video but not the audio. Existing works in this area have focussed on trying to recognise a small number of utterances in controlled environments (e.g. digits and alphabets), partially due to the shortage of suitable datasets.</p><p> We make three novel contributions: first, we develop a pipeline for fully automated data collection from TV broadcasts. With this we have generated a dataset with over a million word instances, spoken by over a thousand different people; second, we develop a two-stream convolutional neural network that learns a joint embedding between the sound and the mouth motions from unlabelled data. We apply this network to the tasks of audio-to-video synchronisation and active speaker detection; third, we train convolutional and recurrent networks that are able to effectively learn and recognize hundreds of words from this large-scale dataset.</p><p> In lip reading and in speaker detection, we demonstrate results that exceed the current state-of-the-art on public benchmark datasets.</p> |
first_indexed | 2024-03-06T22:42:05Z |
format | Journal article |
id | oxford-uuid:5bed328a-7857-406d-b10b-c552e4b69482 |
institution | University of Oxford |
last_indexed | 2024-03-06T22:42:05Z |
publishDate | 2018 |
publisher | Elsevier |
record_format | dspace |
spelling | oxford-uuid:5bed328a-7857-406d-b10b-c552e4b694822022-03-26T17:25:00ZLearning to lip read words by watching videosJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:5bed328a-7857-406d-b10b-c552e4b69482Symplectic Elements at OxfordElsevier2018Chung, JZisserman, A<p>Our aim is to recognise the words being spoken by a talking face, given only the video but not the audio. Existing works in this area have focussed on trying to recognise a small number of utterances in controlled environments (e.g. digits and alphabets), partially due to the shortage of suitable datasets.</p><p> We make three novel contributions: first, we develop a pipeline for fully automated data collection from TV broadcasts. With this we have generated a dataset with over a million word instances, spoken by over a thousand different people; second, we develop a two-stream convolutional neural network that learns a joint embedding between the sound and the mouth motions from unlabelled data. We apply this network to the tasks of audio-to-video synchronisation and active speaker detection; third, we train convolutional and recurrent networks that are able to effectively learn and recognize hundreds of words from this large-scale dataset.</p><p> In lip reading and in speaker detection, we demonstrate results that exceed the current state-of-the-art on public benchmark datasets.</p> |
spellingShingle | Chung, J Zisserman, A Learning to lip read words by watching videos |
title | Learning to lip read words by watching videos |
title_full | Learning to lip read words by watching videos |
title_fullStr | Learning to lip read words by watching videos |
title_full_unstemmed | Learning to lip read words by watching videos |
title_short | Learning to lip read words by watching videos |
title_sort | learning to lip read words by watching videos |
work_keys_str_mv | AT chungj learningtolipreadwordsbywatchingvideos AT zissermana learningtolipreadwordsbywatchingvideos |