Learning to lip read words by watching videos

<p>Our aim is to recognise the words being spoken by a talking face, given only the video but not the audio. Existing works in this area have focussed on trying to recognise a small number of utterances in controlled environments (e.g. digits and alphabets), partially due to the shortage of su...

Full description

Bibliographic Details
Main Authors:	Chung, J, Zisserman, A
Format:	Journal article
Published:	Elsevier 2018

_version_	1826274349194149888
author	Chung, J Zisserman, A
author_facet	Chung, J Zisserman, A
author_sort	Chung, J
collection	OXFORD
description	<p>Our aim is to recognise the words being spoken by a talking face, given only the video but not the audio. Existing works in this area have focussed on trying to recognise a small number of utterances in controlled environments (e.g. digits and alphabets), partially due to the shortage of suitable datasets.</p><p> We make three novel contributions: first, we develop a pipeline for fully automated data collection from TV broadcasts. With this we have generated a dataset with over a million word instances, spoken by over a thousand different people; second, we develop a two-stream convolutional neural network that learns a joint embedding between the sound and the mouth motions from unlabelled data. We apply this network to the tasks of audio-to-video synchronisation and active speaker detection; third, we train convolutional and recurrent networks that are able to effectively learn and recognize hundreds of words from this large-scale dataset.</p><p> In lip reading and in speaker detection, we demonstrate results that exceed the current state-of-the-art on public benchmark datasets.</p>
first_indexed	2024-03-06T22:42:05Z
format	Journal article
id	oxford-uuid:5bed328a-7857-406d-b10b-c552e4b69482
institution	University of Oxford
last_indexed	2024-03-06T22:42:05Z
publishDate	2018
publisher	Elsevier
record_format	dspace
spelling	oxford-uuid:5bed328a-7857-406d-b10b-c552e4b694822022-03-26T17:25:00ZLearning to lip read words by watching videosJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:5bed328a-7857-406d-b10b-c552e4b69482Symplectic Elements at OxfordElsevier2018Chung, JZisserman, A<p>Our aim is to recognise the words being spoken by a talking face, given only the video but not the audio. Existing works in this area have focussed on trying to recognise a small number of utterances in controlled environments (e.g. digits and alphabets), partially due to the shortage of suitable datasets.</p><p> We make three novel contributions: first, we develop a pipeline for fully automated data collection from TV broadcasts. With this we have generated a dataset with over a million word instances, spoken by over a thousand different people; second, we develop a two-stream convolutional neural network that learns a joint embedding between the sound and the mouth motions from unlabelled data. We apply this network to the tasks of audio-to-video synchronisation and active speaker detection; third, we train convolutional and recurrent networks that are able to effectively learn and recognize hundreds of words from this large-scale dataset.</p><p> In lip reading and in speaker detection, we demonstrate results that exceed the current state-of-the-art on public benchmark datasets.</p>
spellingShingle	Chung, J Zisserman, A Learning to lip read words by watching videos
title	Learning to lip read words by watching videos
title_full	Learning to lip read words by watching videos
title_fullStr	Learning to lip read words by watching videos
title_full_unstemmed	Learning to lip read words by watching videos
title_short	Learning to lip read words by watching videos
title_sort	learning to lip read words by watching videos
work_keys_str_mv	AT chungj learningtolipreadwordsbywatchingvideos AT zissermana learningtolipreadwordsbywatchingvideos

Learning to lip read words by watching videos

Similar Items