Now you're speaking my language: visual language identification

The goal of this work is to train models that can identify a spoken language just by interpreting the speaker’s lip movements. Our contributions are the following: (i) we show that models can learn to discriminate among 14 different languages using only visual speech information; (ii) we compare dif...

Full description

Bibliographic Details
Main Authors:	Afouras, T, Chung, JS, Zisserman, A
Format:	Conference item
Language:	English
Published:	ISCA Archive 2020

_version_	1826261021165092864
author	Afouras, T Chung, JS Zisserman, A
author_facet	Afouras, T Chung, JS Zisserman, A
author_sort	Afouras, T
collection	OXFORD
description	The goal of this work is to train models that can identify a spoken language just by interpreting the speaker’s lip movements. Our contributions are the following: (i) we show that models can learn to discriminate among 14 different languages using only visual speech information; (ii) we compare different designs in sequence modelling and utterance-level aggregation in order to determine the best architecture for this task; (iii) we investigate the factors that contribute discriminative cues and show that our model indeed solves the problem by finding temporal patterns in mouth movements and not by exploiting spurious correlations. We demonstrate this further by evaluating our models on challenging examples from bilingual speakers.
first_indexed	2024-03-06T19:14:59Z
format	Conference item
id	oxford-uuid:18075577-11f2-466c-8e7a-775ab16d7f9d
institution	University of Oxford
language	English
last_indexed	2024-03-06T19:14:59Z
publishDate	2020
publisher	ISCA Archive
record_format	dspace
spelling	oxford-uuid:18075577-11f2-466c-8e7a-775ab16d7f9d2022-03-26T10:41:07ZNow you're speaking my language: visual language identificationConference itemhttp://purl.org/coar/resource_type/c_5794uuid:18075577-11f2-466c-8e7a-775ab16d7f9dEnglishSymplectic ElementsISCA Archive2020Afouras, TChung, JSZisserman, AThe goal of this work is to train models that can identify a spoken language just by interpreting the speaker’s lip movements. Our contributions are the following: (i) we show that models can learn to discriminate among 14 different languages using only visual speech information; (ii) we compare different designs in sequence modelling and utterance-level aggregation in order to determine the best architecture for this task; (iii) we investigate the factors that contribute discriminative cues and show that our model indeed solves the problem by finding temporal patterns in mouth movements and not by exploiting spurious correlations. We demonstrate this further by evaluating our models on challenging examples from bilingual speakers.
spellingShingle	Afouras, T Chung, JS Zisserman, A Now you're speaking my language: visual language identification
title	Now you're speaking my language: visual language identification
title_full	Now you're speaking my language: visual language identification
title_fullStr	Now you're speaking my language: visual language identification
title_full_unstemmed	Now you're speaking my language: visual language identification
title_short	Now you're speaking my language: visual language identification
title_sort	now you re speaking my language visual language identification
work_keys_str_mv	AT afourast nowyourespeakingmylanguagevisuallanguageidentification AT chungjs nowyourespeakingmylanguagevisuallanguageidentification AT zissermana nowyourespeakingmylanguagevisuallanguageidentification

Now you're speaking my language: visual language identification

Similar Items