Now you're speaking my language: visual language identification

The goal of this work is to train models that can identify a spoken language just by interpreting the speaker’s lip movements. Our contributions are the following: (i) we show that models can learn to discriminate among 14 different languages using only visual speech information; (ii) we compare dif...

Full description

Bibliographic Details
Main Authors: Afouras, T, Chung, JS, Zisserman, A
Format: Conference item
Language:English
Published: ISCA Archive 2020
_version_ 1826261021165092864
author Afouras, T
Chung, JS
Zisserman, A
author_facet Afouras, T
Chung, JS
Zisserman, A
author_sort Afouras, T
collection OXFORD
description The goal of this work is to train models that can identify a spoken language just by interpreting the speaker’s lip movements. Our contributions are the following: (i) we show that models can learn to discriminate among 14 different languages using only visual speech information; (ii) we compare different designs in sequence modelling and utterance-level aggregation in order to determine the best architecture for this task; (iii) we investigate the factors that contribute discriminative cues and show that our model indeed solves the problem by finding temporal patterns in mouth movements and not by exploiting spurious correlations. We demonstrate this further by evaluating our models on challenging examples from bilingual speakers.
first_indexed 2024-03-06T19:14:59Z
format Conference item
id oxford-uuid:18075577-11f2-466c-8e7a-775ab16d7f9d
institution University of Oxford
language English
last_indexed 2024-03-06T19:14:59Z
publishDate 2020
publisher ISCA Archive
record_format dspace
spelling oxford-uuid:18075577-11f2-466c-8e7a-775ab16d7f9d2022-03-26T10:41:07ZNow you're speaking my language: visual language identificationConference itemhttp://purl.org/coar/resource_type/c_5794uuid:18075577-11f2-466c-8e7a-775ab16d7f9dEnglishSymplectic ElementsISCA Archive2020Afouras, TChung, JSZisserman, AThe goal of this work is to train models that can identify a spoken language just by interpreting the speaker’s lip movements. Our contributions are the following: (i) we show that models can learn to discriminate among 14 different languages using only visual speech information; (ii) we compare different designs in sequence modelling and utterance-level aggregation in order to determine the best architecture for this task; (iii) we investigate the factors that contribute discriminative cues and show that our model indeed solves the problem by finding temporal patterns in mouth movements and not by exploiting spurious correlations. We demonstrate this further by evaluating our models on challenging examples from bilingual speakers.
spellingShingle Afouras, T
Chung, JS
Zisserman, A
Now you're speaking my language: visual language identification
title Now you're speaking my language: visual language identification
title_full Now you're speaking my language: visual language identification
title_fullStr Now you're speaking my language: visual language identification
title_full_unstemmed Now you're speaking my language: visual language identification
title_short Now you're speaking my language: visual language identification
title_sort now you re speaking my language visual language identification
work_keys_str_mv AT afourast nowyourespeakingmylanguagevisuallanguageidentification
AT chungjs nowyourespeakingmylanguagevisuallanguageidentification
AT zissermana nowyourespeakingmylanguagevisuallanguageidentification