Now you're speaking my language: visual language identification
The goal of this work is to train models that can identify a spoken language just by interpreting the speaker’s lip movements. Our contributions are the following: (i) we show that models can learn to discriminate among 14 different languages using only visual speech information; (ii) we compare dif...
Main Authors: | , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
ISCA Archive
2020
|
_version_ | 1826261021165092864 |
---|---|
author | Afouras, T Chung, JS Zisserman, A |
author_facet | Afouras, T Chung, JS Zisserman, A |
author_sort | Afouras, T |
collection | OXFORD |
description | The goal of this work is to train models that can identify a spoken language just by interpreting the speaker’s lip movements.
Our contributions are the following: (i) we show that models
can learn to discriminate among 14 different languages using
only visual speech information; (ii) we compare different designs in sequence modelling and utterance-level aggregation in
order to determine the best architecture for this task; (iii) we
investigate the factors that contribute discriminative cues and
show that our model indeed solves the problem by finding temporal patterns in mouth movements and not by exploiting spurious correlations. We demonstrate this further by evaluating our models on challenging examples from bilingual speakers. |
first_indexed | 2024-03-06T19:14:59Z |
format | Conference item |
id | oxford-uuid:18075577-11f2-466c-8e7a-775ab16d7f9d |
institution | University of Oxford |
language | English |
last_indexed | 2024-03-06T19:14:59Z |
publishDate | 2020 |
publisher | ISCA Archive |
record_format | dspace |
spelling | oxford-uuid:18075577-11f2-466c-8e7a-775ab16d7f9d2022-03-26T10:41:07ZNow you're speaking my language: visual language identificationConference itemhttp://purl.org/coar/resource_type/c_5794uuid:18075577-11f2-466c-8e7a-775ab16d7f9dEnglishSymplectic ElementsISCA Archive2020Afouras, TChung, JSZisserman, AThe goal of this work is to train models that can identify a spoken language just by interpreting the speaker’s lip movements. Our contributions are the following: (i) we show that models can learn to discriminate among 14 different languages using only visual speech information; (ii) we compare different designs in sequence modelling and utterance-level aggregation in order to determine the best architecture for this task; (iii) we investigate the factors that contribute discriminative cues and show that our model indeed solves the problem by finding temporal patterns in mouth movements and not by exploiting spurious correlations. We demonstrate this further by evaluating our models on challenging examples from bilingual speakers. |
spellingShingle | Afouras, T Chung, JS Zisserman, A Now you're speaking my language: visual language identification |
title | Now you're speaking my language: visual language identification |
title_full | Now you're speaking my language: visual language identification |
title_fullStr | Now you're speaking my language: visual language identification |
title_full_unstemmed | Now you're speaking my language: visual language identification |
title_short | Now you're speaking my language: visual language identification |
title_sort | now you re speaking my language visual language identification |
work_keys_str_mv | AT afourast nowyourespeakingmylanguagevisuallanguageidentification AT chungjs nowyourespeakingmylanguagevisuallanguageidentification AT zissermana nowyourespeakingmylanguagevisuallanguageidentification |