Emotion recognition in speech using cross-modal transfer in the wild

Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio...

Full description

Bibliographic Details
Main Authors: Albanie, S, Nagrani, A, Vedaldi, A, Zisserman, A
Format: Internet publication
Language:English
Published: arXiv 2018
_version_ 1826315361040990208
author Albanie, S
Nagrani, A
Vedaldi, A
Zisserman, A
author_facet Albanie, S
Nagrani, A
Vedaldi, A
Zisserman, A
author_sort Albanie, S
collection OXFORD
description Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.
first_indexed 2024-12-09T03:26:39Z
format Internet publication
id oxford-uuid:9e1afd37-3bf5-4334-955a-c7a7bc0ddc50
institution University of Oxford
language English
last_indexed 2024-12-09T03:26:39Z
publishDate 2018
publisher arXiv
record_format dspace
spelling oxford-uuid:9e1afd37-3bf5-4334-955a-c7a7bc0ddc502024-11-28T13:06:08ZEmotion recognition in speech using cross-modal transfer in the wildInternet publicationhttp://purl.org/coar/resource_type/c_7ad9uuid:9e1afd37-3bf5-4334-955a-c7a7bc0ddc50EnglishSymplectic ElementsarXiv2018Albanie, SNagrani, AVedaldi, AZisserman, AObtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.
spellingShingle Albanie, S
Nagrani, A
Vedaldi, A
Zisserman, A
Emotion recognition in speech using cross-modal transfer in the wild
title Emotion recognition in speech using cross-modal transfer in the wild
title_full Emotion recognition in speech using cross-modal transfer in the wild
title_fullStr Emotion recognition in speech using cross-modal transfer in the wild
title_full_unstemmed Emotion recognition in speech using cross-modal transfer in the wild
title_short Emotion recognition in speech using cross-modal transfer in the wild
title_sort emotion recognition in speech using cross modal transfer in the wild
work_keys_str_mv AT albanies emotionrecognitioninspeechusingcrossmodaltransferinthewild
AT nagrania emotionrecognitioninspeechusingcrossmodaltransferinthewild
AT vedaldia emotionrecognitioninspeechusingcrossmodaltransferinthewild
AT zissermana emotionrecognitioninspeechusingcrossmodaltransferinthewild