Emotion recognition in speech using cross-modal transfer in the wild

Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio...

Full description

Bibliographic Details
Main Authors:	Albanie, S, Nagrani, A, Vedaldi, A, Zisserman, A
Format:	Internet publication
Language:	English
Published:	arXiv 2018

_version_	1826315361040990208
author	Albanie, S Nagrani, A Vedaldi, A Zisserman, A
author_facet	Albanie, S Nagrani, A Vedaldi, A Zisserman, A
author_sort	Albanie, S
collection	OXFORD
description	Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.
first_indexed	2024-12-09T03:26:39Z
format	Internet publication
id	oxford-uuid:9e1afd37-3bf5-4334-955a-c7a7bc0ddc50
institution	University of Oxford
language	English
last_indexed	2024-12-09T03:26:39Z
publishDate	2018
publisher	arXiv
record_format	dspace
spelling	oxford-uuid:9e1afd37-3bf5-4334-955a-c7a7bc0ddc502024-11-28T13:06:08ZEmotion recognition in speech using cross-modal transfer in the wildInternet publicationhttp://purl.org/coar/resource_type/c_7ad9uuid:9e1afd37-3bf5-4334-955a-c7a7bc0ddc50EnglishSymplectic ElementsarXiv2018Albanie, SNagrani, AVedaldi, AZisserman, AObtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.
spellingShingle	Albanie, S Nagrani, A Vedaldi, A Zisserman, A Emotion recognition in speech using cross-modal transfer in the wild
title	Emotion recognition in speech using cross-modal transfer in the wild
title_full	Emotion recognition in speech using cross-modal transfer in the wild
title_fullStr	Emotion recognition in speech using cross-modal transfer in the wild
title_full_unstemmed	Emotion recognition in speech using cross-modal transfer in the wild
title_short	Emotion recognition in speech using cross-modal transfer in the wild
title_sort	emotion recognition in speech using cross modal transfer in the wild
work_keys_str_mv	AT albanies emotionrecognitioninspeechusingcrossmodaltransferinthewild AT nagrania emotionrecognitioninspeechusingcrossmodaltransferinthewild AT vedaldia emotionrecognitioninspeechusingcrossmodaltransferinthewild AT zissermana emotionrecognitioninspeechusingcrossmodaltransferinthewild

Emotion recognition in speech using cross-modal transfer in the wild

Similar Items