Self-attention transfer networks for speech emotion recognition

Background: A crucial element of human–machine interaction, the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models. One vital challenge in speech emotion recognition (SER) is how to learn robust and discriminative repres...

Full description

Bibliographic Details
Main Authors:	Ziping Zhao, Zhongtian Bao, Zixing Zhang, Nicholas Cummins, Shihuang Sun, Haishuai Wang, Jianhua Tao, Björn W. Schuller
Format:	Article
Language:	English
Published:	KeAi Communications Co., Ltd. 2021-02-01
Series:	Virtual Reality & Intelligent Hardware
Subjects:	Speech emotion recognition Attention transfer Self-attention Temporal convolutional neural networks (TCNs)
Online Access:	http://www.sciencedirect.com/science/article/pii/S2096579620301145

_version_	1818871204043292672
author	Ziping Zhao Zhongtian Bao Zixing Zhang Nicholas Cummins Shihuang Sun Haishuai Wang Jianhua Tao Björn W. Schuller
author_facet	Ziping Zhao Zhongtian Bao Zixing Zhang Nicholas Cummins Shihuang Sun Haishuai Wang Jianhua Tao Björn W. Schuller
author_sort	Ziping Zhao
collection	DOAJ
description	Background: A crucial element of human–machine interaction, the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models. One vital challenge in speech emotion recognition (SER) is how to learn robust and discriminative representations from speech. Meanwhile, although machine learning methods have been widely applied in SER research, the inadequate amount of available annotated data has become a bottleneck that impedes the extended application of techniques (e.g., deep neural networks). To address this issue, we present a deep learning method that combines knowledge transfer and self-attention for SER tasks. Here, we apply the log-Mel spectrogram with deltas and delta-deltas as input. Moreover, given that emotions are time-dependent, we apply Temporal Convolutional Neural Networks (TCNs) to model the variations in emotions. We further introduce an attention transfer mechanism, which is based on a self-attention algorithm in order to learn long-term dependencies. The Self-Attention Transfer Network (SATN) in our proposed approach, takes advantage of attention autoencoders to learn attention from a source task, and then from speech recognition, followed by transferring this knowledge into SER. Evaluation built on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) demonstrates the effectiveness of the novel model.
first_indexed	2024-12-19T12:19:12Z
format	Article
id	doaj.art-b09f9ee55f354c169a77b6f5b8f185c2
institution	Directory Open Access Journal
issn	2096-5796
language	English
last_indexed	2024-12-19T12:19:12Z
publishDate	2021-02-01
publisher	KeAi Communications Co., Ltd.
record_format	Article
series	Virtual Reality & Intelligent Hardware
spelling	doaj.art-b09f9ee55f354c169a77b6f5b8f185c22022-12-21T20:21:51ZengKeAi Communications Co., Ltd.Virtual Reality & Intelligent Hardware2096-57962021-02-01314354Self-attention transfer networks for speech emotion recognitionZiping Zhao0Zhongtian Bao1Zixing Zhang2Nicholas Cummins3Shihuang Sun4Haishuai Wang5Jianhua Tao6Björn W. Schuller7College of Computer and Information Engineering, Tianjin Normal University, Tianjin, ChinaCollege of Computer and Information Engineering, Tianjin Normal University, Tianjin, ChinaGLAM -- Group on Language, Audio & Music, Imperial College London, UKChair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany; Department of Biostatistics and Health Informatics, IoPPN, King’s College London, London, UKDepartment of Computer Science and Engineering, Fairfield University, USADepartment of Computer Science and Engineering, Fairfield University, USANational Laboratory of Pattern Recognition, CASIA, Beijing, China; Corresponding author.College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China; GLAM -- Group on Language, Audio & Music, Imperial College London, UK; Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, GermanyBackground: A crucial element of human–machine interaction, the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models. One vital challenge in speech emotion recognition (SER) is how to learn robust and discriminative representations from speech. Meanwhile, although machine learning methods have been widely applied in SER research, the inadequate amount of available annotated data has become a bottleneck that impedes the extended application of techniques (e.g., deep neural networks). To address this issue, we present a deep learning method that combines knowledge transfer and self-attention for SER tasks. Here, we apply the log-Mel spectrogram with deltas and delta-deltas as input. Moreover, given that emotions are time-dependent, we apply Temporal Convolutional Neural Networks (TCNs) to model the variations in emotions. We further introduce an attention transfer mechanism, which is based on a self-attention algorithm in order to learn long-term dependencies. The Self-Attention Transfer Network (SATN) in our proposed approach, takes advantage of attention autoencoders to learn attention from a source task, and then from speech recognition, followed by transferring this knowledge into SER. Evaluation built on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) demonstrates the effectiveness of the novel model.http://www.sciencedirect.com/science/article/pii/S2096579620301145Speech emotion recognitionAttention transferSelf-attentionTemporal convolutional neural networks (TCNs)
spellingShingle	Ziping Zhao Zhongtian Bao Zixing Zhang Nicholas Cummins Shihuang Sun Haishuai Wang Jianhua Tao Björn W. Schuller Self-attention transfer networks for speech emotion recognition Virtual Reality & Intelligent Hardware Speech emotion recognition Attention transfer Self-attention Temporal convolutional neural networks (TCNs)
title	Self-attention transfer networks for speech emotion recognition
title_full	Self-attention transfer networks for speech emotion recognition
title_fullStr	Self-attention transfer networks for speech emotion recognition
title_full_unstemmed	Self-attention transfer networks for speech emotion recognition
title_short	Self-attention transfer networks for speech emotion recognition
title_sort	self attention transfer networks for speech emotion recognition
topic	Speech emotion recognition Attention transfer Self-attention Temporal convolutional neural networks (TCNs)
url	http://www.sciencedirect.com/science/article/pii/S2096579620301145
work_keys_str_mv	AT zipingzhao selfattentiontransfernetworksforspeechemotionrecognition AT zhongtianbao selfattentiontransfernetworksforspeechemotionrecognition AT zixingzhang selfattentiontransfernetworksforspeechemotionrecognition AT nicholascummins selfattentiontransfernetworksforspeechemotionrecognition AT shihuangsun selfattentiontransfernetworksforspeechemotionrecognition AT haishuaiwang selfattentiontransfernetworksforspeechemotionrecognition AT jianhuatao selfattentiontransfernetworksforspeechemotionrecognition AT bjornwschuller selfattentiontransfernetworksforspeechemotionrecognition

Self-attention transfer networks for speech emotion recognition

Similar Items