Self-attention transfer networks for speech emotion recognition

Background: A crucial element of human–machine interaction, the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models. One vital challenge in speech emotion recognition (SER) is how to learn robust and discriminative repres...

Full description

Bibliographic Details
Main Authors: Ziping Zhao, Zhongtian Bao, Zixing Zhang, Nicholas Cummins, Shihuang Sun, Haishuai Wang, Jianhua Tao, Björn W. Schuller
Format: Article
Language:English
Published: KeAi Communications Co., Ltd. 2021-02-01
Series:Virtual Reality & Intelligent Hardware
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2096579620301145
_version_ 1818871204043292672
author Ziping Zhao
Zhongtian Bao
Zixing Zhang
Nicholas Cummins
Shihuang Sun
Haishuai Wang
Jianhua Tao
Björn W. Schuller
author_facet Ziping Zhao
Zhongtian Bao
Zixing Zhang
Nicholas Cummins
Shihuang Sun
Haishuai Wang
Jianhua Tao
Björn W. Schuller
author_sort Ziping Zhao
collection DOAJ
description Background: A crucial element of human–machine interaction, the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models. One vital challenge in speech emotion recognition (SER) is how to learn robust and discriminative representations from speech. Meanwhile, although machine learning methods have been widely applied in SER research, the inadequate amount of available annotated data has become a bottleneck that impedes the extended application of techniques (e.g., deep neural networks). To address this issue, we present a deep learning method that combines knowledge transfer and self-attention for SER tasks. Here, we apply the log-Mel spectrogram with deltas and delta-deltas as input. Moreover, given that emotions are time-dependent, we apply Temporal Convolutional Neural Networks (TCNs) to model the variations in emotions. We further introduce an attention transfer mechanism, which is based on a self-attention algorithm in order to learn long-term dependencies. The Self-Attention Transfer Network (SATN) in our proposed approach, takes advantage of attention autoencoders to learn attention from a source task, and then from speech recognition, followed by transferring this knowledge into SER. Evaluation built on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) demonstrates the effectiveness of the novel model.
first_indexed 2024-12-19T12:19:12Z
format Article
id doaj.art-b09f9ee55f354c169a77b6f5b8f185c2
institution Directory Open Access Journal
issn 2096-5796
language English
last_indexed 2024-12-19T12:19:12Z
publishDate 2021-02-01
publisher KeAi Communications Co., Ltd.
record_format Article
series Virtual Reality & Intelligent Hardware
spelling doaj.art-b09f9ee55f354c169a77b6f5b8f185c22022-12-21T20:21:51ZengKeAi Communications Co., Ltd.Virtual Reality & Intelligent Hardware2096-57962021-02-01314354Self-attention transfer networks for speech emotion recognitionZiping Zhao0Zhongtian Bao1Zixing Zhang2Nicholas Cummins3Shihuang Sun4Haishuai Wang5Jianhua Tao6Björn W. Schuller7College of Computer and Information Engineering, Tianjin Normal University, Tianjin, ChinaCollege of Computer and Information Engineering, Tianjin Normal University, Tianjin, ChinaGLAM -- Group on Language, Audio & Music, Imperial College London, UKChair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany; Department of Biostatistics and Health Informatics, IoPPN, King’s College London, London, UKDepartment of Computer Science and Engineering, Fairfield University, USADepartment of Computer Science and Engineering, Fairfield University, USANational Laboratory of Pattern Recognition, CASIA, Beijing, China; Corresponding author.College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China; GLAM -- Group on Language, Audio & Music, Imperial College London, UK; Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, GermanyBackground: A crucial element of human–machine interaction, the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models. One vital challenge in speech emotion recognition (SER) is how to learn robust and discriminative representations from speech. Meanwhile, although machine learning methods have been widely applied in SER research, the inadequate amount of available annotated data has become a bottleneck that impedes the extended application of techniques (e.g., deep neural networks). To address this issue, we present a deep learning method that combines knowledge transfer and self-attention for SER tasks. Here, we apply the log-Mel spectrogram with deltas and delta-deltas as input. Moreover, given that emotions are time-dependent, we apply Temporal Convolutional Neural Networks (TCNs) to model the variations in emotions. We further introduce an attention transfer mechanism, which is based on a self-attention algorithm in order to learn long-term dependencies. The Self-Attention Transfer Network (SATN) in our proposed approach, takes advantage of attention autoencoders to learn attention from a source task, and then from speech recognition, followed by transferring this knowledge into SER. Evaluation built on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) demonstrates the effectiveness of the novel model.http://www.sciencedirect.com/science/article/pii/S2096579620301145Speech emotion recognitionAttention transferSelf-attentionTemporal convolutional neural networks (TCNs)
spellingShingle Ziping Zhao
Zhongtian Bao
Zixing Zhang
Nicholas Cummins
Shihuang Sun
Haishuai Wang
Jianhua Tao
Björn W. Schuller
Self-attention transfer networks for speech emotion recognition
Virtual Reality & Intelligent Hardware
Speech emotion recognition
Attention transfer
Self-attention
Temporal convolutional neural networks (TCNs)
title Self-attention transfer networks for speech emotion recognition
title_full Self-attention transfer networks for speech emotion recognition
title_fullStr Self-attention transfer networks for speech emotion recognition
title_full_unstemmed Self-attention transfer networks for speech emotion recognition
title_short Self-attention transfer networks for speech emotion recognition
title_sort self attention transfer networks for speech emotion recognition
topic Speech emotion recognition
Attention transfer
Self-attention
Temporal convolutional neural networks (TCNs)
url http://www.sciencedirect.com/science/article/pii/S2096579620301145
work_keys_str_mv AT zipingzhao selfattentiontransfernetworksforspeechemotionrecognition
AT zhongtianbao selfattentiontransfernetworksforspeechemotionrecognition
AT zixingzhang selfattentiontransfernetworksforspeechemotionrecognition
AT nicholascummins selfattentiontransfernetworksforspeechemotionrecognition
AT shihuangsun selfattentiontransfernetworksforspeechemotionrecognition
AT haishuaiwang selfattentiontransfernetworksforspeechemotionrecognition
AT jianhuatao selfattentiontransfernetworksforspeechemotionrecognition
AT bjornwschuller selfattentiontransfernetworksforspeechemotionrecognition