Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition

Speech emotion recognition predicts the emotional state of a speaker based on the person’s speech. It brings an additional element for creating more natural human–computer interactions. Earlier studies on emotional recognition have been primarily based on handcrafted features and manual labels. With...

Full description

Bibliographic Details
Main Authors:	Sanghyun Lee, David K. Han, Hanseok Ko
Format:	Article
Language:	English
Published:	MDPI AG 2020-11-01
Series:	Sensors
Subjects:	speech emotion recognition bidirectional encoder representations from transformers (BERT) convolutional neural networks (CNNs) transformer representation spatiotemporal representation
Online Access:	https://www.mdpi.com/1424-8220/20/22/6688

_version_	1827701676057821184
author	Sanghyun Lee David K. Han Hanseok Ko
author_facet	Sanghyun Lee David K. Han Hanseok Ko
author_sort	Sanghyun Lee
collection	DOAJ
description	Speech emotion recognition predicts the emotional state of a speaker based on the person’s speech. It brings an additional element for creating more natural human–computer interactions. Earlier studies on emotional recognition have been primarily based on handcrafted features and manual labels. With the advent of deep learning, there have been some efforts in applying the deep-network-based approach to the problem of emotion recognition. As deep learning automatically extracts salient features correlated to speaker emotion, it brings certain advantages over the handcrafted-feature-based methods. There are, however, some challenges in applying them to the emotion recognition problem, because data required for properly training deep networks are often lacking. Therefore, there is a need for a new deep-learning-based approach which can exploit available information from given speech signals to the maximum extent possible. Our proposed method, called “Fusion-ConvBERT”, is a parallel fusion model consisting of bidirectional encoder representations from transformers and convolutional neural networks. Extensive experiments were conducted on the proposed model using the EMO-DB and Interactive Emotional Dyadic Motion Capture Database emotion corpus, and it was shown that the proposed method outperformed state-of-the-art techniques in most of the test configurations.
first_indexed	2024-03-10T14:38:43Z
format	Article
id	doaj.art-37e5f2862a00489988eafe9d6c25cce6
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-10T14:38:43Z
publishDate	2020-11-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-37e5f2862a00489988eafe9d6c25cce62023-11-20T21:56:42ZengMDPI AGSensors1424-82202020-11-012022668810.3390/s20226688Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion RecognitionSanghyun Lee0David K. Han1Hanseok Ko2Department of Electronics and Electrical Engineering, Korea University, Seoul 136-713, KoreaDepartment of Electrical and Computer Engineering, Drexel University, Philadelphia, PA 19104, USADepartment of Electronics and Electrical Engineering, Korea University, Seoul 136-713, KoreaSpeech emotion recognition predicts the emotional state of a speaker based on the person’s speech. It brings an additional element for creating more natural human–computer interactions. Earlier studies on emotional recognition have been primarily based on handcrafted features and manual labels. With the advent of deep learning, there have been some efforts in applying the deep-network-based approach to the problem of emotion recognition. As deep learning automatically extracts salient features correlated to speaker emotion, it brings certain advantages over the handcrafted-feature-based methods. There are, however, some challenges in applying them to the emotion recognition problem, because data required for properly training deep networks are often lacking. Therefore, there is a need for a new deep-learning-based approach which can exploit available information from given speech signals to the maximum extent possible. Our proposed method, called “Fusion-ConvBERT”, is a parallel fusion model consisting of bidirectional encoder representations from transformers and convolutional neural networks. Extensive experiments were conducted on the proposed model using the EMO-DB and Interactive Emotional Dyadic Motion Capture Database emotion corpus, and it was shown that the proposed method outperformed state-of-the-art techniques in most of the test configurations.https://www.mdpi.com/1424-8220/20/22/6688speech emotion recognitionbidirectional encoder representations from transformers (BERT)convolutional neural networks (CNNs)transformerrepresentationspatiotemporal representation
spellingShingle	Sanghyun Lee David K. Han Hanseok Ko Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition Sensors speech emotion recognition bidirectional encoder representations from transformers (BERT) convolutional neural networks (CNNs) transformer representation spatiotemporal representation
title	Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition
title_full	Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition
title_fullStr	Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition
title_full_unstemmed	Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition
title_short	Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition
title_sort	fusion convbert parallel convolution and bert fusion for speech emotion recognition
topic	speech emotion recognition bidirectional encoder representations from transformers (BERT) convolutional neural networks (CNNs) transformer representation spatiotemporal representation
url	https://www.mdpi.com/1424-8220/20/22/6688
work_keys_str_mv	AT sanghyunlee fusionconvbertparallelconvolutionandbertfusionforspeechemotionrecognition AT davidkhan fusionconvbertparallelconvolutionandbertfusionforspeechemotionrecognition AT hanseokko fusionconvbertparallelconvolutionandbertfusionforspeechemotionrecognition

Fusion-ConvBERT: Parallel Convolution and BERT Fusion for Speech Emotion Recognition

Similar Items