Affective Latent Representation of Acoustic and Lexical Features for Emotion Recognition

In this paper, we propose a novel emotion recognition method based on the underlying emotional characteristics extracted from a conditional adversarial auto-encoder (CAAE), in which both acoustic and lexical features are used as inputs. The acoustic features are generated by calculating statistical...

Full description

Bibliographic Details
Main Authors:	Eesung Kim, Hyungchan Song, Jong Won Shin
Format:	Article
Language:	English
Published:	MDPI AG 2020-05-01
Series:	Sensors
Subjects:	emotion recognition conditional adversarial autoencoder latent representation
Online Access:	https://www.mdpi.com/1424-8220/20/9/2614

_version_	1797568826155466752
author	Eesung Kim Hyungchan Song Jong Won Shin
author_facet	Eesung Kim Hyungchan Song Jong Won Shin
author_sort	Eesung Kim
collection	DOAJ
description	In this paper, we propose a novel emotion recognition method based on the underlying emotional characteristics extracted from a conditional adversarial auto-encoder (CAAE), in which both acoustic and lexical features are used as inputs. The acoustic features are generated by calculating statistical functionals of low-level descriptors and by a deep neural network (DNN). These acoustic features are concatenated with three types of lexical features extracted from the text, which are a sparse representation, a distributed representation, and an affective lexicon-based dimensions. Two-dimensional latent representations similar to vectors in the valence-arousal space are obtained by a CAAE, which can be directly mapped into the emotional classes without the need for a sophisticated classifier. In contrast to the previous attempt to a CAAE using only acoustic features, the proposed approach could enhance the performance of the emotion recognition because combined acoustic and lexical features provide enough discriminant power. Experimental results on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus showed that our method outperformed the previously reported best results on the same corpus, achieving 76.72% in the unweighted average recall.
first_indexed	2024-03-10T20:02:36Z
format	Article
id	doaj.art-ab299de6e82b40b18004eee76b471ccb
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-10T20:02:36Z
publishDate	2020-05-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-ab299de6e82b40b18004eee76b471ccb2023-11-19T23:27:25ZengMDPI AGSensors1424-82202020-05-01209261410.3390/s20092614Affective Latent Representation of Acoustic and Lexical Features for Emotion RecognitionEesung Kim0Hyungchan Song1Jong Won Shin2AI R&D Team, Kakao Enterprise, 235, Pangyoyeok-ro, Bundang-gu, Seongnam-si, Gyeonggi-do 13494, KoreaSchool of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, 123 Cheomdan-gwagiro, Buk-gu, Gwangju 61005, KoreaSchool of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, 123 Cheomdan-gwagiro, Buk-gu, Gwangju 61005, KoreaIn this paper, we propose a novel emotion recognition method based on the underlying emotional characteristics extracted from a conditional adversarial auto-encoder (CAAE), in which both acoustic and lexical features are used as inputs. The acoustic features are generated by calculating statistical functionals of low-level descriptors and by a deep neural network (DNN). These acoustic features are concatenated with three types of lexical features extracted from the text, which are a sparse representation, a distributed representation, and an affective lexicon-based dimensions. Two-dimensional latent representations similar to vectors in the valence-arousal space are obtained by a CAAE, which can be directly mapped into the emotional classes without the need for a sophisticated classifier. In contrast to the previous attempt to a CAAE using only acoustic features, the proposed approach could enhance the performance of the emotion recognition because combined acoustic and lexical features provide enough discriminant power. Experimental results on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus showed that our method outperformed the previously reported best results on the same corpus, achieving 76.72% in the unweighted average recall.https://www.mdpi.com/1424-8220/20/9/2614emotion recognitionconditional adversarial autoencoderlatent representation
spellingShingle	Eesung Kim Hyungchan Song Jong Won Shin Affective Latent Representation of Acoustic and Lexical Features for Emotion Recognition Sensors emotion recognition conditional adversarial autoencoder latent representation
title	Affective Latent Representation of Acoustic and Lexical Features for Emotion Recognition
title_full	Affective Latent Representation of Acoustic and Lexical Features for Emotion Recognition
title_fullStr	Affective Latent Representation of Acoustic and Lexical Features for Emotion Recognition
title_full_unstemmed	Affective Latent Representation of Acoustic and Lexical Features for Emotion Recognition
title_short	Affective Latent Representation of Acoustic and Lexical Features for Emotion Recognition
title_sort	affective latent representation of acoustic and lexical features for emotion recognition
topic	emotion recognition conditional adversarial autoencoder latent representation
url	https://www.mdpi.com/1424-8220/20/9/2614
work_keys_str_mv	AT eesungkim affectivelatentrepresentationofacousticandlexicalfeaturesforemotionrecognition AT hyungchansong affectivelatentrepresentationofacousticandlexicalfeaturesforemotionrecognition AT jongwonshin affectivelatentrepresentationofacousticandlexicalfeaturesforemotionrecognition

Affective Latent Representation of Acoustic and Lexical Features for Emotion Recognition

Similar Items