Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition

Emotion recognition from speech may play a crucial role in many applications related to human–computer interaction or understanding the affective state of users in certain tasks, where other modalities such as video or physiological parameters are unavailable. In general, a human’s emotions may be r...

Full description

Bibliographic Details
Main Authors:	Michalis Papakostas, Evaggelos Spyrou, Theodoros Giannakopoulos, Giorgos Siantikos, Dimitrios Sgouropoulos, Phivos Mylonas, Fillia Makedon
Format:	Article
Language:	English
Published:	MDPI AG 2017-06-01
Series:	Computation
Subjects:	emotion recognition convolutional neural networks spectrograms
Online Access:	http://www.mdpi.com/2079-3197/5/2/26

_version_	1819181561790070784
author	Michalis Papakostas Evaggelos Spyrou Theodoros Giannakopoulos Giorgos Siantikos Dimitrios Sgouropoulos Phivos Mylonas Fillia Makedon
author_facet	Michalis Papakostas Evaggelos Spyrou Theodoros Giannakopoulos Giorgos Siantikos Dimitrios Sgouropoulos Phivos Mylonas Fillia Makedon
author_sort	Michalis Papakostas
collection	DOAJ
description	Emotion recognition from speech may play a crucial role in many applications related to human–computer interaction or understanding the affective state of users in certain tasks, where other modalities such as video or physiological parameters are unavailable. In general, a human’s emotions may be recognized using several modalities such as analyzing facial expressions, speech, physiological parameters (e.g., electroencephalograms, electrocardiograms) etc. However, measuring of these modalities may be difficult, obtrusive or require expensive hardware. In that context, speech may be the best alternative modality in many practical applications. In this work we present an approach that uses a Convolutional Neural Network (CNN) functioning as a visual feature extractor and trained using raw speech information. In contrast to traditional machine learning approaches, CNNs are responsible for identifying the important features of the input thus, making the need of hand-crafted feature engineering optional in many tasks. In this paper no extra features are required other than the spectrogram representations and hand-crafted features were only extracted for validation purposes of our method. Moreover, it does not require any linguistic model and is not specific to any particular language. We compare the proposed approach using cross-language datasets and demonstrate that it is able to provide superior results vs. traditional ones that use hand-crafted features.
first_indexed	2024-12-22T22:32:12Z
format	Article
id	doaj.art-b59360337e874d88bc9606bf54c39cee
institution	Directory Open Access Journal
issn	2079-3197
language	English
last_indexed	2024-12-22T22:32:12Z
publishDate	2017-06-01
publisher	MDPI AG
record_format	Article
series	Computation
spelling	doaj.art-b59360337e874d88bc9606bf54c39cee2022-12-21T18:10:24ZengMDPI AGComputation2079-31972017-06-01522610.3390/computation5020026computation5020026Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion RecognitionMichalis Papakostas0Evaggelos Spyrou1Theodoros Giannakopoulos2Giorgos Siantikos3Dimitrios Sgouropoulos4Phivos Mylonas5Fillia Makedon6Computer Science and Engineering Department, University of Texas at Arlington, Arlington, TX 76019, USAInstitute of Informatics and Telecommunications, National Center for Scientific Research—“Demokritos”, Athens 15310, GreeceInstitute of Informatics and Telecommunications, National Center for Scientific Research—“Demokritos”, Athens 15310, GreeceInstitute of Informatics and Telecommunications, National Center for Scientific Research—“Demokritos”, Athens 15310, GreeceInstitute of Informatics and Telecommunications, National Center for Scientific Research—“Demokritos”, Athens 15310, GreeceDepartment of Informatics, Ionian University, Corfu 49100, GreeceComputer Science and Engineering Department, University of Texas at Arlington, Arlington, TX 76019, USAEmotion recognition from speech may play a crucial role in many applications related to human–computer interaction or understanding the affective state of users in certain tasks, where other modalities such as video or physiological parameters are unavailable. In general, a human’s emotions may be recognized using several modalities such as analyzing facial expressions, speech, physiological parameters (e.g., electroencephalograms, electrocardiograms) etc. However, measuring of these modalities may be difficult, obtrusive or require expensive hardware. In that context, speech may be the best alternative modality in many practical applications. In this work we present an approach that uses a Convolutional Neural Network (CNN) functioning as a visual feature extractor and trained using raw speech information. In contrast to traditional machine learning approaches, CNNs are responsible for identifying the important features of the input thus, making the need of hand-crafted feature engineering optional in many tasks. In this paper no extra features are required other than the spectrogram representations and hand-crafted features were only extracted for validation purposes of our method. Moreover, it does not require any linguistic model and is not specific to any particular language. We compare the proposed approach using cross-language datasets and demonstrate that it is able to provide superior results vs. traditional ones that use hand-crafted features.http://www.mdpi.com/2079-3197/5/2/26emotion recognitionconvolutional neural networksspectrograms
spellingShingle	Michalis Papakostas Evaggelos Spyrou Theodoros Giannakopoulos Giorgos Siantikos Dimitrios Sgouropoulos Phivos Mylonas Fillia Makedon Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition Computation emotion recognition convolutional neural networks spectrograms
title	Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition
title_full	Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition
title_fullStr	Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition
title_full_unstemmed	Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition
title_short	Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition
title_sort	deep visual attributes vs hand crafted audio features on multidomain speech emotion recognition
topic	emotion recognition convolutional neural networks spectrograms
url	http://www.mdpi.com/2079-3197/5/2/26
work_keys_str_mv	AT michalispapakostas deepvisualattributesvshandcraftedaudiofeaturesonmultidomainspeechemotionrecognition AT evaggelosspyrou deepvisualattributesvshandcraftedaudiofeaturesonmultidomainspeechemotionrecognition AT theodorosgiannakopoulos deepvisualattributesvshandcraftedaudiofeaturesonmultidomainspeechemotionrecognition AT giorgossiantikos deepvisualattributesvshandcraftedaudiofeaturesonmultidomainspeechemotionrecognition AT dimitriossgouropoulos deepvisualattributesvshandcraftedaudiofeaturesonmultidomainspeechemotionrecognition AT phivosmylonas deepvisualattributesvshandcraftedaudiofeaturesonmultidomainspeechemotionrecognition AT filliamakedon deepvisualattributesvshandcraftedaudiofeaturesonmultidomainspeechemotionrecognition

Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition

Similar Items