On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition

Many speech emotion recognition systems have been designed using different features and classification methods. Still, there is a lack of knowledge and reasoning regarding the underlying speech characteristics and processing, i.e., how basic characteristics, methods, and settings affect the accuracy...

Full description

Bibliographic Details
Main Authors: Juraj Kacur, Boris Puterka, Jarmila Pavlovicova, Milos Oravec
Format: Article
Language:English
Published: MDPI AG 2021-03-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/21/5/1888
_version_ 1797412216477057024
author Juraj Kacur
Boris Puterka
Jarmila Pavlovicova
Milos Oravec
author_facet Juraj Kacur
Boris Puterka
Jarmila Pavlovicova
Milos Oravec
author_sort Juraj Kacur
collection DOAJ
description Many speech emotion recognition systems have been designed using different features and classification methods. Still, there is a lack of knowledge and reasoning regarding the underlying speech characteristics and processing, i.e., how basic characteristics, methods, and settings affect the accuracy, to what extent, etc. This study is to extend physical perspective on speech emotion recognition by analyzing basic speech characteristics and modeling methods, e.g., time characteristics (segmentation, window types, and classification regions—lengths and overlaps), frequency ranges, frequency scales, processing of whole speech (spectrograms), vocal tract (filter banks, linear prediction coefficient (LPC) modeling), and excitation (inverse LPC filtering) signals, magnitude and phase manipulations, cepstral features, etc. In the evaluation phase the state-of-the-art classification method and rigorous statistical tests were applied, namely N-fold cross validation, paired <i>t</i>-test, rank, and Pearson correlations. The results revealed several settings in a 75% accuracy range (seven emotions). The most successful methods were based on vocal tract features using psychoacoustic filter banks covering the 0–8 kHz frequency range. Well scoring are also spectrograms carrying vocal tract and excitation information. It was found that even basic processing like pre-emphasis, segmentation, magnitude modifications, etc., can dramatically affect the results. Most findings are robust by exhibiting strong correlations across tested databases.
first_indexed 2024-03-09T04:59:09Z
format Article
id doaj.art-7e67c19cad9c452aab601d2da9fb8344
institution Directory Open Access Journal
issn 1424-8220
language English
last_indexed 2024-03-09T04:59:09Z
publishDate 2021-03-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj.art-7e67c19cad9c452aab601d2da9fb83442023-12-03T13:02:17ZengMDPI AGSensors1424-82202021-03-01215188810.3390/s21051888On the Speech Properties and Feature Extraction Methods in Speech Emotion RecognitionJuraj Kacur0Boris Puterka1Jarmila Pavlovicova2Milos Oravec3Institute of Multimedia Information and Communication Technologies, Faculty of Electrical Engineering and Information Technology, Slovak University of Technology in Bratislava, 2412 Bratislava, SlovakiaInstitute of Robotics and Cybernetics, Faculty of Electrical Engineering and Information Technology, Slovak University of Technology in Bratislava, 2412 Bratislava, SlovakiaInstitute of Robotics and Cybernetics, Faculty of Electrical Engineering and Information Technology, Slovak University of Technology in Bratislava, 2412 Bratislava, SlovakiaInstitute of Computer Science and Mathematics, Faculty of Electrical Engineering and Information Technology, Slovak University of Technology in Bratislava, 2412 Bratislava, SlovakiaMany speech emotion recognition systems have been designed using different features and classification methods. Still, there is a lack of knowledge and reasoning regarding the underlying speech characteristics and processing, i.e., how basic characteristics, methods, and settings affect the accuracy, to what extent, etc. This study is to extend physical perspective on speech emotion recognition by analyzing basic speech characteristics and modeling methods, e.g., time characteristics (segmentation, window types, and classification regions—lengths and overlaps), frequency ranges, frequency scales, processing of whole speech (spectrograms), vocal tract (filter banks, linear prediction coefficient (LPC) modeling), and excitation (inverse LPC filtering) signals, magnitude and phase manipulations, cepstral features, etc. In the evaluation phase the state-of-the-art classification method and rigorous statistical tests were applied, namely N-fold cross validation, paired <i>t</i>-test, rank, and Pearson correlations. The results revealed several settings in a 75% accuracy range (seven emotions). The most successful methods were based on vocal tract features using psychoacoustic filter banks covering the 0–8 kHz frequency range. Well scoring are also spectrograms carrying vocal tract and excitation information. It was found that even basic processing like pre-emphasis, segmentation, magnitude modifications, etc., can dramatically affect the results. Most findings are robust by exhibiting strong correlations across tested databases.https://www.mdpi.com/1424-8220/21/5/1888windowsfrequency scalesspectrogramspsychoacoustic filter banksLPCcepstral features
spellingShingle Juraj Kacur
Boris Puterka
Jarmila Pavlovicova
Milos Oravec
On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition
Sensors
windows
frequency scales
spectrograms
psychoacoustic filter banks
LPC
cepstral features
title On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition
title_full On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition
title_fullStr On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition
title_full_unstemmed On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition
title_short On the Speech Properties and Feature Extraction Methods in Speech Emotion Recognition
title_sort on the speech properties and feature extraction methods in speech emotion recognition
topic windows
frequency scales
spectrograms
psychoacoustic filter banks
LPC
cepstral features
url https://www.mdpi.com/1424-8220/21/5/1888
work_keys_str_mv AT jurajkacur onthespeechpropertiesandfeatureextractionmethodsinspeechemotionrecognition
AT borisputerka onthespeechpropertiesandfeatureextractionmethodsinspeechemotionrecognition
AT jarmilapavlovicova onthespeechpropertiesandfeatureextractionmethodsinspeechemotionrecognition
AT milosoravec onthespeechpropertiesandfeatureextractionmethodsinspeechemotionrecognition