XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning

Speech is a powerful means to expressing thoughts, emotions, and perspectives. However, accurately determining the emotions conveyed through speech remains a challenging task. Existing manual methods for analyzing speech to recognize emotions are prone to errors, limiting our understanding and respo...

Full description

Bibliographic Details
Main Authors:	Raheel Ahmad, Arshad Iqbal, Muhammad Mohsin Jadoon, Naveed Ahmad, Yasir Javed
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Machine learning deep learning speech emotion recognition (SER) random forest (RF) logistic regression (LR) decision tree (DT)
Online Access:	https://ieeexplore.ieee.org/document/10466764/

_version_	1797243475834437632
author	Raheel Ahmad Arshad Iqbal Muhammad Mohsin Jadoon Naveed Ahmad Yasir Javed
author_facet	Raheel Ahmad Arshad Iqbal Muhammad Mohsin Jadoon Naveed Ahmad Yasir Javed
author_sort	Raheel Ahmad
collection	DOAJ
description	Speech is a powerful means to expressing thoughts, emotions, and perspectives. However, accurately determining the emotions conveyed through speech remains a challenging task. Existing manual methods for analyzing speech to recognize emotions are prone to errors, limiting our understanding and response to individuals’ emotional states. To address diverse accents, an automated system capable of real-time emotion prediction from human speech is needed. This paper introduces a speech emotion recognition (SER) system that leverages supervised learning techniques to tackle cross-accent diversity. Distinctively, the system extracts a comprehensive set of nine speech features—Zero Crossing Rate, Mel Spectrum, Pitch, Root Mean Square values, Mel Frequency Cepstral Coefficients, chroma-stft, and three spectral features (Centroid, Contrast, and Roll-off) for refined speech signal processing and recognition. Seven machine learning models are employed, encompassing Random Forest, Logistic Regression, Decision Tree, Support Vector Machines, Gaussian Naive Bayes, K-Nearest Neighbors, ensemble learning, and four individual, hybrid deep learning models including Long short-term memory (LSTM) and 1-Dimensional Convolutional Neural Network (1D-CNN) with stratified cross-validation. Audio samples from diverse English regions are combined to train the models. The performance evaluation results of conventional machine learning and deep learning models indicate that the Random Forest-based feature selection model achieves the highest accuracy of up to 76% among the conventional machine learning models. Simultaneously, the 1D-CNN model with stratified cross-validation reaches up to 99% accuracy. The proposed framework enhances the cross-accent emotion recognition accuracy up to 86.3%, 89.87%, 90.27%, and 84.96% by margins of 14.71%, 10.15%, 9.6%, and 16.52% respectively.
first_indexed	2024-04-24T18:55:43Z
format	Article
id	doaj.art-e7281fbe1d6040cfaf194128b18621d6
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-24T18:55:43Z
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-e7281fbe1d6040cfaf194128b18621d62024-03-26T17:44:11ZengIEEEIEEE Access2169-35362024-01-0112411254114210.1109/ACCESS.2024.337637910466764XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition Using Deep LearningRaheel Ahmad0https://orcid.org/0009-0002-1294-2903Arshad Iqbal1https://orcid.org/0000-0002-7189-5564Muhammad Mohsin Jadoon2https://orcid.org/0000-0002-8242-4581Naveed Ahmad3https://orcid.org/0000-0003-2941-9780Yasir Javed4https://orcid.org/0000-0002-6311-027XSino-Pak Center for Artificial Intelligence (SPCAI), Pak-Austria Fachhochschule: Institute of Applied Sciences and Technology (PAF-IAST), Mang, Haripur, PakistanSino-Pak Center for Artificial Intelligence (SPCAI), Pak-Austria Fachhochschule: Institute of Applied Sciences and Technology (PAF-IAST), Mang, Haripur, PakistanDepartment of Computer Science, Prince Sultan University, Riyadh, Saudi ArabiaDepartment of Computer Science, Prince Sultan University, Riyadh, Saudi ArabiaDepartment of Computer Science, Prince Sultan University, Riyadh, Saudi ArabiaSpeech is a powerful means to expressing thoughts, emotions, and perspectives. However, accurately determining the emotions conveyed through speech remains a challenging task. Existing manual methods for analyzing speech to recognize emotions are prone to errors, limiting our understanding and response to individuals’ emotional states. To address diverse accents, an automated system capable of real-time emotion prediction from human speech is needed. This paper introduces a speech emotion recognition (SER) system that leverages supervised learning techniques to tackle cross-accent diversity. Distinctively, the system extracts a comprehensive set of nine speech features—Zero Crossing Rate, Mel Spectrum, Pitch, Root Mean Square values, Mel Frequency Cepstral Coefficients, chroma-stft, and three spectral features (Centroid, Contrast, and Roll-off) for refined speech signal processing and recognition. Seven machine learning models are employed, encompassing Random Forest, Logistic Regression, Decision Tree, Support Vector Machines, Gaussian Naive Bayes, K-Nearest Neighbors, ensemble learning, and four individual, hybrid deep learning models including Long short-term memory (LSTM) and 1-Dimensional Convolutional Neural Network (1D-CNN) with stratified cross-validation. Audio samples from diverse English regions are combined to train the models. The performance evaluation results of conventional machine learning and deep learning models indicate that the Random Forest-based feature selection model achieves the highest accuracy of up to 76% among the conventional machine learning models. Simultaneously, the 1D-CNN model with stratified cross-validation reaches up to 99% accuracy. The proposed framework enhances the cross-accent emotion recognition accuracy up to 86.3%, 89.87%, 90.27%, and 84.96% by margins of 14.71%, 10.15%, 9.6%, and 16.52% respectively.https://ieeexplore.ieee.org/document/10466764/Machine learningdeep learningspeech emotion recognition (SER)random forest (RF)logistic regression (LR)decision tree (DT)
spellingShingle	Raheel Ahmad Arshad Iqbal Muhammad Mohsin Jadoon Naveed Ahmad Yasir Javed XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning IEEE Access Machine learning deep learning speech emotion recognition (SER) random forest (RF) logistic regression (LR) decision tree (DT)
title	XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning
title_full	XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning
title_fullStr	XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning
title_full_unstemmed	XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning
title_short	XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning
title_sort	xemoaccent embracing diversity in cross accent emotion recognition using deep learning
topic	Machine learning deep learning speech emotion recognition (SER) random forest (RF) logistic regression (LR) decision tree (DT)
url	https://ieeexplore.ieee.org/document/10466764/
work_keys_str_mv	AT raheelahmad xemoaccentembracingdiversityincrossaccentemotionrecognitionusingdeeplearning AT arshadiqbal xemoaccentembracingdiversityincrossaccentemotionrecognitionusingdeeplearning AT muhammadmohsinjadoon xemoaccentembracingdiversityincrossaccentemotionrecognitionusingdeeplearning AT naveedahmad xemoaccentembracingdiversityincrossaccentemotionrecognitionusingdeeplearning AT yasirjaved xemoaccentembracingdiversityincrossaccentemotionrecognitionusingdeeplearning

XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning

Similar Items