Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature Fusion

In recent years data science has been applied in a variety of real-life applications such as human-computer interaction applications, computer gaming, mobile services, and emotion evaluation. Among the wide range of applications, speech emotion recognition (SER) is also an emerging and challenging r...

Full description

Bibliographic Details
Main Authors:	Badriyya B. Al-onazi, Muhammad Asif Nauman, Rashid Jahangir, Muhmmad Mohsin Malik, Eman H. Alkhammash, Ahmed M. Elshewey
Format:	Article
Language:	English
Published:	MDPI AG 2022-09-01
Series:	Applied Sciences
Subjects:	multilingual transformer SER speech emotion recognition Arabic vocal emotion artificial intelligence
Online Access:	https://www.mdpi.com/2076-3417/12/18/9188

_version_	1827663767105699840
author	Badriyya B. Al-onazi Muhammad Asif Nauman Rashid Jahangir Muhmmad Mohsin Malik Eman H. Alkhammash Ahmed M. Elshewey
author_facet	Badriyya B. Al-onazi Muhammad Asif Nauman Rashid Jahangir Muhmmad Mohsin Malik Eman H. Alkhammash Ahmed M. Elshewey
author_sort	Badriyya B. Al-onazi
collection	DOAJ
description	In recent years data science has been applied in a variety of real-life applications such as human-computer interaction applications, computer gaming, mobile services, and emotion evaluation. Among the wide range of applications, speech emotion recognition (SER) is also an emerging and challenging research topic. For SER, recent studies used handcrafted features that provide the best results but failed to provide accuracy while applied in complex scenarios. Later, deep learning techniques were used for SER that automatically detect features from speech signals. Deep learning-based SER techniques overcome the issues of accuracy, yet there are still significant gaps in the reported methods. Studies using lightweight CNN failed to learn optimal features from composite acoustic signals. This study proposed a novel SER model to overcome the limitations mentioned earlier in this study. We focused on Arabic vocal emotions in particular because they received relatively little attention in research. The proposed model performs data augmentation before feature extraction. The 273 derived features were fed as input to the transformer model for emotion recognition. This model is applied to four datasets named BAVED, EMO-DB, SAVEE, and EMOVO. The experimental findings demonstrated the robust performance of the proposed model compared to existing techniques. The proposed SER model achieved 95.2%, 93.4%, 85.1%, and 91.7% accuracy on BAVED, EMO-DB, SAVEE, and EMOVO datasets respectively. The highest accuracy was obtained using BAVED dataset, indicating that the proposed model is well suited to Arabic vocal emotions.
first_indexed	2024-03-10T00:49:08Z
format	Article
id	doaj.art-97b4929867bc4d81aeff390896839751
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-10T00:49:08Z
publishDate	2022-09-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-97b4929867bc4d81aeff3908968397512023-11-23T14:54:25ZengMDPI AGApplied Sciences2076-34172022-09-011218918810.3390/app12189188Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature FusionBadriyya B. Al-onazi0Muhammad Asif Nauman1Rashid Jahangir2Muhmmad Mohsin Malik3Eman H. Alkhammash4Ahmed M. Elshewey5Department of Language Preparation, Arabic Language Teaching Institute, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi ArabiaDepartment of Computer Science, University of Engineering and Technology, Lahore 54890, PakistanDepartment of Computer Science, COMSATS University Islamabad, Vehari Campus, Vehari 61100, PakistanDepartment of Interdisciplinary Field, National University of Medical Sciences, Rawalpindi 46000, PakistanDepartment of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi ArabiaDepartment of Computer Science, Faculty of Computers and Information, Suez University, Suez, EgyptIn recent years data science has been applied in a variety of real-life applications such as human-computer interaction applications, computer gaming, mobile services, and emotion evaluation. Among the wide range of applications, speech emotion recognition (SER) is also an emerging and challenging research topic. For SER, recent studies used handcrafted features that provide the best results but failed to provide accuracy while applied in complex scenarios. Later, deep learning techniques were used for SER that automatically detect features from speech signals. Deep learning-based SER techniques overcome the issues of accuracy, yet there are still significant gaps in the reported methods. Studies using lightweight CNN failed to learn optimal features from composite acoustic signals. This study proposed a novel SER model to overcome the limitations mentioned earlier in this study. We focused on Arabic vocal emotions in particular because they received relatively little attention in research. The proposed model performs data augmentation before feature extraction. The 273 derived features were fed as input to the transformer model for emotion recognition. This model is applied to four datasets named BAVED, EMO-DB, SAVEE, and EMOVO. The experimental findings demonstrated the robust performance of the proposed model compared to existing techniques. The proposed SER model achieved 95.2%, 93.4%, 85.1%, and 91.7% accuracy on BAVED, EMO-DB, SAVEE, and EMOVO datasets respectively. The highest accuracy was obtained using BAVED dataset, indicating that the proposed model is well suited to Arabic vocal emotions.https://www.mdpi.com/2076-3417/12/18/9188multilingualtransformerSERspeech emotion recognitionArabic vocal emotionartificial intelligence
spellingShingle	Badriyya B. Al-onazi Muhammad Asif Nauman Rashid Jahangir Muhmmad Mohsin Malik Eman H. Alkhammash Ahmed M. Elshewey Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature Fusion Applied Sciences multilingual transformer SER speech emotion recognition Arabic vocal emotion artificial intelligence
title	Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature Fusion
title_full	Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature Fusion
title_fullStr	Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature Fusion
title_full_unstemmed	Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature Fusion
title_short	Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature Fusion
title_sort	transformer based multilingual speech emotion recognition using data augmentation and feature fusion
topic	multilingual transformer SER speech emotion recognition Arabic vocal emotion artificial intelligence
url	https://www.mdpi.com/2076-3417/12/18/9188
work_keys_str_mv	AT badriyyabalonazi transformerbasedmultilingualspeechemotionrecognitionusingdataaugmentationandfeaturefusion AT muhammadasifnauman transformerbasedmultilingualspeechemotionrecognitionusingdataaugmentationandfeaturefusion AT rashidjahangir transformerbasedmultilingualspeechemotionrecognitionusingdataaugmentationandfeaturefusion AT muhmmadmohsinmalik transformerbasedmultilingualspeechemotionrecognitionusingdataaugmentationandfeaturefusion AT emanhalkhammash transformerbasedmultilingualspeechemotionrecognitionusingdataaugmentationandfeaturefusion AT ahmedmelshewey transformerbasedmultilingualspeechemotionrecognitionusingdataaugmentationandfeaturefusion

Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature Fusion

Similar Items