A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition

Speech Emotion Recognition (SER) is an important part of Affective Computing and emotionally aware Human–Computer Interaction. Emotional expression may vary depending on the language, culture, and the speaker’s personality and vocal attributes. Speaker-adaptive systems can address this issue. In rea...

Full description

Bibliographic Details
Main Authors:	Nikolaos Vryzas, Lazaros Vrysis, Rigas Kotsakis, Charalampos Dimoulas
Format:	Article
Language:	English
Published:	Elsevier 2021-12-01
Series:	Machine Learning with Applications
Subjects:	Speech Emotion Recognition Transfer learning Crowdsourcing Convolutional neural networks VGGish
Online Access:	http://www.sciencedirect.com/science/article/pii/S2666827021000669

_version_	1819174329439485952
author	Nikolaos Vryzas Lazaros Vrysis Rigas Kotsakis Charalampos Dimoulas
author_facet	Nikolaos Vryzas Lazaros Vrysis Rigas Kotsakis Charalampos Dimoulas
author_sort	Nikolaos Vryzas
collection	DOAJ
description	Speech Emotion Recognition (SER) is an important part of Affective Computing and emotionally aware Human–Computer Interaction. Emotional expression may vary depending on the language, culture, and the speaker’s personality and vocal attributes. Speaker-adaptive systems can address this issue. In real-world applications, it is not feasible to obtain big datasets for deep learning model training from a specific speaker. This paper proposes a transfer learning approach for personalized SER based on convolutional neural networks. A CNN is trained in a multi-user dataset for generalization and then is fine-tuned for a small speaker-specific dataset. A VGGish model, pre-trained a large-scale dataset for audio event recognition is also evaluated for the task. This comparison highlights the significance of network capacity, dataset length, and domain-relativity for transfer learning. To enhance the applicability of this approach in real-world conditions, a web crowdsourcing application is implemented. An online platform is provided where contributors can follow a standard procedure to record and submit annotated utterances of emotional speech. The recordings are validated and added to the publicly available AESDD dataset of emotional speech. The platform can be used for the creation of personalized emotional speech datasets for speaker-adaptive SER, following the transfer learning strategies that have been evaluated.
first_indexed	2024-12-22T20:37:15Z
format	Article
id	doaj.art-6c8a5cd387664057ab7709b60c52cb55
institution	Directory Open Access Journal
issn	2666-8270
language	English
last_indexed	2024-12-22T20:37:15Z
publishDate	2021-12-01
publisher	Elsevier
record_format	Article
series	Machine Learning with Applications
spelling	doaj.art-6c8a5cd387664057ab7709b60c52cb552022-12-21T18:13:26ZengElsevierMachine Learning with Applications2666-82702021-12-016100132A web crowdsourcing framework for transfer learning and personalized Speech Emotion RecognitionNikolaos Vryzas0Lazaros Vrysis1Rigas Kotsakis2Charalampos Dimoulas3Multidisciplinary Media and Mediated Communication (M3C) Research Group, Aristotle University of Thessaloniki, Pavillion 1, TIF-HELEXPO, 546 36, Greece; Corresponding author.Multidisciplinary Media and Mediated Communication (M3C) Research Group, Aristotle University of Thessaloniki, Pavillion 1, TIF-HELEXPO, 546 36, GreeceMultidisciplinary Media and Mediated Communication (M3C) Research Group, International Hellenic University, GreeceMultidisciplinary Media and Mediated Communication (M3C) Research Group, Aristotle University of Thessaloniki, Pavillion 1, TIF-HELEXPO, 546 36, GreeceSpeech Emotion Recognition (SER) is an important part of Affective Computing and emotionally aware Human–Computer Interaction. Emotional expression may vary depending on the language, culture, and the speaker’s personality and vocal attributes. Speaker-adaptive systems can address this issue. In real-world applications, it is not feasible to obtain big datasets for deep learning model training from a specific speaker. This paper proposes a transfer learning approach for personalized SER based on convolutional neural networks. A CNN is trained in a multi-user dataset for generalization and then is fine-tuned for a small speaker-specific dataset. A VGGish model, pre-trained a large-scale dataset for audio event recognition is also evaluated for the task. This comparison highlights the significance of network capacity, dataset length, and domain-relativity for transfer learning. To enhance the applicability of this approach in real-world conditions, a web crowdsourcing application is implemented. An online platform is provided where contributors can follow a standard procedure to record and submit annotated utterances of emotional speech. The recordings are validated and added to the publicly available AESDD dataset of emotional speech. The platform can be used for the creation of personalized emotional speech datasets for speaker-adaptive SER, following the transfer learning strategies that have been evaluated.http://www.sciencedirect.com/science/article/pii/S2666827021000669Speech Emotion RecognitionTransfer learningCrowdsourcingConvolutional neural networksVGGish
spellingShingle	Nikolaos Vryzas Lazaros Vrysis Rigas Kotsakis Charalampos Dimoulas A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition Machine Learning with Applications Speech Emotion Recognition Transfer learning Crowdsourcing Convolutional neural networks VGGish
title	A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition
title_full	A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition
title_fullStr	A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition
title_full_unstemmed	A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition
title_short	A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition
title_sort	web crowdsourcing framework for transfer learning and personalized speech emotion recognition
topic	Speech Emotion Recognition Transfer learning Crowdsourcing Convolutional neural networks VGGish
url	http://www.sciencedirect.com/science/article/pii/S2666827021000669
work_keys_str_mv	AT nikolaosvryzas awebcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition AT lazarosvrysis awebcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition AT rigaskotsakis awebcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition AT charalamposdimoulas awebcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition AT nikolaosvryzas webcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition AT lazarosvrysis webcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition AT rigaskotsakis webcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition AT charalamposdimoulas webcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition

A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition

Similar Items