A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition

Speech Emotion Recognition (SER) is an important part of Affective Computing and emotionally aware Human–Computer Interaction. Emotional expression may vary depending on the language, culture, and the speaker’s personality and vocal attributes. Speaker-adaptive systems can address this issue. In rea...

Full description

Bibliographic Details
Main Authors: Nikolaos Vryzas, Lazaros Vrysis, Rigas Kotsakis, Charalampos Dimoulas
Format: Article
Language:English
Published: Elsevier 2021-12-01
Series:Machine Learning with Applications
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666827021000669
_version_ 1819174329439485952
author Nikolaos Vryzas
Lazaros Vrysis
Rigas Kotsakis
Charalampos Dimoulas
author_facet Nikolaos Vryzas
Lazaros Vrysis
Rigas Kotsakis
Charalampos Dimoulas
author_sort Nikolaos Vryzas
collection DOAJ
description Speech Emotion Recognition (SER) is an important part of Affective Computing and emotionally aware Human–Computer Interaction. Emotional expression may vary depending on the language, culture, and the speaker’s personality and vocal attributes. Speaker-adaptive systems can address this issue. In real-world applications, it is not feasible to obtain big datasets for deep learning model training from a specific speaker. This paper proposes a transfer learning approach for personalized SER based on convolutional neural networks. A CNN is trained in a multi-user dataset for generalization and then is fine-tuned for a small speaker-specific dataset. A VGGish model, pre-trained a large-scale dataset for audio event recognition is also evaluated for the task. This comparison highlights the significance of network capacity, dataset length, and domain-relativity for transfer learning. To enhance the applicability of this approach in real-world conditions, a web crowdsourcing application is implemented. An online platform is provided where contributors can follow a standard procedure to record and submit annotated utterances of emotional speech. The recordings are validated and added to the publicly available AESDD dataset of emotional speech. The platform can be used for the creation of personalized emotional speech datasets for speaker-adaptive SER, following the transfer learning strategies that have been evaluated.
first_indexed 2024-12-22T20:37:15Z
format Article
id doaj.art-6c8a5cd387664057ab7709b60c52cb55
institution Directory Open Access Journal
issn 2666-8270
language English
last_indexed 2024-12-22T20:37:15Z
publishDate 2021-12-01
publisher Elsevier
record_format Article
series Machine Learning with Applications
spelling doaj.art-6c8a5cd387664057ab7709b60c52cb552022-12-21T18:13:26ZengElsevierMachine Learning with Applications2666-82702021-12-016100132A web crowdsourcing framework for transfer learning and personalized Speech Emotion RecognitionNikolaos Vryzas0Lazaros Vrysis1Rigas Kotsakis2Charalampos Dimoulas3Multidisciplinary Media and Mediated Communication (M3C) Research Group, Aristotle University of Thessaloniki, Pavillion 1, TIF-HELEXPO, 546 36, Greece; Corresponding author.Multidisciplinary Media and Mediated Communication (M3C) Research Group, Aristotle University of Thessaloniki, Pavillion 1, TIF-HELEXPO, 546 36, GreeceMultidisciplinary Media and Mediated Communication (M3C) Research Group, International Hellenic University, GreeceMultidisciplinary Media and Mediated Communication (M3C) Research Group, Aristotle University of Thessaloniki, Pavillion 1, TIF-HELEXPO, 546 36, GreeceSpeech Emotion Recognition (SER) is an important part of Affective Computing and emotionally aware Human–Computer Interaction. Emotional expression may vary depending on the language, culture, and the speaker’s personality and vocal attributes. Speaker-adaptive systems can address this issue. In real-world applications, it is not feasible to obtain big datasets for deep learning model training from a specific speaker. This paper proposes a transfer learning approach for personalized SER based on convolutional neural networks. A CNN is trained in a multi-user dataset for generalization and then is fine-tuned for a small speaker-specific dataset. A VGGish model, pre-trained a large-scale dataset for audio event recognition is also evaluated for the task. This comparison highlights the significance of network capacity, dataset length, and domain-relativity for transfer learning. To enhance the applicability of this approach in real-world conditions, a web crowdsourcing application is implemented. An online platform is provided where contributors can follow a standard procedure to record and submit annotated utterances of emotional speech. The recordings are validated and added to the publicly available AESDD dataset of emotional speech. The platform can be used for the creation of personalized emotional speech datasets for speaker-adaptive SER, following the transfer learning strategies that have been evaluated.http://www.sciencedirect.com/science/article/pii/S2666827021000669Speech Emotion RecognitionTransfer learningCrowdsourcingConvolutional neural networksVGGish
spellingShingle Nikolaos Vryzas
Lazaros Vrysis
Rigas Kotsakis
Charalampos Dimoulas
A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition
Machine Learning with Applications
Speech Emotion Recognition
Transfer learning
Crowdsourcing
Convolutional neural networks
VGGish
title A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition
title_full A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition
title_fullStr A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition
title_full_unstemmed A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition
title_short A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition
title_sort web crowdsourcing framework for transfer learning and personalized speech emotion recognition
topic Speech Emotion Recognition
Transfer learning
Crowdsourcing
Convolutional neural networks
VGGish
url http://www.sciencedirect.com/science/article/pii/S2666827021000669
work_keys_str_mv AT nikolaosvryzas awebcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition
AT lazarosvrysis awebcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition
AT rigaskotsakis awebcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition
AT charalamposdimoulas awebcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition
AT nikolaosvryzas webcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition
AT lazarosvrysis webcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition
AT rigaskotsakis webcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition
AT charalamposdimoulas webcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition