A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition
Speech Emotion Recognition (SER) is an important part of Affective Computing and emotionally aware Human–Computer Interaction. Emotional expression may vary depending on the language, culture, and the speaker’s personality and vocal attributes. Speaker-adaptive systems can address this issue. In rea...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2021-12-01
|
Series: | Machine Learning with Applications |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2666827021000669 |
_version_ | 1819174329439485952 |
---|---|
author | Nikolaos Vryzas Lazaros Vrysis Rigas Kotsakis Charalampos Dimoulas |
author_facet | Nikolaos Vryzas Lazaros Vrysis Rigas Kotsakis Charalampos Dimoulas |
author_sort | Nikolaos Vryzas |
collection | DOAJ |
description | Speech Emotion Recognition (SER) is an important part of Affective Computing and emotionally aware Human–Computer Interaction. Emotional expression may vary depending on the language, culture, and the speaker’s personality and vocal attributes. Speaker-adaptive systems can address this issue. In real-world applications, it is not feasible to obtain big datasets for deep learning model training from a specific speaker. This paper proposes a transfer learning approach for personalized SER based on convolutional neural networks. A CNN is trained in a multi-user dataset for generalization and then is fine-tuned for a small speaker-specific dataset. A VGGish model, pre-trained a large-scale dataset for audio event recognition is also evaluated for the task. This comparison highlights the significance of network capacity, dataset length, and domain-relativity for transfer learning. To enhance the applicability of this approach in real-world conditions, a web crowdsourcing application is implemented. An online platform is provided where contributors can follow a standard procedure to record and submit annotated utterances of emotional speech. The recordings are validated and added to the publicly available AESDD dataset of emotional speech. The platform can be used for the creation of personalized emotional speech datasets for speaker-adaptive SER, following the transfer learning strategies that have been evaluated. |
first_indexed | 2024-12-22T20:37:15Z |
format | Article |
id | doaj.art-6c8a5cd387664057ab7709b60c52cb55 |
institution | Directory Open Access Journal |
issn | 2666-8270 |
language | English |
last_indexed | 2024-12-22T20:37:15Z |
publishDate | 2021-12-01 |
publisher | Elsevier |
record_format | Article |
series | Machine Learning with Applications |
spelling | doaj.art-6c8a5cd387664057ab7709b60c52cb552022-12-21T18:13:26ZengElsevierMachine Learning with Applications2666-82702021-12-016100132A web crowdsourcing framework for transfer learning and personalized Speech Emotion RecognitionNikolaos Vryzas0Lazaros Vrysis1Rigas Kotsakis2Charalampos Dimoulas3Multidisciplinary Media and Mediated Communication (M3C) Research Group, Aristotle University of Thessaloniki, Pavillion 1, TIF-HELEXPO, 546 36, Greece; Corresponding author.Multidisciplinary Media and Mediated Communication (M3C) Research Group, Aristotle University of Thessaloniki, Pavillion 1, TIF-HELEXPO, 546 36, GreeceMultidisciplinary Media and Mediated Communication (M3C) Research Group, International Hellenic University, GreeceMultidisciplinary Media and Mediated Communication (M3C) Research Group, Aristotle University of Thessaloniki, Pavillion 1, TIF-HELEXPO, 546 36, GreeceSpeech Emotion Recognition (SER) is an important part of Affective Computing and emotionally aware Human–Computer Interaction. Emotional expression may vary depending on the language, culture, and the speaker’s personality and vocal attributes. Speaker-adaptive systems can address this issue. In real-world applications, it is not feasible to obtain big datasets for deep learning model training from a specific speaker. This paper proposes a transfer learning approach for personalized SER based on convolutional neural networks. A CNN is trained in a multi-user dataset for generalization and then is fine-tuned for a small speaker-specific dataset. A VGGish model, pre-trained a large-scale dataset for audio event recognition is also evaluated for the task. This comparison highlights the significance of network capacity, dataset length, and domain-relativity for transfer learning. To enhance the applicability of this approach in real-world conditions, a web crowdsourcing application is implemented. An online platform is provided where contributors can follow a standard procedure to record and submit annotated utterances of emotional speech. The recordings are validated and added to the publicly available AESDD dataset of emotional speech. The platform can be used for the creation of personalized emotional speech datasets for speaker-adaptive SER, following the transfer learning strategies that have been evaluated.http://www.sciencedirect.com/science/article/pii/S2666827021000669Speech Emotion RecognitionTransfer learningCrowdsourcingConvolutional neural networksVGGish |
spellingShingle | Nikolaos Vryzas Lazaros Vrysis Rigas Kotsakis Charalampos Dimoulas A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition Machine Learning with Applications Speech Emotion Recognition Transfer learning Crowdsourcing Convolutional neural networks VGGish |
title | A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition |
title_full | A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition |
title_fullStr | A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition |
title_full_unstemmed | A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition |
title_short | A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition |
title_sort | web crowdsourcing framework for transfer learning and personalized speech emotion recognition |
topic | Speech Emotion Recognition Transfer learning Crowdsourcing Convolutional neural networks VGGish |
url | http://www.sciencedirect.com/science/article/pii/S2666827021000669 |
work_keys_str_mv | AT nikolaosvryzas awebcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition AT lazarosvrysis awebcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition AT rigaskotsakis awebcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition AT charalamposdimoulas awebcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition AT nikolaosvryzas webcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition AT lazarosvrysis webcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition AT rigaskotsakis webcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition AT charalamposdimoulas webcrowdsourcingframeworkfortransferlearningandpersonalizedspeechemotionrecognition |