A web crowdsourcing framework for transfer learning and personalized Speech Emotion Recognition

Speech Emotion Recognition (SER) is an important part of Affective Computing and emotionally aware Human–Computer Interaction. Emotional expression may vary depending on the language, culture, and the speaker’s personality and vocal attributes. Speaker-adaptive systems can address this issue. In rea...

Full description

Bibliographic Details
Main Authors: Nikolaos Vryzas, Lazaros Vrysis, Rigas Kotsakis, Charalampos Dimoulas
Format: Article
Language:English
Published: Elsevier 2021-12-01
Series:Machine Learning with Applications
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666827021000669
Description
Summary:Speech Emotion Recognition (SER) is an important part of Affective Computing and emotionally aware Human–Computer Interaction. Emotional expression may vary depending on the language, culture, and the speaker’s personality and vocal attributes. Speaker-adaptive systems can address this issue. In real-world applications, it is not feasible to obtain big datasets for deep learning model training from a specific speaker. This paper proposes a transfer learning approach for personalized SER based on convolutional neural networks. A CNN is trained in a multi-user dataset for generalization and then is fine-tuned for a small speaker-specific dataset. A VGGish model, pre-trained a large-scale dataset for audio event recognition is also evaluated for the task. This comparison highlights the significance of network capacity, dataset length, and domain-relativity for transfer learning. To enhance the applicability of this approach in real-world conditions, a web crowdsourcing application is implemented. An online platform is provided where contributors can follow a standard procedure to record and submit annotated utterances of emotional speech. The recordings are validated and added to the publicly available AESDD dataset of emotional speech. The platform can be used for the creation of personalized emotional speech datasets for speaker-adaptive SER, following the transfer learning strategies that have been evaluated.
ISSN:2666-8270