End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild
As emotions play a central role in human communication, automatic emotion recognition has attracted increasing attention in the last two decades. While multimodal systems enjoy high performances on lab-controlled data, they are still far from providing ecological validity on non-lab-controlled, name...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-01-01
|
Series: | Multimodal Technologies and Interaction |
Subjects: | |
Online Access: | https://www.mdpi.com/2414-4088/6/2/11 |
_version_ | 1827653508768202752 |
---|---|
author | Denis Dresvyanskiy Elena Ryumina Heysem Kaya Maxim Markitantov Alexey Karpov Wolfgang Minker |
author_facet | Denis Dresvyanskiy Elena Ryumina Heysem Kaya Maxim Markitantov Alexey Karpov Wolfgang Minker |
author_sort | Denis Dresvyanskiy |
collection | DOAJ |
description | As emotions play a central role in human communication, automatic emotion recognition has attracted increasing attention in the last two decades. While multimodal systems enjoy high performances on lab-controlled data, they are still far from providing ecological validity on non-lab-controlled, namely “in-the-wild” data. This work investigates audiovisual deep learning approaches to emotion recognition in in-the-wild problem. Inspired by the outstanding performance of end-to-end and transfer learning techniques, we explored the effectiveness of architectures in which a modality-specific Convolutional Neural Network (CNN) is followed by a Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) using the AffWild2 dataset under the Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. We deployed unimodal end-to-end and transfer learning approaches within a multimodal fusion system, which generated final predictions using a weighted score fusion scheme. Exploiting the proposed deep-learning-based multimodal system, we reached a test set challenge performance measure of 48.1% on the ABAW 2020 Facial Expressions challenge, which advances the first-runner-up performance. |
first_indexed | 2024-03-09T21:19:48Z |
format | Article |
id | doaj.art-741287731d734da4bb44263004d3f2a5 |
institution | Directory Open Access Journal |
issn | 2414-4088 |
language | English |
last_indexed | 2024-03-09T21:19:48Z |
publishDate | 2022-01-01 |
publisher | MDPI AG |
record_format | Article |
series | Multimodal Technologies and Interaction |
spelling | doaj.art-741287731d734da4bb44263004d3f2a52023-11-23T21:24:38ZengMDPI AGMultimodal Technologies and Interaction2414-40882022-01-01621110.3390/mti6020011End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-WildDenis Dresvyanskiy0Elena Ryumina1Heysem Kaya2Maxim Markitantov3Alexey Karpov4Wolfgang Minker5Dialogue Group, Institute of Communications Engineering, Ulm University, 89081 Ulm, GermanySt. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, RussiaDepartment of Information and Computing Sciences, Utrecht University, 3584 CC Utrecht, The NetherlandsSt. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, RussiaSt. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, RussiaDialogue Group, Institute of Communications Engineering, Ulm University, 89081 Ulm, GermanyAs emotions play a central role in human communication, automatic emotion recognition has attracted increasing attention in the last two decades. While multimodal systems enjoy high performances on lab-controlled data, they are still far from providing ecological validity on non-lab-controlled, namely “in-the-wild” data. This work investigates audiovisual deep learning approaches to emotion recognition in in-the-wild problem. Inspired by the outstanding performance of end-to-end and transfer learning techniques, we explored the effectiveness of architectures in which a modality-specific Convolutional Neural Network (CNN) is followed by a Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) using the AffWild2 dataset under the Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. We deployed unimodal end-to-end and transfer learning approaches within a multimodal fusion system, which generated final predictions using a weighted score fusion scheme. Exploiting the proposed deep-learning-based multimodal system, we reached a test set challenge performance measure of 48.1% on the ABAW 2020 Facial Expressions challenge, which advances the first-runner-up performance.https://www.mdpi.com/2414-4088/6/2/11affective computingemotion recognitiondeep learning architecturesface processingmultimodal fusionmultimodal representations |
spellingShingle | Denis Dresvyanskiy Elena Ryumina Heysem Kaya Maxim Markitantov Alexey Karpov Wolfgang Minker End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild Multimodal Technologies and Interaction affective computing emotion recognition deep learning architectures face processing multimodal fusion multimodal representations |
title | End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild |
title_full | End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild |
title_fullStr | End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild |
title_full_unstemmed | End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild |
title_short | End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild |
title_sort | end to end modeling and transfer learning for audiovisual emotion recognition in the wild |
topic | affective computing emotion recognition deep learning architectures face processing multimodal fusion multimodal representations |
url | https://www.mdpi.com/2414-4088/6/2/11 |
work_keys_str_mv | AT denisdresvyanskiy endtoendmodelingandtransferlearningforaudiovisualemotionrecognitioninthewild AT elenaryumina endtoendmodelingandtransferlearningforaudiovisualemotionrecognitioninthewild AT heysemkaya endtoendmodelingandtransferlearningforaudiovisualemotionrecognitioninthewild AT maximmarkitantov endtoendmodelingandtransferlearningforaudiovisualemotionrecognitioninthewild AT alexeykarpov endtoendmodelingandtransferlearningforaudiovisualemotionrecognitioninthewild AT wolfgangminker endtoendmodelingandtransferlearningforaudiovisualemotionrecognitioninthewild |