End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild

As emotions play a central role in human communication, automatic emotion recognition has attracted increasing attention in the last two decades. While multimodal systems enjoy high performances on lab-controlled data, they are still far from providing ecological validity on non-lab-controlled, name...

Full description

Bibliographic Details
Main Authors: Denis Dresvyanskiy, Elena Ryumina, Heysem Kaya, Maxim Markitantov, Alexey Karpov, Wolfgang Minker
Format: Article
Language:English
Published: MDPI AG 2022-01-01
Series:Multimodal Technologies and Interaction
Subjects:
Online Access:https://www.mdpi.com/2414-4088/6/2/11
_version_ 1827653508768202752
author Denis Dresvyanskiy
Elena Ryumina
Heysem Kaya
Maxim Markitantov
Alexey Karpov
Wolfgang Minker
author_facet Denis Dresvyanskiy
Elena Ryumina
Heysem Kaya
Maxim Markitantov
Alexey Karpov
Wolfgang Minker
author_sort Denis Dresvyanskiy
collection DOAJ
description As emotions play a central role in human communication, automatic emotion recognition has attracted increasing attention in the last two decades. While multimodal systems enjoy high performances on lab-controlled data, they are still far from providing ecological validity on non-lab-controlled, namely “in-the-wild” data. This work investigates audiovisual deep learning approaches to emotion recognition in in-the-wild problem. Inspired by the outstanding performance of end-to-end and transfer learning techniques, we explored the effectiveness of architectures in which a modality-specific Convolutional Neural Network (CNN) is followed by a Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) using the AffWild2 dataset under the Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. We deployed unimodal end-to-end and transfer learning approaches within a multimodal fusion system, which generated final predictions using a weighted score fusion scheme. Exploiting the proposed deep-learning-based multimodal system, we reached a test set challenge performance measure of 48.1% on the ABAW 2020 Facial Expressions challenge, which advances the first-runner-up performance.
first_indexed 2024-03-09T21:19:48Z
format Article
id doaj.art-741287731d734da4bb44263004d3f2a5
institution Directory Open Access Journal
issn 2414-4088
language English
last_indexed 2024-03-09T21:19:48Z
publishDate 2022-01-01
publisher MDPI AG
record_format Article
series Multimodal Technologies and Interaction
spelling doaj.art-741287731d734da4bb44263004d3f2a52023-11-23T21:24:38ZengMDPI AGMultimodal Technologies and Interaction2414-40882022-01-01621110.3390/mti6020011End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-WildDenis Dresvyanskiy0Elena Ryumina1Heysem Kaya2Maxim Markitantov3Alexey Karpov4Wolfgang Minker5Dialogue Group, Institute of Communications Engineering, Ulm University, 89081 Ulm, GermanySt. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, RussiaDepartment of Information and Computing Sciences, Utrecht University, 3584 CC Utrecht, The NetherlandsSt. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, RussiaSt. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, RussiaDialogue Group, Institute of Communications Engineering, Ulm University, 89081 Ulm, GermanyAs emotions play a central role in human communication, automatic emotion recognition has attracted increasing attention in the last two decades. While multimodal systems enjoy high performances on lab-controlled data, they are still far from providing ecological validity on non-lab-controlled, namely “in-the-wild” data. This work investigates audiovisual deep learning approaches to emotion recognition in in-the-wild problem. Inspired by the outstanding performance of end-to-end and transfer learning techniques, we explored the effectiveness of architectures in which a modality-specific Convolutional Neural Network (CNN) is followed by a Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) using the AffWild2 dataset under the Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. We deployed unimodal end-to-end and transfer learning approaches within a multimodal fusion system, which generated final predictions using a weighted score fusion scheme. Exploiting the proposed deep-learning-based multimodal system, we reached a test set challenge performance measure of 48.1% on the ABAW 2020 Facial Expressions challenge, which advances the first-runner-up performance.https://www.mdpi.com/2414-4088/6/2/11affective computingemotion recognitiondeep learning architecturesface processingmultimodal fusionmultimodal representations
spellingShingle Denis Dresvyanskiy
Elena Ryumina
Heysem Kaya
Maxim Markitantov
Alexey Karpov
Wolfgang Minker
End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild
Multimodal Technologies and Interaction
affective computing
emotion recognition
deep learning architectures
face processing
multimodal fusion
multimodal representations
title End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild
title_full End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild
title_fullStr End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild
title_full_unstemmed End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild
title_short End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild
title_sort end to end modeling and transfer learning for audiovisual emotion recognition in the wild
topic affective computing
emotion recognition
deep learning architectures
face processing
multimodal fusion
multimodal representations
url https://www.mdpi.com/2414-4088/6/2/11
work_keys_str_mv AT denisdresvyanskiy endtoendmodelingandtransferlearningforaudiovisualemotionrecognitioninthewild
AT elenaryumina endtoendmodelingandtransferlearningforaudiovisualemotionrecognitioninthewild
AT heysemkaya endtoendmodelingandtransferlearningforaudiovisualemotionrecognitioninthewild
AT maximmarkitantov endtoendmodelingandtransferlearningforaudiovisualemotionrecognitioninthewild
AT alexeykarpov endtoendmodelingandtransferlearningforaudiovisualemotionrecognitioninthewild
AT wolfgangminker endtoendmodelingandtransferlearningforaudiovisualemotionrecognitioninthewild