Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

Audio-visual emotion recognition aims to distinguish human emotional states by integrating the audio and visual data acquired in the expression of emotions. It is crucial for facilitating the affect-related human-machine interaction system by enabling machines to intelligently respond to human emoti...

Full description

Bibliographic Details
Main Authors:	Fei Ma, Wei Zhang, Yang Li, Shao-Lun Huang, Lin Zhang
Format:	Article
Language:	English
Published:	MDPI AG 2020-10-01
Series:	Applied Sciences
Subjects:	audio-visual emotion recognition common information HGR maximal correlation semi-supervised learning
Online Access:	https://www.mdpi.com/2076-3417/10/20/7239

_version_	1797550715203223552
author	Fei Ma Wei Zhang Yang Li Shao-Lun Huang Lin Zhang
author_facet	Fei Ma Wei Zhang Yang Li Shao-Lun Huang Lin Zhang
author_sort	Fei Ma
collection	DOAJ
description	Audio-visual emotion recognition aims to distinguish human emotional states by integrating the audio and visual data acquired in the expression of emotions. It is crucial for facilitating the affect-related human-machine interaction system by enabling machines to intelligently respond to human emotions. One challenge of this problem is how to efficiently extract feature representations from audio and visual modalities. Although progresses have been made by previous works, most of them ignore common information between audio and visual data during the feature learning process, which may limit the performance since these two modalities are highly correlated in terms of their emotional information. To address this issue, we propose a deep learning approach in order to efficiently utilize common information for audio-visual emotion recognition by correlation analysis. Specifically, we design an audio network and a visual network to extract the feature representations from audio and visual data respectively, and then employ a fusion network to combine the extracted features for emotion prediction. These neural networks are trained by a joint loss, combining: (i) the correlation loss based on Hirschfeld-Gebelein-Rényi (HGR) maximal correlation, which extracts common information between audio data, visual data, and the corresponding emotion labels, and (ii) the classification loss, which extracts discriminative information from each modality for emotion prediction. We further generalize our architecture to the semi-supervised learning scenario. The experimental results on the eNTERFACE’05 dataset, BAUM-1s dataset, and RAVDESS dataset show that common information can significantly enhance the stability of features learned from different modalities, and improve the emotion recognition performance.
first_indexed	2024-03-10T15:33:14Z
format	Article
id	doaj.art-2df27d280dcc40b5a6074edc09de2dd1
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-10T15:33:14Z
publishDate	2020-10-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-2df27d280dcc40b5a6074edc09de2dd12023-11-20T17:26:25ZengMDPI AGApplied Sciences2076-34172020-10-011020723910.3390/app10207239Learning Better Representations for Audio-Visual Emotion Recognition with Common InformationFei Ma0Wei Zhang1Yang Li2Shao-Lun Huang3Lin Zhang4Tsinghua-Berkeley Shenzhen Institute, Tsinghua University, Shenzhen 518055, ChinaTsinghua-Berkeley Shenzhen Institute, Tsinghua University, Shenzhen 518055, ChinaTsinghua-Berkeley Shenzhen Institute, Tsinghua University, Shenzhen 518055, ChinaTsinghua-Berkeley Shenzhen Institute, Tsinghua University, Shenzhen 518055, ChinaTsinghua-Berkeley Shenzhen Institute, Tsinghua University, Shenzhen 518055, ChinaAudio-visual emotion recognition aims to distinguish human emotional states by integrating the audio and visual data acquired in the expression of emotions. It is crucial for facilitating the affect-related human-machine interaction system by enabling machines to intelligently respond to human emotions. One challenge of this problem is how to efficiently extract feature representations from audio and visual modalities. Although progresses have been made by previous works, most of them ignore common information between audio and visual data during the feature learning process, which may limit the performance since these two modalities are highly correlated in terms of their emotional information. To address this issue, we propose a deep learning approach in order to efficiently utilize common information for audio-visual emotion recognition by correlation analysis. Specifically, we design an audio network and a visual network to extract the feature representations from audio and visual data respectively, and then employ a fusion network to combine the extracted features for emotion prediction. These neural networks are trained by a joint loss, combining: (i) the correlation loss based on Hirschfeld-Gebelein-Rényi (HGR) maximal correlation, which extracts common information between audio data, visual data, and the corresponding emotion labels, and (ii) the classification loss, which extracts discriminative information from each modality for emotion prediction. We further generalize our architecture to the semi-supervised learning scenario. The experimental results on the eNTERFACE’05 dataset, BAUM-1s dataset, and RAVDESS dataset show that common information can significantly enhance the stability of features learned from different modalities, and improve the emotion recognition performance.https://www.mdpi.com/2076-3417/10/20/7239audio-visual emotion recognitioncommon informationHGR maximal correlationsemi-supervised learning
spellingShingle	Fei Ma Wei Zhang Yang Li Shao-Lun Huang Lin Zhang Learning Better Representations for Audio-Visual Emotion Recognition with Common Information Applied Sciences audio-visual emotion recognition common information HGR maximal correlation semi-supervised learning
title	Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
title_full	Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
title_fullStr	Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
title_full_unstemmed	Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
title_short	Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
title_sort	learning better representations for audio visual emotion recognition with common information
topic	audio-visual emotion recognition common information HGR maximal correlation semi-supervised learning
url	https://www.mdpi.com/2076-3417/10/20/7239
work_keys_str_mv	AT feima learningbetterrepresentationsforaudiovisualemotionrecognitionwithcommoninformation AT weizhang learningbetterrepresentationsforaudiovisualemotionrecognitionwithcommoninformation AT yangli learningbetterrepresentationsforaudiovisualemotionrecognitionwithcommoninformation AT shaolunhuang learningbetterrepresentationsforaudiovisualemotionrecognitionwithcommoninformation AT linzhang learningbetterrepresentationsforaudiovisualemotionrecognitionwithcommoninformation

Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

Similar Items