Deep emotion recognition based on audio–visual correlation

Human emotion recognition is studied by means of unimodal channels over the last decade. However, efforts continue to answer tempting questions about how variant modalities can complement each other. This study proposes a multimodal approach using three‐dimensional (3D) convolutional neural networks...

Full description

Bibliographic Details
Main Authors:	Noushin Hajarolasvadi, Hasan Demirel
Format:	Article
Language:	English
Published:	Wiley 2020-10-01
Series:	IET Computer Vision
Subjects:	temporal data alignment slave pipeline k‐means clustering 3D CNN architectures temporal domain unimodal channels
Online Access:	https://doi.org/10.1049/iet-cvi.2020.0013

_version_	1797684639771394048
author	Noushin Hajarolasvadi Hasan Demirel
author_facet	Noushin Hajarolasvadi Hasan Demirel
author_sort	Noushin Hajarolasvadi
collection	DOAJ
description	Human emotion recognition is studied by means of unimodal channels over the last decade. However, efforts continue to answer tempting questions about how variant modalities can complement each other. This study proposes a multimodal approach using three‐dimensional (3D) convolutional neural networks (CNNs) to model human emotion through a modality‐referenced system while investigating the solution to such questions. The proposed modality‐referenced system selects the input data based on one of the modalities regarded as reference or master. The other modality which is referred to as a slave simply adjusts or attunes itself with the master in the temporal domain. In this context, the authors developed three multimodal emotion recognition system, namely, video‐referenced system, audio‐referenced system, and the audio–visual‐referenced system to explore the congruence impact of audio and video modalities on each other. Two pipelines of 3D CNN architectures are employed where k‐means clustering is used in the master pipeline and the slave pipeline adapts itself in a temporal sense. The outputs of the two pipelines are fused to improve recognition performance. In addition, canonical correlation analysis and t‐distributed stochastic neighbour embedding is used validating the experiments. Results show that temporal alignment of the data between two modalities improves the recognition performance significantly.
first_indexed	2024-03-12T00:32:38Z
format	Article
id	doaj.art-ded5ce8b1c97472d93b103c80476b275
institution	Directory Open Access Journal
issn	1751-9632 1751-9640
language	English
last_indexed	2024-03-12T00:32:38Z
publishDate	2020-10-01
publisher	Wiley
record_format	Article
series	IET Computer Vision
spelling	doaj.art-ded5ce8b1c97472d93b103c80476b2752023-09-15T10:11:27ZengWileyIET Computer Vision1751-96321751-96402020-10-0114751752710.1049/iet-cvi.2020.0013Deep emotion recognition based on audio–visual correlationNoushin Hajarolasvadi0Hasan Demirel1Department of Electrical and Electronic EngineeringEastern Mediterranean UniversityTurkey, 10 via MersinNicosia99628CyprusDepartment of Electrical and Electronic EngineeringEastern Mediterranean UniversityTurkey, 10 via MersinNicosia99628CyprusHuman emotion recognition is studied by means of unimodal channels over the last decade. However, efforts continue to answer tempting questions about how variant modalities can complement each other. This study proposes a multimodal approach using three‐dimensional (3D) convolutional neural networks (CNNs) to model human emotion through a modality‐referenced system while investigating the solution to such questions. The proposed modality‐referenced system selects the input data based on one of the modalities regarded as reference or master. The other modality which is referred to as a slave simply adjusts or attunes itself with the master in the temporal domain. In this context, the authors developed three multimodal emotion recognition system, namely, video‐referenced system, audio‐referenced system, and the audio–visual‐referenced system to explore the congruence impact of audio and video modalities on each other. Two pipelines of 3D CNN architectures are employed where k‐means clustering is used in the master pipeline and the slave pipeline adapts itself in a temporal sense. The outputs of the two pipelines are fused to improve recognition performance. In addition, canonical correlation analysis and t‐distributed stochastic neighbour embedding is used validating the experiments. Results show that temporal alignment of the data between two modalities improves the recognition performance significantly.https://doi.org/10.1049/iet-cvi.2020.0013temporal data alignmentslave pipelinek‐means clustering3D CNN architecturestemporal domainunimodal channels
spellingShingle	Noushin Hajarolasvadi Hasan Demirel Deep emotion recognition based on audio–visual correlation IET Computer Vision temporal data alignment slave pipeline k‐means clustering 3D CNN architectures temporal domain unimodal channels
title	Deep emotion recognition based on audio–visual correlation
title_full	Deep emotion recognition based on audio–visual correlation
title_fullStr	Deep emotion recognition based on audio–visual correlation
title_full_unstemmed	Deep emotion recognition based on audio–visual correlation
title_short	Deep emotion recognition based on audio–visual correlation
title_sort	deep emotion recognition based on audio visual correlation
topic	temporal data alignment slave pipeline k‐means clustering 3D CNN architectures temporal domain unimodal channels
url	https://doi.org/10.1049/iet-cvi.2020.0013
work_keys_str_mv	AT noushinhajarolasvadi deepemotionrecognitionbasedonaudiovisualcorrelation AT hasandemirel deepemotionrecognitionbasedonaudiovisualcorrelation

Deep emotion recognition based on audio–visual correlation

Similar Items