ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition

Abstract In Talentino, HR-Solution analyzes candidates’ profiles and conducts interviews. Artificial intelligence is used to analyze the video interviews and recognize the candidate’s expressions during the interview. This paper introduces ViTCN, a combination of Vision Transformer (ViT) and Tempora...

Full description

Bibliographic Details
Main Authors:	Kamal Zakieldin, Radwa Khattab, Ehab Ibrahim, Esraa Arafat, Nehal Ahmed, Elsayed Hemayed
Format:	Article
Language:	English
Published:	Springer 2024-03-01
Series:	International Journal of Computational Intelligence Systems
Subjects:	Emotion-recognition Computer-vision Deep-learning Vision-transformer Temporal-convolution-network
Online Access:	https://doi.org/10.1007/s44196-024-00436-5

_version_	1827300945960108032
author	Kamal Zakieldin Radwa Khattab Ehab Ibrahim Esraa Arafat Nehal Ahmed Elsayed Hemayed
author_facet	Kamal Zakieldin Radwa Khattab Ehab Ibrahim Esraa Arafat Nehal Ahmed Elsayed Hemayed
author_sort	Kamal Zakieldin
collection	DOAJ
description	Abstract In Talentino, HR-Solution analyzes candidates’ profiles and conducts interviews. Artificial intelligence is used to analyze the video interviews and recognize the candidate’s expressions during the interview. This paper introduces ViTCN, a combination of Vision Transformer (ViT) and Temporal Convolution Network (TCN), as a novel architecture for detecting and interpreting human emotions and expressions. Human expression recognition contributes widely to the development of human-computer interaction. The machine’s understanding of human emotions in the real world will considerably contribute to life in the future. Emotion recognition was identifying the emotions as a single frame (image-based) without considering the sequence of frames. The proposed architecture utilized a series of frames to accurately identify the true emotional expression within a combined sequence of frames over time. The study demonstrates the potential of this method as a viable option for identifying facial expressions during interviews, which could inform hiring decisions. For situations with limited computational resources, the proposed architecture offers a powerful solution for interpreting human facial expressions with a single model and a single GPU.The proposed architecture was validated on the widely used controlled data sets CK+, MMI, and the challenging DAiSEE data set, as well as on the challenging wild data sets DFEW and AFFWild2. The experimental results demonstrated that the proposed method has superior performance to existing methods on DFEW, AFFWild2, MMI, and DAiSEE. It outperformed other sophisticated top-performing solutions with an accuracy of 4.29% in DFEW, 14.41% in AFFWild2, and 7.74% in MMI. It also achieved comparable results on the CK+ data set.
first_indexed	2024-04-24T16:13:35Z
format	Article
id	doaj.art-b55c1777378643238ca644b9eff51340
institution	Directory Open Access Journal
issn	1875-6883
language	English
last_indexed	2024-04-24T16:13:35Z
publishDate	2024-03-01
publisher	Springer
record_format	Article
series	International Journal of Computational Intelligence Systems
spelling	doaj.art-b55c1777378643238ca644b9eff513402024-03-31T11:34:45ZengSpringerInternational Journal of Computational Intelligence Systems1875-68832024-03-0117112010.1007/s44196-024-00436-5ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion RecognitionKamal Zakieldin0Radwa Khattab1Ehab Ibrahim2Esraa Arafat3Nehal Ahmed4Elsayed Hemayed5TalentinoTalentinoTalentinoTalentinoAhram Canadian UniversityCIE, Zewail City of Science and TechnologyAbstract In Talentino, HR-Solution analyzes candidates’ profiles and conducts interviews. Artificial intelligence is used to analyze the video interviews and recognize the candidate’s expressions during the interview. This paper introduces ViTCN, a combination of Vision Transformer (ViT) and Temporal Convolution Network (TCN), as a novel architecture for detecting and interpreting human emotions and expressions. Human expression recognition contributes widely to the development of human-computer interaction. The machine’s understanding of human emotions in the real world will considerably contribute to life in the future. Emotion recognition was identifying the emotions as a single frame (image-based) without considering the sequence of frames. The proposed architecture utilized a series of frames to accurately identify the true emotional expression within a combined sequence of frames over time. The study demonstrates the potential of this method as a viable option for identifying facial expressions during interviews, which could inform hiring decisions. For situations with limited computational resources, the proposed architecture offers a powerful solution for interpreting human facial expressions with a single model and a single GPU.The proposed architecture was validated on the widely used controlled data sets CK+, MMI, and the challenging DAiSEE data set, as well as on the challenging wild data sets DFEW and AFFWild2. The experimental results demonstrated that the proposed method has superior performance to existing methods on DFEW, AFFWild2, MMI, and DAiSEE. It outperformed other sophisticated top-performing solutions with an accuracy of 4.29% in DFEW, 14.41% in AFFWild2, and 7.74% in MMI. It also achieved comparable results on the CK+ data set.https://doi.org/10.1007/s44196-024-00436-5Emotion-recognitionComputer-visionDeep-learningVision-transformerTemporal-convolution-network
spellingShingle	Kamal Zakieldin Radwa Khattab Ehab Ibrahim Esraa Arafat Nehal Ahmed Elsayed Hemayed ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition International Journal of Computational Intelligence Systems Emotion-recognition Computer-vision Deep-learning Vision-transformer Temporal-convolution-network
title	ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition
title_full	ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition
title_fullStr	ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition
title_full_unstemmed	ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition
title_short	ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition
title_sort	vitcn hybrid vision transformer with temporal convolution for multi emotion recognition
topic	Emotion-recognition Computer-vision Deep-learning Vision-transformer Temporal-convolution-network
url	https://doi.org/10.1007/s44196-024-00436-5
work_keys_str_mv	AT kamalzakieldin vitcnhybridvisiontransformerwithtemporalconvolutionformultiemotionrecognition AT radwakhattab vitcnhybridvisiontransformerwithtemporalconvolutionformultiemotionrecognition AT ehabibrahim vitcnhybridvisiontransformerwithtemporalconvolutionformultiemotionrecognition AT esraaarafat vitcnhybridvisiontransformerwithtemporalconvolutionformultiemotionrecognition AT nehalahmed vitcnhybridvisiontransformerwithtemporalconvolutionformultiemotionrecognition AT elsayedhemayed vitcnhybridvisiontransformerwithtemporalconvolutionformultiemotionrecognition

ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition

Similar Items