ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition

Abstract In Talentino, HR-Solution analyzes candidates’ profiles and conducts interviews. Artificial intelligence is used to analyze the video interviews and recognize the candidate’s expressions during the interview. This paper introduces ViTCN, a combination of Vision Transformer (ViT) and Tempora...

Full description

Bibliographic Details
Main Authors: Kamal Zakieldin, Radwa Khattab, Ehab Ibrahim, Esraa Arafat, Nehal Ahmed, Elsayed Hemayed
Format: Article
Language:English
Published: Springer 2024-03-01
Series:International Journal of Computational Intelligence Systems
Subjects:
Online Access:https://doi.org/10.1007/s44196-024-00436-5
_version_ 1827300945960108032
author Kamal Zakieldin
Radwa Khattab
Ehab Ibrahim
Esraa Arafat
Nehal Ahmed
Elsayed Hemayed
author_facet Kamal Zakieldin
Radwa Khattab
Ehab Ibrahim
Esraa Arafat
Nehal Ahmed
Elsayed Hemayed
author_sort Kamal Zakieldin
collection DOAJ
description Abstract In Talentino, HR-Solution analyzes candidates’ profiles and conducts interviews. Artificial intelligence is used to analyze the video interviews and recognize the candidate’s expressions during the interview. This paper introduces ViTCN, a combination of Vision Transformer (ViT) and Temporal Convolution Network (TCN), as a novel architecture for detecting and interpreting human emotions and expressions. Human expression recognition contributes widely to the development of human-computer interaction. The machine’s understanding of human emotions in the real world will considerably contribute to life in the future. Emotion recognition was identifying the emotions as a single frame (image-based) without considering the sequence of frames. The proposed architecture utilized a series of frames to accurately identify the true emotional expression within a combined sequence of frames over time. The study demonstrates the potential of this method as a viable option for identifying facial expressions during interviews, which could inform hiring decisions. For situations with limited computational resources, the proposed architecture offers a powerful solution for interpreting human facial expressions with a single model and a single GPU.The proposed architecture was validated on the widely used controlled data sets CK+, MMI, and the challenging DAiSEE data set, as well as on the challenging wild data sets DFEW and AFFWild2. The experimental results demonstrated that the proposed method has superior performance to existing methods on DFEW, AFFWild2, MMI, and DAiSEE. It outperformed other sophisticated top-performing solutions with an accuracy of 4.29% in DFEW, 14.41% in AFFWild2, and 7.74% in MMI. It also achieved comparable results on the CK+ data set.
first_indexed 2024-04-24T16:13:35Z
format Article
id doaj.art-b55c1777378643238ca644b9eff51340
institution Directory Open Access Journal
issn 1875-6883
language English
last_indexed 2024-04-24T16:13:35Z
publishDate 2024-03-01
publisher Springer
record_format Article
series International Journal of Computational Intelligence Systems
spelling doaj.art-b55c1777378643238ca644b9eff513402024-03-31T11:34:45ZengSpringerInternational Journal of Computational Intelligence Systems1875-68832024-03-0117112010.1007/s44196-024-00436-5ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion RecognitionKamal Zakieldin0Radwa Khattab1Ehab Ibrahim2Esraa Arafat3Nehal Ahmed4Elsayed Hemayed5TalentinoTalentinoTalentinoTalentinoAhram Canadian UniversityCIE, Zewail City of Science and TechnologyAbstract In Talentino, HR-Solution analyzes candidates’ profiles and conducts interviews. Artificial intelligence is used to analyze the video interviews and recognize the candidate’s expressions during the interview. This paper introduces ViTCN, a combination of Vision Transformer (ViT) and Temporal Convolution Network (TCN), as a novel architecture for detecting and interpreting human emotions and expressions. Human expression recognition contributes widely to the development of human-computer interaction. The machine’s understanding of human emotions in the real world will considerably contribute to life in the future. Emotion recognition was identifying the emotions as a single frame (image-based) without considering the sequence of frames. The proposed architecture utilized a series of frames to accurately identify the true emotional expression within a combined sequence of frames over time. The study demonstrates the potential of this method as a viable option for identifying facial expressions during interviews, which could inform hiring decisions. For situations with limited computational resources, the proposed architecture offers a powerful solution for interpreting human facial expressions with a single model and a single GPU.The proposed architecture was validated on the widely used controlled data sets CK+, MMI, and the challenging DAiSEE data set, as well as on the challenging wild data sets DFEW and AFFWild2. The experimental results demonstrated that the proposed method has superior performance to existing methods on DFEW, AFFWild2, MMI, and DAiSEE. It outperformed other sophisticated top-performing solutions with an accuracy of 4.29% in DFEW, 14.41% in AFFWild2, and 7.74% in MMI. It also achieved comparable results on the CK+ data set.https://doi.org/10.1007/s44196-024-00436-5Emotion-recognitionComputer-visionDeep-learningVision-transformerTemporal-convolution-network
spellingShingle Kamal Zakieldin
Radwa Khattab
Ehab Ibrahim
Esraa Arafat
Nehal Ahmed
Elsayed Hemayed
ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition
International Journal of Computational Intelligence Systems
Emotion-recognition
Computer-vision
Deep-learning
Vision-transformer
Temporal-convolution-network
title ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition
title_full ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition
title_fullStr ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition
title_full_unstemmed ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition
title_short ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition
title_sort vitcn hybrid vision transformer with temporal convolution for multi emotion recognition
topic Emotion-recognition
Computer-vision
Deep-learning
Vision-transformer
Temporal-convolution-network
url https://doi.org/10.1007/s44196-024-00436-5
work_keys_str_mv AT kamalzakieldin vitcnhybridvisiontransformerwithtemporalconvolutionformultiemotionrecognition
AT radwakhattab vitcnhybridvisiontransformerwithtemporalconvolutionformultiemotionrecognition
AT ehabibrahim vitcnhybridvisiontransformerwithtemporalconvolutionformultiemotionrecognition
AT esraaarafat vitcnhybridvisiontransformerwithtemporalconvolutionformultiemotionrecognition
AT nehalahmed vitcnhybridvisiontransformerwithtemporalconvolutionformultiemotionrecognition
AT elsayedhemayed vitcnhybridvisiontransformerwithtemporalconvolutionformultiemotionrecognition