A Deep Spatial and Temporal Aggregation Framework for Video-Based Facial Expression Recognition

Video-based facial expression recognition is a long-standing problem owing to a gap between visual features and emotions, difficulties in tracking the subtle movement of muscles and limited datasets. The key to solving this problem is to exploit effective features characterizing facial expression to...

Full description

Bibliographic Details
Main Authors: Xianzhang Pan, Guoliang Ying, Guodong Chen, Hongming Li, Wenshu Li
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8674456/
_version_ 1819276062561927168
author Xianzhang Pan
Guoliang Ying
Guodong Chen
Hongming Li
Wenshu Li
author_facet Xianzhang Pan
Guoliang Ying
Guodong Chen
Hongming Li
Wenshu Li
author_sort Xianzhang Pan
collection DOAJ
description Video-based facial expression recognition is a long-standing problem owing to a gap between visual features and emotions, difficulties in tracking the subtle movement of muscles and limited datasets. The key to solving this problem is to exploit effective features characterizing facial expression to perform facial expression recognition. We propose an effective framework to solve these problems. In our work, both spatial information and temporal information are utilized through the aggregation layer of a framework that fuses two state-of-the-art stream networks. We investigate different strategies for pooling across spatial information and temporal information. We find that it is effective to pool jointly across spatial information and temporal information for video-based facial expression recognition. Our framework is end-to-end trainable for whole-video recognition. In addressing the problem of facial recognition, the main contribution of this project is the design of a novel, trainable deep neural network framework that fuses spatial information and temporal information of video according to CNNs and LSTMs for pattern recognition. The experimental results on two public datasets, i.e., the RML and eNTERFACE05 databases, show that our framework outperforms previous state-of-the-art frameworks.
first_indexed 2024-12-23T23:34:15Z
format Article
id doaj.art-43ca07dedc0e442bad5d39a2b4ece2cf
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-23T23:34:15Z
publishDate 2019-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-43ca07dedc0e442bad5d39a2b4ece2cf2022-12-21T17:25:54ZengIEEEIEEE Access2169-35362019-01-017488074881510.1109/ACCESS.2019.29072718674456A Deep Spatial and Temporal Aggregation Framework for Video-Based Facial Expression RecognitionXianzhang Pan0https://orcid.org/0000-0003-0469-7178Guoliang Ying1Guodong Chen2Hongming Li3Wenshu Li4Institute of Intelligent Information Processing, Taizhou University, Taizhou, ChinaInformation Technology Center, Taizhou University, Taizhou, ChinaElectronics and Information Engineering College, Taizhou University, Taizhou, ChinaInformation Technology Center, Taizhou University, Taizhou, ChinaCollege of Information Science and Technology, Zhejiang Sci-Tech University, Hangzhou, ChinaVideo-based facial expression recognition is a long-standing problem owing to a gap between visual features and emotions, difficulties in tracking the subtle movement of muscles and limited datasets. The key to solving this problem is to exploit effective features characterizing facial expression to perform facial expression recognition. We propose an effective framework to solve these problems. In our work, both spatial information and temporal information are utilized through the aggregation layer of a framework that fuses two state-of-the-art stream networks. We investigate different strategies for pooling across spatial information and temporal information. We find that it is effective to pool jointly across spatial information and temporal information for video-based facial expression recognition. Our framework is end-to-end trainable for whole-video recognition. In addressing the problem of facial recognition, the main contribution of this project is the design of a novel, trainable deep neural network framework that fuses spatial information and temporal information of video according to CNNs and LSTMs for pattern recognition. The experimental results on two public datasets, i.e., the RML and eNTERFACE05 databases, show that our framework outperforms previous state-of-the-art frameworks.https://ieeexplore.ieee.org/document/8674456/Video-based facial expression recognitionCNNsdeep temporal-spatial featuresoptical flowLSTM
spellingShingle Xianzhang Pan
Guoliang Ying
Guodong Chen
Hongming Li
Wenshu Li
A Deep Spatial and Temporal Aggregation Framework for Video-Based Facial Expression Recognition
IEEE Access
Video-based facial expression recognition
CNNs
deep temporal-spatial features
optical flow
LSTM
title A Deep Spatial and Temporal Aggregation Framework for Video-Based Facial Expression Recognition
title_full A Deep Spatial and Temporal Aggregation Framework for Video-Based Facial Expression Recognition
title_fullStr A Deep Spatial and Temporal Aggregation Framework for Video-Based Facial Expression Recognition
title_full_unstemmed A Deep Spatial and Temporal Aggregation Framework for Video-Based Facial Expression Recognition
title_short A Deep Spatial and Temporal Aggregation Framework for Video-Based Facial Expression Recognition
title_sort deep spatial and temporal aggregation framework for video based facial expression recognition
topic Video-based facial expression recognition
CNNs
deep temporal-spatial features
optical flow
LSTM
url https://ieeexplore.ieee.org/document/8674456/
work_keys_str_mv AT xianzhangpan adeepspatialandtemporalaggregationframeworkforvideobasedfacialexpressionrecognition
AT guoliangying adeepspatialandtemporalaggregationframeworkforvideobasedfacialexpressionrecognition
AT guodongchen adeepspatialandtemporalaggregationframeworkforvideobasedfacialexpressionrecognition
AT hongmingli adeepspatialandtemporalaggregationframeworkforvideobasedfacialexpressionrecognition
AT wenshuli adeepspatialandtemporalaggregationframeworkforvideobasedfacialexpressionrecognition
AT xianzhangpan deepspatialandtemporalaggregationframeworkforvideobasedfacialexpressionrecognition
AT guoliangying deepspatialandtemporalaggregationframeworkforvideobasedfacialexpressionrecognition
AT guodongchen deepspatialandtemporalaggregationframeworkforvideobasedfacialexpressionrecognition
AT hongmingli deepspatialandtemporalaggregationframeworkforvideobasedfacialexpressionrecognition
AT wenshuli deepspatialandtemporalaggregationframeworkforvideobasedfacialexpressionrecognition