Human pose estimation and action recognition based on monocular video inputs

This thesis presents the research work for the PhD program that focuses on investigating the main processes in human motion analysis from a single video camera, which includes: 1) 3D human pose estimation, 2) motion reconstruction, and 3) action recognition. Motion capture and analysis is an active...

Full description

Bibliographic Details
Main Author:	Leong, Mei Chee
Other Authors:	Lee Yong Tsui
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2020
Subjects:	Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Online Access:	https://hdl.handle.net/10356/136596

_version_	1826125830952058880
author	Leong, Mei Chee
author2	Lee Yong Tsui
author_facet	Lee Yong Tsui Leong, Mei Chee
author_sort	Leong, Mei Chee
collection	NTU
description	This thesis presents the research work for the PhD program that focuses on investigating the main processes in human motion analysis from a single video camera, which includes: 1) 3D human pose estimation, 2) motion reconstruction, and 3) action recognition. Motion capture and analysis is an active research area in computer vision with a wide range of applications. However, there exist major challenges in recovering monocular human motion due to the lack of depth information. Single frame pose estimation and tracking methods have limitations in recovering failed tracking pose and occurrence of motion jitters. To address current limitations in per-frame pose estimation and tracking methods, we propose to directly estimate a sequence of poses from a stack of consecutive frames. We exploit example-based method with dense spatio-temporal features to find best matching poses and then perform interpolation to achieve smooth motion reconstruction. For action recognition task, we exploit learning-base method, specifically deep learning with Convolutional Neural Network (CNN), for effective learning of spatio-temporal features to identify different action class in high volume of video dataset. In the initial study, a number of experiments were conducted to evaluate the effectiveness of configurations in our architecture, followed by an extended study on deeper models. Lastly, we developed a generalized architecture with fusion of 1D, 2D and 3D convolution layers, that can be adopted to existing CNN models while retaining the network’s learning properties. Our empirical studies demonstrated the advantages of our architecture over its corresponding 3D CNN models in: 1) a boost of 16 – 30% improved prediction accuracy, 2) effective spatio-temporal learning, and 3) lower computational cost. The future goal of this project is to link all the main processes in this research work to develop a full pipeline human motion analysis system that can be applied in real-life applications, such as healthcare or sports analysis.
first_indexed	2024-10-01T06:43:07Z
format	Thesis-Doctor of Philosophy
id	ntu-10356/136596
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T06:43:07Z
publishDate	2020
publisher	Nanyang Technological University
record_format	dspace
spelling	ntu-10356/1365962020-11-01T04:57:44Z Human pose estimation and action recognition based on monocular video inputs Leong, Mei Chee Lee Yong Tsui Lin Feng Interdisciplinary Graduate School (IGS) mytlee@ntu.edu.sg, asflin@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence This thesis presents the research work for the PhD program that focuses on investigating the main processes in human motion analysis from a single video camera, which includes: 1) 3D human pose estimation, 2) motion reconstruction, and 3) action recognition. Motion capture and analysis is an active research area in computer vision with a wide range of applications. However, there exist major challenges in recovering monocular human motion due to the lack of depth information. Single frame pose estimation and tracking methods have limitations in recovering failed tracking pose and occurrence of motion jitters. To address current limitations in per-frame pose estimation and tracking methods, we propose to directly estimate a sequence of poses from a stack of consecutive frames. We exploit example-based method with dense spatio-temporal features to find best matching poses and then perform interpolation to achieve smooth motion reconstruction. For action recognition task, we exploit learning-base method, specifically deep learning with Convolutional Neural Network (CNN), for effective learning of spatio-temporal features to identify different action class in high volume of video dataset. In the initial study, a number of experiments were conducted to evaluate the effectiveness of configurations in our architecture, followed by an extended study on deeper models. Lastly, we developed a generalized architecture with fusion of 1D, 2D and 3D convolution layers, that can be adopted to existing CNN models while retaining the network’s learning properties. Our empirical studies demonstrated the advantages of our architecture over its corresponding 3D CNN models in: 1) a boost of 16 – 30% improved prediction accuracy, 2) effective spatio-temporal learning, and 3) lower computational cost. The future goal of this project is to link all the main processes in this research work to develop a full pipeline human motion analysis system that can be applied in real-life applications, such as healthcare or sports analysis. Doctor of Philosophy 2020-01-06T05:41:11Z 2020-01-06T05:41:11Z 2019 Thesis-Doctor of Philosophy Leong, M. C. (2019). Human pose estimation and action recognition based on monocular video inputs. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/136596 10.32657/10356/136596 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Leong, Mei Chee Human pose estimation and action recognition based on monocular video inputs
title	Human pose estimation and action recognition based on monocular video inputs
title_full	Human pose estimation and action recognition based on monocular video inputs
title_fullStr	Human pose estimation and action recognition based on monocular video inputs
title_full_unstemmed	Human pose estimation and action recognition based on monocular video inputs
title_short	Human pose estimation and action recognition based on monocular video inputs
title_sort	human pose estimation and action recognition based on monocular video inputs
topic	Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
url	https://hdl.handle.net/10356/136596
work_keys_str_mv	AT leongmeichee humanposeestimationandactionrecognitionbasedonmonocularvideoinputs

Human pose estimation and action recognition based on monocular video inputs

Similar Items