Human action recognition by embedding silhouettes and visual words

With the availability of cheap video recording devices, fast internet access and huge storage spaces, the corpus of video that is accessible has grown tremendously over the last few years. Processing of these videos to achieve end-user tasks such as video retrieval, human-computer interaction (HCI),...

Full description

Bibliographic Details
Main Author: Saghafi Khadem, Behrouz
Other Authors: Deepu Rajan
Format: Thesis
Language:English
Published: 2013
Subjects:
Online Access:https://hdl.handle.net/10356/54952
_version_ 1826121652780400640
author Saghafi Khadem, Behrouz
author2 Deepu Rajan
author_facet Deepu Rajan
Saghafi Khadem, Behrouz
author_sort Saghafi Khadem, Behrouz
collection NTU
description With the availability of cheap video recording devices, fast internet access and huge storage spaces, the corpus of video that is accessible has grown tremendously over the last few years. Processing of these videos to achieve end-user tasks such as video retrieval, human-computer interaction (HCI), biometrics etc. require automatic understanding of content in the video. Human action recognition is one aspect of video understanding that is useful in surveillance, behavioral analysis and HCI. Although this problem has been studied for quite some years now, challenges still exist in terms of cluttered background, intra-class variance and inter-class similarity, occlusion etc. In this thesis, we propose three methods for action recognition. First, we propose a novel embedding for learning the manifold of human actions which is optimum based on spatio-temporal correlation distance (SCD) between sequences. Sequences of actions can be compared based on distances between frames. However comparison based on between-sequence distance is more efficient and effective. In particular, our proposed embedding minimizes sum of distances between intra-class sequences while maximizing sum of distances between inter-class points. Actions sequences are represented by key postures chosen equidistantly from a semantic period of action. The projected sequences are compared based on SCD or Hausdorff distance in a nearest neighbor framework. The method not only outperforms other dimension reduction methods but is comparable to the state of the art on three public datasets. Moreover it is robust to additive noise, occlusion, shape deformation and change in view point up to a large extent. Second, we proposed an approach for introducing semantic relations into the bag-of-words framework for recognizing human actions. In the standard bag-of-words framework, the features are clustered based on their appearances and not their semantic relations. We exploit Latent Semantic Models such as LSA and pLSA as well as Canonical Correlation Analysis to find a subspace in which visual words are more semantically distributed. We project the visual words into the computed space and apply k-means to obtain semantically meaningful clusters and use them as the semantic visual vocabulary which leads to more discriminative histograms for recognizing actions. Our proposed method gives promising results on the challenging KTH action dataset. Finally, we introduce a novel method for combining information from multiple viewpoints. Spatio-temporal features are extracted from each viewpoint and used in a bag-of-words framework. Two codebooks with different sizes are used to form the histograms. The similarity between computed histograms are captured by HIK kernel as well as RBF kernel with Chi-Square distance. Obtained kernels are linearly combined using proper weights which are learned through an optimization process. For more efficiency, a separate set of optimum weights are calculated for each binary SVM classifier. Our proposed method not only enables us to combine multiple views efficiently but also models the action in multiple spaces using the same features, thereby increasing performance. Several experiments are performed to show the efficiency of the framework as well as the constitutive parts. We have obtained the state of the art accuracy of 95.8% on the challenging IXMAS multi-view dataset.
first_indexed 2024-10-01T05:35:42Z
format Thesis
id ntu-10356/54952
institution Nanyang Technological University
language English
last_indexed 2024-10-01T05:35:42Z
publishDate 2013
record_format dspace
spelling ntu-10356/549522023-03-04T00:41:50Z Human action recognition by embedding silhouettes and visual words Saghafi Khadem, Behrouz Deepu Rajan School of Computer Engineering Centre for Multimedia and Network Technology DRNTU::Engineering::Computer science and engineering::Computer applications::Computer-aided engineering With the availability of cheap video recording devices, fast internet access and huge storage spaces, the corpus of video that is accessible has grown tremendously over the last few years. Processing of these videos to achieve end-user tasks such as video retrieval, human-computer interaction (HCI), biometrics etc. require automatic understanding of content in the video. Human action recognition is one aspect of video understanding that is useful in surveillance, behavioral analysis and HCI. Although this problem has been studied for quite some years now, challenges still exist in terms of cluttered background, intra-class variance and inter-class similarity, occlusion etc. In this thesis, we propose three methods for action recognition. First, we propose a novel embedding for learning the manifold of human actions which is optimum based on spatio-temporal correlation distance (SCD) between sequences. Sequences of actions can be compared based on distances between frames. However comparison based on between-sequence distance is more efficient and effective. In particular, our proposed embedding minimizes sum of distances between intra-class sequences while maximizing sum of distances between inter-class points. Actions sequences are represented by key postures chosen equidistantly from a semantic period of action. The projected sequences are compared based on SCD or Hausdorff distance in a nearest neighbor framework. The method not only outperforms other dimension reduction methods but is comparable to the state of the art on three public datasets. Moreover it is robust to additive noise, occlusion, shape deformation and change in view point up to a large extent. Second, we proposed an approach for introducing semantic relations into the bag-of-words framework for recognizing human actions. In the standard bag-of-words framework, the features are clustered based on their appearances and not their semantic relations. We exploit Latent Semantic Models such as LSA and pLSA as well as Canonical Correlation Analysis to find a subspace in which visual words are more semantically distributed. We project the visual words into the computed space and apply k-means to obtain semantically meaningful clusters and use them as the semantic visual vocabulary which leads to more discriminative histograms for recognizing actions. Our proposed method gives promising results on the challenging KTH action dataset. Finally, we introduce a novel method for combining information from multiple viewpoints. Spatio-temporal features are extracted from each viewpoint and used in a bag-of-words framework. Two codebooks with different sizes are used to form the histograms. The similarity between computed histograms are captured by HIK kernel as well as RBF kernel with Chi-Square distance. Obtained kernels are linearly combined using proper weights which are learned through an optimization process. For more efficiency, a separate set of optimum weights are calculated for each binary SVM classifier. Our proposed method not only enables us to combine multiple views efficiently but also models the action in multiple spaces using the same features, thereby increasing performance. Several experiments are performed to show the efficiency of the framework as well as the constitutive parts. We have obtained the state of the art accuracy of 95.8% on the challenging IXMAS multi-view dataset. DOCTOR OF PHILOSOPHY (SCE) 2013-11-08T04:50:14Z 2013-11-08T04:50:14Z 2013 2013 Thesis Saghafi Khadem, B. (2013). Human action recognition by embedding silhouettes and visual words. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/54952 10.32657/10356/54952 en 131 p. application/pdf
spellingShingle DRNTU::Engineering::Computer science and engineering::Computer applications::Computer-aided engineering
Saghafi Khadem, Behrouz
Human action recognition by embedding silhouettes and visual words
title Human action recognition by embedding silhouettes and visual words
title_full Human action recognition by embedding silhouettes and visual words
title_fullStr Human action recognition by embedding silhouettes and visual words
title_full_unstemmed Human action recognition by embedding silhouettes and visual words
title_short Human action recognition by embedding silhouettes and visual words
title_sort human action recognition by embedding silhouettes and visual words
topic DRNTU::Engineering::Computer science and engineering::Computer applications::Computer-aided engineering
url https://hdl.handle.net/10356/54952
work_keys_str_mv AT saghafikhadembehrouz humanactionrecognitionbyembeddingsilhouettesandvisualwords