Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction

A “long short-term memory” (LSTM)-based human activity classifier is presented for skeleton data estimated in video frames. A strong feature engineering step precedes the deep neural network processing. The video was analyzed in short-time chunks created by a sliding window. A fixed number of video...

Full description

Bibliographic Details
Main Authors: Sebastian Puchała, Włodzimierz Kasprzak, Paweł Piwowarski
Format: Article
Language:English
Published: MDPI AG 2023-07-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/23/14/6279
_version_ 1797587567089025024
author Sebastian Puchała
Włodzimierz Kasprzak
Paweł Piwowarski
author_facet Sebastian Puchała
Włodzimierz Kasprzak
Paweł Piwowarski
author_sort Sebastian Puchała
collection DOAJ
description A “long short-term memory” (LSTM)-based human activity classifier is presented for skeleton data estimated in video frames. A strong feature engineering step precedes the deep neural network processing. The video was analyzed in short-time chunks created by a sliding window. A fixed number of video frames was selected for every chunk and human skeletons were estimated using dedicated software, such as OpenPose or HRNet. The skeleton data for a given window were collected, analyzed, and eventually corrected. A knowledge-aware feature extraction from the corrected skeletons was performed. A deep network model was trained and applied for two-person interaction classification. Three network architectures were developed—single-, double- and triple-channel LSTM networks—and were experimentally evaluated on the interaction subset of the ”NTU RGB+D” data set. The most efficient model achieved an interaction classification accuracy of 96%. This performance was compared with the best reported solutions for this set, based on “adaptive graph convolutional networks” (AGCN) and “3D convolutional networks” (e.g., OpenConv3D). The sliding-window strategy was cross-validated on the ”UT-Interaction” data set, containing long video clips with many changing interactions. We concluded that a two-step approach to skeleton-based human activity classification (a skeleton feature engineering step followed by a deep neural network model) represents a practical tradeoff between accuracy and computational complexity, due to an early correction of imperfect skeleton data and a knowledge-aware extraction of relational features from the skeletons.
first_indexed 2024-03-11T00:40:43Z
format Article
id doaj.art-186834ea482e468e8a6bb9c74a50b507
institution Directory Open Access Journal
issn 1424-8220
language English
last_indexed 2024-03-11T00:40:43Z
publishDate 2023-07-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj.art-186834ea482e468e8a6bb9c74a50b5072023-11-18T21:15:30ZengMDPI AGSensors1424-82202023-07-012314627910.3390/s23146279Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature ExtractionSebastian Puchała0Włodzimierz Kasprzak1Paweł Piwowarski2Institute of Control and Computation Engineering, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warszawa, PolandInstitute of Control and Computation Engineering, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warszawa, PolandInstitute of Control and Computation Engineering, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warszawa, PolandA “long short-term memory” (LSTM)-based human activity classifier is presented for skeleton data estimated in video frames. A strong feature engineering step precedes the deep neural network processing. The video was analyzed in short-time chunks created by a sliding window. A fixed number of video frames was selected for every chunk and human skeletons were estimated using dedicated software, such as OpenPose or HRNet. The skeleton data for a given window were collected, analyzed, and eventually corrected. A knowledge-aware feature extraction from the corrected skeletons was performed. A deep network model was trained and applied for two-person interaction classification. Three network architectures were developed—single-, double- and triple-channel LSTM networks—and were experimentally evaluated on the interaction subset of the ”NTU RGB+D” data set. The most efficient model achieved an interaction classification accuracy of 96%. This performance was compared with the best reported solutions for this set, based on “adaptive graph convolutional networks” (AGCN) and “3D convolutional networks” (e.g., OpenConv3D). The sliding-window strategy was cross-validated on the ”UT-Interaction” data set, containing long video clips with many changing interactions. We concluded that a two-step approach to skeleton-based human activity classification (a skeleton feature engineering step followed by a deep neural network model) represents a practical tradeoff between accuracy and computational complexity, due to an early correction of imperfect skeleton data and a knowledge-aware extraction of relational features from the skeletons.https://www.mdpi.com/1424-8220/23/14/6279human interaction videosLSTMpreliminary skeleton featuresskeleton trackingsliding windowmany-interaction videos
spellingShingle Sebastian Puchała
Włodzimierz Kasprzak
Paweł Piwowarski
Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction
Sensors
human interaction videos
LSTM
preliminary skeleton features
skeleton tracking
sliding window
many-interaction videos
title Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction
title_full Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction
title_fullStr Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction
title_full_unstemmed Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction
title_short Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction
title_sort human interaction classification in sliding video windows using skeleton data tracking and feature extraction
topic human interaction videos
LSTM
preliminary skeleton features
skeleton tracking
sliding window
many-interaction videos
url https://www.mdpi.com/1424-8220/23/14/6279
work_keys_str_mv AT sebastianpuchała humaninteractionclassificationinslidingvideowindowsusingskeletondatatrackingandfeatureextraction
AT włodzimierzkasprzak humaninteractionclassificationinslidingvideowindowsusingskeletondatatrackingandfeatureextraction
AT pawełpiwowarski humaninteractionclassificationinslidingvideowindowsusingskeletondatatrackingandfeatureextraction