Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction
A “long short-term memory” (LSTM)-based human activity classifier is presented for skeleton data estimated in video frames. A strong feature engineering step precedes the deep neural network processing. The video was analyzed in short-time chunks created by a sliding window. A fixed number of video...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-07-01
|
Series: | Sensors |
Subjects: | |
Online Access: | https://www.mdpi.com/1424-8220/23/14/6279 |
_version_ | 1797587567089025024 |
---|---|
author | Sebastian Puchała Włodzimierz Kasprzak Paweł Piwowarski |
author_facet | Sebastian Puchała Włodzimierz Kasprzak Paweł Piwowarski |
author_sort | Sebastian Puchała |
collection | DOAJ |
description | A “long short-term memory” (LSTM)-based human activity classifier is presented for skeleton data estimated in video frames. A strong feature engineering step precedes the deep neural network processing. The video was analyzed in short-time chunks created by a sliding window. A fixed number of video frames was selected for every chunk and human skeletons were estimated using dedicated software, such as OpenPose or HRNet. The skeleton data for a given window were collected, analyzed, and eventually corrected. A knowledge-aware feature extraction from the corrected skeletons was performed. A deep network model was trained and applied for two-person interaction classification. Three network architectures were developed—single-, double- and triple-channel LSTM networks—and were experimentally evaluated on the interaction subset of the ”NTU RGB+D” data set. The most efficient model achieved an interaction classification accuracy of 96%. This performance was compared with the best reported solutions for this set, based on “adaptive graph convolutional networks” (AGCN) and “3D convolutional networks” (e.g., OpenConv3D). The sliding-window strategy was cross-validated on the ”UT-Interaction” data set, containing long video clips with many changing interactions. We concluded that a two-step approach to skeleton-based human activity classification (a skeleton feature engineering step followed by a deep neural network model) represents a practical tradeoff between accuracy and computational complexity, due to an early correction of imperfect skeleton data and a knowledge-aware extraction of relational features from the skeletons. |
first_indexed | 2024-03-11T00:40:43Z |
format | Article |
id | doaj.art-186834ea482e468e8a6bb9c74a50b507 |
institution | Directory Open Access Journal |
issn | 1424-8220 |
language | English |
last_indexed | 2024-03-11T00:40:43Z |
publishDate | 2023-07-01 |
publisher | MDPI AG |
record_format | Article |
series | Sensors |
spelling | doaj.art-186834ea482e468e8a6bb9c74a50b5072023-11-18T21:15:30ZengMDPI AGSensors1424-82202023-07-012314627910.3390/s23146279Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature ExtractionSebastian Puchała0Włodzimierz Kasprzak1Paweł Piwowarski2Institute of Control and Computation Engineering, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warszawa, PolandInstitute of Control and Computation Engineering, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warszawa, PolandInstitute of Control and Computation Engineering, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warszawa, PolandA “long short-term memory” (LSTM)-based human activity classifier is presented for skeleton data estimated in video frames. A strong feature engineering step precedes the deep neural network processing. The video was analyzed in short-time chunks created by a sliding window. A fixed number of video frames was selected for every chunk and human skeletons were estimated using dedicated software, such as OpenPose or HRNet. The skeleton data for a given window were collected, analyzed, and eventually corrected. A knowledge-aware feature extraction from the corrected skeletons was performed. A deep network model was trained and applied for two-person interaction classification. Three network architectures were developed—single-, double- and triple-channel LSTM networks—and were experimentally evaluated on the interaction subset of the ”NTU RGB+D” data set. The most efficient model achieved an interaction classification accuracy of 96%. This performance was compared with the best reported solutions for this set, based on “adaptive graph convolutional networks” (AGCN) and “3D convolutional networks” (e.g., OpenConv3D). The sliding-window strategy was cross-validated on the ”UT-Interaction” data set, containing long video clips with many changing interactions. We concluded that a two-step approach to skeleton-based human activity classification (a skeleton feature engineering step followed by a deep neural network model) represents a practical tradeoff between accuracy and computational complexity, due to an early correction of imperfect skeleton data and a knowledge-aware extraction of relational features from the skeletons.https://www.mdpi.com/1424-8220/23/14/6279human interaction videosLSTMpreliminary skeleton featuresskeleton trackingsliding windowmany-interaction videos |
spellingShingle | Sebastian Puchała Włodzimierz Kasprzak Paweł Piwowarski Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction Sensors human interaction videos LSTM preliminary skeleton features skeleton tracking sliding window many-interaction videos |
title | Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction |
title_full | Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction |
title_fullStr | Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction |
title_full_unstemmed | Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction |
title_short | Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction |
title_sort | human interaction classification in sliding video windows using skeleton data tracking and feature extraction |
topic | human interaction videos LSTM preliminary skeleton features skeleton tracking sliding window many-interaction videos |
url | https://www.mdpi.com/1424-8220/23/14/6279 |
work_keys_str_mv | AT sebastianpuchała humaninteractionclassificationinslidingvideowindowsusingskeletondatatrackingandfeatureextraction AT włodzimierzkasprzak humaninteractionclassificationinslidingvideowindowsusingskeletondatatrackingandfeatureextraction AT pawełpiwowarski humaninteractionclassificationinslidingvideowindowsusingskeletondatatrackingandfeatureextraction |