Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video

Nowadays, for controlling crime, surveillance cameras are typically installed in all public places to ensure urban safety and security. However, automating Human Activity Recognition (HAR) using computer vision techniques faces several challenges such as lowlighting, complex spatiotemporal features,...

Full description

Bibliographic Details
Main Authors: Altaf Hussain, Samee Ullah Khan, Noman Khan, Waseem Ullah, Ahmed Alkhayyat, Meshal Alharbi, Sung Wook Baik
Format: Article
Language:English
Published: Elsevier 2024-03-01
Series:Alexandria Engineering Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1110016823009936
_version_ 1827336674312454144
author Altaf Hussain
Samee Ullah Khan
Noman Khan
Waseem Ullah
Ahmed Alkhayyat
Meshal Alharbi
Sung Wook Baik
author_facet Altaf Hussain
Samee Ullah Khan
Noman Khan
Waseem Ullah
Ahmed Alkhayyat
Meshal Alharbi
Sung Wook Baik
author_sort Altaf Hussain
collection DOAJ
description Nowadays, for controlling crime, surveillance cameras are typically installed in all public places to ensure urban safety and security. However, automating Human Activity Recognition (HAR) using computer vision techniques faces several challenges such as lowlighting, complex spatiotemporal features, clutter backgrounds, and inefficient utilization of surveillance system resources. Existing attempts in HAR designed straightforward networks by analyzing either spatial or motion patterns resulting in limited performance while the dual streams methods are entirely based on Convolutional Neural Networks (CNN) that are inadequate to learning the long-range temporal information for HAR. To overcome the above-mentioned challenges, this paper proposes an optimized dual stream framework for HAR which mainly consists of three steps. First, a shots segmentation module is introduced in the proposed framework to efficiently utilize the surveillance system resources by enhancing the lowlight video stream and then it detects salient video frames that consist of human. This module is trained on our own challenging Lowlight Human Surveillance Dataset (LHSD) which consists of both normal and different levels of lowlighting data to recognize humans in complex uncertain environments. Next, to learn HAR from both contextual and motion information, a dual stream approach is used in the feature extraction. In the first stream, it freezes the learned weights of the backbone Vision Transformer (ViT) B-16 model to select the discriminative contextual information. In the second stream, ViT features are then fused with the intermediate encoder layers of FlowNet2 model for optical flow to extract a robust motion feature vector. Finally, a two stream Parallel Bidirectional Long Short-Term Memory (PBiLSTM) is proposed for sequence learning to capture the global semantics of activities, followed by Dual Stream Multi-Head Attention (DSMHA) with a late fusion strategy to optimize the huge features vector for accurate HAR. To assess the strength of the proposed framework, extensive empirical results are conducted on real-world surveillance scenarios and various benchmark HAR datasets that achieve 78.6285%, 96.0151%, and 98.875% accuracies on HMDB51, UCF101, and YouTube Action, respectively. Our results show that the proposed strategy outperforms State-of-the-Art (SOTA) methods. The proposed framework gives superior performance in HAR, providing accurate and reliable recognition of human activities in surveillance systems.
first_indexed 2024-03-07T18:37:27Z
format Article
id doaj.art-9c969e022d004230998422f0813d0154
institution Directory Open Access Journal
issn 1110-0168
language English
last_indexed 2024-03-07T18:37:27Z
publishDate 2024-03-01
publisher Elsevier
record_format Article
series Alexandria Engineering Journal
spelling doaj.art-9c969e022d004230998422f0813d01542024-03-02T04:53:40ZengElsevierAlexandria Engineering Journal1110-01682024-03-0191632647Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance videoAltaf Hussain0Samee Ullah Khan1Noman Khan2Waseem Ullah3Ahmed Alkhayyat4Meshal Alharbi5Sung Wook Baik6Sejong University, Seoul 143-747, Republic of KoreaSejong University, Seoul 143-747, Republic of KoreaSejong University, Seoul 143-747, Republic of KoreaSejong University, Seoul 143-747, Republic of KoreaIslamic University, 54001 Najaf, IraqDepartment of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Alkharj 11942, Saudi ArabiaSejong University, Seoul 143-747, Republic of Korea; Corresponding author.Nowadays, for controlling crime, surveillance cameras are typically installed in all public places to ensure urban safety and security. However, automating Human Activity Recognition (HAR) using computer vision techniques faces several challenges such as lowlighting, complex spatiotemporal features, clutter backgrounds, and inefficient utilization of surveillance system resources. Existing attempts in HAR designed straightforward networks by analyzing either spatial or motion patterns resulting in limited performance while the dual streams methods are entirely based on Convolutional Neural Networks (CNN) that are inadequate to learning the long-range temporal information for HAR. To overcome the above-mentioned challenges, this paper proposes an optimized dual stream framework for HAR which mainly consists of three steps. First, a shots segmentation module is introduced in the proposed framework to efficiently utilize the surveillance system resources by enhancing the lowlight video stream and then it detects salient video frames that consist of human. This module is trained on our own challenging Lowlight Human Surveillance Dataset (LHSD) which consists of both normal and different levels of lowlighting data to recognize humans in complex uncertain environments. Next, to learn HAR from both contextual and motion information, a dual stream approach is used in the feature extraction. In the first stream, it freezes the learned weights of the backbone Vision Transformer (ViT) B-16 model to select the discriminative contextual information. In the second stream, ViT features are then fused with the intermediate encoder layers of FlowNet2 model for optical flow to extract a robust motion feature vector. Finally, a two stream Parallel Bidirectional Long Short-Term Memory (PBiLSTM) is proposed for sequence learning to capture the global semantics of activities, followed by Dual Stream Multi-Head Attention (DSMHA) with a late fusion strategy to optimize the huge features vector for accurate HAR. To assess the strength of the proposed framework, extensive empirical results are conducted on real-world surveillance scenarios and various benchmark HAR datasets that achieve 78.6285%, 96.0151%, and 98.875% accuracies on HMDB51, UCF101, and YouTube Action, respectively. Our results show that the proposed strategy outperforms State-of-the-Art (SOTA) methods. The proposed framework gives superior performance in HAR, providing accurate and reliable recognition of human activities in surveillance systems.http://www.sciencedirect.com/science/article/pii/S1110016823009936Activity RecognitionVideo ClassificationSurveillance SystemLowlight Image EnhancementDual Stream NetworkTransformer Network
spellingShingle Altaf Hussain
Samee Ullah Khan
Noman Khan
Waseem Ullah
Ahmed Alkhayyat
Meshal Alharbi
Sung Wook Baik
Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video
Alexandria Engineering Journal
Activity Recognition
Video Classification
Surveillance System
Lowlight Image Enhancement
Dual Stream Network
Transformer Network
title Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video
title_full Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video
title_fullStr Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video
title_full_unstemmed Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video
title_short Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video
title_sort shots segmentation based optimized dual stream framework for robust human activity recognition in surveillance video
topic Activity Recognition
Video Classification
Surveillance System
Lowlight Image Enhancement
Dual Stream Network
Transformer Network
url http://www.sciencedirect.com/science/article/pii/S1110016823009936
work_keys_str_mv AT altafhussain shotssegmentationbasedoptimizeddualstreamframeworkforrobusthumanactivityrecognitioninsurveillancevideo
AT sameeullahkhan shotssegmentationbasedoptimizeddualstreamframeworkforrobusthumanactivityrecognitioninsurveillancevideo
AT nomankhan shotssegmentationbasedoptimizeddualstreamframeworkforrobusthumanactivityrecognitioninsurveillancevideo
AT waseemullah shotssegmentationbasedoptimizeddualstreamframeworkforrobusthumanactivityrecognitioninsurveillancevideo
AT ahmedalkhayyat shotssegmentationbasedoptimizeddualstreamframeworkforrobusthumanactivityrecognitioninsurveillancevideo
AT meshalalharbi shotssegmentationbasedoptimizeddualstreamframeworkforrobusthumanactivityrecognitioninsurveillancevideo
AT sungwookbaik shotssegmentationbasedoptimizeddualstreamframeworkforrobusthumanactivityrecognitioninsurveillancevideo