Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video

Nowadays, for controlling crime, surveillance cameras are typically installed in all public places to ensure urban safety and security. However, automating Human Activity Recognition (HAR) using computer vision techniques faces several challenges such as lowlighting, complex spatiotemporal features,...

Full description

Bibliographic Details
Main Authors:	Altaf Hussain, Samee Ullah Khan, Noman Khan, Waseem Ullah, Ahmed Alkhayyat, Meshal Alharbi, Sung Wook Baik
Format:	Article
Language:	English
Published:	Elsevier 2024-03-01
Series:	Alexandria Engineering Journal
Subjects:	Activity Recognition Video Classification Surveillance System Lowlight Image Enhancement Dual Stream Network Transformer Network
Online Access:	http://www.sciencedirect.com/science/article/pii/S1110016823009936

_version_	1827336674312454144
author	Altaf Hussain Samee Ullah Khan Noman Khan Waseem Ullah Ahmed Alkhayyat Meshal Alharbi Sung Wook Baik
author_facet	Altaf Hussain Samee Ullah Khan Noman Khan Waseem Ullah Ahmed Alkhayyat Meshal Alharbi Sung Wook Baik
author_sort	Altaf Hussain
collection	DOAJ
description	Nowadays, for controlling crime, surveillance cameras are typically installed in all public places to ensure urban safety and security. However, automating Human Activity Recognition (HAR) using computer vision techniques faces several challenges such as lowlighting, complex spatiotemporal features, clutter backgrounds, and inefficient utilization of surveillance system resources. Existing attempts in HAR designed straightforward networks by analyzing either spatial or motion patterns resulting in limited performance while the dual streams methods are entirely based on Convolutional Neural Networks (CNN) that are inadequate to learning the long-range temporal information for HAR. To overcome the above-mentioned challenges, this paper proposes an optimized dual stream framework for HAR which mainly consists of three steps. First, a shots segmentation module is introduced in the proposed framework to efficiently utilize the surveillance system resources by enhancing the lowlight video stream and then it detects salient video frames that consist of human. This module is trained on our own challenging Lowlight Human Surveillance Dataset (LHSD) which consists of both normal and different levels of lowlighting data to recognize humans in complex uncertain environments. Next, to learn HAR from both contextual and motion information, a dual stream approach is used in the feature extraction. In the first stream, it freezes the learned weights of the backbone Vision Transformer (ViT) B-16 model to select the discriminative contextual information. In the second stream, ViT features are then fused with the intermediate encoder layers of FlowNet2 model for optical flow to extract a robust motion feature vector. Finally, a two stream Parallel Bidirectional Long Short-Term Memory (PBiLSTM) is proposed for sequence learning to capture the global semantics of activities, followed by Dual Stream Multi-Head Attention (DSMHA) with a late fusion strategy to optimize the huge features vector for accurate HAR. To assess the strength of the proposed framework, extensive empirical results are conducted on real-world surveillance scenarios and various benchmark HAR datasets that achieve 78.6285%, 96.0151%, and 98.875% accuracies on HMDB51, UCF101, and YouTube Action, respectively. Our results show that the proposed strategy outperforms State-of-the-Art (SOTA) methods. The proposed framework gives superior performance in HAR, providing accurate and reliable recognition of human activities in surveillance systems.
first_indexed	2024-03-07T18:37:27Z
format	Article
id	doaj.art-9c969e022d004230998422f0813d0154
institution	Directory Open Access Journal
issn	1110-0168
language	English
last_indexed	2024-03-07T18:37:27Z
publishDate	2024-03-01
publisher	Elsevier
record_format	Article
series	Alexandria Engineering Journal
spelling	doaj.art-9c969e022d004230998422f0813d01542024-03-02T04:53:40ZengElsevierAlexandria Engineering Journal1110-01682024-03-0191632647Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance videoAltaf Hussain0Samee Ullah Khan1Noman Khan2Waseem Ullah3Ahmed Alkhayyat4Meshal Alharbi5Sung Wook Baik6Sejong University, Seoul 143-747, Republic of KoreaSejong University, Seoul 143-747, Republic of KoreaSejong University, Seoul 143-747, Republic of KoreaSejong University, Seoul 143-747, Republic of KoreaIslamic University, 54001 Najaf, IraqDepartment of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Alkharj 11942, Saudi ArabiaSejong University, Seoul 143-747, Republic of Korea; Corresponding author.Nowadays, for controlling crime, surveillance cameras are typically installed in all public places to ensure urban safety and security. However, automating Human Activity Recognition (HAR) using computer vision techniques faces several challenges such as lowlighting, complex spatiotemporal features, clutter backgrounds, and inefficient utilization of surveillance system resources. Existing attempts in HAR designed straightforward networks by analyzing either spatial or motion patterns resulting in limited performance while the dual streams methods are entirely based on Convolutional Neural Networks (CNN) that are inadequate to learning the long-range temporal information for HAR. To overcome the above-mentioned challenges, this paper proposes an optimized dual stream framework for HAR which mainly consists of three steps. First, a shots segmentation module is introduced in the proposed framework to efficiently utilize the surveillance system resources by enhancing the lowlight video stream and then it detects salient video frames that consist of human. This module is trained on our own challenging Lowlight Human Surveillance Dataset (LHSD) which consists of both normal and different levels of lowlighting data to recognize humans in complex uncertain environments. Next, to learn HAR from both contextual and motion information, a dual stream approach is used in the feature extraction. In the first stream, it freezes the learned weights of the backbone Vision Transformer (ViT) B-16 model to select the discriminative contextual information. In the second stream, ViT features are then fused with the intermediate encoder layers of FlowNet2 model for optical flow to extract a robust motion feature vector. Finally, a two stream Parallel Bidirectional Long Short-Term Memory (PBiLSTM) is proposed for sequence learning to capture the global semantics of activities, followed by Dual Stream Multi-Head Attention (DSMHA) with a late fusion strategy to optimize the huge features vector for accurate HAR. To assess the strength of the proposed framework, extensive empirical results are conducted on real-world surveillance scenarios and various benchmark HAR datasets that achieve 78.6285%, 96.0151%, and 98.875% accuracies on HMDB51, UCF101, and YouTube Action, respectively. Our results show that the proposed strategy outperforms State-of-the-Art (SOTA) methods. The proposed framework gives superior performance in HAR, providing accurate and reliable recognition of human activities in surveillance systems.http://www.sciencedirect.com/science/article/pii/S1110016823009936Activity RecognitionVideo ClassificationSurveillance SystemLowlight Image EnhancementDual Stream NetworkTransformer Network
spellingShingle	Altaf Hussain Samee Ullah Khan Noman Khan Waseem Ullah Ahmed Alkhayyat Meshal Alharbi Sung Wook Baik Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video Alexandria Engineering Journal Activity Recognition Video Classification Surveillance System Lowlight Image Enhancement Dual Stream Network Transformer Network
title	Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video
title_full	Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video
title_fullStr	Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video
title_full_unstemmed	Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video
title_short	Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video
title_sort	shots segmentation based optimized dual stream framework for robust human activity recognition in surveillance video
topic	Activity Recognition Video Classification Surveillance System Lowlight Image Enhancement Dual Stream Network Transformer Network
url	http://www.sciencedirect.com/science/article/pii/S1110016823009936
work_keys_str_mv	AT altafhussain shotssegmentationbasedoptimizeddualstreamframeworkforrobusthumanactivityrecognitioninsurveillancevideo AT sameeullahkhan shotssegmentationbasedoptimizeddualstreamframeworkforrobusthumanactivityrecognitioninsurveillancevideo AT nomankhan shotssegmentationbasedoptimizeddualstreamframeworkforrobusthumanactivityrecognitioninsurveillancevideo AT waseemullah shotssegmentationbasedoptimizeddualstreamframeworkforrobusthumanactivityrecognitioninsurveillancevideo AT ahmedalkhayyat shotssegmentationbasedoptimizeddualstreamframeworkforrobusthumanactivityrecognitioninsurveillancevideo AT meshalalharbi shotssegmentationbasedoptimizeddualstreamframeworkforrobusthumanactivityrecognitioninsurveillancevideo AT sungwookbaik shotssegmentationbasedoptimizeddualstreamframeworkforrobusthumanactivityrecognitioninsurveillancevideo

Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video

Similar Items