Natural Language Description of Videos for Smart Surveillance

After the September 11 attacks, security and surveillance measures have changed across the globe. Now, surveillance cameras are installed almost everywhere to monitor video footage. Though quite handy, these cameras produce videos in a massive size and volume. The major challenge faced by security a...

Full description

Bibliographic Details
Main Authors:	Aniqa Dilawari, Muhammad Usman Ghani Khan, Yasser D. Al-Otaibi, Zahoor-ur Rehman, Atta-ur Rahman, Yunyoung Nam
Format:	Article
Language:	English
Published:	MDPI AG 2021-04-01
Series:	Applied Sciences
Subjects:	CNN multitask feature learning bidirectional long short-term memory (LSTM) TRECVid 2007/2008 video captioning smart surveillance
Online Access:	https://www.mdpi.com/2076-3417/11/9/3730

_version_	1827694563411623936
author	Aniqa Dilawari Muhammad Usman Ghani Khan Yasser D. Al-Otaibi Zahoor-ur Rehman Atta-ur Rahman Yunyoung Nam
author_facet	Aniqa Dilawari Muhammad Usman Ghani Khan Yasser D. Al-Otaibi Zahoor-ur Rehman Atta-ur Rahman Yunyoung Nam
author_sort	Aniqa Dilawari
collection	DOAJ
description	After the September 11 attacks, security and surveillance measures have changed across the globe. Now, surveillance cameras are installed almost everywhere to monitor video footage. Though quite handy, these cameras produce videos in a massive size and volume. The major challenge faced by security agencies is the effort of analyzing the surveillance video data collected and generated daily. Problems related to these videos are twofold: (1) understanding the contents of video streams, and (2) conversion of the video contents to condensed formats, such as textual interpretations and summaries, to save storage space. In this paper, we have proposed a video description framework on a surveillance dataset. This framework is based on the multitask learning of high-level features (HLFs) using a convolutional neural network (CNN) and natural language generation (NLG) through bidirectional recurrent networks. For each specific task, a parallel pipeline is derived from the base visual geometry group (VGG)-16 model. Tasks include scene recognition, action recognition, object recognition and human face specific feature recognition. Experimental results on the TRECViD, UET Video Surveillance (UETVS) and AGRIINTRUSION datasets depict that the model outperforms state-of-the-art methods by a METEOR (Metric for Evaluation of Translation with Explicit ORdering) score of 33.9%, 34.3%, and 31.2%, respectively. Our results show that our framework has distinct advantages over traditional rule-based models for the recognition and generation of natural language descriptions.
first_indexed	2024-03-10T12:08:40Z
format	Article
id	doaj.art-7e935fa8c88042af8a03bab54bf78767
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-10T12:08:40Z
publishDate	2021-04-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-7e935fa8c88042af8a03bab54bf787672023-11-21T16:25:05ZengMDPI AGApplied Sciences2076-34172021-04-01119373010.3390/app11093730Natural Language Description of Videos for Smart SurveillanceAniqa Dilawari0Muhammad Usman Ghani Khan1Yasser D. Al-Otaibi2Zahoor-ur Rehman3Atta-ur Rahman4Yunyoung Nam5Department of Computer Science, University of Engineering & Technology, Lahore 54890, PakistanDepartment of Computer Science, University of Engineering & Technology, Lahore 54890, PakistanFaculty of Computing and Information Technology in Rabigh, King Abdulaziz University, Jeddah 21911, Saudi ArabiaDepartment of Computer Science, COMSATS University Islamabad, Attock Campus, Attock 43600, PakistanDepartment of Computer Science, College of Computer and Information Technology, Imam Abdulrahman bin Faisal University, P.O. Box 1982, Dammam 31441, Saudi ArabiaDepartment of Computer Science and Engineering, Soonchunhyang University, Asan 31538, KoreaAfter the September 11 attacks, security and surveillance measures have changed across the globe. Now, surveillance cameras are installed almost everywhere to monitor video footage. Though quite handy, these cameras produce videos in a massive size and volume. The major challenge faced by security agencies is the effort of analyzing the surveillance video data collected and generated daily. Problems related to these videos are twofold: (1) understanding the contents of video streams, and (2) conversion of the video contents to condensed formats, such as textual interpretations and summaries, to save storage space. In this paper, we have proposed a video description framework on a surveillance dataset. This framework is based on the multitask learning of high-level features (HLFs) using a convolutional neural network (CNN) and natural language generation (NLG) through bidirectional recurrent networks. For each specific task, a parallel pipeline is derived from the base visual geometry group (VGG)-16 model. Tasks include scene recognition, action recognition, object recognition and human face specific feature recognition. Experimental results on the TRECViD, UET Video Surveillance (UETVS) and AGRIINTRUSION datasets depict that the model outperforms state-of-the-art methods by a METEOR (Metric for Evaluation of Translation with Explicit ORdering) score of 33.9%, 34.3%, and 31.2%, respectively. Our results show that our framework has distinct advantages over traditional rule-based models for the recognition and generation of natural language descriptions.https://www.mdpi.com/2076-3417/11/9/3730CNNmultitask feature learningbidirectional long short-term memory (LSTM)TRECVid 2007/2008video captioningsmart surveillance
spellingShingle	Aniqa Dilawari Muhammad Usman Ghani Khan Yasser D. Al-Otaibi Zahoor-ur Rehman Atta-ur Rahman Yunyoung Nam Natural Language Description of Videos for Smart Surveillance Applied Sciences CNN multitask feature learning bidirectional long short-term memory (LSTM) TRECVid 2007/2008 video captioning smart surveillance
title	Natural Language Description of Videos for Smart Surveillance
title_full	Natural Language Description of Videos for Smart Surveillance
title_fullStr	Natural Language Description of Videos for Smart Surveillance
title_full_unstemmed	Natural Language Description of Videos for Smart Surveillance
title_short	Natural Language Description of Videos for Smart Surveillance
title_sort	natural language description of videos for smart surveillance
topic	CNN multitask feature learning bidirectional long short-term memory (LSTM) TRECVid 2007/2008 video captioning smart surveillance
url	https://www.mdpi.com/2076-3417/11/9/3730
work_keys_str_mv	AT aniqadilawari naturallanguagedescriptionofvideosforsmartsurveillance AT muhammadusmanghanikhan naturallanguagedescriptionofvideosforsmartsurveillance AT yasserdalotaibi naturallanguagedescriptionofvideosforsmartsurveillance AT zahoorurrehman naturallanguagedescriptionofvideosforsmartsurveillance AT attaurrahman naturallanguagedescriptionofvideosforsmartsurveillance AT yunyoungnam naturallanguagedescriptionofvideosforsmartsurveillance

Natural Language Description of Videos for Smart Surveillance

Similar Items