Natural Language Description of Videos for Smart Surveillance

After the September 11 attacks, security and surveillance measures have changed across the globe. Now, surveillance cameras are installed almost everywhere to monitor video footage. Though quite handy, these cameras produce videos in a massive size and volume. The major challenge faced by security a...

Full description

Bibliographic Details
Main Authors: Aniqa Dilawari, Muhammad Usman Ghani Khan, Yasser D. Al-Otaibi, Zahoor-ur Rehman, Atta-ur Rahman, Yunyoung Nam
Format: Article
Language:English
Published: MDPI AG 2021-04-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/9/3730
_version_ 1827694563411623936
author Aniqa Dilawari
Muhammad Usman Ghani Khan
Yasser D. Al-Otaibi
Zahoor-ur Rehman
Atta-ur Rahman
Yunyoung Nam
author_facet Aniqa Dilawari
Muhammad Usman Ghani Khan
Yasser D. Al-Otaibi
Zahoor-ur Rehman
Atta-ur Rahman
Yunyoung Nam
author_sort Aniqa Dilawari
collection DOAJ
description After the September 11 attacks, security and surveillance measures have changed across the globe. Now, surveillance cameras are installed almost everywhere to monitor video footage. Though quite handy, these cameras produce videos in a massive size and volume. The major challenge faced by security agencies is the effort of analyzing the surveillance video data collected and generated daily. Problems related to these videos are twofold: (1) understanding the contents of video streams, and (2) conversion of the video contents to condensed formats, such as textual interpretations and summaries, to save storage space. In this paper, we have proposed a video description framework on a surveillance dataset. This framework is based on the multitask learning of high-level features (HLFs) using a convolutional neural network (CNN) and natural language generation (NLG) through bidirectional recurrent networks. For each specific task, a parallel pipeline is derived from the base visual geometry group (VGG)-16 model. Tasks include scene recognition, action recognition, object recognition and human face specific feature recognition. Experimental results on the TRECViD, UET Video Surveillance (UETVS) and AGRIINTRUSION datasets depict that the model outperforms state-of-the-art methods by a METEOR (Metric for Evaluation of Translation with Explicit ORdering) score of 33.9%, 34.3%, and 31.2%, respectively. Our results show that our framework has distinct advantages over traditional rule-based models for the recognition and generation of natural language descriptions.
first_indexed 2024-03-10T12:08:40Z
format Article
id doaj.art-7e935fa8c88042af8a03bab54bf78767
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T12:08:40Z
publishDate 2021-04-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-7e935fa8c88042af8a03bab54bf787672023-11-21T16:25:05ZengMDPI AGApplied Sciences2076-34172021-04-01119373010.3390/app11093730Natural Language Description of Videos for Smart SurveillanceAniqa Dilawari0Muhammad Usman Ghani Khan1Yasser D. Al-Otaibi2Zahoor-ur Rehman3Atta-ur Rahman4Yunyoung Nam5Department of Computer Science, University of Engineering & Technology, Lahore 54890, PakistanDepartment of Computer Science, University of Engineering & Technology, Lahore 54890, PakistanFaculty of Computing and Information Technology in Rabigh, King Abdulaziz University, Jeddah 21911, Saudi ArabiaDepartment of Computer Science, COMSATS University Islamabad, Attock Campus, Attock 43600, PakistanDepartment of Computer Science, College of Computer and Information Technology, Imam Abdulrahman bin Faisal University, P.O. Box 1982, Dammam 31441, Saudi ArabiaDepartment of Computer Science and Engineering, Soonchunhyang University, Asan 31538, KoreaAfter the September 11 attacks, security and surveillance measures have changed across the globe. Now, surveillance cameras are installed almost everywhere to monitor video footage. Though quite handy, these cameras produce videos in a massive size and volume. The major challenge faced by security agencies is the effort of analyzing the surveillance video data collected and generated daily. Problems related to these videos are twofold: (1) understanding the contents of video streams, and (2) conversion of the video contents to condensed formats, such as textual interpretations and summaries, to save storage space. In this paper, we have proposed a video description framework on a surveillance dataset. This framework is based on the multitask learning of high-level features (HLFs) using a convolutional neural network (CNN) and natural language generation (NLG) through bidirectional recurrent networks. For each specific task, a parallel pipeline is derived from the base visual geometry group (VGG)-16 model. Tasks include scene recognition, action recognition, object recognition and human face specific feature recognition. Experimental results on the TRECViD, UET Video Surveillance (UETVS) and AGRIINTRUSION datasets depict that the model outperforms state-of-the-art methods by a METEOR (Metric for Evaluation of Translation with Explicit ORdering) score of 33.9%, 34.3%, and 31.2%, respectively. Our results show that our framework has distinct advantages over traditional rule-based models for the recognition and generation of natural language descriptions.https://www.mdpi.com/2076-3417/11/9/3730CNNmultitask feature learningbidirectional long short-term memory (LSTM)TRECVid 2007/2008video captioningsmart surveillance
spellingShingle Aniqa Dilawari
Muhammad Usman Ghani Khan
Yasser D. Al-Otaibi
Zahoor-ur Rehman
Atta-ur Rahman
Yunyoung Nam
Natural Language Description of Videos for Smart Surveillance
Applied Sciences
CNN
multitask feature learning
bidirectional long short-term memory (LSTM)
TRECVid 2007/2008
video captioning
smart surveillance
title Natural Language Description of Videos for Smart Surveillance
title_full Natural Language Description of Videos for Smart Surveillance
title_fullStr Natural Language Description of Videos for Smart Surveillance
title_full_unstemmed Natural Language Description of Videos for Smart Surveillance
title_short Natural Language Description of Videos for Smart Surveillance
title_sort natural language description of videos for smart surveillance
topic CNN
multitask feature learning
bidirectional long short-term memory (LSTM)
TRECVid 2007/2008
video captioning
smart surveillance
url https://www.mdpi.com/2076-3417/11/9/3730
work_keys_str_mv AT aniqadilawari naturallanguagedescriptionofvideosforsmartsurveillance
AT muhammadusmanghanikhan naturallanguagedescriptionofvideosforsmartsurveillance
AT yasserdalotaibi naturallanguagedescriptionofvideosforsmartsurveillance
AT zahoorurrehman naturallanguagedescriptionofvideosforsmartsurveillance
AT attaurrahman naturallanguagedescriptionofvideosforsmartsurveillance
AT yunyoungnam naturallanguagedescriptionofvideosforsmartsurveillance