Natural Language Description of Videos for Smart Surveillance
After the September 11 attacks, security and surveillance measures have changed across the globe. Now, surveillance cameras are installed almost everywhere to monitor video footage. Though quite handy, these cameras produce videos in a massive size and volume. The major challenge faced by security a...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-04-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/11/9/3730 |
_version_ | 1827694563411623936 |
---|---|
author | Aniqa Dilawari Muhammad Usman Ghani Khan Yasser D. Al-Otaibi Zahoor-ur Rehman Atta-ur Rahman Yunyoung Nam |
author_facet | Aniqa Dilawari Muhammad Usman Ghani Khan Yasser D. Al-Otaibi Zahoor-ur Rehman Atta-ur Rahman Yunyoung Nam |
author_sort | Aniqa Dilawari |
collection | DOAJ |
description | After the September 11 attacks, security and surveillance measures have changed across the globe. Now, surveillance cameras are installed almost everywhere to monitor video footage. Though quite handy, these cameras produce videos in a massive size and volume. The major challenge faced by security agencies is the effort of analyzing the surveillance video data collected and generated daily. Problems related to these videos are twofold: (1) understanding the contents of video streams, and (2) conversion of the video contents to condensed formats, such as textual interpretations and summaries, to save storage space. In this paper, we have proposed a video description framework on a surveillance dataset. This framework is based on the multitask learning of high-level features (HLFs) using a convolutional neural network (CNN) and natural language generation (NLG) through bidirectional recurrent networks. For each specific task, a parallel pipeline is derived from the base visual geometry group (VGG)-16 model. Tasks include scene recognition, action recognition, object recognition and human face specific feature recognition. Experimental results on the TRECViD, UET Video Surveillance (UETVS) and AGRIINTRUSION datasets depict that the model outperforms state-of-the-art methods by a METEOR (Metric for Evaluation of Translation with Explicit ORdering) score of 33.9%, 34.3%, and 31.2%, respectively. Our results show that our framework has distinct advantages over traditional rule-based models for the recognition and generation of natural language descriptions. |
first_indexed | 2024-03-10T12:08:40Z |
format | Article |
id | doaj.art-7e935fa8c88042af8a03bab54bf78767 |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-10T12:08:40Z |
publishDate | 2021-04-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-7e935fa8c88042af8a03bab54bf787672023-11-21T16:25:05ZengMDPI AGApplied Sciences2076-34172021-04-01119373010.3390/app11093730Natural Language Description of Videos for Smart SurveillanceAniqa Dilawari0Muhammad Usman Ghani Khan1Yasser D. Al-Otaibi2Zahoor-ur Rehman3Atta-ur Rahman4Yunyoung Nam5Department of Computer Science, University of Engineering & Technology, Lahore 54890, PakistanDepartment of Computer Science, University of Engineering & Technology, Lahore 54890, PakistanFaculty of Computing and Information Technology in Rabigh, King Abdulaziz University, Jeddah 21911, Saudi ArabiaDepartment of Computer Science, COMSATS University Islamabad, Attock Campus, Attock 43600, PakistanDepartment of Computer Science, College of Computer and Information Technology, Imam Abdulrahman bin Faisal University, P.O. Box 1982, Dammam 31441, Saudi ArabiaDepartment of Computer Science and Engineering, Soonchunhyang University, Asan 31538, KoreaAfter the September 11 attacks, security and surveillance measures have changed across the globe. Now, surveillance cameras are installed almost everywhere to monitor video footage. Though quite handy, these cameras produce videos in a massive size and volume. The major challenge faced by security agencies is the effort of analyzing the surveillance video data collected and generated daily. Problems related to these videos are twofold: (1) understanding the contents of video streams, and (2) conversion of the video contents to condensed formats, such as textual interpretations and summaries, to save storage space. In this paper, we have proposed a video description framework on a surveillance dataset. This framework is based on the multitask learning of high-level features (HLFs) using a convolutional neural network (CNN) and natural language generation (NLG) through bidirectional recurrent networks. For each specific task, a parallel pipeline is derived from the base visual geometry group (VGG)-16 model. Tasks include scene recognition, action recognition, object recognition and human face specific feature recognition. Experimental results on the TRECViD, UET Video Surveillance (UETVS) and AGRIINTRUSION datasets depict that the model outperforms state-of-the-art methods by a METEOR (Metric for Evaluation of Translation with Explicit ORdering) score of 33.9%, 34.3%, and 31.2%, respectively. Our results show that our framework has distinct advantages over traditional rule-based models for the recognition and generation of natural language descriptions.https://www.mdpi.com/2076-3417/11/9/3730CNNmultitask feature learningbidirectional long short-term memory (LSTM)TRECVid 2007/2008video captioningsmart surveillance |
spellingShingle | Aniqa Dilawari Muhammad Usman Ghani Khan Yasser D. Al-Otaibi Zahoor-ur Rehman Atta-ur Rahman Yunyoung Nam Natural Language Description of Videos for Smart Surveillance Applied Sciences CNN multitask feature learning bidirectional long short-term memory (LSTM) TRECVid 2007/2008 video captioning smart surveillance |
title | Natural Language Description of Videos for Smart Surveillance |
title_full | Natural Language Description of Videos for Smart Surveillance |
title_fullStr | Natural Language Description of Videos for Smart Surveillance |
title_full_unstemmed | Natural Language Description of Videos for Smart Surveillance |
title_short | Natural Language Description of Videos for Smart Surveillance |
title_sort | natural language description of videos for smart surveillance |
topic | CNN multitask feature learning bidirectional long short-term memory (LSTM) TRECVid 2007/2008 video captioning smart surveillance |
url | https://www.mdpi.com/2076-3417/11/9/3730 |
work_keys_str_mv | AT aniqadilawari naturallanguagedescriptionofvideosforsmartsurveillance AT muhammadusmanghanikhan naturallanguagedescriptionofvideosforsmartsurveillance AT yasserdalotaibi naturallanguagedescriptionofvideosforsmartsurveillance AT zahoorurrehman naturallanguagedescriptionofvideosforsmartsurveillance AT attaurrahman naturallanguagedescriptionofvideosforsmartsurveillance AT yunyoungnam naturallanguagedescriptionofvideosforsmartsurveillance |