Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping

In the process of violence recognition, accuracy is reduced due to problems related to time axis misalignment and the semantic deviation of multimedia visual auditory information. Therefore, this paper proposes a method for auditory-visual information fusion based on autoencoder mapping. First, a fe...

Full description

Bibliographic Details
Main Authors: Jiu Lou, Decheng Zuo, Zhan Zhang, Hongwei Liu
Format: Article
Language:English
Published: MDPI AG 2021-10-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/10/21/2654
_version_ 1797512632277663744
author Jiu Lou
Decheng Zuo
Zhan Zhang
Hongwei Liu
author_facet Jiu Lou
Decheng Zuo
Zhan Zhang
Hongwei Liu
author_sort Jiu Lou
collection DOAJ
description In the process of violence recognition, accuracy is reduced due to problems related to time axis misalignment and the semantic deviation of multimedia visual auditory information. Therefore, this paper proposes a method for auditory-visual information fusion based on autoencoder mapping. First, a feature extraction model based on the CNN-LSTM framework is established, and multimedia segments are used as whole input to solve the problem of time axis misalignment of visual and auditory information. Then, a shared semantic subspace is constructed based on an autoencoder mapping model and is optimized by semantic correspondence, which solves the problem of audiovisual semantic deviation and realizes the fusion of visual and auditory information on segment level features. Finally, the whole network is used to identify violence. The experimental results show that the method can make good use of the complementarity between modes. Compared with single-mode information, the multimodal method can achieve better results.
first_indexed 2024-03-10T06:04:31Z
format Article
id doaj.art-4ba6c819a8f547abb7858b9008326632
institution Directory Open Access Journal
issn 2079-9292
language English
last_indexed 2024-03-10T06:04:31Z
publishDate 2021-10-01
publisher MDPI AG
record_format Article
series Electronics
spelling doaj.art-4ba6c819a8f547abb7858b90083266322023-11-22T20:38:53ZengMDPI AGElectronics2079-92922021-10-011021265410.3390/electronics10212654Violence Recognition Based on Auditory-Visual Fusion of Autoencoder MappingJiu Lou0Decheng Zuo1Zhan Zhang2Hongwei Liu3 School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, ChinaIn the process of violence recognition, accuracy is reduced due to problems related to time axis misalignment and the semantic deviation of multimedia visual auditory information. Therefore, this paper proposes a method for auditory-visual information fusion based on autoencoder mapping. First, a feature extraction model based on the CNN-LSTM framework is established, and multimedia segments are used as whole input to solve the problem of time axis misalignment of visual and auditory information. Then, a shared semantic subspace is constructed based on an autoencoder mapping model and is optimized by semantic correspondence, which solves the problem of audiovisual semantic deviation and realizes the fusion of visual and auditory information on segment level features. Finally, the whole network is used to identify violence. The experimental results show that the method can make good use of the complementarity between modes. Compared with single-mode information, the multimodal method can achieve better results.https://www.mdpi.com/2079-9292/10/21/2654violence recognitionauditory-visual fusionautoencoder mappingshared semantic subspacesCNN-LSTM
spellingShingle Jiu Lou
Decheng Zuo
Zhan Zhang
Hongwei Liu
Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping
Electronics
violence recognition
auditory-visual fusion
autoencoder mapping
shared semantic subspaces
CNN-LSTM
title Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping
title_full Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping
title_fullStr Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping
title_full_unstemmed Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping
title_short Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping
title_sort violence recognition based on auditory visual fusion of autoencoder mapping
topic violence recognition
auditory-visual fusion
autoencoder mapping
shared semantic subspaces
CNN-LSTM
url https://www.mdpi.com/2079-9292/10/21/2654
work_keys_str_mv AT jiulou violencerecognitionbasedonauditoryvisualfusionofautoencodermapping
AT dechengzuo violencerecognitionbasedonauditoryvisualfusionofautoencodermapping
AT zhanzhang violencerecognitionbasedonauditoryvisualfusionofautoencodermapping
AT hongweiliu violencerecognitionbasedonauditoryvisualfusionofautoencodermapping