Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping

In the process of violence recognition, accuracy is reduced due to problems related to time axis misalignment and the semantic deviation of multimedia visual auditory information. Therefore, this paper proposes a method for auditory-visual information fusion based on autoencoder mapping. First, a fe...

Full description

Bibliographic Details
Main Authors:	Jiu Lou, Decheng Zuo, Zhan Zhang, Hongwei Liu
Format:	Article
Language:	English
Published:	MDPI AG 2021-10-01
Series:	Electronics
Subjects:	violence recognition auditory-visual fusion autoencoder mapping shared semantic subspaces CNN-LSTM
Online Access:	https://www.mdpi.com/2079-9292/10/21/2654

_version_	1797512632277663744
author	Jiu Lou Decheng Zuo Zhan Zhang Hongwei Liu
author_facet	Jiu Lou Decheng Zuo Zhan Zhang Hongwei Liu
author_sort	Jiu Lou
collection	DOAJ
description	In the process of violence recognition, accuracy is reduced due to problems related to time axis misalignment and the semantic deviation of multimedia visual auditory information. Therefore, this paper proposes a method for auditory-visual information fusion based on autoencoder mapping. First, a feature extraction model based on the CNN-LSTM framework is established, and multimedia segments are used as whole input to solve the problem of time axis misalignment of visual and auditory information. Then, a shared semantic subspace is constructed based on an autoencoder mapping model and is optimized by semantic correspondence, which solves the problem of audiovisual semantic deviation and realizes the fusion of visual and auditory information on segment level features. Finally, the whole network is used to identify violence. The experimental results show that the method can make good use of the complementarity between modes. Compared with single-mode information, the multimodal method can achieve better results.
first_indexed	2024-03-10T06:04:31Z
format	Article
id	doaj.art-4ba6c819a8f547abb7858b9008326632
institution	Directory Open Access Journal
issn	2079-9292
language	English
last_indexed	2024-03-10T06:04:31Z
publishDate	2021-10-01
publisher	MDPI AG
record_format	Article
series	Electronics
spelling	doaj.art-4ba6c819a8f547abb7858b90083266322023-11-22T20:38:53ZengMDPI AGElectronics2079-92922021-10-011021265410.3390/electronics10212654Violence Recognition Based on Auditory-Visual Fusion of Autoencoder MappingJiu Lou0Decheng Zuo1Zhan Zhang2Hongwei Liu3 School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, ChinaIn the process of violence recognition, accuracy is reduced due to problems related to time axis misalignment and the semantic deviation of multimedia visual auditory information. Therefore, this paper proposes a method for auditory-visual information fusion based on autoencoder mapping. First, a feature extraction model based on the CNN-LSTM framework is established, and multimedia segments are used as whole input to solve the problem of time axis misalignment of visual and auditory information. Then, a shared semantic subspace is constructed based on an autoencoder mapping model and is optimized by semantic correspondence, which solves the problem of audiovisual semantic deviation and realizes the fusion of visual and auditory information on segment level features. Finally, the whole network is used to identify violence. The experimental results show that the method can make good use of the complementarity between modes. Compared with single-mode information, the multimodal method can achieve better results.https://www.mdpi.com/2079-9292/10/21/2654violence recognitionauditory-visual fusionautoencoder mappingshared semantic subspacesCNN-LSTM
spellingShingle	Jiu Lou Decheng Zuo Zhan Zhang Hongwei Liu Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping Electronics violence recognition auditory-visual fusion autoencoder mapping shared semantic subspaces CNN-LSTM
title	Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping
title_full	Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping
title_fullStr	Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping
title_full_unstemmed	Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping
title_short	Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping
title_sort	violence recognition based on auditory visual fusion of autoencoder mapping
topic	violence recognition auditory-visual fusion autoencoder mapping shared semantic subspaces CNN-LSTM
url	https://www.mdpi.com/2079-9292/10/21/2654
work_keys_str_mv	AT jiulou violencerecognitionbasedonauditoryvisualfusionofautoencodermapping AT dechengzuo violencerecognitionbasedonauditoryvisualfusionofautoencodermapping AT zhanzhang violencerecognitionbasedonauditoryvisualfusionofautoencodermapping AT hongweiliu violencerecognitionbasedonauditoryvisualfusionofautoencodermapping

Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping

Similar Items