Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping
In the process of violence recognition, accuracy is reduced due to problems related to time axis misalignment and the semantic deviation of multimedia visual auditory information. Therefore, this paper proposes a method for auditory-visual information fusion based on autoencoder mapping. First, a fe...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-10-01
|
Series: | Electronics |
Subjects: | |
Online Access: | https://www.mdpi.com/2079-9292/10/21/2654 |
_version_ | 1797512632277663744 |
---|---|
author | Jiu Lou Decheng Zuo Zhan Zhang Hongwei Liu |
author_facet | Jiu Lou Decheng Zuo Zhan Zhang Hongwei Liu |
author_sort | Jiu Lou |
collection | DOAJ |
description | In the process of violence recognition, accuracy is reduced due to problems related to time axis misalignment and the semantic deviation of multimedia visual auditory information. Therefore, this paper proposes a method for auditory-visual information fusion based on autoencoder mapping. First, a feature extraction model based on the CNN-LSTM framework is established, and multimedia segments are used as whole input to solve the problem of time axis misalignment of visual and auditory information. Then, a shared semantic subspace is constructed based on an autoencoder mapping model and is optimized by semantic correspondence, which solves the problem of audiovisual semantic deviation and realizes the fusion of visual and auditory information on segment level features. Finally, the whole network is used to identify violence. The experimental results show that the method can make good use of the complementarity between modes. Compared with single-mode information, the multimodal method can achieve better results. |
first_indexed | 2024-03-10T06:04:31Z |
format | Article |
id | doaj.art-4ba6c819a8f547abb7858b9008326632 |
institution | Directory Open Access Journal |
issn | 2079-9292 |
language | English |
last_indexed | 2024-03-10T06:04:31Z |
publishDate | 2021-10-01 |
publisher | MDPI AG |
record_format | Article |
series | Electronics |
spelling | doaj.art-4ba6c819a8f547abb7858b90083266322023-11-22T20:38:53ZengMDPI AGElectronics2079-92922021-10-011021265410.3390/electronics10212654Violence Recognition Based on Auditory-Visual Fusion of Autoencoder MappingJiu Lou0Decheng Zuo1Zhan Zhang2Hongwei Liu3 School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, ChinaIn the process of violence recognition, accuracy is reduced due to problems related to time axis misalignment and the semantic deviation of multimedia visual auditory information. Therefore, this paper proposes a method for auditory-visual information fusion based on autoencoder mapping. First, a feature extraction model based on the CNN-LSTM framework is established, and multimedia segments are used as whole input to solve the problem of time axis misalignment of visual and auditory information. Then, a shared semantic subspace is constructed based on an autoencoder mapping model and is optimized by semantic correspondence, which solves the problem of audiovisual semantic deviation and realizes the fusion of visual and auditory information on segment level features. Finally, the whole network is used to identify violence. The experimental results show that the method can make good use of the complementarity between modes. Compared with single-mode information, the multimodal method can achieve better results.https://www.mdpi.com/2079-9292/10/21/2654violence recognitionauditory-visual fusionautoencoder mappingshared semantic subspacesCNN-LSTM |
spellingShingle | Jiu Lou Decheng Zuo Zhan Zhang Hongwei Liu Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping Electronics violence recognition auditory-visual fusion autoencoder mapping shared semantic subspaces CNN-LSTM |
title | Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping |
title_full | Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping |
title_fullStr | Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping |
title_full_unstemmed | Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping |
title_short | Violence Recognition Based on Auditory-Visual Fusion of Autoencoder Mapping |
title_sort | violence recognition based on auditory visual fusion of autoencoder mapping |
topic | violence recognition auditory-visual fusion autoencoder mapping shared semantic subspaces CNN-LSTM |
url | https://www.mdpi.com/2079-9292/10/21/2654 |
work_keys_str_mv | AT jiulou violencerecognitionbasedonauditoryvisualfusionofautoencodermapping AT dechengzuo violencerecognitionbasedonauditoryvisualfusionofautoencodermapping AT zhanzhang violencerecognitionbasedonauditoryvisualfusionofautoencodermapping AT hongweiliu violencerecognitionbasedonauditoryvisualfusionofautoencodermapping |