Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms

Source acquisition device identification from recorded audio aims to identify the source recording device by analyzing the intrinsic characteristics of audio, which is a challenging problem in audio forensics. In this paper, we propose a spatiotemporal representation learning framework with multi-at...

Full description

Bibliographic Details
Main Authors:	Chunyan Zeng, Shixiong Feng, Dongliang Zhu, Zhifeng Wang
Format:	Article
Language:	English
Published:	MDPI AG 2023-04-01
Series:	Entropy
Subjects:	audio forensics spatiotemporal representation learning attention mechanism temporal convolution networks
Online Access:	https://www.mdpi.com/1099-4300/25/4/626

_version_	1827745192398028800
author	Chunyan Zeng Shixiong Feng Dongliang Zhu Zhifeng Wang
author_facet	Chunyan Zeng Shixiong Feng Dongliang Zhu Zhifeng Wang
author_sort	Chunyan Zeng
collection	DOAJ
description	Source acquisition device identification from recorded audio aims to identify the source recording device by analyzing the intrinsic characteristics of audio, which is a challenging problem in audio forensics. In this paper, we propose a spatiotemporal representation learning framework with multi-attention mechanisms to tackle this problem. In the deep feature extraction stage of recording devices, a two-branch network based on residual dense temporal convolution networks (RD-TCNs) and convolutional neural networks (CNNs) is constructed. The spatial probability distribution features of audio signals are employed as inputs to the branch of the CNN for spatial representation learning, and the temporal spectral features of audio signals are fed into the branch of the RD-TCN network for temporal representation learning. This achieves simultaneous learning of long-term and short-term features to obtain an accurate representation of device-related information. In the spatiotemporal feature fusion stage, three attention mechanisms—temporal, spatial, and branch attention mechanisms—are designed to capture spatiotemporal weights and achieve effective deep feature fusion. The proposed framework achieves state-of-the-art performance on the benchmark CCNU_Mobile dataset, reaching an accuracy of 97.6% for the identification of 45 recording devices, with a significant reduction in training time compared to other models.
first_indexed	2024-03-11T05:03:37Z
format	Article
id	doaj.art-efdc887eaf3a400b8961ac5716aaa680
institution	Directory Open Access Journal
issn	1099-4300
language	English
last_indexed	2024-03-11T05:03:37Z
publishDate	2023-04-01
publisher	MDPI AG
record_format	Article
series	Entropy
spelling	doaj.art-efdc887eaf3a400b8961ac5716aaa6802023-11-17T19:08:45ZengMDPI AGEntropy1099-43002023-04-0125462610.3390/e25040626Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention MechanismsChunyan Zeng0Shixiong Feng1Dongliang Zhu2Zhifeng Wang3Hubei Key Laboratory for High-Efficiency Utilization of Solar Energy and Operation Control of Energy Storage System, Hubei University of Technology, Wuhan 430068, ChinaHubei Key Laboratory for High-Efficiency Utilization of Solar Energy and Operation Control of Energy Storage System, Hubei University of Technology, Wuhan 430068, ChinaNational Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan 430072, ChinaDepartment of Digital Media Technology, Central China Normal University, Wuhan 430079, ChinaSource acquisition device identification from recorded audio aims to identify the source recording device by analyzing the intrinsic characteristics of audio, which is a challenging problem in audio forensics. In this paper, we propose a spatiotemporal representation learning framework with multi-attention mechanisms to tackle this problem. In the deep feature extraction stage of recording devices, a two-branch network based on residual dense temporal convolution networks (RD-TCNs) and convolutional neural networks (CNNs) is constructed. The spatial probability distribution features of audio signals are employed as inputs to the branch of the CNN for spatial representation learning, and the temporal spectral features of audio signals are fed into the branch of the RD-TCN network for temporal representation learning. This achieves simultaneous learning of long-term and short-term features to obtain an accurate representation of device-related information. In the spatiotemporal feature fusion stage, three attention mechanisms—temporal, spatial, and branch attention mechanisms—are designed to capture spatiotemporal weights and achieve effective deep feature fusion. The proposed framework achieves state-of-the-art performance on the benchmark CCNU_Mobile dataset, reaching an accuracy of 97.6% for the identification of 45 recording devices, with a significant reduction in training time compared to other models.https://www.mdpi.com/1099-4300/25/4/626audio forensicsspatiotemporal representation learningattention mechanismtemporal convolution networks
spellingShingle	Chunyan Zeng Shixiong Feng Dongliang Zhu Zhifeng Wang Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms Entropy audio forensics spatiotemporal representation learning attention mechanism temporal convolution networks
title	Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms
title_full	Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms
title_fullStr	Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms
title_full_unstemmed	Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms
title_short	Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms
title_sort	source acquisition device identification from recorded audio based on spatiotemporal representation learning with multi attention mechanisms
topic	audio forensics spatiotemporal representation learning attention mechanism temporal convolution networks
url	https://www.mdpi.com/1099-4300/25/4/626
work_keys_str_mv	AT chunyanzeng sourceacquisitiondeviceidentificationfromrecordedaudiobasedonspatiotemporalrepresentationlearningwithmultiattentionmechanisms AT shixiongfeng sourceacquisitiondeviceidentificationfromrecordedaudiobasedonspatiotemporalrepresentationlearningwithmultiattentionmechanisms AT dongliangzhu sourceacquisitiondeviceidentificationfromrecordedaudiobasedonspatiotemporalrepresentationlearningwithmultiattentionmechanisms AT zhifengwang sourceacquisitiondeviceidentificationfromrecordedaudiobasedonspatiotemporalrepresentationlearningwithmultiattentionmechanisms

Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms

Similar Items