Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms

Source acquisition device identification from recorded audio aims to identify the source recording device by analyzing the intrinsic characteristics of audio, which is a challenging problem in audio forensics. In this paper, we propose a spatiotemporal representation learning framework with multi-at...

Full description

Bibliographic Details
Main Authors: Chunyan Zeng, Shixiong Feng, Dongliang Zhu, Zhifeng Wang
Format: Article
Language:English
Published: MDPI AG 2023-04-01
Series:Entropy
Subjects:
Online Access:https://www.mdpi.com/1099-4300/25/4/626
_version_ 1827745192398028800
author Chunyan Zeng
Shixiong Feng
Dongliang Zhu
Zhifeng Wang
author_facet Chunyan Zeng
Shixiong Feng
Dongliang Zhu
Zhifeng Wang
author_sort Chunyan Zeng
collection DOAJ
description Source acquisition device identification from recorded audio aims to identify the source recording device by analyzing the intrinsic characteristics of audio, which is a challenging problem in audio forensics. In this paper, we propose a spatiotemporal representation learning framework with multi-attention mechanisms to tackle this problem. In the deep feature extraction stage of recording devices, a two-branch network based on residual dense temporal convolution networks (RD-TCNs) and convolutional neural networks (CNNs) is constructed. The spatial probability distribution features of audio signals are employed as inputs to the branch of the CNN for spatial representation learning, and the temporal spectral features of audio signals are fed into the branch of the RD-TCN network for temporal representation learning. This achieves simultaneous learning of long-term and short-term features to obtain an accurate representation of device-related information. In the spatiotemporal feature fusion stage, three attention mechanisms—temporal, spatial, and branch attention mechanisms—are designed to capture spatiotemporal weights and achieve effective deep feature fusion. The proposed framework achieves state-of-the-art performance on the benchmark CCNU_Mobile dataset, reaching an accuracy of 97.6% for the identification of 45 recording devices, with a significant reduction in training time compared to other models.
first_indexed 2024-03-11T05:03:37Z
format Article
id doaj.art-efdc887eaf3a400b8961ac5716aaa680
institution Directory Open Access Journal
issn 1099-4300
language English
last_indexed 2024-03-11T05:03:37Z
publishDate 2023-04-01
publisher MDPI AG
record_format Article
series Entropy
spelling doaj.art-efdc887eaf3a400b8961ac5716aaa6802023-11-17T19:08:45ZengMDPI AGEntropy1099-43002023-04-0125462610.3390/e25040626Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention MechanismsChunyan Zeng0Shixiong Feng1Dongliang Zhu2Zhifeng Wang3Hubei Key Laboratory for High-Efficiency Utilization of Solar Energy and Operation Control of Energy Storage System, Hubei University of Technology, Wuhan 430068, ChinaHubei Key Laboratory for High-Efficiency Utilization of Solar Energy and Operation Control of Energy Storage System, Hubei University of Technology, Wuhan 430068, ChinaNational Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan 430072, ChinaDepartment of Digital Media Technology, Central China Normal University, Wuhan 430079, ChinaSource acquisition device identification from recorded audio aims to identify the source recording device by analyzing the intrinsic characteristics of audio, which is a challenging problem in audio forensics. In this paper, we propose a spatiotemporal representation learning framework with multi-attention mechanisms to tackle this problem. In the deep feature extraction stage of recording devices, a two-branch network based on residual dense temporal convolution networks (RD-TCNs) and convolutional neural networks (CNNs) is constructed. The spatial probability distribution features of audio signals are employed as inputs to the branch of the CNN for spatial representation learning, and the temporal spectral features of audio signals are fed into the branch of the RD-TCN network for temporal representation learning. This achieves simultaneous learning of long-term and short-term features to obtain an accurate representation of device-related information. In the spatiotemporal feature fusion stage, three attention mechanisms—temporal, spatial, and branch attention mechanisms—are designed to capture spatiotemporal weights and achieve effective deep feature fusion. The proposed framework achieves state-of-the-art performance on the benchmark CCNU_Mobile dataset, reaching an accuracy of 97.6% for the identification of 45 recording devices, with a significant reduction in training time compared to other models.https://www.mdpi.com/1099-4300/25/4/626audio forensicsspatiotemporal representation learningattention mechanismtemporal convolution networks
spellingShingle Chunyan Zeng
Shixiong Feng
Dongliang Zhu
Zhifeng Wang
Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms
Entropy
audio forensics
spatiotemporal representation learning
attention mechanism
temporal convolution networks
title Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms
title_full Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms
title_fullStr Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms
title_full_unstemmed Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms
title_short Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms
title_sort source acquisition device identification from recorded audio based on spatiotemporal representation learning with multi attention mechanisms
topic audio forensics
spatiotemporal representation learning
attention mechanism
temporal convolution networks
url https://www.mdpi.com/1099-4300/25/4/626
work_keys_str_mv AT chunyanzeng sourceacquisitiondeviceidentificationfromrecordedaudiobasedonspatiotemporalrepresentationlearningwithmultiattentionmechanisms
AT shixiongfeng sourceacquisitiondeviceidentificationfromrecordedaudiobasedonspatiotemporalrepresentationlearningwithmultiattentionmechanisms
AT dongliangzhu sourceacquisitiondeviceidentificationfromrecordedaudiobasedonspatiotemporalrepresentationlearningwithmultiattentionmechanisms
AT zhifengwang sourceacquisitiondeviceidentificationfromrecordedaudiobasedonspatiotemporalrepresentationlearningwithmultiattentionmechanisms