Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms
Source acquisition device identification from recorded audio aims to identify the source recording device by analyzing the intrinsic characteristics of audio, which is a challenging problem in audio forensics. In this paper, we propose a spatiotemporal representation learning framework with multi-at...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-04-01
|
Series: | Entropy |
Subjects: | |
Online Access: | https://www.mdpi.com/1099-4300/25/4/626 |
_version_ | 1827745192398028800 |
---|---|
author | Chunyan Zeng Shixiong Feng Dongliang Zhu Zhifeng Wang |
author_facet | Chunyan Zeng Shixiong Feng Dongliang Zhu Zhifeng Wang |
author_sort | Chunyan Zeng |
collection | DOAJ |
description | Source acquisition device identification from recorded audio aims to identify the source recording device by analyzing the intrinsic characteristics of audio, which is a challenging problem in audio forensics. In this paper, we propose a spatiotemporal representation learning framework with multi-attention mechanisms to tackle this problem. In the deep feature extraction stage of recording devices, a two-branch network based on residual dense temporal convolution networks (RD-TCNs) and convolutional neural networks (CNNs) is constructed. The spatial probability distribution features of audio signals are employed as inputs to the branch of the CNN for spatial representation learning, and the temporal spectral features of audio signals are fed into the branch of the RD-TCN network for temporal representation learning. This achieves simultaneous learning of long-term and short-term features to obtain an accurate representation of device-related information. In the spatiotemporal feature fusion stage, three attention mechanisms—temporal, spatial, and branch attention mechanisms—are designed to capture spatiotemporal weights and achieve effective deep feature fusion. The proposed framework achieves state-of-the-art performance on the benchmark CCNU_Mobile dataset, reaching an accuracy of 97.6% for the identification of 45 recording devices, with a significant reduction in training time compared to other models. |
first_indexed | 2024-03-11T05:03:37Z |
format | Article |
id | doaj.art-efdc887eaf3a400b8961ac5716aaa680 |
institution | Directory Open Access Journal |
issn | 1099-4300 |
language | English |
last_indexed | 2024-03-11T05:03:37Z |
publishDate | 2023-04-01 |
publisher | MDPI AG |
record_format | Article |
series | Entropy |
spelling | doaj.art-efdc887eaf3a400b8961ac5716aaa6802023-11-17T19:08:45ZengMDPI AGEntropy1099-43002023-04-0125462610.3390/e25040626Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention MechanismsChunyan Zeng0Shixiong Feng1Dongliang Zhu2Zhifeng Wang3Hubei Key Laboratory for High-Efficiency Utilization of Solar Energy and Operation Control of Energy Storage System, Hubei University of Technology, Wuhan 430068, ChinaHubei Key Laboratory for High-Efficiency Utilization of Solar Energy and Operation Control of Energy Storage System, Hubei University of Technology, Wuhan 430068, ChinaNational Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan 430072, ChinaDepartment of Digital Media Technology, Central China Normal University, Wuhan 430079, ChinaSource acquisition device identification from recorded audio aims to identify the source recording device by analyzing the intrinsic characteristics of audio, which is a challenging problem in audio forensics. In this paper, we propose a spatiotemporal representation learning framework with multi-attention mechanisms to tackle this problem. In the deep feature extraction stage of recording devices, a two-branch network based on residual dense temporal convolution networks (RD-TCNs) and convolutional neural networks (CNNs) is constructed. The spatial probability distribution features of audio signals are employed as inputs to the branch of the CNN for spatial representation learning, and the temporal spectral features of audio signals are fed into the branch of the RD-TCN network for temporal representation learning. This achieves simultaneous learning of long-term and short-term features to obtain an accurate representation of device-related information. In the spatiotemporal feature fusion stage, three attention mechanisms—temporal, spatial, and branch attention mechanisms—are designed to capture spatiotemporal weights and achieve effective deep feature fusion. The proposed framework achieves state-of-the-art performance on the benchmark CCNU_Mobile dataset, reaching an accuracy of 97.6% for the identification of 45 recording devices, with a significant reduction in training time compared to other models.https://www.mdpi.com/1099-4300/25/4/626audio forensicsspatiotemporal representation learningattention mechanismtemporal convolution networks |
spellingShingle | Chunyan Zeng Shixiong Feng Dongliang Zhu Zhifeng Wang Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms Entropy audio forensics spatiotemporal representation learning attention mechanism temporal convolution networks |
title | Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms |
title_full | Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms |
title_fullStr | Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms |
title_full_unstemmed | Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms |
title_short | Source Acquisition Device Identification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms |
title_sort | source acquisition device identification from recorded audio based on spatiotemporal representation learning with multi attention mechanisms |
topic | audio forensics spatiotemporal representation learning attention mechanism temporal convolution networks |
url | https://www.mdpi.com/1099-4300/25/4/626 |
work_keys_str_mv | AT chunyanzeng sourceacquisitiondeviceidentificationfromrecordedaudiobasedonspatiotemporalrepresentationlearningwithmultiattentionmechanisms AT shixiongfeng sourceacquisitiondeviceidentificationfromrecordedaudiobasedonspatiotemporalrepresentationlearningwithmultiattentionmechanisms AT dongliangzhu sourceacquisitiondeviceidentificationfromrecordedaudiobasedonspatiotemporalrepresentationlearningwithmultiattentionmechanisms AT zhifengwang sourceacquisitiondeviceidentificationfromrecordedaudiobasedonspatiotemporalrepresentationlearningwithmultiattentionmechanisms |