Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization

Abstract The goal of sound event detection and localization (SELD) is to identify each individual sound event class and its activity time from a piece of audio, while estimating its spatial location at the time of activity. Conformer combines the advantages of convolutional layers and Transformer, w...

Full description

Bibliographic Details
Main Authors: Yuting Zhou, Hongjie Wan
Format: Article
Language:English
Published: SpringerOpen 2023-06-01
Series:EURASIP Journal on Audio, Speech, and Music Processing
Subjects:
Online Access:https://doi.org/10.1186/s13636-023-00292-9
_version_ 1797789663601098752
author Yuting Zhou
Hongjie Wan
author_facet Yuting Zhou
Hongjie Wan
author_sort Yuting Zhou
collection DOAJ
description Abstract The goal of sound event detection and localization (SELD) is to identify each individual sound event class and its activity time from a piece of audio, while estimating its spatial location at the time of activity. Conformer combines the advantages of convolutional layers and Transformer, which is effective in tasks such as speech recognition. However, it achieves high performance relying on complex network structure and a large number of computations. In the SELD task of this paper, we propose to use an encoder with a simpler network structure, called the dual-branch attention module (DBAM). The module is improved based on the conformer using two parallel branches of attention and convolution, which can model both global and local contextual information. We also blend low-level and high-level features of the localization task. In addition, we add soft parameter sharing to the joint SELD network, which can efficiently exploit the potential relationship between the two subtasks, SED and DOA. The proposed method can effectively detect two sound events with overlapping occurrence in the same time period. We experimented with the open dataset DCASE 2020 task 3 proving that the proposed method achieves better SELD performance than the baseline model. Furthermore, we conducted ablation experiments for verifying the effectiveness of the dual-branch attention module and soft parameter sharing.
first_indexed 2024-03-13T01:53:52Z
format Article
id doaj.art-d61bc4c192794481983ba08b224b34df
institution Directory Open Access Journal
issn 1687-4722
language English
last_indexed 2024-03-13T01:53:52Z
publishDate 2023-06-01
publisher SpringerOpen
record_format Article
series EURASIP Journal on Audio, Speech, and Music Processing
spelling doaj.art-d61bc4c192794481983ba08b224b34df2023-07-02T11:22:15ZengSpringerOpenEURASIP Journal on Audio, Speech, and Music Processing1687-47222023-06-012023111510.1186/s13636-023-00292-9Dual-branch attention module-based network with parameter sharing for joint sound event detection and localizationYuting Zhou0Hongjie Wan1Information Engineering Dept, Beijing University of Chemical TechnologyInformation Engineering Dept, Beijing University of Chemical TechnologyAbstract The goal of sound event detection and localization (SELD) is to identify each individual sound event class and its activity time from a piece of audio, while estimating its spatial location at the time of activity. Conformer combines the advantages of convolutional layers and Transformer, which is effective in tasks such as speech recognition. However, it achieves high performance relying on complex network structure and a large number of computations. In the SELD task of this paper, we propose to use an encoder with a simpler network structure, called the dual-branch attention module (DBAM). The module is improved based on the conformer using two parallel branches of attention and convolution, which can model both global and local contextual information. We also blend low-level and high-level features of the localization task. In addition, we add soft parameter sharing to the joint SELD network, which can efficiently exploit the potential relationship between the two subtasks, SED and DOA. The proposed method can effectively detect two sound events with overlapping occurrence in the same time period. We experimented with the open dataset DCASE 2020 task 3 proving that the proposed method achieves better SELD performance than the baseline model. Furthermore, we conducted ablation experiments for verifying the effectiveness of the dual-branch attention module and soft parameter sharing.https://doi.org/10.1186/s13636-023-00292-9Sound event detection and localizationConformerAttention mechanismMulti-task learningSoft parameter sharing
spellingShingle Yuting Zhou
Hongjie Wan
Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization
EURASIP Journal on Audio, Speech, and Music Processing
Sound event detection and localization
Conformer
Attention mechanism
Multi-task learning
Soft parameter sharing
title Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization
title_full Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization
title_fullStr Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization
title_full_unstemmed Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization
title_short Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization
title_sort dual branch attention module based network with parameter sharing for joint sound event detection and localization
topic Sound event detection and localization
Conformer
Attention mechanism
Multi-task learning
Soft parameter sharing
url https://doi.org/10.1186/s13636-023-00292-9
work_keys_str_mv AT yutingzhou dualbranchattentionmodulebasednetworkwithparametersharingforjointsoundeventdetectionandlocalization
AT hongjiewan dualbranchattentionmodulebasednetworkwithparametersharingforjointsoundeventdetectionandlocalization