Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET

One of the most commonly method for sound event detection is the traditional convolutional neural network (CNN) or convolutional recurrent neural network (CRNN) and their variants. However, the pooling operation of the CNN has the disadvantage of losing the location information of the target object....

Full description

Bibliographic Details
Main Authors: Jinjia Wang, Jing Xia, Qian Yang, Yuzhen Zhang
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9000951/
_version_ 1818616914638798848
author Jinjia Wang
Jing Xia
Qian Yang
Yuzhen Zhang
author_facet Jinjia Wang
Jing Xia
Qian Yang
Yuzhen Zhang
author_sort Jinjia Wang
collection DOAJ
description One of the most commonly method for sound event detection is the traditional convolutional neural network (CNN) or convolutional recurrent neural network (CRNN) and their variants. However, the pooling operation of the CNN has the disadvantage of losing the location information of the target object. We don't use the pooling operation, retaining ReLU and convolution operation, and we use the dictionary strong constraints and penalty function prior constraints of the multi-layer convolutional sparse coding (ML-CSC). We proposed iterative deep neural networks, the unfolded multi-layer local block coordinate descent networks (ML-LoBCoD-NET), driven by the multi-layer local block coordinate descent algorithm (ML-LoBCoD) which is extended from the local block coordinate descent (LoBCoD) algorithm. The ML-LoBCoD-NET can extract features different from the CNN. More importantly, for weakly-supervised sound event detection task, we proposed the MRNN-Att network which combines the ML-LoBCoD-NET, a recurrent neural network (RNN), and an attention network. The MCRNN-Att network combines MRNN-Att and CRNN network for fusing the different features. Furthermore, for semi-supervised sound event detection task, the MRNN-Att mean teacher model (MRNN-Att-MT) and the MCRNN-Att mean teacher model (MCRNN-Att-MT) are proposed, in which the MRNN-Att and the MCRNN-Att network are selected as the student model. These models were tested on the dataset of Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Task 4. The F1 score of the MRNN-Att-MT on the development set was 22.83%, which was 8.77% higher than the baseline system. The score of the MRNN-Att-MT on the evaluation set was 15.68%, which was 4.88% higher than the baseline system. The MCRNN-Att-MT model had an F1 score of 20.35% on the development set, which was 6.29% higher than the baseline system and the F1 score of 14.56% on the evaluation set, which was 3.76% higher than the baseline system.
first_indexed 2024-12-16T16:57:22Z
format Article
id doaj.art-ea46451d61154bd7a2436ec7180e6e82
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-16T16:57:22Z
publishDate 2020-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-ea46451d61154bd7a2436ec7180e6e822022-12-21T22:23:51ZengIEEEIEEE Access2169-35362020-01-018380323804410.1109/ACCESS.2020.29744799000951Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NETJinjia Wang0https://orcid.org/0000-0002-2210-5570Jing Xia1https://orcid.org/0000-0001-5245-562XQian Yang2https://orcid.org/0000-0002-6552-7482Yuzhen Zhang3https://orcid.org/0000-0001-7655-4470School of Information Science and Engineering, Yanshan University, Qinhuangdao, ChinaSchool of Information Science and Engineering, Yanshan University, Qinhuangdao, ChinaSchool of Information Science and Engineering, Yanshan University, Qinhuangdao, ChinaSchool of Information Science and Engineering, Yanshan University, Qinhuangdao, ChinaOne of the most commonly method for sound event detection is the traditional convolutional neural network (CNN) or convolutional recurrent neural network (CRNN) and their variants. However, the pooling operation of the CNN has the disadvantage of losing the location information of the target object. We don't use the pooling operation, retaining ReLU and convolution operation, and we use the dictionary strong constraints and penalty function prior constraints of the multi-layer convolutional sparse coding (ML-CSC). We proposed iterative deep neural networks, the unfolded multi-layer local block coordinate descent networks (ML-LoBCoD-NET), driven by the multi-layer local block coordinate descent algorithm (ML-LoBCoD) which is extended from the local block coordinate descent (LoBCoD) algorithm. The ML-LoBCoD-NET can extract features different from the CNN. More importantly, for weakly-supervised sound event detection task, we proposed the MRNN-Att network which combines the ML-LoBCoD-NET, a recurrent neural network (RNN), and an attention network. The MCRNN-Att network combines MRNN-Att and CRNN network for fusing the different features. Furthermore, for semi-supervised sound event detection task, the MRNN-Att mean teacher model (MRNN-Att-MT) and the MCRNN-Att mean teacher model (MCRNN-Att-MT) are proposed, in which the MRNN-Att and the MCRNN-Att network are selected as the student model. These models were tested on the dataset of Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Task 4. The F1 score of the MRNN-Att-MT on the development set was 22.83%, which was 8.77% higher than the baseline system. The score of the MRNN-Att-MT on the evaluation set was 15.68%, which was 4.88% higher than the baseline system. The MCRNN-Att-MT model had an F1 score of 20.35% on the development set, which was 6.29% higher than the baseline system and the F1 score of 14.56% on the evaluation set, which was 3.76% higher than the baseline system.https://ieeexplore.ieee.org/document/9000951/Sound event detectionweakly-supervised learningsemi-supervised learningmean teacher modelmulti-layer local block coordinate descentconvolutional recurrent neural network
spellingShingle Jinjia Wang
Jing Xia
Qian Yang
Yuzhen Zhang
Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET
IEEE Access
Sound event detection
weakly-supervised learning
semi-supervised learning
mean teacher model
multi-layer local block coordinate descent
convolutional recurrent neural network
title Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET
title_full Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET
title_fullStr Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET
title_full_unstemmed Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET
title_short Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET
title_sort research on semi supervised sound event detection based on mean teacher models using ml lobcod net
topic Sound event detection
weakly-supervised learning
semi-supervised learning
mean teacher model
multi-layer local block coordinate descent
convolutional recurrent neural network
url https://ieeexplore.ieee.org/document/9000951/
work_keys_str_mv AT jinjiawang researchonsemisupervisedsoundeventdetectionbasedonmeanteachermodelsusingmllobcodnet
AT jingxia researchonsemisupervisedsoundeventdetectionbasedonmeanteachermodelsusingmllobcodnet
AT qianyang researchonsemisupervisedsoundeventdetectionbasedonmeanteachermodelsusingmllobcodnet
AT yuzhenzhang researchonsemisupervisedsoundeventdetectionbasedonmeanteachermodelsusingmllobcodnet