Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET
One of the most commonly method for sound event detection is the traditional convolutional neural network (CNN) or convolutional recurrent neural network (CRNN) and their variants. However, the pooling operation of the CNN has the disadvantage of losing the location information of the target object....
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2020-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9000951/ |
_version_ | 1818616914638798848 |
---|---|
author | Jinjia Wang Jing Xia Qian Yang Yuzhen Zhang |
author_facet | Jinjia Wang Jing Xia Qian Yang Yuzhen Zhang |
author_sort | Jinjia Wang |
collection | DOAJ |
description | One of the most commonly method for sound event detection is the traditional convolutional neural network (CNN) or convolutional recurrent neural network (CRNN) and their variants. However, the pooling operation of the CNN has the disadvantage of losing the location information of the target object. We don't use the pooling operation, retaining ReLU and convolution operation, and we use the dictionary strong constraints and penalty function prior constraints of the multi-layer convolutional sparse coding (ML-CSC). We proposed iterative deep neural networks, the unfolded multi-layer local block coordinate descent networks (ML-LoBCoD-NET), driven by the multi-layer local block coordinate descent algorithm (ML-LoBCoD) which is extended from the local block coordinate descent (LoBCoD) algorithm. The ML-LoBCoD-NET can extract features different from the CNN. More importantly, for weakly-supervised sound event detection task, we proposed the MRNN-Att network which combines the ML-LoBCoD-NET, a recurrent neural network (RNN), and an attention network. The MCRNN-Att network combines MRNN-Att and CRNN network for fusing the different features. Furthermore, for semi-supervised sound event detection task, the MRNN-Att mean teacher model (MRNN-Att-MT) and the MCRNN-Att mean teacher model (MCRNN-Att-MT) are proposed, in which the MRNN-Att and the MCRNN-Att network are selected as the student model. These models were tested on the dataset of Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Task 4. The F1 score of the MRNN-Att-MT on the development set was 22.83%, which was 8.77% higher than the baseline system. The score of the MRNN-Att-MT on the evaluation set was 15.68%, which was 4.88% higher than the baseline system. The MCRNN-Att-MT model had an F1 score of 20.35% on the development set, which was 6.29% higher than the baseline system and the F1 score of 14.56% on the evaluation set, which was 3.76% higher than the baseline system. |
first_indexed | 2024-12-16T16:57:22Z |
format | Article |
id | doaj.art-ea46451d61154bd7a2436ec7180e6e82 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-12-16T16:57:22Z |
publishDate | 2020-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-ea46451d61154bd7a2436ec7180e6e822022-12-21T22:23:51ZengIEEEIEEE Access2169-35362020-01-018380323804410.1109/ACCESS.2020.29744799000951Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NETJinjia Wang0https://orcid.org/0000-0002-2210-5570Jing Xia1https://orcid.org/0000-0001-5245-562XQian Yang2https://orcid.org/0000-0002-6552-7482Yuzhen Zhang3https://orcid.org/0000-0001-7655-4470School of Information Science and Engineering, Yanshan University, Qinhuangdao, ChinaSchool of Information Science and Engineering, Yanshan University, Qinhuangdao, ChinaSchool of Information Science and Engineering, Yanshan University, Qinhuangdao, ChinaSchool of Information Science and Engineering, Yanshan University, Qinhuangdao, ChinaOne of the most commonly method for sound event detection is the traditional convolutional neural network (CNN) or convolutional recurrent neural network (CRNN) and their variants. However, the pooling operation of the CNN has the disadvantage of losing the location information of the target object. We don't use the pooling operation, retaining ReLU and convolution operation, and we use the dictionary strong constraints and penalty function prior constraints of the multi-layer convolutional sparse coding (ML-CSC). We proposed iterative deep neural networks, the unfolded multi-layer local block coordinate descent networks (ML-LoBCoD-NET), driven by the multi-layer local block coordinate descent algorithm (ML-LoBCoD) which is extended from the local block coordinate descent (LoBCoD) algorithm. The ML-LoBCoD-NET can extract features different from the CNN. More importantly, for weakly-supervised sound event detection task, we proposed the MRNN-Att network which combines the ML-LoBCoD-NET, a recurrent neural network (RNN), and an attention network. The MCRNN-Att network combines MRNN-Att and CRNN network for fusing the different features. Furthermore, for semi-supervised sound event detection task, the MRNN-Att mean teacher model (MRNN-Att-MT) and the MCRNN-Att mean teacher model (MCRNN-Att-MT) are proposed, in which the MRNN-Att and the MCRNN-Att network are selected as the student model. These models were tested on the dataset of Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Task 4. The F1 score of the MRNN-Att-MT on the development set was 22.83%, which was 8.77% higher than the baseline system. The score of the MRNN-Att-MT on the evaluation set was 15.68%, which was 4.88% higher than the baseline system. The MCRNN-Att-MT model had an F1 score of 20.35% on the development set, which was 6.29% higher than the baseline system and the F1 score of 14.56% on the evaluation set, which was 3.76% higher than the baseline system.https://ieeexplore.ieee.org/document/9000951/Sound event detectionweakly-supervised learningsemi-supervised learningmean teacher modelmulti-layer local block coordinate descentconvolutional recurrent neural network |
spellingShingle | Jinjia Wang Jing Xia Qian Yang Yuzhen Zhang Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET IEEE Access Sound event detection weakly-supervised learning semi-supervised learning mean teacher model multi-layer local block coordinate descent convolutional recurrent neural network |
title | Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET |
title_full | Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET |
title_fullStr | Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET |
title_full_unstemmed | Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET |
title_short | Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET |
title_sort | research on semi supervised sound event detection based on mean teacher models using ml lobcod net |
topic | Sound event detection weakly-supervised learning semi-supervised learning mean teacher model multi-layer local block coordinate descent convolutional recurrent neural network |
url | https://ieeexplore.ieee.org/document/9000951/ |
work_keys_str_mv | AT jinjiawang researchonsemisupervisedsoundeventdetectionbasedonmeanteachermodelsusingmllobcodnet AT jingxia researchonsemisupervisedsoundeventdetectionbasedonmeanteachermodelsusingmllobcodnet AT qianyang researchonsemisupervisedsoundeventdetectionbasedonmeanteachermodelsusingmllobcodnet AT yuzhenzhang researchonsemisupervisedsoundeventdetectionbasedonmeanteachermodelsusingmllobcodnet |