Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection

Audio event detection (AED) is a task of recognizing the types of audio events in an audio stream and estimating their temporal positions. AED is typically based on fully supervised approaches, requiring strong labels including both the presence and temporal position of each audio event. However, fu...

Full description

Bibliographic Details
Main Authors:	Inkyu Choi, Soo Hyun Bae, Nam Soo Kim
Format:	Article
Language:	English
Published:	MDPI AG 2019-06-01
Series:	Applied Sciences
Subjects:	audio event detection weakly supervised learning convolutional neural network structured prediction conditional random field
Online Access:	https://www.mdpi.com/2076-3417/9/11/2302

_version_	1819319432776777728
author	Inkyu Choi Soo Hyun Bae Nam Soo Kim
author_facet	Inkyu Choi Soo Hyun Bae Nam Soo Kim
author_sort	Inkyu Choi
collection	DOAJ
description	Audio event detection (AED) is a task of recognizing the types of audio events in an audio stream and estimating their temporal positions. AED is typically based on fully supervised approaches, requiring strong labels including both the presence and temporal position of each audio event. However, fully supervised datasets are not easily available due to the heavy cost of human annotation. Recently, weakly supervised approaches for AED have been proposed, utilizing large scale datasets with weak labels including only the occurrence of events in recordings. In this work, we introduce a deep convolutional neural network (CNN) model called DSNet based on densely connected convolution networks (DenseNets) and squeeze-and-excitation networks (SENets) for weakly supervised training of AED. DSNet alleviates the vanishing-gradient problem and strengthens feature propagation and models interdependencies between channels. We also propose a structured prediction method for weakly supervised AED. We apply a recurrent neural network (RNN) based framework and a prediction smoothness cost function to consider long-term contextual information with reduced error propagation. In post-processing, conditional random fields (CRFs) are applied to take into account the dependency between segments and delineate the borders of audio events precisely. We evaluated our proposed models on the DCASE 2017 task 4 dataset and obtained state-of-the-art results on both audio tagging and event detection tasks.
first_indexed	2024-12-24T11:03:36Z
format	Article
id	doaj.art-32a8c26516c04ef6ac5ae45c0830268b
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-12-24T11:03:36Z
publishDate	2019-06-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-32a8c26516c04ef6ac5ae45c0830268b2022-12-21T16:58:39ZengMDPI AGApplied Sciences2076-34172019-06-01911230210.3390/app9112302app9112302Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event DetectionInkyu Choi0Soo Hyun Bae1Nam Soo Kim2Department of Electrical and Computer Engineering and INMC, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, KoreaDepartment of Electrical and Computer Engineering and INMC, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, KoreaDepartment of Electrical and Computer Engineering and INMC, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, KoreaAudio event detection (AED) is a task of recognizing the types of audio events in an audio stream and estimating their temporal positions. AED is typically based on fully supervised approaches, requiring strong labels including both the presence and temporal position of each audio event. However, fully supervised datasets are not easily available due to the heavy cost of human annotation. Recently, weakly supervised approaches for AED have been proposed, utilizing large scale datasets with weak labels including only the occurrence of events in recordings. In this work, we introduce a deep convolutional neural network (CNN) model called DSNet based on densely connected convolution networks (DenseNets) and squeeze-and-excitation networks (SENets) for weakly supervised training of AED. DSNet alleviates the vanishing-gradient problem and strengthens feature propagation and models interdependencies between channels. We also propose a structured prediction method for weakly supervised AED. We apply a recurrent neural network (RNN) based framework and a prediction smoothness cost function to consider long-term contextual information with reduced error propagation. In post-processing, conditional random fields (CRFs) are applied to take into account the dependency between segments and delineate the borders of audio events precisely. We evaluated our proposed models on the DCASE 2017 task 4 dataset and obtained state-of-the-art results on both audio tagging and event detection tasks.https://www.mdpi.com/2076-3417/9/11/2302audio event detectionweakly supervised learningconvolutional neural networkstructured predictionconditional random field
spellingShingle	Inkyu Choi Soo Hyun Bae Nam Soo Kim Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection Applied Sciences audio event detection weakly supervised learning convolutional neural network structured prediction conditional random field
title	Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection
title_full	Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection
title_fullStr	Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection
title_full_unstemmed	Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection
title_short	Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection
title_sort	deep convolutional neural network with structured prediction for weakly supervised audio event detection
topic	audio event detection weakly supervised learning convolutional neural network structured prediction conditional random field
url	https://www.mdpi.com/2076-3417/9/11/2302
work_keys_str_mv	AT inkyuchoi deepconvolutionalneuralnetworkwithstructuredpredictionforweaklysupervisedaudioeventdetection AT soohyunbae deepconvolutionalneuralnetworkwithstructuredpredictionforweaklysupervisedaudioeventdetection AT namsookim deepconvolutionalneuralnetworkwithstructuredpredictionforweaklysupervisedaudioeventdetection

Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection

Similar Items