Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

Abstract Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken communication. Machine learning models such as neural networks have already been proposed for audio signal modeling, where recurrent structures can take advantage of temporal dependenc...

Full description

Bibliographic Details
Main Authors: Diego de Benito-Gorron, Alicia Lozano-Diez, Doroteo T. Toledano, Joaquin Gonzalez-Rodriguez
Format: Article
Language:English
Published: SpringerOpen 2019-06-01
Series:EURASIP Journal on Audio, Speech, and Music Processing
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13636-019-0152-1
_version_ 1818524109508706304
author Diego de Benito-Gorron
Alicia Lozano-Diez
Doroteo T. Toledano
Joaquin Gonzalez-Rodriguez
author_facet Diego de Benito-Gorron
Alicia Lozano-Diez
Doroteo T. Toledano
Joaquin Gonzalez-Rodriguez
author_sort Diego de Benito-Gorron
collection DOAJ
description Abstract Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken communication. Machine learning models such as neural networks have already been proposed for audio signal modeling, where recurrent structures can take advantage of temporal dependencies. This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments (216 h), selected from the Google AudioSet dataset. These segments belong to YouTube videos and have been represented as mel-spectrograms. We propose and compare two approaches. The first one is the training of two different neural networks, one for speech detection and another for music detection. The second approach consists on training a single neural network to tackle both tasks at the same time. The studied architectures include fully connected, convolutional and LSTM (long short-term memory) recurrent networks. Comparative results are provided in terms of classification performance and model complexity. We would like to highlight the performance of convolutional architectures, specially in combination with an LSTM stage. The hybrid convolutional-LSTM models achieve the best overall results (85% accuracy) in the three proposed tasks. Furthermore, a distractor analysis of the results has been carried out in order to identify which events in the ontology are the most harmful for the performance of the models, showing some difficult scenarios for the detection of music and speech.
first_indexed 2024-12-11T05:53:05Z
format Article
id doaj.art-b982cad8e8874ccda8e178833d78cb69
institution Directory Open Access Journal
issn 1687-4722
language English
last_indexed 2024-12-11T05:53:05Z
publishDate 2019-06-01
publisher SpringerOpen
record_format Article
series EURASIP Journal on Audio, Speech, and Music Processing
spelling doaj.art-b982cad8e8874ccda8e178833d78cb692022-12-22T01:18:46ZengSpringerOpenEURASIP Journal on Audio, Speech, and Music Processing1687-47222019-06-012019111810.1186/s13636-019-0152-1Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio datasetDiego de Benito-Gorron0Alicia Lozano-Diez1Doroteo T. Toledano2Joaquin Gonzalez-Rodriguez3AUDIAS (Audio, Data Intelligence and Speech) - Universidad Autonoma de MadridAUDIAS (Audio, Data Intelligence and Speech) - Universidad Autonoma de MadridAUDIAS (Audio, Data Intelligence and Speech) - Universidad Autonoma de MadridAUDIAS (Audio, Data Intelligence and Speech) - Universidad Autonoma de MadridAbstract Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken communication. Machine learning models such as neural networks have already been proposed for audio signal modeling, where recurrent structures can take advantage of temporal dependencies. This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments (216 h), selected from the Google AudioSet dataset. These segments belong to YouTube videos and have been represented as mel-spectrograms. We propose and compare two approaches. The first one is the training of two different neural networks, one for speech detection and another for music detection. The second approach consists on training a single neural network to tackle both tasks at the same time. The studied architectures include fully connected, convolutional and LSTM (long short-term memory) recurrent networks. Comparative results are provided in terms of classification performance and model complexity. We would like to highlight the performance of convolutional architectures, specially in combination with an LSTM stage. The hybrid convolutional-LSTM models achieve the best overall results (85% accuracy) in the three proposed tasks. Furthermore, a distractor analysis of the results has been carried out in order to identify which events in the ontology are the most harmful for the performance of the models, showing some difficult scenarios for the detection of music and speech.http://link.springer.com/article/10.1186/s13636-019-0152-1Acoustic event detectionSpeech activity detectionMusic activity detectionNeural networksConvolutional networksLSTM
spellingShingle Diego de Benito-Gorron
Alicia Lozano-Diez
Doroteo T. Toledano
Joaquin Gonzalez-Rodriguez
Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset
EURASIP Journal on Audio, Speech, and Music Processing
Acoustic event detection
Speech activity detection
Music activity detection
Neural networks
Convolutional networks
LSTM
title Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset
title_full Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset
title_fullStr Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset
title_full_unstemmed Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset
title_short Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset
title_sort exploring convolutional recurrent and hybrid deep neural networks for speech and music detection in a large audio dataset
topic Acoustic event detection
Speech activity detection
Music activity detection
Neural networks
Convolutional networks
LSTM
url http://link.springer.com/article/10.1186/s13636-019-0152-1
work_keys_str_mv AT diegodebenitogorron exploringconvolutionalrecurrentandhybriddeepneuralnetworksforspeechandmusicdetectioninalargeaudiodataset
AT alicialozanodiez exploringconvolutionalrecurrentandhybriddeepneuralnetworksforspeechandmusicdetectioninalargeaudiodataset
AT doroteottoledano exploringconvolutionalrecurrentandhybriddeepneuralnetworksforspeechandmusicdetectioninalargeaudiodataset
AT joaquingonzalezrodriguez exploringconvolutionalrecurrentandhybriddeepneuralnetworksforspeechandmusicdetectioninalargeaudiodataset