A Survey of Audio Classification Using Deep Learning

Deep learning can be used for audio signal classification in a variety of ways. It can be used to detect and classify various types of audio signals such as speech, music, and environmental sounds. Deep learning models are able to learn complex patterns of audio signals and can be trained on large d...

Full description

Bibliographic Details
Main Authors:	Khalid Zaman, Melike Sah, Cem Direkoglu, Masashi Unoki
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Audio speech music emotion noise classification
Online Access:	https://ieeexplore.ieee.org/document/10258355/

_version_	1827796597164998656
author	Khalid Zaman Melike Sah Cem Direkoglu Masashi Unoki
author_facet	Khalid Zaman Melike Sah Cem Direkoglu Masashi Unoki
author_sort	Khalid Zaman
collection	DOAJ
description	Deep learning can be used for audio signal classification in a variety of ways. It can be used to detect and classify various types of audio signals such as speech, music, and environmental sounds. Deep learning models are able to learn complex patterns of audio signals and can be trained on large datasets to achieve high accuracy. To employ deep learning for audio signal classification, the audio signal must first be represented in a suitable form. This can be done using signal representation techniques such as using spectrograms, Mel-frequency Cepstral coefficients, linear predictive coding, and wavelet decomposition. Once the audio signal is represented in a suitable form, it can then be fed into a deep learning model. Various deep learning models can be utilized for audio classification. We provide an extensive survey of current deep learning models used for a variety of audio classification tasks. In particular, we focus on works published under five different deep neural network architectures, namely Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Autoencoders, Transformers and Hybrid Models (hybrid deep learning models and hybrid deep learning models with traditional classifiers). CNNs can be used to classify audio signals into different categories such as speech, music, and environmental sounds. They can also be used for speech recognition, speaker identification, and emotion recognition. RNNs are widely used for audio classification and audio segmentation. RNN models can capture temporal patterns of audio signals and be used to classify audio segments into different categories. Another approach is to use autoencoders for learning the features of audio signals and then classifying the signals into different categories. Transformers are also well-suited for audio classification. In particular, temporal and frequency features can be extracted to identify the characteristics of the audio signals. Finally, hybrid models for audio classification either combine various deep learning architectures (i.e. CNN-RNN) or combine deep learning models with traditional machine learning techniques (i.e. CNN-Support Vector Machine). These hybrid models take advantage of the strengths of different architectures while avoiding their weaknesses. Existing literature under different categories of deep learning are summarized and compared in detail.
first_indexed	2024-03-11T19:09:14Z
format	Article
id	doaj.art-7991b4deb4004596a3e463f9af1e762d
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-11T19:09:14Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-7991b4deb4004596a3e463f9af1e762d2023-10-09T23:01:47ZengIEEEIEEE Access2169-35362023-01-011110662010664910.1109/ACCESS.2023.331801510258355A Survey of Audio Classification Using Deep LearningKhalid Zaman0https://orcid.org/0009-0004-0809-7537Melike Sah1https://orcid.org/0000-0003-3869-7205Cem Direkoglu2https://orcid.org/0000-0001-7709-4082Masashi Unoki3https://orcid.org/0000-0002-6605-2052Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi, JapanComputer Engineering Department, Cyprus International University, Nicosia, North Cyprus, TurkeyElectrical and Electronics Engineering Department, Middle East Technical University, Northern Cyprus Campus, Kalkanli, Guzelyurt, TurkeyGraduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi, JapanDeep learning can be used for audio signal classification in a variety of ways. It can be used to detect and classify various types of audio signals such as speech, music, and environmental sounds. Deep learning models are able to learn complex patterns of audio signals and can be trained on large datasets to achieve high accuracy. To employ deep learning for audio signal classification, the audio signal must first be represented in a suitable form. This can be done using signal representation techniques such as using spectrograms, Mel-frequency Cepstral coefficients, linear predictive coding, and wavelet decomposition. Once the audio signal is represented in a suitable form, it can then be fed into a deep learning model. Various deep learning models can be utilized for audio classification. We provide an extensive survey of current deep learning models used for a variety of audio classification tasks. In particular, we focus on works published under five different deep neural network architectures, namely Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Autoencoders, Transformers and Hybrid Models (hybrid deep learning models and hybrid deep learning models with traditional classifiers). CNNs can be used to classify audio signals into different categories such as speech, music, and environmental sounds. They can also be used for speech recognition, speaker identification, and emotion recognition. RNNs are widely used for audio classification and audio segmentation. RNN models can capture temporal patterns of audio signals and be used to classify audio segments into different categories. Another approach is to use autoencoders for learning the features of audio signals and then classifying the signals into different categories. Transformers are also well-suited for audio classification. In particular, temporal and frequency features can be extracted to identify the characteristics of the audio signals. Finally, hybrid models for audio classification either combine various deep learning architectures (i.e. CNN-RNN) or combine deep learning models with traditional machine learning techniques (i.e. CNN-Support Vector Machine). These hybrid models take advantage of the strengths of different architectures while avoiding their weaknesses. Existing literature under different categories of deep learning are summarized and compared in detail.https://ieeexplore.ieee.org/document/10258355/Audiospeechmusicemotionnoiseclassification
spellingShingle	Khalid Zaman Melike Sah Cem Direkoglu Masashi Unoki A Survey of Audio Classification Using Deep Learning IEEE Access Audio speech music emotion noise classification
title	A Survey of Audio Classification Using Deep Learning
title_full	A Survey of Audio Classification Using Deep Learning
title_fullStr	A Survey of Audio Classification Using Deep Learning
title_full_unstemmed	A Survey of Audio Classification Using Deep Learning
title_short	A Survey of Audio Classification Using Deep Learning
title_sort	survey of audio classification using deep learning
topic	Audio speech music emotion noise classification
url	https://ieeexplore.ieee.org/document/10258355/
work_keys_str_mv	AT khalidzaman asurveyofaudioclassificationusingdeeplearning AT melikesah asurveyofaudioclassificationusingdeeplearning AT cemdirekoglu asurveyofaudioclassificationusingdeeplearning AT masashiunoki asurveyofaudioclassificationusingdeeplearning AT khalidzaman surveyofaudioclassificationusingdeeplearning AT melikesah surveyofaudioclassificationusingdeeplearning AT cemdirekoglu surveyofaudioclassificationusingdeeplearning AT masashiunoki surveyofaudioclassificationusingdeeplearning

A Survey of Audio Classification Using Deep Learning

Similar Items