A Survey of Audio Classification Using Deep Learning

Deep learning can be used for audio signal classification in a variety of ways. It can be used to detect and classify various types of audio signals such as speech, music, and environmental sounds. Deep learning models are able to learn complex patterns of audio signals and can be trained on large d...

Full description

Bibliographic Details
Main Authors: Khalid Zaman, Melike Sah, Cem Direkoglu, Masashi Unoki
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10258355/
_version_ 1827796597164998656
author Khalid Zaman
Melike Sah
Cem Direkoglu
Masashi Unoki
author_facet Khalid Zaman
Melike Sah
Cem Direkoglu
Masashi Unoki
author_sort Khalid Zaman
collection DOAJ
description Deep learning can be used for audio signal classification in a variety of ways. It can be used to detect and classify various types of audio signals such as speech, music, and environmental sounds. Deep learning models are able to learn complex patterns of audio signals and can be trained on large datasets to achieve high accuracy. To employ deep learning for audio signal classification, the audio signal must first be represented in a suitable form. This can be done using signal representation techniques such as using spectrograms, Mel-frequency Cepstral coefficients, linear predictive coding, and wavelet decomposition. Once the audio signal is represented in a suitable form, it can then be fed into a deep learning model. Various deep learning models can be utilized for audio classification. We provide an extensive survey of current deep learning models used for a variety of audio classification tasks. In particular, we focus on works published under five different deep neural network architectures, namely Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Autoencoders, Transformers and Hybrid Models (hybrid deep learning models and hybrid deep learning models with traditional classifiers). CNNs can be used to classify audio signals into different categories such as speech, music, and environmental sounds. They can also be used for speech recognition, speaker identification, and emotion recognition. RNNs are widely used for audio classification and audio segmentation. RNN models can capture temporal patterns of audio signals and be used to classify audio segments into different categories. Another approach is to use autoencoders for learning the features of audio signals and then classifying the signals into different categories. Transformers are also well-suited for audio classification. In particular, temporal and frequency features can be extracted to identify the characteristics of the audio signals. Finally, hybrid models for audio classification either combine various deep learning architectures (i.e. CNN-RNN) or combine deep learning models with traditional machine learning techniques (i.e. CNN-Support Vector Machine). These hybrid models take advantage of the strengths of different architectures while avoiding their weaknesses. Existing literature under different categories of deep learning are summarized and compared in detail.
first_indexed 2024-03-11T19:09:14Z
format Article
id doaj.art-7991b4deb4004596a3e463f9af1e762d
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-11T19:09:14Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-7991b4deb4004596a3e463f9af1e762d2023-10-09T23:01:47ZengIEEEIEEE Access2169-35362023-01-011110662010664910.1109/ACCESS.2023.331801510258355A Survey of Audio Classification Using Deep LearningKhalid Zaman0https://orcid.org/0009-0004-0809-7537Melike Sah1https://orcid.org/0000-0003-3869-7205Cem Direkoglu2https://orcid.org/0000-0001-7709-4082Masashi Unoki3https://orcid.org/0000-0002-6605-2052Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi, JapanComputer Engineering Department, Cyprus International University, Nicosia, North Cyprus, TurkeyElectrical and Electronics Engineering Department, Middle East Technical University, Northern Cyprus Campus, Kalkanli, Guzelyurt, TurkeyGraduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi, JapanDeep learning can be used for audio signal classification in a variety of ways. It can be used to detect and classify various types of audio signals such as speech, music, and environmental sounds. Deep learning models are able to learn complex patterns of audio signals and can be trained on large datasets to achieve high accuracy. To employ deep learning for audio signal classification, the audio signal must first be represented in a suitable form. This can be done using signal representation techniques such as using spectrograms, Mel-frequency Cepstral coefficients, linear predictive coding, and wavelet decomposition. Once the audio signal is represented in a suitable form, it can then be fed into a deep learning model. Various deep learning models can be utilized for audio classification. We provide an extensive survey of current deep learning models used for a variety of audio classification tasks. In particular, we focus on works published under five different deep neural network architectures, namely Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Autoencoders, Transformers and Hybrid Models (hybrid deep learning models and hybrid deep learning models with traditional classifiers). CNNs can be used to classify audio signals into different categories such as speech, music, and environmental sounds. They can also be used for speech recognition, speaker identification, and emotion recognition. RNNs are widely used for audio classification and audio segmentation. RNN models can capture temporal patterns of audio signals and be used to classify audio segments into different categories. Another approach is to use autoencoders for learning the features of audio signals and then classifying the signals into different categories. Transformers are also well-suited for audio classification. In particular, temporal and frequency features can be extracted to identify the characteristics of the audio signals. Finally, hybrid models for audio classification either combine various deep learning architectures (i.e. CNN-RNN) or combine deep learning models with traditional machine learning techniques (i.e. CNN-Support Vector Machine). These hybrid models take advantage of the strengths of different architectures while avoiding their weaknesses. Existing literature under different categories of deep learning are summarized and compared in detail.https://ieeexplore.ieee.org/document/10258355/Audiospeechmusicemotionnoiseclassification
spellingShingle Khalid Zaman
Melike Sah
Cem Direkoglu
Masashi Unoki
A Survey of Audio Classification Using Deep Learning
IEEE Access
Audio
speech
music
emotion
noise
classification
title A Survey of Audio Classification Using Deep Learning
title_full A Survey of Audio Classification Using Deep Learning
title_fullStr A Survey of Audio Classification Using Deep Learning
title_full_unstemmed A Survey of Audio Classification Using Deep Learning
title_short A Survey of Audio Classification Using Deep Learning
title_sort survey of audio classification using deep learning
topic Audio
speech
music
emotion
noise
classification
url https://ieeexplore.ieee.org/document/10258355/
work_keys_str_mv AT khalidzaman asurveyofaudioclassificationusingdeeplearning
AT melikesah asurveyofaudioclassificationusingdeeplearning
AT cemdirekoglu asurveyofaudioclassificationusingdeeplearning
AT masashiunoki asurveyofaudioclassificationusingdeeplearning
AT khalidzaman surveyofaudioclassificationusingdeeplearning
AT melikesah surveyofaudioclassificationusingdeeplearning
AT cemdirekoglu surveyofaudioclassificationusingdeeplearning
AT masashiunoki surveyofaudioclassificationusingdeeplearning