A Survey of Audio Classification Using Deep Learning
Deep learning can be used for audio signal classification in a variety of ways. It can be used to detect and classify various types of audio signals such as speech, music, and environmental sounds. Deep learning models are able to learn complex patterns of audio signals and can be trained on large d...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10258355/ |
_version_ | 1827796597164998656 |
---|---|
author | Khalid Zaman Melike Sah Cem Direkoglu Masashi Unoki |
author_facet | Khalid Zaman Melike Sah Cem Direkoglu Masashi Unoki |
author_sort | Khalid Zaman |
collection | DOAJ |
description | Deep learning can be used for audio signal classification in a variety of ways. It can be used to detect and classify various types of audio signals such as speech, music, and environmental sounds. Deep learning models are able to learn complex patterns of audio signals and can be trained on large datasets to achieve high accuracy. To employ deep learning for audio signal classification, the audio signal must first be represented in a suitable form. This can be done using signal representation techniques such as using spectrograms, Mel-frequency Cepstral coefficients, linear predictive coding, and wavelet decomposition. Once the audio signal is represented in a suitable form, it can then be fed into a deep learning model. Various deep learning models can be utilized for audio classification. We provide an extensive survey of current deep learning models used for a variety of audio classification tasks. In particular, we focus on works published under five different deep neural network architectures, namely Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Autoencoders, Transformers and Hybrid Models (hybrid deep learning models and hybrid deep learning models with traditional classifiers). CNNs can be used to classify audio signals into different categories such as speech, music, and environmental sounds. They can also be used for speech recognition, speaker identification, and emotion recognition. RNNs are widely used for audio classification and audio segmentation. RNN models can capture temporal patterns of audio signals and be used to classify audio segments into different categories. Another approach is to use autoencoders for learning the features of audio signals and then classifying the signals into different categories. Transformers are also well-suited for audio classification. In particular, temporal and frequency features can be extracted to identify the characteristics of the audio signals. Finally, hybrid models for audio classification either combine various deep learning architectures (i.e. CNN-RNN) or combine deep learning models with traditional machine learning techniques (i.e. CNN-Support Vector Machine). These hybrid models take advantage of the strengths of different architectures while avoiding their weaknesses. Existing literature under different categories of deep learning are summarized and compared in detail. |
first_indexed | 2024-03-11T19:09:14Z |
format | Article |
id | doaj.art-7991b4deb4004596a3e463f9af1e762d |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-11T19:09:14Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-7991b4deb4004596a3e463f9af1e762d2023-10-09T23:01:47ZengIEEEIEEE Access2169-35362023-01-011110662010664910.1109/ACCESS.2023.331801510258355A Survey of Audio Classification Using Deep LearningKhalid Zaman0https://orcid.org/0009-0004-0809-7537Melike Sah1https://orcid.org/0000-0003-3869-7205Cem Direkoglu2https://orcid.org/0000-0001-7709-4082Masashi Unoki3https://orcid.org/0000-0002-6605-2052Graduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi, JapanComputer Engineering Department, Cyprus International University, Nicosia, North Cyprus, TurkeyElectrical and Electronics Engineering Department, Middle East Technical University, Northern Cyprus Campus, Kalkanli, Guzelyurt, TurkeyGraduate School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology, Nomi, JapanDeep learning can be used for audio signal classification in a variety of ways. It can be used to detect and classify various types of audio signals such as speech, music, and environmental sounds. Deep learning models are able to learn complex patterns of audio signals and can be trained on large datasets to achieve high accuracy. To employ deep learning for audio signal classification, the audio signal must first be represented in a suitable form. This can be done using signal representation techniques such as using spectrograms, Mel-frequency Cepstral coefficients, linear predictive coding, and wavelet decomposition. Once the audio signal is represented in a suitable form, it can then be fed into a deep learning model. Various deep learning models can be utilized for audio classification. We provide an extensive survey of current deep learning models used for a variety of audio classification tasks. In particular, we focus on works published under five different deep neural network architectures, namely Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Autoencoders, Transformers and Hybrid Models (hybrid deep learning models and hybrid deep learning models with traditional classifiers). CNNs can be used to classify audio signals into different categories such as speech, music, and environmental sounds. They can also be used for speech recognition, speaker identification, and emotion recognition. RNNs are widely used for audio classification and audio segmentation. RNN models can capture temporal patterns of audio signals and be used to classify audio segments into different categories. Another approach is to use autoencoders for learning the features of audio signals and then classifying the signals into different categories. Transformers are also well-suited for audio classification. In particular, temporal and frequency features can be extracted to identify the characteristics of the audio signals. Finally, hybrid models for audio classification either combine various deep learning architectures (i.e. CNN-RNN) or combine deep learning models with traditional machine learning techniques (i.e. CNN-Support Vector Machine). These hybrid models take advantage of the strengths of different architectures while avoiding their weaknesses. Existing literature under different categories of deep learning are summarized and compared in detail.https://ieeexplore.ieee.org/document/10258355/Audiospeechmusicemotionnoiseclassification |
spellingShingle | Khalid Zaman Melike Sah Cem Direkoglu Masashi Unoki A Survey of Audio Classification Using Deep Learning IEEE Access Audio speech music emotion noise classification |
title | A Survey of Audio Classification Using Deep Learning |
title_full | A Survey of Audio Classification Using Deep Learning |
title_fullStr | A Survey of Audio Classification Using Deep Learning |
title_full_unstemmed | A Survey of Audio Classification Using Deep Learning |
title_short | A Survey of Audio Classification Using Deep Learning |
title_sort | survey of audio classification using deep learning |
topic | Audio speech music emotion noise classification |
url | https://ieeexplore.ieee.org/document/10258355/ |
work_keys_str_mv | AT khalidzaman asurveyofaudioclassificationusingdeeplearning AT melikesah asurveyofaudioclassificationusingdeeplearning AT cemdirekoglu asurveyofaudioclassificationusingdeeplearning AT masashiunoki asurveyofaudioclassificationusingdeeplearning AT khalidzaman surveyofaudioclassificationusingdeeplearning AT melikesah surveyofaudioclassificationusingdeeplearning AT cemdirekoglu surveyofaudioclassificationusingdeeplearning AT masashiunoki surveyofaudioclassificationusingdeeplearning |