Development of audio-visual speech recognition using deep-learning technique

Deep learning is a technique with artificial intelligent (AI) that simulate humans’ learning behavior. Audio-visual speech recognition is important for the listener understand the emotions behind the spoken words truly. In this thesis, two different deep learning models, Convolutional Neural Network...

Full description

Bibliographic Details
Main Authors: How, Chun Kit, Mohd Khairuddin, Ismail, Mohd Razman, Mohd Azraai, Anwar, P. P. Abdul Majeed, Mohd Isa, Wan Hasbullah
Format: Article
Language:English
Published: Penerbit UMP 2022
Subjects:
Online Access:http://umpir.ump.edu.my/id/eprint/37244/1/Development%20of%20audio%20visual%20speech%20recognition.pdf
_version_ 1825814894173224960
author How, Chun Kit
Mohd Khairuddin, Ismail
Mohd Razman, Mohd Azraai
Anwar, P. P. Abdul Majeed
Mohd Isa, Wan Hasbullah
author_facet How, Chun Kit
Mohd Khairuddin, Ismail
Mohd Razman, Mohd Azraai
Anwar, P. P. Abdul Majeed
Mohd Isa, Wan Hasbullah
author_sort How, Chun Kit
collection UMP
description Deep learning is a technique with artificial intelligent (AI) that simulate humans’ learning behavior. Audio-visual speech recognition is important for the listener understand the emotions behind the spoken words truly. In this thesis, two different deep learning models, Convolutional Neural Network (CNN) and Deep Neural Network (DNN), were developed to recognize the speech’s emotion from the dataset. Pytorch framework with torchaudio library was used. Both models were given the same training, validation, testing, and augmented datasets. The training will be stopped when the training loop reaches ten epochs, or the validation loss function does not improve for five epochs. At the end, the highest accuracy and lowest loss function of CNN model in the training dataset are 76.50% and 0.006029 respectively, meanwhile the DNN model achieved 75.42% and 0.086643 respectively. Both models were evaluated using confusion matrix. In conclusion, CNN model has higher performance than DNN model, but needs to improvise as the accuracy of testing dataset is low and the loss function is high.
first_indexed 2024-03-06T13:05:20Z
format Article
id UMPir37244
institution Universiti Malaysia Pahang
language English
last_indexed 2024-03-06T13:05:20Z
publishDate 2022
publisher Penerbit UMP
record_format dspace
spelling UMPir372442023-03-09T03:50:23Z http://umpir.ump.edu.my/id/eprint/37244/ Development of audio-visual speech recognition using deep-learning technique How, Chun Kit Mohd Khairuddin, Ismail Mohd Razman, Mohd Azraai Anwar, P. P. Abdul Majeed Mohd Isa, Wan Hasbullah TJ Mechanical engineering and machinery TK Electrical engineering. Electronics Nuclear engineering TS Manufactures Deep learning is a technique with artificial intelligent (AI) that simulate humans’ learning behavior. Audio-visual speech recognition is important for the listener understand the emotions behind the spoken words truly. In this thesis, two different deep learning models, Convolutional Neural Network (CNN) and Deep Neural Network (DNN), were developed to recognize the speech’s emotion from the dataset. Pytorch framework with torchaudio library was used. Both models were given the same training, validation, testing, and augmented datasets. The training will be stopped when the training loop reaches ten epochs, or the validation loss function does not improve for five epochs. At the end, the highest accuracy and lowest loss function of CNN model in the training dataset are 76.50% and 0.006029 respectively, meanwhile the DNN model achieved 75.42% and 0.086643 respectively. Both models were evaluated using confusion matrix. In conclusion, CNN model has higher performance than DNN model, but needs to improvise as the accuracy of testing dataset is low and the loss function is high. Penerbit UMP 2022-06 Article PeerReviewed pdf en cc_by_nc_4 http://umpir.ump.edu.my/id/eprint/37244/1/Development%20of%20audio%20visual%20speech%20recognition.pdf How, Chun Kit and Mohd Khairuddin, Ismail and Mohd Razman, Mohd Azraai and Anwar, P. P. Abdul Majeed and Mohd Isa, Wan Hasbullah (2022) Development of audio-visual speech recognition using deep-learning technique. Mekatronika - Journal of Intelligent Manufacturing & Mechatronics, 4 (1). pp. 88-95. ISSN 2637-0883. (Published) https://doi.org/10.15282/mekatronika.v4i1.8625 https://doi.org/10.15282/mekatronika.v4i1.8625
spellingShingle TJ Mechanical engineering and machinery
TK Electrical engineering. Electronics Nuclear engineering
TS Manufactures
How, Chun Kit
Mohd Khairuddin, Ismail
Mohd Razman, Mohd Azraai
Anwar, P. P. Abdul Majeed
Mohd Isa, Wan Hasbullah
Development of audio-visual speech recognition using deep-learning technique
title Development of audio-visual speech recognition using deep-learning technique
title_full Development of audio-visual speech recognition using deep-learning technique
title_fullStr Development of audio-visual speech recognition using deep-learning technique
title_full_unstemmed Development of audio-visual speech recognition using deep-learning technique
title_short Development of audio-visual speech recognition using deep-learning technique
title_sort development of audio visual speech recognition using deep learning technique
topic TJ Mechanical engineering and machinery
TK Electrical engineering. Electronics Nuclear engineering
TS Manufactures
url http://umpir.ump.edu.my/id/eprint/37244/1/Development%20of%20audio%20visual%20speech%20recognition.pdf
work_keys_str_mv AT howchunkit developmentofaudiovisualspeechrecognitionusingdeeplearningtechnique
AT mohdkhairuddinismail developmentofaudiovisualspeechrecognitionusingdeeplearningtechnique
AT mohdrazmanmohdazraai developmentofaudiovisualspeechrecognitionusingdeeplearningtechnique
AT anwarppabdulmajeed developmentofaudiovisualspeechrecognitionusingdeeplearningtechnique
AT mohdisawanhasbullah developmentofaudiovisualspeechrecognitionusingdeeplearningtechnique