Exploring the Impact of Data Augmentation Techniques on Automatic Speech Recognition System Development: A Comparative Study
Automatic Speech Recognition (ASR) systems are notorious for their poor performance in adverse conditions, leading to high sensitivity and low robustness. Due to the costly and time-consuming nature of creating extensive speech databases, addressing the issue of low robustness has become a promine...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Stefan cel Mare University of Suceava
2023-08-01
|
Series: | Advances in Electrical and Computer Engineering |
Subjects: | |
Online Access: | http://dx.doi.org/10.4316/AECE.2023.03001 |
_version_ | 1797725130107912192 |
---|---|
author | GALIC, J. GROZDIC, D. |
author_facet | GALIC, J. GROZDIC, D. |
author_sort | GALIC, J. |
collection | DOAJ |
description | Automatic Speech Recognition (ASR) systems are notorious for their poor performance in adverse conditions,
leading to high sensitivity and low robustness. Due to the costly and time-consuming nature of creating
extensive speech databases, addressing the issue of low robustness has become a prominent area of research,
focusing on the synthetic generation of speech data using pre-existing natural speech. This paper examines
the impact of standard data augmentation techniques, including pitch shift, time stretch, volume control,
and their combination, on the accuracy of isolated-word ASR systems. The performance of three machine
learning models, namely Hidden Markov Models (HMM), Support Vector Machines (SVM), and Convolutional
Neural Networks (CNN), is analyzed on two Serbian corpora of isolated words. The Whi-Spe speech database
in neutral phonation is utilized for augmentation and training, and a specifically developed Python-based
software tool is employed for the augmentation process in this research study. The conducted experiments
demonstrate a statistically significant reduction in the Word Error Rate (WER) for the CNN-based
recognizer on both testing datasets, achieved through a single augmentation technique based on
pitch-shifting. |
first_indexed | 2024-03-12T10:26:43Z |
format | Article |
id | doaj.art-61e2ace3470249b2a1b89d68dedc0070 |
institution | Directory Open Access Journal |
issn | 1582-7445 1844-7600 |
language | English |
last_indexed | 2024-03-12T10:26:43Z |
publishDate | 2023-08-01 |
publisher | Stefan cel Mare University of Suceava |
record_format | Article |
series | Advances in Electrical and Computer Engineering |
spelling | doaj.art-61e2ace3470249b2a1b89d68dedc00702023-09-02T09:41:55ZengStefan cel Mare University of SuceavaAdvances in Electrical and Computer Engineering1582-74451844-76002023-08-0123331210.4316/AECE.2023.03001Exploring the Impact of Data Augmentation Techniques on Automatic Speech Recognition System Development: A Comparative StudyGALIC, J.GROZDIC, D.Automatic Speech Recognition (ASR) systems are notorious for their poor performance in adverse conditions, leading to high sensitivity and low robustness. Due to the costly and time-consuming nature of creating extensive speech databases, addressing the issue of low robustness has become a prominent area of research, focusing on the synthetic generation of speech data using pre-existing natural speech. This paper examines the impact of standard data augmentation techniques, including pitch shift, time stretch, volume control, and their combination, on the accuracy of isolated-word ASR systems. The performance of three machine learning models, namely Hidden Markov Models (HMM), Support Vector Machines (SVM), and Convolutional Neural Networks (CNN), is analyzed on two Serbian corpora of isolated words. The Whi-Spe speech database in neutral phonation is utilized for augmentation and training, and a specifically developed Python-based software tool is employed for the augmentation process in this research study. The conducted experiments demonstrate a statistically significant reduction in the Word Error Rate (WER) for the CNN-based recognizer on both testing datasets, achieved through a single augmentation technique based on pitch-shifting.http://dx.doi.org/10.4316/AECE.2023.03001artificial neural networksaudio databasesautomatic speech recognitionhidden markov modelssupport vector machines |
spellingShingle | GALIC, J. GROZDIC, D. Exploring the Impact of Data Augmentation Techniques on Automatic Speech Recognition System Development: A Comparative Study Advances in Electrical and Computer Engineering artificial neural networks audio databases automatic speech recognition hidden markov models support vector machines |
title | Exploring the Impact of Data Augmentation Techniques on Automatic Speech Recognition System Development: A Comparative Study |
title_full | Exploring the Impact of Data Augmentation Techniques on Automatic Speech Recognition System Development: A Comparative Study |
title_fullStr | Exploring the Impact of Data Augmentation Techniques on Automatic Speech Recognition System Development: A Comparative Study |
title_full_unstemmed | Exploring the Impact of Data Augmentation Techniques on Automatic Speech Recognition System Development: A Comparative Study |
title_short | Exploring the Impact of Data Augmentation Techniques on Automatic Speech Recognition System Development: A Comparative Study |
title_sort | exploring the impact of data augmentation techniques on automatic speech recognition system development a comparative study |
topic | artificial neural networks audio databases automatic speech recognition hidden markov models support vector machines |
url | http://dx.doi.org/10.4316/AECE.2023.03001 |
work_keys_str_mv | AT galicj exploringtheimpactofdataaugmentationtechniquesonautomaticspeechrecognitionsystemdevelopmentacomparativestudy AT grozdicd exploringtheimpactofdataaugmentationtechniquesonautomaticspeechrecognitionsystemdevelopmentacomparativestudy |