Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech Recognition

Recent advances in self-supervised learning have allowed automatic speech recognition (ASR) systems to achieve state-of-the-art (SOTA) word error rates (WER) while requiring only a fraction of the labeled data needed by its predecessors. Notwithstanding, while such models achieve SOTA results in mat...

Full description

Bibliographic Details
Main Authors:	Arthur Pimentel, Heitor R. Guimarães, Anderson Avila, Tiago H. Falk
Format:	Article
Language:	English
Published:	MDPI AG 2023-11-01
Series:	Applied Sciences
Subjects:	automatic speech recognition knowledge distillation self-supervised learning modulation spectrum context awareness
Online Access:	https://www.mdpi.com/2076-3417/13/23/12571

_version_	1797400476888596480
author	Arthur Pimentel Heitor R. Guimarães Anderson Avila Tiago H. Falk
author_facet	Arthur Pimentel Heitor R. Guimarães Anderson Avila Tiago H. Falk
author_sort	Arthur Pimentel
collection	DOAJ
description	Recent advances in self-supervised learning have allowed automatic speech recognition (ASR) systems to achieve state-of-the-art (SOTA) word error rates (WER) while requiring only a fraction of the labeled data needed by its predecessors. Notwithstanding, while such models achieve SOTA results in matched train/test scenarios, their performance degrades substantially when tested in unseen conditions. To overcome this problem, strategies such as data augmentation and/or domain adaptation have been explored. Available models, however, are still too large to be considered for edge speech applications on resource-constrained devices; thus, model compression tools, such as knowledge distillation, are needed. In this paper, we propose three innovations on top of the existing DistilHuBERT distillation recipe: optimize the prediction heads, employ a targeted data augmentation method for different environmental scenarios, and employ a real-time environment estimator to choose between compressed models for inference. Experiments with the LibriSpeech dataset, corrupted with varying noise types and reverberation levels, show the proposed method outperforming several benchmark methods, both original and compressed, by as much as 48.4% and 89.2% in the word error reduction rate in extremely noisy and reverberant conditions, respectively, while reducing by 50% the number of parameters. Thus, the proposed method is well suited for resource-constrained edge speech recognition applications.
first_indexed	2024-03-09T01:56:04Z
format	Article
id	doaj.art-c3fc0ceb609145d9ae5565e810f206da
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-09T01:56:04Z
publishDate	2023-11-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-c3fc0ceb609145d9ae5565e810f206da2023-12-08T15:11:01ZengMDPI AGApplied Sciences2076-34172023-11-0113231257110.3390/app132312571Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech RecognitionArthur Pimentel0Heitor R. Guimarães1Anderson Avila2Tiago H. Falk3Institut National de la Recherche Scientifique (INRS-EMT), Université du Québec, Montreal, QC H5A 1K6, CanadaInstitut National de la Recherche Scientifique (INRS-EMT), Université du Québec, Montreal, QC H5A 1K6, CanadaInstitut National de la Recherche Scientifique (INRS-EMT), Université du Québec, Montreal, QC H5A 1K6, CanadaInstitut National de la Recherche Scientifique (INRS-EMT), Université du Québec, Montreal, QC H5A 1K6, CanadaRecent advances in self-supervised learning have allowed automatic speech recognition (ASR) systems to achieve state-of-the-art (SOTA) word error rates (WER) while requiring only a fraction of the labeled data needed by its predecessors. Notwithstanding, while such models achieve SOTA results in matched train/test scenarios, their performance degrades substantially when tested in unseen conditions. To overcome this problem, strategies such as data augmentation and/or domain adaptation have been explored. Available models, however, are still too large to be considered for edge speech applications on resource-constrained devices; thus, model compression tools, such as knowledge distillation, are needed. In this paper, we propose three innovations on top of the existing DistilHuBERT distillation recipe: optimize the prediction heads, employ a targeted data augmentation method for different environmental scenarios, and employ a real-time environment estimator to choose between compressed models for inference. Experiments with the LibriSpeech dataset, corrupted with varying noise types and reverberation levels, show the proposed method outperforming several benchmark methods, both original and compressed, by as much as 48.4% and 89.2% in the word error reduction rate in extremely noisy and reverberant conditions, respectively, while reducing by 50% the number of parameters. Thus, the proposed method is well suited for resource-constrained edge speech recognition applications.https://www.mdpi.com/2076-3417/13/23/12571automatic speech recognitionknowledge distillationself-supervised learningmodulation spectrumcontext awareness
spellingShingle	Arthur Pimentel Heitor R. Guimarães Anderson Avila Tiago H. Falk Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech Recognition Applied Sciences automatic speech recognition knowledge distillation self-supervised learning modulation spectrum context awareness
title	Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech Recognition
title_full	Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech Recognition
title_fullStr	Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech Recognition
title_full_unstemmed	Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech Recognition
title_short	Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech Recognition
title_sort	environment aware knowledge distillation for improved resource constrained edge speech recognition
topic	automatic speech recognition knowledge distillation self-supervised learning modulation spectrum context awareness
url	https://www.mdpi.com/2076-3417/13/23/12571
work_keys_str_mv	AT arthurpimentel environmentawareknowledgedistillationforimprovedresourceconstrainededgespeechrecognition AT heitorrguimaraes environmentawareknowledgedistillationforimprovedresourceconstrainededgespeechrecognition AT andersonavila environmentawareknowledgedistillationforimprovedresourceconstrainededgespeechrecognition AT tiagohfalk environmentawareknowledgedistillationforimprovedresourceconstrainededgespeechrecognition

Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech Recognition

Similar Items