Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models

Robustness against background noise and reverberation is essential for many real-world speech-based applications. One way to achieve this robustness is to employ a speech enhancement front-end that, independently of the back-end, removes the environmental perturbations from the target speech signal....

Full description

Bibliographic Details
Main Authors:	Mohamed Nabih Ali, Daniele Falavigna, Alessio Brutti
Format:	Article
Language:	English
Published:	MDPI AG 2022-01-01
Series:	Sensors
Subjects:	joint training speech enhancement intent classification
Online Access:	https://www.mdpi.com/1424-8220/22/1/374

_version_	1827667561774317568
author	Mohamed Nabih Ali Daniele Falavigna Alessio Brutti
author_facet	Mohamed Nabih Ali Daniele Falavigna Alessio Brutti
author_sort	Mohamed Nabih Ali
collection	DOAJ
description	Robustness against background noise and reverberation is essential for many real-world speech-based applications. One way to achieve this robustness is to employ a speech enhancement front-end that, independently of the back-end, removes the environmental perturbations from the target speech signal. However, although the enhancement front-end typically increases the speech quality from an intelligibility perspective, it tends to introduce distortions which deteriorate the performance of subsequent processing modules. In this paper, we investigate strategies for jointly training neural models for both speech enhancement and the back-end, which optimize a combined loss function. In this way, the enhancement front-end is guided by the back-end to provide more effective enhancement. Differently from typical state-of-the-art approaches employing on spectral features or neural embeddings, we operate in the time domain, processing raw waveforms in both components. As application scenario we consider intent classification in noisy environments. In particular, the front-end speech enhancement module is based on Wave-U-Net while the intent classifier is implemented as a temporal convolutional network. Exhaustive experiments are reported on versions of the Fluent Speech Commands corpus contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, shedding light and providing insight about the most promising training approaches.
first_indexed	2024-03-10T03:20:03Z
format	Article
id	doaj.art-5084c273c0bb427aabe1d176ba50ec46
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-10T03:20:03Z
publishDate	2022-01-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-5084c273c0bb427aabe1d176ba50ec462023-11-23T12:21:19ZengMDPI AGSensors1424-82202022-01-0122137410.3390/s22010374Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural ModelsMohamed Nabih Ali0Daniele Falavigna1Alessio Brutti2Information Engineering and Computer Science School, University of Trento, 38121 Trento, ItalyFondazione Bruno Kessler, 38121 Trento, ItalyFondazione Bruno Kessler, 38121 Trento, ItalyRobustness against background noise and reverberation is essential for many real-world speech-based applications. One way to achieve this robustness is to employ a speech enhancement front-end that, independently of the back-end, removes the environmental perturbations from the target speech signal. However, although the enhancement front-end typically increases the speech quality from an intelligibility perspective, it tends to introduce distortions which deteriorate the performance of subsequent processing modules. In this paper, we investigate strategies for jointly training neural models for both speech enhancement and the back-end, which optimize a combined loss function. In this way, the enhancement front-end is guided by the back-end to provide more effective enhancement. Differently from typical state-of-the-art approaches employing on spectral features or neural embeddings, we operate in the time domain, processing raw waveforms in both components. As application scenario we consider intent classification in noisy environments. In particular, the front-end speech enhancement module is based on Wave-U-Net while the intent classifier is implemented as a temporal convolutional network. Exhaustive experiments are reported on versions of the Fluent Speech Commands corpus contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, shedding light and providing insight about the most promising training approaches.https://www.mdpi.com/1424-8220/22/1/374joint trainingspeech enhancementintent classification
spellingShingle	Mohamed Nabih Ali Daniele Falavigna Alessio Brutti Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models Sensors joint training speech enhancement intent classification
title	Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models
title_full	Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models
title_fullStr	Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models
title_full_unstemmed	Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models
title_short	Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models
title_sort	time domain joint training strategies of speech enhancement and intent classification neural models
topic	joint training speech enhancement intent classification
url	https://www.mdpi.com/1424-8220/22/1/374
work_keys_str_mv	AT mohamednabihali timedomainjointtrainingstrategiesofspeechenhancementandintentclassificationneuralmodels AT danielefalavigna timedomainjointtrainingstrategiesofspeechenhancementandintentclassificationneuralmodels AT alessiobrutti timedomainjointtrainingstrategiesofspeechenhancementandintentclassificationneuralmodels

Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models

Similar Items