Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement

Supervised learning-based speech enhancement methods often work remarkably well in acoustic situations represented in the training corpus but generalize poorly to out-of-domain situations, i.e. situations not seen during training. This stands in the way of further improvement of these methods in rea...

Full description

Bibliographic Details
Main Authors:	Katerina Zmolikova, Michael Syskind Pedersen, Jesper Jensen
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Open Journal of Signal Processing
Subjects:	Masked spectrogram prediction speech enhancement unsupervised domain adaptation
Online Access:	https://ieeexplore.ieee.org/document/10360251/

_version_	1827390756885626880
author	Katerina Zmolikova Michael Syskind Pedersen Jesper Jensen
author_facet	Katerina Zmolikova Michael Syskind Pedersen Jesper Jensen
author_sort	Katerina Zmolikova
collection	DOAJ
description	Supervised learning-based speech enhancement methods often work remarkably well in acoustic situations represented in the training corpus but generalize poorly to out-of-domain situations, i.e. situations not seen during training. This stands in the way of further improvement of these methods in realistic scenarios, as collecting paired noisy-clean recordings in the target application domain is typically not feasible. Recording noisy-only in-domain data is, though, much more practical. In this article, we tackle the problem of unsupervised domain adaptation in speech enhancement. Specifically, we propose a way to use in-domain noisy-only data in the training of a neural network to improve upon a model trained solely on out-of-domain paired data. For this, we make use of masked spectrogram prediction, a technique from self-supervised learning that aims to interpolate masked regions of a spectrogram. We hypothesize that masked spectrogram prediction encourages learning of features that represent well both speech and noise components of the noisy signals. These features can then be used to train a more robust speech enhancement system. We evaluate the proposed method on the VoiceBank-DEMAND and LibriFSD50k databases, with WSJ0-CHiME3 serving as the out-of-domain database. We show that the proposed method outperforms both the out-of-domain system and the baseline approaches, i.e. RemixIT and noisy-target training, and also combines well with the previously proposed RemixIT method.
first_indexed	2024-03-08T16:57:24Z
format	Article
id	doaj.art-f88a0b968a154c59b6cdf281db57c9cc
institution	Directory Open Access Journal
issn	2644-1322
language	English
last_indexed	2024-03-08T16:57:24Z
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Open Journal of Signal Processing
spelling	doaj.art-f88a0b968a154c59b6cdf281db57c9cc2024-01-05T00:05:03ZengIEEEIEEE Open Journal of Signal Processing2644-13222024-01-01527428310.1109/OJSP.2023.334334310360251Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech EnhancementKaterina Zmolikova0https://orcid.org/0000-0003-4438-8580Michael Syskind Pedersen1https://orcid.org/0000-0002-2202-3583Jesper Jensen2https://orcid.org/0000-0003-1478-622XDemant A/S, Smorum, DenmarkDemant A/S, Smorum, DenmarkDemant A/S, Smorum, DenmarkSupervised learning-based speech enhancement methods often work remarkably well in acoustic situations represented in the training corpus but generalize poorly to out-of-domain situations, i.e. situations not seen during training. This stands in the way of further improvement of these methods in realistic scenarios, as collecting paired noisy-clean recordings in the target application domain is typically not feasible. Recording noisy-only in-domain data is, though, much more practical. In this article, we tackle the problem of unsupervised domain adaptation in speech enhancement. Specifically, we propose a way to use in-domain noisy-only data in the training of a neural network to improve upon a model trained solely on out-of-domain paired data. For this, we make use of masked spectrogram prediction, a technique from self-supervised learning that aims to interpolate masked regions of a spectrogram. We hypothesize that masked spectrogram prediction encourages learning of features that represent well both speech and noise components of the noisy signals. These features can then be used to train a more robust speech enhancement system. We evaluate the proposed method on the VoiceBank-DEMAND and LibriFSD50k databases, with WSJ0-CHiME3 serving as the out-of-domain database. We show that the proposed method outperforms both the out-of-domain system and the baseline approaches, i.e. RemixIT and noisy-target training, and also combines well with the previously proposed RemixIT method.https://ieeexplore.ieee.org/document/10360251/Masked spectrogram predictionspeech enhancementunsupervised domain adaptation
spellingShingle	Katerina Zmolikova Michael Syskind Pedersen Jesper Jensen Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement IEEE Open Journal of Signal Processing Masked spectrogram prediction speech enhancement unsupervised domain adaptation
title	Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement
title_full	Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement
title_fullStr	Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement
title_full_unstemmed	Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement
title_short	Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement
title_sort	masked spectrogram prediction for unsupervised domain adaptation in speech enhancement
topic	Masked spectrogram prediction speech enhancement unsupervised domain adaptation
url	https://ieeexplore.ieee.org/document/10360251/
work_keys_str_mv	AT katerinazmolikova maskedspectrogrampredictionforunsuperviseddomainadaptationinspeechenhancement AT michaelsyskindpedersen maskedspectrogrampredictionforunsuperviseddomainadaptationinspeechenhancement AT jesperjensen maskedspectrogrampredictionforunsuperviseddomainadaptationinspeechenhancement

Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement

Similar Items