Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement

Supervised learning-based speech enhancement methods often work remarkably well in acoustic situations represented in the training corpus but generalize poorly to out-of-domain situations, i.e. situations not seen during training. This stands in the way of further improvement of these methods in rea...

Full description

Bibliographic Details
Main Authors: Katerina Zmolikova, Michael Syskind Pedersen, Jesper Jensen
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Open Journal of Signal Processing
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10360251/
_version_ 1827390756885626880
author Katerina Zmolikova
Michael Syskind Pedersen
Jesper Jensen
author_facet Katerina Zmolikova
Michael Syskind Pedersen
Jesper Jensen
author_sort Katerina Zmolikova
collection DOAJ
description Supervised learning-based speech enhancement methods often work remarkably well in acoustic situations represented in the training corpus but generalize poorly to out-of-domain situations, i.e. situations not seen during training. This stands in the way of further improvement of these methods in realistic scenarios, as collecting paired noisy-clean recordings in the target application domain is typically not feasible. Recording noisy-only in-domain data is, though, much more practical. In this article, we tackle the problem of unsupervised domain adaptation in speech enhancement. Specifically, we propose a way to use in-domain noisy-only data in the training of a neural network to improve upon a model trained solely on out-of-domain paired data. For this, we make use of masked spectrogram prediction, a technique from self-supervised learning that aims to interpolate masked regions of a spectrogram. We hypothesize that masked spectrogram prediction encourages learning of features that represent well both speech and noise components of the noisy signals. These features can then be used to train a more robust speech enhancement system. We evaluate the proposed method on the VoiceBank-DEMAND and LibriFSD50k databases, with WSJ0-CHiME3 serving as the out-of-domain database. We show that the proposed method outperforms both the out-of-domain system and the baseline approaches, i.e. RemixIT and noisy-target training, and also combines well with the previously proposed RemixIT method.
first_indexed 2024-03-08T16:57:24Z
format Article
id doaj.art-f88a0b968a154c59b6cdf281db57c9cc
institution Directory Open Access Journal
issn 2644-1322
language English
last_indexed 2024-03-08T16:57:24Z
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Open Journal of Signal Processing
spelling doaj.art-f88a0b968a154c59b6cdf281db57c9cc2024-01-05T00:05:03ZengIEEEIEEE Open Journal of Signal Processing2644-13222024-01-01527428310.1109/OJSP.2023.334334310360251Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech EnhancementKaterina Zmolikova0https://orcid.org/0000-0003-4438-8580Michael Syskind Pedersen1https://orcid.org/0000-0002-2202-3583Jesper Jensen2https://orcid.org/0000-0003-1478-622XDemant A/S, Smorum, DenmarkDemant A/S, Smorum, DenmarkDemant A/S, Smorum, DenmarkSupervised learning-based speech enhancement methods often work remarkably well in acoustic situations represented in the training corpus but generalize poorly to out-of-domain situations, i.e. situations not seen during training. This stands in the way of further improvement of these methods in realistic scenarios, as collecting paired noisy-clean recordings in the target application domain is typically not feasible. Recording noisy-only in-domain data is, though, much more practical. In this article, we tackle the problem of unsupervised domain adaptation in speech enhancement. Specifically, we propose a way to use in-domain noisy-only data in the training of a neural network to improve upon a model trained solely on out-of-domain paired data. For this, we make use of masked spectrogram prediction, a technique from self-supervised learning that aims to interpolate masked regions of a spectrogram. We hypothesize that masked spectrogram prediction encourages learning of features that represent well both speech and noise components of the noisy signals. These features can then be used to train a more robust speech enhancement system. We evaluate the proposed method on the VoiceBank-DEMAND and LibriFSD50k databases, with WSJ0-CHiME3 serving as the out-of-domain database. We show that the proposed method outperforms both the out-of-domain system and the baseline approaches, i.e. RemixIT and noisy-target training, and also combines well with the previously proposed RemixIT method.https://ieeexplore.ieee.org/document/10360251/Masked spectrogram predictionspeech enhancementunsupervised domain adaptation
spellingShingle Katerina Zmolikova
Michael Syskind Pedersen
Jesper Jensen
Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement
IEEE Open Journal of Signal Processing
Masked spectrogram prediction
speech enhancement
unsupervised domain adaptation
title Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement
title_full Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement
title_fullStr Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement
title_full_unstemmed Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement
title_short Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement
title_sort masked spectrogram prediction for unsupervised domain adaptation in speech enhancement
topic Masked spectrogram prediction
speech enhancement
unsupervised domain adaptation
url https://ieeexplore.ieee.org/document/10360251/
work_keys_str_mv AT katerinazmolikova maskedspectrogrampredictionforunsuperviseddomainadaptationinspeechenhancement
AT michaelsyskindpedersen maskedspectrogrampredictionforunsuperviseddomainadaptationinspeechenhancement
AT jesperjensen maskedspectrogrampredictionforunsuperviseddomainadaptationinspeechenhancement