Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement
Supervised learning-based speech enhancement methods often work remarkably well in acoustic situations represented in the training corpus but generalize poorly to out-of-domain situations, i.e. situations not seen during training. This stands in the way of further improvement of these methods in rea...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2024-01-01
|
Series: | IEEE Open Journal of Signal Processing |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10360251/ |
_version_ | 1827390756885626880 |
---|---|
author | Katerina Zmolikova Michael Syskind Pedersen Jesper Jensen |
author_facet | Katerina Zmolikova Michael Syskind Pedersen Jesper Jensen |
author_sort | Katerina Zmolikova |
collection | DOAJ |
description | Supervised learning-based speech enhancement methods often work remarkably well in acoustic situations represented in the training corpus but generalize poorly to out-of-domain situations, i.e. situations not seen during training. This stands in the way of further improvement of these methods in realistic scenarios, as collecting paired noisy-clean recordings in the target application domain is typically not feasible. Recording noisy-only in-domain data is, though, much more practical. In this article, we tackle the problem of unsupervised domain adaptation in speech enhancement. Specifically, we propose a way to use in-domain noisy-only data in the training of a neural network to improve upon a model trained solely on out-of-domain paired data. For this, we make use of masked spectrogram prediction, a technique from self-supervised learning that aims to interpolate masked regions of a spectrogram. We hypothesize that masked spectrogram prediction encourages learning of features that represent well both speech and noise components of the noisy signals. These features can then be used to train a more robust speech enhancement system. We evaluate the proposed method on the VoiceBank-DEMAND and LibriFSD50k databases, with WSJ0-CHiME3 serving as the out-of-domain database. We show that the proposed method outperforms both the out-of-domain system and the baseline approaches, i.e. RemixIT and noisy-target training, and also combines well with the previously proposed RemixIT method. |
first_indexed | 2024-03-08T16:57:24Z |
format | Article |
id | doaj.art-f88a0b968a154c59b6cdf281db57c9cc |
institution | Directory Open Access Journal |
issn | 2644-1322 |
language | English |
last_indexed | 2024-03-08T16:57:24Z |
publishDate | 2024-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Open Journal of Signal Processing |
spelling | doaj.art-f88a0b968a154c59b6cdf281db57c9cc2024-01-05T00:05:03ZengIEEEIEEE Open Journal of Signal Processing2644-13222024-01-01527428310.1109/OJSP.2023.334334310360251Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech EnhancementKaterina Zmolikova0https://orcid.org/0000-0003-4438-8580Michael Syskind Pedersen1https://orcid.org/0000-0002-2202-3583Jesper Jensen2https://orcid.org/0000-0003-1478-622XDemant A/S, Smorum, DenmarkDemant A/S, Smorum, DenmarkDemant A/S, Smorum, DenmarkSupervised learning-based speech enhancement methods often work remarkably well in acoustic situations represented in the training corpus but generalize poorly to out-of-domain situations, i.e. situations not seen during training. This stands in the way of further improvement of these methods in realistic scenarios, as collecting paired noisy-clean recordings in the target application domain is typically not feasible. Recording noisy-only in-domain data is, though, much more practical. In this article, we tackle the problem of unsupervised domain adaptation in speech enhancement. Specifically, we propose a way to use in-domain noisy-only data in the training of a neural network to improve upon a model trained solely on out-of-domain paired data. For this, we make use of masked spectrogram prediction, a technique from self-supervised learning that aims to interpolate masked regions of a spectrogram. We hypothesize that masked spectrogram prediction encourages learning of features that represent well both speech and noise components of the noisy signals. These features can then be used to train a more robust speech enhancement system. We evaluate the proposed method on the VoiceBank-DEMAND and LibriFSD50k databases, with WSJ0-CHiME3 serving as the out-of-domain database. We show that the proposed method outperforms both the out-of-domain system and the baseline approaches, i.e. RemixIT and noisy-target training, and also combines well with the previously proposed RemixIT method.https://ieeexplore.ieee.org/document/10360251/Masked spectrogram predictionspeech enhancementunsupervised domain adaptation |
spellingShingle | Katerina Zmolikova Michael Syskind Pedersen Jesper Jensen Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement IEEE Open Journal of Signal Processing Masked spectrogram prediction speech enhancement unsupervised domain adaptation |
title | Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement |
title_full | Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement |
title_fullStr | Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement |
title_full_unstemmed | Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement |
title_short | Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement |
title_sort | masked spectrogram prediction for unsupervised domain adaptation in speech enhancement |
topic | Masked spectrogram prediction speech enhancement unsupervised domain adaptation |
url | https://ieeexplore.ieee.org/document/10360251/ |
work_keys_str_mv | AT katerinazmolikova maskedspectrogrampredictionforunsuperviseddomainadaptationinspeechenhancement AT michaelsyskindpedersen maskedspectrogrampredictionforunsuperviseddomainadaptationinspeechenhancement AT jesperjensen maskedspectrogrampredictionforunsuperviseddomainadaptationinspeechenhancement |