Channel and temporal-frequency attention UNet for monaural speech enhancement

Abstract The presence of noise and reverberation significantly impedes speech clarity and intelligibility. To mitigate these effects, numerous deep learning-based network models have been proposed for speech enhancement tasks aimed at improving speech quality. In this study, we propose a monaural sp...

Full description

Bibliographic Details
Main Authors: Shiyun Xu, Zehua Zhang, Mingjiang Wang
Format: Article
Language:English
Published: SpringerOpen 2023-08-01
Series:EURASIP Journal on Audio, Speech, and Music Processing
Subjects:
Online Access:https://doi.org/10.1186/s13636-023-00295-6
_version_ 1797557465253937152
author Shiyun Xu
Zehua Zhang
Mingjiang Wang
author_facet Shiyun Xu
Zehua Zhang
Mingjiang Wang
author_sort Shiyun Xu
collection DOAJ
description Abstract The presence of noise and reverberation significantly impedes speech clarity and intelligibility. To mitigate these effects, numerous deep learning-based network models have been proposed for speech enhancement tasks aimed at improving speech quality. In this study, we propose a monaural speech enhancement model called the channel and temporal-frequency attention UNet (CTFUNet). CTFUNet takes the noisy spectrum as input and produces a complex ideal ratio mask (cIRM) as output. To improve the speech enhancement performance of CTFUNet, we employ multi-scale temporal-frequency processing to extract input speech spectrum features. We also utilize multi-conv head channel attention and residual channel attention to capture temporal-frequency and channel features. Moreover, we introduce the channel temporal-frequency skip connection to alleviate information loss between down-sampling and up-sampling. On the blind test set of the first deep noise suppression challenge, our proposed CTFUNet has better denoising performance than the champion models and the latest models. Furthermore, our model outperforms recent models such as Uformar and MTFAA in both denoising and dereverberation performance.
first_indexed 2024-03-10T17:17:26Z
format Article
id doaj.art-aa821468bacd4c88914b46d2a6f15c99
institution Directory Open Access Journal
issn 1687-4722
language English
last_indexed 2024-03-10T17:17:26Z
publishDate 2023-08-01
publisher SpringerOpen
record_format Article
series EURASIP Journal on Audio, Speech, and Music Processing
spelling doaj.art-aa821468bacd4c88914b46d2a6f15c992023-11-20T10:27:34ZengSpringerOpenEURASIP Journal on Audio, Speech, and Music Processing1687-47222023-08-012023111410.1186/s13636-023-00295-6Channel and temporal-frequency attention UNet for monaural speech enhancementShiyun Xu0Zehua Zhang1Mingjiang Wang2Key Laboratory for Key Technologies of IoT Terminals, Harbin Institute of TechnologyKey Laboratory for Key Technologies of IoT Terminals, Harbin Institute of TechnologyKey Laboratory for Key Technologies of IoT Terminals, Harbin Institute of TechnologyAbstract The presence of noise and reverberation significantly impedes speech clarity and intelligibility. To mitigate these effects, numerous deep learning-based network models have been proposed for speech enhancement tasks aimed at improving speech quality. In this study, we propose a monaural speech enhancement model called the channel and temporal-frequency attention UNet (CTFUNet). CTFUNet takes the noisy spectrum as input and produces a complex ideal ratio mask (cIRM) as output. To improve the speech enhancement performance of CTFUNet, we employ multi-scale temporal-frequency processing to extract input speech spectrum features. We also utilize multi-conv head channel attention and residual channel attention to capture temporal-frequency and channel features. Moreover, we introduce the channel temporal-frequency skip connection to alleviate information loss between down-sampling and up-sampling. On the blind test set of the first deep noise suppression challenge, our proposed CTFUNet has better denoising performance than the champion models and the latest models. Furthermore, our model outperforms recent models such as Uformar and MTFAA in both denoising and dereverberation performance.https://doi.org/10.1186/s13636-023-00295-6Speech enhancementNeural networkDenoisingDereverberation
spellingShingle Shiyun Xu
Zehua Zhang
Mingjiang Wang
Channel and temporal-frequency attention UNet for monaural speech enhancement
EURASIP Journal on Audio, Speech, and Music Processing
Speech enhancement
Neural network
Denoising
Dereverberation
title Channel and temporal-frequency attention UNet for monaural speech enhancement
title_full Channel and temporal-frequency attention UNet for monaural speech enhancement
title_fullStr Channel and temporal-frequency attention UNet for monaural speech enhancement
title_full_unstemmed Channel and temporal-frequency attention UNet for monaural speech enhancement
title_short Channel and temporal-frequency attention UNet for monaural speech enhancement
title_sort channel and temporal frequency attention unet for monaural speech enhancement
topic Speech enhancement
Neural network
Denoising
Dereverberation
url https://doi.org/10.1186/s13636-023-00295-6
work_keys_str_mv AT shiyunxu channelandtemporalfrequencyattentionunetformonauralspeechenhancement
AT zehuazhang channelandtemporalfrequencyattentionunetformonauralspeechenhancement
AT mingjiangwang channelandtemporalfrequencyattentionunetformonauralspeechenhancement