Channel and temporal-frequency attention UNet for monaural speech enhancement
Abstract The presence of noise and reverberation significantly impedes speech clarity and intelligibility. To mitigate these effects, numerous deep learning-based network models have been proposed for speech enhancement tasks aimed at improving speech quality. In this study, we propose a monaural sp...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
SpringerOpen
2023-08-01
|
Series: | EURASIP Journal on Audio, Speech, and Music Processing |
Subjects: | |
Online Access: | https://doi.org/10.1186/s13636-023-00295-6 |
_version_ | 1797557465253937152 |
---|---|
author | Shiyun Xu Zehua Zhang Mingjiang Wang |
author_facet | Shiyun Xu Zehua Zhang Mingjiang Wang |
author_sort | Shiyun Xu |
collection | DOAJ |
description | Abstract The presence of noise and reverberation significantly impedes speech clarity and intelligibility. To mitigate these effects, numerous deep learning-based network models have been proposed for speech enhancement tasks aimed at improving speech quality. In this study, we propose a monaural speech enhancement model called the channel and temporal-frequency attention UNet (CTFUNet). CTFUNet takes the noisy spectrum as input and produces a complex ideal ratio mask (cIRM) as output. To improve the speech enhancement performance of CTFUNet, we employ multi-scale temporal-frequency processing to extract input speech spectrum features. We also utilize multi-conv head channel attention and residual channel attention to capture temporal-frequency and channel features. Moreover, we introduce the channel temporal-frequency skip connection to alleviate information loss between down-sampling and up-sampling. On the blind test set of the first deep noise suppression challenge, our proposed CTFUNet has better denoising performance than the champion models and the latest models. Furthermore, our model outperforms recent models such as Uformar and MTFAA in both denoising and dereverberation performance. |
first_indexed | 2024-03-10T17:17:26Z |
format | Article |
id | doaj.art-aa821468bacd4c88914b46d2a6f15c99 |
institution | Directory Open Access Journal |
issn | 1687-4722 |
language | English |
last_indexed | 2024-03-10T17:17:26Z |
publishDate | 2023-08-01 |
publisher | SpringerOpen |
record_format | Article |
series | EURASIP Journal on Audio, Speech, and Music Processing |
spelling | doaj.art-aa821468bacd4c88914b46d2a6f15c992023-11-20T10:27:34ZengSpringerOpenEURASIP Journal on Audio, Speech, and Music Processing1687-47222023-08-012023111410.1186/s13636-023-00295-6Channel and temporal-frequency attention UNet for monaural speech enhancementShiyun Xu0Zehua Zhang1Mingjiang Wang2Key Laboratory for Key Technologies of IoT Terminals, Harbin Institute of TechnologyKey Laboratory for Key Technologies of IoT Terminals, Harbin Institute of TechnologyKey Laboratory for Key Technologies of IoT Terminals, Harbin Institute of TechnologyAbstract The presence of noise and reverberation significantly impedes speech clarity and intelligibility. To mitigate these effects, numerous deep learning-based network models have been proposed for speech enhancement tasks aimed at improving speech quality. In this study, we propose a monaural speech enhancement model called the channel and temporal-frequency attention UNet (CTFUNet). CTFUNet takes the noisy spectrum as input and produces a complex ideal ratio mask (cIRM) as output. To improve the speech enhancement performance of CTFUNet, we employ multi-scale temporal-frequency processing to extract input speech spectrum features. We also utilize multi-conv head channel attention and residual channel attention to capture temporal-frequency and channel features. Moreover, we introduce the channel temporal-frequency skip connection to alleviate information loss between down-sampling and up-sampling. On the blind test set of the first deep noise suppression challenge, our proposed CTFUNet has better denoising performance than the champion models and the latest models. Furthermore, our model outperforms recent models such as Uformar and MTFAA in both denoising and dereverberation performance.https://doi.org/10.1186/s13636-023-00295-6Speech enhancementNeural networkDenoisingDereverberation |
spellingShingle | Shiyun Xu Zehua Zhang Mingjiang Wang Channel and temporal-frequency attention UNet for monaural speech enhancement EURASIP Journal on Audio, Speech, and Music Processing Speech enhancement Neural network Denoising Dereverberation |
title | Channel and temporal-frequency attention UNet for monaural speech enhancement |
title_full | Channel and temporal-frequency attention UNet for monaural speech enhancement |
title_fullStr | Channel and temporal-frequency attention UNet for monaural speech enhancement |
title_full_unstemmed | Channel and temporal-frequency attention UNet for monaural speech enhancement |
title_short | Channel and temporal-frequency attention UNet for monaural speech enhancement |
title_sort | channel and temporal frequency attention unet for monaural speech enhancement |
topic | Speech enhancement Neural network Denoising Dereverberation |
url | https://doi.org/10.1186/s13636-023-00295-6 |
work_keys_str_mv | AT shiyunxu channelandtemporalfrequencyattentionunetformonauralspeechenhancement AT zehuazhang channelandtemporalfrequencyattentionunetformonauralspeechenhancement AT mingjiangwang channelandtemporalfrequencyattentionunetformonauralspeechenhancement |