Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network
The Conformer has shown impressive performance for speech enhancement by exploiting the local and global contextual information, although it requires high computational complexity and many parameters. Recently, multi-layer perceptron (MLP)-based models such as MLP-mixer and gMLP have demonstrated co...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2022-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9945958/ |
_version_ | 1798017635643817984 |
---|---|
author | Hyungchan Song Minseung Kim Jong Won Shin |
author_facet | Hyungchan Song Minseung Kim Jong Won Shin |
author_sort | Hyungchan Song |
collection | DOAJ |
description | The Conformer has shown impressive performance for speech enhancement by exploiting the local and global contextual information, although it requires high computational complexity and many parameters. Recently, multi-layer perceptron (MLP)-based models such as MLP-mixer and gMLP have demonstrated comparable performances with much less computational complexity in the computer vision area. These models showed that all-MLP architectures may perform as good as more advanced structures, but the nature of the MLP limits the application of these architectures to the input with a variable length such as speech and audio. In this paper, we propose the cgMLP-SE model, which is a gMLP-based architecture with convolutional token mixing modules and squeeze-and-excitation network to utilize both local and global contextual information as in the Conformer. Specifically, the token-mixing modules in gMLP are replaced by convolutional layers, squeeze-and-excitation network-based gating is applied on top of the convolutional gating module, and additional feed-forward layers are added to make the cgMLP-SE module a macaron-like structure sandwiched by feed-forward layers like a Conformer block. Experimental results on the TIMIT-DNS noise dataset and the Voice Bank-DEMAND dataset showed that the proposed method exhibited similar speech quality and intelligibility to the Conformer with a smaller model size and less computational complexity. |
first_indexed | 2024-04-11T16:10:30Z |
format | Article |
id | doaj.art-29f4768d2a2c47e89fdc5742d854d6b4 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-04-11T16:10:30Z |
publishDate | 2022-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-29f4768d2a2c47e89fdc5742d854d6b42022-12-22T04:14:42ZengIEEEIEEE Access2169-35362022-01-011011928311928910.1109/ACCESS.2022.32214409945958Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation NetworkHyungchan Song0https://orcid.org/0000-0001-7847-1118Minseung Kim1https://orcid.org/0000-0002-2270-9382Jong Won Shin2https://orcid.org/0000-0002-8910-0264School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South KoreaSchool of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South KoreaSchool of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South KoreaThe Conformer has shown impressive performance for speech enhancement by exploiting the local and global contextual information, although it requires high computational complexity and many parameters. Recently, multi-layer perceptron (MLP)-based models such as MLP-mixer and gMLP have demonstrated comparable performances with much less computational complexity in the computer vision area. These models showed that all-MLP architectures may perform as good as more advanced structures, but the nature of the MLP limits the application of these architectures to the input with a variable length such as speech and audio. In this paper, we propose the cgMLP-SE model, which is a gMLP-based architecture with convolutional token mixing modules and squeeze-and-excitation network to utilize both local and global contextual information as in the Conformer. Specifically, the token-mixing modules in gMLP are replaced by convolutional layers, squeeze-and-excitation network-based gating is applied on top of the convolutional gating module, and additional feed-forward layers are added to make the cgMLP-SE module a macaron-like structure sandwiched by feed-forward layers like a Conformer block. Experimental results on the TIMIT-DNS noise dataset and the Voice Bank-DEMAND dataset showed that the proposed method exhibited similar speech quality and intelligibility to the Conformer with a smaller model size and less computational complexity.https://ieeexplore.ieee.org/document/9945958/Speech enhancementlocal and global informationlow computational complexity |
spellingShingle | Hyungchan Song Minseung Kim Jong Won Shin Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network IEEE Access Speech enhancement local and global information low computational complexity |
title | Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network |
title_full | Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network |
title_fullStr | Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network |
title_full_unstemmed | Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network |
title_short | Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network |
title_sort | speech enhancement using mlp based architecture with convolutional token mixing module and squeeze and excitation network |
topic | Speech enhancement local and global information low computational complexity |
url | https://ieeexplore.ieee.org/document/9945958/ |
work_keys_str_mv | AT hyungchansong speechenhancementusingmlpbasedarchitecturewithconvolutionaltokenmixingmoduleandsqueezeandexcitationnetwork AT minseungkim speechenhancementusingmlpbasedarchitecturewithconvolutionaltokenmixingmoduleandsqueezeandexcitationnetwork AT jongwonshin speechenhancementusingmlpbasedarchitecturewithconvolutionaltokenmixingmoduleandsqueezeandexcitationnetwork |