Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network

The Conformer has shown impressive performance for speech enhancement by exploiting the local and global contextual information, although it requires high computational complexity and many parameters. Recently, multi-layer perceptron (MLP)-based models such as MLP-mixer and gMLP have demonstrated co...

Full description

Bibliographic Details
Main Authors: Hyungchan Song, Minseung Kim, Jong Won Shin
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9945958/
_version_ 1798017635643817984
author Hyungchan Song
Minseung Kim
Jong Won Shin
author_facet Hyungchan Song
Minseung Kim
Jong Won Shin
author_sort Hyungchan Song
collection DOAJ
description The Conformer has shown impressive performance for speech enhancement by exploiting the local and global contextual information, although it requires high computational complexity and many parameters. Recently, multi-layer perceptron (MLP)-based models such as MLP-mixer and gMLP have demonstrated comparable performances with much less computational complexity in the computer vision area. These models showed that all-MLP architectures may perform as good as more advanced structures, but the nature of the MLP limits the application of these architectures to the input with a variable length such as speech and audio. In this paper, we propose the cgMLP-SE model, which is a gMLP-based architecture with convolutional token mixing modules and squeeze-and-excitation network to utilize both local and global contextual information as in the Conformer. Specifically, the token-mixing modules in gMLP are replaced by convolutional layers, squeeze-and-excitation network-based gating is applied on top of the convolutional gating module, and additional feed-forward layers are added to make the cgMLP-SE module a macaron-like structure sandwiched by feed-forward layers like a Conformer block. Experimental results on the TIMIT-DNS noise dataset and the Voice Bank-DEMAND dataset showed that the proposed method exhibited similar speech quality and intelligibility to the Conformer with a smaller model size and less computational complexity.
first_indexed 2024-04-11T16:10:30Z
format Article
id doaj.art-29f4768d2a2c47e89fdc5742d854d6b4
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-04-11T16:10:30Z
publishDate 2022-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-29f4768d2a2c47e89fdc5742d854d6b42022-12-22T04:14:42ZengIEEEIEEE Access2169-35362022-01-011011928311928910.1109/ACCESS.2022.32214409945958Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation NetworkHyungchan Song0https://orcid.org/0000-0001-7847-1118Minseung Kim1https://orcid.org/0000-0002-2270-9382Jong Won Shin2https://orcid.org/0000-0002-8910-0264School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South KoreaSchool of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South KoreaSchool of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South KoreaThe Conformer has shown impressive performance for speech enhancement by exploiting the local and global contextual information, although it requires high computational complexity and many parameters. Recently, multi-layer perceptron (MLP)-based models such as MLP-mixer and gMLP have demonstrated comparable performances with much less computational complexity in the computer vision area. These models showed that all-MLP architectures may perform as good as more advanced structures, but the nature of the MLP limits the application of these architectures to the input with a variable length such as speech and audio. In this paper, we propose the cgMLP-SE model, which is a gMLP-based architecture with convolutional token mixing modules and squeeze-and-excitation network to utilize both local and global contextual information as in the Conformer. Specifically, the token-mixing modules in gMLP are replaced by convolutional layers, squeeze-and-excitation network-based gating is applied on top of the convolutional gating module, and additional feed-forward layers are added to make the cgMLP-SE module a macaron-like structure sandwiched by feed-forward layers like a Conformer block. Experimental results on the TIMIT-DNS noise dataset and the Voice Bank-DEMAND dataset showed that the proposed method exhibited similar speech quality and intelligibility to the Conformer with a smaller model size and less computational complexity.https://ieeexplore.ieee.org/document/9945958/Speech enhancementlocal and global informationlow computational complexity
spellingShingle Hyungchan Song
Minseung Kim
Jong Won Shin
Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network
IEEE Access
Speech enhancement
local and global information
low computational complexity
title Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network
title_full Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network
title_fullStr Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network
title_full_unstemmed Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network
title_short Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network
title_sort speech enhancement using mlp based architecture with convolutional token mixing module and squeeze and excitation network
topic Speech enhancement
local and global information
low computational complexity
url https://ieeexplore.ieee.org/document/9945958/
work_keys_str_mv AT hyungchansong speechenhancementusingmlpbasedarchitecturewithconvolutionaltokenmixingmoduleandsqueezeandexcitationnetwork
AT minseungkim speechenhancementusingmlpbasedarchitecturewithconvolutionaltokenmixingmoduleandsqueezeandexcitationnetwork
AT jongwonshin speechenhancementusingmlpbasedarchitecturewithconvolutionaltokenmixingmoduleandsqueezeandexcitationnetwork