Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network

The Conformer has shown impressive performance for speech enhancement by exploiting the local and global contextual information, although it requires high computational complexity and many parameters. Recently, multi-layer perceptron (MLP)-based models such as MLP-mixer and gMLP have demonstrated co...

Full description

Bibliographic Details
Main Authors:	Hyungchan Song, Minseung Kim, Jong Won Shin
Format:	Article
Language:	English
Published:	IEEE 2022-01-01
Series:	IEEE Access
Subjects:	Speech enhancement local and global information low computational complexity
Online Access:	https://ieeexplore.ieee.org/document/9945958/

_version_	1828128841063727104
author	Hyungchan Song Minseung Kim Jong Won Shin
author_facet	Hyungchan Song Minseung Kim Jong Won Shin
author_sort	Hyungchan Song
collection	DOAJ
description	The Conformer has shown impressive performance for speech enhancement by exploiting the local and global contextual information, although it requires high computational complexity and many parameters. Recently, multi-layer perceptron (MLP)-based models such as MLP-mixer and gMLP have demonstrated comparable performances with much less computational complexity in the computer vision area. These models showed that all-MLP architectures may perform as good as more advanced structures, but the nature of the MLP limits the application of these architectures to the input with a variable length such as speech and audio. In this paper, we propose the cgMLP-SE model, which is a gMLP-based architecture with convolutional token mixing modules and squeeze-and-excitation network to utilize both local and global contextual information as in the Conformer. Specifically, the token-mixing modules in gMLP are replaced by convolutional layers, squeeze-and-excitation network-based gating is applied on top of the convolutional gating module, and additional feed-forward layers are added to make the cgMLP-SE module a macaron-like structure sandwiched by feed-forward layers like a Conformer block. Experimental results on the TIMIT-DNS noise dataset and the Voice Bank-DEMAND dataset showed that the proposed method exhibited similar speech quality and intelligibility to the Conformer with a smaller model size and less computational complexity.
first_indexed	2024-04-11T16:10:30Z
format	Article
id	doaj.art-29f4768d2a2c47e89fdc5742d854d6b4
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-11T16:10:30Z
publishDate	2022-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-29f4768d2a2c47e89fdc5742d854d6b42022-12-22T04:14:42ZengIEEEIEEE Access2169-35362022-01-011011928311928910.1109/ACCESS.2022.32214409945958Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation NetworkHyungchan Song0https://orcid.org/0000-0001-7847-1118Minseung Kim1https://orcid.org/0000-0002-2270-9382Jong Won Shin2https://orcid.org/0000-0002-8910-0264School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South KoreaSchool of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South KoreaSchool of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, South KoreaThe Conformer has shown impressive performance for speech enhancement by exploiting the local and global contextual information, although it requires high computational complexity and many parameters. Recently, multi-layer perceptron (MLP)-based models such as MLP-mixer and gMLP have demonstrated comparable performances with much less computational complexity in the computer vision area. These models showed that all-MLP architectures may perform as good as more advanced structures, but the nature of the MLP limits the application of these architectures to the input with a variable length such as speech and audio. In this paper, we propose the cgMLP-SE model, which is a gMLP-based architecture with convolutional token mixing modules and squeeze-and-excitation network to utilize both local and global contextual information as in the Conformer. Specifically, the token-mixing modules in gMLP are replaced by convolutional layers, squeeze-and-excitation network-based gating is applied on top of the convolutional gating module, and additional feed-forward layers are added to make the cgMLP-SE module a macaron-like structure sandwiched by feed-forward layers like a Conformer block. Experimental results on the TIMIT-DNS noise dataset and the Voice Bank-DEMAND dataset showed that the proposed method exhibited similar speech quality and intelligibility to the Conformer with a smaller model size and less computational complexity.https://ieeexplore.ieee.org/document/9945958/Speech enhancementlocal and global informationlow computational complexity
spellingShingle	Hyungchan Song Minseung Kim Jong Won Shin Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network IEEE Access Speech enhancement local and global information low computational complexity
title	Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network
title_full	Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network
title_fullStr	Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network
title_full_unstemmed	Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network
title_short	Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network
title_sort	speech enhancement using mlp based architecture with convolutional token mixing module and squeeze and excitation network
topic	Speech enhancement local and global information low computational complexity
url	https://ieeexplore.ieee.org/document/9945958/
work_keys_str_mv	AT hyungchansong speechenhancementusingmlpbasedarchitecturewithconvolutionaltokenmixingmoduleandsqueezeandexcitationnetwork AT minseungkim speechenhancementusingmlpbasedarchitecturewithconvolutionaltokenmixingmoduleandsqueezeandexcitationnetwork AT jongwonshin speechenhancementusingmlpbasedarchitecturewithconvolutionaltokenmixingmoduleandsqueezeandexcitationnetwork

Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network

Similar Items