Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network

The Conformer has shown impressive performance for speech enhancement by exploiting the local and global contextual information, although it requires high computational complexity and many parameters. Recently, multi-layer perceptron (MLP)-based models such as MLP-mixer and gMLP have demonstrated co...

Full description

Bibliographic Details
Main Authors:	Hyungchan Song, Minseung Kim, Jong Won Shin
Format:	Article
Language:	English
Published:	IEEE 2022-01-01
Series:	IEEE Access
Subjects:	Speech enhancement local and global information low computational complexity
Online Access:	https://ieeexplore.ieee.org/document/9945958/

Description
Summary:	The Conformer has shown impressive performance for speech enhancement by exploiting the local and global contextual information, although it requires high computational complexity and many parameters. Recently, multi-layer perceptron (MLP)-based models such as MLP-mixer and gMLP have demonstrated comparable performances with much less computational complexity in the computer vision area. These models showed that all-MLP architectures may perform as good as more advanced structures, but the nature of the MLP limits the application of these architectures to the input with a variable length such as speech and audio. In this paper, we propose the cgMLP-SE model, which is a gMLP-based architecture with convolutional token mixing modules and squeeze-and-excitation network to utilize both local and global contextual information as in the Conformer. Specifically, the token-mixing modules in gMLP are replaced by convolutional layers, squeeze-and-excitation network-based gating is applied on top of the convolutional gating module, and additional feed-forward layers are added to make the cgMLP-SE module a macaron-like structure sandwiched by feed-forward layers like a Conformer block. Experimental results on the TIMIT-DNS noise dataset and the Voice Bank-DEMAND dataset showed that the proposed method exhibited similar speech quality and intelligibility to the Conformer with a smaller model size and less computational complexity.
ISSN:	2169-3536

Speech Enhancement Using MLP-Based Architecture With Convolutional Token Mixing Module and Squeeze-and-Excitation Network

Similar Items