Towards audio codec-based speech separation

Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can...

Full description

Bibliographic Details
Main Authors: Yip, Jia Qi, Zhao, Shengkui, Ng, Dianwen, Chng, Eng Siong, Ma, Bin
Other Authors: Interdisciplinary Graduate School (IGS)
Format: Conference Paper
Language:English
Published: 2024
Subjects:
Online Access:https://hdl.handle.net/10356/178451
https://interspeech2024.org/
_version_ 1826112183765827584
author Yip, Jia Qi
Zhao, Shengkui
Ng, Dianwen
Chng, Eng Siong
Ma, Bin
author2 Interdisciplinary Graduate School (IGS)
author_facet Interdisciplinary Graduate School (IGS)
Yip, Jia Qi
Zhao, Shengkui
Ng, Dianwen
Chng, Eng Siong
Ma, Bin
author_sort Yip, Jia Qi
collection NTU
description Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them impractical for many edge computing use cases. However, SS is a waveform-masking task where compression tends to introduce distortions that severely impact performance. Here we propose a novel task of Audio Codec-based SS, where SS is performed within the embedding space of a NAC, and propose a new model, Codecformer, to address this task. At inference, Codecformer achieves a 52x reduction in MAC while producing separation performance comparable to a cloud deployment of Sepformer. This method charts a new direction for performing efficient SS in practical scenarios.
first_indexed 2024-10-01T03:02:49Z
format Conference Paper
id ntu-10356/178451
institution Nanyang Technological University
language English
last_indexed 2024-10-01T03:02:49Z
publishDate 2024
record_format dspace
spelling ntu-10356/1784512024-09-16T02:11:51Z Towards audio codec-based speech separation Yip, Jia Qi Zhao, Shengkui Ng, Dianwen Chng, Eng Siong Ma, Bin Interdisciplinary Graduate School (IGS) Interspeech 2024 Alibaba-NTU Singapore JRI Computer and Information Science Speech separation Audio codec Resource efficient Neural audio compression Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them impractical for many edge computing use cases. However, SS is a waveform-masking task where compression tends to introduce distortions that severely impact performance. Here we propose a novel task of Audio Codec-based SS, where SS is performed within the embedding space of a NAC, and propose a new model, Codecformer, to address this task. At inference, Codecformer achieves a 52x reduction in MAC while producing separation performance comparable to a cloud deployment of Sepformer. This method charts a new direction for performing efficient SS in practical scenarios. Agency for Science, Technology and Research (A*STAR) Nanyang Technological University Submitted/Accepted version This research is supported by the RIE2025 Industry Alignment Fund–Industry Collaboration Projects (IAF-ICP) (Award I2301E0026), administered by A*STAR, as well as supported by Alibaba Group and NTU Singapore. 2024-06-20T06:54:28Z 2024-06-20T06:54:28Z 2024 Conference Paper Yip, J. Q., Zhao, S., Ng, D., Chng, E. S. & Ma, B. (2024). Towards audio codec-based speech separation. Interspeech 2024, 2190-2194. https://dx.doi.org/10.21437/Interspeech.2024 2958-1796 https://hdl.handle.net/10356/178451 10.21437/Interspeech.2024 https://interspeech2024.org/ 2190 2194 en I2301E0026 © 2024 ISCA (International Speech Communication Association). All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at https://www.isca-archive.org/index.html. application/pdf
spellingShingle Computer and Information Science
Speech separation
Audio codec
Resource efficient
Neural audio compression
Yip, Jia Qi
Zhao, Shengkui
Ng, Dianwen
Chng, Eng Siong
Ma, Bin
Towards audio codec-based speech separation
title Towards audio codec-based speech separation
title_full Towards audio codec-based speech separation
title_fullStr Towards audio codec-based speech separation
title_full_unstemmed Towards audio codec-based speech separation
title_short Towards audio codec-based speech separation
title_sort towards audio codec based speech separation
topic Computer and Information Science
Speech separation
Audio codec
Resource efficient
Neural audio compression
url https://hdl.handle.net/10356/178451
https://interspeech2024.org/
work_keys_str_mv AT yipjiaqi towardsaudiocodecbasedspeechseparation
AT zhaoshengkui towardsaudiocodecbasedspeechseparation
AT ngdianwen towardsaudiocodecbasedspeechseparation
AT chngengsiong towardsaudiocodecbasedspeechseparation
AT mabin towardsaudiocodecbasedspeechseparation