Towards audio codec-based speech separation

Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can...

Full description

Bibliographic Details
Main Authors:	Yip, Jia Qi, Zhao, Shengkui, Ng, Dianwen, Chng, Eng Siong, Ma, Bin
Other Authors:	Interdisciplinary Graduate School (IGS)
Format:	Conference Paper
Language:	English
Published:	2024
Subjects:	Computer and Information Science Speech separation Audio codec Resource efficient Neural audio compression
Online Access:	https://hdl.handle.net/10356/178451 https://interspeech2024.org/

_version_	1826112183765827584
author	Yip, Jia Qi Zhao, Shengkui Ng, Dianwen Chng, Eng Siong Ma, Bin
author2	Interdisciplinary Graduate School (IGS)
author_facet	Interdisciplinary Graduate School (IGS) Yip, Jia Qi Zhao, Shengkui Ng, Dianwen Chng, Eng Siong Ma, Bin
author_sort	Yip, Jia Qi
collection	NTU
description	Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them impractical for many edge computing use cases. However, SS is a waveform-masking task where compression tends to introduce distortions that severely impact performance. Here we propose a novel task of Audio Codec-based SS, where SS is performed within the embedding space of a NAC, and propose a new model, Codecformer, to address this task. At inference, Codecformer achieves a 52x reduction in MAC while producing separation performance comparable to a cloud deployment of Sepformer. This method charts a new direction for performing efficient SS in practical scenarios.
first_indexed	2024-10-01T03:02:49Z
format	Conference Paper
id	ntu-10356/178451
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T03:02:49Z
publishDate	2024
record_format	dspace
spelling	ntu-10356/1784512024-09-16T02:11:51Z Towards audio codec-based speech separation Yip, Jia Qi Zhao, Shengkui Ng, Dianwen Chng, Eng Siong Ma, Bin Interdisciplinary Graduate School (IGS) Interspeech 2024 Alibaba-NTU Singapore JRI Computer and Information Science Speech separation Audio codec Resource efficient Neural audio compression Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them impractical for many edge computing use cases. However, SS is a waveform-masking task where compression tends to introduce distortions that severely impact performance. Here we propose a novel task of Audio Codec-based SS, where SS is performed within the embedding space of a NAC, and propose a new model, Codecformer, to address this task. At inference, Codecformer achieves a 52x reduction in MAC while producing separation performance comparable to a cloud deployment of Sepformer. This method charts a new direction for performing efficient SS in practical scenarios. Agency for Science, Technology and Research (ASTAR) Nanyang Technological University Submitted/Accepted version This research is supported by the RIE2025 Industry Alignment Fund–Industry Collaboration Projects (IAF-ICP) (Award I2301E0026), administered by ASTAR, as well as supported by Alibaba Group and NTU Singapore. 2024-06-20T06:54:28Z 2024-06-20T06:54:28Z 2024 Conference Paper Yip, J. Q., Zhao, S., Ng, D., Chng, E. S. & Ma, B. (2024). Towards audio codec-based speech separation. Interspeech 2024, 2190-2194. https://dx.doi.org/10.21437/Interspeech.2024 2958-1796 https://hdl.handle.net/10356/178451 10.21437/Interspeech.2024 https://interspeech2024.org/ 2190 2194 en I2301E0026 © 2024 ISCA (International Speech Communication Association). All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at https://www.isca-archive.org/index.html. application/pdf
spellingShingle	Computer and Information Science Speech separation Audio codec Resource efficient Neural audio compression Yip, Jia Qi Zhao, Shengkui Ng, Dianwen Chng, Eng Siong Ma, Bin Towards audio codec-based speech separation
title	Towards audio codec-based speech separation
title_full	Towards audio codec-based speech separation
title_fullStr	Towards audio codec-based speech separation
title_full_unstemmed	Towards audio codec-based speech separation
title_short	Towards audio codec-based speech separation
title_sort	towards audio codec based speech separation
topic	Computer and Information Science Speech separation Audio codec Resource efficient Neural audio compression
url	https://hdl.handle.net/10356/178451 https://interspeech2024.org/
work_keys_str_mv	AT yipjiaqi towardsaudiocodecbasedspeechseparation AT zhaoshengkui towardsaudiocodecbasedspeechseparation AT ngdianwen towardsaudiocodecbasedspeechseparation AT chngengsiong towardsaudiocodecbasedspeechseparation AT mabin towardsaudiocodecbasedspeechseparation

Towards audio codec-based speech separation

Similar Items