Towards audio codec-based speech separation
Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can...
Main Authors: | , , , , |
---|---|
Other Authors: | |
Format: | Conference Paper |
Language: | English |
Published: |
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/178451 https://interspeech2024.org/ |
_version_ | 1826112183765827584 |
---|---|
author | Yip, Jia Qi Zhao, Shengkui Ng, Dianwen Chng, Eng Siong Ma, Bin |
author2 | Interdisciplinary Graduate School (IGS) |
author_facet | Interdisciplinary Graduate School (IGS) Yip, Jia Qi Zhao, Shengkui Ng, Dianwen Chng, Eng Siong Ma, Bin |
author_sort | Yip, Jia Qi |
collection | NTU |
description | Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them impractical for many edge computing use cases. However, SS is a waveform-masking task where compression tends to introduce distortions that severely impact performance. Here we propose a novel task of Audio Codec-based SS, where SS is performed within the embedding space of a NAC, and propose a new model, Codecformer, to address this task. At inference, Codecformer achieves a 52x reduction in MAC while producing separation performance comparable to a cloud deployment of Sepformer. This method charts a new
direction for performing efficient SS in practical scenarios. |
first_indexed | 2024-10-01T03:02:49Z |
format | Conference Paper |
id | ntu-10356/178451 |
institution | Nanyang Technological University |
language | English |
last_indexed | 2024-10-01T03:02:49Z |
publishDate | 2024 |
record_format | dspace |
spelling | ntu-10356/1784512024-09-16T02:11:51Z Towards audio codec-based speech separation Yip, Jia Qi Zhao, Shengkui Ng, Dianwen Chng, Eng Siong Ma, Bin Interdisciplinary Graduate School (IGS) Interspeech 2024 Alibaba-NTU Singapore JRI Computer and Information Science Speech separation Audio codec Resource efficient Neural audio compression Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them impractical for many edge computing use cases. However, SS is a waveform-masking task where compression tends to introduce distortions that severely impact performance. Here we propose a novel task of Audio Codec-based SS, where SS is performed within the embedding space of a NAC, and propose a new model, Codecformer, to address this task. At inference, Codecformer achieves a 52x reduction in MAC while producing separation performance comparable to a cloud deployment of Sepformer. This method charts a new direction for performing efficient SS in practical scenarios. Agency for Science, Technology and Research (A*STAR) Nanyang Technological University Submitted/Accepted version This research is supported by the RIE2025 Industry Alignment Fund–Industry Collaboration Projects (IAF-ICP) (Award I2301E0026), administered by A*STAR, as well as supported by Alibaba Group and NTU Singapore. 2024-06-20T06:54:28Z 2024-06-20T06:54:28Z 2024 Conference Paper Yip, J. Q., Zhao, S., Ng, D., Chng, E. S. & Ma, B. (2024). Towards audio codec-based speech separation. Interspeech 2024, 2190-2194. https://dx.doi.org/10.21437/Interspeech.2024 2958-1796 https://hdl.handle.net/10356/178451 10.21437/Interspeech.2024 https://interspeech2024.org/ 2190 2194 en I2301E0026 © 2024 ISCA (International Speech Communication Association). All rights reserved. This article may be downloaded for personal use only. Any other use requires prior permission of the copyright holder. The Version of Record is available online at https://www.isca-archive.org/index.html. application/pdf |
spellingShingle | Computer and Information Science Speech separation Audio codec Resource efficient Neural audio compression Yip, Jia Qi Zhao, Shengkui Ng, Dianwen Chng, Eng Siong Ma, Bin Towards audio codec-based speech separation |
title | Towards audio codec-based speech separation |
title_full | Towards audio codec-based speech separation |
title_fullStr | Towards audio codec-based speech separation |
title_full_unstemmed | Towards audio codec-based speech separation |
title_short | Towards audio codec-based speech separation |
title_sort | towards audio codec based speech separation |
topic | Computer and Information Science Speech separation Audio codec Resource efficient Neural audio compression |
url | https://hdl.handle.net/10356/178451 https://interspeech2024.org/ |
work_keys_str_mv | AT yipjiaqi towardsaudiocodecbasedspeechseparation AT zhaoshengkui towardsaudiocodecbasedspeechseparation AT ngdianwen towardsaudiocodecbasedspeechseparation AT chngengsiong towardsaudiocodecbasedspeechseparation AT mabin towardsaudiocodecbasedspeechseparation |