Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures

Separation of speech mixtures in noisy and reverberant environments remains a challenging task for state-of-the-art speech separation systems. Time-domain audio speech separation networks (TasNets) are among the most commonly used network architectures for this task. TasNet models have demonstrated...

Full description

Bibliographic Details
Main Authors:	William Ravenscroft, Stefan Goetze, Thomas Hain
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2022-05-01
Series:	Frontiers in Signal Processing
Subjects:	tasnet speech separation speech enhancement encoder decoder attention
Online Access:	https://www.frontiersin.org/articles/10.3389/frsip.2022.856968/full

_version_	1828823076725325824
author	William Ravenscroft Stefan Goetze Thomas Hain
author_facet	William Ravenscroft Stefan Goetze Thomas Hain
author_sort	William Ravenscroft
collection	DOAJ
description	Separation of speech mixtures in noisy and reverberant environments remains a challenging task for state-of-the-art speech separation systems. Time-domain audio speech separation networks (TasNets) are among the most commonly used network architectures for this task. TasNet models have demonstrated strong performance on typical speech separation baselines where speech is not contaminated with noise. When additive or convolutive noise is present, performance of speech separation degrades significantly. TasNets are typically constructed of an encoder network, a mask estimation network and a decoder network. The design of these networks puts the majority of the onus for enhancing the signal on the mask estimation network when used without any pre-processing of the input data or post processing of the separation network output data. Use of multihead attention (MHA) is proposed in this work as an additional layer in the encoder and decoder to help the separation network attend to encoded features that are relevant to the target speakers and conversely suppress noisy disturbances in the encoded features. As shown in this work, incorporating MHA mechanisms into the encoder network in particular leads to a consistent performance improvement across numerous quality and intelligibility metrics on a variety of acoustic conditions using the WHAMR corpus, a data-set of noisy reverberant speech mixtures. The use of MHA is also investigated in the decoder network where it is demonstrated that smaller performance improvements are consistently gained within specific model configurations. The best performing MHA models yield a mean 0.6 dB scale invariant signal-to-distortion (SISDR) improvement on noisy reverberant mixtures over a baseline 1D convolution encoder. A mean 1 dB SISDR improvement is observed on clean speech mixtures.
first_indexed	2024-12-12T13:28:05Z
format	Article
id	doaj.art-02f17f0adaa14f3cbd21b78e7075c7b9
institution	Directory Open Access Journal
issn	2673-8198
language	English
last_indexed	2024-12-12T13:28:05Z
publishDate	2022-05-01
publisher	Frontiers Media S.A.
record_format	Article
series	Frontiers in Signal Processing
spelling	doaj.art-02f17f0adaa14f3cbd21b78e7075c7b92022-12-22T00:23:08ZengFrontiers Media S.A.Frontiers in Signal Processing2673-81982022-05-01210.3389/frsip.2022.856968856968Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech MixturesWilliam RavenscroftStefan GoetzeThomas HainSeparation of speech mixtures in noisy and reverberant environments remains a challenging task for state-of-the-art speech separation systems. Time-domain audio speech separation networks (TasNets) are among the most commonly used network architectures for this task. TasNet models have demonstrated strong performance on typical speech separation baselines where speech is not contaminated with noise. When additive or convolutive noise is present, performance of speech separation degrades significantly. TasNets are typically constructed of an encoder network, a mask estimation network and a decoder network. The design of these networks puts the majority of the onus for enhancing the signal on the mask estimation network when used without any pre-processing of the input data or post processing of the separation network output data. Use of multihead attention (MHA) is proposed in this work as an additional layer in the encoder and decoder to help the separation network attend to encoded features that are relevant to the target speakers and conversely suppress noisy disturbances in the encoded features. As shown in this work, incorporating MHA mechanisms into the encoder network in particular leads to a consistent performance improvement across numerous quality and intelligibility metrics on a variety of acoustic conditions using the WHAMR corpus, a data-set of noisy reverberant speech mixtures. The use of MHA is also investigated in the decoder network where it is demonstrated that smaller performance improvements are consistently gained within specific model configurations. The best performing MHA models yield a mean 0.6 dB scale invariant signal-to-distortion (SISDR) improvement on noisy reverberant mixtures over a baseline 1D convolution encoder. A mean 1 dB SISDR improvement is observed on clean speech mixtures.https://www.frontiersin.org/articles/10.3389/frsip.2022.856968/fulltasnetspeech separationspeech enhancementencoderdecoderattention
spellingShingle	William Ravenscroft Stefan Goetze Thomas Hain Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures Frontiers in Signal Processing tasnet speech separation speech enhancement encoder decoder attention
title	Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures
title_full	Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures
title_fullStr	Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures
title_full_unstemmed	Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures
title_short	Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures
title_sort	att tasnet attending to encodings in time domain audio speech separation of noisy reverberant speech mixtures
topic	tasnet speech separation speech enhancement encoder decoder attention
url	https://www.frontiersin.org/articles/10.3389/frsip.2022.856968/full
work_keys_str_mv	AT williamravenscroft atttasnetattendingtoencodingsintimedomainaudiospeechseparationofnoisyreverberantspeechmixtures AT stefangoetze atttasnetattendingtoencodingsintimedomainaudiospeechseparationofnoisyreverberantspeechmixtures AT thomashain atttasnetattendingtoencodingsintimedomainaudiospeechseparationofnoisyreverberantspeechmixtures

Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures

Similar Items