Improving Impulse Audio Source Separation using Generative Adversarial Networks for Phase Generation

This thesis explored separating impulse noise from a desired signal, for the purposes of hearing protection for soldiers and musicians. An evaluation of current techniques in source separation, such as matrix demixing methods (Independent Component Analysis, Independent Vector Analysis), and masking...

Full description

Bibliographic Details
Main Author: Piercy, Phoebe K.
Other Authors: Lang, Jeffrey H.
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/138956
_version_ 1811086806654910464
author Piercy, Phoebe K.
author2 Lang, Jeffrey H.
author_facet Lang, Jeffrey H.
Piercy, Phoebe K.
author_sort Piercy, Phoebe K.
collection MIT
description This thesis explored separating impulse noise from a desired signal, for the purposes of hearing protection for soldiers and musicians. An evaluation of current techniques in source separation, such as matrix demixing methods (Independent Component Analysis, Independent Vector Analysis), and masking methods (Ideal Ratio Mask, Ideal Binary Mask), amongst others, concluded that Time-Frequency masking of the noisy signal spectrogram was the best candidate audio separation method for dynamic soundscapes such as tactical fields and music. We followed with an experimental investigation of the role of phase in Time-Frequency masking, finding its importance to the intelligibility of speech to be paramount. In particular, the construction of a Complex Ideal Ratio Mask (cIRM), altering both magnitude and phase information in the spectrogram, was identified as the most promising method of impulse source separation, with separated speech intelligibility comparable to clean speech. This motivated us to develop a method to generate an approximation of the cIRM, but without prior source information. As such, the growing use of neural networks as a tool in source separation and phase estimation was presented and evaluated. Experiments were conducted to evaluate the potential of Generative Adversarial Networks (GANs), often used in image transformation, in generating the phase of the cIRM, with human test subjects to evaluate whether intelligibility of separated speech was improved. The GAN showed promise in generating phase-like results, although imperfect transformation resulted in an audible quality decrease, suggesting that the approach was unlikely to produce the natural sound required by musicians. However, for the tactical case, where intelligibility is valued over quality, consonant reconstruction and improved impulse attenuation was observed using our GAN-estimated cIRM. This improvement was reflected in an increase in the signal to noise ratio as compared to clean speech, and a decrease in the same metric compared to the impulse noise, demonstrating the improved clean speech contributions, and the reduction in impulse noise contributions in the separated output. These results show the potential, with better resources, for GAN-generated phase to be used to improve intelligibility during audio source separation of impulse noise from speech, and motivates further exploration on this topic.
first_indexed 2024-09-23T13:34:58Z
format Thesis
id mit-1721.1/138956
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T13:34:58Z
publishDate 2022
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1389562022-01-15T03:26:16Z Improving Impulse Audio Source Separation using Generative Adversarial Networks for Phase Generation Piercy, Phoebe K. Lang, Jeffrey H. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science This thesis explored separating impulse noise from a desired signal, for the purposes of hearing protection for soldiers and musicians. An evaluation of current techniques in source separation, such as matrix demixing methods (Independent Component Analysis, Independent Vector Analysis), and masking methods (Ideal Ratio Mask, Ideal Binary Mask), amongst others, concluded that Time-Frequency masking of the noisy signal spectrogram was the best candidate audio separation method for dynamic soundscapes such as tactical fields and music. We followed with an experimental investigation of the role of phase in Time-Frequency masking, finding its importance to the intelligibility of speech to be paramount. In particular, the construction of a Complex Ideal Ratio Mask (cIRM), altering both magnitude and phase information in the spectrogram, was identified as the most promising method of impulse source separation, with separated speech intelligibility comparable to clean speech. This motivated us to develop a method to generate an approximation of the cIRM, but without prior source information. As such, the growing use of neural networks as a tool in source separation and phase estimation was presented and evaluated. Experiments were conducted to evaluate the potential of Generative Adversarial Networks (GANs), often used in image transformation, in generating the phase of the cIRM, with human test subjects to evaluate whether intelligibility of separated speech was improved. The GAN showed promise in generating phase-like results, although imperfect transformation resulted in an audible quality decrease, suggesting that the approach was unlikely to produce the natural sound required by musicians. However, for the tactical case, where intelligibility is valued over quality, consonant reconstruction and improved impulse attenuation was observed using our GAN-estimated cIRM. This improvement was reflected in an increase in the signal to noise ratio as compared to clean speech, and a decrease in the same metric compared to the impulse noise, demonstrating the improved clean speech contributions, and the reduction in impulse noise contributions in the separated output. These results show the potential, with better resources, for GAN-generated phase to be used to improve intelligibility during audio source separation of impulse noise from speech, and motivates further exploration on this topic. M.Eng. 2022-01-14T14:40:41Z 2022-01-14T14:40:41Z 2021-06 2021-06-17T20:14:04.111Z Thesis https://hdl.handle.net/1721.1/138956 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Piercy, Phoebe K.
Improving Impulse Audio Source Separation using Generative Adversarial Networks for Phase Generation
title Improving Impulse Audio Source Separation using Generative Adversarial Networks for Phase Generation
title_full Improving Impulse Audio Source Separation using Generative Adversarial Networks for Phase Generation
title_fullStr Improving Impulse Audio Source Separation using Generative Adversarial Networks for Phase Generation
title_full_unstemmed Improving Impulse Audio Source Separation using Generative Adversarial Networks for Phase Generation
title_short Improving Impulse Audio Source Separation using Generative Adversarial Networks for Phase Generation
title_sort improving impulse audio source separation using generative adversarial networks for phase generation
url https://hdl.handle.net/1721.1/138956
work_keys_str_mv AT piercyphoebek improvingimpulseaudiosourceseparationusinggenerativeadversarialnetworksforphasegeneration