Summary: | Speech is the primary way humans communicate. Speech enhancement algorithms estimate speech from received signals. Although conventional approaches can achieve accurate estimates under low noise conditions, their performance degrades with reducing signal-to-noise ratio (SNR). This thesis introduces four novel deep learning (DL) methods for low SNR scenarios.
The first work proposes a convolutional neural network (CNN) that estimates
speech signals from the received signal by learning the features of noise and speech.
In contrast to existing single-channel deep neural networks (DNNs), the proposed
small model on low SNR (SMoLnet) better exploits higher resolution frequency signals
while being parameter efficient. High-resolution frequency is effective at low
SNR since it exposes more frequency bins with higher SNR. However, the filters
in the convolution layers of CNN extract local input features and have a limited
receptive field of high-resolution frequency received signal. Although increasing
the filter length used in the convolutional filter can increase the number of local
features extracted and the receptive field, it also increases the number of parameters
employed by the neural network. To overcome this issue, the convolution
filters are exponentially dilated so that the receptive field after each layer doubles
the one before. By doing so, the final layer can leverage a large receptive field
which encapsulates the large number of frequency bins provided by high-resolution
frequency features.
The second work proposes a transfer learning framework that leverages pre-trained single-channel neural networks to improve the training of multichannel neural networks for low SNR with scarce training data scenarios. The framework consists of a newly-formulated multichannel DNN based on the U-net architecture with exponential dilated layers and a pre-trained single-channel neural network. The multichannel DNN leverages the spectro-spatial features of a high-resolution frequency input to achieve an enhanced feature for the subsequent pre-trained single-channel DNN. Since the spatial information in the multi-channel data is dependent on the sensor array configuration and the source locations (such as the number of sensors, sensor spacing, sensor arrangement, and sensor mismatch) where the publicly available dataset is scarce compared to single-channel ones, U-net-like architecture is employed for faster convergence. In doing so, the proposed architecture can achieve good performance on a publicly available multichannel dataset while using only 10% of the training data.
The third work proposes a multichannel speech enhancement framework based on time-varying neural beamformers and multichannel DNN under a low SNR. In contrast to existing DL with beamforming approaches, this approach does not require the prior direction of the source nor a large number of estimated frames. The proposed recurrent neural beamformer (R-NBF) achieves multichannel speech enhancement with speech sample spatial covariance matrix (SCM) through a feedback connection. An analysis framework based on Taylor’s first-order approximation with Wirtingers calculus. The proposed R-NBF architecture was validated using a real recorded signal from a hexacopter hovering away from a speaker in an open field. Despite such adverse noise conditions, it achieves significantly improved speech intelligibility and reduced background noise.
The fourth work proposes a gridless direction-of-arrival (DOA) method using DL for narrow-band signals under low SNR scenarios with a practical array and a limited number of snapshots. More specifically, a complex CNN with a newly-formulated complex phasor normalization is proposed. The proposed approach demonstrated robustness to unseen array imperfections by learning localized phase-to-sensor relationships from the complex feature maps for SNR as low as −5 dB.
|