Non-reference speech quality assessment based on deep learning

In the field of speech processing, voice quality evaluation is one of the important techniques, and it has been widely used in mobile communications, Internet, public safety, digital entertainment, consumer electronics, and other fields. In the early days, there was only subjective voice quality ass...

Full description

Bibliographic Details
Main Author: Fang, Xuhui
Other Authors: Tan Yap Peng
Format: Thesis-Master by Coursework
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/164956
Description
Summary:In the field of speech processing, voice quality evaluation is one of the important techniques, and it has been widely used in mobile communications, Internet, public safety, digital entertainment, consumer electronics, and other fields. In the early days, there was only subjective voice quality assessment, but it required large human resources, annotated data and time. Hence, objective voice quality evaluation methods gradually became popular. Referenced speech quality assessment models require pure and raw speech signals, which are sometimes difficult to obtain in practice. As a result, the reference speech quality assessment method has received increased attention, especially in recent years. Many experts and researchers have integrated deep learning technology into reference speech quality assessment, which has made a major breakthrough in this field. However, the existing deep learning-based speech quality evaluation still has limitations such as insufficient accuracy and large number of parameters. In order to address these limitations, this dissertation studies the non-reference speech quality evaluation method based on deep learning, and the main research is summarized below: (1) Considering the problem that the accuracy of existing voice quality assessment is not high enough, this dissertation proposes an improvement method from multiple perspectives. This includes the use of BiLSTM(Bidirectional Long Short-Term Memory) to improve the time-dependent model, fully exploiting the ability of BiLSTM to effectively learn the speech context information. On this basis, the Squeeze-and-Excitation (SE) module is added to screen out the attention of the channels by learning the correlation between different channels in the feature map, so as to perform feature calibration on the feature map. In addition, a custom loss function based on the signal loss ratio is used to improve model fitting, which further improves the evaluation performance of the model. Experimental results show the effectiveness of this method. (2) For the problem that the existing speech quality evaluation model has a large number of parameters, we propose a low-complexity speech quality evaluation method based on depthwise residual convolution and Bidirectional Gate Recurrent Unit (BiGRU), the SE-DSResBGRU-NRSQA model\cite{CNN41}. The main goal of this model is to reduce the number of parameters, by using BiGRU and depthwise separable convolution, optimizing the convolution part with the main structure of residual network (ResNet), and using shallow feature information to improve the evaluation performance through direct mapping. On this basis, SE modules are added to learn the importance of different channels, so as to effectively exploit the input information and improve the evaluation performance of the system. From the experimental results, it can be seen that the proposed method can achieve good speech quality evaluation while the number of parameters is relatively small.