Noise robust voice activity detection

Voice activity detection (VAD) is a fundamental task in various speech-related applications, such as speech coding, speaker diarization and speech recognition. It is often defined as the problem of distinguishing speech from silence/noise. A typical VAD system consists of two core parts: a fe...

Full description

Bibliographic Details
Main Author: Pham, Chau Khoa.
Other Authors: Chng Eng Siong
Format: Thesis
Language:English
Published: 2013
Subjects:
Online Access:http://hdl.handle.net/10356/52255
_version_ 1811691670663593984
author Pham, Chau Khoa.
author2 Chng Eng Siong
author_facet Chng Eng Siong
Pham, Chau Khoa.
author_sort Pham, Chau Khoa.
collection NTU
description Voice activity detection (VAD) is a fundamental task in various speech-related applications, such as speech coding, speaker diarization and speech recognition. It is often defined as the problem of distinguishing speech from silence/noise. A typical VAD system consists of two core parts: a feature extraction and a speech/ non-speech decision mechanism. The first part extracts a set of parameters from the signal, which are used by the second part to make the final speech/non-speech decision, based on a set of decision rules. Most VAD features proposed in the literature exploit the discriminative characteristics of speech in different domains, which can be divided into five categories: energy-based features, spectral-domain features, cepstral-domain features, harmonicity-based features, and long-term features. Energy-based features are simple and can be easily implemented in hardware. Spectral-domain and cepstral-domain features are more noise robust at low SNRs, as they are beneficial from a wide class of filtering and speech analysis techniques in these domains. When SNR is around 0 dB, or when the background noise contains complex acoustical events, features relying on the harmonic structure of voiced speech, as well as ones that exploit the long-term variability of speech appear to be more robust. Next, the second part of VAD decides the speech or non-speech class for each signal segment. Existing decision making mechanisms can be divided into three categories: thresholding, statistical modelling and machine learning. The first one is the simplest, yet sufficient in many cases where the features employed possess a good discriminative power. The latter two can work well at high SNRs, but their performance decline quickly at lower SNRs. In order to derive a state-of-the-art VAD algorithm, a comparative study has been carried out in this thesis to evaluate different VAD techniques. Traditionally, VAD algorithms are evaluated as a holistic system, from which it is hard to analyse whether performance gain is achieved from a new feature or a new decision mechanism. In this report, the author examines the use of P_e, the probability of error of two given distributions, to measure performance of a VAD feature separately from other modules in the system. The metric represents the discriminative power of a feature when used for classifying speech and non-speech. The result is a fairer comparison and a more compact performance representation. This allows a deeper analysis of VAD features, which reveals interesting trends across different SNRs. Secondly, a new approach to VAD is proposed in this report, which tackles the cases where SNR can be lower than 0 dB and background might contain complex audible events. The proposed idea exploits the sub-regions of the speech noisy spectrum that still retain a sufficient harmonicity structure of the human voiced speech. This allows for a more robust feature, based on the local harmonicity of the spectral autocorrelation of the voiced speech, can be derived to reliably detect the heavily corrupted voiced speech segments. Experimental results showed a significant improvement over a recently proposed method in the same category.
first_indexed 2024-10-01T06:23:35Z
format Thesis
id ntu-10356/52255
institution Nanyang Technological University
language English
last_indexed 2024-10-01T06:23:35Z
publishDate 2013
record_format dspace
spelling ntu-10356/522552023-03-04T00:34:09Z Noise robust voice activity detection Pham, Chau Khoa. Chng Eng Siong School of Computer Engineering Parallel and Distributed Computing Centre DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition Voice activity detection (VAD) is a fundamental task in various speech-related applications, such as speech coding, speaker diarization and speech recognition. It is often defined as the problem of distinguishing speech from silence/noise. A typical VAD system consists of two core parts: a feature extraction and a speech/ non-speech decision mechanism. The first part extracts a set of parameters from the signal, which are used by the second part to make the final speech/non-speech decision, based on a set of decision rules. Most VAD features proposed in the literature exploit the discriminative characteristics of speech in different domains, which can be divided into five categories: energy-based features, spectral-domain features, cepstral-domain features, harmonicity-based features, and long-term features. Energy-based features are simple and can be easily implemented in hardware. Spectral-domain and cepstral-domain features are more noise robust at low SNRs, as they are beneficial from a wide class of filtering and speech analysis techniques in these domains. When SNR is around 0 dB, or when the background noise contains complex acoustical events, features relying on the harmonic structure of voiced speech, as well as ones that exploit the long-term variability of speech appear to be more robust. Next, the second part of VAD decides the speech or non-speech class for each signal segment. Existing decision making mechanisms can be divided into three categories: thresholding, statistical modelling and machine learning. The first one is the simplest, yet sufficient in many cases where the features employed possess a good discriminative power. The latter two can work well at high SNRs, but their performance decline quickly at lower SNRs. In order to derive a state-of-the-art VAD algorithm, a comparative study has been carried out in this thesis to evaluate different VAD techniques. Traditionally, VAD algorithms are evaluated as a holistic system, from which it is hard to analyse whether performance gain is achieved from a new feature or a new decision mechanism. In this report, the author examines the use of P_e, the probability of error of two given distributions, to measure performance of a VAD feature separately from other modules in the system. The metric represents the discriminative power of a feature when used for classifying speech and non-speech. The result is a fairer comparison and a more compact performance representation. This allows a deeper analysis of VAD features, which reveals interesting trends across different SNRs. Secondly, a new approach to VAD is proposed in this report, which tackles the cases where SNR can be lower than 0 dB and background might contain complex audible events. The proposed idea exploits the sub-regions of the speech noisy spectrum that still retain a sufficient harmonicity structure of the human voiced speech. This allows for a more robust feature, based on the local harmonicity of the spectral autocorrelation of the voiced speech, can be derived to reliably detect the heavily corrupted voiced speech segments. Experimental results showed a significant improvement over a recently proposed method in the same category. Master of Engineering (SCE) 2013-04-26T03:42:20Z 2013-04-26T03:42:20Z 2013 2013 Thesis http://hdl.handle.net/10356/52255 en 82 p. application/pdf
spellingShingle DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
Pham, Chau Khoa.
Noise robust voice activity detection
title Noise robust voice activity detection
title_full Noise robust voice activity detection
title_fullStr Noise robust voice activity detection
title_full_unstemmed Noise robust voice activity detection
title_short Noise robust voice activity detection
title_sort noise robust voice activity detection
topic DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
url http://hdl.handle.net/10356/52255
work_keys_str_mv AT phamchaukhoa noiserobustvoiceactivitydetection