Innovative Method for Unsupervised Voice Activity Detection and Classification of Audio Segments

An accurate and noise-robust voice activity detection (VAD) system can be widely used for emerging speech technologies in the fields of audio forensics, wireless communication, and speech recognition. However, in real-life application, the sufficient amount of data or human-annotated data to train s...

Full description

Bibliographic Details
Main Authors:	Zulfiqar Ali, Muhammad Talha
Format:	Article
Language:	English
Published:	IEEE 2018-01-01
Series:	IEEE Access
Subjects:	Voiced and unvoiced segmentation fractal dimension Katz algorithm TIMIT database KSU speech database
Online Access:	https://ieeexplore.ieee.org/document/8290827/

_version_	1818558655579029504
author	Zulfiqar Ali Muhammad Talha
author_facet	Zulfiqar Ali Muhammad Talha
author_sort	Zulfiqar Ali
collection	DOAJ
description	An accurate and noise-robust voice activity detection (VAD) system can be widely used for emerging speech technologies in the fields of audio forensics, wireless communication, and speech recognition. However, in real-life application, the sufficient amount of data or human-annotated data to train such a system may not be available. Therefore, a supervised system for VAD cannot be used in such situations. In this paper, an unsupervised method for VAD is proposed to label the segments of speech-presence and speech-absence in an audio. To make the proposed method efficient and computationally fast, it is implemented by using long-term features that are computed by using the Katz algorithm of fractal dimension estimation. Two databases of different languages are used to evaluate the performance of the proposed method. The first is Texas Instruments Massachusetts Institute of Technology (TIMIT) database, and the second is the King Saud University (KSU) Arabic speech database. The language of TIMIT is English, while the language of the KSU speech database is Arabic. TIMIT is recorded in only one environment, whereas the KSU speech database is recorded in distinct environments using various recording systems that contain sound cards of different qualities and models. The evaluation of the proposed method suggested that it labels voiced and unvoiced segments reliably in both clean and noisy audio.
first_indexed	2024-12-14T00:15:07Z
format	Article
id	doaj.art-381f37b9b0594c5a89ea04451dba558f
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-14T00:15:07Z
publishDate	2018-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-381f37b9b0594c5a89ea04451dba558f2022-12-21T23:25:35ZengIEEEIEEE Access2169-35362018-01-016154941550410.1109/ACCESS.2018.28058458290827Innovative Method for Unsupervised Voice Activity Detection and Classification of Audio SegmentsZulfiqar Ali0https://orcid.org/0000-0002-1599-1287Muhammad Talha1https://orcid.org/0000-0002-4246-2524Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi ArabiaDeanship of Scientific Research, King Saud University, Riyadh, Saudi ArabiaAn accurate and noise-robust voice activity detection (VAD) system can be widely used for emerging speech technologies in the fields of audio forensics, wireless communication, and speech recognition. However, in real-life application, the sufficient amount of data or human-annotated data to train such a system may not be available. Therefore, a supervised system for VAD cannot be used in such situations. In this paper, an unsupervised method for VAD is proposed to label the segments of speech-presence and speech-absence in an audio. To make the proposed method efficient and computationally fast, it is implemented by using long-term features that are computed by using the Katz algorithm of fractal dimension estimation. Two databases of different languages are used to evaluate the performance of the proposed method. The first is Texas Instruments Massachusetts Institute of Technology (TIMIT) database, and the second is the King Saud University (KSU) Arabic speech database. The language of TIMIT is English, while the language of the KSU speech database is Arabic. TIMIT is recorded in only one environment, whereas the KSU speech database is recorded in distinct environments using various recording systems that contain sound cards of different qualities and models. The evaluation of the proposed method suggested that it labels voiced and unvoiced segments reliably in both clean and noisy audio.https://ieeexplore.ieee.org/document/8290827/Voiced and unvoiced segmentationfractal dimensionKatz algorithmTIMIT databaseKSU speech database
spellingShingle	Zulfiqar Ali Muhammad Talha Innovative Method for Unsupervised Voice Activity Detection and Classification of Audio Segments IEEE Access Voiced and unvoiced segmentation fractal dimension Katz algorithm TIMIT database KSU speech database
title	Innovative Method for Unsupervised Voice Activity Detection and Classification of Audio Segments
title_full	Innovative Method for Unsupervised Voice Activity Detection and Classification of Audio Segments
title_fullStr	Innovative Method for Unsupervised Voice Activity Detection and Classification of Audio Segments
title_full_unstemmed	Innovative Method for Unsupervised Voice Activity Detection and Classification of Audio Segments
title_short	Innovative Method for Unsupervised Voice Activity Detection and Classification of Audio Segments
title_sort	innovative method for unsupervised voice activity detection and classification of audio segments
topic	Voiced and unvoiced segmentation fractal dimension Katz algorithm TIMIT database KSU speech database
url	https://ieeexplore.ieee.org/document/8290827/
work_keys_str_mv	AT zulfiqarali innovativemethodforunsupervisedvoiceactivitydetectionandclassificationofaudiosegments AT muhammadtalha innovativemethodforunsupervisedvoiceactivitydetectionandclassificationofaudiosegments

Innovative Method for Unsupervised Voice Activity Detection and Classification of Audio Segments

Similar Items