Using Voice Activity Detection and Deep Neural Networks with Hybrid Speech Feature Extraction for Deceptive Speech Detection

In this work, we first propose a deep neural network (DNN) system for the automatic detection of speech in audio signals, otherwise known as voice activity detection (VAD). Several DNN types were investigated, including multilayer perceptrons (MLPs), recurrent neural networks (RNNs), and convolution...

Full description

Bibliographic Details
Main Authors:	Serban Mihalache, Dragos Burileanu
Format:	Article
Language:	English
Published:	MDPI AG 2022-02-01
Series:	Sensors
Subjects:	deceptive speech detection deep neural networks RODeCAR voice activity detection
Online Access:	https://www.mdpi.com/1424-8220/22/3/1228

_version_	1797484537856393216
author	Serban Mihalache Dragos Burileanu
author_facet	Serban Mihalache Dragos Burileanu
author_sort	Serban Mihalache
collection	DOAJ
description	In this work, we first propose a deep neural network (DNN) system for the automatic detection of speech in audio signals, otherwise known as voice activity detection (VAD). Several DNN types were investigated, including multilayer perceptrons (MLPs), recurrent neural networks (RNNs), and convolutional neural networks (CNNs), with the best performance being obtained for the latter. Additional postprocessing techniques, i.e., hysteretic thresholding, minimum duration filtering, and bilateral extension, were employed in order to boost performance. The systems were trained and tested using several data subsets of the CENSREC-1-C database, with different simulated ambient noise conditions, and additional testing was performed on a different CENSREC-1-C data subset containing actual ambient noise, as well as on a subset of the TIMIT database. An accuracy of up to 99.13% was obtained for the CENSREC-1-C datasets, and 97.60% for the TIMIT dataset. We proceed to show how the final VAD system can be adapted and employed within an utterance-level deceptive speech detection (DSD) processing pipeline. The best DSD performance is achieved by a novel hybrid CNN-MLP network leveraging a fusion of algorithmically and automatically extracted speech features, and reaches an unweighted accuracy (UA) of 63.7% on the RLDD database, and 62.4% on the RODeCAR database.
first_indexed	2024-03-09T23:06:49Z
format	Article
id	doaj.art-55ace2ba63c04d22940118b728f2dd28
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-09T23:06:49Z
publishDate	2022-02-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-55ace2ba63c04d22940118b728f2dd282023-11-23T17:52:35ZengMDPI AGSensors1424-82202022-02-01223122810.3390/s22031228Using Voice Activity Detection and Deep Neural Networks with Hybrid Speech Feature Extraction for Deceptive Speech DetectionSerban Mihalache0Dragos Burileanu1Speech and Dialogue Research Laboratory, University “Politehnica” of Bucharest, 060042 Bucharest, RomaniaSpeech and Dialogue Research Laboratory, University “Politehnica” of Bucharest, 060042 Bucharest, RomaniaIn this work, we first propose a deep neural network (DNN) system for the automatic detection of speech in audio signals, otherwise known as voice activity detection (VAD). Several DNN types were investigated, including multilayer perceptrons (MLPs), recurrent neural networks (RNNs), and convolutional neural networks (CNNs), with the best performance being obtained for the latter. Additional postprocessing techniques, i.e., hysteretic thresholding, minimum duration filtering, and bilateral extension, were employed in order to boost performance. The systems were trained and tested using several data subsets of the CENSREC-1-C database, with different simulated ambient noise conditions, and additional testing was performed on a different CENSREC-1-C data subset containing actual ambient noise, as well as on a subset of the TIMIT database. An accuracy of up to 99.13% was obtained for the CENSREC-1-C datasets, and 97.60% for the TIMIT dataset. We proceed to show how the final VAD system can be adapted and employed within an utterance-level deceptive speech detection (DSD) processing pipeline. The best DSD performance is achieved by a novel hybrid CNN-MLP network leveraging a fusion of algorithmically and automatically extracted speech features, and reaches an unweighted accuracy (UA) of 63.7% on the RLDD database, and 62.4% on the RODeCAR database.https://www.mdpi.com/1424-8220/22/3/1228deceptive speech detectiondeep neural networksRODeCARvoice activity detection
spellingShingle	Serban Mihalache Dragos Burileanu Using Voice Activity Detection and Deep Neural Networks with Hybrid Speech Feature Extraction for Deceptive Speech Detection Sensors deceptive speech detection deep neural networks RODeCAR voice activity detection
title	Using Voice Activity Detection and Deep Neural Networks with Hybrid Speech Feature Extraction for Deceptive Speech Detection
title_full	Using Voice Activity Detection and Deep Neural Networks with Hybrid Speech Feature Extraction for Deceptive Speech Detection
title_fullStr	Using Voice Activity Detection and Deep Neural Networks with Hybrid Speech Feature Extraction for Deceptive Speech Detection
title_full_unstemmed	Using Voice Activity Detection and Deep Neural Networks with Hybrid Speech Feature Extraction for Deceptive Speech Detection
title_short	Using Voice Activity Detection and Deep Neural Networks with Hybrid Speech Feature Extraction for Deceptive Speech Detection
title_sort	using voice activity detection and deep neural networks with hybrid speech feature extraction for deceptive speech detection
topic	deceptive speech detection deep neural networks RODeCAR voice activity detection
url	https://www.mdpi.com/1424-8220/22/3/1228
work_keys_str_mv	AT serbanmihalache usingvoiceactivitydetectionanddeepneuralnetworkswithhybridspeechfeatureextractionfordeceptivespeechdetection AT dragosburileanu usingvoiceactivitydetectionanddeepneuralnetworkswithhybridspeechfeatureextractionfordeceptivespeechdetection

Using Voice Activity Detection and Deep Neural Networks with Hybrid Speech Feature Extraction for Deceptive Speech Detection

Similar Items