Detecting synthetic speech using long term magnitude and phase information

Synthetic speech is speech signals generated by text-to-speech (TTS) and voice conversion (VC) techniques. They impose a threat to speaker verification (SV) systems as an attacker may make use of TTS or VC to synthesize a speakers voice to cheat the SV system. To address this challenge, we study the...

Full description

Bibliographic Details
Main Authors:	Tian, Xiaohai, Du, Steven, Xiao, Xiong, Xu, Haihua, Chng, Eng Siong, Li, Haizhou
Other Authors:	School of Computer Science and Engineering
Format:	Conference Paper
Language:	English
Published:	2018
Subjects:	Spoofing Attack Voice Conversion DRNTU::Engineering::Computer science and engineering
Online Access:	https://hdl.handle.net/10356/89638 http://hdl.handle.net/10220/47055

_version_	1811680351157747712
author	Tian, Xiaohai Du, Steven Xiao, Xiong Xu, Haihua Chng, Eng Siong Li, Haizhou
author2	School of Computer Science and Engineering
author_facet	School of Computer Science and Engineering Tian, Xiaohai Du, Steven Xiao, Xiong Xu, Haihua Chng, Eng Siong Li, Haizhou
author_sort	Tian, Xiaohai
collection	NTU
description	Synthetic speech is speech signals generated by text-to-speech (TTS) and voice conversion (VC) techniques. They impose a threat to speaker verification (SV) systems as an attacker may make use of TTS or VC to synthesize a speakers voice to cheat the SV system. To address this challenge, we study the detection of synthetic speech using long term magnitude and phase information of speech. As most of the TTS and VC techniques make use of vocoders for speech analysis and synthesis, we focus on differentiating speech signals generated by vocoders from natural speech. Log magnitude spectrum and two phase-based features, including instantaneous frequency derivation and modified group delay, were studied in this work. We conducted experiments on the CMU-ARCTIC database using various speech features and a neural network classifier. During training, the synthetic speech detection is formulated as a 2-class classification problem and the neural network is trained to differentiate synthetic speech from natural speech. During testing, the posterior scores generated by the neural network is used for the detection of synthetic speech. The synthetic speech used in training and testing are generated by different types of vocoders and VC methods. Experimental results show that long term information up to 0.3s is important for synthetic speech detection. In addition, the high dimensional log magnitude spectrum features significantly outperforms the low dimensional MFCC features, showing that it is important to retain the detailed spectral information for detecting synthetic speech. Furthermore, the two phase-based features are found to perform well and complementary to the log magnitude spectrum features. The fusion of these features produces an equal error rate (EER) of 0.09%.
first_indexed	2024-10-01T03:23:40Z
format	Conference Paper
id	ntu-10356/89638
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T03:23:40Z
publishDate	2018
record_format	dspace
spelling	ntu-10356/896382020-03-07T11:48:46Z Detecting synthetic speech using long term magnitude and phase information Tian, Xiaohai Du, Steven Xiao, Xiong Xu, Haihua Chng, Eng Siong Li, Haizhou School of Computer Science and Engineering 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP) NTU-UBC Research Centre of Excellence in Active Living for the Elderly Temasek Laboratories Spoofing Attack Voice Conversion DRNTU::Engineering::Computer science and engineering Synthetic speech is speech signals generated by text-to-speech (TTS) and voice conversion (VC) techniques. They impose a threat to speaker verification (SV) systems as an attacker may make use of TTS or VC to synthesize a speakers voice to cheat the SV system. To address this challenge, we study the detection of synthetic speech using long term magnitude and phase information of speech. As most of the TTS and VC techniques make use of vocoders for speech analysis and synthesis, we focus on differentiating speech signals generated by vocoders from natural speech. Log magnitude spectrum and two phase-based features, including instantaneous frequency derivation and modified group delay, were studied in this work. We conducted experiments on the CMU-ARCTIC database using various speech features and a neural network classifier. During training, the synthetic speech detection is formulated as a 2-class classification problem and the neural network is trained to differentiate synthetic speech from natural speech. During testing, the posterior scores generated by the neural network is used for the detection of synthetic speech. The synthetic speech used in training and testing are generated by different types of vocoders and VC methods. Experimental results show that long term information up to 0.3s is important for synthetic speech detection. In addition, the high dimensional log magnitude spectrum features significantly outperforms the low dimensional MFCC features, showing that it is important to retain the detailed spectral information for detecting synthetic speech. Furthermore, the two phase-based features are found to perform well and complementary to the log magnitude spectrum features. The fusion of these features produces an equal error rate (EER) of 0.09%. NRF (Natl Research Foundation, S’pore) Accepted version 2018-12-18T06:25:40Z 2019-12-06T17:30:02Z 2018-12-18T06:25:40Z 2019-12-06T17:30:02Z 2015-07-01 2015 Conference Paper Tian, X., Du, S., Xiao, X., Xu, H., Chng, E. S., & Li, H. (2015). Detecting synthetic speech using long term magnitude and phase information. 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), 611-615. doi:10.1109/ChinaSIP.2015.7230476 https://hdl.handle.net/10356/89638 http://hdl.handle.net/10220/47055 10.1109/ChinaSIP.2015.7230476 187524 en © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: [http://dx.doi.org/10.1109/ChinaSIP.2015.7230476]. 5 p. application/pdf
spellingShingle	Spoofing Attack Voice Conversion DRNTU::Engineering::Computer science and engineering Tian, Xiaohai Du, Steven Xiao, Xiong Xu, Haihua Chng, Eng Siong Li, Haizhou Detecting synthetic speech using long term magnitude and phase information
title	Detecting synthetic speech using long term magnitude and phase information
title_full	Detecting synthetic speech using long term magnitude and phase information
title_fullStr	Detecting synthetic speech using long term magnitude and phase information
title_full_unstemmed	Detecting synthetic speech using long term magnitude and phase information
title_short	Detecting synthetic speech using long term magnitude and phase information
title_sort	detecting synthetic speech using long term magnitude and phase information
topic	Spoofing Attack Voice Conversion DRNTU::Engineering::Computer science and engineering
url	https://hdl.handle.net/10356/89638 http://hdl.handle.net/10220/47055
work_keys_str_mv	AT tianxiaohai detectingsyntheticspeechusinglongtermmagnitudeandphaseinformation AT dusteven detectingsyntheticspeechusinglongtermmagnitudeandphaseinformation AT xiaoxiong detectingsyntheticspeechusinglongtermmagnitudeandphaseinformation AT xuhaihua detectingsyntheticspeechusinglongtermmagnitudeandphaseinformation AT chngengsiong detectingsyntheticspeechusinglongtermmagnitudeandphaseinformation AT lihaizhou detectingsyntheticspeechusinglongtermmagnitudeandphaseinformation

Detecting synthetic speech using long term magnitude and phase information

Similar Items