Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings

Over the recent years, various research has been conducted to investigate methods for verifying users with a short randomized pass-phrase due to the increasing demand for voice-based authentication systems. In this paper, we propose a novel technique for extracting an i-vector-like feature based on...

Full description

Bibliographic Details
Main Authors:	Woo Hyun Kang, Nam Soo Kim
Format:	Article
Language:	English
Published:	MDPI AG 2019-10-01
Series:	Sensors
Subjects:	speech embedding deep learning speaker recognition unsupervised representation learning
Online Access:	https://www.mdpi.com/1424-8220/19/21/4709

_version_	1817990869925494784
author	Woo Hyun Kang Nam Soo Kim
author_facet	Woo Hyun Kang Nam Soo Kim
author_sort	Woo Hyun Kang
collection	DOAJ
description	Over the recent years, various research has been conducted to investigate methods for verifying users with a short randomized pass-phrase due to the increasing demand for voice-based authentication systems. In this paper, we propose a novel technique for extracting an i-vector-like feature based on an adversarially learned inference (ALI) model which summarizes the variability within the Gaussian mixture model (GMM) distribution through a nonlinear process. Analogous to the previously proposed variational autoencoder (VAE)-based feature extractor, the proposed ALI-based model is trained to generate the GMM supervector according to the maximum likelihood criterion given the Baum−Welch statistics of the input utterance. However, to prevent the potential loss of information caused by the Kullback−Leibler divergence (KL divergence) regularization adopted in the VAE-based model training, the newly proposed ALI-based feature extractor exploits a joint discriminator to ensure that the generated latent variable and the GMM supervector are more realistic. The proposed framework is compared with the conventional i-vector and VAE-based methods using the TIDIGITS dataset. Experimental results show that the proposed method can represent the uncertainty caused by the short duration better than the VAE-based method. Furthermore, the proposed approach has shown great performance when applied in association with the standard i-vector framework.
first_indexed	2024-04-14T01:04:46Z
format	Article
id	doaj.art-c3f4dfedea4945ec9df2adf204542370
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-04-14T01:04:46Z
publishDate	2019-10-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-c3f4dfedea4945ec9df2adf2045423702022-12-22T02:21:17ZengMDPI AGSensors1424-82202019-10-011921470910.3390/s19214709s19214709Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit StringsWoo Hyun Kang0Nam Soo Kim1Department of Electrical and Computer Engineering and the Institute of New Media and Communications, Seoul National University, Seoul 08826, KoreaDepartment of Electrical and Computer Engineering and the Institute of New Media and Communications, Seoul National University, Seoul 08826, KoreaOver the recent years, various research has been conducted to investigate methods for verifying users with a short randomized pass-phrase due to the increasing demand for voice-based authentication systems. In this paper, we propose a novel technique for extracting an i-vector-like feature based on an adversarially learned inference (ALI) model which summarizes the variability within the Gaussian mixture model (GMM) distribution through a nonlinear process. Analogous to the previously proposed variational autoencoder (VAE)-based feature extractor, the proposed ALI-based model is trained to generate the GMM supervector according to the maximum likelihood criterion given the Baum−Welch statistics of the input utterance. However, to prevent the potential loss of information caused by the Kullback−Leibler divergence (KL divergence) regularization adopted in the VAE-based model training, the newly proposed ALI-based feature extractor exploits a joint discriminator to ensure that the generated latent variable and the GMM supervector are more realistic. The proposed framework is compared with the conventional i-vector and VAE-based methods using the TIDIGITS dataset. Experimental results show that the proposed method can represent the uncertainty caused by the short duration better than the VAE-based method. Furthermore, the proposed approach has shown great performance when applied in association with the standard i-vector framework.https://www.mdpi.com/1424-8220/19/21/4709speech embeddingdeep learningspeaker recognitionunsupervised representation learning
spellingShingle	Woo Hyun Kang Nam Soo Kim Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings Sensors speech embedding deep learning speaker recognition unsupervised representation learning
title	Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings
title_full	Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings
title_fullStr	Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings
title_full_unstemmed	Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings
title_short	Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings
title_sort	adversarially learned total variability embedding for speaker recognition with random digit strings
topic	speech embedding deep learning speaker recognition unsupervised representation learning
url	https://www.mdpi.com/1424-8220/19/21/4709
work_keys_str_mv	AT woohyunkang adversariallylearnedtotalvariabilityembeddingforspeakerrecognitionwithrandomdigitstrings AT namsookim adversariallylearnedtotalvariabilityembeddingforspeakerrecognitionwithrandomdigitstrings

Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings

Similar Items