A Deep Neural Network Model for Speaker Identification

Speaker identification is a classification task which aims to identify a subject from a given time-series sequential data. Since the speech signal is a continuous one-dimensional time series, most of the current research methods are based on convolutional neural network (CNN) or recurrent neural net...

Full description

Bibliographic Details
Main Authors:	Feng Ye, Jun Yang
Format:	Article
Language:	English
Published:	MDPI AG 2021-04-01
Series:	Applied Sciences
Subjects:	speaker identification speaker recognition recurrent neural network spectrogram two-dimensional convolutional neural network gated recurrent unit
Online Access:	https://www.mdpi.com/2076-3417/11/8/3603

_version_	1827694908017737728
author	Feng Ye Jun Yang
author_facet	Feng Ye Jun Yang
author_sort	Feng Ye
collection	DOAJ
description	Speaker identification is a classification task which aims to identify a subject from a given time-series sequential data. Since the speech signal is a continuous one-dimensional time series, most of the current research methods are based on convolutional neural network (CNN) or recurrent neural network (RNN). Indeed, these methods perform well in many tasks, but there is no attempt to combine these two network models to study the speaker identification task. Due to the spectrogram that a speech signal contains, the spatial features of voiceprint (which corresponds to the voice spectrum) and CNN are effective for spatial feature extraction (which corresponds to modeling spectral correlations in acoustic features). At the same time, the speech signal is in a time series, and deep RNN can better represent long utterances than shallow networks. Considering the advantage of gated recurrent unit (GRU) (compared with traditional RNN) in the segmentation of sequence data, we decide to use stacked GRU layers in our model for frame-level feature extraction. In this paper, we propose a deep neural network (DNN) model based on a two-dimensional convolutional neural network (2-D CNN) and gated recurrent unit (GRU) for speaker identification. In the network model design, the convolutional layer is used for voiceprint feature extraction and reduces dimensionality in both the time and frequency domains, allowing for faster GRU layer computation. In addition, the stacked GRU recurrent network layers can learn a speaker’s acoustic features. During this research, we tried to use various neural network structures, including 2-D CNN, deep RNN, and deep LSTM. The above network models were evaluated on the Aishell-1 speech dataset. The experimental results showed that our proposed DNN model, which we call deep GRU, achieved a high recognition accuracy of 98.96%. At the same time, the results also demonstrate the effectiveness of the proposed deep GRU network model versus other models for speaker identification. Through further optimization, this method could be applied to other research similar to the study of speaker identification.
first_indexed	2024-03-10T12:15:18Z
format	Article
id	doaj.art-48f51371e0c94468b0dc332cec84e4b8
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-10T12:15:18Z
publishDate	2021-04-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-48f51371e0c94468b0dc332cec84e4b82023-11-21T15:55:23ZengMDPI AGApplied Sciences2076-34172021-04-01118360310.3390/app11083603A Deep Neural Network Model for Speaker IdentificationFeng Ye0Jun Yang1Institute of Microelectronics of the Chinese Academy of Sciences, Beijing 100029, ChinaInstitute of Microelectronics of the Chinese Academy of Sciences, Beijing 100029, ChinaSpeaker identification is a classification task which aims to identify a subject from a given time-series sequential data. Since the speech signal is a continuous one-dimensional time series, most of the current research methods are based on convolutional neural network (CNN) or recurrent neural network (RNN). Indeed, these methods perform well in many tasks, but there is no attempt to combine these two network models to study the speaker identification task. Due to the spectrogram that a speech signal contains, the spatial features of voiceprint (which corresponds to the voice spectrum) and CNN are effective for spatial feature extraction (which corresponds to modeling spectral correlations in acoustic features). At the same time, the speech signal is in a time series, and deep RNN can better represent long utterances than shallow networks. Considering the advantage of gated recurrent unit (GRU) (compared with traditional RNN) in the segmentation of sequence data, we decide to use stacked GRU layers in our model for frame-level feature extraction. In this paper, we propose a deep neural network (DNN) model based on a two-dimensional convolutional neural network (2-D CNN) and gated recurrent unit (GRU) for speaker identification. In the network model design, the convolutional layer is used for voiceprint feature extraction and reduces dimensionality in both the time and frequency domains, allowing for faster GRU layer computation. In addition, the stacked GRU recurrent network layers can learn a speaker’s acoustic features. During this research, we tried to use various neural network structures, including 2-D CNN, deep RNN, and deep LSTM. The above network models were evaluated on the Aishell-1 speech dataset. The experimental results showed that our proposed DNN model, which we call deep GRU, achieved a high recognition accuracy of 98.96%. At the same time, the results also demonstrate the effectiveness of the proposed deep GRU network model versus other models for speaker identification. Through further optimization, this method could be applied to other research similar to the study of speaker identification.https://www.mdpi.com/2076-3417/11/8/3603speaker identificationspeaker recognitionrecurrent neural networkspectrogramtwo-dimensional convolutional neural networkgated recurrent unit
spellingShingle	Feng Ye Jun Yang A Deep Neural Network Model for Speaker Identification Applied Sciences speaker identification speaker recognition recurrent neural network spectrogram two-dimensional convolutional neural network gated recurrent unit
title	A Deep Neural Network Model for Speaker Identification
title_full	A Deep Neural Network Model for Speaker Identification
title_fullStr	A Deep Neural Network Model for Speaker Identification
title_full_unstemmed	A Deep Neural Network Model for Speaker Identification
title_short	A Deep Neural Network Model for Speaker Identification
title_sort	deep neural network model for speaker identification
topic	speaker identification speaker recognition recurrent neural network spectrogram two-dimensional convolutional neural network gated recurrent unit
url	https://www.mdpi.com/2076-3417/11/8/3603
work_keys_str_mv	AT fengye adeepneuralnetworkmodelforspeakeridentification AT junyang adeepneuralnetworkmodelforspeakeridentification AT fengye deepneuralnetworkmodelforspeakeridentification AT junyang deepneuralnetworkmodelforspeakeridentification

A Deep Neural Network Model for Speaker Identification

Similar Items