Speaker Recognition Based on the Joint Loss Function

The statistical pyramid dense time-delay neural network (SPD-TDNN) model makes it difficult to deal with the imbalance of training data, poses a high risk of overfitting, and has weak generalization ability. To solve these problems, we propose a method based on the joint loss function and improved s...

Full description

Bibliographic Details
Main Authors: Tengteng Feng, Houbin Fan, Fengpei Ge, Shuxin Cao, Chunyan Liang
Format: Article
Language:English
Published: MDPI AG 2023-08-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/12/16/3447
Description
Summary:The statistical pyramid dense time-delay neural network (SPD-TDNN) model makes it difficult to deal with the imbalance of training data, poses a high risk of overfitting, and has weak generalization ability. To solve these problems, we propose a method based on the joint loss function and improved statistical pyramid dense time-delay neural network (JLF-ISPD-TDNN), which improves on the SPD-TDNN model and uses the joint loss function method to combine the advantages of the cross-entropy loss function and the comparative learning of the loss function. By minimizing the distance between speech embeddings from the same speaker and maximizing the distance between speech embeddings from different speakers, the model could achieve enhanced generalization performance and more robust speaker feature representation. We evaluated the proposed method’s performance using the evaluation indexes of the equal error rate (EER) and minimum cost function (minDCF). The experimental results show that the EEE and minDCF on the Aishell-1 dataset reached 1.02% and 0.1221%, respectively. Therefore, using the joint loss function in the improved SPD-TDNN model can significantly enhance the model’s speaker recognition performance.
ISSN:2079-9292