Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System
One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are re...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2020-10-01
|
Series: | Electronics |
Subjects: | |
Online Access: | https://www.mdpi.com/2079-9292/9/10/1706 |
_version_ | 1797550634408345600 |
---|---|
author | Soonshin Seo Ji-Hwan Kim |
author_facet | Soonshin Seo Ji-Hwan Kim |
author_sort | Soonshin Seo |
collection | DOAJ |
description | One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the number of model parameters, we set the ResNet with the scaled channel width and layer depth as a baseline. To control the variability in the training, we apply a self-attention mechanism to perform multi-layer aggregation with dropout regularizations and batch normalizations. Subsequently, we apply a feature recalibration layer to the aggregated feature using fully-connected layers and nonlinear activation functions. Further, deep length normalization is used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively). |
first_indexed | 2024-03-10T15:32:07Z |
format | Article |
id | doaj.art-0416d5034ae74d72a10542d5b3c1c9ac |
institution | Directory Open Access Journal |
issn | 2079-9292 |
language | English |
last_indexed | 2024-03-10T15:32:07Z |
publishDate | 2020-10-01 |
publisher | MDPI AG |
record_format | Article |
series | Electronics |
spelling | doaj.art-0416d5034ae74d72a10542d5b3c1c9ac2023-11-20T17:28:30ZengMDPI AGElectronics2079-92922020-10-01910170610.3390/electronics9101706Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification SystemSoonshin Seo0Ji-Hwan Kim1Department of Computer Science and Engineering, Sogang University, Seoul 04107, KoreaDepartment of Computer Science and Engineering, Sogang University, Seoul 04107, KoreaOne of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the number of model parameters, we set the ResNet with the scaled channel width and layer depth as a baseline. To control the variability in the training, we apply a self-attention mechanism to perform multi-layer aggregation with dropout regularizations and batch normalizations. Subsequently, we apply a feature recalibration layer to the aggregated feature using fully-connected layers and nonlinear activation functions. Further, deep length normalization is used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively).https://www.mdpi.com/2079-9292/9/10/1706text-independent speaker verification systemself-attentive poolingmulti-layer aggregationfeature recalibrationdeep length normalizationspeaker embedding |
spellingShingle | Soonshin Seo Ji-Hwan Kim Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System Electronics text-independent speaker verification system self-attentive pooling multi-layer aggregation feature recalibration deep length normalization speaker embedding |
title | Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System |
title_full | Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System |
title_fullStr | Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System |
title_full_unstemmed | Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System |
title_short | Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System |
title_sort | self attentive multi layer aggregation with feature recalibration and deep length normalization for text independent speaker verification system |
topic | text-independent speaker verification system self-attentive pooling multi-layer aggregation feature recalibration deep length normalization speaker embedding |
url | https://www.mdpi.com/2079-9292/9/10/1706 |
work_keys_str_mv | AT soonshinseo selfattentivemultilayeraggregationwithfeaturerecalibrationanddeeplengthnormalizationfortextindependentspeakerverificationsystem AT jihwankim selfattentivemultilayeraggregationwithfeaturerecalibrationanddeeplengthnormalizationfortextindependentspeakerverificationsystem |