Attention-Based Temporal-Frequency Aggregation for Speaker Verification

Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features...

Full description

Bibliographic Details
Main Authors: Meng Wang, Dazheng Feng, Tingting Su, Mohan Chen
Format: Article
Language:English
Published: MDPI AG 2022-03-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/22/6/2147
_version_ 1797442453815427072
author Meng Wang
Dazheng Feng
Tingting Su
Mohan Chen
author_facet Meng Wang
Dazheng Feng
Tingting Su
Mohan Chen
author_sort Meng Wang
collection DOAJ
description Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features generated by the CNN frontend into an utterance-level representation. However, most of the existing aggregation methods aggregate the extracted features across time and cannot capture the speaker-dependent information contained in the frequency domain. To handle this problem, this paper proposes a novel attention-based frequency aggregation method, which focuses on the key frequency bands that provide more information for utterance-level representation. Meanwhile, two more effective temporal-frequency aggregation methods are proposed in combination with the existing temporal aggregation methods. The two proposed methods can capture the speaker-dependent information contained in both the time domain and frequency domain of frame-level features, thus improving the discriminability of speaker embedding. Besides, a powerful CNN-based SV system is developed and evaluated on the TIMIT and Voxceleb datasets. The experimental results indicate that the CNN-based SV system using the temporal-frequency aggregation method achieves a superior equal error rate of 5.96% on Voxceleb compared with the state-of-the-art baseline models.
first_indexed 2024-03-09T12:42:04Z
format Article
id doaj.art-a323c215a63341bcaaed092b59af8879
institution Directory Open Access Journal
issn 1424-8220
language English
last_indexed 2024-03-09T12:42:04Z
publishDate 2022-03-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj.art-a323c215a63341bcaaed092b59af88792023-11-30T22:16:54ZengMDPI AGSensors1424-82202022-03-01226214710.3390/s22062147Attention-Based Temporal-Frequency Aggregation for Speaker VerificationMeng Wang0Dazheng Feng1Tingting Su2Mohan Chen3National Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, ChinaNational Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, ChinaNational Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, ChinaNational Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, ChinaConvolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features generated by the CNN frontend into an utterance-level representation. However, most of the existing aggregation methods aggregate the extracted features across time and cannot capture the speaker-dependent information contained in the frequency domain. To handle this problem, this paper proposes a novel attention-based frequency aggregation method, which focuses on the key frequency bands that provide more information for utterance-level representation. Meanwhile, two more effective temporal-frequency aggregation methods are proposed in combination with the existing temporal aggregation methods. The two proposed methods can capture the speaker-dependent information contained in both the time domain and frequency domain of frame-level features, thus improving the discriminability of speaker embedding. Besides, a powerful CNN-based SV system is developed and evaluated on the TIMIT and Voxceleb datasets. The experimental results indicate that the CNN-based SV system using the temporal-frequency aggregation method achieves a superior equal error rate of 5.96% on Voxceleb compared with the state-of-the-art baseline models.https://www.mdpi.com/1424-8220/22/6/2147convolutional neural networksspeaker verificationtemporal-frequency aggregationself-attention
spellingShingle Meng Wang
Dazheng Feng
Tingting Su
Mohan Chen
Attention-Based Temporal-Frequency Aggregation for Speaker Verification
Sensors
convolutional neural networks
speaker verification
temporal-frequency aggregation
self-attention
title Attention-Based Temporal-Frequency Aggregation for Speaker Verification
title_full Attention-Based Temporal-Frequency Aggregation for Speaker Verification
title_fullStr Attention-Based Temporal-Frequency Aggregation for Speaker Verification
title_full_unstemmed Attention-Based Temporal-Frequency Aggregation for Speaker Verification
title_short Attention-Based Temporal-Frequency Aggregation for Speaker Verification
title_sort attention based temporal frequency aggregation for speaker verification
topic convolutional neural networks
speaker verification
temporal-frequency aggregation
self-attention
url https://www.mdpi.com/1424-8220/22/6/2147
work_keys_str_mv AT mengwang attentionbasedtemporalfrequencyaggregationforspeakerverification
AT dazhengfeng attentionbasedtemporalfrequencyaggregationforspeakerverification
AT tingtingsu attentionbasedtemporalfrequencyaggregationforspeakerverification
AT mohanchen attentionbasedtemporalfrequencyaggregationforspeakerverification