Utterance-level aggregation for speaker recognition in the wild

The objective of this paper is speaker recognition `in the wild' - where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. W...

Mô tả đầy đủ

Chi tiết về thư mục
Những tác giả chính: Xie, W, Nagrani, A, Chung, J, Zisserman, A
Định dạng: Conference item
Được phát hành: IEEE 2019
_version_ 1826280560676306944
author Xie, W
Nagrani, A
Chung, J
Zisserman, A
author_facet Xie, W
Nagrani, A
Chung, J
Zisserman, A
author_sort Xie, W
collection OXFORD
description The objective of this paper is speaker recognition `in the wild' - where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. We propose a powerful speaker recognition deep network, using a `thin-ResNet' trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end. We show that our network achieves state of the art performance by a significant margin on the VoxCeleb1 test set for speaker recognition, whilst requiring fewer parameters than previous methods. We also investigate the effect of utterance length on performance, and conclude that for `in the wild' data, a longer length is beneficial.
first_indexed 2024-03-07T00:15:31Z
format Conference item
id oxford-uuid:7ab74cff-6c8d-4fdd-b024-4bd297a37e8d
institution University of Oxford
last_indexed 2024-03-07T00:15:31Z
publishDate 2019
publisher IEEE
record_format dspace
spelling oxford-uuid:7ab74cff-6c8d-4fdd-b024-4bd297a37e8d2022-03-26T20:45:45ZUtterance-level aggregation for speaker recognition in the wildConference itemhttp://purl.org/coar/resource_type/c_5794uuid:7ab74cff-6c8d-4fdd-b024-4bd297a37e8dSymplectic Elements at OxfordIEEE2019Xie, WNagrani, AChung, JZisserman, AThe objective of this paper is speaker recognition `in the wild' - where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. We propose a powerful speaker recognition deep network, using a `thin-ResNet' trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end. We show that our network achieves state of the art performance by a significant margin on the VoxCeleb1 test set for speaker recognition, whilst requiring fewer parameters than previous methods. We also investigate the effect of utterance length on performance, and conclude that for `in the wild' data, a longer length is beneficial.
spellingShingle Xie, W
Nagrani, A
Chung, J
Zisserman, A
Utterance-level aggregation for speaker recognition in the wild
title Utterance-level aggregation for speaker recognition in the wild
title_full Utterance-level aggregation for speaker recognition in the wild
title_fullStr Utterance-level aggregation for speaker recognition in the wild
title_full_unstemmed Utterance-level aggregation for speaker recognition in the wild
title_short Utterance-level aggregation for speaker recognition in the wild
title_sort utterance level aggregation for speaker recognition in the wild
work_keys_str_mv AT xiew utterancelevelaggregationforspeakerrecognitioninthewild
AT nagrania utterancelevelaggregationforspeakerrecognitioninthewild
AT chungj utterancelevelaggregationforspeakerrecognitioninthewild
AT zissermana utterancelevelaggregationforspeakerrecognitioninthewild