Utterance-level aggregation for speaker recognition in the wild

The objective of this paper is speaker recognition `in the wild' - where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. W...

Mô tả đầy đủ

Chi tiết về thư mục
Những tác giả chính:	Xie, W, Nagrani, A, Chung, J, Zisserman, A
Định dạng:	Conference item
Được phát hành:	IEEE 2019

_version_	1826280560676306944
author	Xie, W Nagrani, A Chung, J Zisserman, A
author_facet	Xie, W Nagrani, A Chung, J Zisserman, A
author_sort	Xie, W
collection	OXFORD
description	The objective of this paper is speaker recognition `in the wild' - where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. We propose a powerful speaker recognition deep network, using a `thin-ResNet' trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end. We show that our network achieves state of the art performance by a significant margin on the VoxCeleb1 test set for speaker recognition, whilst requiring fewer parameters than previous methods. We also investigate the effect of utterance length on performance, and conclude that for `in the wild' data, a longer length is beneficial.
first_indexed	2024-03-07T00:15:31Z
format	Conference item
id	oxford-uuid:7ab74cff-6c8d-4fdd-b024-4bd297a37e8d
institution	University of Oxford
last_indexed	2024-03-07T00:15:31Z
publishDate	2019
publisher	IEEE
record_format	dspace
spelling	oxford-uuid:7ab74cff-6c8d-4fdd-b024-4bd297a37e8d2022-03-26T20:45:45ZUtterance-level aggregation for speaker recognition in the wildConference itemhttp://purl.org/coar/resource_type/c_5794uuid:7ab74cff-6c8d-4fdd-b024-4bd297a37e8dSymplectic Elements at OxfordIEEE2019Xie, WNagrani, AChung, JZisserman, AThe objective of this paper is speaker recognition `in the wild' - where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. We propose a powerful speaker recognition deep network, using a `thin-ResNet' trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end. We show that our network achieves state of the art performance by a significant margin on the VoxCeleb1 test set for speaker recognition, whilst requiring fewer parameters than previous methods. We also investigate the effect of utterance length on performance, and conclude that for `in the wild' data, a longer length is beneficial.
spellingShingle	Xie, W Nagrani, A Chung, J Zisserman, A Utterance-level aggregation for speaker recognition in the wild
title	Utterance-level aggregation for speaker recognition in the wild
title_full	Utterance-level aggregation for speaker recognition in the wild
title_fullStr	Utterance-level aggregation for speaker recognition in the wild
title_full_unstemmed	Utterance-level aggregation for speaker recognition in the wild
title_short	Utterance-level aggregation for speaker recognition in the wild
title_sort	utterance level aggregation for speaker recognition in the wild
work_keys_str_mv	AT xiew utterancelevelaggregationforspeakerrecognitioninthewild AT nagrania utterancelevelaggregationforspeakerrecognitioninthewild AT chungj utterancelevelaggregationforspeakerrecognitioninthewild AT zissermana utterancelevelaggregationforspeakerrecognitioninthewild

Utterance-level aggregation for speaker recognition in the wild

Những quyển sách tương tự