Utterance-level aggregation for speaker recognition in the wild
The objective of this paper is speaker recognition `in the wild' - where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. W...
Những tác giả chính: | , , , |
---|---|
Định dạng: | Conference item |
Được phát hành: |
IEEE
2019
|
_version_ | 1826280560676306944 |
---|---|
author | Xie, W Nagrani, A Chung, J Zisserman, A |
author_facet | Xie, W Nagrani, A Chung, J Zisserman, A |
author_sort | Xie, W |
collection | OXFORD |
description | The objective of this paper is speaker recognition `in the wild' - where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. We propose a powerful speaker recognition deep network, using a `thin-ResNet' trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end. We show that our network achieves state of the art performance by a significant margin on the VoxCeleb1 test set for speaker recognition, whilst requiring fewer parameters than previous methods. We also investigate the effect of utterance length on performance, and conclude that for `in the wild' data, a longer length is beneficial. |
first_indexed | 2024-03-07T00:15:31Z |
format | Conference item |
id | oxford-uuid:7ab74cff-6c8d-4fdd-b024-4bd297a37e8d |
institution | University of Oxford |
last_indexed | 2024-03-07T00:15:31Z |
publishDate | 2019 |
publisher | IEEE |
record_format | dspace |
spelling | oxford-uuid:7ab74cff-6c8d-4fdd-b024-4bd297a37e8d2022-03-26T20:45:45ZUtterance-level aggregation for speaker recognition in the wildConference itemhttp://purl.org/coar/resource_type/c_5794uuid:7ab74cff-6c8d-4fdd-b024-4bd297a37e8dSymplectic Elements at OxfordIEEE2019Xie, WNagrani, AChung, JZisserman, AThe objective of this paper is speaker recognition `in the wild' - where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. We propose a powerful speaker recognition deep network, using a `thin-ResNet' trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end. We show that our network achieves state of the art performance by a significant margin on the VoxCeleb1 test set for speaker recognition, whilst requiring fewer parameters than previous methods. We also investigate the effect of utterance length on performance, and conclude that for `in the wild' data, a longer length is beneficial. |
spellingShingle | Xie, W Nagrani, A Chung, J Zisserman, A Utterance-level aggregation for speaker recognition in the wild |
title | Utterance-level aggregation for speaker recognition in the wild |
title_full | Utterance-level aggregation for speaker recognition in the wild |
title_fullStr | Utterance-level aggregation for speaker recognition in the wild |
title_full_unstemmed | Utterance-level aggregation for speaker recognition in the wild |
title_short | Utterance-level aggregation for speaker recognition in the wild |
title_sort | utterance level aggregation for speaker recognition in the wild |
work_keys_str_mv | AT xiew utterancelevelaggregationforspeakerrecognitioninthewild AT nagrania utterancelevelaggregationforspeakerrecognitioninthewild AT chungj utterancelevelaggregationforspeakerrecognitioninthewild AT zissermana utterancelevelaggregationforspeakerrecognitioninthewild |