Disentangled Speech Embeddings Using Cross-Modal Self-Supervision

The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to te...

Full description

Bibliographic Details
Main Authors: Nagrani, A, Chung, JS, Albanie, S, Zisserman, A
Format: Conference item
Language:English
Published: IEEE 2020