Disentangled Speech Embeddings Using Cross-Modal Self-Supervision

The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to te...

Full description

Bibliographic Details
Main Authors:	Nagrani, A, Chung, JS, Albanie, S, Zisserman, A
Format:	Conference item
Language:	English
Published:	IEEE 2020

Disentangled Speech Embeddings Using Cross-Modal Self-Supervision

Similar Items