Disentangled Speech Embeddings Using Cross-Modal Self-Supervision
The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to te...
Main Authors: | , , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
IEEE
2020
|