Reading to listen at the cocktail party: multi-modal speech separation
The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In...
Main Authors: | , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
IEEE
2022
|
_version_ | 1826308838044729344 |
---|---|
author | Rahimi, A Afouras, T Zisserman, A |
author_facet | Rahimi, A Afouras, T Zisserman, A |
author_sort | Rahimi, A |
collection | OXFORD |
description | The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the-art performance on the well-established benchmark datasets LRS2 and LRS3. |
first_indexed | 2024-03-07T07:25:16Z |
format | Conference item |
id | oxford-uuid:0eced52d-a9a8-4659-8be6-8816d46f0b6d |
institution | University of Oxford |
language | English |
last_indexed | 2024-03-07T07:25:16Z |
publishDate | 2022 |
publisher | IEEE |
record_format | dspace |
spelling | oxford-uuid:0eced52d-a9a8-4659-8be6-8816d46f0b6d2022-11-14T10:06:01ZReading to listen at the cocktail party: multi-modal speech separationConference itemhttp://purl.org/coar/resource_type/c_5794uuid:0eced52d-a9a8-4659-8be6-8816d46f0b6dEnglishSymplectic ElementsIEEE2022Rahimi, AAfouras, TZisserman, AThe goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the-art performance on the well-established benchmark datasets LRS2 and LRS3. |
spellingShingle | Rahimi, A Afouras, T Zisserman, A Reading to listen at the cocktail party: multi-modal speech separation |
title | Reading to listen at the cocktail party: multi-modal speech separation |
title_full | Reading to listen at the cocktail party: multi-modal speech separation |
title_fullStr | Reading to listen at the cocktail party: multi-modal speech separation |
title_full_unstemmed | Reading to listen at the cocktail party: multi-modal speech separation |
title_short | Reading to listen at the cocktail party: multi-modal speech separation |
title_sort | reading to listen at the cocktail party multi modal speech separation |
work_keys_str_mv | AT rahimia readingtolistenatthecocktailpartymultimodalspeechseparation AT afourast readingtolistenatthecocktailpartymultimodalspeechseparation AT zissermana readingtolistenatthecocktailpartymultimodalspeechseparation |