Reading to listen at the cocktail party: multi-modal speech separation

The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In...

Full description

Bibliographic Details
Main Authors: Rahimi, A, Afouras, T, Zisserman, A
Format: Conference item
Language:English
Published: IEEE 2022
_version_ 1826308838044729344
author Rahimi, A
Afouras, T
Zisserman, A
author_facet Rahimi, A
Afouras, T
Zisserman, A
author_sort Rahimi, A
collection OXFORD
description The goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the-art performance on the well-established benchmark datasets LRS2 and LRS3.
first_indexed 2024-03-07T07:25:16Z
format Conference item
id oxford-uuid:0eced52d-a9a8-4659-8be6-8816d46f0b6d
institution University of Oxford
language English
last_indexed 2024-03-07T07:25:16Z
publishDate 2022
publisher IEEE
record_format dspace
spelling oxford-uuid:0eced52d-a9a8-4659-8be6-8816d46f0b6d2022-11-14T10:06:01ZReading to listen at the cocktail party: multi-modal speech separationConference itemhttp://purl.org/coar/resource_type/c_5794uuid:0eced52d-a9a8-4659-8be6-8816d46f0b6dEnglishSymplectic ElementsIEEE2022Rahimi, AAfouras, TZisserman, AThe goal of this paper is speech separation and enhancement in multi-speaker and noisy environments using a combination of different modalities. Previous works have shown good performance when conditioning on temporal or static visual evidence such as synchronised lip movements or face identity. In this paper, we present a unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues. To that end we make the following contributions: (i) we design a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain; (ii) we propose conditioning on the textual content of a sentence alone or in combination with visual information; (iii) we demonstrate the robustness of our model to audio-visual synchronisation offsets; and, (iv) we obtain state-of-the-art performance on the well-established benchmark datasets LRS2 and LRS3.
spellingShingle Rahimi, A
Afouras, T
Zisserman, A
Reading to listen at the cocktail party: multi-modal speech separation
title Reading to listen at the cocktail party: multi-modal speech separation
title_full Reading to listen at the cocktail party: multi-modal speech separation
title_fullStr Reading to listen at the cocktail party: multi-modal speech separation
title_full_unstemmed Reading to listen at the cocktail party: multi-modal speech separation
title_short Reading to listen at the cocktail party: multi-modal speech separation
title_sort reading to listen at the cocktail party multi modal speech separation
work_keys_str_mv AT rahimia readingtolistenatthecocktailpartymultimodalspeechseparation
AT afourast readingtolistenatthecocktailpartymultimodalspeechseparation
AT zissermana readingtolistenatthecocktailpartymultimodalspeechseparation