Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder

In this study, we present an end-to-end framework for active speaker detection to achieve robust performance in challenging scenarios with multiple speakers. In contrast to recent approaches, which rely heavily on the visual relational context between all speakers in a video frame, we propose collab...

Full description

Bibliographic Details
Main Authors: Minyoung Kyoung, Hwa Jeon Song
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10287283/
_version_ 1797649798267928576
author Minyoung Kyoung
Hwa Jeon Song
author_facet Minyoung Kyoung
Hwa Jeon Song
author_sort Minyoung Kyoung
collection DOAJ
description In this study, we present an end-to-end framework for active speaker detection to achieve robust performance in challenging scenarios with multiple speakers. In contrast to recent approaches, which rely heavily on the visual relational context between all speakers in a video frame, we propose collaboratively learning multimodal representations based on the audio and visual signals of a single candidate. Firstly, we propose a spatio-positional encoder to effectively address the problem of false detections caused by indiscernible faces in a video frame. Secondly, we present an efficient multimodal approach that models the long-term temporal contextual interactions between audio and visual modalities. Through extensive experiments on the AVA-ActiveSpeaker dataset, we demonstrate that our framework notably outperforms recent state-of-the-art approaches under challenging multi-speaker settings. Additionally, the proposed framework significantly improves the robustness against auditory and visual noise interference without relying on pre-trained networks or hand-crafted training strategies.
first_indexed 2024-03-11T15:51:12Z
format Article
id doaj.art-a7ea5b0d83074c6c8bd508db958b69cb
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-11T15:51:12Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-a7ea5b0d83074c6c8bd508db958b69cb2023-10-25T23:00:20ZengIEEEIEEE Access2169-35362023-01-011111656111656910.1109/ACCESS.2023.332547410287283Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional EncoderMinyoung Kyoung0https://orcid.org/0009-0006-2814-9386Hwa Jeon Song1https://orcid.org/0000-0002-8216-4812Electronics and Telecommunications Research Institute (ETRI), Daejeon, Republic of KoreaElectronics and Telecommunications Research Institute (ETRI), Daejeon, Republic of KoreaIn this study, we present an end-to-end framework for active speaker detection to achieve robust performance in challenging scenarios with multiple speakers. In contrast to recent approaches, which rely heavily on the visual relational context between all speakers in a video frame, we propose collaboratively learning multimodal representations based on the audio and visual signals of a single candidate. Firstly, we propose a spatio-positional encoder to effectively address the problem of false detections caused by indiscernible faces in a video frame. Secondly, we present an efficient multimodal approach that models the long-term temporal contextual interactions between audio and visual modalities. Through extensive experiments on the AVA-ActiveSpeaker dataset, we demonstrate that our framework notably outperforms recent state-of-the-art approaches under challenging multi-speaker settings. Additionally, the proposed framework significantly improves the robustness against auditory and visual noise interference without relying on pre-trained networks or hand-crafted training strategies.https://ieeexplore.ieee.org/document/10287283/Active speaker detectionaudio-visualmultimodal representationsmulti-speaker
spellingShingle Minyoung Kyoung
Hwa Jeon Song
Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder
IEEE Access
Active speaker detection
audio-visual
multimodal representations
multi-speaker
title Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder
title_full Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder
title_fullStr Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder
title_full_unstemmed Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder
title_short Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder
title_sort modeling long term multimodal representations for active speaker detection with spatio positional encoder
topic Active speaker detection
audio-visual
multimodal representations
multi-speaker
url https://ieeexplore.ieee.org/document/10287283/
work_keys_str_mv AT minyoungkyoung modelinglongtermmultimodalrepresentationsforactivespeakerdetectionwithspatiopositionalencoder
AT hwajeonsong modelinglongtermmultimodalrepresentationsforactivespeakerdetectionwithspatiopositionalencoder