Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder
In this study, we present an end-to-end framework for active speaker detection to achieve robust performance in challenging scenarios with multiple speakers. In contrast to recent approaches, which rely heavily on the visual relational context between all speakers in a video frame, we propose collab...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10287283/ |
_version_ | 1797649798267928576 |
---|---|
author | Minyoung Kyoung Hwa Jeon Song |
author_facet | Minyoung Kyoung Hwa Jeon Song |
author_sort | Minyoung Kyoung |
collection | DOAJ |
description | In this study, we present an end-to-end framework for active speaker detection to achieve robust performance in challenging scenarios with multiple speakers. In contrast to recent approaches, which rely heavily on the visual relational context between all speakers in a video frame, we propose collaboratively learning multimodal representations based on the audio and visual signals of a single candidate. Firstly, we propose a spatio-positional encoder to effectively address the problem of false detections caused by indiscernible faces in a video frame. Secondly, we present an efficient multimodal approach that models the long-term temporal contextual interactions between audio and visual modalities. Through extensive experiments on the AVA-ActiveSpeaker dataset, we demonstrate that our framework notably outperforms recent state-of-the-art approaches under challenging multi-speaker settings. Additionally, the proposed framework significantly improves the robustness against auditory and visual noise interference without relying on pre-trained networks or hand-crafted training strategies. |
first_indexed | 2024-03-11T15:51:12Z |
format | Article |
id | doaj.art-a7ea5b0d83074c6c8bd508db958b69cb |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-11T15:51:12Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-a7ea5b0d83074c6c8bd508db958b69cb2023-10-25T23:00:20ZengIEEEIEEE Access2169-35362023-01-011111656111656910.1109/ACCESS.2023.332547410287283Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional EncoderMinyoung Kyoung0https://orcid.org/0009-0006-2814-9386Hwa Jeon Song1https://orcid.org/0000-0002-8216-4812Electronics and Telecommunications Research Institute (ETRI), Daejeon, Republic of KoreaElectronics and Telecommunications Research Institute (ETRI), Daejeon, Republic of KoreaIn this study, we present an end-to-end framework for active speaker detection to achieve robust performance in challenging scenarios with multiple speakers. In contrast to recent approaches, which rely heavily on the visual relational context between all speakers in a video frame, we propose collaboratively learning multimodal representations based on the audio and visual signals of a single candidate. Firstly, we propose a spatio-positional encoder to effectively address the problem of false detections caused by indiscernible faces in a video frame. Secondly, we present an efficient multimodal approach that models the long-term temporal contextual interactions between audio and visual modalities. Through extensive experiments on the AVA-ActiveSpeaker dataset, we demonstrate that our framework notably outperforms recent state-of-the-art approaches under challenging multi-speaker settings. Additionally, the proposed framework significantly improves the robustness against auditory and visual noise interference without relying on pre-trained networks or hand-crafted training strategies.https://ieeexplore.ieee.org/document/10287283/Active speaker detectionaudio-visualmultimodal representationsmulti-speaker |
spellingShingle | Minyoung Kyoung Hwa Jeon Song Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder IEEE Access Active speaker detection audio-visual multimodal representations multi-speaker |
title | Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder |
title_full | Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder |
title_fullStr | Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder |
title_full_unstemmed | Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder |
title_short | Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder |
title_sort | modeling long term multimodal representations for active speaker detection with spatio positional encoder |
topic | Active speaker detection audio-visual multimodal representations multi-speaker |
url | https://ieeexplore.ieee.org/document/10287283/ |
work_keys_str_mv | AT minyoungkyoung modelinglongtermmultimodalrepresentationsforactivespeakerdetectionwithspatiopositionalencoder AT hwajeonsong modelinglongtermmultimodalrepresentationsforactivespeakerdetectionwithspatiopositionalencoder |