Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection

Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal an...

Full description

Bibliographic Details
Main Authors: Rahul Sharma, Shrikanth Narayanan
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Open Journal of Signal Processing
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10102534/
_version_ 1797832768295534592
author Rahul Sharma
Shrikanth Narayanan
author_facet Rahul Sharma
Shrikanth Narayanan
author_sort Rahul Sharma
collection DOAJ
description Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal and ii) co-occurrences of speakers' identities across modalities in the form of face and speech. The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing, while the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, we investigate their complementary nature in this work. We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection. Through experiments on entertainment media videos from two benchmark datasets–the AVA active speaker (movies) and Visual Person Clustering Dataset (TV shows)–we show that a simple late fusion of the two approaches enhances the active speaker detection performance.
first_indexed 2024-04-09T14:14:08Z
format Article
id doaj.art-6e8a93eb3ebc445587bc29df925306a6
institution Directory Open Access Journal
issn 2644-1322
language English
last_indexed 2024-04-09T14:14:08Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Open Journal of Signal Processing
spelling doaj.art-6e8a93eb3ebc445587bc29df925306a62023-05-05T23:00:32ZengIEEEIEEE Open Journal of Signal Processing2644-13222023-01-01422523210.1109/OJSP.2023.326726910102534Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker DetectionRahul Sharma0https://orcid.org/0000-0003-1697-3897Shrikanth Narayanan1https://orcid.org/0000-0002-1052-6204University of Southern California, Los Angeles, CA, USAUniversity of Southern California, Los Angeles, CA, USAActive speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal and ii) co-occurrences of speakers' identities across modalities in the form of face and speech. The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing, while the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, we investigate their complementary nature in this work. We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection. Through experiments on entertainment media videos from two benchmark datasets–the AVA active speaker (movies) and Visual Person Clustering Dataset (TV shows)–we show that a simple late fusion of the two approaches enhances the active speaker detection performance.https://ieeexplore.ieee.org/document/10102534/Active speaker detectioncharacter identitycross-modalspeaker recognition
spellingShingle Rahul Sharma
Shrikanth Narayanan
Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
IEEE Open Journal of Signal Processing
Active speaker detection
character identity
cross-modal
speaker recognition
title Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
title_full Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
title_fullStr Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
title_full_unstemmed Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
title_short Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
title_sort audio visual activity guided cross modal identity association for active speaker detection
topic Active speaker detection
character identity
cross-modal
speaker recognition
url https://ieeexplore.ieee.org/document/10102534/
work_keys_str_mv AT rahulsharma audiovisualactivityguidedcrossmodalidentityassociationforactivespeakerdetection
AT shrikanthnarayanan audiovisualactivityguidedcrossmodalidentityassociationforactivespeakerdetection