Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection

Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal an...

Full description

Bibliographic Details
Main Authors:	Rahul Sharma, Shrikanth Narayanan
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Open Journal of Signal Processing
Subjects:	Active speaker detection character identity cross-modal speaker recognition
Online Access:	https://ieeexplore.ieee.org/document/10102534/

_version_	1827953451458363392
author	Rahul Sharma Shrikanth Narayanan
author_facet	Rahul Sharma Shrikanth Narayanan
author_sort	Rahul Sharma
collection	DOAJ
description	Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal and ii) co-occurrences of speakers' identities across modalities in the form of face and speech. The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing, while the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, we investigate their complementary nature in this work. We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection. Through experiments on entertainment media videos from two benchmark datasets–the AVA active speaker (movies) and Visual Person Clustering Dataset (TV shows)–we show that a simple late fusion of the two approaches enhances the active speaker detection performance.
first_indexed	2024-04-09T14:14:08Z
format	Article
id	doaj.art-6e8a93eb3ebc445587bc29df925306a6
institution	Directory Open Access Journal
issn	2644-1322
language	English
last_indexed	2024-04-09T14:14:08Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Open Journal of Signal Processing
spelling	doaj.art-6e8a93eb3ebc445587bc29df925306a62023-05-05T23:00:32ZengIEEEIEEE Open Journal of Signal Processing2644-13222023-01-01422523210.1109/OJSP.2023.326726910102534Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker DetectionRahul Sharma0https://orcid.org/0000-0003-1697-3897Shrikanth Narayanan1https://orcid.org/0000-0002-1052-6204University of Southern California, Los Angeles, CA, USAUniversity of Southern California, Los Angeles, CA, USAActive speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal and ii) co-occurrences of speakers' identities across modalities in the form of face and speech. The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing, while the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, we investigate their complementary nature in this work. We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection. Through experiments on entertainment media videos from two benchmark datasets–the AVA active speaker (movies) and Visual Person Clustering Dataset (TV shows)–we show that a simple late fusion of the two approaches enhances the active speaker detection performance.https://ieeexplore.ieee.org/document/10102534/Active speaker detectioncharacter identitycross-modalspeaker recognition
spellingShingle	Rahul Sharma Shrikanth Narayanan Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection IEEE Open Journal of Signal Processing Active speaker detection character identity cross-modal speaker recognition
title	Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
title_full	Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
title_fullStr	Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
title_full_unstemmed	Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
title_short	Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
title_sort	audio visual activity guided cross modal identity association for active speaker detection
topic	Active speaker detection character identity cross-modal speaker recognition
url	https://ieeexplore.ieee.org/document/10102534/
work_keys_str_mv	AT rahulsharma audiovisualactivityguidedcrossmodalidentityassociationforactivespeakerdetection AT shrikanthnarayanan audiovisualactivityguidedcrossmodalidentityassociationforactivespeakerdetection

Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection

Similar Items