Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal an...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Open Journal of Signal Processing |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10102534/ |
_version_ | 1797832768295534592 |
---|---|
author | Rahul Sharma Shrikanth Narayanan |
author_facet | Rahul Sharma Shrikanth Narayanan |
author_sort | Rahul Sharma |
collection | DOAJ |
description | Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal and ii) co-occurrences of speakers' identities across modalities in the form of face and speech. The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing, while the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, we investigate their complementary nature in this work. We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection. Through experiments on entertainment media videos from two benchmark datasets–the AVA active speaker (movies) and Visual Person Clustering Dataset (TV shows)–we show that a simple late fusion of the two approaches enhances the active speaker detection performance. |
first_indexed | 2024-04-09T14:14:08Z |
format | Article |
id | doaj.art-6e8a93eb3ebc445587bc29df925306a6 |
institution | Directory Open Access Journal |
issn | 2644-1322 |
language | English |
last_indexed | 2024-04-09T14:14:08Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Open Journal of Signal Processing |
spelling | doaj.art-6e8a93eb3ebc445587bc29df925306a62023-05-05T23:00:32ZengIEEEIEEE Open Journal of Signal Processing2644-13222023-01-01422523210.1109/OJSP.2023.326726910102534Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker DetectionRahul Sharma0https://orcid.org/0000-0003-1697-3897Shrikanth Narayanan1https://orcid.org/0000-0002-1052-6204University of Southern California, Los Angeles, CA, USAUniversity of Southern California, Los Angeles, CA, USAActive speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal and ii) co-occurrences of speakers' identities across modalities in the form of face and speech. The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing, while the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, we investigate their complementary nature in this work. We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection. Through experiments on entertainment media videos from two benchmark datasets–the AVA active speaker (movies) and Visual Person Clustering Dataset (TV shows)–we show that a simple late fusion of the two approaches enhances the active speaker detection performance.https://ieeexplore.ieee.org/document/10102534/Active speaker detectioncharacter identitycross-modalspeaker recognition |
spellingShingle | Rahul Sharma Shrikanth Narayanan Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection IEEE Open Journal of Signal Processing Active speaker detection character identity cross-modal speaker recognition |
title | Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection |
title_full | Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection |
title_fullStr | Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection |
title_full_unstemmed | Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection |
title_short | Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection |
title_sort | audio visual activity guided cross modal identity association for active speaker detection |
topic | Active speaker detection character identity cross-modal speaker recognition |
url | https://ieeexplore.ieee.org/document/10102534/ |
work_keys_str_mv | AT rahulsharma audiovisualactivityguidedcrossmodalidentityassociationforactivespeakerdetection AT shrikanthnarayanan audiovisualactivityguidedcrossmodalidentityassociationforactivespeakerdetection |