Masked multi-center angular margin loss for language recognition

Abstract Language recognition based on embedding aims to maximize inter-class variance and minimize intra-class variance. Previous researches are limited to the training constraint of a single centroid, which cannot accurately describe the overall geometric characteristics of the embedding space. In...

Full description

Bibliographic Details
Main Authors: Minghang Ju, Yanyan Xu, Dengfeng Ke, Kaile Su
Format: Article
Language:English
Published: SpringerOpen 2022-07-01
Series:EURASIP Journal on Audio, Speech, and Music Processing
Subjects:
Online Access:https://doi.org/10.1186/s13636-022-00249-4
Description
Summary:Abstract Language recognition based on embedding aims to maximize inter-class variance and minimize intra-class variance. Previous researches are limited to the training constraint of a single centroid, which cannot accurately describe the overall geometric characteristics of the embedding space. In this paper, we propose a novel masked multi-center angular margin (MMAM) loss method from the perspective of multiple centroids, resulting in a better overall performance. Specifically, numerous global centers are used to jointly approximate entities of each class. To capture the local neighbor relationship effectively, a small number of centers are adapted to construct the similarity relationship between these centers and each entity. Furthermore, we use a new reverse label propagation algorithm to adjust neighbor relations according to the ground truth labels to learn a discriminative metric space in the classification process. Finally, an additive angular margin is added, which understands more discriminative language embeddings by simultaneously enhancing intra-class compactness and inter-class discrepancy. Experiments are conducted on the APSIPA 2017 Oriental Language Recognition (AP17-OLR) corpus. We compare the proposed MMAM method with seven state-of-the-art baselines and verify that our method has 26.2% and 31.3% relative improvements in the equal error rate (EER) and C avg respectively in the full-length test (“full-length” means the average duration of the utterances is longer than 5 s). Also, there are 31.2% and 29.3% relative improvements in the 3-s test and 14% and 14.8% relative improvements in the 1-s test.
ISSN:1687-4722