Deep bottleneck features for spoken language identification.

A key problem in spoken language identification (LID) is to design effective representations which are specific to language information. For example, in recent years, representations based on both phonotactic and acoustic features have proven their effectiveness for LID. Although advances in machine...

Full description

Bibliographic Details
Main Authors: Bing Jiang, Yan Song, Si Wei, Jun-Hua Liu, Ian Vince McLoughlin, Li-Rong Dai
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2014-01-01
Series:PLoS ONE
Online Access:https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/24983963/pdf/?tool=EBI
_version_ 1819181017578078208
author Bing Jiang
Yan Song
Si Wei
Jun-Hua Liu
Ian Vince McLoughlin
Li-Rong Dai
author_facet Bing Jiang
Yan Song
Si Wei
Jun-Hua Liu
Ian Vince McLoughlin
Li-Rong Dai
author_sort Bing Jiang
collection DOAJ
description A key problem in spoken language identification (LID) is to design effective representations which are specific to language information. For example, in recent years, representations based on both phonotactic and acoustic features have proven their effectiveness for LID. Although advances in machine learning have led to significant improvements, LID performance is still lacking, especially for short duration speech utterances. With the hypothesis that language information is weak and represented only latently in speech, and is largely dependent on the statistical properties of the speech content, existing representations may be insufficient. Furthermore they may be susceptible to the variations caused by different speakers, specific content of the speech segments, and background noise. To address this, we propose using Deep Bottleneck Features (DBF) for spoken LID, motivated by the success of Deep Neural Networks (DNN) in speech recognition. We show that DBFs can form a low-dimensional compact representation of the original inputs with a powerful descriptive and discriminative capability. To evaluate the effectiveness of this, we design two acoustic models, termed DBF-TV and parallel DBF-TV (PDBF-TV), using a DBF based i-vector representation for each speech utterance. Results on NIST language recognition evaluation 2009 (LRE09) show significant improvements over state-of-the-art systems. By fusing the output of phonotactic and acoustic approaches, we achieve an EER of 1.08%, 1.89% and 7.01% for 30 s, 10 s and 3 s test utterances respectively. Furthermore, various DBF configurations have been extensively evaluated, and an optimal system proposed.
first_indexed 2024-12-22T22:23:33Z
format Article
id doaj.art-0d393633184b41409e7ad15df36a83c7
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-12-22T22:23:33Z
publishDate 2014-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-0d393633184b41409e7ad15df36a83c72022-12-21T18:10:37ZengPublic Library of Science (PLoS)PLoS ONE1932-62032014-01-0197e10079510.1371/journal.pone.0100795Deep bottleneck features for spoken language identification.Bing JiangYan SongSi WeiJun-Hua LiuIan Vince McLoughlinLi-Rong DaiA key problem in spoken language identification (LID) is to design effective representations which are specific to language information. For example, in recent years, representations based on both phonotactic and acoustic features have proven their effectiveness for LID. Although advances in machine learning have led to significant improvements, LID performance is still lacking, especially for short duration speech utterances. With the hypothesis that language information is weak and represented only latently in speech, and is largely dependent on the statistical properties of the speech content, existing representations may be insufficient. Furthermore they may be susceptible to the variations caused by different speakers, specific content of the speech segments, and background noise. To address this, we propose using Deep Bottleneck Features (DBF) for spoken LID, motivated by the success of Deep Neural Networks (DNN) in speech recognition. We show that DBFs can form a low-dimensional compact representation of the original inputs with a powerful descriptive and discriminative capability. To evaluate the effectiveness of this, we design two acoustic models, termed DBF-TV and parallel DBF-TV (PDBF-TV), using a DBF based i-vector representation for each speech utterance. Results on NIST language recognition evaluation 2009 (LRE09) show significant improvements over state-of-the-art systems. By fusing the output of phonotactic and acoustic approaches, we achieve an EER of 1.08%, 1.89% and 7.01% for 30 s, 10 s and 3 s test utterances respectively. Furthermore, various DBF configurations have been extensively evaluated, and an optimal system proposed.https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/24983963/pdf/?tool=EBI
spellingShingle Bing Jiang
Yan Song
Si Wei
Jun-Hua Liu
Ian Vince McLoughlin
Li-Rong Dai
Deep bottleneck features for spoken language identification.
PLoS ONE
title Deep bottleneck features for spoken language identification.
title_full Deep bottleneck features for spoken language identification.
title_fullStr Deep bottleneck features for spoken language identification.
title_full_unstemmed Deep bottleneck features for spoken language identification.
title_short Deep bottleneck features for spoken language identification.
title_sort deep bottleneck features for spoken language identification
url https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/24983963/pdf/?tool=EBI
work_keys_str_mv AT bingjiang deepbottleneckfeaturesforspokenlanguageidentification
AT yansong deepbottleneckfeaturesforspokenlanguageidentification
AT siwei deepbottleneckfeaturesforspokenlanguageidentification
AT junhualiu deepbottleneckfeaturesforspokenlanguageidentification
AT ianvincemcloughlin deepbottleneckfeaturesforspokenlanguageidentification
AT lirongdai deepbottleneckfeaturesforspokenlanguageidentification