Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results

To build automatic speech recognition (ASR) systems with a low word error rate (WER), a large speech and text corpus is needed. Corpus preparation is the first step required for developing an ASR system for a language with few argument speech documents available. Turkish is a language with limited r...

Full description

Bibliographic Details
Main Authors:	Huseyin Polat, Saadin Oyucu
Format:	Article
Language:	English
Published:	MDPI AG 2020-02-01
Series:	Symmetry
Subjects:	automatic speech recognition speech corpus text corpus data acquisition multi-layer neural network natural language processing
Online Access:	https://www.mdpi.com/2073-8994/12/2/290

_version_	1798003997494214656
author	Huseyin Polat Saadin Oyucu
author_facet	Huseyin Polat Saadin Oyucu
author_sort	Huseyin Polat
collection	DOAJ
description	To build automatic speech recognition (ASR) systems with a low word error rate (WER), a large speech and text corpus is needed. Corpus preparation is the first step required for developing an ASR system for a language with few argument speech documents available. Turkish is a language with limited resources for ASR. Therefore, development of a symmetric Turkish transcribed speech corpus according to the high resources languages corpora is crucial for improving and promoting Turkish speech recognition activities. In this study, we constructed a viable alternative to classical transcribed corpus preparation techniques for collecting Turkish speech data. In the presented approach, three different methods were used. In the first step, subtitles, which are mainly supplied for people with hearing difficulties, were used as transcriptions for the speech utterances obtained from movies. In the second step, data were collected via a mobile application. In the third step, a transfer learning approach to the Grand National Assembly of Turkey session records (videotext) was used. We also provide the initial speech recognition results of artificial neural network and Gaussian mixture-model-based acoustic models for Turkish. For training models, the newly collected corpus and other existing corpora published by the Linguistic Data Consortium were used. In light of the test results of the other existing corpora, the current study showed the relative contribution of corpus variability in a symmetric speech recognition task. The decrease in WER after including the new corpus was more evident with increased verified data size, compensating for the status of Turkish as a low resource language. For further studies, the importance of the corpus and language model in the success of the Turkish ASR system is shown.
first_indexed	2024-04-11T12:16:38Z
format	Article
id	doaj.art-c93f1f67d9244c63b99105be7316b3e5
institution	Directory Open Access Journal
issn	2073-8994
language	English
last_indexed	2024-04-11T12:16:38Z
publishDate	2020-02-01
publisher	MDPI AG
record_format	Article
series	Symmetry
spelling	doaj.art-c93f1f67d9244c63b99105be7316b3e52022-12-22T04:24:18ZengMDPI AGSymmetry2073-89942020-02-0112229010.3390/sym12020290sym12020290Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition ResultsHuseyin Polat0Saadin Oyucu1Department of Computer Engineering, Faculty of Technology, Gazi University, 06560 Ankara, TurkeyDepartment of Computer Engineering, Faculty of Technology, Gazi University, 06560 Ankara, TurkeyTo build automatic speech recognition (ASR) systems with a low word error rate (WER), a large speech and text corpus is needed. Corpus preparation is the first step required for developing an ASR system for a language with few argument speech documents available. Turkish is a language with limited resources for ASR. Therefore, development of a symmetric Turkish transcribed speech corpus according to the high resources languages corpora is crucial for improving and promoting Turkish speech recognition activities. In this study, we constructed a viable alternative to classical transcribed corpus preparation techniques for collecting Turkish speech data. In the presented approach, three different methods were used. In the first step, subtitles, which are mainly supplied for people with hearing difficulties, were used as transcriptions for the speech utterances obtained from movies. In the second step, data were collected via a mobile application. In the third step, a transfer learning approach to the Grand National Assembly of Turkey session records (videotext) was used. We also provide the initial speech recognition results of artificial neural network and Gaussian mixture-model-based acoustic models for Turkish. For training models, the newly collected corpus and other existing corpora published by the Linguistic Data Consortium were used. In light of the test results of the other existing corpora, the current study showed the relative contribution of corpus variability in a symmetric speech recognition task. The decrease in WER after including the new corpus was more evident with increased verified data size, compensating for the status of Turkish as a low resource language. For further studies, the importance of the corpus and language model in the success of the Turkish ASR system is shown.https://www.mdpi.com/2073-8994/12/2/290automatic speech recognitionspeech corpustext corpusdata acquisitionmulti-layer neural networknatural language processing
spellingShingle	Huseyin Polat Saadin Oyucu Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results Symmetry automatic speech recognition speech corpus text corpus data acquisition multi-layer neural network natural language processing
title	Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results
title_full	Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results
title_fullStr	Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results
title_full_unstemmed	Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results
title_short	Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results
title_sort	building a speech and text corpus of turkish large corpus collection with initial speech recognition results
topic	automatic speech recognition speech corpus text corpus data acquisition multi-layer neural network natural language processing
url	https://www.mdpi.com/2073-8994/12/2/290
work_keys_str_mv	AT huseyinpolat buildingaspeechandtextcorpusofturkishlargecorpuscollectionwithinitialspeechrecognitionresults AT saadinoyucu buildingaspeechandtextcorpusofturkishlargecorpuscollectionwithinitialspeechrecognitionresults

Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results

Similar Items