RESEARCH INTO SPEECH-TO-TEXT TRANFROMATION MODULE IN THE PROPOSED MODEL OF A SPEAKER’S AUTOMATIC SPEECH ANNOTATION

The subject matter of the article is the module for converting the speaker’s speech into text in the proposed model of automatic annotation of the speaker’s speech, which has become more and more popular in Ukraine in the last two years, due to the active transition to an online form of communicati...

Full description

Bibliographic Details
Main Author:	Olesia Barkovska
Format:	Article
Language:	English
Published:	Kharkiv National University of Radio Electronics 2023-01-01
Series:	Сучасний стан наукових досліджень та технологій в промисловості
Subjects:	STT text processing summary audiofile model
Online Access:	https://itssi-journal.com/index.php/ittsi/article/view/351

_version_	1797962908077916160
author	Olesia Barkovska
author_facet	Olesia Barkovska
author_sort	Olesia Barkovska
collection	DOAJ
description	The subject matter of the article is the module for converting the speaker’s speech into text in the proposed model of automatic annotation of the speaker’s speech, which has become more and more popular in Ukraine in the last two years, due to the active transition to an online form of communication and education as well as conducting workshops, interviews and discussing urgent issues. Furthermore, the users of personal educational platforms are not always able to join online meetings on time due to various reasons (one example can be a blackout), which explains the need to save the speakers’ presentations in the form of audio files. The goal of the work is to elimination of false or corrupt data in the process of converting the audio sequence into the relevant text for further semantic analysis. To achieve the goal, the following tasks were solved: a generalized model of incoming audio data summarization was proposed; the existing STT models (for turning audio data into text) were analyzed; the ability of the STT module to operate in Ukrainian was studied; STT module efficiency and timing for English and Ukrainian-based STT module operation were evaluated. The proposed model of the speaker’s speech automatic annotation has two major functional modules: speech-to-text (STT) і summarization module (SUM). For the STT module, the following models of linguistic text analysis have been researched and improved: for English it is wav2vec2-xls-r-1bz, and for Ukrainian it is Ukrainian STT model (wav2vec2-xls-r-1b-uk-with-lm.Artificial neural networks were used as a mathematical apparatus in the models under consideration. The following results were obtained: demonstrates the reduction of the word error level descriptor by almost 1.5 times, which influences the quality of word recognition from the audio and may potentially lead to obtaining higher-quality output text data. In order to estimate the timing for STT module operation, three English and Ukrainian audio recordings of various length (5s, ~60s and ~240s) were analyzed. The results demonstrated an obvious trend for accelerated obtaining of the output file through the application of the computational power of NVIDIA Tesla T4 graphic accelerator for the longest recording. Conclusions: the use of a deep neural network at the stage of noise reduction in the input file is justified, as it provides an increase in the WER metric by almost 25%, and an increase in the computing power of the graphics processor and the number of stream processors provide acceleration only for large input audio files. The following research of the author is focused on the study of the methods of the obtained text summarization module efficiency.
first_indexed	2024-04-11T01:20:07Z
format	Article
id	doaj.art-f90806a906c44e4ab309cea592ccf883
institution	Directory Open Access Journal
issn	2522-9818 2524-2296
language	English
last_indexed	2024-04-11T01:20:07Z
publishDate	2023-01-01
publisher	Kharkiv National University of Radio Electronics
record_format	Article
series	Сучасний стан наукових досліджень та технологій в промисловості
spelling	doaj.art-f90806a906c44e4ab309cea592ccf8832023-01-03T13:04:19ZengKharkiv National University of Radio ElectronicsСучасний стан наукових досліджень та технологій в промисловості2522-98182524-22962023-01-014 (22)10.30837/ITSSI.2022.22.005RESEARCH INTO SPEECH-TO-TEXT TRANFROMATION MODULE IN THE PROPOSED MODEL OF A SPEAKER’S AUTOMATIC SPEECH ANNOTATIONOlesia Barkovska0Kharkiv National University of Radio Electronics The subject matter of the article is the module for converting the speaker’s speech into text in the proposed model of automatic annotation of the speaker’s speech, which has become more and more popular in Ukraine in the last two years, due to the active transition to an online form of communication and education as well as conducting workshops, interviews and discussing urgent issues. Furthermore, the users of personal educational platforms are not always able to join online meetings on time due to various reasons (one example can be a blackout), which explains the need to save the speakers’ presentations in the form of audio files. The goal of the work is to elimination of false or corrupt data in the process of converting the audio sequence into the relevant text for further semantic analysis. To achieve the goal, the following tasks were solved: a generalized model of incoming audio data summarization was proposed; the existing STT models (for turning audio data into text) were analyzed; the ability of the STT module to operate in Ukrainian was studied; STT module efficiency and timing for English and Ukrainian-based STT module operation were evaluated. The proposed model of the speaker’s speech automatic annotation has two major functional modules: speech-to-text (STT) і summarization module (SUM). For the STT module, the following models of linguistic text analysis have been researched and improved: for English it is wav2vec2-xls-r-1bz, and for Ukrainian it is Ukrainian STT model (wav2vec2-xls-r-1b-uk-with-lm.Artificial neural networks were used as a mathematical apparatus in the models under consideration. The following results were obtained: demonstrates the reduction of the word error level descriptor by almost 1.5 times, which influences the quality of word recognition from the audio and may potentially lead to obtaining higher-quality output text data. In order to estimate the timing for STT module operation, three English and Ukrainian audio recordings of various length (5s, ~60s and ~240s) were analyzed. The results demonstrated an obvious trend for accelerated obtaining of the output file through the application of the computational power of NVIDIA Tesla T4 graphic accelerator for the longest recording. Conclusions: the use of a deep neural network at the stage of noise reduction in the input file is justified, as it provides an increase in the WER metric by almost 25%, and an increase in the computing power of the graphics processor and the number of stream processors provide acceleration only for large input audio files. The following research of the author is focused on the study of the methods of the obtained text summarization module efficiency. https://itssi-journal.com/index.php/ittsi/article/view/351STTtextprocessingsummaryaudiofilemodel
spellingShingle	Olesia Barkovska RESEARCH INTO SPEECH-TO-TEXT TRANFROMATION MODULE IN THE PROPOSED MODEL OF A SPEAKER’S AUTOMATIC SPEECH ANNOTATION Сучасний стан наукових досліджень та технологій в промисловості STT text processing summary audiofile model
title	RESEARCH INTO SPEECH-TO-TEXT TRANFROMATION MODULE IN THE PROPOSED MODEL OF A SPEAKER’S AUTOMATIC SPEECH ANNOTATION
title_full	RESEARCH INTO SPEECH-TO-TEXT TRANFROMATION MODULE IN THE PROPOSED MODEL OF A SPEAKER’S AUTOMATIC SPEECH ANNOTATION
title_fullStr	RESEARCH INTO SPEECH-TO-TEXT TRANFROMATION MODULE IN THE PROPOSED MODEL OF A SPEAKER’S AUTOMATIC SPEECH ANNOTATION
title_full_unstemmed	RESEARCH INTO SPEECH-TO-TEXT TRANFROMATION MODULE IN THE PROPOSED MODEL OF A SPEAKER’S AUTOMATIC SPEECH ANNOTATION
title_short	RESEARCH INTO SPEECH-TO-TEXT TRANFROMATION MODULE IN THE PROPOSED MODEL OF A SPEAKER’S AUTOMATIC SPEECH ANNOTATION
title_sort	research into speech to text tranfromation module in the proposed model of a speaker s automatic speech annotation
topic	STT text processing summary audiofile model
url	https://itssi-journal.com/index.php/ittsi/article/view/351
work_keys_str_mv	AT olesiabarkovska researchintospeechtotexttranfromationmoduleintheproposedmodelofaspeakersautomaticspeechannotation

RESEARCH INTO SPEECH-TO-TEXT TRANFROMATION MODULE IN THE PROPOSED MODEL OF A SPEAKER’S AUTOMATIC SPEECH ANNOTATION

Similar Items