Chinese Text De-Colloquialization Technique Based on Back-Translation Strategy and End-to-End Learning

With the development of the Internet, there has been a significant increase in various types of textual information. However, when people engage in the composition of formal texts, they often incorporate their colloquial habits, which can diminish the professionalism and formality of the text. Exist...

Full description

Bibliographic Details
Main Authors: Hongkai Liu, Zhonglin Ye, Haixing Zhao, Yanlin Yang
Format: Article
Language:English
Published: MDPI AG 2023-09-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/19/10818
_version_ 1797576187205124096
author Hongkai Liu
Zhonglin Ye
Haixing Zhao
Yanlin Yang
author_facet Hongkai Liu
Zhonglin Ye
Haixing Zhao
Yanlin Yang
author_sort Hongkai Liu
collection DOAJ
description With the development of the Internet, there has been a significant increase in various types of textual information. However, when people engage in the composition of formal texts, they often incorporate their colloquial habits, which can diminish the professionalism and formality of the text. Existing research on Chinese texts primarily focuses on correcting misspelt characters that are visually or phonetically similar, as well as obvious grammatical errors, such as redundancy, omissions, and incorrect word order. However, there is limited research addressing the correction of text that exhibits colloquial expressions without apparent grammatical errors or misspelt characters. This article proposes a novel technique that utilizes deep learning methods to directly transform colloquial textual expressions into formal written expressions. Firstly, a parallel corpus dataset of written and spoken language is constructed using a back-translation strategy. Then, an end-to-end learning mechanism based on neural machine translation is employed, with colloquial text as the source language and written text as the target language. This allows the model to directly transform the colloquial text into text with a formal style. Finally, an evaluation of the proposed approach is conducted using the bilingual evaluation understudy (BLEU) and manual assessment techniques. The experimental results demonstrate that the technology proposed in this paper performs well in the task of de-colloquialization in Chinese texts. The contribution of this paper lies in proposing an automated method for collecting a substitute for manually annotated parallel corpora of spoken and written language, which significantly saves time and reduces the manual cost of constructing the dataset. Furthermore, the application of end-to-end learning techniques from neural machine translation to the task of de-colloquialization allows the trained model to directly generate written language flexibly based on the input of spoken language. This presents a novel solution for the task of the de-colloquialization of Chinese text.
first_indexed 2024-03-10T21:49:39Z
format Article
id doaj.art-8216858b289b420cad729cc9683927f9
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T21:49:39Z
publishDate 2023-09-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-8216858b289b420cad729cc9683927f92023-11-19T14:04:38ZengMDPI AGApplied Sciences2076-34172023-09-0113191081810.3390/app131910818Chinese Text De-Colloquialization Technique Based on Back-Translation Strategy and End-to-End LearningHongkai Liu0Zhonglin Ye1Haixing Zhao2Yanlin Yang3College of Computer, Qinghai Normal University, Xining 810008, ChinaCollege of Computer, Qinghai Normal University, Xining 810008, ChinaCollege of Computer, Qinghai Normal University, Xining 810008, ChinaCollege of Computer, Qinghai Normal University, Xining 810008, ChinaWith the development of the Internet, there has been a significant increase in various types of textual information. However, when people engage in the composition of formal texts, they often incorporate their colloquial habits, which can diminish the professionalism and formality of the text. Existing research on Chinese texts primarily focuses on correcting misspelt characters that are visually or phonetically similar, as well as obvious grammatical errors, such as redundancy, omissions, and incorrect word order. However, there is limited research addressing the correction of text that exhibits colloquial expressions without apparent grammatical errors or misspelt characters. This article proposes a novel technique that utilizes deep learning methods to directly transform colloquial textual expressions into formal written expressions. Firstly, a parallel corpus dataset of written and spoken language is constructed using a back-translation strategy. Then, an end-to-end learning mechanism based on neural machine translation is employed, with colloquial text as the source language and written text as the target language. This allows the model to directly transform the colloquial text into text with a formal style. Finally, an evaluation of the proposed approach is conducted using the bilingual evaluation understudy (BLEU) and manual assessment techniques. The experimental results demonstrate that the technology proposed in this paper performs well in the task of de-colloquialization in Chinese texts. The contribution of this paper lies in proposing an automated method for collecting a substitute for manually annotated parallel corpora of spoken and written language, which significantly saves time and reduces the manual cost of constructing the dataset. Furthermore, the application of end-to-end learning techniques from neural machine translation to the task of de-colloquialization allows the trained model to directly generate written language flexibly based on the input of spoken language. This presents a novel solution for the task of the de-colloquialization of Chinese text.https://www.mdpi.com/2076-3417/13/19/10818neural machine translationChinese textsde-colloquialismdeep learningnatural language processing
spellingShingle Hongkai Liu
Zhonglin Ye
Haixing Zhao
Yanlin Yang
Chinese Text De-Colloquialization Technique Based on Back-Translation Strategy and End-to-End Learning
Applied Sciences
neural machine translation
Chinese texts
de-colloquialism
deep learning
natural language processing
title Chinese Text De-Colloquialization Technique Based on Back-Translation Strategy and End-to-End Learning
title_full Chinese Text De-Colloquialization Technique Based on Back-Translation Strategy and End-to-End Learning
title_fullStr Chinese Text De-Colloquialization Technique Based on Back-Translation Strategy and End-to-End Learning
title_full_unstemmed Chinese Text De-Colloquialization Technique Based on Back-Translation Strategy and End-to-End Learning
title_short Chinese Text De-Colloquialization Technique Based on Back-Translation Strategy and End-to-End Learning
title_sort chinese text de colloquialization technique based on back translation strategy and end to end learning
topic neural machine translation
Chinese texts
de-colloquialism
deep learning
natural language processing
url https://www.mdpi.com/2076-3417/13/19/10818
work_keys_str_mv AT hongkailiu chinesetextdecolloquializationtechniquebasedonbacktranslationstrategyandendtoendlearning
AT zhonglinye chinesetextdecolloquializationtechniquebasedonbacktranslationstrategyandendtoendlearning
AT haixingzhao chinesetextdecolloquializationtechniquebasedonbacktranslationstrategyandendtoendlearning
AT yanlinyang chinesetextdecolloquializationtechniquebasedonbacktranslationstrategyandendtoendlearning