Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision

Bilingual lexicon extraction is useful, especially for low-resource languages that can leverage from high-resource languages. The Uyghur language is a derivative language, and its language resources are scarce and noisy. Moreover, it is difficult to find a bilingual resource to utilize the linguisti...

Full description

Bibliographic Details
Main Authors:	Anwar Aysa, Mijit Ablimit, Hankiz Yilahun, Askar Hamdulla
Format:	Article
Language:	English
Published:	MDPI AG 2022-03-01
Series:	Information
Subjects:	bilingual dictionary seed dictionary cross-language word embedding
Online Access:	https://www.mdpi.com/2078-2489/13/4/175

_version_	1797434513495687168
author	Anwar Aysa Mijit Ablimit Hankiz Yilahun Askar Hamdulla
author_facet	Anwar Aysa Mijit Ablimit Hankiz Yilahun Askar Hamdulla
author_sort	Anwar Aysa
collection	DOAJ
description	Bilingual lexicon extraction is useful, especially for low-resource languages that can leverage from high-resource languages. The Uyghur language is a derivative language, and its language resources are scarce and noisy. Moreover, it is difficult to find a bilingual resource to utilize the linguistic knowledge of other large resource languages, such as Chinese or English. There is little related research on unsupervised extraction for the Chinese-Uyghur languages, and the existing methods mainly focus on term extraction methods based on translated parallel corpora. Accordingly, unsupervised knowledge extraction methods are effective, especially for the low-resource languages. This paper proposes a method to extract a Chinese-Uyghur bilingual dictionary by combining the inter-word relationship matrix mapped by the neural network cross-language word embedding vector. A seed dictionary is used as a weak supervision signal. A small Chinese-Uyghur parallel data resource is used to map the multilingual word vectors into a unified vector space. As the word-particles of these two languages are not well-coordinated, stems are used as the main linguistic particles. The strong inter-word semantic relationship of word vectors is used to associate Chinese-Uyghur semantic information. Two retrieval indicators, such as nearest neighbor retrieval and cross-domain similarity local scaling, are used to calculate similarity to extract bilingual dictionaries. The experimental results show that the accuracy of the Chinese-Uyghur bilingual dictionary extraction method proposed in this paper is improved to 65.06%. This method helps to improve Chinese-Uyghur machine translation, automatic knowledge extraction, and multilingual translations.
first_indexed	2024-03-09T10:34:25Z
format	Article
id	doaj.art-ad3318c94593407fbc16463e524c0de7
institution	Directory Open Access Journal
issn	2078-2489
language	English
last_indexed	2024-03-09T10:34:25Z
publishDate	2022-03-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj.art-ad3318c94593407fbc16463e524c0de72023-12-01T21:05:15ZengMDPI AGInformation2078-24892022-03-0113417510.3390/info13040175Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak SupervisionAnwar Aysa0Mijit Ablimit1Hankiz Yilahun2Askar Hamdulla3College of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaCollege of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaCollege of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaCollege of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaBilingual lexicon extraction is useful, especially for low-resource languages that can leverage from high-resource languages. The Uyghur language is a derivative language, and its language resources are scarce and noisy. Moreover, it is difficult to find a bilingual resource to utilize the linguistic knowledge of other large resource languages, such as Chinese or English. There is little related research on unsupervised extraction for the Chinese-Uyghur languages, and the existing methods mainly focus on term extraction methods based on translated parallel corpora. Accordingly, unsupervised knowledge extraction methods are effective, especially for the low-resource languages. This paper proposes a method to extract a Chinese-Uyghur bilingual dictionary by combining the inter-word relationship matrix mapped by the neural network cross-language word embedding vector. A seed dictionary is used as a weak supervision signal. A small Chinese-Uyghur parallel data resource is used to map the multilingual word vectors into a unified vector space. As the word-particles of these two languages are not well-coordinated, stems are used as the main linguistic particles. The strong inter-word semantic relationship of word vectors is used to associate Chinese-Uyghur semantic information. Two retrieval indicators, such as nearest neighbor retrieval and cross-domain similarity local scaling, are used to calculate similarity to extract bilingual dictionaries. The experimental results show that the accuracy of the Chinese-Uyghur bilingual dictionary extraction method proposed in this paper is improved to 65.06%. This method helps to improve Chinese-Uyghur machine translation, automatic knowledge extraction, and multilingual translations.https://www.mdpi.com/2078-2489/13/4/175bilingual dictionaryseed dictionarycross-language word embedding
spellingShingle	Anwar Aysa Mijit Ablimit Hankiz Yilahun Askar Hamdulla Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision Information bilingual dictionary seed dictionary cross-language word embedding
title	Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision
title_full	Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision
title_fullStr	Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision
title_full_unstemmed	Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision
title_short	Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision
title_sort	chinese uyghur bilingual lexicon extraction based on weak supervision
topic	bilingual dictionary seed dictionary cross-language word embedding
url	https://www.mdpi.com/2078-2489/13/4/175
work_keys_str_mv	AT anwaraysa chineseuyghurbilinguallexiconextractionbasedonweaksupervision AT mijitablimit chineseuyghurbilinguallexiconextractionbasedonweaksupervision AT hankizyilahun chineseuyghurbilinguallexiconextractionbasedonweaksupervision AT askarhamdulla chineseuyghurbilinguallexiconextractionbasedonweaksupervision

Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision

Similar Items