A study on the classification of stylistic and formal features in English based on corpus data testing

The traditional statistical and rule combination algorithm lacks the determination of the inner cohesion of words, and the N-gram algorithm does not limit the length of N, which will produce a large number of invalid word strings, consume time and reduce the efficiency of the experiment. Therefore,...

Full description

Bibliographic Details
Main Author:	Shuhui Li
Format:	Article
Language:	English
Published:	PeerJ Inc. 2023-04-01
Series:	PeerJ Computer Science
Subjects:	N-gram algorithm English Neologisms Corpus PMI
Online Access:	https://peerj.com/articles/cs-1297.pdf

_version_	1827958573889486848
author	Shuhui Li
author_facet	Shuhui Li
author_sort	Shuhui Li
collection	DOAJ
description	The traditional statistical and rule combination algorithm lacks the determination of the inner cohesion of words, and the N-gram algorithm does not limit the length of N, which will produce a large number of invalid word strings, consume time and reduce the efficiency of the experiment. Therefore, this article first constructs a Chinese neologism corpus, adopts improved multi-PMI, and sets a double threshold to filter new words. Branch entropy is used to calculate the probabilities between words. Finally, the N-gram algorithm is used to segment the preprocessed corpus. We use multi-word mutual information and a double mutual information threshold to identify new words and improve their recognition accuracy. Experimental results show that the algorithm proposed in this article has been improved in accuracy, recall and F measures value by 7%, 3% and 5% respectively, which can promote the sharing of language information resources so that people can intuitively and accurately obtain language information services from the internet.
first_indexed	2024-04-09T15:38:36Z
format	Article
id	doaj.art-41d7eb77f4af40bd95a2748fb319419b
institution	Directory Open Access Journal
issn	2376-5992
language	English
last_indexed	2024-04-09T15:38:36Z
publishDate	2023-04-01
publisher	PeerJ Inc.
record_format	Article
series	PeerJ Computer Science
spelling	doaj.art-41d7eb77f4af40bd95a2748fb319419b2023-04-27T15:05:04ZengPeerJ Inc.PeerJ Computer Science2376-59922023-04-019e129710.7717/peerj-cs.1297A study on the classification of stylistic and formal features in English based on corpus data testingShuhui Li0School of Foreign Studies, South China Agricultural University, Guangzhou, Guangdong, ChinaThe traditional statistical and rule combination algorithm lacks the determination of the inner cohesion of words, and the N-gram algorithm does not limit the length of N, which will produce a large number of invalid word strings, consume time and reduce the efficiency of the experiment. Therefore, this article first constructs a Chinese neologism corpus, adopts improved multi-PMI, and sets a double threshold to filter new words. Branch entropy is used to calculate the probabilities between words. Finally, the N-gram algorithm is used to segment the preprocessed corpus. We use multi-word mutual information and a double mutual information threshold to identify new words and improve their recognition accuracy. Experimental results show that the algorithm proposed in this article has been improved in accuracy, recall and F measures value by 7%, 3% and 5% respectively, which can promote the sharing of language information resources so that people can intuitively and accurately obtain language information services from the internet.https://peerj.com/articles/cs-1297.pdfN-gram algorithm English Neologisms Corpus PMI
spellingShingle	Shuhui Li A study on the classification of stylistic and formal features in English based on corpus data testing PeerJ Computer Science N-gram algorithm English Neologisms Corpus PMI
title	A study on the classification of stylistic and formal features in English based on corpus data testing
title_full	A study on the classification of stylistic and formal features in English based on corpus data testing
title_fullStr	A study on the classification of stylistic and formal features in English based on corpus data testing
title_full_unstemmed	A study on the classification of stylistic and formal features in English based on corpus data testing
title_short	A study on the classification of stylistic and formal features in English based on corpus data testing
title_sort	study on the classification of stylistic and formal features in english based on corpus data testing
topic	N-gram algorithm English Neologisms Corpus PMI
url	https://peerj.com/articles/cs-1297.pdf
work_keys_str_mv	AT shuhuili astudyontheclassificationofstylisticandformalfeaturesinenglishbasedoncorpusdatatesting AT shuhuili studyontheclassificationofstylisticandformalfeaturesinenglishbasedoncorpusdatatesting

A study on the classification of stylistic and formal features in English based on corpus data testing

Similar Items