A study on the classification of stylistic and formal features in English based on corpus data testing

The traditional statistical and rule combination algorithm lacks the determination of the inner cohesion of words, and the N-gram algorithm does not limit the length of N, which will produce a large number of invalid word strings, consume time and reduce the efficiency of the experiment. Therefore,...

Full description

Bibliographic Details
Main Author: Shuhui Li
Format: Article
Language:English
Published: PeerJ Inc. 2023-04-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-1297.pdf
_version_ 1827958573889486848
author Shuhui Li
author_facet Shuhui Li
author_sort Shuhui Li
collection DOAJ
description The traditional statistical and rule combination algorithm lacks the determination of the inner cohesion of words, and the N-gram algorithm does not limit the length of N, which will produce a large number of invalid word strings, consume time and reduce the efficiency of the experiment. Therefore, this article first constructs a Chinese neologism corpus, adopts improved multi-PMI, and sets a double threshold to filter new words. Branch entropy is used to calculate the probabilities between words. Finally, the N-gram algorithm is used to segment the preprocessed corpus. We use multi-word mutual information and a double mutual information threshold to identify new words and improve their recognition accuracy. Experimental results show that the algorithm proposed in this article has been improved in accuracy, recall and F measures value by 7%, 3% and 5% respectively, which can promote the sharing of language information resources so that people can intuitively and accurately obtain language information services from the internet.
first_indexed 2024-04-09T15:38:36Z
format Article
id doaj.art-41d7eb77f4af40bd95a2748fb319419b
institution Directory Open Access Journal
issn 2376-5992
language English
last_indexed 2024-04-09T15:38:36Z
publishDate 2023-04-01
publisher PeerJ Inc.
record_format Article
series PeerJ Computer Science
spelling doaj.art-41d7eb77f4af40bd95a2748fb319419b2023-04-27T15:05:04ZengPeerJ Inc.PeerJ Computer Science2376-59922023-04-019e129710.7717/peerj-cs.1297A study on the classification of stylistic and formal features in English based on corpus data testingShuhui Li0School of Foreign Studies, South China Agricultural University, Guangzhou, Guangdong, ChinaThe traditional statistical and rule combination algorithm lacks the determination of the inner cohesion of words, and the N-gram algorithm does not limit the length of N, which will produce a large number of invalid word strings, consume time and reduce the efficiency of the experiment. Therefore, this article first constructs a Chinese neologism corpus, adopts improved multi-PMI, and sets a double threshold to filter new words. Branch entropy is used to calculate the probabilities between words. Finally, the N-gram algorithm is used to segment the preprocessed corpus. We use multi-word mutual information and a double mutual information threshold to identify new words and improve their recognition accuracy. Experimental results show that the algorithm proposed in this article has been improved in accuracy, recall and F measures value by 7%, 3% and 5% respectively, which can promote the sharing of language information resources so that people can intuitively and accurately obtain language information services from the internet.https://peerj.com/articles/cs-1297.pdfN-gram algorithm English Neologisms Corpus PMI
spellingShingle Shuhui Li
A study on the classification of stylistic and formal features in English based on corpus data testing
PeerJ Computer Science
N-gram algorithm
English
Neologisms
Corpus
PMI
title A study on the classification of stylistic and formal features in English based on corpus data testing
title_full A study on the classification of stylistic and formal features in English based on corpus data testing
title_fullStr A study on the classification of stylistic and formal features in English based on corpus data testing
title_full_unstemmed A study on the classification of stylistic and formal features in English based on corpus data testing
title_short A study on the classification of stylistic and formal features in English based on corpus data testing
title_sort study on the classification of stylistic and formal features in english based on corpus data testing
topic N-gram algorithm
English
Neologisms
Corpus
PMI
url https://peerj.com/articles/cs-1297.pdf
work_keys_str_mv AT shuhuili astudyontheclassificationofstylisticandformalfeaturesinenglishbasedoncorpusdatatesting
AT shuhuili studyontheclassificationofstylisticandformalfeaturesinenglishbasedoncorpusdatatesting