A study on the classification of stylistic and formal features in English based on corpus data testing
The traditional statistical and rule combination algorithm lacks the determination of the inner cohesion of words, and the N-gram algorithm does not limit the length of N, which will produce a large number of invalid word strings, consume time and reduce the efficiency of the experiment. Therefore,...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
PeerJ Inc.
2023-04-01
|
Series: | PeerJ Computer Science |
Subjects: | |
Online Access: | https://peerj.com/articles/cs-1297.pdf |
_version_ | 1827958573889486848 |
---|---|
author | Shuhui Li |
author_facet | Shuhui Li |
author_sort | Shuhui Li |
collection | DOAJ |
description | The traditional statistical and rule combination algorithm lacks the determination of the inner cohesion of words, and the N-gram algorithm does not limit the length of N, which will produce a large number of invalid word strings, consume time and reduce the efficiency of the experiment. Therefore, this article first constructs a Chinese neologism corpus, adopts improved multi-PMI, and sets a double threshold to filter new words. Branch entropy is used to calculate the probabilities between words. Finally, the N-gram algorithm is used to segment the preprocessed corpus. We use multi-word mutual information and a double mutual information threshold to identify new words and improve their recognition accuracy. Experimental results show that the algorithm proposed in this article has been improved in accuracy, recall and F measures value by 7%, 3% and 5% respectively, which can promote the sharing of language information resources so that people can intuitively and accurately obtain language information services from the internet. |
first_indexed | 2024-04-09T15:38:36Z |
format | Article |
id | doaj.art-41d7eb77f4af40bd95a2748fb319419b |
institution | Directory Open Access Journal |
issn | 2376-5992 |
language | English |
last_indexed | 2024-04-09T15:38:36Z |
publishDate | 2023-04-01 |
publisher | PeerJ Inc. |
record_format | Article |
series | PeerJ Computer Science |
spelling | doaj.art-41d7eb77f4af40bd95a2748fb319419b2023-04-27T15:05:04ZengPeerJ Inc.PeerJ Computer Science2376-59922023-04-019e129710.7717/peerj-cs.1297A study on the classification of stylistic and formal features in English based on corpus data testingShuhui Li0School of Foreign Studies, South China Agricultural University, Guangzhou, Guangdong, ChinaThe traditional statistical and rule combination algorithm lacks the determination of the inner cohesion of words, and the N-gram algorithm does not limit the length of N, which will produce a large number of invalid word strings, consume time and reduce the efficiency of the experiment. Therefore, this article first constructs a Chinese neologism corpus, adopts improved multi-PMI, and sets a double threshold to filter new words. Branch entropy is used to calculate the probabilities between words. Finally, the N-gram algorithm is used to segment the preprocessed corpus. We use multi-word mutual information and a double mutual information threshold to identify new words and improve their recognition accuracy. Experimental results show that the algorithm proposed in this article has been improved in accuracy, recall and F measures value by 7%, 3% and 5% respectively, which can promote the sharing of language information resources so that people can intuitively and accurately obtain language information services from the internet.https://peerj.com/articles/cs-1297.pdfN-gram algorithm English Neologisms Corpus PMI |
spellingShingle | Shuhui Li A study on the classification of stylistic and formal features in English based on corpus data testing PeerJ Computer Science N-gram algorithm English Neologisms Corpus PMI |
title | A study on the classification of stylistic and formal features in English based on corpus data testing |
title_full | A study on the classification of stylistic and formal features in English based on corpus data testing |
title_fullStr | A study on the classification of stylistic and formal features in English based on corpus data testing |
title_full_unstemmed | A study on the classification of stylistic and formal features in English based on corpus data testing |
title_short | A study on the classification of stylistic and formal features in English based on corpus data testing |
title_sort | study on the classification of stylistic and formal features in english based on corpus data testing |
topic | N-gram algorithm English Neologisms Corpus PMI |
url | https://peerj.com/articles/cs-1297.pdf |
work_keys_str_mv | AT shuhuili astudyontheclassificationofstylisticandformalfeaturesinenglishbasedoncorpusdatatesting AT shuhuili studyontheclassificationofstylisticandformalfeaturesinenglishbasedoncorpusdatatesting |