Automatic construction of real‐world‐based typing‐error test dataset

Abstract In this study, we aim to automatically construct a test dataset for testing the performance of spelling error correction systems. The Google Web 1T corpus, which includes data on 10 quadrillion phrases, is used for this purpose. Therefore, error words used in the test dataset use error word...

Full description

Bibliographic Details
Main Authors: Jung‐Hun Lee, Hyuk‐Chul Kwon
Format: Article
Language:English
Published: Wiley 2022-07-01
Series:Electronics Letters
Online Access:https://doi.org/10.1049/ell2.12515
Description
Summary:Abstract In this study, we aim to automatically construct a test dataset for testing the performance of spelling error correction systems. The Google Web 1T corpus, which includes data on 10 quadrillion phrases, is used for this purpose. Therefore, error words used in the test dataset use error words generated by real web users. There are seven types of error words. In order to obtain the error word, a word set that appears simultaneously with the surrounding context (3‐g range) of the location of the error word generation is searched. In this calculation, we exclude error words with wide edit distances that cause the resolution of original words to become exceedingly difficult. In order to select the final error word from the word set, a word with a high value is selected by calculating the context probability using 3‐g. In the experiment, the performance was measured for two systems (grammarly, MS Word) in service and the recently announced spelling error correction system (Neuspell). The highest performance was the F1 score of 56%, which shows the overall performance, indicating the need for research on spelling errors.
ISSN:0013-5194
1350-911X