Comparison of Evaluation Metrics for Short Story Generation
This study aimed to analyze the correlation among different automatic evaluation metrics for short story generation. In the study, texts were generated from short stories using different language models: the N-gram model, the Continuous Bag-of-Word (CBOW) model, the Gated recurrent unit (GRU) model...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10329351/ |
_version_ | 1797376325589139456 |
---|---|
author | Ponrudee Netisopakul Usanisa Taoto |
author_facet | Ponrudee Netisopakul Usanisa Taoto |
author_sort | Ponrudee Netisopakul |
collection | DOAJ |
description | This study aimed to analyze the correlation among different automatic evaluation metrics for short story generation. In the study, texts were generated from short stories using different language models: the N-gram model, the Continuous Bag-of-Word (CBOW) model, the Gated recurrent unit (GRU) model and the Generative Pre-trained Transformer 2 (GPT-2) model. All models were trained on short Aesop’s fables. The quality of the generated text was measured with various metrics: Perplexity, BLEU score, the number of grammatical errors, Self-BLEU score, ROUGE score, BERTScore, and Word Mover’s Distance (WMD). The resulting correlation analysis of the evaluation metrics revealed four groups of correlated metrics. Firstly, perplexity and grammatical errors were moderately correlated. Secondly, BLEU, ROUGE and BERTScore were highly correlated. Next, WMD was negatively correlated with BLEU, ROUGE and BERTScore. On the other hand, Self-BLEU, which measures text diversity within the model, did not correlate with the other metrics. In conclusion, to evaluate text generation, a combination of various metrics should be used to measure different aspects of the generated text. |
first_indexed | 2024-03-08T19:36:55Z |
format | Article |
id | doaj.art-fffabc98b171457dbeefab9811bea7c2 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-08T19:36:55Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-fffabc98b171457dbeefab9811bea7c22023-12-26T00:12:15ZengIEEEIEEE Access2169-35362023-01-011114025314026910.1109/ACCESS.2023.333709510329351Comparison of Evaluation Metrics for Short Story GenerationPonrudee Netisopakul0https://orcid.org/0000-0003-1409-9729Usanisa Taoto1School of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Lat Krabang, Bangkok, ThailandSchool of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Lat Krabang, Bangkok, ThailandThis study aimed to analyze the correlation among different automatic evaluation metrics for short story generation. In the study, texts were generated from short stories using different language models: the N-gram model, the Continuous Bag-of-Word (CBOW) model, the Gated recurrent unit (GRU) model and the Generative Pre-trained Transformer 2 (GPT-2) model. All models were trained on short Aesop’s fables. The quality of the generated text was measured with various metrics: Perplexity, BLEU score, the number of grammatical errors, Self-BLEU score, ROUGE score, BERTScore, and Word Mover’s Distance (WMD). The resulting correlation analysis of the evaluation metrics revealed four groups of correlated metrics. Firstly, perplexity and grammatical errors were moderately correlated. Secondly, BLEU, ROUGE and BERTScore were highly correlated. Next, WMD was negatively correlated with BLEU, ROUGE and BERTScore. On the other hand, Self-BLEU, which measures text diversity within the model, did not correlate with the other metrics. In conclusion, to evaluate text generation, a combination of various metrics should be used to measure different aspects of the generated text.https://ieeexplore.ieee.org/document/10329351/Natural language processingneural networkstext processingtext analysis |
spellingShingle | Ponrudee Netisopakul Usanisa Taoto Comparison of Evaluation Metrics for Short Story Generation IEEE Access Natural language processing neural networks text processing text analysis |
title | Comparison of Evaluation Metrics for Short Story Generation |
title_full | Comparison of Evaluation Metrics for Short Story Generation |
title_fullStr | Comparison of Evaluation Metrics for Short Story Generation |
title_full_unstemmed | Comparison of Evaluation Metrics for Short Story Generation |
title_short | Comparison of Evaluation Metrics for Short Story Generation |
title_sort | comparison of evaluation metrics for short story generation |
topic | Natural language processing neural networks text processing text analysis |
url | https://ieeexplore.ieee.org/document/10329351/ |
work_keys_str_mv | AT ponrudeenetisopakul comparisonofevaluationmetricsforshortstorygeneration AT usanisataoto comparisonofevaluationmetricsforshortstorygeneration |