Comparison of Evaluation Metrics for Short Story Generation

This study aimed to analyze the correlation among different automatic evaluation metrics for short story generation. In the study, texts were generated from short stories using different language models: the N-gram model, the Continuous Bag-of-Word (CBOW) model, the Gated recurrent unit (GRU) model...

Full description

Bibliographic Details
Main Authors: Ponrudee Netisopakul, Usanisa Taoto
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10329351/
_version_ 1797376325589139456
author Ponrudee Netisopakul
Usanisa Taoto
author_facet Ponrudee Netisopakul
Usanisa Taoto
author_sort Ponrudee Netisopakul
collection DOAJ
description This study aimed to analyze the correlation among different automatic evaluation metrics for short story generation. In the study, texts were generated from short stories using different language models: the N-gram model, the Continuous Bag-of-Word (CBOW) model, the Gated recurrent unit (GRU) model and the Generative Pre-trained Transformer 2 (GPT-2) model. All models were trained on short Aesop’s fables. The quality of the generated text was measured with various metrics: Perplexity, BLEU score, the number of grammatical errors, Self-BLEU score, ROUGE score, BERTScore, and Word Mover’s Distance (WMD). The resulting correlation analysis of the evaluation metrics revealed four groups of correlated metrics. Firstly, perplexity and grammatical errors were moderately correlated. Secondly, BLEU, ROUGE and BERTScore were highly correlated. Next, WMD was negatively correlated with BLEU, ROUGE and BERTScore. On the other hand, Self-BLEU, which measures text diversity within the model, did not correlate with the other metrics. In conclusion, to evaluate text generation, a combination of various metrics should be used to measure different aspects of the generated text.
first_indexed 2024-03-08T19:36:55Z
format Article
id doaj.art-fffabc98b171457dbeefab9811bea7c2
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-08T19:36:55Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-fffabc98b171457dbeefab9811bea7c22023-12-26T00:12:15ZengIEEEIEEE Access2169-35362023-01-011114025314026910.1109/ACCESS.2023.333709510329351Comparison of Evaluation Metrics for Short Story GenerationPonrudee Netisopakul0https://orcid.org/0000-0003-1409-9729Usanisa Taoto1School of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Lat Krabang, Bangkok, ThailandSchool of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Lat Krabang, Bangkok, ThailandThis study aimed to analyze the correlation among different automatic evaluation metrics for short story generation. In the study, texts were generated from short stories using different language models: the N-gram model, the Continuous Bag-of-Word (CBOW) model, the Gated recurrent unit (GRU) model and the Generative Pre-trained Transformer 2 (GPT-2) model. All models were trained on short Aesop’s fables. The quality of the generated text was measured with various metrics: Perplexity, BLEU score, the number of grammatical errors, Self-BLEU score, ROUGE score, BERTScore, and Word Mover’s Distance (WMD). The resulting correlation analysis of the evaluation metrics revealed four groups of correlated metrics. Firstly, perplexity and grammatical errors were moderately correlated. Secondly, BLEU, ROUGE and BERTScore were highly correlated. Next, WMD was negatively correlated with BLEU, ROUGE and BERTScore. On the other hand, Self-BLEU, which measures text diversity within the model, did not correlate with the other metrics. In conclusion, to evaluate text generation, a combination of various metrics should be used to measure different aspects of the generated text.https://ieeexplore.ieee.org/document/10329351/Natural language processingneural networkstext processingtext analysis
spellingShingle Ponrudee Netisopakul
Usanisa Taoto
Comparison of Evaluation Metrics for Short Story Generation
IEEE Access
Natural language processing
neural networks
text processing
text analysis
title Comparison of Evaluation Metrics for Short Story Generation
title_full Comparison of Evaluation Metrics for Short Story Generation
title_fullStr Comparison of Evaluation Metrics for Short Story Generation
title_full_unstemmed Comparison of Evaluation Metrics for Short Story Generation
title_short Comparison of Evaluation Metrics for Short Story Generation
title_sort comparison of evaluation metrics for short story generation
topic Natural language processing
neural networks
text processing
text analysis
url https://ieeexplore.ieee.org/document/10329351/
work_keys_str_mv AT ponrudeenetisopakul comparisonofevaluationmetricsforshortstorygeneration
AT usanisataoto comparisonofevaluationmetricsforshortstorygeneration