RoSummary: Control Tokens for Romanian News Summarization

Significant progress has been achieved in text generation due to recent developments in neural architectures; nevertheless, this task remains challenging, especially for low-resource languages. This study is centered on developing a model for abstractive summarization in Romanian. A corresponding da...

Full description

Bibliographic Details
Main Authors: Mihai Alexandru Niculescu, Stefan Ruseti, Mihai Dascalu
Format: Article
Language:English
Published: MDPI AG 2022-12-01
Series:Algorithms
Subjects:
Online Access:https://www.mdpi.com/1999-4893/15/12/472
Description
Summary:Significant progress has been achieved in text generation due to recent developments in neural architectures; nevertheless, this task remains challenging, especially for low-resource languages. This study is centered on developing a model for abstractive summarization in Romanian. A corresponding dataset for summarization is introduced, followed by multiple models based on the Romanian GPT-2, on top of which control tokens were considered to specify characteristics for the generated text, namely: counts of sentences and words, token ratio, and n-gram overlap. These are special tokens defined in the prompt received by the model to indicate traits for the text to be generated. The initial model without any control tokens was assessed using BERTScore (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi mathvariant="normal">F</mi><mn>1</mn></msub></semantics></math></inline-formula> = 73.43%) and ROUGE (ROUGE-L accuracy = 34.67%). Control tokens improved the overall BERTScore to 75.42% using <LexOverlap>, while the model was influenced more by the second token specified in the prompt when performing various combinations of tokens. Six raters performed human evaluations of 45 generated summaries with different models and decoding methods. The generated texts were all grammatically correct and consistent in most cases, while the evaluations were promising in terms of main idea coverage, details, and cohesion. Paraphrasing still requires improvements as the models mostly repeat information from the reference text. In addition, we showcase an exploratory analysis of the generated summaries using one or two specific control tokens.
ISSN:1999-4893