CORPURES: Benchmark corpus for urdu extractive summaries and experiments using supervised learning

Text summarization is the process of shortening the text so that it conveys the key points. Several text summarization methods and benchmark corpora are available for languages like English. A significant hurdle in developing and evaluating existing or new text summarization methods is the unavailab...

Full description

Bibliographic Details
Main Authors:	Muhammad Humayoun, Naheed Akhtar
Format:	Article
Language:	English
Published:	Elsevier 2022-11-01
Series:	Intelligent Systems with Applications
Subjects:	Natural language processing Automatic text summarization Single document summarization Extraction based summarization Extracts Urdu summary corpus
Online Access:	http://www.sciencedirect.com/science/article/pii/S2667305322000667

_version_	1817969190984744960
author	Muhammad Humayoun Naheed Akhtar
author_facet	Muhammad Humayoun Naheed Akhtar
author_sort	Muhammad Humayoun
collection	DOAJ
description	Text summarization is the process of shortening the text so that it conveys the key points. Several text summarization methods and benchmark corpora are available for languages like English. A significant hurdle in developing and evaluating existing or new text summarization methods is the unavailability of standardized benchmark corpora, especially for South Asian languages. Among other things, a reference corpus enables researchers to compare existing state-of-the-art methods. Our study addresses this gap by developing a benchmark corpus for one of the widely spoken yet under-resourced language Urdu. The reported corpus contains 161 documents with manually written extractive summaries from the newswire domain. We also perform several experiments on the corpus to show how it can be used to develop, evaluate, and compare text summarization systems using a supervised learning approach for the Urdu language. Our results show that the state of the art classifiers are good candidates for Urdu text summarization when supervised learning techniques are employed. Also, a radical word segmentation technique such as fixed-length segmentation outperforms all other settings (Senetnce Match F1=57%, ROUGE-2 F1=64.4%). On the basic preprocessing of Urdu texts, we observe that tokenization of words on space is a reliable approach until the proper word segmentation tools for Urdu are mature enough. On word similarity features needed for supervised learning, it is observed that a radical stemming such as Ultra stemming with length (1 and 2) works better than the existing stemming and lemmatization tools for Urdu. Finally, the artificially generated datasets do not significantly improve results compared to the original data.
first_indexed	2024-04-13T20:18:12Z
format	Article
id	doaj.art-680d3f5e20d3403593efd31098f82dca
institution	Directory Open Access Journal
issn	2667-3053
language	English
last_indexed	2024-04-13T20:18:12Z
publishDate	2022-11-01
publisher	Elsevier
record_format	Article
series	Intelligent Systems with Applications
spelling	doaj.art-680d3f5e20d3403593efd31098f82dca2022-12-22T02:31:37ZengElsevierIntelligent Systems with Applications2667-30532022-11-0116200129CORPURES: Benchmark corpus for urdu extractive summaries and experiments using supervised learningMuhammad Humayoun0Naheed Akhtar1Corresponding author.; Computer Information Science Division, Higher Colleges of Technology, Abu Dhabi, United Arab EmiratesDepartment of Computer Science, University of Education, Lahore, PakistanText summarization is the process of shortening the text so that it conveys the key points. Several text summarization methods and benchmark corpora are available for languages like English. A significant hurdle in developing and evaluating existing or new text summarization methods is the unavailability of standardized benchmark corpora, especially for South Asian languages. Among other things, a reference corpus enables researchers to compare existing state-of-the-art methods. Our study addresses this gap by developing a benchmark corpus for one of the widely spoken yet under-resourced language Urdu. The reported corpus contains 161 documents with manually written extractive summaries from the newswire domain. We also perform several experiments on the corpus to show how it can be used to develop, evaluate, and compare text summarization systems using a supervised learning approach for the Urdu language. Our results show that the state of the art classifiers are good candidates for Urdu text summarization when supervised learning techniques are employed. Also, a radical word segmentation technique such as fixed-length segmentation outperforms all other settings (Senetnce Match F1=57%, ROUGE-2 F1=64.4%). On the basic preprocessing of Urdu texts, we observe that tokenization of words on space is a reliable approach until the proper word segmentation tools for Urdu are mature enough. On word similarity features needed for supervised learning, it is observed that a radical stemming such as Ultra stemming with length (1 and 2) works better than the existing stemming and lemmatization tools for Urdu. Finally, the artificially generated datasets do not significantly improve results compared to the original data.http://www.sciencedirect.com/science/article/pii/S2667305322000667Natural language processingAutomatic text summarizationSingle document summarizationExtraction based summarizationExtractsUrdu summary corpus
spellingShingle	Muhammad Humayoun Naheed Akhtar CORPURES: Benchmark corpus for urdu extractive summaries and experiments using supervised learning Intelligent Systems with Applications Natural language processing Automatic text summarization Single document summarization Extraction based summarization Extracts Urdu summary corpus
title	CORPURES: Benchmark corpus for urdu extractive summaries and experiments using supervised learning
title_full	CORPURES: Benchmark corpus for urdu extractive summaries and experiments using supervised learning
title_fullStr	CORPURES: Benchmark corpus for urdu extractive summaries and experiments using supervised learning
title_full_unstemmed	CORPURES: Benchmark corpus for urdu extractive summaries and experiments using supervised learning
title_short	CORPURES: Benchmark corpus for urdu extractive summaries and experiments using supervised learning
title_sort	corpures benchmark corpus for urdu extractive summaries and experiments using supervised learning
topic	Natural language processing Automatic text summarization Single document summarization Extraction based summarization Extracts Urdu summary corpus
url	http://www.sciencedirect.com/science/article/pii/S2667305322000667
work_keys_str_mv	AT muhammadhumayoun corpuresbenchmarkcorpusforurduextractivesummariesandexperimentsusingsupervisedlearning AT naheedakhtar corpuresbenchmarkcorpusforurduextractivesummariesandexperimentsusingsupervisedlearning

CORPURES: Benchmark corpus for urdu extractive summaries and experiments using supervised learning

Similar Items