Measuring Short Text Reuse for the Urdu Language

Text reuse occurs when one borrows the text (either verbatim or paraphrased) from an earlier written text. A large and increasing amount of digital text is easily and readily available, making it simpler to reuse but difficult to detect. As a result, automatic detection of text reuse has attracted t...

Full description

Bibliographic Details
Main Authors:	Sara Sameen, Muhammad Sharjeel, Rao Muhammad Adeel Nawab, Paul Rayson, Iqra Muneer
Format:	Article
Language:	English
Published:	IEEE 2018-01-01
Series:	IEEE Access
Subjects:	Urdu text reuse detection Urdu corpus natural language processing
Online Access:	https://ieeexplore.ieee.org/document/8118088/

_version_	1818331693917929472
author	Sara Sameen Muhammad Sharjeel Rao Muhammad Adeel Nawab Paul Rayson Iqra Muneer
author_facet	Sara Sameen Muhammad Sharjeel Rao Muhammad Adeel Nawab Paul Rayson Iqra Muneer
author_sort	Sara Sameen
collection	DOAJ
description	Text reuse occurs when one borrows the text (either verbatim or paraphrased) from an earlier written text. A large and increasing amount of digital text is easily and readily available, making it simpler to reuse but difficult to detect. As a result, automatic detection of text reuse has attracted the attention of the research community due to the wide variety of applications associated with it. To develop and evaluate automatic methods for text reuse detection, standard evaluation resources are required. In this paper, we propose one such resource for a significantly under-resourced language-Urdu, which is widely used in day to day communication and has a large digital footprint particularly in the Indian subcontinent. Our proposed Urdu short text reuse corpus contains 2684 short Urdu text pairs, manually labeled as verbatim (496), paraphrased (1329), and independently written (859). In addition, we describe an evaluation of the corpus using various state-of-the-art text reuse detection methods with binary and multi-classification settings and a set of four classifiers. Output results show that character n-gram overlap using J48 classifier outperform other methods for the Urdu short text reuse detection task.
first_indexed	2024-12-13T13:23:55Z
format	Article
id	doaj.art-47bd7af9d3314083a28218e3aeee2200
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-13T13:23:55Z
publishDate	2018-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-47bd7af9d3314083a28218e3aeee22002022-12-21T23:44:21ZengIEEEIEEE Access2169-35362018-01-0167412742110.1109/ACCESS.2017.27768428118088Measuring Short Text Reuse for the Urdu LanguageSara Sameen0Muhammad Sharjeel1https://orcid.org/0000-0003-3361-4335Rao Muhammad Adeel Nawab2Paul Rayson3Iqra Muneer4Department of Examinations, Virtual University of Pakistan, Lahore, PakistanSchool of Computing and Communications, Lancaster University, Lancaster, U.K.Department of Computer Science, COMSATS Institute of Information Technology, Lahore, PakistanSchool of Computing and Communications, Lancaster University, Lancaster, U.K.Department of Computer Science, Rachna College of Engineering and Technology, Gujranwala, PakistanText reuse occurs when one borrows the text (either verbatim or paraphrased) from an earlier written text. A large and increasing amount of digital text is easily and readily available, making it simpler to reuse but difficult to detect. As a result, automatic detection of text reuse has attracted the attention of the research community due to the wide variety of applications associated with it. To develop and evaluate automatic methods for text reuse detection, standard evaluation resources are required. In this paper, we propose one such resource for a significantly under-resourced language-Urdu, which is widely used in day to day communication and has a large digital footprint particularly in the Indian subcontinent. Our proposed Urdu short text reuse corpus contains 2684 short Urdu text pairs, manually labeled as verbatim (496), paraphrased (1329), and independently written (859). In addition, we describe an evaluation of the corpus using various state-of-the-art text reuse detection methods with binary and multi-classification settings and a set of four classifiers. Output results show that character n-gram overlap using J48 classifier outperform other methods for the Urdu short text reuse detection task.https://ieeexplore.ieee.org/document/8118088/Urdu text reuse detectionUrdu corpusnatural language processing
spellingShingle	Sara Sameen Muhammad Sharjeel Rao Muhammad Adeel Nawab Paul Rayson Iqra Muneer Measuring Short Text Reuse for the Urdu Language IEEE Access Urdu text reuse detection Urdu corpus natural language processing
title	Measuring Short Text Reuse for the Urdu Language
title_full	Measuring Short Text Reuse for the Urdu Language
title_fullStr	Measuring Short Text Reuse for the Urdu Language
title_full_unstemmed	Measuring Short Text Reuse for the Urdu Language
title_short	Measuring Short Text Reuse for the Urdu Language
title_sort	measuring short text reuse for the urdu language
topic	Urdu text reuse detection Urdu corpus natural language processing
url	https://ieeexplore.ieee.org/document/8118088/
work_keys_str_mv	AT sarasameen measuringshorttextreusefortheurdulanguage AT muhammadsharjeel measuringshorttextreusefortheurdulanguage AT raomuhammadadeelnawab measuringshorttextreusefortheurdulanguage AT paulrayson measuringshorttextreusefortheurdulanguage AT iqramuneer measuringshorttextreusefortheurdulanguage

Measuring Short Text Reuse for the Urdu Language

Similar Items