A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu

Plagiarism detection (PD) is a process of identifying instances where someone has presented another person's work or ideas as their own. Plagiarism detection is categorized into two types (i) Intrinsic plagiarism detection primarily concerns the assessment of authorship consistency within a sin...

Full description

Bibliographic Details
Main Authors:	Muhammad Haseeb, Muhammad Faraz Manzoor, Muhammad Shoaib Farooq, Uzma Farooq, Adnan Abid
Format:	Article
Language:	English
Published:	Elsevier 2024-02-01
Series:	Data in Brief
Subjects:	Plagiarism detection Intrinsic plagiarism Stylometry features Sentence Paragraph Urdu language
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340923009186

_version_	1797317212856385536
author	Muhammad Haseeb Muhammad Faraz Manzoor Muhammad Shoaib Farooq Uzma Farooq Adnan Abid
author_facet	Muhammad Haseeb Muhammad Faraz Manzoor Muhammad Shoaib Farooq Uzma Farooq Adnan Abid
author_sort	Muhammad Haseeb
collection	DOAJ
description	Plagiarism detection (PD) is a process of identifying instances where someone has presented another person's work or ideas as their own. Plagiarism detection is categorized into two types (i) Intrinsic plagiarism detection primarily concerns the assessment of authorship consistency within a single document, aiming to identify instances where portions of the text may have been copied or paraphrased from elsewhere within the same document. Author clustering, closely related to intrinsic plagiarism detection, involves grouping documents based on their stylistic and linguistic characteristics to identify common authors or sources within a given dataset. On the other hand, (ii) extrinsic plagiarism detection delves into the comparative analysis of a suspicious document against a set of external source documents, seeking instances of shared phrases, sentences, or paragraphs between them, which is often referred to as text reuse or verbatim copying. Detection of plagiarism from documents is a long-established task in the area of NLP with remarkable contributions in multiple applications. A lot of research has already been conducted in the English and other foreign languages but Urdu language needs a lot of attention especially in intrinsic plagiarism detection domain. The major reason is that Urdu is a low resource language and unfortunately there is no high-quality benchmark corpus available for intrinsic plagiarism detection in Urdu language. This study presents a high-quality benchmark Corpus comprising 10,872 documents. The corpus is structured into two granularity levels: sentence level and paragraph level. This dataset serves multifaceted purposes, facilitating intrinsic plagiarism detection, verbatim text reuse identification, and author clustering in the Urdu language. Also, it holds significance for natural language processing researchers and practitioners as it facilitates the development of specialized plagiarism detection models tailored to the Urdu language. These models can play a vital role in education and publishing by improving the accuracy of plagiarism detection, effectively addressing a gap and enhancing the overall ability to identify copied content in Urdu writing.
first_indexed	2024-03-08T03:30:40Z
format	Article
id	doaj.art-c6034d48f8ee417e83ff87b4acc90a65
institution	Directory Open Access Journal
issn	2352-3409
language	English
last_indexed	2024-03-08T03:30:40Z
publishDate	2024-02-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj.art-c6034d48f8ee417e83ff87b4acc90a652024-02-11T05:10:21ZengElsevierData in Brief2352-34092024-02-0152109857A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in UrduMuhammad Haseeb0Muhammad Faraz Manzoor1Muhammad Shoaib Farooq2Uzma Farooq3Adnan Abid4Department of Computer Science, University of Management and Technology, Lahore, PakistanDepartment of Computer Science, University of Management and Technology, Lahore, PakistanDepartment of Computer Science, University of Management and Technology, Lahore, PakistanDepartment of Computer Science, University of Management and Technology, Lahore, PakistanDepartment of Data Science, Faculty of Computing and Information Technology, University of the Punjab, Pakistan; Corresponding authors.Plagiarism detection (PD) is a process of identifying instances where someone has presented another person's work or ideas as their own. Plagiarism detection is categorized into two types (i) Intrinsic plagiarism detection primarily concerns the assessment of authorship consistency within a single document, aiming to identify instances where portions of the text may have been copied or paraphrased from elsewhere within the same document. Author clustering, closely related to intrinsic plagiarism detection, involves grouping documents based on their stylistic and linguistic characteristics to identify common authors or sources within a given dataset. On the other hand, (ii) extrinsic plagiarism detection delves into the comparative analysis of a suspicious document against a set of external source documents, seeking instances of shared phrases, sentences, or paragraphs between them, which is often referred to as text reuse or verbatim copying. Detection of plagiarism from documents is a long-established task in the area of NLP with remarkable contributions in multiple applications. A lot of research has already been conducted in the English and other foreign languages but Urdu language needs a lot of attention especially in intrinsic plagiarism detection domain. The major reason is that Urdu is a low resource language and unfortunately there is no high-quality benchmark corpus available for intrinsic plagiarism detection in Urdu language. This study presents a high-quality benchmark Corpus comprising 10,872 documents. The corpus is structured into two granularity levels: sentence level and paragraph level. This dataset serves multifaceted purposes, facilitating intrinsic plagiarism detection, verbatim text reuse identification, and author clustering in the Urdu language. Also, it holds significance for natural language processing researchers and practitioners as it facilitates the development of specialized plagiarism detection models tailored to the Urdu language. These models can play a vital role in education and publishing by improving the accuracy of plagiarism detection, effectively addressing a gap and enhancing the overall ability to identify copied content in Urdu writing.http://www.sciencedirect.com/science/article/pii/S2352340923009186Plagiarism detectionIntrinsic plagiarismStylometry featuresSentenceParagraphUrdu language
spellingShingle	Muhammad Haseeb Muhammad Faraz Manzoor Muhammad Shoaib Farooq Uzma Farooq Adnan Abid A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu Data in Brief Plagiarism detection Intrinsic plagiarism Stylometry features Sentence Paragraph Urdu language
title	A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu
title_full	A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu
title_fullStr	A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu
title_full_unstemmed	A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu
title_short	A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu
title_sort	versatile dataset for intrinsic plagiarism detection text reuse analysis and author clustering in urdu
topic	Plagiarism detection Intrinsic plagiarism Stylometry features Sentence Paragraph Urdu language
url	http://www.sciencedirect.com/science/article/pii/S2352340923009186
work_keys_str_mv	AT muhammadhaseeb aversatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT muhammadfarazmanzoor aversatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT muhammadshoaibfarooq aversatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT uzmafarooq aversatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT adnanabid aversatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT muhammadhaseeb versatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT muhammadfarazmanzoor versatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT muhammadshoaibfarooq versatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT uzmafarooq versatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT adnanabid versatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu

A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu

Similar Items