A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu

Plagiarism detection (PD) is a process of identifying instances where someone has presented another person's work or ideas as their own. Plagiarism detection is categorized into two types (i) Intrinsic plagiarism detection primarily concerns the assessment of authorship consistency within a sin...

Full description

Bibliographic Details
Main Authors: Muhammad Haseeb, Muhammad Faraz Manzoor, Muhammad Shoaib Farooq, Uzma Farooq, Adnan Abid
Format: Article
Language:English
Published: Elsevier 2024-02-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340923009186
_version_ 1797317212856385536
author Muhammad Haseeb
Muhammad Faraz Manzoor
Muhammad Shoaib Farooq
Uzma Farooq
Adnan Abid
author_facet Muhammad Haseeb
Muhammad Faraz Manzoor
Muhammad Shoaib Farooq
Uzma Farooq
Adnan Abid
author_sort Muhammad Haseeb
collection DOAJ
description Plagiarism detection (PD) is a process of identifying instances where someone has presented another person's work or ideas as their own. Plagiarism detection is categorized into two types (i) Intrinsic plagiarism detection primarily concerns the assessment of authorship consistency within a single document, aiming to identify instances where portions of the text may have been copied or paraphrased from elsewhere within the same document. Author clustering, closely related to intrinsic plagiarism detection, involves grouping documents based on their stylistic and linguistic characteristics to identify common authors or sources within a given dataset. On the other hand, (ii) extrinsic plagiarism detection delves into the comparative analysis of a suspicious document against a set of external source documents, seeking instances of shared phrases, sentences, or paragraphs between them, which is often referred to as text reuse or verbatim copying. Detection of plagiarism from documents is a long-established task in the area of NLP with remarkable contributions in multiple applications. A lot of research has already been conducted in the English and other foreign languages but Urdu language needs a lot of attention especially in intrinsic plagiarism detection domain. The major reason is that Urdu is a low resource language and unfortunately there is no high-quality benchmark corpus available for intrinsic plagiarism detection in Urdu language. This study presents a high-quality benchmark Corpus comprising 10,872 documents. The corpus is structured into two granularity levels: sentence level and paragraph level. This dataset serves multifaceted purposes, facilitating intrinsic plagiarism detection, verbatim text reuse identification, and author clustering in the Urdu language. Also, it holds significance for natural language processing researchers and practitioners as it facilitates the development of specialized plagiarism detection models tailored to the Urdu language. These models can play a vital role in education and publishing by improving the accuracy of plagiarism detection, effectively addressing a gap and enhancing the overall ability to identify copied content in Urdu writing.
first_indexed 2024-03-08T03:30:40Z
format Article
id doaj.art-c6034d48f8ee417e83ff87b4acc90a65
institution Directory Open Access Journal
issn 2352-3409
language English
last_indexed 2024-03-08T03:30:40Z
publishDate 2024-02-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj.art-c6034d48f8ee417e83ff87b4acc90a652024-02-11T05:10:21ZengElsevierData in Brief2352-34092024-02-0152109857A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in UrduMuhammad Haseeb0Muhammad Faraz Manzoor1Muhammad Shoaib Farooq2Uzma Farooq3Adnan Abid4Department of Computer Science, University of Management and Technology, Lahore, PakistanDepartment of Computer Science, University of Management and Technology, Lahore, PakistanDepartment of Computer Science, University of Management and Technology, Lahore, PakistanDepartment of Computer Science, University of Management and Technology, Lahore, PakistanDepartment of Data Science, Faculty of Computing and Information Technology, University of the Punjab, Pakistan; Corresponding authors.Plagiarism detection (PD) is a process of identifying instances where someone has presented another person's work or ideas as their own. Plagiarism detection is categorized into two types (i) Intrinsic plagiarism detection primarily concerns the assessment of authorship consistency within a single document, aiming to identify instances where portions of the text may have been copied or paraphrased from elsewhere within the same document. Author clustering, closely related to intrinsic plagiarism detection, involves grouping documents based on their stylistic and linguistic characteristics to identify common authors or sources within a given dataset. On the other hand, (ii) extrinsic plagiarism detection delves into the comparative analysis of a suspicious document against a set of external source documents, seeking instances of shared phrases, sentences, or paragraphs between them, which is often referred to as text reuse or verbatim copying. Detection of plagiarism from documents is a long-established task in the area of NLP with remarkable contributions in multiple applications. A lot of research has already been conducted in the English and other foreign languages but Urdu language needs a lot of attention especially in intrinsic plagiarism detection domain. The major reason is that Urdu is a low resource language and unfortunately there is no high-quality benchmark corpus available for intrinsic plagiarism detection in Urdu language. This study presents a high-quality benchmark Corpus comprising 10,872 documents. The corpus is structured into two granularity levels: sentence level and paragraph level. This dataset serves multifaceted purposes, facilitating intrinsic plagiarism detection, verbatim text reuse identification, and author clustering in the Urdu language. Also, it holds significance for natural language processing researchers and practitioners as it facilitates the development of specialized plagiarism detection models tailored to the Urdu language. These models can play a vital role in education and publishing by improving the accuracy of plagiarism detection, effectively addressing a gap and enhancing the overall ability to identify copied content in Urdu writing.http://www.sciencedirect.com/science/article/pii/S2352340923009186Plagiarism detectionIntrinsic plagiarismStylometry featuresSentenceParagraphUrdu language
spellingShingle Muhammad Haseeb
Muhammad Faraz Manzoor
Muhammad Shoaib Farooq
Uzma Farooq
Adnan Abid
A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu
Data in Brief
Plagiarism detection
Intrinsic plagiarism
Stylometry features
Sentence
Paragraph
Urdu language
title A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu
title_full A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu
title_fullStr A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu
title_full_unstemmed A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu
title_short A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu
title_sort versatile dataset for intrinsic plagiarism detection text reuse analysis and author clustering in urdu
topic Plagiarism detection
Intrinsic plagiarism
Stylometry features
Sentence
Paragraph
Urdu language
url http://www.sciencedirect.com/science/article/pii/S2352340923009186
work_keys_str_mv AT muhammadhaseeb aversatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu
AT muhammadfarazmanzoor aversatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu
AT muhammadshoaibfarooq aversatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu
AT uzmafarooq aversatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu
AT adnanabid aversatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu
AT muhammadhaseeb versatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu
AT muhammadfarazmanzoor versatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu
AT muhammadshoaibfarooq versatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu
AT uzmafarooq versatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu
AT adnanabid versatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu