A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu
Plagiarism detection (PD) is a process of identifying instances where someone has presented another person's work or ideas as their own. Plagiarism detection is categorized into two types (i) Intrinsic plagiarism detection primarily concerns the assessment of authorship consistency within a sin...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2024-02-01
|
Series: | Data in Brief |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340923009186 |
_version_ | 1797317212856385536 |
---|---|
author | Muhammad Haseeb Muhammad Faraz Manzoor Muhammad Shoaib Farooq Uzma Farooq Adnan Abid |
author_facet | Muhammad Haseeb Muhammad Faraz Manzoor Muhammad Shoaib Farooq Uzma Farooq Adnan Abid |
author_sort | Muhammad Haseeb |
collection | DOAJ |
description | Plagiarism detection (PD) is a process of identifying instances where someone has presented another person's work or ideas as their own. Plagiarism detection is categorized into two types (i) Intrinsic plagiarism detection primarily concerns the assessment of authorship consistency within a single document, aiming to identify instances where portions of the text may have been copied or paraphrased from elsewhere within the same document. Author clustering, closely related to intrinsic plagiarism detection, involves grouping documents based on their stylistic and linguistic characteristics to identify common authors or sources within a given dataset. On the other hand, (ii) extrinsic plagiarism detection delves into the comparative analysis of a suspicious document against a set of external source documents, seeking instances of shared phrases, sentences, or paragraphs between them, which is often referred to as text reuse or verbatim copying. Detection of plagiarism from documents is a long-established task in the area of NLP with remarkable contributions in multiple applications. A lot of research has already been conducted in the English and other foreign languages but Urdu language needs a lot of attention especially in intrinsic plagiarism detection domain. The major reason is that Urdu is a low resource language and unfortunately there is no high-quality benchmark corpus available for intrinsic plagiarism detection in Urdu language. This study presents a high-quality benchmark Corpus comprising 10,872 documents. The corpus is structured into two granularity levels: sentence level and paragraph level. This dataset serves multifaceted purposes, facilitating intrinsic plagiarism detection, verbatim text reuse identification, and author clustering in the Urdu language. Also, it holds significance for natural language processing researchers and practitioners as it facilitates the development of specialized plagiarism detection models tailored to the Urdu language. These models can play a vital role in education and publishing by improving the accuracy of plagiarism detection, effectively addressing a gap and enhancing the overall ability to identify copied content in Urdu writing. |
first_indexed | 2024-03-08T03:30:40Z |
format | Article |
id | doaj.art-c6034d48f8ee417e83ff87b4acc90a65 |
institution | Directory Open Access Journal |
issn | 2352-3409 |
language | English |
last_indexed | 2024-03-08T03:30:40Z |
publishDate | 2024-02-01 |
publisher | Elsevier |
record_format | Article |
series | Data in Brief |
spelling | doaj.art-c6034d48f8ee417e83ff87b4acc90a652024-02-11T05:10:21ZengElsevierData in Brief2352-34092024-02-0152109857A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in UrduMuhammad Haseeb0Muhammad Faraz Manzoor1Muhammad Shoaib Farooq2Uzma Farooq3Adnan Abid4Department of Computer Science, University of Management and Technology, Lahore, PakistanDepartment of Computer Science, University of Management and Technology, Lahore, PakistanDepartment of Computer Science, University of Management and Technology, Lahore, PakistanDepartment of Computer Science, University of Management and Technology, Lahore, PakistanDepartment of Data Science, Faculty of Computing and Information Technology, University of the Punjab, Pakistan; Corresponding authors.Plagiarism detection (PD) is a process of identifying instances where someone has presented another person's work or ideas as their own. Plagiarism detection is categorized into two types (i) Intrinsic plagiarism detection primarily concerns the assessment of authorship consistency within a single document, aiming to identify instances where portions of the text may have been copied or paraphrased from elsewhere within the same document. Author clustering, closely related to intrinsic plagiarism detection, involves grouping documents based on their stylistic and linguistic characteristics to identify common authors or sources within a given dataset. On the other hand, (ii) extrinsic plagiarism detection delves into the comparative analysis of a suspicious document against a set of external source documents, seeking instances of shared phrases, sentences, or paragraphs between them, which is often referred to as text reuse or verbatim copying. Detection of plagiarism from documents is a long-established task in the area of NLP with remarkable contributions in multiple applications. A lot of research has already been conducted in the English and other foreign languages but Urdu language needs a lot of attention especially in intrinsic plagiarism detection domain. The major reason is that Urdu is a low resource language and unfortunately there is no high-quality benchmark corpus available for intrinsic plagiarism detection in Urdu language. This study presents a high-quality benchmark Corpus comprising 10,872 documents. The corpus is structured into two granularity levels: sentence level and paragraph level. This dataset serves multifaceted purposes, facilitating intrinsic plagiarism detection, verbatim text reuse identification, and author clustering in the Urdu language. Also, it holds significance for natural language processing researchers and practitioners as it facilitates the development of specialized plagiarism detection models tailored to the Urdu language. These models can play a vital role in education and publishing by improving the accuracy of plagiarism detection, effectively addressing a gap and enhancing the overall ability to identify copied content in Urdu writing.http://www.sciencedirect.com/science/article/pii/S2352340923009186Plagiarism detectionIntrinsic plagiarismStylometry featuresSentenceParagraphUrdu language |
spellingShingle | Muhammad Haseeb Muhammad Faraz Manzoor Muhammad Shoaib Farooq Uzma Farooq Adnan Abid A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu Data in Brief Plagiarism detection Intrinsic plagiarism Stylometry features Sentence Paragraph Urdu language |
title | A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu |
title_full | A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu |
title_fullStr | A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu |
title_full_unstemmed | A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu |
title_short | A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu |
title_sort | versatile dataset for intrinsic plagiarism detection text reuse analysis and author clustering in urdu |
topic | Plagiarism detection Intrinsic plagiarism Stylometry features Sentence Paragraph Urdu language |
url | http://www.sciencedirect.com/science/article/pii/S2352340923009186 |
work_keys_str_mv | AT muhammadhaseeb aversatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT muhammadfarazmanzoor aversatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT muhammadshoaibfarooq aversatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT uzmafarooq aversatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT adnanabid aversatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT muhammadhaseeb versatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT muhammadfarazmanzoor versatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT muhammadshoaibfarooq versatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT uzmafarooq versatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu AT adnanabid versatiledatasetforintrinsicplagiarismdetectiontextreuseanalysisandauthorclusteringinurdu |