Uncovering highly obfuscated plagiarism cases using fuzzy semantic-based similarity model

Highly obfuscated plagiarism cases contain unseen and obfuscated texts, which pose difficulties when using existing plagiarism detection methods. A fuzzy semantic-based similarity model for uncovering obfuscated plagiarism is presented and compared with five state-of-the-art baselines. Semantic rela...

Full description

Bibliographic Details
Main Authors:	Salha M. Alzahrani, Naomie Salim, Vasile Palade
Format:	Article
Language:	English
Published:	Elsevier 2015-07-01
Series:	Journal of King Saud University: Computer and Information Sciences
Subjects:	Feature extraction Fuzzy similarity Obfuscation Plagiarism detection Semantic similarity
Online Access:	http://www.sciencedirect.com/science/article/pii/S1319157815000361

_version_	1818012578994978816
author	Salha M. Alzahrani Naomie Salim Vasile Palade
author_facet	Salha M. Alzahrani Naomie Salim Vasile Palade
author_sort	Salha M. Alzahrani
collection	DOAJ
description	Highly obfuscated plagiarism cases contain unseen and obfuscated texts, which pose difficulties when using existing plagiarism detection methods. A fuzzy semantic-based similarity model for uncovering obfuscated plagiarism is presented and compared with five state-of-the-art baselines. Semantic relatedness between words is studied based on the part-of-speech (POS) tags and WordNet-based similarity measures. Fuzzy-based rules are introduced to assess the semantic distance between source and suspicious texts of short lengths, which implement the semantic relatedness between words as a membership function to a fuzzy set. In order to minimize the number of false positives and false negatives, a learning method that combines a permission threshold and a variation threshold is used to decide true plagiarism cases. The proposed model and the baselines are evaluated on 99,033 ground-truth annotated cases extracted from different datasets, including 11,621 (11.7%) handmade paraphrases, 54,815 (55.4%) artificial plagiarism cases, and 32,578 (32.9%) plagiarism-free cases. We conduct extensive experimental verifications, including the study of the effects of different segmentations schemes and parameter settings. Results are assessed using precision, recall, F-measure and granularity on stratified 10-fold cross-validation data. The statistical analysis using paired t-tests shows that the proposed approach is statistically significant in comparison with the baselines, which demonstrates the competence of fuzzy semantic-based model to detect plagiarism cases beyond the literal plagiarism. Additionally, the analysis of variance (ANOVA) statistical test shows the effectiveness of different segmentation schemes used with the proposed approach.
first_indexed	2024-04-14T06:21:57Z
format	Article
id	doaj.art-e05720b229a74f72896633adca2582e4
institution	Directory Open Access Journal
issn	1319-1578
language	English
last_indexed	2024-04-14T06:21:57Z
publishDate	2015-07-01
publisher	Elsevier
record_format	Article
series	Journal of King Saud University: Computer and Information Sciences
spelling	doaj.art-e05720b229a74f72896633adca2582e42022-12-22T02:07:59ZengElsevierJournal of King Saud University: Computer and Information Sciences1319-15782015-07-0127324826810.1016/j.jksuci.2014.12.001Uncovering highly obfuscated plagiarism cases using fuzzy semantic-based similarity modelSalha M. Alzahrani0Naomie Salim1Vasile Palade2College of Computers and Information Technology (CIT), Taif University, Taif, Saudi ArabiaFaculty of Computer Science and Information Systems, University of Technology Malaysia, Johor, MalaysiaDepartment of Computer Science, University of Oxford, UKHighly obfuscated plagiarism cases contain unseen and obfuscated texts, which pose difficulties when using existing plagiarism detection methods. A fuzzy semantic-based similarity model for uncovering obfuscated plagiarism is presented and compared with five state-of-the-art baselines. Semantic relatedness between words is studied based on the part-of-speech (POS) tags and WordNet-based similarity measures. Fuzzy-based rules are introduced to assess the semantic distance between source and suspicious texts of short lengths, which implement the semantic relatedness between words as a membership function to a fuzzy set. In order to minimize the number of false positives and false negatives, a learning method that combines a permission threshold and a variation threshold is used to decide true plagiarism cases. The proposed model and the baselines are evaluated on 99,033 ground-truth annotated cases extracted from different datasets, including 11,621 (11.7%) handmade paraphrases, 54,815 (55.4%) artificial plagiarism cases, and 32,578 (32.9%) plagiarism-free cases. We conduct extensive experimental verifications, including the study of the effects of different segmentations schemes and parameter settings. Results are assessed using precision, recall, F-measure and granularity on stratified 10-fold cross-validation data. The statistical analysis using paired t-tests shows that the proposed approach is statistically significant in comparison with the baselines, which demonstrates the competence of fuzzy semantic-based model to detect plagiarism cases beyond the literal plagiarism. Additionally, the analysis of variance (ANOVA) statistical test shows the effectiveness of different segmentation schemes used with the proposed approach.http://www.sciencedirect.com/science/article/pii/S1319157815000361Feature extractionFuzzy similarityObfuscationPlagiarism detectionSemantic similarity
spellingShingle	Salha M. Alzahrani Naomie Salim Vasile Palade Uncovering highly obfuscated plagiarism cases using fuzzy semantic-based similarity model Journal of King Saud University: Computer and Information Sciences Feature extraction Fuzzy similarity Obfuscation Plagiarism detection Semantic similarity
title	Uncovering highly obfuscated plagiarism cases using fuzzy semantic-based similarity model
title_full	Uncovering highly obfuscated plagiarism cases using fuzzy semantic-based similarity model
title_fullStr	Uncovering highly obfuscated plagiarism cases using fuzzy semantic-based similarity model
title_full_unstemmed	Uncovering highly obfuscated plagiarism cases using fuzzy semantic-based similarity model
title_short	Uncovering highly obfuscated plagiarism cases using fuzzy semantic-based similarity model
title_sort	uncovering highly obfuscated plagiarism cases using fuzzy semantic based similarity model
topic	Feature extraction Fuzzy similarity Obfuscation Plagiarism detection Semantic similarity
url	http://www.sciencedirect.com/science/article/pii/S1319157815000361
work_keys_str_mv	AT salhamalzahrani uncoveringhighlyobfuscatedplagiarismcasesusingfuzzysemanticbasedsimilaritymodel AT naomiesalim uncoveringhighlyobfuscatedplagiarismcasesusingfuzzysemanticbasedsimilaritymodel AT vasilepalade uncoveringhighlyobfuscatedplagiarismcasesusingfuzzysemanticbasedsimilaritymodel

Uncovering highly obfuscated plagiarism cases using fuzzy semantic-based similarity model

Similar Items