An Empirical Study on Authorship Verification for Low Resource Language Using Hyper-Tuned CNN Approach

Authorship verification is a crucial process employed to determine the authorship of a given text by analyzing distinct aspects of the writer’s style, such as vocabulary, syntax, and punctuation. This process has gained significant research attention in various domains, including intellec...

Full description

Bibliographic Details
Main Authors:	Talha Farooq Khan, Waheed Anwar, Humera Arshad, Syed Naseem Abbas
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Authorship verification low resource language natural language processing deep learning
Online Access:	https://ieeexplore.ieee.org/document/10196421/

_version_	1797746117169905664
author	Talha Farooq Khan Waheed Anwar Humera Arshad Syed Naseem Abbas
author_facet	Talha Farooq Khan Waheed Anwar Humera Arshad Syed Naseem Abbas
author_sort	Talha Farooq Khan
collection	DOAJ
description	Authorship verification is a crucial process employed to determine the authorship of a given text by analyzing distinct aspects of the writer’s style, such as vocabulary, syntax, and punctuation. This process has gained significant research attention in various domains, including intellectual property rights, plagiarism detection, cybercrime investigations, copyright infringement, and forensics. While extensive studies have been conducted on multiple languages worldwide, encompassing Western European languages like Italian and Spanish, as well as Asian languages such as Bengali and Chinese, the investigation of authorship verification in Urdu has been comparatively limited, despite its status as a prominent South Asian language. This limitation can be attributed to the intricate and distinctive morphology of Urdu, which necessitates specific methodologies that cannot be directly applied in the same manner as other languages. To bridge this gap, we propose an innovative approach for authorship verification in Urdu, leveraging Convolutional Neural Networks (CNNs) with three distinct hyper-tuned parameters: ADAM, SGD, and RMSProp. To facilitate the development of this approach, we have curated a new corpus called UAVC-22, specifically tailored for Urdu authorship verification. This corpus offers enhanced robustness in terms of authors’ classes and unique words. We have developed 9 authorship verification models, utilizing three different text embedding techniques, namely Word2Vec, GloVe, and FastText, we have performed a comparative analysis with traditional machine learning models such as Support Vector Machines (SVM) and Random Forest to assess the superiority and efficacy of the CNN-based approach. The optimized CNN-ADAM model with FastText achieved the highest accuracy of 98% for the Urdu dataset UAVC-22.
first_indexed	2024-03-12T15:32:28Z
format	Article
id	doaj.art-72671f0915b0435c8c692403f3ce40a4
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-12T15:32:28Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-72671f0915b0435c8c692403f3ce40a42023-08-09T23:00:20ZengIEEEIEEE Access2169-35362023-01-0111804038041510.1109/ACCESS.2023.329956510196421An Empirical Study on Authorship Verification for Low Resource Language Using Hyper-Tuned CNN ApproachTalha Farooq Khan0https://orcid.org/0000-0002-6230-6406Waheed Anwar1https://orcid.org/0000-0002-2374-6951Humera Arshad2Syed Naseem Abbas3Department of Computer Science, Faculty of Computing, The Islamia University of Bahawalpur, Bahawalpur, PakistanDepartment of Computer Science, Faculty of Computing, The Islamia University of Bahawalpur, Bahawalpur, PakistanDepartment of Computer Science, Faculty of Computing, The Islamia University of Bahawalpur, Bahawalpur, PakistanDepartment of Computer Science, Faculty of Computing, The Islamia University of Bahawalpur, Bahawalpur, PakistanAuthorship verification is a crucial process employed to determine the authorship of a given text by analyzing distinct aspects of the writer’s style, such as vocabulary, syntax, and punctuation. This process has gained significant research attention in various domains, including intellectual property rights, plagiarism detection, cybercrime investigations, copyright infringement, and forensics. While extensive studies have been conducted on multiple languages worldwide, encompassing Western European languages like Italian and Spanish, as well as Asian languages such as Bengali and Chinese, the investigation of authorship verification in Urdu has been comparatively limited, despite its status as a prominent South Asian language. This limitation can be attributed to the intricate and distinctive morphology of Urdu, which necessitates specific methodologies that cannot be directly applied in the same manner as other languages. To bridge this gap, we propose an innovative approach for authorship verification in Urdu, leveraging Convolutional Neural Networks (CNNs) with three distinct hyper-tuned parameters: ADAM, SGD, and RMSProp. To facilitate the development of this approach, we have curated a new corpus called UAVC-22, specifically tailored for Urdu authorship verification. This corpus offers enhanced robustness in terms of authors’ classes and unique words. We have developed 9 authorship verification models, utilizing three different text embedding techniques, namely Word2Vec, GloVe, and FastText, we have performed a comparative analysis with traditional machine learning models such as Support Vector Machines (SVM) and Random Forest to assess the superiority and efficacy of the CNN-based approach. The optimized CNN-ADAM model with FastText achieved the highest accuracy of 98% for the Urdu dataset UAVC-22.https://ieeexplore.ieee.org/document/10196421/Authorship verificationlow resource languagenatural language processingdeep learning
spellingShingle	Talha Farooq Khan Waheed Anwar Humera Arshad Syed Naseem Abbas An Empirical Study on Authorship Verification for Low Resource Language Using Hyper-Tuned CNN Approach IEEE Access Authorship verification low resource language natural language processing deep learning
title	An Empirical Study on Authorship Verification for Low Resource Language Using Hyper-Tuned CNN Approach
title_full	An Empirical Study on Authorship Verification for Low Resource Language Using Hyper-Tuned CNN Approach
title_fullStr	An Empirical Study on Authorship Verification for Low Resource Language Using Hyper-Tuned CNN Approach
title_full_unstemmed	An Empirical Study on Authorship Verification for Low Resource Language Using Hyper-Tuned CNN Approach
title_short	An Empirical Study on Authorship Verification for Low Resource Language Using Hyper-Tuned CNN Approach
title_sort	empirical study on authorship verification for low resource language using hyper tuned cnn approach
topic	Authorship verification low resource language natural language processing deep learning
url	https://ieeexplore.ieee.org/document/10196421/
work_keys_str_mv	AT talhafarooqkhan anempiricalstudyonauthorshipverificationforlowresourcelanguageusinghypertunedcnnapproach AT waheedanwar anempiricalstudyonauthorshipverificationforlowresourcelanguageusinghypertunedcnnapproach AT humeraarshad anempiricalstudyonauthorshipverificationforlowresourcelanguageusinghypertunedcnnapproach AT syednaseemabbas anempiricalstudyonauthorshipverificationforlowresourcelanguageusinghypertunedcnnapproach AT talhafarooqkhan empiricalstudyonauthorshipverificationforlowresourcelanguageusinghypertunedcnnapproach AT waheedanwar empiricalstudyonauthorshipverificationforlowresourcelanguageusinghypertunedcnnapproach AT humeraarshad empiricalstudyonauthorshipverificationforlowresourcelanguageusinghypertunedcnnapproach AT syednaseemabbas empiricalstudyonauthorshipverificationforlowresourcelanguageusinghypertunedcnnapproach

An Empirical Study on Authorship Verification for Low Resource Language Using Hyper-Tuned CNN Approach

Similar Items