Active Learning for News Article’s Authorship Identification

Over time, the amount of textual data has increased drastically, especially due to the publication of articles. As a consequence, there has been a rise in anonymous content. Research is being conducted to determine alternative methods for identifying unknown text authors. To this end, a system has t...

Full description

Bibliographic Details
Main Authors: Sidra Abbas, Shtwai Alsubai, Gabriel Avelino Sampedro, Mideth Abisado, Ahmad S. Almadhor, Natalia Kryvinska, Monji Mohamed Zaidi
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10235957/
_version_ 1827818171710570496
author Sidra Abbas
Shtwai Alsubai
Gabriel Avelino Sampedro
Mideth Abisado
Ahmad S. Almadhor
Natalia Kryvinska
Monji Mohamed Zaidi
author_facet Sidra Abbas
Shtwai Alsubai
Gabriel Avelino Sampedro
Mideth Abisado
Ahmad S. Almadhor
Natalia Kryvinska
Monji Mohamed Zaidi
author_sort Sidra Abbas
collection DOAJ
description Over time, the amount of textual data has increased drastically, especially due to the publication of articles. As a consequence, there has been a rise in anonymous content. Research is being conducted to determine alternative methods for identifying unknown text authors. To this end, a system has to be developed to accurately determine the author of unknown texts, given a group of writing samples. Active Learning is utilized in this study because it iteratively selects the most informative samples to include in the training set, which enables a more precise and accurate authorship identification approach with fewer examples. Makes it useful for analyzing the rising amount of anonymous content and identifying unknown text authors. This study proposes a novel approach that utilizes active learning (AL) based machine models, namely Logistic Regression (AL-LR), Random Forest (AL-RF), XGboost (AL-XGB), and Multilayer Perceptron (AL-MLP) for authorship identification. The proposed approach extracts valuable characteristics of the writer using the Term Frequency-Inverse Document Frequency (TF-IDF). This study’s selected comprehensive dataset, “All the news,” is divided into three subsets: Article 1, Article 2, and Article 3. We have restricted the dataset’s scope and selected the top 50 authors for our experimentation. The experimental outcomes reveal that the proposed AL-XGB model achieves superior performance on Article 1 of the “All the news” dataset. Further, the AL-LR model performed well on Article 2, and the AL-MLP performed well on Article 3. The results suggest using the proposed approach for authorship identification.
first_indexed 2024-03-12T00:43:09Z
format Article
id doaj.art-64418a814e554efba98675461e32a4fb
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-12T00:43:09Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-64418a814e554efba98675461e32a4fb2023-09-14T23:00:56ZengIEEEIEEE Access2169-35362023-01-0111984159842610.1109/ACCESS.2023.331081310235957Active Learning for News Article’s Authorship IdentificationSidra Abbas0https://orcid.org/0009-0001-0117-4390Shtwai Alsubai1https://orcid.org/0000-0002-6584-7400Gabriel Avelino Sampedro2https://orcid.org/0000-0003-2354-4409Mideth Abisado3Ahmad S. Almadhor4https://orcid.org/0000-0002-8665-1669Natalia Kryvinska5https://orcid.org/0000-0003-3678-9229Monji Mohamed Zaidi6https://orcid.org/0000-0001-9237-1279Department of Computer Science, COMSATS University Islamabad, Islamabad, PakistanCollege of Computer Engineering and Sciences, Prince Sattam bin Abdulaziz University, Al-Kharj, Saudi ArabiaFaculty of Information and Communication Studies, University of the Philippines Open University, Los Baños, PhilippinesCollege of Computing and Information Technologies, National University, Manila, PhilippinesDepartment of Computer Engineering and Networks, College of Computer and Information Sciences, Jouf University, Sakaka, Saudi ArabiaInformation Systems Department, Faculty of Management, Comenius University Bratislava, Bratislava, SlovakiaDepartment of Electrical Engineering, College of Engineering, King Khalid University, Abha, Saudi ArabiaOver time, the amount of textual data has increased drastically, especially due to the publication of articles. As a consequence, there has been a rise in anonymous content. Research is being conducted to determine alternative methods for identifying unknown text authors. To this end, a system has to be developed to accurately determine the author of unknown texts, given a group of writing samples. Active Learning is utilized in this study because it iteratively selects the most informative samples to include in the training set, which enables a more precise and accurate authorship identification approach with fewer examples. Makes it useful for analyzing the rising amount of anonymous content and identifying unknown text authors. This study proposes a novel approach that utilizes active learning (AL) based machine models, namely Logistic Regression (AL-LR), Random Forest (AL-RF), XGboost (AL-XGB), and Multilayer Perceptron (AL-MLP) for authorship identification. The proposed approach extracts valuable characteristics of the writer using the Term Frequency-Inverse Document Frequency (TF-IDF). This study’s selected comprehensive dataset, “All the news,” is divided into three subsets: Article 1, Article 2, and Article 3. We have restricted the dataset’s scope and selected the top 50 authors for our experimentation. The experimental outcomes reveal that the proposed AL-XGB model achieves superior performance on Article 1 of the “All the news” dataset. Further, the AL-LR model performed well on Article 2, and the AL-MLP performed well on Article 3. The results suggest using the proposed approach for authorship identification.https://ieeexplore.ieee.org/document/10235957/Active learningauthorship identificationtext analysismachine learningnews articles
spellingShingle Sidra Abbas
Shtwai Alsubai
Gabriel Avelino Sampedro
Mideth Abisado
Ahmad S. Almadhor
Natalia Kryvinska
Monji Mohamed Zaidi
Active Learning for News Article’s Authorship Identification
IEEE Access
Active learning
authorship identification
text analysis
machine learning
news articles
title Active Learning for News Article’s Authorship Identification
title_full Active Learning for News Article’s Authorship Identification
title_fullStr Active Learning for News Article’s Authorship Identification
title_full_unstemmed Active Learning for News Article’s Authorship Identification
title_short Active Learning for News Article’s Authorship Identification
title_sort active learning for news article x2019 s authorship identification
topic Active learning
authorship identification
text analysis
machine learning
news articles
url https://ieeexplore.ieee.org/document/10235957/
work_keys_str_mv AT sidraabbas activelearningfornewsarticlex2019sauthorshipidentification
AT shtwaialsubai activelearningfornewsarticlex2019sauthorshipidentification
AT gabrielavelinosampedro activelearningfornewsarticlex2019sauthorshipidentification
AT midethabisado activelearningfornewsarticlex2019sauthorshipidentification
AT ahmadsalmadhor activelearningfornewsarticlex2019sauthorshipidentification
AT nataliakryvinska activelearningfornewsarticlex2019sauthorshipidentification
AT monjimohamedzaidi activelearningfornewsarticlex2019sauthorshipidentification