Active Learning for News Article’s Authorship Identification
Over time, the amount of textual data has increased drastically, especially due to the publication of articles. As a consequence, there has been a rise in anonymous content. Research is being conducted to determine alternative methods for identifying unknown text authors. To this end, a system has t...
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10235957/ |
_version_ | 1827818171710570496 |
---|---|
author | Sidra Abbas Shtwai Alsubai Gabriel Avelino Sampedro Mideth Abisado Ahmad S. Almadhor Natalia Kryvinska Monji Mohamed Zaidi |
author_facet | Sidra Abbas Shtwai Alsubai Gabriel Avelino Sampedro Mideth Abisado Ahmad S. Almadhor Natalia Kryvinska Monji Mohamed Zaidi |
author_sort | Sidra Abbas |
collection | DOAJ |
description | Over time, the amount of textual data has increased drastically, especially due to the publication of articles. As a consequence, there has been a rise in anonymous content. Research is being conducted to determine alternative methods for identifying unknown text authors. To this end, a system has to be developed to accurately determine the author of unknown texts, given a group of writing samples. Active Learning is utilized in this study because it iteratively selects the most informative samples to include in the training set, which enables a more precise and accurate authorship identification approach with fewer examples. Makes it useful for analyzing the rising amount of anonymous content and identifying unknown text authors. This study proposes a novel approach that utilizes active learning (AL) based machine models, namely Logistic Regression (AL-LR), Random Forest (AL-RF), XGboost (AL-XGB), and Multilayer Perceptron (AL-MLP) for authorship identification. The proposed approach extracts valuable characteristics of the writer using the Term Frequency-Inverse Document Frequency (TF-IDF). This study’s selected comprehensive dataset, “All the news,” is divided into three subsets: Article 1, Article 2, and Article 3. We have restricted the dataset’s scope and selected the top 50 authors for our experimentation. The experimental outcomes reveal that the proposed AL-XGB model achieves superior performance on Article 1 of the “All the news” dataset. Further, the AL-LR model performed well on Article 2, and the AL-MLP performed well on Article 3. The results suggest using the proposed approach for authorship identification. |
first_indexed | 2024-03-12T00:43:09Z |
format | Article |
id | doaj.art-64418a814e554efba98675461e32a4fb |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-12T00:43:09Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-64418a814e554efba98675461e32a4fb2023-09-14T23:00:56ZengIEEEIEEE Access2169-35362023-01-0111984159842610.1109/ACCESS.2023.331081310235957Active Learning for News Article’s Authorship IdentificationSidra Abbas0https://orcid.org/0009-0001-0117-4390Shtwai Alsubai1https://orcid.org/0000-0002-6584-7400Gabriel Avelino Sampedro2https://orcid.org/0000-0003-2354-4409Mideth Abisado3Ahmad S. Almadhor4https://orcid.org/0000-0002-8665-1669Natalia Kryvinska5https://orcid.org/0000-0003-3678-9229Monji Mohamed Zaidi6https://orcid.org/0000-0001-9237-1279Department of Computer Science, COMSATS University Islamabad, Islamabad, PakistanCollege of Computer Engineering and Sciences, Prince Sattam bin Abdulaziz University, Al-Kharj, Saudi ArabiaFaculty of Information and Communication Studies, University of the Philippines Open University, Los Baños, PhilippinesCollege of Computing and Information Technologies, National University, Manila, PhilippinesDepartment of Computer Engineering and Networks, College of Computer and Information Sciences, Jouf University, Sakaka, Saudi ArabiaInformation Systems Department, Faculty of Management, Comenius University Bratislava, Bratislava, SlovakiaDepartment of Electrical Engineering, College of Engineering, King Khalid University, Abha, Saudi ArabiaOver time, the amount of textual data has increased drastically, especially due to the publication of articles. As a consequence, there has been a rise in anonymous content. Research is being conducted to determine alternative methods for identifying unknown text authors. To this end, a system has to be developed to accurately determine the author of unknown texts, given a group of writing samples. Active Learning is utilized in this study because it iteratively selects the most informative samples to include in the training set, which enables a more precise and accurate authorship identification approach with fewer examples. Makes it useful for analyzing the rising amount of anonymous content and identifying unknown text authors. This study proposes a novel approach that utilizes active learning (AL) based machine models, namely Logistic Regression (AL-LR), Random Forest (AL-RF), XGboost (AL-XGB), and Multilayer Perceptron (AL-MLP) for authorship identification. The proposed approach extracts valuable characteristics of the writer using the Term Frequency-Inverse Document Frequency (TF-IDF). This study’s selected comprehensive dataset, “All the news,” is divided into three subsets: Article 1, Article 2, and Article 3. We have restricted the dataset’s scope and selected the top 50 authors for our experimentation. The experimental outcomes reveal that the proposed AL-XGB model achieves superior performance on Article 1 of the “All the news” dataset. Further, the AL-LR model performed well on Article 2, and the AL-MLP performed well on Article 3. The results suggest using the proposed approach for authorship identification.https://ieeexplore.ieee.org/document/10235957/Active learningauthorship identificationtext analysismachine learningnews articles |
spellingShingle | Sidra Abbas Shtwai Alsubai Gabriel Avelino Sampedro Mideth Abisado Ahmad S. Almadhor Natalia Kryvinska Monji Mohamed Zaidi Active Learning for News Article’s Authorship Identification IEEE Access Active learning authorship identification text analysis machine learning news articles |
title | Active Learning for News Article’s Authorship Identification |
title_full | Active Learning for News Article’s Authorship Identification |
title_fullStr | Active Learning for News Article’s Authorship Identification |
title_full_unstemmed | Active Learning for News Article’s Authorship Identification |
title_short | Active Learning for News Article’s Authorship Identification |
title_sort | active learning for news article x2019 s authorship identification |
topic | Active learning authorship identification text analysis machine learning news articles |
url | https://ieeexplore.ieee.org/document/10235957/ |
work_keys_str_mv | AT sidraabbas activelearningfornewsarticlex2019sauthorshipidentification AT shtwaialsubai activelearningfornewsarticlex2019sauthorshipidentification AT gabrielavelinosampedro activelearningfornewsarticlex2019sauthorshipidentification AT midethabisado activelearningfornewsarticlex2019sauthorshipidentification AT ahmadsalmadhor activelearningfornewsarticlex2019sauthorshipidentification AT nataliakryvinska activelearningfornewsarticlex2019sauthorshipidentification AT monjimohamedzaidi activelearningfornewsarticlex2019sauthorshipidentification |