FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION

Internet is anonymous, this allows posting under a false name, on behalf of others or simply anonymous. Thus, individuals, criminal or terrorist organizations can use Internet for criminal purposes; they hide their identity to avoid the prosecuting. Existing approaches and algorithms for author iden...

Full description

Bibliographic Details
Main Author: A. A. Vorobeva
Format: Article
Language:English
Published: Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University) 2016-03-01
Series:Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki
Subjects:
Online Access:http://ntv.ifmo.ru/file/article/15186.pdf
_version_ 1819148930762407936
author A. A. Vorobeva
author_facet A. A. Vorobeva
author_sort A. A. Vorobeva
collection DOAJ
description Internet is anonymous, this allows posting under a false name, on behalf of others or simply anonymous. Thus, individuals, criminal or terrorist organizations can use Internet for criminal purposes; they hide their identity to avoid the prosecuting. Existing approaches and algorithms for author identification of web-posts on Russian language are not effective. The development of proven methods, technics and tools for author identification is extremely important and challenging task. In this work the algorithm and software for authorship identification of web-posts was developed. During the study the effectiveness of several classification and feature selection algorithms were tested. The algorithm includes some important steps: 1) Feature extraction; 2) Features discretization; 3) Feature selection with the most effective Relief-f algorithm (to find the best feature set with the most discriminating power for each set of candidate authors and maximize accuracy of author identification); 4) Author identification on model based on Random Forest algorithm. Random Forest and Relief-f algorithms are used to identify the author of a short text on Russian language for the first time. The important step of author attribution is data preprocessing - discretization of continuous features; earlier it was not applied to improve the efficiency of author identification. The software outputs top q authors with maximum probabilities of authorship. This approach is helpful for manual analysis in forensic linguistics, when developed tool is used to narrow the set of candidate authors. For experiments on 10 candidate authors, real author appeared in to top 3 in 90.02% cases, on first place real author appeared in 70.5% of cases.
first_indexed 2024-12-22T13:53:32Z
format Article
id doaj.art-f5855ecbc7e04bd0bf21044395b41013
institution Directory Open Access Journal
issn 2226-1494
2500-0373
language English
last_indexed 2024-12-22T13:53:32Z
publishDate 2016-03-01
publisher Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)
record_format Article
series Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki
spelling doaj.art-f5855ecbc7e04bd0bf21044395b410132022-12-21T18:23:37ZengSaint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki2226-14942500-03732016-03-0116229530210.17586/2226-1494-2016-16-2-295-302FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATIONA. A. VorobevaInternet is anonymous, this allows posting under a false name, on behalf of others or simply anonymous. Thus, individuals, criminal or terrorist organizations can use Internet for criminal purposes; they hide their identity to avoid the prosecuting. Existing approaches and algorithms for author identification of web-posts on Russian language are not effective. The development of proven methods, technics and tools for author identification is extremely important and challenging task. In this work the algorithm and software for authorship identification of web-posts was developed. During the study the effectiveness of several classification and feature selection algorithms were tested. The algorithm includes some important steps: 1) Feature extraction; 2) Features discretization; 3) Feature selection with the most effective Relief-f algorithm (to find the best feature set with the most discriminating power for each set of candidate authors and maximize accuracy of author identification); 4) Author identification on model based on Random Forest algorithm. Random Forest and Relief-f algorithms are used to identify the author of a short text on Russian language for the first time. The important step of author attribution is data preprocessing - discretization of continuous features; earlier it was not applied to improve the efficiency of author identification. The software outputs top q authors with maximum probabilities of authorship. This approach is helpful for manual analysis in forensic linguistics, when developed tool is used to narrow the set of candidate authors. For experiments on 10 candidate authors, real author appeared in to top 3 in 90.02% cases, on first place real author appeared in 70.5% of cases.http://ntv.ifmo.ru/file/article/15186.pdfweb author identificationauthorship attributioncomputational linguisticsinformation security
spellingShingle A. A. Vorobeva
FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION
Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki
web author identification
authorship attribution
computational linguistics
information security
title FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION
title_full FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION
title_fullStr FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION
title_full_unstemmed FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION
title_short FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION
title_sort forensic linguistics automatic web author identification
topic web author identification
authorship attribution
computational linguistics
information security
url http://ntv.ifmo.ru/file/article/15186.pdf
work_keys_str_mv AT aavorobeva forensiclinguisticsautomaticwebauthoridentification