FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION

Internet is anonymous, this allows posting under a false name, on behalf of others or simply anonymous. Thus, individuals, criminal or terrorist organizations can use Internet for criminal purposes; they hide their identity to avoid the prosecuting. Existing approaches and algorithms for author iden...

Full description

Bibliographic Details
Main Author:	A. A. Vorobeva
Format:	Article
Language:	English
Published:	Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University) 2016-03-01
Series:	Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki
Subjects:	web author identification authorship attribution computational linguistics information security
Online Access:	http://ntv.ifmo.ru/file/article/15186.pdf

_version_	1819148930762407936
author	A. A. Vorobeva
author_facet	A. A. Vorobeva
author_sort	A. A. Vorobeva
collection	DOAJ
description	Internet is anonymous, this allows posting under a false name, on behalf of others or simply anonymous. Thus, individuals, criminal or terrorist organizations can use Internet for criminal purposes; they hide their identity to avoid the prosecuting. Existing approaches and algorithms for author identification of web-posts on Russian language are not effective. The development of proven methods, technics and tools for author identification is extremely important and challenging task. In this work the algorithm and software for authorship identification of web-posts was developed. During the study the effectiveness of several classification and feature selection algorithms were tested. The algorithm includes some important steps: 1) Feature extraction; 2) Features discretization; 3) Feature selection with the most effective Relief-f algorithm (to find the best feature set with the most discriminating power for each set of candidate authors and maximize accuracy of author identification); 4) Author identification on model based on Random Forest algorithm. Random Forest and Relief-f algorithms are used to identify the author of a short text on Russian language for the first time. The important step of author attribution is data preprocessing - discretization of continuous features; earlier it was not applied to improve the efficiency of author identification. The software outputs top q authors with maximum probabilities of authorship. This approach is helpful for manual analysis in forensic linguistics, when developed tool is used to narrow the set of candidate authors. For experiments on 10 candidate authors, real author appeared in to top 3 in 90.02% cases, on first place real author appeared in 70.5% of cases.
first_indexed	2024-12-22T13:53:32Z
format	Article
id	doaj.art-f5855ecbc7e04bd0bf21044395b41013
institution	Directory Open Access Journal
issn	2226-1494 2500-0373
language	English
last_indexed	2024-12-22T13:53:32Z
publishDate	2016-03-01
publisher	Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)
record_format	Article
series	Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki
spelling	doaj.art-f5855ecbc7e04bd0bf21044395b410132022-12-21T18:23:37ZengSaint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki2226-14942500-03732016-03-0116229530210.17586/2226-1494-2016-16-2-295-302FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATIONA. A. VorobevaInternet is anonymous, this allows posting under a false name, on behalf of others or simply anonymous. Thus, individuals, criminal or terrorist organizations can use Internet for criminal purposes; they hide their identity to avoid the prosecuting. Existing approaches and algorithms for author identification of web-posts on Russian language are not effective. The development of proven methods, technics and tools for author identification is extremely important and challenging task. In this work the algorithm and software for authorship identification of web-posts was developed. During the study the effectiveness of several classification and feature selection algorithms were tested. The algorithm includes some important steps: 1) Feature extraction; 2) Features discretization; 3) Feature selection with the most effective Relief-f algorithm (to find the best feature set with the most discriminating power for each set of candidate authors and maximize accuracy of author identification); 4) Author identification on model based on Random Forest algorithm. Random Forest and Relief-f algorithms are used to identify the author of a short text on Russian language for the first time. The important step of author attribution is data preprocessing - discretization of continuous features; earlier it was not applied to improve the efficiency of author identification. The software outputs top q authors with maximum probabilities of authorship. This approach is helpful for manual analysis in forensic linguistics, when developed tool is used to narrow the set of candidate authors. For experiments on 10 candidate authors, real author appeared in to top 3 in 90.02% cases, on first place real author appeared in 70.5% of cases.http://ntv.ifmo.ru/file/article/15186.pdfweb author identificationauthorship attributioncomputational linguisticsinformation security
spellingShingle	A. A. Vorobeva FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki web author identification authorship attribution computational linguistics information security
title	FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION
title_full	FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION
title_fullStr	FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION
title_full_unstemmed	FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION
title_short	FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION
title_sort	forensic linguistics automatic web author identification
topic	web author identification authorship attribution computational linguistics information security
url	http://ntv.ifmo.ru/file/article/15186.pdf
work_keys_str_mv	AT aavorobeva forensiclinguisticsautomaticwebauthoridentification

FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION

Similar Items