FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION
Internet is anonymous, this allows posting under a false name, on behalf of others or simply anonymous. Thus, individuals, criminal or terrorist organizations can use Internet for criminal purposes; they hide their identity to avoid the prosecuting. Existing approaches and algorithms for author iden...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)
2016-03-01
|
Series: | Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki |
Subjects: | |
Online Access: | http://ntv.ifmo.ru/file/article/15186.pdf |
_version_ | 1819148930762407936 |
---|---|
author | A. A. Vorobeva |
author_facet | A. A. Vorobeva |
author_sort | A. A. Vorobeva |
collection | DOAJ |
description | Internet is anonymous, this allows posting under a false name, on behalf of others or simply anonymous. Thus, individuals, criminal or terrorist organizations can use Internet for criminal purposes; they hide their identity to avoid the prosecuting. Existing approaches and algorithms for author identification of web-posts on Russian language are not effective. The development of proven methods, technics and tools for author identification is extremely important and challenging task. In this work the algorithm and software for authorship identification of web-posts was developed. During the study the effectiveness of several classification and feature selection algorithms were tested. The algorithm includes some important steps: 1) Feature extraction; 2) Features discretization; 3) Feature selection with the most effective Relief-f algorithm (to find the best feature set with the most discriminating power for each set of candidate authors and maximize accuracy of author identification); 4) Author identification on model based on Random Forest algorithm. Random Forest and Relief-f algorithms are used to identify the author of a short text on Russian language for the first time. The important step of author attribution is data preprocessing - discretization of continuous features; earlier it was not applied to improve the efficiency of author identification. The software outputs top q authors with maximum probabilities of authorship. This approach is helpful for manual analysis in forensic linguistics, when developed tool is used to narrow the set of candidate authors. For experiments on 10 candidate authors, real author appeared in to top 3 in 90.02% cases, on first place real author appeared in 70.5% of cases. |
first_indexed | 2024-12-22T13:53:32Z |
format | Article |
id | doaj.art-f5855ecbc7e04bd0bf21044395b41013 |
institution | Directory Open Access Journal |
issn | 2226-1494 2500-0373 |
language | English |
last_indexed | 2024-12-22T13:53:32Z |
publishDate | 2016-03-01 |
publisher | Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University) |
record_format | Article |
series | Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki |
spelling | doaj.art-f5855ecbc7e04bd0bf21044395b410132022-12-21T18:23:37ZengSaint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki2226-14942500-03732016-03-0116229530210.17586/2226-1494-2016-16-2-295-302FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATIONA. A. VorobevaInternet is anonymous, this allows posting under a false name, on behalf of others or simply anonymous. Thus, individuals, criminal or terrorist organizations can use Internet for criminal purposes; they hide their identity to avoid the prosecuting. Existing approaches and algorithms for author identification of web-posts on Russian language are not effective. The development of proven methods, technics and tools for author identification is extremely important and challenging task. In this work the algorithm and software for authorship identification of web-posts was developed. During the study the effectiveness of several classification and feature selection algorithms were tested. The algorithm includes some important steps: 1) Feature extraction; 2) Features discretization; 3) Feature selection with the most effective Relief-f algorithm (to find the best feature set with the most discriminating power for each set of candidate authors and maximize accuracy of author identification); 4) Author identification on model based on Random Forest algorithm. Random Forest and Relief-f algorithms are used to identify the author of a short text on Russian language for the first time. The important step of author attribution is data preprocessing - discretization of continuous features; earlier it was not applied to improve the efficiency of author identification. The software outputs top q authors with maximum probabilities of authorship. This approach is helpful for manual analysis in forensic linguistics, when developed tool is used to narrow the set of candidate authors. For experiments on 10 candidate authors, real author appeared in to top 3 in 90.02% cases, on first place real author appeared in 70.5% of cases.http://ntv.ifmo.ru/file/article/15186.pdfweb author identificationauthorship attributioncomputational linguisticsinformation security |
spellingShingle | A. A. Vorobeva FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki web author identification authorship attribution computational linguistics information security |
title | FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION |
title_full | FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION |
title_fullStr | FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION |
title_full_unstemmed | FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION |
title_short | FORENSIC LINGUISTICS: AUTOMATIC WEB AUTHOR IDENTIFICATION |
title_sort | forensic linguistics automatic web author identification |
topic | web author identification authorship attribution computational linguistics information security |
url | http://ntv.ifmo.ru/file/article/15186.pdf |
work_keys_str_mv | AT aavorobeva forensiclinguisticsautomaticwebauthoridentification |