Examining the performance of classification algorithms for imbalanced data sets in web author identification

Individuals, criminals or even terrorist organizations can use web-communication for criminal purposes; to avoid the prosecution they try to hide their identity. To increase level of safety in Web we have to improve the author (or web-user) identification and authentication procedures. In field of w...

Full description

Bibliographic Details
Main Author: Alisa A. Vorobeva
Format: Article
Language:English
Published: FRUCT 2016-04-01
Series:Proceedings of the XXth Conference of Open Innovations Association FRUCT
Subjects:
Online Access:https://fruct.org/publications/fruct18/files/Vor.pdf
Description
Summary:Individuals, criminals or even terrorist organizations can use web-communication for criminal purposes; to avoid the prosecution they try to hide their identity. To increase level of safety in Web we have to improve the author (or web-user) identification and authentication procedures. In field of web author identification the situation of imbalanced data sets appears rather frequent, when number of one author's texts significantly exceeds the number of other's. This is common situation for the modern web: social networks, blogs, emails etc. Author identification task is some sort of classification task. To develop methods, technics and tools for web author identification we have to examine the performance of classification algorithms for imbalanced data sets. In this work several modern classification algorithms were tested on data sets with various levels of class imbalance and different number of available webpost The best accuracy in all experiments was achieved with Random Forest algorithm.
ISSN:2305-7254
2343-0737