DYNAMIC FEATURE SELECTION FOR WEB USER IDENTIFICATION ON LINGUISTIC AND STYLISTIC FEATURES OF ONLINE TEXTS

The paper deals with identification and authentication of web users participating in the Internet information processes (based on features of online texts).In digital forensics web user identification based on various linguistic features can be used to discover identity of individuals, criminals or...

Full description

Bibliographic Details
Main Author: A. A. Vorobeva
Format: Article
Language:English
Published: Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University) 2017-01-01
Series:Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki
Subjects:
Online Access:http://ntv.ifmo.ru/file/article/16414.pdf
_version_ 1819018934558392320
author A. A. Vorobeva
author_facet A. A. Vorobeva
author_sort A. A. Vorobeva
collection DOAJ
description The paper deals with identification and authentication of web users participating in the Internet information processes (based on features of online texts).In digital forensics web user identification based on various linguistic features can be used to discover identity of individuals, criminals or terrorists using the Internet to commit cybercrimes. Internet could be used as a tool in different types of cybercrimes (fraud and identity theft, harassment and anonymous threats, terrorist or extremist statements, distribution of illegal content and information warfare). Linguistic identification of web users is a kind of biometric identification, it can be used to narrow down the suspects, identify a criminal and prosecute him. Feature set includes various linguistic and stylistic features extracted from online texts. We propose dynamic feature selection for each web user identification task. Selection is based on calculating Manhattan distance to k-nearest neighbors (Relief-f algorithm). This approach improves the identification accuracy and minimizes the number of features. Experiments were carried out on several datasets with different level of class imbalance. Experiment results showed that features relevance varies in different set of web users (probable authors of some text); features selection for each set of web users improves identification accuracy by 4% at the average that is approximately 1% higher than with the use of static set of features. The proposed approach is most effective for a small number of training samples (messages) per user.
first_indexed 2024-12-21T03:27:18Z
format Article
id doaj.art-812c35f16b8e49fea8ff223c80e754fa
institution Directory Open Access Journal
issn 2226-1494
2500-0373
language English
last_indexed 2024-12-21T03:27:18Z
publishDate 2017-01-01
publisher Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)
record_format Article
series Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki
spelling doaj.art-812c35f16b8e49fea8ff223c80e754fa2022-12-21T19:17:34ZengSaint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki2226-14942500-03732017-01-0117111712810.17586/2226-1494-2017-17-1-117-128DYNAMIC FEATURE SELECTION FOR WEB USER IDENTIFICATION ON LINGUISTIC AND STYLISTIC FEATURES OF ONLINE TEXTSA. A. Vorobeva0assistant, ITMO University, Saint Petersburg, 197101, Russian FederationThe paper deals with identification and authentication of web users participating in the Internet information processes (based on features of online texts).In digital forensics web user identification based on various linguistic features can be used to discover identity of individuals, criminals or terrorists using the Internet to commit cybercrimes. Internet could be used as a tool in different types of cybercrimes (fraud and identity theft, harassment and anonymous threats, terrorist or extremist statements, distribution of illegal content and information warfare). Linguistic identification of web users is a kind of biometric identification, it can be used to narrow down the suspects, identify a criminal and prosecute him. Feature set includes various linguistic and stylistic features extracted from online texts. We propose dynamic feature selection for each web user identification task. Selection is based on calculating Manhattan distance to k-nearest neighbors (Relief-f algorithm). This approach improves the identification accuracy and minimizes the number of features. Experiments were carried out on several datasets with different level of class imbalance. Experiment results showed that features relevance varies in different set of web users (probable authors of some text); features selection for each set of web users improves identification accuracy by 4% at the average that is approximately 1% higher than with the use of static set of features. The proposed approach is most effective for a small number of training samples (messages) per user.http://ntv.ifmo.ru/file/article/16414.pdfweb user identificationforensic linguisticsinformation security
spellingShingle A. A. Vorobeva
DYNAMIC FEATURE SELECTION FOR WEB USER IDENTIFICATION ON LINGUISTIC AND STYLISTIC FEATURES OF ONLINE TEXTS
Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki
web user identification
forensic linguistics
information security
title DYNAMIC FEATURE SELECTION FOR WEB USER IDENTIFICATION ON LINGUISTIC AND STYLISTIC FEATURES OF ONLINE TEXTS
title_full DYNAMIC FEATURE SELECTION FOR WEB USER IDENTIFICATION ON LINGUISTIC AND STYLISTIC FEATURES OF ONLINE TEXTS
title_fullStr DYNAMIC FEATURE SELECTION FOR WEB USER IDENTIFICATION ON LINGUISTIC AND STYLISTIC FEATURES OF ONLINE TEXTS
title_full_unstemmed DYNAMIC FEATURE SELECTION FOR WEB USER IDENTIFICATION ON LINGUISTIC AND STYLISTIC FEATURES OF ONLINE TEXTS
title_short DYNAMIC FEATURE SELECTION FOR WEB USER IDENTIFICATION ON LINGUISTIC AND STYLISTIC FEATURES OF ONLINE TEXTS
title_sort dynamic feature selection for web user identification on linguistic and stylistic features of online texts
topic web user identification
forensic linguistics
information security
url http://ntv.ifmo.ru/file/article/16414.pdf
work_keys_str_mv AT aavorobeva dynamicfeatureselectionforwebuseridentificationonlinguisticandstylisticfeaturesofonlinetexts