Automatically identifying the function and intent of posts in underground forums

Abstract The automatic classification of posts from hacking-related online forums is of potential value for the understanding of user behaviour in social networks relating to cybercrime. We designed annotation schema to label forum posts for three properties: post type, author intent, and addressee....

Full description

Bibliographic Details
Main Authors:	Andrew Caines, Sergio Pastrana, Alice Hutchings, Paula J. Buttery
Format:	Article
Language:	English
Published:	BMC 2018-11-01
Series:	Crime Science
Subjects:	Underground forums Cybercrime Deviant behaviour Machine learning Natural language processing
Online Access:	http://link.springer.com/article/10.1186/s40163-018-0094-4

_version_	1811212622446460928
author	Andrew Caines Sergio Pastrana Alice Hutchings Paula J. Buttery
author_facet	Andrew Caines Sergio Pastrana Alice Hutchings Paula J. Buttery
author_sort	Andrew Caines
collection	DOAJ
description	Abstract The automatic classification of posts from hacking-related online forums is of potential value for the understanding of user behaviour in social networks relating to cybercrime. We designed annotation schema to label forum posts for three properties: post type, author intent, and addressee. The post type indicates whether the text is a question, a comment, and so on. The author’s intent in writing the post could be positive, negative, moderating discussion, showing gratitude to another user, etc. The addressee of a post tends to be a general audience (e.g. other forum users) or individual users who have already contributed to a threaded discussion. We manually annotated a sample of posts and returned substantial agreement for post type and addressee, and fair agreement for author intent. We trained rule-based (logical) and machine learning (statistical) classification models to predict these labels automatically, and found that a hybrid logical–statistical model performs best for post type and author intent, whereas a purely statistical model is best for addressee. We discuss potential applications for this data, including the analysis of thread conversations in forum data and the identification of key actors within social networks.
first_indexed	2024-04-12T05:32:15Z
format	Article
id	doaj.art-365f9d4aea90451ca4b94aff5dfe8309
institution	Directory Open Access Journal
issn	2193-7680
language	English
last_indexed	2024-04-12T05:32:15Z
publishDate	2018-11-01
publisher	BMC
record_format	Article
series	Crime Science
spelling	doaj.art-365f9d4aea90451ca4b94aff5dfe83092022-12-22T03:46:00ZengBMCCrime Science2193-76802018-11-017111410.1186/s40163-018-0094-4Automatically identifying the function and intent of posts in underground forumsAndrew Caines0Sergio Pastrana1Alice Hutchings2Paula J. Buttery3Natural Language & Information Processing, Department of Computer Science & Technology, University of CambridgeCambridge Cybercrime Centre, Department of Computer Science & Technology, University of CambridgeCambridge Cybercrime Centre, Department of Computer Science & Technology, University of CambridgeNatural Language & Information Processing, Department of Computer Science & Technology, University of CambridgeAbstract The automatic classification of posts from hacking-related online forums is of potential value for the understanding of user behaviour in social networks relating to cybercrime. We designed annotation schema to label forum posts for three properties: post type, author intent, and addressee. The post type indicates whether the text is a question, a comment, and so on. The author’s intent in writing the post could be positive, negative, moderating discussion, showing gratitude to another user, etc. The addressee of a post tends to be a general audience (e.g. other forum users) or individual users who have already contributed to a threaded discussion. We manually annotated a sample of posts and returned substantial agreement for post type and addressee, and fair agreement for author intent. We trained rule-based (logical) and machine learning (statistical) classification models to predict these labels automatically, and found that a hybrid logical–statistical model performs best for post type and author intent, whereas a purely statistical model is best for addressee. We discuss potential applications for this data, including the analysis of thread conversations in forum data and the identification of key actors within social networks.http://link.springer.com/article/10.1186/s40163-018-0094-4Underground forumsCybercrimeDeviant behaviourMachine learningNatural language processing
spellingShingle	Andrew Caines Sergio Pastrana Alice Hutchings Paula J. Buttery Automatically identifying the function and intent of posts in underground forums Crime Science Underground forums Cybercrime Deviant behaviour Machine learning Natural language processing
title	Automatically identifying the function and intent of posts in underground forums
title_full	Automatically identifying the function and intent of posts in underground forums
title_fullStr	Automatically identifying the function and intent of posts in underground forums
title_full_unstemmed	Automatically identifying the function and intent of posts in underground forums
title_short	Automatically identifying the function and intent of posts in underground forums
title_sort	automatically identifying the function and intent of posts in underground forums
topic	Underground forums Cybercrime Deviant behaviour Machine learning Natural language processing
url	http://link.springer.com/article/10.1186/s40163-018-0094-4
work_keys_str_mv	AT andrewcaines automaticallyidentifyingthefunctionandintentofpostsinundergroundforums AT sergiopastrana automaticallyidentifyingthefunctionandintentofpostsinundergroundforums AT alicehutchings automaticallyidentifyingthefunctionandintentofpostsinundergroundforums AT paulajbuttery automaticallyidentifyingthefunctionandintentofpostsinundergroundforums

Automatically identifying the function and intent of posts in underground forums

Similar Items