First Steps towards Data-Driven Adversarial Deduplication

In traditional databases, the entity resolution problem (which is also known as deduplication) refers to the task of mapping multiple manifestations of virtual objects to their corresponding real-world entities. When addressing this problem, in both theory and practice, it is widely assumed that suc...

Full description

Bibliographic Details
Main Authors:	Jose N. Paredes, Gerardo I. Simari, Maria Vanina Martinez, Marcelo A. Falappa
Format:	Article
Language:	English
Published:	MDPI AG 2018-07-01
Series:	Information
Subjects:	adversarial deduplication machine learning classifiers cyber threat intelligence
Online Access:	http://www.mdpi.com/2078-2489/9/8/189

_version_	1828839724057362432
author	Jose N. Paredes Gerardo I. Simari Maria Vanina Martinez Marcelo A. Falappa
author_facet	Jose N. Paredes Gerardo I. Simari Maria Vanina Martinez Marcelo A. Falappa
author_sort	Jose N. Paredes
collection	DOAJ
description	In traditional databases, the entity resolution problem (which is also known as deduplication) refers to the task of mapping multiple manifestations of virtual objects to their corresponding real-world entities. When addressing this problem, in both theory and practice, it is widely assumed that such sets of virtual objects appear as the result of clerical errors, transliterations, missing or updated attributes, abbreviations, and so forth. In this paper, we address this problem under the assumption that this situation is caused by malicious actors operating in domains in which they do not wish to be identified, such as hacker forums and markets in which the participants are motivated to remain semi-anonymous (though they wish to keep their true identities secret, they find it useful for customers to identify their products and services). We are therefore in the presence of a different, and even more challenging, problem that we refer to as adversarial deduplication. In this paper, we study this problem via examples that arise from real-world data on malicious hacker forums and markets arising from collaborations with a cyber threat intelligence company focusing on understanding this kind of behavior. We argue that it is very difficult—if not impossible—to find ground truth data on which to build solutions to this problem, and develop a set of preliminary experiments based on training machine learning classifiers that leverage text analysis to detect potential cases of duplicate entities. Our results are encouraging as a first step towards building tools that human analysts can use to enhance their capabilities towards fighting cyber threats.
first_indexed	2024-12-12T19:27:23Z
format	Article
id	doaj.art-68a91f75c62d4fb78dc8efaa5ff4c7dd
institution	Directory Open Access Journal
issn	2078-2489
language	English
last_indexed	2024-12-12T19:27:23Z
publishDate	2018-07-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj.art-68a91f75c62d4fb78dc8efaa5ff4c7dd2022-12-22T00:14:28ZengMDPI AGInformation2078-24892018-07-019818910.3390/info9080189info9080189First Steps towards Data-Driven Adversarial DeduplicationJose N. Paredes0Gerardo I. Simari1Maria Vanina Martinez2Marcelo A. Falappa3Department of Computer Science and Engineering, Universidad Nacional del Sur (UNS), 8000 Bahia Blanca, ArgentinaDepartment of Computer Science and Engineering, Universidad Nacional del Sur (UNS), 8000 Bahia Blanca, ArgentinaDepartment of Computer Science, Universidad de Buenos Aires (UBA), C1428EGA Ciudad Autonoma de Buenos Aires, ArgentinaDepartment of Computer Science and Engineering, Universidad Nacional del Sur (UNS), 8000 Bahia Blanca, ArgentinaIn traditional databases, the entity resolution problem (which is also known as deduplication) refers to the task of mapping multiple manifestations of virtual objects to their corresponding real-world entities. When addressing this problem, in both theory and practice, it is widely assumed that such sets of virtual objects appear as the result of clerical errors, transliterations, missing or updated attributes, abbreviations, and so forth. In this paper, we address this problem under the assumption that this situation is caused by malicious actors operating in domains in which they do not wish to be identified, such as hacker forums and markets in which the participants are motivated to remain semi-anonymous (though they wish to keep their true identities secret, they find it useful for customers to identify their products and services). We are therefore in the presence of a different, and even more challenging, problem that we refer to as adversarial deduplication. In this paper, we study this problem via examples that arise from real-world data on malicious hacker forums and markets arising from collaborations with a cyber threat intelligence company focusing on understanding this kind of behavior. We argue that it is very difficult—if not impossible—to find ground truth data on which to build solutions to this problem, and develop a set of preliminary experiments based on training machine learning classifiers that leverage text analysis to detect potential cases of duplicate entities. Our results are encouraging as a first step towards building tools that human analysts can use to enhance their capabilities towards fighting cyber threats.http://www.mdpi.com/2078-2489/9/8/189adversarial deduplicationmachine learning classifierscyber threat intelligence
spellingShingle	Jose N. Paredes Gerardo I. Simari Maria Vanina Martinez Marcelo A. Falappa First Steps towards Data-Driven Adversarial Deduplication Information adversarial deduplication machine learning classifiers cyber threat intelligence
title	First Steps towards Data-Driven Adversarial Deduplication
title_full	First Steps towards Data-Driven Adversarial Deduplication
title_fullStr	First Steps towards Data-Driven Adversarial Deduplication
title_full_unstemmed	First Steps towards Data-Driven Adversarial Deduplication
title_short	First Steps towards Data-Driven Adversarial Deduplication
title_sort	first steps towards data driven adversarial deduplication
topic	adversarial deduplication machine learning classifiers cyber threat intelligence
url	http://www.mdpi.com/2078-2489/9/8/189
work_keys_str_mv	AT josenparedes firststepstowardsdatadrivenadversarialdeduplication AT gerardoisimari firststepstowardsdatadrivenadversarialdeduplication AT mariavaninamartinez firststepstowardsdatadrivenadversarialdeduplication AT marceloafalappa firststepstowardsdatadrivenadversarialdeduplication

First Steps towards Data-Driven Adversarial Deduplication

Similar Items