Next Article in Journal
The Singularity May Be Near
Previous Article in Journal
Dombi Aggregation Operators of Linguistic Cubic Variables for Multiple Attribute Decision Making
Previous Article in Special Issue
A Framework for More Effective Dark Web Marketplace Investigations
Article Menu

Export Article

Open AccessArticle
Information 2018, 9(8), 189;

First Steps towards Data-Driven Adversarial Deduplication

Department of Computer Science and Engineering, Universidad Nacional del Sur (UNS), 8000 Bahia Blanca, Argentina
Institute for Computer Science and Engineering (CONICET–UNS), 8000 Bahia Blanca, Argentina
School of Computing, Informatics, and Decision Systems Engineering (CIDSE), Arizona State University, Tempe, AZ 85281, USA
Department of Computer Science, Universidad de Buenos Aires (UBA), C1428EGA Ciudad Autonoma de Buenos Aires, Argentina
Institute for Computer Science Research (CONICET–UBA), C1428EGA Ciudad Autonoma de Buenos Aires, Argentina
These authors contributed equally to this work.
Author to whom correspondence should be addressed.
Received: 26 June 2018 / Revised: 20 July 2018 / Accepted: 23 July 2018 / Published: 27 July 2018
(This article belongs to the Special Issue Darkweb Cyber Threat Intelligence Mining)
Full-Text   |   PDF [1446 KB, uploaded 27 July 2018]   |  


In traditional databases, the entity resolution problem (which is also known as deduplication) refers to the task of mapping multiple manifestations of virtual objects to their corresponding real-world entities. When addressing this problem, in both theory and practice, it is widely assumed that such sets of virtual objects appear as the result of clerical errors, transliterations, missing or updated attributes, abbreviations, and so forth. In this paper, we address this problem under the assumption that this situation is caused by malicious actors operating in domains in which they do not wish to be identified, such as hacker forums and markets in which the participants are motivated to remain semi-anonymous (though they wish to keep their true identities secret, they find it useful for customers to identify their products and services). We are therefore in the presence of a different, and even more challenging, problem that we refer to as adversarial deduplication. In this paper, we study this problem via examples that arise from real-world data on malicious hacker forums and markets arising from collaborations with a cyber threat intelligence company focusing on understanding this kind of behavior. We argue that it is very difficult—if not impossible—to find ground truth data on which to build solutions to this problem, and develop a set of preliminary experiments based on training machine learning classifiers that leverage text analysis to detect potential cases of duplicate entities. Our results are encouraging as a first step towards building tools that human analysts can use to enhance their capabilities towards fighting cyber threats. View Full-Text
Keywords: adversarial deduplication; machine learning classifiers; cyber threat intelligence adversarial deduplication; machine learning classifiers; cyber threat intelligence

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. (CC BY 4.0).

Share & Cite This Article

MDPI and ACS Style

Paredes, J.N.; Simari, G.I.; Martinez, M.V.; Falappa, M.A. First Steps towards Data-Driven Adversarial Deduplication. Information 2018, 9, 189.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Information EISSN 2078-2489 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top