You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Feature Paper
  • Article
  • Open Access

16 May 2020

A Comparative Analysis of Cyber-Threat Intelligence Sources, Formats and Languages

,
and
1
School of Computing, Electronics and Mathematics, Faculty of Science and Engineering, Plymouth University, Plymouth PL4 8AA, UK
2
School of Computing, Faculty of Technology, University of Portsmouth, Portsmouth PO1 2UP, UK
3
School of Economics and Technology, Faculty of Informatics and Telecommunications, University of Peloponnese, 22131 Tripolis, Greece
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Advanced Cybersecurity Services Design

Abstract

The sharing of cyber-threat intelligence is an essential part of multi-layered tools used to protect systems and organisations from various threats. Structured standards, such as STIX, TAXII and CybOX, were introduced to provide a common means of sharing cyber-threat intelligence and have been subsequently much-heralded as the de facto industry standards. In this paper, we investigate the landscape of the available formats and languages, along with the publicly available sources of threat feeds, how these are implemented and their suitability for providing rich cyber-threat intelligence. We also analyse at a sample of cyber-threat intelligence feeds, the type of data they provide and the issues found in aggregating and sharing the data. Moreover, the type of data supported by various formats and languages is correlated with the data needs for several use cases related to typical security operations. The main conclusions drawn by our analysis suggest that many of the standards have a poor level of adoption and implementation, with providers opting for custom or traditional simple formats.

1. Introduction

With the advent of the Internet of things (IoT), there has been an unprecedented increase of cyber-attacks, which have evolved and become more sophisticated. Adversaries now use a vast set of tools and tactics to attack their victims with their motivations ranging from intelligence collection to data destruction or financial gain. Understanding the attacker has become more complicated and even more important as this knowledge, if transformed into actionable information, can be used to adapt networks’ defences in an automated manner to better protect the network against possible threats. Cyber-threat intelligence (CTI) focuses on the capabilities, motivations and goals of an adversary and how these could be achieved. Intelligence is the information and knowledge gained about an adversary through observation and analysis; intelligence is not just data, but the outcome of an analysis and must be actionable to meet the needs of current defensive systems that have to deal with and respond to cyber-attacks. Amongst others, examples of CTI include indicators (system artefacts or observables associated with an attack), security alerts, incident reports and threat intelligence, along with any other relevant information on recommended (or vulnerable) security tool configurations [1,2].
The efficient sharing of CTI is at the core of cyber-threat detection and prevention, as it allows building multi-layer automated tools with sophisticated and effective defensive capabilities that continuously analyse the vast amounts of the heterogeneous CTI related to attackers’ tactics, techniques and procedures (TTPs), indicators of ongoing incidents, etc. [3,4]. Given the numerous architectures, products and systems being used as sources of data for information sharing mechanisms, standardised and structured representations of CTI are required to allow a satisfying interoperability level across the various stakeholders [2]. Therefore, considerable efforts have been put during the last decade to standardise the data formats and exchange protocols related to CTI, including recent efforts aiming at promoting the CTI for “things” [5]; the initiative making security measurable (MSM) constitutes the most prominent effort toward improving CTI sharing among the various stakeholders [6].
The analysis carried out in this paper considers prominent representatives of CTI formats and languages that have been proposed and further studied in the literature, such as the structured threat information expression (STIX) [7], trusted automated exchange of indicator information (TAXII) [8,9] and cyber observable expression (CybOX) [10]. Among the paper’s goals are to explore the capabilities of the available formats and languages and their capacity to convey various CTI types, to correlate their features with the degree to which they are used from the vast number of CTI sources and to correlate their capabilities with the needs of typical security use cases to which they are to be used. The above (and other) standardised formats and languages were believed to be the answer to the problem of not having common mechanisms for sharing cyber-threat intelligence. According to [11], STIX is the de facto standard for describing threat intelligence. In a literature review of STIX, TAXII and CybOX, several issues were identified that should be addressed to allow their wide adoption; these include:
  • The headline standards of STIX, TAXII and CybOX have been superseded.
  • The apparent acceptance and utilisation of the standards appeared lower than expected.
  • Much of the body of knowledge found in the literature is outdated mainly due to the rapid change and development of the CTI formats and use.
To address the above issues and provide a state-of-the-art view of the CTI formats, use cases and implementations, the publicly available sources of CTI that share such data were researched along with any related formats and languages.
The organisation of the paper is as follows. We first provide a quick overview of the literature and the current state-of-the-art in Section 2, to have a knowledge base and an informed perspective on the findings and issues encountered. This is followed by Section 3, Section 4 and Section 5 that investigate CTI sources and formats and present the main result of our analysis. We conclude in Section 6.

3. CTI Sources

This section presents several CTI sources that have been examined, which are characterised as being internal, externally sourced observables or feeds and externally open-source intelligence [1,28,29]. It is important to highlight that the examination of CTIs was carried out by installing and using the tools provided from the manufactures, as well as by reading and analysing their documentation and various other online resources.

3.1. Internally Sourced

The CTI obtained from internal sources is comprised of observable events that have happened on an organisation’s internal network and hosts (referred to as threat indicators in [30]). It can provide indicators about threats having breached the security perimeter, having broken the internal access control rules, having infected a system, or having attempted to get access to a restricted system. Statistical data provide a baseline of the normal behaviour so that any abnormality can be highlighted and investigated; possible sources are given in Table 1. More details about internal CTI sources are provided below.
Table 1. Internal sources of cyber-threat intelligence.
System logs and events. Such information is widely available on devices and applications; it can be easily forwarded to a central facility using tools such as Syslog or Windows event forwarding (WEF). As only certain log messages and events apply to CTI, any central logging system, e.g., a security incident and event management (SIEM) system, should apply filters and rule-sets to extract CTI.
Network events. Network devices such as routers, switches and firewalls, support simple network management protocol (SNMP), which can be used to send (in near real-time) event messages, known as SNMP traps, to a central server for processing. SNMP traps can be configured for a variety of CTI events in internal network (e.g., connections requested, login event occurring, etc.).
Network utilisation and traffic profiles. These may indicate abnormal behaviour, such as untrusted or excessive traffic from a client or between clients. Statistics are available in many forms, from simple counters in SNMP and Remote MONitoring (RMON) to detailed IP and protocol data from NetFlow and similar equipped switches and probes.
Boundary security devices. In addition to the above events, proprietary boundary security devices, such as network intrusion prevention systems (NIDS) and web application firewalls (WAF), may have their own application-specific management console that also feeds security events to a SIEM. An example of an alert generated by Suricata NIDS in JSON format is provided below in Listing 1.
Listing 1. Example of CTI (alert) obtained from Suricata.
{
“timestamp”: “2009-11-24T21:27:09.534255”,
“event_type”: “alert”,
“src_ip”: “192.168.2.7”,
“src_port”: 1041,
“dest_ip”: “X.X.250.50”,
“dest_port”: 80,
“proto”: “TCP”,
“alert”: {
“action”: “allowed”,
“gid”: 1,
“signature_id”:2001999,
“rev”: 9,
“signature”: “ET MALWARE BTGrab.com Spyware Downloading Ads”,
“category”: “A Network Trojan was detected”,
“severity”: 1
}
}
Anti-virus systems. Corporate anti-virus systems report malware events back to a central console, allowing a comprehensive coverage for the hosts within an organisation; as with boundary devices, this may also feed security events to a SIEM.
Human. An organisation’s staff is often the quickest to recognise that something is wrong; the ability to rapidly spot and report events is something that can be achieved through user awareness and continuous professional security training programs.
Forensic. This CTI includes artefacts gathered from the investigation following a security incident and can be used to bolster security defences. The analysis of infected systems and log files can provide details about the tactics, techniques and procedures (TTPs) used during the attack.

3.2. Externally Sourced Observables

Locating, identifying and analysing the externally sourced observables or feeds formed the bulk of the research that was conducted in this work [30]. A selection of open and free to use sources of CTI was identified along with the formats and languages used, with an emphasis on sources using the STIX/TAXII standard. These community, open-source IoCs and observables typically consist of the observed malicious sources or data, e.g., IP address, domain, URL, file names and hashes. The principal use case is to explore this information to create rule sets for firewalls, network-based and host-based intrusion detection and prevention systems (IDPS), SIEM systems, etc., to block (or alert on seeing) the observable or a matching indicator.
To obtain samples of CTI data, the STIX sources having been identified to use the TAXII 1.x transport protocol were accessed with the Cabby TAXII client [31], while a simple Python script was written using the CTI TAXII client [32] for TAXII 2.x sources. Other simpler formats, such as text, CSV, JSON, etc., were accessed using a standard web browser or the Linux wget command to review the fields included. The CTI feeds and their respective formats were analysed and compared. Wherever available, the format documentation was downloaded from the source or authoring organisation to allow for a deep understanding of the format used and to contribute to the research and analysis of the formats and languages. Over 275 feeds were identified from the CTI sources, where the first 125 of these (all based on the STIX standard) were selected for analysis; the remaining >150 feeds identified were stored for future analysis. Table 2 shows the quantity and format of the 125 selected feeds obtained from each CTI source, where in case that a feed supports multiple formats, the most complex one was chosen. The formats and languages listed in Table 2 are further examined below (with certain indicative examples) and also discussed later in the paper.
Table 2. CTI Sources’ Formats Used.
Among the above sources, abuse.ch makes several CTI feeds available through projects, such as MalwareBazaar and URLhaus, for sharing information about malware samples along with URLs being used for malware distribution, or the SSL Blacklist that provides information to detect malicious SSL connections and digital certificates used by botnet command and control (C&C) servers. The feeds provided by abuse.ch are comprehensive and are used and re-transmitted by several other providers. A typical example of the CTI shared (with the SHA1 fingerprints of the aforementioned certificates) in a CSV format is shown below in Listing 2.
Listing 2. Example of CTI obtained from abuse.ch.
################################################################
# abuse.ch SSLBL SSL Certificate Blacklist (SHA1 Fingerprints)   #
# Last updated: 2020-05-03 06:46:48 UTC             #
#                              #
# Terms Of Use: https://sslbl.abuse.ch/blacklist/          #
# For questions please contact sslbl [at] abuse.ch         #
################################################################
#
# Listingdate,SHA1,Listingreason
2020-05-03 06:46:48,081cf50a56f59be9b1f9504858a225b80f233cb2,IcedID C&C
2020-05-02 07:48:30,19cf21e6326b6125b023c53df23b74060f4e786e,IcedID C&C
2020-05-02 07:41:15,e5d49e0b12012e40498cc991ae586b3ce05bf2f6,IcedID C&C
2020-05-01 18:01:48,8644711545fc8d1ba02fd4e4424290a06815c320,Adwind C&C
2020-05-01 17:59:19,20373e4d4d11ba0e839378737ee9fc49cb164bbd,ServHelper C&C
...
Another CTI provider is the service blocklist.de that takes reports from numerous active servers that use fail2ban and similar abuse blocking applications. The lists may be obtained through a direct download or via an API and are single-column text files that contain IP addresses; moreover, such information can be obtained by the DNS real-time blackhole list (RBL), which provides a simple DNS query response mechanism to determine the state of an individual IP address, as in the example that is shown in Listing 3.
Listing 3. Example of CTI obtained via blocklist.de with DNSRBL.
query:
host -t any 112.220.10.1.bl.blocklist.de
response:
112.220.10.1.bl.blocklist.de has address 127.0.0.21
112.220.10.1.bl.blocklist.de descriptive text “Infected System (Service: bruteforcelogin, Last-Attack: 1588509427), see http://www.blocklist.de/en/view.html?ip=1.10.220.112
The list of IP addresses available for download by blocklist.de can also be protocol-specific (e.g., for the SSH, FTP, IMAP and SIP), targeting at bots, or other attacks such as the above brute-force attack against a web login; no metadata or other enrichment is provided. Similar information is also provided by Spamhaus, which is a well-known CTI source providing lists of IP address ranges that are involved in sending spam emails (SBL advisory), are compromised by malware and other exploits (XBL advisory), or belong in domains having low reputation (DBL advisory) amongst others. Further to the above, a subset of the SBL list is provided via the don’t route or peer (DROP) list that can be used by firewalls and routers to drop malicious traffic; an example is given below in Listing 4.
Listing 4. Example of CTI obtained from Spamhaus.
; Spamhaus DROP List 2020/04/30 - (c) 2020 The Spamhaus Project
; https://www.spamhaus.org/drop/drop.txt
; Last-Modified: Thu, 30 Apr 2020 14:23:20 GMT
; Expires: Thu, 30 Apr 2020 15:41:23 GMT
1.10.16.0/20 ; SBL256894
1.19.0.0/16 ; SBL434604
1.32.128.0/18 ; SBL286275
2.56.255.0/24 ; SBL444288
2.59.151.0/24 ; SBL444170
...
On the other hand, the CTI provided from Anomali Limo is following the STIX 2.x standard and is delivered by means of the STAXX open source platform and Limo TAXII feed. The compliance with the STIX 2.x format is somewhat lazy, since many of the indicators’ metadata are presented in the description field. Several collections are available, providing details about ransomware, cyber-crime, emerging threats (compromised or C&C servers), malware domains, phishing URLs, etc., but some of the feeds are re-transmissions of other sources (e.g., from abuse.ch).

3.3. External Open-Source Intelligence

For this type of CTI, we concentrated on open sources of threat intelligence (OSINT) from publicly available sources that contributed to building and understanding the threat landscape; although these tend to be more human (and more strategic, as highlighted in [30]) than machine-readable, they are often unstructured. Typical examples are: an announcement of a large data leak compromising user data that could be used to access other systems, in phishing attacks or in geopolitical tensions that may increase the risk of cyber-attack. Table 3 provides a brief list and description of the CTI sources that were identified.
Table 3. Externally sourced intelligence.
A wealth of CTI information was available in the plentiful supply from news feeds, alerts, antivirus (AV) vendors, etc. In most of the cases, it was also available in RSS format, which is machine-readable; however, the news or alerts content typically contains a link redirecting to a free format web page that does not easily lend itself to automated consumption and understanding despite the considerable advances in the areas of natural language processing (NLP) and artificial intelligence (AI). Typical examples of such sources include CERT-EU, Schneier on security, Krebs on security, and SANS institute, amongst others.
Advisories and vulnerability alerts are sources having a standardised CTI format, in many cases using the common vulnerabilities and exposures (CVE) and common weaknesses enumeration (CWE), as well as the common vulnerability reporting framework (CVRF), which is next reviewed. This information is typically associated with a severity measure in the format of the common vulnerability scoring system (CVSS) and is also linked with the systems affected by the vulnerability through the common platform enumeration (CPE), therefore greatly helping in the dissemination of threat intelligence but with some limitations. Typical examples of such sources include the national vulnerability database (NVD), Cisco security advisories, Microsoft security portal, Oracle security advisories, Red Hat security advisories, SecurityFocus, etc. In contrast to the previous type of external OSINT sources, these ones contain (or can readily generate) actionable security information. For example, NVD’s data feeds, apart from the incorporation of the CVSS string (giving granular information about a vulnerability’s preconditions and impact) also includes labels to any external references, such as exploit, patch, mitigation, technical description and product, which can direct tools automating the extraction of actionable information. An example from NVD’s feed in JSON format is provided in Listing 5.
Listing 5. Example of CTI obtained from NVD (truncated/simplified for illustration purposes).
{
“cve” : {
“CVE_data_meta” : {
“ID” : “CVE-2020-0001”
},
“problemtype” : {
“value” : “CWE-269”
},
“references” : [ {
“url” : “https://source.android.com/security/bulletin/2020-01-01”,
“tags” : [ “Vendor Advisory” ]
} ],
/* vulnerability description */
},
“configurations” : {
“cpe_match” : [ {
“vulnerable” : true,
“cpe23Uri” : “cpe:2.3:o:google:android:10.0:*:*:*:*:*:*:*”
} ]
},
“impact” : {
“cvssV3” : {
“version” : “3.1”,
“vectorString” : “CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H”,
“attackVector” : “LOCAL”,
“attackComplexity” : “LOW”,
“privilegesRequired” : “LOW”,
“userInteraction” : “NONE”,
“scope” : “UNCHANGED”,
“confidentialityImpact” : “HIGH”,
“integrityImpact” : “HIGH”,
“availabilityImpact” : “HIGH”,
“baseScore” : 7.8,
“baseSeverity” : “HIGH”
},
“exploitabilityScore” : 1.8,
“impactScore” : 5.9
}
}
The dark web search focused on finding intelligence, tools and services that are not available on the surface web. Our analysis was conducted using a TOR browser running on a disposable virtual machine to provide some insulation from malicious content. The speed and reliability of connections to .onion sites hampered and frustrated progress. Access to several forums was granted by using anonymised email addresses but it was quite limited without first having gained trust in the community.

4. CTI Formats and Languages

Many CTI formats were identified from CTI sources and the literature; these were selected for further analysis based on their popularity in the literature or the source feeds. Where available, the original specifications, documents, schemas, etc., were examined by installing the right tools and applications. Samples of the formats were identified either from the CTI sources under investigation or the literature. The formats and languages have been classified into four main categories:
  • Standards that have been specifically published for representing the CTI
  • Custom application-specific or vendor-specific formats
  • Commonly used standards that were not designed for representing the CTI
  • Legacy formats, commonly referred to in the literature, but no longer being supported or used
A brief overview of the ones selected for further analysis is provided in the following subsections.

4.1. CTI Standards

STIX is one of our principal research subjects; it is a rich and extensive XML format that was first released in 2012 [33], with the minor revision 1.2 being released in 2015. The aim of STIX was to be a flexible and expressive language for representing cyber information. Where existing formats were used, e.g., MAEC [34], the objective was to integrate rather than duplicate them [7]. This provided a highly flexible format that ultimately led to its downfall, as the nested structures present in the XML documents became too complex and difficult to parse. STIX 1.2 was superseded by the 2.0 and in 2017 by 2.1 release. TAXII is the preferred, but not compulsory, transport mechanism for STIX [35]; there are different versions of TAXII for each release of STIX, which are not compatible with each other.
CybOX provides STIX 1.x the means to express cyber observables, events and other properties [10]. With the advent of STIX 2.1, CybOX has been integrated and is now part of the STIX standard. The principal differences between STIX 2.x and STIX 1.x are in the serialisation from XML to JSON that was designed to make the protocol more lightweight and much simpler for programmers [35]. The structure in STIX 2.x is flat rather than nested, with STIX domain objects (SDO) defined at the top level of the document to simplify parsing and storage; the relationship between the SDOs is accommodated by the introduction of a STIX relationship object (SRO) [36]. The CybOX objects have become cyber observable objects in STIX 2.x (under CybOX 3.0 release [37]) along with MAEC, therefore considerably decreasing complexity. Such changes were accompanied by a change in the management of the STIX project, which moved from MITRE to the OASIS CTI technical committee [38]. The MAEC 5.0 standard was designed for characterising malware using attributes such as behaviours, artefacts and relationships between malware samples [34,39]. This latest release was updated in line with STIX 2.x to maintain compatibility using the same cyber observable objects and JSON serialisation.
CVRF is another standard, whose format is machine-readable, aiming for the submission and distribution of vulnerability advisories and reports [40]. The utilisation of CVRF by MITRE’s CVE repository, the principal registry of vulnerabilities and exposures, along with active support and feeds from vendors, such as Cisco, Oracle and Red Hat, are expected to help to establish CVRF as the de facto standard for the distribution of vulnerabilities and security advisories.

4.2. Application and Vendor Specific Formats

CESNET operates a large network infrastructure providing service to higher education and research establishments throughout Czech Republic; it created the intrusion detection extensible alert (IDEA) to overcome the complexities of other CTI formats [41]. IDEA aims at the sharing of CTI data that are varying in nature, thus it has to be flexible, extensible while staying simple. The MISP format is the native protocol for communication between the MISP platform instances [42]; this JSON format is highly extensible and widely used by the MISP platform. The collective intelligence framework (CIF) is another widely used CTI aggregation and sharing platform that provides its JSON format for sharing CTI [43]. Finally, IDS/IPS rules are a long-lived CTI format that can be directly consumed by IDS/IPS applications such as Snort [44] and Suricata [45].

4.3. Commonly Used Standards

These formats were never designed or intended for use as a CTI sharing medium; despite this, the DNS block list (DNSBL), DNS real-time black hole list (DNSRBL) and Text/CSV are the oldest and most widely used formats identified. More precisely, DNSBL and DNSRBL are not downloadable lists of CTI host IPs [46]. Instead, they provide a rapid and efficient DNS-based request/response protocol to determine if an IP or domain exists on a blacklist or whitelist. It is likely one of the oldest methods used to get useful CTI information and is typically used by e-mail spam and malware filters.
Really simple syndication (RSS) is a lightweight XML format that is designed for the distribution of news items [47]. This format has been adopted by several sources for the distribution of CTI with detailed data available from a central repository. On the other hand, Text/CSV is the simplest and most widely used format of all the CTI source feeds sampled, either a single column text list of IPs or URLs (e.g., in the case of black lists), or as a rich, multi-IoC comma or tab-separated variables; they provide all the data in the most efficient and compact manner of any format.

4.4. Legacy Formats

The analysis of the final three CTI formats that we noted from the literature was curtailed due to the absence of current development, no active support or not being identified in any CTI source feeds examined.
Originally created by Mandiant Inc., under openioc.org, the OpenIOC format was designed to provide a common methodology and format for describing host-based or network-based indicators of compromise [48]. The legacy Mandiant resources and/or tools are available on GitHub, but there is currently no apparent activity [49]. The IODEF format was introduced by the Internet Engineering Task Force in RFC 5070 [50]; its current version 2 is described in RFC 7970. It is an XML-based format for exchanging CTI that is reported in the literature, but no evidence was identified about its current support, despite the second version’s activity in 2016. Finally, the open threat partner exchange (OpenTPX) is an open-source and well-documented JSON format designed for sharing CTI [51]; no feeds were identified and there is no apparent evidence of updates since 2015.

5. Analysis

This section is mainly focused on externally sourced CTI feeds found in Section 3 and Section 4. These sources are discussed after a brief analysis of the other CTI sources from our research.

5.1. Internally Sourced CTI

The CTI from internal sources appears to have a quite comprehensive coverage from the HIDS, SIEM and antivirus software provisions available; the majority of these were commercial offerings. It appears that the use of CTI, obtained from network activity such as network traffic flows, DNS requests, DHCP, ARP etc. (excluding NIDS), is not widely utilised and no further analysis was carried out to determine the effectiveness of current solutions on this type of CTI.

5.2. External Open Source Intelligence

The CTI examined from external open-source intelligence (OSINT) showed a very different context comparing to the machine-readable sources and formats. The analysis and application of this CTI is predominantly a manual process, converting this human-readable CTI into machine actionable formats where some of these were available, with some limitation, in machine readable formats such as RSS and CVRF. Advances in natural language processing and AI offer significant opportunity in this area. The availability and structure of vulnerabilities and exposures through the CVE standard is well known and widely used [39] but the main drawback of this system is the limited applicability of the information available in a standard format. It should be noted that some vendors provided CVE feeds (e.g., [52,53,54]) that were quite comprehensive in what the applicable software versions were. The consistency and quality of the CTI that was identified from the dark web was found to be poor and mired in unsavoury content, mostly due to the lack of indexing and controller access to forums and credible resource. As much of the malicious activity originates from those who inhabit the dark web, it cannot be ignored as a potential source of intelligence.

5.3. CTI Source Feeds, Formats and Languages

The analysis carried out on the CTI source feeds revealed several different types of formats including single-column text feeds, multi-column, rich CSV feeds and more complex formats such as STIX and RSS. Many of these feeds, particularly those available in the more complex formats, were found to be retransmissions of simpler plain text feeds from other CTI sources. Examination of the feeds for evidence of originality (instead of being retransmissions) was not always possible. It is worth noting that some sources were found to be informative, giving details of how or where the CTI data were obtained and, in some cases, how agents could be downloaded, etc. A selection of sources, typically CSV or RSS feeds, provided web portal interfaces to search and examine the CTI data in greater depth. Figure 1 gives an overview of the originality for the threat feeds examined.
Figure 1. CTI source originality.
In the retransmission of CTI data, we found that some original source data can be lost or corrupted, which typically was attributed to the poor formatting, dates having been replaced so misrepresenting the freshness of the data, retransmitted or aggregated data appearing as a shadow sighting and giving false significance to the threat. We also observed a common practice of splitting the rich array of CTI types associated with a threat into separate, un-associated types, e.g., IP, domain, etc., diminishing the value of the original cohesive dataset.
In Figure 2, we illustrate the range of CTI types that were represented in the analysed CTI source feeds. IP addresses were the most common type, followed by the description of the threat or malware type and the URLs. From our analysis of the formats we knew that the rich intelligence source feeds could provide a more comprehensive dataset than that available from a simple block list. We compared how many of the sources using complex data formats provided rich CTI feeds. Here, we define rich as the CTI having more than two types represented in the feed, otherwise we consider it as being sparse. Our results are represented below.
Figure 2. CTI types represented.
As highlighted in Figure 3, the capability of STIX to represent complex and rich CTI is somewhat underutilised, with most samples containing only sparse CTI. We carried out further analysis of the STIX 1.x format and compared the efficiency found in retransmitted CTI feeds. For example, a single entry <item> in the RSS Malc0de database feed [55] consumed 307 bytes. In contrast, the STIX 1.1 feed representing the indicators of same single entry from PickUpSTIX [56] consumed 18,153 bytes. Thus, it is clear that the used XML came with significant overhead and complexity.
Figure 3. Rich vs. sparse CTI.
From the documentation of STIX 2.x, it is known that it can provide a more succinct representation than its 1.x predecessors. We still found that only half of the feeds analysed contained rich CTI data. A common approach taken was to put data in the description or title attributes rather than add additional observable objects or indicators to the feed. We refer to this as the lazy implementation of STIX format. We did note that the STIX feeds containing original content tend to be richer and much better implemented than those simply retransmitting data from other sources.
Complexity was one of the prime reasons for moving from STIX 1.x to 2.x, where the need for keeping things simple is also stated as a goal in MISP, CIF and IDEA formats. When analysing complex CTI represented in MISP and STIX 2.x documentation, the strength of the formats to cross reference CTI comes to the fore. When we compare this to the implementations of simpler but still rich CTI, e.g., containing IPs, file names, file hashes and URLs, that are indicators for a strain of malware. However, without the need of TTPs, sequence of events, actor identities, etc., we see that the simpler formats can better express these.
To further examine how the use of the STIX versions varied between the providers, a common original source was chosen that was retransmitted by both STIX 1.x, 2.x sources. For our comparisons, the abuse.ch ransomware tracker feed was used [57]. The STIX 1.1 feed was sourced from PickUpSTIX [58], which contains better source metadata compared to with the Anomali Limo feed [59].
STIX 1.x and 2.x have similar capabilities to represent the data complexity as can be easily seen from Table 4. It was concluded that the Limo source appears to have a somewhat lazy implementation and further analysis was conducted on the STIX 2.x sources to reveal if this is a common practice or not. For this, sixteen samples of TAXII collections were examined from three STIX 2.x source providers to compare how well they utilised the capabilities of the format and structure. The observed data or indicator objects were analysed for containing multiple IoC types in the file (e.g., IP, URL, MD5, etc.); multiple IoC in an either observed-data.objects or indicator.pattern objects; and examples of rich content, e.g., multiple IoC, related objects, etc.
Table 4. STIX, PickUpSTIX and Limo metadata comparison.
Our results in Table 5 indicate that the analysed STIX 2.x samples gained only a little advantage from using the STIX format.
Table 5. STIX 2.x Feature Use.
From the CTI samples identified in our research, many simpler formats such as CSV and RSS had grouped indicators for a given threat with a common label or tag. STIX uses a combination of observed data structures, indicator patterns and relationships. The STIX bundle object is only a container and does not imply any relationship between the objects contained therein; a relationship object is required to represent this, using the UUIDs of the related objects, along with its own UUID, markings, originator, etc. This can result in a complex document to represent a collection of CTI related to a single threat. This is an area in which the MISP format excels; the sharing of data between MISP instances is threat-centric. Here, a single event file contains all the CTI for a threat; UUIDs are used to cross-reference and form relationships the same as STIX; and the attribute array structures are similar to STIX observables. However, the relationships are embedded with no additional objects or complexity required.
We find a similar situation with STIX markings when compared to MISP tags. In STIX, a marking definition is typically a global object and the indicator objects reference these directly. MISP, which has a rich tag and taxonomy implementation, embeds the tag objects directly. This is very simple but creates the potential for inconsistency between versions of the same tag. As the name suggests, universally unique identifiers (UUID) RFC4122 provide unique IDs [60]. Several of the CTI formats examined use these to identify and reference CTI data, markings, relationships and more.
CVRF was found to be a rich format that can meet the need to share vulnerabilities; the addition of a revision history within the vulnerability structure would provide a clearer versioning of individual vulnerabilities. The biggest weaknesses observed was the limited compliance from major influencers and the dilution of the format with multiple, equally suitable alternatives and insufficient target data and remediations in a consistent and standardised manner. As noted above, there is good vendor support for identifying the applicability of a vulnerability and remediations.
MAEC has good support from Sandbox providers, although there is a dilution from the use of older versions and the widespread availability of platform-specific API formats. MAEC 5.0 leverages STIX 2.x cyber observables, types and languages, but there is no evidence of reciprocal support with no facility to reference or include MAEC content in STIX 2.x, as was available in STIX 1.x.
The platform or API custom formats such as MISP, IDEA, CIF, etc., had an enthusiastic use of the formats, and they were found to be better suited to their given use case and able to represent the CTI observables and indicators in a succinct yet comprehensive manner. The MISP format has grown from real-world use; the MISP project sites over 6K installations of the MISP platform, illustrating the wide support in both community and government organisations.
In Table 6, Table 7 and Table 8, the various CTI formats and languages that were researched and analysed are compared to determine how well they are able to convey CTI for different use cases. The criteria are applied to the representation of a single, complete cyber observable, where a single observable can be an event, indicator or similar such single entry, line or item in a list or structure. For example, the CTI indicating the presence of a malware compromise, source of the infection (IP, domain, URL, file, hash, etc.), the destination or target (IP, hostname, domain, vulnerability, etc.) and threat details (malware name, family, type, etc.). The test applies to dedicated fields or columns that are machine readable and unambiguous, inclusion of CTI data fields in general purpose descriptions is ignored.
Table 6. Format and Languages, Assessment Criteria.
Table 7. Formats and languages, use case and features.
Table 8. Typical use case and example CTI.
In Table 7, the formats and languages are graded on how well the test criteria have been met as per the following key: a blank means that the criterion or feature is neither met nor supported; the ‘’ symbol means that the feature is partially supported and some but not all criteria are met; the ‘🞭’ symbol means that the criteria are met or the feature is supported in a satisfactory manner; and the ‘’ symbol means that the requirement criteria and feature requirements are exceeded. Table 8 below describes some very typical example use cases and examples of the types of CTI that those use cases may consume.
From the analysis of the various use cases, CTI formats and sampled feeds, it became clear that some were better suited at representing CTI for a given use case, e.g., due to being simpler or richer. This is illustrated in Table 9, where the available formats and languages are correlated against the security use cases according to the information that is given in Table 7.
Table 9. Formats and languages suitability per use case.
For each use case, the format or language achieving the highest suitability score is shown in boldface, with the score ranging from 0 (lowest) to 1 (highest). The expression used for computing the suitability score s ( f , u ) of any format or language f against some use case u is given by
s ( f , u ) = 1 n ( u ) # { a c ( u ) : f ( a )   covers   u ( a ) }
where the set c ( u ) is comprised of the criteria/features being applicable for the use case u , whose number is n ( u ) . f ( a ) and u ( a ) denote the level at which the criterion/feature a is supported and required, respectively. The ordering ‘’, ‘🞭’, ‘’ of the symbols in increasing support of features allows us to determine if the needs of a particular use case are being met. Let us take the email blocklist use case as an example, that is we have u = email   blocklist in the above expression. According to Table 7, this use case requires the features
c ( u ) = { Blocklist ,   IPv 4   address ,   IPv 6   address ,   Email   address ,   Domain ,   Complexity }
and hence n ( u ) = 6 . It is immediately seen in Table 7 that STIX 1.x protocol can adequately support only four out six features and therefore for f = STIX   1 . x we get s ( f , u ) = 4 6 = 0.67 , which is also depicted in Table 9. Regarding the two features not counted for in STIX 1.x, namely Blocklists and Complexity, we see that the former is not supported while the latter implies that the protocol is unnecessarily over complex in the way that the information is provided (as stated in the assessment criteria of Table 6). It is interesting to note that the IDEA format (followed by STIX 1.x and MISP) is found to be the most suitable for the majority of the use cases considered, whereas it is located among the next most suitable formats and languages for the remaining ones—something that clearly justifies its design goals. On the other hand, Table 9 shows that the use case of “Firewall/Router ACL” is the one that most formats and languages can largely support.
The direction of the information flow is also a factor in the original design and the use of several of the formats were examined. Table 10 shows the flow direction and the formats noted as most suitable.
Table 10. Format suitability.
CTI data from sensors or detection mechanisms tend to be specific to the source type or detection mechanism used. IDEA is a custom format designed to transport CTI from sensors to a central system. MAEC is quite popular with honeypot providers. CTI collection and aggregation systems, or extraction of data from them, are best suited to the formats that can provide the best fit for the data being shared or extracted. Such examples are a simple CSV for bulk IP data; CVRF for vulnerabilities; and STIX, MISP and custom JSON formats for a rich representation of CTI. The format used to distribute CTI to cyber protective systems or devices needs to be one that can be directly consumed, e.g., IDS rule sets, IP/domain lists, MD5 signatures, etc. When examining the suitability of the various formats and given the original use case and design criteria for the formats, the results are as we expected; this does not make any one format better than any other, it depends on the use and the requirements.

6. Conclusions

Through research and analysis, it quickly became apparent that the quantity of CTI sources and formats is vast. As noted above, more than half of the threat intelligence feeds sampled from these sources were either retransmitted or of unknown origin. The support for STIX is apparent in many platforms and the consensus from the research would suggest it has industry and community support. However, its use is not widespread and often poorly implemented. The trend is to use API or platform-specific formats that are a better fit with the given use case.
The question of which format to use depends on the use case; the creation, coding and use of custom JSON formats is a quick and simple way to meet requirements of a specific use case, or there may be a preference to adhere to existing standards. Our recommendation would be to use the best fit; the evidence from the research has shown that even the producers and key supporters of standards still produce their own, lightweight, custom JSON formats, regardless the time scales, processes and ratification needed by standards.
Our recommendations on the distribution and sharing of CTI is to follow the best practice, where applicable, with the common descriptors and conventions in the language. It was found that relying on the IDEA format (and possibly MISP or STIX) might constitute a best practice for the majority of the security use cases considered due to its ability in meeting their CTI needs. In addition, most of the formats are capable of supporting access control services being offered by means of a firewall or router.
Many of the issues we encountered with the quality and the distribution of CTI could be reduced by including the origin and freshness/timestamp data in feeds, keeping threat data complete. Clearly, the vast number of CTI sources offer an opportunity for further research into assessing and improving the quality of CTI feeds. Where resources are constrained, e.g., in IoT devices, better association between the threat and target surface could provide focused CTI able to more effectively protect these devices.

Author Contributions

A.R., S.S., and N.K. contributed equally. The authors read and approved the final manuscript as well as the authors order. All authors have read and agreed to the published version of the manuscript.

Funding

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 786698. This work reflects authors’ view and the agency is not responsible for any use that may be made of the information it contains.

Conflicts of Interest

The authors declare that they have no conflict of interest.

References

  1. Roberts, S.J.; Brown, R. Intelligence–Driven Incident Response; O’Reilly Media: Sevastopol, CA, USA, 2017. [Google Scholar]
  2. Menges, F.; Sperl, C.; Pernul, G. Unifying cyber threat intelligence. In Trust, Privacy and Security in Digital Business (TrustBus), Lecture Notes in Computer Science; Springer: Berlin, Germany, 2019; Volume 11711, pp. 161–175. [Google Scholar]
  3. Poputa–Clean, P. SANS Institute, Automated Defense, Using Threat Intelligence to Augment Security. Available online: https://www.sans.org/reading–room/whitepapers/threats/automated–defense–threat–intelligence–augment–35692 (accessed on 3 April 2020).
  4. Appala, S.; Cam–Winget, N.; McGrew, D.A.; Verma, J. An actionable threat intelligence system using a publish–subscribe communications model. In Proceedings of the 2nd ACM Workshop on Information Sharing and Collaborative Security, Denver, CO, USA, 12–16 October 2015; pp. 61–70. [Google Scholar]
  5. Wagner, T.D. Cyber Threat Intelligence for “Things”. In Proceedings of the 2019 International Conference on Cyber Situational Awareness, Data Analytics and Assessment (Cyber SA), Oxford, UK, 3–4 June 2019; pp. 1–2. [Google Scholar]
  6. MITRE Corp. Making Security Measurable. 2018. Available online: https://msm.mitre.org/ (accessed on 3 April 2020).
  7. Barnum, S. Standardizing cyber threat intelligence information with the Structured Threat Information eXpression (STIX). 2014. Available online: http://www.standardscoordination.org/sites/default/files/docs/STIX_Whitepaper_v1.1.pdf (accessed on 3 April 2020).
  8. Connolly, J.; Davidson, M.; Richard, M.; Skorupka, C. Trusted Automated eXchange of Indicator Information (TAXII™). 2012. Available online: http://taxii.mitre.org/about/documents/Introduction_to_TAXII_White_Paper_November_2012.pdf (accessed on 3 April 2020).
  9. OASIS Open Introduction to TAXII. 2018. Available online: https://oasis–open.github.io/cti–documentation/taxii/intro.html (accessed on 3 April 2020).
  10. MITRE Corp. Cyber Observable eXpression (CybOX™) Archive Website. 2017. Available online: http://cyboxproject.github.io/ (accessed on 3 April 2020).
  11. Sauerwein, C.; Sillaber, C.; Mussmann, A.; Breu, R. Threat Intelligence Sharing Platforms: An Exploratory Study of Software Vendors and Research Perspectives. In Proceedings of the 13th International Conference on Wirtschaftsinformatik, St. Gallen, Switzerland, 12–15 February 2017. [Google Scholar]
  12. Zrahia, A. Threat intelligence sharing between cybersecurity vendors: Network, dyadic, and agent views. J. Cybersecur. 2018, 4, 1–16. [Google Scholar] [CrossRef]
  13. Brown, S.; Gommers, J.; Serrano, O. From Cyber Security Information Sharing to Threat Management. In Proceedings of the 2nd ACM Workshop on Information Sharing and Collaborative Security, Denver, CO, USA, 12–16 October 2015; pp. 43–49. [Google Scholar]
  14. Liu, R.; Zhao, Z.; Sun, C.; Yang, X.; Gong, X.; Zhang, J. A Research and Analysis Method of Open Source Threat Intelligence Data. In Proceedings of the 3rd International Conference of Pioneering Computer Scientists, Engineers and Educators (ICPCSEE), Changsha, China, 22–24 September 2017; Part I, Communications in Computer and Information Science. Springer: Berlin, Germany, 2017; Volume 727, pp. 352–363. [Google Scholar]
  15. Sauerwein, C.; Pekaric, I.; Felderer, M.; Breu, R. An analysis and classification of public information security data sources used in research and practice. Comput. Secur. 2019, 82, 140–155. [Google Scholar] [CrossRef]
  16. Abu, M.; Selamat, S.; Ariffin, A.; Yusof, R. Cyber Threat Intelligence—Issue and Challenges. Indones. J. Electr. Eng. Comput. Sci. 2018, 10, 371–379. [Google Scholar]
  17. Pala, A.; Zhuang, J. Information sharing in cybersecurity: A review. Decis. Anal. 2019, 16, 1–25. [Google Scholar] [CrossRef]
  18. Tounsi, W.; Rais, H. A survey on technical threat intelligence in the age of sophisticated cyber attacks. Comput. Secur. 2018, 72, 212–233. [Google Scholar] [CrossRef]
  19. Menges, F.; Pernul, G. A comparative analysis of incident reporting formats. Comput. Secur. 2018, 73, 87–101. [Google Scholar] [CrossRef]
  20. Mavroeidis, V.; Bromander, S. Cyber threat intelligence model: An evaluation of taxonomies, sharing standards, and ontologies within cyber threat intelligence. In Proceedings of the 2017 European Intelligence and Security Informatics Conference (EISIC), Athens, Greece, 11–13 September 2017; pp. 91–98. [Google Scholar]
  21. Burger, E.W.; Goodman, M.D.; Kampanakis, P.; Zhu, K.A. Taxonomy model for cyber threat intelligence information exchange technologies. In Proceedings of the ACM Workshop on Information Sharing & Collaborative Security (WISCS), Scottsdale, AZ, USA, 3 November 2014; pp. 51–60. [Google Scholar] [CrossRef]
  22. Asgarli, E.; Burger, E. Semantic ontologies for cyber threat sharing standards. In Proceedings of the 2016 IEEE Symposium on Technologies for Homeland Security (HST), Waltham, MA, USA, 10–11 May 2016; pp. 1–6. [Google Scholar]
  23. Serrano, O.; Dandurand, L.; Brown, S. On the Design of a Cyber Security Data Sharing System. In Proceedings of the 2014 ACM Workshop on Information Sharing & Collaborative Security, Scottsdale, AZ, USA, 3 November 2014; pp. 61–69. [Google Scholar]
  24. Sullivan, C.; Burger, E. “In the public interest”: The privacy implications of international business-to-business sharing of cyber-threat intelligence. Comput. Law Secur. Rev. 2017, 33, 14–29. [Google Scholar] [CrossRef]
  25. Wagner, T.D.; Mahbub, K.; Palomar, E.; Abdallah, A.E. Cyber threat intelligence sharing: Survey and research directions. Comput. Secur. 2019, 87, 101589. [Google Scholar] [CrossRef]
  26. Zibak, A.; Simpson, A. Cyber threat information sharing: Perceived benefits and barriers. In Proceedings of the 14th International Conference on Availability, Reliability and Security, Canterbury, UK, 26–29 August 2019; pp. 1–9. [Google Scholar] [CrossRef]
  27. Wagner, C.; Dulaunoy, A.; Wagener, G.; Iklody, A. MISP: The Design and Implementation of a Collaborative Threat Intelligence Sharing Platform. In Proceedings of the 2016 ACM on Workshop on Information Sharing and Collaborative Security, Vienna, Austria, 24 October 2016. [Google Scholar] [CrossRef]
  28. Skopik, F. Collaborative Cyber Threat Intelligence: Detecting and Responding to Advanced Cyber Attacks at National Level; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
  29. Farnham, G. Tools and Standards for Cyber Threat Intelligence Projects; SANS Institute InfoSec Reading Room: Bethesda, MA, USA, 2013. [Google Scholar]
  30. Friedman, J.; Bouchard, M. Definitive Guide to Cyber Threat Intelligence; CyberEdge: Annapolis, MD, USA, 2015. [Google Scholar]
  31. EclecticIQ. Cabby—TAXII Client Implementation. 2018. Available online: https://github.com/EclecticIQ/cabby (accessed on 3 April 2020).
  32. OASIS Open. OASIS TC Open Repository: TAXII 2 Client Library Written in Python. 2018. Available online: https://github.com/oasis–open/cti–taxii–client (accessed on 3 April 2020).
  33. MITRE Corp. The MITRE Corporation. 2018. Available online: https://www.mitre.org/ (accessed on 3 April 2020).
  34. MITRE Corp. About MAEC. 2018. Available online: http://maecproject.github.io/about–maec/ (accessed on 3 April 2020).
  35. OASIS Open. Introduction to STIX. 2018. Available online: https://oasis–open.github.io/cti–documentation/ (accessed on 3 April 2020).
  36. OASIS. Introduction to STIX. 2018. Available online: https://oasis–open.github.io/cti–documentation/stix/intro (accessed on 3 April 2020).
  37. OASIS. OASIS CTI CybOX Subcommittee. 2018. Available online: https://www.oasis–open.org/committees/tc_home.php?wg_abbrev=cti–cybox (accessed on 3 April 2020).
  38. OASIS. OASIS Cyber Threat Intelligence (CTI) TC. 2017. Available online: https://www.oasis–open.org/committees/tc_home.php?wg_abbrev=cti (accessed on 3 April 2020).
  39. MITRE Corp. CVE—Common Vulnerabilities and Exposures. 2018. Available online: http://cve.mitre.org/index.html (accessed on 3 April 2020).
  40. OASIS Open. CSAF Common Vulnerability Reporting Framework (CVRF) Version 1.2. 2017. Available online: https://docs.oasis-open.org/csaf/csaf-cvrf/v1.2/cs01/csaf-cvrf-v1.2-cs01.html (accessed on 3 April 2020).
  41. CESNET. Intrusion Detection Extensible Alert. 2018. Available online: https://www.cesnet.cz/en/index (accessed on 3 April 2020).
  42. CIRCL. Malware Information Sharing Platform MISP—A Threat Sharing Platform. 2018. Available online: https://www.circl.lu/services/misp–malware–information–sharing–platform/ (accessed on 3 April 2020).
  43. CSIRT Gadgets LLC. CSIRT Wiki, Getting Started—Welcome to the CSIRTG–EX Software Development Kit. 2018. Available online: https://github.com/csirtgadgets/csirtg/wiki (accessed on 3 April 2020).
  44. Cisco. Snort. 2018. Available online: https://snort.org/ (accessed on 3 April 2020).
  45. OISF. Suricata Open Source IDS / IPS / NSM engine. 2018. Available online: https://suricata–ids.org/ (accessed on 3 April 2020).
  46. Spamhaus. Understanding DNSBL Filtering. 2018. Available online: https://www.spamhaus.org/whitepapers/dnsbl_function/ (accessed on 3 April 2020).
  47. Winer, D. RSS 2.0 Specification. Available online: https://cyber.harvard.edu/rss/rss.html (accessed on 3 April 2020).
  48. FireEye, Inc. Free Security Software—IOC Tools (Indicator of Compromise). Available online: https://www.fireeye.com/services/freeware.html (accessed on 3 April 2020).
  49. Mandiant. GitHub Repository. Available online: https://github.com/mandiant (accessed on 3 April 2020).
  50. Danyliw, R. Internet Engineering Task Force (IETF), RFC 7970. The Incident Object Description Exchange Format Version 2. Available online: https://tools.ietf.org/html/rfc7970 (accessed on 3 April 2020).
  51. Lookingglass. Welcome to the OpenTPX Project! Available online: https://opentpx.org/ (accessed on 3 April 2020).
  52. Cisco Security Alerts. Available online: https://tools.cisco.com/security/center/cvrf_20.xml. (accessed on 3 April 2020).
  53. Oracle Security & Patch Update Advisories. Available online: http://www.oracle.com/ocom/groups/public/@otn/documents/webcontent/1932662.xml. (accessed on 3 April 2020).
  54. Red Hat Security Advisories. Available online: https://www.redhat.com/security/data/cvrf/ (accessed on 3 April 2020).
  55. Malc0de Database. Available online: http://malc0de.com/database/ (accessed on 3 April 2020).
  56. NC4 Soltra. Connecting to PickupSTIX. Available online: https://www.soltra.com/en/documentation/ctx–soltra–edge/connecting–to–pickupstix/ (accessed on 3 April 2020).
  57. Abuse.Ch. Ransomware Tracker. 2016. Available online: https://ransomwaretracker.abuse.ch/tracker/ (accessed on 3 April 2020).
  58. NC4 / Soltra LLC, PickUpStix. Available online: https://www.soltra.com/en/documentation/ctx–soltra–edge/connecting–to–pickupstix/ (accessed on 3 April 2020).
  59. Anomali, Limo—Free Intel Feed. Available online: https://www.anomali.com/platform/limo (accessed on 3 April 2020).
  60. Leach, P.; Mealling, M.; Salz, R. RFC4122, A Universally Unique IDentifier (UUID) URN Namespace. Available online: https://tools.ietf.org/html/rfc4122 (accessed on 3 April 2020).

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.