Attack Analysis Framework for Cyber-Attack and Defense Test Platform

In 2012, Google first proposed the knowledge graph and applied it in the field of intelligent searching. Subsequently, knowledge graphs have been used for in-depth association analysis in different fields. In recent years, composite attacks have been discovered through association analysis in the field of cyber security. This paper proposes an attack analysis framework for cyber-attack and defense test platforms, which stores prior knowledge in a cyber security knowledge graph and attack rule base as data that can be understood by a computer, sets the time interval of analysis on the Spark framework, and then mines attack chains from massive data with spatiotemporal constraints, so as to achieve the balance between automated analysis and real-time accurate performance. The experimental results show that the analysis accuracy depends on the completeness of the cyber security knowledge graph and the precision of the detection results from security equipment. With the rational expectation about more exposure of attacks and faster upgrade of security equipment, it is necessary and meaningful to constantly improve the cyber security knowledge graph in the attack analysis framework.


Introduction
The knowledge graph was proposed by Google in 2012 and successfully applied to search engines afterwards [1]. The advantage of applying the knowledge graph in the field of intelligent searching is to exchange space for time. In recent years, the knowledge graph has also been used to analyze the relationships between entities to discover and analyze problems in certain areas, such as finance and intelligence. In the field of network security, computer networks naturally have the characteristics of knowledge graphs. A computer network is composed of multiple computing nodes, and there are network connection relationships among these computing nodes, so a computer network can form a knowledge graph.
A knowledge graph constructed based on a computer network is called a cyber security knowledge graph (CSKG), and the CSKG includes two parts: a security knowledge graph (SeKG) and a scene knowledge graph (ScKG). The SeKG includes known information about vulnerabilities, attacks, assets and the relationships among them. The above information can be obtained from various vulnerability and attack analysis websites and can be updated gradually. The ScKG includes network node information and network connectivity information involved in specific attacks. Generally, the SeKG is the core graph and the ScKG is the extended graph.
The data source of the CSKG is known as cyber security knowledge, which can be transformed into different types of attack information, such as attack source, attack mode, attack object and vulnerability information. The data source of the scene knowledge graph depends on all the information about the current attack.
In the field of cyber security, the mainstream research direction is the anomaly detection technology based on machine learning. However, the result of exception detection can only tell us whether it is normal or abnormal, without the specific state of the current network or node, which can only be presented by situational awareness. Situational awareness is a way to improve the ability to discover, identify, understand, analyze and respond to security threats from a global perspective based on security big data, and is usually classified into detection, analysis, prediction and defense. This paper will focus on analysis. Prediction and defense are not in the scope of discussion here.
The premise of the attack analysis in this paper is the understanding of the general steps of testing cyber-attacks, and the experience of security analysis at the same time, with the help of the security device detection results and the characteristics of attacks. It involves a lot of human work, and since the experience is different among different people, the time and accuracy of the analysis will also be different. Therefore, can the prior knowledge of security analysts be stored in a computer according to certain rules, and can cyber-attacks then be analyzed automatically with the help of computer algorithms and programs?
Based on the above ideas, this paper proposes an attack analysis framework for cyber-attack and defense test platforms. First, build a CSKG, including SeKG and ScKG, and set up a threat element association base and a composite attack rule base. Second, fuse the multi-source heterogeneous input data and extract threat elements. The extracted threat elements are sequentially matched with the threat element knowledge base. If the match is successful, carry out the association statistics of the threat elements and then match with the CSKG; if the match is unsuccessful, the threat elements are directly matched with the CSKG. If the threat elements match with the CSKG successfully, then return single-step attacks, and if the threat elements match with the CSKG unsuccessfully, then add the threat elements to the CSKG. Due to the occurrence of false positives, the ScKG is used to determine whether the single-attack is effective, and delete the invalid single-attack. Finally, after association analysis (composite attack rule base and space-time attribute constraints) of these effective single-step attacks, the single-step attacks related to the same composite attack are respectively associated and output in the form of attack chains.

Related Work
The concept of ontology [2] was first applied in the field of philosophy. The German scholar Studer gave the definition of ontology in 1998: ontology is a formal specification of shared conceptual models [3]. Ontology is a conceptual model that abstracts objective facts from the objective world. There are many construction methods of ontology, such as the IDEF-5 (Integrated Computer Aided Manufacturing DEFinition-5) method, TOVE (TOronto Virtual Enterprise) modeling method, Methontology method, Skeletal Methodology method, seven-step method and construction of a domain ontology based on a thesaurus. Most of the current ontology construction is manual, and ontologies cannot be automatically constructed yet. Comez-Perez et al. [4] proposed a classification method to organize ontologies and express the ontology model by using concepts, relationships, functions, axioms and examples. Guarino et al. [5] subdivided an ontology into top-level, domain, task, method and application ontologies according to the degree of domain dependency. Wang [6] et al. established an attack ontology model according to the classification of attacks, and strictly defined the logic relationship and hierarchical structure of the ontology in the ontology model, thereby achieving effective reuse of attack knowledge. Wang et al. [6] proposed a threat intelligence ontology model, which could not only realize the extraction of entities and entity relations, but also provide the function of visual display. Tao et al. [7] proposed an ontology construction model that used servers, network devices and security devices as nodes, business data flow as the relationship connecting two nodes, and the direction of business data flow as the direction of the relationship. Wang et al. [8] constructed a network security Electronics 2020, 9, 1413 3 of 18 knowledge graph, parallelized the association analysis algorithm and realized a distributed association analysis system.
The comprehensive analysis of cyber-attacks makes cyber security event association analysis technology a high concern. In recent years, many authors have summarized the cyber security event association analysis methods. Yan et al. [9] summarized the different technical methods of security event association analysis based on attribute characteristics, logic reasoning, probability statistics and machine learning. Attribute-based security event association analysis technology commonly uses methods including the finite state machine and rule-based association analysis methods. Mastani et al. [10] proposed the association analysis model based on the state machine; Forgy et al. [11] proposed the RETE algorithm, which has been the most effective algorithm based on forward chain reasoning; and Gu et al. [12] describes the principle of Rule Engine and improved RETE algorithm. The common methods of security event association analysis based on logical reasoning include case inference and model inference. Esmaili et al. [13] proposed an association technology based on case inference, and Chen et al. [14] proposed an association technology based on model inference. The common methods of security event association technology based on probability and statistics include dependency graph-based association analysis technology, the Bayesian network model and the Markov model. Rubin et al. [15] proposed a dependency graph-based association analysis technology. The common methods of security event association analysis based on machine learning include the neural network (ANN), the support vector machine (SVM), and so on. In addition, open source association analysis tools mainly include Swatch [16], SEC [17], OSSEC [18], OSSIM [19], Drools [20], Esper [21], and so on. In terms of attack scenario restoration, Peng et al. [22] proposed an alert correlation method based on the causality of attacks. Zhang et al. [23] proposed a real-time alert correlation analysis method based on an attack plan graph, which improved the attack scenario.
The rest of this paper is organized in the following way: Section 3 introduces the relevant content of the attack analysis framework, Section 4 analyzes the experimental results, and Section 5 summarizes the advantages and disadvantages of the method proposed in this paper, and points out the direction of future work.

Attack Analysis Framework
At present, research into cyber-attacks is carried out in the form of simulations. Thus, the cyber-attack and defense test platform came into being. It integrates all kinds of common equipment in the field of cyber security attack and defense, and uses professional attack and defense methods to provide simulation scenarios.
The cyber-attack and defense test platform provides attack and defense drills, and a variety of collection and detection systems. The collection systems can implement terminal collection, traffic collection, honeypot collection, log collection, IDS alert collection, vulnerability scanning, virus scanning, and so on. The detection systems mainly perform preliminary detection on the collected data. The attack analysis framework for the cyber-attack and defense test platform proposed in this paper is based on the above. The preliminary detection result is the input of the framework, with the help of the CSKG and the association analysis method to analyze the current cyber status. The workflow schematic diagram of the framework is shown in Figure 1. The core of the framework comprises the cyber security knowledge graph module and association analysis module.

Cyber Security Knowledge Graph Construction
In this paper, the steps to construct the cyber security knowledge graph are divided into ontology construction, ontology data model construction, self-defined reasoning rules determination and entity extraction.

Security Knowledge Ontology Model
This paper uses the five-tuple model proposed by Jia Yan et al. [24] to construct a security knowledge graph. The five-tuple model includes entities, attributes of entities, relationships between entities, and reasoning rules. We use SeKG (security knowledge graph) to represent the security knowledge graph, SO (security ontology) to represent the security knowledge ontology, SI (security instance) to represent the security instance, SOP (security object properties) to represent the relationships between the security ontologies, SDP (security data properties) to represent the properties of the security instance and SRR (security reasoning rule) to represent the reasoning rules of security knowledge. The formal expression is as follows: SKG =< SO,SI,SOP,SDP,SRR>.
(1) SO = {SOi|i = 1,...,n}. Security knowledge ontology is a concept summarized and abstracted from security knowledge, and usually divided into multi-level ontologies. For example, asset is a first-level ontology, and can be divided into second-level ontologies: software and hardware. The software is divided into a three-level ontology, including operating system, application software, etc. The hardware is divided into a three-level ontology, including PC, server, switch, etc. The operating system is divided into a four-level ontology, including the Windows operating system and the Linux operating system. The Linux operating system is divided into a five-level ontology, including Ubuntu, SUSE, Debian, Redhat and CentOS operating systems. Again, for instance, vulnerability is a first-level ontology, and can be divided into a second-level ontology, including ddos vulnerability, privilege escalation vulnerability, buffer overflow vulnerability, etc. The same applies to the ontology division of attacks, Trojans, worms, snort alerts and security events.

Cyber Security Knowledge Graph Construction
In this paper, the steps to construct the cyber security knowledge graph are divided into ontology construction, ontology data model construction, self-defined reasoning rules determination and entity extraction.

Security Knowledge Ontology Model
This paper uses the five-tuple model proposed by Jia Yan et al. [24] to construct a security knowledge graph. The five-tuple model includes entities, attributes of entities, relationships between entities, and reasoning rules. We use SeKG (security knowledge graph) to represent the security knowledge graph, SO (security ontology) to represent the security knowledge ontology, SI (security instance) to represent the security instance, SOP (security object properties) to represent the relationships between the security ontologies, SDP (security data properties) to represent the properties of the security instance and SRR (security reasoning rule) to represent the reasoning rules of security knowledge. The formal expression is as follows: SKG =<SO,SI,SOP,SDP,SRR>.
(1) SO = {SO i |i = 1,...,n}. Security knowledge ontology is a concept summarized and abstracted from security knowledge, and usually divided into multi-level ontologies. For example, asset is a first-level ontology, and can be divided into second-level ontologies: software and hardware. The software is divided into a three-level ontology, including operating system, application software, etc. The hardware is divided into a three-level ontology, including PC, server, switch, etc. The operating system is divided into a four-level ontology, including the Windows operating system and the Linux operating system. The Linux operating system is divided into a five-level ontology, including Ubuntu, SUSE, Debian, Redhat and CentOS operating systems. Again, for instance, vulnerability is a first-level ontology, and can be divided into a second-level ontology, including ddos vulnerability, privilege escalation vulnerability, buffer overflow vulnerability, etc. The same applies to the ontology division of attacks, Trojans, worms, snort alerts and security events.  (3) SDP = {<SI i , Pro ij , value j >}. The data attributes of the security instance, such as the version number of the operating system, the discovery time, update time, hazard level of the vulnerability, and so on.
(4) SOP =< SO i , Rcc, SO j >|<SO i , Rci, SI i >|<SI i , Rcc, SI j >. The object property of the security instance is the relationship between the security instances. For example, the relationship between multi-level ontologies is subClassOf, such as (asset, subClassOf, hardware), (asset, subClassOf, software); the relationship between different ontologies includes hasExit and exploit, such as (Windows operating system, hasExit, buffer overflow vulnerability), (buffer overflow attack, exploit, buffer overflow vulnerability). After adding instances to the ontology, the relationship between ontology and instance is instanceOf, such as (CVE-2019-1010298, instanceOf, buffer overflow vulnerability), and so on.
(5) SRR = {SRR|SRR = <SI i , newR ij , SI j >|<SO i , newR ij , SI j >|<SI i , Pro ij , newValue j >. Based on SKG, reasoning rules can be used to reason out new attributes of the security instance and new relationships between security instances.

Ontology-Instance Data Model
The security knowledge used in this paper mainly comes from six aspects: vulnerability database, virus database, snort alert, preliminary detection results, log information and attack classification. This paper is mainly concerned with the relationships between vulnerabilities, viruses, snort alerts, preliminary detection results, log information and attacks. We classify the relationships into three types: 1:1, n:1 and 1:n.
In particular, log information can be incorporated into security events. Instances of security event and snort alert are directly stored in the corresponding model based on prior knowledge. For certain instances of vulnerability, Trojan and worm, the entity-relationship needs to be extracted directly, and for other instances, the entity needs to be extracted and then added to the corresponding model. Although Trojans and worms are both viruses, they have specific classifications and different relationships with attacks, so they are respectively taken as ontologies in this paper. 1 1:1 data model Take vulnerability as an example. If all instances in each category of the vulnerability are associated with the same attack, as shown in Figure 2a, then the 1:1 ontology model can be constructed between vulnerability and attack. If not, as shown in Figure 2b, CVE5, the instance in blue, is different from the others, since it not only connects to the attack2, but also connects to the attack6, and needs to be extracted directly, and the rest of the instances are used to construct the ontology model between vulnerability and attack.
Electronics 2020, 9, x FOR PEER REVIEW 5 of 18 (3) SDP = {<SIi, Proij, valuej>}. The data attributes of the security instance, such as the version number of the operating system, the discovery time, update time, hazard level of the vulnerability, and so on.
(4) SOP =< SOi, Rcc, SOj>|<SOi, Rci, SIi>|<SIi, Rcc, SIj>. The object property of the security instance is the relationship between the security instances. For example, the relationship between multi-level ontologies is subClassOf, such as (asset, subClassOf, hardware), (asset, subClassOf, software); the relationship between different ontologies includes hasExit and exploit, such as (Windows operating system, hasExit, buffer overflow vulnerability), (buffer overflow attack, exploit, buffer overflow vulnerability). After adding instances to the ontology, the relationship between ontology and instance is instanceOf, such as (CVE-2019-1010298, instanceOf, buffer overflow vulnerability), and so on.

Ontology-Instance Data Model
The security knowledge used in this paper mainly comes from six aspects: vulnerability database, virus database, snort alert, preliminary detection results, log information and attack classification. This paper is mainly concerned with the relationships between vulnerabilities, viruses, snort alerts, preliminary detection results, log information and attacks. We classify the relationships into three types: 1:1, n:1 and 1:n.
In particular, log information can be incorporated into security events. Instances of security event and snort alert are directly stored in the corresponding model based on prior knowledge. For certain instances of vulnerability, Trojan and worm, the entity-relationship needs to be extracted directly, and for other instances, the entity needs to be extracted and then added to the corresponding model. Although Trojans and worms are both viruses, they have specific classifications and different relationships with attacks, so they are respectively taken as ontologies in this paper.  Figure 2b, CVE5, the instance in blue, is different from the others, since it not only connects to the attack2, but also connects to the attack6, and needs to be extracted directly, and the rest of the instances are used to construct the ontology model between vulnerability and attack. Construct attack ontology, security event ontology, vulnerability ontology, Trojan ontology, worm ontology, snort alert ontology and secondary ontology, determine the relationships between  Construct attack ontology, security event ontology, vulnerability ontology, Trojan ontology, worm ontology, snort alert ontology and secondary ontology, determine the relationships between secondary ontologies, and then add instances to secondary ontologies respectively. The data model is shown in Figure 3.
Electronics 2020, 9, x FOR PEER REVIEW 6 of 18 secondary ontologies, and then add instances to secondary ontologies respectively. The data model is shown in Figure 3.  Figure 4b, CVE11, the instance in blue, is different from the others, since it connects to the attack6 and attack2, and needs to be extracted directly, and the rest of the instances are used to construct the ontology model between vulnerability and attack. Construct attack ontology, vulnerability ontology, Trojan ontology, worm ontology and secondary ontology, determine the relationships between secondary ontologies, and then add instances to secondary ontologies respectively. The data model is shown in Figure 5.  Figure 4b, CVE11, the instance in blue, is different from the others, since it connects to the attack6 and attack2, and needs to be extracted directly, and the rest of the instances are used to construct the ontology model between vulnerability and attack.
Electronics 2020, 9, x FOR PEER REVIEW 6 of 18 secondary ontologies, and then add instances to secondary ontologies respectively. The data model is shown in Figure 3.  Figure 4b, CVE11, the instance in blue, is different from the others, since it connects to the attack6 and attack2, and needs to be extracted directly, and the rest of the instances are used to construct the ontology model between vulnerability and attack. Construct attack ontology, vulnerability ontology, Trojan ontology, worm ontology and secondary ontology, determine the relationships between secondary ontologies, and then add instances to secondary ontologies respectively. The data model is shown in Figure 5. Construct attack ontology, vulnerability ontology, Trojan ontology, worm ontology and secondary ontology, determine the relationships between secondary ontologies, and then add instances to secondary ontologies respectively. The data model is shown in Figure 5.   Figure 6a, then the 1:n ontology model can be constructed between vulnerability, Trojan, worm and attack. If not, as shown in Figure 6b, T-3 and W-1, the instances in blue, are different from the others, since they not only connect to the attack10, but also connect to the attack1, and need to be extracted directly, and the rest of the instances are used to construct the ontology model between vulnerability, Trojan, worm and attack. Construct attack ontology, security event ontology, vulnerability ontology, Trojan ontology, worm ontology, snort alert ontology and secondary ontology, determine the relationships between secondary ontologies, and then add instances to secondary ontologies respectively. The data model is shown in Figure 7.   Figure 6a, then the 1:n ontology model can be constructed between vulnerability, Trojan, worm and attack. If not, as shown in Figure 6b, T-3 and W-1, the instances in blue, are different from the others, since they not only connect to the attack10, but also connect to the attack1, and need to be extracted directly, and the rest of the instances are used to construct the ontology model between vulnerability, Trojan, worm and attack.  Figure 6a, then the 1:n ontology model can be constructed between vulnerability, Trojan, worm and attack. If not, as shown in Figure 6b, T-3 and W-1, the instances in blue, are different from the others, since they not only connect to the attack10, but also connect to the attack1, and need to be extracted directly, and the rest of the instances are used to construct the ontology model between vulnerability, Trojan, worm and attack. Construct attack ontology, security event ontology, vulnerability ontology, Trojan ontology, worm ontology, snort alert ontology and secondary ontology, determine the relationships between secondary ontologies, and then add instances to secondary ontologies respectively. The data model is shown in Figure 7. Construct attack ontology, security event ontology, vulnerability ontology, Trojan ontology, worm ontology, snort alert ontology and secondary ontology, determine the relationships between secondary ontologies, and then add instances to secondary ontologies respectively. The data model is shown in Figure 7.

Self-Defined Reasoning Rules
(1) Fixed reasoning rules The entities of machine learning and manual recognition are added to the 1:1 data model, 1:n data model and n:1 data model according to the entity classification and relationship classification, applied with reasoning rules (instance 1, belongs to, ontology1), (instance 2, belongs to, ontology 2), (ontology 1, relationship 1, ontology 2) -> (instance 1, relationship 1, instance 2). Instances of vulnerability, Trojan, worm, snort alert and security event are then associated with types of attack, and stored in the cyber security knowledge graph as cyber security knowledge. For convenience, we convert the 1:n and n:1 data models into the 1:1 model for reasoning. Examples of reasoning rules are shown in Table 1. rule: (?m NamedIndividualof ?n), (?p NamedIndividualof ?q), (?n r1 ?q) -> (?m r1?p) rule: (?m NamedIndividualof ?n), (?p NamedIndividualof ?q), (?n r2 ?q) -> (?m r2?p) (2) Specific reasoning rules When new knowledge emerges, it is necessary to associate the new knowledge with the existing knowledge to enrich the cyber security knowledge graph. Applied with reasoning rules (instance 1, relationship 1, instance 2), (instance 2, belongs to, Ontology 2), (instance 3, belongs to, ontology 3), (ontology 2, relationship 2, ontology 3) -> (instance 1, relationship 2, instance 3), then instances of new vulnerabilities, new Trojans or new worms can be associated with types of attack, and stored in the cyber security knowledge graph. Examples of reasoning rules are shown in Table 2.   The entities of machine learning and manual recognition are added to the 1:1 data model, 1:n data model and n:1 data model according to the entity classification and relationship classification, applied with reasoning rules (instance 1, belongs to, ontology1), (instance 2, belongs to, ontology 2), (ontology 1, relationship 1, ontology 2) -> (instance 1, relationship 1, instance 2). Instances of vulnerability, Trojan, worm, snort alert and security event are then associated with types of attack, and stored in the cyber security knowledge graph as cyber security knowledge. For convenience, we convert the 1:n and n:1 data models into the 1:1 model for reasoning. Examples of reasoning rules are shown in Table 1.  (2) Specific reasoning rules When new knowledge emerges, it is necessary to associate the new knowledge with the existing knowledge to enrich the cyber security knowledge graph. Applied with reasoning rules (instance 1, relationship 1, instance 2), (instance 2, belongs to, Ontology 2), (instance 3, belongs to, ontology 3), (ontology 2, relationship 2, ontology 3) -> (instance 1, relationship 2, instance 3), then instances of new vulnerabilities, new Trojans or new worms can be associated with types of attack, and stored in the cyber security knowledge graph. Examples of reasoning rules are shown in Table 2.

Named Entity Recognition Based on CRF
In the field of cyber security, although the sources of the corpus are diverse, they have the same syntactic description, and machine learning methods can be used for entity and entity-relationship recognition. Taking vulnerability base and virus base as examples, the markup method of ontology construction-information extraction is shown in Figure 8. The markup method of direct information extraction is shown in Figure 9. We divide the data into a training data set and a test data set, use word2vec to convert words to vectors, and combine with CNN, Bilstm and CRF in order to simultaneously carry out entity recognition and relationship classification. For the instance identified by the first markup method, triples need to be artificially extracted and added to the cyber security knowledge graph. For the instance identified by the second markup method, the instance needs to be artificially added to the corresponding data model, and after the self-defined reasoning, triples can be generated and added to the cyber security knowledge graph.
Electronics 2020, 9, x FOR PEER REVIEW 9 of 18 construction-information extraction is shown in Figure 8. The markup method of direct information extraction is shown in Figure 9. We divide the data into a training data set and a test data set, use word2vec to convert words to vectors, and combine with CNN, Bilstm and CRF in order to simultaneously carry out entity recognition and relationship classification. For the instance identified by the first markup method, triples need to be artificially extracted and added to the cyber security knowledge graph. For the instance identified by the second markup method, the instance needs to be artificially added to the corresponding data model, and after the self-defined reasoning, triples can be generated and added to the cyber security knowledge graph.  Due to the small data set, the best result of recognition is 97.6%. Manual verification is required to ensure the knowledge in the cyber security knowledge graph is 100% correct. The method and improvement of machine learning are not the focus of this paper. The purpose of adopting machine learning in this paper is to reduce manual operation as much as possible, and realize semi-automatic cyber security knowledge graph construction. Electronics 2020, 9, x FOR PEER REVIEW 9 of 18 construction-information extraction is shown in Figure 8. The markup method of direct information extraction is shown in Figure 9. We divide the data into a training data set and a test data set, use word2vec to convert words to vectors, and combine with CNN, Bilstm and CRF in order to simultaneously carry out entity recognition and relationship classification. For the instance identified by the first markup method, triples need to be artificially extracted and added to the cyber security knowledge graph. For the instance identified by the second markup method, the instance needs to be artificially added to the corresponding data model, and after the self-defined reasoning, triples can be generated and added to the cyber security knowledge graph.  Due to the small data set, the best result of recognition is 97.6%. Manual verification is required to ensure the knowledge in the cyber security knowledge graph is 100% correct. The method and improvement of machine learning are not the focus of this paper. The purpose of adopting machine learning in this paper is to reduce manual operation as much as possible, and realize semi-automatic cyber security knowledge graph construction. Due to the small data set, the best result of recognition is 97.6%. Manual verification is required to ensure the knowledge in the cyber security knowledge graph is 100% correct. The method and improvement of machine learning are not the focus of this paper. The purpose of adopting machine learning in this paper is to reduce manual operation as much as possible, and realize semi-automatic cyber security knowledge graph construction.

Scene Knowledge Graph
The scene knowledge graph is composed of the details of multiple simulation attacks of the scene, including node IP, software and hardware, existing vulnerabilities, backdoors, standby status, network status, open services and ports, and so on.
An ontology model is also needed to construct a scene knowledge graph, and the focus of scene knowledge is data attributes. Construct the node ontology, software ontology, hardware ontology, backdoor ontology, standby state ontology, network state ontology, port ontology, service ontology and vulnerability ontology. Determine the data attributes for the ontologies, and then add instances and data attributes of the instances according to the ontology model. The data model is shown in Figure 10.

Scene Knowledge Graph
The scene knowledge graph is composed of the details of multiple simulation attacks of the scene, including node IP, software and hardware, existing vulnerabilities, backdoors, standby status, network status, open services and ports, and so on.
An ontology model is also needed to construct a scene knowledge graph, and the focus of scene knowledge is data attributes. Construct the node ontology, software ontology, hardware ontology, backdoor ontology, standby state ontology, network state ontology, port ontology, service ontology and vulnerability ontology. Determine the data attributes for the ontologies, and then add instances and data attributes of the instances according to the ontology model. The data model is shown in Figure 10.

Threat Element Association Statistics
After data fusion and threat element extraction, we can extract vulnerabilities, Trojans, worms, snort alerts and security events from input data. Vulnerabilities, Trojans, worms, certain snort alerts and security events can be directly matched with the security knowledge graph, but there are two special cases of certain snort alerts and security events. 1. For the same IP, within a certain time interval, count the same snort alert time. If it exceeds the set threshold, the attack corresponding to the snort alert can be considered to have occurred, and the same is true for security events. 2. For the same IP, within a certain time interval, multiple different snort alerts are reported in turn. Count the number of alerts, and if it exceeds the set threshold, the attack corresponding to the multiple different snort alerts can be considered to have occurred, and the same is true for security events.

Set Frequency Threshold
If the counted time of an alert is greater than or equal to the set threshold, match the alert information with the knowledge graph and return the attack associated with the alert information. If the counted time of the alert is less than the set threshold, the statistical results are cached and participate in the calculation of the next time window. The same is true for security events.
The calculation formula of the alert information is shown below.
where keyi stands for different IP and alertk stands for different alert information of the same IP in a (t0, t1) time period. The calculation formula of the security event is shown below.

Threat Element Association Statistics
After data fusion and threat element extraction, we can extract vulnerabilities, Trojans, worms, snort alerts and security events from input data. Vulnerabilities, Trojans, worms, certain snort alerts and security events can be directly matched with the security knowledge graph, but there are two special cases of certain snort alerts and security events. 1. For the same IP, within a certain time interval, count the same snort alert time. If it exceeds the set threshold, the attack corresponding to the snort alert can be considered to have occurred, and the same is true for security events. 2. For the same IP, within a certain time interval, multiple different snort alerts are reported in turn. Count the number of alerts, and if it exceeds the set threshold, the attack corresponding to the multiple different snort alerts can be considered to have occurred, and the same is true for security events.

Set Frequency Threshold
If the counted time of an alert is greater than or equal to the set threshold, match the alert information with the knowledge graph and return the attack associated with the alert information. If the counted time of the alert is less than the set threshold, the statistical results are cached and participate in the calculation of the next time window. The same is true for security events.
The calculation formula of the alert information is shown below.
where key i stands for different IP and alertk stands for different alert information of the same IP in a (t 0 , t 1 ) time period.
Electronics 2020, 9, 1413 11 of 18 The calculation formula of the security event is shown below.
where key i stands for different IP and se k stands for different security events of the same IP in a (t 0 , t 1 ) time period.

Set Number Threshold
If the counted number of alerts is greater than or equal to the set threshold, match the last alert information with the knowledge graph and return the attack associated with the alert information. If the counted number of alerts is less than the set threshold, the statistical results are cached and participate in the calculation of the next time window. The same is true for security events.
The calculation formula of the alert information is shown below.
where key i stands for different IP and alert 1 , alert 2 . . . alert k represents multiple alert messages corresponding to the same IP in a (t 0 , t 1 ) time period. The calculation formula of the security event is shown below.
where key i stands for different IP and se 1 , se 2 . . . se k represents multiple alert messages corresponding to the same IP in a (t 0 , t 1 ) time period.

Attack Rule Base
Any action that attempts to breach resource integrity, confidentiality and availability is called an attack. When an attack has an independent and undecomposed purpose, it is a single-step attack. However, a successful attack process often contains multiple single-step attacks, which is called a composite attack. Figure 11a is the example of a single-step attack, while Figure 11b,c are the examples of composite attacks. The scene in Figure 11b contains only one target, while the scene in Figure 11c exploits several nodes.
We know that cyber-attacks can be divided into single-step attacks and composite attacks. Composite attacks can be regarded as a combination of multiple single-step attacks. Only by following certain rules can single-step attacks be composed of composite attacks. For example, if many attacks occurred on the same node, as shown in Figure 12, how could we determine what attacks are targeted on the node B? This paper solves this problem by establishing an attack rule base. Electronics 2020, 9, x FOR PEER REVIEW 12 of 18 Figure 11. The examples of single-step attack and composite attack.
We know that cyber-attacks can be divided into single-step attacks and composite attacks. Composite attacks can be regarded as a combination of multiple single-step attacks. Only by following certain rules can single-step attacks be composed of composite attacks. For example, if many attacks occurred on the same node, as shown in Figure 12, how could we determine what attacks are targeted on the node B? This paper solves this problem by establishing an attack rule base. The framework models to describe attacks mainly include Kill Chain, the ATT&CK model and the NSA/CSS technical threat framework. These framework models summarize the steps for composite attacks. Based on the above models, this paper summarizes the general rules of composite attacks, some of which are shown in Figure 13. An attack rule base can be added or modified as analysis progresses.  We know that cyber-attacks can be divided into single-step attacks and composite attacks. Composite attacks can be regarded as a combination of multiple single-step attacks. Only by following certain rules can single-step attacks be composed of composite attacks. For example, if many attacks occurred on the same node, as shown in Figure 12, how could we determine what attacks are targeted on the node B? This paper solves this problem by establishing an attack rule base. The framework models to describe attacks mainly include Kill Chain, the ATT&CK model and the NSA/CSS technical threat framework. These framework models summarize the steps for composite attacks. Based on the above models, this paper summarizes the general rules of composite attacks, some of which are shown in Figure 13. An attack rule base can be added or modified as analysis progresses. The framework models to describe attacks mainly include Kill Chain, the ATT&CK model and the NSA/CSS technical threat framework. These framework models summarize the steps for composite attacks. Based on the above models, this paper summarizes the general rules of composite attacks, some of which are shown in Figure 13. An attack rule base can be added or modified as analysis progresses. Even if multiple single-step attacks against the same node conform to the attack rules, they cannot be judged as composite attacks, as shown in Figure 14. Attacks on node B exactly conform to attack rule 2, but they actually belong to five different composite attacks. Obviously, composite attacks cannot be accurately analyzed just by setting up the attack rule base. This paper solves this problem by setting spatiotemporal constraints.

Spatiotemporal Constraints
The duration of each composite attack is different. We cannot know the duration of each composite attack in advance, and it is impossible to set the analysis interval for each composite attack separately. As shown in Figure 15a, if the duration of the attack is less than the time of analysis, an attack has occurred and caused the damage, but the analysis has not been completed. As shown in Figure 15b, if the duration of the attack is greater than the time of analysis, the attack has not finished yet, but the analysis has finished, which means we cannot obtain the full attack chain. For these two situations, we set the analysis time interval and time interval offset on Spark, and iteratively output the results of each analysis to ensure the accuracy of the composite attack analysis. Even if multiple single-step attacks against the same node conform to the attack rules, they cannot be judged as composite attacks, as shown in Figure 14. Attacks on node B exactly conform to attack rule 2, but they actually belong to five different composite attacks. Obviously, composite attacks cannot be accurately analyzed just by setting up the attack rule base. This paper solves this problem by setting spatiotemporal constraints. Even if multiple single-step attacks against the same node conform to the attack rules, they cannot be judged as composite attacks, as shown in Figure 14. Attacks on node B exactly conform to attack rule 2, but they actually belong to five different composite attacks. Obviously, composite attacks cannot be accurately analyzed just by setting up the attack rule base. This paper solves this problem by setting spatiotemporal constraints.

Spatiotemporal Constraints
The duration of each composite attack is different. We cannot know the duration of each composite attack in advance, and it is impossible to set the analysis interval for each composite attack separately. As shown in Figure 15a, if the duration of the attack is less than the time of analysis, an attack has occurred and caused the damage, but the analysis has not been completed. As shown in Figure 15b, if the duration of the attack is greater than the time of analysis, the attack has not finished yet, but the analysis has finished, which means we cannot obtain the full attack chain. For these two situations, we set the analysis time interval and time interval offset on Spark, and iteratively output the results of each analysis to ensure the accuracy of the composite attack analysis.

Spatiotemporal Constraints
The duration of each composite attack is different. We cannot know the duration of each composite attack in advance, and it is impossible to set the analysis interval for each composite attack separately. As shown in Figure 15a, if the duration of the attack is less than the time of analysis, an attack has occurred and caused the damage, but the analysis has not been completed. As shown in Figure 15b, if the duration of the attack is greater than the time of analysis, the attack has not finished yet, but the analysis has finished, which means we cannot obtain the full attack chain. For these two situations, we set the analysis time interval and time interval offset on Spark, and iteratively output the results of each analysis to ensure the accuracy of the composite attack analysis. Electronics 2020, 9, x FOR PEER REVIEW 14 of 18 Figure 15. The examples of analysis time.
Spark [25] is good at processing stream data. The basic principle is to divide stream data into small pieces of time (a few seconds) and process these small pieces of data in a way similar to batch processing. The attack analysis framework uses the Spark framework to process data in parallel, and uses time windows and time-increasing offsets to iteratively output the analysis results, thereby greatly reducing the false negative and false positive rates of the analysis results, and the analysis response time is close to a real-time response.
The process of composite attack analysis is described as follows: ① set the analysis time window and offset on the Spark framework, and respectively set the alert base and security event base that need to be associated statistically; ② in a time window, after data fusion and threat element extraction, the input data is first matched with the alert base and security event base set in the first step. If the matching fails, it can be directly matched with the security knowledge graph. If the match is successful, then it is matched with the security knowledge graph after statistical association; ③ the result returned in the previous step is for single-step attacks after using the scene knowledge graph to filter out invalid attacks; ④ match effective single-step attacks with attack rule bases, while constraining time and IP propagation; ⑤ after the analysis of the input data within a time window is completed, the analysis results are cached and output as intermediate results.
With the offset of the time window, the results of each analysis are iteratively output. When the attack ends, the complete attack chain will be output. Intermediate results and final results are output in the form of an attack chain. A schematic of the attack chain generation is shown in Figure 16.

Evaluation Mechanism
The effective attack analysis framework for the cyber-attack and defense test platform proposed in this paper is based on the cyber security knowledge graph. The core of the analysis is to first match a threat element with the cyber security knowledge graph and return a single-attack result, and then generate possible composite attack chains within the constraints of an attack rule base and spatiotemporal properties. Spark [25] is good at processing stream data. The basic principle is to divide stream data into small pieces of time (a few seconds) and process these small pieces of data in a way similar to batch processing. The attack analysis framework uses the Spark framework to process data in parallel, and uses time windows and time-increasing offsets to iteratively output the analysis results, thereby greatly reducing the false negative and false positive rates of the analysis results, and the analysis response time is close to a real-time response.
The process of composite attack analysis is described as follows: after the analysis of the input data within a time window is completed, the analysis results are cached and output as intermediate results.
With the offset of the time window, the results of each analysis are iteratively output. When the attack ends, the complete attack chain will be output.
Intermediate results and final results are output in the form of an attack chain. A schematic of the attack chain generation is shown in Figure 16. Spark [25] is good at processing stream data. The basic principle is to divide stream data into small pieces of time (a few seconds) and process these small pieces of data in a way similar to batch processing. The attack analysis framework uses the Spark framework to process data in parallel, and uses time windows and time-increasing offsets to iteratively output the analysis results, thereby greatly reducing the false negative and false positive rates of the analysis results, and the analysis response time is close to a real-time response.
The process of composite attack analysis is described as follows: ① set the analysis time window and offset on the Spark framework, and respectively set the alert base and security event base that need to be associated statistically; ② in a time window, after data fusion and threat element extraction, the input data is first matched with the alert base and security event base set in the first step. If the matching fails, it can be directly matched with the security knowledge graph. If the match is successful, then it is matched with the security knowledge graph after statistical association; ③ the result returned in the previous step is for single-step attacks after using the scene knowledge graph to filter out invalid attacks; ④ match effective single-step attacks with attack rule bases, while constraining time and IP propagation; ⑤ after the analysis of the input data within a time window is completed, the analysis results are cached and output as intermediate results.
With the offset of the time window, the results of each analysis are iteratively output. When the attack ends, the complete attack chain will be output. Intermediate results and final results are output in the form of an attack chain. A schematic of the attack chain generation is shown in Figure 16.

Evaluation Mechanism
The effective attack analysis framework for the cyber-attack and defense test platform proposed in this paper is based on the cyber security knowledge graph. The core of the analysis is to first match a threat element with the cyber security knowledge graph and return a single-attack result, and then generate possible composite attack chains within the constraints of an attack rule base and spatiotemporal properties.

Evaluation Mechanism
The effective attack analysis framework for the cyber-attack and defense test platform proposed in this paper is based on the cyber security knowledge graph. The core of the analysis is to first match a threat element with the cyber security knowledge graph and return a single-attack result, and then generate possible composite attack chains within the constraints of an attack rule base and spatiotemporal properties.
We know that some vulnerabilities, Trojans and worms can correspond to multiple attacks, but in actual simulations, they can only correspond to one attack at a time. Therefore, based on the analysis results of the cyber security knowledge graph, the number of analysis results will be equal to or greater than the number of simulated attacks. In order to verify the effectiveness of this method, we save the simulated attack information in advance, match the analysis results with it, and use the matched results as the basis of the verification method.
At present, there is no general evaluation mechanism to verify the effectiveness of this method. There may be four cases that affect the results of attack analysis: 1. simulated attacks are all successful, and detection results are all correct; 2. simulated attacks are partially successful, and detection results are all correct; 3. simulated attacks are all successful, and detection results are partially correct; 4. simulated attacks are partially successful, and detection results are partially correct. The results of the theoretical attack analyses in these four cases are shown in Table 3. In order to quantify the above indicators, this paper proposes a concept: efficiency of attack analysis. The number of simulated attacks is denoted as F1, the number of single-step attacks obtained by matching with the security knowledge graph is denoted as F2, the number of effective single-step attacks obtained by matching with the scene knowledge graph is denoted as F3, the number of invalid attacks is denoted as F4, the number of attacks obtained by association analysis is denoted as F5, matching of the analyzed attack with the simulated attack and the number of successful matches is recorded as F6 and the efficiency of the attack analysis is recorded as R. The formula is as follows:

Test and Result Analysis
The simulation of the composite attacks is completed on the cyber-attack and defense test platform by running the attack script. The attack script sets the path of the attack. The cyber-attack and defense test platform provides test node mirrors. In the node mirror, software and hardware can be arbitrarily installed, network state and standby can be set arbitrarily and the open state of the port and service can be set arbitrarily. Deploy the attack script, then generate a virtual machine mirror, start the virtual machine, execute the attack script, run a command to complete the attack simulation, and at the same time, the cyber-attack and defense test platform will provide collection and detection systems, such as vulnerability scanning, virus detection, terminal collection, mail collection, honeypot collection, computing node data collection, file detection, sandbox detection analysis, network detection and honeypot terminal behavior detection.
The experiment is divided into two parts. The first part is to verify the validity of self-defined reasoning during the expanding of the knowledge graph. The second part is to verify the feasibility of the analysis framework and the evaluation mechanism. The first part uses the comparison of analysis results of known and unknown vulnerabilities to verify the validity of the self-defined reasoning. The second part simulates all possibilities in the occurrence of attacks and detections, and uses the analysis results to verify the feasibility of the evaluation mechanism. The test deployment is shown in Figure 17. In this test, 3 attacks are simulated, the duration of these attacks is 15 min and the analysis time window is set to 5 min.
shown in Figure 17. In this test, 3 attacks are simulated, the duration of these attacks is 15 min and the analysis time window is set to 5 min.

Verify the Validity of Self-Defined Reasoning Rules
It is known that CVE-2019-1010153 is not stored in the safety knowledge graph, and the relationship between CVE-2019-1010153 and CVE-2013-3525 is known. It is assumed that all simulation attacks are successful and the detection results are all correct. The original knowledge graph and the expanded knowledge graph are used to compare the results, and the results are shown in Table 4 (see Sections 3.1 to 3.3 for detailed analysis).

Verify the Feasibility of the Analysis Framework
Divide the test into four cases, each of which uses the extended security knowledge graph. In the first case, the simulated attacks are all successful and the detection results are all correct; in the second case, the simulated attacks are partially successful, the standby state of the node with IP 183.146.1.6 is set to shut down to simulate a failed attack, and the detection results are all correct; in the third case, the simulated attacks are all successful and the detection results are partially correct, and the Trojan information of the node with IP 183.146.2.8 is lost; in the fourth case, the simulated attacks are partially successful, the standby state of the node with IP 183.146.1.6 is set to shut down to simulate a failed attack, the detection results are partially correct, and the Trojan information of

Verify the Validity of Self-Defined Reasoning Rules
It is known that CVE-2019-1010153 is not stored in the safety knowledge graph, and the relationship between CVE-2019-1010153 and CVE-2013-3525 is known. It is assumed that all simulation attacks are successful and the detection results are all correct. The original knowledge graph and the expanded knowledge graph are used to compare the results, and the results are shown in Table 4 (see Section 3.1 to Section 3.3 for detailed analysis).

Verify the Feasibility of the Analysis Framework
Divide the test into four cases, each of which uses the extended security knowledge graph. In the first case, the simulated attacks are all successful and the detection results are all correct; in the second case, the simulated attacks are partially successful, the standby state of the node with IP 183.146.1.6 is set to shut down to simulate a failed attack, and the detection results are all correct; in the third case, the simulated attacks are all successful and the detection results are partially correct, and the Trojan information of the node with IP 183.146.2.8 is lost; in the fourth case, the simulated attacks are partially successful, the standby state of the node with IP 183.146.1.6 is set to shut down to simulate a failed attack, the detection results are partially correct, and the Trojan information of the node with IP 183.146.2.8 is lost. The effective results of the attack analysis under different cases are shown in Table 5 (see Section 3.1 to Section 3.3 for detailed analysis). The test results show that the richer the security knowledge graph, the higher the efficiency of the attack analysis; the fewer false positives and false negatives in the detection result, the higher the efficiency of the attack analysis. When new knowledge appears and has not been added to the cyber security knowledge graph and attack rule base, even if it is detected, a complete attack chain cannot be analyzed. In the simulation environment, the analysis framework proposed in this paper is feasible. However, when the detection results are wrong, the complete attack chain cannot be analyzed. Future work will mainly focus on how to comprehensively develop self-defined reasoning rules, and how to complete the attack chain with the help of a scene knowledge graph and an attack rule base.

Conclusions
The core of this paper is to apply a cyber security knowledge graph to attack analysis, which is divided into a security knowledge graph and a scene knowledge graph. The security knowledge graph is constructed and expanded semi-automatically, and the scene knowledge graph is constructed manually. Using the cyber security knowledge graph, attack rule base and spatiotemporal property constraints, composite attack chains are mined from multiple single-attacks.
At present, the construction of the cyber security knowledge graph and attack rule bases also requires a lot of manual operations, and the flexibility is poor. The goal of future work is to achieve semi-automatic or automated construction. Since the results of the analysis may not generate a complete attack chain, this paper proposes the effectiveness of attack analysis. The effectiveness of attack analysis is proportional to the accuracy of the detection results. The focus of future work is how to improve the effectiveness of attack analysis. The framework proposed in this paper is only applicable to the analysis of attacks with prior knowledge that can be stated in a cyber security knowledge graph and an attack rule base, and is not designed for the discovery of unknown attacks.