A Hierarchical Security Event Correlation Model for Real-Time Threat Detection and Response

: An intrusion detection system (IDS) perform postcompromise detection of security breaches whenever preventive measures such as firewalls do not avert an attack. However, these systems raise a vast number of alerts that must be analyzed and triaged by security analysts. This process is largely manual, tedious, and time-consuming. Alert correlation is a technique that reduces the number of intrusion alerts by aggregating alerts that are similar in some way. However, the correlation is performed outside the IDS through third-party systems and tools, after the IDS has already generated a high volume of alerts. These third-party systems add to the complexity of security operations. In this paper, we build on the highly researched area of alert and event correlation by developing a novel hierarchical event correlation model that promises to reduce the number of alerts issued by an intrusion detection system. This is achieved by correlating the events before the IDS classifies them. The proposed model takes the best features from similarity and graph-based correlation techniques to deliver an ensemble capability not possible by either approach separately. Further, we propose a correlation process for events rather than alerts as is the case in the current art. We further develop our own correlation and clustering algorithm which is tailor-made to the correlation and clustering of network event data. The model is implemented as a proof of concept with experiments run on standard intrusion detection sets. The correlation achieves an 87% data reduction through aggregation, producing nearly 21,000 clusters in about 30 s.


INTRODUCTION
Amoroso [1], defines alert correlation as ". . . the interpretation, combination and analysis of information from all available sources about target system activity for purposes of intrusion detection and response."The goal of correlation is to reduce the number of events by aggregating and fusing those events that are related in some way.Additionally, through the correlation, it is possible to discover multi-step and complex attacks whose information is carried in separate events.These events may also be generated by separate devices at various times and locations.From an intrusion detection perspective, correlation can be performed on either events or alerts.Alerts and events are related but fundamentally different concepts.The event is the record of what happened.The alert is a message that is raised when an interesting event has been encountered.The interestingness of an event depends on the application domain.From the IDMEF [2], an alert is a message that is generated by a tool when it encounters an event that it was configured to look out for.Most of the active research is around alert correlation rather than event correlation.Event correlation is carried out before an IDS makes a detection decision.Alert correlation is performed outside the IDS after alerts have already been raised, by a third-party system, tool or application.This paper deals with the correlation of the actual events before analysis is performed by an IDS.We hypothesize that since the output of the correlation is a reduced data set due to the aggregation, the input to the correlation component of an IDS are few correlated events., consequently, the number of alerts should also reduced.

RESEARCH BACKGROUND
Generally, correlation techniques are categorized as (1) similarity-based, (2) sequential, and (3) case−based [7].Similarity−based techniques correlate two events if they are similar by comparing their attributes or time.Typically, some measure of distance is used to decide the similarity.Common distance functions in use are the Euclidean distance, Mahalanobis, Minkowski and Manhattan.To determine if a new alert should be included in a cluster, its distance to the cluster is compared to some pre-determined threshold.If the distance is above the threshold, it is correlated and included in the cluster.These techniques are known for their simplicity and performance in aggregation and reduction of the data set [3].However, sequences in the events cannot be captured.Additionally, similarity-based techniques cannot be used to detect complex and multi-step attacks.On the other hand, sequence-based techniques correlate events that have a causal relationship with each other.One event is considered the pre-condition while the other is the consequence.The strength of these techniques is their scalability and the ability to detect even previously unknown attacks.Case-based methods require the existence of a database of earlier cases which can be correlated with a new alert to make a classification and prediction.Several of these approaches have been used by researchers to address different correlation problems.What is clear is that there is no single approach that is universally applicable to every problem.The problem being addressed by this paper can be characterized by the following requirements; (1) Correlation must be done on the events, rather than alerts.This approach is taken to deal with the issue of the high volume of alerts at the source, rather than after the fact, (2) The correlation must enable real-time threat detection.This requirement means that approaches that require long time or large data sets to build models and statistics do not apply as those approaches will not fit satisfy the response times demanded by a real-time application.
Based on the problem defined and the approaches reviewed, this paper develops an event correlation model for real-time intrusion detection based on a hierarchy of two approaches (A) Similarity correlation method is used to reduce the size of the data set and form clusters, and (B) Graph Correlation is employed to discover cluster interconnections, establish communication sequence and enable visual analytics.The target architecture is a distributed IDS where data collection components stream events to the correlation unit in real-time.The architecture is described in detail in the rest of this paper.

RELATED WORKS
This work is grounded in an area of expansive research works in cybersecurity on intrusion alert correlation which aims at reducing the volume of alerts presented to cyber security analysts for ease of prioritisation and response.The Correlation process will ingest alerts from several distributed IDSes, analyse them, and construct more compact reports that shows the high-level view of the security status of a communications network as whole [4].Various research works have been carried out over the years on this topic across different application domains.In this section, we review only a select few of the available recent and pioneering research, focusing on those whose approaches we have adopted in the listed contributions.

HIERARCHICAL CORRELATION
Different correlation techniques work best with different and specific types of correlation problems, data, and industries.Even within the same industry, each technique depends on the type of data available and the type of intrusion that is being pursued for detection.Hierarchical correlation is an approach that is based on this realization that no single approach is fit for all purposes.Using this approach, more than one technique is applied to the correlation problem with the reasoning that the combined approaches will achieve a far greater result than the individual approaches.The science, then, is to determine which combination of approaches work best for the problem at hand.There has been limited research in hierarchical correlation approaches.One of the pioneering works often referenced by researchers is the work by Cuppens and Miege [5], in which they performed correlation of alerts from several distributed IDSes.The first level performed clustering of alerts at a local level, while the second level correlated these clusters at global level.However, they used the same correlation method, namely;correlation rules using the LAMBDA language, at both levels of the hierarchy.Rule based correlation, though simple and efficient, can not be directly used to detect new attacks whose footprints may not yet have been captured by any existing rule.The hierarchy used in the approach by Cuppens relies on the fact that the first level captures local events while the second level correlates the local clusters to uncover global patterns.In the model developed by Tian et al., [6], Intrusion Detection Systems were distributed geographically in a network and their generated alerts were correlated at two levels.The first level performed local correlation using graphs to aggregate local alerts and discover local intrusion attempts.The generated graphs were then fed to a central correlation unit that further aggregated and correlated the graphs from the local correlation units to discover global intrusions.While this work achieves the detection of both local and global alerts,the authors also use the same technique for the correlation at both levels of the hierarchy.By using the same technique, the inherent weaknesses of the chosen approach are maintained throughout the correlation process flow.For example, graph-based correlations generate a lot of false positives [7], a limitation which will not necessarily be resolved even if the system is implemented in a hierarchy.The system in [8] firstly aggregates various intrusion alerts using Splunk, and then an automated program ingests the aggregated alerts and clusters them using text analysis and visualized using a sunburst diagram.This work increases the operational complexity of security operations in that it increases the number of systems that need to be used for the analysis of security events.Additional to the IDS, the security analysts must deal with the SIEM (Splunk) as well as the custom automated correlation application developed.

REAL TIME CORRELATION
Firstly, it is necessary to define real time data.One such definition is given by [9], where real time data is defined as '.. information that is delivered immediately after collection.There is no delay in the timeliness of the information provided ...' Others have defined real-time data as data that is available when needed.Approaches that have been seen in research to optimize data collection for real time performance include direct acquisition [11], event streaming [12]- [14] and the use of in-memory storage [15].The key issue is to reduce the latency in the delivery of the event to the application that needs it.The latency is often caused by architectures that introduce too many hops such as aggregation points for the data, as well as access times due the data being stored on disk or databases.In terms of real-time correlation and detection, the most common approach is to correlate all events that can fit into some defined time window.The challenge is to decide the optimal duration for the time window that will make the data available to the consumer application when needed.This question usually depends on some expert knowledge of the system and the type of data.If the time window is large, the time to analyse batched events and eventually react if there is an attack might be too long to satisfy the needs of a real-time system.This technique is seen in [17] and [18].In [18], they follow the real-time algorithm from [18] for real-time processing of alerts.Incoming alerts are firstly grouped into batches that are sorted according to their creation time.Within the batch, the alerts are correlated according to a configurable time window, and each window is passed on to the analytics component as soon as it has been processed.Figure 1 below shows the real-time processing based on time windows.Zhang et al. [28] designed a detection system that operates in two phases.Initially, they mine frequent item sets for both normal and attack classes using Nested Sliding Windows.Then frequent item sets of incoming real time network traffic flows Figure 1.Time-Window processing in batches [18] are compared with these two classes.If the incoming flow is detected as anomalous, it is then passed on to the classification component, which uses a combination of Deep Belief Networks and Support Real-time processing is achieved by having the data processed from memory rather than from disc or databases.This significantly reduces the latency that is usually incurred due to disk access.The data is typically loaded into data structures in memory from where the correlation and analysis are run.Additionally, other researchers have used in-memory databases such as Oracle in-memory [20].

CORRELATION PROCESSES
The survey by Salah et al., [7] is one of the most referenced works on alert correlation techniques.They reviewed several works, and they summarised the alert correlation processes to propose a process model that comprises (1) Alert preprocessing, (2) Alert Reduction, (3) Alert Correlation, and (4) Alert Prioritization.Each of these process blocks expands to other sub-processes.For example, pre-processing constitutes normalisation and feature construction, while alert reduction is made up of filtering and reduction.The issue with correlation processes is that until now there is no standard approach as researchers tend to follow a process that is convenient and proper for the problem they are trying to solve, and usually the approach is targeted on a specific part of the entire process.The process model by [4] is supposed to address the end-to-end process, hence it is more elaborate with up to 10 steps.However most, if not all the steps in that process are covered by the consolidated model by Salah et al.Most recently, [21] proposed a process model that is specific for the identification of multi-step attacks.This process model has three main steps (1) Alert clustering (2) Context Supplementation, and (3) Attack Interconnection.This process model assumes all other necessary and preceding steps such as feature selection, normalisation and fusion have already occurred.

CORRELATION AND CLUSTERING ALGORITHMS
Clustering algorithms can be categorized in several ways.One way is whether the algorithm is optimised for numerical or categorical data.Depending on the features selected, network events can be clustered using either category.For example, [22] and [23] used a numerical-based approach, by calculating the entropy of the different IP header fields to cluster the data within an observation duration.By analysing the entropy values, certain attack types can be reasoned.As an example, during the scan phase of an attack, few attacking IP addresses attack many ports on a few targets.The entropy value of the destination port addresses would be expected to be low for a portscan attack.In terms of the categorical features of a network event, researchers have often turned to clustering using the CLIQUE [24] algorithm and its variants.According to the recent comprehensive survey on clustering algorithms performed by Xu and Wunsch [25], the Clique clustering algorithm can handle high dimensional data, with a computational complexity that is quadratic with the data dimensions and linear with the number of objects.This means that when the data volume gets higher, such as is the case with network event data, the clique-based algorithm tends to get slower.The features of a network event defined in our paper are only 5 dimensional, therefore the Clique algorithm should perform well in a deterministic way with respect to the dimensionality of the data space.The volume, however, is huge as network traces from packet sniffers can grow very huge in a small space of time.Additionally, the first step of clique is Apriori-like, in which it builds singleton clusters, i.e. clusters of 1 frequent itemset.It then gradually combines the clusters into bigger partitions of increasing combinations of the feature space before deciding which of those partitions form cluster.While this algorithm has successfully been used to identify clusters, the computational complexity of mining frequent itemsets is often cited as its biggest weakness [26].In terms of mining patterns from event logs, Vaarandi [27] proposed an algorithm that performs data summaries in the first run in the same manner as CLIQUE.In the second step, candidate clusters are created in the cluster table, and in the final round data points are allocated to a matching cluster candidate if it exists, otherwise a new entry is made.This algorithm is suitable for event logs of textual data, where the feature space cannot be figured out beforehand.For real-time clustering, Ma et al [17] initiate a memory queue Q, into which they deposit an incoming stream of alerts.All alerts that match a set of mined features are fused into a hyper alert which has features that are a combination of the features of the constituent elementary alerts.They use the temporal similarity to purge a hyper-alert from the queue if the incoming alert exceeds a given temporal threshold.The hyper-alerts are further stored into a database for further offline analytics.Their approach is tailored to the real-time clustering of alerts, and extracts features that are present in typical alert messages, but not in the raw events as is the case with this research.Our paper concurs with the conclusion by Xu and Wunsch [25] that there is no clustering algorithm that is universally suitable for all problems.With that in mind, we present our own developed correlation and clustering algorithm that considers the uniqueness of the problem at hand.The clustering algorithm takes into consideration the characteristics of network event data to be clustered.Some of the characteristics are summarised below.

•
The Data space dimensionality is small and fixed.Precisely, the event log entry as defined is a 5-tuple record.

•
Each event can belong to candidate clusters that can be defined by the set of predefined mining rules.

•
The data of the mined features are categorical.

GRAPH CORRELATION
Graph correlation presents the alert sequences in an acyclic graph for ease of visualization.The nodes of the graph represent the alerts, while the vertices usually stand for the temporal relationships between the alerts.Graph-based correlation models the events into a graph G = (V, E) where the vertex V represents a set of nodes, and each node (v ∈ V) represents the low-level n-tuple event with n mined features or attributes.The edge (e(vi, vj) ∈ E) represents the connection between the two nodes vi, vj, showing that there is a correlation between the nodes vi and vj.Typically, this relationship also shows that event vi precedes event vj.The graph is weighted, where the weight of each edge corresponds to the correlation strength between the event nodes.Consider the events recorded in Table 1.
Figure 2 below illustrates the resulting meta-event constructed after performing graph correlation.The nodes are labeled with the event IDs from table 1 for ease of reference.The edge weights correspond to the number of features that have matched the similarity rule, while the arrow direction shows the temporal sequence of the correlated events.
Graphs have been used in prior research to construct attack scenarios from intrusion alerts.Some of the works include showing causal relationships among alerts [28] and uncovering multi-step attack scenarios [29].Proximity graphs are used in [30] to produce

DRAWBACKS AND LIMITATIONS
Firstly, we observed that most works in correlation research relates to alerts.While this has proved to be beneficial, alert correlation is performed outside of the IDS by third party systems and tools.The IDS itself continues to raise a high volume of alerts, despite that the general architecture of an IDS includes a correlation unit.This is seen in all the correlation process models where the input is a stream of IDS alerts.Most organisations do not have the automated tools or third-party systems such as Security Information and Event Management (SIEM) systems to correlate the alerts and it remains the daunting task of the security analyst to manually sift through them and filter out redundant and irrelevant alerts.This introduces considerable delay to threat response which normally requires real time action.Secondly, in the reviewed research on hierarchical correlation we observed two prevailing approaches.The first one, as in the case of [6], the same correlation technique is used at both levels of the hierarchy.For example, in this case they used graph correlation at the local and global levels.Sequence based techniques in general are known for high false positives, so by using the same technique they can detect local and global intrusions, without necessarily dealing with the issue of high false positive rates inherent in the technique itself.In general, using the same correlation approach at different levels of hierarchy might enhance the strength of the technique without necessarily eliminating its weakness.We have observed that in the literature, hierarchical detection methods are used to build intrusion detection systems by combining signature and anomaly-based techniques [32] to achieve a better result, but we have not seen similar approaches for correlation problems.The second approach is where an external system is used together with a custom automated tool, such as in [8], where they rely on a third-party system, Splunk, to do the aggregation.This increases operational complexity as the security team must master the operations of the IDS, the SIEM (Splunk), and the custom correlation tool, just to accomplish the seemingly simple task of alert analysis.In terms of real time correlation and detection solutions, we observe that the real time component only starts from the correlation part of the processes.Cyber Security resilience requires a process that addresses the requirements of real time analytics end to end, starting from the time the event is generated, through correlation, to detection.Additionally, most of the works follow the lambda architecture [13], directly or indirectly, where an offline component builds up a historical data set and correlations, and then these are used for the correlations of new alerts.We find that this approach is not the best fit for pure real time analytics as the time needed to build a good and representative the offline component is quite significant.During that time, the system will not perform optimally with detection methods that are based on machine learning or statistical.The challenge with in-memory processing is the volatility of RAM which may result in the loss of the event data.In-memory databases, while they achieve the access throughput requirements of a real time application, may not be a scalable solution in a direct data acquisition architecture as it would require every event source to have the database.Other options such as caching also have the same volatility challenge, further, the cache does not always result in a hit.

RESEARCH CONTRIBUTION
It is no longer a question of whether an organization will be attacked or not.It is rather a question of when and how severe the damage will be.Organizations are therefore forever seeking solutions that can enhance their business continuity in the face of adversarial actions.Timely detection and response are key in today's hostile cyber environment.The high volume of alerts is counterproductive to the efforts of security analysts in trying to speedily triage alerts, prioritize them and execute response actions.Complexity is a security antidote that ought to be avoided by implementing streamlined security operations processes and systems.This research tries to bridge the gap in the current intrusion alert correlation approaches as it aims to resolve the following issues.

•
Reduce the actual volume of alerts generated by an IDS, by correlating the raw events themselves, rather than correlate alerts that have already been raised.

•
Reduce the time taken to respond to threats, because of the reduced number of alerts that need to be analyzed.

•
Enhance the capability of Intrusion Detection Systems, by improving their correlation unit using a hierarchy of correlation techniques that takes advantage of their cumulative strengths.
The contributions arising out of this research are the following. 1.
An event correlation process: All correlation processed reviewed relate to intrusion alerts.Some of the process steps are not relevant to the correlation of raw events.The process model proposed in this research assembles only those process steps that relate to event correlation.

2.
A conceptual hierarchical event correlation model: The model concentrates on the correlation component of an intrusion detection system and combines similarity and graph-based techniques in an ensemble.The first level of the hierarchy performs aggregation, data reduction, and clustering so that the events ingested by the graph correlation algorithm are significantly reduced.The graph correlation then performs cluster interconnections and visualization to reveal communication patterns and visual analytics in real-time.
The outcome of the research is a model that when implemented results in a streamlined process and system that enables real-time threat detection and response.

PROPOSED APPROACH 2.1. EVENT CORRELATION PROCESS
The proposed process model is motivated by the alert correlation processes in [7] and [21], and optimizes them by considering only those steps that are relevant for event correlation.The proposed process is shown in figure 3 below.The details of the process

HIERARCHICAL EVENT CORRELATION MODEL
The hierarchical event correlation model comprises two levels of correlation.The first level performs the aggregation, clustering, and context supplementation steps of the process model, while the second level performs cluster interconnection.The input to the model is a stream of raw events direct from the event generator, and not an IDS or any type of sensor.The output from the first level correlation is clusters of events representing network sessions between pairs of communicating hosts.These clusters become the input to the second level.The output from the second level is graph data structures which can be used by analytics systems such as intrusion detection systems.The high-level view of the proposed event correlation model is shown in Figure 4 below.The details of each step of the

Level 1 -Similarity-based correlation
The first level is aimed at the aggregation, clustering and context supplementation steps of the event correlation process defined.This model uses the similarity of event attributes for the correlation.The input to the first level is the stream of events.The output is a set of clusters that have been processed and have had contextual information added and carried through the process.

Event Aggregation
The purpose of aggregation is to summarize events that convey the same knowledge within some defined observation window so that one aggregate event is sufficient to represent all the transactions recorded in the individual events.This model defines an event as an n-tuple e = [attr_1, attr_2, . . ., attr_n], where each attr_i ∈ e represents an extracted attribute or feature out of all the possible features of the event.The aggregated event E, is the fusion of all such events having all similar features, except the timestamp.Since one aggregate event can represent several individual events, the result of aggregation is fewer events with no loss of information.Which features to mine to form the event will depend on the attacks that need to be detected by the model.
In the sample entries in table 2 above, all events 1 -4 record the same fact, that is, the host 192.168.204.69 is accessing the service on udp port 67(DHCP) running on the target host 192.168.204.1.The only difference in these events is the timestamp.Ignoring the timestamp, these events can be represented by the aggregate event E shown in Table 3 below.
In the aggregate event, all the features are the same, except the number of bytes (bytes), which is extracted from the IP header length field of the IP header.The count and packets (PKTS) are calculated fields which track the number of packets that make up the aggregate Event.In order to maintain the information from the elementary events in the aggregate, the bytes attribute of the aggregate is the sum total of the individual byte fields from the constituent events.This is done so that there is no loss of information from the individual events.In this way, analysis and detection can be made based on this aggregate in the same way it would have been done based on the individual events.The output of the aggregation step is an Aggregated Event Set AES comprising a set of the aggregate events E as defined by the following equation.

Event Clustering
Before the events can be clustered, association rules are defined to mine data from the AES that represent the fundamental analytics for network communications.Each aggregate event E in AES has n candidate correlations, where n is the number of features that make up the aggregate event.To mine the correlation candidates, a minimum of n association rules is created for each candidate cluster.The general form of a rule R is given below.
A support value, sup dictates the minimum frequent itemsets to be considered for the correlation, while n is the number of all the features of the Aggregated event.This allows the creation of candidate correlations at varying levels of correlations with different values of k.In the implementation of this model, the network event log entry is defined by the 5-tuple below.

NetL = [SIP, Sport, DIP, Dport, Protocol]
( Based on the network event log, the mining association rules below are defined.
Shared correlations, SC, indicate that the same host is communicating on the same applications.Correlation with variance level 1, VC1 is achieved when the same source host talks to the same target host on the same application but using different protocols.Correlation with variance level 2, VC2, is when the same and the same target are communicating on the same application using the same source ports, while the destination ports or protocol could be different.In VC3, the same source is accessing the same target host and destination port but could be using different source ports.When only the source and destination IP addresses are similar, this is mined by the VC4 rule.A careful inspection of these rules should reveal that they represent the foundational analytics for analysing network invents.Other types of analytics can be derived from various correlations or interconnection of the clusters created.For example, the SC represents normal traffic in general terms.However, if the number of these correlations and the volume of the traffic keeps increasing in a time series analysis, there could be a DoS attack inflight.Similarly, an increase in VC2 correlations could signal a probe attack, as the source is sending so many packets to the destination from different source ports.The data mined by these rules is put into a correlation table, CorrTable, which is an in-memory hash table of candidate clusters for each aggregate event that satisfy the minimum support, sup.This correlation table is passed onto the clustering algorithm which allocates the data points by picking the best fitting of the candidate clusters for each event.
Event clustering groups similar events into a cluster C such that each cluster has some contextual meaning using some clustering algorithm.Typically, a measure of similarity is defined which is used to decide whether two events should be put in the same cluster.The idea is that events within the same cluster should generally be given the same analytic interpretation or detection treatment.Different criteria can be used to contextualise the clusters for network events, such as clusters of source addresses, destination addresses, applications, and others.The output from this stage is a set of clusters C = [C1, C2, ..., Cn] such that each cluster Ci ∈ C has a unique semantical relevance.In an implementation of this model, the clustering is performed in an own developed clustering algorithm, which is shown in the pseudocode for algorithm 1 below.The algorithm takes the batched-up event streams as input.Hash tables for the correlation and cluster tables are initialised on line 3. Lines 5 -8 mine data from the record based on the current rule index.The hash for the current rule data is created in line 9 and used to perform a lookup on line 10.From line 10 -18, a new meta-event is created if the look up returned no match.The meta event is then inserted into the correlation table on line 18.If there is a match, lines 19 -24 update the meta event.The cluster table is a subset of the correlation table for all meta-events whose

Context Supplementation
Context supplementation is used to provide some descriptive information to the clusters to help analysts understand better what the cluster is representing.The type of information that can be used is flexible, depending on the use case of the cluster.In this model, context supplementation is performed during both levels of the hierarchical correlation.During the first phase, at a minimum, each cluster was given two descriptive attributes (1) Count, and (2) label.The count is a number that indicates how many elementary events were fused together to create the meta-event, while the label l ∈ L =[SC, VC1, VC2, VC3, VC4] .The interpretation and significance of these labels correspond to the descriptions provided for the corresponding mining rules given in the clustering section above.This allows analytics to be performed for flows of a given label.

Level 2 -Graph Correlation
The clusters from the first level are fed into the second level, for more correlation using graphs.This level establishes cluster interconnections, communication patterns and enables visual analytics.The input is the set of clusters of events from level-1 correlation above, while output is a series of interconnection and communication graph flows.The figure 6 below details the level 2 correlation block.The set M of nodes is the set of meta-events for all communication flows between the same source and destination hosts within the observation window.The meta-events are the clusters formed during the level-1 correlation described above.For two nodes m1, m2 ∈ M, an edge e ∈ E = (m1, m2) is created that connects m1 to m2 if the source and destination are similar for both meta events.What constitutes source and destination is left to the implementation, according to the features mined.Possible examples could be source and destination IP addresses, source and destination port numbers, source, and destination mac addresses, or any combination of source and destination features.Additionally, the edge is formed if meta event m1 occurred before meta event m2 according to the time stamps for the events.By sequencing the events, analytics that are based on the sequence of events can be executed by examining the data structure of Ghyper.In general, an edge in Ghyper is created if all the conditions below are satisfied.m1.src == m2.src (9) m1.dst == m2.dst (10) m1.ts < m2.ts (11) The objective is to interconnect related clusters and create a set of clusters of the hyper-events H = [c0, c1, . . ., cn].such that each cluster Ci ∈ C represents a collection of meta-events from the previous step with some similar attributes.This establishes multisession communication correlations among the single-session clusters established in stage 1 for the same communication pair.Ghyper is the first interconnection step of the process.Additionally, this model defines the Communications Flow Graph Gflow.
Definition 2: A communications flow graph, GFlow = (M, E) is a Directed Acyclic Graph (DAG), that shows the flow of communications in the network during the observation window.The set M of nodes is the set of meta-events from the level-1 correlation.For two nodes m1, m2 ∈ M, an edge e ∈ E = [m1, m2] is created that connects m1 to m2 if the destination of m1 is the source m2.An edge in Gflow is formed if the condition in the The principle is that an edge is formed if traffic is flowing from m1 to m2.Consider the communications in Table 4 below.
The above communications could be represented by the following Gflow shown in Figure 7 Gflow provides important analytics.A node with multiple outgoing edges, for A node with multiple incoming edges, such as node C above, could be a popular service, such as DNS or DHCP, but it could also be a node under a DDoS attack.The weakness of graph correlation is that if the data is too big, the run time could be long.Network events are classified as big fast data, owing to their huge volume.If this big data were to be presented to a graph correlation algorithm, the time taken to complete the correlation would be too long and defeat the whole idea of real time analytics.In the implementation of this model, the source and destination are taken to be the source and destination IP address.Equations 9-12 are then interpreted to the implementation equations below to implement Ghyper.Equation 12defines a generic source of events, leaving it to the specific implementation to specify what the source is.Possible sources could be source IP addresses, source port numbers, source mac addresses or any combination of the various source options, In this experiment, the source is the source IP address of network packets.In the same manner, the destination in this experiment is the destination IP address of network packets.This results in the fine tuning of equations 9 and 10 into equations 13 and 14.Further,equation 12 is fine tuned to the following equation.By aggregating and correlating the events in the first level, the resulting input to the graph correlation algorithm is very small and manageable, allowing the algorithm to complete in real time.It can therefore be observed that the two approaches in the correlation hierarchy are complimentary and deliver an outcome that is impossible to achieve with either one of them in isolation, with a performance acceptable for real time applications.Using these two defined graphs, a variety of analytics can be developed.Most importantly, the detection component of the model can base its analysis on these graph data structures.

TEST-BED AND DATA SETS
One of the biggest challenges facing intrusion detection research is the availability of realistic datasets.Due to various privacy concerns, organisations are often unwilling to provide real network traffic data for research.Consequently, researchers often use synthetic, or datasets created through simulations.The survey by [31] lists and analyses the various publicly available data sets for intrusion detection research.One such dataset, which is the most often used to evaluate Intrusion Detection Systems was created by the Defence Advanced Research Project Agency (DARPA) in 1998 and is often referred to as the DARPA/KDD99 Dataset.While this data set is faulted for its age and lack of modern attack traces, we found it sufficient for these research goals as our focus was correlation of intrusion events rather than evaluation of IDS detection performance.Specifically, we used data from the 5th week of the DARPA'99 intrusion data set for Friday 09-04-1999.The PCAP files for this day were parsed by a custom script written in python, which parsed the file and reconstructed the packets, extracted the features, normalised into JSON, and streamed to the correlation application in real time through Kafka.The correlation application ingested the stream of the arriving individual events and performed the correlation through the process defined in the developed model.The environment for the experiment was set up as in figure 7 below.A windows 11 machine was set up as the data collector.This machine The ready records were passed to another custom python script which was set as a producer of Kafka events which was streaming each event as it got parsed.A Linux server machine was set up in the public cloud.The server was equipped with 64GB of RAM, 2TB of hard drive space, 8 processors with two threads per core.On this server, three virtual machines were set up.The first machine implemented the correlation and detection component of the experiments.A python script was developed as a Kafka consumer, to read live events as they arrived in the Kafka server.The events then were passed on the analytics application, running inside the same virtual machine.The analytics application then performed the correlation.At the end of each correlation iteration, the raw events and the completed analytics were sent to MongoDB database which was set up in a separate virtual machine.The third virtual machine was running snort IDS.This was used to process the same original packet capture file to check for intrusions, for comparison with the performance of the developed detector.

RESULTS AND DISCUSSION
The Correlation model developed in this paper firstly aggregates all events into shared correlations, and then runs the first level of the correlation using association rule mining .After mining the data, cluster allocation of the data points is done through the own developed clustering algorithm.The data is input from the live streams arriving in Apache Kafka.Records are collected for a configurable batch size.In this experiment the batch size was set at 10,000 records.The batch size needs to be optimised such that the events corrected during that period do not exceed a reasonable time for detecting events in real time.After inspecting the data set, it was figured out that on average 10000 records contained between 7-and 10-minutes' worth of events.The other consideration is that a small batch size will cause the correlation algorithm to run too often, which might not be necessary as intrusions may not be detected in exceedingly small numbers of events.Table 5 summarizes the results of the correlation.Each line of the summary shows the number of events that were acquired by the data collection, the result of the aggregation, the number of clusters that were identified by the clustering algorithm, the number of hyperevents that resulted from cluster interconnections and the run time for the correlation iteration.The iterations were run with increasing numbers of events in steps of 10,000 up to 100,000 events.Each of these iterations can be interrogated for a breakdown of the clusters.For the first 10,000 events from table 5 above the breakdown is shown in table 6 below.
The minimum support was chosen to be 1.This was done so that probe attacks should be captured as most probes will not correlate at higher levels of support.Usually, when an attacker is scanning a network, they will send only one packet per probed port or service.From Table 6 above, of the 1811 clusters, 1261 were unique flows.There were 57 flows that had the same source port number (VC2), while 404 flows had the same destination port number (VC3).The 89 flows had only the source and destination IP addresses in common, while the rest of the features were all different.The distribution of the correlations is plotted in the pie chart in Figure 8 below.Different analytics can be performed on the clusters to uncover interesting patterns of communications contained within the dataset.The meta events created with the graph correlation initially corelate all VCx between each source and destination pair revealing the correlated details of all the flows between these communicating systems.In the dataset analysed, the analytics realised a total of 126 meta-events.The table below shows the details of the top ten hyperevents when sorted by the number of packets between them.The analytic allows the user to specify the sort order (number of packets or bytes) as well as the range (top n) to produce.This allows a security analyst to quickly inspect suspect communications.For example, in the analytics  Ghyper interconnects all clusters with the same source and destination IP address pairs.From the 1811 clusters described in Table 5, the output of Ghyper correlation is shown in Figure 9 below.The result showed that there are 126 clusters with shared connection sources in the dataset.Each cluster in the Gyhper graph above represents a correlation of all VCx meta events between source and destination communication pairs within the measurement time window.Zooming into each cluster shows the individual meta events (VCx) that make up the cluster.To investigate the flow of communication between the pairs, the Gflow analytic was used.The Gflow shows the communications paths in the data set from each source to various destinations.The graph in figure 10 below shows the Gflow that was generated from the data set in this experiment for the first 10,000 events.

Figure 11. First 10K cluster Interconnections
The nodes in Gflow represent different communicating sources, identified by their source IP addresses.The edges are created between two communicating hosts, with the direction of the arrow showing the direction of the traffic flow.This analysis can provide a quick visual analysis of the network.Where a node has many outgoing edges, it shows a high talker.Many incoming edges would be indicative of a popular service being available on that host, but it could also signal a potential DDoS attack in progress.One of the appealing elements of graph correlation is the visualization.In the graph above, the red colour was used to flag out all nodes that have more than five outgoing connections.More outgoing connections could signify a sweep attack if the trace contains probe traffic.This helps the analyst to investigate nodes for further details.The parameter for flagging is configurable in the analytics.According to the Gflow above, 10 hosts had at least 5 outgoing connections to different targets in the data set under analysis.
The correlation performance was tested by measuring the time of the entire correlation process (aggregation, level 1 correlation, clustering, and level 2 graph correlation).From the results shown earlier in Table 7, the charts in figure 11 below show the correlation performance for the 100,000 events.The correlation run time is plotted against the number of clusters and hyperevents.As can be seen from the charts above, correlation process run time increases with increasing number of events, clusters, and hyper events.To measure the performance of the correlation algorithm, the same approach as above was used, however, the time was measured only for the clustering algorithm part of the correlation process.The algorithm run time performance is plotted in Figure 12 below.
The results show that in general the run time of the algorithm increases linearly with the size of the input.For an input of size n, the performance observed was O(n).

CONCLUSIONS
This paper explored a novel approach to solving the problem of the high volume of alerts raised by Intrusion Detection Systems.The main contribution is a hierarchical event correlation model that ingests raw events from the source and outputs correlated graph data structures to the detection part of the IDS.We have shown that by correlating the raw events rather than alerts, the system reduces the complexity of security operations since it can be integrated into an IDS as a core correlation component, this contrasts with other approaches seen in prior art which rely on third-party systems to correlate alerts that have already been raised.We see an implementation of this model being integrated into a distributed IDS, where the data collection, correlation, and detection components are spread into different nodes.We have shown that no intrusion information is lost during the correlation process.The input to the detection component is a significantly reduced set of aggregated events.Based on this observation, the IDS should detect threats just as it would do based on the elementary events.This approach enables the IDS itself to raise fewer alerts, which enables fast and real-time response to threats.The proposed model proves that by combining two complementary correlation techniques, the overall system is more efficient and effective.Our model employs similarity and graph-based correlation techniques in a hierarchy.The similarity-based correlation is performed at the first level of the hierarchy to achieve aggregation, data reduction and clustering.Graph-based correlation is performed at the second level to interconnect related clusters, and communication patterns, and provide event visualization for ease of analysis.By combining these two approaches, the system benefits from their complementary strengths to achieve an integrated capability.Additionally, this research has contributed to security events correlation process.The process was used in the development of the proposed model to show its applicability.It is a process that is tailored to the correlation of raw events, rather than alerts.In future, we intend to develop an integrated correlation-based intrusion detection component as proof of concept, to consume the correlated data, and analyse and detect threats.This work is in progress and already at an advanced level.

Figure 4 .
Figure 4. Hierarchical Event Correlation Model process are covered in the subsequent sections for each level of the correlation hierarchy.

Figure 7 .
Figure 7. Conceptual Gflow example node A in the above graph could indicate a high talker, or an intruder could be carrying out attacks such as sweep probe on hosts B and C.A node with multiple incoming edges, such as node C above, could be a popular service, such as DNS or DHCP, but it could also be a node under a DDoS attack.The weakness of graph correlation is that if the data is too big, the run time could be long.Network events are classified as big fast data, owing to their huge volume.If this big data were to be presented to a graph correlation algorithm, the time taken to complete the correlation would be too long and defeat the whole idea of real time analytics.In the implementation of this model, the source and destination are taken to be the source and destination IP address.Equations 9-12 are then interpreted to the implementation equations below to implement Ghyper.Equation12defines a generic source of events, leaving it to the specific implementation to specify what the source is.Possible sources could be source IP addresses, source port numbers, source mac addresses or any combination of the various source options, In this experiment, the source is the source IP address of network packets.In the same manner, the destination in this experiment is the destination IP address of network packets.This results in the fine tuning of equations 9 and 10 into equations 13 and 14.Further,equation 12 is fine tuned to the following equation.By aggregating and correlating the events in the first level, the resulting input to the graph correlation algorithm is very small and manageable, allowing the algorithm to complete in real time.It can therefore be observed that the two approaches in the correlation hierarchy are complimentary and deliver an outcome that is impossible to achieve with either one of them in isolation, with a performance acceptable for real time applications.Using these two defined graphs, a variety of analytics can be developed.Most importantly, the detection component of the model can base its analysis on these graph data structures.

Figure 8 .
Figure 8. Experimental Set up implemented the data collection component of the experiments.Two custom applications were developed using python 3.11.The parser was responsible for reading packet captures from the data set, reconstructing each packet, preprocessing, and normalising the records.The ready records were passed to another custom python script which was set as a producer of Kafka events which was streaming each event as it got parsed.A Linux server machine was set up in the public cloud.The server was equipped with 64GB of RAM, 2TB of hard drive space, 8 processors with two threads per core.On this server, three virtual machines were set up.The first machine implemented the correlation and detection component of the experiments.A python script was developed as a Kafka consumer, to read live events as they arrived in the Kafka server.The events then were passed on the analytics application, running inside the same virtual machine.The analytics application then performed the correlation.At the end of each correlation iteration, the raw events and the completed analytics were sent to MongoDB database which was set up in a separate virtual machine.The third virtual machine was running snort IDS.This was used to process the same original packet capture file to check for intrusions, for comparison with the performance of the developed detector.

Figure 9 .
Figure 9. summary of clusters for the first 10K events

Table 1 .
Sample Network Event Log Entries

Table 2 .
Individual Event Log Entries

Table 5 .
Analytic of top ten communication flows

Table 6 .
Cluster summary for the first 10K events