Graphical Representation of UWF ‐ ZeekData22 Using Memgraph

: This work uses Memgraph, an open ‐ source graph data platform, to analyze, visualize, and apply graph machine learning techniques to detect cybersecurity attack tactics in a newly created Zeek Conn log dataset, UWF ‐ ZeekData22, generated in The University of West Florida’s cyber sim ‐ ulation environment. The dataset is transformed to a representative graph, and the graph’s proper ‐ ties studied in this paper are PageRank, degree, bridge, weakly connected components, node and edge cardinality, and path length. Node classification is used to predict the connection between IP addresses and ports as a form of attack tactic or non ‐ attack tactic in the MITRE framework, imple ‐ mented using Memgraph’s graph neural networks. Multi ‐ classification is performed using the at ‐ tack tactics, and three different graph neural network models are compared. Using only three graph features, in ‐ degree, out ‐ degree, and PageRank, Memgraph’s GATJK model performs the best, with source node classification accuracy of 98.51% and destination node classification accuracy of 97.85%.


Introduction
Addressing cybersecurity attacks in corporate networks is crucial, particularly considering the limitations of corporate resources in terms of manpower and budgetary constraints.Many organizations, especially smaller ones, may not have dedicated cybersecurity teams or the financial means to invest in robust security systems.A cyberattack occurs every 39 s [1], and in 2021, 43% of the attacks impacted small businesses [1], demonstrat-Memgraph 2.8.0 is a freely available graph analytics software used to characterize and visualize graphs and perform graph machine learning through its various imported libraries [2].Memgraph is very similar in function to Neo4j [5,6], with a key difference being that Neo4j is written in Java and stores data on-disk, whereas Memgraph is written in C/C++ and stores data in-memory.This makes Memgraph more performant but limits the size of data loaded to the amount of available RAM on the local machine.Both software have built-in data science libraries, GDS for Neo4j, and MAGE for Memgraph.This research uses MAGE's built-in algorithms to create GNNs implemented in Torch and the Deep Graph Library (DGL) to classify network attack tactics, create visualizations of connections, and query the graph for supplemental information.The rest of this paper is organized as follows.Section 2 presents the related works; Section 3 presents the data; Section 4 explains the data preprocessing; Section 5 explains Memgraph; Section 6 presents graph characterization or properties of UWF-ZeekData22; Section 7 presents graph visualizations of UWF-ZeekData22; Section 8 presents node classification results using graph neural networks; Section 9 presents the conclusions; and finally, Section 10 outlines future works.

Related Works
Javorník et al. [7] introduce the concepts of multi-step and multi-target attacks through Complex Privilege-Exploit Attack Graphs and Bayesian Privilege Attack Graphs.These structures model causal relationships among trust levels and guide decision-making.The system calculates the resilience of mission configurations using Bayesian networks, considering the distribution of critical privileges and the attacker's position.A case study of a medical information system demonstrates the use of the decision support system, highlighting the challenges of decision-making when most resilient configurations conflict with user demands.
Jacob et al. [8] discuss the problem of cyber security and the increasing need for improved detection of cyber-attacks on software applications using microservice architectures.The authors propose a graph-based anomaly detection approach to identify irregular microservice traffic caused by cyber-attacks.They use a graph convolutional neural network to capture spatial and temporal dynamics within the application's tracing data, which consists of the sequence of API calls between microservices.The authors also introduce a diffusion convolutional recurrent neural network for traffic forecasting and the detection of anomalies in microservice traffic.
Wei et al. [9] approach the problem of cyber threats by way of "proposed DeepHunter", a GNN-based approach that can match provenance data against known attack behaviors in a robust way.This approach aims to actively search for attack behavior in an organization's information system, using indicators of compromise (IOC).The DeepHunter GNN is able to capture the relationship between IOCs to provide better detection of advanced persistent threats for attack behaviors different from previously known attacks.This solution is determined to perform better than the comparable Poirot.
Hagheshenas et al. [10] frame the problem for application in smart grids, using a "Temporal graph neural network framework" for the detection and localization of false data injection and ramp attacks on the system state in smart grids.These forms of attack both manipulate sensor data to disrupt a grid's functionality.Examples of such data could be "voltage, current, power injections… status information of breakers and switches".Researchers have found promising results in classifying these forms of attack in a TGNN that models both the topological structure of the smart grids, and the "temporal measurements at each bus of the system".
Other works on graph databases have been conducted by [11][12][13][14].Ref. [11] looked at how anomalies could be detected in node-labeled directed weighted graphs.Ref. [12] looked at the graph similarity model using the minimum description length principle.Ref. [13] presented the graph database analysis using Neo4j.Ref. [14] looked at attack graph analysis using temporal factors associated with vulnerabilities.
Although different forms of graph architectures have been used in the past, the uniqueness of our work lies in the use of , Memgraph characterizations, visualizations, and node classification using multi-class graph neural networks.Moreover, this work uses Zeek Conn logs from the newly created UWF-ZeekData22 dataset.

The Data
The dataset used in this work, UWF-ZeekData22 [3], was generated in the cyber simulation environment at The University of West Florida (UWF).The Zeek Conn log dataset is labeled using the MITRE Adversarial Tactics, Techniques, and Common knowledge (ATT&CK) framework [15].The MITRE ATT&CK framework, based on a foundation of threat models that determine adversary tactics, contains 14 tactics to date, and several techniques and sub-techniques.UWF-ZeekData22 contains 10 tactics, as shown in Table 1, with a total of 9,280,869 attack tactic records and 9,281,599 benign records [4].A breakdown of this dataset's tactics is also presented in [3].This research, however, focuses on the top three tactics by count: Reconnaissance (TA0043) [16], Discovery (TA0007) [17], and Credential Access (TA0006) [18].
Adversaries using Reconnaissance are actively and passively looking for ways to gather information about the system as a whole so that they can plan attacks.Adversaries performing Discovery tactics are gathering information about the internals of the system and environment so that they can plan their attacks.Adversaries using Credential Access tactics are looking for ways to steal credentials like user IDs and passwords.

Data Preprocessing
To create this graph network in Memgraph, the dataset was preprocessed.This preprocessing flow is presented in Figure 1.The goal of the preprocessing was to transform log data to representative graph nodes and edges CSV files which could be ingested into Memgraph.The first preprocessing step was to remove the tactics that had very few occurrences.There is too little data in these low frequency tactics to do any meaningful graph analysis.Only the Reconnaissance, Discovery, and Credential Access tactics were kept from the UWF-ZeekData22 dataset.
The second preprocessing step was to combine the four source/destination IP address/port columns into two address columns.The address value would become the node labels formatted as IP address-port.The ports should not be nodes because port 80 associated with an IP address is not the same port 80 associated with a different IP address.
Once the IP addresses were in the proper format, distinct addresses were collected and joined on the address to the edges.The tactics_src and tactics_dest column values were parsed from the edges and added as labels to the nodes CSV file.
The source dataset contained 24 columns.Since we are concerned with the structure and pattern of connections made to identify tactics of attacks under the MITRE ATT&CK framework, we removed all columns except the addresses and labels of tactics which define the graph structure.Therefore, only the datetime, source address, destination address, and tactic columns were kept.This significantly reduced the size of the dataset to be processed.The resulting dataset was written to the edges CSV, and the nodes and edges CSV files were ready to be ingested into Memgraph.The columns in the original dataset and the nodes and edges datasets are presented in Figure 2. Note the presence of address ID columns in the nodes and edges dataset.This creates a relational dataset which Memgraph uses to define the edges between the source and destination nodes.

Memgraph
Memgraph is a high-performance, in-memory graph data platform designed for ingesting, querying, and visualizing large-scale graph data.Memgraph efficiently processes graph queries and traverses' large networks of nodes and edges using the cypher query language [2].It is freely available and can be run inside a docker container, thus making setup easy.

Ingestion into Memgraph
The first step is to create the graph database.This is accomplished using the code below.The code does the following: 1. Loads the nodes dataset from disk (Figure 3). 2. For each row in the dataset, a Memgraph node is created.The node contains an address attribute.3.An index is created on the address attribute.This speeds up edge creation and creates an edges dataset.4. Load the edges dataset from disk. 5.For each row in the edges dataset: a. Obtain the source and destination address nodes using the MATCH clause.b.Create an edge between the two nodes.c.Set the tactic attribute on the edge.Once the nodes and edges have been created (Figure 3), the networks can be created by variations shown in Figure 4.The query, Figure 4, uses two useful functions: 1.A WHERE clause for filtering by tactic.2. A LIMIT clause for returning a subset of nodes.Since the Reconnaissance tactic has many nodes, obtaining a full result set not only takes time but also does not offer any additional insights into the structure.Experimentation showed that setting the limit to about 5000 nodes gave an adequate representation of the graph's structure.

Graph Characterization
Graphs are complex structures whose topology can vary depending on the number of nodes, edges, and whether they are directed or not.Mathematically, a graph is a representation of a set of objects (nodes) connected by links (edges).In the context of this work, a node represents an individual IP address.An edge represents the connection between two nodes, a source IP address, and destination IP address.The edge is the attack tactic that is used.In other words, the graph represents an attack modality from a source computer (the attack initiator) to the target computer (the attack victim).Having a source and target node implies direction; hence, the edge is "directed", and the graph is a "directed graph".The degree of a node in a graph refers to the number of edges that are incident to that node.In a directed graph, the degree can be split into two separate measures: the indegree (number of incoming edges) and the out-degree (number of outgoing edges).
Hence, when analyzing a graph, the following characteristics or properties are of interest: (i) PageRank; (ii) degree; (iii) bridges; (iv) weakly connected components; (v) node and edge cardinality; (vi) path Length.These characteristics, obtained using Cypher queries, are discussed next.

PageRank
Google created the PageRank algorithm to distinguish recognizable and relevant web pages from lesser known pages.The algorithm uses the web's link structure to calculate PageRank scores for each document on the web [19,20].The algorithm incorporates the concept of a "random user" in the following fashion: 1.The random web surfer starts on a random web page.This surfer follows the links on the page and clicks randomly on one of the links.2. After clicking a link, the surfer is now on the new page and repeats the process by randomly clicking on one of the links on that page.3.This process continues indefinitely, with the random surfer randomly clicking on links and moving from page to page.
The PageRank algorithm calculates the probabilities of the random surfer being on a specific page at any given time.These probabilities are represented as the PageRank scores for each page.Pages that are frequently visited or are linked to by other important pages tend to have higher probabilities and therefore higher PageRank scores.
The random surfer concept can be extended to model the behavior of a cyber attacker targeting machines on a network.Using the PageRank algorithm, one can calculate the probability that a random cyber attacker will attack a particular machine by randomly choosing machines on the network.The graph data are a historical record of attacks.The PageRank score evaluates the likelihood of another attack using past attacks.
Memgraph natively implements PageRank using a straightforward calculation.The iterative nature of the algorithm allows Memgraph to parallelize the score computation across multiple processor cores [20].The execution times of running PageRank for the benign data (None), Reconnaissance, and Discovery using Memgraph are presented in Table 2.The cypher query, Figure 5, was used to generate the PageRank scores for the tactics of interest shown in subsequent sections.

Degree
Degree is the second category of interest when analyzing the graph.We analyze the degree centrality, degree distribution as well as the degree averages [21,22].

Degree Centrality
Degree centrality measures the number of edges connected to a node, normalized by the number of nodes in the graph, such that degree centrality = degree/(number of nodes −1).This measure is useful to view the connectedness of nodes based on the tactic being used.For directed graphs, the user specifies the "in" or "out" parameter for the type of degree, or the method defaults to "undirected."In-degree centrality is the degree centrality calculated value, where degree is the number of incoming edges.Likewise, out-degree centrality is the degree centrality calculated value where degree is the number of outgoing edges.Figure 6 sets the in and out degrees as a property of each node.In-degree centrality and out-degree centrality are shown for the Reconnaissance tactics.Table 5 shows that for UWF-ZeekData22, there is no significant difference in the indegree values between the Reconnaissance and None (benign data) tactics.They appear to be extremely similar in distribution.Table 6 shows the out-degree centrality for the Reconnaissance and None (benign) nodes.These results show that the connectedness of the Reconnaissance nodes are significantly greater than the normal "none" tactic nodes, even though the number of edges are similarly in the millions.This outlier may be a strong indicator of an attack using the Reconnaissance tactic.Degree averages show the differences between tactics and their connectedness.These values are highly correlated with the degree centrality and provide more raw data but do not divide by the number of nodes.Table 7 shows the out-degree averages for each of the tactics-Reconnaissance, Credential Access, Discovery-as well as for benign data.

Bridges
Bridge is the third category of interest when analyzing the graph.A bridge is a single edge that connects subgraphs together, which when removed, would result in the two subgraphs losing connection.Looking at bridges can provide additional information about the structure of our graphs.Figure 7 yields the bridge count, and Table 8 shows the bridge count by tactic for UWF-ZeekData22.The number of bridges in None compared to Reconnaissance supports the idea that the Reconnaissance tactic is heavily connected between nodes.

Weakly Connected Components
Weakly connected components is the fourth category of interest when analyzing the graph.Given a directed graph, a weakly connected component (WCC) is a subgraph of the original graph where all vertices are connected to each other by some path, ignoring the direction of edges [23].These values give an idea of how many disparate subgraphs exist by tactic.It shows that the Reconnaissance tactic has many disparate components, whereas None (the benign data) has very few disparate components, even though both have a similar number of connections in the millions (Table 9).

Node and Edge Cardinality
Node and edge cardinality is the fifth category of interest when analyzing the graph.Figure 8 shows the code snippet of the cypher queries returning the number of nodes and edges, respectively.The data for UWF-ZeekData22 resulted in a graph with 262,963 nodes and 18,562,438 directed edges (edges from the source to target address).Table 10 shows the node and edge cardinality for the tactics of interest as given by Figure 9.

Path Length
Path length is the sixth category of interest when analyzing the graph.It is valuable to know if there are nodes that are more than one hop away from a source node.The query in Figure 10 returns paths that are of length 2 to 5 edges away from source to destination, where the source is a source of the Reconnaissance tactic, limited to 1000 paths.Figure 11 shows a source node of the Reconnaissance tactic, with address "143.88.2.10:41562" that connects to intermediate address "143.88.7.12:3", then to address "143.88.2.10:3", and then to many destination addresses.Figure 13 shows a source node of the Discovery tactic with address "143.88.7.10:55262" that connects to intermediate address "143.88.2.10:3" and then to many destination addresses for dataset UWF-ZeekData22.The paths shown in Figures 11-13 are the result of running the query in Figure 10 and represent the way that different tactic types connect to addresses, with a focus on connections that pass through an intermediate IP address/port number before connecting to their destination nodes.The source nodes of Credential Access do not have any paths of length greater than 1 for dataset UWF-ZeekData22; hence, they have not been displayed.

Graph Visualizations
Figure 14 presents graphs representing the whole graph or sub-graphs of the three major attack tactics as well as benign data from the UWF-ZeekData22 dataset.The graphs help visualize the way different tactics connect between nodes.For instance, the graph with connections using the tactic "Discovery" shows a central node connecting to most other nodes, and some nodes connecting over an intermediate node in the path.Looking at the "none" data, connections show a node connected to thousands of other nodes, with a low degree of connections.In contrast, the nodes for the Reconnaissance tactic show a very high degree of interconnectedness.The degree of connections for Reconnaissance appears to be a strong indicator that the node belongs to that tactic.The code snippet of Figure 15 was used to create Figure 16, showing 10,000 node connections.The graph shows a large collection of nodes that connect among a set of central nodes.These central nodes appear to connect either directly or through intermediate nodes, with a low degree of connection from the central nodes to the outer nodes.

Node Classification Using Graph Neural Networks
In this work, node classification, performed using graph neural networks (GNNs), is used to predict the connection between IP addresses and ports as a form of attack or nonattack in the MITRE framework of UWF-ZeekData22; that is, node classification is used to classify a node as a source or a destination of an attack tactic and is used to identify the IP address-port combination that is the source of an attack for the different attack tactics (hence, multi-classification).In Memgraph, since node classification is conducted using GNNs, there are three different main layer types available for selection [24]: 1. Graph attention networks jumping knowledge (GATJK).2. Graph attention networks (GAT, GATv2).3. "GraphSAGE" (inductive representation learning on large graphs).
The following parameters are provided to train the models: the hidden features layers, layer type, learning rate, number of epochs, training testing split ratio, and weight decay.In this work, node classification is performed using the torch open-source machine learning library.
The layer type defines the type of layers in GNN, such as GATJK, GAT, GraphSAGE, etc.The learning rate defines the amount that the GNN will adjust during learning to correct its weights and minimize loss.The number of epochs defines the number of times that a model passes through the training dataset; that is, if the number of epochs equals five, the model will pass the whole dataset five times.Split ratio defines the ratio between training and testing data.Weight decay defines an additional loss to weights that grow too big, such that a single weight does not monopolize the result.Table 11 presents an example of node classification training parameters.GAT and GATJK use the torch geometric library to apply graph attention network layers in addition to jumping knowledge.GAT creates a matrix of weights embedded on each node for the neighbors of the node and to itself.These standard weight values are supplemented by the "attention coefficient."This attention coefficient characterizes the importance of the relationship between a node to its neighboring node, or how much attention to give the neighboring node.This concept is used in language learning models as well.The attention coefficient, as conducted in [25], is calculated as an additional singlelayer neural network that takes in the weights of two nodes, performs a "LeakyReLU" activation function, and normalizes them performing "softmax."The final weights are then multiplied against the attention coefficient, adjusting the weights for which neighboring nodes require the most attention per the additional single-layer neural network [26].These calculations are performed for pairs in parallel, allowing for additional performance gain [25].In the context of Memgraph's "GATJK" model, the torch GAT model is the given parameter jk ("Jumping Knowledge") with a string value "max", which performs a final linear transformation to transform node embeddings to the expected output feature dimensionality [27].
Jumping knowledge can be applied on top of GAT, GraphSAGE, and other graph neural network types.This is a way to be more agnostic with respect to specific graph structures such that a trained neural network for one graph might work just as well on another graph with a different structure via the way jumping knowledge influences the aggregation of neighbor information for its embeddings [28].
GraphSAGE is another graph neural network model, which focuses on creating a general inductive framework that leverages node feature information [29].The general inductive approach allows for GraphSAGE to be used across different graphs without losing as much information because instead of training individual embeddings for each node, we learn a function that generates embeddings by sampling and aggregating features from a node's local neighborhood [29].
Node classification required additional preprocessing as well as feature selection, which are presented next.

Preprocessing for Node Classification
Additional preprocessing was required for node classification on the UWF-ZeekData22 dataset.First, an attack tactic label was required for each node in the graph.The labeling was separated between two classifications: source and destination.This was undertaken to classify a node as a source of an attack or a destination of an attack based on its features, as predicted by the trained model.
Table 12 shows the tactic labels that were assigned to each node for both the src_class (source) and dest_class (destination) node properties, converting categorical strings to integer values for the node classification algorithm."None" is a benign connection, whereas "no_conn" means there is no connection for this node.Hence, a node that is a source of a connection but has no incoming connection will have dest_class = 7 as it is not a destination of any connections.A node can be a source or destination of multiple tactics, and labels 5 and 6 denote such values.Table 12′s labels would be considered "multi-class" labels.In addition to this, binary labels were created for each tactic for both source and destination.Table 13 shows the node properties.

Feature Selection for Node Classification
As per the earlier analysis and visualizations that show the differences in degree and centrality of the various nodes by tactic, in-degree, out-degree, and PageRank were selected as graph features for node classification.The benefit of choosing these three features is that these values model the structure and layout of the graph without requiring highly specific network log data, while providing information about the amount of traffic and unique connections being made from each node.Node classification is performed purely based on the connectivity characteristics of the nodes to determine the underlying intention.In-degree is most relevant to nodes that receive network traffic (destination nodes), out-degree is most relevant to nodes that send traffic (source nodes), and PageRank provides data by scoring nodes that are most relevant and central to the graph.These graph features were strongly correlated to the classification of a node.Figure 18 presents the code snippet used for feature selection.

Node Classification Results and Discussion
Node classification was performed for the different layer types: GATJK, GraphSAGE, and GATv2.GATJK was initially performed using the default parameters.Then, experimentation with different learning rates as well as different epochs produced a set of optimal parameters that were used to then run the other two layer types: GraphSAGE and GATv2.Eventually the three models were compared using the optimal parameters.Being a multi-class model, the results presented reflect all classes; that is, classes 1-7 of Table 12.

GATJK
GATJK was performed using the default model.Based on the results of the default model, the learning rates and epochs were adjusted in order to obtain the best results.

Default Model
The initial set of parameters for creating the model for source node classification were as follows: {hidden features: [16,16]; layer type: "GATJK"; learning rate: 0.1; weight decay 0.0005; split ratio: 0.8; epochs: 100}.The results are shown in Figure 19.
As can be seen from Figure 19, the validation accuracy, F1 score, precision, and recall have a similar trajectory.Of note, the accuracy values are very close to recall; hence, they are difficult to see.For source node classification, throughout the epochs, the accuracy rises steadily until it plummets and cycles through this process when the learning rate is at 0.1.The chart visualizes how the accuracy, F1 score, and recall fall dramatically in very select training epochs, although generally, the performance is very high.This may be a sign that the learning rate is too high.It also might be because there are multiple attack tactics (since this is being run as a multi-classifier), and the distribution of the attack tactics is not even.The performance of the model continues to rise until the 25th epoch, after which, it drops and rises dramatically.Figure 20 shows the GATJK default source node classification loss.The objective of the model is to minimize loss.The training loss was low and more or less stable with the default parameters; however, the validation loss was not very consistent and pretty high in many cases.Next, experimentation was conducted with the learning rate.

Varying the learning rates
In the source and destination classification models, the learning rate was reduced to 0.001, and the layers were adjusted to [8,8].The following parameters were used: {hidden features: [8,8]; layer type: "GATJK"; learning rate: 0.001; weight decay 0.0005; split ratio: 0.8; epochs: 100}.This is shown in Figure 21.
As noted from Figure 21, the range of performance is much reduced such that the peak stays high and the lows only fall to greater than 0.945 for any measured value as opposed to approaching 0.2 in our previous attempt.By lowering the learning rate, we can lower losses by an order of magnitude and maintain high metrics across all measures.The performance stops improving at around 10 epochs, so that is likely the optimum epoch to avoid overfitting the model to our dataset.Figure 23 presents a lower learning rate of 0.001.There is significant improvement in loss over epochs with a lower learning rate of 0.001.Interestingly, the dips in metrics occur at approximately the same points, i.e., 30 to 35 epochs and 45 to 50 epochs, as shown in Figures 22 and 23.

Varying Epochs
The training parameters were adjusted to the following: {hidden features: [8,8]; layer type: "GATJK"; learning rate: 0.001; weight decay 0.0005; split ratio: 0.8; epochs: 5}.Reducing the epochs from 100 to 5 stops the oscillation around the optimum metrics, and this stops once the metrics have arrived or are close to the local maximum.This avoids overfitting the model to the data.This is reflected in the loss charts, which continually drop and flatten out, no longer rising again.
Figure 24 shows the results for the following source node classification metrics: accuracy, F1 score, precision, and recall.The initial epoch produces a high starting correct classification for all metrics.As epochs continue, the metrics flatten out to a plateau between 0.98 and 0.99.Table 14 summarizes the best results of accuracy, F1, precision, and recall for the source node classification as well as destination node classification using GATJK by adjusting learning rates and epochs for the respective nodes.GraphSAGE was performed with the optimal parameters obtained from the GATJK model.The GraphSAGE model used the following parameters: {hidden features: [8,8]; layer type: "SAGE"; learning rate: 0.0005; weight decay 0.0005; split ratio: 0.8; epochs: 10}.With similar parameters set for GraphSAGE, its performance was very comparable to GATJK, with metrics all lying within the 0.95 to 0.99 range for both source node classification and destination node classification, as shown in Figure 28.After epoch 6, the metrics stop trending higher and oscillates close to 0.96, with the highest metrics across accuracy, F1 score, precision, and recall falling between 0.96 and 0.98.   Figure 30 shows the metrics for GraphSAGE destination node classification.The values peak at around epoch 7, nearing 0.99, before falling again.The peak performance appears to be slightly higher than GATJK, which peaked at around 0.985.The changes between epochs are more unpredictable as compared to source node classification.Table 15 summarizes the best results of accuracy, F1, precision, and recall for the source node classification as well as the destination node classification using GraphSAGE by adjusting learning rates and epochs for the respective nodes.GATv2 was also started with the optimal parameters obtained from the GATJK and GraphSAGE models.Using the same parameters as GATJK and GraphSAGE, the parameters were set to the following values: {hidden features: [8,8]; layer type: "GATv2"; learning rate: 0.0005; weight decay 0.0005; split ratio: 0.8; epochs: 10}.
Figure 32 shows the resulting metrics for source node classification using GATv2.The results are very poor except for precision, which stays in the 0.8 to 0.9 range.Accuracy, F1 score, and recall stay below 0.6 after the first epoch.The results indicate the other GNNs are better solutions for node classification.Figure 33 shows that although the losses trend downwards, the minimum losses are still much higher than the GATJK and GraphSAGE source node classification losses.The minimum training loss is still higher than 0.3.Figure 34 shows that Gatv2 destination node classification performs relatively well compared to source node classification.The metrics stay close to 0.9 throughout, with no drastic changes over epochs.Table 16 summarizes the best results of accuracy, F1, precision, and recall for the source node classification as well as the destination node classification using GATv2 by adjusting learning rates and epochs for the respective nodes.

Summary for Node Classification
The GNN models, for the purpose of classifying an IP address-port combination as a source or destination of an attack or benign connection tactic type based on graph-related features-in-degree, out-degree, and PageRank-proved to perform well depending on the selection of the algorithm or the layer type that was chosen.Except for GATv2, the other GNN models, GraphSAGE and GATJK, attained favorable results for both source and destination node classification with optimized models' lowest metrics being greater than 0.95.This supports the idea that the graph structure and graph-related features representing network connections can indicate the tactic of attacks being conducted as replicated in a controlled cyber environment.

Limitations of this Study
A limiting factor in our dataset is that the number of tactics for Reconnaissance far outweigh the other tactic types.That said, GNNs are a promising candidate for identifying bad actors in a network based on the results.

Conclusions
In this research, the UWF-ZeekData22 dataset, network logs generated by Zeek, a passive open-source network traffic analyzer, were viewed within a graph framework to explore and describe the graph structure of different MITRE ATT&CK tactics, and machine learning was performed using graph neural networks to classify tactics as source or destination nodes of an attack tactic.
Preprocessing to prepare the data for ingestion into Memgraph was conducted in Jupyter Notebooks.Unnecessary columns were removed, and additional node labels and addresses were edited to fit the format of a graph.The connections were set as edges with concatenated IP addresses and ports as the single address for each node.Both the nodes and edges were labeled for the attack tactic.
Memgraph generated a graph representation of UWF-ZeekData22, with which several graph properties were extracted, such as PageRank, degree, bridges, weakly connected components, node and edge cardinality, and path length.These properties further described the graphs that were also visualized for different tactics, providing a different view.
Graph neural network models were generated for the UWF-ZeekData22 dataset to perform node classification; that is, to label a node as a source or destination node for the correct tactic under the MITRE ATT&CK framework.Through various means of testing and tweaking the models, favorable results were obtained with training parameters: {hidden features: [8,8]; layer type: "GATJK"; learning rate: 0.001; weight decay 0.0005; split ratio: 0.8; epochs: 5}.Reducing the epochs to 5 stopped the oscillation around the optimum metrics and this prevented overfitting the model.So, of Memgraph's three GNN classifiers, GATJK gave the best results for both source node classification and destination node classification using only three graph features: in-degree, out-degree, and PageRank.For GATJK, all metrics, accuracy, F1-score, precision, and recall, produced above 98.3% and 97.7% for source and destination node classification, respectively.There was a significant improvement in loss over epochs with a lower learning rate of 0.001.The performance of GATv2 was the weakest of the three models.
Multi-classification had not been conducted on this set of data previously, but these results are better than some previous results obtained for binary classification using classical machine learning classifiers like SVM, naïve Bayes, and logistic regression in [30].

Future Works
The use of graph neural networks as an AI/ML intrusion detection system using live, real-time, "temporal" GNNs presents an exciting potential use-case that requires further research and exploration.As it currently stands, the Zeek logs provide a recording of events that have already happened and are useful for exploring past incidents.GNNs provide unique insight by framing connectivity from a graph perspective, and as research in GNNs is still ongoing, the use of GNNs as a solution for cybersecurity will continue to evolve and improve.

Figure 2 .
Figure 2. Attributes before and after preprocessing.

Figure 3 .
Figure 3. Cypher query: creating the nodes and edges.

Figure 5 .
Figure 5. Cypher query: generating the PageRank scores.6.1.2.PageRank Scores for the Reconnaissance Tactic Table 3 presents the PageRank scores for the top 10 IP addresses for the Reconnaissance tactic.As per Table 3, the IP address 143.88.5.1:53 in UWF-ZeekData22 is the most likely to be attacked using the Reconnaissance tactic.Based on these results, more resources should be allocated to protecting 143.88.5.1:53.

Figure 7 .
Figure 7. Cypher query for count of bridges.

Figure 8 .
Figure 8. Cypher query returning the number of nodes and edges.

Figure 9 .
Figure 9. Node and edge cardinality for tactics of interest.

Figure 10 .
Figure 10.Returns paths of length 2 to 5 edges away.

Figure 11 .
Figure 11.A node of the Reconnaissance tactic to destination addresses.

Figure 12
Figure12shows a source node of None (benign traffic), with path lengths of 2.

Figure 12 .
Figure 12.Node of None (benign traffic) with path length of 2.

Figure 13 .
Figure 13.Nodes of the Discovery tactic to destination addresses.

Figure 15 .
Figure 15.Showing 10,000 nodes of the benign data.

Figure 17
Figure17is a close up of Figure16, offering a view of the connections at node address "199.7.91.13:53".The node is connected to many other nodes, with one or two edges for most connections.

Figure 17 .
Figure 17.Highlight of a highly connected node of "none" (benign data).

Figure 22
Figure22shows that training loss is lowered to the 0.05-0.06range.The validation loss fluctuates between 0.04 and 0.12.

Figure 25
Figure25shows the loss from source node classification using GATJK.The loss from training data continually slopes down, approaching 0.05.Validation loss also follows the

Figure 26
Figure 26 shows accuracy, precision, recall, and F1 score for destination node classification.All metrics hit a local maximum at epoch 3, before slowly falling again.This contrasts with the source classification, which continued to improve beyond epoch 3.In terms of performance, destination node classification performed quite close to the source node classification, with both values ranging between 0.98 and 0.99 at their highest metric values.

Figure 27
Figure27shows the training and validation loss for destination node classification.Once again, loss is hitting a plateau at epoch 5, showing that at that point, we approach the local minimum.The loss is slightly higher compared to the source classification losses, but are comparably low, both being less than 0.1.

Figure 29 shows
Figure 29 shows GraphSAGE source node classification loss.The loss plateaus starting from epoch 3 until training loss is almost flat at a value close to 0.1.The validation loss similarly plateaus, starting from epoch 3, and then peaks again at epoch 7. The validation loss is higher than the training loss, maintaining a value closer to 0.2.

Figure 31 shows
Figure 31 shows GraphSAGE destination classification losses.The training losses continually trend downwards in the training loss line.The validation loss peaks at epoch 3 and then rises again at epoch 8.The overall loss is slightly higher in destination node classification loss compared to source node classification loss.

Figure 35
Figure35shows that GATv2 destination node classification training loss and validation loss trend towards 0.2.Again, the destination node classification loss minimums are better than the source node classification losses.

Table 4
shows that IP address 143.88.2.12:22 in UWF-ZeekData22 is the most likely to be attacked using the Discovery tactic.Based on these results, more resources should be allocated to protecting 143.88.2.12:22.

Table 5 .
In-degree centrality as tactics subgraphs.

Table 8 .
Bridge counts by tactic.

Table 11 .
Example of node classification training parameters.