Network Trafﬁc Anomaly Detection via Deep Learning

: Network intrusion detection is a key pillar towards the sustainability and normal operation of information systems. Complex threat patterns and malicious actors are able to cause severe damages to cyber-systems. In this work, we propose novel Deep Learning formulations for detecting threats and alerts on network logs that were acquired by pfSense, an open-source software that acts as ﬁrewall on FreeBSD operating system. pfSense integrates several powerful security services such as ﬁrewall, URL ﬁltering, and virtual private networking among others. The main goal of this study is to analyse the logs that were acquired by a local installation of pfSense software, in order to provide a powerful and efﬁcient solution that controls trafﬁc ﬂow based on patterns that are automatically learnt via the proposed, challenging DL architectures. For this purpose, we exploit the Convolutional Neural Networks (CNNs), and the Long Short Term Memory Networks (LSTMs) in order to construct robust multi-class classiﬁers, able to assign each new network log instance that reaches our system into its corresponding category. The performance of our scheme is evaluated by conducting several quantitative experiments, and by comparing to state-of-the-art formulations.


Introduction
Over recent years, the spread of connectivity and the extended internet usage due to the pandemic, has both increased the vulnerabilities to cyber-threats, and simultaneously grown the sophistication of cyber-attacks. As a result, the cyber-security defensive shield of the information systems is further stressed. Nowadays, modern enterprises generate enormous amount of data, which brings several opportunities concerning their exploitation, but also depicts severe challenges regarding their protection. On this direction, network logs are considered as an immensely rich source of information that can be exploited for many purposes, such as the estimation regarding the current operation state of the network, and the investigation and prevention of potentially malicious activities. The above actions indicate the necessity of further improving the cyber-security defences, by providing more effective cyber-security procedures. Cyber-security protection is the ability to identify any malicious behaviour or potential vulnerabilities that impose a threat which can be exploited by adversaries. For this purpose, relevant countermeasures and mitigation actions have to be applied to the affected assets in order to overcome potential risks.
Network log management corresponds to the collection, manipulation, analysis and reporting of large volume and velocity data, such as event-logs, audit records, audit trails, etc. Additionally, log management evaluates the firewall capabilities in enhancing network security, facilitates the early detection of possible vulnerabilities, and minimizes the resolution time of any suspicious action that corrupts the normal operation of information systems. On this direction, network traffic analysis enables the immediate and deep understanding of several significant network metrics, including the network type, origin, size, destination of variant packets, and the uploading/downloading speed, among others. Consequently, network traffic analysis provides significant assistance towards the identification of malicious packets and actors within the traffic flow. However, with the increased penetration rates that are reported on modern information systems, sophisticated cybersecurity protection technologies like Machine Learning (ML) and Artificial Intelligence (AI) can be incorporated into the loop.
In this paper, we design and develop an innovative network-traffic monitoring and anomaly detection scheme that is able to upload, monitor and analyse network logs utilizing the pfSense software [1,2], as illustrated on Figure 1. pfSense is a customized FreeBSD (Berkeley Software) distribution that is oriented to be used as a firewall and router. The software includes multiple features, including among others: firewall, routing, QoS differentiation, NAT, Redundancy, Load Balancing, VPN, Report and Monitoring, Real Time information, and a Captive Portal. It is fully prepared for high-throughput scenarios (over 500 Mbps). pfSense uses a single XML file, called config.xml, which stores the configuration of all services available in the pfSense system. For each new instance that reaches our system, we use Apache Spark Streaming [3] as the tool for real-time processing of the log's monitoring. Packets from the input data are stored into Apache Kafka topics and are ingested into Spark Streaming using the Kafka consumer API for Spark, synthesizing discretized streams. After we parse the input network log instances, we perform the subsequent analysis steps, including the design and evaluation of semisupervised Deep Learning (DL) anomaly detection techniques, with main goal to classify the different Event types that reach our system. Specifically, the key contributions of this paper are summarized as follows: • The design and development of an innovative network monitoring and analysis scheme based on the pfSense software; • The construction and pre-processing of a novel network intrusion dataset composed of Suricata logs collected from the pfSense software; • The design and exploitation of challenging Deep Learning (DL) schemes for semisupervised network intrusion/anomaly detection.
According to our knowledge, this work is out of the first that propose variant deep feature learning architectures towards the problem of anomaly detection on network logs that were acquired by the pfSense firewall , with main goal the Event type classification. The proposed scheme can be efficiently extended-with minor modifications regarding the data pre-processing in detecting multiple threat types from variant information systems. The rest of the paper is structured as follows: Section 2 describes the state-of-the-art concerning network intrusion detection systems, along with the most recent Machine Learning/Deep Learning approaches that were reported in the literature; Section 3 demonstrates the main formulation of this work, including the proposed architectures: the Convolutional Neural Networks and the Long Short Term Neural Networks towards the multi-class Suricata pfSense logs anomaly detection; Section 4 presents the complete experimental setup, including the data acquisition, the dataset creation, the achieved performance of the proposed schemes and their comparison; finally, Section 5 concludes this work and highlights our main future directions.

Network Intrusion Detection Systems
Nowadays the complexity of cyber-attacks and malicious events has grown tremendously. For this purpose, the design and exploitation of robust Intrusion detection systems (IDSs) [4][5][6] becomes a fundamental entity towards the protection and sustainability of the Information and Communication Tecnology (ICT) infrastructures. IDSs provide the means to early detect malicious cyber-activities and design efficient mitigation actions against them. Recently, numerous interesting intrusion detection approaches have been proposed in the related literature [7,8]. Generally, network intrusion detection methodologies are divided into two main categories (i) the signature-based [9,10], and (ii) the anomaly detection-based approaches [11,12]. Signature-based intrusion detection approaches focus on detecting pre-defined malicious threats, for instance network traffic for a series of malicious packet sequences, or bytes. The key benefits of the specific approaches lie into the fact that signatures can be efficiently extracted, under the constraint that precise prior knowledge regarding their patterns or structure exists. A characteristic example is described on [13,14], where the authors propose a signature-based IDS for the detection of DoS and routing attacks in IoT networks, involving centralised and distributed components. Specifically, the detection module operates on the main router, while other lightweight modules are deployed in the network in close proximity to the IoT devices for traffic monitoring and reporting purposes. Moreover, pattern matching on signature-based approaches is implemented efficiently on modern systems that minimize the amount of power required to perform the matching for a particular rule set [15]. For instance, if the monitored system communicates only via DNS, DNS, ICMP and SMTP, all other signatures have to be ignored. Despite the advantages of the signature-based techniques, they also confront several limitations, since they only detect attacks whose signatures are previously recorded. Consequently, these techniques perform well only when they work with pre-defined behavioral patterns, while they demonstrate poor performance when they either deal with novel attacks, or when the users use advance technologies such as payload encoders, and encrypted data channels [16].
On the other hand, anomaly detection techniques exploit numerous of statistical, datamining and Machine Learning (ML)-based techniques to automatically detect any kind of attacks [17]. ML formulations have been widely utilized for network intrusion detection applications [18], and they can be discriminated into shallow ML and Deep Learning (DL) according to the involved network architecture. Generally, DL architectures present more robust representation ability compared to the shallow architectures. Additionally, ML-based anomaly detection approaches can be further discriminated into supervised [19][20][21], semi-supervised [22][23][24] or unsupervised [25][26][27][28][29]. Supervised classification approaches require proper training datasets that include all the anomalous examples along with their corresponding labels. The main drawback of these techniques is that they achieve proper detection performance only when they exploit large number of labeled training data. Nevertheless, real network traffic data usually lacks of truly labeled datasets. On the other hand, semi-supervised anomaly detection approaches provide accurate estimates of the probability distribution of the normal and malicious classes, extracted from a sufficiently large amount of collected log/network measurements [30]. These techniques are differentiated from the supervised approaches in the sense that partial or no information regarding the properties of the anomaly is presented. Finally, the unsupervised anomaly detection methodologies aim to discriminate automatically normal from abnormal network behavior without exploiting labelled data. In such scenarios, clustering-based techniques [31][32][33] proved to be convenient solution, however due to the enormous and extremely heterogeneous structure of network traffic measurements more sophisticated techniques, relying on DNNs, are considered [34]. Finally, generative models have proved to be an intelligent choice towards the unsupervised network anomaly detection problem [35].

Embedding
Recently, several interesting techniques have been reported in the literature that handle time-series observations with text data using deep learning architectures [36,37]. In this work, we propose two DL schemes for the problem of multi-class event classification/anomaly detection of pfSense Suricata logs that both consider a word embedding layer as their input layer. This approach ensures that multiple occurrences of a word are provided by the same identifier. Embedding has proved to be a highly efficient choice in DL for not only handling categorical data, but generally big-volume datasets [38]. Embedding acts as a trainable layer in which categorical data are encoded via dense vectors of real values with reduced dimension [39,40]. Specifically, embedding stores the input information of n-words (i.e., categorical data) into an (n × m) matrix, where m n, i.e., each word is stored into a (1 × m) vector. The textual input to the network corresponds to an one-dimensional vector with each element consisting of an integer identifier of the corresponding word. Consequently, multiple occurrences of a word are provided by the same identifier. Additionally, using distance measures, words with similar meanings are clustered together, facilitating the analysis of similar words during the training and testing phases.

Long-Short Term Networks (LSTM)
Long Short-Term Memory (LSTM) [41] is a variation of Recurrent Neural Networks (RNNs) [42], that consider the long-term dependencies between data nodes. Each LSTM consists of a chain of repeating modules of neural networks to process an input data sequence, and utilizes memory blocks to replace the traditional neurons in the hidden layer. Each module consists of three control gates and four neural network layers, interacting with each other. The gates control the cell state (i.e., the top horizontal line of the module that operates straight down the entire chain), by deciding the extend to which old information should be erased (i.e., denoted as "forget gate"), the new information that should be memorized (i.e., input gate), and last the current contents that should be communicated ("output gate"). Each gate is composed of a sigmoid activated neural network and a pointwise multiplication operation. The sigmoid layers output the numbers in the interval [0, 1], where "zero" denotes that no information passes, while 1 corresponds to all information being exposed. From an analytical point of view, we consider a memory cell, denoted as C t , an input gate o t , and a forget gate f t . At time instance t, x t corresponds to the input, and z t to the hidden state, whileĈ t is the candidate state of the memory cell, which determines how much the input is received in the cell state. The calculations for each gate, input candidate, hidden state, and cell state are summarized as follows: where c t ∈ R m denotes the state vector, x ∈ R d is the input vector and h t ∈ R m is the output vector. Additionally, σ and tan stand for the sigmoid and tanh activation functions, respectively, while i t , f t and o t correspond to the input, forget, and output gates, respectively, and diag stands for the diagonal matrix. LSTM neural networks and their variants are regarded as powerful tool in DL, and have been widely used in modelling time-series prediction and classification problems. In our previous work, we provided the complete analysis and background regarding the LSTM formulations [43], when applied on the problem of autonomous fault detection on Critical Energy Infrastructures. Additionally, over the last few years multiple LSTM approaches have been proposed in the literature that tackle the problems of anomaly detection and network intrusion detection [44,45].

Convolutional Neural Network
Convolutional Neural Networks (CNNs) [46] are biologically inspired feed-forward deep neural networks, traditionally used for two-dimensional signals, such as imaging and video data [47]. Over the last few decades, CNNs are considered as the state-of-the-art approach for multiple image classification [48] and detection problems [49]. Nevertheless, their success in tackling 2D problems, imposed the demand to also use them for 1D applications, such as timeseries prediction and classification problems. Recently, multiple literature techniques use 1D-CNNs for anomaly detection applications [50][51][52]. In this formulation, due to the tremendously complex structure of our data, conventional deep learning formulations are not able to fully represent and encode the input information. Contradictory, the CNNs scheme that is composed of one or more convolution and pooling layer(s) followed by one or more FCN layer(s), builds a powerful architecture, able to encode high-level features and extracts all valuable information that will facilitate the subsequent classification task.
Consequently, similar to the 2D CNNs, 1D CNN-layers process the raw signals and automatically extract features that can be further used for any classification task, implemented by the network's fully-connected/dense layers. The key differences with the traditional multi perpecton models (MLPs) include: (i) 1D CNNs perform 1D convolutions among the input data with pre-defined kernels, (ii) the kernel size affects the connectivity of the previous 1D convolution layers with the current layer, and (iii) the same weights are used for all connections. A significant component of a CNN concerns the non-linear activation function which is associated with the outputs of the convolution operations. The most prominent non-linear functions that have been utilized in the related state-of-the-art are the sigmoid, the tanh, and the Rectified Linear Unit (ReLU), which preserves the non-negative components while the negative values are set to zero, according to: σ(x) = max(0, x) [53]. Moreover, the activations of each convolutional layer are introduced into a pooling layer that produces down-sampled versions of the input vector. There are numerous types of pooling operators, including max-pooling and average pooling to be the most convenient choices that reduce the models' training computational complexity. Finally, the last layer of a traditional CNN architecture corresponds to a fully-connected layer, that assigns each output unit to a certain class.

Proposed Formulation
This paper provides a modification of our previous work [54] in designing a formulation that uploads, monitors and analyses network logs utilizing the pfSense software [1]. For each new network log that arrives our system, Apache Spark Streaming [3] is used for its monitoring in (semi) real-time conditions. To be more accurate, after we parse the input, raw pfSense network measurements, we perform the several intermediate data pre-processing and cleansing steps. Characteristic examples include several proper internal data checks, such as the problems of missing values and the existence of inappropriate characters. Additionally, since we deal with multivariate time-series data, we separate the numeric from the categorical variables to encode them properly. For the categorical variables, Natural Language Processing (NLP)-based data pre-processing [55] procedures have been applied, while all numeric variables were normalized with respect to the sum of the unit-norm (l1-norm). After the data pre-processing part, we proceed to the analysis procedure regarding the development of DL models for network intrusion detection/multiclass event classification techniques. To achieve this, we divide the analysis into two distinct phases: • Offline: representing the model training process; • Online: corresponding to the decision making and does not involve model re-training.
During the offline, training stage, we store the learnt models, to be able to distribute them for the decision making stage. To achieve this, we design two characteristic DL formulations, capable of detecting whether specific logs contain anomalies, such as alerts or threats. The following paragraphs are dedicated on the multiclass classification problem of the "Event" type of Suricata logs that were extracted from the pfSense software, using: (i) the Long Short Term Memory (LSTM) Neural Network approach, and (ii) the one Dimensional Convolutional Neural Network (CNN) method.

Long Short Term Memory (LSTM) Approach
The first technique that is exploited towards the problem of multi-class classification/anomaly detection relies on the intelligent scheme of the Long-Short Term Memory (LSTM) neural networks. As it was explicitely described on the background Section (i.e., Section 2), LSTM neural networks are widely used for both multivariate time-series analysis and Natural Language Processing (NLP)/text-classification problems. Consequently, their structure is highly convenient for our use-case since the network measurements that we exploit compose a combination of time-series data instances including significant text information. In the following paragraph, we provide a detailed analysis regarding the mathematical formulation of LSTM as it is applied on our scenario. In our formulation, we consider as {X} N t=1 the input data sequence, where N stands for the total number of examples. The first hidden layer, is an embedding layer, in order to encode the text information properly and provide more robust and faithful representations of our data. Thus, for each X t we consider: , where x t k ∈ R K . Each i-th vector of the sequence X t can be encoded via the proposed LSTM architecture as: where U ∈ R m×m , W ∈ R m×k are the RNN (i.e., Recurrent Neural Network) coefficient weight matrices, and x ∈ R m stands as the state vector. For each X t the output is formulated as: where z t i corresponds to the output vector of the i-th input unit, and φ denotes the set of parameters that are imposed. After each LSTM layer, a dropout layer is operated, that preserves the most characteristic parts of the learnt features. The final layer, which is fully connected, is responsible for the classification task, and it is selected to be activated using the softmax function [56] as: Finally, a state-of-the-art back-propagation algorithm [57] is exploited to learn the model's trainable parameters. For the reader's reference, in back propagation, the model's parameters are updated via the alternating minimization of the cost function with respect to each parameter: In this formulation, we denote D as the training dataset. To extract the final prediction, we calculate the maximum value of: In Figure 2, we depict the main diagram of the proposed LSTM-based formulation for the multi-class event type classification of the Suricata pfSense logs. Specifically, we illustrate the high-level architecture of the proposed scheme where multiple LSTM layers can be imposed. However, to reduce the computational complexity of our model, we may use only one LSTM layer and achieve significant performance.

One Dimensional Convolutional Neural Network (CNN) Approach
Since convolution operators extract automatically features of patterns from complex data structures, in the second approach of this study, we propose an 1D-CNN architecture, as illustrated on Figure 3. Specifically, in this scheme we aim to exploit the key insights of CNNs, summarized as: while in traditional, dense neural network schemes the activation of each hidden vector is computed by considering the whole input vector with certain weights, in CNNs only the most representative examples of the input vectors are utilized to compute the activation term of each hidden unit. Our proposed 1D-CNN scheme considers the input time-series sequence {X N t=1 }, and first passes it through an embedding layer composed of 50 hidden units. Each input observation X t = [x t 1 , x t 2 , . . . , x t k ], x t k ∈ R K , is passed through a convolution operation of the following form: where b denotes the bias parameter, σ corresponds to a non-linear activation function, w ∈ R m×m corresponds to the filter operator, and finally the output of the l hidden layer for a square-region (n × n) is formulated as h ∈ R (n−m+1)×(n−m+1) . In our formulation, we use 32 convolution filters, with 3 kernel sizes and 8 input features. The convolution filters allow us to extract different features on the first convolution layer of our network. Each filter contains its own weight with the pre-defined kernel size, composing the length of the input matrix. Moreover, each convolutional layer (i.e., 1DConv) is followed by a max-pooling operator (i.e., MaxPool) that down-samples the spatial size of the output, by selecting the maximum value of the region to which the filter operator is applied. In our scheme, we used 1D max-pooling with a pooling size of 2 and 1 stride size. Additionally, after each max-pooling operator a dropout layer is imposed in order to provide higher generalization of results for unseen data, by dropping out a percentage of outputs from the previous layer. All convolutional layers are activated using the ReLU function. After the last pair of 1DConv-MaxPool layers, a flatten layer is imposed, followed by a stack of fully-connected/dense layers. Generally, the majority of learning parameters are included on the fully-connected layers. The last layer is a softmax activated layer, which is responsible for the assigning the probabilities to the corresponding classes, and thus is responsible for the classification task.

Experimental Setup
In this section, we provide the dataset description, including the data acquisition process, the experimental scenario that was exploited, the evaluation metrics, and finally the quantitative performance of the proposed architecture.

pfSense Software
In this work we exploit data that were extracted through pfSense software [1,58], which acts as a firewall service, also enabling the Suricata [59] intrusion detection service using Snort detection rules. pfSense is a customized FreeBSD (Berkeley Software) distribution, oriented to be used as a firewall and router. It has started as a fork of the m0n0wall project, which was mainly directed towards embedded hardware installations [2]. Contradictory, pfSense is more focused on full PC installations. pfSense platform, includes many base features, and can be extended with the package system, including "one touch" installations. pfSense stands as a viable replacement for commercial firewalling/routing packages, including multiple features that can be found on commercial products such as SonicWall and WatchGuard among others. Out of the most significant features of pfSense are the firewall, routing, QoS differentiation, NAT, Redundancy, Load Balancing, VPN, Report and Monitoring, Real Time information, and a Captive Portal. It is fully prepared for high throughput scenarios (over 500 Mbps), as long as high-end server class hardware is used. Logs, in pfSense software, contain recent events and messages from daemons. These messages can be stored locally on a limited basis or forwarded to a central logging server for long-term storage, better reporting, and alerting among others. pfSense logs are contained in a binary circular log format called clog. These files have a fixed size that does not grow. Consequently, logs only hold a certain amount of entries. As the new log entries are added to the system, the old entries are removed. Additionally, logs can be copied to another server with syslog where they may be permanently retained or rotated with less frequency. On normal installations, where logs are kept on disk, they are retained across reboots.

Dataset Description
Suricata is a high-performance and multi-tasking network intrusion detection (IDS), prevention (IPS) and security monitoring engine, designed to protect networks, as well as collect, and store information about any incoming signals [59]. Additionally, it is open source and owned by a community-run non-profit foundation, the Open Information Security Foundation (OISF). The work of the Suricata attack detector is based on the analysis of signatures and heuristics, and its convenience is due to the presence of open access to the source code. This approach allows the users to customize the system performance for individual tasks. The main settings include rules referring to traffic analysis, filters that limit the output of an alert, and notifications regarding active ports and networks among others. Consequently, Suricata, stands as an intelligent and highly flexible IDS/IPS solution, that can be easily extended or modified depending upon the type of the attack, making it as effective as possible. Information and communication systems capture and store information about suspicious activity. The users are able to edit the settings of the attack detector, to specify internal, external networks, address ranges of the different servers, and the utilised ports, among others. Additionally, on Suricata's default settings, the traffic to external interfaces is also analysed. Suricata supports multiple output formats, including: the line-based alerts, EVE JSON, http-logs, packet logs, and verbose alerts. In our study, we exploit the line-based alert logs file, and a characteristic example of the extracted logs is illustrated in Figure 4. The attributes that are extracted from the Suricata line-based alert logs, are: {< Date >, < Source >, < Suricata_id >, < Flow_id >, < Event >, < input_IP_address > , < Destination_IP_address >, < Protocol >, < Description >, < Classi f ication >}. We consider as our training set, the attributes: {< Date >, < Source >, < Suricata_id > , < Flow_id >, < input_IP_address >, < Destination_IP_address >, < Protocol >, < Description >, < Classi f ication >}, along with their corresponding labels which are included into the <Event> attribute. <Event> attribute indicates the type of the instance.
Regarding the {< Classi f ication >} attribute, we observe that there exist 52 distinct classification occurrences. Regarding the {< Event >} attribute, 6 distinct types are reported. However, we observed that several data records appear with either corrupted or missing {< Event >} attribute. For this purpose, the main goal of this study is to classify the events with respect to the {< Event >} attribute, in order to also cover the cases that data instances might appear with missing information on this field. Consequently, during the testing phase, examples with similar attributes are presented, and task is to determine the precise value of the <Event> category. The main goal is to identify the "alert" and "anomaly" cases, and to provide the precise <Event> type, since we are not confronting a binary classification problem. Figure 5 presents two characteristic examples of "alert" and "anomaly" network logs. Figure 5. Characteristic logs that correspond to "alert" and "anomaly" <Event> types.
Additionally, in Figure 6 we demonstrate the distribution of the {< Event >} categories. We observe, that the majority of cases on our total approximately 700, 000 network instances are marked as "alert", and much less number of instances as "anomaly". However, since the number of occurrences among the variant classed is not balanced, we should consider to incorporate this remark also in our models.

Evaluation Metrics
The most significant indicator that quantitatively evaluates the performance of the proposed architectures is the (Accuracy, AC) metric, defined as follows: where True Positives (TP) indicate the number of anomaly measurements that are identified as anomaly, False Positives (FP) denote the number of normal records that are identified as anomaly, True Negatives (TN) correspond to the number of normal records that are identified as normal, and False Negatives (FN) denote the number of anomaly measurements that are characterized as normal [60]. Additionally, we have selected the Area Under the Receiver Operating Characteristic (ROC) Curve (ROC-AUC) score as an evaluation metric to determine the degree of separability among the different categories, since it measures the classification performance of each model/per each class. ROC curve illustrates the ratio between the True Positive Rate (TPR), i.e., TPR = TN TN+FP , and the False Positive Rate (FPT), i.e., FPR = FP FP+TN . Using the specific metric, we evaluate how the model works while distinguishing between the variant anomaly or normal statuses. ROC-AUC scores that are close to 1.0 indicate highly robust models, that can perfectly determine the variant different classes. Moreover, we exploit the Precision, Recall, and F1-score metrics defined as: High score on Precision metric indicates a lower False Positive Rate, i.e., achieving less fault-free data that were incorrectly marked as faulty. On the other hand, high score on the Recall metric demonstrates low ratio of False Negatives, and thus prevents from false event detection. Finally, F1-score provides the harmonic mean of Precision and Recall, by capturing these two measures on a single metric. Finally, regarding the loss function, since we confront a multi-class classification problem, the categorical cross-entropy function [61] is exploited and defined as: where y true stands for the ground truth, and y prediction for the predicted values, while K denotes the total number of classes.

Experimental Results
In the following paragraphs we demonstrate the evaluation results obtained using the proposed formulations: (i) the LSTM neural network architecture, and (ii) the 1D-CNN scheme, for the 6-<Event> category multi-class classification problem of the pfSense Suricata network logs. Regarding the dataset split, for both architectures we follow a 80-20% split ratio, considering the 80% of our dataset for the training, and the rest 20% for the testing phases, respectively. In order to successfully validate the proposed deep learning architectures, we extract a small sample of our initial training set as the validation set in order to provide an unbiased evaluation of the model fit on the training dataset while tuning the models hyper-parameters. The validation split ratio that was used during the model's training process was also set into 0.2. Moreover, since we are dealing with real-world's data that are highly imbalanced, we are significantly cautious with avoiding over-fitting scenarios. For this purpose, we set class weights for our imbalanced classes [62]. Using this strategy the proposed models will avoid cases of over-fitting, and will able to generalize better for all trained scenarios, even if they contain a limited number of training examples.
Regarding the LSTM neural network architecture, we include one embedding layer composed of 50 hidden units, followed by a Dropout layer, one LSTM layer with 50 hidden nodes, followed by another Dropout Layer and two Dense layers with 10, and 6 hidden units, respectively. The final layer is a fully connected Dense layer, activated with the softmax function and is responsible for the classification task. Specifically, the output layer (i.e., classification layer) assigns the corresponding probabilities to the output classes (i.e., the classification levels) that we investigate. The maximum value of these probabilities indicates the corresponding class. The number of the hyper-parameters were evaluated via a thorough validation approach, in which we have selected the best possible parameters for our architecture. The number of batch-size was set to 512, while the number of internal epochs was set to 50. We observe, that within the interval of 50 epochs, we achieve highly robust results, including convergence in terms of the reduction of the loss fuction, and in terms of achieving high ratio in Accuracy. Finally, the dropout ratio was set into 0.3 for all layers.
In Figure 7, we illustrate the LSTM neural network architecture's accuracy and loss function with respect the number of epochs. The total accuracy for the training and validation phases that was achieved via the proposed method is: 97.65% and 97.29%, respectively, while the total error of the loss function is set to: 0.0055 and 0.0069 for the training and validation phases, respectively. We observe that the extracted values for both accuracy and Mean Squared Error (MSE) loss function illustrate a significantly robust architecture that can be further generalized and tested successfully with new network Suricata Logs containing similar structure. Regarding the testing phase, the proposed scheme achieves 97.27% accuracy and 0.0066 MSE loss on the 20% of Suricata logs that remained for testing. Consequently, as we may notice the proposed scheme is able to generalize with high accuracy on new/"unseen" network observations. Regarding the 1D-CNN scheme, we consider one embedding layer composed of 50 hidden units, followed by a sequence of 1D-convolutional layers, with filter size 5, batch size 256, and 32 number of filters. Specifically, the input data sequence passes through the first layer, which is an embedding one and is composed of 50 hidden units. After the embedding layer, we consider the convolution operators, followed by the max-pooling operators that reduce the tensors' dimensions. The 1D-CNN constructs a tensor of shape (1,8,32), which is then passed to a max-pooling layer of filter size 2 and 1 stride. The pooling layer reduces the tensor shape into (1,4,32). After the max-pooling layer, a spatial dropout layer is introduced, with 0.5 dropout ratio, followed by a flatten layer, and two dense layers composed of 64 and 6 hidden nodes, respectively.
The dense layer that has 64 hidden units is activated using the ReLU fuction, while the last layer, which is composed of 6 hidden nodes, is activated with the softmax fuction and is responsible for the classification task. The optimizer that was exploited in the specific scenario is the Adam, using a learning rate of 0.001. Moreover, Figure 8, demonstrates the proposed 1D-CNN model's accuracy and loss function with respect to the number of epochs. The achieved accuracy for the training phase is: 97.12%, while for the validation phase is 96.34%. The total loss function error is set to: 0.0065 and 0.0069 for the training and validation phases, respectively. Regarding the testing phase, the proposed scheme achieves 97.24% accuracy and 0.0131 MSE loss on the 20% of Suricata logs that were extracted out of the initial dataset for testing. Figure 9 illustrates the AUC-ROC curves that were constructed via the proposed LSTM and 1D -CNN architectures, concerning the classification among the different Eventtype category of pfSense Suricata Logs. In these figures, the distribution between the false positive and true positives rates for each separate class is illustrated. Regarding the LSTM scheme, the observed detection performance among the different categories is: DNS: 0.9997%, Alert: 0.9982%, Anomaly: 0.9979%, HTTP: 0.9982%, TLS: 0.9966%, SNMP: 0.9999%. Moreover, regarding the 1D-CNN formulation, the performance for classifying between the different categories is: DNS: 0.9998%, Alert: 0.9976%, Anomaly: 0.9976%, HTTP: 0.9979%, TLS: 0.9969%, SNMP: 0.9999%. Consequently, we observe that both the LSTM and the 1D-CNN techniques provide high quality estimations of each separate Event-type category.   Comparing the proposed two schemes, we observe their high and robust performance towards the pfSense categories multi-class classification/anomaly detection problem.
Additionally, Figure 10 demonstrates the normalized confusion matrices that were achieved via the proposed LSTM and 1D-CNN techniques, towards the 6-Event type category classification of Suricata logs. We observe that both architectures provide high detection rate among the variant categories, regarding the models' predicted labels towards the ground truth labels. All the above examples demonstrate the high performance of our proposed deep learning formulations in representing real-world data, and learning robust features from them, that significantly facilitate the problem of network anomaly detection, from the perspective of multiclass classification of pfSense Suricata logs. Furthermore, in order to validate the superiority of the proposed architectures, we compare our results with state-of-the-art classifiers, namely with the: (i) traditional, MLbased Random Forest classifier [63], and with the (ii) Multi Percepton Model (MLP)-fully connected neural networks. Concerning the Random Forest classifier, we selected 20 estimators, while the max depth parameter was set to 12 in order to achieve proper classification results. Regarding the MLP multi-class classifier, we selected a four intermediate-layer architecture, composed of 512,256,128 and 64 hidden nodes, respectively. The last layer, which is composed of 6 nodes, is activated with the softmax function and is responsible for the multi-class classification task of the variant Event type categories. In terms of classification accuracy on the validation set, we observe that the highest score for the pfSense, Suricata logs multi-class Event type classification, is achieved first by the LSTM scheme with an accuracy of 97.29% , followed by the 96.34%, then by the MLP model with 87.14%, and finally by the Random Forests with 83.13%. Concluding, Table 1 depicts the Precision, Recall, and F1-score metrics that are achieved via the comparable schemes, representing the various <Event> classes of pfSense logs. We observe that our proposed multi-class classification schemes achieve highly accurate results for all classes, and in all cases over 90% in terms of all metrics, validating our claim that the specific schemes are very prominent in predicting accurately the Event type of pfSense Suricata network measurements.

Conclusions
This paper examines the performance of two efficient and robust deep neural network architectures, namely: the LSTM and the 1D-CNN, for the challenging problem of pfSense Suricata anomaly detection/multi-class classification problem. Both developed formulations were trained on multivariate time-series data including significant text information in order to classify between the variant "Event" states of the pfSense Suricata log instances. The proposed techniques present high-quality results regarding the classification accuracy, the reduction of the loss function reconstruction error, and the evaluation metrics, compared to state-of-the-art classifiers. The superiority of the proposed architectures lies in the fact that they present high detection accuracy in real-world's data, and gives us the insight that these schemes with slight data pre-preprocessing steps and modifications are able to tackle any kind of network anomaly/intrusion detection application. Funding: This work was partially funded by the H2020 PHOENIX project, contract no. 832989, within the H2020 Framework Program of the European Commission.