1. Introduction
Optical burst switching (OBS) is an emerging network architecture that combines the advantages of circuit switching and packet switching. The OBS is preferable as it demonstrates lower latency relative to optical packet switched (OPS) networks [
1]. Therefore, the OBS has become one of the fundamental technologies in optical IP networks [
2]. It plays a crucial role in various environments, including the software-defined networks (SDN), wireless sensor networks (WSN), and Internet of Things (IoT) [
3]. In IP networks, different types of cyber-attacks have emerged as significant threats, drawing researchers’ attention. These include cyber-attacks in long-range wide area networks (LoRaWAN) [
4] and SDN [
5]. Along with the growth of the optical networks industry by 15.2% from 2013 to 2018 [
6], cyber-attacks on optical networks are a growing concern, as these networks are critical infrastructure that supports many aspects of modern society, including finance, healthcare, transportation, and communication. According to Thales Data Threat Report 2020, 26% of surveyed global organizations were breached in 2019 [
6]. As such, it is important for organizations to implement robust cybersecurity measures to protect their optical networks from cyber threats.
Several approaches for cyber-attack countermeasures on optical networks have been proposed for OBS network intrusion detection systems (IDSs) as the network is prone to flooding attacks, which lead to low bandwidth utilization, degraded network performance, DOS, and high data loss rates. Due to many rules and exceptions, machine learning approaches are the best solution to detect these attacks. One effective method is using the deep convolution neural network (DCNN) model [
7] to detect these attacks early on, with better performance compared to traditional models such as naive Bayes, SVM, and KNN due to the limited number of samples in the dataset. Chawathe [
8] discusses how the OBS networks can be vulnerable to denial of service attacks due to a separation of control information from primary data, and proposes a solution using a monitoring method evaluated on a public dataset. Liu et al. [
9] develop a combination of particle swarm optimization and support vector machine (PSO-SVM) to detect intrusion in OBS networks and use the UCI and NCENTs datasets to show that this model has better performance than traditional machine learning models. It is an effective and high-efficiency method for detecting burst header packet (BHP) flooding attacks in OBS networks. Furthermore, several methods have been developed for BHP flooding attacks in OBS networks, including a decision tree (DT) with selected features by Almaslukh [
10], ant colony optimization (ACO), and a support vector machine (SVM) by Seddik et al. [
11], and a decision forest classifier (DFC) with flower search optimization (FSO) by Panda et al. [
12]. In addition to that, our research group has conducted multiple studies on machine learning methods for cyber-attacks in different types of networks, including wireless sensor networks [
13], websites [
14], and the IoT [
15].
Motivated by the works of Rajab et al. [
16,
17] about flooding attack prevention on OBS, the main contribution of this paper is the development of a third-order distance for k-nearest neighbors (KNN3O) for flooding attack detection. In addition to that, this paper compares different types of machine learning methods for flooding attack detection. The remainder of this paper is organized as follows:
Section 2 presents the proposed method and the comparative methods. Experimental results are discussed in
Section 3. Lastly,
Section 4 concludes this paper.
2. Methodologies
This section describes the methods used in this paper. First, the vulnerabilities of OBS networks, including node hijacking and flooding attacks, are presented. Several machine learning methods to detect intrusion are described. The last sub-section discusses the experimental setup.
2.1. OBS Networks Vulnerability and Intrusion Detection
The OBS transmission approach allows for the management of data in the optical domain while allowing for complex control header electronic processing in the counterpart domain. The procedure involves taking in incoming client data traffic at the OBS network ingress (edge) node and constructing a data burst (DB). A BHP, which contains the DB packet information, such as the offset time, arrival time, burst length, etc., is then set over a dedicated (out-of-band) wavelength division multiplexing (WDM) channel before the DB. The BHP is delivered before the DB with a particular time difference referred to as the offset time, which is used to configure the path for the core switches to process DBs and allocate the essential resources. At each intermediate node, the BHP must undergo an optical-electronic-optical (O-E-O) conversion and is electronically processed to reserve resources required by the incoming data burst in the optical domain. OBS data bursts come in various lengths and include different types of traffic, such as optical packets, IP packets, and ATM cells. The edge node sends the data as bursts, which are taken apart at the receiving edge router. This is illustrated in
Figure 1.
This study is centered around the BHP flooding attack, which belongs to the denial of service (DoS) attack class. This attack is designed to hinder (legitimate) regular BHP allocation of crucial resources in the intermediate core switch. Similar to the traditional DoS attack aimed at the TCP protocol, such as SYN flooding, which inundates a victim host with a vast number of SYN requests without finalizing the connection setup and prevents it from accepting genuine connection requests, the BHP flooding attack also involves flooding the network with fake BHPs to seize necessary resources and prevent legitimate BHPs from reserving them. This article aims to thwart this type of attack through the application of machine learning techniques.
In a similar vein, a BHP flooding attack takes place when a hijacked or attacker node overwhelms the network by sending numerous BHPs without corresponding DBs. As soon as the WDM channels are allocated by a core switch for incoming BHPs, these channels’ states change from vacant to busy. The process of fake BHPs attacking a core switch is illustrated by
Figure 2. It starts to provide each BHP with new WDM channels. This process results in legitimate BHPs being unable to allocate the required intermediate core switch resources. When a regular DB is received without any available vacant WDM channels, the core switch discards the DB, and the allocated channels stay busy, awaiting unidentifiable bursts that may never arrive.
This paper uses the security model developed by Rajab et al. [
17], as shown in
Figure 3, to defend the OBS network from BHP flooding attacks by analyzing the behavior of each node to counter harmful BHPs that exploit network resources. The developed model has multiple benefits, including easy implementation through software modification and integration into existing core switch infrastructure. It can also be deployed gradually to improve security. The model uses a sliding range window to classify all ingress nodes as Blocked, Trusted, or Suspicious based on their observed performance. The node category changes over time, with a node classified as Suspicious when a predetermined number of corresponding DBs are not sent on time and as Blocked when the packet dropping rate increases. If there is a BHP flooding attack, the classifier adds compromised nodes to the blocked list. However, nodes can improve their state by increasing their throughput and decreasing their packet dropping rate. The Trusted window, corresponding to one second, is divided into 10 slots, whereas the Blocked and Suspicious windows have 20 slots to scrutinize the node behavior in detail. The core switches that put a node in the Blocked category will not pass on its BHPs, but the node can be removed from the Blocked class if it stops transmitting fake BHPs and starts transmitting legitimate DBs.
2.2. K-Nearest Neighbors and Its Enhancement
K-nearest neighbors (KNN) is a machine learning algorithm used for classification tasks [
18] by finding the k closest data points in the feature space to a given query point and then classifying the query point based on the labels or values of its
nearest neighbors.
Mathematically, the KNN algorithm can be described as follows: Let
be the feature space and
be the label space. Given a query point
representing the input data for the model, this algorithm finds the
nearest data points
in
to
based on the Euclidean distance [
19] metric, which is given by:
where
is the Euclidian distance metric between two points
and
. The Euclidean norm
of
dimensional point
is given by:
where
is the element of vector
at dimension
. Once the
nearest neighbors of
are identified, the algorithm then determines the class xq based on the labels or values of its
nearest neighbors. For classification tasks, KNN selects the class label with the highest frequency among the
nearest neighbors.
Alternatively, a combination of KNN with cosine distance (KNNC) is commonly used. Cosine distance [
20] is given by:
where the dot operator
denotes the inner product.
This paper proposes a new class of distance using a third-order exponential as given below:
The proposed combination of KNN with the third order of exponential distance (KNN3O) is expected to increase the classification model’s accuracy. Mathematically, the
function can be written as:
where
and
are two n-dimensional vectors representing two data points, and
denotes the absolute value.
Compared to the Euclidean distance, the function is more sensitive to differences between values because the third power amplifies the difference between them. This means that the differences in a single dimension can have a large impact on the overall distance between the two points.
Theorem 1. The function is more sensitive to differences than the Euclidean distance.
Proof. Let us consider the two-dimensional case, where x and y are two vectors with elements
,
and
,
, respectively. Then, the Euclidean distance between x and y is given by:
The
function between x and y is given by:
□
The partial derivative of
with respect to the difference between the first coordinate is given by:
Whereas the counterpart of
is given by
where “sign” denotes the sign function.
It can be noticed that the partial derivative of with respect to is larger than the partial derivative of with respect to the same quantity, except at the point where , where both derivatives are close to zero. This means that the function is more sensitive to differences between the first coordinates than the Euclidean distance. QED.
Figure 4 shows the comparison between two partial derivatives where the partial derivative of
shows more significant values than that of the standard distance. This means that the
function can better distinguish between points that are close together in Euclidean space, which can be beneficial for KNN classification.
2.3. Multi-Layer Perceptron for Classification
Multi-layer perceptron (MLP) [
21] is a feedforward neural network that consists of an input layer, single or multiple hidden layers, and an output layer. The input layer takes the input data, and each neuron in the input layer represents one feature of the input. The hidden layers perform non-linear transformations on the input, and the output layer produces the final output.
Activation functions introduce non-linearity in the neural network. In this study, the hidden layers used the rectified linear unit (RelU) activation function, which is defined as:
The output layer of MLPs used for classification problems usually employs the softmax activation function. The softmax function converts the output of each neuron in the output layer into probabilities where the softmax function is defined as:
where
is the input to the jth neuron in the output layer.
The cost function measures the difference between the predicted output and the true value. In this study, cross-entropy is used as the cost function.
where N is the total number of samples,
is a one-hot encoding of the true label of the n-th sample and class k, and
is the predicted probability distribution over the classes for the n-th sample and class k.
This paper uses the limited-memory Broyden–Fletcher–Goldfarb–Shanno (LBFGS)-based backpropagation algorithm to train the network. The error is propagated back through the network, and the weights are updated to minimize the cost function. The derivative of the cost function with respect to the output of the output layer is:
where
is the true label for the jth neuron in the output layer, and
is the predicted probability for the jth neuron in the output layer.
2.4. Support Vector Machine
This paper uses a support vector machine (SVM) [
22] with error-correcting output codes (ECOC) for multiclass classification. Given a training set of
samples, each with
features and
possible classes, the goal of multiclass classification is to learn a function
that maps each input sample
to one of the
possible classes.
To convert the -class problem into a set of binary classification problems, a coding matrix is first created, which is a binary matrix. Each column of corresponds to a binary classifier and each row corresponds to a distinct binary problem. The entries in each column indicate which classes are included in the positive class for that binary classifier. The number of columns, , is determined by the desired trade-off between the number of classifiers and accuracy.
The binary classifiers are trained independently, one for each column of . The binary classifier for column is trained to distinguish between the positive class defined by column of and all the other classes. The training data for each binary classifier consist of the original data set, but with the labels modified according to column of .
To train each binary classifier, the following SVM optimization problem must be solved:
where
is the weight vector for the binary classifier,
is the bias term,
is the
th sample,
is the corresponding binary classifier label for the
th sample, and
is a regularization parameter. The objective is to obtain the optimal hyperplane that maximizes the separation between the two classes. The hinge loss function ensures that the classifier does not make errors by imposing a penalty when a sample is misclassified.
The interior-point method can be used to solve this optimization problem by first converting it into a form that can be solved using an unconstrained optimization algorithm. This is achieved by introducing a logarithmic barrier function that penalizes solutions that violate the constraints of the problem. The barrier function is defined as:
where the second term ensures that x is positive, to prevent the logarithm from being undefined.
The interior-point method then minimizes the objective function subject to the barrier function by solving the following problem:
where μ is a positive parameter that controls the trade-off between the objective function and the barrier function.
The interior-point method solves this problem iteratively by first choosing an initial feasible point, and then solving a sequence of barrier problems with increasing values of μ. At each iteration, the interior-point method computes the gradient and Hessian of the objective function subject to the barrier function, and then updates the decision variables (w, b) using a Newton-like method. The interior-point method also uses a step-size parameter to control the rate at which the decision variables move towards the boundary of the feasible region.
The interior-point method continues iterating until the solution converges to a point that satisfies the constraints within a predefined tolerance. Once a solution is found, the decision variables (w, b) can be used to define the hyperplane that divides the data points into two classes with a maximum margin. The support vectors are the data points that lie on the margin or violate the margin constraint, and their corresponding Lagrange multipliers can be used to calculate their importance in defining the hyperplane.
To classify a new input sample , each binary classifier produces a score indicating the confidence that belongs to the positive class for that classifier. The scores for all classifiers are concatenated to form a vector of length t. A decoding matrix is used to map the vector back to a -dimensional output vector . Each row of corresponds to a class label, and the entries in each row indicate which binary classifiers vote for that class. The output class for is the class corresponding to the row of with the highest score. By using binary classifiers with a coding matrix and decoding matrix, they can be extended to handle problems with multiple classes while still maintaining high accuracy. The error correcting technique improves the robustness of the classifier by correcting errors that may have occurred due to misclassification by the binary classifiers.
2.5. Naive Bayes Classifier
The naive Bayes classifier (NBC) [
23] for multiclass classification is an extension of the binary classifier. Let X be a d-dimensional feature vector and y be a class label taking one of K possible values. The goal is to predict the class label y given the feature vector X. The NBC assumes that the features
are conditionally independent given the class label y.
Mathematically, the NBC computes the posterior probability of the class label y given the feature vector X as:
where
is the probability of observing the feature vector
given the class label
,
denotes the prior likelihood of the class label y, and
is the marginal likelihood of the feature vector
.
The NBC estimates the likelihood and prior probabilities from the training data. For a given class label y, the likelihood is modeled as a multivariate Gaussian distribution:
where
is the mean vector and
is the covariance matrix of the training samples with class label
.
The prior likelihood of the class label y is estimated as the frequency of the class label in the training data:
where
is the number of training samples with class label y and N is the total number of training samples.
To classify a new feature vector X, the NBC computes the posterior probability for each class label y and assigns the label with the highest probability:
where
is the predicted class label for X.
2.6. Decision Tree
This paper also uses a decision tree (DT) [
24] model to perform the classification. The mathematical representation of DT consists of training data (X) and a class label (Y). The training data is a matrix of size N-by-P, where N, P, and Y denote the number of observations, the number of predictor variables, and a vector of true class labels of size N-by-1, respectively.
The tree is constructed by recursively partitioning the data into subsets based on the predictor variables, using a splitting criterion that maximizes the Gini index. The DT is grown until the stopping criteria, such as minimum number of observations per leaf or a maximum tree depth, are met.
The splitting criterion that maximizes the Gini index is a measure of impurity used in decision tree learning. It is based on the concept of Gini impurity, which is a measure of the probability of misclassification.
The Gini index for a binary split is defined as:
where
is the proportion of samples in the left child node that belong to the first class, and
is the proportion of samples in the left child node that belong to the second class.
The Gini index for a multiclass split is defined as:
where
is the proportion of samples in the ith class in the left child node.
The splitting criterion that maximizes the Gini index is the one that minimizes the impurity in the resulting child nodes. The decision tree algorithm evaluates all possible splits and selects the one that results in the greatest reduction in the Gini index. This process is repeated recursively until a stopping criterion is met, such as reaching a maximum tree depth or a minimum number of samples in a leaf node.
The output of DT is a trained model which is a binary tree structure consisting of decision nodes and leaf nodes. Each decision node specifies a test on one of the predictor variables, and each leaf node assigns a class label to the observations that reach it based on the majority class of the training samples in that node. Once the DT model is trained, it can be used to predict the class labels of new observations.
2.7. Discriminant Analysis Classifier
Discriminant analysis classifier (DAC) model [
25] is a linear classification method that assumes that the predictors have a multivariate normal distribution and that the class covariances are equal. DAC is a similar method but does not assume equal class covariances.
Mathematically, given a set of input data X of size , where N is the number of observations and p is the number of predictors, and a corresponding response variable Y of size , where Y contains the categorical labels for each observation, DAC finds the linear or quadratic discriminant function that best classifies the observations into K classes.
For DAC, the discriminant function for each class
is defined as:
where
is the mean vector for class
,
is the covariance matrix for class
,
is the prior probability of class
, and
is the determinant of the covariance matrix for class k.
This paper uses pseudo quadratic, which estimates the coefficients of the quadratic discriminant function by maximizing the likelihood function of the data, given the parameters of the model. The likelihood function is a measure of how well the model fits the data, and the maximum likelihood estimates of the model parameters are those that maximize the likelihood function.
The use of pseudo quadratic can improve the performance of the discriminant analysis model when the assumptions of separate covariance matrices are violated. This method adjusts the covariance matrix of the predictor variables to have a more quadratic form, which can better capture the non-linear relationships among the predictor variables and improve the accuracy of the classification model.
2.8. Experiment Setup
This section provides details on the dataset and configuration, along with the corresponding experimental results of the proposed method. To evaluate the performance of the classifier, a dataset provided by [
9], where a NCTUns network simulator modification was used, and the simulation topology in
Figure 5 included single legitimate sender (1), a single receiver (14), a single attacker (13), eight core switches (3–10), two ingress edge routers (2 and 11), and a single egress edge router (12). The attacker node was placed near the receiver to highlight its impact and increase the probability of detection. Only one attacker and one legitimate ingress node were used in the experiments, as the focus was on testing the classifier against BHP flooding attacks. In the original simulation, the eight core switches serve to simulate the complexities of an OBS network, while the attacker node can be deployed to any core switches. However, the attacker node is placed near the receiver to indicate its flooding effect, instead of core network congestion effects. Ten trace files with increasing User Datagram Protocol (UDP) traffic load rates were created for the legitimate sender’s traffic, starting at 0.1 Gbps and increasing by 0.1 Gbps up to 1 Gbps. For each legitimate traffic load rate, the network is tested with three different attack schemes, namely lightweight, medium, and powerful, corresponding to attack traffic load rate 0.2 Gbps, 0.5 Gbps, and 1 Gbps, respectively. The simulation parameter is summarized in
Table 1. All machine learning methods are implemented using Matlab 2023a.
The simulation produces 1075 samples, with each sample having the attributes given in
Table 2:
The original dataset contains 21 input attributes. However, in our experiments, the 20th attribute, which contains nodes’ initial classification labels, is removed from the model inputs to allow the model to learn only from numerical input. The 21st attribute, representing the percentage of flood per node, is also removed to increase the detection difficulty. In addition to that, the target class label is one of four classes of nodes, namely No Block (NB), NB-Wait, NB-No Block, and Block.
All the methods in this study used a k = 5-fold validation approach, where 80% of the data was used for training and 20% was used for testing. This approach allowed for the evaluation of the performance of each method using multiple independent datasets, which helped to ensure the reliability of the results. By dividing the data into training and testing sets, the models were able to learn from the training data and generalize their performance to the testing data. This also allowed for the identification of any overfitting or underfitting issues that could affect the accuracy of the models.
This paper uses accuracy, precision, recall, F1, and specificity, which are commonly used metrics to evaluate the performance of classification models. Accuracy represents the ratio of correctly classified instances among all samples. Mathematically, accuracy is defined as:
where true positive (TP) and true negative (TN) represent the number of samples correctly classified as positive and negative, respectively. On the contrary, false positive (FP) and false negative (FN) are the number of instances incorrectly classified as positive and negative, respectively.
Precision is the ratio of correctly classified positive instances among all samples classified as positive. Mathematically, precision is defined as:
Recall or TP rate is the proportion of correctly classified positive instances among all actual positive samples. Mathematically, recall is defined as:
F1 score is the harmonic mean of recall and precision, which provides a numerical value to balance between recall and precision. Mathematically, F1 score is defined as:
Specificity is a statistical measure that describes how well a binary classifier can identify true negative cases, or the ratio of actual negatives that are correctly identified by the classifier. It is calculated as the proportion of TN predictions over the sum of TN and FP predictions, expressed as:
In other words, specificity tells us how good the classifier is at avoiding false positives or how often it correctly identifies cases that are negative.
3. Results and Discussion
This section provides the results of the simulation of all methods. First,
Table 3 contains a summary of all the experiments. A detailed sample confusion matrix of each method is provided to show the performance of the method.
Table 2 summarizes the results, with all methods benchmarked in relation to accuracy, precision, recall, F1 score, and specificity for all trials (k = 1…5), along with the mean.
Table 3 presents an analysis of the performance of different methods for a classification problem. It evaluates the methods based on five metrics: accuracy, precision, recall, F-1 score, and specificity. The main focus is on comparing the highest and lowest scores for each metric, as well as the average performance and the variation among the methods.
The highest accuracy score achieved by KNN3O is impressive at 0.993, indicating that it correctly classified 99.3% of the test data. On the other hand, DAC’s accuracy score of 0.54024 is considerably lower, suggesting that it only correctly classified 54% of the test data. This reveals a significant disparity in performance between the two methods. The precision scores further highlight KNN3O’s superiority, with a precision score of 0.993, indicating high accuracy in classifying positive examples. In contrast, DAC, MLP, and NBC precision scores below 0.8 show that they are imprecise and prone to errors.
Similarly, KNN3O dominates the recall scores with a high score of 0.993, indicating its sensitivity and accuracy in classifying positive examples. DAC, on the other hand, has the lowest recall score of 0.66826. The F1-score, which balances precision and recall, shows KNN3O’s strength with a score of 0.993, indicating a good balance between the two. DAC’s F1-score of 0.588076 suggests a poor balance between precision and recall. Regarding specificity, KNN3O stands out with the highest score of 0.99666, indicating its reliability in correctly classifying negative examples. On the contrary, DAC’s low specificity score of 0.843448 indicates its proneness to false positives.
Based on the analysis of all the metrics, KNN3O emerges as the best method for this problem, exhibiting the highest scores in all five metrics. It is closely followed by KNN with the standard Euclidean distance and KNNC. The proposed KNN3O achieved 100% accuracy in four trials and 96.5% in a single trial, resulting in an average accuracy of 99.3%. In comparison, KNN and KNNC achieved 100% accuracy in only three trials, with average accuracies of 99% and 98.7%, respectively. This indicates that the proposed distance function effectively enhances the performance of the standard KNN. DAC is identified as the worst-performing method. Other methods like NBC, MLP, SVM, and DT fall somewhere in between.
Detailed confusion matrices for all methods at k = 1 are shown in
Figure 6. It can be noticed that the proposed distance combined with the KNN (KNN3O) perfectly detects the attacks and classifies the other types of traffic.