Next Article in Journal
A Hydrodynamical Model for Carriers and Phonons With Generation-Recombination, Including Auger Effect
Next Article in Special Issue
Towards the Development of a Universal Expression for the Configurational Entropy of Mixing
Previous Article in Journal
Correction on Davidson, R.M.; Lauritzen, A.; Seneff, S. Biological Water Dynamics and Entropy: A Biophysical Origin of Cancer and Other Diseases. Entropy 2013, 15, 3822-3876
Previous Article in Special Issue
Kinetic Theory Modeling and Efficient Numerical Simulation of Gene Regulatory Networks Based on Qualitative Descriptions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Using Generalized Entropies and OC-SVM with Mahalanobis Kernel for Detection and Classification of Anomalies in Network Traffic

by
Jayro Santiago-Paz
*,
Deni Torres-Roman
,
Angel Figueroa-Ypiña
and
Jesus Argaez-Xool
CINVESTAV, Campus Guadalajara, Av. del Bosque 1145, Col. El Bajio, Zapopan 45019, Mexico
*
Author to whom correspondence should be addressed.
Entropy 2015, 17(9), 6239-6257; https://doi.org/10.3390/e17096239
Submission received: 8 May 2015 / Revised: 20 August 2015 / Accepted: 2 September 2015 / Published: 8 September 2015

Abstract

:
Network anomaly detection and classification is an important open issue in network security. Several approaches and systems based on different mathematical tools have been studied and developed, among them, the Anomaly-Network Intrusion Detection System (A-NIDS), which monitors network traffic and compares it against an established baseline of a “normal” traffic profile. Then, it is necessary to characterize the “normal” Internet traffic. This paper presents an approach for anomaly detection and classification based on Shannon, Rényi and Tsallis entropies of selected features, and the construction of regions from entropy data employing the Mahalanobis distance (MD), and One Class Support Vector Machine (OC-SVM) with different kernels (Radial Basis Function (RBF) and Mahalanobis Kernel (MK)) for “normal” and abnormal traffic. Regular and non-regular regions built from “normal” traffic profiles allow anomaly detection, while the classification is performed under the assumption that regions corresponding to the attack classes have been previously characterized. Although this approach allows the use of as many features as required, only four well-known significant features were selected in our case. In order to evaluate our approach, two different data sets were used: one set of real traffic obtained from an Academic Local Area Network (LAN), and the other a subset of the 1998 MIT-DARPA set. For these data sets, a True positive rate up to 99.35%, a True negative rate up to 99.83% and a False negative rate at about 0.16% were yielded. Experimental results show that certain q-values of the generalized entropies and the use of OC-SVM with RBF kernel improve the detection rate in the detection stage, while the novel inclusion of MK kernel in OC-SVM and k-temporal nearest neighbors improve accuracy in classification. In addition, the results show that using the Box-Cox transformation, the Mahalanobis distance yielded high detection rates with an efficient computation time, while OC-SVM achieved detection rates slightly higher, but is more computationally expensive.

Graphical Abstract

1. Introduction

The detection and prevention of attacks and malicious activities have led to the development of technologies and devices designed to provide a certain degree of security. One of the first technologies for countering attacks launched against computer networks were the Network Intrusion Detection Systems (NIDS). NIDS are classified into two groups: Signature-NIDS, which use a database with attack signatures, and Anomaly-NIDS, which use the principle of classifying the traffic into normal and abnormal in order to decide if an attack has occurred.
A-NIDS, also known in the literature as behavioral-based, make use of a model of normal inputs in order to detect security events. They try to establish what a “normal profile” or anomaly-free profile for system or network behavior is, using the network features or variables, e.g., destination and source IP Addresses and Port, packet size, number of flows, and amount of packets.
For anomaly detection [1], some traffic variables can be employed directly or functions of these variables, e.g., the entropy. Entropy-based approaches for anomaly detection are appealing, since they provide more information about the structure of anomalies than traditional traffic volume analysis [2]. Entropy is used to capture the degree of dispersal or concentration of the distributions for different traffic features [3,4]. The attractiveness of entropy metrics stems from their ability to condense an entire feature distribution into a single number while retaining important information about the overall state of the distribution. A sequence of packets from network traffic is captured, network features are selected, and the entropy of these network features are calculated. With the estimated values of entropy, the anomaly detection is performed. For this, a profile with “normal” traffic is generated, and the data that deviate from this profile will be considered anomalies. In work [5], starting from an H entropy matrix of normal traffic without outlier filtering, an ellipsoidal region based on the Mahalanobis distance was defined.
An improvement to [5] was proposed in [6] where the algorithm uses the Mahalanobis distance to the exclusion of outliers, and an ellipsoidal region was generated by calculating the parameters { x ¯ , γ , λ , L T } , where x ¯ is the mean vector of the H matrix, γ , λ are the eigenvectors and eigenvalues of the covariance matrix of H, and L T is the limit of the Mahalanobis distance for H [7]. In both works, network traffic behavior was characterized by regular ellipsoidal regions.
This paper proposes defining non-regular regions from training traces, i.e., “normal” traffic, through OC-SVM, which contains parameters that adjust the region to the training traces. Figure 1 shows different defined regions for the case of two variables. In other works (see [8,9]), the RBF kernel was used. However, this work proposes using the Mahalanobis kernel, which in general showed higher accuracy in classification than other methods.
Figure 1. Different regions based on different methods and metrics.
Figure 1. Different regions based on different methods and metrics.
Entropy 17 06239 g001
This paper is organized as follows: Section 2 gives an overview of related work in the area of network anomaly detection. Section 3 introduces the mathematical background, including different entropy estimators, distance metrics, and OC-SVM. Section 4 states the problem and the proposed methods associated with the definition of a region in the space R p that characterizes the entropy behavior of the p intrinsic variables associated with the traces. Section 5 presents the experiments carried out to define regions and to detect and classify anomalies employing two different types of data sets. Section 6 presents a discussion of the experimental results. Finally, Section 7 outlines the conclusions.

2. Related Work

Works dedicated to anomaly detection systems employ different features and entropy as a measure of dispersion, uncertainty, or randomness in order to detect changes in network traffic, which allows anomaly detection. Wagner et al. [3] justify the use of entropy, saying, “The connection between entropy and worm propagation is that worm traffic is more uniform or structured than normal traffic in some respects and more random in others.” Xu et al. [4] propose a method based on the construction of a 3-dimensional feature space by reporting the contents of Shannon entropy of four intrinsic characteristics of the traffic (source and destination IP address, source and destination ports) as a mechanism for detecting intrusions. Nychis et al. [10] consider two types of distribution based on flow-header and behavioral features. They concluded that the port and address distributions are strongly correlated, both in their entropy time series and in their detection capabilities.
Some authors ([11,12,13,14]), have utilized generalized entropies (Tsallis and Rényi entropy), showing advantages over Shannon entropy, adapting the q parameter in order to improve the detection of anomalies. Ziviani et al. [11] investigated Tsallis entropy in the context of DoS attack detection and found empirically that a value of q around 0.9 provides high detection of this attack. On the other hand, Tellenbach et al. [12] utilized the set q { - 3 } { - 2 , - 1.75 , . . . , 1.75 , 2 } in order to detect DDoS and scanning attacks. Ma et al. [13] used Tsallis entropy and Lyapunov exponent with chaotic analysis of the entropy of source and destination IPs to detect DDoS attacks, employing q = 1.1 . Bhuyan et al. [14] used generalized entropy to describe characteristics of network traffic data and as an appropriate metric to facilitate building an effective model for detecting both low-rate and high-rate DDoS attacks, for q { 1 , 2 , 3 , . . . , 15 } .
At the classification stage, different techniques are used. In [5], the authors detected anomalies using regular regions obtained from “normal” network traffic through Mahalanobis distance (i.e., hyper-ellipsoids). In [8], Li et al. proposed the OC-SVM method for construction of non-regular regions, using the RBF kernel, and considering that “the normal data set is much larger than the abnormal.” Zhang et al. [9] detected anomalies using the OC-SVM detector with RBF kernel.
Defining regular and non-regular regions in the feature space in order to detect and classify anomalies in network traffic using entropy, Mahalanobis distance, and OC-SVM, this paper proposes:
  • the use of Mahalanobis distance for construction of decision regions,
  • the novel inclusion of the MK in OC-SVM for classification improvement respect to RBF kernel,
  • the refinement of classification via the k-nn algorithm in the temporal sense.
In addition, the Box-Cox transformation was used to transform non-Gaussian distributed data to a set of data that has approximately Gaussian distribution, and fulfills the requirement of Gaussianity for the Mahalanobis distance.

3. Mathematical Background

3.1. Entropy Estimators

Let X be a random variable (r.v) which takes values of the set { x 1 , x 2 , . . . , x M } , p i : = P ( X = x i ) the probability of occurrence of x i , and M the cardinality of the finite set; hence, the Shannon entropy is:
H ^ S ( P ) = - i = 1 M p i l o g p i .
Based on the Shannon entropy [15], Rényi [16] and Tsallis [17] defined generalized entropies, which are related to the q-deformed algebra
H ^ R ( P , q ) = 1 1 - q l o g i = 1 M p i q
and
H ^ T ( P , q ) = 1 q - 1 1 - i = 1 M p i q ,
where P is a probability distribution. When q 1 the generalized entropies are reduced to Shannon entropy.
In order to compare the changes of entropy at different times, the entropy is normalized, i.e.,
H ¯ = H ^ H ^ m a x ,
where the maximum value of Rényi entropy for the observation vector of size L is given by
H ^ m a x R ( P , q ) = l o g ( L )
while the maximum of Tsallis entropy is given by
H ^ m a x T ( P , q ) = 1 - L 1 - q q - 1 .
The parameter q as shown in Equations (2) and (3) is used to make the entropy more or less sensitive to certain events within the distribution, thus modifying the entropy values, and consequently the entropy behavior. In addition, for a specific event with probability p selecting an appropriated value of q , the entropy value with respect to Shannon entropy can be increased (or decreased), see Figure 2.
Figure 2. Entropy estimators (Shannon, Rényi, and Tsallis) for random variable (r.v) X with probabilities P x = { p , ( 1 - p ) } .
Figure 2. Entropy estimators (Shannon, Rényi, and Tsallis) for random variable (r.v) X with probabilities P x = { p , ( 1 - p ) } .
Entropy 17 06239 g002

3.2. Feature Space

Let X t i , i = 1 , 2 , . . . , p be features or random variables of some phenomenon under study and R p a p-dimensional feature space or space where our variables live. When the phenomenon is observed during a time period T, N observations are collected. These observations can be studied one by one or by group. In our case, the N observations are partitioned into m sequences or windows of length L . For each sequence or time window, a functional f ( ) is applied. As our purpose is the study of network traffic and the randomness of the features, we will employ the entropy as f ( ) , which maps a set of values of a sequence of R p into a point in R p .
Let X j R p , j = 1 , 2 , . . . , N be the vectors associated at p features, and H i , i = 1 , . . , m , the entropies associated at X j in each sequence. Therefore, we have X N × p , a matrix representing the observations, and H m × p , the matrix of the entropy of the m sequences.
X N × p = X 1 1 X 1 2 X 1 p X 2 1 X 2 2 X 2 p X N 1 X N 2 X N p f ( ) H m × p = H ¯ ( X 1 1 ) H ¯ ( X 1 2 ) H ¯ ( X 1 p ) H ¯ ( X 2 1 ) H ¯ ( X 2 2 ) H ¯ ( X 2 p ) H ¯ ( X m 1 ) H ¯ ( X m 2 ) H ¯ ( X m p )
An H m × p matrix row represents a point in the p-dimensional feature space, and the m points generate a cloud, which characterizes the behavior of p variables of the phenomenon under study. The entropy values will be normalized, H ¯ ( X i p ) [ 0 , 1 ] , in order to perform comparisons between the variables.

3.3. Mahalanobis Distance

The Mahalanobis distance is defined as [18]: d 2 = ( x - μ ) C - 1 ( x - μ ) where x R p is the sample vector, μ R p denotes the theoretical mean vector, and C R p × p denotes the theoretical covariance matrix.
An unbiased sample covariance matrix is
S = 1 N - 1 i = 1 N ( x i - x ¯ ) ( x i - x ¯ ) ,
where the sample mean is
x ¯ = 1 N i = 1 N x i .
Thus, Mahalanobis distance using Equations (8) and (9) is given by:
d 2 = ( x - x ¯ ) S - 1 ( x - x ¯ ) .
One basic assumption preceding any discussion of the distribution properties of Mahalanobis distance is that the p-multivariate observations involved are the result of a random sampling of a p-variate Gaussian population having a mean vector μ and a covariance matrix C. As μ and C are theoretical values, for a data set containing x 1 , x 2 , . . , x N samples, S and x ¯ are theirs estimated respectively, and the distribution of d i 2 ( x ¯ i , S ) is given by:
d i 2 ( x ¯ i , S ) ( N - 1 ) 2 N β [ α , p / 2 , ( N - p - 1 ) / 2 ]
where β [ α , p / 2 , ( N - p - 1 ) / 2 ] represents a beta distribution with a level of confidence α and parameters p / 2 and ( N - p - 1 ) / 2 , N is the number of samples and p the number of variables, see [7,19].
If the data sets does not follow a Gaussian distribution, a method to transform non-Gaussian distributed data to a data sets with an approximate Gaussian distribution should be employed. In this paper, the Box-Cox transformation [20] was used. This transformation is a family of power expressions y ( z ) = x z - 1 z for z 0 and y ( z ) = l o g ( x ) for z = 0 , where z is the transformation parameter that maximizes the Log-likelihood function.

3.4. One Class Support Vector Machine and Mahalanobis Kernel

OC-SVM maps input data x 1 , . . . , x N A (a certain set) into a high dimensional space F (via Kernel k ( x , y ) ) and finds the maximal margin hyperplane which best separates the training data from the origin (see Figure 3).
Figure 3. Illustration of One Class Support Vector Machine idea.
Figure 3. Illustration of One Class Support Vector Machine idea.
Entropy 17 06239 g003
Theoretical fundamentals of SVM and OC-SVM were established in [21,22,23,24]. In order to separate the data from the origin, the following quadratic program must be solved [21]
min w F , b R , ξ R N 1 2 w 2 + 1 ν N i N ξ i - b
subject to ( w · φ ( x i ) ) b - ξ i ; ξ i 0 , ν ( 0 , 1 ] , where w is the normal vector, φ is a map function A F , b is the bias, ξ i are nonzero slack variables, ν is the outlier parameter control, and k ( x , y ) = ( φ ( x ) , φ ( y ) ) . Moreover, the decision function is given by f ( x ) = s g n ( w · φ ( x i ) ) - b .
By applying the kernel function and Lagrangian multiplier ( α i ) to the original quadratic program, the solution of Equation (12) creates a decision function:
f ( x ) = s g n i N α i k ( x i , x ) - b ,
where w = i α i φ ( x i ) and i α i = 1 .
In this work, we used the Mahalanobis kernel (MK), which is defined as: K ( x , y ) = e x p ( - ( x - y ) C ( x - y ) ) , where C is a positive definite matrix. The Mahalanobis kernel is an extension of the Radial Basis Function kernel (RBF). Namely, by setting C = η I , where η > 0 is a parameter for decision boundary control and I is the unit matrix, we obtain the RBF kernel:
e x p ( - η x - y 2 ) .
The Mahalanobis kernel approximation [25] used in this work is:
k ( x , y ) = e x p ( - η p ( x - y ) S - 1 ( x - y ) ) ,
where p is the number of variables, and S is defined by Equation (8).

4. Problem Statement

Let Ω be an Internet traffic data trace, called here Ω-trace, and p the number of random variables X i representing the traffic features. It is known that the temporal behavior of these variables in the case of “normal” traffic differs from that when there are attacks. On the other hand, in order to characterize these behaviors, entropy can be used, and then instead of studying the traffic features directly, their temporal entropy behaviors H i ( t ) will be studied. We have the following behaviors:
  • if Ω-trace was obtained during “normal” network traffic and the outlier exclusion was performed, it will be called β-trace,
  • if Ω-trace was obtained during a period containing “normal” traffic plus one or more attacks, it will be called ψ-trace.
The main problem is to find a region, R N or R A , in the feature space R p characterizing the temporal behavior of the entropy of the p intrinsic variables associated with a class determined by the traces, i.e.,
  • if Ω-trace is β-trace, then a region R N (“normal” traffic) can be constructed and it will serve to detect the anomalies,
  • if Ω-trace is ψ-trace, then a region R A (abnormal traffic) can be constructed and will serve to classify the anomalies of this class.
Our approach to defining the “normal” R N or abnormal regions R A in the feature space uses Mahalanobis distance to construct regular regions (i.e., hyper-ellipsoids) and OC-SVM for non-regular regions.
Figure 4 shows the general architecture of the proposed method, which is composed of three parts: training, detection, and classification. Feature extraction, windowing, entropy calculation, and the Box-Cox transformation (for non-Gaussian data) are performed in the training and detection stages. In the training stage, the different regions in the feature space are defined and the decision functions are obtained. In the detection stage, the “normal” regions R N and the decision functions are used to detect anomalies in the current traffic. Finally, the anomaly is classified through defined regions R A of known classes.
Figure 4. General architecture of the proposed method.
Figure 4. General architecture of the proposed method.
Entropy 17 06239 g004

4.1. Algorithm for the Construction of Decision Regions

4.1.1. Training Stage

An Ω-trace is divided into m non-overlapping slots of L packets each. Next, normalized entropy estimates by means of Equations (1)–(3) of each p variable for every j-slot of size L are obtained, using the relative frequencies p i ^ = n i L , where n i is the number of times that the i-element appears in the j-slot. Then, the matrix H R m × p is built as follows:
H m × p = H ¯ ( X 1 1 ) H ¯ ( X 1 2 ) H ¯ ( X 1 p ) H ¯ ( X 2 1 ) H ¯ ( X 2 2 ) H ¯ ( X 2 p ) H ¯ ( X m 1 ) H ¯ ( X m 2 ) H ¯ ( X m p )
where H ¯ ( X j p ) represents the normalized entropy estimation of the p variable of each j-slot obtained from Ω-trace. The H matrices are inputs of the algorithms for constructing the regions.
Algorithm for constructing regions based on the Mahalanobis distance (MD) method
  • Verify that the columns of the H matrix follow a Gaussian distribution. If the data are non-Gaussian, then a transformation is performed so that the new data approximately follow a distribution of this type. In this paper, the Box-Cox transformation was employed.
  • Perform the exclusion of outliers of the H matrix. The L T limit for Mahalanobis distance is calculated through Equation (11).
  • Calculate the mean vector x ¯ = { x ¯ 1 , x ¯ 2 , . . . , x ¯ p } , where the i-element is the mean of the i-column of the H matrix, see Equation (9).
  • Calculate the covariance matrix S of the H matrix. As the S matrix is positive definite and Hermitian, all its eigenvalues λ 1 λ 2 . . . λ p are real and positive, and its eigenvectors γ 1 , γ 2 , . . . , γ p form a set of orthogonal basis vectors that span the p-dimensional vector space.
  • Solve the matrix equation S γ = λ γ according to a specific algorithm in order to obtain the eigenvalues λ i and eigenvectors γ i of S .
  • Finally, define a hyper-ellipsoidal region obtained from the H matrix by means of { L T , x ¯ , γ , λ } .
Algorithm for constructing regions based on the OC-SVM method
  • Verify that the columns of the H matrix follow a Gaussian distribution. If the data are non-Gaussian, then a transformation is performed so that the new data approximately follow a distribution of this type.
  • Perform the exclusion of outliers of the H matrix. The L T limit for Mahalanobis distance is calculated through Equation (11).
  • Solve Equation (12) via the Sequential Minimal Optimization Algorithm (SMO) [26], using two different kernel functions: RBF and MK. Considering the H matrix as input data, the entropy support vectors x i and the constants α i and b are obtained.
The algorithm for constructing regions based on the MD or OC-SVM allows regions R N to be defined if the trace contains “normal” traffic, or regions R A if the trace contains abnormal traffic.

4.1.2. Detection Stage

  • In the current traffic, a j-slot of size L packets is captured, the p features or variables associated to each packet are extracted, and their entropies estimated. With these values, the input vector h j is built as follows:
    h j = H ¯ ( X j 1 ) , H ¯ ( X j 2 ) , , H ¯ ( X j p ) .
  • The decision function for the MD region is given by Equation (10). If d j 2 ( h j ) L T , then the j-slot is considered “normal”; otherwise, it is an anomaly.
  • The decision function for OC-SVM is expressed by Equation (13). If the decision function maps h j to + 1 , then h j is considered “normal”; otherwise, it is an anomaly.

4.1.3. Anomaly Classification Stage

If h j Equation (17) is outside the “normal” region, i.e., h j R N , but h j R A , then the behavior is abnormal and the vector will be classified. Here, h j is evaluated with all decision functions defined in the training stage.
If h j is outside all the defined regions or h j is located in two or more regions, then classification is refined through a criterion based on the k-temporal nearest neighbors algorithm in order to ensure that a point does or does not belong to a specific class.
The principle of the k-temporal nearest neighbors algorithm is that given h j and its k temporal successors h r , r = j + 1 , j + 2 , . . , j + k , h j using majority vote among these k-temporal nearest neighbors is classified. If h j is outside all the defined regions and its k temporal successors are as well, then a new attack class will be found or not.
Figure 5 shows an example of the algorithm considering two regions and k = 2 temporal nearest neighbors. In Figure 5a, point h j is classified in the R A region, while in Figure 5b, point h j is classified in the R B region.
Figure 5. Use of k-temporal nearest neighbors algorithm in classification stage. (a) h j is outside all the defined regions; (b) h j belongs to two or more regions.
Figure 5. Use of k-temporal nearest neighbors algorithm in classification stage. (a) h j is outside all the defined regions; (b) h j belongs to two or more regions.
Entropy 17 06239 g005

5. Experiments and Results

5.1. Our Data Sets

We evaluated our approach by analyzing its performance over two different experimental databases. The first is from an Academic LAN [27], and is composed of traffic data traces collected over seven days. A trace contains “normal” traffic ( β 1 ) and four traces are formed with “normal” traffic plus traffic generated by four real attacks: port scan ( ψ 1 ), and three worms: Blaster ( ψ 2 ), Sasser ( ψ 3 ), and Welchia ( ψ 4 ). The second is a sub-set of the 1998 MIT-DARPA [28] (public set benchmark for testing NIDS), and is composed of one training trace ( β 2 ) that was collected over five days of “normal” behavior of the network and four traces containing the traffic generated by Smurf ( ψ 5 ), Neptune ( ψ 6 ), Pod ( ψ 7 ), and portsweep ( ψ 8 ) attacks.
The β 1 -trace is composed of “normal” traffic captured over six days. In the training stage, only a day’s traffic is used, and the rest is used for test. A similar procedure is employed for the MIT-DARPA β 2 -trace. In the case of anomalous traces, a portion of each ψ-traces were used for training, and the complete traces for the test were employed.

5.2. Traffic Features

According to Section 3.2, the selected features are extracted from the header of each traffic network packet and represented as random variables X r ; r = 1 ; . . . ; p . For our experiments, where attacks generate deviations from the typical behavior of IP and Port addresses, four random variables were selected: X 1 source IP addresses, X 2 destination IP addresses, X 3 source port addresses, and X 4 destination port addresses, and the temporal behavior of these features via their entropies h X p for normal and abnormal traffic were studied.
An Ω-trace is divided into m non-overlapping slots of L packets each. For each i-slot, the normalized entropy for each p variable H ¯ ( X i p ) was obtained and the entropy vectors h X p = H ¯ ( X 1 p ) , H ¯ ( X 2 p ) , . . . , H ¯ ( X m p ) were constructed. Then, as inputs of the algorithms the following matrices H I p = ( h X 1 h X 2 ) R m × 2 , H P t = ( h X 3 h X 4 ) , H I p D P t = ( h X 1 h X 2 h X 4 ) , H I p S P t = ( h X 1 h X 2 h X 3 ) R m × 3 , H I p P t = ( h X 1 h X 2 h X 3 h X 4 ) were formed. For estimations of the generalized entropies the selected q-values are { 0.01 , 0.5 , 1.5 , 2 , 10 } .

5.3. The Classifier Metrics

The classifier is a mapping from instances to predicted classes, e.g., in two-class classification problems, each instance (an entropy point in our case) is mapped to one element of the set { + 1 , - 1 } of positive and negative class labels [29]. Given a classifier and an instance, there are four possible outcomes: T N is the number of correct predictions that an instance is negative, F P is the number of incorrect predictions that an instance is positive, F N is the number of incorrect predictions that an instance is negative, and T P is the number of correct predictions that an instance is positive. With these entries, the following statistics are computed [30]:
  • The accuracy (AC) is the proportion of the total number of predictions that were correct: A C = T N + T P T N + F P + F N + T P .
  • The sensitivity, detection rate, or true positive rate (TPR) is the proportion of positive cases that were correctly identified: T P R = T P F N + T P .
  • The specificity or true negative rate (TNR) is defined as the proportion of negative cases that were classified correctly: T N R = T N T N + F P .
  • The false negative rate (FNR) is the proportion of positive cases that were incorrectly classified as negative: F N R = F N F N + T P .

5.4. Detection of Anomalies in Network Traffic

As noted above, anomaly-free traces were divided into m non-overlapping slots of size L (in our case L = 32 ) packets. This size was chosen according to the shortest attacks contained in the test traces—around 30 packets—and assuring at least one slot with malicious traffic.
For the input matrices H I p , H P t , H I p S P t , H I p D P t , and H I p P t , ellipsoids were found through Mahalanobis distance, and non-regular regions were found through OC-SVM Radial Basis Function (RBF) and Mahalanobis kernel (MK). The performance of OC-SVM was evaluated for different combinations of parameters η and ν (see Equations (12), (14), and (15)) in the k-fold cross-validation process with k = 5 . For implementation of OC-SVM, the LIBSVM library [31] was used.
Table 1. True positive and negative rates using Tsallis entropy with q = 0.01 for different input matrices.
Table 1. True positive and negative rates using Tsallis entropy with q = 0.01 for different input matrices.
RegionLANMIT-DARPA
H I p T
νη# SV β 1 ψ 1 ψ 2 ψ 3 ψ 4 νη# SV β 2 ψ 5 ψ 6 ψ 7 ψ 8
MK0.10.0116791.2910099.3781.6485.970.030.001699.9899.910.092.8522.22
RBF9.40.0117895.7810099.2475.5685.46250.0011299.9699.910.092.8522.22
MD α = 0.9995 98.3210099.4366.4357.58 α = 0.99995 99.9899.910.092.8522.22
H P t T
MK0.20.0117292.6188.8884.5761.1894.090.030.001999.7699.3910092.8588.88
RBF9.40.0119492.3688.8883.3461.1491.07250.0011799.7699.8210092.8588.88
MD α = 0 . 9995 98.3677.7775.9260.6669.02 α = 0 . 99995 99.5899.3910092.85100
H I p S P t T
MK0.120.0119694.9610099.5973.7398.910.050.0011099.8999.9110092.8544.44
RBF9.40.0122696.7710099.7348.8599.66250.0013499.8299.9110092.8566.66
MD α = 0 . 9995 98.1310099.5265.8997.98 α = 0 . 99995 99.8299.9110092.8566.66
H I p D P t T
MK0.20.0120693.6210099.4787.1299.590.050.0011299.8799.910.092.8588.88
RBF10.60.0123296.7410099.4887.9399.39250.0013099.8499.910.092.85100
MD α = 0 . 9995 98.2310099.5569.5499.25 α = 0 . 99995 99.8199.910.092.85100
H I p P t T
MK0.120.0120695.2310099.6579.7599.590.050.0011499.8299.9110092.8588.88
RBF10.40.0129196.2310099.7686.3899.75250.0015899.5699.9110092.85100
MD α = 0 . 9995 97.9410099.6269.0199.23 α = 0 . 99995 99.6199.9110092.85100
The regions found are used to detect anomalies in network traffic. Therefore, traces containing traffic generated by different anomalies were used. Each test trace was divided into slots of size L and the estimates of entropy for each selected variable were obtained. For each i-slot the Mahalanobis distance was computed by Equation (10). Likewise, each i-slot was analyzed with OC-SVM decision function Equation (13) and thus it was determined to belong to the non-regular region or not.
Results for anomaly detection of the LAN and MIT-DARPA traces using Tsallis entropy of the features with q = 0.01 by means of the ellipsoidal (MD) and non-regular (OC-SVM) regions are displayed in Table 1. Additionally, the values of α, η and ν (see Equations (11), (12), (14), and (15)) are shown. The true negative rate for the attack ψ 6 is 0 or 100, as it is contained in only one slot.

5.5. Classification of Worm Attacks

Each ψ-trace was divided into m non-overlapping slots of size L . For each i - slot, i = 1 , . . . , m , the estimation of entropy H ¯ ( X i r ) of the four selected variables was obtained. Next, H I p , H P t , H I p S P t , H I p D P t , and H I p P t matrices were formed. With these matrices, the regions using Mahalanobis distance and OC-SVM with RBF and MK kernel were defined. Figure 6 shows the ellipses and non-regular regions defined in the feature space of IP addresses R 2 for each anomalous trace from LAN and MIT-DARPA traces. In Table 2, the selected values of the OC-SVM parameters for the construction of non-regular regions are shown.
We assume that every entropy point outside the normal region is an anomaly; however, not every anomaly belongs to a specific attack class. If a point is an anomaly but the majority of its temporal neighbors are normal, then it is considered normal as well. If a point is an anomaly and the majority of its temporal neighbors belong to a specific anomaly class, then it belongs to this class. Therefore, results were obtained using the k-temporal nearest neighbors algorithm, as in [6].
Table 2. Parameters of OC-SVM for classification of LAN and MIT-DARPA traces with Tsallis entropy, q = 0.01 .
Table 2. Parameters of OC-SVM for classification of LAN and MIT-DARPA traces with Tsallis entropy, q = 0.01 .
KernelLANMIT-DARPA
ψ 1 ψ 2 ψ 3 ψ 4 ψ 5 ψ 8
νη# SVνη# SVνη# SVνη# SVνη# SVνη# SV
H I p T
MK0.010.630.010.7280.010.91630.010.91150.00010.120.010.082
RBF0.001830.0113290.01101620.00115330.0005320.01252
H P t T
MK0.010.650.010.7250.010.9350.010.9590.00010.130.010.083
RBF0.001860.0113480.0110420.00115210.0005320.01253
H I p S P t T
MK0.010.660.010.7420.010.91890.010.91330.00010.120.010.082
RBF0.001860.0113660.01101950.001151730.0005320.01253
H I p D P t T
MK0.010.680.010.7340.010.91730.010.91360.00010.120.010.082
RBF0.001860.0113490.01101790.00115850.0005320.01255
H I p P t T
MK0.010.680.010.7470.010.91930.010.91480.00010.120.010.084
RBF0.001850.01131150.01102170.001152830.0005320.01259
Figure 6. Worm attack regions in 2D space. (a) Worm attack regions from LAN traces in 2D space ( L = 32 ); (b) Worm attack regions from MIT-DARPA traces in 2D space ( L = 32 ).
Figure 6. Worm attack regions in 2D space. (a) Worm attack regions from LAN traces in 2D space ( L = 32 ); (b) Worm attack regions from MIT-DARPA traces in 2D space ( L = 32 ).
Entropy 17 06239 g006
In Figure 7, the impact of the k-value of k-temporal nearest neighbors algorithm on the classification for LAN traces using Tsallis entropy of Ips and ports variables with q = 0.01 is shown. TPR values are results of the classifiers trained with β-traces and TNR values are results of the classifiers trained with ψ-traces.
Figure 7. Impact of the k-value of k-temporal nearest neighbors algorithm on the classification.
Figure 7. Impact of the k-value of k-temporal nearest neighbors algorithm on the classification.
Entropy 17 06239 g007

6. Discussion of the Experimental Results

Our approach, see Figure 4, based on mathematical tools such as Mahalanobis distance, covariance matrix, OC-SVM, and the k-temporal nearest neighbors algorithm allows the construction of different regions (regular and non-regular), which encompass the behaviors of the four selected features. These regions allow:
  • the classification of an entropy vector as normal or abnormal, and
  • the classification of an abnormal entropy vector based on known attacks.
The effects of the number of features—input matrices—on the true positive and negative rate is shown in Table 1. Although in general more variables mean better results, a particular case occurred in trace ψ 3 , where the use of three variables was better than four.
For anomalous ψ-traces, experimental results show that the true negative rate for q < 1 is higher than the results for q > 1 . Figure 8 shows the behavior of the true negative rate using four variables for different q values of Tsallis entropy using OC-SVM with RBF kernel.
The runtime of the decision function of OC-SVM, see Equation (13), is determined by the number of support vectors ( x i ). In this regard, the Mahalanobis kernel has a smaller number of support vectors than RBF kernel in the MIT-DARPA traces. For LAN traces, the kernel that uses fewer support vectors is RBF.
When a sequence of anomalies occur in network traffic, the entropy values begin to move away from the “normal” region to a new region. This transient state affects classification when few neighbors ( k 2 ) of the k-temporal nearest neighbors algorithm are selected. Choosing a larger k-value mitigates the effect of this transient, and therefore, the classification rate will stabilize. Table 3 shows that when the number of neighbors is increased, the classification accuracy in the network LAN is increased as well. Using the k-temporal nearest neighbors method, classification is improved; however, classification is performed k - slots later. Experimental results showed that for values of k between 3 and 5, the accuracy classification reaches a steady state, and the delay time is not significant.
Figure 8. True negative rate for different values of q parameter of Tsallis entropy using OC-SVM with RBF kernel.
Figure 8. True negative rate for different values of q parameter of Tsallis entropy using OC-SVM with RBF kernel.
Entropy 17 06239 g008
Table 3. Accuracy of the classification of LAN and MIT-DARPA traces vs. different k-values of k-temporal nearest neighbors, using q = 0.01 in LAN traces, and q = 0.5 in MIT-DARPA traces for generalized entropies.
Table 3. Accuracy of the classification of LAN and MIT-DARPA traces vs. different k-values of k-temporal nearest neighbors, using q = 0.01 in LAN traces, and q = 0.5 in MIT-DARPA traces for generalized entropies.
kOC-SVM MKOC-SVM RBFMD
H I p P t T H I p P t R H I p P t S H I p P t T H I p P t R H I p P t S H I p P t T H I p P t R H I p P t S
LAN
032.402761.359238.943433.850666.097340.379430.164655.575733.7279
198.750398.809997.922998.769898.686998.929296.953496.104395.4036
399.161199.121098.279499.187199.066899.014298.058397.319796.3232
599.307999.201798.314199.269599.184499.090698.456697.773896.6380
799.291799.269598.385699.271699.206699.077198.631198.148896.8542
MIT-DARPA
024.266525.615623.871614.17018.74159.964586.706792.343393.2046
199.959598.669999.916799.233899.969099.738299.995299.997699.9976
399.743099.635999.935799.802599.759699.764499.995299.997699.9976
599.657399.609799.931099.852499.626499.681199.992899.995299.9952
799.467099.640799.928699.847799.390899.469499.990499.992899.9928
Considering packet sizes of 60 bytes in a 100Mbs network to capture a slot of 32 packets, the time required is 32 × 60 × 8 100 M b s = 153.6 μ S . Using a PC with Intel Core i7 3.4 Ghz and 16 G of RAM, a C-implementation of the proposed method using MD and including the decision function required computation times of no more than 5 μ s . Therefore, the proposed method can be implemented in real time.

7. Conclusions

In this paper, an approach was proposed for detecting and classifying Internet traffic anomalies using the entropy of selected features, Mahalanobis distance, and OC-SVM with two kernels: RBF and Mahalanobis kernel. Regular and non-regular regions were built with “normal” traffic from training data. For detection of an anomaly, computation times in order of few μ s were obtained; consequently, these results are very significant for real time implementations.
In the detection stage, for all traces the highest true positive and negative rates (99.35% for “normal” traffic and up to 99.83% for anomalous traffic) were obtained, using the generalized entropies (particularly Tsallis entropy) with q = 0.01 , and OC-SVM with RBF kernel. However, the optimal q is not addressed in this work.
In the classification stage:
  • For Academic LAN traces, using Tsallis entropy with q = 0.01 , OC-SVM with Mahalanobis kernel, and considering k = 5 for the k-temporal nearest neighbor algorithm the highest results of accuracy (99.30%) were obtained.
  • For MIT-DARPA traces, using the MD method, Rényi entropy with q = 0.5 , and k 1 for the k-temporal nearest neighbor algorithm the highest results of accuracy (99.99%) were obtained.

Open Issues

For different networks, the larger the slot size, the more different the entropy behaviors. In the near future, this behavior including more and recent traces in order to determine whether the learned model from a certain network can be used in a different network should be addressed.
In order to enhance our proposed approach other classification techniques such as multi-class SVM should be studied.

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive criticism, which helped to improve the presentation of this paper significantly.

Author Contributions

Jayro Santiago-Paz conceived the approach, designed and performed the experiments, and wrote the initial version of the manuscript, under the direction of his supervisor Deni Torres-Roman. Angel Figueroa-Ypiña implemented a part of OC-SVM stage. Jesus Argaez-Xool cooperated in the writing and revision of the manuscript. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chandola, V.; Banerjee, A.; Kumar, V. Anomaly Detection: A Survey. ACM Comput. Surv. 2009, 41. [Google Scholar] [CrossRef]
  2. Lakhina, A.; Crovella, M.; Diot, C. Mining Anomalies Using Traffic Feature Distributions. In Proceedings of the 2005 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Philadelphia, PA, USA, 22–26 August 2005; Volume 35, pp. 217–228.
  3. Wagner, A.; Plattner, B. Entropy Based Worm and Anomaly Detection in Fast IP Networks. In Proceedings of the 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise, Linköping, Sweden, 13–15 June 2005; pp. 172–177.
  4. Xu, K.; Zhang, Z.L.; Bhattacharyya, S. Profiling Internet Backbone Traffic: Behavior Models and Applications. In Proceedings of the 2005 conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Philadelphia, PA, USA, 22–26 August 2005; Volume 35, pp. 169–180.
  5. Santiago-Paz, J.; Torres-Roman, D.; Velarde-Alvarado, P. Detecting anomalies in network traffic using Entropy and Mahalanobis distance. In Proceedings of the 2012 22nd International Conference on Electrical Communications and Computers (CONIELECOMP), Cholula, Mexico, 27–29 February 2012; pp. 86–91.
  6. Santiago-Paz, J.; Torres-Roman, D. Characterization of worm attacks using entropy, Mahalanobis distance and K-nearest neighbors. In Proceedings of the 2014 International Conference on Electronics, Communications and Computers (CONIELECOMP), Cholula, Mexico, 26–28 February 2014; pp. 200–205.
  7. Mason, R.L.; Young, J.C. Multivariate Statistical Process Control with Industrial Applications; Siam: Philadelphia, PA, USA, 2002; Volume 9. [Google Scholar]
  8. Li, K.L.; Huang, H.K.; Tian, S.F.; Xu, W. Improving one-class SVM for anomaly detection. In Proceedings of the 2003 International Conference on Machine Learning and Cybernetics, Xi’an, China, 2–5 November 2003; Volume 5, pp. 3077–3081.
  9. Zhang, R.; Zhang, S.; Lan, Y.; Jiang, J. Network anomaly detection using one class support vector machine. In Proceedings of the International MultiConference of Engineers and Computer Scientists (IMECS), Hong Kong, China, 19–21 March 2008.
  10. Nychis, G.; Sekar, V.; Andersen, D.G.; Kim, H.; Zhang, H. An Empirical Evaluation of Entropy-based Traffic Anomaly Detection. In Proceedings of the 8th ACM SIGCOMM Conference on Internet Measurement, Vouliagmeni, Greece, 20–22 October 2008; ACM: New York, NY, USA, 2008; pp. 151–156. [Google Scholar]
  11. Ziviani, A.; Gomes, A.T.A.; Monsores, M.L.; Rodrigues, P.S. Network anomaly detection using nonextensive entropy. IEEE Commun. Lett. 2007, 11, 1034–1036. [Google Scholar] [CrossRef]
  12. Tellenbach, B.; Burkhart, M.; Schatzmann, D.; Gugelmann, D.; Sornette, D. Accurate Network Anomaly Classification with Generalized Entropy Metrics. Comput. Netw. 2011, 55, 3485–3502. [Google Scholar] [CrossRef]
  13. Ma, X.; Chen, Y. DDoS Detection method based on chaos analysis of network traffic entropy. Commun. Lett. IEEE 2014, 18, 114–117. [Google Scholar] [CrossRef]
  14. Bhuyan, M.H.; Bhattacharyya, D.; Kalita, J. An empirical evaluation of information metrics for low-rate and high-rate DDoS attack detection. Pattern Recognit. Lett. 2015, 51, 1–7. [Google Scholar] [CrossRef]
  15. Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
  16. Rényi, A. Probability Theory; North-Holland Series in Applied Mathematics and Mechanics; Elsevier: Amsterdam, The Netherlands, 1970. [Google Scholar]
  17. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
  18. Mahalanobis, P.C. On the Generalised Distance in Statistics; Proceedings of the National Institute of Science: Calcutta, India, 1936; Volume 2, pp. 49–55. [Google Scholar]
  19. Tracy, N.D. Multivariate control charts for individual observations. J. Qual. Technol. 1992, 24, 88–95. [Google Scholar]
  20. Box, G.E.P.; Cox, D.R. An Analysis of Transformations. J. R. Stat. Soc. B Stat. Methodol. 1964, 26, 211–252. [Google Scholar]
  21. Schölkopf, B.; Platt, J.C.; Shawe-Taylor, J.C.; Smola, A.J.; Williamson, R.C. Estimating the Support of a High-Dimensional Distribution. Neural Comput. 2001, 13, 1443–1471. [Google Scholar] [CrossRef] [PubMed]
  22. Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A Training Algorithm for Optimal Margin Classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992; ACM: New York, NY, USA, 1992; pp. 144–152. [Google Scholar]
  23. Vapnik, V.N. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995. [Google Scholar]
  24. Schölkopf, B.; Burges, C.J.C.; Smola, A.J. Advances in Kernel Methods: Support Vector Learning; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
  25. Abe, S. Training of Support Vector Machines with Mahalanobis Kernels. In Artificial Neural Networks: Formal Models and Their Applications—ICANN 2005; Lecture Notes in Computer Science; Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S., Eds.; Springer: Berlin, Germany, 2005; Volume 3697, pp. 571–576. [Google Scholar]
  26. Platt, J. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines; Technical Report MSR-TR-98-14; Microsoft Research: Redmond, WA, USA, 1998. [Google Scholar]
  27. Velarde-Alvarado, P.; Vargas-Rosales, C.; Torres-Román, D.; Martinez-Herrera, A. Entropy-based profiles for intrusion detection in LAN traffic. Adv. Artif. Intell. 2008, 40, 119–130. [Google Scholar]
  28. Kendall, K. A Database of Computer Attacks for the Evaluation of Intrusion Detection Systems; Technical Report, DTIC Document; Massachusetts Institute of Technology: Cambridge, MA, USA, 1999. [Google Scholar]
  29. Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  30. Kohavi, R.; Provost, F. Glossary of Terms. J. Mach. Learn. 1998, 30, 271–274. [Google Scholar]
  31. Chang, C.C.; Lin, C.J. LIBSVM: A Library for Support Vector Machines. ACM Trans. Intell. Syst. Technol. 2011, 2. [Google Scholar] [CrossRef]

Share and Cite

MDPI and ACS Style

Santiago-Paz, J.; Torres-Roman, D.; Figueroa-Ypiña, A.; Argaez-Xool, J. Using Generalized Entropies and OC-SVM with Mahalanobis Kernel for Detection and Classification of Anomalies in Network Traffic. Entropy 2015, 17, 6239-6257. https://doi.org/10.3390/e17096239

AMA Style

Santiago-Paz J, Torres-Roman D, Figueroa-Ypiña A, Argaez-Xool J. Using Generalized Entropies and OC-SVM with Mahalanobis Kernel for Detection and Classification of Anomalies in Network Traffic. Entropy. 2015; 17(9):6239-6257. https://doi.org/10.3390/e17096239

Chicago/Turabian Style

Santiago-Paz, Jayro, Deni Torres-Roman, Angel Figueroa-Ypiña, and Jesus Argaez-Xool. 2015. "Using Generalized Entropies and OC-SVM with Mahalanobis Kernel for Detection and Classification of Anomalies in Network Traffic" Entropy 17, no. 9: 6239-6257. https://doi.org/10.3390/e17096239

Article Metrics

Back to TopTop