A Big Network Trafﬁc Data Fusion Approach Based on Fisher and Deep Auto-Encoder

: Data fusion is usually performed prior to classiﬁcation in order to reduce the input space. These dimensionality reduction techniques help to decline the complexity of the classiﬁcation model and thus improve the classiﬁcation performance. The traditional supervised methods demand labeled samples, and the current network trafﬁc data mostly is not labeled. Thereby, better learners will be built by using both labeled and unlabeled data, than using each one alone. In this paper, a novel network trafﬁc data fusion approach based on Fisher and deep auto-encoder (DFA-F-DAE) is proposed to reduce the data dimensions and the complexity of computation. The experimental results show that the DFA-F-DAE improves the generalization ability of the three classiﬁcation algorithms (J48, back propagation neural network (BPNN), and support vector machine (SVM)) by data dimensionality reduction. We found that the DFA-F-DAE remarkably improves the efﬁciency of big network trafﬁc classiﬁcation.


Introduction
Nowadays, to enhance network security, a variety of security devices are used, such as firewall, intrusion detection system (IDS), intrusion prevention system (IPS), antivirus software, security audit, etc.Though all kinds of monitoring approaches and reporting mechanisms provide big data for network management personnel, the lack of effective network traffic data fusion has become a stumbling block to solve different issues in network security situation awareness (NSSA) In such circumstances, the research on data fusion as one of the next generation security solutions has enough academic value and comprehensive practical value.
Data fusion in NSSA aims to effectively eliminate the redundancy of big network traffic data by feature extraction, classification, and integration.Thereby network management personnel can realize situational awareness quickly.Therefore, how to build a suitable data fusion algorithm is one of the important issues in NSSA.Feature extraction is the key of the data fusion algorithm because its performance directly affects the result of fusion.The feature extraction, as a preprocessing method to overcome dimension disaster, aims at extracting a few features that can represent the original data from big data by analyzing its internal characteristics.The classic methods include principal components analysis (PCA) [1], linear discriminant analysis (LDA) [2], Fisher score [3], etc.
In 2006, the significant technological achievement to effective training tactics for deep architectures [4] came with the unsupervised greedy layer-wise pre-training algorithm that was closed behind supervised fine-tuning.Since then, denoising auto-encoders [5], convolutional neural networks [6], deep belief networks [7], etc., and other deep learning models have been put forward as well.Currently, deep learning theory has been successfully applied to a variety of real-world applications, including face/image recognition, voice search, speech-to-text (transcription), spam filtering (anomaly detection), E-commerce fraud detection, regression, and other machine learning fields.
In this paper, a novel network traffic data fusion approach based on Fisher and deep auto-encoder (DFA-F-DAE) is proposed to reduce the data dimensions and the complexity of computation, and it is helpful for handling big network traffic data validly.The experimental results indicate that, the proposed approach improves the generalization ability of the classification algorithms by data dimensionality reduction.Furthermore, it can reduce the redundancy of big network traffic data.Under the premise of ensuring the classification accuracy, the DFA-F-DAE reduces the time complexity of classification.
The rest of this paper is organized as follows.Section 2 describes related works.Section 3 reviews the concept of Fisher and deep auto-encoder.In Section 4, the data fusion based on Fisher and the deep auto-encoder is proposed.The experimental results and discussion are covered in Section 5. Finally, the conclusion and future work are presented in Section 6.

Related Work
Network security issues are more prominent with each passing day, which has become a key research topic which needs to be dealt with urgently [8].In 1999, Bass proposed the concept of NSSA [9].Its main goal is to obtain the macro level of information from multiple network security information by extracting, refining, and fusing.Then it can help administrators to deal with various kinds of security problems in the network.Soon after, Bass proposed a framework of intrusion detection based on multi-sensor data fusion, and pointed out that the next generation network management system and intrusion detection system will interact in the unified model.Thus it can fuse the data into information to help the network administrators make decisions.Since the objects of NSSA mostly are data information, the research of data fusion in NSSA [10,11] has gradually become a developmental trend.
Data fusion technology dated from 1970s, it was mainly engaged in the military area.As the technology is developing in a high speed, data fusion technology gradually extended to civilian areas, and has been widely employed in urban mapping [12], forest-related studies [13], oil slick detection and characterization [14], disaster management [15], remote sensing [16] and other fields.Of course, all sorts of data fusion approaches were proposed.Li et al. [17] proposed a fusion mechanism MCMR based on trust, which considered historical and time correlation and draws up situation trust awareness rule on historical trust and current data correlation.Papadopoulos et al. [18] used a data fusion method to present SIES, a scheme that solves exact SUM queries through a combination of homomorphic encryption and secret sharing.A distributed data fusion technique is provided by Akselrod et al. [19] in multi-sensor multi-target tracking.A few examples which introduce Fisher into data fusion are as follows.Zeng et al. [20] proposed a sensor fusion framework based adaptive activity recognition and dynamic heterogeneous, and they incorporated it into popular feature transformation algorithms, e.g., marginal Fisher's analysis, and maximum mutual information in the proposed framework.Chen et al. [21] introduced the finite mixture of Von Mises-Fisher (VMF) distribution for observations that are invariant to actions of a spherical symmetry group.The approach reduced the computation time by a factor of 2. Yong Wang [3] described an interpolation family that generalizes the Fisher scoring method and proposed a general Monte Carlo approach in dimensionality reduction.
Recently, deep learning has attracted wide attention again since the efficient layer-wise unsupervised learning strategy is proposed to retrain this kind of deep architecture.Deep learning focuses on the deep structure of neural networks, with the purpose of realizing a machine which has cognitive capabilities similar to those of the human brains.In 2006, Hinton et al. proposed deep belief nets (DBN) [4], which were composed of multiple logistic belief neural networks and one restricted Boltzmann machine (RBM).In recent years, deep learning has been successfully applied to various applications, such as dimensionality reduction, object recognition, and natural language processing.For example, Bu et al. [22] proposed to fuse the different modality data of 3D shapes into a deep learning framework, which combined intrinsic and extrinsic features to provide complementary information so better discriminability could be reached.It is better to mine the deep correlations of different modalities.Gu et al. [23] used the quasi-Newton method, conjugate gradient method and the Levenberg-Marquardt algorithm to improve the traditional BP neural network algorithm, and eventually got converged data, as well as improved traffic flow accuracy.Furthermore, in [24], speech features were used as input into a pre-trained DBN in order to extract BN features, though the DBN hybrid system outperforms the BN system.Although varieties of deep learning algorithms have been applied in the field of data fusion, there are a few studies of auto-encoder algorithms in the data fusion field.Felix et al. [25] investigated blind feature space de-reverberation and deep recurrent de-noising auto-encoders (DAE) in an early fusion scheme.Then they proposed early feature level fusion with model-based spectral de-reverberation and showed that this further improves performance.A sparse auto encoder (SAE) has proven to be an effective way for dimension reduction and data reconstruction in practices [26].

Fisher Score
Classical Fisher Score is a well-known method to establish a linear transformation that maximizes the ratio of between-class scatter to average within-class scatter in the lower-dimensional space.The Fisher Score [27] is a classical algorithm widely engaged in statistics, pattern recognition, and machine learning.In statistical pattern recognition, the Fisher Score is used to reduce the dimension of a given statistical model, by searching for a transform.F is the class-to-class variation of the detected signal divided by the sum of the within-class variations of the signal, and F is defined as follows [28].
where σ between is the class-to-class variation, and σ within is the within-class variation.
where n i is the number of measurements in the ith class, x i is the mean of the ith class, x is the overall mean, and k is the number of classes.
where x ij is the ith measurement of the jth class, and N is the total number of sample profiles.

Deep Auto-Encoder
An auto-encoder (AE) is a professional neural network composed of three layers, including an input layer, hidden layer (because its values are not observed in the training set), and an output layer.The output of the second layer acts as a compact representation or "code" for the input data.The function of AE is much like principal component analysis (PCA) but AE works in a non-linear fashion.Auto-encoders are unsupervised learning algorithms that attempt to reconstruct visible layer data in the reconstruction layer.The idea of AE was extended to several other variants such as deep AE [29], sparse AE, denoising AE [5] and contractive AE [30].All of these ideas have been formalized and successfully applied to various applications, and have even taken an important part of deep learning.An AE is shown in Figure 1. ( , , , )  , where neural network is an unsupervised learning algorithm that utilizes backpropagation, setting the target values to be equal to the inputs.This means i i y x = .AE attempts to learn a function , ( ) i.e., it is attempting to learn an approximation to the identity function.In Figure 1, the circles labeled "+1" are called bias units, and correspond to the intercept term.
In our scheme, we choose ( ) f • to be the sigmoid function: l denotes the number of layers in our network.
denotes the parameter (or weight) associated with the connection between unit j in layer l , and unit i in layer 1 l + .Also, ( ) l i b is the bias associated with unit i in layer 1 l + .l s denotes the number of nodes in layer l (not counting the bias unit).
( ) l i a denotes the activation (meaning output value) of unit i in layer l .For 1 l = , we also make use of (1)   i i a x = to denote the i th input.Given a fixed setting of the parameters ( , ) W b , our neural network defines a hypothesis , ( ) x that outputs a real number.Particularly, the calculation is given by: where m is the number of hidden nodes.Suppose that given a fixed training set of n training examples.The definition of the overall cost function is as follows: ( ) where λ is weight decay parameter.The first term in the definition of ( , ) J W b is an average sum-of-squares error term.The second term is a regularization term (also called a weight decay term) that tends to decrease the magnitude of the weights, and helps to prevent overfitting.Suppose a set of unlabeled training samples x " px 1 , x 2 , ¨¨¨, x i q, i P p1, 2, ¨¨¨, nq, where x i P n .AE neural network is an unsupervised learning algorithm that utilizes backpropagation, setting the target values to be equal to the inputs.This means y i " x i .AE attempts to learn a function h W,b pxq « x, i.e., it is attempting to learn an approximation to the identity function.In Figure 1, the circles labeled "+1" are called bias units, and correspond to the intercept term.
In our scheme, we choose f p‚q to be the sigmoid function: l denotes the number of layers in our network.W plq ij denotes the parameter (or weight) associated with the connection between unit j in layer l, and unit i in layer l `1.Also, b plq i is the bias associated with unit i in layer l `1.s l denotes the number of nodes in layer l (not counting the bias unit).a plq i denotes the activation (meaning output value) of unit i in layer l.For l " 1, we also make use of a p1q i " x i to denote the ith input.Given a fixed setting of the parameters pW, bq, our neural network defines a hypothesis h W,b pxq that outputs a real number.Particularly, the calculation is given by: where m is the number of hidden nodes.Suppose that given a fixed training set of n training examples.The definition of the overall cost function is as follows: where λ is weight decay parameter.The first term in the definition of JpW, bq is an average sum-of-squares error term.The second term is a regularization term (also called a weight decay term) that tends to decrease the magnitude of the weights, and helps to prevent overfitting.

Fine-Tune
In order to minimize JpW, bq according to the function of W and b, we initialized every parameter W p1q ij and every b p1q i to a small random value closed to 0. Then we use a Fine-tune algorithm, for instance, batch gradient descent (BGD).Gradient descent is likely to lead to local optima because JpW, bq is a non-convex function.However, gradient descent usually works quite well in practice.Eventually, noted that it is important to initialize the parameters randomly, rather than to all 0's.The random initialization avoids symmetry breaking.
One iteration of gradient descent updates the parameters W plq , b plq as follows: Firstly, compute the error term: Secondly, compute the desired partial derivatives: Thirdly, update ∆W plq , ∆b plq : ∆b plq :" ∆b plq `∇b plq JpW, bq Finally, reset W plq , b plq : where α is the learning rate.

Data Fusion Approach Based on Fisher and Deep Auto-Encoder (DFA-F-DAE)
The machine-learning methods generally are divided into two categories: supervised and unsupervised.In the supervised methods, the training data is fully labeled and the goal is to find a mapping from input features to output classes.On the contrary, unsupervised methods devote itself to discovering patterns in unlabeled data such that the traffic with similar characteristics is grouped without any prior guidance from class labels.The unsupervised methods need to be further transformed into a classifier for the online classifying stage.In general, the supervised methods are more precise than the unsupervised.Instead, unsupervised methods have some significant advantages such as the elimination of requirements for fully labeled training data sets and the ability to discover hidden classes that might represent unknown applications.Furthermore, unlabeled data is not only cheap but also requires experts and special devices.It is not practical to use the traditional feature extraction method to deal with it.Therefore, in this paper, combined the robustness of traditional feature extraction method (Fisher) with the unsupervised learning advantages of deep auto-encoder, we propose a novel network traffic data fusion approach.In particular, Fisher score, as a high-efficiency filter-based supervised feature selection method, according to the feature selection criteria of the minimum intra-cluster distance and the maximum inter-cluster distance, evaluates and sorts the features by the internal properties of single feature.The architecture of DFA-F-DAE is shown in Figure 2. V. Build the filter of the feature f2 and get feature subset A2.
In the end: Merge A1 and A2.The DFA-F-DAE aims to fuse network traffic data by two approaches (Fisher and DAE).The details are below.
Fisher: I Input small labeled set sample.II Use the Formula (1) to compute F and value the weight based on F. III Order the feature based on the weight.IV Build the filter of the feature f 1 and get feature subset A 1 .

DAE:
I Initialize the parameters of each layer and build the model of AE.II Input a large number of unlabeled samples.III Set up the threshold value θ, then compute the cost function according to Formula (6).IV If J pW, bq ď θ, the process continues.However, if J pW, bq ą θ, reset the parameters of each layer until J pW, bq ď θ.V Build the filter of the feature f 2 and get feature subset A 2 .

In the end:
Merge A 1 and A 2 .

The Experiment Design and the Result Analysis
Below we present the datasets, experimental environment, and experimental results.The classifier is used in our experiments for the evaluation criteria.

Dataset
The 1999 DARPA IDS data set [31], or KDD99 for short, is well known as standard network security dataset, which was collected at MIT Lincoln Labs.The attacks types were divided into four categories: (1) Denial-Of-Service (DOS): Denial of service; (2) Surveillance or Probe (Probe): Surveillance and other probing; (3) User to Root (U2R): unauthorized access to local super user (root) privileges; (4) Remote to Local (R2L): unauthorized access from a remote machine.The experiment used repeated sampling to randomly extract 10,000 flows from KDD99 as train-set.Besides, the test-set contains of 500,000 flows.Since the Normal type is easy to be mistaken, we increased the proportion of Normal type in the test-set, and the other types are randomly selected.The composition of data set is shown in Table 1.

Experimental Environment
Experimental environment: Matlab version 8.0.0 (The MathWorks Inc., Natick, MA, USA) and Weka version 3.7.13(University of Waikato, Waikato, New Zealand) were used as the tools in data processing and analysis in the experiments.The configuration information of node is as shown in Table 2.

Experimental Results
In order to verify the validity of the proposed DFA-F-DAE, our experimental evaluation considers two standards of classification: one is classification accuracy which symbolizes the effect of classification, while another is classification time which symbolizes the efficiency of classification.

Classification Accuracy under Different Dimensionalities
In order to choose proper dimensionalities by the DFA-F-DAE, we measure the performance of three classification algorithms (J48, BPNN, and SVM) under different dimensionalities.Matlab 8.0.0From Table 3, it can be seen that the classification times of three algorithms after data fusion all show a sharp decline, when compare the classification times before data fusion.This is because DFA-F-DAE reduces the dimensionalities, furthermore reduces the time complexity of classification.Note that classification time of BPNN and SVM decreased more severely than that of J48, since that the dimensionalities of the test-set have a great influence on the nonlinear computation, whereas BPNN and SVM need a large number of nonlinear computation.Obviously, the DFA-F-DAE remarkably improves the efficiency of big network traffic classification.

Conclusions
In recent years, a few methods for data fusion have been proposed by utilized the machine learning approach, such as Dempster-Shafer (D-S), principal components analysis (PCA), etc.Although these methods have shown their promising potential and robustness, there are still several challenges such as the curse of dimensionality because datasets are often of high dimension.The architecture of DFA-F-DAE has been proven useful to overcome this drawback that how to reduce dimensionality and generalization error.The experimental study shows that the proposed architecture outperforms traditional methods in terms of the classification time and classification accuracy.Our future work is studying the influence of DFA-F-DAE on the classification results, which is an interesting research topic is to realize the data fusion of big data by MapReduce.

Figure 1 .
Figure 1.The architecture of auto-encoder.

Figure 1 .
Figure 1.The architecture of auto-encoder.

Figure 2 .
Figure 2. The architecture of DFA-F-DAE.The DFA-F-DAE aims to fuse network traffic data by two approaches (Fisher and DAE).The details are below.Fisher: I. Input small labeled set sample.II.Use the formula (1) to compute F and value the weight based on F .III. Order the feature based on the weight.IV.Build the filter of the feature f1 and get feature subset A1.DAE: I. Initialize the parameters of each layer and build the model of AE.II.Input a large number of unlabeled samples.III.Set up the threshold value θ , then compute the cost function according to formula (6).IV.If ( ) , J W b θ ≤ , the process continues.However, if ( ) , J W b θ > , reset the parameters of

Table 1 .
The composition of data set.

Table 2 .
Node configuration information.