A New Method of Fuzzy Support Vector Machine Algorithm for Intrusion Detection

: Since SVM is sensitive to noises and outliers of system call sequence data. A new fuzzy support vector machine algorithm based on SVDD is presented in this paper. In our algorithm, the noises and outliers are identiﬁed by a hypersphere with minimum volume while containing the maximum of the samples. The deﬁnition of fuzzy membership is considered by not only the relation between a sample and hyperplane, but also relation between samples. For each sample inside the hypersphere, the fuzzy membership function is a linear function of the distance between the sample and the hyperplane. The greater the distance, the greater the weight coe ﬃ cient. For each sample outside the hypersphere, the membership function is an exponential function of the distance between the sample and the hyperplane. The greater the distance, the smaller the weight coe ﬃ cient. Compared with the traditional fuzzy membership deﬁnition based on the relation between a sample and its cluster center, our method e ﬀ ectively distinguishes the noises or outlies from support vectors and assigns them appropriate weight coe ﬃ cients even though they are distributed on the boundary between the positive and the negative classes. The experiments show that the fuzzy support vector proposed in this paper is more robust than the support vector machine and fuzzy support vector machines based on the distance of a sample and its cluster center.


Introduction
Intrusion detection systems (IDS) are essential to information security.IDS can be divided into signature-based IDS and anomaly-based IDS [1].Both are based on pattern detection.Signature-based IDS matches the system behavior against the known attack and lacks the ability to detect zero-day attack.Anomaly-based methods construct normal behavior based on prior knowledge and judge the deviation between the current behavior and the normal behavior [2].The advantage of the anomaly-based method is the ability to detect new attacks.
A system call requested by an application is a function built into the operation system kernel.A system call sequence is a detailed account of the system calls occurring on a host.The behavior of an application can be described in terms of the sequence of system calls.It is easy to get the system call sequence in real-time.Therefore, the data of a system call sequence is often used as audit data for analysis and classification of malicious processes.
The current methods of anomaly detection are based on traditional statistics, which is the study of the asymptotic theory.That is, the limit property can be reached when the number of samples approaches infinity.In intrusion detection systems, the observation samples are limited or even a small number.This cannot satisfy the preconditions of a detection method based on traditional statistics.As a result, the false alarm rate and missing rate are high.The algorithms of a system call in anomaly-based IDS need a lot of data.This is because these algorithms are based on traditional statistics, which is the study of the asymptotic theory.That is, the limit property can be reached when the number of samples approaches infinity.Unfortunately, the anomaly data in the intrusion detection system is very limited.Therefore, we classify system calls using an SVM-based algorithm, which is based on statistical learning theory.Statistical learning theory makes the SVM-based classifier only depend on a small part of the support vectors (SVs).This is very helpful for the training of classifiers with insufficient data.SVM-based algorithm, like the other algorithm, also has an inherent shortcoming, that is, it is sensitive to noises and outliers.In order to distinguish noises near the boundary from SV, fuzzy support vector machine is proposed to solve the problem.Because there are no uniform guiding principles, the existing FSVM-based algorithms are inconsistent with reality when classifying system call sequences.Therefore, the purpose of our paper is to construct a fuzzy support vector machine that is suitable for classifying system call sequences.
Support vector machine [3][4][5][6], as a machine learning method based on statistical learning theory, derives from the idea of the dual form to solve the large dimensional problems, makes the classifier only depend on a small part of the support vectors, implements the structural risk minimization principle in statistical learning theory, and solves the problems of nonlinearity and local minima.A system call sequence can be converted into a vector in a high dimensional space by the frequency of short system call sequences of a certain length.Therefore, abnormal detection can be carried out based on SVM.
There are always noises and outliers in solving practical engineering applications due to statistical methods, human error and other factors.These noises and outliers cannot satisfy the precondition that all samples are independent and identically distributed.The noises and outliers near the boundary play the same role as SVs in constructing the optimal classification hyperplane.To solve this problem, researchers proposed fuzzy support vector machine (FSVM), that is, different weights are assigned to different samples, so that different samples contribute differently to the optimal classification hyperplane.In order to eliminate the influence of noises and outliers, the small weights are given to these samples.The design of the membership function is the key of the whole fuzzy algorithm and is no uniform criterion to be followed.At present, there are many ways to construct a membership function.Most of these methods are based on the distance between a sample and its cluster center.The closer the sample is to the cluster center, the greater the weight coefficient is.The noises and SVs distributed on the boundary are far from the cluster center.They are all given less weight coefficients.This is quite different from the objective situation.SVs should be given greater weight coefficients.
To solve the above problem, our paper proposed a new fuzzy support machine method based on support vector data description (SVDD).The contributions of our work are as follows:

•
Our paper proposed a new fuzzy membership function which can effectively distinguish the noises and SVs distributed on the boundary based on SVDD.SVs are given larger weight coefficients while noises are given smaller coefficients.In this way, our method avoids imperfection of FSVM based on the distance between a sample and its cluster center.Such a fuzzy membership function structure method is more in line with reality.

•
The method proposed in our paper uses the hyperplane, which passes through the cluster center and takes the line of two cluster centers as the normal vector to replace each cluster center.This is in accordance with the geometric principle of SVM.In other words, two hyperplanes with maximum space are used to separate the training samples.Therefore, using the hyperplane in class to replace the cluster center can better approximate the actual situation.

•
Our method is more efficient, especially for anomaly-based IDS with high real-time requirements.
In our method, the noises and outliers are identified by a sphere with minimum volume while containing the maximum of the samples.Some uncontributed vectors are eliminated by pre-extracting the boundary vector set containing the support vectors.This reduces the number of training samples and speeds up the training.It is important that IDS speed up detection as much as possible by reducing computation and storage.
The remainder of the paper is organized as follows.In Section 2, the previous work based on system calls are reviewed, while Section 3 describes the proposed method.In Section 4, experiments and evaluations are presented.Section 5 gives some conclusions.

Previous Work
Anomaly detection can be studied by system call sequence from different angles.These methods focus primarily on data processing, data representation and other feature selections derived from system call sequences.In this section, some methods of anomaly-based IDSs based on system call will be discussed.
As the original data of system call traces is large, preprocessing and feature selection methods contribute to obtaining typical features and avoiding the influence of irrelevant and redundant features on detection rate and processing cost [7,8].Methods commonly used for natural language processing are used to preprocess system call traces.The n-gram method is used to construct the system call databases of normal behavior by a sliding window with a single length or multiple lengths [9,10].Aron Laszka et al. [11,12] investigated and claimed that the optimal n-gram is 6-gram in UNM dataset and 7-gram in ADFA-LD dataset.Suaad et al. [13] continued to prove that 6-gram and 10-gram have the advantage of time efficiency and detection rate respectively in a dataset collected from a virtual machine.Feature selection methods reduce redundancy and irrelevance by selecting interesting features.These methods facilitate the reduction of computation time and storage requirements, understanding data out noise and avoiding the over-fitting problem, increasing the accuracy rate.Feature selection can be divided into th wrapper approach [14], filter approach [15] and hybrid approach [16] according to the correlation of algorithms.This depends on the feedback represented by the accuracy rate; the wrapper approach implements the selection of best features.The filter approach evaluates the attributes of a learning algorithm by using the statistical learning data.The wrapper approach gets better classification performance than filter approach at the expense of expensive computation.The filter approach is better suited to handle high dimensional data than the wrapper approach.A hybrid approach was produced by combining the advantages and disadvantages of the filter approach and wrapper approach.
The enumerating sequences-based [17][18][19] methods are simple and efficient to implement by removing system call parameters.During database construction stage, normal behaviors are represented by short system sequences.During the monitoring stage, the short sequences of testing data are obtained and tested.The enumerating sequences-based methods need constructing, updating and maintainance of the normal database for each individual program [20,21].The Murmurhash [22][23][24] is utilized with the Bloom-filter-based method to ensure that it has the advantage over STIDE in terms of memory occupation, searching speed and privacy preservation.Although the Bloom-filter-based method shows simplicity and effectiveness, it has the limitation of false positives.
The system call sequence can be represented as a vector.Qing et al. extracted a minimized set of rules to define a normal behavior and detected anomaly behavior based on a rough set [25,26].Pandit predefined workflow and added a knowledge base of workflow [27][28][29].A search engine is then applied to discover the hidden knowledge [30].The drawback of a rule-based approach [31][32][33], since these rules are derived from small-scale datasets, is that the rules are constantly updated.Matej et al. [34] presented a new tool data collection system for Windows PC.Different from previously distributed data collection systems, this system uses less resources based on host and client structures.
The system shows good performance in the real test environment.However, it is just a preliminary stage and is going to be a lot of work.IDS plays an important role in the network.Qiuhua et al. [35] proved a classification algorithm based on data clustering and data reduction by mini batch K-Means algorithm in the training stage and sorting cluster in the detection stage.Experiments indicated that the computational complexity was reduced significantly and the accuracy maintained high.However, the implementation of this classification method is complex.
In recent years, neural networks have made remarkable achievements in computer vision [36][37][38] and natural language processing [39][40][41].Researchers have also tried to use neural networks to process system call sequences [42][43][44][45][46]. AnRAD [47] performs probabilistic inference by self-structuring confabulation network.Their network continuously refines their knowledge base and is capable of fast incremental learning.Sheraz et al. [48] implemented intrusion detection in a real environment based on a convolutional neural network.In order to accelerate the training process, multiple GPUs must be deployed on a physical host.The challenge of solutions based on a neural network is pricey and space consuming due to the increasing amount of data.
Ambusaidi et al. [49] proved that their method contributes more critical features for the least square support vector machine to achieve a better detection rate and lower computation cost.Gideon [50] applied a semantic structure to system calls.This approach facilitates the representation of software behavior and obtains excellent results in UNM dataset and KD98 dataset.Wael et al. [51] presented a heterogeneous detector which consisted of sequence time-delay embedding, hidden Markov model [52][53][54] and a one-class support machine.In addition to satisfactory results, the heterogeneous detector also exhibits the reliability.Michael et al. [55] detected features of the system in the hypervisor.The experiments demonstrated that their detection accuracy achieves 90% whilst the method has the detecting ability of DoS attacks.The algorithms based on frequency can be realized at a lower computation cost by reducing the dimension of the frequency vectors [56].SVM is sensitive to noises and outliers [57][58][59][60].FSVM is based on fuzzy theory to reduce the influence of noises or outliers on the classification hyperplane [61][62][63][64][65][66].Lin et al. [67] proposed a method based on the relation between samples and their cluster center.Zhang et al. [68] proposed a new FSVM method after considering the imperfection of a distance-based algorithm.However, the above methods also reduce the effect of SVs on the hyperplane when reducing the influence of noises or outliers on the hyperplane.

Methodology Based on SVDD
Due to human error, random error and other factors in the data collection process, the sample set contains a small number of noise samples.Noise samples have a great influence on the construction of the optimal classification hyperplane, which makes it deviate from the optimal position, reduces the normalization ability of the classifier, and affects the classification effect.FSVM assigns different weight coefficients to different samples.The purpose is to make each sample have a different effect on the optimal classification hyperplane.
Figure 1 shows the overview of our algorithm.The algorithm consists of three steps.The first step is to obtain the positive and negative minimum hyperspheres based on the training data and SVDD.By finding the minimum volume hypersphere containing a sample set, the target samples are included in the hypersphere as much as possible, and the non-target samples are excluded from the hypersphere.The second step is to calculate the distance between the samples and the hyperplanes.The sample position is determined by distance and radius difference.In the third step, different samples inside and outside the hyperspheres are represented by different functions.Our algorithm is described in detail below.

Analysis
The definition of fuzzy membership function is the key of FSVM algorithm.There are many definitions of fuzzy membership functions, but there are no general guidelines.Traditional fuzzy membership functions based on the distance between the sample and its cluster center are not effective to distinguish noises or outliers from SVs.Sometimes the distance is not the only criterion for judging whether it is normal.As shown in Figure 2, point A on the left, in Figure 2a, has a high probability of being a valid sample.Point A on the right, in Figure 2b, has a high probability of being a noise or outlier.It is not enough to just rely on linear functions of distance.The relative position relation between samples should be considered.That is, fuzzy membership function must consider the affinity between samples.As shown in Figure 3, the distance between B and the cluster center is not the same as the distance between C and the cluster center.Since these two points have the same distance to the classification hyperplane, they contribute the same to the hyperplane.Compared with point B and point C, point D is further away from the cluster center, but closer to the classification hyperplane.Point D is the point that contributes the most to the hyperplane.So we have to take that into account when we design membership functions.The optimal classification hyperplane and its nearby support vectors are far away from the cluster center.The closer the sample is to the cluster center, the greater the weight coefficient.The noises and SVs distributed on the boundary are far from the cluster center.They are all given less weight coefficients.This is quite different from the objective situation.SVs should be given greater weight coefficients.
The optimal hyperplane of standard SVM is determined by SVs.The geometric principle of SVM is to use two hyperplanes with maximum spacing to separate the training samples as far as possible in the original space or feature space.Therefore, using the hyperplane in class to replace the center can better approximate the actual situation.
The solution of the optimal classification hyperplane of SVM is usually converted to solving quadratic programming problem.However, the solving complexity of the quadratic programming problem will increase significantly with the sample increase.When the sample size is large, the traditional fuzzy support vector machine needs large memory to store and calculate the kernel function matrix.Therefore, by selecting the boundary vector containing SVs in advance, the number of training samples and the number of quadratic programming solutions can be reduced.This has practical significance for improving training speed and accuracy.

Support Vector Data Description
The task of one-class classification is to distinguish the target sample from non-target samples.The boundary has to be constructed by a hypersphere around the target samples.By finding the minimum volume hypersphere containing the sample set, the target samples are included in the hypersphere as much as possible, and the non-target samples are excluded from the hypersphere.
For ease of description, ϕ: R n → H represents the mapping of the input space to the high-dimensional space.Assume T = {x i , i=1, 2, • • • , l} contains l data objects, and the hypersphere is described by center a and radius R. The minimum hypersphere can be obtained by solving the following quadratic programming (1).min where ||•|| is the Euclidean distance, ξ i is slack variables, and C is the trade-off between the volume of the hypersphere and the errors.Lagrangian multipliers α i and β i are introduced to construct a Lagrangian function.The Lagrangian function is constructed as shown in Equation (2).
The center a and radius R can be obtained from the solution to the dual problem and the solution to KKT conditions.a and R are calculated according to Formulas (3) and ( 4) respectively.
According to (5), if the Euclidean distance between the point ϕ(x) and center a is less than radius R then it is normal.Conversely, it is the noise point when the Euclidean distance is greater than radius R.

Design Fuzzy Membership Function
According to the basic principle of the support vector machine, the optimal classification hyperplane is determined by the support vectors.If each class of the two classification problems is considered as a convex set, then these support vectors lie on the relative boundary of the two convex sets far away from the center of the two classes.
Given T = {(x i , y i ), i = 1, 2, • • • , l} contains l data objects.If l + and l − respectively represent the number of positive samples x + i , i = 1, 2, • • • , l + and the number of negative samples If a + and a − respectively represent the minimum hypersphere center of positive samples and the minimum hypersphere center of negative samples, R + and R − respectively represent the minimum hypersphere radius of positive samples and minimum hypersphere radius of negative samples, then a + and a − are always located in the geometric center, if the normal vector of the hyperplane is established by the maximum sum of the distance [69] from two centers to the hyperplane.As shown in Figure 4, in order to maximize the sum of the distances, it should satisfy d=||a + −A||+||a − −B||≤||a + −O||+||a − −O|| = ||a + −a − ||.That is, the distance is maximized d=||a + −a − || when hyperplane and vector a + −a − are perpendicular to each other.For the given data, the relative positions of the two hyperspheres can be intersected, tangent and separated.The two separate cases and tangent case can be classified as one case.Although in the case of intersection, the radial basis kernel function can always choose parameters to make the two hyperspheres separated, this will cause an overfitting phenomenon.Therefore, our paper summarizes the above three cases as two cases of separation and intersection.
As shown in the Figures 5 and 6, the normal vector to the hyperplane is w = a + − a − according to the principle of maximum distance.These two hyperplanes that go through a + and a − with a normal vector of a normal vector of w are H + : w T (x−a + ) = 0 and H -: w T (x−a − ) = 0.The optimal classification hyperplane is only determined by the SVs.Therefore, samples can be screened in advance, and those samples that may become support vectors can be selected to be trained as new training samples, which will simplify the computation of quadratic programming and improve the training speed.The optimal classification hyperplane lies between positive and negative hypersphere centers.Then, positive and negative samples between H + and H − can be selected as the new training sample set.As shown in the above two figures, the shaded parts inside the hyperspheres are the normal sample, in which the positive class is represented by orange slashes, while the negative class is represented by blue slashes.The samples with "+" and "*" in the new training sample set located outside the hypersphere represent noises or outliers respectively.
If l + new and l − new are the number of positive and negative samples in the new sample set, then the distance between the samples and the hyperplane in each class is calculated according to (6).
where w T (x−a + ) = 0 and H − : w T (x−a − ) = 0 are the two hyperplanes that go through a + and a − ; w = a + − a − is the normal vector to the hyperplane.The distance between the samples and the center of the hypersphere is calculated according to (7).
Therefore, membership functions of both positive and negative sample points are constructed according to (8) and (9).
The minimum value of the membership function inside the hypersphere is 0.4, and the maximum value of the membership function outside the hypersphere is 0.4.The membership value of the sample increases with the value of the distance between the sample and hyperplane.Given p ≥ 2, the bigger p, the faster s + i and s − i decay.Similarly, for nonlinear cases, mapping function ϕ(x ) is introduced by kernel function K(x i , x j ) to map data to a high-dimensional space.The normal vector is w = a + −a − by a rule for the maximum sum of distance from two hyperspheres' centers to the separation hyperplane.The hyperplanes of the two classes with w as the normal vector and going through a + and a − respectively are H + : w T (ϕ(x ) − a + ) = 0 and H − : w T (ϕ(x ) − a − ) = 0.If l + new and l − new are the number of positive and negative samples in the new sample set, then the distance between the samples and the hyperplane in each class is calculated according to (10).
The distance between the sample and its center of the hypersphere in each class is calculated according to (11).
Therefore, membership functions of both positive and negative sample points are constructed according to (12) and (13).

Experimental Evaluation
This section presents the evaluation of our method in terms of detection performance, overhead and impaction of parameters.In order to test the performance of our algorithm, in addition to comparing the SVM-based algorithms SVM, FSVM1 [66], FSVM2 [65], and FSVM3 [68], the proposed algorithm is also compared with other algorithms in this section.

The Experimental Data
In order to facilitate an experimental performance comparison with similar studies, the system call datasets UNM_sendmail and UNM_live_lpr published by the New Mexico university are used in this section [70].The UNM_sendmail data set consists of 346 normal traces and 25 abnormal traces.The abnormal data contains sunsendmailcp (sccp) intrusions and decode intrusions, in which the sccp intrusions enable the local user to obtain the root access by using special command options to make sendmail attach an e-mail message to a file, and the decode intrusions enable remote users to make changes to certain files on the local system.Data for UNM_live_lpr data set includes 15 months of activity and consists of 4298 normal tracks and 1003 abnormal tracks.The abnormal data contains lprcp symbolic link intrusions that take advantage of the vulnerability of the lpr program to control the files on the host computer and tamper with the contents of the files.
The trace consists of a sequence of system calls in chronological order.The meaning of trace varies from program to program.Each trace file lists pairs of numbers, the first number represents the process identity (PID) of the execution process, and the second number represents the specific system call (SC).The child processes forked by the parent process are traced individually.We take UNM_sendmail as an example to show the specific format of experimental data as follows: The data consists of different data units.Each data unit consists of PID and SC.In the experiment, data is segmented according to the PID number and the SCs after the same PID are arranged together in chronological order.In the above data, the numbers 8840, 8843 and 6545 are PIDs and the number 4, 2, 5, 66, . . ., 2 are SCs.The lists of system calls issued by 8840, 8843 and 6545 are denoted as R 8840 = (4, 2, 5, 66, 5, . . ., 6), R 8843 = (115, 15, 99, 120, . . ., 17) and R 6545 = (2, 55, . . ., 2) respectively.
The vector form of a system call sequence composed of the frequency of the short system call sequences.Given n is the number of traces, m is the total number of short system call sequences.Then the ith element in the vector is the frequency at which the short system sequence numbered i occurs in the system call sequence of the process.The vector corresponding to the jth trace, represented by the frequency of short system call sequence, is denoted by (f 1j ,• • • , f mj ).That is to say, samples are represented as (X j , y j ) j = 1,• • • , n, where X j = (f 1j ,• • • , f mj ), y j ∈ (+1, −1).

The Experimental Performance
The experimental data of UNM_live_lpr process were composed of 4298 normal tracks and 1003 abnormal tracks.The 1003 anomaly traces contain LPRCP attacks that control and tamper with host files.The experimental data for the UNM_sendmail process consisted of 346 normal traces and 25 abnormal traces, including 20 SCCPS that used E-mail to obtain root directory information and 5 decode attacks that remotely modified local files.
False alarm is to judge the normal behavior of the program as abnormal behavior.If L N normal traces participate in the evaluation, and N FAR normal traces are misjudged as abnormal traces, then the false alarm rate is equal to N FAR /L N .Detection rate refers to the proportion of detected abnormal traces in the total number of abnormal traces.If L AN abnormal traces participate in the test and N HR were detected, then the detection rate is N HR /L AN .The missing rate indicates the proportion of unrecognized abnormal traces in the total number of abnormal traces.If L AN abnormal traces participate in the evaluation and N M cannot be detected, then the missing rate is N M /L AN .DR = ρ HR * (1−ρ FAR ) is used as the comprehensive detection formula, in which ρ HR and ρ FAR stands for detection rate and missing rate respectively.
Gauss kernel k (x, y) = exp (−|| x−y || 2 /2σ 2 ) is used in all algorithms during training.σ 1 and σ 2 can be selected according to the maximum error rate allowed in the target set.C 1 and C 2 can be adjusted according to the equality of upper bound of positive and negative error rate.That is, 1/ Grid search method is adopted to select the optimal parameters.The search range of parameter C is {2 −24 , 2 −23 ,• • • , 2 23 , 2 24 }.The parameter σ and β both have a search range of {2 −24 , 2 −23 ,• • • , 2 23 , 2 24 }.The step length and short sequence length are set to 1 and 6 respectively.
Table 1 summarizes the comparison results of detection performance between our method and the other eight methods on UNM_live_lp.The detection rate of SVM, FSVM1, FSVM2, FSVM3, and our algorithm are 61.53%,76.92%,76.92%, 83.21%, and 84.61% respectively.The missing rates of SVM, FSVM1, FSVM2, FSVM3, and our algorithm are 38.47%,23.08%, 23.08%, 16.23%, and 15.39% respectively.The false alarm rates of SVM, FSVM1, FSVM2, FSVM3, and our algorithm are 10.14%, 9.57%, 8.17%, 6.92%, and 4.51% respectively.In all the algorithms, our algorithm has the highest detection rate and the lowest false alarm rate and missing rate.As shown in Table 1, SVM has the lowest detection rate, 79.88%, among the five algorithms.Due to the construction of a fuzzy membership function, the other four algorithms treat the different contributions of different samples in the construction of the objective function differently, which makes their average detection rate reach 87.03%.FSVM1 algorithm designs a fuzzy membership function based on the linear function of the distance between the sample and its cluster center.The larger the distance is, the smaller the coefficients is.However, this membership design method is sometimes unable to distinguish abnormal points effectively.The FSVM2 algorithm not only considers the distance in the FSVM1 algorithm, but also considers the position relation between samples.FSVM2 algorithm determines the membership function according to the position of the sample and hypersphere.When the samples are located inside the hypersphere, the membership function is defined by the linear function of the distance.The coefficient of the sample decreases with the increase of the distance.When the samples are located outside the hypersphere, these samples are regarded as abnormal samples.The membership function is represented by different functions.The FSVM3 algorithm, like the FSVM2 algorithm, takes account of the distance between the sample and its cluster center and the affinity between samples.The difference between the FSVM3 and FSVM2 is that the FSVM3 algorithm is based on the SVDD algorithm by introducing two different parameters to control the affinity between positive and negative samples.
The method treats all samples equally during the training process, which makes the contribution of samples to constructing the optimal classification hyperplane equal.As a result, when training samples contain abnormal samples, the classification hyperplane obtained is not the optimal hyperplane.Our algorithm abides by the maximum sum of distance from the hypersphere to hyperplane, replaces the cluster center with the hyperplane inside the class, and designs the membership function according to the distance from the sample to the hyperplane inside the hypersphere.Therefore, it can be seen from the table that with the continuous improvement of the membership function, the detection rate of the algorithm increases successively.The detection rate changed from 83.71% of FSVM1 algorithm to 85.56% of FSVM2 algorithm, then to 86.20% of WCS-FSVM algorithm, and finally reached 92.63% of our algorithm.
The formula of comprehensive detection rate is closer to reality.The comprehensive detection rates of SVM, FSVM1, FSVM2, and FSVM3 are 72.97%,77.59%, 78.56%, and 80.23% respectively.As show in Figure 7, our algorithm has the highest comprehensive detection rate, 86.92%, among five algorithms.At the same time, it is also evident from the data in Table 2 that the detection rate, false alarm rate and missing rate of UNM_sendmail data are all lower than that of UNM_live_lpr data.The reason for this is the amount of data.The UNM_sendmail process data consists of 346 normal tracks and 25 abnormal tracks, which is less experimental data than the 4298 normal tracks and 1003 abnormal tracks of the UNM_live_lpr process.Therefore, there is sufficient data for UNM_live_lpr process training, which is conducive to pre-extracting relative boundary vectors containing enough support vectors and ensuring sufficient data parameters.In the experiments of UNM_live_lpr process and UNM_sendmail process, SVM uses the same error penalty factor for all samples.It will have a negative As shown in Figures 9 and 10, if k = 6, the comprehensive detection rates of our algorithm on UNM_live_lpr and UNM_sendmail are 86.92% and 80.79% respectively, which are higher than the comprehensive detection rate corresponding to other k values on UNM_live_lpr and UNM_sendmail.When k increases from 3 to 6, the comprehensive detection rate also increases gradually.When k increases from 6 to 8, the comprehensive detection rate decreases gradually.This is consistent with professor Forrest's conclusion on the selection of short sequence length, and professor Wenke Lee's conclusion from the perspective of information theory that the best system call short sequence length is 6 or 7. Compared with the value of k of 6, when the value of short sequence is too small, the timing relation between short sequence patterns is lost.When the short sequence becomes longer, the short sequence pattern loses local information, no matter what kind of information loss will make the comprehensive detection performance worse.Therefore, the short sequence length can only be chosen as 6, which makes the detection performance of our algorithm reach the optimal level.These two methods, Compression method and Sequence Matching method, respectively consider the detection problem from the aspect of data reversibility and similarity matching degree, and fail to consider the sequence characteristics between system calls.Therefore, they are the two algorithms with the worst comprehensive detection performance.The IPMA method and Hybrid Markov method are complicated due to investigating the transition characteristics of system calls one by one.Although the ρ FAR of IPMA method and Hybrid Markov method are 7.58% and 14.26%, the ρ MR is 47.37% and 40.22%.The ρ MR of Bayes 1-step Markov method is 33.3% and the ρ FAR is 4.1%.The comprehensive test performance evaluation formula has a high value, so the test performance is good.Since this method looks the frequency of rare system calls, the ρ MR of Uniqueness method is 55.3% and the ρ FAR is 2.3%.However, this requires statistics for all the system calls that are used.N.bayes belongs to Naive Bayesian method in essence, and it has good noise tolerance and fast calculation, but the false alarm rate is too high.Therefore, the comprehensive detection performance is good.The closeness method extracts user behavior patterns from the perspective of combination.It shows good detection performance under different closeness thresholds, but it ignores the characteristics of attack intensity.That is, the attacker will complete the attack task in the shortest possible time and try to behave normally the rest of the time.As show in Figure 11, our algorithm has the highest comprehensive detection rate 86.92% among five algorithms.In all the algorithms, our algorithm has the highest detection rate, and the lowest false alarm rate and missing rate.The formula of the comprehensive detection rate is closer to reality.The comprehensive detection rate of N.bayes, Uniqueness, Hybrid Markov, Bayes1-step Markov, IPMA, Sequence matching, Compression, Closeness, and our algorithm are 62%, 36.91%,39.13%, 60.03%, 41.63%, 34.82%, 28.47%, 73.22%, and 86.92% respectively.Table 6 summarizes the comparison results of detection performance between our method and the other eight methods on UNM_sendmail.The detection rates of N.bayes, Uniqueness, Hybrid Markov, Bayes1-step Markov, IPMA, Sequence matching, Compression, Closeness, and our algorithm are 63.88%, 37.4%, 45.3%, 67.3%, 40.63%, 35.9%, 29.7%, 75.2%, and 84.61% respectively.The missing rates of N.bayes, Uniqueness, Hybrid Markov, Bayes1-step Markov, IPMA, Sequence matching, Compression, Closeness, and our algorithm are 32.12%,60.3%, 40.44%, 30.8%, 42.37%, 60.2%, 50.5%, 20.1%, and 15.39% respectively.The false alarm rates of N.bayes, Uniqueness, Hybrid Markov, Bayes1-step Markov, IPMA, Sequence matching, Compression, Closeness, and our algorithm are 4%, 2.30%, 14.26%, 1.9%, 17%, 3.9%, 19.8%, 4.7%, and 4.51% respectively.Similarly, the Compression method and Sequence Matching method pay attention to reversibility and similarity matching degree, respectively, and fail to consider the sequence characteristic between system calls.The IPMA method and Hybrid Markov method are complicated due to investigating transition characteristics of system calls one by one.Although the ρ FAR of IPMA method and Hybrid Markov method is 17% and 14.26%, the ρ MR is 42.37% and 40.44%.The ρ MR of Bayes 1-step Markov method is 30.8%, and the ρ FAR is 1.9%.The comprehensive test performance evaluation formula has a high value, so the test performance is good.Since this method looks at the frequency of rare system calls, the ρ MR of Uniqueness method is 60.3%, and the ρ FAR is 2.3%.However, this requires statistics for all the system calls that are used.N.bayes belongs to Naive Bayesian method in essence, and it has good noise tolerance and fast calculation, but the false alarm rate is too high.Therefore, the comprehensive detection performance is good.The closeness method extracts user behavior patterns from the perspective of combination.It shows good detection performance under specific closeness thresholds, but it ignores the characteristics of attack intensity.That is, the attacker will complete the attack task in the shortest possible time and try to behave normally the rest of the time.
As shown in Figure 12, our algorithm has the highest comprehensive detection rate 86.92% among five algorithms.In all the algorithms, our algorithm has the highest detection rate, and the lowest false alarm rate and missing rate.The formula of comprehensive detection rate is closer to reality.The comprehensive detection rates of N.bayes, Uniqueness, Hybrid Markov, Bayes1-step Markov, IPMA, Sequence matching, Compression, Closeness, and our algorithm are 61.32%,36.54%,38.84%, 60.02%, 33.72%, 34.49%, 23.81%, 71.67%, and 80.79% respectively.

Conclusions
In order to solve the defects of FSVM.This paper presents a new FSVM algorithm based on SVDD.In our method, the noises and outliers are identified by a hypersphere with minimum volume while containing the maximum of the samples.The definition of fuzzy membership considered not only the position of samples inside the hypersphere, but also the distance between the hyperplane and samples.For the samples inside the hypersphere, the fuzzy membership function is a linear function of the distance.The greater the distance is, the greater the coefficient is.For the samples outside the hypersphere, the fuzzy membership function is an exponential function of the distance.The greater the distance is, the smaller the coefficient is.Compared with the FSVM based on the relation between the sample and its cluster center, our algorithm effectively distinguishes the noises or outliers from samples.The experiments show that our FSVM is more robust than the SVM and FSVM based on the distance between the sample and its cluster center.Our algorithm is suitable for the data of system call sequence and can contribute to accurate prediction.

Figure 1 .
Figure 1.The overview of the proposed algorithm.

Figure 2 .
Figure 2. The difference of the affinity among samples at two different classes.

Figure 3 .
Figure 3.The membership function design based on the position of samples and the hypersphere.

Figure 4 .
Figure 4. Maximize the sum of the distances.

Figure 5 .
Figure 5.The case of separation.

Figure 6 .
Figure 6.The case of intersection.

Figure 9 .
Figure 9.The comprehensive detection rate of different k on UNM_live_lpr data.

Figure 10 .
Figure 10.The comprehensive detection rate of different k on UNM_sendmail data.

Figure 11 .
Figure 11.The DR comparation of different algorithms on UNM_live_lp.

Figure 12 .
Figure 12.The DR comparation of different algorithms on UNM_sendmail.

Table 1 .
Performance comparison on data set UNM_live_lp.

Table 3 .
Detection performance of different k on UNM_live_lpr data.

Table 4 .
Detection performance of different k on UNM_sendmail data.

Table 5 .
Comparison between different algorithms on UNM_live_lp.

Table 6 .
Comparison between different algorithms on UNM_ sendmail.