An Entropy-Based Network Anomaly Detection Method

Data mining is an interdisciplinary subfield of computer science involving methods at the intersection of artificial intelligence, machine learning and statistics. One of the data mining tasks is anomaly detection which is the analysis of large quantities of data to identify items, events or observations which do not conform to an expected pattern. Anomaly detection is applicable in a variety of domains, e.g., fraud detection, fault detection, system health monitoring but this article focuses on application of anomaly detection in the field of network intrusion detection.The main goal of the article is to prove that an entropy-based approach is suitable to detect modern botnet-like malware based on anomalous patterns in network. This aim is achieved by realization of the following points: (i) preparation of a concept of original entropy-based network anomaly detection method, (ii) implementation of the method, (iii) preparation of original dataset, (iv) evaluation of the method.


Introduction
The first anomaly detection method for intrusion detection was proposed almost 40 years ago [1].Today, network anomaly detection is a very broad and heavily explored subject but the problem of finding a generic method for a wide range of network anomalies is still unsolved.Widely used intrusion detection systems are ineffective against a modern malicious software (malware).Such systems mostly make use of common signature-based (or misuse-based) technique.This approach is known for its shortcomings [2][3][4][5].Signatures describe only illegal patterns in network traffic, so a prior knowledge is required [2].Signature-based solutions do not cope with evasion techniques and not known yet attacks (0-days) [3].Moreover, they are unable to detect a specific attack until a rule for the corresponding vulnerability is created, tested, released and deployed, which usually takes long time [4,5].Therefore, a proper network anomaly detection as one of possible solutions to complement signature-based solutions is essential.Recently, entropy-based methods which rely on network feature distributions has been of a great interest [6][7][8][9][10][11].It is crucial to check if entropy-based approach is efficient in detection of anomalous network activities caused by modern botnet-like malware [12].Botnet is a group of infected hosts (bots) controlled by Command and Cotrol (C&C) servers operated by cyber-criminals.The number of such a malware as well as the level of its sophistication increases each year [13].Damage from this type of malware can take many serious forms including loss of important data, reputation or money.Moreover, nowadays such malware is also used in a warfare to cunduct sabotega and espionage [14].Entropy-based approach to detect anomalies caused by botnet-like malware in a local networks is not investigated area.Entropy-based methods proposed in the past e.g.[8,10,15] deals with a massive spreads of old types of worms (not botnet-like) or different types of Distributed Denial of Service (DDoS) attacks in a high-speed networks.In this article we propose an effective entropy based method for detection and categorization of network anomalies that indicate existence of the botnet-like malware in the local networks.This type of anomalies are often very small and hidden in a network traffic volume expressed by the number of flows, packets or bytes, so their detection with popular solutions and methods which rely mostly on a traffic volume changes, e.g., [16][17][18][19] is highly difficult.
The main goal of this article is to prove that entropy-based approach is suitable to detect modern botnet-like malware based on anomalous network patterns.The aim was achieved by realization the following points: (i) preparation of a concept of original entropy-based network anomaly detection method, (ii) implementation of the method, (iii) preparation of original dataset, (iv) evaluation of the method.
These steps are discussed in details in the further part of the article that is organized as follows: • Section 2 reviews related work in the area of network anomaly detection.
• Section 3 introduces the definition of Shannon entropy and describes Renyi and Tsallis generalizations.Brief overview as well as comparison of entropy measures are provided.
• Section 4 presents the architecture of the proposed method.Detailed specification as well as results of implementation are given.
• Section 5 refers to the dataset developed in order to evaluate a performance of the proposed method.
• Section 6 presents results of verification of the method.
• Section 7 finishes this article providing conclusions and short summary.It also outlines further work.

Related Work
This section reviews related work in the area of network anomaly detection.The section starts with a general overview of the latest advances in this broad subject.Then, more details on anomaly detection techniques that are closely related to the approach proposed in this article are presented and comments are provided.

Closely Related Work
In this paragraph a closer look at works strictly related to approach proposed in this article is taken.Analysis of detection methods based on summarizing feature distributions via entropy, histograms and sketches is provided.Special attention is devoted to the methods employing different forms of entropy.Some comments according noticed gaps are given.Section starts with the comparison of the feature distributions approach to the older but still more popular detection via counters.

Detection via Counters
In the past, anomalies were treated as deviations in the traffic volume.Simple counters like: number of flows, packets (total, forwarded, fragmented, discarded) and bytes (per packet, per second) were used.These counters can be derived from network devices via Simple Network Management Protocol (SNMP) [49] or NetFlow [50].
Barford et al. [17] presented wavelet analysis to distinguish between predictable and anomalous traffic volume changes using a very basic set of counters from NetFlow and SNMP data.They used rather advanced signal analysis technique combined with very simple metrics, i.e., number of flows, packets and bytes.The authors reported some positive results in detection of high-volume anomalies like network failure, bandwidth flood and flash crowd.
Kim et al. [18] proposed a method where many different DDoS attacks [51,52] are described in terms of traffic patterns in a flow characteristics.In particular, the authors focused on counters like: number of flows, packets, bytes, the flow and packet sizes, average flow size and number of packets per flow.In the presented TCP SYN flood example the following pattern was applied: a large number of flows, yet small number of small packets and no constraints on the bandwidth and the total amount of packets.This pattern differs significantly from the one generated for an ICMP/UDP flooding attacks, where high bandwidth consumption and a large number of packets is involved.Although the authors reported some good results, they also mentioned that common legitimate peer-to-peer (P2P) traffic [53] may result in some false alarms in their approach.
A threshold-based detector measuring the deviation from a mean value present in a traffic collection algorithm for frequent collection of SNMP data was proposed by Lee et al. [54].To assess the algorithm, the authors examined how it impacts detection of volume anomalies.Only some minor differences were reported in comparison to the original traffic collection algorithm.
Casas et al. [55] introduced an anomaly detection algorithm based on SNMP data which deals with abrupt and large traffic changes.The authors proposed a novel linear parsimonious model for anomaly-free network flows.This model makes it possible to treat the legitimate traffic as a nuisance parameter, to remove it from the detection problem and to detect the anomalies in the residuals.Authors reported that with this approach they slightly improved previously introduced approach based on PCA in terms of a false alarms.
Many commercial and open source solutions that relay on SNMP or NetFlow counters are available on the market, e.g., NFSen [16], NtopNg [19], Plixer Scrutinizer [56], Peassler PRTG [57] , and Solarwinds Network Traffic Analyzer [58].All of them provide more or less the same functionality, i.e., browsing and filtering network data, raporting and alerting.Several commercial solutions like, e.g., Invea-Tech FlowMon [59] or AKMA Labs FlowMatrix [60] offer some advanced anomaly detection methods which mostly rely on predefined set of rules for detection of undesirable behavior patterns and some simple long-term network behavior profiles in terms of services, traffic volume and communication sides.
Concluding this subsection, we noticed that although there are many methods that rely on counters, their use is limited.The problem with a counter-based approach is that it is strictly connected with a traffic volume.Nowadays many anomalous network activities such as low-rate DDoS [61,62] stealth scanning or botnet-like worm propagation and communication do not result in substantial traffic volume change.Presented above counter-based methods handles well large and abrupt traffic changes such as bandwidth flooding attacks or flash crowds, but a large group of anomalies which do not cause changes of volume remains undetected.Moreover there is also some practical issue connected with counters reported by Brauckhoff et al. [63] who stated that packets sampling used by many routers to save resources when collecting data can influence a counter-based anomaly detection metrics, but does not significantly affect the distribution of traffic features.

Detection via Feature Distributions
Network anomaly detection via traffic feature distributions is becoming more and more popular.Several feature distributions, i.e., header-based (addresses, ports, flags), volume-based (host or service specific percentage of flows, packets and bytes) and behavior-based (in/out connections for particular host) have been suggested in the past [8,15,64].However, it is unclear which feature distributions perform best.Nychis in [8], based on his results of pairwise correlation reported dependencies between addresses and ports and recommended the use of volume-based and behavior-based feature distributions.In opposite, Tellenbach in [15] found no correlation among header-based features.
In this article, original results of feature correlation are presented and some interesting conclusions are given in Section 6.
Shannon Entropy Entropy as the measure of uncertainty can be used to summarize feature distributions in a compact form, i.e., single number.Many forms of entropy exist, but only a few have been applied to network anomaly detection.The most popular is the well-known Shannon entropy [65,66].Application of Shannon measures like relative entropy and conditional entropy to conduct network anomaly detection were proposed by Lee and Xiang [67].Also, Lakhina et al. [64] made use of Shannon entropy to sum up a feature distribution of network flows.By using unsupervised learning, the authors showed that anomalies can be successfully clustered.Wagner and Plattner [7] made use of the Kolmogorov Complexity (related to Shannon entropy) [68,69] in order to detect worms in network traffic.Their work mostly focuses on implementation aspects and scalability and does not propose any specific analysis techniques.The authors reported that their method is able to detect worm outbreaks and massive scanning activities in a near real time.Ranjan et al. [70] suggested another worm detection algorithm which measures Shannon entropy ratios for traffic feature pairs and issues an alarm on sudden changes.Gu et al. [71] made use of Shannon maximum entropy estimation to estimate the network baseline distribution and to give a multi-dimensional view of network traffic.The authors claim that with their approach they were able to distinguish anomalies that change the traffic either abruptly or slowly.
Generalized entropy Besides Shannon entropy, some generalization of entropy have been recently introduced in the context of network anomaly detection.Einman in [6,72,73] reported some positive results of using T-entropy [74] for intrusion detection based on packet analysis.T-entropy can be estimated from a string complexity measure called T-complexity.String complexity is a minimum number of steps required to construct a given string.In contrast to entropy, where probabilities (estimated from frequencies) can be permuted, in a complexity-based approach, the order matters.A string is compressed with some algorithm and the output length is used to estimate the complexity.Finally, the complexity becomes an estimate for the entropy.Because in this approach sequence of events is crucial, it fits to fine-grinded methods of network data analysis like full packet or header packet inspection.However, this type of inspection is not scalable in the context of network speed.Some details about T-entropy are also presented in our paper [75].A parameterized generalization of entropy have also been recently reported as very promising.The Shannon entropy assumes a tradeoff between contributions from the main mass of the distribution and the tail.With the parameterized Tsallis [76][77][78] or Renyi [79,80] entropy, one can control this tradeoff.In general, if the parameter denoted as α has a positive value, it exposes the main mass, if the value is negative -it refers to the tail.Ziviani et al. [81] investigated Tsallis entropy in the context of the best value of α parameter in DoS attacks detection.They found that α-value around 0.9 is the best for detecting such attacks.Shafiq et al. [82] did the same for port scan anomalies caused by malware.He reported that for scan anomalies α-value around 0.5 is the best choice.A comparative study of the use of the Shannon, Renyi and Tsallis entropy for attribute selecting to obtain an optimal attribute subset, which increases the detection capability of decision tree and k-means classifiers was presented by Lima et al. [83].The experimental results demonstrate that the performance of the models built with smaller subsets of attributes is comparable and sometimes better than that associated with the complete set of attributes for DoS and scan attack categories.The authors found, that for the DoS category, Renyi entropy with α-value around 0.5 and Tsallis entropy with α-value around 1.2 are the best for decision tree classifier.We believe, the proper choice of the α-value depends either on the anomaly or the legitimate traffic used as a baseline, or for both, since none of the authors mentioned above reported similar results.Thus, such goals like finding the proper value of parameter for entropy in order to improve detection of particular group of anomalies remains unachieved.Some authors, e.g., Tellenbach et al. [9,15,84] employed a set of α-values in their methods.The authors proposed the Traffic Entropy Telescope prototype based on Tsallis entropy capable to detect a broad spectrum of anomalies in a backbone traffic including fast-spreading worms (not that common nowadays), scans and different form of DoS/DDoS attacks.Although Tsallis entropy seems to be more popular than Renyi entropy in the context of network anomaly detection the latter was also successfully applied in detection of different anomalies.An example is the work by Yang et al. [10] who employed Renyi entropy to early detection of low-rate DDoS attacks and Kopylova et al. [11] who reported positive results of using Renyi conditional entropy in detection of selected worms.We believe that with parameterized entropy some limitations of Shannon entropy caused by small descriptive capability [9] which results in a little ability to detect typical small or low-rate anomalies can be overcome.Moreover, we think that with some properly chosen spectrum of α-values this detection will be accurate in terms of low false alarms and high detection rate.In this article we present original results of our research on the most suitable set of α-values as well as original results of research on the most suitable entropy type.
Others Techniques Apart from entropy, some other feature distributions summarization techniques are successfully used in the context of network anomaly detection [85], namely sketches and histograms.Soule et al. [45] proposed a flow classification method based on modeling network flow histograms using Dirichlet Mixture Processes for random distributions.The authors validated their model against three synthetic test cases and achieved almost 100% classification accuracy.In [46], Stoecklin et al. introduced a two-layered sketch anomaly detection technique.The first layer models typical values of different feature components (e.g., typical number of flows connecting to a specific port) and the second layer evaluates the differences between observed feature distribution and the corresponding model.The authors claim that the main strength of their method is the construction of fine-grained models that capture the details of feature distributions, instead of summarizing it into an entropy value.A more general approach was presented by Kind et al. [18].In their method histogram-based baselines were constructed from selected essential network traffic features distributions like addresses and ports.This work was augmented by Brauckhoff et al. in [47] who applied association rule mining, in order to identify flows representing anomalous network traffic.The main problem with non-entropic feature distributions summarization techniques is a proper tuning [9].The performance of detection depends to a great extent on the accuracy of a bin size.This may be difficult to manage while network traffic changes.

Entropy
This chapter introduces theoretic fundamentals of entropy.It starts with a brief overview of Shannon entropy.Next, the parameterized generalizations are presented.Finally a comparison of entropy measures is provided.

Shannon Entropy
Definition of entropy as a measure of disorder comes from thermodynamic and was proposed in the early 1850s by Clausius [86].In 1948 Shannon [65] adopted entropy to information theory.In information theory, entropy is a measure of the uncertainty associated with a random variable The more random the variable, the bigger the entropy and in contrast, the greater certainty of the variable, the smaller the entropy.For a probability distribution p(X = x i ) of a discrete random variable X, the Shannon entropy is defined as: X is the feature that can take values {x 1 ...x n } and p(x i ) is the probability mass function of outcome x i .The entropy of X can be also interpreted as the expected value of log a 1 p(X) where X is drown according to probability mass function p(x).Depending on the base of the logarithm, different units can be used: bits (a = 2), nats (a = e) or hurtleys (a = 10).For the purpose of network anomaly detection, sampled probabilities estimated from a number of occurrences of x i in a time window t are typically used.The value of entropy depends on randomness (it attains maximum when probability p(x i ) for every x i is equal) but also on the value of n.In order to measure randomness only, normalized forms have to be employed.For example, an entropy value can be divided by n or by maximum entropy defined as log a (n).Some important properties of Shannon entropy are listed below: Much more properties of Shannon entropy can be found in [87,88].
If not only the degree of uncertainty is important but also the extent of changes between assumed and observed distributions, denoted as q and p respectively, a relative entropy, also known as the Kullback-Leibler divergence [89] can be used: This definition is not symmetric, D KL (p||q) = D KL (q||p) unless p = q.
To measure how much uncertainty is eliminated in X by observing Y the conditional entropy (or equivocation) [90] may be employed:

Parameterized Entropy
The Shannon entropy assumes a tradeoff between contributions from the main mass of the distribution and the tail [91].To control this tradeoff, two parameterized Shannon entropy generalizations were proposed, by Renyi (1970s) [79] and Tsallis (late 1980s) [76] respectively.If the parameter denoted as α (or q) has a positive value, it exposes the main mass (the concentration of events that occur often), if the value is negative -it refers to the tail (the dispersion caused by seldom events).
Both parameterized entropies (Renyi and Tsallis) derive from the Kolmogorov-Nagumo generalization of an average [92,93]: where φ is a function which satisfies the postulate of additivity (only affine or exponential functions satisfy this) and φ −1 is the inverse function.Due to affine transformations φ(x i ) → γ(x i ) = aφ(x i ) + b (where a and b are numbers), the inverse function φ(x i ) is expressed as Renyi entropy can be obtained from the Shannon entropy with the following transformations [93]: Given φ( After transformation, a well-known form of Renyi entropy is obtained: The Renyi entropy satisfies the same postulates as the Shannon entropy and there are the following relation between these two: Tsallis proposed the following function φ: After transformation, a well-known form of Tsallis entropy is as follows: As one can see this entropy is non logarithmic.There are the following relation between the Shannon and the Tsallis entropy: Moreover, Tsallis entropy is non-extensive i.e., it satisfies only pseudo-additivity criteria.For an independent discrete random variables X,Y : It means that: and To summarize parameterized (Renyi and Tsallis) entropies -both of them: • expose concentration for α > 1 and dispersion for α < 1, • converge to Shannon entropy for α → 1,

Comparison
In order to understand, compare and successfully apply parameterized entropies in our approach some experiments were conducted.
Firstly, a comparison of Shannon, Renyi and Tsallis entropy of a bi-nominal probability distributions was performed.Then we compared calculated entropies for an uniform distribution to check how they depends on number of equal probabilities and α-values.Next, the impact of rare and frequent events on the value of entropy for different α-values was examined.

Binominal Distribution
Shannon, Renyi and Tsallis entropy for a bi-nominal probability distribution where the probability of success is p, and the probability of failure is 1 − p is depicted in Figure 1, Figure 2   As one can see both Renyi and Tsallis converge to the Shannon entropy for α → 1. (Note: According to Equation ( 14) Tsallis needs to be multiplied by 1 log 2 to get the similar to Shannon curve for α → 1).Tsallis entropy is much more sensitive than Renyi for negative α-values and less sensitive for positive α-values.Moreover, Tsallis maximum entropy changes for different α-values, while Renyi is always equal to 1.

Uniform Distribution
Shannon, Renyi and Tsallis entropy for a uniform probability distribution is depicted in Figure 4.In this distribution maximum entropy (case when probabilities are equal) is calculated for different n representing number of equal probabilities.As one can see in contrast to Shannon and Renyi entropy, value of Tsallis entropy depends not only on n but also on value of α.

Impact of Frequent and Rare Events
Example 1. Let's assume a discrete random variable X = ip addresses observed in network within last 1 min.X = {"10.1.0.1", "10.1.0.2", "10.1.0.3", "10.1.0.4", "10.1.0.5"}.Suppose the following number of occurrences for the subsequent ip addresses F req = {96, 1, 1, 1, 1}.Based on frequencies let's estimate the following probability distribution of X (see Table 1).Let's examine what is the impact of a frequent event p(X ="10.1.0.1") = 0.96 and rare event p(X ="10.1.0.2") = 0.01 on the Renyi and Tsallis entropy when α = −2 and α = 2 are used.In order to measure the impact of these events, we can check results of exponential expression p(x i ) α existing in both Renyi and Tsallis formulas in Equation ( 8), Equation (12).The results are presented in Table 2.As one can see the impact of frequent events (expressed by p(x i ) = 0.96) on the entropy is greater than impact of rare events (expressed by p(x i ) = 0.01) when positive α-values are used and in opposite the impact of rare events is greater than frequent events when negative α-values are used.

Anode-Entropy-Based Network Anomaly Detector
This chapter is focused on the proposed method and its implementation named Anode.Firstly, an operating principle is presented.Then, results of implementation are given.
In order to verify if entropy-based approach is suitable to detect modern botnet-like malware based on anomalous network patterns the entropy-based network anomaly detector named Anode has been proposed.Operating principle of Anode is presented in Figure 5.

Architecture
The architecture of Anode is presented in Figure 5. Anode analyzes network data captured by NetFlow probes.Typical probes like routers or dedicated probes, e.g., Softlowd [94] connected to TAPs [95] or SPAN ports [95] on switches are assumed.Flows are analyzed within fixed time intervals (every 5 min by default).Bidirectional flows [96] are used since, according to some works (e.g.[8]), unidirectional flows may entail biased results.Collected flows are recorded in the relational database and then analyzed.In order to limit the area of search for anomalies, filters per direction, protocol and subnet are provided.Next, depends on the mode, Tsallis or Renyi entropy of positive and negative α-values is calculated for traffic feature distributions presented in Table 3. (Note: the Shannon version of our method use internally Renyi entropy with α set to 1).There are two phases in our approach: training and detection.In the training phase profile of legitimate traffic is built and model for classification is prepared.In the detection phase current observation are compared with the model.Initially, during the training phase, a dynamic profile is built using min and max entropy values within a sliding time window for every feature, α pair.Thus, we can reflect traffic changes during the day but in the same time a margin for some minor differences, e.g., small delays between the profile and current traffic is provided.A way of building a profile based on entropy values is presented in Figure 6 Figure 6.A way of building a profile.
In the detection phase, the observed entropy is compared with the min and max values stored in the profile according to the following rule: With the use of this rule, anomaly threshold is defined.Values r α (x i ) < 0 or r α (x i ) > 1 indicate abnormal concentration or dispersion.These abnormal dispersion or concentration for different feature distributions are characteristic for anomalies.For example, during a port scan, a high dispersion in port numbers and high concentration in addresses should be observed.Detection is based on the relative value of entropy with respect to the distance between min and max.Coefficient k in the formula determines a margin for min and max boundaries and may be used for tuning purposes.A high value of k, e.g., k = 2, limits the number of false alarms (alarms where no anomaly has taken placed) while a low value (k = 1) increases the detection rate (the percentage of anomalies correctly detected).Some other approaches to thresholding based on standard deviation -mean ± 2sdev, median absolute deviationmedian ± 2mad [97] has been also taken into consideration but empirical results proved that proposed rule is the best choice.The detection is based on the results from all feature distributions presented in Table 3. Classification is based on popular classifiers (decision trees, Bayes nets [98], rules and functions) employed in Weka [99].Extraction of anomaly details is also assumed -related ports and addresses are obtained by looking into the top contributors to the entropy value.

Implementation
A proof of concept implementation of Anode has been developed in Microsoft .NET environment in C# language.Currently it allows to detect anomalies in an off-line mode.All experiments presented in this article has been conducted with this implementation.Our software produces Weka arff files based on entropy calculations for each network feature.Recorded NetFlow data (e.g.whole day traffic) has to be captured and labeled in advance.Classification performance is evaluated with Weka (ten-fold cross-validation mode) based on provided arff files.
Currently Anode is also a component of the anomaly detection and security event data correlation system developed in SECOR project [100].A final implementation in SECOR has been developed in JAVA WSO2 (http://wso2.com)environment.This implementation allows on-line detection and classification an anomalies based on NetFlow reports coming from probes deployed in network.SECOR is not limited to network anomaly detection, e.g., PRONTO module [101,102] developed by another team of the project detects obfuscated malware at infected hosts.

Dataset
This chapter presents the dataset developed to evaluate proposed method.This dataset is based on a real legitimate traffic and synthetic anomalies.It consist of labeled flows which are stored in the relational database.Chapter starts with the origin of the idea.Next, details concerning legitimate and anomalous traffic are presented.

Origin of the Idea
One of the main problem in network anomaly detection is the lack of realistic and publicly available datasets for evaluation purposes.The most valuable are real network traces but because of privacy issues they are rarely published even though some anonimization techniques exists.Another problem with real traces is a proper labeling, which in many cases have to be done manually.Alternative approach are synthetic datasets.To build such dataset a deep domain knowledge and appropriate methods and tools are required in order to get a realistic data.Most authors do not disclose self-crafted traces used for evaluation of their methods.Real traffic traces can be found in some publicly available repositories like Internet Traffic Archive [103], LBNL/ICSI Enterprise Tracing [104], SimpleWeb [105], Caida [106], MOME [107], WITS [108], UMASS [109].Unfortunately, these traces are usually old, often unlabeled and they are not dedicated to anomaly detection.Lack of contemporary anomalies, e.g., traces of botnet activity in available datasets question thier timeliness.According to recent reports provided by cyber security organizations [13,[110][111][112] botnets are one of the most sophisticated and popular types of cybercrime today.So anomalies connected with botnet-like malware should be included in contemporary datasets and researches should address this anomalies in their methods.The number of datasets containing modern malware traces is limited.Worth mentioning are ISOT [113] and CTU-13 [114].The first one is a mixture of malicious and non-malicious datasets.Regrettably only one host in this dataset is infected with botnet-like malware.The second dataset which has just been made public is much richer and consist of traces of serveral bots, namely Neris, Rbot, Sogou, Murlo, Menti.Unfortunately, as this dataset has appeared recentlym, we had no chance to use it in our studies.Interesting flow-based traffic dataset has been recently made publicly available by Sperotto et al. [115].This set is based on data collected from a real honeypot (monitored trap) featuring HTTP, SSH and FTP services.The authors gathered about 14 million malicious flows but most of them referred to activity of web and network scanners.Some details about particular anomalies in this dataset are also presented in our research [116].Instead of tracking anomalies caused by modern malware some authors still make use of very old and criticized DARPA [117] data set and their modified versions, namely, KDD99 [118] and NSL-KDD [119] to evaluate their methods.Besides strong criticism by McHugh [120], Mahoney et al. [121] or Thomas [122] for being unrealistic and not balanced, nowadays DARPA dataset are simply out of date in the context of network services and attacks.According to Brauckhoff et al. [123], a realistic simulation of legitimate traffic is largely an unsolved problem today and one of the solution is combining generated anomalies with real, legitimate traffic traces.In [123] and then in [124], Brauckhoff et al. introduced the FLAME tool for injection of hand-crafted anomalies into a given background traffic trace.This tool is freely available but the current distribution does not include any models reflecting anomalies.Another interesting concept was introduced by Shiravi et al. [125].Authors proposed to describe network traffic (not only flows) by a set of so-called α and β profiles which can subsequently be used to generate a synthetic dataset.The α-profiles consist of actions which should be executed to generate a given event in the network (such as attack) while in β-profiles certain entities (packet sizes, number of packets per flow) are represented by a statistical model.Regrettably, this tool is not freely available.
All things considered-the effort to build own dataset was taken due to: • limited availability of such datasets; • the lack of proper labeling in shared datasets; • the fact that most of available datasets are obsolete in terms of legitimate traffic and anomalies; • the absence of realistic data in synthetic datasets; • small number of dataset with flows (conversion from packets is needed, labels are lost); • incompleteness of data (narrow range of anomalies, lack of anomalies related to botnet-like malware);

Legitimate Traffic
Firstly, one-week legitimate traffic from a medium size network connected to the Internet was captured.This was accomplished using open source software-Softflowd [94] and NfDump [16].Because daily profile of each working day in this traffic is similar (except some minor differences on Monday morning and Friday afternoon) one-day profiling approach was chosen.From the whole traffic it was enough to extract two days (Tuesday, Wednesday) in order to build the dataset.The first day is reserved for a training (only legitimate traffic) and the second day for a detection (legitimate traffic + injected anomalies).The profile expressed by the number of flows of this 2-day traffic (before any injection of anomalies) is depicted in Figure 7.We can see time t on x axis (5 min fixed time window) and the number of flows on y (log scale) axis.Working day starts around 7 am.and finishes around 4 pm.The volume of the traffic expressed by the number of flows for both days is similar.
In the next step implementation of different scenarios of malicious network activities has been prepared.Synthetic anomalies typical for botnet-like network behavior were generated and then injected into the legitimate traffic.To produce synthetic anomaly traces a dedicated tool in Python language was developed.More details about the tool and the generation process can be found in our research [126].

Scenario 1
In this scenario a small and slow ssh brute force, port scan, ssh network scan and TCP SYN flood DDoS anomalies in different variants were generated.These anomalies do not form any realistic traces of malware but detection and proper classification of such set of anomalies is crucial because they are typical for network behavior of botnet-like malware.Main characteristics of generated anomalies are presented in Table 4. Generated anomalies were mixed with the legitimate traffic from Day2 (Wednesday) in the way presented in Figure 8. Anomalies are not injected into the traffic from Day1 (Tuesday) as it is intended for the profile of a legitimate traffic.As one can see, each anomaly is injected every 15 min mainly during the working hours.After injection only a few anomalies are visible in the volume expressed by a number of flows as depicted in Figure 9 .

Scenario 2
In this scenario, we prepared much more realistic sequence of a modern botnet-like malware network behavior.The subsequent stages looks as follows: 1.One of the host in local network gets infected with a botnet-like malware.In order to propagate via network it starts scanning his neighbors.Malware is looking for hosts running Remote Desktop Protocol (RDP) services.RDP is a proprietary protocol developed by Microsoft, which provides a user with a graphical interface to connect to another computer over a network.RDP servers are built into Windows operating systems.By default, the server listens on TCP/UDP port 3389.
2. Hosts serving Remote Desktop services are attacked with a dictionary attack (similarly to the technique found in MORTO worm [127]).3.After successful dictionary attack vulnerable machines are infected and become a member of botnet.4. A peer-to-peer communication based on UDP transport protocol is established among infected hosts. 5. On C&C server command botnet members start a low rate Distributed Denial of Service attack called Slowrolis [128] on an external HTTP server.After a few min the server is blocked.
Main characteristics of generated anomalies are presented in Table 5. Anomalies generated for the scenario were mixed with the legitimate traffic from Day2 (Wednesday) in the way presented in Figure 10.

Scenario 3
In this scenario, we prepared another realistic sequence of a modern botnet-like malware network behavior.The subsequent stages looks as follows: 1.One of the local host which is infected with a modern botnet malware starts scanning his neighbors in order to propagate via network.It uses similar network propagation mechanism as it is employed in Stuxnet worm [129,130].Malware is looking for hosts with open TCP and UDP ports reserved for Microsoft Remote Procedure Call (RPC).In Windows RPC is an interprocess communication mechanism that enables data exchange and invocation of functionality residing in a different process localy or via network.The list of ports used to initiate a connection with RPC is as follows: UDP -135, 137, 138, 445, TCP -135, 139, 445, 593.

2.
Hosts with an open RPC ports are attacked with a specially crafted RPC requests.
3. After successful attack, vulnerable machines are infected and become a member of botnet.
4. A direct communication to a single C&C server is established on each infected host.
5. On C&C server command botnet members start a DDoS amplification attack based on Network Time Protocol (NTP).This attack is targeted to an external server.Botnet members send packets with a forged source IP address (set to this used by the victim).Because the source IP address is forged the remote server replies and sends data to the victim.Moreover attack is amplified via NTP.Thus attackers send a small (234 bytes) packet "from" a forged source IP address with a command to get a list of interacting machines and NTP server sends a large (up to 200 times bigger) reply to the victim.As a result attackers turn small amount of bandwidth coming from a few machines into a significant traffic load hitting the victim.More details regarding NTP amplification in DDoS attacks can be found in [131].
Main characteristics of generated anomalies are presented in Table 6.Anomalies generated for the scenario were mixed with the legitimate traffic from Day2 (Wednesday) in the way presented in Figure 12.As one can see, the whole scenario which consists of four anomalies is injected every hour during a working time.Similarly to Scenario 2 anomalies in this scenario are low and slow and they represent only a small fraction of total traffic.After injection none of them is visible in the volume expressed by a number of flows as depicted in Figure 13.

Verification of the Approach
This section presents verification of the proposed method.The aim of verification is to check if the proposed method is able to detect network anomalies and categorize them.Firstly, results of correlation tests performed in order to find the proper set of α-values and network features are presented.Next, the performance of the Tsallis, Renyi, Shannon and volume-based version of the method is evaluated.Finally conclusions are given.

Correlation
Firstly, correlation tests for various α-values and for various feature distributions were performed.This is important as strong correlation suggests that some results are closely related to each other and thus it may be sufficient to restrict the scope scope of α-values and network features without impairing validity of the method.
In the experiments Pearson and Spearman [132] correlation coefficients were used.For a sample of discrete random variables X, Y the formula for Pearson coefficient is defined as: The formula for Spearman coefficient for a sample of discrete random variables X, Y is defined as: where corr-Pearson correlation coefficient for a sample, RX-ranks of X, RY-ranks of Y.
The results of correlation between entropy timeseries for different α-values are presented in Table 7.This table shows the pairwise Tsallis α correlation scores from range −1..1 where scopes |1 − 0.9|, |0.9 − 0.7|, |0.7 − 0.5|, |0.5 − 0| denote, respectively, strong, medium, weak, and no correlation.The sign determines if the correlation is positive (no sign) or negative (-).The presented values (see Table 7) are an average from 15 different feature distributions scores.Only results based on Tsallis entropy are presented as these obtained for the Renyi entropy were similar.
Table 7. Results of linear and rank correlation of α.
It should be noticed, that there is a strong positive linear (Pearson) and rank (Spearman) correlation for negative α-values and strong positive correlation between α-values which are higher than 1.For α = 0 there is some small positive correlation with negative values.For α = 1 (Shannon) there is a medium correlation with α = 2 and α = 3.These results suggest that it is sufficient to use α-values from range −2..2 to obtain different and distinctive sensitivity levels of entropy.
Results of pairwise correlation between Tsallis entropy timeseries of different feature distributions are presented in Table 8 and Table 9.The results obtained for the Renyi entropy are not presented as they closely reasemble these obtained for Tsallis.
The results for one positive and one negative value of α are presented because they differ significantly.Averaging (based on results from the whole range of α-values) would hide an essential property.It is noticeable that there is a strong positive correlation of addresses and ports for negative values of α but no correlation for positive α-values.

Performance Evaluation
Experiments were performed for Tsallis, Renyi and Shannon version of our method as well as traditional volume-based approach with flow, packet and byte counters.Final evaluation was performed with Weka [99].Experiments were performed with the dataset presented in Section 5. Exemplary results of entropies for a selected feature distributions are presented below.Abnormally high dispersion in destination addresses distribution for network scan anomalies exposed by negative value of α parameters is depicted in Figure 14.One can see time t on x axis (5-minute time windows), result r on y axis and α-values on z axis.The r value corresponds to normalization applied in our method (Equation ( 16)).Values of r outside (0..1) threshold are considered as anomalous.Anomalies are marked with (A) on the time axis.Values of Shannon entropy are denoted as S. Abnormal concentration of flows duration for network scans is depicted in Figure 15.This concentration is typical for anomalies with a fixed data stream, i.e., anomalies where all flows have similar size.Figure 16 shows ambiguous detection (no significant excess of 0−1 threshold) of port scan anomaly with volume-based approach using flow, packet and byte counters.While experimenting, we noticed that measurements for all feature distributions as a group work better than single ones or subsets.In our experiments, addresses, ports and duration feature distributions turned out to be the most deterministic, although we believe that the proper set of network features is specific for particular anomalies.Overall (whole data set, all feature distributions) multi-class classification was performed with Weka.We defined n classes -one for each anomaly type and one class for the legitimate traffic.In order to properly evaluate predictive performance 10-fold cross-validation method was used [99].From the performance point of view, every classification attempt can produce one of four outcomes presented in An ideal classifier should not produce False Positive (FP) and False Negative (FN) statistical errors.To evaluate non-ideal classifiers, one could measure proportion of correct assessments to all assessments-Accuracy (ACC), the share of benign activities reported as anomalous-False Positive Rate (FPR) and the share of anomalies missed by the detector-False Negative Rate (FNR).Usage of Precision (proportion of correctly reported anomalies) and Recall (share of correctly reported anomalies compared to the total number of anomalies) is another option.Based on these measures some tools like Receiver Operating Characteristics (ROC) and Precision vs Recall (PR) are typically used [133,134].
Formulas for mentioned metrics as well as some additional measures which can be also used to evaluate the performance of classifier are presented in Table 10.In our approach we deal with multi-classification problem where more than two classes are utilized, instead of single binary classification (or detection) where only two classes, e.g., anomalous and not anomalous are used.We classify instances to be into one of many classes like port scan, network scan, brute force, etc.We use classifiers from Weka which transform internally multi-class problem into multiple binary class one.One of the possible way to handle it is One-vs-All classification [135].The idea behind this method is: • Take n binary classifiers (one for each class); • For the ith classifier, let the positive examples be all the points in class i, and let the negative examples be all the points not in class i; • Let f i be the ith classifier; Classify with the following rule: Averaged Accuracy and avaraged FPR results based on Scenario 1 are presented in the Table 11.As one can see the results for several popular classifiers are presented.ZeroR is a trivial classifier which classifies the whole traffic as not anomalous.We included it here as a reference to other results as it is expected that other classifiers should perform better.Using weighted Accuracy and weighted FPR is the most popular way to measure the performance of multi-class classification but we also propose own measurement tool, namely weighted ROC curves which are presented later in this section.Evaluation results based on Scenario 2 and Scenario 3 are presented in the Table 12 and Table 13 respectively.It is noticable that the best performance in each scenario was obtained by applying Simple Logistic.In Weka, SimpleLogistic is a classifier for building linear logistic regression models [136].Logistic regression comes from the fact that linear regression [137] can also be used to perform classification problem.The idea of logistic regression is to make linear regression produce probabilities, thus instead of class prediction, there is a prediction of class probabilities.More details on SimpleLogistic can be found in [136,138].If we look at the detailed results of SimpleLogistic (Renyi entropy case) for all scenarios Table 14 one can see that different classes are characterized by rather different performance of recognition.For example, models for network scan and not anomalous are very strong, whereas this for p2p is much weaker.
As was mentioned before ROC plots can be also used to evaluate a performance of a classifier.It presents more detailed characteristic of a classifier than ACC.The ROC curve is obtained for a classifier by plotting TPR an x-axis and FPR on y-axis.The Area Under a Curve (UAC) is a scalar measurement method connected with a ROC.While evaluating the classifier, the ROC plot considers all possible operating points (thresholds) in the classifier's prediction in order to identify the operating point at which the best performance is achieved.A ROC curve does nor directly present the optimal value instead it shows a tradeoff between TPR and FPR.Depending on the goals one can change the optimal operating point in order to limit FPR or to increase TPR.An examplary ROC for perfect (a), partially overlaped (b) and random (c) classifier is presented in Figure 18.ROC is only applicable to the binary classification case.As in our approach more than two classes are considered we can analyze an individual ROC curves for each of the classes separately as presented in Figure 19.Based on such analysis we can find what is a performance of particular classifier for each class.This may be useful to find the best classifier for a specific anomaly but this is out of the scope of this article.In this work we are looking at classifiers which are (on average) the best for all classes.This is typically measured by weighted ACC and weighted FPR, however these measures hide some important characteristics.Thus, we propose a method of calculating a multi-class ROC based on weighted results of binary ROC for each individual class.In Weka there is a feature to generate and save in files an individual ROC curves for each of the classes of multi-class classifier separately.Weka ROC file consists of operating points (threshold values) and confusion matrices containing relevant TP, FN, TN, FP values for binary classification of particular class.Our approach is to take ROC files generated by Weka (one file for each class in the dataset) and perform processing in order to average the results.The idea is to average the corresponding ROC for each class with respect to the number of class instances.As a result we received one weighted ROC based on all binary ROC results.Weighted ROC curves for SimpleLogistic classifier for all scenarios are depicted in Figure 20, in Figure 21 and in Figure 22 respectively.It is noticeable that these for Tsallis and Renyi entropy are better than this for Shannon.ROC curves for volume-based shows that classifier based on this approach is really poor.7. Summary

Conclusions
General conslusions for our studies is that it is possible to detect modern botnet-like malware based on anomalous patterns in network with entropy-based approach.Concluding particular results of our studies, we can observe that, based on our experiments: • Tsallis and Renyi entropy performed best; • Shannon entropy turned out to be worse both in Accuracy and False Positive Rate as well as weighted ROC curves; • the volume-based approach performed poorly; • using a broad spectrum of network traffic feature is essential to successfully detect and classify different types of anomalies; this was proved both by results of features correlation and good results of classification of different anomalies in tested scenarios; • using α-values from a set {−2, −1, 0, 1, 2} is a proper choice; it was proved by results of α-values correlation and good results of classification of different anomalies in tested scenarios; using a bigger set of α values is redundant; using one α-value is not enough to recognize different types of anomalies; • the most suitable classifier (among popular classifiers employed in Weka) to our approach is the SimpleLogistic which relay on linear regression.
While we admit that our experiments were limited to few number of cases, we also believe that these cases were representative.Our dataset contains traces of network malicious activities which are typical for botnet-like malware propagation, communication and attacks performed by such a malware.Although, only one day legitimate traffic profile was built in our experiments, we have observed that this profile suits to each regular working day in the network we monitored so there was no need to prepare whole week profile.The weak performance of the Shannon entropy and poor performance of volume-based counters allows to question whether they are the right approach to detection of anomalies caused by botnet-like malware.

Further Work
Multiclass classification usually means classifying a data point into only one of the many (more than two) classes possible.It is much more advanced and sophisticated the simple detection where only two classes, e.g., anomalous and not anomalous, exist.However multiclass approach does not solve the problem when more than one class should be assign to one instance.For example instance may belong to port scan and brute force classes simultaneously because both anomalies appeared in the same time.With multi-label classification [139,140] one can classify a data point into more than one of the possible classes.In this work we do not cover multi-label problem, however this is one of the directions for a further work.

Figure 4 .
Figure 4. Shannon,Renyi and Tsallis entropy for an uniform distribution.

Figure 7 .
Figure 7. Legitimate traffic profile by number of flows.

Figure 8 .
Figure 8. Distribution of anomalies in time.

Figure 9 .
Figure 9. Legitimate and anomalous traffic by number of flows.

Figure 10 .
Figure 10.Distribution of anomalies in time.

Figure 11 .
Figure 11.Legitimate and anomalous traffic by number of flows.

Figure 12 .
Figure 12.Distribution of anomalies in time.

Figure 13 .
Figure 13.Legitimate and anomalous traffic by number of flows.

Figure 16 .
Figure 16.Ambiguous detection of port scan anomaly with a volume-based approach.

Figure 17 .
Figure 17.Possible results of classification.

Table 1 .
Probability distribution of X.

Table 2 .
Impact of frequent and rare events on the value of parameterized entropy.

Table 3 .
Selected traffic feature distributions.
number of pkts(bytes) with x i as src(dst) addr (port)total number of pkts(bytes) in(out)-degree number of hosts with x i as in(out)−degree total number of hosts

Table 4 .
Characteristics of anomalies.

Table 5 .
Characteristics of anomalies.

Table 6 .
Characteristics of anomalies.

Table 8 .
Results of correlation of features for α = −3.

Table 9 .
Results of correlation of features for α = 3.

Table 10 .
Metrics used to evaluate performance of classification.