A Review on Hot-IP Finding Methods and Its Application in Early DDoS Target Detection

On the high-speed connections of the Internet or computer networks, the IP (Internet Protocol) packet traffic passing through the network is extremely high, and that makes it difficult for network monitoring and attack detection applications. This paper reviews methods to find the high-occurrence-frequency elements in the data stream and applies the most efficient methods to find Hot-IPs that are high-frequency IP addresses of IP packets passing through the network. Fast finding of Hot-IPs in the IP packet stream can be effectively used in early detection of DDoS (Distributed Denial of Service) attack targets and spreading sources of network worms. Research results show that the Count-Min method gives the best overall performance for Hot-IP detection thanks to its low computational complexity, low space requirement and fast processing speed. We also propose an early detection model of DDoS attack targets based on Hot-IP finding, which can be deployed on the target network routers.


Introduction
In network monitoring and attack detection applications, the collection and processing of data packets passing through the network is a big challenge because the number of transferred packets is huge, especially on high-speed connections.In addition, apart from the processing of information items in packet headers, many applications also need to process the packet payloads.This causes significant delays in the processing of data packets and can result in network traffic congestion.Therefore, it is necessary to select suitable packet information items and appropriate processing methods to speed up the processing of data packets.
In an IP (Internet Protocol) packet header, the source IP and destination IP are important information for the transfer of the packet from the source host to the target host.In DDoS (Distributed Denial of Service) attacks, a huge number of packets of fake requests are sent to the target host to exhaust system resources or to flood the network connection.Since the destination IP of these packets is the attacked target host, it is possible to detect a DDoS attack in the early stage by monitoring high-frequency destination IPs in the target network router.Similarly, the monitoring can be done in ISP (Internet Service Provider) routers to track source IPs of hosts that originate a large number of packets.These IPs may be the addresses of hosts infected with network worms and these worms are scanning the network for the next target hosts.IP addresses with a high occurrence frequency in the IP packet stream are called Hot-IPs.Therefore, the problem of target detection of DDoS attacks or the detection of network worm sources can be solved by monitoring the IP packet stream transferred through the network to find Hot-IPs [1][2][3].
On a certain network connection, an IP packet stream is a sequence of IP packets, which can be represented as S = {a 1 , a 2 , . . ., a M }, where there are M packets with N unique IP addresses.Suppose f i to be the occurrence frequency of the packet with IP address s i in S, and then we have f i = {j|s j = s i }, where 1 ≤ i ≤ N and 1 ≤ j ≤ M. We also have f 1 + f 2 + . . .+ f N = M and, given the occurrence frequency threshold ϕ, the Hot-IP = {s i |f i ≥ ϕ} [1,2].
The problem of finding Hot-IPs in the IP packet stream can be solved using algorithms for finding elements with a high occurrence frequency in the data stream.There are a number of such algorithms, including Majority, Frequent, Lossy Counting, Space Saving, Count-Sketch, Count-Min and Group Testing [4][5][6][7][8][9][10][11][12].Section 2 of this paper will briefly discuss these algorithms.
Huynh et al. [1,2] proposed to use the non-adaptive Group Testing method for fast finding of Hot-IPs in the IP packet stream and to apply the results in the detection of DDoS attacks and network worm spreading sources.Since the computational complexity of the Group Testing method (O(tN), where N is the number of unique IP addresses and t is the number of tests) is relatively high, it is not efficient for the processing of the IP packet stream in heavy traffic [1].To solve this issue, they use the Reed Solomon code concatenation method to construct the d-disjunct matrix m [1,2].Thus, the storage space of matrix m is significantly reduced and the computational complexity is also reduced, which is equivalent to the polynomial time.
In this paper, we evaluate methods to find high-occurrence-frequency elements in the data stream and apply them to find Hot-IPs in the IP packet stream passing through the network.Based on the computational and space complexity evaluation results of each method, the most efficient method is selected for use in the target detection model of DDoS attacks.
The rest of the paper is organized as follows: Section 2 reviews and compares methods for finding high-occurrence-frequency elements in the data stream and applies them in finding Hot-IPs in the IP packet stream.Section 3 presents the proposed target detection model of DDoS attacks and Section 4 is our conclusion.

Majority
The Majority algorithm was proposed by Boyer-Moore in 1982 [4].A common element is defined as the element that has the number of occurrences greater than half of the total number of elements in the data stream.The algorithm can be summarized as follows: Given the data stream P = {a 1 , a 2 , . . ., a m } and N is the number of unique elements.Assume f to be the frequency vector of N unique elements, f = {f 1 , f 2 , . . ., f N } and f 1 + f 2 +, . . .+ f N = m.A high-occurrence-frequency element j is determined by j|f j > m/2.The Majority algorithm consists of two stages, including (1) finding elements with a high occurrence frequency if they exist, and (2) checking if these elements are really common elements based on their occurrence frequencies.

Frequent
The Frequent algorithm was proposed by Misra and Gries in 1982 [8].The algorithm attempts to find all elements that have an occurrence frequency greater than m/k, where m is the total number of elements in the data stream and k is the number of high-occurrence-frequency elements to find.It uses k pairs of elements and counters to monitor these k elements.Initially the k pairs are empty.The main idea of the algorithm can be described as follows: for each element in the data stream, if the element exists in k pairs, increase the element's counter by 1.If the element does not exist in k pairs and the number of pairs is less than k, add the new element into the k pairs and assign 1 to its counter.Otherwise, decrease all counters and remove elements from k pairs if their counters are 0. At completion, the algorithm can find most k elements that have an occurrence frequency greater than m/k.

Lossy Counting
The Lossy Counting algorithm was proposed by Manku and Motwani in 2002 [7].The algorithm uses a data structure that has three attributes, including the element, L (lower bound) and H (higher bound) to process the data stream.Assume that m is the total number of elements in the data stream and k is the number of high-occurrence-frequency elements to find.Set P = H − L. For each ith element in the data stream, if the element was stored in the data structures, set L ← L + 1; otherwise, create a structure for the element and set L = 1 and P = i/k.If H < i/k, the ith element is removed from data structures.At completion, the algorithm can find k elements that have an occurrence frequency greater than m/k.

Space Saving
The Space Saving algorithm was proposed by Metwally et al. in 2005 [11].The algorithm uses a data structure of k pairs, in which each pair consists of an element and its counter.Initially, the first k unique elements are assigned to k pairs and their counters are assigned to their corresponding numbers of occurrences.For the next element from the data stream, if the element is found in k pairs, increase its counter by 1. Otherwise, replace the smallest counter element with the new element and increase its counter by 1.At completion, the algorithm can find k elements that have an occurrence frequency descending from high to low.

Count-Sketch
The Count-Sketch algorithm was proposed by Charikar et al. for the first time in 2002 and the improved version of the algorithm was introduced in 2004 [12].Sketch is a terminology used to represent a data structure, or a linear mapping of the input frequency vector.Count-Sketch is used to find k elements that have a high occurrence frequency in the data stream.Sketch data structure is compact and the algorithm's processing speed is fast.
Count-Sketch uses two input variables, where ε is the error rate and δ is the error probability.To store counter variables in the processing, Count-Sketch uses an array sized w × d, where d is the number of hash functions and w is the space size used by a hash function.
The Count-Sketch algorithm consists of three stages: Stage 1 is for variable initialization; Stage 2 is a loop to process the data stream; and Stage 3 is for processing of the output result.The details of this algorithm can be found in [12].

Count-Min
The Count-Min algorithm was proposed by Cormode and Muthukrishnan in 2005 [5] to replace other sketch-based methods, such as Count-Sketch.The aim of Count-Min is to provide a simple and compact sketch data structure.Count-Min has been widely used in many computing areas thanks to its simple sketch data structure.Count-Min's data structure uses the linear reference system with some random vectors.These vectors are defined by a simple hash function.The increase of the hash function's operating scope will increase the counting accuracy and reduce the error rate.Count-Min's data structure can be expanded or shrunk to fit each specific case without compromising the accuracy.
Count-Min's data structure is a matrix of simple counting variables.The matrix has the width w and depth d, including elements from CM [1,1] to CM [d,w].Similar to other sketch-based methods, Count-Min also uses ε as the error rate and δ as the error probability and the algorithm consists of three stages: Stage 1 is for variable initialization; Stage 2 is a loop to process the data stream; and Stage 3 is for the calculation of occurrence frequencies of elements in the data stream.The details of this algorithm can be found in [5].

Group Testing
The Group Testing method was proposed by R. Dorfman in 1943 [9] to detect who was infected with syphilis in the US Army in the Second World War with the minimum number of blood tests.He first divided each soldier's blood sample into many smaller samples, and then he formed groups of soldiers and mixed their blood samples for each group to make group samples.Then, he carried out a test for each group sample.If the group test result is negative, all samples in the group are negative.If the group test result is positive, at least one sample in the group is positive.Repeat the procedure on the samples of the positive group until the infected sample is found.Thus, the number of tests had been decreased significantly.The purpose of the Group Testing method is to determine a sub-set of d positive elements of a large set of N elements with the number of tests as small as possible [1].
Group Testing methods can be classified into two types: adaptive Group Testing and non-adaptive Group Testing.In adaptive Group Testing, the next test is created based on the result of the previous test.Therefore, tests of this method must be done in sequence.On the other hand, in non-adaptive Group Testing, tests are independent and therefore tests can be processed in parallel to speed up the process.In this research, we use the non-adaptive Group Testing method proposed by Kautz and Singleton [9].Non-adaptive Group Testing requires that all tests must be designed in advance and then executed at the same time to produce the final result.
The problem of using the non-adaptive Group Testing method to find high-occurrence-frequency elements in a data stream can be stated as follows: Given the data stream a of M elements, in which there are N unique elements (M >> N), assume that there is a maximum of d high-occurrence-frequency elements; we need to design t tests for N elements.Construct the binary matrix m txN , where the matrix's columns represent elements and the matrix's rows represent tests.If m[i][j] = 1, it means that element j belongs to test i and if m[i][j] = 0, it means that element j does not belong to test i [9].The details of the non-adaptive Group Testing method can be found in [9].
We adopt the non-adaptive Group Testing method to find d high-occurrence-frequency elements used in [1,2,9], in which the d-disjunct matrix m txN is built using the Reed Solomon code concatenation.The binary matrix m txN is the d-disjunct matrix if and only if the union of any d columns does not contain any column in the matrix.The Reed Solomon code is the error-correcting code in the form of [n, k] q , where q is the alphabet size, n is the block length and k is the message length.The relationship among the three parameters is k < n ≤ q and it is often that n = q − 1.The construction procedure of the d-disjunct matrix using the Reed Solomon code concatenation can be found in [1,2,9].

Application in Finding Hot-IPs
In this section, we evaluate the performance of the Count-Sketch, Count-Min and Group Testing methods on finding Hot-IPs in IP packet streams extracted from the UCLA (University of California, Los Angeles) data sets [13].These methods have been widely used to find high-occurrence-frequency elements in the data stream because they have low computational complexity and low space requirements [1,2,4-7].

Experimental Data Sets and Parameters
The UCLA data sets [13] were collected from simulated DDoS attacks using DoS/DDoS attack tools.Raw data of collected packets were processed, whereby each packet is converted into a record.For TCP packets, each record consists of 11 fields, including Packet_TIME, IP_from, IP_to, PORT_from, PORT_to, LENGTH, FLAG, SEQ_from, SEQ_to, ACK and WIN.In the experiments in this section, the IP_from (source IP address) is used to find Hot-IPs.
The experimental data set extracted from the UCLA data sets [13] consists of 200,000 records (also contains M = 200,000 source IP addresses), in which there are N = 2214 unique IP addresses.
For the Count-Sketch and Count-Min methods, select ε = 0.005, δ = 0.00000001 and the number of IPs with the highest occurrence frequency k = 10.
We use a one-dimensional matrix C N to be the counter of the occurrence frequency of IP addresses in the IP packet stream.An IP is considered a Hot-IP if C[i] > L/(d + 1).

Experimental Results
Table 1 shows the result of finding 10 source IP addresses that have the highest occurrence frequency using the Count-Sketch and Count-Min methods.The IP occurrence frequencies found by both methods are almost the same.If we set the occurrence frequency threshold for a Hot-IP as 10,000, we get four Hot-IPs, which are at the top of the table (1.1.17.29, 1.1.6.6, 1.1.35.39 and 1.1.17.17).Table 2 shows the result of finding Hot-IPs using the Group Testing method.Since the purpose of the Group Testing method is to find d Hot-IPs, the IP occurrence frequencies are not the output parameters.It can be seen that the result of finding Hot-IPs using the Group Testing method is equivalent to the results of the Count-Sketch and Count-Min methods.

Comparison of Hot-IP Finding Methods
As discussed in Section 2.2, the Count-Sketch, Count-Min and Group Testing methods are capable of finding Hot-IPs correctly in the IP packet stream.In this section, we compare these methods based on the following criteria: (1) algorithm complexity; (2) space requirements; and (3) processing time in finding Hot-IPs.

Count-Sketch
The space required for this algorithm is O(d × w + 2d) because the counter matrix has the size of d × w and there are two hash functions for each row.Thus, the space required for the algorithm is and the average processing time for d elements is linear; therefore, the processing time required is O (d) = O(ln 1 δ ) [12].

Count-Min
Count-Min uses a counter matrix sized d × w and d hash functions, thus the space required for the algorithm is O (d × w) = O(ln 1  δ × e ε ).The processing time for each element is O (d) = O(ln 1 δ ) and the average processing time for d elements is linear; therefore, the processing time required is O (d) = O(ln 1 δ ) [6].
Table 3 presents the complexity of the Count-Sketch, Count-Min and Group Testing methods.

Comparison of Hot-IP Finding Time
We use data sets extracted from UCLA data sets [13] for our experiments.Our experimental data sets have the numbers of records (also the IP addresses) gradually increased to measure the processing time of the Count-Sketch, Count-Min and Group Testing methods.The sizes of the data sets are gradually increased from 10,000 IP addresses to a maximum of 1,000,000 IP addresses.Because of the differences in the input parameters and space required, we create three scenarios to compare these three methods on two pairs, including (1) Count-Sketch and Count-Min; and (2) Count-Min and Group Testing.
All experiments in this section are implemented in Java and run on a laptop machine with an Intel core i5-M460 2.53 GHz CPU, 8 GB RAM on 64-bit Microsoft Windows 8.For each dataset, five runs are executed and the average of processing time (excluding the time for I/O operations) is the final result. 1 Scenario 1: Compare Count-Sketch and Count-Min using the same initialization parameters ε = 0.005 and δ = 0.00000001.The experiment result is shown in Table 5. 2 Scenario 2: Compare Count-Sketch and Count-Min using the same processing space.For Count-Sketch, select ε = 0.01 and δ = 0.00005, and select ε = 0.00009 and δ = 0.00001 for Count-Min.The experiment result is shown in Table 4.
It can be seen from Tables 4 and 5 that Count-Min is significantly faster than Count-Sketch in both cases of using the same initialization parameters or using the same processing space.3 Scenario 3: Compare Count-Min and Group Testing using the same matrix size.For Count-Min, select ε = 0.00045 and δ = 0.00000001, and we have the counter matrix CM [19,6041].For Group Testing, select RS [15,3] 16 and d-disjunct as 6000 (corresponding to 6000 unique IPs), and we have d-disjunct matrix m[240,6000].The experiment result is shown in Table 6.The experimental results given in Table 6 show that Group Testing is slightly faster than Count-Min in finding Hot-IPs in the IP packet stream.However, the space requirement of Group Testing is much higher than that of Count-Min, even though they use the same width matrices because Group Testing uses the d-disjunct matrix with a much larger number of tests t.Overall, Count-Min is more efficient than Group Testing because its space requirement is much smaller than that of Group Testing while its processing time is almost equivalent to that of Group Testing.

Proposed Target Detection Model of DDoS Attacks
DDoS (Distributed Denial of Service) is the type of attack that makes the computer or network system overload, and the system cannot provide the service, or has to stop working.While a DoS attack is usually originated from one source, or a small number of sources, a DDoS attack is originated from large number of sources distributed all over the Internet.In real DDoS attacks, network service servers are "flooded" by a huge amount of requests sent from controlled hosts (also called zombies or bots) distributed on the networks [14].When the amount of requests is too large, the server is overloaded and fails to handle incoming requests.Consequently, legitimate users are not able to access the service provided by servers.Figure 1 illustrates a typical architecture of DDoS attacks.
Future Internet 2016, 8, 52 8 of 10 There have been a number of proposed measures to defend against DDoS attacks over the last decade.However, until now there has not been any solution capable of DDoS prevention comprehensively and effectively due to the complexity, scale and highly distributed nature of DDoS attacks [14][15][16][17].These DDoS defense measures can be classified into three groups: (1) measures based on deployment location; (2) measures based on network protocols; and (3) measures based on the time of action.Group (1) consists of measures that are deployed at the sources or targets of DDoS attacks.On the other hand, group (2) includes measures that defend DDoS attacks at the IP, TCP/UDP (Transmission Control Protocol/User Datagram Protocol), or application layers.Based on the time of action, group 3 includes pre-attack, in-attack and post-attack measures [14][15][16][17].In the proposed model, the Hot-IP/DDoS detector deployed on the target network router is responsible for capturing and processing all IP packets sent to service servers.The detector uses the sliding window method on the IP packet stream to find destination IPs that have high occurrence There have been a number of proposed measures to defend against DDoS attacks over the last decade.However, until now there has not been any solution capable of DDoS comprehensively and effectively due to the complexity, scale and highly distributed nature of DDoS attacks [14][15][16][17].These DDoS defense measures can be classified into three groups: (1) measures based on deployment location; (2) measures based on network protocols; and (3) measures based on the time of action.Group (1) consists of measures that are deployed at the sources or targets of DDoS attacks.On the other hand, group (2) includes measures that defend DDoS attacks at the IP, TCP/UDP (Transmission Control Protocol/User Datagram Protocol), or application layers.Based on the time of action, group 3 includes pre-attack, in-attack and post-attack measures [14][15][16][17].
Based on the analysis of DDoS architectures and the Hot-IP finding results presented in Section 2, we propose a DDoS target or victim detection model based on Hot-IP finding, as shown in Figure 2. The IP addresses of service servers are the destination IP addresses of IP packets sent to these servers.Under a DDoS attack, these IP addresses usually have an extremely high occurrence frequency.Therefore, if we deployed a Hot-IP/DDoS detector on the target network router, it is possible to detect the signals of a DDoS attack early.Our DDoS target detection model can be deployed at the target host, or would be best at the router of the target network.
In the proposed model, the Hot-IP/DDoS detector deployed on the target network router is responsible for capturing and processing all IP packets sent to service servers.The detector uses the sliding window method on the IP packet stream to find destination IPs that have high occurrence frequency.Count-Min is the method implemented for finding high-occurrence-frequency IPs in the detector.A threshold of occurrence frequency for Hot-IPs is determined in advance to identify the possibility of a DDoS attack.The threshold is determined based on the type of network services and the user access patterns when the system is in normal operation.Initial experiment results on a simulated environment show that the detector is capable of quickly identifying Hot-IPs correctly and based on that, it can detect DDoS attacks on network service servers.Based on the analysis of DDoS architectures and the Hot-IP finding results presented in Section 2, we propose a DDoS target or victim detection model based on Hot-IP finding, as shown in Figure 2. The IP addresses of service servers are the destination IP addresses of IP packets sent to these servers.Under a DDoS attack, these IP addresses usually have an extremely high occurrence frequency.Therefore, if we deployed a Hot-IP/DDoS detector on the target network router, it is possible to detect the signals of a DDoS attack early.Our DDoS target detection model can be deployed at the target host, or would be best at the router of the target network.In the proposed model, the Hot-IP/DDoS detector deployed on the target network router is responsible for capturing and processing all IP packets sent to service servers.The detector uses the sliding window method on the IP packet stream to find destination IPs that have high occurrence

Conclusions
Finding Hot-IPs in the IP packet stream flowing through the network can be used to detect targets or victims of DDoS attacks, or spreading sources of network worms.It can also be used in applications to monitor activities of network elements.This paper reviewed methods to find high-occurrence-frequency elements in the data stream and applied them in finding Hot-IPs in the IP packet flow passing through the network.Research results shows that Count-Min gives the best overall performance thanks to its low computational and space complexity, and its fast processing speed.We also proposed a Hot-IP finding-based model for early target/victim detection of DDoS attacks, which can be deployed on the router of the target network.
This research can be extended in the following directions: (i) complete the Hot-IP-based detection module and deploy it in the real environment; and (ii) optimize the Hot-IP finding module using embedded processors to speed up the processing of IP packets to be able to monitor large bandwidth network connections.

Figure 1 .
Figure 1.A typical architecture of DDoS attacks.

Figure 2 .
Figure 2. DDoS target detection model based on Hot-IP finding.

Figure 1 .
Figure 1.A typical architecture of DDoS attacks.

Figure 1 .
Figure 1.A typical architecture of DDoS attacks.

Figure 2 .
Figure 2. DDoS target detection model based on Hot-IP finding.

Figure 2 .
Figure 2. DDoS target detection model based on Hot-IP finding.

Table 1 .
Result of finding 10 source IP addresses having the highest occurrence frequency using Count-Sketch and Count-Min.

Table 2 .
Result of finding Hot-IPs using Group Testing.

Table 3 .
Complexity comparison of Count-Sketch, Count-Min and Group Testing methods.

Table 4 .
Time for finding Hot-IPs of Count-Sketch and Count-Min using the same processing space.

Table 5 .
Time for finding Hot-IPs of Count-Sketch and Count-Min using the same initialization parameters.

Table 6 .
Time for finding Hot-IPs of Count-Min and Group Testing using the same matrix size.