5.1. Limitation of Supervised Learning-Based Approach
In
Section 3 and
Section 4, we presented the proposed approach toward supervised learning-based NATD identification using port response patterns, and its evaluation. It exhibited very high effectiveness, with F1 scores of over 90%. Notwithstanding this high effectiveness, the supervised-learning approach has the following two main problems in practical applications.
First, it requires a very long time to collect port response patterns. It scans at a maximum of 65,536 ports for each of TCP and UDP, a total of 131,072 ports. To demonstrate the effect of the number of ports to be examined on the identification time, we configured two hosts as a scanner and a target. These were connected in the same subnet through an Ethernet switch with 1 ms latency. Then, the scanning on the ports was performed using
[
24] rather than the proposed system to provide a more general insight. The round trip time (RTT) was measured as the difference between the time when the scanner sends a probe packet and the time when it receives the response from the target. Note that after collecting all the response patterns on all the target ports, the classification or the identification task followed. In addition, we measured the processing time of the classification or identification task.
Figure 4a shows the elapsed time results of the TCP ports by varying the number of target ports. The RTT increases linearly as the number of ports increases. The processing consumes approximately 2 s independent of the number of ports. The elapsed times for the UDP port scans also increases linearly as the number of target ports increases, as shown in
Figure 4b. However, the degree of increase is significantly larger than that of TCP. A comparison of
Figure 4a,b reveals that the time for the TCP scan is not high compared to that for UDP scan. The UDP port scan was measured to consume approximately 1 s per UDP probe packet. UDP scan consumes a significantly longer time than TCP does because UDP is a stateless and connectionless protocol. The host should wait for a possible reply that may be processed. Furthermore, although the time-out value is set to zero for waiting, Linux OS places a limit of one reply per second for ICMP messages. According to
Table 2, a UDP port may respond with an ICMP packet while the probe packet is filtered. This is why we selected 200 TCP and UDP ports for the experiments described in
Section 4.
Second, intrusion detection systems (IDSs) and firewalls may consider probe packets as a threat or an attack. Because port scanning techniques have been typically utilized by malicious users to identify vulnerabilities of target hosts, IDSs and firewalls are generally configured to block the port scan packets as well as the host that sends the packets. To detect the port scan threat, a few studies and commercial services have defined their thresholds in the unit of packets per second (PPS). To detect TCP-SYN and UDP flooding attacks, threshold values were set to 20 incomplete TCP-SYN and 10 UDP PPS in [
33] and 200 incomplete TCP-SYN and 300 UDP PPS in [
34]. Commercial network devices such as routers and firewalls also have default rules unique to them to detect these attacks as follows, 128 incomplete TCP-SYN and 500 UDP PPS for Cisco [
35] and 25 incomplete TCP-SYN PPS in the Juniper Networks firewall [
36]. The threshold values are summarized in
Figure 5.
The port scanner of the proposed system architecture described in
Section 3 sends 100 TCP-SYN and 100 UDP PPS per host to collect its port response patterns. When there are numerous hosts to be scanned, PPS values increases tremendously and exceeds the thresholds shown in
Figure 5. Then, the probe packets are all filtered by IDSs or firewalls.
5.2. Dt-Fs: Dt-Based Fast and Stealthy Natd Identification
A simple method to solve the limitation illustrated in the previous subsection is to send fewer probe packets for a time period long enough to evade detection by IDSs or firewalls. However, this time period becomes excessively long as the number of hosts increases, which may be impractical. There have been some works to avoid the situation by optimizing the scanning rate considering congestions and throughput in wireless LAN environments [
37], IP addresses and their collected features for Internet-wide scan [
38], and so on. However, the scan target for those existing methods are not applicable to NATD identification. Therefore, another effective method is required to solve the problem.
Here, we propose a fast and stealthy NATD identification method based on DT (called
DT-FS), to overcome the limitations of the supervised learning-based approach. With the evaluation results in
Section 4.3, we select DT as the fundamental classification model. This is because it exhibited relatively higher performances, with 94.7% accuracy and 94.2% F1 score and lower elapsed times for the training and identification phases, than those of the others. In addition, the DT constructed in the training phase exhibits a hierarchical tree structure. Herein, the NATD identification rules are represented by the paths from the root to the leaf nodes. The identification process is repeated until it reaches the leaf node starting from the root. Similarly, because it does not visit all the nodes within the tree, we can achieve fast identification by inducing the port scanner to send probe packets only to the ports that it will visit on the DT’s path.
The algorithm for DT-FS is presented in Algorithm 2. Each node is represented as the feature vector , where is the outcome at the node. In general, c at the root and intermediate nodes is set to TBD. It denotes To Be Determined. Meanwhile, c at the leaf node is true or false. m and r are the port number to be probed and its response pattern, respectively. = = 0x00, 0x01, ⋯, 0x05} is the set of child nodes that it can be branched into. Here, is the child node to be branched for a response pattern r.
Algorithm 2 functions as follows. First, it examines c in . If c is TBD, it implies that the node is not a leaf node and needs more response pattern. Then, it sends a probe packet to the port m of the host h and obtains the response pattern r from h by using the function SendProbePacket( ). Here, {0x00, 0x01, ⋯, 0x05} (lines 1–2). Then, it branches to , which is the child node to be branched by r among those listed in , and repeats the function DT-FS( ) (line 3). If c is true or false, it implies that the node is a leaf, and it returns c as the identification value (line 5).
Because DT-FS is based on DT as its classification model, its performances with regard to precision, recall, accuracy, F1 score, and AuC are identical to those of the supervised learning-based NATD identification method with the DT model (
DT-SL) presented in
Table 5.
However, unlike DT-SL, which collects all the response patterns on all the target ports of a host, DT-FS sends probe packets only to the ports that it will visit on the DT’s path. Accordingly, DT-FS can reduce the time for NATD identification significantly, and that too with a substantially smaller number of port response patterns than that for DT-SL.
Algorithm 2 DT-FS Algorithm |
Input: and h | // = , h is a target host |
Output: boolean value | // true or false |
functionDT-FS: |
1: if c == TBD then |
2: r = SendProbePacket ; | // get port response |
3: DT-FS ( , h ); | // branch to child node |
4: else if c == true OR false then |
5: return c; | // identification result |
6: end if |
end function |
Figure 6 shows the cumulative distribution function for the number of probe packets required for the NATD identification. Here,
and
denote the ratio of the training set and of the test set, respectively, to the total dataset. Note that
+
= 1. As is evident from
Figure 6, the identification can be conducted with a very small number of probe packets for a few hosts. For example, 40% of the hosts can be identified by at most five probe packets. With less than 30 probe packets, all the hosts are identified for all the cases of
and
.
Furthermore, it is evident that the larger is, the higher the number of probe packets that need to be sent. This is because as increases, the paths from the root to the leaf nodes become deeper, and it requests more port response patterns. However, for all the cases, the tree depth for DT-FS can be optimized by a maximum of 30 probe packets.
With the result shown in
Figure 6, NATD identification for a host can be performed with a small number of probe packets. Then, we can conveniently compute the elapsed time for identifying all the hosts in the networks managed by an IDS or firewall for its network protection as follows. Let
and
be the time unit and the threshold value of the number of packets for the detection of abnormal behaviors by an IDS or a firewall, respectively. Let
M and
N be the number of probe packets and hosts, respectively, for the NATD identification. Then, we have the total time of NATD identification without being detected by the IDS or firewall as
=
. Here,
denotes the smallest integer larger than
x. For example, let us consider
M = 30 and
N = 1. Then, it requires 1 s for [
35,
36], and 2 s for [
33,
34], as shown in
Figure 5.