Adoption of IP Truncation in a Privacy-Based Decision Tree Pruning Design: A Case Study in Network Intrusion Detection System

: A decision tree is a transparent model where the rules are visible and can represent the logic of classiﬁcation. However, this structure might allow attackers to infer conﬁdential information if the rules carry some sensitive information. Thus, a tree pruning methodology based on an IP truncation anonymisation scheme is proposed in this paper to prune the real IP addresses. However, the possible drawback of carelessly designed tree pruning might degrade the performance of the original tree as some information is intentionally opted out for the tree’s consideration. In this work, the 6-percent-GureKDDCup’99, full-version-GureKDDCup’99, UNSW-NB15, and CIDDS-001 datasets are used to evaluate the performance of the proposed pruning method. The results are also compared to the original unpruned tree model to observe its tolerance and trade-off. The tree model adopted in this work is the C4.5 tree. The ﬁndings from our empirical results are very encouraging and spell two main advantages: the sensitive IP addresses can be “pruned” (hidden) throughout the classiﬁcation process to prevent any potential user proﬁling, and the number of nodes in the tree is tremendously reduced to make the rule interpretation possible while maintaining the classiﬁcation accuracy.


Introduction
The decision tree is a powerful white-box classifier encompassing branches and nodes to deliver an interpretable model in the form of "if-else" rules. However, as most of the information is visible in the tree-like structure, the model is highly susceptible to privacy attacks. For example, the protocol type or port number attribute that branches out to or from the IP address node in the tree model may reveal the possibility of certain services being available on the particular machine [1].
Although Li and Sarkar [2] had proposed a privacy pruning model based on risk assessment, the approach is entirely different from the solution we proposed in this paper. Specifically, we utilise a case study in the network intrusion detection system (NIDS) domain to illustrate how we tackle the privacy issue as mentioned earlier. Furthermore, to the best of our knowledge, none of the previous works designed a privacy pruning solution from the perspective of IP truncation.
The IP address is viewed as personal data when its usage is combined with other information and can be used for profiling an individual [3,4]. For example, an attacker can deduce sensitive information such as individual online behaviours and personal communication from the IP address, and other information from the network traces [4]. IP truncation can be embedded into the C4.5 tree to anonymise any sensitive information [5,6]. A decision tree model often suffers from overfitting its training data due to its greedy approach in building the classifier. Pruning is often used to simplify an unpruned decision tree by removing the insignificance nodes, branches, or leaves [7][8][9]. Various pruning techniques have been proposed in the literature to yield a decision tree model with higher performance (i.e., classification accuracy) and better generalisation capability. Since the focus of this work is on tree pruning, some existing pruning strategies are discussed. The most common usage of pruning is to improve the decision tree model's classification accuracy (i.e., reduced classification error). In general, nodes or leaves will be pruned if the classification accuracy for the pruned version of the tree is better than the unpruned version. Reduced Error Pruning [10], Pessimistic Error Pruning [10], Error-based Pruning [11], Critical Value Pruning [12], Minimum Error Pruning [12], and Improved Expected Error Pruning [13] are some of the notable pruning algorithms that endeavour to minimise the error rate of the decision tree model. In the default settings, the C4.5 decision tree [11] uses Error-based Pruning (an improved version of Pessimistic Error Pruning).
A decision tree model can be viewed as a sequence of "if-else" rules flowing from the root node to the leaf node to provide a classification decision or regression result. Though maximising the classification accuracy is important, the structural complexity (i.e., number of nodes, branches, and leaves) of a tree should not be overlooked. This is because a tree with over-complicated tree structures would hinder the interpretation of the decision rules. Therefore, some researchers opt to build a simpler decision tree through the pruning mechanism to unleash the interpretability of a decision tree. On top of this, a simpler tree structure is more computational friendly. A few of the well-known pruning schemes that seek to reduce the complexity of the tree structure include Cost-Complexity Pruning [14], improved Cost-Complexity Pruning [15], and the Optimal Pruned Tree [16]. Among them, the Cost-Complexity Pruning was employed in the renowned Classification and Regression Trees (CART) developed by Breiman et al. [14].
Apart from pruning with a single objective such as maximising the classification accuracy or reducing the number of nodes, Osei-Bryson [17] pointed out another evaluation criteria-the stability and interpretability of a tree to assess the quality of the model. Fournier and Crémilleux [18] designed a Depth-Impurity Pruning based on the Impurity Quality Node index that considers the average distance of the data belonging to a leaf node to its centre of mass [19] and also the depth of a node. In contrast to Cost-Complexity Pruning [14], which only considers the number of nodes, Wei et al. [20] include the depth (i.e., level) of the tree and classification accuracy as the pruning evaluation criteria by utilising rough set theory [21].
In the purview of the unbalanced classes in most real-world classification problems, researchers had proposed a Cost-Sensitive Pruning technique to tackle this issue. It is very common to treat all classes as equally significant during the pruning process. However, this may not work perfectly for certain applications. In the medical field, a misclassification (i.e., false negative) in certain scenarios might be more expensive than a false positive. For instance, the consequence of identifying a patient with cancer as a healthy patient is more expensive than identifying a healthy patient as a patient diagnosed with cancer [22]. Thus, Knoll, Nakhaeizadeh, and Tausend [22] proposed a Cost-Sensitive Pruning that incorporates weights for each class in the dataset. In their work, they supplied the cost matrix directly to the algorithm. To evaluate the optimum ratio of misclassification costs (i.e., cost matrix), Bradley and Lovell [23] exploited the receiver operating characteristics curve, and Ting [24] incorporated instance weighting with a C4.5 decision tree to compute the cost matrix. Decision tree models exploit the concepts of information gain to split the training dataset based on the features selected into a smaller group. In most cases, the leave nodes will only contain a very small subset of the training data. This scenario will lead to reidentification attacks through information linking [25], as described by Li and Sarkar [2]. Thus, Li and Sarkar [2] proposed a new pruning metric that evaluates each node's disclosure risks. They measured the risks by counting the number of data available in each node. A larger number of data would imply a lower risk, while a smaller number would suggest a higher risk of information disclosure. To prevent re-identification attacks, the authors introduced a data swapping procedure for the data found in each node. In this procedure, the majority class instances will be swapped with the minority class instances. To retain the classification performance, the superior probability after the data swapping process is retained by the majority class of each node.
From the literature review, we realise that only a minor group is working on privacyoriented pruning, and there is room for improvement. Therefore, instead of weighting each node, we propose a new approach to prune the tree by truncating the value of the IP address in the application domain of NIDS. The methodology of this proposed work is presented in the next section.

Methodology
In this work, truncation [26] is performed on IP addresses in three ways: 8-bit, 16-bit, and 24-bit truncations. Given the original IP address as "192.168.123.121", 8-bit truncation will zeroize the last byte (8-bit) of the IP address, thus representing it as "192.168.123.0"; 16bit truncation will zeroize the last two bytes (16-bit) of the original IP address, representing it as "192.168.0.0"; whereas 24-bit truncation will zeroize the last three bytes (24-bit) of the original IP address, representing it as "192.0.0.0". The goal of this proposed method is twofold: (1) to anonymise the real IP address and (2) to prune the original C4.5 decision tree. In this work, we have adopted the C4.5 decision tree from the Weka J48 decision tree package to run the experiment. It should be noted that the default J48 decision tree without any fine-tuning of parameters is utilised on all the experiments deliberated on in this paper.
To illustrate the impacts of decision tree pruning with IP truncation, part of the ASCII J48 (Weka implementation of C4.5) tree model built against the original 6-percent-GureKDDCup'99 dataset is shown in Figure 1. Similarly, the 24-bit truncation version of the same dataset trained with the identical dataset and model is presented in Figure 2. We can observe that all the IP addresses from the tree model in Figure 2 have been truncated by three bytes (24-bit), spawning a new set of IP addresses such as "192.0.0.0", "135.0.0.0", "196.0.0.0", etc. In both figures, the values in the bracket should be read as (total number of instances from the training distribution reaching the leaf, number of training instances that is incorrectly classified) [25]. For instance, "resp_port <= 20: normal (6465.0/6.0)" from Figure 2 denotes that 6465 number of training instances reached the leaf node, with 6 of them classified incorrectly [27]. To have a better insight into the proposed approach, we give the overall methodological diagram and pseudocode for the proposed IP truncation pruning algorithm in Figures 3 and 4. Note that 10-fold cross-validation is utilised in the absence of a testing dataset.

Experimental Empirical Studies
The examination is tested on a 6-percent-GureKDDCup'99 [28,29] NIDS dataset. This dataset is suitable because it contains IP addresses, which the IP truncation method will be applied to. To avoid the biasness or cherry-picking of the features, this work intentionally opted out of feature extraction, and no feature enhancement was performed on the dataset.
As the dataset has not been segregated into a training and testing distribution by the authors, the classification performance of the proposed method will be evaluated based on 10-fold cross-validation to provide a fair comparison between the models.

Experimental Results
To gauge the usability of a decision tree model, the decision tree structure (i.e., number of nodes) and the classification accuracy attained before and after the application of IP truncations are compared. The number of nodes is selected as one of the evaluation criteria since the interpretability of a decision tree greatly depends on the tree structure. For instance, a very complex decision tree structure having a vast number of nodes is harder to be understood when compared to a simpler tree with a smaller number of nodes.
We present the empirical results for each model (i.e., original, 8-bit truncation, 16-bit truncation, and 24-bit truncation) in Table 1 with crucial evaluation metrics such as classification accuracy, true positive rate, false positive rate, true negative rate, false negative rate, the number of nodes found in the tree models, and the number of leaves present in the tree models.

Experimental Empirical Studies
The examination is tested on a 6-percent-GureKDDCup'99 [28,29] NIDS dataset. This dataset is suitable because it contains IP addresses, which the IP truncation method will be applied to. To avoid the biasness or cherry-picking of the features, this work intentionally opted out of feature extraction, and no feature enhancement was performed on the dataset.
As the dataset has not been segregated into a training and testing distribution by the authors, the classification performance of the proposed method will be evaluated based on 10-fold cross-validation to provide a fair comparison between the models.

Experimental Results
To gauge the usability of a decision tree model, the decision tree structure (i.e., number of nodes) and the classification accuracy attained before and after the application of IP truncations are compared. The number of nodes is selected as one of the evaluation criteria since the interpretability of a decision tree greatly depends on the tree structure. For instance, a very complex decision tree structure having a vast number of nodes is harder to be understood when compared to a simpler tree with a smaller number of nodes.
We present the empirical results for each model (i.e., original, 8-bit truncation, 16-bit truncation, and 24-bit truncation) in Table 1 with crucial evaluation metrics such as classification accuracy, true positive rate, false positive rate, true negative rate, false negative rate, the number of nodes found in the tree models, and the number of leaves present in the tree models.

Discussions and Findings
Affected by the IP truncation concealing procedure, it is speculated that this process might be reducing the accuracy of the original decision tree due to information loss. However, the experimental results prove otherwise since the difference in accuracy between the original and truncated versions are remarkably insignificant (on average~0.01% of degradation), as tabulated in Table 1. Similarly, although the original model performance in terms of true positive rate, false positive rate, true negative rate, and false negative rate is superior when compared to the other models, the difference is miniature.
Interestingly, the total number of nodes in the decision tree structure is observed to be considerably decreased after applying IP truncation (i.e., 23,096 or 98.3520% of nodes are pruned from the original tree when 24-bit truncation is adopted) based on the results shown in Table 1. This scenario signifies that the truncation scheme can simplify the tree structure and provide additional privacy protection without deteriorating the classification performance of the decision tree at the same time.
Akin to the observation in the number of nodes, we also noticed a significant reduction in the number of leaves. In particular, the number of leaves is reduced by 21,354 (91.0696%) when 16-bit truncation is adopted and 23,129 (98.6395%) when 24-bit truncation is applied.
To better understand the differences between the truncation methods, we have also presented the total number of distinct source IP addresses and destination IP addresses in Table 1. The highest number of nodes and leaves are pruned when 24-bit truncation is applied. This can be substantiated by the total reduction in IP address from 5900 to 99. A total of 5801 or 98.3220% of IP addresses have been reduced as compared to the original model when 24-bit truncation is employed.
Based on the empirical results presented in this section, it is safe to assume that the truncation methods are suitable to be adopted as a pruning strategy to reduce the complexity of the tree structure while maintaining an adequate classification performance and preserving privacy simultaneously.

Extended Experimental Empirical Studies
To further corroborate the claims made in Section 4, extended empirical studies performed on the full version of three NIDS datasets are substantiated in this section. The three NIDS datasets include (i) GureKDDCup'99 [28,29], (ii) UNSW-NB15 [30], and CIDDS-001 [31]. As the full version of these three datasets was supplied in a weekly fashion by the authors, we adopted a distinct approach to split the training and testing data following each of the weeks, as depicted in Table 2. Due to the over-optimistic results produced by cross-validation shown in the previous section, this approach was advocated by Ring et al. [32] and Al Tobi and Duncan [33]. Akin to the 6-percent-GureKDDCup'99, the three NIDS datasets are chosen as they contained the mandatory IP truncation features: the source IP address and destination IP address. More details of the datasets, data cleansing, and data preparation can be found in our previous work [34].

Extended Empirical Experimental Results
For the sake of simplicity, only the classification accuracy (i.e., model performance) and the number of nodes (i.e., tree structure) will be discussed and analysed in the subsequent section. Table 3 tabulates the accuracy performance of the decision tree model trained with different train-test distributions incorporated with or without the IP truncation. The number of nodes present in the model after the pruning process is delivered in Table 4. From the tables, we can notice that the obtained empirical results are very encouraging as most tree models pruned with IP truncation can perform almost on a par with the original model.

Discussion and Analysis on the Extended Empirical Experimental Results
Similar to the observation presented in Section 4, the classification performance difference between the truncated models and original model is not significant, with, at most, 2% of degradation across the three datasets based on the classification accuracy difference shown in Table 5. However, it is interesting to witness that some tree models utilising the IP truncation pruning mechanism could perform better than the original model. For example, when the model is trained with GureKDDCup'99 (Week 1~4) and tested with GureKDDCup'99 (Week 5~7), the proposed model integrated with the 24-bit truncation gains 0.9189% accuracy compared with the original model. This scenario assumes that the nodes pruned by truncation can produce a tree with better generalisation capability. For the tree's structure, the number of nodes realised by each model is tabulated in Table 6. From the result, we can observe that the number of nodes is significantly reduced in all three datasets when 8-bit, 16-bit, and 24-bit truncations are adopted. These results further justify that the IP truncation is an effective strategy to prune the tree to create a simpler tree structure. By plotting the classification accuracy against the number of nodes in Figure 5, it can be evidently seen that the truncation approach can prune the nodes efficiently without sacrificing a good classification accuracy. Interestingly, the performance of the truncated models containing lesser nodes in GureKDDCup'99 and UNSB-SW15 are either on a par with or superior when compared to the original models. Although the model's classification accuracy slightly deteriorated by~2% in CIDDS-001, it is compensated with a simpler tree containing only~700 nodes instead of 8766 nodes from the original model.     From Table 5, we noticed that the classification performance of the tree models is identical when 16-bit or 24-bit truncation is applied in UNSW-NB15 and CIDDS-001. At the same time, we also observed that the number of nodes remains constant in accordance with the results in Table 6. To further understand this finding, the total number of unique IP addresses, including the source and destination, is calculated and organised in Table 7. Therefore, we can identify that the total number of IP addresses remains unchanged in UNSW-NB15 and CIDDS-001 when 16-bit or 24-bit truncation is applied. This may be the key reason leading to the identical performance achieved by the models.
In Table 8, we present the total number of reductions in IP addresses when different truncation approaches are applied in the three datasets. Since the tree model only utilised the number of unique IP addresses from the training distribution as the training attributes to build the tree, we deduced that the total number of unique IP addresses found in the training distribution will have a particularly significant relationship with the number of nodes. Referring to Table 8, we noticed a huge amount of IP address reduction in GureKDDCup'99 when 24-bit truncation is applied on the Week 1~6 training distribution, whereby a total of 28,262 (99.63%) unique IP addresses are removed. By eliminating a tremendous amount of unique IP addresses, the number of nodes is also significantly pruned by 317,166 nodes (refer Table 6) compared to the original model. Figure 6 is plotted to illustrate the relationship between the number of nodes and the total number of unique IP addresses from the training distribution. We can clearly perceive that as the number of unique IP addresses increases, the number of nodes produced by the tree model will also increase. This undoubtedly justifies the noticeable effects of IP truncation when applied on GureKDDCup'99 since it contains the largest number of unique IP addresses.

Performance Comparison Using Wilcoxon Signed-Rank Test
To further evaluate and ensure that the adoption of the proposed pruning does not deteriorate the performance of the original classifier, we have additionally performed a Wilcoxon Signed-Rank test. A two-tailed test with a significance level of 0.05 is conducted based on the original classification accuracy against the truncated classification accuracy (8-bit, 16-bit, and 24-bit) obtained from Tables 1 and 3. The empirical results substantiated in this section further uphold the claims made in Section 4. This substantially demonstrates that the adoption of the truncation approach to pruning the tree model will not cause a substantial decline in the classification performance of the model. In short, the truncation strategy is a suitable method to preserve privacy and reduce the complexity of the tree concurrently without sacrificing the classification performance.

Performance Comparison Using Wilcoxon Signed-Rank Test
To further evaluate and ensure that the adoption of the proposed pruning does not deteriorate the performance of the original classifier, we have additionally performed a Wilcoxon Signed-Rank test. A two-tailed test with a significance level of 0.05 is conducted based on the original classification accuracy against the truncated classification accuracy (8-bit, 16-bit, and 24-bit) obtained from Tables 1 and 3.
According to Table 9, all the tests show that the results are insignificant. In other words, the finding affirms that the performance achieved by the proposed approach is identical to the original classifier. However, we also notice that the classification accuracy attained by the proposed method in some scenarios surpasses the original classifier. Specifically, the accuracy is improved when the full GureKDDCup'99 is trained (Week 1~4; Week 1~5) and tested against (Week 5~7; Week 6~7) for all 8-bit, 16-bit, and 24-bit truncation. Additionally, all the models that exploited the IP truncated pruning approaches outperformed the original classifier in the UNSW-NB15 datasets. Details of the performance comparison can be found in Table 5. These test results further confirm the claims made in Sections 4.2 and 5.2. We can comprehend that the adoption of IP truncation can simplify the tree structure through pruning and preserve privacy at the same time without deteriorating the performance of the tree classifier.

Conclusions and Research Limitation
The proposed pruning method, which is based on IP truncation, spells two main advantages: (1) the IP address can be anonymised to prevent any potential user profiling, and (2) the number of nodes in a C4.5 tree is tremendously reduced to make the rule interpretation possible while maintaining the classification accuracy.
A slight accuracy degradation is observed from the results presented after we applied the truncation in most scenarios. However, it is safe to assume that the truncation can prune the decision tree effectively without much affecting the performance of the tree classifier, and merely a very small performance degradation is observed (refer to Tables 1 and 5) after the truncation. Based on the empirical results presented in Table 5, it is also interesting to observe that the classification accuracy of the truncated model outperforms the original model in some situations. It can be postulated that the IP truncation can group similar features for building a more generalised tree model. However, masking the IP address for the sake of privacy will also hinder its interpretability when we want to investigate an incident for an exact node (i.e., IP address or specific device). This can be viewed as a trade-off of privacy. Based on the empirical results attained, we believe that the proposed IP truncation-based pruning can preserve the utility and privacy simultaneously.

Future Works
In future, other network anonymisation techniques and machine learning classifiers will be explored and investigated to further assess the relationship between them. On the other hand, the similar analysis procedure demonstrated in this paper can also be adopted in other domains to examine the pruning effect of a decision tree resulting from the application of data privacy solutions. Nevertheless, future experiments in the Botnet [35] application may also be considered to exclude a list of Botnet IP addresses from the IP truncation anonymisation algorithm since they can be regarded as public information.