Network Intrusion Detection with a Hashing Based Apriori Algorithm Using Hadoop MapReduce

Ubiquitous nature of Internet services across the globe has undoubtedly expanded the strategies and operational mode being used by cybercriminals to perpetrate their unlawful activities through intrusion on various networks. Network intrusion has led to many global financial loses and privacy problems for Internet users across the globe. In order to safeguard the network and to prevent Internet users from being the regular victims of cyber-criminal activities, new solutions are needed. This research proposes solution for intrusion detection by using the improved hashing-based Apriori algorithm implemented on Hadoop MapReduce framework; capable of using association rules in mining algorithm for identifying and detecting network intrusions. We used the KDD dataset to evaluate the effectiveness and reliability of the solution. Our results obtained show that this approach provides a reliable and effective means of detecting network intrusion.


Introduction
The survival of any business organization depends upon the security mechanisms that adequately protect and prevent from illegal entrance into confidential data of the organization. However, it might appear impossible to entirely control the breaches in security at present. On the back of this, researchers can attempt to detect intrusions and act accordingly to counter actions that brought about them. Intrusion detection scrutinizes computer activities for the purpose of uncovering violations [1]. The activity is especially relevant for new technologies such as smartphones [2], cloud computing [3], fog computing [4], and edge computing [5], where private and business data is shared across computer or wireless sensor networks, thus increasing the likelihood of attacks [6]. The intrusion detection system (IDS) provides insight into tracking and analysis of system and user activity, looking out for vulnerability, statistical anomalies, and performing behavioral analysis of user activities. The IDS can add expertise to the remaining part of critical network infrastructure and can follow activities of users from entry to action points. Being active, they can report alterations to administrators; when the system is under attack or when it detects errors in system configuration, guide the administrator in establishing necessarily important policies best for the organization's safety and even give permission to non-expert staff to carry out security management on the system. Even Lately, new threats related with cloud computing, fog computing, and edge computing have emerged as these computing technologies are still vulnerable to security deficiencies and vulnerabilities for intrusion and other malicious activities affecting the integrity and availability of resources on cloud. For example, Abadeh and Habibi [16] proposed using evolutionary fuzzy rules and optimized GA for intrusion detection. Al Haddad et al. [17] introduced a collaborative network intrusion detection system (C-NIDS) to find network attacks, while addressing intrusion detection in virtual network, and other security challenges. Das et al. [18] used principal component analysis (PCA) to lower the complexity of the network data and detect network intrusions. Huang et al. [19] demonstrated the use of rough sets and support vector machines (SVM) for intrusion detection, while Khamphakdee et al. [20] used the association rules for identifying probe attacks. Kola Sujatha et al. [21] used a combination of SVM, fuzzy logic and genetic network programming (GNP) to create rules to detect the network intrusions. Hashem et al. [22] used the Bee Ranker (BR) algorithm based on the foraging behavior of honeybees for selection of features useful for detection of network intrusions. Gao et al. [23] combined an adaptive principal component (A-PCA) for adaptive selection of network traffic features, and incremental extreme learning machine (I-ELM) for intrusion detection. Abdulhammed [24] used auto-encoder (AE) and principle component analysis (PCA) to lower feature dimensionality before building Bayesian network, random forest (RF), linear discriminant analysis (LDA), and quadratic discriminant analysis (QDA) classifiers for an IDS. Al Tobi and Duncan [25] explored threshold adaptation for C5.0, random forest and SVM in improving the accuracy of the network intrusion detection models.
The Apriori algorithm was used in the context of network security before. Prasenna et al. [26] used fuzzy Weighted Association rule (WAR) based on the Apriori algorithm and genetic network programming (GNP) and for generation of rules to evaluate network misuse and anomaly detection. Lalli et al. [27] used GA to reduce network traffic features, then Apriori Association is used to generate a rule schema, and artificial neural network (ANN) filters the rules to increase the detection accuracy. An et al. [4] proposed using a hypergraph clustering model based on the Apriori algorithm to find the association between fog computing nodes with a higher risk of threat of the distributed denial of service (DDoS) attack. Jie et al. [28] proposed a similarity factor for quantifying the similarity between current and past frequent itemsets in real time. The factor is used as a reliability parameter to identify the abnormal system state.
For network IDSs, different frequent itemset mining algorithms, including the modifications of the Apriori algorithm, have been widely used. Yan et al. [29] employed the PrefixSpan-based sequence mining algorithm for network log analysis and intrusion discovery. Ohrui et al. [30] combined Apriori with Prefixspan, the sequential pattern mining approach, to detect and predict future botnet attacks. Zeng et al. [31] used sparse matrices to optimize Apriori for big data computing. The improved algorithm can save the runtime of data mining in intrusion detection, and save storage space for the analyzed network data. Khalili et al. [32] used Apriori to reduce the number of candidate states of industrial systems evaluated as critical. Zheng et al. [33] utilized the relational algebra theory to obtain the optimization relation association rule (ORAR) based on the relationship matrix and correlation operation, which reduces the number of dataset scanning operations to one, and thus overcomes the disadvantage of Apriori, which requires multiple scans.
Artificial intelligence (AI), machine learning (ML), and ANNs have become key research topics in the domain of anomaly-based intrusion detection. Chiba et al. [34] used the signature-based detection to find known attacks, and back-propagation neural network (BPN) classifier to identify unknown attacks. Odusami et al. [35] employed long short term memory (LSTM) for protecting against layer seven distributed denial of service (L7DDoS) intrusion attack. Yang et al. [36] employed modified density peak clustering algorithm (MDPCA) to lower the size of the training set and to deal with the class imbalance problem. Each cluster was used to train its own deep belief network (DBN) classifier. Finally, the decisions of the DBN classifiers is combined based on fuzzy membership weights, which are calculated using the nearest neighbor criterion. Le et al. [37] used the hybrid sequence forward selection and decision tree (SFSDT) model to generate the optimal feature subset for training recurrent neural network (RNN) models. The latter are used to deal with user-to-root (U2R) and remote-to-local (R2L) attacks.
Summarizing, despite the plethora of methods used, including the advanced deep learning models, which have been proven successful in multiple other domains of application, they often fail for the real-world network intrusion detection task despite showing excellent results on benchmark network intrusion datasets. The reason is that the need of adaptation to pattern variability has often been neglected. Certain classes of attack such as denial of service (DoS) attacks are formed of abrupt patterns, which bring a high level of variability into network traffic patterns. For machine learning models, adaptation is performed only once using the validation data for cross-validation such as k-folds. However, in the real-world scenarios, validation data changes continuously, which makes such approaches inefficient [26].
Our novelty and contribution are the proposed improved hashing-based Apriori algorithm and its implementation on the MapReduce framework. The hashing-based modification allows too find the frequency of the k-itemsets without the use of computationally expensive candidate sets, which makes it usable for detecting network attacks in near real-time, whereas the MapReduce framework allows to handle network traffic on large networks. We demonstrate the applicability of the proposed method to identify and detect network intrusions using the KDD dataset as a benchmark.

Apriori Algorithm
The Apriori algorithm is used for mining frequent itemsets based on the Boolean rules. The Apriori algorithm is considered as the most recognized algorithm to mine association rules. Developed by Agrawal and Sriknat [38], the algorithm finds association rules issues on a large scale, giving room to implicative outcomes that possess more than one element. Association rules seek frequent itemsets that have occurrences that go beyond a pre-defined least threshold and obtaining association rules from those frequency itemsets. These two sub-issues are solved repeatedly until no new rules appear. The least support threshold must be set by the user. The algorithm of Apriori is summarized in Algorithm 1. The algorithm has two stages: a training phase and a testing phase. In the training Stage, the algorithm can observe specs of behavior and the makes a generalization from it. Several algorithms organized learning stage, whereby samples of known attacks are supplied. In the testing stage, the algorithm is provided with a situation and it decides on the possibility of having an attack.

Algorithm 1: Apriori algorithm
Variables: C k is a candidate itemset of size k L k is a frequent itemset of size k BEGIN Find frequent set L k−1 Generate C k by using Cartesian product of L k-1 , i.e. L k−1 x L k−1 Perform pruning: remove any k−1 size itemsets that are not frequent Return frequent set L k END The algorithm utilizes a breadth-first search mechanism and a hash tree configuration to make candidate itemsets counted efficiently to determine the frequency of occurrence for each itemsets. The pseudocode of the algorithm is summarized in Algorithm 2.

Algorithm 2: Find frequent itemsets
The disadvantage of the classical Apriori algorithm is rapid performance degradation when working with very large datasets because of recurring scanning of the dataset and the creation of many candidate sets. To improve the efficiency of the Apriori algorithm, we have adopted the idea of hashing first presented by Tribhuvan et al. [39]. The Modified Apriori algorithm employs hash tables to generate large itemsets efficiently. It runs over the entire dataset and stores previous results in the hash tables. This allows to void repeating scan as the results stored in hash tables are used. We also employ the double hashing techniques by Jayalakshmi et al. [40]. First, the data is represented using the Transaction id format. The hashing is used to store the data values. To resolve hashing collisions, an independent second hashing function is used. Following the suggestion of [41], for hashing we adopt Hamming projections, which are described as follows: here Smax is maximum support, and n is the count of transactions. The hash table construction procedure has two parts: generation of hash value and update of the hash table. Similarly to Reference [42], to increase the speed of hash table construction, we employ parallel hash value generators to allow for the simultaneous generation of hash values. However, instead of hardware parallelization, we use MapReduce [43]. The MapReduce model is based on the division of the large dataset into smaller data subset. Then the Map function is used to parallelize the processing of each data subset. The Reduce function performs the combining of the results. In case of the Apriori algorithm, we follow the suggestion of Zhou et al. [44] and start from the frequent 2-itemset. As a key value, we use the first (k − 2) term of the frequent (k − 1)-itemset, whereas the value is the last term. The Reduce function combines the results into previous (k − 2) items. Thus, our method has two stages. First, MapReduce is applied to calculate frequent itemsets in parallel. Then the frequent itemsets are subjected to MapReduce again to find the association rules. The pruning is performed based on support and confidence threshold criteria. The algorithm is summarized as a flow diagram in Figure 1.

Dataset
The KDDcup99 intrusion dataset (available from the UCI KDD repository) was used for the testing of the algorithm. The dataset has many intrusion attacks simulated in a military-grade network. The KDDcup99 dataset is considered as the benchmark tagged dataset [45] that is commonly used for assessing the network intrusion detection methods. Here we use the 10% KDD training data subset. The dataset has some important features, which support the purpose of its usage for intrusion detection. TCP packets are used to describe the connections from the initial stage to the end at important acceptable intervals. Each connection is tagged and labeled as normal or abnormal. There are 494,021 connection attempts at the LAN. The dataset covers four main attack types (see Table 1) and has 41 features separated into both continuous and discrete sets. The 41 features (see Table 2) include the characteristics of TCP connections, content and network traffic features, which are calculated with a 2 sec time window [46].

Dataset
The KDDcup99 intrusion dataset (available from the UCI KDD repository) was used for the testing of the algorithm. The dataset has many intrusion attacks simulated in a military-grade network. The KDDcup99 dataset is considered as the benchmark tagged dataset [45] that is commonly used for assessing the network intrusion detection methods. Here we use the 10% KDD training data subset. The dataset has some important features, which support the purpose of its usage for intrusion detection. TCP packets are used to describe the connections from the initial stage to the end at important acceptable intervals. Each connection is tagged and labeled as normal or abnormal. There are 494,021 connection attempts at the LAN. The dataset covers four main attack types (see Table 1) and has 41 features separated into both continuous and discrete sets. The 41 features (see Table 2) include the characteristics of TCP connections, content and network traffic features, which are calculated with a 2 sec time window [46].

Evaluation
We evaluate the results using the strength measures of association rules, i.e., support and confidence. Support defines how often a rule can be applied to a dataset, whereas confidence defines how often items in consequent of the rules appear in the rules that contain the antecedent: here S is the support, C is the confidence, X is the antecedent, Y is the consequent, and σ is the frequency of the itemset.
A rule that has a low level of support may be occurring by chance. Confidence measures reliability of the rule and provides an estimate of the conditional probability of consequent upon the antecedent.

Results
We adopted Tanagra 1.4.50 (Lumière University Lyon 2, Lyon, France) on a Windows 10 OS, Inter Core i7 2.7 GHZ, 16GB RAM as the main software platform for the implementation of this research. The tool is used for evaluating both multivariate and univariate parametric and nonparametric tests, and for the extraction of results, in the form of rules, for the Apriori algorithm. We employed the tool to perform operations on the KDDcup99 dataset to get results from the various network cyberattack attempts. Validation has been performed using 10-fold cross-validation. The technique allows to obtain the accuracy results that are less sensitive with regard to different training subsets. In 10-fold cross validation, traffic profiles are split into ten sets and a training set is made by joining nine randomly selected sets. The remaining subset is utilized as testing set for assessing the classification performance. The entire process is replicated ten times by joining the subsets in ten different ways, and the mean accuracy rate is computed.
The results (the rules with largest confidence) are presented in Table 3 below with the following parameters of the Apriori algorithm (Max rule length 4, Support min 0.33, Lift filtering 1.1, Confidence min 0.75). The algorithm generates 146 rules (based on 494,020 transactions). Figure 2 explains relationship between the parameters and how they served to detect strange signatures in the database for attack types given in Table 1. Note that R2L attacks have been identified with a higher level of support than other types of attack (DOS, U2R, Probe, see Table 1), while DOS attacks had the lowest level of support.

Accuracy
We evaluate our results against state-of-the-art of other authors achieved using a variety of different methods on KDDcup99 dataset (see Table 4). Note that the perfect result achieved by some of the other methods does not mean that the corresponding method will behave well in a real-world situation, where network traffic patterns constantly change. Finally, we present the confusion matrices for DOS, U2R, R2L, and PRB attack classification in Figure 4. The accuracy for recognition of different types of attack is similar, with the DOS attacks recognized with a highest accuracy of 98.2%, and the PRB attacks recognized at a lowest accuracy of 96.91%.

Accuracy
We evaluate our results against state-of-the-art of other authors achieved using a variety of different methods on KDDcup99 dataset (see Table 4). Note that the perfect result achieved by some of the other methods does not mean that the corresponding method will behave well in a real-world situation, where network traffic patterns constantly change. Finally, we present the confusion matrices for DOS, U2R, R2L, and PRB attack classification in Figure 4. The accuracy for recognition of different types of attack is similar, with the DOS attacks recognized with a highest accuracy of 98.2%, and the PRB attacks recognized at a lowest accuracy of 96.91%.  Following the recommendation of Demšar [50], we used a series of statistical tests to compare the methods. The Friedman Test ranks the algorithms by assigning a rank for performance of each method for each dataset. The Nemenyi post-hoc test was applied to compute an average ranking difference threshold as critical distance (CD). The hypothesis that "the accuracy of two methods is the same" is rejected, if their mean rank difference is larger than CD. The results are summarized as the Demšar significance diagram in Figure 5. Considering different types of attack, on average, we proposed method performs better than other considered methods. However, statistically the ranking Following the recommendation of Demšar [50], we used a series of statistical tests to compare the methods. The Friedman Test ranks the algorithms by assigning a rank for performance of each method for each dataset. The Nemenyi post-hoc test was applied to compute an average ranking difference threshold as critical distance (CD). The hypothesis that "the accuracy of two methods is the same" is rejected, if their mean rank difference is larger than CD. The results are summarized as the Demšar significance diagram in Figure 5. Considering different types of attack, on average, we proposed method performs better than other considered methods. However, statistically the ranking of the proposed method is statistically undistinguishable from the methods of Zhao et al. [45], Le et al. [37], and Papamartzivanos et al. [48].
(c) (d) Following the recommendation of Demšar [50], we used a series of statistical tests to compare the methods. The Friedman Test ranks the algorithms by assigning a rank for performance of each method for each dataset. The Nemenyi post-hoc test was applied to compute an average ranking difference threshold as critical distance (CD). The hypothesis that "the accuracy of two methods is the same" is rejected, if their mean rank difference is larger than CD. The results are summarized as the Demšar significance diagram in Figure 5. Considering different types of attack, on average, we proposed method performs better than other considered methods. However, statistically the ranking of the proposed method is statistically undistinguishable from the methods of Zhao et al. [45], Le et al. [37], and Papamartzivanos et al. [48].

Scalability
To analyze and evaluate the performance and scalability of the proposed solution, we have setup a cluster of eight PC nodes. Each node runs Microsoft Windows 10 Home operating system on Intel i5-8265U CPU, 1.60GHz (4 cores, 8 logical processors) with 8GB RAM and 15.6 GB virtual memory available. All algorithms were implemented using Python version 3.7.4. For implementation of MapReduce, we used Apache Hadoop 1.2.1 framework. To test the solution, we used the full KDDcup99 dataset, which has about five mln. records in the training part, and around two mln. records in the testing part. To evaluate scalability, we used the Speedup measure [51], which is the ratio of performance on a single-node system with respect to performance on an n-node system. Speedup is measured by evaluating the performance of the framework on the dataset by the number of nodes.
In Figure 6, we report the running time results for a different number of computing nodes. The impact of using MapReduce on the running time can be seen. The running time on 2 nodes takes 1875 s, while the running time on 8 nodes takes 420 s for the same dataset. The speedup factor of the

Scalability
To analyze and evaluate the performance and scalability of the proposed solution, we have set-up a cluster of eight PC nodes. Each node runs Microsoft Windows 10 Home operating system on Intel i5-8265U CPU, 1.60GHz (4 cores, 8 logical processors) with 8GB RAM and 15.6 GB virtual memory available. All algorithms were implemented using Python version 3.7.4. For implementation of MapReduce, we used Apache Hadoop 1.2.1 framework. To test the solution, we used the full KDDcup99 dataset, which has about five mln. records in the training part, and around two mln. records in the testing part. To evaluate scalability, we used the Speedup measure [51], which is the ratio of performance on a single-node system with respect to performance on an n-node system. Speedup is measured by evaluating the performance of the framework on the dataset by the number of nodes.
In Figure 6, we report the running time results for a different number of computing nodes. The impact of using MapReduce on the running time can be seen. The running time on 2 nodes takes 1875 s, while the running time on 8 nodes takes 420 s for the same dataset. The speedup factor of the running times for 4 nodes as compared to the runtime with a single node is 3.78, and for 8 nodes, the improvement is 7.59. As we can observe from Figure 6, the speedup is very close to the linear one when using from 2 to 8 nodes. The results demonstrate reasonable scalability for the suggested network intrusion detection system. running times for 4 nodes as compared to the runtime with a single node is 3.78, and for 8 nodes, the improvement is 7.59. As we can observe from Figure 6, the speedup is very close to the linear one when using from 2 to 8 nodes. The results demonstrate reasonable scalability for the suggested network intrusion detection system.

Discussion and Conclusion
One of the main shortcomings of the classical Apriori algorithm is inefficient performance when working with big datasets, because of repeated scanning of the database and the creation of many candidate sets. In this paper we tackled this problem by adopting a hashing approach, which allows to find the frequency of the k-itemsets without the use of computationally expensive candidate sets. The hashing-based approach also has an advantage in its high-computation speed, which has already been noted by other researchers [52][53][54]. The latter makes it usable for detecting network attacks in near real-time, while the adoption of MapReduce allows for scalability with big data [55,56].
We have described the intrusion detection framework for identification of network intrusions. The framework uses the Apriori algorithm for intrusion detection, to find attacks and develop rules, and has an appreciable level of accuracy and efficiency in finding out new cyberattacks using the pieces of information provided about known and recognized attacks. The framework was applied on the KDDcup99 dataset and provided successful recognition of four types of network attacks with high confidence and level of support.
The proposed method can produce solutions that address the shortcomings of other approaches, specifically, the lack of adaptability demonstrated by the neural network based methods. The proposed methodology based on double hashing is promising and can be used for detecting cyberattacks. However, the association rules do not imply causality, but rather the co-occurrence of events. Moreover, the researchers need better and more standard datasets that are presently prevalent and indicative of today's web servers. Researchers and business organizations need to look through network defense mechanisms with a view to identifying loopholes and improving the system to provide a more reliable protection from cyber-attacks.
In future works, we will perform a more in-depth research on the recognition of cyber-attacks in edge and fog computing environments.

Discussion and Conclusions
One of the main shortcomings of the classical Apriori algorithm is inefficient performance when working with big datasets, because of repeated scanning of the database and the creation of many candidate sets. In this paper we tackled this problem by adopting a hashing approach, which allows to find the frequency of the k-itemsets without the use of computationally expensive candidate sets. The hashing-based approach also has an advantage in its high-computation speed, which has already been noted by other researchers [52][53][54]. The latter makes it usable for detecting network attacks in near real-time, while the adoption of MapReduce allows for scalability with big data [55,56].
We have described the intrusion detection framework for identification of network intrusions. The framework uses the Apriori algorithm for intrusion detection, to find attacks and develop rules, and has an appreciable level of accuracy and efficiency in finding out new cyberattacks using the pieces of information provided about known and recognized attacks. The framework was applied on the KDDcup99 dataset and provided successful recognition of four types of network attacks with high confidence and level of support.
The proposed method can produce solutions that address the shortcomings of other approaches, specifically, the lack of adaptability demonstrated by the neural network based methods. The proposed methodology based on double hashing is promising and can be used for detecting cyber-attacks. However, the association rules do not imply causality, but rather the co-occurrence of events. Moreover, the researchers need better and more standard datasets that are presently prevalent and indicative of today's web servers. Researchers and business organizations need to look through network defense mechanisms with a view to identifying loopholes and improving the system to provide a more reliable protection from cyber-attacks.
In future works, we will perform a more in-depth research on the recognition of cyber-attacks in edge and fog computing environments.