1. Introduction
Data traffic over wireless networks is exhibiting ever-increasing growth. Due to its ability to offer increased mobility, speed, usability, and low installation and maintenance costs, IEEE 802.11 networks are at the epicenter of this rapid shift to a wireless realm. Such networks, commercially known as Wi-Fi, are omnipresent in our daily life for providing connectivity to areas facilitating a wide spectrum of contemporary services [
1], including Voice over Wi-Fi (VoWiFi) and automotive and smart city applications.
On the other hand, mainstream digital technologies are also in the crosshairs of a variety of threat actors. Furthermore, while the 802.11 standard has greatly advanced over the years in terms of security, recent research work indicates that even the latest defenses, say, the Simultaneous Authentication of Equals (SAE) authentication and key exchange method and the Protected Management Frames (PMF) mechanism, embraced by the most recent at the time of writing 802.11-2020 standard are not impermeable [
2,
3,
4]. Through a security prism, the situation becomes more cumbersome and complicated, given that at least infrastructure-based Wi-Fi domains co-exist with their wired counterparts, and therefore the former can be used as a springboard for attacking the latter.
In this context, Intrusion Detection Systems (IDS) provide a supplementary layer of defense either to purely wireless domains or others that exploit a mixture of wired and wireless zones, trusted, say, within the premises of an enterprise, or not. Thus, far, a significant mass of works has investigated Machine Learning (ML) driven IDS both for wireless and wired networks and through the lens of diverse benchmark datasets and techniques. However, most likely due to the lack of proper datasets, research on IDS capitalizing simultaneously on 802.11-oriented and other types of network protocol features, including TCP, UDP, and Address Resolution Protocol (ARP), strikingly lags behind.
Our contribution: The work at hand aspires to fill this important literature gap by exploiting the modern AWID3 benchmark dataset. AWID3 contains a rich repertoire of attacks, which span from legacy 802.11 ones, say, deauthentication, to application layer assaults, including amplification, malware, botnet, SQL injection, and others. This renders AWID3 an ideal testing platform for assessing IDS that target the detection of a wide variety of attacks mounted on diverse layers of the protocol stack. Under this angle, and by considering an opponent who takes advantage of a Wi-Fi domain to launch application layer attacks, the current work answers the following key questions, which to the best of our knowledge are neglected by the related work:
Given two different network protocol feature sets, the first comprising 802.11-specific features and the second encompassing an assortment of non-802.11 features, which of them is superior in detecting application layer attacks, and to what degree? The features included in each feature are selected based on prior work on the topic.
Which features per set are the most important and bear the most information to the ML model?
How the IDS detection performance is affected if the two above-mentioned feature sets are combined and possibly escorted by engineered (artificial) features? Note that an engineered feature aims at improving the detection of a cumbersome identify class of attacks.
To respond to these questions, we performed a series of experiments utilizing both shallow and deep learning techniques. It is important to note that, in the context of the current work, the term “non-802.11” does not embrace any application layer feature. This makes the responses to the above questions more interesting, given that, typically, the detection of application layer attacks involves features of the same layer, which, however, are not normally available due to encryption or anonymization.
The rest of the manuscript is divided into sections as follows. The next section presents the related work on the topic.
Section 3 details the feature selection and data preprocessing schemes. The results after experimenting with each set of features are included in
Section 4. A deeper look into feature importance is provided in
Section 5. The same section offers an additional set of experiments performed over a unified feature set, and elaborates on the potential of engineered features. The last section provides concluding remarks and describes future research avenues.
2. Related Work
The current section briefly reviews the relevant literature. We consider major contributions spanning a time period from 2011 to 2021. The section only embraces works focusing on the identification of layer attacks through ML techniques utilizing non-802.11 network protocol features; in this respect, we do not consider works that deal with application layer attacks in general [
5,
6,
7]. The emphasis is put on the feature selection process, the utilized methodology, and the ML algorithm or models utilized per work. The reader should keep in mind that this section purposefully omits related work examining Wireless IDS (WIDS) capitalizing on 802.11-specific features. For such contributions, the reader is referred to [
8].
In [
9], the authors relied on a three-layered Neural Network (NN) structure to perform IoT network traffic flow classification within the context of a proposed IDS. Both binary and multiclass classification via a Feedforward NN (FNN) model were conducted against the
Bot-IoT dataset [
10], and towards the identification of 10 diverse classes of IoT-oriented attacks. The layers of the FNN model were randomly weighted based on a sampled version of the initial normal data distribution. The performance of the model was evaluated through legacy metrics, including Accuracy and F1. Twenty-five high-level features were selected from the dataset, representing diverse related field categories pertaining to different traffic, including ARP, IP, TCP, and UDP. Specifically, FNN achieved an F1 score above 99% in both the classification categories, i.e., binary and multiclass. The results were compared against the Support Vector classifier model with 5-fold cross-validation, achieving 82% at best. On the downside, the proposed IDS failed to generalize the classification procedure presenting low precision during the identification of specific categories of attacks or even variations of the same attack, namely flooding and reconnaissance ones, for both binary and multiclass experiments.
The authors in [
11] proposed a DDoS IDS for the classification of malicious traffic with the Gradient Boosting (GBT) algorithm. The experiments utilized two custom datasets created from the real-world Internet traffic traces dataset obtained from
CAIDA [
12]. The proposed scheme was evaluated against GBT algorithms metrics, achieving an F1 score above 95% with a False Positive Rate (FPR) between 9% and 12%, especially when large iteration and DT values were applied. However, the authors provide minimal details regarding the creation of the two datasets and the feature selection procedure.
The authors in [
13,
14] concentrated on the identification and categorization of encrypted traffic using Skype and SSH as case studies. The IP packet header along with flow-based features was extracted from various public and private datasets, including
DARPA-99 and
NIMS. They evaluated their proposal against various classifiers and DNN models in the context of binary and multiclass classification. The authors chose 61 basic features. By capitalizing on them, they also constructed a set of UDP and TCP flow-based (artificial) features, without, however, properly justifying their choices, and the way these engineered features were utilized in the context of the NN models.
The work in [
15] relied on a supervised ML approach to develop a dual-layer IoT IDS, which is destined for malicious traffic classification and subsequently aids in differentiating between attack types. They executed five attacks, namely network scanning, DoS, evil twin, Man in the Middle (MiTM), and injection, on a custom-made IoT testbed and created a dataset comprising 88 features. Following a feature selection process, they resulted in two subsets of 29 and 9 features. The experiments carried out with a handful of ML models, namely Multinomial Naive Bayes (MNB), Support Vector Machines (SVM), Decision Trees (DT), Random Forest (RF), and Artificial Neural Networks (ANN), achieved a score above 95% and 92% regarding the F1 metric for malicious traffic and attack recognition models, respectively. During the preprocessing phase of the dataset, all the missing values were replaced with a zero value, possibly raising the risk of affecting or misleading the performance and effectiveness of the models in such an imbalanced dataset.
Several other contributions were dedicated to the classification of higher-layer attacks with ML techniques. The authors in [
16] relied on C4.5 decision tree and Symbiotic Bid-based (SBB) Genetic Programming (GP) models for the creation of a botnet classification mechanism. The work in [
17] implemented two DNN models, namely, Autoencoder and CNN, to perform feature selection and classification of TLS traffic. Moreover, the author in [
18] put forward a hybrid KNN-GP classification approach for the identification of DDoS traffic. Despite the promising results, the three aforementioned papers suggest a feature implementation approach that relies on custom extracted flow-based statistical measurements, providing little information regarding their extraction process. Nevertheless, an approach that totally neglects features based on header fields may lead to dubious results. Precisely, engineered features are interlinked with a specific attack, and a slight deviation in the underlying settings may cause the model to fail to generalize to even minor variations of the same attack.
The authors in [
19] relied on DNN models to assess the performance of an IDS protecting against DDoS attacks. For model training, the authors implemented the extended and imbalanced “UNB ISCX Intrusion Detection Evaluation 2012 DataSet”. The evaluation of the models was conducted across several NN models, namely CNN, RNN, LSTM, and GRU. The authors created a balanced dataset that was sampled repeatedly prior to the execution of each classification model’s experiment. It can be said that the continuous re-sampling of the original dataset along with data normalization should be executed as a preprocessing step in conjunction with feature importance for avoiding compromising the integrity of the final results and the overall generalization of the created model.
In [
20], the authors proposed two feature selection algorithms, namely
Chi Square and
Symmetrical, together with
Decision Tree to effectively identify and detect DDoS assaults. They took advantage of five different subsets stemming from the CAIDA [
12] dataset. Their experiments revealed that from the 25 initially selected features, only seven contributed to positively achieving a precision score above 95%. The authors do not elaborate on whether and in which way the feature selection process conducted on the CAIDA subsets can influence the effectiveness of a generalized IDS.
The authors in [
21] proposed a dataset, namely
Edge-IIoTset, destined to IoT and Industrial IoT (IIoT) applications. As an initial step, they collected network traffic from a great variety of IoT devices and digital sensors during the execution of 14 IoT-related attacks, which were derived from five generalized categories: information gathering, DDoS, MiTM, injection, and malware. Nearly 1.2K features were identified, from which only the 61 most relevant were finally selected. Categorical data conversion into ML algorithm’s compatible form was carried out by means of the
pandas.get_dummies Python library, while duplicate and missing values, including “NAN” and “INF”, were removed. Above that, flow-based features related to IP addresses, port, payload information, and timestamps were dropped as irrelevant to the concept of the proposed dataset. Both supervised swallow classification and DNN analysis were utilized to evaluate the effectiveness of the proposed IDS model. The authors relied on hyperparameter tuning using Grid Search, tying their ML models exclusively to the proposed data set. We argue that the aforesaid approach does not highlight the general nature of the conducted experiments and emphasizes how the proposed analysis may apply to unknown data beyond the presented work.
The work in [
22] introduced an FNN-based IDS for multiclass classification of high-layer attacks on IoT devices. Regularization and hyperparameter model tuning was adopted by the authors, while the final results of the FNN model were compared against the linear-SVM supervised algorithm. They concluded that FNN is more time efficient and expands better vis-à-vis the SVM model. The authors relied on frame-, ARP-, TCP-, IP-, and UDP-related fields during the feature extraction procedure. However, the absence of feature importance verification in the selected fields could not corroborate the robustness of the 29 selected features.
The authors in [
23] presented another dataset, coined
UKM-IDS20, comprising 46 features extracted from DoS, ARP poisoning, network scanning, and malware attack traffic. The dataset was evaluated through Artificial NN against the legacy
KDD99 and
UNSW-NB15 datasets, revealing higher attack detection rates. It can be said that, as a rule of thumb, engineered features may be tightly interrelated to the described testbed scenarios, and therefore even tiny variations of an attack may go undetected.
The contributions in [
24,
25] coped with unsupervised DNN techniques towards the creation of IDS specially designed to identify higher-layer attacks. Precisely, the authors in [
24] assessed two datasets comprising EtherNet/IP and Modbus protocol packets. Stacked denoising autoencoders NN were used to train and evaluate the proposed IDS. Above that, the work in [
25] implemented a signature-based ML approach, dubbed “Classification Voting”, in an effort to deliver a packet-agnostic IDS. Both these approaches provide little information regarding the feature selection procedure.
Works such as [
26,
27,
28] are considered marginally within the scope of the current paper as they focus on the comparative presentation of commonly used classifiers, NN models, and feature selection techniques towards the creation of an IDS. Moreover, the authors in [
29] presented an adversarial approach that is applicable to the falsification concept of LSTM-based IDS targeting DDoS traffic. This survey is also considered marginally relevant to ours as it examines the manipulation of high-layer features towards bypassing DDoS detection.
To ease the parsing of the relevant literature,
Table 1 summarizes the pertinent characteristics of each work included in this section. Namely, we outline the features selected per work plus the classification methods used. It is important to point out that the non-802.11 features shown in boldface in
Table 1 are common to that listed in the penultimate column of
Table 2, i.e., the features used in the context of this work. Overall, most of the works discussed in this section resorted to some sort of feature selection towards the identification of malicious traffic [
9,
11,
13,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
29]. To this end, the majority of contributions implemented binary or multiclass classification with traditional algorithms such as Adaboost, KNN, C4.5, Random Forest, and Decision Trees [
9,
11,
13,
15,
16,
17,
18,
20]. Deep Learning techniques were also implemented in several cases [
9,
14,
15,
17,
19,
21,
22,
23,
24,
25,
29,
29].
Altogether, the analysis of the related work carried out in the current section alongside the argumentation provided in § 2 of [
8], suggests that there is a noticeable lack of contributions attempting to detect higher-layer attacks, e.g., HTTP-oriented, by merely capitalizing on non-application features of diverse kinds.
5. Delving into Feature Analysis
This section elaborates on the selected features. First off, we examine the importance of each feature on both feature sets. Second, we construct an artificial feature that could potentially assist in predicting the most challenging class, namely Other. Thirdly, and more interestingly, we investigate if using the two feature sets in tandem can increase the prediction rate of an ML model. For each of the aforementioned cases, only the best ML performers, i.e., LightGBM and Bagging were considered.
5.1. Feature Importance
Feature importance aims at inferring the dominant features, i.e., those which possibly bear the greater information for the ML model. To this end, a permutation importance analysis was carried out using LightGBM. The analysis used 10% of the stratified data from each feature set. Precisely, LightGBM was trained with a 10% subset of stratified samples and tested with a different 10% subset of stratified samples.
As illustrated in
Figure 5, the analysis of the 802.11 features showed that six of them offer the most information:
frame.len,
radiotap.dbm_antsignal,
radiotap.length,
wlan.duration,
wlan_radio.duration, and
wlan_radio.signal_dbm. Further, the same type of analysis on the non-802.11 features revealed that only three of them, namely
arp,
ip.ttl, and
udp.length, provide significant information. This is a logical result, since ARP and UDP features were more important for the
Flooding class, while the TCP ones were assisted mostly in the detection of the
Other class. On the flip side, this result also entails that the rest of the non-802.11 features have almost zero contribution, especially in the detection of the
Other class, which is by far the most challenging. It is to be noted that while a couple of 802.11 features, namely
wlan.fc.ds and
wlan_radio.phy did have some importance, they were not picked because the useful information did not pertain to the feature as a whole, but to specific columns, say,
wlan.fc.ds_1, due to the use of the OHE technique. As a result, if used, such a feature may introduce more noise rather than improve the detection capacity of the algorithm.
After dropping the insignificant features per set, we repeated the experiments, and the results for the two best performers are summarized in
Table 7. Once more, the Bagging model was superior, yielding an AUC score of 90.71% on the 802.11 reduced set. This result corroborates the feature importance analysis; indeed, the dropped features do not contain useful information, since, vis-á-vis the results of
Table 4, in terms of the AUC metric, Bagging lost only 0.05% and 0.71% for the 802.11 and non-802.11 feature sets, respectively. On the negative side, the reduced non-802.11 feature set completely missed the
Other class, classifying its instances as
Normal ones. Nevertheless, it managed to identify the
Normal class with great success, misplacing approximately 100 samples in each fold.
The above-mentioned results on feature importance corroborate pretty well the outcomes of both kinds of analysis given in
Section 4.1 and
Section 4.2, and the current one: the 802.11 feature set produces better results vis-à-vis the non-802.11 set, and this out turn is far more obvious when it comes to shallow analysis. With reference to
Figure 5, this result can be mainly attributed to a couple of key factors. First, the non-802.11 set misses the 802.3
frame.len feature because AWID3 was created in an 802.11 set. Conversely, this important feature is included in the 802.11 feature set, contributing appreciably to the detection of attacks. Second, a quartet of features, namely
radiotap.dbm_antsignal,
wlan_radio.duration,
wlan_duration, and
radio_signal_dbm, incorporated in the 802.11 set evidently aid in pinpointing the attacker, while the non-802.11 set misses this information. The reader should however keep in mind that these observations and findings are closely tied to the feature sets of
Table 2. That is, the possible refinement and expansion of the feature sets depending on the particular case are left for future work.
5.2. Conflating the Feature Sets
In light of the analysis in
Section 4.1,
Section 4.2 and
Section 5.1, an important question emerges: what if both these feature sets are exploited in tandem? To provide an answer, as shown in
Table 8, we examined the detection performance of both the full and the reduced feature sets when used jointly. Simply put, the combined full feature set includes all the 33 features of
Table 2, while the combined reduced set comprises the nine features depicted in
Figure 5. As observed from
Table 8, the combined feature set produces substantially better results in comparison to the case each set is used separately. Precisely, the gain in terms of the AUC metric is +4.52% over that of the full 802.11 16-features set given in
Table 4. A significant AUC improvement (almost 3%) is also perceived in the percentage of the combined reduced set vis-à-vis that is seen in
Table 7 regarding the 802.11 reduced feature set.
Figure 6 elaborates on this outcome by illustrating the respective confusion matrices. As observed, in comparison to the results of
Section 4.1, the combined feature set improved the prediction rate of the model by about 0.32%, 2.07%, and 18.42%, for the
Normal,
Flooding, and
Other classes, respectively.
5.3. Use of Engineered Features
While the combined full feature set did augment the AUC score up to almost 95.30%, it would be interesting to indicatively examine the potential of (mostly empirically-derived) engineered features in possibly ameliorating this score. This demonstrative effort would also serve as a reference and guidance for future work. Obviously, based on the preceding discussions, the most cumbersome to detect class is the Other. To this end, for the explanation given below, we consider an enterprise network, that is, a similar setting to that deployed in the creation of AWID3. In such a network realm, the chance of a client machine being (also) utilized as a server is practically tiny. Nevertheless, such strange behavior, i.e., a local machine to serve a dual role, is exhibited in the botnet attack contained in AWID3. That is, the opponent, an insider in this case, operates a Command and Control (C2) server to herd and manage infected hosts (bots) in the local network. Based on this observation, using the same dataset, i.e., the Botnet pcap file, we used the pseudocode of Algorithm 1 for constructing an artificial feature dubbed “Insider”.
Specifically, this feature concentrates on the local IP address of each client. Namely, when the packets are sent from one client machine to another (client-to-client), the respective traffic samples were flagged with 1, otherwise (client-to-server) with 0. After its generation, the feature was preprocessed with OHE. It is argued that this engineered feature does not affect the generalization of the produced ML models, since it does not directly rely on the IP addresses per se, but only considers the correlation between them (client-to-client). In other words, a client’s IP address can change for different reasons, say, Dynamic Host Configuration Protocol (DHCP), alterations in network topology, and so on, but the model will be trained to detect weird communication patterns between local network nodes having a client role. Obviously, as with all the other features, this one is constructed based on readily available (not typically encrypted) packet-level information.
To assess the contribution of this feature to the detection performance, we utilized it alongside a triad of feature sets: the reduced 802.11 one, the reduced combined one, and the combined full set. The results per examined set are given in
Table 9. Obviously, in all three cases, the engineered feature favorably improved the AUC score in the range of 2% to 3%. For instance, regarding the combined full feature set, the detection performance increased by nearly 1.5%. Looking at the confusion matrices presented in
Figure 7, this betterment is clearly due to the improved identification of the
Other class. Precisely, as expected, the addition of this single feature rendered possible the detection of the samples belonging to the
botnet attack: in comparison to the two confusion matrices in
Figure 7, the algorithm (bagging) was now able to correctly classify much more (around +6.7%) samples of the
Other class, which in
Figure 6 were misclassified in the
Normal class.
Algorithm 1 Algorithm for constructing the “Insider” feature. |
Require: |
Require: |
for i=0; i< ; i++ do |
|
|
|
|
for j=0; j< ; j++ do |
|
if is in then |
|
else if is in then |
|
else if and == 1 then |
|
|
end if |
end for |
if == 0 or == 0 then |
|
end if |
end for |
6. Conclusions
Since its inception back in the late 1990s, Wi-Fi has ripened into a full-fledged mature technology being utilized in numerous everyday applications. Nevertheless, IEEE 802.11 networks are alluring to attackers as well. That is, in absence of any intrinsic network access control as in its wired counterparts, an assailant can either attack the wireless network directly or used it as an (anonymous) springboard for assaulting other networks; the opponent can be anywhere in the vicinity or further afield depending on the strength/type of the wireless signal/equipment. Furthermore, while 802.11 security features have greatly evolved and enriched in the passing of time, new vulnerabilities emerge. In this context, an intriguing from an IDS viewpoint issue is to examine the potential of combining the information stemming from both wired and wireless protocols in such commonplace hybrid network realms, to possibly improve the detection performance. From that standpoint, the current study aspires to set the ground for IDS that hinge on diverse feeds in terms of network traffic features. Differently to the related work, the ultimate goal here is to investigate if and to what degree application layer attacks can be detected with lower layer features, either or both frame-level or packet-level, which however are readily accessible, meaning neither encrypted nor anonymized.
While this effort in the context of this paper is concentrated on IEEE 802.11 networks, future work may exploit the same methodology for other mainstream network access technologies, including cellular. In short, the analysis conducted in the above sections suggests that when features stemming from different network realms and layers of the protocol stack are used alongside each other, the detection performance of the ML model is increased. With reference to our experiments, this boost rose up to almost 95.3% in terms of the AUC metric, which is significantly greater (around 4.5%) vis-à-vis the best result obtained with just the 802.11 feature set. Finally, yet importantly, it was demonstrated that the inclusion of engineered, yet generalized enough, features grounded in empirical evidence and/or theoretical insight can improve the prediction capacity of the ML model; this can be particularly beneficial for detecting challenging attacks exhibiting a diminutive and imperceptible footprint. Nevertheless, a thorough investigation of this potential is well beyond the scope of this paper and is left for future work. Along with the previous future direction, a different one could aim at experimenting with diverse sets of non-802.11 cherry-picked features originating from diverse protocols in the protocol stack.