Models versus Datasets: Reducing Bias through Building a Comprehensive IDS Benchmark

: Today, deep learning approaches are widely used to build Intrusion Detection Systems for securing IoT environments. However, the models’ hidden and complex nature raises various concerns, such as trusting the model output and understanding why the model made certain decisions. Researchers generally publish their proposed model’s settings and performance results based on a speciﬁc dataset and a classiﬁcation model but do not report the proposed model’s output and ﬁndings. Similarly, many researchers suggest an IDS solution by focusing only on a single benchmark dataset and classiﬁer. Such solutions are prone to generating inaccurate and biased results. This paper overcomes these limitations in previous work by analyzing various benchmark datasets and various individual and hybrid deep learning classiﬁers towards ﬁnding the best IDS solution for IoT that is efﬁcient, lightweight, and comprehensive in detecting network anomalies. We also showed the model’s localized predictions and analyzed the top contributing features impacting the global performance of deep learning models. This paper aims to extract the aggregate knowledge from various datasets and classiﬁers and analyze the commonalities to avoid any possible bias in results and increase the trust and transparency of deep learning models. We believe this paper’s ﬁndings will help future researchers build a comprehensive IDS based on well-performing classiﬁers and utilize the aggregated knowledge and the minimum set of signiﬁcantly contributing features. output would help researchers improve their model’s design to reduce the size, increase processing speed, and improve attack detection capability. This work reveals future opportunities towards building a comprehensive online IDS based on minimum but signiﬁcantly contributing features. Future researchers can also gain insight into good-performing classiﬁers and benchmark datasets to ﬁnd an optimal fusion of best classiﬁers and related settings.


Introduction
Growing consumer, business, and industrial demand for advanced Internet of Things (IoT) solutions creates unique challenges to securing these devices. IoT devices are limited in resources such as memory, disk space, and processing power. Therefore, traditional security solutions (firewall, antivirus, software patches) applicable to personal computers do not adequately secure the growing number of IoT devices [1]. Unlike personal computers, these small devices not only lack built-in robust security features, but manufacturers also do not consistently deliver security patches and upgrades after selling the devices [2]. Naïve users either do not change the default passwords or set strong or difficult-to-remember passwords; this makes the devices vulnerable to various cyber-attacks. By using advanced mining tools, adversaries can easily takeover these devices remotely by finding many access points and penetrating deep into the network [3]. Once compromised, thousands or possibly millions of IoT devices can be controlled by a command and control (C&C) server to launch various large-scale attacks such as Distributed Denial of Service (DDoS) on critical national infrastructure and websites.
In order to secure IoT devices, a drastic change in security approaches and a paradigm shift are required [4]. A common and widely adopted network attack detection approach is known as anomaly-based (also known as behavior-based) Intrusion Detection System from applying conventional security solutions such as firewalls, antivirus, and patches [1]. In contrast, access to the internet allows cyber-attackers to use sophisticated software to quickly overtake these devices and use them as bots to launch further large-scale distributed attacks on critical infrastructure and websites [15]. There are many attack types that can impact IoT systems. Some of them include distributed denial of service (DDoS), botnet, keylogging, phishing, spamming, click fraud, identity theft, and malware proliferation. However, the most common and the most dangerous are Denial of Service (DDoS) and botnet. Some of these attacks are active attacks because they occur during real-time network communication. They either affect the normal operations or alter the system resources, e.g., data alteration or flooding the target system (DDoS, DOS) to make it unresponsive to prevent it from serving any new requests. Contrary to active attacks, passive attacks monitor and analyze the traffic pattern by eavesdropping to find vulnerabilities such as open ports, low cipher usage, and unencrypted traffic [16]. Protecting the IoT environment is critical. Many research studies have been based on finding the optimal machine learning solution. However, most of them do not provide a comprehensive solution by investigating network traffic from various sources or building robust models that converge quickly and provide the best results. Some researchers trained their models on a single and old dataset (e.g., NSL-KDD), which lacks modern-day attacks [17]. An IDS must detect a large number of attack types because some of the DDoS attacks such as "Slowloris" slowly deplete server resources and are hard to detect [18]. In contrast, others such as "Jamming" attacks can quickly drain resources and have serious consequences [19].

Related Work
The performance of conventional ML approaches plateau and remains stagnant once a certain computational threshold has reached; in contrast, DL performance grows rapidly as the dataset size grows [10]. Therefore, our focus in this study is only on deep learning-based models. Due to the continuous changes in network attacks every year, it is important to train models on diverse datasets consisting of different attacks and also provide the ability to quickly retrain models on new traffic patterns to evaluate IDS performance [20]. The security experts need to be able to interpret the model's output consistently and understand the factors and features that cause the model to make certain predictions [21]. Features have a significant impact on ML model predictions and their quality [22]. Many research studies have recently been performed on explaining model outputs [23][24][25][26]. The two common approaches to explain model output are post hoc, also known as model-agnostic and model-specific approaches. The model-agnostic approach has received much attention in the research community [27]. Model-agnostic approaches to explain machine learning models provide many advantages over model-specific interpretation approaches. One of the important ones is to allow researchers to build any machine learning model they desire without worrying about the model's complexity and interpretability [28]. For a model-specific interpretation, it is difficult to switch to a new or existing model with a slight change in the model settings.
Many researchers have proposed deep learning-based IDS for IoT environments; however, few have explained their model's output. For example, in [23], the authors tried to explain their model's output, but their solution is not comprehensive. They only looked at a single dataset NSL-KDD to train the model and predict attacks. In another study, the authors only investigated the robustness of interpretable models [29]. Some researchers used Convolutional Neural Networks (CNN) to detect diseases using localized model interpretation only [30]. In another research, the authors used the NSL-KDD dataset to train a deep neural network and applied LIME and SHAP to explain the model behavior [31]. However, their solution only focuses on explaining a single classifier and using a single dataset, "NSL-KDD," which is old and lacks recent attacks [32]; they have not used their research to find optimal classifier settings across various datasets. In [33], the authors used a localized explanation to interpret the model using an old dataset and Multilayer Perceptron (MLP). This paper is a step forward from our previously published paper [34] which proves the bias issues in models that are trained on a single dataset. In our previous work, first, we replicated various DL-based models from many researchers using the same dataset, classifier settings, and the preprocessing steps; second, we proved the bias issue by switching the dataset to a new benchmark dataset that caused the good performing classifier to drop their performance considerably. In this paper, we overcome this limitation. Instead of building a model on a single dataset and a single DL algorithm, we incorporated multiple datasets and multiple algorithms to diversify the input data and the processing algorithm for optimal performance.

Benchmark Datasets
The dataset has a critical role in building and testing a holistic IDS solution. Deep learning models need a large volume of data for better classification and improved performance. Some of our goals in this study, as stated earlier, are (a) to gather, extract, and analyze aggregate knowledge from various benchmark network datasets and look at the commonalities to avoid any possible bias in produced results and (b) mitigate dataset quality and reliability issues. To achieve these goals, we gathered various benchmark network datasets to capture a diverse set of attacks, devices used for data collection, different traffic capture duration, and actual and simulated traffic patterns. Some of these datasets, such as KDD CUP 99 and NSL-KDD, are very old and not IoT specific; however, researchers have still used them extensively to build their models. Others similar to BoT_IoT are recent and provide IoT-specific traffic. Table 1 provides details of the datasets used in our empirical analysis.

Dataset Quality and Reliability Issues
Data quality is important and greatly affects the reliability and robustness of an IDS. Data quality problems impact model predictions and impact security experts' decisions that rely on IDS results. From an IDS perspective, it is important to not only train MLbased models on clean and reliable data, but it is also important to diversify the data to cover a vast range of traffic patterns and minimize the model's biased output towards limited traffic patterns. In ML, some of the common data quality issues consist of noisy or insufficient labeled data, imbalanced data, duplicate data, incomplete data, and inconsistent data. A dataset consisting of a few attack patterns cannot fulfill the IDS purpose. In the preprocessing steps of ML-based modeling, data quality issues are fixed using statistical or manual techniques to ensure models are trained with clean and reliable data only.
In [48], the authors emphasized the impact of poor-quality data on ML-based models' performance. The authors argued that very few research studies pay attention to data requirements and quality issues. The authors presented eight data quality issues relevant to ML-based IDS. These include reputation, relevance, comprehensiveness, timeliness, variety, accuracy, consistency, and de-duplication. In [49], the authors empirically analyzed data quality issues and stated that the data quality issues in ML-based models cause compounding negative events, create downstream effects, and become a technical burden to manage over time.
Today, many ML-based IDS solutions are proposed by training the model on a single dataset. These models are proposed to assume that the training labels are accurately identified. However, this assumption is not fully accurate because the labels are manually created by a security expert who may lack the appropriate judgment; similarly, imbalanced class labels can impact the performance and accuracy of an ML model [50]. There are different approaches adopted by the researchers to mitigate dataset quality and reliability issues, such as removing noise in the labels, increasing unique input data instances to statistically represent the population, utilizing techniques such as bootstrapping or SMOT (Synthetic Minority Over-sampling Technique) [5] to overcome class imbalance issues. In this paper, we improved the data quality issues that may cause biased results by performing various preprocessing steps to clean the noise from the data and trained various models on multiple datasets containing unique attack classes and traffic patterns.

Deep Learning Classifiers for Sequential Data
Researchers have been proposing a vast range of ML and DL solutions for network anomaly detection. This paper focused our experiments on only those DL classifiers that are famous, commonly used, and are known to perform well on sequential data. Section 3 presents our proposed framework with various DL classifiers used in this study. Firstly, we implemented a Multilayer Perceptron (MLP), which is a feed-forward Neural Network and consists of an input, hidden, and an output layer. It has frequently been used for network anomaly detection [11,[51][52][53][54]. Autoencoders (A.E.) are robust unsupervised neural networks. A.E. helps avoid data imbalance and dimensionality reduction and reconstruct errors while detecting anomalies [55]. Researchers have shown great interest in A.E. when solving anomaly detection, fault diagnostics, dimensionality reduction, compression, and other related problems.
A Recurrent Neural Network (RNN) is a powerful and famous algorithm to find hidden patterns in sequential data [56]. RNN captures temporal dependencies in the data by storing the data received earlier [57]. RNN's are good at processing sequential data; however, they suffer from gradient vanishing problems, are challenging to train [58], and cannot remember longer sequences [59]. Long Short-term Memory (LSTM) is a specialized RNN that can remember information for a longer period using a memory cell. It overcomes RNN's vanishing gradient problem [60]. LSTM's consist of three gates, i.e., input gate, forget gate, and output gate [57]. The gates control access and information flow in the memory cell and prevent stored information from being overwritten with irrelevant information. Bi-directional LSTM (BLSTM) came to solve some of the problems in traditional LSTM, such as that LSTM cannot operate in both positive and negative directions [61]. Similarly, LSTM does not work for tasks requiring explicit and external memory [60]. BLSTM are good in processing sequential data in both time directions using forward and backward LSTM layers [62].
A Convolutional Neural Network (CNN) is primarily used for image classification and produces high accuracy when performing complex tasks [56]. A basic CNN consists of one or more convolutional, pooling, and fully connected layers. The 1D-CNN has been used in many IDS studies and produces good results in processing sequential data [63]. CNN's main advantage over a feed-forward neural network is that each neuron in CNN only connects to a subset of input; this reduces the total number of network parameters and improves training time and process [64]. Temporal Convolutional Network (TCN) is a feed-forward network and is a variant of CNN that uses causal convolutions. A CNN architecture combined with causal padding makes it a causal convolution [65]. TCN is known to maintain a temporal sequence of data, which helps in preventing information loss. TCN allows better GPU optimization during training [66].

Feature Importance
Over the past many years, extensive research has been performed on building an optimal IDS for an IoT environment. However, researchers have been challenged to optimally handle the changing network traffic patterns, technology shifts, and large data volume management [67]. An essential operation in machine learning (ML) is finding the best features before training the models. Unfortunately, there are no commonly agreed-upon approaches on feature selection that can be applied to every problem on hand. Researchers sometimes use their judgment, experience, or statistical techniques such as Principal Component Analysis (PCA) to find the least number of optimal features. Each of these approaches aims to effectively handle large data sets with a large number of features without compromising model performance and accuracy. Once the representational features are extracted, they are used to train the ML classifiers and are applied to test the anonymous traffic patterns [68].
Selecting the important features is commonly performed manually in traditional ML approaches. The process is challenging, labor-intensive, time-consuming, and is prone to errors [57]. The feature selection computation time increases when the input dataset has a large number of features [69]. On the other hand, deep learning algorithms perform well on large datasets without explicitly performing feature selection [70]. However, some researchers still find benefits in performing feature selection in deep learning-based models [56]. For example, in one study, the authors used the NSGA-ii-aJG feature selection method with the CNN + LSTM model and achieved 99.03% accuracy [71]. On the other hand, in another study, the authors used DL's capabilities to auto-select features to detect anomalies; their proposed LSTM model achieved a 99.98% precision score compared to the SVM model, which could only achieve an 88.18% precision score [57].
Deep learning models are inherently complex with a large number of hidden parameters, complex settings, multiple layers, and hidden nodes. Because of the complex nature of DL models, many researchers only report model settings and performance results but nothing about the output and findings of the model itself. Some researchers try to justify their classifier performance using techniques such as cross-validation, but DL models could still fail to learn important hidden representations that an expert human in the field might consider necessary. Similarly, a model may consider certain features important, which an expert human may not consider necessary.
Explaining model output is essential for various reasons, including (a) human curiosity about specific predictions [72], (b) making the model explainable to naïve humans, (c) enhancing and redesigning the model by analyzing its output to gain optimal performance, (d) understanding why specific wrong predictions were made, and (e) improving trust and confidence in the model's decisions. Once a model is analyzed as a whole, or when individual predictions are deeply analyzed, it is assumed that models can be redesigned with a limited set of features for better results and performance. When a model is explained locally, its scope is limited to single-input data instance only. Individual probabilistic scores are calculated to express the predictions for an individual instance compared to all predictions. On the other hand, models can be explained as a whole with a global scope. The approach to explaining models can be built within a DL model itself or can be applied as a post hoc approach. However, model explanation within a model has a restrictive scope and can lead to significant changes with a slight change in model settings. On the other hand, researchers have shown interest in the post hoc approach that uses ad hoc methods and techniques to explain any model already built [73]. It is not easy to interpret complex deep learning models such as ensemble or hybrid models. For complex models, the post hoc approach provides a simpler approach to interpret model outputs. The following sub-section presents two state-of-the-art libraries commonly used in post hoc methods to explain models locally and globally.

Local Interpretable Model-Agnostic Explanation (LIME)
The LIME explains the individual predictions of any model by approximating it locally [74]. The output of LIME provides a quantitative and visual understanding of an instance and the model predictions in order for a naïve or expert person to build trust in a model and make effective decisions. LIME provides local fidelity of predictions, which may provide different results than global explanatory methods; in other words, features that LIME considers important for a local prediction might not be considered important by the global explainers and vice versa. LIME has some advantages over global model explainers such as SHAP; for example, it is computationally fast and converges quickly. Explaining an individual model prediction is faster compared to finding and aggregating global permutations. LIME uses an intuitive approach to analyze model output by providing variations in the input data. The variation in input data is generated by creating new perturb data from the original input. The new data are then used to train the model and interpret its predictions. An important pre-consideration must be made to specify the number of features we want to interpret the model when interpreting models. The higher the features, the better the trustworthiness of the model. Equation (1) represents a local interpretation of input data sample x: where g represents an individual model from the list of available explainable models G (e.g., decision tree, linear regression) for the sample instance, x and f represent the original model (e.g., MLP) being explained. π x defines the proximity measure or weight between the original and sample data. L(f,g, π x ) represents the fidelity function that will try to reduce the loss L (e.g., categorical cross-entropy) and measures the difference between the predictions made by the original model f and its representation g. Finally, Ω(g) represents the complexity of model g. LIME explains the model by trying to minimize the fidelity function and the complexity so naive humans can easily interpret the model.

SHapley Additive Explanation (SHAP)
SHAP is another popular method used to interpret model predictions for increased model transparency. It is based on coalitional game theory, which fairly distributes the payout among the features. SHAP measures each feature's contribution towards the model output [75]. The two great advantages of SHAP include: (a) global interpretability, which provides a good explanation of each feature's contribution (either positive or negative) to the model predictions across all permutations. (b) Local interpretability, which provides transparency through the local interpretability of each observation. Each observation x is assigned a corresponding SHAP value that can be applied to any model. SHAP computes the contribution of each feature to the predictions as follows: where g represents the explainable model of deep learning model f. z' {0,1} M represents simplified features, also known as a coalition vector, where 0 means the feature value in the new data is not available in the original data, and 1 means the feature value in the new data is available in the original data. M represents the maximum coalition size, and φ j R represents the feature attribution, quality, or impact of feature j on predictions. A large value of φ j presents a strong positive impact of a feature on predictions. Depending on the model type, SHAP provides various explainers [76]; some of the commonly used explainers are as follows: (a) TreeExplainer can be used to explain treebased classifiers such as XGBoost, random forest, etc. (b) KernelExplainer can be used to explain any function, (c) DeepExplainer only explains deep neural networks, and (d) GradientExplainer can also be used to explain neural networks frameworks, such as TensorFlow, Keras, and Pytorch [77].

Proposed Framework
Many research studies have been performed on anomaly detection where researchers report their model settings and performance measures. Our proposed methodology overcomes some of the important limitations in previous research studies, where researchers do not report their proposed model's output itself but what features contributed to the model to make certain predictions or influence the overall model performance. Similarly, many previous research studies generate bias results due to training the models only on a single dataset consisting of a limited network traffic pattern. A single dataset does not classify a vast range of attack classes and may lack recent traffic patterns [11]. This paper explores the output of various DL models by implementing SHAP and LIME, analyzing predictions, finding commonalities to avoid bias, improving classifier quality and reliability, and extracting top contributing features that influenced the model predictions most. Figure  1 depicts the proposed framework overview.
To avoid the bias issue of a single dataset, we used various benchmark datasets. The details of these datasets are presented in Section 2.1. Datasets are then trained on eight different deep learning models known for processing sequential data and producing good results in anomaly detection. In our approach, we first implemented single classifiers and then hybrid classifiers to find the impact on classification. Detailed summary results were captured with numerous valuable information such as model settings, trainable parameters, training time, and model size, etc. The novelty of this work is reducing bias results and improving model interpretability by using LIME to explain local predictions and SHAP to explain both local and global model output. The top 20 important contributing features were extracted, which were then used to enhance the DL models by improving model settings, performance and choosing the best classifier for a resource constraint environment. Our experiments involve various datasets and DL models. To preserve the space, we only present the architecture of one of our hybrid models, "Autoencoder + TCN," in Figure 2. The same architecture has been used for other hybrid model implementations shown in Figure 1. In other implementations, we replaced the "TCN Structure" block in Figure 2 with LSTM, BRNN, or BLSTM. Model-specific settings are reported in Table 2 (individual classifiers) and Table 3 (hybrid classifiers). In this architecture, first, the autoencoder takes the input shape of the dataset, reduces its dimensions to a smaller size (also known as latent space) and then reconstructs the original data from the compressed representation. Autoencoders generate reconstruction errors and guarantee very high accuracy with low latency detection [78]. The output of autoencoders is passed into a TCN structure. The pooled outputs of TCN block layers are flattened into a one-dimensional array and are passed to a fully connected layer. The last layer is passed with total output classes using the "softmax" activation function to make predictions on each label.   The TCN block in Figure 2 is the layered implementation of the dilated causal convolution of the TCN architecture. Figure 3 reflects the details of this model. TCN block takes three parameters as follows: TCN block (filters, kernel_size, dialation_rate) where "filters" are similar to units in LSTM, and they affect the model size. A larger filter size is preferred. It helps to train the model in parallel and faster, unlike RNN and LSTM, where predictions must wait for the predecessor results [79]. In TCN, a longer input sequence can be processed as a whole instead of sequentially. The "kernel_size" parameter determines the size of each filter in each convolution layer. It helps to calculate how much of the input is used to calculate each value in the output. A larger kernel_size helps detect complex features. The "dialation_rate" parameter represents a fixed step between two adjacent filters. A "dilation_size=1" is the same as a regular convolution in a CNN network. A larger dilation rate captures a bigger input range on the top convolution layer, making a TCN more receptive [79]. Our TCN structure starts with a CONV1D dilated causal convolution layer, followed by a batch normalization layer to obtain high accuracy values and increase model training. We then implemented a Rectified Linear Unit "Relu" to allow quick network convergence. We then added a "Dropout" layer to avoid over-fitting and added regularization by randomly dropping 30% of weights connected to certain nodes during the training process. The same layer structure was repeated in the TCN block and is shown in Figure 3.

Experiments, Results, and Analysis
This section discusses the experimental setup and various findings. We present our rigorous experiments in various parts. First, Table 2 shows the summary of individual classifiers, and Table 3 shows the hybrid classifier's implementation summary. All experiments presented in Table 2 were executed with 20 epochs and a batch size of 256. Several pieces of valuable information were gathered, including (a) trainable parameters, (b) training time, (c) model size, (d) accuracy, (e) epochs, and batch size. Two separate experiments were performed on NSL-KDD and UNSW_NB15 datasets; In (*a), models were trained and predicted on the given training and testing set. In (*b), we merged the training and testing set, shuffled the dataset, and recreated a new training and testing set based on the 70:30 split ratio. In (*c), models were trained on the given full dataset and created a training and testing set based on a 70:30 ratio. Second, Figure 6 shows the SHAP's local explanation of the KDD CUP 99 dataset. To conserve space, we only presented a single prediction from a single dataset. Figure 7 also shows the local explanation of one of the predictions using LIME. Lastly, Figure 8 shows the results of SHAP's global explanation by listing the top 20 most important contributing features. Table 2 shows that the LSTM model requires over 3 million trainable parameters, thus requiring a longer training time than MLP and CNN. Running the model on UNSW_NB15 (*c) requires over 8.5 h, making it an expensive algorithm for a resource-constrained IoT environment. Figure 4 compares the accuracy comparison of each of the classifiers. All three algorithms return similar accuracy, with LSTM performing slightly better on the UNSW_NB15 (*a) dataset.  Table 3 shows the summary of hybrid classifiers. The "Autoencoder + TCN" takes on average the least number of trainable parameters around 199K, whereas "Autoencoder + BLSTM" takes the most over 9.2 million parameters. All other classifiers also require a large number of parameters, with "Autoencoder + LSTM" taking over 3.3 million and "CNN + LSTM" taking over 3.2 million trainable parameters. For an IoT environment, "Autoencoder + TCN" is convenient due to a smaller than 3 B, shorter training time, and better accuracy than other models. Figure 5 shows the accuracy comparison of each of the hybrid classifiers. "Autoencoder + TCN" returns better accuracy compared to other algorithms. "Autoencoder + BRNN" returns very low accuracy with only 23.30% on N-BotIoT and 31.90% on UNSW_NB15 (a*) datasets. Model training time is also an important concern in a live environment. As shown in Table 3, algorithms such as "CNN + LSTM" took over 9 h to converge on UNSW_NB15 (*c) dataset. Models that converge faster can be trained multiple times during the day with new attack information to keep them up-to-date.

Individual Prediction Interpretation-Localized Explanation
We performed two analyses to look for a suitable measure of feature importance. SHAP and LIME provide a localized explanation of a specific prediction. Both local and global explanations show the complexity of models when making predictions. SHAP local explanation in Figure 6 visually shows the features considered important by the classifier in making an individual prediction. The prediction reflects the probability that the input traffic is a DOS attack. Results showed that the base value of 79.24% (model's average prediction over the training set) would be predicted if the features to the current output f(x) were unknown. However, the classifier was able to predict the DOS attack with 100% accuracy. The top features that predict the attack are shown in red color and are count, dst_host_count, service_ecr_i, service_http, etc., whereas the blue features dst_host_srv_count drives the prediction value lower. The larger arrow size of the count shows the magnitude of this feature's effect on making a classification decision.  Figure 7 shows the local explanation of a DoS vs. NOT DoS attack using LIME. The classifier predicts that the traffic is a DoS attack, and the LIME bar chart highlights the importance given to the most relevant features that led to the prediction. As shown in Figure 7, protocol_type_udp, logged_in, count, srv_count, and dst_host_srv_count are depicted as contributing to the "DOS" prediction, while serror_rate is against it. Cybersecurity experts can review these granular details to make an informed decision about trusting the model's predictions.

Model Interpretation-Global Explanation
Similar to the local explanations presented in Section 4.1, global explanations also interpret the model by providing insights on input features, patterns, and the output correlation to promote model behavior transparency [73]. SHAP provides various methods to explain models. This paper's scope is limited to deep learning models only, so we used DeepExplainer and GradientExplainer to extract important contributing features in the overall model output. Analyzing global explanations of models is important to understand general model behavior when dealing with large datasets and a large number of features. We implemented multiclass classification in our experiments and extracted the top 20 contributing features for each attack type. The experiments were run on all eight datasets and eight DL classifiers. To conserve space in the paper, we present the visual output of only two datasets in Figure 8. Figure where φ j (i) refers to the Shapely value for feature j in the i-th data; N is the total number of samples in the dataset; and I_j is the average Shapley value of the feature j. Briefly, Figure 8a depicts dur as the most important feature followed by sum, N_IN_Conn_P_DstIP, etc., in the BoT_IoT dataset. Similarly, Figure 8b depicts sttl as the most important feature, followed by ct_state_ttl, dttl, smeansz, etc. Further details and analysis of which features each classifier considered important are presented in Section 4.3. Global explanation helps to refine further models towards building an online IDS, which can look at only a limited set of network traffic features to make decisions quickly in an unsupervised manner.

Top Contributing Features
This section presents the aggregated knowledge of important features identified by each of the eight DL classifiers from all eight datasets. Our approach was to extract the top 20 features and is divided into three sub-categories, (a) Table 4 provides a list of features identified by every classifier. These features must be considered the most important features (b) Table 5 provides a list of features identified by six or more classifiers, and (c) Table 6 provides a list of features identified by five or fewer classifiers. Each classifier reports features sorted by top contributing to low contributing; however, we are not reporting the ordering sequence here because it will require a very big table size if we report the individual output of eight classifiers on eight datasets. Our goal is to present an aggregated knowledge and report those features to which all or the majority of classifiers have been given some importance in the top 20 list regardless of the ordering sequence.   Table 4 reveals that some features in each of the datasets carry more importance than others. Every classifier picked such features in making decisions. Similarly, as shown in Table 5, some other features were given a little less importance by only one out of eight classifiers who did not think it was an important feature. Lastly, many other features were considered important by five or fewer classifiers. These features in the third category still carry importance but not as much compared to the first two categories.
Feature selection has an important role in machine learning. For an IDS, eliminating unnecessary features can help in improving the performance, reducing the computational cost, and early detection of malicious traffic. In addition to these benefits, feature optimization also helps to find a subset of features that can produce better classification results. This paper, based on the output of multiple deep learning algorithms and datasets, presented a minimum set of top contributing features consisting of diverse attack classes and algorithms. From the top contributing features shown in Tables 4-6, multiple traffic flowbased features are discovered by each algorithm, for example, source and destination IP addresses, ports, protocol, and flags used. Network flow-based features provide metadata details of many securities-related details of network activities. From the traffic flow information, security experts can identify a given session's full TCP/IP information to analyze a particular activity better. Network flow-based features also present a good use case to detect large-scale attacks such as DDoS and DoS. These attacks generate excessive network traffic, and detecting them using individual packets could consume all available resources; therefore, it is optimal to analyze flow-based features to detect large-scale attacks [80].
The experimental results and identifying the top contributing features in each dataset by a specific model are a step towards future research where authors plan to further optimize each of the models by training them using only top contributing features shown in Tables 4-6. Some DL models generate thousands of neurons and trainable parameters, as shown in Tables 2 and 3. For a large dataset with many features, this leads to the problem of "Curse of Dimensionality." Reducing the number of input features helps solve this problem [81]. In [71], the researchers reported accuracy improvement with a 5-fold reduction in training time by selecting reduced features for model training compared to total feature space.

Models versus Datasets, a Comparison Study
The performance of an ML-based model depends on multiple factors. For example, the dataset's size directly impacts the model training time. Although DL is known to provide optimal performance on large datasets, the bigger the training dataset, the longer it takes to train the model. Therefore, researchers have started exploring options to train DL-based models with distributed training using TPU's and GPU's that provide better computational capabilities [82]. Another important impact on DL-based models is the datasets with a large number of features. As shown in Tables 2 and 3, the LSTM model generates over 3 million trainable parameters in individual classifiers, whereas "autoencoder + BLSTM" generates over 9 million trainable parameters on multiple datasets. Models that generate a large number of trainable parameters require longer training time and have bigger model sizes. Domain shift is another issue that impacts DL performance [83]. A general assumption in DL-based modeling is that the training and testing data come from the same distribution under the same settings. This assumption contradicts the real-life scenario where data comes from different sources and would consist of different attack patterns (variants of existing attacks or completely new attacks). DL-based models are normally trained offline with limited traffic patterns collected at a certain time period. Any change in the test data would result in models generating poor results. In [84], the authors empirically proved the impact of dataset size on model accuracy. The authors argued that accuracy increases as the dataset size grows for the DL model. The authors also argued that increasing the minority classes using techniques such as the SMOTE algorithm considerably increases the model accuracy of minority classes.
In data analytics projects, researchers usually publish their results using a particular dataset and a classification model. An accurate comparison with previous work is not possible due to various reasons, which include but are not limited to the following: (a) variation in model settings, i.e., researchers normally publish their results but do not publish their model settings such as the number of hidden layers, epochs, and trainable parameters, (b) difference in computation resources, i.e., training a model on a CPU vs. a GPU environment or a local vs. a cloud provider. (c) The difference in dataset preprocessing steps, i.e., reducing dataset dimensionality using Principal Component Analysis (PCA) vs. Autoencoder or using any other method. By considering all these limitations, we attempted to provide a model accuracy comparison of our work with some of the previous researchers' work, where researchers have trained an individual or hybrid deep learning classifier using a single benchmark dataset. Although we strongly believe and reported in this paper that any model trained on a single dataset produces bias results. Table 7 gathered some of the previous work where researchers proposed an IDS for IoT's using a single benchmark dataset to train deep learning classifiers. Our proposed deep learning model stack provides comparable performance with previous researchers' work with added benefits of avoiding bias by utilizing various datasets and classifiers. Some of our classifiers perform well on one dataset but not on others; for example, autoencoder + LSTM produces an accuracy of 100% on the BoT_IoT dataset but 80.50% on the N-BaIoT dataset. Similarly, Autoencoder + BRNN returns 100% accuracy on the BoT_IoT dataset but only achieved 23.30% on N-BaIoT, 51.90% on NSL-KDD, and 79.20% on KDD CUP 99 datasets. This emphasizes one of the reasons that a single classifier is not good enough to capture attacks reported in different benchmark datasets. When comparing with the previous researcher's work, Reference [10] reported a 92.99% accuracy on CNN, whereas our CNN classifier achieved 98.60% accuracy. Reference [70] reported 86.34% accuracy on MLP, which is less than our MLP's accuracy of 99.90%. They also reported 95.14% accuracy on CNN, whereas our CNN achieved 100% accuracy. The one notable difference where our models returned minimum accuracy is against the N_BaIoT dataset, where Reference [53] reported 99.57% accuracy on CNN and 96.13% on MLP, our models achieving on average 90.80% accuracy. Lastly, Reference [85] reported only 54.43% accuracy on MLP compared to our MLP classifier achieving 100% accuracy. Although our results show an overall better performance than other researchers' work, we believe that there is a need for baselines or benchmarks to reuse the same configurations to reproduce and provide true comparable results.

Conclusions and Future Work
Machine learning predictive modeling is often a trade-off between what the model has predicted and understanding the reasons for why certain predictions were made. Many machine learning-based research studies only report model settings and performance metrics but nothing about the model's interpretability, output, and findings. Similarly, many machine learning models produce biased results for various reasons, such as training the classifier on a single benchmark dataset or trying to find and tune a single classifier that can provide blanket protection against intrusion detections. Explaining models visually and understanding the reasons for certain model behavior provide a useful tool to detect bias.
Due to the complexity and hidden layers of deep learning models, a post hoc (after model training) approach to interpreting the model is proposed in this paper. We performed a comparative analysis of various benchmark datasets and deep learning models to gather the model's output commonalities and aggregated knowledge of each model generated on various datasets. SHAP and LIME were used to gather localized explanations of specific predictions and the overall impact of top contributing features to the model's output to gain insights into model decisions. Our results reveal that building a comprehensive IDS is not possible if the chosen framework is based on a single classifier and a single dataset. To be more explicit, in Table 2, an MLP model could detect attacks in the Bot_IoT dataset with 100% accuracy, while for UNSW_NB15 (*a), it could detect attacks with only 38.90%. Similarly, the "autoencoder + TCN" model could detect attacks in Bot_IoT with 100% accuracy, but for the dataset NSL-KDD (*a), it could detect attacks with only 75.6%. Thus, it is important to diversify the model training on multiple input datasets to increase the model knowledge base for optimal predictions. Our findings in this paper will help the security experts to make informed decisions and improve their trust in the model's predictions. Similarly, the global interpretation of the model's output would help researchers improve their model's design to reduce the size, increase processing speed, and improve attack detection capability. This work reveals future opportunities towards building a comprehensive online IDS based on minimum but significantly contributing features. Future researchers can also gain insight into good-performing classifiers and benchmark datasets to find an optimal fusion of best classifiers and related settings.