DNS over HTTPS Tunneling Detection System Based on Selected Features via Ant Colony Optimization

Talabani, Hardi Sabah; Abdul, Zrar Khalid; Mohammed Saleh, Hardi Mohammed

doi:10.3390/fi17050211

Open AccessArticle

DNS over HTTPS Tunneling Detection System Based on Selected Features via Ant Colony Optimization

by

Hardi Sabah Talabani

^1,*

,

Zrar Khalid Abdul

^1,2

and

Hardi Mohammed Mohammed Saleh

¹

Department of Computer Scince, College of Scinence, Charmo University, Sulaimaniyah, Chamchamal 46023, Iraq

²

Department of Software Engineering, Faculty of Engineering, Koya University, Koya 44023, Iraq

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(5), 211; https://doi.org/10.3390/fi17050211

Submission received: 30 March 2025 / Revised: 30 April 2025 / Accepted: 2 May 2025 / Published: 7 May 2025

Download

Browse Figures

Versions Notes

Abstract

DNS over HTTPS (DoH) is an advanced version of the traditional DNS protocol that prevents eavesdropping and man-in-the-middle attacks by encrypting queries and responses. However, it introduces new challenges such as encrypted traffic communication, masking malicious activity, tunneling attacks, and complicating intrusion detection system (IDS) packet inspection. In contrast, unencrypted packets in the traditional Non-DoH version remain vulnerable to eavesdropping, privacy breaches, and spoofing. To address these challenges, an optimized dual-path feature selection approach is designed to select the most efficient packet features for binary class (DoH-Normal, DoH-Malicious) and multiclass (Non-DoH, DoH-Normal, DoH-Malicious) classification. Ant Colony Optimization (ACO) is integrated with machine learning algorithms such as XGBoost, K-Nearest Neighbors (KNN), Random Forest (RF), and Convolutional Neural Networks (CNNs) using CIRA-CIC-DoHBrw-2020 as the benchmark dataset. Experimental results show that the proposed model selects the most effective features for both scenarios, achieving the highest detection and outperforming previous studies in IDS. The highest accuracy obtained for binary and multiclass classifications was 0.9999 and 0.9955, respectively. The optimized feature set contributed significantly to reducing computational costs and processing time across all utilized classifiers. The results provide a robust, fast, and accurate solution to challenges associated with encrypted DNS packets.

Keywords:

intrusion detection system; DNS over HTTPS; ant colony optimization; feature selection; dimensionality reduction; machine learning

1. Introduction

Intrusion Detection Systems (IDSs) have become indispensable tools in safeguarding network infrastructures as cyberattacks grow increasingly sophisticated. The global cybersecurity market, which includes IDS technologies, was valued at 119.75 billion USD in 2022 and is projected to expand at a compound annual growth rate (CAGR) of 8.0% from 2023 to 2030 [1]. This rapid growth reflects the urgent need for advanced security measures in critical sectors such as government, military, and public infrastructure, where vulnerabilities to cyber threats carry severe consequences [2].

The Domain Name System (DNS), a cornerstone of internet functionality, translates human-readable domain names into machine-interpretable IP addresses. Despite its foundational role, traditional DNS traffic operates in unencrypted plaintext, exposing it to interception, tampering, and surveillance by malicious actors. The lack of encryption in DNS queries creates significant privacy risks, as attackers can monitor or alter these communications. To counter these vulnerabilities, recent initiatives have prioritized DNS over HTTPS (DoH), a protocol that encrypts DNS traffic via HTTPS, thereby addressing the security shortcomings of conventional DNS [3].

DoH enhances privacy by encapsulating DNS requests within HTTPS sessions secured with Transport Layer Security (TLS). This encryption prevents third parties from intercepting or manipulating DNS resolutions, ensuring both confidentiality and data integrity [4]. Consequently, researchers have intensified efforts to refine encrypted communication protocols, aiming to strengthen the security architecture of DNS infrastructure [5]. Leading web browsers, including Google Chrome, Microsoft Edge, and Mozilla Firefox, now implement DoH by default to protect user privacy during domain resolution [6]. While this encryption mitigates DNS manipulation and eavesdropping, it complicates network monitoring, as security analysts cannot easily distinguish benign HTTPS traffic from malicious activities embedded within encrypted streams Figure 1.

While DoH enhances confidentiality and integrity, it introduces challenges for network security by enabling malicious actors to conceal harmful activities. One such concern is DNS tunneling over HTTPS, where attackers exploit encrypted DNS channels to covertly exfiltrate data or establish communication with command-and-control (C2) servers [7]. Traditional intrusion detection and prevention systems (IDS/IPS) often rely on deep packet inspection (DPI) and signature-based techniques, which are effective for plaintext DNS but struggle with encrypted traffic. Since DoH hides DNS queries within HTTPS payloads, conventional monitoring tools cannot easily differentiate between legitimate encrypted web traffic and malicious DNS tunneling activities [8]. Attackers can use DoH to bypass security controls, evade detection, and establish persistent covert channels within the network. This significantly increases the difficulty for security teams in tracking unauthorized activities and detecting anomalies in encrypted traffic.

As a result, advanced detection mechanisms are required to identify DoH-based attacks effectively [9]. Machine learning (ML)-based techniques, such as decision trees, neural networks, and clustering algorithms, have emerged as promising approaches for detecting abnormal patterns in DoH traffic [10]. However, the high dimensionality and complexity of traffic data pose computational challenges. Dimensionality reduction techniques are essential for improving efficiency, reducing processing time, and optimizing the balance between security and privacy in DoH detection [11].

The detection of abnormal behavior in DoH traffic via machine learning ML-based techniques, such as decision trees, neural networks, and clustering algorithms, to identify traffic profiles is one of the increasing research subjects [12]. However, dimensionality reduction is crucial because of the high dimensionality as well as the other complex aspects of traffic data in DoH traffic. Consequently, selecting the most important features leads to a reduced computational load and processing time, making analysis more efficient and aiming to balance security and privacy [13].

This study aims to improve the analysis and classification of DoH traffic by applying an ACO for dimensionality reduction by reducing the number of essential features. ACO optimizes feature selection by choosing the most efficient features to increase model efficiency by deploying machine learning models such as XGBoost, K-nearest neighbors (KNN), random forest (RF), and Convolutional Neural Networks (CNNs) for binary class and multiclass classification via the CIRA-CIC-DoHBrw-2020 dataset. The contributions of this study are as follows:

The ACO algorithm is used to select the most effective features, which greatly reduces the dimensions of the CIRA-CIC-DoHBrw-2020 dataset while preserving the underlying information.
Combining ACO-based feature selection with XgBoost, KNN, RF, and CNNs classifiers improves classification performance, outperforming existing methods in accuracy, precision, recall, and F measure.
This study confirms the robustness and generalizability of the proposed feature selection and classification approach by maintaining consistent performance on live network traffic structured to match the CIRA-CIC-DoHBrw-2020 dataset across both binary and multiclass classification tasks.
The proposed method decreases computational execution time and complexity compared to existing feature selection approaches, making it convenient for real-time DoH tunneling detection.
The proposed work is applied to both binary (DoH-Normal, DoH-Malicious) and multiclass (Non-DoH, DoH-Normal, DoH-Malicious) classification scenarios, which demonstrates its efficiency in selecting the best features for various classification tasks.

The remainder of this study is arranged as follows: The second section is a literature review that reviews the relevant research, especially those that focus on the use of metaheuristic optimization algorithms using benchmark datasets in the IDS field, including the CIRA-CIC-DoHBrw-2020 dataset. The third section provides a detailed explanation of the proposed method structure, including ACO as a mechanism to extract the most effective features and a brief overview of the ML algorithms used. Furthermore, a mechanism is used for tuning parameters and data processing methods. Finally, the fourth section presents and discusses the obtained results, including the features selected for both binary and multiclass scenarios and the measurements utilized to evaluate the model.

2. Literature Review

In this section, we review the literature on CIRA-CIC-DoHBrw-2020 and other widely used benchmark datasets in the IDS field, which are important resources for the analysis and detection of DoH traffic, including malicious and benign traffic. Various ML algorithms and optimization techniques have been deployed on this dataset to enhance feature selection and improve classification accuracy because of their importance in network security. These studies demonstrate the effectiveness of the dataset in discerning more sophisticated patterns embodied in the encrypted DNS traffic over time. Although significant advancements have been made, there are still certain constraints, such as ambiguity with respect to data preprocessing steps and inadequate interpretability in feature selection methodologies [14], and most of the models are geared toward binary classification tasks [15]; thus, the potential of these models to accommodate multiclass classification tasks is limited.

A study [16] was conducted to detect the malicious activity of DNS levels in the DoH environment. To do this, the CIRA-CIC-DoHBrw-2020 dataset and different ML classifiers, such as logistic regression (LR), gradient boosting (GB), RF, NB, and KNN, were deployed. The experimental results revealed that the GB and RF outperform the other classifiers. Although [17] developed a CNN to identify threats and increase accuracy in DNS systems, they struggled with the lack of high-quality datasets specifically for testing DNS tunneling connections. The results showed that the model was able to correctly identify tunneling domains with 92% accuracy and maintained a low false positive rate of approximately 0.8%. Furthermore, researchers in [18] proposed a neural network-based intrusion detection system (NNIDS) that aims to respond quickly to attacks by analyzing low-level network details. The proposed approach yields a relatively low average accuracy of approximately 90%.

Additionally, ref. [15] focused only on binary classification for classifying DNS traffic, and the authors applied supervised ML to distinguish between benign and malicious domains. However, this approach cannot detect malicious queries within non-DoH traffic. Relying solely on subdomain information is insufficient for identifying other types of attacks. Other researchers have analyzed the CIRA-CIC-DoHBrw-2020 dataset via multiple ML algorithms, including XGBoost, GB, and LightGBM, without details regarding the preprocessing phases, such as [14].

Previous studies have focused on the application of various MLs and deep learning algorithms in DoH detection without reducing or adopting the most effective features. This study concluded that the classification accuracy rates are not high or are not at the required level. In addition, sharing all the features in the dataset leads to increased computation time and model complexity. Feature selection is critical in ML-based IDSs for improving accuracy, efficiency, and interpretability. Identifying the most relevant data attributes reduces redundancy, speeds up processing, and minimizes overfitting, which is crucial for real-time detection. ML algorithms benefit significantly from feature selection, as they help them focus on the most informative features, improving model performance and reducing computational complexity [19].

Metaheuristic algorithms have been used for feature selection in DoH in recent studies. For traffic networks provided in the CIRA-CIC-DoHBrw-2020 dataset, four algorithms, namely, particle swarm optimization (PSO) [20], the genetic algorithm (GA) [21], the equilibrium optimizer (EO) [22] and the artificial bee colony (ABC) [23], are used to choose the relevant features to predict if the traffic networks are obtained from an attack or not. However, in general, the performance of such algorithms in high-dimensional datasets of IDS, including the CIRA-CIC-DoHBrw-2020 dataset with redundant or no useful attributes, requires complementary techniques such as dimensionality reduction to achieve performance improvement. Moreover, they have several limitations, including efficiency in computation and convergence speed.

In paper [20], metaheuristic optimization algorithms were used to optimize features in the CIRA-CIC-DoHBrw-2020 dataset via PSO, ABC, and K-means. Additionally, hybrid models have been proposed in which K-means is used with PSO and ABC for the clustering process. This study reported a 99.5% accuracy and 99.4% precision high performance. The analysis limited itself to a binary classification, separating them only as DoH traffic or non-DoH traffic.

The researchers in [22] proposed a refined variation of the EO algorithm called Levy-opposition equilibrium optimization (LOEO) to increase functional selection in network intrusion detection systems (NIDSs). The opposition-based learning (OBL) technique is integrated with the proposed algorithm to improve the diversity of candidate solutions, and the Levy flight mechanism is applied to flee the local optima. LOEO showed consistently high detection performance, exceeding 95% on benchmark datasets (e.g., NSL-KDD, UNSW-NB15, and CIC-IDS2017) on a combination of select best features. For example, on the UNSW-NB15 dataset, LOEO chooses an average of only 10.8 features, achieving a high accuracy of 97.6%.

Furthermore, the GA combined with the SVM and KNN classifiers yielded an average accuracy of 98.94% in the study presented in paper [21], which used the effective features in the dataset, encompassing both benign DoH and malicious DoH traffic, and it was shown that the method can accurately classify the type of network traffic. The study did not specify features selected by the GA and used in the classifiers.

Reviewing existing literature reveals several limitations, particularly in studies using metaheuristic algorithms for feature selection. While the reviewed studies focus on binary or multiclass classification, they often suffer from unclear data preprocessing methods, high computational complexity, and long training times. Additionally, these approaches typically do not apply their models to both classification tasks, limiting their versatility and broader applicability. Detection accuracies are frequently suboptimal, reducing the practical effectiveness of these methods. These identified gaps emphasize the need for more comprehensive, efficient, and accurate solutions, especially for metaheuristic-based feature selection techniques.

In this work, an ACO approach is proposed for extracting the most informative features from the CIRA-CIC-DoHBrw-2020 dataset. This method is evaluated with four ML algorithms: XGBoost, KNN, RF and CNNs. Furthermore, by implementing a clearly defined and transparent data preprocessing stage, this method also includes balancing the dataset so that it adequately represents each class, which leads to better performance than in previous works. This study performs experiments for both binary classification (classifying benign DoH and malicious DoH traffic) and multiclass classification (distinguishing nonDoH, benign DoH, and malicious DoH). This iterative methodology not only enhances feature selection but also reduces time complexity and fortifies the model against overfitting, establishing overall adaptability across various classification tasks.

3. The Proposed Work

A methodology framework is proposed in this section, and it utilizes ACO as a metaheuristic optimization technique for feature selection. ACO is incorporated to address four advanced types of ML: XGBoost, KNN, RF, and CNNs. These ML algorithms have been chosen because of their demonstrated performance in conducting analyses on complex high-dimensional datasets and their performance established in the research community within the field of network traffic analysis. Additionally, the CIRA-CIC-DoHBrw-2020 dataset has been used as an evaluation benchmark dataset since it is a full and well-known dataset containing malicious DOH and benign DOH traffic. In addition to data preprocessing, several measures, including data balancing, data cleaning, normalization, and other fundamental activities for binary and multiclassification tasks, are likewise applied. Figure 2 provides a clear view of the proposed workflow and illustrates a detailed diagram of the methodology. All experiments were conducted on a system with an Intel Core i7 processor, 32 GB RAM, and a 512 GB SSD running Windows 11. The implementation was carried out using Python 3.11 within the Anaconda distribution, utilizing the Spyder IDE. Commonly used libraries such as scikit-learn, pandas, matplotlib, and imbalanced-learn were employed for model development. No GPU acceleration was used during experimentation.

While our system improves DoH tunneling detection, critical awareness of its limitations and potential misuse is essential. Adversaries may attempt to evade detection by perturbing features selected via ACO (e.g., altering query timing or domain distributions). To address this, future work could integrate adversarial training to harden the model against evasion attacks. Additionally, though our method analyzes encrypted DoH traffic metadata, we emphasize that it does not decrypt user payloads, thereby preserving privacy. Strict access controls and audit mechanisms should be enforced during deployment to prevent abuse. Proactive collaboration with cybersecurity ethics boards could further ensure transparency and accountability.

3.1. Algorithms in the Proposed Model

(1): Ant Colony Optimization:

ACO was developed by Marco Dorigo in 1992. It is a novel technique for solving combinatorial optimization problems [24]. Figure 3 shows the three levels of ant behavior in finding the shortest path between the nest and the food source. In the first level, ants explore randomly across multiple paths without any pheromone guidance, resulting in equal probabilities for all paths. In the second level, as ants reach the food source and return to the nest, they deposit pheromones along their routes, with shorter paths naturally receiving pheromones more frequently because of quicker traversal. By the third level, the shortest path becomes dominant as it accumulates the highest concentration of pheromones, attracting most ants, while longer or less efficient paths gradually lose their pheromone trails. This iterative process, driven by pheromone deposition and evaporation, ensures that a colony collectively optimizes its route to the food source. In feature selection, the process begins with initializing a population of artificial ants and a pheromone matrix that influences the selection of features. Each ant constructs a solution by probabilistically choosing features on the basis of the pheromone levels and heuristic information such as feature importance. These selected subsets of features are then evaluated via an objective function such as classification accuracy to measure their performance on the basis of updating the evaluation pheromone [25]. Consequently, higher-quality solutions receive stronger pheromone deposits to reinforce successful feature combinations, whereas weaker solutions experience pheromone evaporation. The algorithm iterates through these steps until a termination condition is met, such as a set number of iterations or a convergence criterion.

(2): XgBoost:

XgBoost is a high-performance ML algorithm based on the gradient boosting framework for classification and regression problems. First developed in 2014 by Tianqi Chen, XgBoost enhances both the performance and the efficiency of classic gradient boosting through various pertinent innovations, including regularization to reduce overfitting, efficient handling of sparse data and reading, and parallelized tree construction for faster training [26]. The algorithm sequentially builds an ensemble of decision trees, each of which attempts to minimize the mistakes made by the previous trees in the sequence by using gradient descent on a differentiable loss function [27].

(3): K Nearest Neighbors:

The KNN algorithm is a simple yet powerful ML algorithm used for classification and regression. Proposed in the 1950s, KNN is a nonparametric, instance-based algorithm based on the principle of proximity to make predictions [26]. KNN For classification, once we have a given test instance on the basis of a distance metric, such as Euclidean distance, it identifies the ‘k’ closest data points (neighbors) in the feature space. The predicted class for the test instance is the majority class among these neighbors [2]. The most important of these include the fact that KNN is simple and easy to implement and is flexible since it can adapt to linear and nonlinear relationships in the data. However, this approach can be costly for large datasets, as it needs to compute the distance for each data point in the given dataset, and it is also sensitive to the selection of ‘k’ and the scaling of the input features [28].

(4): Random Forrest:

RF is an ensemble learning algorithm predominantly used for classification and regression and was developed by Leo Breiman in 2001 [29]. It operates by constructing a multitude of decision trees during training and combining their outputs to improve accuracy and robustness. Each tree in the forest is built on a random subset of the data and a random subset of features, which helps to reduce variance and avoid overfitting, a common issue with individual decision trees. The final prediction in a classification task is made by aggregating the votes from all the individual trees, typically through majority voting. RF is highly effective because it combines the strengths of multiple decision trees, which individually may have high variance, into a more stable and accurate model [30].

(5): Convolutional Neural Networks:

CNNs are one kind of deep learning model most suitable for handling data that has a grid-like structure, such as complex data or data that is large in volume. CNNs are used to automatically and adaptively learn spatial hierarchies of features by using multiple layers, such as convolutional, pooling, and fully connected ones. Convolutional layers use filters to extract local patterns from the input that get combined at deeper levels to recognize more complicated structures. This architecture allows CNNs to have high accuracy in tasks such as pattern recognition, anomaly detection, and classification, thus being commonly applied in areas that deal with analyzing large or complicated data.

3.2. Feature Selection Stage

The method used to select the most influential features mimics the ant’s method of finding the shortest and easiest path to food. First, to explore subsets of features repeatedly, the artificial ant starts by preparing equal levels of pheromones for all the features in the dataset. Consequently, based on the pheromone level and Heuristic information (feature importance values), the features are selected probabilistically by each ant, thus forming subsets of features. The next step is to evaluate these subsets formed using the accuracy function of the four used classifiers. This process continues until the specified number of iterations is reached, leading to the enhancement of the features with high pheromones and the gradual evaporation of the features with lower influence to stagnate. Thus, by repeating this process, the search process of the algorithm ACO is gradually improved by referring to the most influential set of features. This approach improves the performance of the classifiers, reducing the dimensions of the CIRA-CIC-DoHBrw-2020 dataset, reducing the computational efficiency and accelerating the model’s access to the decision without compromising the predictive accuracy. Figure 4 illustrates all the steps of the methodology followed to select the most important features.

Mathematically, the goal is to identify a set of features

S_{k}

that achieves the highest classification accuracy as an evaluation function, where

S_{k} \subseteq F

and

F

represent all the features in the CIRA-CIC-DoHBrw-2020 dataset (29 statistical features).

The ACO approach used for feature selection treats each feature

F_{i}

as a node in the graph

G = (F, E)

, where the edge

E

represents the transactions between features. The probability of selecting each feature depends on the level of pheromone density

T_{i} (t)

at iteration

t

. A higher pheromone density indicates that a feature contributes significantly to the evaluation function. To enhance the selection process, the heuristic value

n_{i}

, derived from the feature importance of

F_{i}

, is calculated as

n_{i} = F e a t u r e I m p o r t a n c e (F_{i})

(1)

where

n_{i}

represents the heuristic value for feature

F_{i}

, which prioritizes features with greater relevance on the basis of their importance.

The mechanism for iteratively selecting a specific feature by ants for each subset is performed through the following probability:

P_{i}^{K} (t) = \frac{T_{i} {(t)}^{α} * n_{i}^{β}}{\sum_{j \in N o t Y e t} T_{j} {(t)}^{α} * n_{j}^{β}}

(2)

where

P_{i}^{K} (t)

is the probability of selecting feature

F_{i}

for subset

S_{k}

at iteration

t

. The term

T_{i} (t)

denotes the pheromone density of

F_{i}

, while

n_{i}

is the heuristic value. The parameters

α a n d β

control the influence of the pheromone density and heuristic value, respectively. The summation in the denominator accounts for all features in the set

N o t Y e t

, which comprises features not yet included in

S_{k}

. Ants iteratively construct subsets

S_{k}

, stopping when a termination condition (e.g., 50 iterations) is met. Each subset is then evaluated on the basis of the classification accuracy of a machine learning model, and subsets with higher accuracies are favored.

To avoid stagnation and local optima, the pheromone levels undergo evaporation and reinforcement. Pheromone evaporation reduces the influence of previous iterations and is modeled as follows:

T_{i} (t + 1) = (t - ρ) * T_{i} (t)

(3)

where

T_{i} (t + 1)

is the updated pheromone density for feature

F_{i}

, and

ρ

is the evaporation rate, which is between 0 and 1.

The reinforcement phase increases the pheromone level for effective features on the basis of their performance. This update is expressed as follows:

T_{i} (t + 1) = (t + 1) + ∆ T_{i}

(4)

where

∆ T_{i}

represents the pheromone increment for feature

F_{i}

. This increment is computed as follows:

∆ T_{i} = \sum_{k = 1}^{m} {∆ T}_{i}^{k}

with

{∆ T}_{i}^{k}

defined as follows:

{∆ T}_{i}^{k} = \{\begin{matrix} \frac{ϱ}{f (S_{k})} i f F_{i} \in S_{k} \\ 0 o t h e r w i s e \end{matrix}

(5)

Here,

ϱ

is a scaling constant for the pheromone, and

f (S_{k})

represents the objective function value (classification accuracy) of the subset

S_{k}

. This mechanism ensures that features contributing to subsets with higher accuracy receive greater reinforcement.

By combining probabilistic feature selection, heuristic guidance, and adaptive pheromone updates, the ACO framework identifies the most relevant features to optimize classification performance effectively.

3.3. Parameter Setting

Parameter tuning plays a key role in the performance of optimization algorithms and machine optimization algorithms. The hyperparameters listed in Table 1 for the three classifiers used in this study, XGBoost, KNN, and RF, were tuned using Gaussian Optimization. To evaluate the impact of hyperparameter tuning, the classification accuracies achieved before and after optimization were compared. As illustrated in Figure 5, tuning the parameters led to a noticeable improvement in performance across both binary and multiclass classification tasks. The results clearly demonstrate that the use of optimized parameters enhances the effectiveness of the classifiers. For the CNN-based component, we adopted the same model architecture and parameter settings as used in [31]. Moreover, the parameters for the ACO were manually configured based on preliminary testing and prior experience to ensure effective convergence. The number of ants was set to 30, the pheromone evaporation rate was assigned a value of 0.5, and the pheromone importance factor was set to 1. Additionally, the number of iterations was fixed at 500 to provide sufficient exploration of the search space while maintaining a reasonable computational cost.

3.4. Dataset Description

The proposed method is tested on the CIRA-CIC-DoHBrw-2020 dataset (with 1,167,136 records and 29 statistical features) to check the performance detection of DoH traffic. The University of New Brunswick dataset was developed to address this lack, and it separately offers a dataset that is able to examine benign and malicious DoH traffic. DNS over HTTPS, a standard in the Internet Engineering Task Force (IETF) introduced in RFC8484, improves privacy through encrypting queries and defenses against eavesdropping attacks (Man-in-the-middle). However, this form of encryption also hampers the ability to categorize benign and malignant activity from one another, which warrants advanced classification methods [32]. The following two-layered approach is used for the dataset: In Layer 1, all traffic is analyzed and classified as DoH or non-DoH. Layer 2 segregates DoH traffic even more, distinguishing benign DoH from malicious DoH. A key feature of this hierarchical classification design is that it supports a multilevel traffic analysis process, as well as different types of network-related decisions, making this approach perfect for research purposes. For data collection, 10,000 Alexa websites were captured via both regular web browsers (Google Chrome and Mozilla Firefox) and DNS tunneling software (dns2tcp, DNSCat2, and Iodine) to model different network behaviors. The traffic was captured on four most popular DoH servers: Cloudflare, Quad9, Google DNS, and AdGuard. Regular HTTPS browsing sessions create nonDoH traffic, and benign DoH queries are generated via actual browser-based DoH requests. DNS tunneling tools are used to encapsulate TCP traffic into an unrecognizable type of DoH request, as shown in the following screenshot, and send it through the HTTPS-encrypted tunnel. DoHMeter, like the PoC tool, uses Python to obtain statistical features from saved DoH traffic. Since raw data from PCAP files encode much information that does not provide direct insights into DoH traffic behavior, this tool is used to focus on 29 interesting statistical features. All these features describe the general behavior of different DoH flows, such as packet size, flow duration, and interarrival times. The output is written into the CSV file, in which each flow is either marked as Benign or Malicious by its features, but it does not contain any information on the source IP, destination IP, or port. All 29 statistical features extracted from the observed traffic data are summarized in Table 2.

The dataset is divided into three classes: non-DoH, DoH-benign, and DoH-malicious. Table 3 below provides an overview of the label counts across both layers:

3.5. Data Preprocessing

Data preprocessing is a critical step in ML to ensure that the data are clean, consistent, and suitable for model training. Properly prepared data can significantly enhance model performance, improving accuracy, reducing errors, and ensuring that the model can be generalized well to unseen data. In this study, several data preprocessing techniques are used, including labeling, data cleaning, normalization, resampling for class balancing, and data splitting. For these preprocessing steps, the scikit-learn (sklearn) library is utilized as follows:

(1): Label Encoding

sklearn’s label encoder has been used to convert the categorical class labels of the dataset, Non-DoH, Benign-DoH, and Malicious-DoH, into numerical labels. This is necessary because most machine learning models are designed to work with numeric data, wherein label encoding converts class labels into integers that can be easily processed by the model.

(2): Data Cleaning

The noise and irrelevant data are cleaned using the imputation method along with inconsistencies in the dataset. This ensures that the data are more accurate, reducing the possibility of errors, which may affect the model’s learning process.

(3): Normalization

To keep the features on the same scale, min–max scaling was implemented via MinMaxScaler from sklearn; this technique rescales feature values within a range of 0–1 to avoid domination by large values in model training processes. The min–max scaling formula is as follows:

x_{s c a l e d} = \frac{x - x_{m i n}}{x_{m a x} - x_{m i n}}

(6)

where x is the original feature value and where Xmin and Xmax are the minimum and maximum values of the feature, respectively.

(4): Resampling

Since the CIRA-CIC-DoHBrw-2020 dataset is imbalanced, meaning that some classes have more samples than others do, SMOTE is used to balance the dataset classes. SMOTE generates synthetic samples for minority classes by interpolating between existing data points; in this way, it increases the number of samples for underrepresented classes without simply replicating existing instances from those classes. This helps the model learn from all classes more equally since the majority class bias is reduced, which again leads to better results for the minority classes [33]. Figure 6a,b Changes in the class label distributions for both binary class and multiclass scenarios before and after applying resampling via SMOTE.

(5): Data Splitting

A training–testing split for the dataset using sklearn’s train_test_split is performed, allocating 70% for training and the remaining 30% for testing. When a substantial portion of the model is trained on the majority of the data, using the remaining data to assess its performance against unseen data provides accurate or nearly perfect performance in terms of its generalization ability.

4. Experimental Results and Discussion

In this section, an overview of the experimental results is provided, focusing on how well the models performed in both binary and multiclass classification scenarios. Different classifiers are evaluated to measure their accuracy in identifying encrypted traffic. The binary classification results were highly accurate, successfully distinguishing between doh and non-doh traffic. In the more challenging multiclass scenario, the models still performed well, accurately differentiating between multiple types of encrypted traffic, including various DoH clients’ behaviors. Feature selection was crucial to this improvement, as it focused on the most relevant data, significantly improving the models’ accuracy. The steps and techniques used in the experiments are discussed in detail in the following sections, exploring how performance was measured and how the proposed method influenced the results across both scenarios.

4.1. Performance Evaluation Metrics

In this study, four key performance metrics are used—accuracy, precision, recall, F1 score, and the confusion matrix—to evaluate the effectiveness of the proposed method. These metrics collectively provide a comprehensive understanding of the model’s classification performance across various aspects, including overall accuracy, the handling of positive instances, and the balance between precision and recall.

Accuracy, as the most general metric, measures the overall correctness of a model. It is calculated by dividing the total number of correct predictions (both True Positives TP and True Negatives TN) by the total number of samples [34]. This metric is especially useful when the dataset is balanced, as it provides an overall measure of performance Equation (7).

A c c u r a c y = \frac{T P + T N}{t o t a l s a m p l e s}

(7)

Precision, on the other hand, focuses on the model’s ability to correctly predict positive instances via Equation (8). It calculates how many of the predicted positive instances are actually correct, making it a crucial metric when False Positives (FP) need to be minimized [35].

P r e c i s i o n = \frac{T P}{T P + F P}

(8)

Recall (also known as sensitivity) evaluates how well the model captures all TP instances via Equation (9). It measures the proportion of actual positive instances that are correctly identified, making it important when missing positive instances is costly [36]

R e c a l l = \frac{T P}{T P + F P}

(9)

The F1 score balances both precision and recall by calculating their harmonic means, providing a single metric that is particularly useful when dealing with imbalanced datasets Equation (10). The F1 score helps balance the trade-offs between precision and recall [37]:

F 1 - S c o r e = 2 * \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(10)

The confusion matrix is an essential tool for evaluating the performance of an ML classification model. It shows a breakdown of the predictions of the model versus the actual results, allowing you to see how well your model is performing and where it may fail. The matrix itself is a table where the rows and columns indicate different categories [38]. Each cell in the table shows the number of predictions for a particular category from the predicted categories versus the actual categories.

The Chi-square test was applied to demonstrate the significant impact of ACO-based feature selection on model performance compared to the same model without feature selection. For instance, when the Random Forest algorithm was used with ACO-selected features, it achieved the highest accuracy, significantly outperforming the model without feature selection. The p-values for the two classification scenarios were 0.0012 and 0.0007, respectively.

In addition to evaluating classification performances on the hold-out test set, 5-fold cross-validation (5-Fold CV) was employed to further validate the consistency and robustness of the results. This method involves dividing the training data into five subsets, ensuring that each model is trained and tested across varied data partitions.

4.2. Feature Analysis and Discussion

Feature Feature selection is a critical step in handling large datasets, as it reduces dimensionality, enhances model performance, and mitigates overfitting caused by irrelevant or redundant features. An excessive number of features can introduce noise, increase computational complexity, and degrade model performance, particularly in large-scale data processing. Effective feature selection not only improves processing efficiency and reduces storage requirements but also ensures the inclusion of relevant features, thereby enhancing both model accuracy and interpretability.

The proposed method reduces the total number of dataset features to 15 for both classification scenarios, ensuring optimal performance. The selected features for each scenario, along with their corresponding importance values, are presented in Figure 7 and Figure 8, providing insights into their contributions to model predictions. Among these features, PacketLengthMode is identified as the most influential in both scenarios. In contrast, PacketLengthCoefficientofVariation in the binary classification scenario and PacketTimeMode in the multiclass classification scenario exhibit the lowest feature importance, indicating their minimal impact on model performance. Furthermore, Figure 9 highlights the mutual features selected in both classification tasks, emphasizing the common attributes that contribute to consistent model behavior across scenarios.

4.3. Classification Performance and Analysis

This section presents the evaluation results of the classification performances. The analysis compares the performance of all used classifiers across both binary and multiclass scenarios using previously mentioned metrics, with and without feature selection. The following tables present the performance results of four classifiers evaluated on all dataset characteristics using the previously defined comparison metrics. Table 4 and Table 5 report the results for binary and multiclass classification, respectively, without feature selection. In contrast, Table 6 and Table 7 report the outcomes after applying ACO for feature selection in both binary and multiclass classification scenarios. The results indicate that the features selected by the proposed method enhance the classification performance across all evaluated algorithms.

To further assess the generalizability of the proposed models and ensure robustness beyond a single train-test split, a 5-fold CV was applied in addition to the hold-out test set evaluation. This method evaluates model performance across multiple, diverse subsets of the data, providing a more comprehensive view of its stability. As expected, a slight decrease in performance metrics was observed during cross-validation. This modest drop is anticipated and considered a positive indication that the models are not overfitted to a particular split and are capable of generalizing well to various dataset samples. Figure 10 and Figure 11 show a comparative chart and provide detailed visual representations of performance accuracies with all features against selected features for all classifiers in both classification tasks.

To demonstrate the feasibility of the proposed method, confusion matrixes for all used classifiers are presented for binary and multiclass classification in Figure 12a,b, respectively. Providing insight into the feasibility of each classifier and how it performs accurately and correctly identifies TP, TN, FP, and FN through the features selected for it by the proposed method. The classes in binary classification are (Doh-benign, Doh-malicious) while the tuples in multiclass classification are (Doh-benign, Doh-malicious and Non-Doh).

Related to binary class, XGBoost demonstrated superior discriminative capability, achieving 74,846 TN for Doh-benign and 74,969 TP for Doh-malicious, with minimal misclassifications (10 FP and 3 FN]). RF exhibited comparable robustness, yielding 74,854 TN and 74,964 TP, though slightly higher (FN 8) relative to XGBoost. In contrast, KNN showed reduced precision, misclassifying 34 benign instances as malicious FP and 193 malicious instances as benign FN, resulting in lower TP 74,779 and TN 74,822. CNNs, despite their computational complexity, underperformed significantly, with 52 FP and 223 FN, reflecting challenges in generalizing to minority-class detection (74,751 TP and 74,802 TN). These results underscore the efficacy of ensemble methods (XGBoost, RF) for imbalanced DoH traffic classification while highlighting potential limitations of distance-based (KNN) and deep learning models (CNNs) in scenarios requiring high specificity for security-critical Doh-malicious detection. The elevated FN rates in CNNs, which are 0.3% of malicious cases, further emphasize the risks of relying on unoptimized architectures for cybersecurity applications, where false negatives pose critical operational threats.

According to multiclass, XGBoost achieved robust discriminative performance, with high TP for all classes: 224,259 (Doh-benign), 225,620 (Doh-malicious), and 223,900 (Non-Doh). Its errors were minimal, primarily misclassifying 2023 Doh-benign instances as Non-Doh and 2316 Non-Doh samples as Doh-benign, while maintaining near-perfect precision for Doh-malicious (52 misclassified as benign, 21 as Non-Doh). RF outperformed all models in Doh-malicious detection 225,582 TP, with exceptionally low cross-class errors (e.g., only 3 Non-Doh misclassified as malicious) and strong Non-Doh recognition 224,228 TP. In contrast, KNN exhibited pronounced instability, particularly in distinguishing Non-Doh traffic: 4853 Non-Doh samples were misclassified as Doh-benign, and 2593 as Doh-malicious, while its Doh-malicious TP 225,097 lagged behind ensemble methods. CNNs mirrored KNN’s limitations, with severe Non-Doh misclassifications (4868 Doh-benign and 2633 Doh-malicious) and elevated Doh-malicious errors (443 misclassified as Non-Doh), underscoring challenges in hierarchical feature learning for minority classes.

Notably, Doh-benign and Non-Doh confusion dominated error patterns across models, suggesting overlapping feature representations in non-malicious traffic. However, Doh-malicious detection remained highly reliable for XGBoost and RF (FN < 0.1%), critical for security applications where false negatives carry severe risks. The stark underperformance of KNN and CNNs, particularly in Non-Doh specificity, highlights the limitations of distance-based and deep learning models in multiclass scenarios with imbalanced or semantically ambiguous categories. These results advocate for ensemble methods (XGBoost, RF) in operational environments requiring high precision across heterogeneous network traffic while cautioning against deploying less interpretable models (CNNs) without targeted architectural optimization.

To further validate the effectiveness of the proposed approach, approximately 1000 records of live network traffic were collected and structured to match exactly the feature set and format of the CIRA-CIC-DoHBrw-2020 dataset. After undergoing the same preprocessing steps, the data were utilized to train the classifiers using the previously optimized feature subsets. The classification results, in both binary and multiclass scenarios, demonstrate that all models maintained performance levels comparable to those achieved on the original dataset, with only slight variations observed. These findings confirm the robustness and generalizability of the selected features and classification models across different classification tasks. A detailed comparison of the results is presented in Table 8.

4.4. Computational Time and Complexity Reduction

In addition to improving classification performance, it is essential to evaluate the computational efficiency of the proposed feature selection method, particularly in terms of execution time and complexity. The execution time required by each classifier to process all features of the dataset was measured and compared with the time taken to classify the dataset using only the features selected by the proposed methods for both binary classification (Table 9) and multiclass classification (Table 10). This comparison provides a comprehensive assessment of the computational complexity involved in the decision-making process across different scenarios. By analyzing the classification times and calculating the reduction rate (as defined in Equation (11)) for both full and reduced feature sets, the results demonstrate that the proposed method significantly reduces execution time and overall computational complexity for all classifiers, as illustrated in Figure 13 and Figure 14.

T i m e R e d u c t i o n % = (\frac{A l l F e a t u r e s T i m e - S e l e c t e d F e a t u r e s b a s e d A C O T i m e}{A l l F e a t u r e s T i m e}) * 100

(11)

The proposed method significantly contributed to reducing the less important features, which in turn contributed to reducing the computational complexity of the classifiers used in binary and multiclass classification. Table 11 summarizes the effect of selecting the most important features for reducing the computational complexity before and after feature selection for KNN, XGBoost, RF, and CNNs.

As practical challenges, Deploying machine learning-based solutions within IDSs in real-world environments introduces critical challenges, particularly concerning scalability and latency. As traffic volumes grow, IDSs must maintain efficient processing without compromising real-time detection capabilities. In the specific context of DoH, these challenges become even more pronounced due to the encryption, decryption, and high variability of traffic patterns. Furthermore, preprocessing overhead significantly affects latency, as features often differ in format and require various data preprocessing before classification. This added computational cost can delay real-time decision making and impact overall system responsiveness. Therefore, optimizing both the preprocessing pipeline and the learning model is essential to ensure scalable, low latency deployment in large scale DoH production environments.

4.5. State of the Art Comparison

In this section, the proposed method is compared with other methods in which metaheuristic optimization algorithms, such as the GA, PSO, ABC, and LOEO, have been deployed for the purpose of selecting effective features in the IDS field in general and DoH detection in particular. Since studies that used the CIRA-CIC-DoHBrw-2020 dataset based on metaheuristic optimization algorithms were not sufficient, other studies that relied on other benchmark datasets in the IDS field, such as UNSW-NB15 [39] and ISCXIDS2012 [40], were included in the comparison. Studies that focused exclusively on one scenario (binary class or multiclass) with respect to feature selection and classification performance accuracy are reported as not applied (N/A) in Table 12. For example, ref. [20] employed the PSO algorithm and achieved an accuracy of 97.2% for an exclusively binary scenario, selecting all the statistical features in the CIRA-CIC-DoHBrw-2020 dataset. Similarly, ref. [23] focused exclusively on the multiclass scenario with a high classification accuracy rate of 98.9 using 25 features from the total features of the ISCXIDS2012 dataset via ABC optimization, leaving the binary scenario. Notably, including all or most of the dataset features used leads to an increase in computational complexity or a decrease in convergence efficiency for the metaheuristic optimization algorithm. In contrast, studies such as studies [21,22], which rely on selecting the minimum number of features with 12 features as a maximum for different benchmark datasets and apply optimization algorithms such as the GA and LOEO, which exclusively focus on multiclass scenarios, obtain relatively lower classification accuracies ranging between 94.5 and 97.6, respectively. Notably, the proposed method in this study selected a moderate number of features (15 features per scenario) from the CIRA-CIC-DoHBrw-2020 dataset via ACO, which achieved a balance between dimensionality and relevance and reported the highest performance accuracy compared with other methods for both binary and multiclass scenarios, which reached 99.99 and 99.55, respectively. This calls attention to the possibility of improving feature selection to achieve superior classification performance in various scenarios while maintaining the simplicity of the model.

5. Conclusions

This study presented an innovative dual-path feature selection framework, which leveraged ACO to optimize the detection of DoH and non-DoH traffic. The integration of ACO-driven feature selection markedly enhanced classifier performance across both binary (DoH-normal vs. DoH-malicious) and multiclass (non-DoH, DoH-normal, DoH-malicious) classification tasks. By employing ACO, 15 discriminative features were identified from the CIRA-CIC-DoHBrw-2020 dataset, enabling significant accuracy improvements in all evaluated machine learning tools, XgBoost, KNN, RF, and CNNs.

In binary classification, XgBoostB, KNN, and RF achieved near-perfect accuracies (0.999), while CNNs exhibited marginally lower performance (0.998). For multiclass scenarios, RF emerged as the superior model (accuracy: 0.995), whereas CNNs demonstrated the lowest efficacy (0.985). These results highlight the critical role of ACO in optimizing feature subsets to bolster model discriminative power. Furthermore, the framework’s efficiency was further validated through a 29% average reduction in execution time and computational complexity across all classifiers, which is attributable to the parsimonious feature set. Hyperparameter tuning via Gaussian optimization and class balancing with SMOTE ensured robust model generalization, corroborated by rigorous evaluation using Hold-out testing and 5-Fold CV.

These findings underscore the efficacy of ACO-based feature selection in enabling high-accuracy, resource-efficient detection of encrypted traffic threats. The framework’s scalability and performance advancements position it as a promising solution for real-time network security applications, which addresses both binary and multiclass traffic classification challenges with minimal computational overhead. Future research directions include applying the framework to other benchmark datasets in the intrusion detection domain to better assess its adaptability, robustness, and real-world applicability. Moreover, exploring advanced deep learning architectures and ensemble learning strategies could further improve detection performance.

Author Contributions

Methodology, H.S.T.; Software, H.S.T.; Formal analysis, H.S.T.; Visualization, H.S.T.; Supervision, Z.K.A. and H.M.M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available on the University of New Brunswick’s website at https://www.unb.ca/cic/datasets/dohbrw-2020.html.

Conflicts of Interest

The authors declare no conflict of interest.

References

Security Market Size, Share & Trends Analysis Report. 2030. Available online: https://www.grandviewresearch.com/industry-analysis/security-market (accessed on 16 December 2024).
Abdul, Z.K.; Al-Talabani, A.K. Mel Frequency Cepstral Coefficient and Its Applications: A Review. IEEE Access 2022, 10, 122136–122158. [Google Scholar] [CrossRef]
Irénée, M.; Wang, Y.; Hei, X.; Song, X.; Turiho, J.C.; Nyesheja, E.M. XTS: A Hybrid Framework to Detect DNS-Over-HTTPS Tunnels Based on XGBoost and Cooperative Game Theory. Mathematics 2023, 11, 2372. [Google Scholar] [CrossRef]
Gao, B.; Bu, B.; Zhang, W.; Li, X. An Intrusion Detection Method Based on Machine Learning and State Observer for Train-Ground Communication Systems. IEEE Trans. Intell. Transp. Syst. 2022, 23, 6608–6620. [Google Scholar] [CrossRef]
Koshy, A.M.; Yellur, G.; Kammachi, H.J.; Isha, V.P.; Kumar, R.P.; Moharir, M. An Insight into Encrypted DNS Protocol: DNS over TLS. In Proceedings of the 2021 4th International Conference on Recent Developments in Control, Automation & Power Engineering (RDCAPE), Noida, India, 7–8 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 379–383. [Google Scholar] [CrossRef]
Jerabek, K.; Hynek, K.; Rysavy, O.; Burgetova, I. DNS Over HTTPS Detection Using Standard Flow Telemetry. IEEE Access 2023, 11, 50000–50012. [Google Scholar] [CrossRef]
Satilmis, H.; Akleylek, S.; Tok, Z.Y. A Systematic Literature Review on Host-Based Intrusion Detection Systems. IEEE Access 2024, 12, 27237–27266. [Google Scholar] [CrossRef]
Abu Al-Haija, Q.; Alohaly, M.; Odeh, A. A Lightweight Double-Stage Scheme to Identify Malicious DNS over HTTPS Traffic Using a Hybrid Learning Approach. Sensors 2023, 23, 3489. [Google Scholar] [CrossRef]
Schmid, G. Thirty Years of DNS Insecurity: Current Issues and Perspectives. IEEE Commun. Surv. Tutor. 2021, 23, 2429–2459. [Google Scholar] [CrossRef]
Zieni, R.; Massari, L.; Calzarossa, M.C. Phishing or Not Phishing? A Survey on the Detection of Phishing Websites. IEEE Access 2023, 11, 18499–18519. [Google Scholar] [CrossRef]
Balyan, A.K.; Ahuja, S.; Lilhore, U.K.; Sharma, S.K.; Manoharan, P.; Algarni, A.D.; Elmannai, H.; Raahemifar, K. A Hybrid Intrusion Detection Model Using EGA-PSO and Improved Random Forest Method. Sensors 2022, 22, 5986. [Google Scholar] [CrossRef]
Alenezi, R.; Ludwig, S.A. Classifying DNS Tunneling Tools for Malicious DoH Traffic. In Proceedings of the 2021 IEEE Symposium Series on Computational Intelligence, SSCI 2021—Proceedings, Orlando, FL, USA, 5–7 December 2021. [Google Scholar] [CrossRef]
Wei, N.; Yin, L.; Tan, J.; Ruan, C.; Yin, C.; Sun, Z.; Luo, X. An Autoencoder-Based Hybrid Detection Model for Intrusion Detection with Small-Sample Problem. IEEE Trans. Netw. Serv. Manag. 2024, 21, 2402–2412. [Google Scholar] [CrossRef]
Banadaki, Y.M. Detecting Malicious DNS over HTTPS Traffic in Domain Name System Using Machine Learning Classifiers. J. Comput. Sci. Appl. 2020, 8, 46–55. [Google Scholar] [CrossRef]
Preston, R. DNS Tunneling Detection with Supervised Learning. In Proceedings of the 2019 IEEE International Symposium on Technologies for Homeland Security (HST), Woburn, MA, USA, 5–6 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar] [CrossRef]
Singh, S.K.; Roy, P.K. Detecting Malicious DNS over HTTPS Traffic Using Machine Learning. In Proceedings of the 2020 International Conference on Innovation and Intelligence for Informatics, Computing and Technologies (3ICT), Online, 20–21 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar] [CrossRef]
Palau, F.; Catania, C.; Guerra, J.; Garcia, S.; Rigaki, M. DNS Tunneling: A Deep Learning Based Lexicographical Detection Approach. arXiv 2020, arXiv:2006.06122. [Google Scholar]
Ramakrishnan, S.; Senthil Rajan, A. Network Attack Detection with QNNBADT in Minimal Response Times Using Minimized Features; Springer: Singapore, 2022; pp. 563–579. [Google Scholar] [CrossRef]
Barhoush, M.; Abed-alguni, B.H.; Al-qudah, N.E.A. Improved Discrete Salp Swarm Algorithm Using Exploration and Exploitation Techniques for Feature Selection in Intrusion Detection Systems. J. Supercomput. 2023, 79, 21265–21309. [Google Scholar] [CrossRef]
Alibrahim, H.; Ludwig, S.A. Investigation of Domain Name System Attack Clustering Using Semi-Supervised Learning with Swarm Intelligence Algorithms. In Proceedings of the 2021 IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, 5–7 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–9. [Google Scholar] [CrossRef]
Halim, Z.; Yousaf, M.N.; Waqas, M.; Sulaiman, M.; Abbas, G.; Hussain, M.; Ahmad, I.; Hanif, M. An Effective Genetic Algorithm-Based Feature Selection Method for Intrusion Detection Systems. Comput. Secur. 2021, 110, 102448. [Google Scholar] [CrossRef]
Varzaneh, Z.A.; Hosseini, S. An Improved Equilibrium Optimization Algorithm for Feature Selection Problem in Network Intrusion Detection. Sci. Rep. 2024, 14, 18696. [Google Scholar] [CrossRef]
Mazini, M.; Shirazi, B.; Mahdavi, I. Anomaly Network-Based Intrusion Detection System Using a Reliable Hybrid Artificial Bee Colony and AdaBoost Algorithms. J. King Saud Univ.-Comput. Inf. Sci. 2019, 31, 541–553. [Google Scholar] [CrossRef]
Dorigo, M.; Birattari, M.; Stutzle, T. Ant Colony Optimization. IEEE Comput. Intell. Mag. 2006, 1, 28–39. [Google Scholar] [CrossRef]
Alhenawi, E.; Khurma, R.A.; Sharieh, A.A.; Al-Adwan, O.; Shorman, A.A.; Shannaq, F. Parallel Ant Colony Optimization Algorithm for Finding the Shortest Path for Mountain Climbing. IEEE Access 2023, 11, 6185–6196. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the KDD ‘16: The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Abdulhadi, H.M.T.; Talabani, H.S. Comparative Study of Supervised Machine Learning Algorithms on Thoracic Surgery Patients Based on Ranker Feature Algorithms. UHD J. Sci. Technol. 2021, 5, 66–74. [Google Scholar] [CrossRef]
Sewwandi, M.A.N.D.; Li, Y.; Zhang, J. A Class-Specific Feature Selection and Classification Approach Using Neighborhood Rough Set and K-Nearest Neighbor Theories. Appl. Soft. Comput. 2023, 143, 110366. [Google Scholar] [CrossRef]
Rigatti, S.J. Random Forest. J. Insur. Med. 2017, 47, 31–39. [Google Scholar] [CrossRef] [PubMed]
Talabani, H.S.; Jumaa, I.H. A Review of Various Machine Learning Techniques and Its Application on IoT and Cloud Computing. Tikrit J. Pure Sci. 2024, 29, 185–195. [Google Scholar] [CrossRef]
Jha, H.; Patel, I.; Li, G.; Cherukuri, A.K.; Thaseen, S. Detection of Tunneling in DNS over HTTPS. In Proceedings of the 2021 7th International Conference on Signal Processing and Communication (ICSC), Noida, India, 25–27 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 42–47. [Google Scholar] [CrossRef]
DoHBrw 2020 | Datasets | Research | Canadian Institute for Cybersecurity | UNB. Available online: https://www.unb.ca/cic/datasets/dohbrw-2020.html (accessed on 17 December 2024).
Khushi, M.; Shaukat, K.; Alam, T.M.; Hameed, I.A.; Uddin, S.; Luo, S.; Yang, X.; Reyes, M.C. A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data. IEEE Access 2021, 9, 109960–109975. [Google Scholar] [CrossRef]
Canbek, G.; Taskaya Temizel, T.; Sagiroglu, S. BenchMetrics: A Systematic Benchmarking Method for Binary Classification Performance Metrics. Neural Comput. Appl. 2021, 33, 14623–14650. [Google Scholar] [CrossRef]
Talabani, H.; Avci, E. Performance Comparison of SVM Kernel Types on Child Autism Disease Database. In Proceedings of the 2018 International Conference on Artificial Intelligence and Data Processing, IDAP 2018, Malatya, Turkey, 28–30 September 2018. [Google Scholar] [CrossRef]
Althnian, A.; AlSaeed, D.; Al-Baity, H.; Samha, A.; Dris, A.B.; Alzakari, N.; Abou Elwafa, A.; Kurdi, H. Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain. Appl. Sci. 2021, 11, 796. [Google Scholar] [CrossRef]
Mohammed, H.M.; Abdul, Z.K.; Rashid, T.A.; Alsadoon, A.; Bacanin, N. A New K-Means Grey Wolf Algorithm for Engineering Problems. World J. Eng. 2021, 18, 630–638. [Google Scholar] [CrossRef]
Altalabani, H.M. The Performance Comparison of Support Vector Machine Classification Kernel Functions on Medical Databases. Master’s Thesis, 15 April 2020. Available online: https://acikbilim.yok.gov.tr/handle/20.500.12812/402884 (accessed on 17 December 2024).
The UNSW-NB15 Dataset | UNSW Research. Available online: https://research.unsw.edu.au/projects/unsw-nb15-dataset (accessed on 17 December 2024).
IDS 2012 | Datasets | Research | Canadian Institute for Cybersecurity | UNB. Available online: https://www.unb.ca/cic/datasets/ids.html (accessed on 17 December 2024).

Figure 1. DNS over HTTPS work mechanism.

Figure 2. Proposed methodology diagram.

Figure 3. ACO work mechanism for finding the shortest path.

Figure 4. The proposed ACO-based feature selection.

Figure 5. Comparison of classification accuracies obtained using default parameters and tuned parameters (optimized by Gaussian Optimization) for XGBoost, KNN, and Random Forest classifiers in both binary and multiclass classification tasks.

Figure 6. (a) Class label distribution for the binary class scenario (DoH-benign vs. DoH-malicious) before and after resampling using SMOTE. (b) Class label distribution for the multiclass scenario (Non-DoH, DoH-benign, DoH-malicious) before and after resampling using SMOTE.

Figure 7. ACO-based feature subsets and their feature importance values for binary class (DoH-benign and DoH-malicious).

Figure 8. ACO-based feature subsets and their feature importance values for multiclass (NoN-DoH, DoH-benign, DoH-malicious).

Figure 9. The visualization of exclusive and mutually selected features between both binary class and multiclass scenarios.

Figure 10. Comparison of (XgBoost, KNN, RF, and CNNs) classification accuracy for Binary class with All Features vs. Selected Features-based ACO.

Figure 11. Comparison of (XgBoost, KNN, RF, and CNNs) classification accuracy for multiclass with All Features vs. Selected Features-based ACO.

Figure 12. Confusion matrices of (XgBoost, KNN, RF, and CNNs) classifiers for both (a) binary class task and (b) multiclass task for selected features-based ACO.

Figure 13. Execution time of (XgBoost, KNN, RF, and CNNs) classifiers for binary class with All Features vs. Selected Features-based ACO.

Figure 14. Execution time of (XgBoost, KNN, RF, and CNN) classifiers for multiclass with All Features vs. Selected Features-based ACO.

Table 1. Optimized Parameters of ACO, Xgboost, KNN, RF and CNNs based on Gaussian Optimization.

Component	Parameter	Value
KNN	Number of Neighbors (k)	8
KNN	Distance Metric	Euclidean
XgBoost	Learning Rate	0.5
	Maximum Depth	7
	Number of Estimators	100
RF	Number of Trees	95
	Maximum Depth	8
	Minimum Samples Split	3

Table 2. Description of dataset features.

	Feature Name	Description	Feature Name	Description
1	Duration	The total duration of the traffic flow from the first packet to the last.	PacketTimeMean	The average time between successive packets in the flow.
2	FlowBytesSent	Quantity of bytes transmitted by the source in the flow.	PacketTimeMedian	Median packet time interval, showing the middle of the timing data.
3	FlowSentRate	The rate in bytes per second at which data was sent by source.	PacketTimeMode	Which is the time interval between packets that occur most frequently.
4	FlowBytesReceived	Total amount of bytes received by the destination in the traffic flow.	PacketTimeSkewFromMedian	Skew of packet timing with respect to the median, which will give any timing asymmetry.
5	FlowReceivedRate	Receipt Rates: The rate at which data was coming into the destination, measured in bytes per second.	PacketTimeSkewFromMode	Skewness in packet times concerning the mode, and it will give the measure of asymmetry in frequent timings.
6	PacketLengthVariance	The packet length time series is representative of variability or dispersion.	PacketTimeCoefficientofVariation	It defines the variability in packet times relative to the mean, standardized by the mean.
7	PacketLengthStandardDeviation	Standard deviation in packet length is essentially the measure of the magnitude of sizes taken by packets.	ResponseTimeVariance	The variance in the response time from the destination server.
8	PacketLengthMean	The average length of the packets in the flow.	ResponseTimeStandardDeviation	The standard deviation of response times is a measure of the dispersion of the respective response times.
9	PacketLengthMedian	The median value of the packet lengths showing the midpoint in the dataset.	ResponseTimeMean	Average response time of the destination server.
10	PacketLengthMode	The most frequent packet length within the flow.	ResponseTimeMedian	Median time of response from the server.
11	PacketLengthSkewFromMedian	Skewness of Packet Lengths from the Median The asymmetry.	ResponseTimeMode	The most frequent response time from the server.
12	PacketLengthSkewFromMode	Skew of packet lengths from mode; higher is more asymmetric.	ResponseTimeSkewFromMedian	Skewness of response times relative to the median, which gives a measure of asymmetry in server responses.
13	PacketLengthCoefficientofVariation	Variability Ratio: This is the standard deviation of packet lengths to the mean. Higher values are expected to indicate variability around the mean.	ResponseTimeSkewFromMode	Skewness of response times, relative to the mode, results from a deviation with respect to the most common response time.
14	PacketTimeVariance	Packet intervals dispersion: Variance in timing between packets.	ResponseTimeCoefficientofVariation	The ratio of the standard deviation to the mean of response times, which shows relative variability.
15	PacketTimeStandardDeviation	The standard deviation of the packet interarrival times.

Table 3. Breakdown of traffic types and class counts in the CIRA-CIC-DoHBrw-2020 dataset.

Traffic Type	Class	Count
Layer 1	Non-DoH	897,493
Layer 1	DoH	269,643
Layer 2 (DoH)	Benign-DoH	19,807
Layer 2 (DoH)	Malicious-DoH	249,836

Table 4. Performance metrics of the binary class scenario based on ALL features.

Classifiers	Accuracy	Precision	Recall	F1-Score
XGBoost	0.967	0.959	0.961	0.963
KNN	0.959	0.938	0.946	0.954
RF	0.967	0.958	0.958	0.962
CNNs	0.961	0.965	0.959	0.958

Table 5. Performance metrics of the multiclass class scenario based on ALL features.

Classifiers	Accuracy	Precision	Recall	F1-Score
XGBoost	0.928	0.931	0.932	0.939
KNN	0.941	0.939	0.956	0.957
RF	0.952	0.928	0.925	0.943
CNN	0.933	0.921	0.922	0.925

Table 6. Performance metrics of the binary class scenario based on ACO-selected features.

Classifiers	Validation Types	Accuracy	Precision	Recall	F1-Score
XGBoost	Hold-out Test Set	0.999	1.000	1.000	1.000
XGBoost	5-Fold CV	0.983	0.986	0.979	0.975
KNN	Hold-out Test Set	0.999	1.000	1.000	1.000
KNN	5-Fold CV	0.981	0.985	0.979	0.974
RF	Hold-out Test Set	0.999	1.000	1.000	1.000
RF	5-Fold CV	0.980	0.981	0.978	0.981
CNNs	Hold-out Test Set	0.998	1.000	0.999	1.000
CNNs	5-Fold CV	0.982	0.985	0.983	0.989

Table 7. Performance classifiers in the multiclass classification scenario based on ACO-selected features.

Classifiers	Validation Types	Accuracy	Precision	Recall	F1-Score
XGBoost	Hold-out Test Set	0.993	0.999	0.999	0.999
XGBoost	5-Fold CV	0.976	0.982	0.981	0.981
KNN	Hold-out Test Set	0.986	0.999	0.999	0.999
KNN	5-Fold CV	0.969	0.984	0.983	0.982
RF	Hold-out Test Set	0.995	1.00	1.00	1.00
RF	5-Fold CV	0.977	0.989	0.983	0.986
CNN	Hold-out Test Set	0.985	0.998	0.987	0.999
CNN	5-Fold CV	0.968	0.983	0.986	0.984

Table 8. Performance classifiers in both classification scenarios using real live traffic based on ACO-selected features.

Classifiers	Binary Class Accuracy	Multiclass Accuracy
XGBoost	0.997	0.991
KNN	0.999	0.984
RF	0.999	0.995
CNNs	0.998	0.985

Table 9. Execution Time Computation comparison for binary class with and without feature selection-based ACO.

Classifiers	All Features (Seconds)	Selected Features—ACO (Seconds)	Time Reduction %
XgBoost	130.46	95.31	26.94%
KNN	108.75	76.49	29.66%
RF	160.32	110.66	30.97%
CNNs	179.87	118.39	34.19%

Table 10. Execution Time Computation comparison for multiclass with and without feature selection-based ACO.

Classifiers	All Features (Seconds)	Selected Features—ACO (Seconds)	Time Reduction %
XgBoost	210.92	155.76	26.15%
KNN	185.49	133.57	27.99%
RF	230.73	163.43	29.16%
CNNs	290.86	189.27	34.93%

Table 11. Effect of feature selection on computational complexity in KNN, XGBOOST, RF, and CNNs.

MLs	Computational Complexity with All Features	Computational Complexity with ACO-based Selected Features	Speed up effect of Feature Selection
KNN	$O (n \cdot d)$	$O (n \cdot d')$	bring down distance calculations between features, accelerating classification process.
XgBoost	$O (n \cdot d \cdot \log n)$	$O (n \cdot d' \cdot \log n)$	bring down tree depth and node splits, speeding up classification process.
RF	$O (m \cdot n \cdot d \cdot \log n)$	$O (m \cdot n \cdot d' \cdot \log n)$	decrease feature evaluations in each tree, speeding up classification process.
CNNs	$O (m \cdot d . k^{2} . c . l)$	$O (m \cdot d^{'} . k^{2} . c . l)$	decrease input channels and convolution operations, speeding up classification process

Where, n is the number of training samples in each classifier, d represent the number of all features, d′ is the number of selected features, where (d′ < d), m is the number of decision trees in RF, and log n represents the logarithmic complexity due to tree-based structures. Furthermore, k² is the kernel size, c is the layer per filter, and l is number of convolutional layers in CNNs.

Table 12. State-of-the-art comparisons.

Ref.	Feature selection Method	Dataset	No. of Features for Binary Class	No. of Features for Multiclass	Binary Class Accuracy (%)	Multiclass Accuracy (%)
[21]	GA	CIRA-CIC-DoHBrw-2020	N/A	12	N/A	94.5
[20]	PSO	CIRA-CIC-DoHBrw-2020	28	N/A	97.2	N/A
[22]	LOEO	UNSW-NB15	N/A	11	N/A	97.6
[23]	ABC	ISCXIDS2012	N/A	25	N/A	98.9
Proposed Method	ACO	CIRA-CIC-DoHBrw-2020	15	15	99.99	99.55

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Talabani, H.S.; Abdul, Z.K.; Mohammed Saleh, H.M. DNS over HTTPS Tunneling Detection System Based on Selected Features via Ant Colony Optimization. Future Internet 2025, 17, 211. https://doi.org/10.3390/fi17050211

AMA Style

Talabani HS, Abdul ZK, Mohammed Saleh HM. DNS over HTTPS Tunneling Detection System Based on Selected Features via Ant Colony Optimization. Future Internet. 2025; 17(5):211. https://doi.org/10.3390/fi17050211

Chicago/Turabian Style

Talabani, Hardi Sabah, Zrar Khalid Abdul, and Hardi Mohammed Mohammed Saleh. 2025. "DNS over HTTPS Tunneling Detection System Based on Selected Features via Ant Colony Optimization" Future Internet 17, no. 5: 211. https://doi.org/10.3390/fi17050211

APA Style

Talabani, H. S., Abdul, Z. K., & Mohammed Saleh, H. M. (2025). DNS over HTTPS Tunneling Detection System Based on Selected Features via Ant Colony Optimization. Future Internet, 17(5), 211. https://doi.org/10.3390/fi17050211

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DNS over HTTPS Tunneling Detection System Based on Selected Features via Ant Colony Optimization

Abstract

1. Introduction

2. Literature Review

3. The Proposed Work

3.1. Algorithms in the Proposed Model

3.2. Feature Selection Stage

3.3. Parameter Setting

3.4. Dataset Description

3.5. Data Preprocessing

4. Experimental Results and Discussion

4.1. Performance Evaluation Metrics

4.2. Feature Analysis and Discussion

4.3. Classification Performance and Analysis

4.4. Computational Time and Complexity Reduction

4.5. State of the Art Comparison

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI