FSDC: Flow Samples and Dimensions Compression for Efficient Detection of DNS-over-HTTPS Tunnels

Mungwarakarama, Irénée; Wang, Yichuan; Hei, Xinhong; Song, Xin; Nyesheja, Enan Muhire; Turiho, Jean Claude

doi:10.3390/electronics13132604

Open AccessArticle

FSDC: Flow Samples and Dimensions Compression for Efficient Detection of DNS-over-HTTPS Tunnels

by

Irénée Mungwarakarama

^1,2,*

,

Yichuan Wang

^1,*

,

Xinhong Hei

¹

,

Xin Song

¹,

Enan Muhire Nyesheja

² and

Jean Claude Turiho

²

¹

School of Computer Science and Engineering, Xi’an University of Technology, Xi’an 710071, China

²

Faculty of Computing and Information Sciences, University of Lay Adventists of Kigali, Kigali 6392, Rwanda

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(13), 2604; https://doi.org/10.3390/electronics13132604

Submission received: 25 May 2024 / Revised: 27 June 2024 / Accepted: 1 July 2024 / Published: 3 July 2024

(This article belongs to the Special Issue Advances in Data Science and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

This paper proposes an innovative approach capitalized on the distinctive characteristics of command and control (C&C) beacons, namely, time intervals and frequency between consecutive unique connections, to compress the network flow dataset. While previous studies on the same matter used single technique, we propose a multi-technique approach for efficient detection of DoH tunnels. We use a baseline public dataset, CIRA-CIC-DoHBrw-2020, containing over a million network flow properties and statistical features of DoH, tunnels, benign DoH and normal browsing (HTTPS) traffic. Each sample is represented by 33 features with a timestamp. Our methodology combines star graph and bar plot visualizations with supervised and unsupervised learning techniques. The approach underscores the importance of C&C beacon characteristic features in compressing a dataset and reducing a flow dimension while enabling efficient detection of DoH tunnels. Through compression, the original dataset size and dimensions are reduced by approximately

95 %

and

94 %

respectively. For supervised learning, RF emerges as the top-performing algorithm, attaining precision and recall scores of 100% each, with speed increase of

\approx 6796

times faster in training and

\approx 55

in testing. For anomaly detection models, OCSVM emerges as the most suitable choice for this purpose, with precision (88.89) and recall (100). Star graph and bar graph models also show a clear difference between normal traffic and DoH tunnels. The reduction in flow sample size and dimension, while maintaining accuracy, holds promise for edge networks with constrained resources and aids security analysts in interpreting complex ML models to identify Indicators of Compromise (IoC).

Keywords:

DoH tunnels; CIRA-CIC-DoHBrw-2020; C&C beaconing; intrusion detection

1. Introduction

The application areas of Data Science and machine learning for network and intrusion detection systems (NIDSs) require considerable computation especially at the edge where computing and storage devices are constrained with low or medium capabilities. However, on the other hand, efforts have also been made to optimize machine learning (ML) algorithms to allow them to be applied at less powerful end nodes, such as with a mobile handset. In this case, the network infrastructure plays a role in distributed and federated learning, requiring solutions that reduce computation costs and increase storage capacity, especially in a high-speed environment. Additionally, the growing adoption of DNS over HTTPS (DoH) has significantly enhanced user privacy by encrypting DNS queries, thus preventing interception and manipulation by third parties. However, this encryption also opens avenues for malicious activities, including covert communication channels and command and control (C&C) beaconing, which are challenging to detect due to their encrypted nature [1].

Previous studies have developed machine learning-based solutions aiming to achieve high prediction and computational performance [2,3,4]. Feature dimensionality has often been a key consideration, as simplifying models to include fewer features can enhance efficiency and interpretability. For example, ML models are evaluated using all available features as baseline models, then feature selection or feature ranking methods are used to select a subset of predictors that could be used to evaluate subsequent models [5]. These models are then compared to find the optimal feature subset and models. Previous studies were predominantly supervised, limiting their ability to evaluate selected feature subsets for the detection of unknown patterns.

To address these limitations, this paper proposes a novel multi-technique approach for detecting DoH tunnels. Our methodology combines star graph and bar plot visualizations with supervised and unsupervised learning techniques. This integrated approach aims to leverage the strengths of various techniques to evaluate the effectiveness of the newly created features based on C&C characteristics. This approach focuses on feature engineering for dataset compression (sample size and dimension reduction) to ensure that the ML model remains computationally efficient and interpretable.

We leverage a baseline public dataset, CIRA-CIC-DoHBrw-2020 [6]. Created in 2020, two years after the official adoption of the DoH protocol [7], this dataset has been extensively used by researchers to develop and enhance machine learning models aimed at detecting DoH tunnels. It contains over a million network flow properties and statistical features for benign DoH traffic, DoH tunnels generated by three different DNS tunnels through an HTTPS proxy, and normal HTTPS web traffic. Each flow example is represented by 33 features with a timestamp.

After extracting time and frequency features, we generate a new compressed binary dataset that reduces the baseline sample size by an average of

\approx 95 %

and flow dimensions by

\approx 94 %

reduction. The star graph and bar plot analysis indicate a clear difference between normal and DoH tunnel traffic. All selected ML models showed n-fold faster computation speeds, with some models achieving desirable prediction performance. Our proposed approach positions itself as a promising solution for network security applications, especially in environments with high-speed and constrained resources. By using only two features aligned with expert knowledge, it also aids security analysts in identifying Indicators of Compromise (IoC) more effectively, thereby enhancing the overall robustness of network defense mechanisms against sophisticated threats utilizing DoH tunnels.

The summary of our contributions is as follows:

Our proposed methodology creates an unlabeled and labeled new compressed binary dataset derived from the CIRA-CIC-DoHBrw-2020 dataset.
We used a star graph and bar plot to analyze DoH tunnels
We modeled DoH tunnels by evaluating four supervised cost-sensitive algorithms, RF, LR, SVM, and XGB, on both the CIRA-CIC-DoHBrw-2020 dataset and the new compressed dataset, and we compared the prediction and computation outcomes
We modeled DoH tunnels by evaluating four anomaly detection algorithms, SVM, iForest, LOF, and MAD on unlabeled new compressed dataset, and conducted further analysis on the selected OCSVM algorithm

The content of this paper is summarized in the subsequent section as follows. In Section 2, we explain the background concepts pertaining to this research, including the DNS-over-HTTPS (DoH) protocol and how command and control (C&C) beaconing leverages this protocol to create covert communication channels (DoH tunnels). In this section, we hypothesize that frequency and time interval between consecutive connections—characteristics of C&C beacons—could be used as potential features to detect DoH tunnels. We conducted a literature review focusing on studies that used the CIRA-CIC-DoHBrw-2020 dataset to propose efficient ML solutions both in prediction and computation performance. In Section 3, we detail our research methodology, explaining the dataset and methods used, including star graphs, bar plot, and ML algorithms. In Section 4, we show experimental results and analysis, and conclude this paper in Section 5.

2. Background and Related Works

2.1. Background

Traditionally, DNS queries are sent in plaintext, making them susceptible to interception and manipulation by malicious actors. DNS-over-HTTPS (DoH) is a protocol that encrypts DNS queries using HTTPS to mitigate these risks by encapsulating DNS queries within HTTPS traffic, thereby protecting the integrity and confidentiality of the queries [7]. This ensures that DNS requests and responses are not easily visible to third parties, thus enhancing user privacy and preventing potential eavesdropping or tampering.

While DoH provides significant privacy benefits, it also introduces challenges, particularly in network security. Malicious actors can exploit DoH to create covert communication channels, commonly referred to as DoH tunnels [8]. These tunnels can be used to bypass network security measures, enabling covert communication between compromised systems and command and control (C&C) servers [9]. This communication often involves periodic beacon signals sent from the infected system to the C&C server, which can issue commands or exfiltrate data. The beacon signal exhibits two important characteristics, i.e., a consistent frequency and time interval between consecutive connections [10].

In this study, we hypothesize that these beacon features can differentiate normal traffic from DoH tunnels [11] on the basis that compromised hosts will consistently send more traffic than normal hosts. By measuring these features, we posit that the number of time intervals and frequency for compromised hosts is likely to be much higher and somehow more consistent than normal traffic. Since they are transformational features from aggregating unique connections, we assume that creating a dataset of only these two features could potentially compress the baseline dataset, leading to improved ML model computation efficiency while enabling desirable precision and recall outcomes. In the next subsequent section, we are going to present other research works that focused on the same idea but used different methodologies. We will end this section by showing the gap that has not yet been addressed in the previous studies.

2.2. Related Work

Developing ML-based intrusion detection systems requires a representational dataset that in some way simulates a real-world scenario. Acquiring such a dataset requires enough skills in network attack, a robust real network, or a sophisticated computer that can accommodate many VM instances that can be used as compromised hosts. Due to difficulties in obtaining our own dataset, we have utilized a widely cited public dataset that meets our requirements.

Montazerishatoori et al. created a dataset, CIRA-CIC-DoHBrw-2020 [2], two years after the official adoption of the DoH protocol [7]. This dataset has been extensively used by researchers to develop and enhance machine learning models aimed at detecting DoH tunnels. The dataset contains over a million network flow properties and statistical features for benign DoH traffic, DoH tunnels generated by three different DNS tunnels (Iodine [12], DNS2tcp [13], and DNSCat2 [14]) through an HTTPS proxy, and normal HTTPS web traffic. Each flow example is represented by 33 features and a timestamp. The mentioned dataset became a breakthrough for detecting DoH tunnels with a more represented and highly imbalanced dataset until the time of writing this paper. The paper that accompanied the CIRA-CIC-DoHBrw-2020 dataset shows experiments and results based on a two-layered architecture. In the first layer, the dataset consists of normal HTTPS (897,493 samples) and DoH traffic flows (269,643 samples). In this case, DoH samples consist of both benign and DoH tunnels, amounting to a total of 269,643 samples. In the second layer, the DoH dataset is split into DoH tunnels and benign DoH with a ratio of 9:1. In each layer, a number of ML algorithms are evaluated, and precision, recall, and F1 scores are recorded, including training and testing time. Since then, many research studies on this dataset following the same two-layered architecture but using different techniques have started to emerge. In this paper, we limit only those that include computation performance, specifically using the same data sample size we have used in this study.

For example, Behnke et al. [15] used Chi-Square and Pearson Correlation Coefficient to rank features and evaluate different supervised classifiers, including RF, XGB, and LGBM. They overlooked the importance of IP-related features after criticizing Banadaki et al. [16] who had previously used them. Behnke et al. [15] reported LGBM as the best model, outperforming others in accuracy and training time. In the study conducted by Jafar et al. [17] on the same dataset, they treated IP-related features as categorical variables and used the one-hot encoding technique to convert them into integer variables. They also sampled the dataset using the SMOTE technique.

They evaluated various supervised ML algorithms, such as RF, DT, QDA, GNB, SGD, KNN, LR, and SVM and measured both accuracy and computation time (training and testing). Experimental results showed that no single model was able to manage both accuracy and computation time. For example, they showed that RF, SVM, DT, and KNN were able to achieve an accuracy of over 99.99%; however, models such as SVM and KNN’s computation speeds were in hours and minutes. GBN was the fastest with training speed of

0.71 s

testing speed of

0.23 s

but with the lowest accuracy of

\approx 80 %

.

In the study conducted by D. Vekshin et al. [18], they aimed to evaluate supervised machine ML classifiers including 5-NN, C 4.5, RF, NB, and Ada Boosted DT, to detect DoH traffic and identify the clients generating such traffic using their newly created dataset. Unlike CIRA-CIC-DoHBrw-2020, which contains DoH tunnels, D. Vekshin’s dataset and some others, such as K. Jeřábek et al. [19], contain only benign DoH traffic. Additionally, this study does not cover computation efficiency, which is the main focus of our study.

Another study conducted by S. Singh et al. [20] discusses the use of ML to detect and classify DoH traffic. The study evaluates various ML classifiers, such as Naive Bayes, Logistic Regression, Random Forest, K-Nearest Neighbor, and Gradient Boosting, to determine their effectiveness in identifying malicious DoH traffic using the CIRA-CIC-DoHBrw-2020 dataset. The results demonstrate that ensemble learning-based classifiers, particularly Random Forest and Gradient Boosting, achieve outstanding accuracy rates (

100 %

) and are effective in recognizing DoH traffic. Although S. Singh et al. [20] included DoH tunnel traffic, unlike D. Vekshin et al. [18] and K. Jeřábek et al. [19], they didn’t include regular web traffic. The lack of regular traffic in their dataset limits the comprehensive evaluation of their model’s performance in realistic network conditions, potentially affecting its applicability in practical scenarios where distinguishing between DoH and regular web traffic is crucial.

In our recent study [5], we challenged all the previous results for different reasons. (1) poor or unclear methodology and results reporting, neglecting important features such as IP-related for both DoH and DoH tunnels detection [16], (2) excessive unnecessary work flow tasks such as sampling [17], lack of exploring the depts of some algorithms such as XGB for its parallel computation hyperparameters that can speed up the training and testing time, and using many features in the results that can be hard to explain the outcomes. We therefore proposed an XTS framework that used an optimized and explainable XGB by TreeSHAP explainer (a stable and consistent technique) [5]. In the model selection, we selected the algorithms that are cost-sensitive by design, such as RF, SVM, LR, and XGB. XGB has both prediction and computation superiority over others. We then further evaluated it on different feature subsets generated after feature ranking by the TreeSHAP explainer. Experimental results showed that this proposed framework outperformed previous studies in prediction power, computation, and explainability, where only three features (IP source, destination, and packet length mode) were able to predict DoH traffic. The training and testing speeds were 1.8 and 0.07 s.

Although our previous study showed an improvement over the others in computation speed and in the use of fewer important features that security analysts can interpret; however, there are still a few challenges: (1) Most previous studies follow the architecture proposed by Montazerishatoori et al. [2]. The problem with this architecture is that it separates benign DoH from regular HTTPS traffic, yet these are all legitimate traffic that does not need to be blocked. (2) They all rely on reducing computation time by selecting feature subsets from the entire feature set using either complex or classical feature ranking or selection techniques. This technique does not only consume time—taken during the feature selection process—but also produces different feature sets, some of which may even be unintelligible to a security expert [15]. (3) They are all supervised learning, which has the challenge of lacking the ability to detect unseen threats. Therefore, this paper proposes a different approach based on expert knowledge for feature engineering and also considers the problem as both supervised and anomaly detection to efficiently detect DoH tunnels. By efficiency, we mean an approach that can detect DoH tunnels with or without a labeled dataset, simplicity in the ML workflow, high prediction performance, low computation speed, and fewer features aligned with expert knowledge.

3. Materials and Methods

This section explains in detail how we compressed the CIRA-CIC-DoHBrw-2020 dataset and used different techniques to analyze and detect DoH tunnels. As shown in Figure 1, we call the CIRA-CIC-DoHBrw-2020 dataset the original or baseline dataset. The goal of this methodology is to utilize expert knowledge of the behavior of network malware—C&C beaconing—to extract beacon signal characteristics (time intervals and frequency between two consecutive connections) and use them as predictors for star graphs, bar plots, and ML algorithms to detect DoH tunnels. The reason for the choice of these two features was explained previously in Section 2.1.

Figure 2 shows a proposed visual representation of the proposed scheme, i.e., data processing and analysis. First, we combine the regular web traffic dataset with that of benign DoH to create what we call the normal dataset. The reason for this combination is the fact that these datasets contain legitimate traffic flows for the company that are not to be filtered. For supervised DoH detection with high feature dimensionality (original dataset), a number of data processing tasks are performed, such as IP address conversion to integers, imputation, feature scaling, and label encoding, as will be explained later in this section.

Second, we compress the original dataset with 33 features and over a million samples into a compressed binary dataset with around 50 thousand samples by filtering flows based on the source and destination IP addresses to track unique flows (connections). This process allows us to extract C&C characteristic features (time and frequency) by calculating and recording each from unique consecutive connections. Third, we perform data processing on the newly created dataset and train, then evaluate both supervised and unsupervised algorithms and compare the results, as shown in Figure 2.

3.1. Datasets

3.1.1. CIRA-CIC-DoHBrw-2020 Dataset

This dataset initially contains four subsets, namely: non-DoH, containing 897,493 regular HTTPS traffic flow samples captured while visiting 10 k Alexa websites; and DoH, containing 19,807 benign DoH samples. To ensure there are no outside influences on the data, all of the mentioned operations were done on VMs with no other significant HTTPS traffic. Both non-DoH and DoH traffic were recorded simultaneously. Because the IP of DoH resolvers used in this experiment (shown in Table 1) was known and has not been used in any other connections other that non-DoH, their connections were labeled DoH, and the rest as non-DoH [21].

Another subset was so-called malicious DoH, containing 249,836 DoH tunnel samples collected after simulating DoH tunnel attacks for data exfiltration using three DNS tunneling tools (Iodine, DNScat, and DNS2cat), as indicated in Figure 3. The network topology used to capture these datasets can be seen in [2,21] and revised in Figure 3. We revised the topology made by Montazerishatoori et al. [2] to show the abstracted part, i.e., the capture point, in their original design.

Unlike the tool DoHLyzer [22], which they used to extract flow properties and statistical features, which could be hard to use for non-programmers, we showed in Figure 3 the generic methodology [23] that can be used for the same purpose. We downloaded a zipped folder denoted as

X \in R^{n \times d}

containing four CSV files, as shown in Figure 3. Each record in a subset has 33 dimensions of flow features, as shown in Table 2, and has been explained in more detail by D. Stalder [24] in his thesis. For supervised algorithms, we have first used all 33 features and later used the 2 features generated after extracting C&C beacon characteristics. This approach allowed the authors to compare the speed and prediction performance of the same algorithm and analyze the effect of feature dimensionality on model computation and that of C&C beacon characteristic features on DoH tunnel detection.

3.1.2. Data Processing

Converting IP address: to use IP address-based features for supervised algorithms, we convert each IP address into integers and scale them to reduce the range differences between features that can cause bias in the models.

Imputation: The original datasets contain 16,056 missing values each. Therefore, for training supervised algorithms using 33 features, imputation was performed on the relevant variables by filling in the missing values with the column mean.

Label encoding: a vector

y

is encoded and assumed to represent the target variable, such that,

y = [0, 1]

represents the target variable, where 0 means non-DoH and 1 means DoH.

Feature scaling: Unscaled features have a different range of values and may impact the computation of many machines learning, leading to biased results [25,26]. There are two common ways to get all attributes to have the same scale: min-max scaling and standardization. Min-max (also known as normalization) scales the feature value

x_{i}

by subtracting

\min (x)

from

x

and dividing by the difference of

\min (x)

and

\max (x)

. This method maps each feature value to between 0 and 1. [25,26]. However, it does not handle outliers.

Feature scaling using the standardization technique scales the data values by subtracting feature values of

x

from the mean

μ

and dividing by the standard deviation

σ

. A new score

z

or standard is computed from

µ

and

σ

. Scaled values are in the range of

[- 1, 1]

, so that they have properties of the standard normal distribution with the mean

µ = 0

and a standard deviation of approximately

1

refs. [25,26]. Unlike min-max scaling, standardization is much less affected by outliers. Therefore, we applied this technique shown in Equation (1) to all integer features to reduce their magnitudes, thus increasing the chance to prevent the model’s overfitting and speeding up its convergence. Initially, the source and destination columns in the original dataset are string objects; before scaling, they are converted into numeric whole numbers since many ML models only handle numeric data, and these features are of importance in the case of supervised learning algorithms.

3.1.3. Compressed Dataset

First, we merge two subsets, non-DoH and benign DoH, because they represent legit traffic, as shown in Figure 3. A flow record

x_{i}

in each dataset

Χ_{j}

is represented by dimension

d = 33

features and a timestamp indicating when a flow was recorded. To trace the C&C beaconing process, we only need the features that can uniquely identify a connection, i.e., a tuple of pairs of source and destination IP addresses. Figure 4a demonstrates a zoomed process of creating records of the subset for a specific local host H_A and server S_A.

In Figure 4, the nodes H_subcript represent the local host’s source IP address, and S_subscript represents the destination public server’s IP address. We refer to a connection as the packet flows between a pair of hosts (i.e., flows with source IP, destination IP, source port, destination port [27]). We include the timestamp to record a connection at a particular time period, and this will be used to calculate the time interval between consecutive unique connections. Each flow record

x_{i} \in X

is uniquely characterized by a flow key (a tuple of 5 elements, including source IP, destination IP, source port, and destination port) [27].

According to Algorithm 1, flows

x_{1}, x_{2}, x_{3}, \dots, x_{n} | x_{i} \in X

are ordered in groups

G = g_{1}, g_{2}, g_{3}, \dots, g_{n}

, per pair of unique connections (i.e., flows of same source and destination IPs are grouped together). This process results in a compressed subset

R \subseteq X | R \in R^{m \times p}

. Within each group

g_{i} \subseteq G

in subset

R

also represented graphically in Figure 4a—a group is represented by a pair of nodes connected by an edge—the flows are sorted by timestamp, Figure 4b.

Algorithm 1: Computing Time Intervals (TI) and Frequency (CF) between consecutive unique connections.
1	Input: $X$ : A set of flow records Ti: a Time at which flow record i occurred Si: Source IP address of flow record i Di: Destination IP address of flow record i
	Output: - TI: Time intervals between consecutive connections - CF: Connection frequency for each unique src and dest IP pair
2	Procedure ( $X$ , $T_{i}$ , $S_{i}$ , $D_{i}$ )
3	$G \leftarrow$ Aggregate ( $x_{1}$ , $x_{2}$ , $x_{3}$ , …, $x_{n}$ ) $\in X$ by unique ( $S_{i}$ , $D_{i}$ ) pair into a group $G$ of unique connections
4	for $g_{i} \in G$ do: //Sort the flow records within the group
5	X′← Sort the flow records by their occurrence time Ti in ascending order
6	for ${X^{'}}_{i} \in X^{'}$ do://For each sorted group of flow records
7	$T I \leftarrow \emptyset$ //Initialize an empty list to store time intervals
8	$\nabla t = T_{i} - T_{i - 1}$ //Compute the time interval $\nabla t$ between consecutive connections
9	$L \leftarrow \nabla t$ //Add $\nabla t$ to the list of time intervals
10	$C F \leftarrow C o u n t ({x^{'}}_{i}, . ., {x^{'}}_{n} \in {X^{'}}_{i})$ //Count the number of flows records within each sorted group
11	end for
12	end for
13	end procedure

For two consecutive connections within each group

g_{i} \subseteq G

as shown in Figure 4b,

\nabla t

is calculated using Equation (1). For each group

g_{i} \subseteq G

, the number of connections is counted to determine the connection frequency feature value. All

\nabla t

per single connection are summed up to make an overall time interval feature value. These two computed features, together, make a new vector

s_{i}

for a newly created compressed dataset

S \in Z^{m \times l}

. Where

m

represents the samples and

l

the two features.

∆ t = t_{i} - t_{i - 1}

(1)

3.2. Graph Modeling

In graph theory, a graph, denoted as

G

, is composed of a set of vertices,

V

(also known as nodes) connected by a directed or undirected set of edges,

E

. While there are exist many graph types, a star graph is a tree with one internal node and

k

leaves (the adjacent nodes to central/internal node). In our case, we use this model to analyze the connection frequency between local hosts and public DoH servers.

Let’s consider a directed graph

G = (V, E, w)

, where

V

is generally a set of nodes. In our case, it represents local hosts and DoH servers’ IP addresses, and

E

, the set of edges represents the connection frequency between each local host and a specified public DoH server. The weight function

w : E \in Z^{+}

assigns a non-negative value to each edge

e \in E

. In order to model tunnel and normal traffic separately, we denote

V_{n} = {h_{1}, h_{2}, \dots, h_{n}}

, the set of hosts with normal traffic and

V_{t} = {{h}_{1}, h_{2}, \dots, h_{t}}

, the set of hosts with DoH tunnel traffic. Let us also denote

V_{d} = {{d}_{1}, d_{2}, \dots, d_{n}}

, the set of DoH servers. We can define two subgraphs

G_{n} = (V_{n}, V_{d})

and

G_{t} = (V_{t}, V_{d})

representing the star networks showing connections between DoH servers (central nodes) and connected hosts (adjacent nodes). We can then calculate the weighted sum

e_{i}

of connections between each host and a specific server, as shown in Equation (2).

e_{i} = \sum w (h_{i}, d_{j})

(2)

In this research, we are more interested in the frequency of connection, not the servers that a local host connects to. Therefore, we design connections from a specific DoH server as an independent subtree. The goal here is to observe the variability in weights between

V_{n}

and

V_{t}

. We assume that the connection frequency variability between a local host running a DoH tunnel application

h_{i} \in V_{t}

, to any public DoH server

d_{i} \in V_{d}

is much higher than that with normal traffic,

h_{i} \in V_{n}

as shown in Equation (3).

\forall h \in V_{n,} v a r (h_{i} \in V_{n}, d_{i} \in V_{d}) ≪ v a r (h_{i} \in V_{t}, d_{i} \in V_{d})

(3)

Despite its simple and elegant representation of the connections, this graph modeling may be clutter when the network becomes larger and larger, where internal hosts become more and more. Moreover, this modeling is univariate, meaning that it can only detect outliers but not novelties. The problem with this is that an outlier can be normal traffic with different behavior, such as a legitimate application such as a chat application or other application that sends frequent update requests. Therefore, we can use bar plots to accommodate more weight values and also model both time and frequency variables. This will be shown in the subsequent Section 4.2 (bar plot analysis).

3.3. Machine Learning Modeling

This section indicates the advantages of compressing a flow dataset for efficient detection of DoH tunnels using machine learning models. We use both supervised machine learning algorithms and anomaly detection algorithms. There are two approaches to optimizing the machine learning process: (1) redesign the algorithm or hyperparameter tuning—the most complex but efficient approach. (2) A data processing approach and (3) a hybrid approach. Data processing usually involves dimensionality reduction, feature creation, etc. The goal is to have an efficient model that could generalize well with fewer configurations and fewer relevant features for easy interpretability with less computation. In this case, we use the hybrid with fewer configurations and more data processing. For supervised learning, we choose cost-sensitive algorithms to avoid data sampling techniques. For anomaly detection, we use classical algorithms to avert the complexity of deep learning algorithms, as our goal is to use the lean and smart principle.

3.3.1. Supervised Models

Usually, problems containing rare events, such as anomaly detection, are naturally highly class imbalanced with rare events and the minority class being significantly smaller than the majority class. For instance, let the number of positive instances in the dataset be denoted as

P

, and the number of negative instances,

N

. If so

P < < N

, then the dataset is said to be highly imbalanced. In machine learning, this imbalance can cause bias towards the majority class, as the algorithm infers more insights or patterns from the majority class than from the minority. As a result, it is likely to predict more often minority samples as majority, which causes more false negatives in the results. This can be problematic in cases such as fraud detection, malware detection, or medical diagnosis.

To address the problem of class imbalance, various sampling techniques have been developed. For instance, synthetic minority over-sampling techniques (SMOTE) [28] have been a common approach in the literature. This technique, however, takes a separate task outside the ML algorithm that requires extra data processing. Another simplified yet effective technique is to use cost-sensitive algorithms. Cost-sensitive algorithms assign weights to training samples according to class proportions. This allows the algorithm to identify the cost of misclassification. Let

i

be a predicted class and

j

the actual class of an instance

x

. Let also

C (i, j)

be a function that computes the cost of predicting the actual class

j

as

i

. In Table 3, a matrix of how the algorithm assigns the cost for a binary classification is shown. Because of the highly imbalanced dataset, the model tends to classify every sample as a majority class. In the matrix shown in Table 3, the heavy cost (

n / p

) is assigned to the model when it misclassifies the majority class as the minority (FP). The expected cost of classifying

x

into a different class,

i

than its actual one,

j

can be expressed in Equation (4) as the sum of its true prediction

P (j | x)

times the cost of its misclassification. Selected models were based on the criteria of having a class-weight hyperparameter for the implantation in the scikit learn library and their popularity in the practitioners’ community.

C (i | x) = \sum_{j} P (j | x) . C (i, j)

(4)

Cost-Sensitive Logistic Regression (CS-LR)

Logistic regression (LR) [29] is a fundamental linear model used for binary classification tasks [30,31]. It models the relationship between the independent variables and the probability of a binary outcome using the logistic function, also known as the sigmoid function, in Equation (5) to squeeze the output of a linear equation between 0 and 1.

P (z) = \frac{1}{1 + e^{(- z)}}

(5)

Logistic regression can be adapted to improve its performance in imbalanced classification scenarios. The model’s coefficients are determined through an optimization algorithm aimed at minimizing the negative log likelihood (loss) on the training data, as shown in Equation (6):

\min \sum_{i = 1}^{n} - (l o g ({\hat{y}}_{i}) \times y_{i} + l o g (1 - {\hat{y}}_{i}) \times (1 - y_{i}))

(6)

This process entails iteratively utilizing the model to generate predictions, followed by adjusting the coefficients in a manner that minimizes the model’s loss. The computation of the loss for a specific set of coefficients can be adjusted to accommodate class imbalance, see Equation (7). By default, errors for each class may be assigned an equal weight, typically set at 1.0, but these weights can be fine-tuned according to the significance of each class; see, for example, Table 3.

\min \sum_{i = 1}^{n} - (w 0 \times l o g ({\hat{y}}_{i}) \times y_{i} + w 1 \times l o g (1 - {\hat{y}}_{i}) \times (1 - y_{i}))

(7)

The weighting is applied to the loss function such that smaller weight values lead to lower error values, thereby inducing a lesser update to the model coefficients. Conversely, larger weight values lead to higher error calculations, resulting in more substantial updates to the model coefficients.

Cost-Sensitive Support Vector Machine (CS-SVM)

Support Vector Machine (SVM) [32] is a powerful supervised machine learning algorithm capable of handling both linear and nonlinear classification and regression tasks. It aims to find the hyperplane that best separates the classes in the feature space while maximizing the margin between them. In general terms, the margin represents the distance between the classification boundary and the nearest point in the training set [33]. By default, this margin tends to favor the majority class in imbalanced datasets. However, it can be adjusted to consider the significance of each class, leading to significant improvements in algorithm performance, particularly on datasets with highly skewed class distributions.

The kernel trick allows SVMs to transform the data into a higher-dimensional space where linear hyperplanes can separate the classes, even when the boundary in the original feature space is nonlinear. The softening of the margin, controlled by the regularization hyperparameter referred to as the soft-margin, lambda, or (C), becomes crucial for accommodating the misclassification of points. With C determining the tolerance for margin violations, a higher value allows for a softer margin, while a lower value results in a harder margin, indicating less tolerance for misclassifications [34].

To address this limitation for imbalance handling, extensions to SVMs have been devised, with one common approach being the adjustment of the C value in proportion to the importance of each class. This instance-level weighted modification assigns a penalty term (C) to each example in the training dataset based on the class distribution [35]. An example’s C-value can be determined by weighing the global C-value, with the weight being defined in proportion to the class distribution as shown in Equation (8):

C_{i} = {w e i g h t (w}_{i}) \times C

(8)

By assigning larger weights to minority class examples and smaller weights to majority class examples, the modified SVM algorithm aims to balance the trade-off between classification error and margin maximization. This strategy encourages the margin to contain the majority class more rigidly while allowing flexibility for the minority class, effectively mitigating the skew in the separating hyperplane and reducing misclassifications. This modification is commonly referred to as Weighted Support Vector Machine (SVM) or Class-Weighted SVM, among other terms, signifying its importance in enhancing SVM performance on imbalanced datasets. Despite its effectiveness in high-dimensional spaces and robustness against overfitting, it can be computationally demanding, especially with large datasets [17].

Cost-Sensitive Random Forest (CS-RF)

Random Forest (RF) [36] is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (classification) or mean prediction (regression) of the individual trees. RF is known for its robustness, scalability, and ability to handle high-dimensional data with complex interactions. It reduces overfitting by aggregating the predictions of multiple trees, making it less sensitive to noise in the data. RF is suitable for a wide range of tasks and performs well in heterogeneous datasets, but it may not be as interpretable as simpler models such as LR.

A simple method to adapt a decision tree for imbalanced classification involves adjusting the weight assigned to each class when computing the impurity score of a selected split point [37]. Impurity, which gauges the mixture of samples within groups for a given split in the training dataset, is typically assessed using metrics such as Gini impurity or entropy. By biasing the calculation to favor a mixture that benefits the minority class, this adjustment allows for some false positives to be tolerated for the majority class. Referred to as Weighted Random Forest, this modification enhances the random forest algorithm’s performance in handling imbalanced datasets [38]. Alternatively, another strategy to improve the suitability of random forests for learning from highly imbalanced data adopts the concept of cost-sensitive learning. Given that the random forest classifier often exhibits a bias towards the majority class, a heavier penalty is imposed for misclassifying the minority class. This can be accomplished by specifying the class_weight argument in the RandomForestClassifier class. This argument accepts a dictionary mapping each class value (e.g., 0 and 1) to its corresponding weight [39]. Alternatively, the argument value balanced can be supplied to automatically apply inverse weighting based on the training dataset, thereby prioritizing the minority class.

Cost-Sensitive eXtreme Gradient Boosting (CS-XGB)

XGB [40] is a gradient-boosting algorithm known for its efficiency and performance in supervised learning tasks. XGB sequentially builds a series of weak learners (usually decision trees) and combines their predictions to improve accuracy. It is highly customizable, allowing fine-tuning of parameters to optimize performance [40]. The final model output of a sample is a summation of the results of all the learners’ iterative training. Let

P = {1, 2, 3, \dots M}

denote the set of weak learners, where

M

is the total number of trees in the model. If

y_{i}

is used to represent the true label a of a sample

x_{i}

in a dataset, then the predicted value

f_{M} (x_{i})

of the XGBoost model can be expressed in Equation (9):

f_{M} (x_{i}) = \sum_{m = 1}^{M} f_{m} (x_{i}), f \in F

(9)

where

F

represents a set of all classification trees and

f_{M} (x_{i})

the individual base classifier’s prediction. The raw output or scores from each tree in XGBoost is referred to as the “raw prediction” and is denoted by z [40]. The predicted probability is then obtained by passing the raw prediction through the sigmoid function shown in Equation (5). To minimize the objective function,

o b j (θ)

, Equation (10) is computed:

f_{M} (x_{i}) = L (θ) + Ω (θ)

(10)

where

L (θ) = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i})

is the loss function and

Ω (θ) = \sum_{m = 1}^{M} Ω (f_{m})

the regularization parameter that penalizes the complexity of the model. Since training happens in the iteration process, the prediction value

{\hat{y}}_{i}^{(t)}

of an instance ith in iteration t is expressed in Equation (11):

{\hat{y}}_{i}^{(t)} = {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})

(11)

Since the problem is a binary classification, we let the model use the predetermined loss function, which is the cross entropy in Equation (12):

L (y_{i}, {\hat{y}}_{i}) = - [y_{i} l o g ({\hat{y}}_{i}) + (1 - {\hat{y}}_{i}) l o g (1 - {\hat{y}}_{i})]

(12)

where

y

is the true label (either 0 or 1) and

\hat{y}

is the predicted probability of the positive class (i.e., the output of the sigmoid function

P (z)

) as shown in Equation (5). The loss is minimized when the predicted probabilities

\hat{y}

are as close as possible to the true labels

y

.

3.3.2. Anomaly Detection Models

As demonstrated in the preceding section, machine learning algorithms are adept at categorizing data patterns into various classes. A prevalent strategy for training these classifiers involves supplying them with examples from all classes. Essentially, the classifier is furnished with both benign and malicious data to facilitate learning the discrimination between samples from each class [41]. However, while acquiring abundant benign data from networks and systems may be straightforward, obtaining malicious samples from cyberattacks can prove exceedingly challenging and costly. Consequently, an alternative approach for training a binary classifier—specifically, one that distinguishes between two classes—is to train it exclusively with data from just one of the classes. This way, the classifier learns to discern data samples that fall within the distribution of the training samples from that particular class. In simpler terms, these classifiers learn to recognize samples that match the pattern of the class they were trained on and those that don’t. They’re known as one-class novelty detection classifiers, and they can be trained solely with benign data from networks and systems.

There are many one-class novelty detection algorithms. In general, we can divide them into two groups: traditional machine learning and deep learning techniques. The former includes several techniques that rely on traditional machine learning algorithms, such as one-class support vector machines (OCSVM) [42] and isolation forests (iForest) [43], which are largely adopted in anomaly detection tasks [44,45,46]. The latter rely on deep learning algorithms and frameworks, such as autoencoders [47] and generative adversarial networks (GANs) [48]. In this paper, we focus on the traditional techniques for their simplicity and effectiveness.

One-Class Support Vector Machine (OCSVM)

The one-class Support Vector Machine (OCSVM) algorithm builds an optimal hyperplane to differentiate data samples that resemble the ones seen during training from those that do not, essentially separating data samples from two classes. This hyperplane acts as a decision boundary function, denoted as f, which assigns a value of +1 to data samples located on one side of the hyperplane and −1 to those on the other side. To minimize errors, such as false positives and false negatives, OCSVM determines this decision boundary by solving an optimization problem aimed at maximizing the separation margins between each class [46].

Isolation Forest

To construct an isolation tree (iTree) from a dataset

X

comprising

n

instances

\{x_{1}, x_{2}, \dots, x_{n}\}

from a d-dimensional distribution, the process entails recursively partitioning

X

. This partitioning involves random selection of an attribute

q

and a split value

p

until one of the following conditions is met: (i) the tree attains a predefined height limit; (ii) the size of

X

reduces to 1, i.e.,

|X| = 1

; or (iii) all instances in

X

possess identical attribute values [46,49].

Since anomalies are considered to make up a small percentage of the normal data and have very different attributes than normal data, they tend to be isolated with fewer partitions than normal data. In other words, since the benign samples are the majority, we expect them to be isolated from each other with many split operations. On the other hand, we expect that malicious samples, which are the minority, will be separated from each other with fewer split operations [46,49]. The construction of isolation trees (iTrees) follows a structure similar to binary search trees (BST) [43], allowing for an estimation of the average path length

h (x)

for terminations at external nodes. This estimation parallels the analysis conducted for unsuccessful searches in BSTs. For a dataset comprising

n

instances, the average path length of an unsuccessful search in a BST is approximated by Equation (13)

(n) = 2 H (n - 1) - (2 (n - 1) / n)

(13)

where

H (i) \approx \ln (i) + 0.5772156649

. Leveraging the comparable structure of binary search trees and isolation trees, the value

c (n)

represents the average depth of an isolation tree constructed using

N

training instances. When leaf nodes contain

M > 1

training instances,

c (M)

is added to the measured path length from the root to the leaf node to derive the effective path length

h (x_{i})

for a specific instance

x_{i}

. An ensemble of isolation trees, referred to as an isolation forest, can be trained, and the output is averaged to mitigate model variance. Subsequently, for an isolation forest model, the outlier score for an instance

x_{i}

is determined by Equation (14):

s (x_{i}, n) = 2^{- \frac{E (h (x_{i}))}{c (n)}}

(14)

where

E (h (x_{i}))

signifies the effective path length for the instance

x_{i}

, averaged across all trees in the ensemble, and

c (n)

denotes the expected depth of an isolation tree with

n

training instances. This uncalibrated score,

s (x_{i}, n)

, ranges from 0 to 1, with higher scores indicative of a greater likelihood of being an outlier.

Local Outlier Factor

The Local Outlier Factor (LOF) [50] algorithm is a density-based method used to identify unusual or anomalous data points within a dataset

X = \{x_{1}, x_{2}, \dots, x_{n}\}

. The LOF score for each data point

x_{i}

indicates the extent to which it deviates from its neighbors. A higher LOF value suggests that the data point

x_{i}

is more likely to be an outlier. In the algorithm,

d i s t (x, x^{'})

represents the distance between data points

x

and

x^{'}

. The Local Outlier Factor

{L O F}_{k} (x)

for data point

x

is defined in Equation (15) [51]:

{L O F}_{k} (x) = \frac{\sum_{x^{'} \in N_{k} (x)} \frac{{l d r}_{k} (x^{'})}{{l d r}_{k} (x)}}{|N_{k} (x)|}

(15)

Here,

N_{k} (x)

denotes a set of

k

nearest neighbors of data point

x

. The numerator represents the sum of all the ratios of

x

’s neighbor’s

l d r

and its own

l d r

. Where

l d r

denotes Local Reachability Density LRD. It measures how dense a data point’s neighborhood is compared to its neighbors. Mathematically, LRD for a point

x

, denoted as

l d r (x)

, is computed as the inverse of the average reachability distance of

x

to its

k

nearest neighbors, as shown in Equation (16) [51]. The denominator

|N_{k} (x)|

represents the number of neighbors considered. The LOF score reflects how much more or less dense the neighborhood of

x

s compared to its neighbors, helping identify outliers within the dataset.

{l d r}_{k} (x) = \frac{|N_{k} (x)|}{\sum_{x^{'} \in N_{k} (x)} {r e a c h_d i s t}_{k} (x^{'} \leftarrow x)}

(16)

In our specific context, the data sample

x

signifies a binary flow vector within a binary dataset that incorporates transformed features of time intervals and connection frequencies.

Modified Z-Score

The Modified Z-score, also known as the Median Absolute Deviation (MAD), is a powerful technique for outlier detection that offers robustness and interpretability. By utilizing the median and the median absolute deviation, it provides a reliable measure of the deviation of data points from the center of the distribution, making it valuable in various analytical applications. While several outlier detection methodologies have been discussed in this chapter, we opted for the Modified Z-score technique for two primary reasons: (i) it is a simple yet robust approach to outlier detection. (ii) the Modified Z-score’s core component, MAD, described in Equation (17), finds widespread application in prominent real-world command and control (C&C) beacon detection frameworks such as Real Intelligent Threat Analytics (RITA), particularly for detecting short beacons [52]. This endorsement from real-world cybersecurity frameworks underscores the efficacy and relevance of the Modified Z-score technique in practical anomaly detection scenarios.

M A D = b M_{i} (|x_{i} - M_{j} (x_{j})|)

(17)

where, according to C. Leys et al. [53],

M_{i}

denotes the median of the series, and

x_{j}

the

n

original observations. The constant

b = \frac{1}{Q (75)}

, is linked to the density distribution, which is the inverse of the 3rd quartile (Q3) [53]. Equation (18), for computing the Modified Z-score, encapsulates the essence of this technique, offering a succinct representation of its application in outlier detection tasks.

\begin{matrix} M - (d * M A D) < x_{i} < M + (d * M A D) \\ Or \\ \frac{x_{i} - M}{M A D} > | \pm d \end{matrix}

(18)

where

d

is the threshold defining how many MAD units to account for before a point is detected as an outlier. Note that,

d

can be changed depending on the researcher’s criteria and should be justified. Commonly proposed values of

d

among researchers [54] are, but, not limited to, 2.5, 3.0, and 3.5.

4. Experimental Results and Analysis

To achieve the results in this paper, the experiment was conducted on a Lenovo laptop, which featured an Intel 64-bit, i7-9750H CPU with 6 cores clocked at 2.6 GHz, a Pascal GTX 1050 GPU with 2 GB of memory, and 8 GB of RAM running Windows 10 Pro, 64-bit. Python libraries were utilized under scikit learn in the jupyter notebook for data processing and analysis.

Following the compression method shown in Section 3, Table 4 shows the size and dimensions of the compressed dataset. In the new compressed dataset, two new features are needed. Hence, compared to the 33 features used in the original dataset, we have reduced the dimensions by

94 %

. The average compression rate for both normal and DoH tunnels achieve

95.2 %

.

4.1. Graph Analysis

We used the graphviz layout of the networkx library to plot the star network. To get more connections, we capture incoming connections from arbitrary servers (1.1.1.1, 8.8.4.4, and 176.103.130.130), as shown in Table 1. We use the servers for both normal and DoH tunnel connections. The goal is to analyze and confirm the hypothesis we placed in Section 3, i.e., that the DoH tunnel connection frequency is likely much higher than that of normal.

Experimental results show in Figure 5 that the incoming connection frequencies from DoH servers (central nodes) to local hosts with tunnel clients installed (adjacent IP addresses in red) are in the thousands, whereas the incoming connection frequencies from DoH servers to local hosts with normal traffic (adjacent IP addresses in green) are in the tens or a few hundred.

4.2. Bar Plot Analysis

The density distribution of outgoing connection frequency and time intervals between a set of local hosts (normal and tunnels) presented in Figure 6 also reveals some insightful observations. Host IPs are in the network 192.168.20.0/24, whereas servers are 1.1.1.1, 8.8.4.4, 9.9.9.9, 176.103.130.130, and 176.103.130.131. Figure 6a depicts benign DoH connections to DoH severs, whereas Figure 6b represents DoH tunnel traffic to the same DoH servers.

The arrows facing down indicate the connection direction. To accommodate space, only the last three IP digits were shown in Figure 6a and one last IP digit in Figure 6b were shown according to Table 2. The y axis (left) indicates the number of connections made between a local host and a particular server, while the y axis (right) shows the total number of times (seconds) elapsed during these connections. This analysis provides a comprehensive view of the behavioral differences between hosts with normal traffic and those with DoH tunnels. This approach can serve as a red flag for security analysts or blue teams to conduct further analysis on the identified hosts. Using the eagle’s eye, we can observe some variations on the part of benign traffic in Figure 6a, whereas on the part of malicious traffic in Figure 6b, we observe some consistency.

4.3. Supervised Machine Learning

We trained four cost-sensitive ML algorithms: CS-LR, CS-SVM, CS-RF, and CS-XGB. The LG was configured with C = 100 and solver = ‘newton-cg’. The rest of the algorithms were configured with default hyperparameters, except. All four algorithms were configured with the class_weight hyperparameter with the value balanced. When class_weight is set to “balanced”, the classifier automatically adjusts the weights inversely proportional to class frequencies in the input data. The weight for each class is calculated using Equation (19):

Class weight = \frac{N_{s a m p l e s}}{N_{c l a s s e s} \times n p . b i n c o u n t (y)}

(19)

Here, np.bincount (y) provides the number of occurrences for each class in the target vector y. However, for the XGB algorithm, this hyperparameter is called scale_pos_weight. We configured this hyperparameter as the ratio between the negative and positive classes, as shown in Equation (20).

scale_pos_weight = \frac{N_{n e g a t i v e s a m p l e s}}{N_{p o s i t i v e s a m p l e s}}

(20)

This option helps to counteract imbalances by assigning higher weights to minority classes and lower weights to majority classes, ensuring that the classifier pays more attention to underrepresented classes. For the original dataset, we converted IP address variables into integers. Since there are missing variables, we imputed them by replacing each missing value with the mean of its respective column. For both datasets, we scaled variables with the standard scaler, the scikit library. Feature scaling using the standardization technique scales the data values by subtracting feature values of

x

from the mean

μ

and dividing by the standard deviation

σ

. Scaled values are in the range of

[- 1, 1]

, so that they have properties of the standard normal distribution with the mean

µ = 0

and a standard deviation of approximately

1

refs. [25,26]. Unlike min-max scaling, standardization is much less affected by outliers. Therefore, we applied this technique to prevent models’ overfitting and speed up their convergence. We evaluated (trained and tested) the algorithms with the original dataset and then with the compressed dataset. For each evaluation, we recorded prediction (precision, recall, and F1 score) and computation time (training and prediction). Table 5 shows experimental results.

Table 6 presents the increase in training and testing speeds for various machine learning models after dataset compression. The LR model experienced a 104.8-fold increase in training and a 135.3-fold increase in testing speed. The SVM showed the most dramatic improvements, with training speed increasing by 36,284.9 times and testing by 22,477.5 times. The RF model also saw a substantial speed increase, with a 678.5-fold increase in training and a 54.8-fold increase in testing. The XGB model improved its training speed by 626.8 times and its testing by 27.8 times.

Here, we compare the XGB model that was used in [5] with the one that was used in this study. The reason we compared only layer 1 and not layer two in Table 5 of ref. [5] is because layer 1 uses the same sample size as what we used in this research. We can therefore see that our approach improved its training speed by 18-fold and 17.5-fold in testing. However, our proposed approach has seen about

50 %

decrease in precision.

These results demonstrate that compressing the dataset significantly enhances both training and testing computation speed across different compared methods, with SVM showing the highest improvements. Looking at the results in Table 5 and Table 6, we could see that, despite its computational performance increase, SVM experienced prediction performance variation where precision increased by ≈4.5% but recall decreased by

7.5 %

. If we were to choose the best model, RF stands out for both prediction and computation time performance, among others.

4.4. Effect of Flow Samples and Dimension Compression

The results in Table 5 show that compressing flow samples and reducing feature dimensions using the transformation technique reduces the performance of some models. However, in terms of computational speed, as shown in Table 6, the number of times that the models that use compressed dataset outperform those using original datasets is exponential. Suppose

n

represents sample size and

d

the sample dimension. For most ML algorithms, these two parameters are part of the computation complexity formula. Therefore, by selecting fewer features from a dataset and using fewer samples, this paper demonstrated a reduction in processing speed. This can also be indicated in our previous study, where reducing the number of features improved computation speed exponentially [5].

4.5. Anomaly Detection

In this section, we employ both outlier detection and novelty detection algorithms [42,50] to assess which one(s) can achieve superior performance on this dataset for further analysis due to the following reasons. Despite their long history of application, they have consistently demonstrated strong performance across numerous domains and are by nature lighter than deep learning-based methods [44,55]. In this section, we employ both outlier detection IF, MAD, and novelty detection OCSVM and LOF algorithms to assess which one(s) can achieve superior performance on the compressed dataset for further analysis.

For OCSVM, we set the parameter ν to 0.0057 and utilized a radial basis function (RBF) kernel with automatic determination of the gamma parameter. In the case of IF, our configuration consisted of 42 estimators, with a maximum of 20% of samples and all features considered for each split. Additionally, a contamination rate of 0.000978 was specified to identify outliers. For LOF, we employed the Minkowski metric with p = 2 and enabled novelty detection mode to identify outliers. Notably, the configuration for MAD did not require specific parameters as it relies solely on the computation of the median absolute deviation. This comprehensive setup facilitated the evaluation and comparison of these algorithms in our anomaly detection experiments.

According to Table 7, the OCSVM and LOF models have the highest and similar overall performance across all three metrics, with a precision of 88.89, a recall of 100, and an F1 score of 94.12. The MAD model has the lowest precision (98.00) and the lowest recall (94.12) among others. This means that MAD flags a lot of normal traffic data points as anomalies (many false positives), but it might also catch most of the actual DoH tunnels. All models are able to detect most DoH tunnels with almost no false negative indicated by recall (100) across all models.

Concerning computation time, it’s evident that OCSVM requires less time for both training and testing, followed by LOF, whereas IF exhibits higher computation costs. Hence, we opt for the novelty detection algorithm that offers the highest accuracy, efficient computation, and minimal parameter overhead. OCSVM emerges as the most suitable choice for this purpose. We further carried out research on this algorithm to delve deeper into its effectiveness and performance.

During the experiment, we observed that 99.9% of all false negatives were from the connections generated by compromised local hosts to a specific public DoH server 8.8.8.8 as shown in Figure 7a, and one connection to another public DoH server 176.103.130.131, as indicated in Figure 7b. Only this traffic created false negatives which reduced the models’ performance. This is probably due to some recording errors or because the servers were not used during the attack. Hence, no data was captured on this connection. If it were a real attack, these connections would have been blocked by the public DoH servers due to some security policies. Consequently, we call these connections special outliers, and they were removed.

This is probably due to some recording errors or because the servers were not used during the attack. Hence, no data was captured on this connection. If it were a real attack, these connections would have been blocked by the public DoH servers due to some security policies. Consequently, we call these connections special outliers, and they were removed. The confusion matrix (CM) in Figure 8 shows extremely low false positives, five which is about 4.4 × 10⁻⁴% probability of false alarm (FPR). Remarkably, there were no false negatives, indicated by 0 in the CM, hence the recall (100) in Table 7.

5. Conclusions

This paper presents a novel approach for analyzing encrypted traffic to detect DNS-over-HTTPS (DoH) tunnels using C&C beacon characteristic features (frequency and time intervals between consecutive connections). We leverage a dataset containing over a million network flow properties and statistical feature samples about DoH tunnels and benign DoH normal browsing activities (HTTPS) traffic. Each flow example is represented by 33 features with a timestamp. The approach underscores the importance of extracting and using only frequency and time intervals when compressing a dataset and reducing flow dimensions.

We applied a star graph and bar plot to these two features to analyze the difference between normal and tunnel traffic. We also applied machine learning to both supervised and anomaly detection algorithms. The results show a clear difference, as shown by the star graph and bar plot. We compared our proposed approach with different supervised ML models evaluated on an uncompressed dataset and the recent approaches aiming at enhancing computation speed. The results show that with compressed datasets models become n-fold faster than when they are evaluated on the original dataset. However, while some supervised algorithms such as RF and SVM showed stable performance for both cases, some other algorithms such as LR and XGB showed a high false positive rate, shown by low precision on compressed datasets. For unsupervised algorithms, OCSVM and LOF stand out as the best-performing models.

Our proposed approach shows simplicity yet effectiveness in encrypted-based Network Intrusion Detection Systems (NIDS) operating in high-speed networks or resource-constrained environments to detect known and unknown attacks. Frequency and time intervals between consecutive connections are also intuitive network features for security analysts to create Indicators of Compromise (IoC) that can be used for countermeasures.

6. Limitations and Recommendations

Despite the success of our proposed approach, this approach is specific to only network flow-based analysis where we can aggregate network flows it cannot be applied as a general-purpose intrusion detection system. Additionally, star graphs and bar plots are simple models to just indicate the behavioral pattern of malware traffic using C&C beacons with a fewer number of IP addresses. They cannot accommodate regular traffic where hundreds or even thousands of websites with different IPs are visited. Even if they did, the graph would not be easily interpretable. Another challenge is to get a DoH tunnel dataset with real-world attacks to replace the famous CIRA-CIC-DoHBrw-2020 dataset. We recommend future research to develop a solution that considers the limitations mentioned above. Among such solutions, the authors suggest the creation of a new dataset containing various DoH tunneling tools or real DoH tunnel attacks, or using more advanced Generative Adversarial Network (GAN) models to generate a better synthetic dataset.

Author Contributions

I.M.: Conceptualization, Methodology, Formal analysis, Implementation, Investigation, Validation, and Writing—original draft, Y.W.: Methodology, Writing—review and editing, Project administration, and Funding acquisition, X.H.: Writing—review and editing, Project administration, Funding acquisition, and Supervision, X.S.: Methodology and Writing—review and editing, E.M.N.: Resources and Data Curation, J.C.T.: Resources and Data Curation. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Founds of China (U20B2050, 62302389) and Natural Science Basic Research Program of Shaanxi Province (2023-JC-QN-0742).

Data Availability Statement

The baseline dataset used to support the findings of this study is publicly available and was cited in this paper. The compressed dataset and other necessary materials can be provided by the authors of this paper on demand.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Hynek, K.; Vekshin, D.; Luxemburk, J.A.N.; Wasicek, A.; Member, S. Summary of DNS Over HTTPS Abuse. IEEE Access 2022, 10, 54668–54680. [Google Scholar] [CrossRef]
Montazerishatoori, M.; Davidson, L.; Kaur, G.; Habibi Lashkari, A. Detection of DoH Tunnels Using Time-Series Classification of Encrypted Traffic. In Proceedings of the 2020 IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing, International Conference on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Calgary, AB, Canada, 17–22 August 2020; pp. 63–70. [Google Scholar] [CrossRef]
Abualghanam, O.; Alazzam, H.; Elshqeirat, B.; Qatawneh, M.; Almaiah, M.A. Real-Time Detection System for Data Exfiltration over DNS Tunneling Using Machine Learning. Electronics 2023, 12, 1467. [Google Scholar] [CrossRef]
Nguyen, T.A.; Park, M. DoH Tunneling Detection System for Enterprise Network Using Deep Learning Technique. Appl. Sci. 2022, 12, 2416. [Google Scholar] [CrossRef]
Irénée, M.; Wang, Y.; Hei, X.; Song, X.; Turiho, J.C.; Nyesheja, E.M. XTS: A Hybrid Framework to Detect DNS-Over-HTTPS Tunnels Based on XGBoost and Cooperative Game Theory. Mathematics 2023, 11, 2372. [Google Scholar] [CrossRef]
DoHBrw 2020 Datasets. Available online: https://www.unb.ca/cic/datasets/dohbrw-2020.html (accessed on 25 November 2022).
Hoffman, P.; McManus, P. DNS Queries over HTTPS (DoH). 2018. [Google Scholar] [CrossRef]
Turing, A.; Ye, G. An Analysis of Godlua Backdoor. Available online: https://blog.netlab.360.com/an-analysis-of-godlua-backdoor-en/ (accessed on 24 November 2022).
Ramos, F.M.; Wang, X. A Machine Learning Based Approach to Detect Stealthy Cobalt Strike C &C Activities from Encrypted Network Traffic. In Machine Learning for Networking; Lecture Notes in Computer Science (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics); Springer: Cham, Switzerland, 2023; Volume 13767, pp. 113–129. [Google Scholar] [CrossRef]
Cobalt Strike|Defining Cobalt Strike Components & BEACON. Available online: https://www.mandiant.com/resources/blog/defining-cobalt-strike-components (accessed on 24 October 2023).
Abu Talib, M.; Nasir, Q.; Bou Nassif, A.; Mokhamed, T.; Ahmed, N.; Mahfood, B. APT Beaconing Detection: A Systematic Review. Comput. Secur. 2022, 122, 102875. [Google Scholar] [CrossRef]
Kryo.Se: Iodine (IP-over-DNS, IPv4 over DNS Tunnel). Available online: https://code.kryo.se/iodine/ (accessed on 26 November 2022).
GitHub-Alex-Sector/Dns2tcp. Available online: https://github.com/alex-sector/dns2tcp (accessed on 26 November 2022).
GitHub-Iagox86/Dnscat2. Available online: https://github.com/iagox86/dnscat2 (accessed on 26 November 2022).
Behnke, M.; Briner, N.; Cullen, D.; Schwerdtfeger, K.; Warren, J.; Basnet, R.; Doleck, T. Feature Engineering and Machine Learning Model Comparison for Malicious Activity Detection in the DNS-Over-HTTPS Protocol. IEEE Access 2021, 9, 129902–129916. [Google Scholar] [CrossRef]
Banadaki, Y.M.; Robert, S. Detecting Malicious DNS over HTTPS Traffic in Domain Name System Using Machine Learning Classifiers. J. Comput. Sci. Appl. 2020, 8, 46–55. [Google Scholar] [CrossRef]
Jafar, M.T.; Al-Fawa’reh, M.; Al-Hrahsheh, Z.; Jafar, S.T. Analysis and Investigation of Malicious DNS Queries Using CIRA-CIC-DoHBrw-2020 Dataset. Manch. J. Artif. Intell. Appl. Sci. 2021, 2, 65–70. [Google Scholar]
Vekshin, D.; Hynek, K.; Cejka, T. DoH Insight: Detecting DNS over HTTPS by Machine Learning. In Proceedings of the ACM International Conference Proceeding Series, New York, NY, USA, 19–23 October 2020. [Google Scholar]
Jeřábek, K.; Hynek, K.; Čejka, T.; Ryšavý, O. Collection of Datasets with DNS over HTTPS Traffic. Data Brief 2022, 42, 108310. [Google Scholar] [CrossRef]
Singh, S.K.; Roy, P.K. Detecting Malicious DNS over HTTPS Traffic Using Machine Learning. In Proceedings of the International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT 2020), Zallaq, Bahrain, 20–21 December 2020. [Google Scholar] [CrossRef]
MontazeriShatoori, M. An Anomaly Detection Framework for DNS-over-HTTPS (DoH) Tunnel Using Time-Series Analysis. Bachelor’s Thesis, University of New Brunswick, Fredericton, NB, Canada, 2020. [Google Scholar]
GitHub-Ahlashkari/DoHLyzer: DoHlyzer Is a DNS over HTTPS (DoH) Traffic Flow Generator and Analyzer for Anomaly Detection and Characterization. Available online: https://github.com/ahlashkari/DoHlyzer (accessed on 26 November 2022).
Hofstede, R.; Čeleda, P.; Trammell, B.; Drago, I.; Sadre, R.; Sperotto, A.; Pras, A. Flow Monitoring Explained: From Packet Capture to Data Analysis with NetFlow and IPFIX. IEEE Commun. Surv. Tutorials 2014, 16, 2037–2064. [Google Scholar] [CrossRef]
Stalder Zurich, D. Machine-Learning Based Detection of Malicious DNS-over-HTTPS (DoH) Traffic Based on Packet Captures. Bachelor’s Thesis, University of Zurich, Zürich, Switzerland, 2021. [Google Scholar]
Yang, Z.; Liu, X.; Li, T.; Wu, D.; Wang, J.; Zhao, Y.; Han, H. A Systematic Literature Review of Methods and Datasets for Anomaly-Based Network Intrusion Detection. Comput. Secur. 2022, 116, 102675. [Google Scholar] [CrossRef]
Géron, A. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd ed.; Roumeliotis, R., Nicole, T., Eds.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2017; ISBN 9781492032649. [Google Scholar]
Brownlee, N.; Mills, C.; Ruth, G. RFC2722: Traffic Flow Measurement: Architecture. USA: RFC Editor. 1999. Available online: https://www.rfc-editor.org/rfc/rfc2722.html (accessed on 30 June 2024).
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Kleinbaum, D.G.; Klein, M. Logistic Regression: A Self-Learning Text; Statistics for Biology and Health; Springer: New York, NY, USA, 2010; ISBN 978-1-4419-1741-6. [Google Scholar]
Amiri, P.A.D.; Pierre, S. An Ensemble-Based Machine Learning Model for Forecasting Network Traffic in VANET. IEEE Access 2023, 11, 22855–22870. [Google Scholar] [CrossRef]
Singh, S.K.; Roy, P.K. Malicious Traffic Detection of DNS over HTTPS Using Ensemble Machine Learning. Int. J. Comput. Digit. Syst. 2022, 11, 1061–1069. [Google Scholar] [CrossRef] [PubMed]
Support Vector Machine-Wikipedia. Available online: https://en.wikipedia.org/wiki/Support_vector_machine (accessed on 10 July 2023).
Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: New York, NY, USA, 2013; ISBN 9781461468493. [Google Scholar]
Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Learning from Imbalanced Data Sets; Springer International Publishing: Cham, Switzerland, 2018. [Google Scholar]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning: With Applications in R; Springer Texts in Statistics; Springer: New York, NY, USA, 2021; ISBN 978-1-0716-1417-4. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Ting, K.M. An Instance-Weighting Method to Induce Cost-Sensitive Trees. IEEE Trans. Knowl. Data Eng. 2002, 14, 659–665. [Google Scholar] [CrossRef]
Mienye, I.D.; Sun, Y. Performance Analysis of Cost-Sensitive Learning Methods with Application to Imbalanced Medical Data. Inform. Med. Unlocked 2021, 25, 100690. [Google Scholar] [CrossRef]
Brownlee, J. Cost-Sensitive. Imbalanced Classification with Python: Choose Better Metrics, Balance Skewed Classes, and Apply Cost-Sensitive Learning. Martin, S., Sanderson, M., Koshy, A., Andrei Cheremskoy, J.H., Eds.; 2020, pp. 237–240. Available online: https://www.amazon.com/Imbalanced-Classification-Python-Cost-Sensitive-Learning/dp/B09FP165TZ (accessed on 30 June 2024).
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Chaabouni, N.; Mosbah, M.; Zemmari, A.; Sauvignac, C.; Faruki, P. Network Intrusion Detection for IoT Security Based on Learning Techniques. IEEE Commun. Surv. Tutor. 2019, 21, 2671–2701. [Google Scholar] [CrossRef]
Scholkopf, B.; Williamson, R.; Smola, A.; Shawe-Taylor, J.; Platt, J.; Holloway, R. Support Vector Method for Novelty Detection. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999; MIT Press: Denver, CO, USA, 1999. [Google Scholar]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining ICDM, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar] [CrossRef]
Pimentel, M.A.F.; Clifton, D.A.; Clifton, L.; Tarassenko, L. A Review of Novelty Detection. In Signal Processing; Elsevier: Amsterdam, The Netherlands, 2014; Volume 99, pp. 215–249. [Google Scholar]
Prasad, N.R.; Almanza-Garcia, S.; Lu, T.T. Anomaly Detection: A Survey. ACM Comput. Surv. 2009, 14, 1–22. [Google Scholar] [CrossRef]
Freitas De Araujo-Filho, P.; Pinheiro, A.J.; Kaddoum, G.; Campelo, D.R.; Soares, F.L. An Efficient Intrusion Prevention System for CAN: Hindering Cyber-Attacks with a Low-Cost Platform. IEEE Access 2021, 9, 166855–166869. [Google Scholar] [CrossRef]
Mirsky, Y.; Doitshman, T.; Elovici, Y.; Shabtai, A. Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection. arXiv 2018, arXiv:1802.09089. [Google Scholar]
Freitas De Araujo-Filho, P.; Kaddoum, G.; Campelo, D.R.; Gondim Santos, A.; Macedo, D.; Zanchettin, C. Intrusion Detection for Cyber-Physical Systems Using Generative Adversarial Networks in Fog Environment. IEEE Internet Things J. 2021, 8, 6247–6256. [Google Scholar] [CrossRef]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation-Based Anomaly Detection. ACM Trans. Knowl. Discov. from Data 2012, 6, 1–39. [Google Scholar] [CrossRef]
Breuniq, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying Density-Based Local Outliers. SIGMOD Rec. (ACM Spec. Interes. Gr. Manag. Data) 2000, 29, 93–104. [Google Scholar] [CrossRef]
Song, X.; Wang, Y.; Zhu, L.; Ji, W.; Du, Y.; Hu, F. A Method for Fast Outlier Detection in High Dimensional Database Log. In Proceedings of the Proceedings-2021 International Conference on Networking and Network Applications, NaNA 2021, Lijiang City, China, 29 October–1 November 2021; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2021; pp. 236–241. [Google Scholar]
Rita/Analyzer.Go at Master Activecm/Rita GitHub. Available online: https://github.com/activecm/rita/blob/master/pkg/beacon/analyzer.go (accessed on 29 April 2023).
Leys, C.; Ley, C.; Klein, O.; Bernard, P.; Licata, L. Detecting Outliers: Do Not Use Standard Deviation around the Mean, Use Absolute Deviation around the Median. J. Exp. Soc. Psychol. 2013, 49, 764–766. [Google Scholar] [CrossRef]
Miller, J. Short Report: Reaction Time Analysis with Outlier Exclusion: Bias Varies with Sample Size. Exp. Psychol. Soc. 1991, 43, 907–912. [Google Scholar] [CrossRef]
Perera, P.; Oza, P.; Member, S.; Patel, V.M.; Member, S. One-Class Classification: A Survey. arXiv 2021, arXiv:2101.03064. [Google Scholar]

Figure 1. A high-level view of the process of extracting C&C features leading to a newly compressed dataset. The original dataset contains 33 features, including packets (bytes, length, time, request/response time), each encompassing other statistical features. The original dataset passed through filtering based on the flow direction features (IP source and destination addresses). The grouped flows are aggregated, and two C&C features are extracted (computed), hence making a compressed binary dataset.

Figure 2. Proposed scheme. (1) Legitimate DoH traffic (NonDoH and DoH) is merged to create normal traffic. (2) and (3) represent original and filtered datasets, respectively, as shown in Figure 1. (4) shows star graph and statistical analysis and modeling. (5) shows a compressed dataset after extracting beacon characteristics and data processing. (6) and (7) show unsupervised and supervised machine learning modeling and evaluation.

Figure 3. The network topology used to simulate DoH tunnel attacks and generate CIRA-CIC-DoHBrw-2020.

Figure 4. A graphical representation showing the outgoing connections from local hosts to public servers using nodes in a simplified directed graph. (a) shows consecutive unique connections from a specific client to a specific DoH server. (b) shows connections made by different hosts (H_A, H_B, H_C).

Figure 5. A star graph showing network connections. The center node is an arbitrary DoH server, and the adjacent nodes are local hosts. Red nodes represent source IPs used to create DoH tunnels, and green adjacent nodes represent the IPs used to visit the web, from Table 1.

Figure 6. Bar plot of connection frequency and time intervals. The x-axis indicates host IPs as shown in Table 2 while two y-axes (left and right) indicate connection frequency and time intervals respectively. (a) shows connection benign DoH traffic and (b) DoH tunnels traffic connections.

Figure 7. The traffic from the compromised local hosts that created special outliers. (a) shows traffic from the public DoH server (8.8.8.8) and (b) the traffic from the public DoH server (176.103.130.131).

Figure 8. Confusion matrix for OCSVM.

Table 1. IP address list used for creating the CIRA-CIC-DoHBrw-2020 dataset, edited from [2].

Public DoH IP addresses	1.1.1.1	8.8.8.8	9.9.9.10
	8.8.4.4	9.9.9.9	9.9.9.11
	176.103.130.131	176.103.130.130	149.112.112.10
	149.112.112.112	104.16.248.249	104.16.249.249
Source IP used to connect to websites (Google Chrome)	192.168.20.191
Source IPs used to connect to websites (Mozilla Firefox)	192.168.20.111	192.168.20.112	192.168.20.113
Source IPs used to create DoH tunnels	192.168.20.144	192.168.20.204	192.168.20.205
	192.168.20.206	192.168.20.207	192.168.20.208
	192.168.20.209	192.168.20.210	192.168.20.211
	192.168.20.212

Table 2. CIRA-CIC-DoHBrw-2020 traffic flow features without the timestamp [2].

Category	Feature Name
Flow Direction	F1: Source IP, F2: Destination IP, F3: Source Port, F4: Destination Port.
Packet Bytes	F5: Duration, F6: Number of flow bytes sent, F7: Rate of flow bytes sent, F8; Number of flow bytes received, F9: Rates of flow bytes received.
Packet Length	F10: Mean, F11: Median, F12: Mode, F13: Variance, F14: Standard deviation, F15: Coefficient of variation, F16: Skew from median, F17: Skew from mode.
Packet Time	F18: Mean, F19: Median, F20: Mode, F21: Variance, F22: Standard Deviation, F23: Coefficient of variation, F24: Skew from median, F25: Skew from mode
Request/response time difference	F26: Mean, F27: Median, F28: Mode, F29: Variance, F30: Standard Deviation, F31: Coefficient of variation, F32: Skew from median, F33: Skew from mode.

Table 3. Cost of misclassification of minority classes.

	Predicted Positive	Predicted Negative
Actual Positive	C (1, 1) = 1	C (0, 1) = n/p
Actual Negative	C (1, 0) = 1	C (0, 0) = 1

Table 4. The size and dimensions of the compressed dataset.

Original Dataset			Compressed Dataset
	Dataset	Datasets sizes		$Compression rate (%$ )
Merged	Non-DoH	$897,493 \times 34$	$55,852 \times 2$	$93.8 %$
Merged	Benign DoH	$19,807 \times 34$	$11 \times 2$	$99.94 %$
	Normal	$917,300 \times 34$	$55,863 \times 2$	$93.9 %$
	Malicious DoH	$249,836 \times 34$	$51 \times 2$	$99.98 %$

Table 5. Performance comparison of supervised ML models between original, previous studies, and compressed datasets.

Dataset	Method	Prediction Performance		Computation Time (s)
Dataset	Method	P	R	F1	Training	Testing	# of Predictors
CIRA-CIC-DoHBrw-2020	LR	82.367	95.399	88.86.63	18.863	0.541	33
	SVM	95.552	98.4322	96.97	4789.607	449.549
	RF	99.989	99.905	99.47	298.558	2.192
	XGB	99.993	99.998	99.995	62.681	0.111
Recent studies	XTS	99.99	99.96	99.94	1.8	0.07	3
New compressed	LR	68.8	100	81.5	0.18	0.004	2
	SVM	100	90.9	95.2	0.132	0.02
	RF	100	100	100	0.44	0.04
	XGB	50	100	66.7	0.1	0.004

Table 6. Number of times that supervised models evaluated using compressed datasets are faster than those evaluated with CIRA-CIC-DoHBrw-2020 and previous studies. The * shows the speed of a recent study [5], which also outperformed previous studies that were shown in the literature review.

Method	Training	Testing
LR	104.8	135.3
SVM	36,284.9	22,477.5
RF	678.5	54.8
XGB	626.8	27.8
XTS *	18	17.5

Table 7. Anomaly detection results.

Model	Precision	Recall	F1 Score	Training (ms)	Testing (ms)
OCSVM	88.89	100	94.12	336.1	28.9
IF	78.57	100	88	597.4	137.6
LOF	88.89	100	94.12	398.9	49.8
MAD	0.98	94.12	1.95	N/A	N/A

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mungwarakarama, I.; Wang, Y.; Hei, X.; Song, X.; Nyesheja, E.M.; Turiho, J.C. FSDC: Flow Samples and Dimensions Compression for Efficient Detection of DNS-over-HTTPS Tunnels. Electronics 2024, 13, 2604. https://doi.org/10.3390/electronics13132604

AMA Style

Mungwarakarama I, Wang Y, Hei X, Song X, Nyesheja EM, Turiho JC. FSDC: Flow Samples and Dimensions Compression for Efficient Detection of DNS-over-HTTPS Tunnels. Electronics. 2024; 13(13):2604. https://doi.org/10.3390/electronics13132604

Chicago/Turabian Style

Mungwarakarama, Irénée, Yichuan Wang, Xinhong Hei, Xin Song, Enan Muhire Nyesheja, and Jean Claude Turiho. 2024. "FSDC: Flow Samples and Dimensions Compression for Efficient Detection of DNS-over-HTTPS Tunnels" Electronics 13, no. 13: 2604. https://doi.org/10.3390/electronics13132604

APA Style

Mungwarakarama, I., Wang, Y., Hei, X., Song, X., Nyesheja, E. M., & Turiho, J. C. (2024). FSDC: Flow Samples and Dimensions Compression for Efficient Detection of DNS-over-HTTPS Tunnels. Electronics, 13(13), 2604. https://doi.org/10.3390/electronics13132604

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FSDC: Flow Samples and Dimensions Compression for Efficient Detection of DNS-over-HTTPS Tunnels

Abstract

1. Introduction

2. Background and Related Works

2.1. Background

2.2. Related Work

3. Materials and Methods

3.1. Datasets

3.1.1. CIRA-CIC-DoHBrw-2020 Dataset

3.1.2. Data Processing

3.1.3. Compressed Dataset

3.2. Graph Modeling

3.3. Machine Learning Modeling

3.3.1. Supervised Models

Cost-Sensitive Logistic Regression (CS-LR)

Cost-Sensitive Support Vector Machine (CS-SVM)

Cost-Sensitive Random Forest (CS-RF)

Cost-Sensitive eXtreme Gradient Boosting (CS-XGB)

3.3.2. Anomaly Detection Models

One-Class Support Vector Machine (OCSVM)

Isolation Forest

Local Outlier Factor

Modified Z-Score

4. Experimental Results and Analysis

4.1. Graph Analysis

4.2. Bar Plot Analysis

4.3. Supervised Machine Learning

4.4. Effect of Flow Samples and Dimension Compression

4.5. Anomaly Detection

5. Conclusions

6. Limitations and Recommendations

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI