Understanding the Feature Space and Decision Boundaries of Commercial WAFs Using Maximum Entropy in the Mean

Gzyl, Henryk; ter Horst, Enrique; Peña-Garcia, Nathalie; Torres, Andres

doi:10.3390/e25111476

Open AccessArticle

Understanding the Feature Space and Decision Boundaries of Commercial WAFs Using Maximum Entropy in the Mean

¹

Centro de Finanzas IESA, Caracas 1010, Venezuela

²

School of Management, Universidad de los Andes, Bogota 111711, Colombia

³

Research Department, CESA Business School, Bogota 110311, Colombia

^*

Author to whom correspondence should be addressed.

Entropy 2023, 25(11), 1476; https://doi.org/10.3390/e25111476

Submission received: 10 August 2023 / Revised: 24 September 2023 / Accepted: 26 September 2023 / Published: 24 October 2023

Download Versions Notes

Abstract

:

The security of a network requires the correct identification and characterization of the attacks through its ports. This involves the follow-up of all the requests for access to the networks by all kinds of users. We consider the frequency of connections and the type of connections to a network, and determine their joint probability. This leads to the problem of determining a joint probability distribution from the knowledge of its marginals in the presence of errors of measurement. Mathematically, this consists of an ill-posed linear problem with convex constraints, which we solved by the method of maximum entropy in the mean. This procedure is flexible enough to accommodate errors in the data in a natural way. Also, the procedure is model-free and, hence, it does not require fitting unknown parameters.

Keywords:

cyber security; web application firewalls; feature space; decision boundary; ill-posed linear inverse problem; maximum entropy in the mean

1. Introduction

In the ever-changing digital environment, the dangers posed by cyber-attacks have become increasingly critical. The growing complexity of malicious actors, together with the swift proliferation of connected devices and systems, has resulted in an unparalleled degree of vulnerability for both individuals and organizations (see ISACA (https://www.isaca.org/go/state-of-cybersecurity-2021 (accessed on 23 September 2023))). Consequently, it is crucial to devise and employ innovative intrusion detection techniques that are capable of effectively combating these emerging threats.

Web application firewalls (WAFs) play a vital role in safeguarding contemporary applications, as the majority of attacks target the application layer of the OSI model [1]. Although commercial WAFs have exhibited superior overall performance in comparison to their open-source counterparts [2], they pose a challenge for defenders due to the absence of transparency concerning the feature space, learning algorithm, and decision function utilized by these systems. The feature space encompasses the set of all potential features that can be employed to characterize a data point, while decision boundaries represent the demarcations that distinguish different classes within the feature space. This knowledge gap obstructs defenders’ ability to optimize their usage of commercial WAFs and enhance their defenses against cyber-attacks. Furthermore, recent research has indicated that even widely used classification algorithms can be circumvented with a high likelihood by attackers, even if the attacker has access to only a small surrogate dataset of the classifier and limited knowledge of the targeted system [3,4]. Research has made significant advances when it comes to the automatic detection of attacks; most of the methodologies proposed are based on very complex models that require a lot of time in processing and computing resources [5]. One of the most prevalent trends is the use of feature selection algorithms to reduce the cost associated with the training and inference of these models. Some examples include employing filter-based feature reduction using the XGBoost algorithm [6], utilizing genetic algorithms (GAs) in conjunction with a logistic regression (LR) wrapper-based feature selection methodology [7], and implementing hybrid sampling with deep hierarchical networks [8], among others. Following this trend, our model-free approach is a tool that eases the amount of resources required to detect and prevent cybersecurity attacks and blusters the ability of defendants to differentiate between malicious and benign traffic, even in scenarios where there is limited access to information, as is the case with users of commercial firewalls. In other areas, researchers have approached modeling cooperation and extortion alliances in network systems from a game theoretical perspective [9].

As such, comprehending the multidimensional space of attributes (feature space) and the boundaries that differentiate various classes of network traffic within the feature space (decision boundaries) in commercial WAFs is critical for augmenting the efficacy of security systems for web applications. In this paper, we introduce a novel approach that employs maximum entropy in the mean to obtain insights into the feature space and decision boundaries of commercial WAFs. Our methodology enables defenders to reveal the underlying distribution of request features classified as malicious and benign by commercial WAFs, thereby bolstering their ability to defend against cyber-attacks.

The manuscript is organized as follows. First, to gain some perspective about the nature of the problem, in Section 2, we describe the data collection process. With this in mind, we explain the method of maximum entropy in Section 3. There we extend the results in [10] to cover the current augmented problem. In Section 4, we describe the results of the implementation of the maxentropic procedure. In Section 5, we present our concluding remarks.

2. Data Collection

In this study, we utilized the CSE-CIC-IDS2018 dataset for our analysis. This dataset is a publicly available collection of network traffic captures, which were recorded in a controlled environment using a custom-designed setup [11]. The dataset contains both benign and malicious traffic and is specifically designed to develop and evaluate network intrusion detection systems.

The CSE-CIC-IDS2018 is a dataset consisting of dynamically generated records for benign and malign events within a computer network [11].

The benign events are created by fitting a distribution to data collected from real users and subsequently drawing from the distribution using multiple machine learning and statistical analysis techniques.

The malign events were created by automated agents attacking the network under seven different scenarios described in the Appendix A in Table A1.

The variables we chose were the requests over ports and the attack scenarios described in Table A1. In a real-world network, each port can be directly associated with a computer application; thus, the information revealed by the joint distribution of attacks to each port provides defendants with a quantitative measurement of the attackers’ application target and behavior within the network. Additionally, the data required to find the joint distribution can be collected from the defendants’ firewall and monitoring system, which makes it accessible, and the results in the paper are appropriate for any cybersecurity team.

To process the dataset, we first limited the total number of records used to 80% and computed the frequency of port usage and requests for each class of traffic. The dataset included 14 classes of malicious requests and benign requests. Using this information, we then calculated the joint distribution of port usages and requests of each traffic class. Furthermore, we computed the individual probability of each port and traffic class under analysis. To understand the feature space and decision boundaries of commercial WAFs, we employed a novel approach using maximum entropy in the mean (MEM). MEM is a statistical method that allows us to estimate the underlying probability distribution of a set of features based on a sample of observations when only their probability is known. The data can, therefore, be explained with the following two variables, X and Y, in the following way:

X: Traffic type, frequency of connections belonging to a: ’Benign’, ’Infilteration’, ’Bot’, ’Brute Force -Web’, ’Brute Force -XSS’, ’SQL Injection’, ’DoS attacks-GoldenEye’, ’DoS attacks-Slowloris’, ’DDoS attacks-LOIC-HTTP’, ’DDOS attack-HOIC’, ’DDOS attack-LOIC-UDP’, ’FTP-BruteForce’, ’SSH-Bruteforce’, ’DoS attacks-Hulk’, ’DoS attacks-SlowHTTPTest’. We shall label each of these 15 ports and from 1 to 15, respectively.
Y: traffic behavior, frequency of port usage “0” “135” “137’ ’139” “21” “22” “3128” “3389” “443” “445” “53” “5355” “67” “80” “8080”. Those ports were selected since they represent more than 80% of web network traffic. We shall label each of these 15 ports from 1 to 15, respectively.

Additionally, the reason for selecting ports as the focal point of analysis is that they characterize network traffic with the most information; thus, each port is associated with a specific process or service, and it has been shown that the analysis of network traffic can provide the specific insight needed to develop novel approaches to collect, classify, and eventually, mitigate malware [12].

3. Statement of the Inverse Problem and Its Solution by MEM

The mathematical problem to be solved consists of determining a

15 \times 15

matrix from the imprecise knowledge of the row and column sums. Let us write

p_{i, j} = P (X = x_{i}, Y = y_{j})

, where

1 \leq i \leq 15

and

1 \leq j \leq 15

, and the values of X and Y are described above, in Section 2. So, to begin with, if there were no measurement errors, the problem to solve consists of determining a joint distribution from its marginals; that is:

\begin{matrix} Determine a joint distribution & p_{i, j} such that \\ 0 \leq p_{i, j} \leq 1 and & \sum_{i = 1, j = 1}^{(15, 15)} p_{i, j} . = 1 \\ \sum_{j = 1}^{15} p_{i, j} : = p_{i} i = 1, \dots, 15 & \sum_{i = 1}^{15} p_{i, j} : = q_{i} j = 1, \dots, 15 . \end{matrix}

(1)

To write this in a standard form, it is convenient to vectorize this problem. For that, we list the components of

p_{i, j}

lexicographically into a vector

x

of

N = 15 \times 15 = 225

components. Similarly, the row and column sums will be vectorized into a data vector

d

of dimension

K = 15 + 15 = 30

, defined by

d = (\begin{matrix} p \\ q \end{matrix})

(2)

The 30 row and column sum constraints upon

p_{i, j},

will be vectorized into a

30 \times 225

matrix

C,

the specification of which is shown in the Appendix A. Also, let

u

denote the

225

-vector with all of its components equal to

1 .

As usual, we think of vectors as column vectors, and the superscript “

^{t}

” denotes the transposition of the indicated object.

With these notations, after vectorization problem (1) becomes:

\begin{matrix} Determine a vector & x such that \\ 0 \leq x_{k} \leq 1 and & u^{t} x = 1 \\ C x = & d . \end{matrix}

(3)

Observe that the constraint matrix can be augmented by adding

u^{t}

as a first row to

b C

, so that the constraint

u^{r} x = 1

does not appear as an extra constraint. An important remark is the following. When two data vectors,

p

and

q

, are probabilities, then we do not have to impose this constraint. But when the marginals are not proportions; that is, when each of them does not add up to 1, we still may want to impose

u^{r} x = 1 .

In this case, we augment the constraint matrix as follows:

Let us put

A_{0} = (\begin{matrix} u^{t} \\ C \end{matrix})

(4)

The situation just described may occur when the data vector is known up to an observational error. In this case, besides augmenting the constraint matrix, the statement of problem (3) has to be modified because now the range of the matrix

C

, or that of

A_{0}

, may not contain the data point. Apart from redefining the constraint matrix, as in (4), we replace (3) with the following problem:

\begin{matrix} Determine a vector x \in R^{225} & and e \in {[- δ, δ]}^{30} such that \\ 0 \leq x_{k} \leq 1 and & A_{0} x + e = d . \end{matrix}

(5)

This is equivalent to regularizing the problem because it permits the solution to deviate from satisfying the constraints. The role of

e

is to absorb the error. We choose

{[- δ, δ]}^{30}

as the range of

e

, due to computational simplicity. The order of magnitude of the parameter

δ

is determined by a statistical analysis of the data. So, (5) is a typical ill-posed, linear problem with convex constraints.

In short, we use

N = 225

and

K = 30

. To allow for the possibility of constraints upon the entries of some cells, we suppose that

x_{j} \in [a_{j}, b_{j}]

with

0 \leq a_{j} < b_{j} \leq 1

for

j = 1, \dots, N

. This may occur when the probability and severity of a given type of attack lies within a known range. To estimate errors of measurement as part of the solution of the problem, we augment (5) as follows. We put

A = [A_{0} I_{K}], z = (\binom{x}{e}) K = \prod_{j = 1}^{N} [a_{j}, b_{j}] \times {[- d, d]}^{K} .

A = [A_{0} I_{K}], z = (\binom{x}{e}) K = \prod_{j = 1}^{N} [a_{j}, b_{j}] \times {[- d, d]}^{K} .

We shall use

M = N + K = 255 .

With these notations, the extension of the problem of reconstructing a joint probability from its marginals, when there is an error of measurement in the data, can be stated as follows:

\begin{matrix} Determine a vector & z \in K such that \\ A z & = d . \end{matrix}

(6)

The method of maximum entropy in the mean (MEM) transforms the constrained ill-posed problem into an unconstrained, non-linear, convex optimization problem. Instead of solving (6), we search for a probability P on

(K, F),

where

F

denotes the class of Borel measurable subsets of

K,

such that

z = A E_{P} [Z] = d .

(7)

This is the essence of the method of maximum entropy in the mean: To transform solving a linear ill-posed problem, subject to convex constraints, into a problem of searching for a probability measure on the space of constraints, such that the average of the coordinates with respect to the probability satisfies the constraint. Since the measure is concentrated within the constraint set, and the average over a convex set remains within the set, the convexity constraint is inherently satisfied.

Let

Z : K \to K

be the coordinate (identity) mapping. This is a standard generalized moment problem. To determine P in this setup, it is easier to start from a reference measure

Q,

which we choose to define as

d Q (ζ) = \prod_{j = 1}^{N} (δ_{a_{j}} (d ξ_{j}) + δ_{b_{j}} (d ξ_{j})) \prod_{j = 1}^{K} (δ_{- d} (d η_{j}) + δ_{d} (d η_{j})) .

(8)

This is chosen because any interior point in an interval can be written as a convex combination of the endpoints. Any probability P that is absolutely continuous with respect to Q (written

P < < Q

) has the form

d Q (ζ) = \prod_{j = 1}^{N} (p_{j} δ_{a_{j}} (d ξ_{j}) + q_{j} δ_{b_{j}} (d ξ_{j})) \prod_{i = 1}^{K} (α_{i} δ_{- d} (d η_{i}) + β_{i} (δ_{d} (d η_{i})) .

(9)

where the coefficients satisfy

p_{j} + q_{j} = 1, j = 1, \dots, N, and α_{i} + β_{i} = 1, i = 1, \dots, K .

When determining the coefficients

p_{j}, α_{i}

, the notion of entropy comes in. We will leave this as an exercise for the reader, to verify that the expected values of

ξ_{j}

and

η_{i}

are given by

x_{j} = a_{j} p_{j} + b_{j} (1 - p_{j}) and e_{i} = - d α_{i} + d (1 - α_{i}) .

(10)

The class of probabilities that satisfies (7) is a convex se. When it comes to locating a point within a convex set, one effective approach is to maximize a convex function defined over that set. We define the entropy of P relative to Q by

S_{Q} (P) = - (\sum_{j = 1}^{N} p_{j} ln p_{j} + (1 - p_{j}) ln (1 - p_{j}) + \sum_{i = 1}^{K} α_{i} ln α_{i} + (1 - α_{i}) ln (1 - α_{i})) .

(11)

With all this, we replace (6) with

\begin{matrix} Determine P < & < Q that maximizes \\ S_{Q} (P) = - (\sum_{j = 1}^{N} p_{j} ln p_{j} + (1 - p_{j}) ln (1 - p_{j}) & + \sum_{i = 1}^{K} α_{i} ln α_{i} + (1 - α_{i}) ln (1 - α_{i})) . \\ Subject to z & = A E_{P} [Z] = d . \end{matrix}

(12)

This clarifies the comment made immediately after (7). Once the probability in (12) is found, then

x = E_{P} [X]

automatically satisfies the convexity constraints and equation

A x = b d .

This problem is easy to solve using Lagrange multipliers. Going through the routine, Equation (10) yields:

x_{j}^{*} = \frac{a_{j} e^{- a_{j} {(A_{0}^{t} λ^{*})}_{j}} + b_{j} e^{- {(b_{j} A_{0}^{t} λ^{*})}_{j}}}{e^{- a_{j} {(A_{0}^{t} λ^{*})}_{j}} + e^{- b_{j} {(A_{0}^{t} λ)}_{j}^{*}}} e_{i} = \frac{- d e^{d λ_{i}^{*}} + d e^{- d λ_{i}}}{e^{d λ_{i}^{*}} + e^{- d λ_{i}^{*}}} .

(13)

When there is no prior information on the content of the cells, all values become

a_{j} = 0

and

b_{j} = 1

, and (13) becomes

x_{j}^{*} = \frac{e^{- {(b_{j} A_{0}^{t} λ^{*})}_{j}}}{1 + e^{- b_{j} {(A_{0}^{t} λ)}_{j}^{*}}} e_{i} = \frac{- d e^{d λ_{i}^{*}} + d e^{- d λ_{i}}}{e^{d λ_{i}^{*}} + e^{- d λ_{i}^{*}}} .

(14)

The next step is to elucidate how to obtain the optimal Lagrange multiplier

λ^{*} \in R^{K} .

This is achieved by minimizing the convex function

Σ (λ, d) = \sum_{j = 1}^{N} ln (e^{- a_{j} {(A_{0}^{t} λ^{*})}_{j}} + e^{- b_{j} {(A_{0}^{t} λ)}_{j}^{*}}) + \sum_{i = 1}^{K} ln (e^{d λ_{i}^{*}} + e^{- d λ_{i}^{*}}) + 〈 λ, d 〉 .

(15)

This is a strictly convex function that has a minimizer in

R^{K}

, as long as the datum

d

lies in the range of

A_{0} .

It is easy to verify that the first-order condition for

λ^{*}

to be a minimizer of

Σ (λ, d)

is equivalent to ensuring that

x^{*}

, as given in (13) or (14), satisfies (7).

4. Results

4.1. Simulation Exercise

In the following section, we perform a small toy example to test our method and see how well the reconstruction of a bivariate probability distribution is of two random variables, X and Y, which can both take the values of

1, 2

, and 3 (see Table 1).

We can easily compute the marginal distribution for X with probability vector

p = (0.30, 0.46, 0.24)

, and for Y, with probability vector

q = (0.19, 0.47, 0.34)

, respectively. We use the algorithm from [13], which yields the following Table 2 of the reconstructed probability distribution for

δ = 0

, with a few converging steps:

We were able to reconstruct—to a very close level of accuracy—the theoretical joint probability distribution without errors. If we add an error of

d = 0.005

, then we obtain the following reconstructed probability distribution (see Table 3):

4.2. Real Data Application

Here, we apply the method developed in Section 3 to obtain the joint probabilities that fit the data. We compute the distributions for X and Y from the data, and obtain the following probability vectors:

$p = (1.791 \times 10^{- 2}; 1.109 \times 10^{- 3}; 1.723 \times 10^{- 3}; 2.895 \times 10^{- 4}; 2.509 \times 10^{- 2}$ ; $1.485 \times 10^{- 2}; 2.107 \times 10^{- 3}; 1.379 \times 10^{- 1}; 1.523 \times 10^{- 1}; 5.349 \times 10^{- 2}; 3.060 \times 10^{- 1}$ ; $3.817 \times 10^{- 3};$ $1.018 \times 10^{- 3}; 2.612 \times 10^{- 1}; 2.121 \times 10^{- 2})$
$q = (7.976 \times 10^{- 1}; 8.399 \times 10^{- 3}; 2.118 \times 10^{- 2}; 3.637 \times 10^{- 5}; 1.698 \times 10^{- 5}; 6.538 \times 10^{- 6}$ ; $3.119 \times 10^{- 3}; 8.259 \times 10^{- 4}; 4.330 \times 10^{- 2}; 5.155 \times 10^{- 2}; 1.300 \times 10^{- 4}; 1.453 \times 10^{- 2}$ ; $1.410 \times 10^{- 2}; 3.471 \times 10^{- 2}; 1.051 \times 10^{- 2})$

Considering the distribution vectors, it is crucial to take into account the high numerical resolution and the extensive dataset comprising over 13 million data points. This scale of data allows us to observe events with probabilities as low as

1 \times 10^{- 5}

, even if they are considered unlikely.

Finally, we minimized the entropy using the r package implemented by [13] and found the following joint distribution for both X and Y, where the gradient was equal to

4.618528 \times 10^{- 7}

. The results are presented in Table 4. This table contains the main computational results of this work. It contains the joint probabilities of the type of connections and the frequency of connections to the network.

5. Discussion

In conclusion, this paper presented a novel methodology that employs maximum entropy in the mean to gain insights into the feature space and decision boundaries of commercial web application firewalls (WAFs). This approach enables defenders to uncover the joint underlying distribution of requests classified as malicious and benign by commercial WAFs, thereby enhancing their ability to protect against cyber-attacks, when taking into account the dependence between the types of attacks.

The bivariate density reconstructed in Table 4 enabled us to evaluate the most probable future attacks, along with conditional probabilities of an attack occurring based on factors such as internet traffic volume or any other relevant measurable variable that we can use for conditioning.

We demonstrate that our methodology could reconstruct a joint distribution of different traffic types (benign and malicious) and the port usage for each class with an error of less than

1 \times 10^{- 4}

at the gradient level. This level of accuracy allows for the precise identification of the amount of incoming malicious traffic for each port. Furthermore, since our methodology can be employed to reconstruct the joint distribution of any two variables, it can be utilized to pinpoint any measurable feature for malicious traffic.

The results presented in this paper not only emphasize the importance of understanding the feature space and decision boundaries in commercial WAFs, but also showcase the effectiveness of the proposed methodology in addressing the critical aspect, such as the dependence of different types of attacks in cyber security. By providing a robust and versatile tool for analyzing and interpreting the characteristics of malicious traffic, this research significantly contributes to the ongoing efforts in the cybersecurity community in regard to developing and implementing innovative solutions for safeguarding individuals and organizations in the constantly evolving digital landscape.

As part of future work, we recommend further exploring the applicability of our methodology to other types of security systems and exploring its potential for integration into existing intrusion detection frameworks. Additionally, ongoing research on the identification and analysis of measurable features for malicious traffic will help to refine and expand the capabilities of our approach, ultimately contributing to the development of more comprehensive and effective strategies for defending against cyber threats.

Author Contributions

Methodology, Software, Writing—review, H.G., E.t.H., A.T. and N.P.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

The Vectorized Form of the Constraint Matrix

For numerical processing, we already mentioned that it is convenient to cast the

15 \times 15

joint probability matrix as a 225 vector

x \in R^{225},

subject to positivity and other constraints. Thus, it is also necessary to rewrite the constraints as linear or convex constraints upon

x .

For that, we denote by

1

(resp.

0

) the vector in

R^{15}

with all entries equal to 1 (resp. equal to 0), let

1^{t}

and

0^{t}

be their transposes, and let

I

denote the

15 \times 15

identity matrix and form the

30 \times 225

matrix:

C = (\begin{matrix} 1^{t} & 0^{t} & \dots & 0^{t} \\ 0^{t} & 1^{t} & \dots & 0^{t} \\ ⋮ & ⋮ & \dots & ⋮ \\ 0^{t} & 0^{t} & \dots & 1^{t} \\ I & I & \dots & I \end{matrix})

(A1)

From now on, the remainder continues as described in Section 3.

Table A1. Table with descriptions of attacks.

Scenario	Description
Brute-force attack	A set of agents attempts to connect via SSH or FTP to machines within the network by guessing the system’s password.
Heartbleed attack	Heartbleed is a vulnerability found in the OpenSSL library, which allows an attacker to leak memory data from a system. Some vulnerable machines were added to the network and the heartleech program was used to exploit them.
Botnet	Botnets are computers infected with malicious software and controlled as a group without the owners’ knowledge. Some computers within the network were infected with Zeus and other types of Trojans used to create botnets.
Denial-of-Service (DoS)	This attack seeks to shut down a machine or network, making it inaccessible to its intended users. An agent was used to attack the systems within the network using the Slow Loris variant of this attack.
Distributed Denial-of-Service (DDoS)	DDoS is a variant of the DoS attack, where multiple agents (usually a botnet) are used to overwhelm the target with a huge amount of requests. An agent was tasked to perform stress tests on the services and simulate a DDoS.
Web Attacks	Web attacks aim to exploit vulnerable web applications (such as websites). Some computers within the network ran a vulnerable PHP/MySQL web application and an agent was used to automatically exploit the vulnerabilities.
Infiltration of the network from inside	This scenario simulates the actions of an attacker who has gained control of one of the computers from within the network and uses Nmap to perform an IP sweep, full port scan, and service enumerations.

References

Matatall, N.; Arseniev, M. Web Application Security; University of California: Irvine, CA, USA, 2008. [Google Scholar]
Prandl, S.; Lazarescu, M.; Pham, D.S. A study of web application firewall solutions. In Information Systems Security; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2015; pp. 501–510. [Google Scholar]
Biggio, B.; Corona, I.; Maiorca, D.; Nelson, B.; Šrndić, N.; Laskov, P.; Giacinto, G.; Roli, F. Evasion attacks against machine learning at test time. In Advanced Information Systems Engineering; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2013; pp. 387–402. [Google Scholar]
Papernot, N.; McDaniel, P.; Goodfellow, I.; Jha, S.; Celik, Z.B.; Swami, A. Practical Black-Box Attacks against Machine Learning. In Proceedings of the ACM Asia Conference on Computer and Communications Security, Abu Dhabi, United Arab Emirates, 2–6 April 2017; pp. 506–519. [Google Scholar] [CrossRef]
Ahmad, Z.; Khan, A.S.; Shiang, C.W.; Abdullah, J.; Ahmad, F. Network intrusion detection system: A systematic study of machine learning and deep learning approaches. Trans. Emerg. Telecommun. Technol. 2020, 32, e4150. [Google Scholar] [CrossRef]
Kasongo, S.M.; Sun, Y. Performance Analysis of Intrusion Detection Systems Using a Feature Selection Method on the UNSW-NB15 Dataset. J. Big Data 2020, 7, 105. [Google Scholar] [CrossRef]
Khammassi, C.; Krichen, S. A GA-LR wrapper approach for feature selection in network intrusion detection. Comput. Secur. 2017, 70, 255–277. [Google Scholar] [CrossRef]
Jiang, K.; Wang, W.; Wang, A.; Wu, H. Network Intrusion Detection Combined Hybrid Sampling With Deep Hierarchical Network. IEEE Access 2020, 8, 32464–32476. [Google Scholar] [CrossRef]
Xu, X.; Rong, Z.; Tian, Z.; Wu, Z.X. Timescale diversity facilitates the emergence of cooperation-extortion alliances in networked systems. Neurocomputing 2019, 350, 195–201. [Google Scholar] [CrossRef]
Gzyl, H. Construction of contingency tables by maximum entropy in the mean. Commun. Stat. Theory Methods 2021, 50, 4778–4786. [Google Scholar] [CrossRef]
Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In Proceedings of the International Conference on Information Systems Security and Privacy, Funchal, Portugal, 22–24 January 2018. [Google Scholar]
Rossow, C.; Dietrich, C.J.; Bos, H.; Cavallaro, L.; van Steen, M.; Freiling, F.C.; Pohlmann, N. Sandnet. In Proceedings of the First Workshop on Building Analysis Datasets and Gathering Experience Returns for Security, Salzburg, Austria, 10–13 April 2011. [Google Scholar]
Mullen, K.; Ardia, D.; Gil, D.; Windover, D.; Cline, J. DEoptim: An R Package for Global Optimization by Differential Evolution. J. Stat. Softw. 2011, 40, 1–26. [Google Scholar] [CrossRef]

Table 1. True joint probability distribution.

	$Y = 1$	$Y = 2$	$Y = 3$
$X = 1$	0.09	0.14	0.07
$X = 2$	0.07	0.23	0.16
$X = 3$	0.03	0.10	0.11

Table 2. Reconstructed probability distribution without errors.

	$Y = 1$	$Y = 2$	$Y = 3$
$X = 1$	0.06	0.14	0.10
$X = 2$	0.09	0.21	0.16
$X = 3$	0.04	0.11	0.08

Table 3. Reconstructed probability distribution with error

d = 0.005

.

Table 3. Reconstructed probability distribution with error

d = 0.005

.

	$Y = 1$	$Y = 2$	$Y = 3$
$X = 1$	0.06	0.14	0.10
$X = 2$	0.09	0.21	0.16
$X = 3$	0.04	0.11	0.08

Table 4. Reconstructed probability distribution from data.

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
1	1.77 × 10 $^{- 2}$	1.69 × 10 $^{- 4}$	1.93 × 10 $^{- 5}$	1.57 × 10 $^{- 6}$	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
2	9.14 × 10 $^{- 4}$	1.93 × 10 $^{- 4}$	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
3	1.70 × 10 $^{- 3}$	2.00 × 10 $^{- 5}$	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
4	2.21 × 10 $^{- 4}$	6.77 × 10 $^{- 5}$	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
5	3.93 × 10 $^{- 5}$	2.63 × 10 $^{- 6}$	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.45 × 10 $^{- 2}$	2.25 × 10 $^{- 6}$	0.00	1.05 × 10 $^{- 2}$
6	6.88 × 10 $^{- 4}$	7.05 × 10 $^{- 5}$	0.00	7.51 × 10 $^{- 8}$	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.40 × 10 $^{- 2}$	0.00	0.00
7	2.10 × 10 $^{- 3}$	2.10 × 10 $^{- 6}$	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
8	1.36 × 10 $^{- 1}$	1.02 × 10 $^{- 3}$	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
9	1.50 × 10 $^{- 1}$	2.19 × 10 $^{- 3}$	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
10	5.30 × 10 $^{- 2}$	4.10 × 10 $^{- 4}$	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
11	3.02 × 10 $^{- 1}$	3.78 × 10 $^{- 3}$	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
12	3.80 × 10 $^{- 3}$	8.49 × 10 $^{- 6}$	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
13	1.00 × 10 $^{- 3}$	8.64 × 10 $^{- 6}$	0.00	1.50 × 10 $^{- 7}$	7.51 × 10 $^{- 8}$	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
14	1.27 × 10 $^{- 1}$	4.37 × 10 $^{- 4}$	0.00	3.45 × 10 $^{- 5}$	1.69 × 10 $^{- 5}$	6.53 × 10 $^{- 6}$	3.11 × 10 $^{- 3}$	8.25 × 10 $^{- 4}$	4.32 × 10 $^{- 2}$	5.15 × 10 $^{- 2}$	1.30 × 10 $^{- 4}$	0.00	0.00	3.47 × 10 $^{- 2}$	0.00
15	4.33 × 10 $^{- 5}$	2.70 × 10 $^{- 6}$	2.11 × 10 $^{- 2}$	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gzyl, H.; ter Horst, E.; Peña-Garcia, N.; Torres, A. Understanding the Feature Space and Decision Boundaries of Commercial WAFs Using Maximum Entropy in the Mean. Entropy 2023, 25, 1476. https://doi.org/10.3390/e25111476

AMA Style

Gzyl H, ter Horst E, Peña-Garcia N, Torres A. Understanding the Feature Space and Decision Boundaries of Commercial WAFs Using Maximum Entropy in the Mean. Entropy. 2023; 25(11):1476. https://doi.org/10.3390/e25111476

Chicago/Turabian Style

Gzyl, Henryk, Enrique ter Horst, Nathalie Peña-Garcia, and Andres Torres. 2023. "Understanding the Feature Space and Decision Boundaries of Commercial WAFs Using Maximum Entropy in the Mean" Entropy 25, no. 11: 1476. https://doi.org/10.3390/e25111476

APA Style

Gzyl, H., ter Horst, E., Peña-Garcia, N., & Torres, A. (2023). Understanding the Feature Space and Decision Boundaries of Commercial WAFs Using Maximum Entropy in the Mean. Entropy, 25(11), 1476. https://doi.org/10.3390/e25111476

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Understanding the Feature Space and Decision Boundaries of Commercial WAFs Using Maximum Entropy in the Mean

Abstract

1. Introduction

2. Data Collection

3. Statement of the Inverse Problem and Its Solution by MEM

4. Results

4.1. Simulation Exercise

4.2. Real Data Application

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Appendix A

The Vectorized Form of the Constraint Matrix

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI