3.1. Overview of ECML/PKDD 2007 Dataset
The ECML/PKDD 2007 dataset was introduced as part of the PASCAL Large Scale Learning Challenge held at the 2007 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD). The dataset was created to provide a realistic and large-scale benchmark for evaluating machine learning models in the domain of web traffic analysis and attack detection. Unlike legacy datasets such as KDD’99, which were largely synthetic and tailored to lower-level network features, the ECML/PKDD 2007 dataset captures real HTTP request data. This distinction makes it especially relevant for application-layer intrusion detection systems, such as modern Web Application Firewalls.
Although several modern intrusion detection datasets such as UNSW-NB15 and CICIDS2017 have been introduced in recent years, the present study employed the ECML/PKDD 2007 dataset due to its unique suitability for multilabel classification tasks. This dataset provides a well-structured benchmark that includes eight distinct labels, seven attack types and one normal class, allowing the models to be trained and evaluated on overlapping and co-occurring attack categories. Furthermore, the use of this dataset ensures continuity and comparability with our previous research, which focused on deep learning models using the same data source.
The decision to use the ECML/PKDD 2007 dataset exclusively in this study was made to maintain a controlled experimental environment and ensure fair, reproducible comparisons among the seven evaluated models. This dataset offers a well-labeled, multilabel structure that includes multiple types of web application attacks and normal traffic, aligning closely with the objectives of multi-class and multi-label classification. While newer datasets are available, many are not natively multilabel or differ significantly in structure, which could introduce inconsistencies when comparing model performance. By focusing on a single, consistent benchmark, this work provides a solid foundation for future research, where the best-performing models identified here can be further validated on more recent and diverse datasets.
The data was collected from a honeypot system designed to mimic vulnerable web applications. Honeypots are security mechanisms intended to attract malicious activity, allowing researchers to analyze real-world attack patterns. The traffic includes both legitimate and malicious HTTP requests directed at the honeypot, making the dataset rich in both clean and noisy signals. As such, the dataset provides a suitable environment for developing and benchmarking supervised learning techniques that aim to distinguish between normal and malicious behavior.
Each instance in the dataset corresponds to a single HTTP request and is associated with one or more class labels. In total, the dataset contains eight labels: one for normal traffic and seven representing distinct categories of web-based attacks [
11]. These attack types include XPath Injection, OS Commanding, LDAP Injection, Path Traversal, SQL Injection, Server-Side Includes (SSI), and Cross-Site Scripting (XSS). This structure supports multilabel classification, as individual requests may trigger multiple types of attacks. For example, a request might simultaneously exploit SQL Injection and Directory Traversal vulnerabilities.
The inclusion of multilabel samples introduces additional complexity not typically addressed in older intrusion detection datasets. Most legacy datasets treat intrusion detection as a binary or multiclass classification problem. However, in real-world environments, web-based attacks often overlap or exploit multiple weaknesses within the same request. By preserving multilabel relationships, the ECML/PKDD 2007 dataset accurately reflects these realities and enables a more accurate evaluation of machine learning algorithms designed for WAF applications.
From a technical standpoint, the dataset contains approximately 60,000 samples, although the precise number may vary slightly depending on how preprocessing is handled. Each record includes raw HTTP request components such as the method (e.g., GET or POST), URL, and query string, headers, payload parameters, and a timestamp. The dataset does not initially include pre-engineered features for model training. Therefore, significant preprocessing and feature engineering are required before it can be used with machine learning models, a topic addressed in detail in the next subsection.
The number of features extracted from each request can vary depending on the encoding and representation chosen by the researcher. In typical preprocessing pipelines, features such as request length, number of special characters, parameter entropy, and n-gram frequency (from URLs or payloads) are commonly used. After feature transformation, most implementations result in 30 to 50 numerical or categorical features per sample.
The time span of the data collection process extended over several weeks. Although the dataset documentation does not specify exact start and end dates, the prolonged capture period ensured the inclusion of both periodic traffic patterns and a diverse set of attack attempts. This temporal depth is valuable for training and evaluating models that may incorporate time-aware architectures such as RNNs or LSTMs.
In terms of class distribution, the dataset is significantly imbalanced. Normal traffic accounts for more than half of the total samples, while some attack categories, such as Command Injection or XSS, are relatively rare. This imbalance poses a challenge for classification models, especially in multilabel settings where minor classes may be overlooked.
Figure 1 can be used to represent the proportion of samples per class, offering insight into the data distribution and highlighting the need for appropriate metric selection and sampling strategies during model evaluation.
The ECML/PKDD 2007 dataset’s design and contents make it highly relevant for research in machine learning-based intrusion detection, particularly at the application layer. Despite its value, it remains underutilized in the literature compared to more widely adopted datasets like NSL-KDD or CIC-IDS2017. Most of the existing studies that use this dataset focus on binary or multiclass classification and often simplify its multilabel structure by choosing only one label per request, typically the first or most prominent one. This practice, while simplifying model design, limits the realism and effectiveness of the resulting classifiers in real-world settings.
In this study, we preserve and leverage the full multilabel structure of the ECML/PKDD 2007 dataset. All seven attack categories are treated independently, and models are evaluated based on their ability to detect one or more attack types per sample. This approach provides a more accurate and complete assessment of model performance and reflects the operational requirements of WAFs deployed in production environments.
Overall, the ECML/PKDD 2007 dataset presents a valuable, realistic, and suitably complex benchmark for multilabel classification tasks in web traffic analysis. Its use in this paper enables a fair and consistent comparison of both classical and deep learning models, contributing to a more informed understanding of their respective strengths and limitations in WAF applications.
3.2. Data Preprocessing and Feature Engineering
Effective data preprocessing and feature engineering are essential steps in any machine learning pipeline, particularly when working with raw web traffic logs such as those found in the ECML/PKDD 2007 dataset. Since the dataset was collected from a honeypot system, it contains raw HTTP requests and associated metadata, which must be transformed into a structured format suitable for machine learning models [
12].
The dataset was originally provided in text format, where each line corresponds to an HTTP request, including its method, requested URL, query parameters, headers, and associated labels. These raw inputs are rich in information but require significant processing to extract meaningful features and ensure compatibility with machine learning algorithms.
3.2.1. Data Cleaning
The first stage of preprocessing involved removing corrupted, incomplete, or malformed HTTP requests. Some entries contained missing fields or improperly encoded characters, which could introduce noise or lead to parsing errors during feature extraction. These were either corrected using fallback heuristics or excluded from the dataset if recovery was not possible.
Headers and payloads were decoded using UTF-8 encoding where applicable, and common formatting issues (e.g., escape characters, null bytes, and HTML entities) were resolved using regular expressions and standard cleaning libraries. Duplicate entries, if detected, were also removed to prevent bias in model training.
In addition, non-ASCII characters, excessively long request lines (e.g., >5000 characters), and unsupported HTTP methods were filtered out during initial parsing to standardize the dataset. This step ensured consistency across all samples and simplified feature extraction.
3.2.2. Label Processing
The ECML/PKDD 2007 dataset supports multilabel classification, meaning a single HTTP request may belong to more than one attack category. The raw label field often contains a list of labels, which were parsed and encoded into a binary vector for each sample. Each of the eight classes (seven attack types + normal) was mapped to a specific position in a binary array. For instance, a request labeled as both “XPathInjection”, second position in the vector, and “SQL Injection”, the sixth position in the vector, would be represented as [0, 1, 0, 0, 0, 1, 0, 0].
Requests with no attack label were assigned the “Normal” class. In cases where “Normal” appeared alongside attack labels, it was removed to preserve the semantics of anomalous behavior.
3.2.3. Feature Extraction
Each HTTP request was processed to extract multiple types of features relevant to detecting malicious behavior. Lexical features captured string characteristics from the URL and query parameters, such as length, special character counts, and entropy. Statistical features included the number of parameters, token counts, and the presence of common attack keywords like SELECT, UNION, or <script>. Boolean indicators were used to flag suspicious patterns, including encoded characters, shell commands, or known exploit substrings. Additionally, header-based features such as request method, content length, and user-agent properties were included.
To improve detection of obfuscation, character-level analysis and tokenization were applied, enabling the identification of symbol patterns and encoded inputs. These features together provided a detailed representation of the request structure and content for model training.
3.2.4. Categorical Encoding
Several features, such as HTTP method and protocol version, are categorical in nature. These were encoded using one-hot encoding to prevent any ordinal assumptions. For example, the HTTP method feature was expanded into binary features for GET, POST, HEAD, etc. Similarly, user–agent types were grouped into high-level categories (e.g., browser, bot, unknown) and encoded accordingly.
For deep learning models, where embedding layers are used, categorical features were optionally converted into integer indices instead of one-hot vectors. This approach was especially useful in models where the number of categorical categories was large (e.g., for tokenized user-agents or paths).
3.2.5. Normalization and Scaling
After numerical features were extracted, they were normalized to ensure uniformity across all input dimensions. Features with large ranges or skewed distributions were scaled using min–max normalization to the range [0, 1]. Alternatively, z-score standardization [
13] was used when the feature distributions approximated Gaussian behavior.
Scaling was particularly important for classical machine learning models such as SVM or Gradient Boosting, which are sensitive to the magnitude of input features. For deep learning models, normalization also contributed to faster convergence during training.
3.2.6. Handling Missing Values
Most missing values were encountered during the feature extraction stage, typically when specific fields such as content-length, headers, or parameters were absent from an HTTP request. In these situations, numerical attributes were imputed either with zero to indicate absence or with the median value of the corresponding feature to preserve its statistical distribution. Categorical features were assigned a default “unknown” label or treated as a distinct category to prevent loss of information related to missing data patterns. Additionally, Boolean indicators were introduced to explicitly mark the absence of certain fields, allowing models to learn whether missing information itself carried predictive significance.
The selection of imputation strategies was informed by their observed effect on model accuracy and stability during preliminary validation experiments. Different methods were tested to ensure that the chosen approach minimized bias and maintained consistency across all feature types, thereby supporting reliable model training and evaluation.
3.2.7. Dataset Summary After Preprocessing
After applying all preprocessing and feature extraction steps, the dataset was transformed into a structured format consisting of numeric and binary features, ready for input into machine learning models.
Table 3 summarizes the key characteristics of the dataset after transformation.
The structured dataset was then split into training, validation, and test sets using stratified multilabel sampling to preserve label distributions across all subsets. This ensured fair and consistent evaluation across all models.
The dataset was first shuffled to ensure a random distribution of samples. Given the multilabel nature of the ECML/PKDD 2007 dataset, a stratified multilabel split was applied to preserve the label distributions across all subsets. The dataset was divided into three distinct partitions to facilitate model training, hyperparameter tuning, and final evaluation. Specifically, 70% of the data was allocated for training, 15% was used for validation to guide model optimization and prevent overfitting, and the remaining 15% was reserved as an independent test set for assessing generalization performance. This partitioning ensured a balanced trade-off between sufficient training data and reliable evaluation across all models.
During preliminary experiments, several dataset partitioning strategies were evaluated to determine the most stable configuration for model training and validation. Ratios such as 80% training/10% validation/10% testing and 60% training/20% validation/20% testing were tested but led to slightly lower or less consistent results. In particular, configurations with smaller validation sets resulted in weaker early-stopping behavior for recurrent models, while larger validation splits reduced the available data for training and affected model convergence. Therefore, the 70%/15%/15% division was adopted, as it offered the optimal balance between training data sufficiency and reliable validation and testing performance across all evaluated models.
To perform a stratified split in a multilabel setting, we used the iterative stratification method which balances the presence of each label across the splits. This approach ensures that rare labels (e.g., SSI, XSS) are present in all sets, avoiding biased evaluation due to missing classes during training or testing.
3.3. Label Distribution and Attack Types
This subsection provides an overview of the dataset’s label distribution and the types of attacks considered in the analysis.
3.3.1. Label Categories and Definitions
The eight labels included in the dataset are defined as follows:
Normal: Legitimate traffic without any malicious behavior.
XPathInjection: Attempts to manipulate XPath queries through crafted input.
OSCommanding: Injection of operating system commands intended for execution on the host system.
LDAPInjection: Attempts to exploit LDAP query structures by injecting unauthorized code.
PathTransversal: Accessing unauthorized files or directories by exploiting file path traversal.
SQLInjection: Insertion of malicious SQL queries to manipulate databases.
SSI: Exploitation of Server Side Includes to inject or execute unauthorized directives.
XSS: Injection of malicious scripts into web content that is rendered by the client.
3.3.2. Class Distribution
The distribution of samples across labels is highly imbalanced, which has important implications for model training and evaluation. The Normal class dominates the dataset with 10,289 samples, while attack classes vary between 1273 and 2220 samples each; see
Table 4.
3.3.3. Discussion of Label Imbalance
Although Normal traffic remains the largest class, it constitutes only about one-third of the dataset. The remaining two-thirds are distributed across the seven attack types. This makes the ECML/PKDD 2007 dataset relatively well-balanced within attack categories, although slightly skewed overall in favor of normal traffic.
This structure contrasts with many older intrusion detection datasets (such as NSL-KDD or KDD’99), where the imbalance is often extreme, and attack types are collapsed into broad categories. In this case, the moderate imbalance still presents challenges—especially for classifiers that may overfit to the dominant class without adequate weighting or regularization—but it also allows for more meaningful evaluation across attack types.
3.3.4. Multilabel Characteristics
The dataset supports multilabel assignments, meaning some HTTP requests are tagged with more than one attack label. For example, a request containing both OS-level commands and SQL query fragments may be labeled with both OSCommanding and SQLInjection. This is important in realistic WAF environments, where attackers often craft payloads that target multiple vulnerabilities simultaneously.
While the majority of the dataset samples are single-labeled, a significant minority, around 5.2% of attacks, exhibit label co-occurrence, which should be taken into account during training. This motivates the use of multilabel-aware evaluation metrics, such as macro/micro F1-score, subset accuracy, and Hamming loss.
3.3.5. Implications for Model Training
From a machine learning perspective, this label distribution requires careful handling. Strategies such as class weighting, threshold tuning, or oversampling of minority classes may be necessary to prevent performance degradation on less frequent classes like SSI or XSS. Additionally, multilabel classification introduces complexity in loss function design and output interpretation, particularly in deep learning models.
In summary, the ECML/PKDD 2007 dataset offers a nuanced and realistic label distribution, both in terms of class balance and multilabel structure. It enables a robust evaluation of machine learning models under conditions that closely resemble actual WAF use cases, where traffic can be diverse, noisy, and simultaneously malicious in multiple ways.