Abstract
We present a real-world dataset capturing thirty consecutive days of malicious HTTP traffic filtered and blocked by the OWASP ModSecurity Web Application Firewall (WAF) on a live production server. Each entry corresponds to a request that triggered one or more rules in the OWASP Core Rule Set (CRS), resulting in its inclusion in the audit log due to suspected exploitation attempts. The dataset includes attack categories such as SQL injection, cross-site scripting (XSS), local file inclusion, scanner probes, and various malformed or evasive input forms. The data has been carefully anonymized to protect sensitive information while preserving critical structural tags, including request method, URI, triggered rule IDs, request headers, and user-agent strings. This dataset provides a real-world resource for cybersecurity researchers, particularly those developing or evaluating intrusion detection systems (IDSs), WAF rule tuning strategies, anomaly detection algorithms, and adversarial machine learning models. The dataset also allows performance testing of threat prevention pipelines. By making this dataset publicly available, we aim to support reproducible research in web security, encourage benchmarking of detection techniques under real-world conditions, and contribute insight into the nature of contemporary web-based threats observed in an uncontrolled environment.
1. Introduction
Today, web applications remain a primary target for attackers due to their direct exposure and frequent role as entry points to sensitive systems and data. Although Web Application Firewalls (WAFs) such as OWASP ModSecurity [] are widely deployed [,] to mitigate common exploitation attempts, the academic community often lacks open, real-world data to validate detection and defense strategies. Many public datasets in this area originate from controlled experiments []. A review of 89 publicly available NIDS datasets revealed that a substantial proportion were generated under experimental conditions, involving simulated or emulated components rather than authentic network traffic with natural background noise [].
This paper introduces a dataset of blocked HTTP requests collected over a continuous thirty-day period from a production server operating in an uncontrolled, adversarial environment. The dataset provides request-level details and is anonymized to protect sensitive information while retaining attack-relevant features such as CRS rule IDs [] and payload and HTTP header information.
Table 1 summarizes the key characteristics of widely used datasets and contrasts them with our own. Our goal is to bridge the gap between simulated and real data over HTTP(S) transport, forming the basis for the contributions discussed in the following sections.
Table 1.
Dataset comparison.
1.1. Key Contributions
The dataset exhibits several key properties that enhance its research value. First, it demonstrates realism, as the traffic reflects live attack attempts against a real-world application environment without reliance on synthetic injection. Second, it ensures diversity, spanning multiple attack classes that include, but are not limited to, injection attempts, scanner probes, and obfuscation strategies. The dataset is anonymized with utility preservation whereby sensitive identifiers have been sanitized while retaining structural features essential for analysis. Finally, it emphasizes accessibility, being provided in its original raw format to facilitate use across diverse research domains.
1.2. Research Utility
The dataset provides multiple avenues for empirical research, including benchmarking IDS/IPS models against much-needed real-world attack data [] and developing anomaly detection algorithms that are sensitive to obfuscated payloads. It also enables the study of automated scanning and bot-driven exploitation attempts, as well as the evaluation of rule tuning and false-positive mitigation strategies in WAFs. Moreover, the dataset can be applied to training adversarial machine learning models with real attack payloads while adhering to anonymization constraints.
2. Data and Description
The dataset is available on Zenodo [] under the Creative Commons Attribution 4.0 International license. It consists of request-level ModSecurity audit log entries from a live web server protected by ModSecurity v[2.9.3-1+deb10u2] with the OWASP CRS v[3.2.3-0+deb10u3]. The source server was part of a commercial fleet, running multiple WordPress installations with the WooCommerce e-commerce addon, as well as custom PHP scripts, managed by third-party developers. Customer-managed WordPress installations tend to receive updates in a delayed manner, if at all, and as such are a common target for exploitation [,].
In our case, ModSecurity was configured to maximize end-user performance; therefore, a single additional custom rule (ID: 444444) was added for “early denial” of access to “well-known” abusive bots based on the user-agent string. The exact rule can be found in the supporting OWASP server configuration documentation; however, structurally, it lists all discovered data scraper bots under a single rule ID and is expected to dominate the statistics.
To protect WordPress’s daily operations, seven rules were removed for known detrimental interactions (ID: 941160, 949110, 980130, 941100, 932110, 200004, 932100) based on customer feedback. This means that when genuine customers encountered a WAF-based rejection while operating either the front-end or back-end of WordPress, the conflicting rule was deactivated. Having an early rejection option and a simplified ruleset have also been found to significantly boost performance [].
Only those anomalous requests that had been tagged by the WAF module are included in the dataset. The exact server configurations are available from GitHub and Zenodo []. Using these files will set up the same logging pipeline utilized in the data collection.
2.1. Scope
- Coverage period: 30 consecutive days (27 July 2025–25 August 2025)—randomly selected
- Total requests: 147,205
- Daily average: 4907
- Data formats: RAW ModSecurity audit log
2.2. Dataset Fields
Each record was assigned a unique identifier {UNIQUE_ID}, and the audit data was divided into a maximum of six sections (not including the closing section).
The --{UNIQUE_ID}-A-- section contains the audit log header, which provides general metadata about the transaction, including the unique transaction ID, client IP address, timestamp, host, and request line (method, URI, and protocol).
The --{UNIQUE_ID}-B-- section stores the request headers, capturing all HTTP request headers as received from the client.
The --{UNIQUE_ID}-C-- section records the request body, typically consisting of POST data or payloads.
The --{UNIQUE_ID}-E-- section holds the intermediary response body, which is not always available. This part contains the response body as observed at the end of the response phase, before any transformations or filtering applied by ModSecurity or the server.
The --{UNIQUE_ID}-F-- section includes the final response headers that ModSecurity and other web server modules sent back to the client.
The --{UNIQUE_ID}-H-- section provides audit log trailer information, including details about rules triggered, actions taken, tagging data, and performance metrics.
Finally, the --{UNIQUE_ID}-Z-- section serves as the audit log terminator, marking the end of the log entry for that transaction.
2.3. Data Sample
We include a representative log excerpt to illustrate the plain-text raw log format. This example illustrates the dataset’s structure and highlights its straightforward accessibility for text-based processing utilities, enabling researchers to quickly adapt existing log analysis tools or develop new ones for experimental purposes.
| --c30afe70-A-- [01/Aug/2025:19:59:37 +0200] aI0AibHA3gfwyP1Wo76@lwAAAAg 100.122.171.187 47574 100.115.127.60 443 --c30afe70-B-- GET/.env HTTP/1.1 Host: www.d4941dc.hu User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Connection: keep-alive Accept-Encoding: gzip --c30afe70-F-- HTTP/1.1 404 Not Found Upgrade: h2 Connection: Upgrade, Keep-Alive Last-Modified: Fri, 08 Dec 2023 13:47:02 GMT ETag: “70f-60bffd1aa7854” Accept-Ranges: bytes Content-Length: 1807 Keep-Alive: timeout=1, max=100 Content-Type: text/html --c30afe70-H-- Message: Warning. Matched phrase “/.env” at REQUEST_FILENAME. [file “/usr/share/modsecurity-crs/rules/REQUEST-930-APPLICATION-ATTACK-LFI.conf”] [line “125”] [id “930130”] [msg “Restricted File Access Attempt”] [data “Matched Data:/.env found within REQUEST_FILENAME:/.env”] [severity “CRITICAL”] [ver “OWASP_CRS/3.2.3”] [tag “application-multi”] [tag “language-multi”] [tag “platform-multi”] [tag “attack-lfi”] [tag “OWASP_CRS”] [tag “OWASP_CRS/WEB_ATTACK/FILE_INJECTION”] [tag “WASCTC/WASC-33”] [tag “OWASP_TOP_10/A4”] [tag “PCI/6.5.4”] Apache-Error: [file “apache2_util.c”] [line 273] [level 3] [client 100.122.171.187] ModSecurity: Warning. Matched phrase “/.env” at REQUEST_FILENAME. [file “/usr/share/modsecurity-crs/rules/REQUEST-930-APPLICATION-ATTACK-LFI.conf”] [line “125”] [id “930130”] [msg “Restricted File Access Attempt”] [data “Matched Data:/.env found within REQUEST_FILENAME:/.env”] [severity “CRITICAL”] [ver “OWASP_CRS/3.2.3”] [tag “application-multi”] [tag “language-multi”] [tag “platform-multi”] [tag “attack-lfi”] [tag “OWASP_CRS”] [tag “OWASP_CRS/WEB_ATTACK/FILE_INJECTION”] [tag “WASCTC/WASC-33”] [tag “OWASP_TOP_10/A4”] [tag “PCI/6.5.4”] [hostname “www.d4941dc.hu”] [uri “/.env”] [unique_id “aI0AibHA3gfwyP1Wo76@lwAAAAg”] Stopwatch: 1754071177142684 2452 (- - -) Stopwatch2: 1754071177142684 2452; combined=1448, p1=595, p2=743, p3=0, p4=0, p5=108, sr=100, sw=2, l=0, gc=0 Producer: ModSecurity for Apache/2.9.3 (http://www.modsecurity.org/); OWASP_CRS/3.2.3. Server: Apache Engine-Mode: “ENABLED” --c30afe70-Z-- |
2.4. Data Organization
The dataset is distributed as a compressed archive, with a separate folder representing each day of the thirty-day observation period. The folder names correspond to calendar dates in the format dd-MMM-yyyy. The raw ModSecurity audit data is stored in a single file named modsec_audit.anon.log within each folder. The day-by-day separation allows for both incremental loading and parallelized analysis.
3. Methods
3.1. Data Collection
For our data collection, we used modSecurity as the “standard” open-source WAF, the best current candidate to receive machine learning-based extensions []. All requests that matched any WAF rule were logged. Logs were written in native ModSecurity audit log format with full metadata [].
3.2. Anonymization
Because raw logs contain sensitive client and server data, a deterministic anonymization was applied. Deterministic means that given the same input, it will provide the same output. In the case of IP addresses, this involves a random table lookup between source and target ranges, eliminating collisions. In contrast, text-based data undergoes a fixed salt-based one-way encryption with a maximum length of 16 bytes (SHA-256 (salt + payload)[:16]), limiting the chance of collisions. These methods ensure integrity across files while preventing the recovery of the original values. We balanced privacy preservation against data fidelity [] with the following transformations.
- IPv4 addresses were consistently remapped to 100.64.0.0/10
- IPv6 addresses were consistently remapped to 2001:db8::/32
- Domains (Host, Authority, in URLs): The public suffix/TLD, as well as the label count, had been preserved, and the left labels had been anonymized.
- Directory paths in HTTP(S) URLs and request targets were anonymized to per-segment stable tokens with the slashes preserved
- Filesystem access paths starting with /var/www/ were anonymized per segment, and the final filename was preserved
- Cookies and sensitive query params were retained, but the values had been anonymized. If the value contained a domain name, the same procedure was applied as for the other domain names, rather than full anonymization.
- Any email addresses were anonymized unless they appeared in an agent string.
- Response filtering: All transactions that returned a “200 OK” response were omitted. Although these are part of the audit file, they triggered rules based on their output rather than the input itself and are not suitable for incoming data-based analysis.
All other content was subject to the email filtering rules; therefore, it was preserved unless it contained an email address, in which case the email address was masked. This includes the following:
- All request headers and values were retained.
- All response codes, headers, and values were retained.
- POST payloads were retained.
- User-agent strings were fully retained.
- Performance metrics, dates, and time stamps were fully retained.
- Path filenames (in URLs/requests) were fully preserved
- Query-string filenames were fully preserved
3.3. Attack Categorization
Requests were categorized according to the CRS rule tags associated with each alert. The categories identified include SQL injection (SQLi), cross-site scripting (XSS), local file inclusion (LFI), remote file inclusion (RFI), scanner and reconnaissance activity, as well as protocol anomalies and obfuscation. A single request may fall into multiple categories if more than one rule is triggered.
4. Dataset Characterization
Our objective was to preserve the original dataset and enable researchers to conduct independent analyses without being influenced by our prior interpretations. Limiting our intervention to anonymization ensures that the dataset remains as close as possible to the original while safeguarding sensitive information. This approach enables the application of various analytical methods, including off-the-shelf software currently used to analyze ModSecurity audit files. Furthermore, the current dataset is well-suited for exploratory research and benchmarking new detection techniques under realistic conditions.
4.1. Attack Category Distribution
Table 2 summarizes the observed attack categories. The information is located in an H section tag in the raw audit log files. Some rules, including the custom 444444 bot rejection section, do not contain tag information. These are listed as UNTAGGED, followed by the rule ID that triggered the audit log entry.
Table 2.
Attack category statistics.
4.2. Attack Code Distribution
Table 3 summarizes the attack codes that had been observed. The raw audit log files contain this information as “id” and “msg”.
Table 3.
Attack rule violation ID statistics.
4.3. User-Agent Patterns
We provide for analysis of user-agent strings; all original information has been preserved.
5. Limitations
The IP addresses in the database were converted consistently, so that a specific address was always replaced with its exact replacement. While this allows accurate statistics regarding the unique attacker sources, they cannot be used to determine geographic and network origins. While IP address and geographic location matching accuracy is limited at best [], that information is lost through the obfuscation process. While this limits threat intelligence-style research, the utility for attack patterns is fully preserved.
All data used in this study were collected from a single server. The attack distribution and access patterns are therefore specific to the hosted content, namely WordPress sites and the most associated back-end components: MySQL and PHP scripting tools. This scope inherently limits generalizability.
The following rules, 941160, 949110, 980130, 941100, 932110, 200004, 932100, have been deactivated in ModSecurity to allow a seamless WordPress experience to end-users. Therefore, these violations do not appear in the attack code and category distributions.
During the anonymization process, file and URL path information were converted to stable tokens. This prevents the full interpretation of messages that were triggered by a path segment match. However, the trigger message contains the matched string or a regex with a group of strings that activate the match.
The dataset contains only those hits that have triggered at least a single ModSecurity rule. This means that requests that may have been malicious but were not detected by ModSecurity are excluded.
ModSecurity is a mature software solution that can be configured in multiple ways to handle malicious requests. We included our full configuration to allow for a comprehensive analysis of why a particular request resulted in a specific response or error code. Different configurations could have triggered the same rule and created an entry in the audit log, but with a different response code.
6. Conclusions
This dataset offers a rare, now publicly available view of real-world web attack traffic, anonymized to protect sensitive information while preserving research value. It is intended to support reproducible evaluation of detection methods, provide evidence for understanding attacker behavior, and serve as a basis for advocating web application security.
Beyond its use for security research, the dataset also highlights a performance optimization opportunity for WordPress. WordPress constructs all error pages as if they were intended for human users, consuming roughly the same CPU, I/O, and network resources as serving a normal page. By estimating the total resources spent on generating these error responses and partially or fully offloading them through simplified error pipelines or lightweight redirects, the system can maintain higher responsiveness during peak load, a main weakness of CMS platforms, especially when improperly tuned []. Running the dataset under different error-handling modes enables the quantification of these potential performance gains and the identification of the most efficient mitigation strategy.
Author Contributions
Methodology, G.L., Data curation, G.L., Writing—original draft, G.L., Writing—review & editing, G.L. and B.F., Supervision, B.F. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Artificial Intelligence National Laboratory: European Union project RRF-2.3.1-21-2022-00004.
Institutional Review Board Statement
The dataset was collected in accordance with applicable legal and ethical requirements. The server operator (the authors’ own infrastructure) provided explicit approval for data collection and publication. Only requests flagged as anomalous or malicious by the Web Application Firewall were included, and all sensitive identifiers were extensively anonymized to ensure that no personal data could be traced back to individual users. This anonymization process aligns with GDPR principles by removing or hashing personal identifiers while preserving research-relevant structural features. A risk assessment was conducted prior to publication, and the dataset was released under appropriate safeguards to strike a balance between research utility and privacy protection.
Data Availability Statement
The dataset is publicly available at https://doi.org/10.5281/zenodo.17178461 (accessed on 22 September 2025).
Conflicts of Interest
The authors declare no conflict of interest.
References
- ModSecurity. ModSecurity Open-Source Web Application Firewall. Available online: https://modsecurity.org (accessed on 22 September 2025).
- Bilic, I.; Josić, K.; Pranic, D.; Ribaric, S. Web application firewalls (WAFs) in protecting software. In Proceedings of the 35th DAAAM International Symposium on Intelligent Manufacturing and Automation, Vienna, Austria, 24–25 October 2024; DAAAM International: Vienna, Austria, 2024; pp. 306–311. [Google Scholar] [CrossRef]
- Prates, L.; dos Santos, R.P.; de Lima, T.L.; Costa, A.L. DevSecOps practices and tools. Int. J. Inf. Secur. 2025, 24, 11. [Google Scholar] [CrossRef]
- Dehlaghi-Ghadim, A.; Helali Moghadam, M.; Balador, A.; Hansson, H. ICS-Flow: An anomaly detection dataset for industrial control systems. arXiv 2023, arXiv:2305.09678. [Google Scholar] [CrossRef]
- Goldschmidt, P.; Chudá, D. Network intrusion datasets: A survey, limitations, and best practices. arXiv 2025, arXiv:2502.06688. [Google Scholar] [CrossRef]
- OWASP Core Rule Set Project. OWASP ModSecurity Core Rule Set. Available online: https://coreruleset.org (accessed on 22 September 2025).
- Moustafa, N.; Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, ACT, Australia, 10–12 November 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1–6. [Google Scholar] [CrossRef]
- Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. CICIDS2017 Dataset. Canadian Institute for Cybersecurity, University of New Brunswick. 2018. Available online: https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 22 September 2025).
- Tavallaee, M.; Stakhanova, N.; Ghorbani, A.A. HTTP CSIC 2010 Dataset [Dataset]. Information Security Institute, Spanish Research Council (CSIC). 2010. Available online: https://www.kaggle.com/datasets/ispangler/csic-2010-web-application-attacks (accessed on 22 September 2025).
- Şen, Ö. Benchmark Evaluation of Anomaly-Based Intrusion Detection Systems. arXiv 2023, arXiv:2312.13705. [Google Scholar] [CrossRef]
- Lucz, G. A Thirty-Day Dataset of Malicious HTTP Requests Blocked by OWASP ModSecurity on a Production Web Server [Data Set]. Zenodo. 2025. Available online: https://zenodo.org/records/17178461 (accessed on 22 September 2025).
- Kasturi, G.; Zhao, P.; Alowaisheq, E.; Kotipalli, S.; Chen, Z. A large-scale study of malicious plugins in WordPress. In Proceedings of the 31st USENIX Security Symposium (USENIX Security 2022), Boston, MA, USA, 10–12 August 2022; USENIX Association: Berkeley, CA, USA, 2022; pp. 1045–1062. Available online: https://www.usenix.org/conference/usenixsecurity22/presentation/kasturi (accessed on 22 September 2025).
- Mohamed Mohideen, M.A.; Nadeem, M.S.; Hardy, J.; Ali, H.; Tariq, U.U.; Sabrina, F.; Waqar, M.; Ahmed, S. Behind the Code: Identifying Zero-Day Exploits in WordPress. Futur. Internet 2024, 16, 256. [Google Scholar] [CrossRef]
- Thomas-Reynolds, D.; Butakov, S. Factors affecting the performance of web application firewall. In Proceedings of the 2020 Workshop on Information Security and Privacy (WISP 2020), Virtual, 12 December 2020; AIS Electronic Library (AISeL): Atlanta, GA, USA, 2020. Available online: https://aisel.aisnet.org/wisp2020/8 (accessed on 22 September 2025).
- glucz. Glucz/OWASP-Server-Configuration: Zenodo Release (v1.1). Zenodo. 2025. Available online: https://zenodo.org/records/17188106 (accessed on 22 September 2025).
- Antonov, A.; Sidorov, S. Web application firewalls: Comparative evaluation of ModSecurity, NAXSI, and Shadow Daemon. arXiv 2024, arXiv:2406.13547. [Google Scholar]
- OWASP ModSecurity Project. ModSecurity 2 Data Formats. GitHub. Available online: https://github.com/owasp-modsecurity/ModSecurity/wiki/ModSecurity-2-Data-Formats (accessed on 22 September 2025).
- Sarmin, S.; Sarkar, S.; Wang, Y.; Mohammed, N. Synthetic data: Revisiting the privacy–utility trade-off. arXiv 2025, arXiv:2502.19282. [Google Scholar] [CrossRef]
- Livadariu, I.; Dainotti, A.; Jonker, M.; Stiller, B.; Elmokashfi, A. On the accuracy of country-level IP geolocation. In Proceedings of the Applied Networking Research Workshop (ANRW 2020), Virtual, 30–31 July 2020; ACM/IRTF: New York, NY, USA, 2020; pp. 1–7. [Google Scholar] [CrossRef]
- Drivas, I.; Karampelas, P.; Anagnostopoulos, I.; Verginadis, Y. Content management systems performance and website speed. Information 2021, 12, 259. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).