Improving Threat Detection in Wazuh Using Machine Learning Techniques

Chamkar, Samir Achraf; Zaydi, Mounia; Maleh, Yassine; Gherabi, Noreddine

doi:10.3390/jcp5020034

Open AccessArticle

Improving Threat Detection in Wazuh Using Machine Learning Techniques

¹

LaSTI Laboratory, ENSA Khouribga, Sultan Moulay Slimane University, Beni Mellal 23000, Morocco

²

ICL, Junia, Université Catholique de Lille, 59000 Lille, France

^*

Author to whom correspondence should be addressed.

J. Cybersecur. Priv. 2025, 5(2), 34; https://doi.org/10.3390/jcp5020034

Submission received: 22 April 2025 / Revised: 5 June 2025 / Accepted: 11 June 2025 / Published: 14 June 2025

(This article belongs to the Special Issue Cybersecurity Risk Prediction, Assessment and Management)

Download

Browse Figures

Versions Notes

Abstract

The increasing complexity and sophistication of cyber threats underscore the critical need for advanced threat detection mechanisms within Security Operations Centers (SOCs) to effectively mitigate risks and enhance cybersecurity resilience. This study enhances the capabilities of Wazuh, an open-source Security Information and Event Management (SIEM) system, by addressing its primary limitation: high false-positive rates in rule-based detection. We propose a hybrid approach that integrates machine learning (ML) techniques—specifically, Random Forest (RF) and DBSCAN—into Wazuh’s detection pipeline to improve both accuracy and operational efficiency. Experimental results show that RF achieves 97.2% accuracy, while DBSCAN yields 91.06% accuracy with a false-positive rate of 0.0821, significantly improving alert quality. Real-time deployment requirements are rigorously evaluated, with all models maintaining end-to-end processing latencies below 100 milliseconds and 95% of events processed within 500 milliseconds. Scalability testing confirms linear performance up to 500 events per second, with an average processing latency of 45 milliseconds under typical SOC workloads. This integration demonstrates a practical, resource-efficient solution for enhancing real-time threat detection in modern cybersecurity environments.

Keywords:

machine learning; threat detection; Security Operation Center; supervised ML; unsupervised ML; Wazuh; SIEM

1. Introduction

The escalating complexity, frequency, and scale of cyber threats present growing challenges to the effectiveness of traditional Security Operations Centers (SOCs). As cyberattacks evolve beyond signature-based exploits to include sophisticated, stealthy, and adaptive techniques, SOCs must adopt intelligent, scalable, and real-time threat detection mechanisms to maintain organizational resilience and response capability [1]. Wazuh, a widely adopted open-source Security Information and Event Management (SIEM) system, plays a central role in providing security visibility through the collection and correlation of logs, intrusion alerts, and system anomalies across distributed infrastructure. However, Wazuh’s default rule-based detection engine, while effective in identifying known threat patterns, exhibits several limitations when deployed in dynamic and large-scale environments. The static nature of rule sets can lead to high false-positive rates, a lack of adaptability to emerging threats, and limited scalability in high-throughput operational contexts [2,3].

Specifically, Wazuh’s reliance on predefined detection rules restricts its ability to detect zero-day threats or behaviorally anomalous activity not explicitly defined by its policies. This leads to frequent false alarms that overwhelm security analysts and contribute to alert fatigue, diverting attention from genuine threats. Furthermore, Wazuh lacks adaptive learning capabilities, preventing it from refining its detection strategies based on historical data or evolving attack patterns. The system’s performance may also degrade under high data volumes, especially in large-scale enterprise deployments, due to limited optimization for real-time event processing and correlation.

To address these limitations, this research proposes the integration of machine learning (ML) into Wazuh’s detection architecture. ML models offer several advantages over static rule-based systems: they can learn behavioral patterns from historical data, detect previously unknown attack vectors, and continuously adapt to evolving threats with minimal manual intervention. Supervised learning models such as Random Forest (RF) offer high classification accuracy, while unsupervised models like DBSCAN excel at anomaly detection in unlabeled datasets. These techniques enable multidimensional correlation of events in real time, supporting a more nuanced and efficient threat-detection process.

In this study, we present a hybrid ML-enhanced Wazuh framework that integrates supervised and unsupervised learning models into the Wazuh-SIEM pipeline. Our solution includes the deployment of Wazuh agents across monitored endpoints to collect real-time event data, centralized log processing via the Elastic Stack, and ML-based classification using models such as RF, DBSCAN, and others. Alerts are enriched, analyzed, and visualized through Kibana, with real-time inference achieving an average latency of 45 ms, well below the operational threshold of 100 ms required for critical threat response.

Figure 1 illustrates the proposed architecture. To operationalize this framework, alerts are first collected from sources, including host-based intrusion detection (HIDS), network intrusion detection (NIDS via Suricata), and application logs. These are formatted as JSON and stored in alerts.json. Filebeat forwards these alerts to Logstash for enrichment, and then Elasticsearch indexes them for real-time querying. Our ML engine interfaces with Elasticsearch to perform feature extraction and classification, optimizing threat detection without disrupting Wazuh’s core functionalities.

Despite its robust capabilities, Wazuh presents certain limitations when used as a standalone threat detection system. These limitations are not inherent deficiencies but rather natural constraints of rule-based approaches that justify the integration of machine learning capabilities. High Rate of False Positives: The rule-based approach can often lead to a significant number of false positives, requiring extensive manual validation. This limitation stems from the static nature of predefined rules that cannot adapt to contextual variations in normal network environments.

Limited Real-Time Learning: Wazuh lacks the ability to adaptively learn from past data to refine detection strategies autonomously. This limitation prevents the system from evolving with changing threats and automatically improving over time.

Scalability Constraints: Handling large-scale data in real-time can strain resources, impacting the performance of detection and response processes. This constraint becomes particularly problematic in enterprise environments with high volumes of log data.

Complex Alert Management: Managing and analyzing numerous alerts can become cumbersome, leading to potential oversight of critical threats. This complexity can result in alert fatigue and reduced efficiency of SOC analysts. These limitations underscore the need to augment Wazuh with ML models, which can enhance its detection capabilities by automating the learning process, reducing false positives, and providing more nuanced threat analyses.

This study demonstrates significant improvements in threat detection and operational efficiency, specifically within the Wazuh SIEM system, contributing to the broader field of SOCs and Security Information and Event Management (SIEM) systems.

Enhanced Threat Detection with ML Integration: Demonstration of successful integration of machine learning models into Wazuh, achieving a high accuracy of 0.972 with Random Forest (RF), significantly improving the system’s ability to detect sophisticated cyber threats compared to traditional rule-based methods.

Reduction in False Positives: Addressing Wazuh’s high false-positive rates, with RF achieving a low false-positive rate of 0.03 and DBSCAN clustering at 0.0821, thereby reducing alert fatigue and enhancing operational efficiency in SOCs.

Comprehensive Model Evaluation: Providing a thorough evaluation of multiple ML models (RF, SVM, KNN, Logistic Regression, Gaussian Naive Bayes) and clustering algorithms (DBSCAN, K-means, Isolation Forest), offering comparative insights into their performance for cybersecurity applications, with RF and DBSCAN outperforming others.

Improved Alert Management: Enabling more efficient real-time threat detection and alert management by streamlining the alerting mechanism, allowing SOC analysts to focus on genuine threats and reducing manual effort.

Framework for Adaptive Security: Establishing a framework for an adaptive security system by integrating ML with Wazuh, paving the way for more responsive and intelligent threat detection in dynamic cybersecurity environments.

The following sections provide a detailed exploration of this study: Section 2 reviews related work; Section 3 outlines the materials and methods, including the ML-SOC architecture and data processing; Section 4 presents results and discussion with performance metrics; and Section 5 concludes with future directions.

2. Related Work

The integration of machine learning (ML) and artificial intelligence (AI) into cybersecurity frameworks has been extensively explored to enhance threat detection and response in Security Operation Centers (SOCs) [4]. Hughes et al. [5] proposed a model-free reinforcement learning (RL) approach for intrusion response systems, training an RL agent in a simulated environment to mitigate complex multistage attacks while minimizing disruption to legitimate network activities. However, their method’s effectiveness was constrained by the quality and completeness of training data, potentially missing diverse attack patterns. Coscia et al. [6] developed a decision tree-based system for detecting and mitigating DoS and DDoS attacks, introducing the Anomaly2Sign algorithm to generate Suricata rules from normal and anomalous traffic in an unsupervised manner. Their decision tree model achieved an accuracy of 99.7–99.9%, outperforming other ML classifiers; however, concerns about overfitting due to unrepresentative datasets limit its practical applicability.

Advancements in Security Orchestration, Automation, and Response (SOAR) platforms further highlight the role of ML in SOCs. Kinyua and Awuah [7] examined the potential of SOAR solutions integrated with SIEM systems, arguing that AI and ML can enhance automation and operational efficiency for SOC analysts. Similarly, Islam et al. [8] introduced DecOr, a declarative API-driven orchestration framework for SOAR platforms, abstracting complexities in incident response processes to improve response accuracy, though its performance depends on the ontological knowledge base’s accuracy. Sworna et al. [8] proposed APIRO, an ML-based architecture for automatic API recommendations in SOAR solutions, achieving a Top-1 accuracy of 91.9%, using data augmentation and convolutional neural networks despite data scarcity challenges.

A recent study investigated the integration of machine learning with traditional signature-based methods to enhance Network Intrusion Detection and Prevention Systems (NIDPS) against Distributed Denial-of-Service (DDoS) attacks, using Snort for traffic analysis and Wazuh for implementing a Random Forest model and active response mechanism [9]. Their Random Forest model achieved near-perfect metrics (accuracy, precision, recall, and F1-Score of 99.99%) with a training time of 18.84 s, enabling real-time monitoring and mitigation of DDoS attacks.

Beyond general AI and ML approaches in cybersecurity, recent research has specifically explored the use and enhancement of the Wazuh platform in various contexts. Kurnia et al. [10] developed the Security Event Response Copilot (SERC), a system designed to assist analysts in responding to and mitigating security breaches more effectively. This research specifically utilizes Wazuh for capturing, analyzing, and correlating security events from endpoints. SERC leverages Wazuh’s capabilities to collect real-time event data and applies a Retrieval-Augmented Generation (RAG) approach to retrieve context-specific insights from three vectorized data collections: incident response knowledge, the MITRE ATT&CK framework, and the NIST Cybersecurity Framework (CSF) 2.0. This integration bridges strategic risk management and tactical intelligence, enabling precise identification of adversarial tactics and techniques while adhering to best practices in cybersecurity.

Manzoor et al. [11] presented a comprehensive study evaluating the security and performance of open-source SIEM solutions for small and medium enterprises (SMEs), with particular attention to Wazuh. This research highlights that SMEs, which are the backbone of the global economy, are particularly vulnerable to cyber threats due to inadequate protection for critical and sensitive information, budgetary constraints, and a lack of cybersecurity expertise and personnel. This study examines the capabilities of open-source SIEM solutions, including Wazuh, in addressing modern security challenges and compliance with regulatory requirements. Performance aspects are explored through empirical testing in simulated enterprise-grade SME network environments to assess resource utilization and real-time data processing capabilities. The findings shed light on the strengths and limitations of these systems, aiding decision-makers in selecting the most suitable SIEM solution for their specific requirements while enhancing the cybersecurity posture of SMEs.

These use cases demonstrate Wazuh’s versatility across different sectors and organizational contexts. However, despite its many advantages, Wazuh implementations present certain limitations. Kurnia et al. [10] note that existing implementations of Wazuh in SOCs lack direct integration with advanced AI capabilities, which could significantly improve its detection accuracy and responsiveness. Additionally, Manzoor et al. [11] highlight that effective use of Wazuh requires a certain level of technical expertise, which can be challenging for organizations with limited cybersecurity personnel resources. Unlike previous works that often overlook open-source SIEM solutions like Wazuh and their practical deployment in resource-constrained SOCs, particularly for small and medium enterprises (SMEs), this paper demonstrates the practical application of ML in Wazuh, reducing false positives and improving SOC efficiency through a dual approach of classification and clustering, offering a scalable solution for SMEs.

Addressing the growing need for accessible security solutions in cloud environments, Moiz et al. [12] introduced a novel approach to cloud-based Wazuh deployment specifically tailored for small businesses with limited technical expertise and resources. Their research emphasizes the user-friendly nature of Wazuh Cloud, providing streamlined deployment and customizable security rules that can be managed by non-technical personnel.

This study demonstrates how Wazuh’s Host-Based Intrusion Detection System (HIDS) effectively detects and mitigates various threats. A key contribution of this work highlights Wazuh Cloud’s cost-effectiveness compared to maintaining dedicated IT security teams, making robust cybersecurity accessible to resource-constrained organizations. This research also addresses implementation challenges such as complexity in setup and configuration, offering practical strategies to simplify deployment and ensure Wazuh can seamlessly expand to accommodate an organization’s changing requirements while preserving its efficacy against evolving cyber threats.

However, while previous research has explored various machine learning techniques for threat detection, there remains a gap in the practical application of these methods within open-source SIEM platforms, particularly in real-world, resource-constrained SOC environments. In this context, Wazuh—despite its growing adoption—has received limited attention in terms of ML integration tailored to its architecture and operational challenges. This study addresses that gap by presenting a practical and scalable machine learning-enhanced framework for Wazuh, aimed specifically at improving detection accuracy and reducing false positives. By combining supervised classification (Random Forest) with unsupervised clustering (DBSCAN), our approach offers a dual-layer detection strategy that enhances SOC efficiency. It is designed with small and medium-sized enterprises (SMEs) in mind, delivering a lightweight yet effective solution that has been validated in real deployment scenarios.

3. Materials and Methods

3.1. Design of ML-SOC Architecture and Integration

Designing an ML-SOC architecture involves addressing various challenges to ensure effective threat detection, efficient response times, and reduced false positives [7]. This section outlines a systematic approach to the design and development of the ML-SOC architecture, focusing on data collection, preprocessing, model selection, and performance evaluation.

To develop an effective ML-SOC architecture, it is essential to understand the core objectives and requirements that guide its design. Table 1 below outlines the primary objectives and the corresponding requirements.

This section elaborates on the process of integrating ML to improve real-time threat-detection capabilities, as shown in Figure 2. We discuss how data is prepared, preprocessed, and used for training various ML models to enhance the system’s ability to identify and respond to threats effectively.

The implementation of ML techniques enables Wazuh to transition from a traditional rule-based approach to a more intelligent, adaptive security system capable of detecting sophisticated cyber threats, such as advanced persistent threats (APTs), zero-day exploits, and polymorphic malware that evade predefined rules, while minimizing human intervention [13,14]. This paper details the technical implementation of ML models tailored for Wazuh, emphasizing data preparation, feature extraction, model training, and evaluation. Our integration leverages Wazuh’s three-tier architecture, consisting of Wazuh Manager, Wazuh Indexer (Elasticsearch-based), and Wazuh Dashboard (OpenSearch Dashboards). Machine learning enhancement occurs at multiple integration points to ensure comprehensive threat analysis while maintaining system reliability and performance.

Wazuh Manager Enhancement: The Wazuh Manager receives custom output modules that extract structured data from processed alerts for ML analysis. Custom decoders parse security events to extract relevant features, including temporal patterns, event characteristics, source information, and contextual metadata. Enhanced correlation rules provide initial threat scoring and categorization that informs ML model selection and processing priorities. Configuration modifications include custom output plugins that format alert data for ML processing, enhanced memory allocation for high-volume processing, and optimized rule evaluation engines that reduce processing overhead by 25% compared to standard configurations. The enhanced manager maintains full compatibility with existing Wazuh agents and rules while providing structured data output for ML analysis.

Wazuh Indexer Optimization: The Wazuh Indexer receives specialized mapping templates for ML feature vectors, optimized shard allocation for time-series security data, and enhanced query performance configurations for real-time ML inference. Index lifecycle management policies ensure efficient storage utilization while maintaining rapid access to recent security events required for ML model training and inference. Custom analyzers process security event text for feature extraction, specialized aggregation pipelines compute statistical features for ML input, and optimized cluster configuration supports high-throughput ML workloads. Performance testing demonstrates a 40% improvement in query response times for ML feature extraction compared to standard Wazuh configurations.

Wazuh Dashboard Integration: The Wazuh Dashboard receives comprehensive enhancements to display ML-generated insights alongside traditional security event analysis. Custom visualizations provide ML model confidence scores, feature importance analysis, and trend detection capabilities that enable SOC analysts to understand and validate ML-generated alerts. Dashboard enhancements include real-time ML performance monitoring widgets, interactive feature importance visualization, and customizable alert prioritization based on ML confidence scores. The integrated dashboard provides comprehensive situational awareness, combining traditional rule-based detection with advanced ML insights.

The two-stage processing architecture implementation includes the following:

Stage 1: Enhanced Rule-Based Processing and Data Extraction: The first stage extends traditional Wazuh functionality with enhanced data collection and preprocessing capabilities while maintaining full backward compatibility with existing SOC workflows. Wazuh agents deployed across monitored systems collect security events and apply predefined rulesets enhanced with contextual enrichment and initial threat scoring.

Agent Configuration and Data Collection: Wazuh agents receive enhanced configuration files that specify additional data collection points for ML feature extraction. Configuration includes detailed file integrity monitoring rules, enhanced Windows event log collection, comprehensive network connection monitoring, and process execution tracking with command-line arguments and parent-child relationships. Agent configuration files specify custom log formats for ML processing, enhanced data collection intervals for time-sensitive analysis, and intelligent filtering rules that reduce data volume while preserving critical security information. The enhanced agents maintain compatibility with the existing Wazuh infrastructure while providing enriched data for ML analysis.

Rule Enhancement and Feature Extraction: Custom Wazuh rules extract structured features from raw security events, including temporal patterns (hour of day, day of week, time since last similar event), event characteristics (severity level, event type, source system type), user and process information (user account, process name, command line arguments), and network information (source/destination IPs, ports, protocols). Rule configuration includes custom decoders for specific log formats, enhanced correlation rules that link related events across time windows, and feature extraction rules that compute statistical measures for ML input. The enhanced rules maintain compatibility with existing Wazuh rulesets while providing structured data output for ML processing;

Stage 2: Machine Learning Processing and Enhancement: The second stage implements our comprehensive ML pipeline with parallel processing for multiple algorithms and ensemble decision-making capabilities. The ML enhancement layer processes structured alert data from Stage 1 through feature extraction, model inference, confidence scoring, and result integration with existing alert workflows.

ML Pipeline Architecture: The ML processing pipeline implements a modular architecture with separate components for feature preprocessing, model inference, ensemble decision-making, and result formatting. Each component operates independently with well-defined interfaces, enabling flexible algorithm selection and optimization based on organizational requirements. Feature preprocessing includes normalization and scaling of numerical features, encoding of categorical variables, temporal feature extraction from timestamp data, and statistical aggregation of related events. Model inference supports parallel processing of multiple algorithms with intelligent load balancing and resource allocation based on algorithm computational requirements.

Integration API and Data Flow: The integration between Wazuh and ML processing uses a lightweight RESTful API that ensures minimal latency overhead while providing comprehensive monitoring and error handling. The API implements intelligent caching mechanisms for frequently detected patterns and adaptive prioritization based on threat severity and organizational context. Data flow optimization includes asynchronous processing for non-critical events, priority queuing for high-severity alerts, and automatic failover to rule-based detection in case of ML system unavailability. The integration maintains comprehensive audit logging of all ML decisions and processing times for performance monitoring and troubleshooting. Figure 2 below shows the integration process.

The integration of ML augments Wazuh’s alerting mechanism, enhancing security monitoring operations and reducing manual effort and operational inefficiencies in Security Operation Centers (SOCs) [15]. Key areas covered in this paper include

Data Preparation and Cleansing: Techniques for transforming raw security event data into structured formats suitable for ML analysis;
Model Training and Evaluation: Training various supervised machine learning models, including K-Nearest Neighbors, Random Forest, Naive Bayes, Logistic Regression, and unsupervised machine learning models such as DBSCAN, K-means, and Isolation Forest;
Model Optimization: Methods to optimize and fine-tune models to improve detection accuracy and minimize false positives;
Real-World Application: Best practices for integrating these models into Wazuh’s existing architecture, demonstrating their impact on alert management, false-positive reduction, and overall system performance.

3.2. Data Collection, Preprocessing, and Feature Engineering

The dataset consists of log data, predominantly stored in plain text or CSV formats. Logstash is utilized to parse these logs, converting them into JSON format for structured processing. The processed logs are then archived in Elasticsearch, a NoSQL database, which serves as a core component of the ELK Stack. Furthermore, the log data is transmitted to ML models as JSON objects through API integrations, enabling advanced analytical processing. The alerts.json dataset was selected due to its wide range of data points, essential for training ML models aimed at identifying patterns and distinguishing between normal and anomalous behaviors [16]. This helps in developing tools that can enhance threat detection accuracy and reduce false positives.

Using the alerts.json data, machine learning models were built and refined to identify patterns indicative of false positives and improve detection criteria and overall system accuracy. This process involved not only data collection but also its comprehensive analysis and preparation for use in model training [17]. Our dataset was derived from Wazuh alerts.json files collected over a three-month period in a controlled environment where various attack scenarios were simulated. The dataset comprises 15,427 security events collected from Wazuh agents deployed on 12 different endpoints (8 Windows, 3 Linux, and 1 macOS). The network environment was designed to simulate a mid-sized enterprise with segmented networks, including DMZ, internal networks, and a cloud component.

The network environment was meticulously designed to simulate a realistic enterprise architecture with segmented networks, including a demilitarized zone (DMZ) hosting web services and email servers, internal corporate networks with workstations and file servers, and cloud components integrated through VPN connections. Security events were generated through systematic simulation of attack scenarios based on the MITRE ATT&CK framework, ensuring comprehensive coverage of contemporary threat techniques and tactics. Each attack scenario was executed multiple times under varying conditions to capture behavioral variations and ensure robust model training, while legitimate user activities were simultaneously conducted to provide realistic baseline behavior patterns.

The dataset consists of log data, predominantly stored in plain text or CSV formats. Logstash is utilized to parse these logs, converting them into JSON format for structured processing. The processed logs are then archived in Elasticsearch, a NoSQL database, which serves as a core component of the ELK Stack. Furthermore, the log data is transmitted to ML models as JSON objects through API integrations, enabling advanced analytical processing. The alerts.json dataset was selected due to its wide range of data points, essential for training ML models aimed at identifying patterns and distinguishing between normal and anomalous behaviors.

Feature engineering is a pivotal step in our machine learning pipeline, involving the transformation of raw security logs into structured and meaningful representations suitable for model training. Leveraging domain knowledge, we identified, extracted, and encoded the most informative attributes from Wazuh logs to enhance model performance and detection accuracy. The goal was to convert log data—often in textual or semi-structured formats—into numerical features without losing the semantic context critical for accurate threat classification. We extracted and processed the following key features from Wazuh logs:

Day, Hour, Minute (numerical): Temporal components normalized from timestamps, enabling modeling of event time patterns;
Event Type (categorical): Event category encoded using label encoding or one-hot encoding, depending on cardinality;
Event Name (categorical): Event subtype or name, encoded similarly to Event Type;
Source IP and Destination IP (categorical or numerical): Source and destination IP addresses encoded, with potential inclusion of reputation scores in subsequent steps;
Destination TCP/UDP Port (numerical): Network port at the destination of the event;
Event Severity (numerical): Severity level assigned by Wazuh, normalized to [0,1].

Given the absence of ordinal variables in the dataset—i.e., variables with a ranked or hierarchical relationship—the classification of categorical variables was handled with techniques that preserve their discrete and non-relational nature. For categorical features without inherent order, such as rule IDs or countries, integer encoding would introduce misleading relationships. Thus, one-hot encoding was selected to maintain categorical integrity, representing each class as an independent binary vector.

All numerical features were normalized using the MinMaxScaler from the sklearn.preprocessing module in Scikit-learn 1.4.1, which scales values into the [0,1] range based on the observed minimum and maximum values of each feature. This normalization ensures that all features contribute proportionately during model training, particularly for algorithms sensitive to feature magnitude, such as Support Vector Machines (SVM) and K-Nearest Neighbors (KNN). For high-cardinality nominal variables such as Agent ID, label encoding was applied to assign unique integer values to each category. This approach avoids the creation of large, sparse matrices associated with one-hot encoding, thereby reducing model complexity and memory usage during training and inference.

Finally, feature importance analysis was conducted using a Random Forest classifier to quantify the predictive relevance of each feature, guiding the final feature selection process. Table 2 summarizes the features, their data types, and the applied engineering techniques.

3.3. Descriptive Statistics Analysis

Descriptive statistics provide a summary of the main characteristics of the data contained in our DataFrame. Using Pandas’ describe() function, we were able to obtain measures such as the mean, standard deviation, minimum, maximum, and quartiles for each numerical variable in the database. Table 3 shows a sample of Encoded & Clean DATA.

3.4. Data Labeling

In our approach, we utilized two distinct methods for data labeling:

▪: Explicit Threat Level Labeling: Alerts were classified according to predefined threat levels to facilitate categorization and enable supervised analysis;
▪: Clustering Based on Data Similarity: We employed the KMeans algorithm [18] to identify natural clusters within the data, enabling the discovery of patterns and subtle anomalies that might not be evident through explicit labels alone.

In explicit labeling based on the threat level, we assigned labels to each alert based on its threat level, categorizing them into three distinct categories:

▪: Unknown Threat (low danger level;
▪: Questionable Threat (moderate danger level);
▪: Dangerous Threat (high danger level).

To achieve this, we defined a mapping function to assign specific classes to the alerts based on their threat levels. After applying this function, we added a new “Class” column to the DataFrame, indicating the threat level of each alert. After applying the mapping function, the DataFrame contains a new ‘Class’ column indicating the threat level of each alert.

3.5. Data Standardization

After thoroughly cleaning the dataset, we proceeded to separate the data into features and labels in preparation for model training. The features represented by the variable X were isolated by excluding the “Class” column from the dataset. Concurrently, the labels, represented by the variable y, were specifically assigned to the “Class” column.

To ensure a fair and balanced comparison across the different features, we normalized the data using the MinMaxScaler, an essential tool for scaling feature values within a specific range. This step ensures that the model can learn efficiently without being affected by variations in scale among the different input features. The normalization process implemented by the MinMaxScaler follows this formula:

X_{s c a l e d} = \frac{X - X_{m i n}}{X_{m a x} - X_{m i n}}

(1)

where

▪: $X$ represents the initial value of the feature;
▪: $X_{m i n}$ is the minimum value of the feature in the dataset;
▪: $X_{m a x}$ is the maximum value of the feature in the dataset;
▪: $X_{s c a l e d}$ is the resulting normalized value of the feature.

After normalizing the data, the next critical step was dividing the dataset into training and test sets. This division is essential to evaluate the model’s performance on unseen data, ensuring its ability to generalize effectively. We used the train_test_split function from scikit-learn to accomplish this. For this task, 80% of the data was designated for the training set to train the model, while the remaining 20% was reserved for the test set to assess performance. This approach strikes a balance, providing enough data for solid model training and sufficient test data for a reliable performance evaluation.

To rectify the problem of class imbalance in the training dataset, we utilized SMOTE (Synthetic Minority Over-sampling Technique). SMOTE creates synthetic cases for the minority class derived from existing examples, so successfully equilibrating the class distribution. This approach guarantees that the model is trained on a dataset with equal representation of all classes, hence augmenting its learning capacity and boosting overall accuracy. We utilized SMOTE using the fit_resample function from the imbalanced-learn package. This technique yielded a balanced training dataset, which was crucial for enhancing the model’s capacity to learn from the data efficiently and generate precise predictions.

3.6. Training Supervised Machine Learning Models

After dividing the data into training and test sets, the next step involves training various machine learning models on the training set and evaluating their performance on the test set. This process helps in selecting the most efficient model for our specific task. We initially used cross-validation to assess the performance of each supervised learning model. A 10-fold cross-validation with the KFold method (KFold(n_splits = 10)) was applied to estimate the mean accuracy of the model over different subsets of the training data. To prevent overfitting, we implemented early stopping during training and pruned the feature set based on importance scores, reducing it from 9 to 7 by excluding less impactful features like Agent ID while retaining critical ones like temporal patterns and source IP reputation. These measures, combined with SMOTE for class imbalance, ensure the model’s generalizability and reliability in dynamic Security Operations Center (SOC) environments.

3.6.1. K-Nearest Neighbors (KNN) Classifier

The K-Nearest Neighbors (KNN) [19] model is a supervised learning technique employed for classification and regression tasks. It forecasts the label of a data point by referencing the labels of its nearest neighbors in the feature space, utilizing the idea that analogous data points aggregate together.

Implementation Steps:

▪: Defined a parameter grid (param_grid_knn) with n_neighbors ranging from 1 to 100 and used both Euclidean and Manhattan distance metrics;
▪: Employed GridSearchCV for hyperparameter tuning using 10-fold cross-validation;
▪: Trained the model on the training set (X_train, y_train);
▪: Evaluated the best model on the test set (X_test, y_test) to provide an unbiased estimate of its performance;
▪: Used the best model to make predictions on the test set, obtaining the predicted labels (y_predict_knn).

3.6.2. Random Forest Classifier (RF)

The Random Forest Classifier (RF) [20] is a widely used ensemble learning method that aggregates multiple decision trees to enhance predictive performance. It reduces overfitting and improves generalization by averaging or taking a majority vote from several decision trees trained on different subsets of the training data.

Implementation Steps:

▪: Trained an ExtraTrees classifier to perform feature selection using SelectFromModel;
▪: Trained the RFC model with n_estimators = 100 using the selected features;
▪: Used the trained RFC model to make predictions on the test set.

After training the RFC model, we extracted the important features using the feature importance method. This analysis revealed the relative importance of each feature in the model’s decision-making, as shown in Figure 3. The results obtained indicate that the following features have a significant influence on the predictions of the RFC model:

3.6.3. Naive Bayes Classifier

The Naive Bayes classifier [21], founded on Bayes’ theorem, presupposes conditional independence among features contingent upon the class label. It is useful for text classification and other data types when the assumption of feature independence is justifiable.

Steps:

▪: Initialized the Gaussian Naive Bayes classifier;
▪: Defined a parameter grid for hyperparameter tuning, focusing on the var_smoothing parameter;
▪: Trained the model with the best hyperparameters;
▪: Evaluated the model’s performance on the test set.

3.6.4. Logistic Regression

Logistic regression is a generalized linear model used for classification tasks and is valued for its simplicity and ability to provide interpretable results.

Steps:

▪: Initialized the logistic regression model and defined a parameter grid to optimize the C (regularization strength) and penalty settings;
▪: Used GridSearchCV to find the best hyperparameters;
▪: Trained the model using the optimal configuration and evaluated its performance on the test set.

3.7. Training Unsupervised ML Techniques

In this section, we implement a preprocessing technique to improve the accuracy of clustering in a safety dataset. Clustering is a commonly used method to identify patterns and anomalies in data; however, it requires that all features be on the same scale to avoid favoring those with higher values. Standardization, which adjusts the data so that it has a mean of zero and a standard deviation of one, ensures that each feature contributes equally to the clustering algorithm. One of the key features of our dataset, Event Severity, represents the criticality of events, where higher values indicate more serious security incidents. To preserve the influence of this characteristic, we have chosen not to standardize it to maintain the weight of high-severity events in the clustering process. This method makes it possible to obtain more accurate clustering by ensuring that critical events are given the appropriate weight when identifying patterns or anomalies. The following (Algorithm 1) code shows the steps taken to scale the other features and combine them with weighted Event Severity to ensure balanced and relevant clustering results.

Algorithm 1: Preprocess Data for Clustering with Severity

Input: Data (a dataset with features including ‘Event Severity’)
Output: ScaledData (a dataset with standardized features and original ‘Event Severity’)
1.//Select features for standardization, excluding ‘Event Severity’
2. FeaturesToStandardize ← Data.DropColumn(‘Event Severity’)
3.//Standardize the selected features
4. Scaler ← InitializeStandardScaler()
5. ScaledFeatures ← Scaler.FitTransform(FeaturesToStandardize)
6.//Combine the standardized features with the original ‘Event Severity’
7. ScaledData ← CreateDataFrame(ScaledFeatures, Columns ← FeaturesToStandardize.Columns)
8. ScaledData[‘Event Severity’] ← Data[‘Event Severity’]
9. Return ScaledData

3.7.1. K-Means

Before applying the K-Means clustering algorithm, it is essential to determine the optimal number of clusters, denoted as k, as this parameter has a direct impact on the clustering quality. Choosing an inappropriate number of clusters can lead to suboptimal results, either by overfitting or underfitting the data structure.

Overfitting occurs when an excessive number of clusters is specified, resulting in the model capturing noise and minor fluctuations in the data, thereby losing generalizability. Conversely, underfitting happens when too few clusters are used, which can cause important structures or natural groupings in the data to be overlooked.

To systematically determine the appropriate value for k, two well-established techniques were applied: the Silhouette Score and the Elbow Method. The Silhouette Score provides a quantitative assessment of clustering quality by measuring the similarity of a data point to its cluster compared to other clusters. Scores range from −1 to +1, with higher values indicating that clusters are well-separated and internally cohesive. The silhouette analysis revealed that the clustering structure is most distinct when k = 2, suggesting that this value provides the clearest separation between groups.

To complement this analysis, the Elbow Method was also utilized. This approach involves plotting the within-cluster sum of squares (WCSS) against various values of k. Initially, as k increases, the WCSS decreases sharply due to improved fit. However, after a certain point, the rate of decrease slows down significantly, forming an “elbow” in the plot. This inflection point marks a balance between minimizing intra-cluster distance and maintaining model simplicity, serving as a practical guide for selecting k.

Based on the results of the Silhouette Score, the number of clusters was set to 2. The K-Means model was initialized with this value of k, and a fixed random_state of 42 was specified to ensure reproducibility of the clustering results. This model was then trained on the standardized dataset, allowing it to iteratively partition the data into the defined number of clusters by minimizing intra-cluster variance. Upon training, cluster assignments were generated for each data point. These assignments, which indicate the closest cluster centroid for each observation, were added to the dataset in a new column labeled cluster.

To assess the distribution of data points within their respective clusters, the algorithm computed the Euclidean distances between each point and all cluster centroids using the transform method. The minimum of these distances—representing the proximity to the nearest cluster center—was stored in a new column named distance_to_centroid.

To detect potential anomalies, a threshold was derived based on the 99.7th percentile of the distance distribution. Any point whose distance exceeded this threshold was considered an outlier, as it significantly deviated from the typical intra-cluster distance pattern. These points were flagged in a new column named anomaly.

This clustering-based anomaly detection approach proved effective in identifying outliers. By focusing on the uppermost 0.3% of data points in terms of distance to the nearest centroid, this model isolated observations that deviated notably from the core structure of the data, thereby enabling more targeted investigations into these atypical instances.

3.7.2. Isolation Forest

Isolation Forest is an unsupervised learning algorithm specifically designed for anomaly detection [22]. Unlike clustering-based techniques, it does not attempt to model normal instances but rather isolates anomalies directly. The underlying assumption is that anomalies are few and different, making them easier to separate from the rest of the data.

The algorithm constructs multiple isolation trees by randomly selecting a feature and a split value. Since anomalies are rare and often exhibit attribute values that differ significantly from normal observations, they are typically isolated closer to the root of the tree, resulting in shorter path lengths. Conversely, normal points usually require more partitions to be isolated, leading to longer path lengths.

The Isolation Forest model was initialized with a contamination parameter set to 0.01, indicating that approximately 1% of the dataset is expected to be anomalous. This value influences the threshold used for anomaly detection. To ensure consistent and reproducible results, the random seed was fixed using a specific value.

Once initialized, this model was trained on the preprocessed dataset. During this process, it assigned a prediction label to each data point: −1 for anomalies and 1 for normal observations. These predictions were stored in a dedicated column labeled anomaly in the dataset, facilitating the identification of unusual behaviors.

After training and prediction, the dataset was filtered to retain only those instances labeled as anomalies. These entries, identified as having a significantly shorter average path length across the isolation trees, were extracted and compiled into a separate dataset for further analysis.

This method allowed for the efficient detection of outliers based on their isolation characteristics without requiring any assumptions about the underlying distribution of the data. The results demonstrated the effectiveness of Isolation Forest in identifying subtle and potentially high-impact anomalies that may escape traditional clustering-based approaches.

3.7.3. DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised clustering algorithm that identifies dense regions in data and treats low-density regions as anomalies or noise [23]. It is particularly effective for discovering clusters of arbitrary shape and for distinguishing outliers in datasets with varying densities. The algorithm relies on two key parameters:

▪: eps (epsilon): the maximum distance between two samples for them to be considered neighbors;
▪: min_samples: the minimum number of neighboring points required to form a dense region.

In this study, the parameters were set to eps = 0.3 and min_samples = 3. This configuration means that any data point must have at least two other points within a 0.3-unit radius to be considered part of a cluster. Otherwise, it is labeled as noise.

Once initialized, the DBSCAN algorithm was applied to the preprocessed dataset. It assigned cluster labels to each data point. Points that could not be assigned to any cluster were marked with the label −1, signifying that they were outliers or anomalies.

To identify these anomalies, the assigned labels were analyzed, and all points labeled as −1 were extracted. These instances were flagged in a new column labeled anomaly and isolated in a separate dataset for further inspection.

4. Results and Discussion

4.1. Experimental Setup and Real-Time Performance Definition and Measurement

The experimental environment was designed to emulate a realistic Security Operations Center (SOC) deployment while ensuring controlled conditions for reproducible evaluation and upper-bound performance testing. The infrastructure comprised a server environment featuring two Dell PowerEdge R740 servers, each equipped with an Intel Xeon Gold 6248R processor (24 cores, 3.0 GHz), 128 GB of DDR4 RAM, and 2 TB of NVMe SSD storage configured in a RAID 10 setup for enhanced reliability and performance. These servers operated on Ubuntu Server 20.04 LTS, providing a stable foundation for the experiment. The client environment included 12 endpoint devices, consisting of 8 Windows machines, 3 Linux systems, and 1 macOS device, with varied hardware configurations to reflect the heterogeneity typical of enterprise settings. These endpoints were interconnected via a 10 Gbps network infrastructure to simulate high-speed data exchange. The Wazuh deployment utilized Wazuh Server version 4.3.10 integrated with Elasticsearch version 7.17.0, with Wazuh agents installed across all 12 endpoints. The configuration incorporated default rule sets supplemented by custom rules tailored to specific attack scenarios alongside standard logging levels and real-time alerting capabilities to ensure comprehensive event monitoring. The development environment leveraged Python 3.9.7, equipped with scikit-learn 1.4.1, pandas 1.4.1, and numpy 1.22.2 for machine learning and data processing tasks, while Jupyter Notebook 7.4.0.facilitated interactive analysis and visualization.

We categorize events into three temporal tiers:

▪: Critical Threat Response (hard real-time): Processing must completed within 100 ms to enable integration with automated response systems (e.g., DDoS mitigation, malware containment). This aligns with industry benchmarks requiring 50–200 ms response windows [2];
▪: Standard Alert Processing (soft real-time): 95% of events must be completed within 500 ms to balance thorough analysis with operational efficiency;
▪: Batch Processing: Events exceeding 1 s are queued for low-activity periods to prevent resource bottlenecks.

Timing measurements use high-resolution performance counters with microsecond precision, capturing data across all stages: alert ingestion, feature extraction, model inference, result formatting, and output delivery. We conducted 1000+ iterations per test condition, calculating mean, median, standard deviation, and 95th percentile metrics. Bootstrap sampling (1000 iterations) established 95% confidence intervals, with paired t-tests (p < 0.001) and Cohen’s d for effect size validation. For example, Random Forest inference averaged 45 ms [95% CI: 43–47 ms], DBSCAN 62 ms [95% CI: 59–65 ms], and SVM 48 ms [95% CI: 46–50 ms], all within the 100 ms hard real-time threshold.

To discuss performance metrics based on the provided code snippet, we must understand how each metric evaluates a machine learning model’s performance, especially in classification tasks. Table 4 shows the performance metrics with their explanations and corresponding equations.

4.2. Supervised ML Techniques

Table 5 outlines the comparative performance of five machine learning models (Random Forest, K-Nearest Neighbors, Logistic Regression, Gaussian Naive Bayes, and Support Vector Machine) using common evaluation metrics: accuracy, precision, true positive, true negative, false positive, false negative, true positive rate (TPR), and false-positive rate (FPR). Each metric provides insight into different aspects of the model’s performance on a classification task.

The Random Forest Classifier (RFC) emerged as the top-performing model, achieving an accuracy of 0.972, as shown in Table 4. RFC also demonstrated high precision (0.982) and recall (0.975), resulting in a corrected F1-Score of 0.978, reflecting its ability to effectively balance the minimization of false positives and false negatives. The model’s low false-positive rate (FPR) of 0.03, coupled with a high true-positive rate (TPR) of 0.98, underscores its reliability in avoiding false alarms while accurately detecting true threats. This performance highlights RFC’s robustness in handling complex, high-dimensional data structures, making it an ideal choice for high-stakes cybersecurity scenarios, where minimizing false positives is critical, such as in threat detection systems.

The Support Vector Machine (SVM) followed closely, with an accuracy of 0.965, a precision of 0.970, a recall of 0.971, and a corrected F1-Score of 0.971. SVM’s performance reflects its strength in handling high-dimensional data with clear boundaries, as evidenced by its competitive TPR of 0.96 and FPR of 0.04. Although its TPR is slightly lower than RFCs, SVM excels in recognizing true positive cases, making it a strong contender for cybersecurity applications where data classes are well-separated. However, its minor shortfall in TPR compared to RFC suggests that SVM might miss some true threats in scenarios with overlapping or highly complex data distributions. Nonetheless, SVM remains a viable alternative to RFC, particularly in tasks requiring efficient classification in high-dimensional spaces.

K-Nearest Neighbors (KNN) achieved an accuracy of 0.963, with a precision of 0.960, a recall of 0.961, and a corrected F1-Score of 0.960 (Table 5). While KNN demonstrates good performance, its FPR of 0.04 is slightly higher than RFC and SVM, indicating a potential for more false alarms. KNN is well-suited for smaller datasets with distinct patterns, as it relies on proximity-based classification. However, its performance can degrade in higher-dimensional spaces due to the curse of dimensionality, which may explain its slightly lower metrics compared to RFC and SVM. Despite this, KNN offers a balanced trade-off between precision and recall, making it a reliable general-purpose model for cybersecurity classification tasks.

Logistic Regression delivered a solid performance with an accuracy of 0.939, precision of 0.951, recall of 0.941, and a corrected F1-Score of 0.946 (Table 5). As a linear model, Logistic Regression maintains high precision and recall, but it falls slightly behind non-linear models like RFC and SVM in overall performance. Its TPR of 0.94 and FPR of 0.05 indicate reliable detection of true threats with a moderate rate of false alarms. Logistic Regression is advantageous for problems with linear decision boundaries and serves as an effective baseline for classification tasks in cybersecurity. However, its performance may be limited when dealing with complex, non-linear relationships in the data, which are common in real-world cybersecurity datasets.

Gaussian Naive Bayes, despite its simplicity, achieved an accuracy of 0.927, the lowest among the evaluated models, with a precision of 0.923, a recall of 0.911, and a corrected F1-Score of 0.917 (Figure 4). This performance is attributed to its assumption of feature independence, which often does not hold in complex cybersecurity datasets. The model’s higher FPR (0.08) and lower TPR (0.90) compared to the other models suggest that it generates more false positives and misses some true threats, making it less suitable for applications requiring minimal false alarms. Nevertheless, Gaussian Naive Bayes remains effective for straightforward datasets with minimal inter-feature dependencies, offering reasonable precision and recall for simpler classification tasks.

Overall, the Random Forest Classifier stands out as the best model, with the highest accuracy, precision, recall, and F1-Score, alongside the lowest FPR, proving its effectiveness in complex and diverse cybersecurity classification tasks. SVM also performed exceptionally well, offering a strong alternative for scenarios where linear decision boundaries are insufficient, and its high recall makes it particularly adept at identifying true threats. KNN and Logistic Regression provide balanced trade-offs, with KNN excelling in smaller, distinct datasets and Logistic Regression serving as a reliable baseline for linear problems. Gaussian Naive Bayes, while fast and simple, is best suited for simpler data structures with independent features, as its performance lags in more complex scenarios.

4.3. Unsupervised ML Techniques

The evaluation metrics (Precision, Recall, F1-score, and Accuracy) confirm the observations made from the confusion matrices. DBSCAN obtains the best scores in all metrics, underlining its robustness in detecting anomalies. K-means shows average results, while Isolation Forest, despite its detection capabilities, ranks last. The bar charts of the evaluation metrics highlight the superiority of DBSCAN, positioning this algorithm as the preferred choice for our use case. This analysis shows that DBSCAN is the most effective at minimizing misclassification and detecting anomalies compared to the other algorithms in this simulation, as illustrated in Figure 5.

▪: Confusion Matrix

We compared the performance of three anomaly detection algorithms, DBSCAN, K-means, and Isolation Forest, using sample logs, as shown in Figure 6. Confusion matrices were designed to reflect the accuracy levels of the algorithms. The results show that DBSCAN outperforms both K-means and Isolation Forest. DBSCAN outperformed with a higher number of True Positives (TP) and a lower number of False Positives (FP), indicating its effectiveness in correctly identifying anomalies while minimizing false alarms. In comparison, K-means shows an intermediate performance with moderate levels of TP and FP, while Isolation Forest, although functional, shows the weakest performance of the three, with fewer TPs and more FPs.

4.4. Practical Considerations for Real-Time Deployment

Deploying ML-enhanced Wazuh in real-time Security Operations Center (SOC) environments involves addressing critical factors such as computational overhead, latency constraints, continuous model updates, and fault tolerance. These considerations ensure that machine learning models maintain high detection accuracy without compromising operational efficiency or responsiveness.

Computational Overhead: Our evaluation of enterprise-grade infrastructure (e.g., Dell PowerEdge R740 servers with Intel Xeon Gold 6248R processors and 128 GB RAM) established upper-bound resource utilization for all algorithms under ideal conditions. However, practical deployments must consider the resource capabilities of typical SOC workstations and server nodes:

▪: Random Forest: Demonstrated excellent scalability and parallelism, utilizing approximately 120 MB of memory and under 5% CPU on a 24-core server while sustaining real-time processing at event rates up to 500 per second. Its ensemble architecture leverages multicore processing to optimize throughput;
▪: Support Vector Machine (SVM): Requires about 150 MB of memory and 6% CPU, benefiting from server-grade floating-point computation for kernel operations. Performs best on clearly separable data patterns typical in cybersecurity;
▪: K-Nearest Neighbors (KNN): Lightweight at roughly 100 MB memory and 4% CPU, but computationally intensive in high-dimensional feature spaces due to distance calculations. Efficient indexing and data structures are essential to maintain performance;
▪: Logistic Regression: The minimal footprint (80 MB memory, 3% CPU) makes it suitable for resource-constrained scenarios. It excels at linearly separable problems but struggles with complex threat patterns;
▪: Gaussian Naive Bayes: Lowest resource use (70 MB memory, 2% CPU) but limited by its independence assumption, impacting detection accuracy in correlated feature environments;
▪: DBSCAN: Notably efficient during inference (50 MB memory, 2% CPU), optimized for distributed or parallel processing architectures. Ideal for unsupervised anomaly detection on streaming event data. Training phases are offloaded to centralized servers, while inference runs on SOC workstations with minimal overhead;
▪: K-means and Isolation Forest: Both lightweight (60 MB and 65 MB memory; 3% CPU) but prone to higher false positives in real-time, reducing analyst trust and practical usability despite good computational efficiency.

Latency Requirement: All evaluated models meet stringent SOC latency demands, with average inference times comfortably below the 100 ms threshold required for timely detection and response, as shown in Table 6.

Latency Requirements: All models meet the 100 ms hard real-time threshold: Random Forest (45 ms), SVM (48 ms), KNN (50 ms), Logistic Regression (40 ms), Gaussian Naive Bayes (35 ms), DBSCAN (62 ms), K-means (55 ms), and Isolation Forest (60 ms). Distributed processing on SOC workstations reduces central server load, ensuring scalability.

Continuous Model Updates: An automated retraining pipeline monitors model drift weekly, triggering retraining if metrics (e.g., F1-Score < 0.85) drop. Retraining takes ~30 min, scheduled during low-activity periods. Incremental learning reduces retraining time by 20%, and models are deployed via/var/ossec/ml_detection/ML_Alerts.py.

Fault Tolerance and Scalability: Redundant processing nodes and load balancing ensure robustness. A health-monitoring system tracks latency and resource utilization, alerting administrators if metrics exceed thresholds (e.g., latency > 100 ms).

4.5. Analysis and Comparison

Analysis of the alerts.json dataset revealed a 3% increase in false negatives (from 4% to 7%) compared to rule-based Wazuh, primarily involving privilege escalation (73% of missed threats) during low-volume periods (68%).

Analysis of the alerts.json dataset (15,427 events) revealed a 3% increase in false negatives using the ML-enhanced Wazuh system (rising from 4% to 7% compared to the rule-based baseline). These missed detections were predominantly privilege escalation attempts (73%) and occurred mostly during low-volume periods (68%). Although this increase in false negatives is a concern—particularly given that 39% were classified as high or critical impact—it is partially attributed to training bias, where severe but infrequent threats were underrepresented in the dataset.

Despite this, our ML-enhanced approach achieved a substantial 78% reduction in false positives (from 23% to 5%), significantly improving analyst efficiency and reducing alert fatigue. We recognize that most organizations would prefer a higher false-positive rate over missing critical threats; hence, we introduced multiple mitigation strategies:

Ensemble Model: A combination of Random Forest, DBSCAN, and Isolation Forest using confidence-weighted voting reduced overall false negatives by 73% (down to 0.8%);

Contextual Enrichment: Incorporating temporal correlation and external threat intelligence reduced critical false negatives by 91%, with only a 0.2% increase in false positives (to 5.2%);

Risk-Based Thresholding: We applied adaptive sensitivity thresholds—lower for critical assets to reduce missed attacks and higher for non-critical systems to manage workload—achieving a practical trade-off between detection and operational overhead.

To validate the overall improvement, we compared the ML-enhanced and rule-based Wazuh systems on the same dataset. The results are summarized in Table 7.

To contextualize our approach within the current state of the art, we compared our approach with four state-of-the-art ML-based threat detection systems from the recent literature:

- Coscia et al. (2023) [11]: Our approach achieves comparable accuracy (99.1% vs. 99.7%) but with significantly lower computational requirements;

- Hughes et al. (2022) [10]: Our system shows better generalization to novel attacks (85% vs. 72% detection rate);

- APIRO (Sworna et al., 2023) [13]: While APIRO achieves higher API recommendation accuracy, our system provides more comprehensive threat coverage;

- DecOr (Islam et al., 2022) [13]: Our approach demonstrates better scalability with increasing log volumes.

This comparative analysis demonstrates that our ML-enhanced Wazuh system offers significant improvements over traditional rule-based detection while remaining competitive with or exceeding the performance of other state-of-the-art approaches in the literature. Table 8 presents this comparison with five recent studies on ML-based log analysis.

Our approach leverages a hybrid framework integrating supervised (Random Forest) and unsupervised (DBSCAN, Isolation Forest) algorithms, achieving an accuracy of 0.972 and an F1-Score of 0.978. This performance is complemented by a reduced false-positive rate (0.03 for RF, 0.0821 for DBSCAN), enhancing alert management efficiency within the Wazuh SIEM system. The optimization for Wazuh, an open-source platform, addresses the specific needs of small and medium enterprises (SMEs), contrasting with the resource-intensive decision tree approach of Coscia et al. [9], which, despite higher reported accuracy (99.7–99.9%), may suffer from overfitting and limited adaptability. Hughes et al. [8]’s reinforcement learning model, while innovative for novel attack detection (72%), lacks robustness against zero-day threats due to data dependency, a limitation mitigated in our study through diverse attack simulations (e.g., port scanning, malware execution) as outlined in Section 3.2. Similarly, APIRO [11]’s deep learning approach (91.9% Top-1 accuracy) targets API recommendations rather than broad threat detection, rendering it less relevant to our SOC-focused objectives. DecOr [10]’s ontology-based ML, while structured, exhibits scalability constraints with large log volumes, a challenge our Wazuh-integrated solution addresses through Elasticsearch and real-time API processing. This comparison underscores the practical advantages of our hybrid methodology for enhancing Wazuh’s threat detection capabilities in resource-constrained environments.

5. Conclusions

This research demonstrates the integration of machine learning techniques into Wazuh, an open-source SIEM system, to significantly enhance threat detection capabilities within Security Operations Centers (SOCs). By combining Random Forest for supervised classification and DBSCAN for unsupervised anomaly detection, the proposed hybrid approach achieved 97.2% accuracy and a 78% reduction in false positives while maintaining low inference latency suitable for real-time deployment. This study presents a replicable and resource-efficient ML-SOC framework that overcomes the limitations of traditional rule-based detection by embedding machine learning models directly into Wazuh’s event analysis pipeline. Comprehensive performance validation, including end-to-end latency measurement and scalability testing, confirmed the system’s ability to process up to 500 events per second with sub-100 ms average latency per event. In addition to improved accuracy and efficiency, the framework effectively mitigates the slight increase in false negatives through ensemble modeling, contextual enrichment, and risk-based thresholds, thereby maintaining comprehensive detection coverage. Comparative analysis with existing state-of-the-art solutions demonstrated that the proposed system offers a competitive balance of performance, interpretability, and integration simplicity, making it particularly suitable for resource-constrained SOC environments.

Future work will explore the integration of deep learning methods for more sophisticated threat modeling, the incorporation of external threat intelligence to enhance detection accuracy, and the extension of the framework to other open-source SIEM platforms. The potential application of federated learning will also be investigated to enable secure model sharing across organizations without compromising data confidentiality. Overall, this study contributes a practical and scalable solution to the field of cybersecurity, especially for small and medium-sized enterprises that rely on open-source tools to protect their infrastructure effectively.

Author Contributions

Conceptualization, S.A.C., Y.M., and N.G.; methodology, S.A.C. Y.M., and M.Z.; software, S.A.C.; validation, M.Z., Y.M., and N.G.; formal analysis, S.A.C.; investigation, S.A.C.; resources, S.A.C.; data curation, S.A.C. and M.Z.; writing—original draft preparation, S.A.C. and Y.M.; writing—review and editing, S.A.C., Y.M., M.Z., and N.G.; visualization, S.A.C.; supervision, Y.M. and S.A.C.; project administration, Y.M.; funding acquisition, M.Z. and Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset is available upon request from the authors.

Acknowledgments

We express our gratitude to the LaSTI Laboratory of Sultan Moulay Slimane University, Beni Mellal, Morocco, for supporting this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chamkar, S.A.; Maleh, Y.; Gherabi, N. Security Operations Centers: Use Case Best Practices, Coverage, and Gap Analysis Based on MITRE Adversarial Tactics, Techniques, and Common Knowledge. J. Cybersecur. Priv. 2024, 4, 777–793. [Google Scholar] [CrossRef]
Mokalled, H.; Catelli, R.; Casola, V.; Debertol, D.; Meda, E.; Zunino, R. The Applicability of a SIEM Solution: Requirements and Evaluation. In Proceedings of the 28th IEEE International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises, Naples, Italy, 12–14 June 2019. [Google Scholar] [CrossRef]
Sheeraz, M.; Paracha, M.A.; Haque, M.U.; Durad, M.H.; Mohsin, S.M.; Band, S.S.; Mosavi, A. Effective security monitoring using efficient SIEM architecture. Hum.-Centric Comput. Inf. Sci. 2023, 13, 1–18. [Google Scholar] [CrossRef]
Khayat, M.; Barka, E.; Serhani, M.A.; Sallabi, F.; Shuaib, K.; Khater, H.M. Empowering Security Operation Center with Artificial Intelligence and Machine Learning–A Systematic Literature Review. IEEE Access 2025, 13, 19162–19197. [Google Scholar] [CrossRef]
Hughes, K.; McLaughlin, K.; Sezer, S. Dynamic countermeasure knowledge for intrusion response systems. In Proceedings of the 2020 31st Irish Signals and Systems Conference (ISSC), Letterkenny, Ireland, 11–12 June 2020; pp. 1–6. [Google Scholar] [CrossRef]
Coscia, A.; Dentamaro, V.; Galantucci, S.; Maci, A.; Pirlo, G. Automatic decision tree-based NIDPS ruleset generation for DoS/DDoS attacks. J. Inf. Secur. Appl. 2024, 82, 103736. [Google Scholar] [CrossRef]
Kinyua, J.; Awuah, L. AI/ML in Security Orchestration, Automation and Response: Future Research Directions. Intell. Autom. Soft Comput. 2021, 28, 527–545. [Google Scholar] [CrossRef]
Sworna, Z.T.; Islam, C.; Babar, M.A. APIRO: A framework for Automated Security Tools API Recommendation. ACM Trans. Softw. Eng. Methodol. 2023, 32, 1–42. [Google Scholar] [CrossRef]
Toyin, O.; Adeola, M.O.; Oguntimilehin, A.; OB, A.; Aweh, O.M.; Obamiyi, S.E.; Akinduyite, C.O.; James, A.A. Intelligent Network Intrusion Detection and Prevention System (NIDPS): A Machine Learning and Network Security. In Proceedings of the 2024 IEEE 5th International Conference on Electro-Computing Technologies for Humanity (NIGERCON), Ado Ekiti, Nigeria, 26–28 November 2024; pp. 1–6. [Google Scholar] [CrossRef]
Kurnia, R.; Widyatama, F.; Wibawa, I.M.; Brata, Z.A.; Nelistiani, G.A.; Kim, H. Enhancing Security Operations Center: Wazuh Security Event Response with Retrieval-Augmented-Generation-Driven Copilot. Sensors 2025, 25, 870. [Google Scholar] [CrossRef]
Manzoor, J.; Waleed, A.; Jamali, A.F.; Masood, A. Cybersecurity on a budget: Evaluating security and performance of open-source SIEM solutions for SMEs. PLoS ONE 2024, 19, e0301183. [Google Scholar] [CrossRef] [PubMed]
Moiz, S.; Majid, A.; Basit, A.; Ebrahim, M.; Abro, A.A.; Naeem, M. Security and threat detection through cloud-based Wazuh deployment. In Proceedings of the 2024 IEEE 1st Karachi Section Humanitarian Technology Conference (KHI-HTC), Tandojam, Pakistan, 8–9 January 2024; pp. 1–5. [Google Scholar] [CrossRef]
Vilendečić, B.; Dejanović, R.; Ćurić, P. The Impact of Human Factors in the Implementation of SIEM Systems. J. Electr. Eng. 2017, 5, 196–203. [Google Scholar] [CrossRef]
Chamkar, S.A.; Maleh, Y.; Gherabi, N. The Human Factor Capabilities in Security Operation Center (SOC). EDPACS 2022, 66, 1–14. [Google Scholar] [CrossRef]
Mughal, A.A. Building and securing the modern security operations center (soc). Int. J. Bus. Intell. Big Data Anal. 2022, 5, 1–15. [Google Scholar]
Önal, V.; Arslan, H.; Görmez, Y. Machine Learning and Event-Based User and Entity Behavior Analysis. In Proceedings of the 2024 32nd Signal Processing and Communications Applications Conference (SIU), Mersin, Turkiye, 15–18 May 2024; pp. 1–4. [Google Scholar] [CrossRef]
Karampudi, B.; Phanideep, D.M.; Reddy, V.M.K.; Subhashini, N.; Muthulakshmi, S. Malware Analysis Using Machine Learning. In Intelligent Systems Design and Applications; Abraham, A., Pllana, S., Casalino, G., Ma, K., Bajaj, A., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 281–290. [Google Scholar] [CrossRef]
Silic, M.; Delac, G.; Srbljic, S. Prediction of Atomic Web Services Reliability Based on K-means Clustering. In ESEC/FSE, Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, Saint Petersburg, Russia, 18–26 August 2013; ACM: New York, NY, USA, 2013; pp. 70–80. [Google Scholar] [CrossRef]
Laaksonen, J.; Oja, E. Classification with Learning K-nearest Neighbors. In Proceedings of the IEEE International Conference on Neural Networks, Washington, DC, USA, 3–6 June 1996; Volume 3, pp. 1480–1483. [Google Scholar] [CrossRef]
Breiman, L. Random forests. In Machine Learning; Springer: Berlin/Heidelberg, Germany, 2001; Volume 45, pp. 1–33. [Google Scholar] [CrossRef]
Rish, I. An Empirical Study of The Naive Bayes Classifier. In IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence; Washington, DC, USA, 2001; Volume 3, pp. 41–46.
Liu, F.T.; Ting, K.M.; Zhou, Z.-H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar] [CrossRef]
Schubert, E.; Sander, J.; Ester, M.; Kriegel, H.P.; Xu, X. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Trans. Database Syst. (TODS) 2017, 42, 1–21. [Google Scholar] [CrossRef]

Figure 1. The proposed integration Architecture based on Machine Learning.

Figure 2. ML Integration process.

Figure 3. Visualization of feature importance in the RFC model.

Figure 4. Supervised ML Techniques Scores.

Figure 5. Unsupervised ML Techniques Scores.

Figure 6. Confusion Matrix.

Table 1. Requirements and Objectives for ML-SOC Design.

Objective Requirement	Goal	Specific Requirements
Advanced Threat Detection	Create an ML-SOC capable of accurately detecting threats and identifying anomalies.	Train ML models on both historical data (archived logs and alerts, e.g., from the alerts.json dataset, capturing past patterns of malicious behavior) and real-time data (incoming logs and alerts processed live by Wazuh agents to detect ongoing threats) to identify malicious behavior.
Reduction in False Positives	Minimize false alerts to prevent analyst overload.	Develop ML models that can distinguish legitimate activities from malicious behavior with high precision.
Rapid Incident Response	Implement automated response mechanisms for quick threat mitigation.	Integrate decision-making algorithms to trigger appropriate response actions.
Scalability and Performance	Ensure that the architecture can handle large data volumes while maintaining high performance.	Use scalable technologies and infrastructures to adapt to workload increases.
Interpretability	Ensure that the ML-SOC’s decisions are explainable to security analysts.	Include mechanisms that provide clear reasoning for triggered alerts.

Table 2. Feature Types and Data Preprocessing Methods.

Feature	Type	Engineering Technique
Day	Numerical	MinMaxScaler normalization [0,1]
Hour	Numerical	MinMaxScaler normalization [0,1]
Minute	Numerical	MinMaxScaler normalization [0,1]
Event Type	Categorical	One-hot encoding
Event Name	Categorical	One-hot encoding
Source IP	Categorical	Label encoding
Destination IP	Categorical	Label encoding
Destination TCP/UDP Port	Numerical	MinMaxScaler normalization [0,1]
Event Severity	Numerical	MinMaxScaler normalization [0,1]

Table 3. Encoded & Clean DATA Sample.

Feature	Count	Mean	Std Dev	Min	25%	50%	75%	Max
Month	578.0	3.92	0.6	3	3.0	4.0	4.0	5
Day	578.0	14.7	4.5	1	11.0	14.0	18.0	28
Hour	578.0	15.3	2.1	9	14.0	15.0	17.0	21
Minute	578.0	28.5	12.4	3	17.0	27.0	38.0	59
RuleDescription	578.0	10.4	6.3	0	5.0	10.0	15.0	22
Groups	578.0	5.7	2.1	1	4.0	6.0	7.0	11
CommandLine	578.0	32.6	18.9	5	21.0	30.0	45.0	94
RuleLevel	578.0	8.9	3.1	3	6.0	9.0	12.0

Table 4. Performance metrics with their explanations and corresponding equations.

Metric	Description	Calculation
True Positive Rate (TPR)	Sensitivity or Recall: Fraction of actual positives correctly classified as positive.	$T P R = (\frac{T P}{F N})$
False-Positive Rate (TNR)	Specificity: Fraction of actual negatives incorrectly classified as positive.	$F P R = (\frac{F P}{F P + T N})$
Precision	Fraction of positive predictions that were actually positive.	$P r e c i s i o n = (\frac{T P}{T P + F P})$
Accuracy	Fraction of instances correctly classified by the algorithm.	$A c c u r a c y = (\frac{T P}{T P + F P})$
Recall	The percentage of correct positive predictions out of all thepositive values.	$A c c u r a c y = (\frac{T P}{T P + F N})$
F1-Score	This is the weighted harmonic mean of precision and recall.	$F 1 = (\frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l})$
Confusion matrix	It shows the number of correct predictions (TP and TN) and incorrect predictions (FP and FN).	$[\begin{matrix} T P & F N \\ F P & T N \end{matrix}]$

Table 5. Comparative performance of five machine learning models.

Model	Accuracy	Precision	F1-Score	Recall	True Positive (TP)	True Negative (TN)	False Positive (FP)	False Negative (FN)	True Positive Rate (TPR)	False-Positive Rate (FPR)
Random Forest (RFC)	0.972	0.982	0.978	0.975	98	97	3	2	0.98	0.03
K-Nearest Neighbors (KNN)	0.963	0.960	0.960	0.961	95	96	4	5	0.95	0.04
Logistic Regression	0.939	0.951	0.946	0.941	94	95	5	6	0.94	0.05
Gaussian Naive Bayes	0.927	0.923	0.917	0.911	90	92	8	10	0.90	0.08
Support Vector Machine (SVM)	0.965	0.970	0.971	0.971	96	96	4	4	0.96	0.04

Table 6. Latency and Resource Utilization in Real-Time SOC Deployment.

Event Volume (Events/Second)	Metric	Random Forest	SVM (RBF Kernel)	SVM (Linear Kernel)	KNN	DBSCAN	Logistic Regression	Gaussian Naive Bayes	K-Means	Isolation Forest
10	Average Inference Time (ms)	12.4	18.7	14.2	15.8	8.7	7.2	5.9	9.4	11.7
	CPU Utilization (%)	1.8	2.4	2.1	1.9	1.2	1.1	0.8	1.4	1.6
50	Average Inference Time (ms)	15.7	23.8	17.9	21.4	12.3	9.8	7.8	13.1	16.2
	CPU Utilization (%)	2.3	3.2	2.8	2.7	1.8	1.5	1.2	2.0	2.3
100	Average Inference Time (ms)	18.9	31.5	23.1	29.7	18.9	13.4	10.7	18.8	23.1
	CPU Utilization (%)	3.1	4.9	3.9	4.1	2.9	2.2	1.8	3.1	3.5
500	Average Inference Time (ms)	45.2	67.3	52.8	87.9	62.8	40.1	32.4	55.7	67.4
	CPU Utilization (%)	7.8	12.1	9.7	14.2	8.7	6.4	5.2	8.9	10.2

Table 7. Performance Comparison of Rule-Based and ML-Enhanced Wazuh Detection Systems.

Metric	Rule-Based Wazuh	ML-Enhanced Wazuh
False-Positive Rate	23%	5% (78% reduction)
False-Negative Rate	4%	7% (3% increase)
Overall Accuracy	76%	99% (23% improvement)
F1-Score	0.72	0.97 (35% improvement)
Detection Speed for Complex Attacks	Baseline	18% faster

Table 8. Comparative Analysis of ML-Based Log Analysis Algorithms.

Study	Algorithms	Performance Metrics	Key Differences
Coscia et al. (2023) [6]	Decision Trees	Accuracy: 99.7–99.9%, F1-Score: 0.98	Demonstrates high accuracy but requires significant computational resources and lacks adaptability to evolving threats due to reliance on static decision trees.
Hughes et al. (2020) [5]	Reinforcement Learning	Novel attack detection: 72%	Effective for novel attack mitigation in simulated environments but exhibits limited efficacy against zero-day threats, with performance constrained by training data quality.
APIRO (Sworna et al., 2023) [8]	Convolutional Neural Networks (CNN), Deep Learning	Top-1 Accuracy: 91.9%	Focuses on automated API recommendations for SOAR platforms, offering limited applicability to comprehensive threat detection across diverse log datasets.
DecOr (Islam et al., 2022) [10]	Ontology-based ML	Scalability issues with large logs	Provides structured incident response but demonstrates reduced efficiency as log volumes increase, limiting its suitability for large-scale SOC deployments.
Our Work	Random Forest (RF), DBSCAN, Isolation Forest	Accuracy: 0.972, F1-Score: 0.978	Employs different supervised/unsupervised frameworks optimized for integration with Wazuh, achieving a false-positive rate of 0.03 with RF and 0.0821 with DBSCAN, tailored for real-time threat detection in SMEs.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chamkar, S.A.; Zaydi, M.; Maleh, Y.; Gherabi, N. Improving Threat Detection in Wazuh Using Machine Learning Techniques. J. Cybersecur. Priv. 2025, 5, 34. https://doi.org/10.3390/jcp5020034

AMA Style

Chamkar SA, Zaydi M, Maleh Y, Gherabi N. Improving Threat Detection in Wazuh Using Machine Learning Techniques. Journal of Cybersecurity and Privacy. 2025; 5(2):34. https://doi.org/10.3390/jcp5020034

Chicago/Turabian Style

Chamkar, Samir Achraf, Mounia Zaydi, Yassine Maleh, and Noreddine Gherabi. 2025. "Improving Threat Detection in Wazuh Using Machine Learning Techniques" Journal of Cybersecurity and Privacy 5, no. 2: 34. https://doi.org/10.3390/jcp5020034

APA Style

Chamkar, S. A., Zaydi, M., Maleh, Y., & Gherabi, N. (2025). Improving Threat Detection in Wazuh Using Machine Learning Techniques. Journal of Cybersecurity and Privacy, 5(2), 34. https://doi.org/10.3390/jcp5020034

Article Menu

Improving Threat Detection in Wazuh Using Machine Learning Techniques

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Design of ML-SOC Architecture and Integration

3.2. Data Collection, Preprocessing, and Feature Engineering

3.3. Descriptive Statistics Analysis

3.4. Data Labeling

3.5. Data Standardization

3.6. Training Supervised Machine Learning Models

3.6.1. K-Nearest Neighbors (KNN) Classifier

3.6.2. Random Forest Classifier (RF)

3.6.3. Naive Bayes Classifier

3.6.4. Logistic Regression

3.7. Training Unsupervised ML Techniques

3.7.1. K-Means

3.7.2. Isolation Forest

3.7.3. DBSCAN

4. Results and Discussion

4.1. Experimental Setup and Real-Time Performance Definition and Measurement

4.2. Supervised ML Techniques

4.3. Unsupervised ML Techniques

4.4. Practical Considerations for Real-Time Deployment

4.5. Analysis and Comparison

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI