1. Introduction
The National Center for Research and Development in Information and Telematic Technologies of the National Institute for Nuclear Physics (INFN-CNAF) represents one of the most advanced computing infrastructures in Italy and internationally, playing a key role in supporting high-intensity scientific research. Since 2003, it has hosted the Italian Tier-1, one of the core components of the Worldwide LHC Computing Grid (WLCG) (
https://home.cern/science/computing/grid, accessed on 2 January 2025), a distributed infrastructure that manages and analyzes vast amounts of data produced by experiments at the Large Hadron Collider (LHC) (
https://home.cern/science/accelerators/large-hadron-collider, accessed on 2 January 2025) at CERN. In addition to supporting the LHC, which uses approximately 70% of CNAF’s resources, the center provides essential services for many other world-class research projects. These include astroparticle experiments, such as AMS and DarkSide, neutrino studies like ICARUS (
https://icarus.fnal.gov, accessed on 2 January 2025) and JUNO (
https://w3.lnf.infn.it/ricerca/fisica-delle-astroparticelle/juno/, accessed on 2 January 2025), as well as projects such as Virgo, dedicated to the detection of gravitational waves. The center is also deeply involved in the management and development of Grid Computing technologies, which allow computational resources to be distributed on a global scale. Through the integration of distributed storage systems and technologies such as IBM Spectrum Scale (former General Parallel File System—GPFS) and IBM Spectrum Protect (former Tivoli Storage Manager—TSM), CNAF ensures maximum reliability in data management and long-term storage. These systems are supported by advanced monitoring technologies based on Grafana and Sensu, which provide real-time performance and security monitoring of IT infrastructures, enabling rapid detection of any anomalies and maintaining high levels of operability.
In the context of time series analysis, anomaly prediction aims to forecast the occurrence of future data points, segments, or sequences that are likely to diverge significantly from expected behavior or established patterns within the temporal dynamics of the system [
1,
2]. Unlike anomaly detection [
3,
4,
5], which identifies deviations retrospectively, anomaly prediction [
6] proactively identifies conditions or trends that indicate potential anomalies. These predicted deviations are characterized by their rarity and unexpected nature when compared to historical trends or statistical properties of the time series. Leveraging techniques such as time series forecasting models [
7,
8], recurrent neural networks [
9,
10,
11,
12], Bayesian methods [
13,
14], or Reinforcement Learning [
15,
16], anomaly prediction enables IT systems to anticipate disruptions and take preventive measures, making it a valuable component in many different domains. Applications of anomaly prediction span a diverse range of fields, including geological hazard forecasting [
17], maintenance of mechanical devices [
18], and monitoring complex systems and networks. Notable examples include Power Distribution Systems [
19], Integrated Energy Systems [
20], and wireless sensor networks [
21].
In complex distributed environments [
22], anomalies can lead to severe consequences, such as performance degradation, service interruptions, or, in the worst cases, loss of critical data. This challenge is particularly pronounced in large-scale systems like distributed data centers, where efficient and reliable management of computational and storage resources is essential. Proactively addressing anomalies in these environments helps prevent potentially harmful events, such as cyberattacks or security breaches [
23,
24].
One of the major challenges in anomaly prediction lies in the sparsity of data: systems generally operate in a healthy mode, with their normal behavior adhering to expected patterns and exhibiting few significant deviations. This scarcity of anomalous events limits the availability of labeled data, which is crucial for training and validating predictive models. Furthermore, different domains often exhibit unique characteristics and constraints, such as uni-modal versus multi-modal data [
25,
26], spatio-temporal dependencies [
27,
28], single versus compound faults [
29,
30], and varying scales or patterns of anomalies. These domain-specific variations exacerbate the difficulty of developing models that are both generalizable and effective across diverse contexts.
The work presented in this article focuses on adopting a machine learning-based approach to predict anomalies within an INFN-CNAF WebDAV service starting from a time series dataset comprising various system metrics, such as CPU usage, memory, and network traffic. The WebDAV service is implemented exploiting the StoRM-WebDAV tool [
31,
32] developed at INFN-CNAF and deployed in multiple instances to serve the data management requirements of the scientific communities accessing the data center. In the following, the INFN-CNAF WebDAV service always refers to the StoRM WebDAV service.
Preliminary investigations identified a single major type of anomaly that caused the monitoring system to fail in collecting data, indicating a system malfunction. Our research focuses on predicting this specific anomaly with the aim of enabling proactive prevention.
Various models were tested to identify the one most suited to the context, also exploring optimal configurations to maximize their effectiveness. Furthermore, it has been shown that, with the aid of additional techniques, it is possible to improve the interpretability of the results, obtaining valuable insights into the potential causes of the detected anomalies. This aspect is particularly important in critical operational contexts, such as data centers, where a deep understanding of anomalies enables timely intervention to prevent malfunctions, reduce downtime, and minimize the risk of system crashes, thereby ensuring stable and reliable operation.
The work is organized as follows:
Section 2 delves into the state of the art in anomaly detection, presenting the main techniques and approaches used for anomaly detection.
Section 3 describes the proposed method and implementation, covering the entire pipeline, including data collection, metric analysis and its relationships, pre-processing, data organization for training, a description of the machine learning models used, and their application.
Section 4 presents and compares the results obtained, along with an analysis via a Saliency Map of an exemplary case in
Section 4.2. Finally,
Section 5 provides the conclusions and an overview of possible future work.
2. Related Works
This section explores methods and techniques for predicting anomalies in time series data evaluated at regular intervals, in settings similar to the one considered in this work, focusing on both statistical and machine learning methods.
Wei et al. [
33] propose a hybrid anomaly detection approach using Long Short-Term Memory (LSTM)-Autoencoder architecture. The LSTM encoder compresses the time series into a latent representation, while the decoder reconstructs it, with anomalies identified when reconstruction errors exceed a threshold. Similar approaches are discussed in works by Githinji and Maina [
34] and Ho et al. [
35]. Frehner et al. [
36] extend this idea by combining LSTM-Autoencoders with Kernel Density Estimation (KDE), which avoids fixed thresholds by using probabilistic density estimates for anomaly detection.
Statistical methods such as KDE are effective in estimating data distributions without predefined assumptions, labeling low-density points as anomalies [
37]. However, these approaches often fail to capture temporal dependencies in sequential data. Hybrid methods, like those by Frehner et al., address this limitation by integrating statistical and machine learning techniques.
Isolation Forest, introduced by Liu et al. [
38], isolates anomalies based on shorter paths in randomly generated trees, leveraging the distinctiveness of anomalies. This method has been applied to contexts like smart aquaculture [
39] and has been further developed by Xu et al. [
40], who introduced an approach that integrates the use of randomly initialized neural networks and an innovative method for anomaly score calculation, and by AbuAlghanam et al. [
41], who introduced a dynamic and adaptive management of the contamination parameter. Clustering methods, such as K-Means, also contribute to anomaly detection by identifying data points distant from cluster centroids, as demonstrated by Enayati et al. [
42] and Dong et al. [
43].
Supervised learning techniques leverage labeled datasets to train models for distinguishing between normal and anomalous patterns, using architectures like LSTMs, Random Forests, and one-class SVMs. Recently, Graph Neural Networks (GNNs) have emerged as a powerful tool for anomaly detection, particularly in contexts requiring structured data analysis. The Dynamic Graph Neural Network (DGNN) proposed by Chen et al. [
44] models the relationships between time series metrics, introducing subgraphs and adaptive attention mechanisms for efficient anomaly detection. Other works, such as those by Guo et al. [
45,
46], further explore GNN applications in anomaly detection.
Finally, comparative studies by Schmidl et al. [
47] and Zhang et al. [
48] assess the effectiveness and robustness of these methods, providing insights into their suitability for different anomaly detection scenarios.
In our work, we utilized some of the previously established techniques, specifically LSTM-Autoencoders, during the preliminary phase to identify potential anomalies. However, the predictive models for these anomalies discussed in this article are based on traditional architectures designed for analyzing temporal sequences.
3. Materials and Methods
In the following paragraph, the study conducted on the WebDav service of INFN-CNAF will be described, for which the temporal data of the service metrics measuring the system resource usage have been made available. The paragraph will initially provide a description of the service and the available metrics, then proceed with a description of the data management and organization, and finally conclude with an illustration of the implemented solutions.
3.1. Data Collection
The data related to the WebDav service are maintained through the InfluxDB v2 database. InfluxDB is a platform specifically designed to collect, store, process, and visualize time series data. Time series data represent a sequence of observations organized in chronological order. These observations, generally consecutive measurements from the same source, are used to monitor and analyze changes over time. InfluxDB organizes time series data into buckets and measurements. A bucket can contain multiple measurements, and measurements contain multiple tags and fields. We define a bucket as a named location where time series data are stored. A measurement is instead a logical grouping for time series data. All points of a given measurement must have the same tags. We define a tag as key-value pairs with different values, intended to store metadata for each point (for example, something to identify the data source such as host, experiment, and so on). Each data point in the database is also associated with a timestamp, which temporally identifies the data. InfluxDB also allows for modifying the data retention frequency through a downsampling process. This enables reducing the granularity of stored data over time to save disk space and retain only the most relevant information for the long term. A retention policy can be defined, that is, rules specifying how long the data should be retained and at what granularity.
The data were made available through database dumps carried out at specific time points. After a restore process of the provided dumps, the data were extracted through database queries that allowed for selecting only the data relevant to the study, by applying filters on the various available fields. In particular, the applied queries were used to obtain data with different retentions, filter the time windows of interest and the experiments under study, and select the reference hosts belonging to each experiment. Finally, the extracted data were reorganized in CSV format to facilitate subsequent analysis.
3.2. Metrics Analysis Through Dashboard
The initial step involved analyzing metrics related to system components such as CPU, disks, network interfaces, memory, and WebDav performance. INFN-CNAF provided an interactive dashboard featuring visualizations of key metrics, enabling pattern and anomaly identification. While not all metrics from the database dump are displayed, the dashboard focuses on the most relevant ones. Metrics are categorized into three groups: general, detailed graph, and StoRM WebDAV metrics.
General metrics: The “general” group of metrics provides an overall view of the system’s main components, offering insights into trends and variations affecting infrastructure performance. Key metrics include CPU usage details, such as user and system process times, idle time, and time spent on I/O operations, as well as network traffic metrics that monitor incoming and outgoing data and analyze traffic per interface. These metrics enable basic monitoring of the CPU’s operational state, workload distribution, and network capacity, helping to identify potential issues like bottlenecks and fluctuations.
Figure 1 is an example of the dashboard visualization related to CPU usage metrics.
Detailed metrics: This group includes metrics for monitoring system memory and load, offering detailed insights into resource usage and system performance. Metrics include load average, which tracks the average number of running or waiting processes across hosts, and various memory-related metrics such as memory used, memory cached, memory buffered, memory free, and memory swap used. These provide a clear view of RAM usage, cached and buffered data, available memory, and swap memory utilization. This group enables a comprehensive assessment of resource utilization and helps identify areas for optimization.
In
Figure 2, the section of the dashboard concerning the load average can be observed, which is a metric representing the average load within a system. It indicates the average number of processes running or waiting to be executed by the CPU, divided across the various available hosts.
StoRM WebDAV metrics: This group focuses on monitoring the WebDAV service, providing insights into service performance and resource usage. Key metrics include HTTP request rate, reflecting the volume of client requests, and workload metrics such as active dispatches, thread pool size, and thread pool utilization, which together depict system processing capacity and thread usage. Error monitoring is supported by the 4xx and 5xx error rate metrics, identifying client- and server-side issues, respectively. JVM threads metrics offer details on active, blocked, and executable threads within the Java Virtual Machine. Data transfer performance is tracked via TPC (ThirdPartyCopy) pull throughput, which measures data speed and volume, and TPC pull success/error, highlighting the success rate of server-to-server data transfers. These metrics provide a detailed view of WebDAV service health and efficiency.
In
Figure 3, the section of the dashboard related to the “HTTP request rate” can be observed. This metric measures the number of HTTP requests received within a given time interval, thus reflecting the rate at which clients make requests to the server via the HTTP protocol.
3.3. Separation of Normal and Anomalous Observations
The criterion for distinguishing between normal and anomalous situations was determined through a detailed analysis of dashboard metrics and discussions with INFN-CNAF system technicians. A consistent anomaly pattern was identified: during system crashes or malfunctions, the “TPC pull success/error” metric exhibited a clear interruption in data reception. This interruption, as illustrated in
Figure 4, was visually evident as a gap in the metric’s graphical representation.
This gap does not reflect the absence of TPC pull requests but rather a disruption in monitoring processes that prevent the system from recording data. Such disruptions indicate a specific system malfunction, which is the sole anomaly addressed in this article. The absence of values in this metric often coincides with gaps in other dashboard metrics, further confirming system inactivity during these periods.
The anomalous behavior identified by the “TPC pull success/error” metric is corroborated by other metrics in the dashboard, reinforcing its reliability as an indicator of this particular type of system malfunction. This metric was used to delineate periods of normal activity from those characterized by disruptions, focusing exclusively on this specific anomaly.
Our study centers on accurately predicting this anomaly by analyzing the sequence of indicators collected in the temporal window preceding its occurrence.
3.4. Data Pre-Processing and Feature Selection
After identifying anomalous moments, the next phase involves defining and organizing the temporal windows used as input for the predictive model. The first decision made concerned the type of data to use. The available data have different granularities, defined based on the specific retention policy applied. There are four available retention policies, as described previously:
one_week, with data every 5 min and a retention period of one week;
one_month, with data every 15 min and a retention period of one month;
six_month, with data every 2 h and a retention period of six months;
all_data, with data every 3 h and infinite retention.
The best choice would be to adopt the ‘‘one_week’’ retention policy, which would allow for a higher granularity. With this choice, more detailed data could be obtained, collected more frequently and without intermediate processing such as averaging or similar operations, directly from the sensors. However, the available data are limited. Therefore, a compromise was sought between the Precision of the data and their quantity. It was decided to use the “one_month” retention policy, which allowed for a significant number of observations while maintaining reasonable Precision in the measurements, albeit averaged. This was considered the most sensible choice to have a meaningful number of data, as otherwise, the sample would have been too small to allow for effective and adequate analysis. An initial pre-processing of the data was then carried out, with the goal of better organizing the sensor data from the various CSV files obtained previously. Using a Python script that utilizes the Pandas library, the CSV files were manipulated to organize them into a more structured format, providing a clear view of the measurements associated with each time point. Another important decision concerns the organization of time windows to be considered in a single observation. In fact, a single observation is not treated as an isolated time point but is considered as a time window composed of multiple time instances, to capture the evolution of the various metrics rather than a simple static snapshot of the moment. Therefore, it was decided to organize the data in three distinct modes, each characterized by a different time window size:
Time window equal to 2: equivalent to observing the data from the 30 min preceding the anomalous time instant.
Time window equal to 3: equivalent to observing the data from the 45 min preceding the anomalous time instant.
Time window equal to 5: equivalent to observing the data from the 75 min preceding the anomalous time instant.
Subsequently, a crucial phase was undertaken: the selection of the metrics to be used. The metrics were selected following an approach based on the analysis of their significance and redundancy. To this end, analyses were conducted among the various metrics using correlation matrices to identify those that were potentially redundant or uninformative. Metrics identified as such were removed to optimize the dataset and reduce noise in the data, while most of the metrics were retained, as no sufficient evidence or strong reasons emerged to justify their elimination, ensuring the most comprehensive dataset possible. A total of 32 metrics were selected. Specifically, 11 of these relate to CPU usage, addressing aspects such as the distribution of time spent by the processor on various activities, including time dedicated to system processes or user processes. Another 8 metrics focus on memory and system load, describing the total memory usage, cache, and buffer memory, as well as providing insights into the system’s average load. A group of 11 metrics is specifically dedicated to monitoring the WebDav service, providing data on HTTP request activity, thread usage for connection management, and the error or success rates of operations. Finally, 2 metrics describe network traffic, measuring the volume of data received and transmitted through the network interfaces. Upon completion of this phase, the data were organized into time windows in a clear and structured manner and divided into normal and anomalous observations, ready to be used in the subsequent training phase. The resulting dataset consists of approximately 39,000 normal observations and 200 anomalous observations, where each observation is represented by multiple consecutive time points, determined by the size of the time window. Each time point is described through the 32 features previously selected. To address the class imbalance, during the training process, Oversampling strategies were considered, such as replicating the observations belonging to the minority class, namely, the anomalies, to increase their presence in the training dataset. This approach allows for balancing the number of examples between the classes, improving the model’s ability to effectively learn the distinctive characteristics of the anomalies.
3.5. Dataset Description
After completing all the previous pre-processing operations, it is possible to provide a comprehensive and detailed description of the dataset. The dataset includes time series data related to various system metrics, such as CPU usage, memory utilization, network traffic, and additional specific parameters monitoring the operational state of the WebDAV service, as detailed in
Section 3.2. The dataset contains approximately 39,000 normal observations and 200 anomalous observations. Each observation is represented by a sequence of consecutive time points, described through the 32 features derived from the monitored metrics. This highly imbalanced distribution reflects the rarity of anomalies in a stable operational environment, a typical challenge in anomaly detection contexts. The available data have varying levels of granularity, based on different retention policies, which are described in detail in
Section 3.4. For this study, the “one_month” retention policy was chosen, providing data at 15 min intervals. This choice allowed for balancing temporal granularity with the overall availability of data, ensuring a sufficiently large and detailed sample for analysis. Anomalies were identified through an in-depth analysis of the metrics and consultation with system experts. A key indicator for anomaly classification was the parameter related to the success or failure of TPC transfer requests. During system malfunctions, this metric displayed evident interruptions, visible as gaps in the temporal graphs, as detailed in
Section 3.3. These anomalous behaviors were used to label the corresponding periods as anomalies, while periods without such phenomena were classified as normal.
At an early stage of the research, event logs were considered as additional inputs. However, they did not exhibit a significant correlation with temporal metrics and did not enhance the effectiveness of the predictive model.
3.6. Overview of Used Architectures
During the training phase, several architectures were used. This section will describe and analyze the main architectures that showed the best performance. The code for the model implementations was developed using PyTorch 2.4.0, one of the most widely used and flexible deep learning libraries. It was chosen for its high degree of customization for training and the availability of numerous tools for monitoring the training process. In fact, Tensorboard, a monitoring library that allows visual tracking of performance metrics, was used in combination with PyTorch, facilitating comparative analysis between the different architectures developed. The models used are as follows:
Single-Layer Perceptron [
49]: One of the most basic artificial neural network models. This architecture was chosen because, despite its simplicity, it represents an ideal starting point for understanding the magnitude of the problem and developing more complex neural networks. It consists of a single layer of neurons, i.e., the output layer, which directly receives input from the initial data and provides an output result. Therefore, there are no hidden layers, making the architecture capable of only making linear predictions. It therefore allows for an understanding of the basic characteristics of the data without introducing unnecessary complexity. However, this simplicity limits it to being used as a baseline rather than a true definitive solution.
Shallow Network: A more advanced neural network architecture compared to the Single-Layer Perceptron, it represents a step forward in tackling more complex problems. The addition of the hidden layer, which processes data from the input layer and applies a nonlinear activation function to the result of the processing, enables the network to learn nonlinear relationships between the input features, significantly improving its classification ability. It therefore introduces greater flexibility and represents a fundamental intermediate step for understanding how the addition of a hidden layer and the ability to model nonlinear relationships actually affect the results. However, it remains limited in its ability to capture the temporal dynamics of the data, as it lacks a structure specifically designed to handle sequential data.
Random Forest [
50]: A supervised learning model that uses a large number of independent decision trees to make predictions or classifications. Each tree is constructed using the bootstrap method, meaning a random subset of the Training-Set data is used. Additionally, at each node of the tree, a random subset of the available features is selected to determine the best split, which is crucial for introducing diversity among the trees. In classification tasks, the result is obtained through majority voting, where each tree “votes” for a class, and the class with the highest number of votes is returned as the final output. Therefore, it represents a powerful and flexible approach for classification, thanks to the combination of multiple independent decision trees. It was selected for its robustness against noisy datasets and its ability to identify complex patterns without assuming linear relationships. This approach does not require extensive pre-processing and proves to be versatile, but, once again, the lack of an inherently temporal structure limits its ability to analyze data with sequential dependencies.
Long Short-Term Memory [
51,
52]: An architecture suitable for processing sequential or temporal data. Its operation is based on a complex internal structure that allows the network to learn both long-term and short-term dependencies through the use of memory cells and gating mechanisms. The main variables used by the architecture to maintain information over time are
, which represents the long-term memory of the LSTM units, and
, which represents the short-term memory. These variables are updated through gating mechanisms that control the flow of incoming information, memory updates, and the output produced by the network. Specifically, the Forget Gate Layer controls which information in the memory cell should be forgotten or preserved, the Input Gate Layer determines which new information should be added to the memory cell, and the Output Gate Layer decides which part of the memory cell state
should be used to calculate the output at the current time. This architecture is an optimal choice for processing the available data, as it leverages the LSTM’s ability to learn temporal relationships and retain relevant information across sequences, making it particularly well-suited to the specific problem addressed. However, it requires careful optimization to balance generalization capability with the risk of overfitting.
3.7. Training
The previously described architectures underwent a training process aimed at identifying the optimal configuration of hyperparameters. This optimization was carried out using random search and grid search techniques, enabling a systematic and thorough exploration of a wide range of possible configurations. Among the hyperparameters analyzed, particular attention was given to defining the structure of the neural networks, including parameters such as the number of layers and the number of neurons per layer. The selection of these parameters was guided by theoretical considerations that took into account the specific characteristics of the dataset, such as the number of features and the complexity of their relationships, as well as practical experiments aimed at balancing the model’s ability to learn complex representations without overfitting.
Regarding the network’s depth, various configurations were tested, from simpler architectures to deeper networks, to identify a compromise between model complexity and generalization ability. Similarly, the number of neurons per layer was optimized through an iterative process: initial configurations with an excessive number of neurons showed a significant increase in training time without proportional improvements in performance.
Regularization techniques were also considered to improve the model’s performance and generalization, including dropout and L2 regularization, both of which proved effective in reducing the risk of overfitting.
Another hyperparameter analyzed was the choice of optimizer. After a comparative analysis, the Adam optimizer (Adaptive Moment Estimation) [
53] was selected, known for its ability to combine the advantages of stochastic gradient-based optimization with adaptive learning rate adjustments, ensuring more stable and faster convergence.
Training was conducted over a maximum of 100 epochs to ensure adequate learning, but the early stopping mechanism was implemented to avoid overfitting. This approach allowed the training process to stop early if improvements in Validation-Set performance were no longer significant for a predefined number of consecutive epochs. This criterion optimized computational efficiency, reducing training time, and ensured that the final model maintained good generalization capabilities.
Overall, the optimization process and training design were crafted to ensure the model achieved optimal performance, effectively balancing Accuracy on the Training-Set with the ability to generalize to test data. Full details on the optimal hyperparameter values and the configurations selected are provided in the following sections.
3.8. Overview of Loss Functions Used
During the training phase, several loss functions were used to improve the results obtained. Some of these functions are provided natively by PyTorch; others were modified to better fit the specific task, while some were implemented manually as they were not available among the predefined ones. Below, the main loss functions used during the training phase will be described and analyzed.
3.8.1. Binary Cross-Entropy
The first loss function used is Binary Cross-Entropy (BCE) (
https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html, accessed on 2 January 2025), one of the most commonly used loss functions in binary classification problems. It measures the dissimilarity between the probabilities predicted by the model and the actual target values, assigning a higher penalty to more significant errors. It is based on the concept of entropy and is calculated for each sample independently. BCE, as defined in the literature (see PyTorch documentation (
https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html, accessed on 2 January 2025)), is expressed as follows:
Binary Cross-Entropy heavily penalizes incorrect predictions, particularly when the prediction deviates significantly from the actual target value. In fact, due to the logarithmic nature of the function, it penalizes errors nonlinearly, amplifying the penalty for substantial errors and reducing it for smaller errors. However, one disadvantage of BCE emerges when there is class imbalance in the dataset, as the model tends to give more weight to the majority class, neglecting instances of the less frequent class. To address this issue, two different approaches were used. The first involves using a common variant of Binary Cross-Entropy, namely, Weighted Binary Cross-Entropy, which assigns different weights to the classes based on their frequency in the dataset. This approach leverages the “weight” parameter to adjust class importance (see PyTorch documentation (
https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html, accessed on 2 January 2025)) and is defined as follows:
where
and
are the weights assigned to classes 1 and 0, respectively, to balance the contribution of the less frequent classes compared to the majority class. The second approach, which was ultimately used, is based on the Oversampling technique described earlier. This technique involves replicating examples from the minority class (i.e., anomalies) to increase their presence in the dataset. This allows the Binary Cross-Entropy Loss to function effectively because the classes have an equivalent number of samples. In conclusion, Binary Cross-Entropy is an extremely effective loss function for binary classification, as it allows for penalizing incorrect predictions based on their severity. This enables the model to converge more quickly to an optimal solution and progressively improve its ability to distinguish observations from different classes.
3.8.2. Binary Cross-Entropy in Combination with F1-Score
Another loss function used during the training of the model is a combination of Binary Cross-Entropy and F1-Score, defined through a “Composite Score”. This combination is designed to focus the optimization not only on minimizing the loss but also on maximizing the F1-Score. However, it is important to clarify that Binary Cross-Entropy is the actual loss function used to update the model’s weights during the optimization process. The Composite Score, on the other hand, is used as a criterion to compare and select the optimal configuration of the model, taking into account not only the minimization of BCE but also the ability to achieve a high F1-Score. During the process of selecting the best epoch of the model, based on the performance obtained on the Validation-Set, the Composite Score allows for balancing the importance of BCE and F1-Score, favoring configurations that are more effective at detecting anomalies. This combination is defined as follows:
where
is the parameter that balances the importance of the F1-Score and
is the parameter that balances the importance of Binary Cross-Entropy. These two parameters were treated as hyperparameters in order to find the best configuration that maximizes the model’s performance. In conclusion, this composite metric proves to be an excellent solution in Anomaly Detection contexts, as it allows for balancing error minimization with the maximization of the model’s ability to correctly identify anomalies, thus ensuring greater attention to the most critical observations.
3.8.3. Soft-F1 Loss
An additional loss function used during training is the Soft-F1 Loss, a function introduced by Joan Pastor-Pellicer et al. in 2013 [
54]. As described in the article, this loss function was specifically designed for imbalanced datasets, where traditional loss functions often do not yield good results. This makes it particularly suitable for tasks like anomaly detection, where the F1-Score is a more relevant metric than Accuracy. However, since the F1-Score is non-differentiable, it cannot be used directly for training models via backpropagation. The Soft-F1 Loss addresses this issue by making the F1-Score continuous and differentiable, treating predictions as probabilities, so that it can be used within the training process directly as a loss function. The idea behind the Soft-F1 Loss is to replace True Positives (TPs), false positives (FPs), and False Negatives (FNs) with their probabilistic versions, defined as follows:
where the following are true:
is the model’s output, i.e., the predicted probability.
is the actual target value, either 0 or 1.
The calculation of the Soft-F1 follows the standard F1-Score formula, but with these probabilistic versions:
This function allows for directly optimizing the F1-Score during the training process, thus maximizing the model’s ability to balance Precision and Recall. An additional extension of the Soft-F1 Loss is the
, a parameterized version of the F1-Score that allows for giving more weight to either Precision or Recall. This is particularly useful in cases where it is preferable to optimize one of the two metrics over the other. In anomaly detection tasks, for example, it might be preferable to optimize Recall more to ensure that most anomalies are correctly identified, even at the cost of a slight increase in false positives. The formula that defines the
, as reported by Joan Pastor-Pellicer et al. [
54], is as follows:
where
is the parameter that regulates the model’s behavior based on the specific needs of the task, meaning it favors either Precision or Recall. These differentiable versions of the F1-Score and F
-Score are therefore very useful in contexts like anomaly detection, where the primary objective is the identification of anomalies. They allow the optimization of the model to focus on forecasting rare events like anomalies. The integration of these functions directly into the backpropagation process could lead to an improvement in the model’s performance in these specific contexts compared to the use of traditional loss functions.
3.9. Explored Hyperparameters and Best Configurations
In
Table 1, the hyperparameters explored for each architecture and the corresponding best values identified can be observed.
As can be observed, in all four configurations, the time window size is set to 5. This highlights the necessity of having a sufficiently large time window to correctly identify the highest number of anomalies, allowing the models to capture complex patterns and long-term relationships within the data. This choice is particularly relevant for forecasting not only sudden anomalies but also those that emerge gradually over time. Additionally, it can be noted that the loss function yielding the best results in neural network architectures is the one that combines Binary Cross-Entropy with the F1-Score. This combination has allowed the focus to shift toward configurations that not only minimize the Binary Cross-Entropy Loss but also achieve a higher F1-Score. This approach favors configurations where the model is more effective at detecting anomalies. Specifically, the values assigned to the and parameters within the composite loss are 0.7 and 0.3, respectively. Finally, in all four architectures, the Oversampling technique with the replication of anomalous examples produced the best results compared to the use of alternative balancing techniques. This approach allowed for an increase in the representativeness of the anomalous class, significantly improving the model’s ability to identify anomalies and minimizing the risk of overlooking critical events.
4. Results
In the following section, the results obtained with the different models used in the study will be described and analyzed. The section will be divided into two parts: first, the results on the Test-Set for the various models used will be presented, followed by an analysis of a case example using Saliency Maps [
55], a method based on the gradient analysis of the model’s prediction with respect to the input. It works by calculating the partial derivatives of the model’s prediction with respect to each input feature, thus determining how much each feature contributes to the final outcome. Features with higher gradient values are highlighted as particularly influential. In this way, it will be possible to identify the most significant features for anomaly detection and improve the interpretability of the model.
4.1. Results on the Test-Set
In this section, the results obtained on the Test-Set by the various architectures will be presented. The metrics used to express the results include Accuracy, Precision, Recall, and F1-Score. The results are shown in
Table 2.
From the table, it is possible to analyze the results of the various architectures:
Single-Layer Perceptron: It achieved relatively stable base performance in terms of Accuracy but shows some difficulty in generalizing to the test data. Precision is very low, suggesting that the model struggles to accurately distinguish positive classes, while Recall remains relatively high, at the expense of Precision. The F1-Score reaches modest values, confirming the limitations of the architecture.
Shallow Network: There is a significant improvement compared to the previous architecture. Accuracy reaches good values, indicating an enhanced ability of the model to generalize to unseen data. Precision also improves significantly, suggesting that the model is more accurate in correctly distinguishing positive classes compared to the previous architecture. Recall remains high, indicating that the model maintains a strong ability to identify most positive examples. The overall improvement is confirmed by the increase in the F1-Score, which reaches a good value. This highlights the need to use models capable of capturing nonlinear relationships to address the problem and ensure better generalization.
Random Forest: It achieves excellent Accuracy, indicating that the model has a good ability to generalize to unseen data. Precision reaches the best results among all the architectures, indicating that the model is very effective in reducing false positives, successfully distinguishing positive examples. However, Recall shows a slight decrease, demonstrating a good ability to correctly identify positive examples, but with a slight drop compared to Precision. The F1-Score confirms the model’s good ability to balance the correct classification of positive examples with the reduction in false positives, achieving slightly lower but very similar results to the previous architecture.
Long Short-Term Memory: It achieves excellent Accuracy, indicating a very good generalization ability on unseen data. Precision reaches fairly high values, demonstrating that the architecture is effective in reducing false positives and accurately distinguishing positive examples. Recall remains high, indicating that the model possesses a strong ability to identify the majority of positive examples. Finally, the F1-Score reaches the highest values among all models, confirming the model’s ability to balance the correct classification of positive examples with the reduction in false positives.
The results clearly show that the LSTM model achieves the best overall performance, leveraging its ability to capture long-term relationships and use them for more accurate decision making. In particular, Accuracy is the highest among the models, indicating excellent capability in correctly classifying samples. The Random Forest and Shallow Network follow closely behind, with values 0.76% and 1.55% lower, respectively. Regarding Precision, the LSTM achieves a good score, demonstrating solid capability in reducing false positives. The Random Forest stands out particularly in this metric, achieving a result 3.91% higher than the LSTM and 14.8% higher than the Shallow Network, while the Single-Layer Perceptron exhibits significantly lower values. For Recall, both the LSTM and Shallow Network achieve the best values, highlighting their greater ability to correctly identify anomalies in the data. The Random Forest ranks further behind, with values approximately 14% lower. Regarding F1-Score, which represents a trade-off between Precision and Recall, the LSTM achieves the best results, indicating an excellent balance between reducing false positives and correctly identifying anomalies. In particular, it achieves results 5.1% better than the Shallow Network and 5.5% better than the Random Forest, while the Single-Layer Perceptron confirms itself as the worst-performing model, achieving values 64.1% lower.
4.2. Saliency Map for Key Feature Identification
In Anomaly Detection contexts, it is particularly useful not only to be alerted to the onset of a potential anomaly but also to obtain information about the nature of the anomalies within the system and understand the cause of the problem generating them. This type of information can be crucial: being notified of an anomaly and having indications of what is causing it is essential for timely intervention and preventing potential damage or service interruptions. In particular, a domain expert could greatly benefit from receiving insights into the nature of the problem, as they would be able to interpret the system’s signals more accurately and quickly and identify the causes of the anomalous behavior. For this purpose, one of the most useful tools is represented by Saliency Maps [
55]. These significantly improve the interpretability of machine learning models, particularly neural networks, which are often considered “black boxes” due to their complexity and the large number of parameters involved, making it difficult to understand which factors most influence the model’s decisions. Saliency Maps overcome this obstacle by providing a representation of the importance of the different features in predicting a specific output. The concept behind Saliency Maps is based on gradient analysis to highlight the importance of features with respect to the model’s output. In practice, this involves calculating the gradient of the output with respect to the input, determining how small variations in the features influence the output. For neural networks, this is equivalent to calculating the partial derivative of the output with respect to each input feature, providing a measure of each feature’s impact on the prediction. Features with a greater absolute gradient value are considered more relevant, as small variations in these characteristics cause significant changes in the output. Saliency Maps, therefore, help identify the most critical features in the model’s prediction, providing a clear representation of what the model focused on to make its decision. It is possible to observe the benefits of this technique in the example case shown in
Figure 5, where an anomaly in the WebDAV service is detected using the “TPC pull success/error” metric.
The anomaly in question was examined by analyzing the metrics displayed on the INFN-CNAF dashboard in order to understand its causes and identify the reason for the detected anomalous behavior. Observing the metrics on the dashboard for the time period associated with the identified anomaly, illustrated in
Figure 5, it becomes evident that the anomaly originated from an increase in active dispatch requests to manage, as shown in
Figure 6. An increase in active dispatches indicates a rise in the number of requests that the system must assign to various processes or computation nodes for processing. The sudden and consistent increase in these requests gradually saturated the system’s resources, preventing it from effectively handling the workload. Consequently, a critical overload occurred, leading to a temporary shutdown, interrupting services and preventing the system from responding correctly to future requests. In
Figure 6, the moment when the system reached saturation and caused a crash or temporary shutdown is highlighted with a red circle. This coincides with the last time we received data from the metric measuring the TPC pull success/error requests, shown in
Figure 5. A blue circle also highlights the point when the system was restarted and resumed normal operation.
This overload is confirmed by observing another metric, namely, the thread pool utilization, as shown in
Figure 7. Indeed, the graph clearly shows a gradual and consistent increase in thread pool utilization, starting from about 20% and reaching 100%, indicating full saturation of the available computational resources. The constant increase and subsequent saturation of thread pool utilization coincide with the peak of active dispatches observed earlier, confirming that the increase in requests led to an overload of the available resources, resulting in a system crash.
Once the cause of the anomaly is understood, the previously trained Shallow Network is used, with the anomalous observation being provided as input to obtain a prediction. The prediction correctly indicates that the provided observation is classified as anomalous. By using the Saliency Map technique, it is possible to display a list of the most important features that the network used to detect the anomaly. For this specific case, the three most important metrics for identifying the anomaly are shown, with their corresponding importance represented by the aggregated gradient values, in
Table 3. The metrics shown in the table are sorted in descending order, from the most important to the least important.
Comparing the table with the analysis of the metrics present in the INFN-CNAF Dashboard carried out earlier, it is immediately clear that the Saliency Map was able to correctly identify the causes of the anomaly. It indeed indicates that the most important metric is “storm.http.thread-pool.utilization.value”, which measures the thread pool utilization rate, which, as previously shown, steadily increased until reaching 100%, completely saturating the available resources. The second most important metric identified is “storm.http.handler.dispatches.m1_rate”, which measures the rate of correctly managed dispatches relative to the total incoming dispatches. Like the previous metric, this one also showed a progressive increase until saturation, being a direct cause of the system’s overall overload and subsequent crash. As for the “storm.http.handler.5xx-responses.m1_rate” metric, which measures the rate of 5xx responses (server errors), returned by the Saliency Map as the third most important, it represents an increase in server-side errors due to the inability to handle incoming requests, likely linked to the saturation of the thread pool and the increase in unmanaged dispatches, as an overloaded system cannot properly complete requests, resulting in 5xx errors.
These features are strongly related to each other and jointly influence the anomalous state of the system. Indeed, the increase in thread pool utilization is closely tied to the rise in the rate of correctly managed dispatches, as a growing number of requests to be processed led to an increasingly higher utilization of the thread pool until it reached full saturation. This overload prevented the system from properly handling incoming requests, causing an increase in 5xx responses, which are server-side errors. In this anomalous scenario, the interdependence between these metrics is evident: the increase in dispatches to be managed saturates the thread pool, and this saturation directly results in the system’s inability to complete requests, generating critical errors.
The results of the Saliency Map, which identify the anomalous state of the system, are also confirmed by an analysis on the LSTM, although some slightly different features were identified. This discrepancy is mainly attributable to architectural differences between the models: while the Shallow Network processes the data statically, an LSTM is designed to capture temporal dependencies in sequential data, tending to identify as relevant features that emerge only in relation to the temporal sequence, leading to a different interpretation of the causes contributing to the anomalous state. The three most important features for the anomaly are shown in
Table 4.
As shown, the feature identified as the most important for the anomaly is the metric “storm.http.handler.dispatches.m1_rate”, which measures the rate of correctly managed dispatches relative to the total incoming dispatches. This feature was also identified by the Shallow Network, highlighting its importance as a direct cause of the system’s overall overload and subsequent crash. The second most important feature is “storm.webdav.TPC.pull.ok-count.count”, which measures the number of successful TPC requests. The presence of this metric is likely due to the increase in handled TPC requests, suggesting a rise in the system’s workload. Although these transfers are completed successfully, the increase represents a demand beyond the standard operational capacity, contributing to a general overload. Finally, the third most important feature is “load_avg.fifteen”, which represents the system’s average load over the last 15 min. The presence of this metric confirms that the system is operating under stress and at the limit of its capabilities, with an increasing risk of performance degradation or a complete crash. By combining these three metrics, it is possible to identify the primary cause of the anomaly, which is consistent and overlaps with that previously identified using the Shallow Network: an increase in the system’s workload, evidenced by the growing number of handled requests and completed transfers, led to significant operational stress, ultimately compromising the system’s stability and resulting in the observed crash.
The use of the Saliency Map proves to be extremely important in such a context, as it helps identify the most relevant metrics that contributed to the anomaly. This provides important insights to a domain expert about what is going wrong within the system, allowing for a timely and targeted intervention to resolve the issue.
5. Conclusions
In this study, the problem of Anomaly Detection in IT systems, specifically in a WebDAV service, has been addressed using a wide range of machine learning models, ranging from simpler architectures like the Single-Layer Perceptron to more complex models like Long Short-Term Memory (LSTM). The main objective of this study was to identify anomalies in time series data obtained from the service’s sensors, tackling challenges such as the imbalanced nature of datasets and the need to capture temporal and spatial dependencies between the analyzed variables.
The analyses clearly showed that while simpler models like the Single-Layer Perceptron offer a useful starting point for data exploration, they are not powerful enough to capture the complexities present in datasets that require accurate modeling of temporal dependencies. More advanced models like LSTMs proved to be much more effective, thanks to their ability to handle temporal relationships in sequential data.
A crucial aspect that influenced the success of the models was hyperparameter tuning. Through random search and grid search processes, configurations such as the loss function, optimizer, learning rate, and, for LSTMs, the number of layers and the number of units per layer were optimized. This significantly improved the model’s performance by reducing overfitting and maximizing generalization capacity. In particular, the use of a combined loss function between Binary Cross-Entropy and F1-Score proved particularly useful for our task.
While the best models demonstrate promising predictive capabilities in terms of Precision and Recall, there is significant room for improvement. Two key limitations are identified: the limited amount of anomalous data and the granularity of the available data. The dataset is highly imbalanced, and while techniques like sample replication and weighted loss functions were applied, they were insufficient for achieving full generalization. Increasing the amount of anomalous data would enhance the model’s ability to learn their distinguishing characteristics. Additionally, the 15 min sampling interval, resulting from downsampling higher-resolution data, may obscure critical details. This loss of granularity can mask subtle patterns and sudden peaks, reducing the model’s ability to detect short-term anomalies. A higher temporal resolution would provide a more accurate representation of the system’s behavior, improving the model’s ability to identify and respond to anomalies effectively.
In conclusion, the work presented shows how the use of machine learning techniques can provide an effective and automated method for predicting anomalies in complex IT systems, enabling continuous monitoring and prompt response to unexpected changes in system metrics. The adopted approach has proven useful in identifying anomalous patterns that would otherwise be difficult to detect with traditional manual analysis methods, requiring constant metric monitoring by domain experts. This work provides a solid foundation for future developments, which could benefit from more complete and detailed data, further improving model performance and strengthening anomaly prediction capabilities in a more precise and reliable manner.