A Taxonomy of Techniques for SLO Failure Prediction in Software Systems

: Failure prediction is an important aspect of self-aware computing systems. Therefore, a multitude of different approaches has been proposed in the literature over the past few years. In this work, we propose a taxonomy for organizing works focusing on the prediction of Service Level Objective (SLO) failures. Our taxonomy classiﬁes related work along the dimensions of the prediction target (e.g., anomaly detection, performance prediction, or failure prediction), the time horizon (e.g., detection or prediction, online or ofﬂine application), and the applied modeling type (e.g., time series forecasting, machine learning, or queueing theory). The classiﬁcation is derived based on a systematic mapping of relevant papers in the area. Additionally, we give an overview of different techniques in each sub-group and address remaining challenges in order to guide future research.


Introduction
Self-aware computing systems are defined as software systems that (1) learn models capturing knowledge about themselves and their environment (such as their structure, design, state, possible actions, and run-time behavior) on an ongoing basis, and (2) reason using the models (for example predict, analyze, consider, plan), enabling them to act based on their knowledge and reasoning in accordance with their higher-level goals [1]. Therefore, the techniques and algorithms employed in these self-aware computing systems are often required to perform failure prediction as part of the predict or the analyze phase during reasoning. Often, these techniques have to be applied on an ongoing basis or in an online fashion as the system itself is also applied in a changing environment, and the employed models change continuously.
Hence, a variety of different algorithms for the prediction of Service Level Objective (SLO) failures, i.e., the inability to comply with user-defined service quality goals, has been proposed in the literature, based on different techniques, like time-series forecasting, machine learning or queueing theory, all of which were designed with a specific use case in mind, and offering a few advantages and disadvantages over related approaches.
In this paper, we propose a taxonomy of different approaches for SLO failure prediction grouping them according to three dimensions: (1) the prediction target (e.g., anomaly detection, performance prediction, or failure prediction); (2) the time horizon (e.g., detection or prediction, online or offline application); and (3) the applied modeling type, i.e., the underlying base technique that was utilized term as recommended by Webster and Watson [10] as well as Kitchenham and Charters [11], since all comprehensive queries would result in too many (>5000) irrelevant search results. Therefore, the also well-known method of systematic mapping was applied [12]. We started with a set of initial literature and performed forward and backward passes on those in order to identify more works of the respective area. This procedure is known as Berry Picking with footnote chasing and backward chaining [13]. We filtered potential works by manually reading title, abstract, and keywords in order to include the context as opposed to applying an automated keyword-based mechanism. Summarizing, we included a total of 67  scientific papers, eight theses [81][82][83][84][85][86][87][88], three tech-reports [89][90][91], and two US patents [92,93] for the derivation of the taxonomy presented in Section 4. After refining the taxonomy, we re-classified all works to see if each work is still in the right category and then continued to classify all remaining works by reading their approach in detail. This procedure results in an unbalanced taxonomy tree. However, this is due to the imbalance in the distribution of the underlying works.

Taxonomy
This section introduces our taxonomy for research in the field of failure detection. We work in a tree-like and iteratively refining structure, as shown in Figure 1. The works are split along three different dimensions: (1) the prediction target, (2) the time horizon, and (3) the applied modeling type, i.e., the underlying base technique that was utilized. We will, in the following, briefly explain the distinction criteria for each dimension.

Prediction Target
Our first dimension concerns the prediction targets. As there exist a variety of different approaches for prediction of performance events, different notions for the prediction of performance properties exist. One key difference from related work is that we focus on SLO failure prediction, as opposed to general performance prediction and/or optimization, the prediction of performance anomalies (cf. Section 2) as well as the prediction failures caused by hardware faults. However, the definition of failure still varies between different communities and can be a software fault, a performance anomaly, a malfunction of the system, or a combination.
Hence, the first dimension aims at distinguishing the different notions of failure present in the literature. The first group is the work of anomaly prediction. Here, the focus is on detecting and/or predicting anomalous behavior of the system, typically described by a combination of time series. An anomaly can be a sudden load spike, an increase in user requests, resource consumption, or error rates.
The second group aims at predicting failures by predicting the general performance properties of a system. The rationale behind this interpretation is that by predicting the general system performance, all performance events, including erroneous or failure states, can be predicted and therefore used for failure prediction. Therefore, these works are usually limited to predicting performance failures; however, they are also able to predict normal system states, with or without the presence of failures.
Third, the last group focuses on the definition of failure as defined by Pitakrat et al. [87] and Avizienis et al. [94]: A service failure is defined as an event that occurs when the service deviates from the correct behavior. This includes an increase in response time, a service outage, or an incorrect return result. For example, a failure might follow the following definition: A transaction type is in a failure state if and only if (i), the 95th percentile of the server-side response times of all requests in each 10-s time window exceeds 1 s or (ii), the ratio of successful requests over all requests in the same window falls below 99.99%.
The goal of these works is to cover all non-functional properties of a software system as opposed to just performance properties, as the works in the second group. Therefore, in the following, when we refer to a failure, we imply the definition stated above. Additionally, our definition of failure also is compatible with definitions from other surveys [2], as we only classify behavior as a failure if the issue becomes apparent to the end-user.

Time Horizon
The second dimension of interest is the time horizon of the model learning and model predictions. For most applications, it is important that the failure predictions can be delivered in a timely fashion, desirably even ahead of time [2]. Therefore, we divide all techniques of failure prediction into techniques that can be applied in an online context, and other techniques that serve as an offline analysis tool, delivering theoretical results or hypothetical scenarios. Note that, even though the model prediction is classified as online, the actual model learning could still happen offline.
As the focus of anomaly prediction is implicitly the implementation in an online context, the notion of time horizon is interpreted differently. For anomaly prediction, we distinguish the works between works that focus on the detection of anomalies, usually within a certain time window or after a pre-defined acceptable delay, and work that are able to predict anomalies, i.e., detect anomalies before the respective anomaly manifests in the system. The obvious advantage of prediction is the increased time that one is able to analyze or react to the detected anomaly, while detection algorithms are usually more robust and stable and can implicitly deliver some preliminary analysis.

Modeling Type
Lastly, we partition the works based on the underlying modeling type, i.e., the base techniques that are applied or utilized for performing the prediction. The underlying modeling type implies several characteristics ranging from the required type and amount of information, the time-to-result, and the accuracy deliverable by the approach to the types of failures that are predictable by the approach as well as the amount of explainability (i.e., why the model results in the given prediction). Naturally, not all techniques are applicable to all prediction targets.
For anomaly detection, we distinguish between rule-based approaches and approaches based on time series forecasting. Performance prediction approaches can be divided into approaches based on machine learning, queueing theory, or architectural software modeling. Online failure prediction can be done using rule-based approaches, machine learning, or architectural approaches.
We will discuss each of these groups, as well as the representative works from these groups in the following Section 5.

Survey Results
This section presents an overview of the related work and groups all works into the respective groups, according to the taxonomy introduced in Section 4. First, the prediction target as discussed in Section 4.1 defines at what types of prediction the work is aiming for. Along this dimension, we split all related works into three groups. We discuss the area of event and anomaly prediction in Section 5.1, the area of general performance prediction in Section 5.2, and the area focusing on the prediction of actual failures in Section 5.3. Finally, we summarize all results in Section 5.4 by aggregating our classification into one table.
The first group, event and anomaly prediction is outlined in Section 5.1. These works focus on the prediction and detection of special events and/or anomalies. Detected or predicted events are usually not analyzed about whether or not they have a critical or negative impact on the application performance. For example, an anomaly could be an unusual spike in the number of concurrent users of a system. However, it is unclear whether this load spike leads to a performance problem for our application or not.
Second, we describe works generally targeted at performance prediction in Section 5.2. These works focus on the performance prediction of software systems, usually targeted towards what-if analyses and/or optimization of the respective software System under study (SUS). Therefore, they are usually generally applicable, but also require a lot of information to be provided and are seldom tailored to one specific use case.
Third, Section 5.3 deals with the predictions of failures. These works focus on the prediction of failures, according to the definition we discussed in Section 4. We conclude this section with a summarizing categorization of all listed works into our taxonomy table in Section 5.4.

Event and Anomaly Prediction
This section outlines works related to event and anomaly prediction. The focus of these works is usually not on classifying the severity or the degree of any anomaly, but rather to predict or detect if a measured behavior or a measured metric entity deviates from expected or normal behavior. However, the notion of event or anomaly might also include failures, as we define it in Section 4.1. In the following, we divide the works in this area along the dimension of the corresponding time horizon, as described in Section 4.2. Section 5.1.1 focuses on works aiming at detecting anomalies, i.e., classifying events while or after they happen based on the monitoring streams. In contrast, Section 5.1.2 describes works targeting the prediction of anomalies, i.e., alerting or detecting a performance event before it actually happens on the system.

Anomaly Detection
In addition to the works we enumerate here, Chandola et al. [3] present a comprehensive survey of anomaly detection approaches. However, according to the derived taxonomy for our work, we divide the area into the following two subgroups. Both model the expected behavior of the system and then classify an anomaly based on the deviation of the predicted expected behavior. However, works described in the first group anticipate the expected behavior by applying time series forecasting, while the second group focuses on more explicit models.

Detection Based on Time Series Forecasting
The works of Bielefeld [81], Frotscher [82], and Oehler et al. [14] utilize time series forecasting to detect anomalies in large scale software systems. Based on historical data, a forecast for the anticipated behavior is made. Then, the observed behavior can be compared to the predicted one in order to detect anomalies. Usually, the predicted metrics are compared to the measured values, and an anomaly is alerted based on a pre-defined threshold. Another work based on time series modeling and prediction is the patent by Iyer and Zhao [92].
As this is a promising direction for anomaly detection (as well as for other domains), forecasting is still an active area of research. Hence, there exist different methods for forecasting that are of use for the aforementioned techniques. Wold [103], Hyndman et al. [104], Goodwin et al. [105], Herbst et al. [106], De Livera et al. [107], Züfle et al. [108], Faloutsos et al. [109], and Bauer et al. [110] aim at creating or improving general forecasting techniques. Note that those works do not specifically consider anomaly detection or performance prediction but propose generally applicable techniques for forecasting.

Detection Based on Normal Behavior Modeling
Other works define normal behavior based on models and detect anomalies based on deviation from the defined behavior. Chan et al. [89] define the model with rules based on machine learning, Song et al. [15] rely on manual models. Zhang et al. [16] employ unsupervised clustering on black-box tasks to induce normal resource usage behavior patterns from historical data. Monni et al. [17,18] develop a technique for energy-based anomaly detection using Restricted Boltzmann Machines (RBMs) [18] and acknowledge how their technique can be used to detect collective anomalies and failures in software systems. Chan and Mahoney [19] automatically construct models based on the observed time series, but without explicitly applying forecasting techniques. Rathfelder et al. [83] apply workload-aware performance predictions in order to improve to improve anomaly and failure detection. This targets the problem of fixed thresholds that apply for the approaches of this and the previous sections, as observed deviations from the anticipated behavior have to be analyzed subject to a specific threshold. Similarly, Mayle et al. [93] dynamically adapt the baseline and thresholds based on the historical properties of the system.

Anomaly Prediction
Tan et al. [20] propose an anomaly prediction mechanism, called ALERT, designed for predictive anomaly detection in large scale systems. They achieve predictive power by applying a three-state prediction approach. They divide into normal, anomaly, and alert states, with the alert state being the state directly before an anomaly occurs and, therefore, the predictive state of interest. They furthermore improve the model accuracy by clustering the different execution contexts into different groups and deriving a mapping between execution context and predictive model.
Schörgenhummer et al. [21] propose to apply online anomaly prediction based on multiple monitoring streams. They achieve this by training machine learning models on monitoring data provided by an industrial partner. The data are then split into vectors using features consisting of 34 data metrics and 11 aggregation functions. After that, each feature vector is assigned to be either normal or faulty, based on the available log data. The obtained data are then fed into a classification algorithm in order to predict a given monitoring stream as normal or abnormal.

Performance Prediction
Many approaches to online performance and resource management in dynamic environments have been developed in the literature. Approaches are typically based on control theory feedback loops, machine learning techniques, or stochastic performance models, such as Layered Queueing Networks (LQNs) or Stochastic Petri Networks (SPNs). Other approaches (listed in Section 5.2.3), develop their own modeling languages to describe the architecture of the software system, together with its performance properties. Performance models are typically used in the context of utility-based optimization techniques. They are embedded within optimization frameworks aiming at optimizing multiple criteria such as different Quality of Service (QoS) metrics [111][112][113]. One such use-case is, for example, the auto-scaling of containers or Virtual Machines (VMs) of applications. For more details, we refer to the study by Lorido-Botran et al. [114].
In the following, we divide the works from this area according to the underlying modeling formalism, as discussed in Section 4.3. First, black box or machine learning models included in Section 5.2.1 usually rely on statistical analysis or machine learning. Second, stochastic modeling formalisms based on queueing theory are described in Section 5.2.2, which already include specific information but not yet model the application architecture explicitly. Finally, Section 5.2.3 discusses white box or architectural modeling formalisms, where the software and its performance properties are explicitly modeled. Amiri and Mohammad-Khanli [4] give a further overview of performance prediction approaches. Their focus is on resource provisioning in the cloud.

Black-Box and Machine Learning Models
Black box models apply statistical approaches or machine learning techniques to historical or online measurement data in order to build a model representation of the software system. This technique has a variety of different terms in different communities and is also known in the literature as software performance curves [22,23], performance prediction functions [24], performance predictions using machine learning [25], or using statistical techniques [26]. These approaches train statistical models on measurement data usually collected during a dedicated measurement phase. The resulting models can then be used to infer the performance of a software system under different scenarios, for example, different workloads or software configurations. Other approaches based on feedback loops and control theory (e.g., [27,28]) aim at optimizing different QoS objectives while ensuring system stability. Machine learning techniques capture the system behavior based on observations at run-time without the need for an a priori analytical model of the system [29,30].
Thereska et al. [26] build a statistical performance model to predict the performance of several Microsoft applications. The authors collect data from several hundred thousand real users using instrumented applications. The respective performance predictions are then based on a similarity search after irrelevant features were filtered out using Classification and Regression Trees (CART). Mantis, developed by Kwon et al. [25], predicts the performance of Android applications using a regression model. The result is used to determine whether or not a specific task can be efficiently offloaded. The authors train the regression model on the input parameters to the application, the current hardware utilization, and on values calculated during the program execution. Westermann et al. [24] compare the algorithms Multivariate Adaptive Regression Splines (MARS), CART, Genetic Programming (GP), and Kriging in the context of constructing statistical performance models for performance prediction. In their case studies, MARS outperformed the other evaluated approaches. Furthermore, the authors compare three different algorithms for selecting measurement points used to train the regression algorithms. This is useful to reduce the number of required performance measurements. Noorshams et al. [31] evaluate the accuracy of Linear Regression (LR), MARS, CART, M5 Trees, and Cubist Forests for the performance prediction of storage systems. Furthermore, the authors propose to optimize the parameterization of the individual algorithms and propose their own algorithm, called Stepwise Sampling Search (S3) [115]. In their case study, MARS and Cubist outperform CART, M5 Trees, and LR after applying the proposed parameter optimization. Faber et al. [22] derive so-called software performance curves using GP techniques. They introduce parameter optimization approaches and a technique to prevent overfitting for GP. Additionally, they also propose a technique for parameter optimization. In their evaluation, the introduced optimized GP approach outperforms an unoptimized MARS model. Finally, Chow et al. [32], create a performance model by working with a large set of component hypotheses that get rejected if the empirical data do not support the respective hypotheses. They evaluate their model using an evaluation set of 1.3 million requests provided Facebook in order to predict and improve the end-to-end latency.
In summary, the approaches based on statistical or machine learning models can predict the impact of changes in workload intensity and workload parameterization well but are unreliable when predicting the impact of changes to the system or its deployment. Therefore, their application for failure prediction is limited. On the other hand, the used models are relatively easy to obtain and are generically applicable.

Models Based on Queueing Theory
Works in this area mainly use predictive performance models to capture and predict system behavior. The platform is normally abstracted as a black-box, that is, the software architecture and configuration are not modeled explicitly (e.g., [33][34][35][36][37][38]). Such models include Queueing Networks (QNs) (e.g., [39,40]), LQNs (e.g., [41]), Queueing Petri Netss (QPNs) (e.g., [42]), stochastic process algebras [43], and statistical regression models (e.g., [44]). Models are typically solved analytically, e.g., based on mean-value analysis [34], or by simulation [38]. Both approaches have their respective benefits and downsides. Analytic solutions are usually much faster to compute; however, they might not be able to provide the same detail as a simulation or are only applicable to certain scenarios or system types. Therefore, a variety of approaches have been developed to enable dynamically switching between different methods (see, e.g., Walter et al. [116,117] for computer systems or Rygielski et al. [118,119] for networks) based on the current demand, or alternatively altering the prediction model itself [120,121]. These approaches accomplish this by working on a higher abstraction layer, e.g., the white-box descriptions of Section 5.2.3. For failure prediction, explicitly considering dynamic changes and being able to predict their effect at run-time is vital for ensuring predictable performance and ensuring to meet SLOs. Therefore, the application of the proposed approaches in the context of failure prediction is also possible; however, the required computation time of many of the simulation-based approaches make the application in an online context questionable. However, as most of the works lack a clear performance analysis for large-scale systems, it is hard to judge whether or not the application in an online context would be feasible. Conceptually, it could be possible for some approaches to apply the proposed approaches in online environments. For this reason, the performance prediction target is not divided with respect to the dimension of the time horizon.

Architectural White-Box Models
In addition to the techniques described in the previous sections, a number of meta-models for building architecture-level performance models, i.e., white-box models of software systems, have been proposed. Such meta-models provide modeling concepts, i.e., the ability to explicitly describe, to capture the performance-relevant behavior of a system as well as its architecture, together with some aspects of its execution environment, i.e., the hardware the respective system is deployed on [6]. The most prominent meta-models are the Unified Modeling Language (UML) Schedulability, Performance, & Time (SPT) and UML Modeling and Analysis of Real-time and Embedded systems (MARTE) profiles [90]. Further proposed meta-models include ACME [45][46][47], ROBOCOP [48], Software Performance Engineering Meta-Model (SPE-MM) [49], Core Scenario Model (CSM) [50], KLAPER [51], SLAstic [84], Palladio Component Model (PCM) [52], and the Descartes Modeling Language (DML) [53]. A more detailed overview and comparison of some of the given techniques can be found in the survey by Koziolek [6].
A major drawback of all white-box approaches is the missing adoption in the industry, due to their inherent complexity [122]. Additionally, all approaches focusing solely on performance prediction is that they are usually able to accurately predict performance problems (i.e., time-outs or latency-spikes), but are not able to predict content failures. Content failures include, for example, malicious or faulty user behavior that leads to erroneous responses or software failures. As most performance modeling formalisms do not include these properties into their model abstractions, they are not able to anticipate such types of failures.

Failure Prediction
This section describes all work falling in the area of failure prediction, as we defined it in Section 4.1. Since our definition is not restrictive, the area comprises a wide variety of approaches, which makes it necessary to divide the work into further subgroups. Therefore, we split the works by analyzing the time horizon of the model learning and the model predictions, as already discussed in Section 4.2. Therefore, we will, in the following, distinguish between works intended for offline application in Section 5.3.1, and approaches explicitly developed for an online context in Section 5.3.2. Approaches designed for online applications must not only provide a reasonable time-to-result but also need to ensure a low Central Processing Unit (CPU) and memory footprint.

Offline Prediction
If the objective focus is not to predict software failures at run-time but rather to generally analyze the reliability or the safety of a system, one approach is using reliability models (as proposed by Goševa-Popstojanova and Trivedi [54]) or safety models (see Grunske and Han [55]) are of relevance. These works are in principle very similar to the performance models introduced in the previous Section 5.2, but instead of specifically focusing the performance of a system, these models analyze quality of a software [56][57][58], or the amount and severity of probable failures occurring in a software system for a specific scenario.
Notable approaches include the works of Cheung [59], Cortallessa and Grassi [60], Yilmaz and Porter [61], Brosch [85], and Uhle [86]. Cheung [59] employs a Markov model to build a reliability model of the whole software system based on the reliability of the individual components. Similarly, Cortallessa and Grassi [60] consider the error propagation probabilities in addition to the single failure probabilities of each component based on the system architecture. Yilmaz and Porter [61] classify measured executions into successful and failed executions in order to apply the resulting models to systems with an unknown failure status. They furthermore apply their technique to online environments (see Section 5.3.2.3). Brosch [85] extends the performance modeling formalism PCM introduced in Section 5.2.3 with reliability attributes in order to receive an estimation of the reliability of the system. Finally, Uhle [86] models the dependability of micro-service architecture applications using dependency graphs.
All of the included approaches include the architecture of the application in order to assess the reliability of the approaches. However, none of them is intended for the online prediction of failures.

Online Prediction
This section focuses on methods for online failure prediction. A comprehensive survey about different failure prediction methods in an online context is already given by Salfner et al. [2]. In the following, we will assume all works to either specifically consider online environments or at least are easily applicable in that context. Note that we do not explicitly consider the area of High Performance Computing (HPC) and distributed computing, mainly focusing on batch processing. Most works in the field of distributed computing aim at predicting parallel-running, long-term task durations, optimizing scheduling strategies, wait times, or overall throughput. In contrast, cloud computing usually aims at minimizing latency for user-facing web applications. Here, mostly transaction-oriented, latency-sensitive, and interactive applications are analyzed.
Witt et al. [5] survey PPM approaches in the area of distributed computing and give an overview of the state-of-the-art in that field. Note that the referenced survey focuses on general performance prediction, not solely on failure prediction in the context of batch-processing. Works focusing on distributed computing or HPC include Islam and Dakshnamoorthy [123], Zheng et al. [124], Yu et al. [125], or Liang et al. [126]. Islam and Dakshnamoorthy [123] use Long Short-Term Memory (LSTM) networks to detect and predict task failures before they occur using traces from Google cloud. Zheng et al. [124], Yu et al. [125], and Liang et al. [126] all extract rules from log data of the Blue Gene supercomputer. These rules are then used to predict failures at run-time. Although these works focus on HPC, there is another branch of works applying the same principle on web servers (see Section 5.3.2.2).
Similarly to the previous sections, this section will further analyze all works concerned with the online prediction of failures for user-facing web applications by splitting them into different groups. For the remainder of this section, we assume that all works concentrate on user-facing web applications, as described at the beginning of this section. In the following, we analyze the type of model used to conduct the prediction of performance degradation and/or failure, as explained in Section 4.3.
We, therefore, divide this section into two sub-sections, grouping the modeling types of the area. First, we list monolithic black-box models in Section 5.3.2.1, i.e., models that treat the application as black-box and do not contain or require any architectural knowledge about the application. These models are mostly based on machine learning. Afterwards, we present works implicitly incorporating architectural or white-box information about the SUS using rules in Section 5.3.2.2, while works in Section 5.3.2.3 use explicit architectural information.

Black-Box and Machine Learning Models
Alonso et al. [62] apply and evaluate different machine learning models with each other in order to find the best suitable for their task of predicting anomalies and failures based on software aging caused by resource exhaustion. They furthermore employ Lasso regularization in order to filter features. Their evaluation is based on the TPC-W benchmark. Lou et al. [63] predict the failures of cloud services with Relevance Vector Machines (RVMs) using different flavors of Quantum-inspired Binary Gravitational Search Algorithms (QBGSAs). Li et al. [64] postulate to incorporate information from the network, the hardware, and the software in order to detect not only timing but also content failures as both contribute to failure problems. This approach, therefore, also follows the failure definition of Section 4.1. Sharma et al. [65] apply a multi-layered online learning mechanism in order to distinguish cloud-related anomalies from application faults and furthermore try to include automatic remediation. They evaluate their tool on Hadoop, Olio, and Rubis, using enterprise traces on an IaaS cloud testbed. Daraghmeh et al. [66] use local regression models and Box-Cox transformation to forecast and predict the future state of each host. This information is then utilized to optimized VM placement and consolidation within a cloud platform in order to optimize resource usage. Although the target of this work is different than our approach, the applied algorithms might still be useful for our approach. Grohmann et al. [67] use random forest machine learning techniques to build a model for resource saturation of micro-service applications running the cloud. They propose to use one holistic model for all types of applications in order to infer resource saturation of different applications, without measuring any QoS metrics at the application level. The works of Cavallo et al. [68] and Amin et al. [69][70][71] predict QoS violations of web servers based on time series analysis methods like Auto-Regressive Integrated Moving Average (ARIMA) or Generalized Auto-Regressive Conditional Heteroskedasticity (GARCH) models. Van Beek et al. [72] focus on predicting CPU contentions with the use of different regression models. They evaluate their approach on a workload trace from a cloud data center.

Rule-Based Models
NETradamus is a framework for forecasting failures based on event messages proposed by Clemm and Hartwig [73]. The tool mines critical event patterns from logs based on past failures in order to create rules for the detection of failures in an online context. The work of Gu et al. [74], Pitakrat et al. [75], and-as already introduced at the beginning of this section-the works of Zheng et al. [124], Yu et al. [125], and Liang et al. [126], are based on a similar idea, i.e., creating rules based on log events of past failures in order to predict new upcoming failures of the same type.
The presented works do not explicitly consider explainability. However, as rule-based approaches are generally easier to be interpreted by humans and event logs can be used for root-cause analysis, we consider these approaches to be white-box models as they implicitly deliver a certain degree of explainability.

Architectural Models
Pitakrat et al. [76,87] develop a technique for predicting failures in an online environment. Their technique is based on two components. First, failure probabilities of individual components are predicted based on defined thresholds for key metrics and ARIMA based metric forecasting. Second, these failure probabilities are inserted into a Failure Propagation Model (FPM) using a Bayesian Network (BN) in order to incorporate the probabilities of cascading failures. Mohamed [88] uses the so-called error spread signature in order to derive a Connection Dependency Graph (CDG) for propagating software function failures (excluding QoS metrics), an approach that works similarly, but does not consider probabilistic propagation. Pertet and Narasimhan [91] published a technical report, where they concluded that a lot of software failures can be traced back to fault chains, i.e., cascading failures, which supports the above modeling strategies. To that end, they also propose their own approach [77] that uses the dependency graph of the application nodes in order to deal with cascading failures. Capelastegui et al. [78] propose to combine monitoring on VM-level as well as on host-level together with different data sources like monitoring data, event logs, and failure data to derive online failure predictions. Similarly, Ozcelik and Yilmaz [79] propose to combine measurement from hardware and from inside the software to so-called hybrid spectra, in order to overcome the downsides of black-box failure models while keeping the monitoring overhead acceptable. This is the improved version of their offline technique already introduced in Section 5.3.1. Lastly, Mariani et al. [80] aim at predicting failures in distributed multi-tier environments by employing a two-stage approach, consisting of anomaly detection in the first stage, and a signature-based classification technique in the second stage.

Summarizing Taxonomy Table
We conclude this section by presenting a final taxonomy comprising of all categories discussed in Sections 5.1-5.3. Table 1 presents a comprehensible version of the aggregation of all works into the taxonomy presented in Section 4. In addition, we assess each group of approaches with regard to the three properties Generality (G), Adaptability (A), and Explainability (E) in Table 1. Categories marked with (+) apply the specific property better than categories marked with (−).
Generality estimates how generic the specific group of approaches is, and how well it can be applied to different software systems. Generally, we find that black-box approaches are usually advantageous here, as they require very little knowledge about the systems, and fewer assumptions need to be made. In contrast, architectural and queueing theoretical approaches only apply to system types that can be modeled using the particular modeling formalism, which in turn limits their generality. Similarly, the rule-based systems from the second paragraph of Section 5.3.2 mostly rely on system logs and therefore make assumptions about the structure of those.
Adaptability refers to the ability to react to changes in the system, i.e., to deliver accurate predictions, even if the original modeled system changed. This is a strength of all approaches dealing with anomaly prediction, as their goal is basically to detect such deviations from the system model. Additionally, as architectural approaches encode more information about the system, they are able to model more states of the systems. On the contrary, black-box approaches encode less information and are, therefore, seldom able to transfer the knowledge to unknown system states. The same applies to the rule-based approaches, as rules are usually hard to abstract.
Finally, we discuss the concept of explainability. Explainability, as we discuss in more detail in Section 6, refers to the ability to report on the reason for a certain SLO failure of the system, sometimes also referred to as root-cause. Hence, approaches designed for anomaly prediction can usually just detect a non-normal state, without the ability to pinpoint its cause. The same applies to black-box approaches, as long as they are not explicitly trained with the corresponding root-causes. Generally, the more expressive architectural model types usually have the best chances of finding the reason for the failure as they encode most information in the system. However, we would like to note that the degree of explainability is still quite limited, in our opinion, which is why we discuss this as an open issue in Section 6. Table 1. Comprehensive depiction the presented classification along with our assessment of Generality (G), Adaptability (A), and Explainability (E).

Open Research Challenges
In this section, we list some remaining challenges and open research issues that have not or rarely been addressed by the respective works in the field.

Explainability
One important issue when predicting failures is the explainability of the approaches. We distinguish between two types: (1) model explainability; and (2) failure explainability. Model explainability refers to the predictions of the model itself. Here, the focus is on why the specific approach resulted in the respective prediction, i.e., what specific inputs led to which intermediate and final results. Model explainability is often desired or even required, when a lot of trust in the applied approach is required, i.e., in production or high-availability systems. Failure explainability targets more towards the actual software system in production. Usually, the root cause of the failure is desired in order to be able to avoid failure manifestation, quickly resolve the failure, or mitigate the noticeable effects of the failure for the end-user. The specific degree of explainability required of each failure prediction varies strongly between each use case. We argue that both types of explainability are strongly dependent on the chosen modeling types and their restrictions. Furthermore, it is crucial to address both types of explainability when designing a production-ready system. Hence, we encourage the research community to increase the effort towards these directions.

Resource Consumption
Although most approaches included in this work are intended for implementation in an online environment, many works do not explicitly consider or analyze the performance of the prediction approach itself. We argue that, in order to be applicable in an online environment of a production system, the resource consumption, i.e., the CPU and memory footprint of any approach must be analyzed in depth. However, as most of the works lack a clear performance analysis for large-scale systems, it is hard to judge whether or not the application in an online context would be feasible. A possible future work of our study could consist of a representative benchmark that evaluates the performance of the different approaches in terms of accuracy and time-to-result, as well as resource and energy consumption.

Hybrid or Overarching Approaches
This work presents a variety of different modeling strategies and approaches targeted at achieved the same or a similar goal. As all these approaches show different advantages and disadvantages in certain scenarios, it seems natural to combine different approaches in order to create an overarching approach that is able to combine different strengths of various approaches. For example, many failure prediction methods included in Section 5.3 rely on anomaly detection. However, only seldom advanced techniques, like the ones presented in Section 5.1, are utilized. Instead, standard off-the-shelf anomaly detection is deployed. While this is a reasonable restriction in a research paper with a specific focus, it is certainly possible to improve the proposed approach by improving its anomaly detection component. Additionally, it seems useful to combine the advantages of machine learning techniques (see the first paragraph of Section 5.3.2) with methods using architectural knowledge (see the third paragrah of Section 5.3.2) for failure prediction by creating a hybrid approach. We, therefore, want to motivate the community to increase work towards compound approaches in the future.

Limitations and Threats to Validity
Although we conducted this survey to the best of our knowledge and belief, there are still some remaining threats to validity to be discussed.
First, the applied search method of systematic mapping might not include all relevant papers for the considered scope. As already discussed in the methodology section, a systematic literature review did not seem fruitful for our application, which is why we think the proposed methodology is the best way of collecting a reasonable amount of papers. However, we can not rule out that some relevant areas of research have been missed during the study. Nevertheless, due to the nature of the taxonomy creation, this can not challenge the validity of the proposed taxonomy. Instead, additional references and categorizations need to be added in future versions in order to complete the taxonomy view.
Second, we use the analyzed works in this study to draw conclusions about the strengths and weaknesses of the discussed approaches, as well as to identify directions for future research in the area. Due to the summarizing, abstracting, and communalizing nature of our taxonomy, we need to formulate general and generic statements that apply to all or almost all works in the specific group. Hence, although we aim to avoid such cases, there can be occasional instances where a conclusion drawn for the whole group of approaches might not be true for specific approaches of that sub-field.
Third, this study represents our critical view of the current state-of-the-art. It is therefore unavoidable that future research addresses some of the weaknesses as well as challenges and open issues we pointed out throughout this study. Hence, not all claims and statements made in this work might still apply if the research in the field progresses. However, as this lies in the nature of all sciences, this can not be seen as a threat to this research overview. Instead, this work can be used for later analysis and comparisons in what areas of research developed and improved, and which areas still need more research attention in the future.
Finally, this study does not provide a quantitative comparison in terms of prediction accuracy or required computation time. Although such a benchmark would provide a huge benefit for the research community, such a comparative evaluation is out of scope for this work. First, many of the discussed works do not disclose open-source implementations. Second, even if implementations are available, running this wide variety of approaches, each with their own interface implementations, poses a huge technical challenge. Additionally, some conceptual questions need to be answered in order to conduct a comparative study: What are realistic test scenarios for this investigation? What information can/will be provided to the individual approaches? What is the degree of detail that is expected for a specific prediction? How can the need for more information be penalized, or a more detailed prediction (e.g., smaller time window, faster prediction, mitigation or root-cause information included) be awarded? However, answering these questions and providing quantitative comparison could be an interesting target for future work.

Conclusions
This work presents a taxonomy of SLO failure prediction techniques by dividing research works along three different dimensions: (1) the prediction target (e.g., anomaly detection, performance prediction, or failure prediction); (2) the time horizon (e.g., detection or prediction, online or offline application); and (3) the applied modeling type, i.e., the underlying base technique that was utilized (e.g., time series forecasting, machine learning, or queueing theory). Furthermore, we present an overview of the current state-of-the-art and categorize existing works into the different groups of the taxonomy.
Finally, we address open issues and remaining research challenges in order to guide future research. This work can be used by both researchers and practitioners to classify and compare different approaches for SLO failure prediction.