1. Introduction
Software-as-a-Service (SaaS) systems are deployed in an increasingly modularised and distributed pattern. The architecture of those systems benefits from a separation between a collection of computing tasks and a workflow that orchestrates those tasks. Traditionally, workflow management systems (WMSs) are (usually monolithic) software tools that are driven by workflow models and automate business processes [
1]. In modern service-based workflow-centric systems (called
workflow systems hereafter), the workflow and task applications are implemented by software developers and deployed in containers or run by serverless functions. Moreover, those workflow systems often manifest high throughput and high concurrency and consist of scalable computing tasks. One notable example is the web service AWS Step Functions [
2].
In order to conform with the quality of service deliveries, workflow systems usually provide an access to a wide range of monitoring metrics. One important kind of performance metric is the execution time. We can easily measure the end-to-end time (i.e., makespan) of any workflow and the task time for individual tasks in the workflows. For complex workflows, it is also useful to consider complex time metrics in the level of execution traces. For example, we may want to know the execution time of traces traversing specific branches and iterating in some loop at most i times (where i is a positive integer). We use the term “trace component” to refer to a subset of traces that can be described by a temporal pattern. The time metrics for trace components can provide more pertinent information to establish a monitoring plan for workflow systems.
In general, there are two approaches used to analyse execution traces. Event stream processing [
3,
4] is a technique of continuously querying timestamped event logs, which contain low-level and precise footprints of the traces. However, ingesting the telemetry (probably after a cleaning step) from a high-throughput system to a processing platform is not cost-effective. Sampling can alleviate the cost by randomly selecting a portion of data to process, but it comes with the price of lowered accuracy. In contrast,
trace profiling is a model-based technique used to build and maintain a statistical profile of the traces. Compared with the set of concrete traces, a profile model is compact and contains rich trace information to be analysed further. In particular, temporal (or behavioural) profiling, which employs models preserving the temporal orders of events, is adopted in various problem domains (e.g., clickstream analysis [
5], intrusion detection [
6] and process mining [
7]).
Following the trace profiling approach, we leverage
probabilistic model checking [
8] to analyse time metrics for trace components in workflows. Probabilistic model checking, which is a formal technique of quantitative verification, has been applied to address different problems in software engineering, including Quality-of-Service (QoS) management [
9], self-adaptive systems [
10,
11], event-streaming systems [
12] and business workflows [
13]. One main strength of this technique is the ability of expressing a wide range of temporal properties for a system model in the Markovian model family [
14] and performing verification in an automated way. As our focus is on time analysis, the Markovian model of workflow profiles should encode not only the
state transition probabilities but also the
state holding times. Moreover, because of the uncertain workload and environment of the workflow system, it is more reasonable to consider the state holding times as random variables rather than fixed values.
If we assume that the state holding times are exponentially distributed, we obtain a
continuous-time Markov chain (CTMC), which is a widely-adopted model in performance analysis [
15]. Based on the memoryless properties of exponential distributions, we can perform automated verification on CTMC using the well-established methods of numerical computation [
16]. But the aforementioned assumption is usually too restricted to interpret the empirical trace data and thus results in inaccurate analysis output. Therefore, we consider an extension to CTMC, namely,
semi-Markov chain (SMC), in which the state holding times are determined by general distributions. The execution time of a trace component is naturally formalised by a fundamental measure in SMC, called
first passage times (FPTs). It is noteworthy that in some existing work [
17,
18] that aggregate workflow makespans from randomised task times, the workflow is represented as a directed acyclic graph (DAG). But loops, which are allowed in SMC (and other models in the Markovian family), are an essential structure used to build workflow models.
However, the generality of an SMC results in a serious challenge for its quantitative verification. First, it is impractical to estimate the exact distribution functions of state holding times from empirical trace data. Second, even if those distributions are given, it is extremely hard to compute the probability that FPT falls in an interval for SMC [
19]. In view of these difficulties, we maintain a simple mergeable
summary of workflow executions (which comprises frequency values and power sum values only), and compute a group of generic statistical parameters, namely
moments, for FPT. Our core computational method, which is underpinned by a well-established property of SMC [
20], resorts to solving a sequence of linear systems in a reasonable size (i.e., the same size as the SMC). We can either use the FPT moments directly as time metrics for trace components or further infer other statistical parameters, such as quantiles, from the FPT moments using the method of moments ([
21], Chapter 10.2).
In real-world applications, our approach is suitable for performance monitoring in workflow systems such as AWS Step Functions. In particular, our approach can build compact workflow trace profiles from historical data and recent data in a sliding or decaying time window, and produces a broad range of complex time metrics to define a performance baseline and the current performance for monitoring. We also included an empirical evaluation of the computation time for our approach.
The main contributions of this paper are summarised as follows:
We leverage probabilistic model checking as a trace profiling approach to formally analyse the complex time metrics for workflow systems.
We present an efficient method to compute the FPT moments in SMC, which complements the existing quantitative verification technique for SMC.
We describe an application of our approach to a notable workflow Web service, namely, AWS Step Functions.
The remainder of the paper is organised as follows:
Section 2 presents the background of our problem in the time analysis of workflow trace components.
Section 3 presents an overview of our approach.
Section 4 and
Section 5 present the technical details, namely, the model building process and the core time analysis method, respectively.
Section 6 describes an application of our approach to a real-world workflow system.
Section 7 discusses the related work. Finally,
Section 8 concludes the paper.
2. Background and Problem
In this section, we explain the background and problem of workflow execution time analysis using a simple running example.
2.1. Execution Trace Components
A workflow orchestrates a set of computing tasks to form a complex workflow logic. Typically, besides a “start” state and an “end” state, a workflow comprises two kinds of steps, namely tasks (where the main computing functions are performed) and control stages (such as “choice”, “fork”, “join” and “loop”). For illustration purposes,
Figure 1 presents the diagram of a simple workflow, called SimpleFlow, which includes two tasks (Task 1 and Task 2). The workflow logic of SimpleFlow is as follows: After the workflow starts, two parallel threads are spawned at the “fork” stage. In the first thread, Task 1 is executed once. In the second thread, Task 2 is executed iteratively, depending on the output of the “ok?” stage. After both threads enter the “join” stage, the workflow terminates.
In real-world workflow engines (such as AWS Step Functions), workflows are usually defined in a domain-specific specification language (such as a JSON-style language). The execution traces of a workflow can be identified at different levels of abstraction. For example, we can consider each task as a single execution stage of SimpleFlow or a combination of the lower-level execution stages (e.g., scheduling, queuing and processing). For simplicity, we adopt the former (i.e., more abstract) level of execution traces. A trace component can be viewed as a part of the workflow, which can be described by some temporal pattern. Informally speaking, a trace component represents a subset of possible execution traces in a workflow. Consider the following two examples of trace components for SimpleFlow:
- TC-A
Task 2 is executed only once.
- TC-B
Task 2 is executed at least twice.
The formalisation of trace components and their relationship with workflows is provided in
Section 4.
2.2. Execution Time Metric
Suppose that SimpleFlow is a long-running workflow service, where multiple executions can run simultaneously. Also suppose that the resources allocated to the two tasks in SimpleFlow are scalable at runtime (which is often the case in real-world SaaS systems). To manage the resource allocation, the workflow system needs to have access to a wide range of performance metrics, including the time metrics. Basic time metrics include the end-to-end time of the workflow and the task time (i.e., the time to run an individual task) in the workflow. But in order to provide more pertinent information about the workflow performance, it is also important to consider time metrics for trace components.
Time metrics of those trace components can provide performance information to determine whether more computing resources should be allocated to the two tasks. Unlike the end-to-end time and task time, the time metrics for trace components cannot be measured directly. One generic technique is event stream processing, in which we can route the trace logs to a stream processing platform, filter the traces belonging to an interesting trace component and aggregate the time for those traces. But this technique is resource-consuming especially for high-throughput streams, because the telemetry needs to be streamed into a processing platform from (possibly) multiple sources (e.g., multiple service units), and because fragments of the streams need to remain in main memory for aggregation. In practice, the overhead of performance monitoring atop the operation cost of the applications should be minimised. Therefore, a light-weight trace profiling method, which can produce reasonably accurate time metrics for a variety of trace components, is desirable.
3. Approach Overview
In this section, we present an overview of our approach to workflow execution time analysis, and illustrate its several key aspects.
Because of the uncertain workload and environment for the workflow system, workflow executions manifest stochastic characteristics. Therefore, the execution time of any part of the workflow should be viewed as a random variable. The workflow execution traces, which are collected according to some pre-determined time window function (e.g., a sliding or decaying function), can be considered as some kind of random samples. We assume that those random samples (i.e., traces) are generated by a stochastic process, in which the probabilities and elapsed times of jumping to the next states are dependent on the current states only. This assumption or some equivalent statements are adopted in many QoS composition studies (e.g., [
17,
22,
23,
24]). Under this mild assumption, we developed a quantitative verification technique to compute the time metrics for trace components efficiently, just by constructing a fixed workflow model and maintaining an execution summary.
An overview of our approach is illustrated in
Figure 2, whilst the technical details are presented later in
Section 4 and
Section 5. Our approach requires three inputs: (i) a
workflow model, which is a transition system, (ii) a simple mergeable
summary of workflow executions, and (iii) one or more to-be-analysed
trace components, which are formalised by finite-state automata (DFA). The key idea is to build an SMC model for each trace component based on the aforementioned three input and analyse its FPT measure formally. The direct quantitative verification of FPT in SMC is impractical because (i) the probability distribution functions of state holding times in SMC can be arbitrary and difficult to determined precisely from data and (ii) even if those probability distribution functions are given, the probability that FPT falls within some interval is extremely hard to compute [
19]. However, we can calculate a data summary, which comprises frequency values and power sum values, and compute a group of important statistical parameters (i.e., moments) for the FPT efficiently [
20]. Besides the mean (which is the first moment), some other common statistical parameters (such as variance, skewness and kurtosis) are derivable from moments directly. We can use those statistical parameters for the FPT as time metrics of the trace components. From the FPT moments, we can also infer (approximate) quantiles of the FPT as other time metrics.
We highlight the following advantages of our approach. First, our approach early-aggregates simple statistical summaries from the workflow execution data, thus alleviating the cost of transferring and persisting the raw telemetry. Second, owning to the expressive power of DFA (which is equivalent to the regular language), a very broad range of trace components for a given workflow can be defined and analysed. Third, the computation of the FPT moments is very efficient, as it resorts to solving a sequence of linear systems, whose size is not larger than the size of the SMC state space.
4. Model Building
In this section, we focus on the model building stage of our approach, namely the building of the SMC model from a transition system, an execution summary and DFA.
4.1. Transition System for Workflow
Defintion 1 (Transition system). A transition system is the tuple where
W is a set of states;
is an initial state;
is a transition relation;
is a set of atomic properties;
is a labelling function that assigns a subset of atomic propositions to each event.
A trace in
is a finite sequence
such that
for all
. We present a transition system in
Figure 3 as the model of SimpleFlow. In general, to model a workflow as a transition system
, the following rules are applied:
- –
The set includes the names of some stages in the workflow. The excluded stages (such as the control stages “fork”, “join” and “ok?” in SimpleFlow) have almost instant elapsed times, which are negligible.
- –
The labelling function L assigns to and to a state without outgoing transitions, respectively.
- –
The parallel threads in the workflow are represented by branches in . For example, the two “fork–join” threads in SimpleFlow are represented by the two branches of and from .
Figure 3.
Transition system for SimpleFlow .
Figure 3.
Transition system for SimpleFlow .
The last rule is worthy of further explanation. Because some threads are temporarily blocked before entering a “join” stage in a workflow execution, only the last-completed thread determines the execution time. As our focus is on time analysis, we can resolve the parallel structures in the workflow by the branching structures in . Thus, the traces in model the traces of last-completed threads in the workflow only. In , the path (for any positive number of ) models an execution trace of SimpleFlow where the thread of Task2 is the last one completed. This can avoid the exponential growth in the size of the transition system as the number of threads in the workflow increases. For convenience, we also use to denote the workflow that it models from now on.
4.2. Workflow Execution Summary
An execution trace of a workflow is a list of events with the following fields: (eventType, timestamp), where eventType is an event type, such as entering a stage in the workflow, and timestamp is the event timestamp. The list may also contain a unique ID to identify the execution instance. Execution traces include all the information to create a workflow execution (data) summary.
Assume that we have collected a set of execution traces falling into a prescribed time window. We remove the events that do not belong to last-completed threads. With the execution trace data, we can replay the executions and calculate two kinds of statistical values. The first kind of statistical values are the frequencies of transitions, denoted Freq , where is a transition in . The frequencies can be calculated by counting the pairs of consecutive events. The second kind of statistical values are power sums of transition time intervals, denoted PSum , where j is a power, and is a transition in . Let t and be the timestamps of two consecutive events that correspond to w and , respectively, and let . Thus, we have a set of time intervals for , where . Then, we calculate , where . An alternative way of interpreting time intervals of transitions is the holding time at state w if the next state is . For example, when the execution of SimpleFlow is located in in , the transition time from to means the task time for Task1.
The pair
is an execution summary of the workflow. It is noteworthy that one important property of
is
mergeability [
25,
26]: Given
and
(resp.,
and
), we can simply merge them, namely, let
). This property enables the parallel calculation of
in practice. Moreover, we can calculate the summary
with sliding or decaying windows [
4]. Assume that we calculate
with historical data (e.g., data from the past week) and
with recent data (e.g., data from the past hour). In the case of (exponential) decaying windows, we can let
, where
is a decaying factor. Thus, we represent the entire execution history with a compact summary
.
We use a (statistical) workflow profile to refer to the combination of a workflow model and its execution summary , where is fixed, and is updated by the aforementioned method as more recent execution trace data are collected.
4.3. Specification of Trace Component
We present the definition of DFA, which we employ to define trace components of the workflow .
Defintion 2 (Deterministic finite automata). A DFA is a tuple , where
Q is a set of locations;
Σ is an alphabet;
is a transaction function;
is a set of initial locations;
is a set of accept locations.
Throughout the paper, we let the alphabet of be the power set of atomic propositions in a workflow , namely . Based on a transition system and a DFA , we can define a product transition system that models the behaviour of a trace component.
Defintion 3 (Product transition system)
. Given a workflow and a DFA , a product transition system is the tuple where the transition relation is the smallest relation defined by the following rules: Clearly, traces of
are also traces of
. This justifies
as a trace component of
. The DFA is a very expressive formalism used to specify temporal patterns for traces.
Figure 4a,b presents two DFA examples
and
. The production transition systems of
and
model the two trace components of SimpleFlow in
Section 2.1.
Note that Definition 3 is slightly different from the standard definition of production transition systems. For technical purposes, we encode absorbing states in the production transition system with the second rule of transition relations.
4.4. SMC Model
With the workflow profile and DFA, we can define an SMC model. Our presentation of an SMC follows the probabilistic model checking literature [
19] but also specifies the absorbing states explicitly.
Defintion 4 (Semi-Markov chain). An SMC (with absorbing states) is the tuple where
S is a state space of ;
α is an initial (probability) distribution on S;
is a transition (probability) matrix on S;
is an matrix of continuous probability distribution functions (i.e., Cumulative Distribution Functions (CDFs)) whose supports belong to ;
is a set of absorbing states (i.e., if ).
Let
be the set of successor states of
s, namely,
. A
path in
is given by an infinite alternating sequence of states and time values, namely
with
and
, such that
for all
i. For convenience, let
, which is the holding time of the path before the
i-th jump. We use
to denote the set of paths in
and
to denote the set of those paths in
originating from state
s. Following the literature [
19], we define a probability measure on an SMC as follows: Let
, where
is a non-empty interval in
, and denote a
cylinder set consisting of all paths in
such that
, for all
. Then, a
probability measure is defined by induction on
i: If
,
; for all
,
where
is the closure of
. Based on the probability measure
, we can formally analyse a variety of stochastic and timed properties of an SMC
.
For convenience, let
contain the corresponding Probability Density Functions (PDFs) for the CDFs in
. Let
and
. The
k-th moment of a random variable
X is the expected value of the
k-th power of
X, usually denoted as
(where
E is the “expectation” operator). Formally, let
F and
f be the CDF and PDF of
X, respectively; the
k-th moment of
X is the following quantity (if it exists):
The first moment of X is just its expected value . The variance of X is derived from the first and second moments of X, namely . We can also mention the moments of a CDF (or PDF) without referring to a random variable.
4.4.1. Building the SMC Structure
We build an SMC based on a product transition system . For convenience, if , let and . We first prune the such that each state in reaches at least one absorbing state and is reachable from at least one initial state. Then, the state space S of is just in the (pruned) , and the set G of absorbing states is the set . Also, is “qualitatively determined” in the sense that if , then .
4.4.2. Estimating the SMC Quantitative Parameters
The initial distribution is a distribution over the states in the intersection of S and . If this intersection is not a singleton, we usually let be a uniform distribution. But in general, can be any discrete distribution over the state space.
Recall that the statistics of execution traces is given as
, where
includes the frequencies of transitions, and
includes the power sums of state holding times. We use
to estimate the transition matrix
of the SMC. For each
, let
Clearly, if , then .
It is very difficult to determine continuous probability distribution functions from data. We use
and
to estimate the moments of the CDFs in
. For each
and
, let
We briefly summarise the model building stage in our approach. The fixed workflow model and an updatable execution summary form a workflow profile. One or more DFAs are defined to formalise the trace components of . The workflow profile and the DFA are used together to build an SMC model for time analysis (which is detailed in the next section).
6. Application and Evaluation
In this section, to demonstrate the practicality of our time analysis approach, we present a detailed description of applying our approach to establish a monitoring plan for AWS Step Functions, which also includes an empirical evaluation of the computational efficiency.
6.1. AWS Step Functions
AWS Step Functions (
https://aws.amazon.com/step-functions/ accessed on 1 June 2024) is a web service that provides a flexible way to coordinate tasks running on other AWS services, such as AWS Lambda (
https://aws.amazon.com/lambda/ accessed on 1 June 2024) (a serverless computing service) and AWS ECS (
https://aws.amazon.com/ecs/ accessed on 1 June 2024) (a container-based computing service). It is well suited for running high-throughput workflows and processing massive parallel jobs. Workflows in AWS Step Functions are defined with a domain-specific JSON-style language called the Amazon States Language (
https://states-language.net/spec.html accessed on 1 June 2024), which employs finite state machines (FSMs) as the semantics.
6.2. Example: JobDeciderWorkflow
We envisage a workflow called JobDeciderWorkflow as an example to explain the applicability of our approach to AWS Step Functions. For readability, we do not present its JSON-style specification directly but instead illustrate the specification in
Figure 5. The main purpose of this workflow is to run a job in a cost-effective way, either with Lambda (if the job is small to medium-sized) or ECS (if the job is large). Assume that each job of this workflow can be decomposed into multiple parts. All main stages (except “start” and “end”) are implemented as Lambda functions and play the following different roles: the workflow controllers (i.e., “Job Decider” and “Job Complete?”), job runners (i.e., “Run Job in Parallel”), job submitter (i.e., “Submit Job to ECS”), time keeper (i.e., “Wait X Second’), status checker (e.g., “Get Status”) and event emitter (i.e., “Emit ‘Successful’” and “Emit ‘Failed’”). If the “Job Decider” decides that the job should be run in Lambda, by exploiting the easy parallelisation of this service, a set of parallel threads are spawned to handle different parts of the job. If the “Job Decider” decides that the job should be submitted to ECS and run by it, two loops are used to guarantee the quality of job completion: The inner loop monitors the status of the job, and the outer loop resubmits the job if it has failed. This workflow example manifests many common characteristics of AWS Step Function workflows.
6.3. Execution Time Monitoring
Execution time monitoring is an important part of maintaining the performance in service-based systems, such as AWS Step Functions. The historical monitoring data provide a baseline, against which the workflow administrator can compare the current performance of the workflow system. A monitoring plan includes a set of baseline metrics and current performance metrics, which are determined by historical and current data, respectively. If the current performance falls below the baseline, the administrator can undertake various countermeasures, ranging from simply reconfiguring some workflow parameters to purchasing more cloud resources to run the tasks. In JobDeciderWorkflow, parameter reconfiguration can involve updating the criterion of running the job with Lambda or ECS and reseting the maximum number of job resubmissions to ECS.
In this case study, we focus on the metrics of execution times. AWS Step Functions supports the monitoring of the task execution time and the end-to-end workflow execution time. However, it does not provide the time metrics in the level of traces. Consider the following trace component for JobDeciderWorkflow:
The “Submit Job to ECS” stage is traversed up to i times, where i is a parameter, until the“Emit ‘Successful’” stage is reached.
The time metrics of trace components similar to the above one can help to define a performance baseline and the current performance, and to create a thorough monitoring plan.
6.4. Model Building
In the following, we explain how to apply our time analysis approach (as depicted
Figure 2) to define the performance baseline and current performance based on the JobDeciderWorkflow example in detail. Recall that there are three inputs to our approach, namely (
Input I) a transition system as the workflow model, (
Input II) a workflow execution summary and (
Input III) one or more DFA, which specify the to-be-analysed trace components.
For
Input I, the transition system (denoted
) for JobDeciderWorkflow is built from its specification in the Amazon States Language. There are seven types of states in the FSM of AWS Step Functions, namely Task, Choice, Fail, Succeed, Pass, Wait and Parallel. In the model building, we need to decide the level of abstraction. In AWS Step Functions, when a Lambda Task state (i.e., a Task state that runs a Lambda function) is traversed, the following five low-level events are created (in order): “TaskStateEntered”, “LambdaFunctionScheduled”, “LambdaFunctionStarted”, “LambdaFunctionSucceeded” and “TaskStateExited”. Thus,
includes five low-level states for each Lambda Task state in the “Run Job in Parallel” step. Alternatively, we can consider all steps in JobDeciderWorkflow (including the Task states) as single abstract states in
. Another problem is the parallel states in the FSM, which if directly translated to a transition system, result in a state explosion: The composition of
N parallel Lambda Task states results in
lower-level states. However, since our objective is time analysis, we can model the parallel threads with non-deterministic last-completed threads (c.f.,
Section 4.1). When an execution trace traverses the step “Run Job in Parallel”, the traverse time is determined by the thread that takes the longest time to complete the part of job. Thus, the step “Run Job in Parallel” can be modelled with no more than
states in
.
For
Input II, the statistical values in the execution summary are the frequencies of transitions and the power sums of time intervals (as specified in
Section 4.2), which are calculated from the execution traces. The log data of execution traces is available in AWS Step Functions. In particular, the
getExecutionHistory API (
https://docs.aws.amazon.com/step-functions/latest/apireference/API_Operations.html accessed on 1 June 2024) of AWS Step Functions returns a list of (raw) timestamped trace events for a specific workflow execution. We filter the execution traces where the timestamps of the last events fall into a prescribed time window and then remove the unnecessary events in the following two steps: First, we remove the events that do not belong to the last-completed threads. For example, if “Run Part-1” takes the longest time to complete in a trace, we remove all the events related to “Run Part-2”, “Run Part-3”, …, and “Run Part-N” in the same trace. Second, based on the modelling abstraction level, we may also need to remove the unnecessary low-level events. For example, if we adopt the high-level modelling for
, among the aforementioned five low-level events for each Lambda Task state, we retain “TaskStateEntered” only and remove the other four. After removing the unnecessary events, we can replay the traverses of paths in
and calculate the values of frequencies and power sums, where the detailed method of calculation is presented in
Section 4.2. Moreover, by setting two suitable time windows, we can create two versions of summaries, namely the baseline summary
and the current performance summary
. For example, we can calculate
from data in the past week and
from data in the past hour. The detailed method of calculation is given in
Section 4.2. Note that the baseline summary can be updated with sliding or decaying windows, and that the mergeability of frequencies and power sums can facilitate the parallel calculation of both summaries in practice.
For
Input III, the DFA for specifying trace components are user-defined. The expressive power of DFA allows a broad range of complex trace components to be defined and analysed. The trace component for JobDeciderWorkflow informally described in
Section 6.3 can be formalised with DFA
in
Figure 6, which has
states. Note that
is parametric on
i; in particular, if
,
just includes the traces traversing the “Run Job in Parallel” step but not the “Submit Job to ECS” step. As DFAs are independent of the two execution summaries
and
, the set of user-defined DFAs can be updated dynamically.
6.5. Evaluation of Computation Time
Provided with the three outputs (i.e., the workflow transition system, an execution summary and DFA,
Table 1), the computation includes building the SMC model, and generating and solving the linear systems of FPT moments. In the empirical evaluation, we parametrised the value of
i in the DFA in
Figure 6 for JobDeciderWorkflow and created SMC models in different sizes. The experimental data, as summarised in Table and depicted in
Figure 7, were collected in a Linux system with a 2.8 GHz Intel Core i7 CPU and 32 GB memory. As shown in the figure, the computation time is relevant to the size (i.e., the number of states) of the SMC models, which is also the size of the linear systems. Within 16 s, we can compute the first four moments for SMC models with up to 5000 states, and within 70 s, we can compute the same moments for models with up to 10,000 states. The superlinear increment in computation time with respect to model size is mainly due to the need of solving linear systems (c.f., Proposition 2). However, we believe that the meaningful time analysis of most real-world workflows in our approach results in SMC models with no more than 10,000 states. In
Figure 7, the largest SMC model is built from the production of the JobDeciderWorkflow transition system with 13 states and a DFA with 2400 locations. Another observation from
Figure 7 is the linear increment in computation time as the order of the FPT moment increases (Note that the computation of the first FPT moment also includes the model building time). This conforms with the fact that the linear systems for FTP moments are of the same size regardless of the moment orders.
6.6. Summary and Discussion
To apply our approach to performance monitoring in AWS Step Functions, we translate the JSON-style specification of a workflow into a fixed transition system in a suitable level of abstraction, maintain two sets of execution summaries, which are created from the historical and recent workflow execution traces, respectively, and employ DFA to specify trace components, whose time metrics are interesting to monitor. With these three inputs, the building of the SMC models and the computation of the FPT moments in those models are automated and efficient. We can use the FPT moments as time metrics to define the baseline and current performance. Alternatively, we can further infer the FPT quantiles from the FPT moments using the method of moments. Therefore, our approach is a useful monitoring technique for AWS Step Functions, and the evaluation of computation time also demonstrates its efficiency in general.
It is noteworthy that, although this case study focuses on AWS Step Functions, our approach can be applied to other realistic workflow systems. The process follows the same steps, which is the preparation of the three inputs for model building (c.f.,
Section 6.4). However, the translation of workflow specifications into transition systems and the collection of execution statistics are platform-dependent. In AWS Step Functions, workflows are specified in the JSON format, and execution statistics can be retrieved by a specific API.
7. Related Work
Workflow execution time analysis is heavily investigated in the existing works, most of which belong to a specific problem domain, such as the
makespan estimation for workflow scheduling and the
QoS aggregation in service composition (i.e., how to aggregate QoS metrics, such as execution times, for composite services from those of atomic services). Many works in those two domains represent makespans and QoS metrics as constant values (e.g., [
30,
31,
32,
33,
34]). We discuss only those works in which the makespans and QoS metrics are modelled as stochastic values, as those analytic methods are more related to our approach.
In the framework proposed by Hwang et al. [
22], the QoS metrics of atomic services are modelled as independent discrete random variables. Their algorithm works recursively to derive the Probability Mass Functions (PMFs) for the service QoS metrics through five common composition structures (i.e., sequential, conditional, loop, synchronised parallel and unsynchronised parallel) until the PMF of the end-to-end QoS is computed. Although the PMF composition is straightforward, in order to represent the empirical QoS values accurately, the support of the PMF must consist of a high number of small discretised value ranges. The complexity of their algorithm grows exponentially as the support size of the PMF increases. Zheng et al. [
24] presented an approach to aggregate the continuously distributed QoS metrics of service components in the same aforementioned five composition patterns. In order to compute the composite QoS, they assumed that the distributions of all QoS metrics must have the same form, and that the histograms must have the same start point and width of intervals. This assumption may be over restricted in application.
Zheng et al. [
23] employed an SMC to formally model composite services. They derived basic statistical parameters (i.e., means and variances) of QoS metrics (i.e., the response time and availability probability) for composite services in the closed form of the basic statistical parameters for atomic services. Although we also use an SMC as the formal model, our method computes not only the first and second moments but also high-order moments numerically, which are not so convenient to express in a closed form. Our approach is also equipped with an expressive formalism (i.e., DFA) to specify trace components for quantitative verification.
In the task scheduling framework of workflows proposed by Chirkin et al. [
18], an essential component is a method to infer the execution time of workflows from generally (and independently) distributed times of basic tasks. Their method estimates both the CDFs and characteristic functions (CFs) of task times. By representing the workflow as a directed acyclic graph (DAG) (without loops), they can employ the former to aggregate the parallel structure and later to aggregate the sequential structure and branching structure. During the aggregation process, their method uses Fourier transforms between CDFs and CFs at the cost of some approximation error. In contrast, by considering the last-completed parallel threads for execution time analysis purposes, our approach reduces to the parallel structure into the branching structure and works on the computation of moments with loops. Moreover, the moment estimation of task times from workflow execution data is much easier than the estimation of CDFs and CFs.
Glatard et al. [
17] proposed a probabilistic model to estimate the workflow makespans for scientific grid applications. In their model, they separated the task time and the grid latency for executing a task, where the former is a constant value and the later is a random variable. In practice, the randomised grid latency represents the time variations caused by factors such as new-resource provision, system failures or network interruptions. They considered the parallel composition (called “data parallel”) and the sequential parallel composition (called “data and service parallel”) only. As with Chirkin et al. [
18], their probabilistic model assumes a DAG workflow topology. The aggregation of the makespan means and variances in those two kinds of compositions can be viewed as a special case of Zheng et al’s approach [
23] and our approach. Rogge-Solti and Weske [
7] presented a method based on the Generalised Stochastic Petri Net (GSPN) [
35] (which is a more expressive stochastic model than an SMC in that the independence assumption for the general distributions of state holding times is relaxed) to predict the remaining time to workflow completion and the likelihood of breaching workflow deadlines. However, in order to solve the GSPN, one can only resort to discrete-event simulation, which is suitable for offline analysis but too time-consuming for online monitoring. Meyer et al. [
36] proposed Timed Probabilistic Workflow Net (TPMN), which is underpinned by the Markov Decision Process (MDP) semantics, and analysed the expected execution time of TPMN. However, task times are modelled as fixed costs in TPMN rather than random variables. Carnevali et al. [
37] employed Stochastic Time Petri Nets to analyse complex workflows that are constructed through sequence, choice/merge, and balanced/unbalanced split/join operators. They provided an efficient and accurate compositional approach to provide an upper bound on the response time distribution. Belgacem and Beghdad-Bey [
38] investigated the makespan and cost of virtual machine usage in cloud-computing workflows. They used a multi-objective optimisation method to optimise the trade-off between the two factors. Wang et al. [
39] presented a probabilistic modelling and evolutionary optimisation approach to analyse the performance and cost of serverless workflows as well as the tradeoff of the two metrics. Similarly, Zhang et al. [
40] considered the effective workflow scheduling strategy for cloud computing services while ensuring the service executing reliability. They proposed a formal approach that includes a reliability checkpoint mechanism to meet the workflow deadline and the energy consumption constraints.
In general, our execution time analysis is different from the aforementioned works in that we adopt a quantitative verification approach. In particular, we use DFA to specify trace components whose time metrics are interesting to analyse in a workflow. Su et al. [
12] employed model checking to analyse cloud-native event-streaming systems. Due to the empirical estimation of event stream statistics, they considered how the empirical estimation influences the model checking output. A DTMC was used as the formal model, and thus, their approach does not aim to address continuous-time metrics such as first pass time (FPT), which is the focus of the current work. However, in this work, we do not employ the standard technique of probabilistic model checking directly. The well-established quantitative verification technique for CTMCs (based on uniformisation) [
16] cannot be generalised to SMCs, which allow state holding times to be determined by general distributions. We report several recent advancements in probabilistic model checking for SMCs and the related models as follows.
López et al. [
19] formalised a quantitative verification framework for an SMC, in which the verification of the SMC against the expressive logic Continuous Stochastic Logic (CSL) is reduced to FPT analysis. They further demonstrated that the FPT analysis for an SMC requires solving a set of Volterra integral equations, which limits its scalability seriously. Paolieri et al. [
41] proposed another model called Stochastic Time Petri Net (STPN), whose expressive power is between that of an SMC and GSPN, and demonstrated that the verification of STPN can also resort to (a variant of) the Volterra equation system. Despite an extension to the previous work, no efficiency benefit is gained with their technique. Bryans et al. [
42] verified stochastic automata with underlying SMCs against “non-nested until” formulas in CSL by unfolding the model into a region tree, where regions are defined by relations between randomised clocks. And a direct computation of probabilities to reach individual regions includes the number of integration operations, which grows exponentially as the number of regions increases.
Compared with the above model-based and quantitative analysis literature, our work adopts a very different approach. We exploited the history-independence property of SMCs (that is, the moment property of convolution) and developed a sequence of linear systems to compute the moments of the FPT, which contains the moments of state holding times as the coefficients. In addition to the computational efficiency, an advantage of working with moments is the easy estimation of moments from empirical data, especially for probability distributions whose forms are unknown.
It is also noteworthy that, beyond the context of model-based and quantitative analysis, our approach is related to the literature that utilises mergeable summaries of moments to describe data streams and that inverts the quantiles of the streams from the moments [
25].
8. Conclusions
We presented a formal approach to analyse time metrics for workflow systems at the level of traces. Our objective is to compute the randomised time metrics of trace components, which represent subsets of workflow traces that satisfy specific temporal patterns. In the model building of our approach, we create compact trace profiles from the workflow specification and the workflow execution data summaries, define trace components using DFA, and build SMC models from the profiles and DFA. Our core computation method is an efficient quantitative verification technique that produces the moments of FPT in an SMC. We demonstrated the applicability of our approach to create a monitoring plan in real-world workflow systems such as AWS Step Functions. The computation time evaluation shows the efficiency of our approach to compute high-order FPT moments for large-workflow instances. In particular, our approach can compute up to the fourth moment for a workflow model with 10,000 states within 70 s.
In future work, we plan to apply our approach to other real-world workflow systems. We also plan to establish an extension to the current approach to analyse not only time metrics but also other commons QoS metrics, such as reliability and availability. Also, despite its expressive power, encoding temporal properties in DFA is at a very low level. To improve the usability of our approach, we can introduce a high-level expressive notation (such as a notation based on the Linear Temporal Logic) to specify trace components. Finally, as mentioned in
Section 5.3, in practice, we usually do not know the exact forms of the probability distributions for time metrics. In order to infer quantiles from moments in the absence of distribution functions with known forms, we plan to develop more advanced means of statistical inference than the standard method of moments for our approach.