1. Introduction
The Internet of Things (IoT) is a novel expansion of the Internet that forms a network of heterogeneous physical objects such as sensors, video cameras, mobile phones, and industrial machines (all of which we refer to as IoT devices) that can communicate and exchange data with each other over the Internet [
1,
2,
3].
In recent years, the quantity of data generated from IoT devices (we refer to this data as IoT data) has grown significantly and there has been a great deal of interest in extracting valuable insights from these data [
4]. To accomplish this, IoT applications collect IoT data from appropriate IoT devices and produce high-value information by analyzing the collected IoT data.
In this paper, we focus on IoT applications that must complete the tasks of collecting IoT data and analyzing them within an application specific time-bound to produce high-value results. If this is not achieved, the value of such applications and the results they produce depreciates. We refer to such applications as Time-Sensitive IoT (TS-IoT) applications and such time-related requirements of the data analysis as time-bound requirements. We can consider a vehicle accident prediction application that must (1) gather IoT data generated from sensors located at traffic lights, vehicles, and traffic cameras deployed at intersections, (2) analyze the collected data to predict a possible accident using many machine learning techniques [
5,
6], and (3) prevent the accident by informing the corresponding driver in near real-time (e.g., within a 30 ms time-bound). If there is any additional time (i.e., more than the application time-bound) involved in completing the data analysis, the predicted accident information will not be beneficial to prevent the accident. To discuss further the problem of satisfying time-bound requirements, consider that TS-IoT applications are comprised of a set of tasks and are intrinsically distributed. Each of these tasks may need to perform one of the following: collect IoT data from heterogeneous IoT devices, process the collected data using various techniques such as stream processing and resource-intensive machine learning and statistical techniques, manage the storage requirements for stateful data analysis, and maintain the data analysis pipelines (i.e., results produced by a task may be used as an input to another task/tasks in the same application). Currently, TS-IoT applications that are comprised of such tasks are executed in distributed IoT environments.
The IoT environments consist of various heterogeneous distributed computing resources [
3,
7]. These computing resources include: (1) IoT devices that are directly connected with sensors, which generate IoT data; (2) edge-computers, i.e., any computing resource that is closer than the cloud to the IoT devices and is not directly connected to them (e.g., network gateways, dedicated edge computers); and (3) cloud data centers. These computing resources are connected to each other and/or to the Internet via multiple heterogeneous networks (e.g., narrowband/NB-IoT, LoRa, BLE, and Wi-Fi). Moreover, IoT environments tend to be volatile due to (1) unpredictable IoT data generation rates, (2) uncertain availability of resources caused by mobility, connection issues, etc., and (3) overloaded resources ascribed to multitenancy (i.e., multiple applications utilizing the same set of resources).
Ensuring the time-bound requirements of TS-IoT applications is largely determined by the total application execution time. This can be computed as the summation of the total data processing time and the total data communication time. The total data communication time is influenced by the relevant network delays involved in moving IoT data to corresponding resources to process them, whereas the total data processing time is influenced by the computing resource where the data analysis is performed. Therefore, guaranteeing the time-bound requirements depends significantly on the choice of suitable resources from the IoT environment. Nevertheless, there are trade-offs in the selection of the cloud virtual machines, edge-computers, and/or IoT device resources to execute a TS-IoT application [
8,
9]. Although analyzing IoT data on IoT devices yields the lowest communication delays, IoT devices have limited computing resources. Edge computers have greater computing resources than IoT devices, but they are likely to suffer from more communication delays than IoT devices. Cloud virtual machines offer virtually unlimited resources [
8]; therefore, several researchers [
10] have used cloud computing resources to process large quantities of sensor data, such as remote sensing data. However, processing these data in the cloud induces significant communication delays when IoT data is transferred to the cloud [
11,
12]. Moreover, each task of the TS-IoT application also has diverse resource requirements. For example, a machine learning classification task may require more computing resources than a simple data aggregation task. Therefore, although it is usually feasible to meet the time-bound requirements of each TS-IoT application by distributing tasks for execution in the IoT devices, edge computers, and the cloud, we must determine the best possible distribution of tasks from the perspective of communication and computing resource constraints. However, determining the task distribution for TS-IoT applications is more difficult than for other applications due to the volatility of the IoT environment and the unpredictability of IoT data streams. Therefore, task distributions of TS-IoT applications may need to be dynamically adapted to deal with possible time-bound violations caused by the unpredictable and volatile nature of the IoT environment. Thus, this has become a major research challenge.
Many existing studies in IoT have proposed various task management techniques (i.e., techniques that are used to manage the execution of IoT applications in distributed computing resources) to address the research challenge of supporting the time-bound requirements of TS-IoT applications. These task management techniques include: (1) task sizing techniques that determine the amount of computing and networking resource needed to complete the execution of TS-IoT application tasks; (2) task distribution techniques that distribute and execute TS-IoT application tasks in the IoT environment; and (3) task adaptation techniques that dynamically adapt the distribution of TS-IoT application tasks to mitigate possible time-bound violations. However, existing task sizing techniques [
13,
14,
15] have depended on simulation tools [
16] or include limited testbed implementations. Thus, they cannot accurately estimate the suitable quantity of resources required for TS-IoT application tasks. Existing task distribution techniques have employed computationally expensive complex optimization techniques [
17,
18,
19,
20] because they have been built for use with distributed real-time systems that include controlled execution environments, unlike volatile IoT environments, and most of them cannot scale with a large number of IoT devices. Thus, existing task distribution techniques are not suitable to use with TS-IoT applications that have stringent time-bound requirements. Task adaptation techniques [
18,
21] that are found in the literature involve higher overheads and most of them adapt the distribution of tasks in a reactive manner (i.e., after noticing the application’s execution time exceeded the time-bound requirements). Finally, there is a lack of comprehensive task management techniques that integrate task sizing, task distribution, and task adaptation techniques to collectively facilitate the TS-IoT applications to meet their time-bound requirements.
To overcome this shortcoming, in this paper, we propose a novel Time-Sensitive IoT Data Analysis platform called TIDA that includes a task management technique, called DTDA, which utilizes the computing resources available in the IoT devices, edge computers, and the cloud for meeting the time-bound requirements of each TS-IoT application when the entire pool of available computing resources is sufficient to collectively achieve this. Specifically, this paper significantly extends our earlier work [
3] and makes the following novel contributions:
Our earlier work [
3] presented a greedy task distribution technique that is unable to dynamically adapt the task distribution in real-time to cope with the volatile changes in the IoT environment. Hence, the proposed DTDA technique presented in this paper aims to address this key limitation of the greedy task distribution technique.
- 2.
A complete implementation of the improved task sizing and novel task adaptation techniques using Microsoft’s Orleans Actor framework.
- 3.
A comprehensive experimental evaluation using a real-world smart city use case and related dataset that includes: (a) an experimental setup that forms four clusters of computing resources with heterogeneous system configurations to validate the DTDA task management technique; and (b) a comprehensive comparison with two current state-of-the-art task management techniques that shows how well the TIDA platform, which implements the above techniques, meets the time-bound requirements of TS-IoT applications.
The remainder of the paper is organized as follows.
Section 2 presents the related work,
Section 3 presents a motivating use case scenario,
Section 4 describes the system model and problem formulation,
Section 5 discusses the dynamic task distribution and adaptation technique, and
Section 6 presents the design and implementation of the TIDA platform.
Section 7 presents the experimental evaluation results and
Section 8 concludes the paper and outlines potential future work.
2. Related Work
TS-IoT applications are executed in an IoT environment that connects a variety of IoT devices such as sensors, mobile phones, cameras, and industrial machines with each other and/or to the Internet. Intrinsically, this environment is highly distributed. Traditional distributed real-time systems consist of a set of computer resources interconnected by a real-time communication network [
22]. Generally, distributed real-time systems include a controlled execution environment (i.e., an environment where applications are executed). In addition, distributed real-time systems found in power plants and factory control systems are not connected to the Internet. In these distributed real-time systems, attributes of the execution environment, such as latencies of networks, data generation rates, and available computing resources, can be determined in advance [
23,
24]. On the contrary, although the IoT environment is highly distributed, it has unique characteristics that impose additional challenges for the execution of TS-IoT applications in such a manner that their time-bound requirements are met compared to traditional distributed real-time systems [
3,
8,
25]. We further discuss these additional challenges of the IoT environment in
Section 3 via a motivation scenario, and in
Section 4 by presenting a formal model for the IoT environment.
Many techniques can be found in the literature that have been developed for distributed real-time systems and aim to optimize the execution of real-time applications in terms of execution time [
26], cost of execution [
27], and energy consumption [
28]. Furthermore, due to the controlled execution environment of the distributed real-time systems, these techniques largely consist of optimization-based scheduling techniques [
22,
29,
30,
31] that can effectively determine which task needs to be executed, at what time it needs to be executed, and the resource that should be used for the execution. However, such optimization-based techniques are only effective in controlled and predictable execution environments. More specifically, in this paper, we focus on techniques for dynamically distributing and adapting the tasks of TS-IoT applications by determining the relevant communication delays involved and the needed computing resource capacities. However, adaptive scheduling techniques found in distributed real-time systems determine which tasks need to be executed, at which times, to meet the deadlines of the system. Therefore, the task scheduling techniques developed for distributed real-time systems cannot be used for TS-IoT application task distribution. Due to this reason and the unique characteristics of the IoT environment (such as heterogeneity and volatility) compared to traditional distributed real-time systems, techniques developed for distributed real-time systems cannot be effectively used with TS-IoT applications to meet their time-bound requirements. Therefore, many studies have been recently conducted to devise suitable techniques for facilitating TS-IoT applications to meet their time-bound requirements.
Meeting the time-bound requirements of TS-IoT applications is challenging due to the heterogeneous and volatile nature of the IoT environment and the time-bound requirements of such applications [
8,
32]. In one of our previous works [
3], we proposed an approach for dealing with these challenges that involves distributing TS-IoT applications in a collection of interrelated tasks and selecting the appropriate IoT computing and network resources to execute the tasks of each application in such a manner that they collectively meet the application’s time-bound requirements. To enable such task distribution, a task sizing technique was proposed for estimating the computing and network resources required by the tasks of TS-IoT applications. Related work in determining the most suitable IoT resources for computing IoT application tasks includes [
13,
14], which investigated (1) how to estimate the computing resources required by cloud-based IoT applications based on historical performance metrics, and (2) evaluated various techniques for achieving this via the Cloudsim simulator. Zeng et al. [
15] presented a simulator called IOTSim to analyze IoT applications to understand how they would perform in cloud computing environments. In [
16], the researchers proposed a technique for measuring the performance of computing resources when different IoT application tasks are executed, whereas Korala et al. [
33] introduced a platform to experimentally evaluate the performance of TS-IoT applications. Alhamazani et al. [
34] proposed another approach for measuring performances of IoT applications across multiple cloud environments. Souza et al. [
35] proposed a novel framework to take measurements of IoT microservices-based applications deployed across edge and cloud environments.
Most related research in task distribution has considered this problem as an optimization problem and proposed various optimization techniques (such as linear programming, non-linear programming, and heuristic techniques) for this purpose. For example, Taneja et al. [
36] proposed a technique for efficient distribution of application tasks across cloud and edge resources in a resource-aware manner. Skarlat et al. [
37] proposed an optimization technique that generates a task execution plan for IoT applications. Hong et al. [
17] introduced a technique for optimizing the scheduling of IoT application tasks in edge devices. Yousefpour et al. [
18] formulated IoT application distribution as an Integer Non-Linear Problem (INLP). The authors then used INLP to minimize the cost of resource usage while satisfying the QoS requirements of the applications. The optimization techniques proposed by [
12,
19,
20] determine appropriate computing resource selection for meeting the QoS requirements of IoT applications. Related computing frameworks and tools, such as those in [
38,
39,
40,
41,
42], have employed similar techniques to manage the distribution of TS-IoT applications. Zhang et al. [
43] proposed a recommender system for dealing with the heterogeneity of cloud computing resources.
The IoT is subject to uncertainties, including unpredictable IoT data, volatile and mobile computing resources, and overloaded resource-constrained computing resources, due to multitenancy [
35]. Therefore, task execution plans may need to be adapted dynamically to deal with possible time-bound violations. To address this, many related studies have presented dynamic task redistribution techniques, which involve generating new task execution plans periodically and/or in instances where certain computing resources become overloaded. Yousefpour et al. [
18] introduced a technique to dynamically adapt the task execution techniques by periodically releasing and deploying tasks when there are QoS requirement violations. Skarlat et al. [
21] also proposed a technique to dynamically redistribute the tasks when certain resources are overloaded and/or disconnected from the IoT environment. Xu et al. [
44] proposed a solution that can recommend appropriate virtual machines for IoT application workloads in edge-cloud environments using a tree-based machine learning algorithm to predict performance metrics for these IoT application workloads.
In summary, optimization-based techniques that have been developed for distributed real-time systems cannot be used with TS-IoT applications to enable them to meet their time-bound requirements due to the volatile nature of the IoT environment. Task sizing techniques found in the literature have relied on simulation tools [
13,
14] or include limited testbeds [
16] for sizing tasks using estimations. These techniques cannot effectively estimate the resources needed by TS-IoT application tasks because they do not deal with the heterogeneity and dynamic nature of the IoT environment. Most of the task distribution techniques in the literature employ complex optimization techniques [
17,
18,
19,
20] in device task execution plans; furthermore, most of them do not consider task sizing and are expensive to compute. Existing dynamic task adaptation techniques [
18,
21] have mainly employed task redistribution techniques and they dynamically adapt the task execution plans in a reactive manner (i.e., after observing that the application has failed to meet the time-bound requirement or after observing the resources are overloaded), which is not an effective solution to guarantee the time-bound requirements. Due to these reasons, these techniques are not suitable for TS-IoT applications that have demanding time-bound requirements. On the contrary, TIDA includes a Dynamic Task Distribution and Adaptation (DTDA) task management technique that integrates: (1) a task sizing technique that measures the computing and network resources required by the tasks when they are executed in the IoT environment; (2) a greedy task distribution technique that uses the task sizing information to generate time-bound satisfying task execution plans to distribute tasks in the IoT environment; and (3) a dynamic task adaptation technique that utilizes a predictive machine learning model to accurately predict possible time-bound violations and make necessary adaptations to the task execution plans to ensure the time-bound requirements are met. Furthermore, TIDA was implemented by extending Microsoft Orleans and the greedy algorithm was evaluated using a real-world smart city application.
3. Smart City Passenger Counting Application—Motivating Scenario
Let us consider a smart city application that requires an accurate count of passengers for a public transport system in near real-time. The passenger count information is used by the transport service to improve planning and scheduling of buses, allocate buses or trains to meet the actual demand, and respond to unplanned incidents such as bus breakdowns and accidents.
Figure 1 illustrates the motivating scenario, computing resources, and IoT data analysis tasks in this TS-IoT application.
To count passengers in this smart city environment we utilized the following IoT devices, edge computers, and cloud resources:
Orbbec Persee IoT devices providing a combination of RGB and infrared cameras with a fully functioning onboard computer were mounted above the doors of each bus. We used these devices to count the passengers stepping in and out of each bus at each bus stop in the transport network. The IoT data generated by these IoT devices included (1) video data, (2) depth sensor data, and (3) infrared data at 30 frames per second. In addition to generating a large volume and variety of IoT data from their sensors, the Orbbec Persee devices provide internal computing and storage resources.
Edge computers at bus stops and train stations. These edge computers act as gateways for IoT devices and connect to the cloud data center via the Internet. Furthermore, edge computers also include additional computing and storage resources that can be used for IoT data analysis.
A cloud data center with virtually unlimited computing resources.
In this IoT environment, the IoT devices, edge computers, and the cloud are connected via different networks (e.g., NB-IoT, 4G, and broadband). The Orbbec Persee IoT devices incorporate Wi-Fi cards and, as a result, they can connect to the edge computer at each bus stop. In addition, these IoT devices can also be directly connected to the cloud via 4G during the entire bus journey. However, IoT devices can connect to edge computers only when they are near bus stops or train stations. The edge computers and the cloud data centers are connected via broadband Internet. To compute the occupancy of each bus and the total occupancy, this TS-IoT application must perform the following: (1) capture passenger data while passengers are stepping in and out of each bus; (2) analyze the collected RGB/infrared/depth data and recognize individual passengers; and (3) compute the occupancy of each bus at each bus stop and the entire transport network. This task may involve the following sub-tasks: (1) pre-processing the collected RGB/infrared/depth data; (2) classifying passengers as entering or exiting by applying classification techniques such as the Haar-cascade classifier (please note that, in this paper, we consider the classifier to be an already trained classifier; hence training the classifier is not considered to be an IoT data analysis task and is not discussed further in this paper); (3) calculating the total occupancy of the bus; (4) computing the total occupancy of all the busses in the transport network.
The IoT passenger count application has a variable timebound that is hard to meet, i.e., it fails to meet its time-bound requirement when any bus reaches the next bus stop before its occupancy information from the previous bus stop is counted. Meeting the time-bound requirements in the IoT often depends on the appropriate selection of computing and networking resources for each TS-IoT application. In the passenger counting IoT application, although we perform the entire data analysis quickly in the cloud, this may involve a significant communication delay to collect all the passenger RGB/infrared/depth data. Offloading the collected passenger data to the edge computers and performing the data analysis in the edge computers is another option. However, we only have a limited time to transfer the passenger data to the edge computer and the computing resources in edge computers are more limited than in the cloud. Processing data in an IoT device itself is another option that is viable only if an IoT device has enough computing resources available for the tasks of the IoT application at hand.
Therefore, to meet the time-bound requirements of this and any other TS-IoT application, we must determine the best possible distribution of the data analysis tasks that comprise the TS-IoT application from the perspective of providing enough computing resources and communication capacity, and compute the assigned analysis tasks in a manner such that the entire TS-IoT application meets its time-bound(s).
However, as discussed in
Section 2, unlike distributed real-time systems, the IoT environment introduces additional challenges. The IoT environment consists of heterogeneous computing resources connected by multiple networks, which have different data transfer rates, latencies, and bandwidth values. Moreover, the IoT environment tends to be volatile due to (1) unpredictable IoT data generation rates; (2) uncertain availability of resources caused by mobility, connection issues, etc.; and (3) overloaded resources attributed to multitenancy [
3,
8,
25]. Therefore, TS-IoT applications require adaptation of their task distribution during the runtime to continue to satisfy the application time-bounds. For example, the number of passengers onboard at each bus stop varies and cannot be predicted; thus, the passenger onboarding data fluctuates throughout the bus journey. Moreover, some computing resources (e.g., the cloud virtual machines) that are processing the TS-IoT application tasks may disconnect from the IoT environment due to network issues. Additionally, some computing resources such as edge computers may become overloaded due to multiple applications competing for resources, in addition to simply not having enough available resources to handle the incoming variable IoT data. Hence, the computing and storage resources of the IoT are volatile. As a result of this, although we determine the best possible distribution of tasks for the TS-IoT application before deployment, it is a challenging task to meet the application time-bounds during the application runtime. Moreover, due to these reasons, existing optimization-based techniques devised for distributed real-time-systems are not suitable and are ineffective in supporting the time-bound requirements of the TS-IoT applications, because it is necessary to dynamically adapt the distribution of the tasks of TS-IoT applications to deal with possible time-bound violations caused by the volatile nature of the IoT environment.
4. System Model & Problem Formulation
Due to the trade-offs between IoT resources in the distributed IoT environment, it is necessary to generate a task execution plan (which meets the application’s time-bounds requirement) by determining the relevant communication delays involved and the needed computing resource capacities for each task. A task execution plan is a mapping of the tasks of the TS-IoT application to the corresponding computing resources where the tasks are executed. To address this, first, we present a formal description of the resources in the IoT environment and the TS-IoT applications. Then, we formulate the task distribution problem as an optimization problem.
4.1. Resource Model
Computing resources (i.e., IoT devices, edge computers, and the cloud) and network resources in the distributed IoT environment form a graph
, where
represents the distributed computing resources and
represents the network links between computing resources. A single computing resource of
can be denoted as
, where
and
,
is the total number of computing resources in
. Each
has an attribute called
, which is the amount of computing and storage resources available at
. Further,
can be represented as a tuple of
.
Figure 2 shows the IoT environment graph of the resource model.
A single network link of the represents the network resources of a network link between two computing resources, and . This can be denoted as , where and denote the corresponding indexes of the two computing resources that are connected via network link . Each has the following attribute: is the amount of available bandwidth of the network resource link . Furthermore, is captured by a tuple where is the amount of available upload bandwidth and is the amount of available download bandwidth in .
As discussed in
Section 3, due to the volatile nature of the IoT environment,
of computing resources, the number of connected computing resources in the IoT environment, and
of the network links vary over time and the degree of change is hard to predict.
4.2. Application Model
A TS-IoT application is comprised of a set of (possibly inter-dependent) tasks that interact via data exchanges. A TS-IoT application can be represented as a directed acyclic graph (DAG), where represents the tasks of the TS-IoT application and represents the data flows between Tasks. Each TS-IoT application has a time-bound requirement and we denote it as .
A single task of the can be denoted as , where and , where is the total number of tasks in . Each task can be of two types: stateful tasks and stateless tasks. Stateful tasks require buffering a certain number of data items before processing them. We identify the number of data items required to buffer in a stateful task as the queue size and denote this as . Stateless tasks do not require buffering of data items during their data processing; therefore, we consider of stateless tasks to be 1. Furthermore, to identify whether a task is stateful, we denote the following binary attribute, : = 1 if the task is a stateful task and = 0 otherwise. With the current proposed model, we assume that the tasks run continuously; hence, we do not consider any loop variables (i.e., control variables) for this model at this stage.
Each , has the following attributes: is the amount of computing resources required for the execution of ; denotes the time taken to process the IoT data at a specific computing resource. This depends on the computing resource where the task is executed. A is also associated with two delays. We denote these as and . The time taken to produce the first data item during IoT data processing is denoted by and the delay between producing data items is denoted as . We assume that the aforementioned attributes can be obtained by measurements.
A single dataflow of
represents the dataflow (i.e., data transfer) between the predecessor task
and successor task
, and this can be denoted as
, where
.
and
denote the indexes of the corresponding tasks. In our model, we assume that data is transferred piece by piece. Each
has the following attributes:
is the size of a single data piece transferred through
; the amount of time to send a single piece of data via a network link is denoted as
.
Figure 3 shows the TS-IoT application graph.
The above model is based on the following assumptions:
We assume that cloud data centers in the IoT environment have a large amount of computing and storage resources [
8], whereas IoT devices and edge computers have limited computing and storage resources.
We assume the can be obtained by measurements via executing the corresponding task on a reference computing resource.
We assume the on a computing resource can be obtained by estimation based on previous measurements.
We assume the is developed by considering the number of computing resources and their networks available in the IoT environment.
4.3. Problem Formulation
Our objective is to generate an application-specific, time-bound satisfying task execution plan for the IoT environment within the available resources. To realize this, we need to generate a task execution plan in an IoT environment in a manner such that the end-to-end response time of the TS-IoT application is within the time-bound requirement of the application. Furthermore, in this model, we consider TS-IoT application graphs with multiple paths, and to capture this we consider the end-to-end response time of the critical path in the graph. We define this critical path of the application graph as a set of tasks and dataflows, forming a path, for which the end-to-end response time is maximal. We refer to this end-to-end response time of the application as
Total Application Execution Time and denote it as
. Given this definition, we can formulate the following equation:
where
is the end-to-end execution time along the path
and
is the total number of paths in
. For any path
, we can calculate the
as the summation of execution times (i.e., summation of data processing time at tasks and delays involved in bringing data to the task, buffering data at tasks, etc.) of each task that is in that path
p. Given this definition, we obtain the following:
where
is the total number of tasks in the path
and
is the execution time of the
task in the path
of
.
can be calculated from the following:
In Equation (3),
is the amount of time taken to process IoT data by
.
is the amount of time taken to transfer a single data item from the predecessor task
to the task at hand
via
. To capture the total
, we multiply this with the queue size of
, which we denote as
. Note we do not need to consider the maximum of
because we apply this equation on a single path of the graph, and at the end, the critical path is chosen using Equation (1). We assume
to be 0 if the two tasks (i.e.,
and
) are executed in the same computing resource.
is the time taken to produce the first data item by the predecessor task
and
is the delay between producing data items at the predecessor task
. For stateful tasks to capture the total
, this is multiplied by
(i.e., the queue size of the task
).
can be calculated using the following:
where
denotes the size of a single data piece that needs to be sent to
from predecessor task
via
that is placed on network link
and
is the available bandwidth of the
Decision variables: We define the decision variables that form the task execution plan as follows: the first decision variable denotes whether a task is distributed on a computing resource . The next decision variable denotes whether a dataflow is placed on a network resource .
Constraints: First, the task distribution of computing resources and dataflow placement on network link resources must not exceed the available resources of those corresponding computing and network resources. A task
can be distributed in the computing resource
if
is at least equal to or more than
of
. We can formally denote it as follows:
Each network link can only transfer data that is within its available bandwidth and we can formally denote it as follows:
where
denotes the amount of data transfer between task
and
via network link
and
is the binary variable denoting whether a dataflow
is placed on a network resource
.
Regarding the second constraint, TS-IoT applications must satisfy their time-bound requirements. We can formally denote it as follows:
Objective function: The objective of the task distribution problem is to devise a task execution plan in an IoT environment that yields the minimum application execution time while satisfying the time-bound and resource constraints. We formally denote it as follows:
Subject to: Equations (5)–(7).
However, solving this problem via optimization techniques tends to be NP-hard, and to ensure the time-bound requirements of TS-IoT applications in a such volatile IoT environment we need to generate time-bound satisfying task execution plans at the rate of the change occurring in the IoT environment. However, existing optimization-based techniques devised for distributed systems are ineffective. Hence, we aim to solve this problem using a novel dynamic task distribution and adaptation technique described in the next section.
Table 1 lists the definitions of all notations we used in the above system model and the problem formulation.
5. Task Management Technique That Includes Task Sizing, Distribution and Adaptation Techniques
In this section, we present three novel techniques we devised for meeting TS-IoT application time-bounds as part of the task management technique. The task sizing technique is presented in
Section 5.1 and its results are used by the task distribution technique that is discussed in
Section 5.2. The task adaptation technique adapts the task distribution by redistributing or reconfiguring individual tasks, whenever a potential violation of the TS-IoT applications time-bound is detected.
5.1. Task Sizing Technique
The task sizing technique is used for: (1) measuring the computing and network resources needed for the execution of TS-IoT application tasks in the available IoT devices, edge computers, and the cloud; and (2) the execution times of the TS-IoT application tasks. Unlike the existing task sizing techniques found in the literature, instead of estimating the computing and network resources needed to complete the IoT data analysis of the tasks of TS-IoT applications and the execution times of tasks, the proposed task sizing technique executes the TS-IoT application tasks in available computing resources in the IoT environment, and measures the computing and network resources needed for the execution of TS-IoT application tasks and execution times of tasks in the available IoT devices, edge computers, and the cloud. This enables more realistic resource needs and execution times of TS-IoT application tasks to be gathered compared to estimations. The task sizing technique is executed whenever the application developer submits a new TS-IoT application to the TIDA platform or whenever there is a change in the underlying IoT environment, such as resources becoming disconnected and connected. In this task sizing technique, each task of the TS-IoT application is executed on every available unique IoT device, edge computer, and the cloud using sample IoT data submitted by the TS-IoT application developer. The computing and network resources needed for the execution of each TS-IoT application task in the IoT environment are measured and recorded. These resource measurements are then used by the task distribution technique discussed in
Section 5.2, and for training the machine learning (ML) model that is used in the task adaptation technique, which is discussed in
Section 5.3.
The task sizing measurement data mainly captures the resource utilization metrics (e.g., CPU utilization and memory utilization metrics) and execution times of the tasks when the tasks are executed on available computing resources. As shown in
Table 2, the raw task sizing measurement data consisted of 12 features. Unlike earlier task sizing solutions, e.g., [
3], this task sizing technique employs an unsupervised machine learning technique called Principal Component Analysis (PCA) [
45] to extract the most important features from the 12 features of the task sizing measurement data and then store these extracted features in the database. Feature extraction is the process of constructing a new set of features that is more informative and non-redundant than the raw measured data. Feature extraction techniques such as PCA can find correlations among the features in the task sizing measurement data. For example, consider the two features, the average CPU usage and maximum CPU usage, from the task sizing measurement data. These two features could be closely correlated; hence, keeping both of these features will not yield any additional benefit. There may also be many other features that are highly correlated with each other. Therefore, using a feature extraction technique such as PCA, a summarized version (which is more informative and non-redundant) of the original features can be derived from a combination of the original features. After applying the PCA technique on the task sizing measurement data, we were able to derive five new features that better describe the task sizing measurement data.
The extracted features are then used to train the machine learning model used in the task adaptation technique. In addition, during the execution of the application, as preformed in the task sizing, we periodically measure the resource utilization and execution times. This monitored data is then used to periodically update the task sizing measurements. This provides continuously updated task sizing measurements and allows this technique to deal with variations in the volume and velocity of IoT data that lead to unexpected demand for computing resources.
Table 2 shows the features of task sizing measurement data.
More in-depth details of this task sizing technique can be found in [
3]. Most of the task sizing techniques in the literature include limited testbed implementations or rely on simulation tools to size the TS-IoT application tasks (i.e., to estimate the amount of resource required for the execution of tasks and the execution time of task on each resource). However, none of the existing task sizing techniques can effectively deal with (1) variations in the volume and velocity of IoT data that are common in IoT, and (2) the computing resource heterogeneity in the real IoT environment. Unlike other existing task sizing techniques, the technique proposed here measures the actual (i.e., not simulated) computing and network resources required to complete each task of the TS-IoT application at hand in the available computing resources in the IoT environment. Therefore, it provides more accurate estimations of the resource required for each TS-IoT application task and the execution time of each task per unique IoT device, edge computer, and cloud computing resource available.
Section 5.2 discusses the task distribution technique that uses the resource estimates (i.e., task sizing measurements) produced by the task sizing technique. The pseudocode of the task sizing technique is shown in Algorithm 1.
Algorithm 1: Task Sizing Technique |
Input: | TaskList, ResourceList |
Output: | MeasuredData |
| function SizeTask (TaskList, ResourceList) |
1: | foreach resource in ResourceList do |
2: | foreach task in TaskList do |
3: | Execute task and measure the computing and network resource usage and execution time |
4: | Return the measured data |
5: | end foreach |
6: | end foreach |
7: | end function |
5.2. Task Distribution Technique
In this section, we propose a novel task distribution technique that follows a greedy heuristic approach to incrementally solve the task distribution problem and generate a time-bound satisfying task execution plan. As mentioned in
Section 5.1, the greedy task distribution technique utilizes the task sizing measurements produced by the task sizing technique to construct a time-bound satisfying task execution plan for executing each TS-IoT application. The greedy task distribution technique can be re-executed as needed to construct different task execution plans in situations where certain computing resources are disconnected from or connected to the IoT environment after the greedy task distribution technique was previously executed for any TS-IoT application. The pseudocode of the proposed greedy task distribution technique is shown in Algorithm 2.
Algorithm 2: Greedy Task Distribution Technique |
Input: | TaskList, ResourceList, MeasuredData, |
Output: | TaskExecutionPlan <task, resource> |
| function GreedyTaskDistribution (TaskList, ResourceList, MeasuredData, ) |
01: | Initialize TaskExecutionPlan <task, resource ≥ null, CanDistribute = true, SortedResourceMap <resource, execution_time>, = null; |
02: | while TaskList is not empty AND CanDistribute is true do |
03: | task = TaskList. GetItem (); |
04: | if IsFirstTask(task) |
05: | SortedResourceMap = GetResourcesCloserToDataSource (); |
06: |
else |
07: | SortedResourceMap = MeasuredData.GetData (task); |
08: |
end if |
09: | Initialize TaskPlaced = false, i = 0; |
10: | while i < SortedResourceMap. Count AND TaskPlaced is false do |
11: | resource = SortedResourceMap. GetItem(i); |
12: | if CheckResourceCapacity (task, resource) |
13: | TaskExecutionPlan. Add (task, resource); |
14: | UpdateAvailableResourceCapacities (resource, ResourceList); |
15: | ± SortedResourceMap. GetValue (resource); |
16: | TaskPlaced = true; |
17: |
end if |
18: | i++; |
19: |
end while |
20: | if TaskPlaced is false OR > |
21: | CanDistribute = false; |
22: |
end if |
23: |
end while |
24: | if ≤ AND CanDistribute |
25: | return TaskExecutionPlan; |
26: |
else |
27: | TaskExecutionPlan = AllocateAllTasksToCloud (); |
28: | return TaskExecutionPlan; |
29: |
end if |
| end function |
The technique takes the TaskList, ResourceList, MeasuredData, and as inputs. Then, for each task in the TaskList, the technique finds an eligible (i.e., has enough capacity to fulfill the resources required by the task) computing resource, which yields the lowest execution time for that task from a sorted resources map. To construct the sorted resources map for the first task in the TaskList, the technique uses only the computing resources that are closer to the IoT data source. To find such resources the algorithm uses the GetResourcesCloserToDataSource () function. Therefore, the first task of the application is always assigned to a computing resource that is closer to the data source, provided it has enough resource capacity (lines 4–5 in Algorithm 2). Moreover, to construct the sorted resources map for tasks that have predecessor tasks, the algorithm retrieves the tuples of the corresponding task from the measurement table and constructs a sorted resources map using the data in the tuples. The map consists of the resources and the corresponding execution time measured for that task. Furthermore, the map is sorted based on the measured execution times and we consider that one computing resource can host multiple tasks if it has enough resource capacity (lines 6–8 in Algorithm 2).
Once the sorted resources map is created, the technique iterates through each item in the sorted resources map until it finds an eligible computing resource. When the technique identifies an eligible computing resource, it first assigns that resource to the corresponding task via updating the task distribution map, then updates the available resources of the selected resource, updates the based on the estimated execution time, exits the while loop, and moves to the next task in the task list (lines10–19 in Algorithm 2). The technique iteratively determines eligible computing resources in a greedy manner (i.e., picks the resource that would yield the lowest execution time) for each task in the task list.
It should be noted that IoT can be volatile, because in the real world the IoT data volume/velocity and the available computing resources often vary. Although the task distribution technique can be executed several times to produce different task execution plans whenever computing resources are disconnected or connected, it is not capable of changing or adapting its task execution plan at hand to deal with possible time-bound violations caused by varying IoT data volumes and available computing resources. Therefore, the task execution plans constructed by the task distribution technique may not achieve the time-bound requirements of the TS-IoT application due to such volatility. To overcome this drawback, we propose a novel task adaptation technique, which is discussed in
Section 5.3.
5.3. Task Adaptation Technique
The objective of the task adaptation technique is to dynamically adapt the task execution plan of the TS-IoT application to mitigate potential time-bound violations caused by unexpected changes in the IoT device observation volume and velocity, and computing resources becoming unavailable due to the volatility of the IoT environment. To achieve this, the task adaptation technique employs a variation of the XGBoost regression tree model [
46] to periodically predict the application’s total execution time assuming that the application continues to follow its current task execution plan. Next, the predicted total application execution is compared with the time-bound requirement of the TS-IoT application to assess whether the current task execution plan at hand can meet its time-bound requirement. If the predicted total application execution time is earlier than or meets the time-bound requirement of the TS-IoT application, then the current task execution plan remains unchanged. If the predicted total application execution time is later than the time-bound of the TS-IoT application, then an alternative task execution plan that can guarantee the time-bound requirements is selected from a set of alternative task execution plans (the creation of alternative execution plans is discussed further in
Section 5.4). More specifically, when selecting an alternative task execution plan, the XGBoost model is used to predict the total application execution times for each of the alternative task execution plans and pick the task execution plan that yields the lowest total application execution time. After selecting an alternative task execution plan, the tasks that were running will be stopped. Then, the TS-IoT application tasks will be redistributed according to this alternative task execution plan and start executing again.
The task adaptation technique continuously trains the XGBoost model in an online manner using the features extracted from the task sizing measurement data (as shown in
Table 2 and discussed in
Section 5.1). XGBoost is a tree-based ensemble machine learning algorithm, which was selected as the basis of the proposed task adaptation technique due to its fast convergence speed. Compared to the other performance prediction models, such as neural networks [
47], decision tree-based models require less training data, less training time, and less parameter tuning. Moreover, the ensemble tree-based algorithms, which combine several decision trees, performed better when compared to single decision tree-based models [
44].
5.4. Combining the Task Sizing, Distribution, and Adaptation Techniques to Meet TS-IoT Application Time-Bounds
In this section, we outline how the techniques we discussed in
Section 5.1,
Section 5.2 and
Section 5.3 are combined in an integrated task management technique called the Dynamic Task Distribution and Adaptation (DTDA) technique. In summary, DTDA ensures that TS-IoT application meets their time-bounds as follows:
Step 1: Size the TS-IoT application’s tasks in the available resources (i.e., all available IoT devices, edge computers, and cloud virtual machines) using the task sizing technique (line 4 in Algorithm 3).
Step 2: Train the XGBoost model according to the task adaptation technique using the features extracted from the task sizing measurement data (lines 5–8 in Algorithm 3).
Step 3: Construct all the possible task execution plans for the submitted TS-IoT application. For this purpose, we need to create all the different possibilities in distributing the tasks of the TS-IoT application to the available computing resources. Therefore, we generate all the combinatorial possibilities for distributing tasks of a given TS-IoT application to the available computing resources in the IoT environment. For example, consider an instance where a given TS-IoT application is comprised of three tasks (e.g.,
,
, and
) and an IoT environment that has two available computing resources (e.g.,
and
). In this step, all the different possible distributions of the tasks
,
, and
in the two computing resources
and
will be generated. As we discussed in
Section 4, each different task distribution possibility in the IoT environment is considered as a task execution plan; thus, we identify all the possible task execution plans for the submitted TS-IoT application. Pre-computing all the possible task execution plans permits the task adaptation technique to quickly pick an alternative task execution plan from the list of all possible plans that have been pre-computed. This enables the dynamic adaptation technique to quickly pick an alternative plan that will meet the time-bound of the application (line 7 in Algorithm 3).
Step 4: Generate a time-bound satisfying task execution plan (from this point on we refer to this as the current task execution plan) using the greedy task distribution technique, which was discussed in
Section 5.2 (line 9 in Algorithm 3).
Step 5: Distribute the tasks into the IoT devices, edge computers, and/or cloud-based virtual machines according to the task execution plan and start executing the application tasks (line 10 in Algorithm 3).
Step 6: Periodically measure the resource usage and execution times of the tasks to update the task sizing measurement data by (1) extracting the features from them, and (2) using the extracted features to retrain the ML model (lines 13–18 in Algorithm 3).
Step 7: Periodically predict the total application execution time of the task execution plan at hand using the XGBoost model (lines 20–22 in Algorithm 3).
Step 8: Assess whether the task execution plan at hand can guarantee the time-bound requirement of the TS-IoT application using the predicted total application execution time (line 23 in Algorithm 3).
Step 9: If a possible time-bound violation is detected, select an alternative task execution plan from the set of pre-computed task execution plans (lines 24–34 in Algorithm 3).
Step 10: Redistribute the TS-IoT application tasks according to the alternative task execution plan and restart the application execution (line 35 in Algorithm 3).
The pseudocode of the proposed DTDA technique is shown in Algorithm 3.
Algorithm 3: Dynamic Task Distribution and Adaptation Technique (DTDA) |
Input: | , , |
01: | Initialize TaskList = null, ResourceList = null; TaskExecutionPlanAtHand = null; |
02: | TaskList = CreateTaskListFromAppGraph (); |
03: | ResourceList = CreateResourceListFromResourceGraph (); |
04: | MeasuredData = SizeTask (TaskList, ResourceList, );//Use Task Sizing Technique |
05: | ExtractedFeatures = ExtractFeatures (MeasuredData);//Extract features from measured data |
06: | SaveExtractedFeatures (ExtractedFeatures); |
07: | TaskExecutionPlanList = ComputeAllPossiblePlans (TaskList, ResourceList); |
08: | TrainMLModel(ExtractedFeatures);//Train ML model using extracted features |
09: | TaskExecutionPlanAtHand = GreedyTaskDistribution (TaskList, ResourceList, MeasuredData,); |
10: | DistributeTasks (TaskExecutionPlanAtHand); |
11: | NoChangeInIoTEnviornment = True;//This variable will change to false if any resource disconnects or connects to the IoT |
12: | while NoChangeInIoTEnviornment is True do |
13: | if PeriodicUpdate is true AND PeriodicPredict is false |
14: | MeasuredData = MonitorAndMeasure (); |
15: | ExtractedFeatures = ExtractFeatures (MeasuredData); |
16: | UpdateValuesOfFeatureData (ExtractedFeatures); |
17: | TrainMLModel (ExtractedFeatures); |
18: |
end if |
19: | if PeriodicUpdate is false AND PeriodicPredict is true |
20: | MeasuredData = MonitorAndMeasure (); |
21: | ExtractedFeatures = ExtractFeatures (MeasuredData); |
22: | = Predict (ExtractedFeatures, TaskExecutionPlanAtHand); |
23: | IsTimeBoundViolation = AssessTimeBoundViolation (, ); |
24: | if IsTimeBoundViolation is true |
25: | Initialize FoundNewPlan = False; i = 0; |
26: | while TaskExecutionPlanList is not empty AND FoundNewPlan is false do |
27: | TaskExecutionPlan = TaskExecutionPlanList. GetItem(i); |
28: | PredictedTime = Predict (ExtractedFeatures, TaskExecutionPlan); |
29: | if PredictedTime =< |
30: | TaskExecutionPlanAtHand = TaskExecutionPlan; |
31: | FoundNewPlan = True; |
32: |
end if |
33: |
end while |
34: |
end if |
35: | DistributeTasks (TaskExecutionPlanAtHand); |
36: |
end if |
37: | end while |
8. Conclusions and Future Work
In this paper, we extended our previous research presented in [
3] and proposed a complete solution, called TIDA, for meeting the time-bound requirements of TS-IoT applications. The proposed TIDA platform uses a novel task management technique, called DTDA, which combines three novel techniques: task sizing, task distribution, and task adaptation. The task sizing technique measures the computing and network resources required to complete each TS-IoT application task in each available computing resource in the IoT environment, whereas the task distribution technique utilizes the measurements of task sizing to create time-bound satisfying task execution plans to distribute and execute TS-IoT application tasks in the IoT environment. Finally, the task adaptation technique utilizes a machine learning model to accurately and periodically predict possible time-bound violations and, in the case of a time-bound violation, dynamically adapts the task distribution by redistributing tasks according to an alternative task execution plan.
We described a proof-of-concept implementation of the TIDA platform that implements the above techniques using Microsoft’s Orleans framework. We evaluated the TIDA by developing a passenger counting IoT application, executing the application in a cloud-based testbed under four TS-IoT application management techniques, and assessing how well each of these techniques enables the application to meet its time-bound requirements. The results showed that the dynamic task distribution and adaptation (DTDA) technique of TIDA (discussed in
Section 5), on average, improves the time-bound violation ratio by 43.34%, compared to the greedy technique, which is the base-line TS-task management technique of the TIDA platform. Moreover, the evaluation demonstrated the TIDA’s ability to adapt to the varying volume of IoT data by dynamically adapting the task execution plans to deal with possible time-bound violations.
Although the current implementation of the DTDA technique is capable of dynamically adapting its task execution plans, during the adaptation step, the technique iterates through a set of pre-computed task execution plans to select an alternative time-bound satisfying plan, and then redistributes all the tasks into the computing resources according to the new plan. This step is not efficient and scalable because it may cause a significant delay when the number of tasks and the computing resources increase. Moreover, in this research, for the evaluation, we considered only one TS-IoT application. However, in the real world, there can be multiple IoT applications competing for the computing and networking resources in the IoT environment, and resources often connect and disconnect from the IoT environment. Therefore, in our future work, we aim to: (1) improve the DTDA technique to be more efficient and scalable; (2) conduct more experiments in a volatile IoT environment with multiple IoT applications; (3) compare the TIDA platform’s ability to meet time-bound requirements with existing solutions; and (4) improve the existing machine learning technique by incorporating federated machine learning techniques [
56]. Moreover, we plan to explore and incorporate (1) IoT services search and discovery techniques [
57], (2) digital twin capabilities [
58], and (3) contextualization [
59] and approximation approaches [
60] into the TIDA platform to make it a comprehensive IoT solution.