Securing Real-Time Internet-of-Things

Modern embedded and cyber-physical systems are ubiquitous. Many critical cyber-physical systems have real-time requirements (e.g., avionics, automobiles, power grids, manufacturing systems, industrial control systems, etc.). Recent developments and new functionality require real-time embedded devices to be connected to the Internet. This gives rise to the real-time Internet-of-things (RT-IoT) that promises a better user experience through stronger connectivity and efficient use of next-generation embedded devices. However, RT-IoT are also increasingly becoming targets for cyber-attacks, which is exacerbated by this increased connectivity. This paper gives an introduction to RT-IoT systems, an outlook of current approaches and possible research challenges towards secure RT-IoT frameworks.


Introduction
Nowadays, smart embedded devices (e.g., surveillance cameras, home automation systems, smart TVs, in-vehicle infotainment systems, etc.) are connected to the Internet-this rise in the Internet-of-things (IoT) links together devices/applications that were previously isolated. On the other hand, embedded devices with real-time properties (e.g., strict timing and safety requirements) require interaction between cyber and physical worlds. These devices are used to monitor and control physical systems and processes in many domains, e.g., manned and unmanned vehicles including aircraft, spacecraft, unmanned aerial vehicles (UAVs), and self-driving cars; critical infrastructures; process control systems in industrial plants; and smart technologies (e.g., electric vehicles, medical devices, etc.) to name just a few. Given the drive towards remote monitoring and control, these devices are being increasingly interconnected, often via the Internet, giving rise to the Real-Time Internet-of-things (RT-IoT). Since many of these systems have to meet stringent safety and timing requirements, any problems that deter from the normal operation of such systems could result in damage to the system, the environment or pose a threat to human safety. The drive towards remote monitoring and control facilitated by the growth of the Internet, the rise in the use of commercial-off-the-shelf (COTS) components, standardized communication protocols and the high value of these systems to adversaries are making cyber-security a design priority for such systems. Security breaches are not uncommon in critical IoT applications, especially considering the recent spate of IoT-centric attacks (e.g., the Marai botnet, attacks on the Dyn DNS provider, and DoS attacks from IoT devices [1,2]) as well as others centered on safety-critical systems (e.g., Stuxnet [3], BlackEnergy [4], and attack demonstrations by researchers on automobiles [5,6] and medical devices [7]). Successful cyber attacks against such systems could lead to problems more serious than just loss of data or availability because of their critical nature [6,8]. Attacks on one or more of these types of systems can have catastrophic results, leading to loss of life or injury to humans, negative impacts on the system and even the environment. An overview of RT-IoT around everyday living. Dotted lines and radiate symbols indicate the wireless connectivity supported by the devices. Each RT-IoT device executes periodic real-time tasks (e.g., τ i,j denotes the jth activation of any task τ i ) required for safe operation of the physical system.

Stringent Timing/Safety Requirements and Resources Constraints
Many RT-IoT devices (e.g., sensors, controllers, UAV, autonomous vehicles, etc.) will have severely limited resources (e.g., memory, processor, battery, etc.) and often require control tasks to complete within a few milliseconds [15]. RT-IoT nodes, apart from a requirement for functional correctness, require that temporal properties be met as well. These temporal properties are often presented in the form of deadlines. The usefulness of results produced by the system drops on the passage of a deadline. If the usefulness drops sharply, then we refer to the system as a hard real-time system (e.g., avionics, nuclear power plants, anti-lock braking systems in automobiles, etc.) and, if it drops in a more gradual manner, then they are referred to as soft real-time systems (e.g., multimedia streaming, automated windshield wipers, etc.) [16]. • Implemented as a system of periodic/sporadic tasks • Stringent timing requirements • Worst-case bounds are known for all loops • No dynamically loaded or self modified codes • Recursion is either not used or statically bounded • Memory and processing power is often limited • Communication flows with mixed timing criticality

Heterogeneous Communication Traffic
Many conventional RTS typically consist of several independently operating nodes with limited or no communication capabilities. However, with the emergence of RT-IoT, cyber-physical nodes not only communicate over closed industrial communication networks but are also often connected via the Internet. Since most real-time applications would need to trigger events based on specific data conditions, a real-time communication channel with guaranteed QoS (e.g., throughput and data processing requirements, delay guarantees, etc.) would also be necessary to support such applications [17,18].
Another property of RT-IoT is that they often include traffic flows with mixed criticality, i.e., those with varying degrees of timing (and perhaps even bandwidth and availability) requirements: (a) high priority/criticality traffic-that is essential for the correct and safe operation of the system; examples could include sensors for closed loop control and actual control commands in avionics, automotive or power grid systems; security systems in home automation (b) medium criticality traffic-that is critical to the correct operation of the system, but with some tolerances in delays, packet drops, etc.; for instance, navigation systems in aircraft, system monitoring traffic in power substations, communication messages exchanged between electric vehicles and power grid or home charging station, traffic related to home automation equipment such as water sprinklers, heating, air conditioning, lighting devices, food preparation appliances etc.; and (c) low priority traffic-essentially all other traffic in the system that does not really need guarantees on delays or bandwidth such as engineering traffic in power substations, multimedia flows in aircraft, notification messages from smart home equipment, etc. Typically, in many safety-critical RT-IoT, the properties of all high-priority flows are well known, while the number and properties of other flows could be more dynamic (e.g., consider the on-demand video situation where new flows could arise and old ones stop based on the viewing patterns of passengers in a commercial aircraft).

Real-Time Scheduling Model
Many such systems are implemented using a set of periodic (e.g., fixed temporal septation between consecutive instances) or sporadic (e.g., the tasks that can make an execution request at any time, but with a minimum inter-invocation interval) tasks [19] (Chapter 1) and [20]. For instance, a sensor management task that monitors the conveyor belt in a manufacturing system needs to be periodic but the tasks that monitor the arrival of automated cars at traffic intersections are sporadic. Another example is an engine control unit (ECU) in a modern vehicle in which the task that controls the valve in the electronic throttle body (ETB) is periodic while the task that handles commands from the in-vehicle computer is sporadic. Application tasks in the RT-IoT nodes are often designed based on the Liu and Layland model [21,22] that contains a set of tasks, Γ, where each task τ i ∈ Γ has the parameters: (C i , T i , D i ), where C i is the worst-case execution time (WCET), T i is the period or minimum inter-arrival time, and D i is the deadline, with D i ≤ T i .
In the multicore context, real-time task scheduling can be viewed as solving an allocation problem (e.g., on which processor a task should execute) depending on design criteria [23]. For example, (a) no migration means tasks are allocated to a processor and no migration is permitted; (b) task-level migration means the jobs of a task may execute on different core but each job can only execute on a single core; and (c) job-level migration means the jobs of a task migrate to and execute on different cores but parallel execution of a job is not permitted.
Schedulability tests [23][24][25][26] are used to determine if all tasks in the system meet their respective deadlines. If they do, then the task set is deemed to be "schedulable"; and the system, safe.

CPU Architectures and System Development Model
Even though most RT-IoT applications are designed using platforms equipped with a single-core CPU, the trend towards multicore systems can be seen as many COTS devices nowadays are built on top of a multicore environment [23]. For some specific applications (e.g., avionics systems), there exist regulations that restrict the use of additional cores. In such cases, the additional cores that do not execute real-time or safety critical tasks can be utilized to provide layers of security to the system. We have leveraged the use of multicore platforms in the real-time domain and developed security solutions [27][28][29][30][31][32], as discussed in Section 4.1.
It is also common that multiple vendors are involved in the development of RT-IoT systems. Such a system is said to be developed under the multi-vendor development model [33]. In this model, each vendor designs/controls several separate tasks. Figure 2 demonstrates an electronic control unit (ECU) for an avionics system (on an unmanned aerial vehicle) that uses the multi-vendor development model. While this demonstrative example focuses on the avionics domain other RT-IoT systems (e.g., automotive, home automation, etc.), it could also be created using a similar model (albeit loosely defined).  Figure 2. A high-level design of a UAV that exemplifies the multi-vendor development model. In this demonstrative system, three vendors are involved in building the ECU system: Vendor 1 comprises tasks that process image data from a surveillance camera attached to the ECU; Vendor 2 is in charge of flight control tasks interacting with the UAV; and Integrator handles communication between the system and a base station.

UAV
the system. This information may later be used by the attacker to launch further attacks. In the rest of this section, we summarize the common attack surfaces for RT-IoT systems.

Integrity Violation with Malicious Code Injection
An intelligent adversary can get a foothold in the system. For example, an adversary may insert a malicious task that respects the real-time guarantees of the system to avoid immediate detection and/or compromise one or more existing real-time tasks. The attacker may use such a task to manipulate sensor inputs and actuator commands (for instance) and/or modify system behavior in undesirable ways. Integrity violation through code injection attacks conceptually consists of two steps [34]. First, the attacker sends instruction snippets (e.g., a valid machine code program) to the device that is then stored somewhere in memory by the software application receiving it. Such instruction snippets are referred to as gadgets. In the second step, the attacker triggers a vulnerability in the application software, i.e., real-time OS (RTOS) or task codes, to divert the control flow. Since the instruction snippets represents a valid machine code program, when the program execution jumps to the start address of the data, the malicious code is executed. As we illustrate in Section 4, our recent solutions [27][28][29][30][31][32][35][36][37] can be used to detect integrity violations through a combination of hardware/software mechanisms.

Side-Channel Attacks
The adversary may learn important information by side or covert-channel attacks [38] by simply lodging themselves in the system and extracting sensitive information. A side-channel attack manipulates previously unknown channels to acquire useful information from the victim. Memory/cache access time [39], power consumption traces [40], schedule preemptions [41], electromagnetic (EM) emanations [42], temperature [43], etc. are examples of some typical side-channels used by attackers. These attack surfaces are particularly applicable to attacking RT-IoT nodes that execute real-time tasks due to the deterministic behaviors in such systems. A demonstrative cache-timing attack is presented in Sections 3.2.2 and 4.2.1 illustrates our recent approaches [33,44] to mitigate information leakage that used timing-based attacks on storage-channels.

Attacks on Communication Channels
RT-IoT elevates the Internet as the main communication medium between the physical entities. However, Internet, as an insecure communication medium, introduces a variety of vulnerabilities that may put the security and privacy of RT-IoT systems under risk. Threats to communication includes eavesdropping or interception, man-in-the-middle attacks, falsifying, tampering or repudiation of control/information messages [45]. From the perspective of RT-IoT, defending against communication threats is not an easy task. This is because it is challenging to distinguish rogue traffic from the legitimate traffic (especially for the critical/high-priority flows) without degrading the QoS (e.g., bandwidth and end-to-end delay constraints). Threats to communications are usually dealt with by integrating cryptographic protection mechanisms [46,47]. However, this increases the WCET of the real-time tasks and may require modification of existing schedulers. Many cryptographic operations are also computationally expensive to execute especially on limited resources available in embedded RT-IoT devices. Therefore, existing cryptographic approaches may not be a preferable option for many RT-IoT systems. In Section 4.2.3, we illustrate a solution to integrate security mechanisms that can also be used for dealing with communication threats but does not require modification of existing real-time tasks.

Denial-of-Service (DoS) Attacks
Due to resource constraints (e.g., low memory capabilities, limited computation resources, etc.) and stringent timing requirements, RT-IoT nodes are vulnerable to DoS attacks. The attacker may take control of the real-time task(s) and perform system-level resource (e.g., CPU, disk, memory, etc.) exhaustion. A more severe type of the DoS attack is the distributed denial-of-service (DDoS) attack, where many malicious/compromised nodes simultaneously attack the physical plant. In particular, when critical tasks are scheduled to run, an attacker may capture I/O or network ports and perform network-level attacks to tamper with the confidentiality and integrity (i.e., safety) of the system. Again, the defense mechanisms developed for generic IT or embedded systems do not consider timing, safety and resource constraints of RT-IoT and are not easily adaptable without significant modifications. As described in Sections 4.1.4 and 4.2.3, our recent work [32,[35][36][37] may be used to defend against DoS attacks.
However, first, for those attacks to be successful, reconnaissance is one of the early steps that an attacker needs to carry out. We illustrate this in the following (to demonstrate an attack mechanism).

Reconnaissance: Attack Preparation
Reconnaissance, essentially, is the first step for launching other successful attacks and, at the very least, the attacker gains important information about the system's internals.

ScheduLeak
In initial work [48], we developed an algorithm, "ScheduLeak", to show the feasibility of a schedule-based side-channel attack targeting real-time embedded systems with the multi-vendor development model introduced in Section 2.4. The adversary could be one of the vendors or an attacker who compromises a vendor. The ScheduLeak algorithm utilizes an observer task that has the lowest priority in the victim system to observe busy intervals. A "busy interval" is a block of time when one or more tasks are executing-an adversary cannot determine what tasks are running when by only measuring or observing the busy intervals as they are.
The ScheduLeak algorithm can be represented as a function R(Γ, W) = J, where W is a set of observed busy intervals and J is the inferred schedule information that can be used to pinpoint the possible start time of any particular victim task. Such a function is illustrated by Figure 3. By using the ScheduLeak algorithm, an attacker can deconstruct the observed busy periods (with up to 99% success rate if tasks have fixed execution times) into their constituent jobs and precisely pinpoint the instant when a task is scheduled. Figure 3. An example of the schedules produced from a task set of three tasks [48]. The ScheduLeak algorithm can recover the precise schedules from the observed busy intervals. (a) Busy intervals observed by attacker's observer task. (b) Schedules reconstructed by the ScheduLeak algorithm..

Targeted Attacks
It is worth mentioning that the effectiveness of side-channel attacks is enhanced when combined with the reconnaissance step introduced above. For example, in the demonstrative ECU system introduced in Section 2.4, let us assume code inserted into Vendor 2 would like to identify whether the surveillance camera controlled by the I/O Operation Task is enabled. The attacker can launch a ScheduLeak algorithm to infer exact start times of the IO Operation Task and carry out a cache-timing attack to gauge cache usage when an I/O Operation Task is scheduled. Figure 4 shows the result of such a cache-timing attack. By launching a ScheduLeak attack and knowing when the I/O Operation Task is scheduled to execute, the attacker probes the cache usage only when the task is active. The result indicates that the attacker is able to identify the instant when the camera is on (i.e., when many data are processed by I/O Operation Task).  . A demonstration of a cache-timing attack [48]. The x-axis is sample points and the y-axis shows both cache usage inference (round dots) and real memory usage amount (the solid line). It shows that a successful cache-timing attack can precisely infer the memory usage of the victim task.

Securing RT-IoT: Host-Based Approaches
In what follows, we summarize our initial attempts to provide security in RT-IoT nodes. We refer to these approaches as host-based solutions since they primarily focus on securing an individual RT-IoT node. These approaches can be classified into two major classes: (i) solutions that require custom hardware support to provide security and (ii) the solutions at the scheduler/software level that do not require any architectural modifications. Table 2 summarizes these security mechanisms for RT-IoT systems. Table 2. Summary of Security Solutions for RT-IoT.

References Approach Attack Surface Overhead/Costs
Simplex-based security [27][28][29][30][31] Use verified/secure hardware module to monitor system behavior (e.g., timing [28] and execution pattern [27], memory access [29], system call usage [30], control flow [31]) Code injection attacks Require custom hardware or monitoring unit Security by platform-level reset [32,49] Periodically and/or asynchronously (e.g., upon detection of a malicious activity) restart the platform and load an uncompromised OS image Code injection, side channel and DoS attacks Extra hardware to ensure safety during periodic/asynchronous restart events Cache flushing [33,44] Flush the shared medium (e.g., cache) between the consecutive execution of high-priority (security sensitive) and low-priority (potentially vulnerable) tasks

Side-channel (cache) attacks
Overhead of cache flushing reduces task-set schedulability Schedule randomization [50] Randomize the task execution order (i.e., schedule) to reduce the predictability Side-channel attacks Extra context switch Table 2. Cont.

References Approach Attack Surface Overhead/Costs
Security task integration for legacy RT-IoT [35,37] Execute monitoring/intrusion detection tasks with a priority lower than real-time task to preserve the real-time task parameters (e.g., period, WCET and execution order)

Security with Hardware Support
The key idea of providing security without compromising the safety of the physical system is built on the Simplex framework [51]. Simplex is a well-known real-time architecture that utilizes a minimal, verified controller (e.g., safety controller) as backup when the complex, high-performance controller (e.g., complex controller) is not available or malfunctioning. The goal of the Simplex method is to guarantee that, even though a safety-critical system is controlled by a complex controller, the physical system would remain safe. We have used the idea of Simplex in the context of RT-IoT security [27][28][29][30]32].
The key concept of using Simplex-based architecture for security is to use a minimal simple subsystem (e.g., a trusted core) to monitor the properties (i.e., timing behavior [27,28], memory access [29], system call trace [30], behavioral anomalies [32], etc.) of an untrusted entity (e.g., monitored core) that is designed for more complex tasks and/or exposed to less secure mediums (e.g., network, Internet, I/O channels, etc.).
As mentioned in Section 2, the worst-case, best-case and average-case behaviors for most RT-IoT nodes are calculated ahead of time to ensure that all resource and schedulability requirements will be met during system operation. S3A [27] utilizes this knowledge of deterministic execution profile of the system and use to detect the violation of predicted (e.g., uncompromised) system behavior. S3A is one of our earliest efforts to use another (FPGA-based, in this case) trusted hardware component that monitors the behavior (e.g., execution time and the period) of a real-time control application running on a untrustworthy main system. The goal of this Simplex-based architecture is to detect an infection as quickly as possible and then ensure that the physical system components always remain safe. Using an FPGA-based implementation and considering inverted pendulum (IP) as the physical plant, we demonstrated that S3A can detect intrusions in less than 6 µs without violating safety requirements of the actual plant.

SecureCore Framework
As illustrated in Figure 5, the idea of SecureCore architecture is to utilize the redundancy in multicore chips to create a trusted entity (e.g., a "secure" core) that can continuously monitor the system behavior (e.g., code execution pattern [28], memory usage [29], system call trace [30]) of a real-time application on an untrustworthy entity (e.g., monitored core). The SecureCore is protected by hypervisor-based approaches (e.g., by isolating memory regions and I/O device consolidation).
The secure monitor (a software process) in the SecureCore uses the on-chip hardware monitoring unit to observe the states (e.g., I/O activities, memory usages, etc.) of monitored cores and checks the system behavior at runtime.  Figure 5. An illustration of SecureCore framework. The trusted core is used to monitor the behavior of the complex (and potentially vulnerable) core used for executing application/control tasks.
The initial SecureCore architecture [28] uses a statistical learning-based mechanism for profiling the correct execution behavior of the target system and uses these profiles to detect malicious code execution. Given the probability distribution P(e) of a legitimate execution instance, the secure monitor compares P(e) with a predefined minimum required probability θ-if P(e) is below the threshold probability (e.g., P(e) < θ), the execution instance to is considered as malicious. The SecureCore framework is also extended [29] to profile memory behavior (referred to as memory heat map (MHM)) and then detect deviations from the normal memory behavior patterns. MHM represents how many times a particular memory region was accessed during a time interval. We proposed machine learning algorithms to characterize the information contained in the MHMs and then detect deviations from the normal memory behavior patterns. We have also extended SecureCore architecture to detect anomalous executions using a distribution of system call frequencies. Specifically, we have proposed [30] to use clustering algorithms (e.g., global k-means clustering with the Mahalanobis distance) to learn the legitimate execution contexts (by means of distribution of system call frequencies) of real-time applications and then monitor them at run-time to detect intrusions.

Control Flow Monitoring
We then proposed hardware-based approach for checking the integrity of code flow of real-time tasks [31]. In particular, we add an on-chip control flow monitoring module (OCFMM) with a dedicated memory unit that directly hooks into the processor and tracks the control flow of the tasks. The control flow graph (CFG) of tasks is produced from the program binary and loaded into the OCFMM memory in advance (e.g., during system boot). The detection module inside OCFMM compares the control flow of the running program with the stored one (e.g., CFG profiles that are loaded into the dedicated memory at boot time) during program execution. At run-time (e.g., during execution of a given block), CFG profiles for the next-possible blocks are pre-fetched. The decision module continuously scans the current block and validates the execution flow by comparing the current address of the program counter (PC) against the possible, previously fetched destination addresses. If any mismatch occurs, the detection module raises a detection flag that indicates a possible breach.

Security via Platform-Level Reset
In traditional computing systems (e.g., servers, smart phones, etc.), software problems are often resolved by restarting either the application process or the platform [52,53]. However, unlike those conventional computing systems, restart-based recovery mechanisms are not straightforward in RT-IoT due to the real-time constraints as well as interactions of the control system with the physical world (for example, a UAV can quickly be destabilized if its controller restarts). In initial work [32], we proposed a restart-based concept to improve security guarantees for RT-IoT. This Simplex-based framework, which we refer to as ReSecure, is specifically designed to improve security of safety-critical RT-IoT systems. In particular, we proposed to restart the platform periodically/asynchronously and load a fresh image of the applications and OS from a read-only media after each reboot with the objective of wiping out the intruder or malicious entity. The ReSecure architecture (see Figure 6) produces a verified system (by using a safety unit) despite the use of an unverified complex controller (e.g., complex unit). OS/Firmwire in complex unit is exposed to external (possible attack) surfaces and can fail. Decision module predicts if the future states are safe. Watchdog (WD) and periodic timers restart the complex unit (and reload OS image from read-only memory) upon fail-stop. Read-Only Memory Figure 6. The ReSecure framework [32]: Safety unit is the bare-metal verified component and complex unit is not verified. The decision module switches between the controllers to provide overall system safety.

RTOS
Our primary focus here is to ensure the safety of the system despite the presence of malicious entity. The main idea is that, if we restart the system frequently enough, it is less likely that the attacker will have time to re-enter the system and cause meaningful damage (such as data breaches and jeopardizing safety) to the system. After every restart, there will be a predictable down time (during the system reboot), some operational time (before system is compromised again) and some compromised time (until the compromise is detected or the periodic timer expires). The length of each one of the above intervals depends on the type and configuration of the platform, adversary models, complexity of the exploits, etc. As a general rule, the effectiveness of the restarting mechanism increases: (i) as the time to re-launch the attacks increases; or (ii) the time to detect attacks and trigger a restart decreases. We also evaluated the expected lack of availability due to restarts and the expected damage from the attacks/exploits given a certain restart configuration.
In later work [49], we introduced the secure execution interval (SEI)-a period of time after each restart and before the untrusted applications begin to execute, when the execution environment is not yet contaminated and hence security is guaranteed. During SEI, the system executes trusted code to determine the next restart time based on the current discrete state of the physical system. When necessary, a safety controller can override the control of the system (during SEI) to guide the system back to a safe state. In addition, we introduced a root of trust (RoT)-an isolated hardware timer responsible for enforcing the restart process by issuing the restart signal at designated times (computed by the trusted code in SEI). RoT is designed to be programmable only once in each execution cycle and only during SEI. Since it is inaccessible outside of SEI and works independently, the triggering of the restart process is not affected even when the system is compromised. An example our framework operating in a UAV system is illustrated in Figure 7. The UAV operates normally within its safe flight zone and the safety controller does not need to be activated during SEI. Once the attacker compromises the system after the second restart (the orange area), the UAV flying towards its unsafe zone. Before the UAV reaches the unsafe zone, the hardware timer is up in RoT and triggers a restart. The safety controller (in SEI) takes over the control and brings the UAV back to the safe zone. Once the UAV returns to a predefined safe zone threshold, SEI ends and hands the control backs to the applications.

Safe Flight Zone
RoT Interface programmable? Time Figure 7. An example of a UAV system operating under the ReSecure framework [49]. The black line coming out from the UAV indicates the distance before it gets out of the safe flight zone. The red arrows annotate the triggering of the restarting points by the RoT. The blue arrows annotate the exit of the SEI (and that the next restart time is scheduled in RoT). We use different colors to illustrate the different phases of the system operation: white, the main flight controller is in charge and system is not compromised; yellow, the system is restarting; green, SEI is active, the safety controller is running and the next restart time is being calculated; orange, the system is compromised and the adversary is in charge; and blue/gray, the time spans when the (RoT) interface is available and unavailable, respectively.

Security without Architectural Modifications
Even though architectural modification can improve the security posture of RT-IoT nodes, those approaches require an overall redesign and may not be suitable for systems developed using COTS components. We now review the some of the approaches that we recently proposed to enhance security in RT-IoT without custom hardware support.

Dealing with Side-Channel Attacks
As introduced in Section 3.2.2, we demonstrated that an attacker can carry out a cache-timing attack to indirectly estimate memory usage behavior. It is due to the lack of isolation for shared resources across different tasks in most COTS-based RT-IoT systems. The overlap between tasks happens when the system transitions from one task to another. Therefore, capturing security constraints between tasks becomes essential for preventing side-channel attacks.
In previous work [44], we proposed to integrate security in RT-IoT by introducing techniques to add constraints to tasks scheduled with fixed-priority real-time schedulers. Based on user-defined security levels for each task, the scheduler flushes shared cache when the system is transitioning from a high security task (i.e., a task demanding higher confidentiality) to a low security task (i.e., an insecure task potentially compromised). Let us consider the set of security levels for tasks, S, that form a total order. Hence, any two tasks (τ i , τ j ) may have one of the following two relationships when considering their security levels, s i , s j ∈ S: (i) s i ≺ s j , meaning that τ i has higher security level than τ j or (ii) s j ≺ s i .
We proposed the idea of mitigating information leakage among tasks of varying security levels, by transforming security requirements into constraints on scheduling algorithms. The approach of modifying or constraining scheduling algorithms is appealing because: (a) it is a software based approach and hence easier to deploy compared to hardware based approaches; and (b) it allows for reconciling the security requirements with real-time or schedulability requirements. Consider a simple case with two periodic tasks, a high priority task H and a low priority task L scheduled by a fixed-priority scheduling policy. Assume that s H ≺ s L ; hence, information from H must not leak to L. These tasks must be scheduled on a single processor, P, so that both deadlines (D H , D L ) are satisfied. If L (or any part thereof) executes immediately after (any part) or all of H, then L could "leak" data from resources recently used by H. The main intuition is that a penalty must be paid for each shared resource in the system every time tasks switch between security levels. In this case, the cache must be flushed before a new task is scheduled. Hence, we proposed the use of an independent task, called the Flush Task for this purpose.
In subsequent work [33], we relaxed many of the restrictions (e.g., the requirement of total ordering of security levels) and proposed a new, more general model to capture security constraints between tasks in a real-time system. This includes the analysis for the schedulability conditions with both preemptive and non-preemptive tasks. We proposed a constraint named noleak to capture whether unintended information sharing between a pair of tasks must be forbidden. Using this constraint, we can prevent the information leakage via implicitly shared resources. For any two tasks τ i and τ j : if noleak(τ i , τ j ) = True, then information leakage from τ i to τ j must be prevented; if noleak(τ i , τ j ) = False, no such constraints need to be enforced. We showed that the system remains schedulable (e.g., all the tasks can meet their deadline) under the proposed constraints without significant performance impact.

Schedule Randomization
One way to protect a system from certain attacks (e.g., the schedule-based side-channel attack mentioned in Section 3.2.1), is to randomize the task schedule to reduce the deterministic observability of periodic RT-IoT applications. By randomizing the task schedules, we can enforce non-determinism since every hyper-period (a hyper-period is the smallest interval of time after which the periodic patterns of all the tasks repeats itself-typically defined as the least common multiple of the periods of the tasks) will show different order (and timing) of execution for the tasks. Unlike traditional systems, randomizing task schedules in RT-IoT is not straightforward since it leads to priority inversions [54] that, in turn, may cause missed deadlines -hence, putting the safety of the system at risk.
Hence, we proposed TaskShuffler [50], a randomization protocol for fixed-priority scheduling algorithm, to achieve such randomness in task schedule. For instance, by picking a random task instead of the one with the highest-priority at each scheduling point, subject to the deadline constraints. The degree of randomness is flexible in TaskShuffler. Based on the system's needs, TaskShuffler implements the following randomization schemes: • Randomization (Task Only): This is the most basic form of randomization in contrast to other schemes introduced below. We randomly pick a task to execute whenever a task arrives or finishes its job, i.e., at the scheduling points. The effectiveness against the schedule-based side-channel attack is limited since the busy intervals in this scheme remains the same.

•
Randomization with Idle Time Scheduling: In addition to the randomness provided in the basic scheme, we include the idle task (e.g., the dummy task executed by an RTOS when other real-time tasks are not running) at each scheduling point. It eliminates the periodicity of busy intervals (from hyper-period's point of view). This scheme makes it harder to produce effective results from the schedule-based side-channel attack. • Randomization with Idle Time Scheduling and Fine-grained Switching: To push the randomization to an extreme, one could choose to randomize the schedule every tick. That is, the scheduler will randomly pick a task to execute, subject to the deadline constraints, in every tick interrupt. This way, we gain the most randomness for the schedule. Figure 8 illustrates an instance of the randomized schedule for an simple taskset with three tasks. However, it greatly increases the overhead and thus may not be applicable for all use cases.
IoT systems with real-time properties are predictable by design. This very determinism can become a vulnerability in the hands of smart adversaries and it becomes easier to carry out adversarial actions such as side-channel attacks [40,48], DoS (making critical resources unavailable at important times) or even the recently developed timing-inference attacks [48]. TaskShuffler can reduce the determinism that is visible to external entities while still meeting real-time guarantees. With such randomization, even if an observer is able to capture the exact schedule for a (limited) period of time (for instance, for a few hyper-periods), TaskShuffler will schedule tasks in a way that succeeding hyper-periods will show different orders (and timing) of execution for the tasks.

Integrating Security for Legacy RT-IoT
As we have described in Section 3.2.1, an adversary can extract important information while still remaining undetected and it is essential to have a layered defense and integrated resilience against such attacks into the design of RT-IoT. However, any security mechanisms have to co-exist with real-time tasks in the system and have to operate without impacting the timing and safety constraints of the control logic. Besides, the embedded nature of these systems limits the availability of computational power (e.g., memory or processor) required for resource-extensive monitoring mechanisms. This creates an apparent tension between security requirements (e.g., having enough cycles for effective monitoring and detection) and the timing and safety requirements. For example, a critical parameter is to determine how often and how long should a monitoring and intrusion detection task execution to be effective but not interfere with real-time control or other safety-critical tasks. While this tension could potentially be addressed for newer systems at design time, this is especially challenging for retrofitting legacy systems where the control tasks are already in place and perhaps cannot be modified. Any hardware and/or software-level modifications to those legacy system parameters is costly since it will go through several verification and validation steps and may increase system downtime [15]. Most of the security solutions for RT-IoT proposed in literature either require custom hardware [27][28][29][30][31][32]49,55], modification of the existing schedulers [46,47], extra instrumentations [55] or may need to change the tasks parameters (e.g., execution order and/or run-time) [33,44,50] and therefore not suitable for legacy systems. Integrating monitoring and detection tasks for RT-IoT without custom hardware support is an open problem.
Given the tension between security and timing requirements, while integrating security mechanisms into a practical system, finding the frequency of execution of the monitoring tasks is an important design parameter that trades security requirements with timing constraints. If the interval between consecutive monitoring events is too large, the adversary may harm the system (and remain undetected) between two invocations of the security task. In contrast, if the security tasks are executed very frequently then it may impact the schedulability of the real-time tasks.
In preliminary work [35], we addressed the problem of determining the frequency of execution (e.g., periods or inter-monitoring interval) of the security tasks. Our approach to integrate security without perturbing real-time scheduling order is to execute security tasks at a lower priority tasks than real-time tasks. We refer to this scheme as opportunistic execution since the security tasks are only allowed to execute opportunistically only during slack times when no other real-time tasks are running.
We propose to measure the security of the system by means of the achievable periodic monitoring. Let T i be the period of the security task that needs to be determined. Our goal here is to minimize the perturbation between the achievable (i.e., unknown) period T i and the desired (e.g., designer provided) period T des i . We formulate a constraint optimization problem and develop a polynomial-time solution that allows us to execute security tasks with a frequency closer to the desired values while respecting the temporal constraints of the other real-time tasks.
If the security tasks always execute with lowest priority, they suffer more interference (i.e., preemption from high-priority real-time tasks) and the consequent longer detection time (due to poor response time) will make the security mechanisms less effective. To provide better responsiveness and increase the effectiveness of monitoring and detection mechanisms, we then proposed a multi-mode model [36]. This framework (called Contego) allows the security policies/tasks to execute in in different modes (i.e., passive monitoring with lowest priority as well as exhaustive checking with higher priority). By using this approach (see Figure 9), for instance, security routines can execute opportunistically when the system is deemed to be clean (i.e., not compromised). However, if any anomaly or unusual behavior is suspected, the security policy may switch to a fine-grained checking mode and execute with higher priority. The security routines may go back to normal mode if: (i) no anomalous activity is found; or (ii) the intrusion is detected and malicious entities are removed.
RT Task Figure 9. Flow of operations in Contego depicting different modes for the security tasks.
The aforementioned works however are developed for single core systems only-integrating security mechanisms for legacy multicore platforms (where designers have less flexibility for changing system architecture/parameter) is also a challenging problem. In recent work [37], we developed a scheme for multicore RT-IoT and find a suitable assignment of security tasks that ensures they can execute with a frequency close to what a designer expects. We considered a multicore platform comprised of M identical cores. One fundamental problem while integrating security mechanisms in multicore platforms is to determine which security tasks will be assigned to which core and executed when. Although security tasks can execute in any of the M available cores and any period T des i ≤ T i ≤ T max i is acceptable, the actual task-to-core assignment and the periods of the security tasks are not known a priori. The goal of this scheme therefore is to jointly find the core-to-task assignment and suitable periods for the security tasks. However, finding such an assignment is NP-hard due to combinatorial nature of the problem. Therefore, we developed a near-optimal low-complexity solution (called HYDRA) that jointly finds the security tasks' period and core assignments. From our experiment, we found that on average HYDRA (that distributes security tasks across all available cores) can provide 27.23% faster intrusion detection rate (on a quad core system) compared to the case when the security tasks are allocated a dedicated core while the real-time tasks are assigned to the remaining cores.

Securing Legacy RT-IoT Systems
Since most RT-IoT nodes are resource-constrained embedded devices, resource-intensive processing and complex protocols (e.g., heavy cryptographic operations) for securing those systems are unrealistic and may threaten the safety of such systems-for instance, a safety-critical task may miss a deadline to run computation-heavy security tasks. In addition to execution frequency, another important consideration is to determine how quickly can intrusions be detected. Thus, responsiveness vs. schedulability of critical tasks is another important trade-off. This in itself is a research challenge that needs to be investigated.
Thus far, we have assumed that we are given a set of security tasks and that each security task has a desired frequency of execution for better security coverage. Security tasks so far have been treated as being independent and preemptible. However, in practice, as previously discussed, some security monitoring may need atomicity or non-preemptive execution. Further, security tasks may have dependencies where one task depends on the output from one or more other tasks. For example, an anomaly detection task may depend on the outputs of multiple scanning tasks. Alternatively, the scheduling framework may need to follow certain precedence constraints for security tasks. For example, to ensure integrity of monitoring security, the security application's own binary may need to be examined first before checking the system binary files. In such cases, we cannot independently execute the security task and we need to consider the problem of integrating security tasks with dependencies between them. One approach could be use a directed acyclic graph (DAG) to capture the dependencies and constraints among security tasks. In this case, tightness of achievable periodic monitoring described in Section 4.2.3 may no longer be a reasonable metric. Constraints to ensure that the entire DAG is executed often enough should be included and the optimization problem reformulated and evaluated with different metrics.

Security for Multicore based RT-IoT Platforms
Most of the work [33,35,36,44,48] presented thus far has been in the context of single core processors-they are the most common types of processors being used in RT-IoT systems. However, as mentioned above, due to increasing computational demands, multi-core processors are becoming increasingly relevant to real-time systems [23,56]. With the increased number of cores, more computation can be packed into a single chip-thus reducing power and weight requirements-both of which might be relevant to many RT-IoT systems. However, multicore processors can increase attack vectors, especially for side-channel attacks. First, two or more tasks are running together and (most likely) sharing low-level resources (e.g., last level caches). Hence, a task running on one core can snoop on the other-and not only when tasks follow each other. In fact, it has been shown that leakage can occur with a much higher bandwidth in the case of shared resources in multi-core processors [57]. Second, when tasks execute together, a malicious task can increase the "interference" faced by a critical task-for instance, the malicious task can flood the cache/bus with memory references just when an important task (say, one that computes the control loop) is running. This could cause the critical task to get delayed and even miss its deadline. To prevent such problems, designers of the systems need to enforce constraints that protected tasks do not execute simultaneously with unprotected ones on the multi-core chip.
The problem of integrating security tasks into legacy RT-IoT systems is also interesting in the multicore context-perhaps the security tasks can always be running (e.g., on one of the dedicated cores) instead of running opportunistically as is the case for single core systems. In addition, it may be possible to to take up more cores and execute fine-grained sanity checks (e.g., a complete system-wide scan) as it detects malicious activity. Analyzing the impact of integrating security tasks in a multicore legacy RT-IoT is an open problem worth investigating.

Secure Communication with Timing Constraints
With the rise of RT-IoT, edge devices are more frequently exchanging control messages and data often with unreliable mediums such as the Internet. Therefore, in addition to the host-based approaches [27][28][29][30][31]33,44,50] described above, there is a requirement for securing communication channels to ensure authenticity and integrity of control messages. While some of our previous work [32,35] can also be used to deal with network-level attacks, designing a unified framework to protect edge devices as well as communication messages (given the stringent end-to-end delay requirements for high-critical traffics) is still an open problem.
Safety-critical RT-IoT systems often have separate networks (hardware and software) for each of the different types of flows for safety (and security) reasons. This leads to significant overhead (equipment, management, weight, etc.) and also potential for errors/faults and even increased attack surface and vectors. Network-level nondeterminism, i.e., unpredictability in sensor reading, packet delivery, forwarding, and processing further complicate the management of RT-IoT systems. Existing protocols, e.g., avionics full-duplex switched Ethernet (AFDX) [58], controller area network (CAN) [59], etc. that are in use in many of real-time domains are either proprietary, complex, expensive, and require custom hardware or they are also exposed to known vulnerabilities [60].
Given the limitations of existing protocols, leveraging the benefits of software-defined networking (SDN) can also be effective for RT-IoT systems. The advantage of using SDN is that it is compatible with COTS components (and thus suitable for legacy RT-IoT systems) and provides a centralized mechanism for developing and managing the system. The global view is useful to ensure QoS (e.g., bandwidth and delay) and enforce security mechanisms (such as remote attestations, secure key/message exchange, remote monitoring, etc.). While SDNs provide a global view of the network and high-level management capabilities (including resource allocation), current standards used in traditional SDN (e.g., OpenFlow [61]) do not consider inherent timing and safety-critical nature of the RT-IoT systems. In recent work [62], we tried to address this problem through static flow allocation and routing-we used static path allocation and over-provisioning hardware resources (e.g., dedicating one queue per real-time flow) for meeting the end-to-end delay requirements and providing isolation. This limited the number of flows that could be admitted and resulted in underutilized network resources. Retrofitting the capabilities of SDN in the RT-IoT domain requires further research. We also need mechanisms to prioritize between flows (e.g., between the critical real-time flows or even across real-time and non real-time flows) and also schemes for multiplexing flows on the same queues in the SDN switches (to improve the efficiency of the network) while still meeting the real-time constraints.

Related Work
There exists work that has investigated security in real-time systems [46,47,63]. Many researchers have studied this research area from different aspects. Information leakage via side channels has been discussed in many works. Kadloor et al. [64] and Gong et al. [65] introduced analysis and methodology for quantifying side-channel leakage. Kelsey et al. [39], Osvik et al. [66] and Page et al. [67] demonstrated the usability of cache-based side-channels. Son et al. [41] and Völp et al. [68] examined the exploitation of timing channels in real-time scheduling. Bao et al. [69] introduced a scheduling algorithm for soft real-time systems (where some tasks can miss deadlines) and provided a trade-off between thermal side-channel information leakage and the number of deadline misses. Studies on [40] the robustness of AES secret keys against differential power analysis (DPA) [70] attacks also exist.
While the work above focuses on exploring vulnerabilities, studies that aim to provide security to real-time systems also exist. Ghassami et al. [71] and Völp et al. [72] proposed techniques to address leakage via shared resources. An online job randomization scheme [73] was proposed by Krüger et al. for time-triggered real-time systems. Xie et al. [46] and Lin et al. [47] presented security in real-time systems by encrypting communication messages. Similar to our hardware-assisted security mechanisms (e.g., S3A, SecureCore, ReSecure, etc.), researchers also propose architectural frameworks [55] that dynamically utilize slack times (e.g., the time instance when no other real-time tasks is executing) for run-time monitoring. Recent studies [74,75] propose schemes to secure systems from man-in-the-middle attacks, where an attacker can compromise communication between system sensors and controllers.
Some recent work has raised security awareness in IoT applications [10,13,[76][77][78]. Some researchers aim to add security properties to IoT. Pacheco et al. [79] introduced a security framework that offers security solutions with smart infrastructures. Kuusijärvi et al. [80] proposed to mitigate IoT security threats with using trusted networks. Those work primarily focuses on generic IoT applications, and do not consider the additional real-time constraints required for RT-IoT systems.

Conclusions
The sophistication of recent attacks on RT-IoT requires rethinking of security solutions for such systems. The goal of this paper is to raise the awareness of real-time security and bridge missing gaps in the current IoT context-securing the IoT systems with real-time constraints. The techniques and methodology presented here vary from different perspectives-from hardware-assisted security to scheduler-level as well as those for legacy systems. The designers of the systems and research community will now be able to integrate and develop upon these frameworks required to secure safety-critical RT-IoT systems. We believe that the real-time and IoT worlds are closely connected and will become inseparable in the near future.