Real-Time Performance and Response Latency Measurements of Linux Kernels on Single-Board Computers

: This research performs real-time measurements of Linux kernels with real-time support provided by the PREEMPT_RT patch on embedded development devices such as BeagleBoard and Raspberry Pi. The experimental measurements of the Linux real-time performance on these devices are based on real-time software modules developed speciﬁcally for the purposes of this research. Taking in consideration the constraints of the speciﬁc hardware platforms under investigation, new measurements software was developed. The measurement algorithms are designed upon response and periodic task models. Measurements investigate latencies of real-time applications at user and kernel space. An outcome of this research is that the proposed performance measurements approach and evaluation methodology could be applied and deployed on other Linux-based boards and platforms. Furthermore, the results demonstrate that the PREEMPT_RT patch overall improves the Linux kernel real-time performance compared to the standard one. The reduced worst-case latencies on such devices running Linux with real-time support could make them potentially more suitable for real-time applications as long as a latency value of about 160 µ s, as an upper bound, is an acceptable safety margin.


Introduction
The Linux kernel by standard successfully handles lightweight or soft real-time requirements. Nevertheless, it does not provide full assurance for hard timing deadlines required in safety-critical applications in industrial automation and control (e.g., robotics control, aerospace and air traffic control, vehicles control). Linux is a general purpose operating system that provides important features, such as process management, although not all of them have strict timing constraints, e.g., the scheduler can cause unbounded latencies which makes Linux not deterministic enough and cannot guarantee to meet the task deadlines. However, safety-critical systems must be safe at all times. On the other hand, PREEMPT_RT, a real-time preemption patch provided by Ingo Molnar and Thomas Gleixner is a popular patch for the Linux kernel that transforms Linux into a hard real-time operating system with deterministic and predictable behavior. This patch allows nearly all of the kernel code to be preempted by higher priority kernel threads, and reduces the maximum thread switching latency, although that depends on the system-that is, on a combination of hardware and software. By way of example, not all microprocessors have included a memory management unit (MMU), or it is not always enabled, even if it is present. Currently, documentation is maintained on the Linux Foundation Wiki [1]. In addition, many other kernel developers and real-time experts have contributed with significant contributions to the development of this patch too.
Open source operating systems such as Linux continue to evolve and have a significant impact in many embedded systems for control applications. In particular, embedded systems with real-time support are employed by a wide variety of applications ranging from simple consumer electronics and home appliances to military weapons and space systems [2]. The fast growth of Industrial Internet of Things (IIoT) is accelerating the move towards open source Linux in embedded market share. The increasing requirements on the performance of real-time applications, and the need to reduce development costs and time, led to an increase in the interest for employing COTS (commercial off-theshelf) hardware and software components in real-time applications [3][4][5][6]. However, their reliable real-time performance is still under investigation. This is an objective of this work research too. The real-time measurements are focused on real-time capabilities provided by PREEMPT_RT patch in handling real-time tasks and operations in user and kernel space. The experimental measurements platform is based upon ARM-based embedded devices, such as Raspberry Pi (a Raspberry Pi3 referred from now on as RPi3) and BeagleBoard microcontroller (a BeagleBone Black referred from now on as BBB), running in a masterslave mode. Raspberry Pi was designed as an educational and experimental board [7]. However, it has already made the leap into industry, e.g., with the Compute Module 3 (CM3) intended for industrial applications. Raspberry Pi applications are now part of industry 4.0 and the Internet of Things, e.g., the JanzTec emPC-A/RPI3+ Industrial Controller [8], and the Kunbus Revolution Pi [9].
The Linux kernel distributions for RPis and BBBs do not currently have any hard realtime support. Therefore, it is an issue under investigation in using Linux for hard real-time applications. However, it is possible to patch them with PREEMPT_RT, and hopefully in the future it will be provided as the de facto standard option in the mainline kernel for such microcontrollers. For the purposes of this research, specific software modules were developed and applied to investigate and evaluate the real-time performance of Linux kernels patched with PREEMPT_RT. Standard benchmark tools such as cyclictest [10] could have also been used. However, this benchmark is difficult to extend and does not combine different types of operations [11]. As a result, taking in consideration the specific hardware platforms under investigation and the aimed real-time applications with certain constraints and requirements (e.g., high priorities, locks of memory pages, high-resolution timers, and specific metrics measurements), it was decided as the most optimal approach to build our own new measurements software. The locking of memory pages is essential in order to avoid page faults and even thrashing. However, this is an issue which needs further investigation to ensure the most optimal memory usage. Currently, there is under investigation an interesting approach provided by Reuven and Wiseman [12], specifically for systems with very heavy memory usage, which propose thrashing minimization by splitting the processes into a number of bins, using Bin Packing approximation algorithms.
Development platforms such as Raspberry Pi and BeagleBone are being extensively used in IoT embedded applications, and even in Industrial IoT. Although their Linux kernel distributions do not have any hard real-time support, this is possible with the installation and configuration of the PREEMPT_RT patch. However, there is still no sufficient research work in the evaluation of the real-time performance of Linux kernels patched with PREEMPT_RT on such development platforms. This was one of the major motivations to investigate the real-time Linux kernels' behavior with the real-time preemption patch.
This work provides experimental results on real-time latency metrics for Linux kernels patched with PREEMPT_RT, on Raspberry Pi3 and BeagleBone Black development boards. Response and periodic task models were introduced, upon which novel software real-time measurement modules were designed. These modules take into consideration specific critical real-time requirements, e.g., high priorities, locks of memory pages, and highresolution timers. In the majority of measurement cases, the worst-case maximum latency was decreased down to values in the order of a few tens of microseconds. One of the key findings is that a value of about 160 µs, as an upper bound, could be an acceptable safety margin for such low frequencies in many real-time systems running in a masterslave mode. Although that is a general outcome of several measurements under this specific master-slave schema, it provides some evidence that it could be valid for real-time embedded systems based on such devices and connected to various kinds of actuators, which require fast response times below this threshold value. Taking into account that real-time capabilities of PREEMPT_RT patch and Linux mainline kernel continue to evolve, together with other constant improvements of ARM-based microcontrollers, both in terms of hardware and software, such systems can be another candidate for computing intensive applications in hard real-time applications. Some of the important aspects and outcomes of this research work are the following:

•
Extends the measurements methodology presented in previous work by Brown and Martin [13], by introducing new sets of experiments with additional measurement metrics, applicable in a wider range of Linux kernels and distributions in ARM-based platforms. • Implements latency measurements based on software real-time measurement modules, designed upon response and periodic task models.

•
The same performance measurements approach and evaluation methodology could be applied and deployed on other Linux-based boards.
This paper is structured as follows: Section 2 describes previous related work; Section 3 presents the key components of the methodology followed; Section 4 describes the performance measurements algorithms and modules developed; Section 5 presents the setup of the experimentation platform; Sections 6 and 7 present the results of the experimental measurements of response and periodic tasks in user and kernel space; Section 8 presents a discussion analysis on the research findings; Section 9 provides a summary of the research results and draws conclusions.

Related Work
The real-time performance of operating systems and applications is analyzed with many different approaches [14][15][16]. The tools and methods used rely upon the performance metrics targeted, most commonly schedulability issues in real-time systems [17,18]. There is also a number of interesting schedulability analysis tools, e.g., RTDruid, TimeWiz, symTA/S, and chronVAL. Many different scheduling policies and algorithms exist, but not all of them are adequate for real-time tasks. Scheduling policies for real-time systems should ensure a number of factors, including first and foremost the timely response to critical events, low task switching and interrupt latency, low worst-case execution times, allowing for the preemption of any kind of task in the system, etc. In real-time operating systems, the methods and approaches used have to guarantee that certain deadlines are always met. The methods used to investigate predictability and timing characteristics of such systems typically measure scheduling jitter and interrupt latency with benchmark tools [19,20]. Tracing tools are also being used to identify latency issues [21,22]. Other approaches introduce the design and development of new benchmarks and software modules that investigate performance metrics of real-time operating systems [23,24].
The approach followed in this research work is based on software test modules, developed particularly for latency performance measurements in Linux kernels patched with PREEMPT_RT. A similar approach that inspired this research is presented in the work of Brown and Martin [13]. They developed a test system for evaluating the performance of two real-time tasks on Linux and Xenomai systems. They compare the performance of Linux kernels with real-time support such as Xenomai and the PREEMPT_RT patch, using C software modules to perform timing measurements of responsive and periodic tasks, with real-time characteristics, at user and kernel space. However, their evaluation is based only on a BeagleBoard microcontroller and Ubuntu Lucid Linux kernel configuration.
It is worth the effort to run a real-time kernel and evaluate its potential and performance benefits for applications. The advantages of using a real-time kernel are presented in many cases. However, performance evaluation of different kernel versions with real-time support has been presented primarily on Intel x86 platforms [25]. In the work of Litayem and Saoud [26], the authors evaluate the timing performance (latency) and throughput of PREEMPT_RT with different kernel versions, using cyclictest and unixbench. The platform is an x86 computer with CoreTM 2 Duo Intel CPU, running Ubuntu Linux 10.10. In the work of Fayyad-Kazan et al. [27], the authors present experimental measurements and tests that benchmark RTOSs such as Linux with PREEMPT_RT (v3.6.6-rt17) against two commercial ones, QNX and Windows Embedded Compact 7. The tests were executed on an x86 platform (ATOM processor). In the work of Cerqueira and Brandenburg [20], a comparison of scheduling latency in Linux, PREEMPT_RT, and LITMUS RT is presented, based again on a 16-core Intel CPU platform. The majority of these works rely upon x86-based computer platforms with Ubuntu Linux.
The open source code accessibility and portability, the amount of implemented algorithms and libraries have made Linux with PREEMPT_RT a strong alternative to commercial RTOSs and specialized approaches, also in industrial environments [28]. Other research articles have recently focused on latency measurements of Raspbian Linux with real-time patch PREEMPT_RT vs. the standard Raspbian [29,30]. However, measurements are performed only with Raspbian Linux and the cyclictest benchmark.
The latest research shows that such Linux-based embedded systems play an important role in nearly every aspect of modern life, particularly in systems' real-time control [31][32][33]. However, there is still no sufficient research work on the evaluation of the real-time performance of Linux kernels patched with PREEMPT_RT on Raspberry Pi and BeagleBone Black development platforms. Their low cost, open source design, and ease of integration with various peripherals make these development platforms appropriate for research in various fields, particularly in embedded control systems, robotics, smart cities, sensors systems, and for fast experimentation and prototyping in manufacturing [34][35][36].
This research focuses on Linux latency measurements aiming to find out how the real-time patch affects its real-time performance. In contrast to many of the above works, the experimental platform includes multiple Linux kernels and distributions. This experimental work of the latency performance of the Linux kernels patched with PREEMPT_RT running adds to the knowledge and understanding of real-time execution behavior in such platforms.

Objectives
One of the major goals of this research is to measure the real-time responses of Linux kernels and variants in ARM-based development platforms with the real-time preemption patch PREEMPT_RT. This goal is addressed by creating new software multithreaded modules in C, which implement the proposed measurement algorithms. These modules provide the ability to observe the execution state of multiple parameters including response latency during the real-time tasks execution.

Design Methodology
The research methodology is based upon two simple task models, the periodic task model and the sporadic task model [37]. In the periodic task model, the tasks of a job arrive strictly periodically, separated by a fixed time interval. In the sporadic task model, each task may arrive at any time once a minimum interarrival time has elapsed since the arrival of the previous task. This is because real-time tasks are usually activated in response to external events (e.g., upon sensor triggering) or by periodic timer expirations.
In this research, we introduce a response task model. In a periodic task model, each invocation of a task arrives strictly periodically, separated by a fixed time interval. In the proposed response task model, each task may arrive at any time upon the arrival of the previous task. Each task τ i is characterized by: its execution time relative to a deadline t i , maximum (or worst-case) response latency wcrl i , and minimum interval time t irv . A task's worst-case response latency wcrl i is defined as the overall time elapsed from the arrival of this task (timer interrupt) to the moment this task is switched to a running state producing results. The models' structure is described in the algorithms provided below in Section 4 and implemented as the measurement software modules. In the experiments, each task τ i is scheduled using the highest real-time priority to eliminate the latency caused by scheduling jitter. Each module executes the measurements loop, based on timing data acquired from the device under test, performs analysis of the measurements and outputs the results. The experiments with the software modules were executed multiple times to obtain the following measurements:

•
The optimum sustained interrupt frequency, that is, the maximum frequency of the signal on the associated GPIO line (General Purpose Input/Output) that can handle efficiently running in continuous mode.

•
The response latency, that is, the estimated time elapsed between GPIO input level change (IRQ trigger-interrupt request) and GPIO output level change.

•
In response tasks, measure the total time elapsed until the device under test responds, while in periodic tasks, measure whether the slave device responds at proper time periods.

Measurements Software Design Considerations
Real-time multithreaded modules were executed under two modes, in user and kernel space. A thread is a basic unit of CPU utilization, which can be implemented in user space or in kernel space. These multithreaded applications perform the proposed response and periodic real-time tasks. Processes are scheduled under the real-time policy SCHED_FIFO, having a sched_priority value in the range of 1 (low) to 99 (high). This ensures a timely execution of the tasks and decreased execution times and latencies. SCHED_FIFO policy runs a task until it is preempted by a higher priority task. This may not contribute to the overall throughput, but will increase the determinism by allowing all kernel space to be preemptible as all interrupt handlers are switched to threaded interrupts. Scheduling policy, attributes and priorities were also set per thread upon their creation, with POSIX thread scheduling policy functions calls. The design approach is illustrated in Figure 1.
worst-case response latency wcrli is defined as the overall time elapsed from the arrival of this task (timer interrupt) to the moment this task is switched to a running state producing results. The models' structure is described in the algorithms provided below in Section 4 and implemented as the measurement software modules. In the experiments, each task τi is scheduled using the highest real-time priority to eliminate the latency caused by scheduling jitter. Each module executes the measurements loop, based on timing data acquired from the device under test, performs analysis of the measurements and outputs the results. The experiments with the software modules were executed multiple times to obtain the following measurements: • The optimum sustained interrupt frequency, that is, the maximum frequency of the signal on the associated GPIO line (General Purpose Input/Output) that can handle efficiently running in continuous mode.

•
The response latency, that is, the estimated time elapsed between GPIO input level change (IRQ trigger-interrupt request) and GPIO output level change.

•
In response tasks, measure the total time elapsed until the device under test responds, while in periodic tasks, measure whether the slave device responds at proper time periods.

Measurements Software Design Considerations
Real-time multithreaded modules were executed under two modes, in user and kernel space. A thread is a basic unit of CPU utilization, which can be implemented in user space or in kernel space. These multithreaded applications perform the proposed response and periodic real-time tasks. Processes are scheduled under the real-time policy SCHED_FIFO, having a sched_priority value in the range of 1 (low) to 99 (high). This ensures a timely execution of the tasks and decreased execution times and latencies. SCHED_FIFO policy runs a task until it is preempted by a higher priority task. This may not contribute to the overall throughput, but will increase the determinism by allowing all kernel space to be preemptible as all interrupt handlers are switched to threaded interrupts. Scheduling policy, attributes and priorities were also set per thread upon their creation, with POSIX thread scheduling policy functions calls. The design approach is illustrated in Figure 1. The software modules are designed in a master-slave mode and perform measurements in user and kernel space of response and periodic real-time tasks. The master software module controls the overall execution process and performs the actual measure- The software modules are designed in a master-slave mode and perform measurements in user and kernel space of response and periodic real-time tasks. The master software module controls the overall execution process and performs the actual measurements in user and kernel space. The slave modules run the actual tasks on the device under test and provide feedback to the master control modules. The measurements are passed as function arguments to threads function calls during their creation (pthread_create()). The running tasks are scheduled as threads, with real-time SCHED_FIFO scheduling policy (sched_setscheduler()) and high priority set to 99. All user-space processes are scheduled with real-time scheduling class SCHED_FIFO, and high priorities. From a scheduling point of view, it makes no difference between the initial thread of a process, e.g., executing the main() function, and all additional threads created dynamically. The slave modules in kernel space are implemented as kernel modules. The initialization function (kgpio_init) in the response task uses the GPIO kernel interface and an interrupt handler function (gpio_irq_handler) to service the input changes. In the periodic task, the slave modules uses a high-resolution timer (hrtimer) to produce timer-based interrupts.
It is possible to avoid intercore interferences by setting the processor affinity. However, the intention is to investigate threads execution by having more than one thread per core, and threads are allowed to migrate among all cores in RPi3 ARM CPU. This will affect scheduling latencies due to potential locks, and will add to the total response latency.

Performance Measurements Modules
Usually, the worst-case execution time (WCET) analysis is mandatory for hard realtime system performance evaluation according to their latency. Using the software modules developed, experimental tests run for a long duration (about 1 h each test run) to evaluate the latency that occurred in real-time task execution, at the Linux kernels patched with PREEMPT_RT and the standard ones. Measurements were conducted at user and kernel space.
The software runs in a master-slave mode. A synopsis of the overall software control flow is presented in Figure 2.
The master software module performs initializations, sets the scheduling policy and events to poll, and triggers the device under test (writes GPIO output, gets clock time and polls input). In user space, the slave software in response mode polls the input and writes the output accordingly, while in periodic mode it reads the timer until the time interval elapsed and writes the output. In kernel space, the slave software (as a kernel module) in response mode services the interrupt by getting the input value and setting the output, while in periodic mode it services the interrupt starting the high-resolution timer and returns. Once the desired number of loops is reached, the master software module performs metrics calculations and outputs the results.

Response Task Modules
The response task modules, based on the measurements analysis presented earlier, invoke code that measures the responsiveness of the real-time applications in user and kernel space. The measurements software module in the master device performs the overall control of execution and metrics measurements at user and kernel space. This module triggers the slave device at specific and random time intervals in a loop for a number of iterations (1 M), and measures the time elapsed (latency) until the slave device under test responds. In user space, in the slave device, the software module responds to GPIO toggle frequency (e.g., 10 kHz) in an asynchronous manner by activating a GPIO output, as soon as the level of a GPIO input changes. In kernel space, in the slave device the software module is inserted into the slave's kernel as a loadable kernel module. This module uses an interrupt handler function (only the top-half) to service the input change.

Master and Slave Response Tasks in User and Kernel Space
The Linux kernel has a way to expose internal structures using SysFS, a virtual file system which exposes a common interface for kernel implementation details and internal structures. The software modules make use of the kernel's SysFS interface. The algorithms that describe the basic functionality of these modules are shown as pseudocode in Algorithms 1-3. structures. The software modules make use of the kernel's SysFS interface. The algorithms that describe the basic functionality of these modules are shown as pseudocode in Algorithms 1-3.

Algorithm 1
Master response task in user and kernel space scheduling is SCHED_FIFO at priority 99 ← set thread's scheduling algorithm to realtime events is POLLIN or POLLPRI ← set the events to poll until there is data to read loops ← set by command line argument no_of_iterations is below or equal to loops while no_of_iterations is below or equal to loops, do setting ← 1 scheduling is SCHED_FIFO at priority 99 ← set thread's scheduling algorithm to real-time events is POLLIN or POLLPRI ← set the events to poll until there is data to read loops ← set by command line argument no_of_iterations is below or equal to loops while no_of_iterations is below or equal to loops, do setting ← 1 write fd_output setting ← set the value of output pin that triggers the slave clock_gettime begin_time poll fd_input for events ← await for interrupt infinitely clock_gettime end_time read fd_input ← read input once enabled by the slave setting ← 0 end perform measurements Algorithm 2 Slave response task in user space scheduling is SCHED_FIFO at priority 99 ← set thread's scheduling algorithm to real-time events is POLLPRI ← set the events to poll until there is data to read while 1 do read fd_input ← read input once enabled by the master poll fd_input for events ← await for interrupt infinitely write fd_output setting ← set the value of output pin accordingly (to 0 or 1) end Algorithm 3 Slave response task in kernel space function kgpio_init ← uses the GPIO kernel interface gpio_request gpio_out ← request GPIO output gpio_direction output ← set up as output gpio_request gpio_in ← request GPIO input gpio_direction input ← set up as input gpio_to_irq irqNumber ← maps GPIO to IRQ number irq_request irq_handler ← request an interrupt line end function kgpio_init function gpio_irq_handler ← uses an interrupt handler function (only the top-half) to service the input change gpio_get_value gpio_in ← gets GPIO input value gpio_set_value gpio_out to gpio_in ← sets GPIO output accordingly return IRQ_HANDLED ← interrupt serviced end function gpio_irq_handler

Periodic Task Modules
The purpose of the periodic task modules is to periodically execute at a specific interval certain process interrupts. The master control software monitors whether the slave device under test responds at proper periods in user and kernel space measurements. The slave device responds to the interrupts by toggling the value (0, 1) of an output pin, at specific time intervals, based on an internal timer. In kernel space, the slave's software uses an internal high-resolution timer, which is inserted as a kernel module. Due to the fact that a periodic timer interrupt is not an appropriate solution for a real-time kernel, most of the existing real-time kernels provide high-resolution timers [38][39][40][41]. Since hard real-time systems usually have timing constraints in the micro seconds range, a high-resolution timer is usually a requirement when a task needs to occur more frequently than the 1 millisecond resolution provided under Linux.

Master and Slave Periodic Tasks in User and Kernel Space
Reliable latency performance measurements require accurate timing source. For this reason, the performance measurements software make use of the system call clock_gettime() with the highest possible resolution, and the clock is set to CLOCK_MONOTONIC. The master control software reads the slave's input for the corresponding interrupts and measures the time interval in between (half period). The slave's software module uses, again, a high-resolution timer to produce timer-based interrupts. The algorithms that describe the basic functionality of these modules are shown as pseudocode in Algorithms 4 and 5. Algorithm 4 Slave periodic task in user space timerfd_create is CLOCK_MONOTONIC ← set the clock to mark the timer's progress timerfd_settime is ABSTIME ← start the timer semi_period_interval ← set by command line argument no_of_iterations is below or equal to semi_period_interval while no_of_iterations is below or equal to semi_period_interval, do read timer_fd ← read the timer until the time interval is elapsed write fd_output setting ← set the value of output pin accordingly (to 0 or 1) end Algorithm 5 Slave periodic task in kernel space function kgpio_init ← uses an internal high-resolution timer hr_timer_init high_res_timer hr_timer_set CLOCK_MONOTONIC hr_timer_mode HRTIMER_MODE_REL hr_timer_function timer_func end function kgpio_init function gpio_irq_handler ← the GPIO IRQ handler function hrtimer_start high_res_timer ← starts high-resolution timer return IRQ_HANDLED ← interrupt serviced end function gpio_irq_handler

Issues Solved
Real-time metrics measurements depend upon how well software or benchmarking modules are written, as well as how well the kernel is configured. Comparing the performance of a real-time application running in different systems is a challenge, mainly because of the difficulty to isolate the various different factors that may affect performance. That usually implies the configuration of the kernel and adaptation of the source software to the native kernel of each system. Optimal decisions also have to be made on how to set various settings related, e.g., to memory management mode, system timers, peripheral devices configuration, etc., since they can make a huge difference on the latencies of a given system. During the experimental work, a few problems were encountered and solved, meaning that there are still issues to be considered and improved in real-time support with PREEMPT_RT. In some cases, long latencies were due to the use of timer functions on time measurements other than clock_gettime() or clock_nanosleep(). In another case, it was observed that during the experimental runs, after a few minutes the RPi run into instability and the system had to be restarted. In particular, the FIQ (Fast Interrupt reQuest) system implementation causes lock ups of the RPi when using threaded interrupts. A solution to this problem is proposed by the Open Source Automation Development Lab (OSADL) [42], which disables the IRQ while the FIQ spin lock is held, and indeed the kernel run stable. As the Linux Foundation points out [1], since the kernel of Raspberry Pi is not part of the mainline, there are some known limitations of PREEMPT_RT running on RPi platforms.

Experimental Setup
A Raspberry Pi3 (RPi3) is used as the master device in all measurement schemes. The RPi3 has integrated a System on Chip (SoC) based on Broadcom BCM2837, which features a 1.2 GHz 64-bit quad-core ARM Cortex-A53 (ARMv8) processor. The BeagleBone Black development board features a 1 GHz ARM Cortex-A8 (ARMv7) processor based on TI Sitara AM3358AZCZ100 SoC from Texas Instruments. Both the devices are low-cost and low-power single-board computers, commonly used as development platforms for various system applications, specifically for embedded systems.
The slave devices (RPi3 and BBB) can communicate and transfer data to and from the master device using the standard GPIO interface. The master and slave arrangements are shown in Figure 3.
Computers 2021, 10, x FOR PEER REVIEW 10 of 18 a 1.2 GHz 64-bit quad-core ARM Cortex-A53 (ARMv8) processor. The BeagleBone Black development board features a 1 GHz ARM Cortex-A8 (ARMv7) processor based on TI Sitara AM3358AZCZ100 SoC from Texas Instruments. Both the devices are low-cost and low-power single-board computers, commonly used as development platforms for various system applications, specifically for embedded systems. The slave devices (RPi3 and BBB) can communicate and transfer data to and from the master device using the standard GPIO interface. The master and slave arrangements are shown in Figure 3.  The master and slave devices are connected through GPIOs in a master-slave schema, as illustrated in Figure 4. For RPi3, GPIO27 (pin 13) in the slave device is defined as input and connected to GPIO17 (pin 11) defined as output in the master device. For BBB, GPIO1_13 (pin 11) is defined as input and GPIO1_16 (pin 15) as an output. The connections establish a pin-to-pin bidirectional communication, so the same connection is applied, respectively, in reverse directions from the slave devices to the master.  Standard Linux kernel configurations and kernels with real-time support were installed and configured (on different microSD cards) on the slave devices under test. These include: Ubuntu Mate (4.14.74-rt44-v7), Arch Linux (4.19.10-1-ARCH), and Debian (4.19.67-2). The developed software measurement modules provide consistent and reliable results based on multiple experiments. These were visualized and validated with an oscilloscope. In particular, the latency measurements obtained internally by the software modules are compared to those directly measured externally with an oscilloscope. The master and slave devices are connected through GPIOs in a master-slave schema, as illustrated in Figure 4. For RPi3, GPIO27 (pin 13) in the slave device is defined as input and connected to GPIO17 (pin 11) defined as output in the master device. For BBB, GPIO1_13 (pin 11) is defined as input and GPIO1_16 (pin 15) as an output. The connections establish a pin-to-pin bidirectional communication, so the same connection is applied, respectively, in reverse directions from the slave devices to the master.
Computers 2021, 10, x FOR PEER REVIEW 10 of 18 a 1.2 GHz 64-bit quad-core ARM Cortex-A53 (ARMv8) processor. The BeagleBone Black development board features a 1 GHz ARM Cortex-A8 (ARMv7) processor based on TI Sitara AM3358AZCZ100 SoC from Texas Instruments. Both the devices are low-cost and low-power single-board computers, commonly used as development platforms for various system applications, specifically for embedded systems. The slave devices (RPi3 and BBB) can communicate and transfer data to and from the master device using the standard GPIO interface. The master and slave arrangements are shown in Figure 3. The master and slave devices are connected through GPIOs in a master-slave schema, as illustrated in Figure 4. For RPi3, GPIO27 (pin 13) in the slave device is defined as input and connected to GPIO17 (pin 11) defined as output in the master device. For BBB, GPIO1_13 (pin 11) is defined as input and GPIO1_16 (pin 15) as an output. The connections establish a pin-to-pin bidirectional communication, so the same connection is applied, respectively, in reverse directions from the slave devices to the master.  Standard Linux kernel configurations and kernels with real-time support were installed and configured (on different microSD cards) on the slave devices under test. These include: Ubuntu Mate (4.14.74-rt44-v7), Arch Linux (4.19.10-1-ARCH), and Debian (4.19.67-2). The developed software measurement modules provide consistent and reliable results based on multiple experiments. These were visualized and validated with an oscilloscope. In particular, the latency measurements obtained internally by the software modules are compared to those directly measured externally with an oscilloscope. Standard Linux kernel configurations and kernels with real-time support were installed and configured (on different microSD cards) on the slave devices under test. These include: Ubuntu Mate (4.14.74-rt44-v7), Arch Linux (4.19.10-1-ARCH), and Debian (4.19.67-2). The developed software measurement modules provide consistent and reliable results based on multiple experiments. These were visualized and validated with an oscilloscope. In particular, the latency measurements obtained internally by the software modules are compared to those directly measured externally with an oscilloscope.

Response Task Measurements in User Space
The master control software at specific time intervals runs a task τ i that triggers a GPIO input on the slave device, with loops of "0 s" and "1 s", which the slave is polling in an infinite loop. Then, it begins to measure the slave's response delay and accumulates relevant measurement metrics. The slave device, upon reading the change of the input state, sets its output accordingly (on a rising edge it sets its output line, while on a falling edge it clears its output line). Then, the master device repeats the loop for a number of cycles (1 million loops) for sufficient samples to be collected for analysis. The variation in the input signal level (values of 0 s and 1 s) provides a way to check that the devices under test read the input signals correctly, and respond appropriately and accordingly. Measurements are performed on both edges, rising and falling, of the trigger signals, as shown in Figure 5.

Response Task Measurements in User Space
The master control software at specific time intervals runs a task τi that triggers a GPIO input on the slave device, with loops of "0 s" and "1 s", which the slave is polling in an infinite loop. Then, it begins to measure the slave's response delay and accumulates relevant measurement metrics. The slave device, upon reading the change of the input state, sets its output accordingly (on a rising edge it sets its output line, while on a falling edge it clears its output line). Then, the master device repeats the loop for a number of cycles (1 million loops) for sufficient samples to be collected for analysis. The variation in the input signal level (values of 0 s and 1 s) provides a way to check that the devices under test read the input signals correctly, and respond appropriately and accordingly. Measurements are performed on both edges, rising and falling, of the trigger signals, as shown in Figure 5. Each running task τi runs two loops, and thus consists of two subtasks-loops of "1 s" and "0 s"-which are executed sequentially and alternatingly. The total execution time includes the execution of all the subtasks (that is, tlat1 for loop of "1 s" and tlat0 for loop of "0 s") times the amount of iterations, plus the time interval tirv between the generated subtasks. In the experiments, the time interval tirv in between was initially unset and random; however, later, for efficiency purposes, it was set to specific values within the range of 1 to 10 ms. This is because for lower time intervals it was observed that long delays sometimes appear on latency measurements. Although rare, such delays make it apparent that the devices could not react properly at such frequencies.

Estimation of Maximum Sustained Frequency
The time interval between two consecutive generated interrupts is estimated so that subtasks are properly initialized and executed. For this purpose, a number of tests have been executed with variable frequency values to determine the optimum value for the time interval between the generated interrupts at the master device. The results show that the slave devices with PREEMPT_RT can handle all the generated interrupts if the time interval in between is above 10 ms. This value was set for the majority of the experiments, and below, for testing and sensitivity analysis purposes. That means that we could toggle the state of a GPIO pin, e.g., with a low frequency, e.g., of 1000 Hz, with millions of interrupts (1 M), and get reliable responses. Each running task τ i runs two loops, and thus consists of two subtasks-loops of "1 s" and "0 s"-which are executed sequentially and alternatingly. The total execution time includes the execution of all the subtasks (that is, t lat1 for loop of "1 s" and t lat0 for loop of "0 s") times the amount of iterations, plus the time interval t irv between the generated subtasks. In the experiments, the time interval t irv in between was initially unset and random; however, later, for efficiency purposes, it was set to specific values within the range of 1 to 10 ms. This is because for lower time intervals it was observed that long delays sometimes appear on latency measurements. Although rare, such delays make it apparent that the devices could not react properly at such frequencies.

Estimation of Maximum Sustained Frequency
The time interval between two consecutive generated interrupts is estimated so that subtasks are properly initialized and executed. For this purpose, a number of tests have been executed with variable frequency values to determine the optimum value for the time interval between the generated interrupts at the master device. The results show that the slave devices with PREEMPT_RT can handle all the generated interrupts if the time interval in between is above 10 ms. This value was set for the majority of the experiments, and below, for testing and sensitivity analysis purposes. That means that we could toggle the state of a GPIO pin, e.g., with a low frequency, e.g., of 1000 Hz, with millions of interrupts (1 M), and get reliable responses.

Response Latency Measurements
The importance of measuring the response latency is unquestionable in real-time systems. The slave devices were tested continuously by circulating the loops of "1 s" and "0 s" for a million (1M) interrupts, with an average cycle duration of 120 µs, and an overall running time of about 3 h. At the end of each measurement cycle, the master control software processes the results and estimates the mean, minimum, and the maximum response latency, plus some statistics on variance and standard deviation.

Response Task Measurements in Kernel Space
In kernel space, experimentation is conducted in a similar way. The master control software initiates the triggering cycles at specific intervals, which the slave's software module is polling in an infinite loop, and responds once a change of the input state is detected. The slave's software in this case is a kernel module developed for this purpose and inserted in the kernel.

Periodic Task Measurements in User and Kernel Space
In periodic measurements, the master device measures the signal's length period produced by the slave's internal timer. The master software module checks at specific time periods the slave's output status in order to verify that the device responds at proper periods, and at the same time to investigate the state upon which the slave device cannot react properly.
In user space, the master device is polling the slave device in an infinite loop, until its GPIO input status is changed (rising edge of the first interrupt). On the other hand, the slave device toggles periodically the value of an output configured pin at a specific periodic rate, based on an internal timer. The master control software begins to count the time until its input status has changed again (falling edge of the second interrupt) ( Figure 6).
Computers 2021, 10, x FOR PEER REVIEW 12 of 18

Response Latency Measurements
The importance of measuring the response latency is unquestionable in real-time systems. The slave devices were tested continuously by circulating the loops of "1 s" and "0 s" for a million (1Μ) interrupts, with an average cycle duration of 120 μs, and an overall running time of about 3 h. At the end of each measurement cycle, the master control software processes the results and estimates the mean, minimum, and the maximum response latency, plus some statistics on variance and standard deviation.

Response Task Measurements in Kernel Space
In kernel space, experimentation is conducted in a similar way. The master control software initiates the triggering cycles at specific intervals, which the slave's software module is polling in an infinite loop, and responds once a change of the input state is detected. The slave's software in this case is a kernel module developed for this purpose and inserted in the kernel.

Periodic Task Measurements in User and Kernel Space
In periodic measurements, the master device measures the signal's length period produced by the slave's internal timer. The master software module checks at specific time periods the slave's output status in order to verify that the device responds at proper periods, and at the same time to investigate the state upon which the slave device cannot react properly.
In user space, the master device is polling the slave device in an infinite loop, until its GPIO input status is changed (rising edge of the first interrupt). On the other hand, the slave device toggles periodically the value of an output configured pin at a specific periodic rate, based on an internal timer. The master control software begins to count the time until its input status has changed again (falling edge of the second interrupt) ( Figure 6). In kernel space, the experimental setup and layout of the devices is the same as described earlier. The master device performs the measurements in a similar way to the user space experimentations. However, in this case, the slave's control software is a kernel module that uses an internal high-resolution timer to produce the periodic interrupts.
Measurements are performed on both edges of the triggering signals for a variable number of samplings starting at 10,000 and decreasing, with a semi-period at 15,000 μs (down to a 1500 μs period). The results show that the slave devices generate the timer interrupts at exact time intervals, both at standard Linux kernels and with real-time support. In kernel space, the experimental setup and layout of the devices is the same as described earlier. The master device performs the measurements in a similar way to the user space experimentations. However, in this case, the slave's control software is a kernel module that uses an internal high-resolution timer to produce the periodic interrupts.

Results and Discussion
Measurements are performed on both edges of the triggering signals for a variable number of samplings starting at 10,000 and decreasing, with a semi-period at 15,000 µs (down to a 1500 µs period). The results show that the slave devices generate the timer interrupts at exact time intervals, both at standard Linux kernels and with real-time support.

Results and Discussion
Linux-based platforms on ARM-based devices such as the Raspberry Pi3 and Beagle-Bone Black are continuously gaining popularity in various standalone control applications as embedded systems. However, many of the approaches to measuring their real-time performance and particularly latency are still based on x86 CPU architectures and the use of benchmark tools such as cyclictest. Regarding latency measurements, very few works, such as the work of Brown and Martin [13], proceed into the development of specific software measurement modules for such ARM-based devices. Their research inspired this work to extend the measurement metrics to a wider range of Linux kernels and distributions in such ARM-based embedded platforms. There are software structure similarities in both approaches; however, the hardware development platforms and Linux kernel versions are different. On the other hand, both hardware platforms are based on ARM CPU architectures running among other Linux distributions, Ubuntu too. Table 1 provides a summary of the results obtained in both approaches for Ubuntu Linux distributions and kernels with real-time support. Even though the results are very close, the intention is rather to reconfirm the results obtained with PREEMPT_RT, rather than providing a fair comparison, since the kernel versions are significantly different. In RPi3 and BeagleBone Black with PREEMPT_RT patched kernels, the minimum latency is measured below 50 µs, both at user and kernel spaces. In user space, 90% of the latencies fall below the maximum of 147 µs and 160 µs, respectively, while in kernel space, 95% of the latencies fall below the maximum of 67 µs and 76 µs, respectively. In BeagleBoard C4, at user space, for 95% of the time the maximum latency does not exceed the value of 157 µs, while in kernel space, this value is lower at 43 µs. Figure 7 illustrates the above results for both approaches in all devices.  Table 2 presents a comparative summary of the response latency results for Raspberry Pi3 and BeagleBone Black running Linux kernels with PREEMPT_RT patch. Data on the most commonly used measures of spread that is variance (var) and standard deviation (stdev) are also given.   Table 2 presents a comparative summary of the response latency results for Raspberry Pi3 and BeagleBone Black running Linux kernels with PREEMPT_RT patch. Data on the most commonly used measures of spread that is variance (var) and standard deviation (stdev) are also given. The specified kernel versions are different in order to investigate the spread of the variations in latency results. Linux kernels with real-time support maintain much lower latencies. The oscilloscope measurements reconfirm the results produced with the software measurement modules.

Response Latency Results
The majority of Linux kernels' measurements with PREEMPT_RT-patched kernel show the minimum response latency to be below 50 µs, both in user and kernel space. The maximum worst-case response latency (wcrl) reached 147 µs for RPi3 and 160 µs for BBB in user space, and 67 µs and 76 µs, respectively, in kernel space (average values). Most of the latencies are quite below this maximum (90% and 95%, respectively, for user space and kernel space). In general, it seems that maximal latencies do not often cross these values.
The measurements in standard Linux kernels show the minimum response latency to be about the same and below 55 µs, both in user and kernel space. However, the maximum worst-case response latency reached 360 µs (RPi3) and 380 µs (BBB) in user space, and 160 µs and 86 µs, respectively, in kernel space. This maximum observed latency is significantly higher than the one observed under the PREEMPT_RT-patched Linux kernels. Figure 8 illustrates the worst-case response latencies in user space for both kernels (standard and preempted).
show the minimum response latency to be below 50 μs, both in user and kernel space. The maximum worst-case response latency (wcrl) reached 147 μs for RPi3 and 160 μs for BBB in user space, and 67 μs and 76 μs, respectively, in kernel space (average values). Most of the latencies are quite below this maximum (90% and 95%, respectively, for user space and kernel space). In general, it seems that maximal latencies do not often cross these values.
The measurements in standard Linux kernels show the minimum response latency to be about the same and below 55 μs, both in user and kernel space. However, the maximum worst-case response latency reached 360 μs (RPi3) and 380 μs (BBB) in user space, and 160 μs and 86 μs, respectively, in kernel space. This maximum observed latency is significantly higher than the one observed under the PREEMPT_RT-patched Linux kernels. Figure 8 illustrates the worst-case response latencies in user space for both kernels (standard and preempted). In real-time systems designed with tight timing constraints, these worst-case latency values must be taken into consideration.

Conclusions
This research work presents the experimental evaluations on the real-time performance of the PREEMPT_RT patch, and particularly latency metrics, in Linux kernels and distributions running on Raspberry Pi3 Model B and BeagleBone Black ARM-based development platforms. These devices have become a popular choice for a wide range of applications in many embedded systems, while being easy to use, flexible, and lower cost. In real-time systems designed with tight timing constraints, these worst-case latency values must be taken into consideration.

Conclusions
This research work presents the experimental evaluations on the real-time performance of the PREEMPT_RT patch, and particularly latency metrics, in Linux kernels and distributions running on Raspberry Pi3 Model B and BeagleBone Black ARM-based development platforms. These devices have become a popular choice for a wide range of applications in many embedded systems, while being easy to use, flexible, and lower cost.
Challenges in recent real-time embedded systems, such as those found in cloud computing platforms using commercial-off-the-shelf technology, have prompted further research into their real-time behavior. However, currently, there is still limited research on investigating the real-time performance of such ARM-based architectural platforms running Linux patched with PREEMPT_RT.
This experimental work provides further insights into their real-time behavior. The performance measurement and evaluation approach is based upon the introduction of response and periodic task models implemented as new specific real-time software measurement modules. This could be applied and deployed on other Linux-based development boards and platforms too. Any device that supports a Linux kernel version, e.g., from release 4 (e.g., 4.4, 4.9, 4.14, 4.19) and later, and configured with the PREEMPT_RT patch, is an appropriate platform to deploy the developed measurement modules. These experimental software modules written in C are available as an open-source project at GitHub https://github.com/gadam2018/RPi-BeagleBone (accessed on 10 April 2021). There are further details provided about their installation and usage. The experimental results show that latencies on kernels with real-time support are considerably lower compared to those in the standard kernels and the majority falls below 50 µs. The average maximum observed latency of 160 µs is still significantly lower than the one observed under the standard Linux kernels. As an outcome, Linux kernels patched with PREEMPT_RT on such devices have the ability to run in a deterministic way as long as a latency value of about 160 µs, as an upper bound, is an acceptable safety margin. Such results reconfirm the reliability of such COTS devices running Linux with real-time support and extend their life cycle for the running applications. In addition, such devices could further stimulate their use in the development of architectural frameworks and systems for reliable real-time control applications, as is the case presented here [43]. Initially, the preliminary results of this research were also utilized in the development of a real-time controller based on Raspberry Pi and kernel modules [44].