Using Graphics Processing Units and Compute Shaders in Real Time Multimodel Adaptive Robust Control

: Graphics processing units and video cards have seen a surge of usage in domains other than graphics computers, due to advances in hardware and software technologies; however, little uptake has been in the domain of systems engineering and real time control. This research article will demonstrate the use of video cards in multimodel adaptive robust control, using openGL and compute shaders. A software simulation will show the behavior of the adaptive robust multimodel control scheme as the target process is exposed to both parametric and structural disturbances and will show the viability of using graphics processing units in real time systems control.


Introduction
Adaptive-robust control [1][2][3] is a strategy used to control processes with nonlinear dynamics, as it guarantees reference tracking and rejection of disruption by reidentifying the process model and controller each time performance degradation is detected. It requires little a priori knowledge of the process and is best suited for situations where information about system dynamics is difficult to obtain or processes are subject to various types of disruptions. However, process identification and controller tuning in close loops can be difficult to implement correctly, and the mathematical complexities may lead the control system to miss the real-time sampling deadline for fast processes. In case of processes that are exposed to a many different types of disturbances repeatedly, a combination of multimodel [1,4] and adaptive robust control can be used to avoid reidentification of nominal models and tuning of robust controllers. Similar to adaptive and robust philosophies, multimodel control is an approach to dealing with perturbations and changes in the operating circumstances.
Instead of trying to tune algorithms or reidentify nominal models, multimodel control stores precalculated nominal models and controllers that match various working regimes. When perturbations occur, the change in the working regime is detected through switching algorithms. The correct controller and model to use for a specific working regime is determined through the computation of a switching criterion [5][6][7].
This control scheme works very well with systems that have many working regimes yet repetitive behavior. These are encountered in vehicles. One possible example is a vehicle's suspension system, for a particular wheel. Depending on the speed, angle of movement, type of road pavement, temperature, car weight, and its weight distribution, the system behaves differently, with a different model for each working regime. Most cars tend to follow the same routes throughout their lifetime, so the repetitive nature of the perturbations is also met, as shown in Figure 1. Another example are the components of the steering system in a plane or shuttle, such as the ailerons on wings. As the plane flies through changing atmospheric conditions and executes maneuvers, the working regime of an aileron changes, accounting for multiple possible models during a single flight trip. The detailed working model of the multimodel control scheme is shown in Figure 2. When the process is exposed to perturbations, either parametric or structural, the output value y changes. This output value y is then fed into the model selector block, along with the command value u. For each model in the model selector block, an output value yn is computed using the commands and the output errors of the real process. Finally, yn, u, y, and un are used to compute a switching criterion. Based on the switching criterion, a model that best matches the real process is chosen. The switching criterion is usually based on the delta between the real process output value y and the output value yn of the nominal model [5]. With the model determined, the appropriate control algorithm is chosen from the algorithm selector block, which is then finally used to issue commands to the real process. The multimodel part consists of a knowledge base containing previously identified nominal models and their robust controllers, matching various working regimes that are subject to different types of perturbations. Switching between the current nominal model and robust controller to one in the cache is decided by calculating a switching criterion [5], and only occurs when performance degradation is detected, a difference from the classical multimodel control philosophy, where the model switch can occur at every sampling time. If the knowledge base grows, it becomes difficult for the classical iterative approaches to compute the switching criterion for each cached nominal model and still hit the real time sampling deadline, especially when dealing with fast systems. This article presents a novel approach to fixing the large knowledge base problem, by employing a graphics processing unit (GPU) to perform the calculation of the switching criterion in highly parallelized manner [8]. As part of this new approach, the mathematical algorithms that simulate the evolution of processes (calculation of transfer function outputs in discrete form and controller command computation) have been adapted to make full use of the particularities of the GPU hardware. A software simulation was developed to showcase the results. The software can be easily modified to work with real systems.
Other approaches so far have avoided the large knowledge base issue by limiting the number of "fixed" models and initializing an adaptive model from the initial fixed model and performing continuous tuning [9,10]. This approach is accurate but slow and is therefore unsuitable for fast processes. The proposed approach is coarse but very fast and will be better suited for fast and slow processes alike.
The paper will first present the mathematical basis of the simulation, followed by a brief explanation on how GPUs were used in the application, and finally some simulation results. The simulation results will show that the algorithms work correctly. Although the point of the research is to deal with large numbers of models, the simulation will only show four models, to better explain how it works.

GPU Hardware Overview
GPUs are designed to perform many mathematical operations in parallel due to the multithreaded nature of the hardware platforms, which help efficiently generate display output images, by executing a series of operations on the input data in a process called a graphics pipeline. The graphics pipeline has several stages [11] (not all mandatory-it depends on the expected results) in which various operations can be done in the outputs from of the previous stage (which become inputs for the current stage). At each stage, scripts called shaders are used to perform programmatic manipulation of input data. Among these, pixel shaders (which apply per pixel operations over a rasterized image) were typically used to perform various general purpose mathematical operations (with some limitations). In recent times, GPU hardware and driver support have evolved to expose a new kind of shader-called compute shader [12]-which can be used to perform general purpose mathematics on large data sets in a natural fashion, very similar to how things would normally operate with a classical algorithm that runs on the central processing unit (CPU) of the computer. These shaders also fully expose the parallel, heavily multithreaded nature of the GPU hardware. Compute shaders run on virtual work groups (the shader declaration defines the size of the group), composed of small execution units called cores. Each core has floating point and an integer arithmetic unit, and each core inside a virtual group has access to the same data buffers and will perform the same instructions in the same order. Virtual groups may or may not have a 1 to 1 mapping with hardware groups. In case the mapping is not exact (i.e., the size of the virtual group is not an integer multiplier of hardware group size), then the graphics hardware is not fully used, it is up to the driver and the execution scheduler in each hardware implementation to decide what happens with the unused cores in the group. Figure 3 shows an overview of nVidia Turing GPU work group (named streaming multiprocessor by Nvidia) [13]. Each core in the work group will run a thread in the compute cluster. The compute cluster represents the total number of threads assigned by the application when invoking the compute shader. When invocation occurs, the programmer can decide how many work groups will be executed as part of the compute cluster. The total number of threads will be the product of the group size, defined by the shader declaration in source code, and the total number of groups passed to the invocation call. The maximum number of threads in a compute cluster depends on the framework being used to access the GPU hardware and the hardware specifications of each GPU. Both the size of the virtual compute group and the total number of groups dispatched can be represented in 3D vector numbers, with X, Y, and Z sizes. This helps process data that are naturally 3D, but in this research, Y and Z are always 1, as the data are naturally a 1D array, so both the group size and the number of groups are effectively 1D vectors. Attempting to map it to 3D does not offer any advantages, so a design choice was made to keep Y and Z always 1 [12].
Even though compute shaders can provide support in computing various mathematical operations needed as part of the usual graphics generating pipeline, they can be used outside of the graphics pipeline to act as powerful mathematical co-processors for the applications. This feature, however, depends on the framework being used to access the hardware. DirectX allows calling compute shaders [14,15] without an actual game window; openGL currently does not. This is not a problem in itself when it comes to industrial applications. In most cases, a process computer will have some sort of user interface that is used to show the status and evolution of the process, so the requirement of an active openGL context is already met. The calling of compute shaders will be strictly tied to the refresh rate of the graphics context. This refresh rate can be coordinated with the system sampling time, to ensure real time responsiveness, when using openGL [11]. In case of DirectX, conditions are much simpler because programmers can directly access the hardware pointers and create contexts whenever it is necessary [14][15][16].
In the case of the multimodel adaptive control philosophy presented here, each core will execute the logic needed to compute the value of a switching criterion, which is then used to execute a nominal model and controller switch in the real plant. Since a GPU boosts hundreds of cores, it is quite easy to maintain a large knowledge base of nominal models and their optimal controllers and compute the switching criterion in parallel, while still adhering to the strict sampling time requirements imposed by real time control. This is not trivially possible when using just a CPU, because it lacks the number of cores that a GPU has, and it is more likely to perform these calculations in parallel [8,17].
The simulation software used in this research was developed in C#, using Visual Studio 2019, and targets the net framework 4.7.2. The application uses openTk nuget package to interop with openGL, and allow calling openGL functions from C#. Finally, openGL itself is packaged through the freeglut, glew32, glut, and glut32 dlls. These are bundled with the source code and will automatically copy themselves to the output directory so they will be loaded by the runtime correctly.
Additionally, the application makes use of MATLAB software for PID tune algorithms. The application code did have its own algorithm for computing an optimal PID for first degree processes, but this will not work for second degree processes. Since the purpose of the research work is not to implement PID tune algorithms but rather to demonstrate the use of GPUs in real time system control, it was decided to use as many external software packages as needed to deal with dependencies. Another reason why MATLAB is used to compute the optimal PIDs instead of using other inputs is because it makes it easier to just plug in various types of models in the simulation and see how it evolves, without the need to deal with the PID tune [18]. MATLAB was also used to print the plots throughout this paper. Any version of MATLAB that supports the COM server can be used [19]. It may be necessary to replace the COM reference to MLApp in the C# project if versions other than 2020 are used.
To compile the source code, Visual Studio 2019 with the Desktop Development Package must be used. It is likely that this will work with older versions as well, but it was never attempted. Visual Studio Community edition is free to use and install for opensource developers and small companies [20].

Computations of System Outputs
In the adaptive-robust control philosophy [2], the supervisor uses an approximation model of the real plant, called a nominal model, to determine when performance degradation occurs due to changes in the real plant. Equation (1) describes the relationship between the real plant, the nominal model, and the robustness margin [1,3,17,21].
where CP is the closed loop transfer function of the real plant, and CM is the closed loop transfer function of the nominal process; δ is the robustness margin.
To simulate the outputs of a transfer function, necessary for determining the value of CM, it is necessary to perform a discrete transform over its continuous form. There are several transformations provides in literature; in this research, the Tustin or Trapezoidal transform will be used. For a first order transfer function, such as the one described in Equation (2), The Tustin transform will involve replacing s as shown in Equation (3), where T is the sampling period, and z is the delay operator, y is the output, and u is the input.
By replacing (3) in (2), it can be easily demonstrated that Furthermore, (4) can be rewritten in the regressive form, as shown in (5) For a second order system, the transfer function can be written as in (6) H II = y u = ω 2 s 2 +2ζωs+ω 2 = W Ls 2 +Ks+W (10) where L =1 (usually), K = 2 ω, W = ω . Applying the same Tustin transformation shown in (3), the discrete form of the second order transfer function is determined as where In the regressive form, Equation (11) Similarities can be observed between Equation (11), which describes the regressive form of a second order process, and Equation (9), which describes the regressive form of a first order process. If in (11), c[1] = 0 and c[4] = 0, then (11) becomes equivalent to (9). This shows that the set of transfer functions of second order systems is a superset of the transfer functions of first order systems, and that any first order transfer function can be rewritten as a second order transfer function. This is important, as the mathematics show that a single algorithm can be used to compute the output of both first order and second order transfer functions. The value of u is computed the by the controller, using the reference r and the output error e = r − yk. In the case of a PID controller, with the general form The output value of uk can be written as where The values of u can now be plugged in to Equations (9) and (18) to compute the final output of each nominal system. The number of historical values for both y and u depends on the order of the process. First order transfer functions require fewer historical values compared to second order transfer functions. With those formulae, it is now possible to compute the values of CM and CP in (1). The compute shader will execute compute the values of each CM in the knowledge base, as will be explained in the next section.

Using the Compute Shader
The compute shader represents the code that will be executed by the graphics hardware work groups to calculate the values described in (1). The openGL framework was used for accessing the graphics hardware, the programming language used was GLSL, a language similar to C in syntax [12]. As described in the introduction of this article, shaders get executed by threads in parallel, using hardware groups called work groups. The size of a work group is declared by the GLSL code itself, whereas the application consuming the shader through the invocation only gets to decide the number of groups to run. Both the size declaration in the shader code and the number of groups dispatched by the invoking application are described as 3D vectors. In GLSL, the shader describes the size of the group by using the layout directive: layout (local_size_x = X, local_size_y = Y, local_size_z = Z) in [12]. This declares the local size, also known as the size of the compute shader invocation group.
When invoking, or running a compute shader, the application will call the Dispatch method. The method signature, void glDispatchCompute (GLuint num_groups_x, GLuint num_groups_y, GLuint num_groups_z), exposes three parameters, which match the X, Y, and Z dimensions of the 3D vector used to define the number of work groups that will be used by the application. The total number of threads will be the product between the three parameters of the glDispatchCompute function [12], and the three parameters of the layout directive. Figure 4 shows the 3D virtual layout of the compute dispatch. To determine the location of a thread in the 3D space of the compute shader invocation, openGL provides several built-in indices that allow for quick identification [12]: • gl_NumWorkGroups, contains the number of work groups requested by the dispatch call. • gl_WorkGroupID represents the current work group. • gl_LocalInvocationID represents the thread index inside the current work group. • gl_GlobalInvocationID uniquely identifies the current thread among all threads in all work groups. • gl_LocalInvocationIndex is the 1D normalization of the position of the thread within the work group.
To invoke a compute shader, applications must follow a series of steps [12,22].
1. Create an openGL Program. Programs are objects that are used to apply shaders in the graphics pipeline. 2. Create a pointer to the compute shader. 3. Set the source of the shader. The shader can be precompiled before the application is run or it can be compiled at runtime. In this case, it was chosen to compile the shader at runtime. 4. Compile the shader (not necessary if it was precompiled). 5. Attach the shader to the Program object. 6. Link the Program object. 7. Install the Program in the graphics pipeline.
8. Generate the buffers that will be used for passing data to and from the GPU, with the appropriate target type (like ShadderStorageBuffer) and usage mask (e.g., the write bit and invalidate buffer bit). For efficiency, the same input buffer can also be used for output. 9. Fill the buffers with the appropriate input data. 10. Call the glDispatchCompute method with the desire parameters. 11. Read the results from the output buffers.
The procedure may look complicated, but steps 1-7 must be performed only once per compute shader. Afterwards, it just becomes a matter of filling in the buffers and calling glDispatchCompute and reading the results.
Shaders are highly dependent on the hardware being used and their binary representation is highly dependent on the hardware drivers [22]. Usually, shaders must be compiled on each hardware platform they run on and each time the hardware driver changes. Applications typically choose to compile the shaders whenever needed, most often at startup. The simulation software used to generate the results presented in this article will compile the shader every time it runs [22].
Data buffers can be either structured (with strongly typed language symbols) or unstructured (raw byte arrays). In this simulation, structured buffers were used, as it makes maintenance extremely easy. The data structure contains information about the current state of the real process, such as the current output errors of the system, the system reference, the regression coefficients of the real process and of the simulated nominal process, the P, I, and D values of the nominal process optimal controller, and various other intermediary values that are needed to compute the outputs of the system and the value of the switching criterion. The input buffer also contains the needed memory to hold the output values. The shader essentially decorates the data structures representing the state of the nominal models with the output values. The local group size used will be (1024, 1, 1); the virtual layout is shown in Figure 5. Each box represents an instance of the data structure that is used to pass data to and from the GPU, associated with a nominal model in the knowledge base. Figure 5. Overview of the work group layout of the compute shader used to calculate the switching criterion, with local_size_x = 1024, local_size_y = local_size_z = 1. Each box represents an instance of the data structure associated with a nominal model used to pass information to and from the GPU.

Results
To test the efficiency of the control scheme, a small knowledge base was used, and no dynamic filling added (as would be required for adaptive control integration). This will keep the knowledge base size constant, with three models, and the real plant would switch between them, to simulate its exposure to various types of perturbances. The switching criterion used to select the proper model is defined in (21) [5,23].
where DY is the difference between the output of the nominal model and the output of the real plant. DY [1], DY [2], DY [3] represent the previous values of DY, and α and β are constants chosen by the designer of the system, depending on the particularities of the models. The transfer function of the real plant, which is also the initial nominal model used in the simulation is shown in (21) H P = 2 1.2s+1 (22) The knowledge base is prefilled with the models described in Equations (23) (25) The system reference will be set to 10. It is expected that when the switch occurs from one model to another, after a brief hysteresis period, the proper nominal model, which best matches the real plant from Equations (23)-(25), will have its output stabilized around the reference value of 10. Figure 6 shows the initial evolution of the real plant and nominal models. Real plant tracks the reference value of 10 without any errors. The real plan is then exposed to parametric perturbations, switching the real plant model to something like the model described in (24). For example, if our real plant would be the suspension system in a car, a sudden perturbation would be caused by degrading weather conditions. The supervisor will detect the degradation in performance and will proceed to look for a model in the knowledge base that will offer better performance. The evolution is shown in Figure 7.  Figure 7 shows that the switching logic is working as expected and allows the system to maintain the required performance. It can also be seen that the shock of the perturbation is forwarded to the nominal models in the cache, as the output errors of the real system are passed as parameters to the GPU shader. As expected, cached model 3, which matches the transfer function (25), stabilizes at the reference value, along with the real plant. At this point, the optimal controller of (25) is also applied to the real plant, to achieve best possible performance. This is felt in the evolution of the other two cached models, which change their output values to 12 and 50. Figure 8 shows another switch in the perturbation, this time to something like the nominal model (24). The behavior is very much the same. Cached model 2, which matches model (24), stabilizes to the reference value of 10, along with the real plant. Cached model 3, no longer offering best performance, degrades along with cached model 1 to output value 2. Similar behavior will be observed with a switch to model (24). If the output value measurement is polluted by a white noise, it can be removed through filtering. In the following examples, a moving average filter with window size 4 was applied to even out the measurement noise. This means the values of CM and CP from (1) will be applied on mean of the last four output values read. Figure 9 shows the effect of perturbations applied to the real plant by altering its model to match (25). This is the same case as shown in Figure 7, but with white noise affecting the measurement of the system output. Figure 9. Evolution of the real plant and the nominal models after exposing the real plant to perturbations, with output measurement noise.
As expected, nominal model 3 from the cache will now have its value gravitating around the reference value, but not quite matching it, due to the white noise polluting the output measurements of the real plant (but at a much lower intensity). The analogue of Figure 8 with measurement noise is shown in Figure 10. Simulating structural perturbances is achieved by having some second order systems in the nominal model knowledge base. To keep the same low number of models, the second model, cached model 2, will be replaced with various second order transfer functions.
Similar tests have been conducted to demonstrate the behavior of the control system. Figure 11 shows the switch from the initial nominal model to cached model 1 (a parametric disturbance) to cached model 2 (a structural disturbance). The control scheme behaves as expected. First, nominal model 1 tracks the reference value, then when the structural perturbation occurs, the appropriate cached model 2 tracks the reference, signaling which model and controller was chosen by the control structure. In this case, the second order system behaves similarly to a first order system, because its poles lay on the real axis and have no imaginary components. The system will show oscillations when switching, as shown in Figure 12. This occurs because the poles of (27), -0.1 ± 2i√5, are close to the imaginary axis, and this is known to make the system unstable, on top of the systemic shock caused by switching and disturbances. This leads to strong oscillations the moment the perturbation occurs and eventually oscillations around the reference value. Unlike oscillations caused by noisy measurements, these cannot be trivially removed using simple filters. When dealing with noisy measurements of the output error, the MA filter is still good in case of structural perturbances, as shown in Figure 13. Additionally, Figure 13 shows a series of switches between various types of models, Equations (23), (25) and (26). As far as performance is concerned, GPUs are designed to deal with large data sets. Depending on the technical specifications, a GPU can handle thousands of parallel computations of the switching criterion without performance degradations. This allows the control system to provide real time responses even for real plants that work in extremely varied working conditions. Table 1 shows the time taken to finish the shader invocation with an increasing number of nominal models in the knowledge base. Using GPUs is therefore best in situations where the potential number of nominal models is high. Because performance is strictly tied to hardware, we can observe that execution times do not have a linear increase. For 100 and 1000 models, the execution time does not grow in a linear fashion, with only a slightly 2× increase in execution time for a 10× increase in the number of models. Things take a turn for the worse at 10,000 models copying of structured data from GPU memory to CPU memory.

Discussion
The premises of the research were to show that GPUs and compute shaders can be used in real time systems control. The adaptive robust multimodel control philosophy is a perfect place to apply such technologies, as the parallel nature of the GPU hardware makes it extremely easy to handle many nominal models in the knowledge base, perhaps hundreds or even thousands of nominal models, and compute the value of the switching criterion for each of them in a single shader invocation. As seen in the results section, this approach works for both parametric and structural perturbations.
It was possible to use a single shader implementation for both first and second order transfer functions. Second order transfer functions can be combined to achieve higher order functions, but it may no longer be possible to use the same shader declaration for higher order functions.
Unlike traditional programming, shaders are supposed to be simple and not contain much branching logic or iterative functions. Branching logic is supported through various instructions such as if, for, and while, but it is notoriously expensive, and many have unforeseen side effects. The recommendation is to not use branching logic unless it is unavoidable. Iterative functions are usually replaced by the parallel nature of the GPU. Implementing a universal solution for all possible kinds of models is therefore possible but remains as a future research topic on the matter.
The choice of models that form the knowledge base also has a significant impact on the behavior of this control scheme. As physical processes can be represented by several transfer functions, it is possible that the chosen model from the knowledge base will not always be the expected one. If its robust controller provides decent performance, then it will be deemed good enough by the supervisor. This is because the robustness margin from (1) creates a zone of stability around nominal models, and any other model that fits in that zone will be a good match for the robust controller (although maybe not optimal). This can be adjusted by using a different criterion for detecting performance degradation.
Another important aspect regarding the choice of models is stability. The results show that poorly dumped or undumped models will introduce instability, and this instability is further augmented by the shock of switching. This control scheme is therefore not suitable for handling such models, but this is not related to the use of GPUs.
Output measurement noise was filtered out using a moving average filter. This worked very well, as seen in the results section, but it may not always be the best choice. Assuming more information is known about the system and the type of disturbances it may encounter, other time domain filters like band pass biquad filters or other time domain IIR filters may be used for removing the noise.
The potential applicability of compute shaders in real time control is extensive. Most modern vehicles now come equipped with GPU hardware, as infotainment systems or state of the art car status systems no longer use mechanical knobs or gauges but use high resolution displays and touchscreens to show information to drivers, sometimes even in 3D. This requires powerful GPU hardware that can keep up with real time considerations. The GPUs can potentially be used to keep track of the many electronic control units in modern cars and act as powerful supervisors over the good functioning of cars. Future research will explore the usability of compute shaders in fault detection and recovery.
As performance scales very well with a large number of models in the knowledge base, the GPU approach makes it possible to use multimodel control with a large knowledge base for fast systems, where the sampling period imposes a hard limit on the number of models, as traditional CPUs will have limited scalability due to a lower number of cores.

Conclusions
This article presented the use of graphics processing units in real time adaptive robust multimodel control, which is best suited for processes that may encounter repetitive types of parametric or structural perturbances. The multimodel aspect of this implementation is based on a knowledge base of previously discovered nominal models and their robust controllers. Graphics processing units are designed to deal with massive parallel computations and are extremely proficient at dealing with large knowledge bases, containing hundreds or thousands of models matching various types of perturbances that the real process may encounter during its function. The knowledge base can be filled in advance, depending on the knowledge on the functional environment of the plant, or it can be combined with an adaptive closed loop identification algorithm that discovers new working models for the plant and fills in the knowledge base in real time.
The results section showed how the control scheme would behave when dealing with parametric and structural perturbances, both with and without output measurement noise. The algorithms performed as expected, showing that graphics processing units and compute shaders can be used in real time control.
Future research will focus on implementing the control model on physical systems and implementing more of the adaptive-robust control scheme using the GPU. This includes implementing the PID tuning algorithms and system identification procedures on the GPU, thus removing the dependency on third party software, such as MATLAB.
Author Contributions: C.-C.M. handled conceptualization, methodology, software development, writing the original draft of the research article, formal analysis, investigation data acquisition, validation, and visualization. C.L. contributed to conceptualization, handled project administration and funding acquisition, reviewed scientific approach and article, and approved final research results. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.