Hybrid GPU–CPU Efﬁcient Implementation of a Parallel Numerical Algorithm for Solving the Cauchy Problem for a Nonlinear Differential Riccati Equation of Fractional Variable Order

: The numerical solution for fractional dynamics problems can create a high computational load, which makes it necessary to implement efﬁcient algorithms for their solution. The main contribution to the computational load of such computations is created by heredity (memory), which is determined by the dependence of the current value of the solution function on previous values in the time interval. In terms of mathematics, the heredity here is described using a fractional differentiation operator in the Gerasimov–Caputo sense of variable order. As an example, we consider the Cauchy problem for the non-linear fractional Riccati equation with non-constant coefﬁcients. An efﬁcient parallel implementation algorithm has been proposed for the known sequential non-local explicit ﬁnite-difference numerical solution scheme. This implementation of the algorithm is a hybrid one, since it uses both GPU and CPU computational nodes. The program code of the parallel implementation of the algorithm is described in C and CUDA C languages, and is developed using OpenMP and CUDA hardware, as well as software architectures. This paper presents a study on the computational efﬁciency of the proposed parallel algorithm based on data from a series of computational experiments that were obtained using a computing server NVIDIA DGX STATION. The average computation time is analyzed in terms of the following: running time, acceleration, efﬁciency, and the cost of the algorithm. As a result, it is shown on test examples that the hybrid version of the numerical algorithm can give a signiﬁcant performance increase of 3–5 times in comparison with both the sequential version of the algorithm and OpenMP implementation.


Introduction
The fractional differentiation operator is a generalization of the concept of derivative, to the case of non-integer order of differentiation.This direction originated almost immediately after the appearance of integro-differential calculus about 300 years ago [1]. The study of this topic continues in the current day, and is associated with such names as Uchaikin V. [2], Psxy A. [3], Kilbas A. [4,5], Ortigueira M. D. [6,7], Patnaik S. [8], and many others. You can learn more about the history of the origins of fractional differential calculus in the excellent encyclopedia written by Samko S., Kilbas A., and Marichev O. on such a complex topic as "Integrals and derivatives of fractional order and some of their applications" [4].
The fractional calculus is closely related to the theory of fractals, which is reflected in the works of [9]. In particular, there is a connection between the fractal (Hausdorff) dimension of the medium and the orders of fractional operators. This is related to the concept of heredity-the property of a system or process to retain a memory of its past. The concept of heredity is equivalent to such concepts as consequence, heredity, residuality, memory, and lag, as well as non-locality, which was first introduced in the paper [10] by the Italian mathematician Vito Volterra in 1912. The property of hereditariness can be possessed by such systems in which not only the current state of the system, i.e., the initial values of the state parameters of the system and some time derivatives, but also a finite number of previous states of the system [2] are taken into account. Such models are called non-local in time [11]. There exist today various definitions regarding the meaning of the fractional differentiation operator: Liouville, Riemann-Liouville, Weyl, Grunwald-Letnikov, and many others. And for some of them, there are even more generalizations to the case of the VO (variable order) of order [7,8], which in turn leads to variable non-locality.
As a consequence, the applications of fractional-differential calculus in various branches of science and engineering are increasingly being developed [12,13]. The generalization of known applications and the development of new mathematical models that take into account the hereditary function allows one to clarify known results. Today, the derivatives and integrals of non-integer orders are already successfully used in many areas of science and technology. For example this in found in the following: economics [14,15]; physics and mechanics [16]; hydrodynamic modeling [17]; meteorology and diffusion [18]; modeling complex acoustic oscillations [19]; viscoelastic waves [20]; and many other fields of science.
Attempts to describe such processes in terms of fractional dynamics give rise to rather complicated model integro-differential equations and systems of equations, which often have to be solved numerically. Even if the proposed numerical scheme is relatively simple, it is usually the approximation of the fractional differentiation or integration operator [21] that starts to create difficulties, which inevitably increases the computational complexity of the numerical scheme. In mathematical modeling that is based on ODEs and PDEs, in some cases, it is possible to use sequential algorithms that implement numerical solution schemes and obtain results in an acceptable time. However, when generalizing these models to using fractional calculus, the computation time can increase dramatically due to memory or non-locality effects. It is then expedient to use parallel algorithms of a numerical solution that is realized both on the CPU and GPU nodes of computing systems to accelerate the calculations in those parts of the numerical scheme where possible.
For example, the papers [22,23] investigate the issues of computational acceleration in modeling some diffusion processes that are not based on the fractional Gerasimov-Caputo derivative [24,25], which is non-local to spatial variables. The authors Bogaenko V.A. and Bulavatskiy V.M. proposed to reduce the computational accuracy at the stage of building the approximation of the fractional Gerasimov-Caputo derivative for acceleration. It was proposed to solve the model equation numerically by using finite-difference schemes. The proposed schemes were implemented as parallel algorithms for systems with distributed memory and GPUs (graphics acceleration units). It is shown that by using such approaches it is possible to speed up computations by several times via proposed mathematical model.
The literature review shows that there are no studies devoted to the development of parallel algorithms for solving the Cauchy problem for nonlinear fractional equations with Gerasimov-CaputoVO. In particular, the realization of hybrid algorithms of a parallel explicit finite-difference scheme for solving the fractional Riccati equation on both CPU and GPU nodes of a computing system have not been investigated; thus, this will be the subject of this paper.
The research plan of the article is as follows: Section 2 describes the hardware base and principles of the organization of parallel computing, for which the proposed efficient parallel algorithm of numerical solution is developed; Section 3 gives the formulation of the Cauchy problem for the nonlinear equation with a fractional derivative of the Gerasimov-Caputo VO. In particular, the fractional Riccati equation is considered. A nonlocal explicit finite-difference scheme for the numerical solution of the problem is given; Section 4, in the form of code blocks of the C programming language, describes a hybrid parallel algorithm that realizes the explicit numerical scheme. The whole code located in the blocks represents a complete function implementing the numerical scheme; and Section 5 analyzes the efficiency of the presented hybrid parallel algorithm of the explicit finite-difference scheme in comparison with its best sequential version.

About Organization of Parallel Computations on CPU and GPU Nodes of a Computing System
It is difficult to overestimate how much computing technology and devices have infiltrated our lives, the most notable of which is computers. The basis of any computer or computing system is the central processing unit (CPU), which executes program instructions and coordinates all other nodes of the computing system [26]. The main part of the CPU is the core-a collection of data buses and caches, control units, and arithmetic-logic units (ALUs) [26]. For the last 20 years, the never-ending race has taken CPU architecture to a new level-multi-core. Within such CPUs there are physical cores-specific hardware blocks and logical cores-that are independent threads to which the CPU is able to distribute program instructions (code) [27]. Then, a specially designed program-a parallel algorithm-can be executed on such a computational node.
To solve the problems of developing parallel numerical algorithms, one of the technologies chosen is OpenMP-an open standard software interface for parallel systems with CPUs and their shared memory [28]. OpenMP implements parallel computing by using the idea of multithreading-multiple CPU threads running tasks in parallel. This is achieved by providing the developer with a set of directives for compilers, function libraries, and environment variables. The high-level language C was chosen as the main programming language because of its versatility and rather large freedom of working with memory [29]. And also because OpenMP is officially supported by C and the C++ programming languages.
The ideas of code development taking into account CPU multithreading are also applicable to GPUs. Their main difference is that while CPUs can solve several complex tasks in parallel, GPUs are oriented to the execution of programs with a large amount of single-type arithmetic operations. This way of organizing parallel calculations is called single instruction multiple data (SIMD) [26], i.e., when several processors perform the same operation simultaneously but on different data. From Figure 1, we can see that the CPU has a system of CPU caches and complex ALUs, while the GPU contains many simplified ALUs that share a common memory. GPU memory is hierarchically structured and optimized for maximum bandwidth. And to achieve the best acceleration from the developed parallel algorithms on GPUs, it is necessary to think over the memory access of the program code and also to take into account the hardware features. All this allows one to improve performance in these kinds of computational tasks, but it makes programming more difficult [30].
The tool for working with GPUs in this article will be NVIDIA's hardware and software parallel computing architecture for GPUs. This is also called Compute Unified Device Architecture (CUDA), and it allows us to significantly increase computational performance.
CUDA gives you the ability to control the RAM of the GPU at will, and it generally provides a wide range of tools. The architecture invites us to implement functions executable on a GPU, also called a device, using addons of the C programming language, which are called CUDA C. In turn, we will name the CPU as host, thus separating the code executable on CPU and GPU [30]. We also note that the high-level programming language C has versatility and quite a lot of freedom of working with memory [29], which is very useful for designing efficient algorithms.
The ideas that led to the creation of the CUDA hardware-software architecture are somewhat similar to the ideas of the OpenMP software standard [28]. In particular, this is achieved in simplifying and bringing to a single standard the work with computing equipment, and providing a convenient program interface to developers, thereby solving the problems that arose earlier at the stage of «non-graphical» calculations on GPU. For more details, see [30] (p. 5).
In the problem we are considering, there is a need to utilize a supercomputer and its computing power [31,32]. The main reason is the ability to run simulations with an extremely small h sampling step. Since the amount of RAM memory that supercomputer technologies can provide us with allows us to operate on matrices and vectors with large dimensions, it allows us to concentrate on the organization of the efficient loading of CPUs and GPUs with calculations, while the software implementation of parallel algorithms of numerical solution schemes is provided.

Mathematical Problem Statement
There are many tens of definitions for the fractional differentiation operator, but we will focus on applying one of them [33]. We consider, in this paper, the communication of the Gerasimov-Caputo fractional differentiation operator [24,25], which is non-local in time 0 < α < 1, to the Gerasimov-Caputo VO, where 0 < α(t) < 1-i.e., the non-constant order of the fractional derivative. This generalization is studied in detail in the works of the author Tverdyi D.A. [34,35] and has the following form: where u(t) ∈ C[0, T] is the one-dimensional local in a space solution function; the derivativė u(t) = du dt ; t ∈ [0, T] is the current simulation time; T > 0 is the total simulation time; and Γ(.) is the Euler gamma function, which monotonically decreases on the interval 0 < x < 1.

Definition 1.
The introduction of such (1) generalizations with variable nonlocality allows a more accurate description of the time series of some dynamical processes with memory [36][37][38].
The concept of heredity was proposed and studied by the Italian mathematician Vito Volterra [11]. In the [2] paper, various physical processes with hereditary effects were investigated. We will not elaborate on the theoretical aspects related to the description of heredity (memory effect) using fractional derivatives.
In the following section, for the sake of an example, we consider a more specific kind of model equation; namely, the Riccati equation [39] when it is in its fractional analog, for which the Cauchy problem will be of the form: where u(t) ∈ C 2 [0, T]-the unknown solution function; a(t), b(t), c(t) ∈ C[0, T]-the set functions determining the values of the coefficients at each moment of time.

Remark 1.
However, for example, in [34], we consider a generalized form of a nonlinear fractional Equation (2) for which a local Cauchy problem with variable time nonlocality is posed [5] due to the use of the VO operator (1). Such a problem can describe a wide class of dynamic processes with variable memory in saturated environments [40].

Remark 2.
The model (2) presented above, also called the heredity α(t)-model, is used by the authors in the problems of the emanation of radon gas [41] through porous media, where the order of the fractional derivative can be responsible for the intensity of the transport process 222 Rn, which is related to the permeability of the geo-environment: porosity, fracturing, etc. [2].
We will solve this problem (2) numerically using the methods of finite-difference schemes for solutions of differential equations, and the Gerasimov-Caputo VO approximation (1) is introduced according to the [34,35]. This manuscript implements a relatively simple non-local explicit finite-difference scheme (EFDS) that was composed by Euler's method [42]. However, there is a difficulty in deriving a discrete analog (approximation) of the fractional Gerasimov-Caputo derivative VO (1) with this scheme.
To obtain the numerical solution of [34,43], according to Euler's method, it can be represented as a finite-difference scheme of the form: where u 0 = C is the initial condition of the Cauchy problem (2); C is a known constant; h = T/N is the uniform grid discretization step; N is the number of grid nodes; and i = 2, . . . , N − 1 is the node number of the computational grid.
In [34] (p. 14), the convergence and stability of EFDS (3) are investigated. It is shown that EFDS (3) is stable and converges to order O(h) under the condition: Remark 3. In this paper, the numerical scheme (3) slightly differs from the one presented in [34] (p. 14) in that in (3) there appears a weighting factor w i 1 that is derived from the factor w i j in order to speed up the computation. This does not impact the accuracy of EFDS (3) computation, convergence, and stability.

Software Implementation of the Hybrid EFDS Algorithm
Let us consider step by step an efficient hybrid parallel program implementation of EFDS (3). We will provide the code listing, which is broken into parts for the necessary explanations. , and directives (see Listings 6) described in this chapter must be declared before declaring the code for a method function (see . From the code snippets (see , we can compose a function EFDS_parallel_HYBRID() that works and is debugged. The function code assumes a GPU function call (see Listings 20 and 21) necessary for the calculations.  This function will be run within a short program complex, the whole code of which is not given in this article. However, some of its auxiliary functions necessary for the operation of EFDS_parallel_HYBRID() have been given above.
Next, the function tic() is the starting time measurement into the previously created objects of the structure MATC, which will consume the entire algorithm and its individual parts: Next, is the block of initialization of the calculation grid. In other words, we tell the system "what hardware resources we can use for useful calculations". To work with OpenMP here, we will set the number of CPU threads: Listing 11. Indicates the program's available computing resources.

Remark 5.
The gpu_calc_param_grid() function calculates the optimal number of blocks that is based on the user-defined num_threads_GPU number of threads. The user can vary the set number of thread up to a maximum value based on the known specifications of the GPU. If num_threads_GPU is not specified, the maximum allowed number of threads per block on the current GPU will be used. This may additionally speed up the computation a bit, but it may cause distortions in the computation.
In the CUDA runtime environment, a special data type (tuple) thread and block of the CUDA environment is responsible for applying the computed dim3 values, thus defining the computational grid of the numerical method on the GPU. After that, we can start implementing blocks of parallel code with useful calculations: printf( " START : CUDA : kernel_EFDS_1d_coeff : " ); kernel_EFDS_1d_coeff<<< blocks, threads >>>( T, N, h, dev_a, dev_b, dev_c, dev_alpha, dev_A ); // copy the array back from the GPU to the CPU cudaMemcpy( a, dev_a, N * sizeof(float), cudaMemcpyDeviceToHost ); cudaMemcpy( b, dev_b, N * sizeof(float), cudaMemcpyDeviceToHost ); cudaMemcpy( c, dev_c, N * sizeof(float), cudaMemcpyDeviceToHost ); cudaMemcpy( alpha, dev_alpha, N_p1 * sizeof(float), cudaMemcpyDeviceToHost ); cudaMemcpy( A, dev_A, N * sizeof(float), cudaMemcpyDeviceToHost ); // free the memory allocated on the GPU cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); cudaFree( dev_A ); printf( " \ldots sucsessfull END \n" ); Here for EFDS (3), discrete values of the coefficients of the equation are computed, at each i node of a grid of N nodes. The computation of N values from i = 0 to i = N is allocated to the appropriate number of thread and is performed by kernel_EFDS_1d_coeff<<< >>>>()-GPU function. Since the calculated coefficient values are one-dimensional vectors of dimension N, then on the one-dimensional GPU grid defined earlier, in each such nodethread with index i, the vector values will be calculated depending on the value (position) of i. This is why it is so important that the GPU correctly sets the number of block and thread. printf( "\n START : CUDA : kernel_EFDS_2d_w : " ); kernel_EFDS_2d_w<<< blocks_2, threads_2 >>>( N, dev_alpha, dev_w ); // copy the array back from the GPU to the CPU cudaMemcpy( w, dev_w, (N * N) * sizeof(float), cudaMemcpyDeviceToHost ); // free the memory allocated on the GPU cudaFree( dev_w ); printf( " \ldots sucsessfull END \n" ); Here, we calculate the values of the weight coefficient w[i][j] that represents, in memory, both the CPU and GPU matrix N x N in a linearized form. Although the matrix w[i][j] has insignificant zeros above the main diagonal, we cannot represent it as a lowertriangular matrix. This is due to the excessive complexity of the code for linearizing the lower-triangular matrix in CUDA C, which is required in orderto work with the grid on the GPU. The sequential part of the EFDS algorithm (3) is an inevitable part of our problem since the memory effect leads to a dependency in the decision function: the value u at node u[i+1] requires knowledge of the value at the previous node u[i]. Such a part of the algorithm cannot be distributed by definition, and this cannot be avoided when implementing explicit finite-difference numerical schemes: criterion_stability( (char *)"EFDS", T, N, h, a, b, c, alpha ); printf( "\n End calc %s. \n", input.current_method ); At the end of the implementation of the functionEFDS_parallel_OpenMP(), release the CPU RAM that is occupied for storing the N x N values of the vector w[i][j] of the weighting factor, which is achieved by using the command free(w);. The memory occupied on the GPU is free in the code immediately after it is no longer needed in the future.

Remark 6.
Note that if, for example, we want to realize an inverse problem on the basis of a direct EFDS problem (3), the function of solving the direct problem (3) will be called in a loop.
In this case, there will be no automatic «memory dump» after the whole program (process) is terminated. As a consequence, after some time, the entire amount of available CPU RAM will be occupied because the reserved memory from previous runs of the function is still held by the system (although it is not used further in any way).

Used GPU Functions
When describing the EFDS_parallel_HYBRID() code, several functions executed directly on the GPU were used. The description of such functions in terms of the CUDA C language differs from ordinary C by using the __global__-specifier, as well as when calling the <<<<\ldots>>>-execution configuration. This also applies to the special logic of describing algorithms for the calculations we need, about which more details are given in [30].

Analysis of Computational Efficiency
We investigated how much faster the algorithms with parameter N were characterized by the complexity of the algorithm that implements (3). We compared the computation time of the schemes on the following: one CPU thread (sequential algorithm) and multiple other threads (parallel algorithm). The number of threads and other GPU resources in the developed program were determined and set automatically in an optimal way.
We will conduct experiments on the following computing systems with a different number of threads. The first system will be a laptop, i.e., a HP Pavilion Gaming Laptop It follows from Brent's lemma [28,44,45] that any algorithm implemented for execution in PRAM-type systems can be modified so that it can be executed in PRAM systems with a smaller number of threads, where PRAM is a random access memory parallel machine combining c threads (cores or CPUs), shared memory, and a control device. This is one of the most common models of computing systems [28]. It means that we can run the developed parallel algorithm on systems that have a different number of cores or threads.

Remark 7.
We will measure the time spent on code execution-since midnight on January 1, 1970-using the gettimeofday() function, which returns seconds and microseconds to the tv\_sec and tv\_usec fields of the tv\_usec structure, respectively. This is one of the most accurate methods of calculating program execution time. The method is implemented in two functions: tic()-starts measuring, and toc()-stops measuring and then calculates the time spent.
To measure the computation time, consider a model (2) with the following parameters: According to [28]  Since we will obtain a slightly different T c with each new numerical experiment, and the number of experiments is finite, T c can be considered a random variable with some distribution function. Then, to compute T 1 (N) and T c (N), we resort to the concept of mathematical expectation (average value) [46] for a discrete random variable: where i is the index of the numerical experiment, and L is the sample size. Then, for the developed algorithm, we determine the E(·) mean (5) of a sample of L = 10 values with the different number of c CPU threads used. The results for the hybrid code are presented in the Table 1 below. To compare the results of the analysis, in Table 2, we will provide data on the E(·) EFDS (3) calculation time by a parallel algorithm based on OpenMP.
Graphically, the results of computing E(·) to the average computation time, from Tables 1 amd 2, by all the specified algorithms are shown in Figure 3 below: Now, by using the data on the time spent on code execution we can study the efficiency of the parallel algorithm. Efficiency is defined as the optimal ratio in coordinates: where computational speedup is the RAM memory footprint when compared to the sequential version of the algorithm. The average computation time is analyzed in terms of the following: running time, acceleration, efficiency, and the cost of the algorithm. Table 1. Time spent T (in sec.) and RAM memory (in Mb) for calculating the sequential algorithm (T 1 (N)) and parallel algorithm based on OpenMP (with a different number of threads) implementing EFDS.   The acceleration A c (N) that a parallel algorithm provides on a c-threaded computing system when compared to the most efficient sequential algorithm is defined by the following formula:

Notebook Notebook SuperPC SuperPC SuperPC SuperPC SuperPC
where max(A c (N)) = c is a theoretical value that is possible in the absence of unavoidable time delays when executing a parallel algorithm, i.e., when there are no additional actions on the interaction between threads, then all calculations are evenly distributed among the threads and no actions are required to merge the results of the calculations.

Definition 3.
The efficiency V c (N) of using a given number of threads by a parallel algorithm is determined by the formula: where, for max A c (N) = c, we get max(E c (N)) = 1.
Definition 5. The cost-optimal CO c (N) of an algorithm is characterized by the cost proportional to the complexity of the best sequential algorithm [47]. And in this case, we have the following:

Discussion
From the comparison of the analysis results of the proposed algorithms in Figures 4-7, we can see that there is a significant acceleration (6) in the computation speed by 3-5 times for parallel algorithms, but when the number of processors/threads for parallel processing increases, their efficiency (7) decreases significantly.    From Figure 4, according to the acceleration data, with the same parameters of the numerical experiment, both computing systems on 6 and 40 maximum available threads were used. This produced the desired result for an approximately equal computation time. At the same time, when running the calculations on a supercomputer on more than 15-20 threads, there was no performance gain.
As a result, the resource/computation time ratio (8) and (9) were inversely proportional to the efficiency, which grew significantly for the parallel algorithms. However, the hybrid OpenMP-CUDA algorithm showed a lower cost than the parallel OpenMP. This means that the presented hybrid algorithm is more efficient both on a supercomputer and on a personal computer.
It can be concluded that it is more than possible to use powerful personal computers and laptops for efficient computations in fractional dynamics models that are based on the Cauchy problem for nonlinear equations of the form (2). In particular, this applies with the fractional derivatives of the Gerasimov-Caputo type (1).
In such a case, the use of supercomputing can be an important future prospect. Nonefficiently loaded threads can be used for the parallel computation of several copies of a numerical algorithm, which will be useful in solving inverse problems for (2). For the solution of inverse problems, an important and well-studied class of mathematical problems found its application in the various fields of science and technology [48]. Their solution is the next stage of the authors of this study's research, which was conducted to find an application of the described model (2) when based on the fractional derivatives of [34].
Within the framework of this direction, the issues related to the emanation of radon gas [41] through the geo-environment were investigated, particularly when describing the processes of its accumulation and dissipation from the storage chamber with registration sensors. Radon 222 Rn was read as a known, informative, and well-established earthquake precursor [49,50]. The solution of the inverse problems were carried out in order to restore some of the parameters of the mathematical model of the dynamic process-one that was based on the known observed data. The solution is extremely time-consuming in itself, and the better efficiency of the hybrid OpenMP-CUDA algorithm will make it possible to solve them in much less time.

Conclusions
A software implementation of a non-local explicit finite-difference scheme for solving the fractional Riccati equation as an efficient parallel hybrid OpenMP-CUDA algorithm was presented.
The efficiency of the proposed hybrid OpenMP-CUDA algorithm was analyzed. As a result, for the non-local explicit finite-difference scheme, a significant performance gain of 3-5 times was shown when compared to the best sequential counterpart, as well as a better efficiency when compared to the OpenMP parallel algorithm.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: