# Parallel Implementation of Modeling of Fractional-Order State-Space Systems Using the Fixed-Step Euler Method

^{*}

Next Article in Journal

Next Article in Special Issue

Next Article in Special Issue

Previous Article in Journal

Previous Article in Special Issue

Previous Article in Special Issue

Departament of Electrial, Control and Computer Engineering, Opole University of Technology, 45-758 Opole, Poland

Author to whom correspondence should be addressed.

Received: 20 August 2019
/
Revised: 18 September 2019
/
Accepted: 20 September 2019
/
Published: 24 September 2019

(This article belongs to the Special Issue The Fractional View of Complexity)

This paper presents new results in implementation of parallel computing in modeling of fractional-order state-space systems. The methods considered in the paper are based on the Euler fixed-step discretization scheme and the Grünwald-Letnikov definition of the fractional-order derivative. Two different parallelization approaches for modeling of fractional-order state-space systems are proposed, which are implemented both in Central Processing Unit (CPU)- and Graphical Processing Unit (GPU)-based hardware environments. Simulation examples show high efficiency of the introduced parallelization schemes. Execution times of the introduced methodology are significantly lower than for the classical, commonly used simulation environment.

Fractional-order systems incorporating fractional-order derivatives (or differences) have attracted considerable research interest as their specific nature can be more adequate to describe some complex physical phenomena [1,2,3,4,5,6,7,8,9,10,11,12,13]. Since the fractional-order derivative is not defined at a point as in the case of its integer-order counterpart, impulse responses of fractional-order systems are not, in general, a class of exponential functions. In addition, a fractional-order derivative affects the properties of the system in the frequency domain. This is due to the fact that the modulus characteristic of a derivative of the fractional order $\alpha $ is increased by $20\alpha $ dB per decade instead of 20 dB per decade for the integer-order derivative, whereas the phase characteristic is equal to $\pi \alpha /2$. Therefore, the dynamic properties of the fractional-order system are more adjustable than for integer-order systems and can be more accurate in modeling various physical processes involving electrical circuits [4,6], thermal and diffusion processes [14,15], medicine [8,16], and others [17,18,19,20]. Among them, a lot of interest has been devoted to fractional-order generalizations of various entropy definitions and functions, i.e., Rényi entropy [11,21], Tsallis entropy [11,22], etc. The fractional generalizations of entropy definitions have led to different probability distribution functions as compared with the Shannon entropies [5]. However, the main problem encountered in the fractional-order systems is the fact that calculation of fractional-order derivative/difference may lead to computational problems. Namely, the Grünwald-Letnikov (GL) fractional-order derivative/difference, which is considered here, may lead to computational explosion related to the infinite summation. That is why truncated or finite-length implementation is used in approximators to the fractional-order derivative/difference [3,10]. Still, high modeling accuracy requirements often end up with high (upper) summation limits, which may be computationally burdensome for a single-core application. In spite of this, there is little discussion about using the multi-core architectures in calculations of fractional-order derivatives/differences. Therefore, the fractional-oriented parallel implementation issue is the main motivation for this paper.

Recently, trends in the development of more powerful computing hardware have focused on increased numbers of cores rather than the increased performance of an individual unit. For this reason, parallel computing has emerged as a research domain with the capability of meeting time requirements in both real-time applications and offline simulations. Despite the increased difficulty of coding, parallel programming has become very popular due to the ever-growing computational scale [23,24].

There are some papers considering the parallelization process for the fractional-order derivative/difference and fractional-order systems. In [25], a parallel computing application has been proposed for calculation of the Caputo derivative by use of the Adams–Bashforth–Moulton method. That method is extended to modeling of the fractional-order state-space system in [26]. Another approach for solving the Caputo derivative problem in the specific fractional-diffusion equation has been presented in [27], with a parallel algorithm based on linear tridiagonal equations. Yet another example is the use of parallel computing for solving nonlinear time-space fractional partial differential equations [28]. In that case, the authors have studied the efficiency of parallelizations for shared and distributed memory systems.

In this work, we use two strategies for parallel calculation of fractional-order state-space systems, the Grünwald-Letnikov derivative and the fixed-step discretization scheme. The first one is the use of a Central Processing Unit (CPU) with OpenMP (OMP) Application Programming Interface (API) [29]. The second one incorporates a Graphical Processing Unit (GPU) with Compute Unified Device Architecture (CUDA) API.

OMP is a set of compiler directives, library routines, and environment variables enabling shared-memory parallelism. With minimum latency, each thread or process can have direct access to memory throughout the system. It can be used to develop applications in programming languages C, C++, and Fortran on many platforms including Solaris, AIX, HP-UX, Linux, macOS, and Windows. OMP is supported by major computer hardware and software vendors, and is characterized by high-level parallelism, portability, scalability, and simplicity of use.

CUDA is an API developed by Nvidia Corporation. The CUDA platform (Nvidia Corporation, Santa Clara, CA, USA) is a software layer that provides a dramatic increase in computing performance by using the power of CUDA-enabled GPUs. It is designed to work with programming languages such as C, C++, and Fortran. In contrast to APIs like Direct3D and OpenGL, which require advanced skills in graphics programming, CUDA makes it easier for specialists to use GPU resources by use of virtual instruction set and parallel computational elements. It also supports programming frameworks such as OpenACC and OpenCL.

The paper is organized as follows. Having introduced the problem in Section 1, representations of the fractional-order derivative/difference and the fractional-order state-space system are presented in Section 2. Section 3 gives an introduction to the different parallel architectures that are considered in this paper. The detailed description, pseudo-code, and implementation method for the used parallel algorithms are also presented in Section 3. Simulation examples of Section 4 provide a comparative analysis of the introduced algorithm approaches as well as presented architectures. The analysis is accomplished in terms of high modeling accuracy and high time efficiency. Conclusions in Section 5 complete the paper.

Consider a continuous-time linear time-invariant (LTI) state-space fractional-order system described by the following equations:
where $A\in {\Re}^{n\times n}$, $B\in {\Re}^{n\times {n}_{u}}$, $C\in {\Re}^{{n}_{y}\times n}$, and $D\in {\Re}^{{n}_{y}\times {n}_{u}}$ are the system matrices; ${n}_{u}$ and ${n}_{y}$ are the number of inputs and outputs, respectively; t is the continuous time; and ${D}^{\alpha}=\mathrm{diag}\{{D}^{{\alpha}_{1}},\dots ,{D}^{{\alpha}_{n}}\}$ is the matrix ${D}^{\alpha}\in {\Re}^{n\times n}$ consisting of a fractional-order derivative ${D}^{{\alpha}_{i}}$ of order ${\alpha}_{i}$, $i=1,\dots ,n$. The system (1) is commonly considered in a simplified commensurate fractional-order form, where the fractional-order $\alpha ={\alpha}_{i}$, $i=1,2,\dots ,n$. In this case, ${D}^{\alpha}\in \Re $ denotes a fractional-order derivative of order $\alpha $. The fractional-order state-space system (1) is one of the most popular methods used to describe fractional-order processes. By using Equation (1), we can present fractional-order differential equations in a simple, matrix/vector form. Therefore, we can find many uses of the state-space system (1) in practical applications, e.g., in electrical circuits [4,6], in modeling of thermal and diffusion processes [12,13,15], in medicine [8,16], etc.

$$\begin{array}{cc}\hfill {D}^{\alpha}x\left(t\right)=& Ax\left(t\right)+Bu\left(t\right),\hfill \\ \hfill y\left(t\right)=& Cx\left(t\right)+Du\left(t\right),\hfill \end{array}$$

The fractional-order derivative is often described by using one of three definitions, that is, the Riemman-Liouville (RL), Caputo, or Grünwald-Letnikov (GL) definitions. Regarding the practical implementation point of view, in this paper, we use the Grünwald-Letnikov derivative defined as
where $\alpha $∈$(0,2)$, h is the sampling period, $\Gamma (.)$ denotes the Gamma function and $\left(\genfrac{}{}{0pt}{}{\alpha}{j}\right)$, $j=1,2,\dots $, are the Newton binomial coefficients. Taking into account poor numerical feasibility of the Gamma function, we use the definition as in Equation (3). The GL definition is equivalent to the RL definition discretized by the use of the fixed-step discretization scheme. Additionally, in the case of homogenous initial value, the GL is equivalent to the Caputo definition. Moreover, by using specific correction coefficients, we can easily calculate the Caputo derivative by the use of the extended GL definition [30]. Therefore, the GL derivative based on finite length implementation can be used as the approximation of the Riemann–Liouville and Caputo derivatives. The main advantages of the GL definition are (1) they can be easily calculated in the recursive way, by using robust and numerically stable algorithms; and (2) an error of the finite-length GL approximation can be easily calculated, both for fractional-order difference/derivative and for the whole fractional-order system [31]. The main disadvantage of the finite-length GL is a low convergence rate. A detailed analysis of the effectiveness of finite-length GL as compared to other approximation methods can be found in [32]. In the literature, we can find several other definitions of fractional-order derivatives/differences [15,18,32]. In particular, He’s derivative is shown to provide a good numerical performance in specific applications [12,15]. Analysis of the state-space system using this definition will be a topic of our further work.

$$\begin{array}{ccc}\hfill {D}^{\alpha}{x\left(t\right)|}_{t=kh}& =& \underset{h\to 0}{lim}\frac{1}{{h}^{\alpha}}\sum _{j=0}^{\infty}{(-1)}^{j}\frac{\Gamma (\alpha +1)}{j!\phantom{\rule{4pt}{0ex}}\Gamma (\alpha -n+1)}x(t-jh),\hfill \end{array}$$

$$\begin{array}{cc}=& \underset{h\to 0}{lim}\frac{1}{{h}^{\alpha}}\sum _{j=0}^{\infty}{(-1)}^{j}\left(\genfrac{}{}{0pt}{}{\alpha}{j}\right)x(t-jh),\hfill \end{array}$$

In order to solve the system (1) incorporating the GL derivative (3), a typical way is to use the simple fixed-step Euler method for calculation of fractional-order equation. Assuming that $x\left(l\right)=0\phantom{\rule{4pt}{0ex}}\forall \phantom{\rule{4pt}{0ex}}l\le 0$, we obtain
or using a discrete-time formulation for $k=0,1,\dots $
where ${x}_{k}$ is the state vector defined in discrete time k and ${\Delta}^{\alpha}{x}_{k}$ denotes the discrete-time fractional difference
with $\alpha $∈ (0,2), ${q}^{-1}$ being the backward shift operator and

$${D}^{\alpha}{x\left(t\right)|}_{t=kh}\approx \frac{1}{{h}^{\alpha}}\sum _{j=0}^{k}{(-1)}^{j}\left(\genfrac{}{}{0pt}{}{\alpha}{j}\right)x(t-jh),$$

$${D}^{\alpha}{x\left(t\right)|}_{t=kh}\approx \frac{{\Delta}^{\alpha}{x}_{k}}{{h}^{\alpha}},$$

$${\Delta}^{\alpha}{x}_{k}=\sum _{j=0}^{k}{P}_{j}\left(\alpha \right){x}_{k}{q}^{-j}={x}_{k}+\sum _{j=1}^{k}{P}_{j}\left(\alpha \right){x}_{k}{q}^{-j}\phantom{\rule{14.22636pt}{0ex}}k=0,1,\dots $$

$${P}_{j}\left(\alpha \right)={(-1)}^{j}\left(\genfrac{}{}{0pt}{}{\alpha}{j}\right).$$

Note that the Newton binomials can be calculated by use of the simple, well-known formula

$$\left(\genfrac{}{}{0pt}{}{\alpha}{j}\right)=\left\{\begin{array}{c}1\phantom{\rule{76.82234pt}{0ex}}j=0\hfill \\ \frac{\alpha (\alpha -1)...(\alpha -j+1)}{j!}\phantom{\rule{14.22636pt}{0ex}}j>0.\hfill \end{array}\right.$$

Usually, in the fractional-order state-space systems, we use a forward-shifted form of the fractional-order difference (see e.g., [7]).

$${\Delta}^{\alpha}{x}_{k+1}=A{h}^{\alpha}{x}_{k}+B{h}^{\alpha}{u}_{k}.$$

$${y}_{k}=C{x}_{k}+D{u}_{k}.$$

Combining Equations (6) and (9), we obtain the formula for calculating a fractional-order discrete-time state equation:

$${x}_{k+1}=(A{h}^{\alpha}+\alpha I){x}_{k}-\sum _{j=2}^{k+1}{P}_{j}\left(\alpha \right){x}_{k-j+1}+B{h}^{\alpha}{u}_{k}.$$

Note that each incoming sample of the signal ${x}_{k}$ increases the complexity of both the fractional-order difference (Equation (11)) and the whole fractional-order system (Equations (9) and (10)). This leads to the computational explosion for $k\to \infty $. To avoid this, a finite-length version of the fractional-order difference is used (see e.g., [7,10,33])
where L is the upper bound for j.

$${\Delta}_{L}^{\alpha}{x}_{k}={x}_{k}+\sum _{j=1}^{L}{P}_{j}\left(\alpha \right){x}_{k-j}\phantom{\rule{14.22636pt}{0ex}}k=0,1,\dots ,$$

Now, combining Equations (9) and (12), we obtain the finite-length formula for calculation of the state equation

$${x}_{k+1}=(A{h}^{\alpha}+\alpha I){x}_{k}-\sum _{j=2}^{L}{P}_{j}\left(\alpha \right){x}_{k-j+1}+B{h}^{\alpha}{u}_{k}.$$

It is important to note that the results presented in Equations (12) and (13) for $k>L$ define the approximations of the fractional-order difference and fractional-order state-equation, respectively.

Problem formulation: It is important that convergence of the series ${P}_{j}\left(\alpha \right)$ depends on the order $\alpha $ and is quite slow, in particular for a low value of $\alpha $. Therefore, accurate approximation of Equation (5), and consequently Equation (12), requires a very high implementation length L. Exemplary norm $H(\alpha ,L)=\left|\right|{\Delta}_{L}^{\alpha}\mathbf{1}\left(t\right)-{\Delta}^{\alpha}\mathbf{1}\left(t\right){\left|\right|}_{{L}_{\infty}}$, where $\mathbf{1}\left(t\right)$ is the Heaviside step function, is $H(0.5,3180)=0.01$, but to obtain the similar accuracy for $\alpha =0.3$ we need as high a length as L = 2,000,000 (i.e., h(0.3, 2,000,000) = 0.01). Moreover, implementation of the fractional-order difference into the state-space system may require much higher L to obtain the same accuracy [31]. Therefore, real-time applications of both fractional-order difference and fractional-order state-space system for ’fast’ systems (with small sampling periods h) may require realization of the parallelization scheme for the calculation process.

Firstly, we introduce a parallel method for calculating the fractional-order difference. In the next step, the method is extended to the calculation of the fractional-order state-space system. Finally, a new hierarchical parallelization scheme for the calculation of the fractional-order system is proposed. It is important to note that the fractional-order difference incorporated into the fractional-order system leads to computational complication of the state equation only. The calculation of this equation, as is presented in the previous section, constitutes a computational burden process. In contrast, the output equation of the fractional-order system is the same as for the integer-order case and is based on simple vector/matrix operations. Therefore, we consider the calculation of the state equation only.

Consider the fractional-order difference of Equation (12). In order to implement the parallelization scheme, we have to divide the summation process into N independent parts
where N is the number of parts/workers, $\lfloor .\rfloor $ denotes the floor function, and
with ${\mathcal{P}}_{i}=[{P}_{\lfloor (i-1)L/N\rfloor +1}\left(\alpha \right),\phantom{\rule{4pt}{0ex}}\dots ,\phantom{\rule{4pt}{0ex}}{P}_{\lfloor iL/N\rfloor}\left(\alpha \right)]$, and ${X}_{i}^{T}=[{x}_{k-\lfloor (i-1)L/N\rfloor -1},\phantom{\rule{4pt}{0ex}}\dots ,\phantom{\rule{4pt}{0ex}}{x}_{k-\lfloor iL/N\rfloor}]$, $i=1,\dots ,N$. Now, in the parallelization process we delegate calculation of elements ${\Phi}_{i}$ on particular workers. Finally, a block diagram of the calculation process for the fractional-order difference is presented in Figure 1 and the calculation algorithm is presented as Algorithm 1.

$$\begin{array}{ccc}\hfill {\Delta}_{L}^{\alpha}{x}_{k}& =& {x}_{k}+\sum _{i=1}^{N}\left[\sum _{j=\lfloor (i-1)L/N\rfloor +1}^{\lfloor iL/N\rfloor}{P}_{j}\left(\alpha \right){x}_{k-j}\right],\hfill \end{array}$$

$$\begin{array}{cc}=& {x}_{k}+\sum _{i=1}^{N}{\Phi}_{i},\hfill \end{array}$$

$${\Phi}_{i}=\sum _{j=\lfloor (i-1)L/N\rfloor +1}^{\lfloor iL/N\rfloor}{P}_{j}\left(\alpha \right){x}_{k-j}={\mathcal{P}}_{i}{X}_{i}^{T}$$

Algorithm 1: Parallel algorithm for one-step computation of fractional-order difference. |

Note that Equation (12) can be considered as both a) finite implementation of the fractional-order difference when $L<t$ and b) a fractional-order difference when $L=t$. Consequently, Algorithm 1 represents a parallel implementation for both cases.

The number of workers N used in Algorithm 1 is bounded by the following condition $N\le min(L,\phantom{\rule{4pt}{0ex}}k)$. In addition, from a feasibility point of view, the number of workers N should be less than physical cores in the hardware environment.

The parallelization algorithm presented in the previous subsection can be immediately applied to the fractional-order state-space system. In this case, one-step calculation process is as follows:
where
with ${\mathcal{P}}_{i}=[{P}_{\lfloor (i-1)(L-1)/N\rfloor +2}\left(\alpha \right),\phantom{\rule{4pt}{0ex}}\dots ,\phantom{\rule{4pt}{0ex}}{P}_{\lfloor i(L-1)/N\rfloor +1}\left(\alpha \right)]$, and ${X}_{i}^{T}=[{x}_{k-\lfloor (i-1)(L-1)/N\rfloor},\phantom{\rule{4pt}{0ex}}\dots ,\phantom{\rule{4pt}{0ex}}{x}_{k-\lfloor i(L-1)/N\rfloor}]$, where ${\mathcal{P}}_{i}\in {\Re}^{1\times \lfloor i(L-1)/N\rfloor -\lfloor (i-1)(L-1)/N\rfloor}$ and ${X}_{i}\in {\Re}^{\lfloor i(L-1)/N\rfloor -\lfloor (i-1)(L-1)/N\rfloor \times n}$. Note that each row of the vector ${\Phi}_{i}$ is quite similar to those of Equation (16), the difference being only the single forward time shift. Note that, in this case, particular workers calculate ${\Phi}_{i}$, $i=1,\dots ,N$, and the master worker calculates the next step of system states (Equation (18)) and the output signal on the basis of Equation (1). The parallelization process for the above scheme is presented in Algorithm 2. As in Algorithm 1, the number of workers N used in Algorithm 2 is bounded by the condition given in Remark 2.

$$\begin{array}{ccc}\hfill {x}_{k+1}& =& (A{h}^{\alpha}+\alpha I){x}_{k}+B{h}^{\alpha}{u}_{k}-\sum _{i=1}^{N}\left[\sum _{j=\lfloor (i-1)(L-1)/N\rfloor +1}^{\lfloor i(L-1)/N\rfloor}{P}_{j+1}\left(\alpha \right){x}_{k-j}\right],\hfill \end{array}$$

$$\begin{array}{cc}=& (A{h}^{\alpha}+\alpha I){x}_{k}+B{h}^{\alpha}{u}_{k}-\sum _{i=1}^{N}{\Phi}_{i},\hfill \end{array}$$

$${\Phi}_{i}=\sum _{j=\lfloor (i-1)(L-1)/N\rfloor +1}^{\lfloor i(L-1)/N\rfloor}{P}_{j+1}\left(\alpha \right){x}_{k-j}={\mathcal{P}}_{i}{X}_{i}^{T}$$

Algorithm 2: Parallel algorithm for one-step calculation of fractional-order system. |

Furthermore, we can calculate the next values of state vector ${x}_{k+1}$ on the basis of the following equations:
where
with ${A}_{\alpha}=A-I\alpha $, ${\tilde{P}}_{j}\left(\alpha \right)=I{P}_{j}\left(\alpha \right)$, $j=2,\dots ,L$. The sizes of matrices are as follows: $\tilde{P}\in {\Re}^{n\times nL+{n}_{u}}$ and ${\tilde{X}}_{k}\in {\Re}^{1\times nL+{n}_{u}}$. In the case of modeling the noncommensurate-order system, we can still use Equation (20), where the elements ${\tilde{P}}_{j}\left(\alpha \right)$, $j=2,\dots ,L$ are substituted by ${\tilde{P}}_{j}=\mathrm{diag}\{{P}_{j}\left({\alpha}_{1}\right),\phantom{\rule{4pt}{0ex}}\dots ,\phantom{\rule{4pt}{0ex}}{P}_{j}\left({\alpha}_{n}\right)\}$.

$$\begin{array}{ccc}\hfill {x}_{k+1}& =& \tilde{P}{\tilde{X}}_{k},\hfill \end{array}$$

$$\begin{array}{ccc}\hfill \tilde{P}& =& \left[{A}_{\alpha},\phantom{\rule{4pt}{0ex}}{\tilde{P}}_{2}\left(\alpha \right),\phantom{\rule{4pt}{0ex}}\dots ,\phantom{\rule{4pt}{0ex}}{\tilde{P}}_{L}\left(\alpha \right),\phantom{\rule{4pt}{0ex}}B{h}^{\alpha}\right],\hfill \end{array}$$

$$\begin{array}{ccc}\hfill {\tilde{X}}_{k}& =& {\left[{x}_{k}^{T},\phantom{\rule{4pt}{0ex}}{x}_{k-1}^{T},\phantom{\rule{4pt}{0ex}}\dots ,\phantom{\rule{4pt}{0ex}}{x}_{k-L}^{T},\phantom{\rule{4pt}{0ex}}{u}_{k}^{T}\right]}^{T}\hfill \end{array}$$

Note that Equation (20) is a simple matrix form of the state equation, but the dimensions of the matrix $\tilde{P}$ and vector ${\tilde{X}}_{k}$ are large. Therefore, using Equation (20) is not effective from the computational complexity point of view. Taking into account that $\tilde{P}$ is the sparse matrix, we can present Equation (20) in a more computationally effective form as
where ${\Phi}_{k}^{i}={\tilde{P}}^{i}{\tilde{X}}_{k}^{i}$ and
with $i=1,\dots ,n$, ${a}_{i,j}^{\alpha}$, $j=1,\dots ,n$ are the entries in the i-th row of the matrix ${A}_{\alpha}$ and ${b}_{i,j}$ are the elements in i-th row of the matrix B, respectively. Now, we can apply the parallelization algorithm to Equation (23) and calculate ${\Phi}_{k}^{i}$ in various processes. Moreover, we can implement the parallelization scheme to calculate particular ${\Phi}_{k}^{i}$, $i=1,\dots ,n$, as follows
where
with ${\tilde{p}}^{i,m}$ and ${\tilde{x}}_{k}^{i,m}$, $m=1,\dots ,n+L+{n}_{u}$ being the m-th elements of the vectors ${\tilde{P}}^{i}$ and ${\tilde{X}}_{k}^{i}$, respectively. As a result of the parallelization scheme for Equations (23)–(27), we obtain a kind of a hierarchical parallelization process. Firstly, the calculation for time step k is divided into n parts, computing ${\Phi}_{k}^{i}$, $i=1,\dots ,n$. Then, the calculation of each ${\Phi}_{k}^{i}$ is distributed on M subtasks, calculating ${\Phi}_{k}^{i,j}$, $j=1,\dots ,M$. The block diagram of the hierarchical parallelization scheme is presented in Figure 2 and the pseudo-code is presented in Algorithm 3.

$${x}_{k+1}=\left[\begin{array}{c}{x}_{k+1}^{1}\\ \vdots \\ {x}_{k+1}^{n}\end{array}\right]=\left[\begin{array}{c}{\Phi}_{k}^{1}\\ \vdots \\ {\Phi}_{k}^{n}\end{array}\right],$$

$$\begin{array}{ccc}\hfill {\tilde{P}}^{i}& =& \left[{a}_{i,1}^{\alpha},\phantom{\rule{4pt}{0ex}}...,\phantom{\rule{4pt}{0ex}}{a}_{i,n}^{\alpha},\phantom{\rule{4pt}{0ex}}{P}_{2}\left(\alpha \right),\phantom{\rule{4pt}{0ex}}...,\phantom{\rule{4pt}{0ex}}{P}_{L}\left(\alpha \right),\phantom{\rule{4pt}{0ex}}{b}_{i,1},\dots ,\phantom{\rule{4pt}{0ex}}{b}_{i,{n}_{u}}\right],\hfill \end{array}$$

$$\begin{array}{ccc}\hfill {\tilde{X}}_{k}^{i}& =& {\left[{x}_{k}^{T},\phantom{\rule{4pt}{0ex}}{x}_{k-1}^{i},\phantom{\rule{4pt}{0ex}}\dots ,\phantom{\rule{4pt}{0ex}}{x}_{k-L}^{i},\phantom{\rule{4pt}{0ex}}{u}_{k}^{T}\right]}^{T}\hfill \end{array}$$

$${\Phi}_{k}^{i}=\sum _{j=1}^{M}{\Phi}_{k}^{i,j},$$

$${\Phi}_{k}^{i,j}=\sum _{m=\lfloor (j-1)\frac{n+L+{n}_{u}}{M}\rfloor +1}^{\lfloor j\frac{n+L+{n}_{u}}{M}\rfloor}{\tilde{p}}^{i,m}{\tilde{x}}_{k}^{i,m}$$

Algorithm 3: Hierarchical parallel algorithm for one-step calculation of fractional-order system. |

Finally, in the hierarchical parallelization algorithm presented above, to calculate state vector ${x}_{k+1}$, we use $(M+1)\times n+1$ workers.

The number M used in Algorithm 3 has to fulfill the following condition: $M\le min(L,\phantom{\rule{4pt}{0ex}}k)$.

The parallelization methods for the fractional-order difference in Equation (14) and for the fractional-order system in Equation (17) can be used for a wide spectrum of multi-core processors. On the other hand, the hierarchical parallelization method presented in Figure 2 requires a relatively high number of cores, therefore, the method is recommended in case of using Massively Parallel Processors (MPP).

In this section, we present the implementation results of parallelization methods both for the fractional-order difference and the fractional-order state-space system. For analysis, the CPU- and GPU-based hardware environments have been used.

In simulation experiments, we consider the fractional-order state-space system $\{A,B,C,D\}$ as follows:
with both the fractional-order difference and fractional-order system excited by the Heaviside step function $u\left(t\right)=\mathbf{1}\left(t\right)$.

$$\begin{array}{cc}\hfill A=& \left[\begin{array}{cccccccc}-3.6557& -8.5928& -8.9663& -5.2783& -1.6870& -0.2600& -0.0171& -0.0012\\ 1& 0.8& 0& 0& 0& 0& 0& 0\\ 0& 1& 0.8& 0& 0& 0& 0& 0\\ 0& 0& 1& 0.8& 0& 0& 0& 0\\ 0& 0& 0& 1& 0.8& 0& 0& 0\\ 0& 0& 0& 0& 1& 0.8& 0& 0\\ 0& 0& 0& 0& 0& 1& 0.8& 0\\ 0& 0& 0& 0& 0& 0& 1& 0.8\end{array}\right],\hfill \\ \hfill B=& {\left[\begin{array}{cccccccc}0& 0& 0& 1& 0& 1& 0& 0\end{array}\right]}^{T},\phantom{\rule{1.em}{0ex}}C=\left[\begin{array}{cccccccc}0& -1& 0& 0& 1& 0& 0& 0\end{array}\right],\phantom{\rule{1.em}{0ex}}D=\left[\begin{array}{c}0\end{array}\right],\hfill \\ \hfill \alpha =& \phantom{\rule{3.33333pt}{0ex}}0.7\hfill \end{array}$$

In case of calculation on the CPU-based hardware, numerical simulations are carried out on a computing node equipped with the Ubuntu $14.04$ (Canonical Ltd., London, UK) operating system and an Intel Xeon E5−2650 v3 CPU (Intel Corporation, Santa Clara, CA, USA) with a basic frequency of 2.3 GHz. The hardware system offers 10 physical cores (20 threads based on the hyperthreading technology). During the simulations, we use one thread per physical core only, since the hyperthreading technology is not suitable in our task (see [34,35]). All the calculation algorithms in this subsection are implemented by use of the C++ programming language and the OMP library. The OMP library is based on the shared memory concept, therefore, explicit data distribution techniques are not desirable in this case. For evaluation of parallelization efficiency, solely the simulation times have been taken into consideration because all the methods provide the same calculation results.

Consider fractional-order difference (12) with $\alpha =0.7$. The difference is implemented by use of the parallelization Algorithm 1. The simulation times for various implementation lengths $L=[{2}^{16},{2}^{17},{2}^{18},{2}^{19}]$ and various numbers of cores $N=[1,2,4,8]$ are presented in Figure 3, Figure 4, Figure 5 and Figure 6.

As we can see in Figure 3, Figure 4, Figure 5 and Figure 6, the parallelization process decreases execution times of the one-step calculation process for fractional-order difference. The speedup for 8 cores varies from $S=4.17$ for $L={2}^{19}$ to $S=4.58$ for $L={2}^{16}$. So, the acceleration is similar for the considered lengths.

Consider the fractional-order state-space system presented in the introduction of Section 4. The system is calculated by the parallel scheme introduced in Algorithm 2. The calculation process is executed in CPU-based hardware for implementation lengths $L=[{2}^{14},{2}^{15},{2}^{16},{2}^{17}]$. Results in terms of execution times for number of cores $N=[1,2,4,8]$ are presented in Figure 7, Figure 8, Figure 9 and Figure 10.

We can see from Figure 7, Figure 8, Figure 9 and Figure 10 that, again, the parallelization process decreases execution times of the one-step calculation process for the fractional-order system. The speedup for 8 cores in this case varies from $S=3.65$ for $L={2}^{17}$ to $S=3.84$ for $L={2}^{14}$. This means, again, that the effectiveness of the parallelization is similar for the considered lengths.

We compared results obtained by the introduced methodology with time effectiveness of model implementation in the Matlab environment. In the case of using a single-core method, we obtain times from $1.0\times {10}^{-3}$ for $L={2}^{14}$ to $5.4\times {10}^{-3}$ for $L={2}^{18}$, therefore, the times are significantly higher than in the case where the OpenMP environment is used. Moreover, the implementation of parallel computing by the use of Parallel Computing Toolbox increases calculation times compared to a single-core approach. This is a result of the specific construction of CPU-based parallelization methods in the Matlab environment.

In the case of GPU-based implementation, we use hardware with two Tesla K80 (Nvidia Corporation, Santa Clara, CA, USA) accelerators with a dual-GPU design that consists of 4992 Nvidia CUDA threads, 24GB of GDDR5 memory, 480GB/s aggregate memory bandwidth, and up to 2.91 teraflops of double-precision operations. Accelerators operate on a computing node equipped with the operating system Windows Server 2012 R2 (Microsoft Corporation, Redmond, WA, USA) and an Intel Xeon $E5-2683$ v3 CPU (Intel Corporation, Santa Clara, CA, USA) with a basic frequency of $2.0$ GHz. In contrast to OMP, CUDA does not support globally shared memory, so data distribution and memory allocation must be performed manually by proper data transfers between all processing units. CUDA enables the overlapping of some operations without losing much performance, but still, improper management of data distribution can result in poor time results. CUDA GPUs have many parallel processors grouped into Streaming Multiprocessors (SMs), creating a grid of threads arranged within blocks. Each SM can run multiple concurrent thread blocks. Tesla K80 GPU can support up to 1024 active threads in one working block. To take full advantage of all these threads, the program code must be executed with multiple thread blocks.

Consider the fractional-order state-space system presented in the introduction of this section. The system is calculated by use of parallelization Algorithm 3. Times of the one-step calculation process for implementation lengths $L=[{2}^{15},{2}^{17}]$ and various numbers of cores are presented in Figure 11. Moreover, Figure 11 presents data transfer times (red and yellow) and execution times (blue and green).

We can see that in the case of using GPU-based hardware and Algorithm 3, we obtain an effective tool for the distributed calculation of the fractional-order systems. For instance, increasing the number of processors 16 times (from 256 to 4096) for implementation length $L={2}^{17}$ leads to speedup $S=10.22$. In contrast, for $L={2}^{15}$ in the same case, we obtain $S=4.00$. Therefore, the parallelization process is much more effective for longer implementations of fractional-order systems. Additionally, we can see that the data transfer times are longer for high numbers of threads, but are still relatively short compared to the times of calculation.

Again, we compared the effectiveness of the proposed methodology with implicit GPU-support tools in the Matlab Parallel Computing Toolbox. Finally, we obtain execution times from $1.3\times {10}^{-3}$ for $L={2}^{15}$ to $3.0\times {10}^{-3}$ for $L={2}^{17}$. The times are higher in the case of the use of 256 workers for the hierarchical parallelization scheme introduced in the paper. Taking into account that Matlab is a high-level environment, where GPU-support is based on the same CUDA software as we use in our implementation, we can see that the Matlab parallelization algorithms are visibly less effective than for those considered in the paper.

This paper has presented new parallelization algorithms for calculation of both the GL fractional-order derivative/difference and the fractional-order state-space system. For the fractional-order system, we introduce two different parallelization methods. The first method is dedicated to use with classical hardware and relatively low numbers of cores, and the second one is designed for Massively Parallel Processors. In simulation examples, we use computers with (a) an Intel Xeon $E5-2650$ v3 CPU (Intel Corporation, Santa Clara, CA, USA) and (b) Tesla K80 (Nvidia Corporation, Santa Clara, CA, USA) accelerators with a dual-GPU. Simulation examples confirm that the introduced methods can be effectively used in the accurate approximation of the fractional-order systems, in particular for high calculation lengths. The direction of our future research will be focused on numerical methods for solving fractional-order systems based on different fractional-order derivatives/differences, as well as their parallel implementation algorithms.

Conceptualization, R.S.; Formal analysis, R.S.; Methodology, R.S.; Project administration, R.S.; Software, K.K.; Validation, K.K.; Visualization, K.K.; Writing-original draft, R.S. and K.K.

This research received no external funding.

The authors declare no conflict of interest.

- Bauer, W.; Rydel, M. Application of Reduced Models of Non-Integer Order Integrator to the Realization of PI
^{λ}D Controller. In Proceedings of the 39th International Conference on Telecommunications and Signal Processing, Vienna, Austria, 27–29 June 2016. [Google Scholar] - Ferreira, R.A.C.; Tenreiro Machado, J. An Entropy Formulation Based on the Generalized Liouville Fractional Derivative. Entropy
**2019**, 21, 638. [Google Scholar] [CrossRef] - Kaczorek, T. Practical stability of positive fractional discrete-time linear systems. Bull. Pol. Acad. Sci. Tech. Sci.
**2008**, 56, 313–317. [Google Scholar] - Kaczorek, T. Singular fractional linear systems and electrical circuits. Int. J. Appl. Math. Comput. Sci.
**2011**, 21, 379–384. [Google Scholar] [CrossRef] - Karci, A. Fractional order entropy: New perspectives. Optik
**2016**, 127, 9172–9177. [Google Scholar] [CrossRef] - Latawiec, K.J.; Stanisławski, R.; ukaniszyn, M.; Czuczwara, W.; Rydel, M. Fractional-order modeling of electric circuits: Modern empiricism vs. classical science. In Proceedings of the Progress in Applied Electrical Engineering, Kościelisko, Poland, 25–30 June 2017. [Google Scholar]
- Monje, C.; Chen, Y.; Vinagre, B.; Xue, D.; Feliu, V. Fractional-order Systems and Controls; Springer-Verlag: London, UK, 2010. [Google Scholar]
- Muhammad Altaf, K.; Atangana, A. Dynamics of Ebola Disease in the Framework of Different Fractional Derivatives. Entropy
**2019**, 21, 303. [Google Scholar] [CrossRef] - Rydel, M.; Stanislawski, W. Selection of reduction parameters for complex plant MIMO LTI models using the evolutionary algorithm. Math. Comput. Simul.
**2017**, 140, 94–106. [Google Scholar] [CrossRef] - Stanisławski, R.; Rydel, M.; Latawiec, K.J. Modeling of discrete-time fractional-order state space systems using the balanced truncation method. J. Frankl. Inst.
**2017**, 354, 3008–3020. [Google Scholar] [CrossRef] - Ubriaco, M.R. Entropies based on fractional calculus. Phys. Lett. A
**2009**, 373, 2516–2519. [Google Scholar] [CrossRef] - Wang, Y.; Zhang, Y.F.; Liu, J.G.; Iqbal, M. A short review on analytical methods for fractional equations with He’s fractional derivative. Therm. Sci.
**2017**, 21, 1567–1574. [Google Scholar] [CrossRef] - Zúñiga-Aguilar, C.J.; Romero-Ugalde, H.M.; Gómez-Aguilar, J.F.; Escobar-Jiménez, R.F.; Valtierra-Rodríguez, M. Solving fractional differential equations of variable-order involving operators with Mittag-Leffer kernel using artificial neural networks. Chaos Solitons Fractals
**2017**, 103, 382–403. [Google Scholar] [CrossRef] - Oprzędkiewicz, K.; Mitkowski, W. A Memory-Efficient Noninteger-Order Discrete-Time State-Space Model of a Heat Transfer Process. Int. J. Appl. Math. Comput. Sci.
**2018**, 28, 649–659. [Google Scholar] [CrossRef] - Wang, K.L.; Liu, S.Y. He’s fractional derivative and its application for fractional Fornberg-Whitham equation. Therm. Sci.
**2017**, 21, 2049–2055. [Google Scholar] [CrossRef] - Lichae, B.H.; Biazar, J.; Ayati, Z. The Fractional Differential Model of HIV-1 Infection of CD
^{+}T-Cells with Description of the Effect of Antiviral Drug Treatment. Comput. Math. Methods Med.**2019**, 2019, 4059549. [Google Scholar] [CrossRef] [PubMed] - Solís-Pérez, J.E.; Gómez-Aguilar, J.F.; Torres, L.; Escobar-Jiménez, R.F.; Reyes-Reyes, J. Fitting of experimental data using a fractional Kalman-like observer. ISA Trans.
**2019**, 88, 153–169. [Google Scholar] [CrossRef] [PubMed] - He, J.H. Fractal calculus and its geometrical explanation. Results Phys.
**2018**, 10, 272–276. [Google Scholar] [CrossRef] - Alzabut, J.; Sudsutad, W.; Kayar, Z.; Baghani, H. A New Gronwall–Bellman Inequality in Frame of Generalized Proportional Fractional Derivative. Mathematics
**2019**, 7, 747. [Google Scholar] [CrossRef] - Zúniga-Aguilar, C.J.; Coronel-Escamilla, A.; Gómez-Aguilar, J.F.; Alvarado-Martínez, V.M.; Romero-Ugalde, H.M. New numerical approximation for solving fractional delay differential equations of variable order using artificial neural networks. Eur. Phys. J. Plus
**2018**, 133, 75. [Google Scholar] [CrossRef] - Tanaka, H.-A.; Nakagawa, M.; Oohama, Y. A Direct Link between Rényi-Tsallis Entropy and Hölder’s Inequality-Yet Another Proof of Rényi-Tsallis Entropy Maximization. Entropy
**2019**, 21, 549. [Google Scholar] [CrossRef] - Ibrahim, R.W.; Darus, M. Analytic Study of Complex Fractional Tsallis’ Entropy with Applications in CNNs. Entropy
**2018**, 20, 722. [Google Scholar] [CrossRef] - Herlihy, M.; Shavit, N. The Art of Multiprocessor Programming; Morgan Kaufmann: Burlington, NJ, USA, 2008. [Google Scholar]
- Pacheco, P. An Introduction to Parallel Programming; Morgan Kaufmann: Burlington, NJ, USA, 2011. [Google Scholar]
- Diethelm, K. An Efficient Parallel Algorithm for the Numerical Solution of Fractional Differential Equations. Fract. Calc. Appl. Anal.
**2011**, 14, 475–490. [Google Scholar] [CrossRef] - Bonchis, C.; Kaslik, E.; Rosu, F. HPC optimal parallel communication algorithm for the simulation of fractional-order systems. J. Supercomput.
**2019**, 75, 1014–1025. [Google Scholar] [CrossRef] - Wang, Q.; Liu, J.; Gong, C.; Tang, X.; Fu, G.; Xing, Z. An efficient parallel algorithm for Caputo fractional reaction-diffusion equation with implicit finite-difference method. Adv. Differ. Equ.
**2016**, 2016, 207. [Google Scholar] [CrossRef] - Biala, T.A.; Khaliq, A.Q.M. Parallel algorithms for nonlinear time-space fractional parabolic PDEs. J. Comput. Phys.
**2018**, 375, 135–154. [Google Scholar] [CrossRef] - OpenMP Application Programming Interface. Available online: https://www.openmp.org/wp-content/uploads/ OpenMP-API-Specification-5.0.pdf (accessed on 4 September 2018).
- Scherer, R.; Kalla, S.L.; Tang, Y.; Huang, J. The Grünwald–Letnikov method for fractional differential equations. Comput. Math. Appl.
**2011**, 62, 902–917. [Google Scholar] [CrossRef] - Stanisławski, R.; Latawiec, K.J. Normalized finite fractional differences: Computational and accuracy breakthroughs. Int. J. Appl. Math. Comput. Sci.
**2012**, 22, 907–919. [Google Scholar] [CrossRef] - Stanisławski, R.; Latawiec, K.J.; ukaniszyn, M. A Comparative Analysis of Laguerre-Based Approximators to the Grunwald Letnikov Fractional-Order Difference. Math. Probl. Eng.
**2015**, 2015, 512104. [Google Scholar] - Podlubny, I. Fractional Differential Equations; Academic Press: San Diego, CA, USA, 1998. [Google Scholar]
- Koufaty, D.; Marr, D.T. Hyperthreading technology in the netburst microarchitecture. IEEE Micro
**2003**, 23, 56–65. [Google Scholar] [CrossRef] - Song, Y.; Kalogeropulos, S.; Tirumalai, P. Design and implementation of a compiler framework for helper threading on multi-core processors. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, St. Louis, MO, USA, 17–21 September 2005. [Google Scholar]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).