Open Access
This article is
 freely available
 reusable
Algorithms 2017, 10(4), 137; https://doi.org/10.3390/a10040137
Article
Weakly Coupled Distributed Calculation of Lyapunov Exponents for NonLinear Dynamical Systems
^{1}
Centro de Desarrollo Aeroespacial, Instituto Politécnico Nacional, Ciudad de Mexico 06010, Mexico
^{2}
Departamento de Ingeniería Agroindustrial, Universidad de Guanajuato, Campus CelayaSalvatierra, Celaya, Guanajuato 38060, Mexico
^{3}
Centro de Investigación en Matemáticas, Guanajuato, Guanajuato 36240, Mexico
^{4}
Facultad de Ciencias, Universidad Nacional Autónoma de México, Ciudad Universitaria, Ciudad de México 04510, Mexico
*
Correspondence: [email protected]; Tel.: +525557296000 (ext. 64665)
^{†}
These authors contributed equally to this work.
Received: 19 October 2017 / Accepted: 13 November 2017 / Published: 7 December 2017
Abstract
:Numerical estimation of Lyapunov exponents in nonlinear dynamical systems results in a very high computational cost. This is due to the largescale computational cost of several Runge–Kutta problems that need to be calculated. In this work we introduce a parallel implementation based on MPI (Message Passing Interface) for the calculation of the Lyapunov exponents for a multidimensional dynamical system, considering a weakly coupled algorithm. Since we work on an academic highlatency cluster interconnected with a gigabit switch, the design has to be oriented to reduce the number of messages required. With the design introduced in this work, the computing time is drastically reduced, and the obtained performance leads to close to optimal speedup ratios. The implemented parallelisation allows us to carry out many experiments for the calculation of several Lyapunov exponents with a lowcost cluster. The numerical experiments showed a high scalability, which we showed with up to 68 cores.
Keywords:
MPI; distributed memory; Lyapunov exponents; chaos theory; nonlinear dynamical systems1. Introduction
Although the study of nonlinear differential equations began with the rise of differential equations themselves, formally, the birth of the modern field of nonlinear dynamical systems began in 1962 when Edward Lorenz, a MIT meteorologist, computationally simulated a set of differential equations for fluid convection in the atmosphere. In such simulations, he noticed a complicated behaviour that seemed to depend sensitively on initial conditions, and here he found the “Lorenz” strange attractor. The field of nonlinear dynamical systems, particularly chaos theory, is a very active research field today, and its applications cover huge areas of science as well as other disciplines.
Nowadays, it is widely accepted that if a system of differential equations possesses significant dependence on initial conditions, then its behaviour is catalogued as chaotic [1]. The most popular tools to measure sensitivity dependence on initial conditions are the Lyapunov exponents (LE), named after the Russian mathematician A. M. Lyapunov (1857–1918) [2]. As Lyapunov exponents give a measure of the separation of closely adjacent solutions (in initial conditions) to the set of differential equations when the system has evolved into a steady state (after a very long time), their numerical calculation has always led to high processing times. Moreover, the smaller the step in which possible initial conditions are swept and the smaller the time step of the simulation as well as the larger the final simulation time, the better the obtained approach to the theoretical Lyapunov exponent.
The numerical processing time problem is aggravated when the set of ODEs (Ordinary Differential Equations) is large, because for each degree of freedom, a Lyapunov exponent ought to be calculated. Historically, in order to alleviate computational load, different approximation techniques have been developed. The most popular is the calculation of the maximal Lyapunov exponent, which, if negative, guarantees that the system is not chaotic. Several methods to estimate the maximal Lyapunov exponent have been developed [3,4,5,6].
Nevertheless, the calculation of the maximal Lyapunov exponent only serves as a guide to detect chaos—or not—in a system. In order to fully exploit the richness of the nonlinear dynamical behaviour embodied in the set of ODEs, the full Lyapunov spectrum is required. This way, the calculation of Lyapunov exponents is a highcomputationalload problem that is benefited by parallelisation techniques. Notwithstanding, the parallelisation might be tricky because the dependence of the adjacent initial conditions solutions of the system in order to calculate each exponent of the spectra. Thus, in this work we present the weakly coupled distributed calculation of Lyapunov exponents in a high latency cluster. This paper is organized as follows: In Section 2 we present the theory of Lyapunov exponents as well as the system of differential equations to be tackled in this paper. In Section 3 we describe the design of the application, focusing in the MPIdistributed implementation for a highlatency cluster. In Section 4 we describe the performance experiments along with their results. Finally, in Section 6 we present some interesting final remarks.
2. Lyapunov Exponents in NonLinear Dynamical Systems
There are several different definitions of Lyapunov exponents which can be found in the literature. For instance, there are at least two widely used definitions of Lyapunov exponents for linearised systems [7]. For the sake of simplicity, and in order to focus this paper into the development of the parallel scheme, we work with a simple definition of Lyapunov exponents for the fully nonlinear dynamical system herein treated (see Section 2.1).
Let $\dot{r}=f(r,t)$ be a firstorder differential equation. In order to solve it numerically, it must be discretised, for example, by an Euler method for the sake of brevity. Then, the value of r in the nth time step would be
$${r}_{n}={r}_{n1}+\Delta tf({r}_{n1},{t}_{n1})$$
Now let ${t}_{0}$ and ${t}_{1}$ be two initial conditions such that ${t}_{1}{t}_{0}=\delta <<1$, and let the values of r in the ($n+1$)th time step with the simulation started at ${t}_{0}$ and ${t}_{1}$ be ${r}_{n}^{\left(0\right)}$ and ${r}_{n}^{\left(1\right)}$, respectively. Then, the separation of both solutions with adjacent initial conditions is defined to be
where $\lambda $ is the Lyapunov exponent. From this definition, we can observe that if $\lambda $ is negative, the two trajectories shall converge (at time step n), but if it is positive, the nearby orbits diverge (at least exponentially), with chaotic dynamics arising. The Lyapunov exponent thus represents the average exponential growth per unit time between the two nearby states. It can be shown that, with some algebra, $\lambda $ is:
where the limit and the sum take into account the separation of adjacent trajectories (in initial conditions) for the whole dynamics of the system [8]. Again, in the last equation it can be seen that if the Lyapunov exponent is positive, the separation of the trajectories grows at least exponentially, indicating the presence of chaotic behaviour. The set of all possible values of $\lambda $, when sweeping the interval of possible values for the initial conditions, constitute the Lyapunov spectrum of the system. It must be noted that, for a given system, there are as many Lyapunov exponents as there are variables.
$$\left{r}_{n}^{\left(0\right)}{r}_{n}^{\left(1\right)}\right=\delta {e}^{n\lambda}$$
$$\lambda =\underset{n\to \infty}{lim}\frac{1}{n}\sum _{i=0}^{n1}ln\left\frac{{r}_{n}^{\left(0\right)}{r}_{n}^{\left(1\right)}}{\delta}\right$$
2.1. Coupled Oscillations Model
The set of ODEs to solve in this work is given by the following six firstorder equations with variable coefficients:
where $x,\theta ,\varphi ,{v}_{x},{v}_{\theta}$, and ${v}_{\varphi}$ are the dynamic variables (x is linear and $\theta $ and $\varphi $ are angular variables), t is the curve parameter (time), the dot represents tderivative, and the functions X, $\Theta $ and $\Phi $ are given by
where A, k, $\omega $, g, ${m}_{g}$, ${m}_{\theta}$, ${m}_{\varphi}$, ${l}_{\theta}$, ${l}_{\varphi}$, $\gamma $ and $M={m}_{g}+{m}_{\theta}+{m}_{\varphi}$ are physical parameters. Equations (1)–(6) represent three highlycoupled (one harmonic and two pendular) oscillators, and were physically derived by HernándezGómez et al. [9]. Thus, in order to obtain the whole spectra of Lyapunov exponents of the system, six of them ought to be calculated.
$$\dot{x}={v}_{x}\equiv {V}_{x}\left({v}_{x}\right)$$
$$\dot{\theta}={v}_{\theta}\equiv {V}_{\theta}\left({v}_{\theta}\right)$$
$$\dot{\varphi}={v}_{\varphi}\equiv {V}_{\varphi}\left({v}_{\varphi}\right)$$
$$\dot{{v}_{x}}=X\left(x,\theta ,\varphi ,{v}_{x},{v}_{\theta},{v}_{\varphi},t\right)$$
$$\dot{{v}_{\theta}}=\Theta \left(x,\theta ,\varphi ,{v}_{x},{v}_{\theta},{v}_{\varphi},t\right)$$
$$\dot{{v}_{\varphi}}=\Phi \left(x,\theta ,\varphi ,{v}_{x},{v}_{\theta},{v}_{\varphi},t\right)$$
$$X=\frac{Akcos\omega t+\frac{g}{2}({m}_{\theta}sin2\theta +{m}_{\varphi}sin2\varphi )+{m}_{\theta}{l}_{\theta}{\dot{\theta}}^{2}sin\theta +{m}_{\varphi}{l}_{\varphi}{\dot{\varphi}}^{2}sin\varphi kx\gamma \dot{x}}{M{m}_{\theta}{cos}^{2}\theta {m}_{\varphi}{cos}^{2}\varphi}$$
$$\Theta =\frac{g}{{l}_{\theta}}sin\theta X\frac{cos\theta}{{l}_{\theta}}$$
$$\Phi =\frac{g}{{l}_{\varphi}}sin\varphi X\frac{cos\varphi}{{l}_{\varphi}}$$
3. Parallel Application Design
The type of design used in this application is through message passing, due to the fact that it is a cluster with a network interconnection of 100 MB. The features of the cluster are:
 A master node.
 17 slave nodes with Intel i54670 processors (four cores without HT), and 32 GB PC312800 of RAM memory each.
 TLSG1024 24Port Gigabit Switch.
 Cluster Rocks 6.2 OS.
 Intel FORTRAN 17.0.1.
3.1. MPI Distributed Implementation
One of the big advantages of using MPI is that it is a standard for coding messagepassing based applications [10]. Although there exist some other messagepassing implementations, like Java Parallel Virtual Machine (JPVM) http://www.cs.virginia.edu/ajf2j/jpvm.html [11], MPI http://mpiforum.org is more widespread, and it is more commonly used with implementations like OpenMPI, MPICH, LAM/MPI and IntelMPI [12]. The version used in this work is IntelMPI for FORTRAN.
Considering a gigabitbased intercommunication interface in our cluster, we must minimise the messagepassing between nodes to get good performance. Therefore, we have to look for a design that lets us overcome network limitations. The parallel design that we developed demands that little data be sent, therefore, a Gigabit network for intercommunication is enough.
Basically, the computational problem consists in solving the system of Equations (1)–(6) for very large times, which in our case turns into the solving of a huge amount of subproblems with the fourthorder Runge–Kutta (RK) method. This occurs by slightly varying the initial conditions in each RK calling to determine the Lyapunov coefficient between such two initial conditions. One of the first tactics one might think of to solve the problem is that it could be addressed by carrying out the parallelisation of the Runge–Kutta method, for which a series of procedures already exists for shared memory architectures [13], as well as for technologies such as the Xeon Phi [14] and Graphic Processing Units (GPUs) [15]. There are also options for Runge–Kutta parallelisation using MPI [16], but they are not good options considering the network’s latency. Thus, the best strategy is to disrupt the N (RK) problems that should be solved, and analyse the dependencies.
By analysing the algorithm, we see that it is necessary to execute $(N1)$ Runge–Kutta instances for a different set of initial conditions ${\left\{{C}_{n}\right\}}_{n=0}^{N}$. Obviously, in the serial implementation, the recalculation of the Runge–Kutta method is avoided by storing the previous adjacent result (in initial condition); that is, once the Lyapunov coefficient is calculated between ${C}_{i1}$ and ${C}_{i}$, the calculation of ${C}_{i}$ is no longer necessary to obtain the Lyapunov coefficient between ${C}_{i}$ and ${C}_{i+1}$ in the next iteration $\forall i(0<i<N)$. Therefore, in this way, the algorithm has no strong dependencies rather than the adjacent ones, i.e., is weakly coupled.
To divide tasks in a balanced way, we follow the procedure described by CouderCastañeda et al. [17], and Arroyo et al. [18]. Let ${p}_{n}$ be the number of MPI processes and ${C}_{n}$ the number of problems to solve, then we define the problem number with which a process, p, must start and finish as ${p}_{s}$ and ${p}_{e}$, respectively. To determine ${p}_{s}$ and ${p}_{e}$ in each process, p, we proceed as follows:
$$\begin{array}{cc}\hfill s& :=\lfloor {C}_{n}/{p}_{n}\rfloor ,\hfill \end{array}$$
$$\begin{array}{cc}\hfill r& :={C}_{n}(\phantom{\rule{0.277778em}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{3.33333pt}{0ex}}{p}_{n}).\hfill \end{array}$$
Therefore
and
$${p}_{\mathrm{start}}:=p\times s+1$$
$${p}_{\mathrm{end}}:=(p+1)\times s.$$
If $r\ne 0$ and $p<r$, then we make an adjustment as:
and
$${p}_{\mathrm{start}}:={p}_{\mathrm{start}}+p$$
$${p}_{\mathrm{end}}:={p}_{\mathrm{end}}+(p+1).$$
If $r\ne 0$ and $p\ge r$, then:
and
$${p}_{\mathrm{start}}:={p}_{\mathrm{start}}+r$$
$${p}_{\mathrm{end}}:={p}_{\mathrm{end}}+r.$$
In this way, the number of problems to be solved (distributed) in the nodes is thus divided: because the previous problem must be calculated together with the present (except for the process ${p}_{0}$), it is necessary that ${p}_{\mathrm{start}}$ be decremented by 1. Therefore, ${p}_{\mathrm{start}}={p}_{\mathrm{start}}1$ for all processes from ${p}_{1}$ to ${p}_{n}$.
4. Physical Experiments
The system has six degrees of freedom, two of which were varied simultaneously. The initial conditions that remained fixed in each simulation of the dynamical system were: ${\varphi}_{0}=\pi /2$ and ${\dot{x}}_{0}={\dot{\theta}}_{0}={\dot{\varphi}}_{0}=0.0$. The initial conditions that were simultaneously swept were: $\left{x}_{0}\right\le 0.01$ and $\left\theta \right\le \pi /2$, with $dx=0.0001$ and $d\theta =\pi /180$, giving a total of 200 iterations in x and 180 iterations in $\theta $. It ought to be noted that the scanning regions are imposed by the physical limitations of the model. In this way, the sweeping of both initial conditions constitutes a total of $36,000$ Runge–Kutta instances to execute, each one simulated for 600 s (10 min) with $\Delta t=0.0001$ ($6\times {10}^{7}$ iterations).
For this experiment, the parameters of the model were initialised as follows: $A=0.0055555$ m, $k=5.15$ N/m, $\omega =2\pi $ Hz, $g=9.78$ kg s${}^{2}$ (this value of gravity’s acceleration is valid in Mexico City, where the experiments were performed), ${m}_{g}=0.0686$ kg, ${m}_{\theta}={m}_{\varphi}=0.0049$ kg, ${l}_{\theta}={l}_{\varphi}=0.22$ m and $\gamma =0.029$ kg s${}^{1}$.
The system does exhibit both chaotic and stable regions for the time span herein studied. There are three characteristic times in this system: ${t}_{{c}_{x}}=2\pi /\omega =1$ s and ${t}_{{c}_{\theta}}={t}_{{c}_{\varphi}}\approx 1.11232$ s (when oscillating pendula are initially dropped from $\pi /2$). As an illustrative issue, we show phase planes for x, for chaotic and a nonchaotic cases, in Figure 1. The initial conditions for the chaotic trajectory are given by: $(x,\theta ,\varphi ,\dot{x},\dot{\theta},\dot{\varphi})=(0.007,1.3439,\pi /2,0,0,0)$. Meanwhile, for the nonchaotic trajectory the initial conditions were: $(x,\theta ,\varphi ,\dot{x},\dot{\theta},\dot{\varphi})=(0,\pi /2,\pi /2,0,0,0)$.
In Figure 2, Lyapunov exponents for varying x and fixed $\theta $ are shown. The chaotic phase plane trajectory shown in Figure 1a has a Lyapunov exponent, ${\lambda}_{x}=1.584224$, that corresponds to that of Figure 2a with $x=0.007$. On the other hand, the nonchaotic phase plane trajectory shown in Figure 1b has a Lyapunov exponent, ${\lambda}_{x}=4.214377$, as shown in Figure 2b with $x=0$
5. Performance Experiments
We started the experiments over a node to verify the performance without the latency of the network. We ought to recall the importance of controlling the affinity so that the MPI processes do not commute within the node and so that the final result remains as reliable as possible. Thus, the following environment variable was defined: export I_MPI_PIN_PROCESSOR_LIST=allcores:grain=core.
In Figure 3 we show the speedup obtained using just one node with four cores executing from one to four MPI processes. The behaviour is practically linear, so the serial fraction is very low.
From Figure 3 we can conclude that execution time is considerably reduced. In the next experiment we launch from 1 to 17 MPI processes to measure, taking one node as a unit of computation, the performance. In Figure 4 we show computing times per computation unit, while in Figure 5 we feature the corresponding speedup along the cluster. From the results we can observe a low introduction of the overhead latency as expected in the design.
An important metric indicator which must be calculated is the efficiency E, defined as
where $S\left(n\right)$ is the obtained speedup with n processes, and indicates how busy are the nodes during execution. Figure 6 shows that the efficiency obtained is high, since on average every node is kept busy 99.3% of the time. The efficiency also indicates that the partitioning of tasks herein implemented is scalable, which means that we can increase the number of computing units to improve time reduction while not losing efficiency in the use of many nodes. The scalability must be contemplated as a good design of the parallel program, since it allows scaling of the algorithm, a fact that we could expect when increasing the number of processing units in the cluster. Similar implementations where the problem consists of processing recursively the first, second, third, and fourth Runge–Kutta coefficients in a parallel fashion, achieve a maximum performance which decreases when adding more processing units [19].
$$E={\textstyle \frac{S\left(n\right)}{n}}\times 100\%$$
6. Conclusions
A parallel design for the calculation of Lyapunov exponents in nonlinear dynamical systems field was implemented and validated using MPI implementation. The numerical experiments and the obtained indicators validate the high efficiency of the implementation herein proposed, being able to overcome the networking speed.
As is well known, it is necessary to use the most convenient implementation that contributes to the attainment of the best performance or the greatest exploitation of the platform. For our case, distributing the calculation of serial Runge–Kutta problems is more efficient than using a parallelised version of the ODE solving method. Moreover, not only with respect to the performance we obtain excellent speedups, but this scheme results in an easier implementation of the parallelisation.
Finally, we must say that the parallelisation of the calculation of Lyapunov exponents, in the field of nonlinear dynamical systems, allows researchers to drastically reduce the required time for exhaustive exploration of the manifold of initial conditions in order to quickly and accurately identify chaotic and nonchaotic dynamical regions, discarding, through highlatency and lowcost clusters, regions that appear to be chaotic that arise from largesized temporal or spatial steps.
Acknowledgments
Authors acknowledge partial support projects 20171536, 20170721 and 20171027, as well as an EDI grant, all provided by SIP/IPN. Authors also acknowledge the FS0001 computational time grant at FSF, provided by CDA/IPN.
Author Contributions
J.J.H.G. conceived the problem and directed the research. C.C.C. defined the parallel scheme and coded the problem. I.E.H.D. physically established the nonlinear dynamical system herein studied, and established conditions for each numerical experiment. N.F.G. performed the numerical experiments. E.G.C. obtained performance metrics. J.J.H.G. and E.G.C. wrote the paper.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. PseudoCode of the Computation Procedure
Algorithm A1 Distributed Algorithm for the Calculation of Lyapunov Exponents. 

References
 Strogatz, S. Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering; Studies in Nonlinearity; Avalon Publishing: New York, NY, USA, 2014. [Google Scholar]
 Lyapunov, A.M. The General Problem of the Stability of Motion. Ph.D. Thesis, University of Kharkov, Kharkiv Oblast, Ukraine, 1892. [Google Scholar]
 Benettin, G.; Galgani, L.; Strelcyn, J.M. Kolmogorov entropy and numerical experiments. Phys. Rev. A 1976, 14, 2338. [Google Scholar] [CrossRef]
 Contopoulos, G.; Galgani, L.; Giorgilli, A. On the number of isolating integrals in Hamiltonian systems. Phys. Rev. A 1978, 18, 1183. [Google Scholar] [CrossRef]
 Sato, S.; Sano, M.; Sawada, Y. Practical methods of measuring the generalized dimension and the largest Lyapunov exponent in high dimensional chaotic systems. Prog. Theor. Phys. 1987, 77, 1–5. [Google Scholar] [CrossRef]
 Kantz, H. A robust method to estimate the maximal Lyapunov exponent of a time series. Phys. Lett. A 1994, 185, 77–87. [Google Scholar] [CrossRef]
 Kuznetsov, N.; Alexeeva, T.; Leonov, G. Invariance of Lyapunov exponents and Lyapunov dimension for regular and irregular linearizations. Nonlinear Dyn. 2016, 85, 195–201. [Google Scholar] [CrossRef]
 Wolf, A.; Swift, J.B.; Swinney, H.L.; Vastano, J.A. Determining Lyapunov exponents from a time series. Phys. D Nonlinear Phenom. 1985, 16, 285–317. [Google Scholar] [CrossRef]
 HernándezGómez, J.J.; CouderCastañeda, C.; GómezCruz, E.; SolisSantomé, A.; OrtizAlemán, J.C. A simple experimental setup to approach chaos theory. Eur. J. Phys. 2017. under review. [Google Scholar]
 Rauber, T.; Rünger, G. Parallel Programming: For Multicore and Cluster Systems; Springer Science & Business Media: New York, NY, USA, 2013; pp. 1–516. [Google Scholar]
 CouderCastañeda, C. Simulation of supersonic flow in an ejector diffuser using the JPVM. J. Appl. Math. 2009, 2009, 497013. [Google Scholar] [CrossRef]
 Kshemkalyani, A.; Singhal, M. Distributed Computing: Principles, Algorithms, and Systems; Cambridge University Press: Cambridge, UK, 2008; pp. 1–736. [Google Scholar]
 Iserles, A.; Nørsett, S. On the Theory of Parallel Runge—Kutta Methods. IMA J. Numer. Anal. 1990, 10, 463. [Google Scholar] [CrossRef]
 Bylina, B.; Potiopa, J. Explicit FourthOrder Runge–Kutta Method on Intel Xeon Phi Coprocessor. Int. J. Parallel Program. 2017, 45, 1073–1090. [Google Scholar] [CrossRef]
 Murray, L. GPU Acceleration of RungeKutta Integrators. IEEE Trans. Parallel Distrib. Syst. 2012, 23, 94–101. [Google Scholar] [CrossRef]
 Majid, Z.; Mehrkanoon, S.; Othman, K. Parallel block method for solving large systems of ODEs using MPI. In Proceedings of the 4th International Conference on Applied Mathematics, Simulation, Modelling—Proceedings, Corfu Island, Greece, 2225 July 2010; pp. 34–38. [Google Scholar]
 CouderCastañeda, C.; OrtizAlemán, J.; Orozcodel Castillo, M.; NavaFlores, M. Forward modeling of gravitational fields on hybrid multithreaded cluster. Geofis. Int. 2015, 54, 31–48. [Google Scholar] [CrossRef]
 Arroyo, M.; CouderCastañeda, C.; TrujilloAlcantara, A.; HerreraDiaz, I.E.; VeraChavez, N. A performance study of a dual XeonPhi cluster for the forward modelling of gravitational fields. Sci. Program. 2015, 2015, 316012. [Google Scholar] [CrossRef]
 Zemlyanaya, E.; Bashashin, M.; Rahmonov, I.; Shukrinov, Y.; Atanasova, P.; Volokhova, A. Model of stacked long Josephson junctions: Parallel algorithm and numerical results in case of weak coupling. In Proceedings of the 8th International Conference for Promoting the Application of Mathematics in Technical and Natural Sciences—AMiTaNS 16, Albena, Bulgaria, 22–27 June 2016; Volume 1773. [Google Scholar]
Figure 2.
Lyapunov exponents for varying x and fixed $\theta $. (a) shows two regions. The first half describes chaos, while the second half shows stability. (b) has a large region of pure stability.
Figure 3.
Speedup obtained in one node. As can be observed the speedup is almost linear. The computing times are 154,158 s, 77,395 s, 53,214 s and 41,082 s.
Figure 4.
Computing times obtained from 1 to 17 processes in the cluster; the processes follow an ordered affinity.
Figure 5.
Speedup obtained in the cluster. The perfect speedup is 17 and the obtained speedup is 16.85, giving us a very good performance.
© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).