Differential Evolution under Fixed Point Arithmetic and FP16 Numbers

: In this work, the differential evolution algorithm behavior under a ﬁxed point arithmetic is analyzed also using half-precision ﬂoating point (FP) numbers of 16 bits, and these last numbers are known as FP16. In this paper, it is considered that it is important to analyze differential evolution (DE) in these circumstances with the goal of reducing its consumption power, storage size of the variables, and improve its speed behavior. All these aspects become important if one needs to design a dedicated hardware, as an embedded DE within a circuit chip, that performs optimization. With these conditions DE is tested using three common multimodal benchmark functions: Rosenbrock, Rastrigin, and Ackley, in 10 dimensions. Results are obtained in software by simulating all numbers using C programming language.


Introduction
The use of different number types in machine learning applications has been analyzed extensively in previous years, more specifically in deep learning neural networks [1,2]. These kinds of neural networks use the convolution as the basic function and have thousands of parameters and must be trained first; that is, the network must be optimized by modifying all the parameters to obtain a local minimum of the goal function. The optimization step is called training and it could take hours in modern hardware of general purpose graphics processor units (GPGPUs). A special type of number, Brain Floating Point (bfloat16), which is a half-precision FP format of 16 bits with the same range of the usual single precision FP numbers (float in C programming language, of 32 bits length), has been proposed for training deep learning neural networks [2]. Other FP numbers of 16 bit length are the so-called FP16 numbers, these are an IEEE standard [1,2] for half-precision FP numbers and can be used on ARM processors.
The goal of using different, shorter numbers in machine learning applications is to improve the speed, and as a consequence reduce the power consumption as it would take less time to train a deep learning network, and also reduce the storage memory or disk size for the variables. In [1] it is mentioned that half precision is also attractive for accelerating general purpose scientific computing, such as weather forecasting, climate modeling, and solution of linear systems of equations. The supercomputer Summit (it was in the Top 500 list https://www.top500.org), has a peak performance of 148.6 petaflops in the LINPACK benchmark, a benchmark that employs only double precision. For a genetics application that uses half precision, the same machine has a peak performance of 2.36 exaflops [1].
In this work it is proposed to analyze the well known heuristic for single objective optimization, the differential evolution (DE) algorithm, under FP16 numbers, and also under fixed point arithmetic that uses integer numbers of different lengths. This analysis is important if we think of embedded optimization algorithms within a chip [3], which performs a dedicated task. One constraint in these kinds of applications must be that the power consumption is as low as possible. Also it is important if one designs a dedicated algorithm in hardware, just as in FPGAs (Field Programmable Gate Arrays), to accelerate its behavior. Also, another possible application is to execute a fast and small DE inside each core in a GPGPU. These three application scenarios justify the analysis of the DE performed in this work.
The rest of this article is organized as: in Section 2 a very brief description of fixed point arithmetic and FP numbers is made. In Section 3 the DE algorithm is analyzed for which parts could be improved by using other different number types. In Section 4 some experiments and their results are described. Finally, in Section 6 some conclusions are presented.

Fixed Point Arithmetic and Floating Point Numbers
The notation a.b will be used here to represent a set of integer numbers that uses a bits in the integer part, and b bits in the fractional part. Each number is of size a + b + 1 bits (plus the sign bit).
For a number x ∈ a.b, the range of numbers that can be represented is: Summing up two numbers a.b results in a number (a + 1).b [4]. The multiplication of two numbers a.b results in a number (2a + 1).2b [4]. It is possible to verify these results by applying the respective operation to two extreme numbers in (1).
The microprocessors offer the sum and multiplication of two integer numbers and the result is stored in a number of the same size as the operands. In a hardware design for a given application, one must use a big enough number to store the sum of two a.b numbers, and the result to multiply two a.b numbers must be returned to a a.b number. The easiest way to perform this is by truncating the result: the resulted 2a.2b is shifted b bits to the right, again the number must be big enough to store the resulted a.b number. In a PC, if one uses 32 bit integer numbers, the first bit is the sign bit, and then one could multiply up two √ 2 31 = 2 31/2 values to keep the result within the used 31 bits. In any application, normally one does not take care if the used numbers can keep the result of the operations applied to them, and one trusts that the numbers are big enough to store the results.
The operations sum and multiplication of two integer numbers are the fastest because each operation is built in the hardware and both take a single clock step.
The sum and multiplication of two FP numbers is totally different. An FP number is composed as s · 2 e , where s is the significant and e the exponent. If p bits are used for the significant, it is an integer that could take values from 0 to 2 p − 1. The exponent e is an integer number too. The sum of two FP numbers is carried on first by expressing both numbers with the same exponent, then summing up both significants. The greater exponent of both numbers is used to express them with the same exponent. The result must be rounded to express the same number of bits used in the significants. Also, the result could be normalized, which means that the exponent will have a single binary precision number.
The multiplication takes more steps because two numbers s 1 · 2 e 1 , and s 2 · 2 e 2 are multiplied as s = s 1 · s 2 and the exponents are summed (e = e 1 + e 2 ), and also both results are rounded and the final result is normalized.
In the IEEE 754 standard [5], an FP number has a sign bit, i, and the represented number is equal to (−1) i · s · 2 e , where e min ≤ p + e − 1 ≤ e max . The values used in common FP numbers are shown in Table 1. Floating point operations take more than a clock cycle within a microprocessor.
The IEEE 754 standard [5] gives much more aspects that are necessary to work with FP numbers, such as rounding methods, Not a Number (NaN), infinities, and how to handle exceptions. In [6] all these details about FPs are explained.

DE Analysis
DE is a heuristic used for global optimization under continuous spaces. DE solves problems as: minimize: f (x), subject to: g(x) ≥ 0, and where f : R n → R is the function to optimize; x ∈ R n , that is, the problem has n variables; and also we could have g : R n → R m 1 , m 1 inequality constraints; and h : R n → R m 2 , m 2 equality constraints. The solution to the problem x is in a subset S of the whole search space R n and where the constraints are satisfied, this space S is called the feasible space. Also, the search space contains the feasible space and is defined by the box constraints: This is, each variable x i is searched in the interval defined by the lower bound value l i , and the upper bound value u i , for i = {1, 2, . . . , n}.
Constraints can be incorporated into the problem (2) by modifying the objective function as: Now the f 1 will be optimized instead of f in (2). α and β in (4) represent the penalty coefficients that weigh the relative importance of each kind of constraint.
One important point about DE is that the heuristic needs to only evaluate the problem to solve. Classical mathematical optimization methods use the first and perhaps also the second derivative of the given problem. These derivatives are easy to obtain if one has in hand the mathematical expression to the given problem. It is possible to approximate the derivatives numerically but with a very high computational cost [7].
According to the test in the CEC 2005 conference [8], DE is the second best heuristic to solve real parameter optimization problems, when the number of parameters is around 10.
The DE pseudocode is shown in Algorithm 1. for i = 1 to µ do 6: Let r 1 , r 2 and r 3 be three random integers in [1, µ], such that r 1 = r 2 = r 3 7:

Algorithm 1 Differential evolution algorithm (rand/1/bin version)
Let j rand be a random integer in [1, n] 8: for j = 1 to n do 9: DE works with a population that is composed of a set of individuals, or vectors, of real numbers. All vectors are initialized with random numbers with a uniform distribution within the search bounds of each parameter (line 1 in Algorithm 1). For a certain number of iterations (line 4) the population is modified and this modified population could replace the original individuals. The core of DE is in the loop on lines 8-13: a new individual is generated from three different individuals chosen randomly; each value of the new vector (it represents a new individual) is calculated from the first father, plus the difference of the other two fathers multiplied by F, the difference constant; the new vector value is calculated if a random real number (between zero and one) is less than R, the DE's recombination constant. To prevent the case when the new individual could be equal to the current father i, at least one vector's component (a variable value) is forced to be calculated from their random fathers values: it is in line 9 of the pseudocode, when j = j rand , and j rand is an integer random number between 1 and n. In lines 10-12 it is checked if each combined variable value is within the search space. Then the new individual is evaluated, and if it is better than the father (in lines [11][12], then the child replaces its father. The stop condition used here is: if the number of iterations is greater than a maximum number of iterations or when the difference in the objective function values of the worst and best individuals is less than v. This stop condition is called diff criterion in [9], and is recommended for a global optimization task.
A general form to set the parameter values for DE is: if d is the number of variables, the population size is set to 10d, F ∈ [0.5, 1.0], and R ∈ [0.8, 1.0] [9].
The DE in Algorithm 1 can be improved by using a random integer number generator as the one described in [10], which does not use divisions or FP numbers. This idea could improve the algorithm in line 6 (to generate three numbers in the interval [1, µ], and in line 7 where another random integer number is generated in the interval [1, n]. Also, the values for F and R are within the interval [0.5, 1.0], and usually no more than one or two decimal values are used for these constants, thus these values are not affected by using half precision numbers (see Table 1). Even more, a totally integer arithmetic could be used in the comparison Two implementations of DE were used in this work: one with fixed point arithmetic, and another one using FP16 numbers. The implementation with fixed point arithmetic uses integer (of 32 bits) numbers for all the variables. The implementation using FP16 numbers uses half precision floats (FP16, 16 bits) for all the variables. In this paper a computer of 64 bits architecture was used, then the multiplication of two integers was stored in a long type variable of 64 bits, shifted and truncated to a integer of 32 bits. The core part of DE (lines 8-13 in Algorithm 1) calculates the selected and mutated vector x as: for j = {1, 2, . . . , n}, this is for each variable of the given problem. Thus, one subtraction (x r 1 ,j − x r 2 ,j ) followed of one multiplication (by constant F) and one summation (with x r 3 ,j ) are needed to calculated the new vector x . The greatest value for F could be 1, if all the search space is equal for all variables, the result in (5) could be the double of the current x j value. Then, the maximum possible values in the search space could be the double of the bound values of the search space. Another problem is to find the maximum possible value in the function space. Also, it is not clear how many bits are necessary in the fractional part for the fixed point arithmetic. These items are solved in the following section.

Experiments with Three Multimodal Functions in 10 Dimensions
Three very well known benchmark functions were used: shifted version of Rosenbrock, Rastrigin, and Ackley functions in 10 dimensions. All these functions are multimodal, which justify solving them using the DE heuristic. The used Rosenbrock function is defined as: its minimum value is 0.39 with x = [0, 0, . . . , 0] The Rastrigin function is defined as: its minimum value is −33 for x = [0, 0, . . . , 0] The Ackley function is defined as: its minimum value is −7 with x also equal to x = [0, 0, . . . , 0]. These three functions are scaled with respect to the three ones defined in [11] in order to keep their amplitudes within the range of half precision FP numbers (see Table 1). A summary of these three functions is described in Table 2.
All functions were programmed in single precision FP (float in C) arithmetic. The number of bits used for the integer and fractional parts for the simulations in fixed point arithmetic is shown in Table 3. The number of bits in the integer part is set according to Table 2 because the maximum number in the third column in Table 3 must be greater than the maximum extreme value shown in Table 2. Table 3. Calculation of the number of bits in the integer part for the simulations using fixed point arithmetic. Numbers shown here must be greater than the corresponding ones in Table 2 to permit the optimization operations for differential evolution (DE). The resulted statistics for the simulations using 100 runs per bit in the fractional part and FP16 arithmetic are shown in Tables 4-6, for the Rosenbrock, Rastrigin, and Ackley functions, respectively. In those tables the statistics for the number of generations and the obtained function values are shown. The used number of bits in the integer part are shown in Table 3. These number of bits in the integer part were calculated from data in Table 2, for example, for the Rosenbrock function in Table 2 the maximum obtained value function is 10891.29, thus the number of bits for the integer part must be greater than this number, therefore 14 bits were selected because 2 14 = 16, 384 > 10, 891.29. The corresponding variable values for the minimum for each function for the FP16 simulations are shown in Table 7. The obtained mean value for the FP16 simulation for the Rosenbrock function is 0.391538 (see at the end of sixth column in Table 4). The equivalent mean for the fixed point arithmetic is 0.391079 at 11 bits in the fractional part; the associated variable values at this simulation with 11 bits is also shown in Table 7. The same procedure was repeated for the results for the Rastrigin and Ackley functions and are also shown in Table 7. Table 4. Statistics of the 100 runs per bits used in the fractional part for the fixed point arithmetic and for the Rosenbrock function (14 bits were used for the integer part). Results for 100 runs for the FP16 are also shown. g represents the number of generations.  Table 5. Simulation results for Rastrigin function. Statistics of the 100 runs per bits used in the fractional part for the fixed point arithmetic (7 bits were used for the integer part). Results for 100 runs for the FP16 are also shown. g is the number of generations.   Table 6. Simulation results for Ackley function. Statistics of the 100 runs per bits used in the fractional part for the fixed point arithmetic (5 bits were used for the integer part). Results for 100 runs for the FP16 are also shown. g is the number of generations.

Discussion
With the simulation results shown in Tables 4-7 it is confirmed that the heuristic DE can be executed in fixed point arithmetic or half precision FP numbers.
As one can see in Tables 4-6 not all the fractional numbers of bits are necessary with a given application. From Table 7 same results for FP16 numbers can be obtained with numbers 14.11, 7.12, and 5.11 for the scaled Rosenbrok, Rastrigin, and Acklen functions.
About the precision obtained in the solution using FP16 or integer arithmetic. The defined machine epsilon value is that such when = 1 + . In most of the modern microprocessors (that use two's complement arithmetic) this machine epsilon value for each data type is shown in Table 8. Table 8. Machine epsilon values for the different floating point numbers, for a general integer number of n bits in the fractional part, and also for the integer arithmetic of results shown in Table 7. The precision bits is one bit more than the positive exponent of epsilon in floating point types and equal to the number of bits used in the fractional part in integer arithmetic.

Data
Roughly, one cannot expect a result in an optimization problem beyond the precision of the machine epsilon. Thus, using FP16 numbers will give precision in the result at most 9.765625 × 10 −4 . Or using an integer number a.b, the result will have at most a precision of 2 −b . This means also that using FP16 numbers the heuristic, DE in this case, will finish early compared to using single or double precision floating point numbers. In the experiment in this work the DE's stop condition was set equal to 10 −4 . It is expected that using a smaller stop condition the heuristic will finish in more generations but then is necessary to change to other number types.
One possible application of using FP16 numbers of integer arithmetic could be to obtain first a low precision result within the precision given by the used type numbers (see Table 8). If a bigger precision is required, then a traditional mathematical algorithm, such as the Newton method, could be used. The starting solution for the Newton method will be the previous obtained low resolution solution.
Of course if FP16 numbers of integer arithmetic are used, the application should work at the precision results given by those type numbers. Finally, this behavior must be analyzed in advance for a given application.
For all the simulations the DE's stop condition was set equal to 0.0001. This number in 3.28 notation is equal to 0x000068db (it is a hexadecimal number of 32 bits), and this number can be written by convenience with the binary point as 0x0.00068db. The 13 bits after the binary point are all zeros, thus the stop condition is equal to zero for less than 13 bits used in the fractional part, as one can confirm in Tables 4 and 6 where the simulations show the maximum number of iterations and the stop condition is not taken into account for lesser than and equal to 13 bits.
For the use of fixed point arithmetic in DE, it is critical to know in advance the range of values for the function to optimize. Here the extremes values of the search space were used to know those quantities. In a practical task, it could be tried with the extremes and perhaps other points, on a very coarse grid, to evaluate the function to optimize. The same procedure should be applied to use FP16 numbers.
DE core (in Algorithm 1) uses one difference and one multiplication, thus there is not a numerical problem to be used with fixed point arithmetic or FP16 numbers.
A naive implementation of fixed point arithmetic with a word length of 32 bits is not required, in general. As one can see in Table 4, the same results using 14-17 bits in the fractional part for the Rosenbrock function are obtained. The same applies from results in Table 5 for the Rastringin function for 11-24 bits, and in Table 6 for the Ackley function from 13 to 26 bits in the fractional part.
A future work will be the design in the hardware of DE, which should include the random number generator that can be optimized to use directly the generated bits without FP divisions, as is suggested in [10]. This idea of this design also could be used in software within each core of a GPGPU. Also an interesting idea is to incorporate a random number generator based in chaos [12], which is easy to implement.

Conclusions
The DE optimization heuristic was analyzed under its implementation with fixed point arithmetic and half precision floating point arithmetic. Results were shown in software simulation with three multimodal functions: Rosenbrock, Rastrigin, and Ackley in 10 dimensions. To apply these arithmetic representations, it is necessary first to know how to scale the function values to be inside the ranges of FP16 numbers. It is suggested to use the extreme search values to have an idea of those range function values. If this point is solved, DE can be perfectly used in these arithmetics.
Still is possible to optimize the DE algorithm in the pseudo random number generator, without using FP arithmetic. This analysis is required if DE will be embedded in hardware inside a circuit chip or in massive parallel versions in GPGPUs.
Funding: This research received no external funding.