Distributed Genetic Algorithms for Low-Power, Low-Cost and Small-Sized Memory Devices

: This work presents a strategy to implement a distributed form of genetic algorithm (GA) on low power, low cost, and small-sized memory aiming for increased performance and reduction of energy consumption when compared to standalone GAs. This strategy focuses on making a distributed version of GA feasible to run as a low cost and a low power consumption embedded system utilizing devices such as 8-bit microcontrollers (µCs) and Serial Peripheral Interface (SPI) for data transmission between those devices. Details about how the distributed GA was designed from a previous standalone implementation made by the authors and how the project is structured are presented. Furthermore, this work investigates the implementation limitations and shows results about its proper operation, most of them collected with the Hardware-In-Loop (HIL) technique, and resource consumption such as memory and processing time. Finally, some scenarios are analyzed to identify where this distributed version can be utilized and how it is compared to the single-node standalone implementation in terms of performance and energy consumption.


Introduction
Distributed systems are present in our lives every day. They can be simple or complex such as the ones found in the World Wide Web, social networks, e-commerce, and others. A distributed system can be any system in which hardware or software components are separated and able to communicate between themselves by passing messages through some sort of network. The main motivation for constructing these distributed systems is resource sharing, that is, the system can use resources that are not in the same location and it can be eventually scalable. However, distributed systems usually run concurrently on devices that do not share a global clock and memory, which requires some sort of synchronization, besides the fact that individual devices may present independent failure as well [1]. Thus, that explains why this area is challenging and has been studied for decades.
Traditionally, most algorithms were created and implemented to run on a single machine. Over time, with the development of multiple-core devices and faster networks, several of those algorithms were reinvented to work in a distributed way, so that they could use more resources and be accelerated, for instance [2]. An example of an algorithm that gained a distributed version years later after its first implementation was the Genetic Algorithms. They are a type of metaheuristics inspired by Darwin's theory of evolution and are an efficient method to solve numerous types of problems, mainly related to search and optimization in different areas [3]. Some researchers already proposed

Genetic Algorithms
For the scope of this work, genetic algorithms can be defined as iterative algorithms that start by randomly generating a population of N individuals and after K iterations, called generations, those individuals will converge to some specific result. Each individual is mapped into M bits and during each generation, k-th iteration of the algorithm, the population passes through operations of evaluation, selection, crossover, and mutation. At the end of the generation, a new population of the same size N is generated and then it will become the starting point of the following generation. After this cyclic process repeats K times, most of the individuals are expected to be concentrated around the same values and the best one can be used as the result.
The Algorithm 1 represents the pseudocode of the GA described above, the same presented in [13], and it is inspired by [3]. The vector x j (k) represents the j-th individual of the N-sized population X(k), on the k-th generation. Each j-th individual has dimension D, thus the element x j,i [M](k) represents the i-th dimension of this individual, which is mapped into M bits. Therefore, the population X(k) can be expressed as X(k) = x 0 (k) ...
After the evaluation, the next operation is the selection, where some individuals are selected and the best ones, with better fitness value y j [B](k), are combined to generate new and possibly better individuals for the next generation. There are several selection methods described in the literature, such as the roulette wheel selection, the stochastic universal sampling, the tournament selection, and the rank-based selection, for example, [20]. For this work, the tournament selection is applied since it is one of the most used and efficient methods according to [21]. The selection function is represented in the pseudocode as SF (Line 10 of Algorithm 1). Finally, the elitism technique can also be applied, so that the best E individuals of the current population are passed directly to the new population without being combined. In this work, E = 1 and the best individual is placed on the first position of the new population (Line 16 of Algorithm 1).

Algorithm 1 Genetic Algorithm Pseudocode
Generation of the initial population 1: Initialize(X(0)) Starts to process the generations 2: for k ← 0 to K − 1 do Calculates the fitnesses and evaluates the individuals (or chromosomes) 3: for j ← 0 to N − 1 do 4: if y j [B](k) < y jb [B](k) then 6: jb ← j 7: end if 8: end for Selection and crossover 9: for i ← 0 to N − 1 with step 2 do 10: SF (y(k), X(k))     11: end for Mutation 12: for v ← 0 to P − 1 do 13: z v (k) ← MF (z v (k)) 14: end for Elitism 15: for i ← 0 to D − 1 do 16: x 0,i [M](k) ← x jb,i [M](k) 17: end for Updates the population 18: for j ← 1 to N − 1 do 19: for i ← 0 to D − 1 do 20: x The operation following selection is called crossover, where two or more selected individuals from the current population, X(k), are combined to generate new ones that will be inserted into the new population, X(k + 1), after passing through the mutation operation. In the literature, there are several strategies for the crossover such as the one-point crossover, two-point, and uniform [22]. In this work, either of these three options can be used. The crossover function is defined as CF (Line 10 of Algorithm 1) and the offspring is stored into the matrix Z(k), which is defined as After the new individuals are inserted in the Z(k), they are processed through the operation called mutation, where P individuals will have their information randomly modified. In this work, the mutation function is defined as MF (Line 13 of Algorithm 1). The mutation rate, called R M , defines the proportion of individuals that suffer mutation, hence P can be specified as The last operation of the GA is the population update. In the literature, there are different approaches in which the entire older population or only a part of it is substituted [23]. In this implementation, the entire population X(k) is renewed, that is, each j-th individual of the k-th generation is replaced by a new individual, generating the population of the next generation, X(k + 1). These new individuals can come from both the offspring of the k-th generation, stored in Z(k), or directly from the old population due the elitism technique (Lines 16 and 20 of Algorithm 1).

Distributed Genetic Algorithms
The implementation of distributed genetic algorithms (DGAs) follows the same general idea of its traditional version as described in Algorithm 1, but the difference is that the workload is divided between multiple nodes. There are several possible architectures for DGAs as described in [24] and some of them will be presented below. The main advantage of those distributed architectures is that more resources can be used by the GA and hence it can work with larger populations, more bits to represent each individual and increase the precision, and even reduce the processing time by running simultaneous tasks using multiple processors. The architecture proposed in this work is inspired by the ones presented below but will be better explained in Section 4.
The most traditional architecture for distributed systems probably is the master-slave, wherein the case of genetic algorithms one of the Q nodes will process most of the operations and sends individuals to be evaluated by the other nodes. While this approach does not sound so efficient at first, the evaluation function is where usually most of the computing load is done for most search and optimization problems. Consequently, by adopting this strategy it is possible to accelerate the evaluation of several individuals in parallel because these evaluations are mutually independent. However, there is a cost to transfer all the individuals during every generation and if the evaluation function is not too costly to cover the communication overhead, then it will not be efficient enough. This architecture is shown in Figure 1.  Two other options for distributed genetic algorithms is the island and cellular models. By using them, the main population of N individuals is divided into sub-populations that are scattered between the Q nodes, which are spatially distributed. That means each node will be responsible for V individuals, wherein this project N must be divisible by Q. Hence V can be defined as Thus, a node Q will have a sub-population X q (k) of V individuals mapped into M bits and with D dimensions, and can be expressed as In both island and cellular models, all nodes process all the operations of the GA but there is also an extra stage where individuals from an island or cell can migrate to another one as a way to increase the diversity of the global population and avoid a local and premature convergence. That means the nodes can communicate between themselves differently from the master-slave model, where the communication happens only between the master and slaves but not between slaves. The island model and the cellular model are presented in Figures 2 and 3, respectively.  . Cellular model for distributed genetic algorithms. All the circles represent nodes and in this architecture, differently from the island model, individuals can migrate from a node only to its neighbors. In this example, individuals from the central green node can migrate only to the adjacent yellow nodes-the dashed square region limits the nodes that exchange individuals with the green one.
One last model among numerous that exist for DGAs is the pool model. In this form, the population of individuals is put in a sort of shared global array where various autonomous nodes can access them. That array is then split in U segments so that each node is responsible for the group of individuals in that segment. Finally, each processor can read individuals from any segment but can overwrite only individuals in its reserved segment. One advantage of this model when compared to the previous ones is that it can handle well asynchronous tasks and heterogeneity, while the others need to have some kind of synchronization between the nodes, mainly during the communication. This model can be seen in Figure 4.

Algorithm
Based on the several available models of distributed genetic algorithms, the one implemented in this work is mostly inspired by the master-slave model but with some characteristics of the island model. The focus was to keep as many operations as possible in parallel and asynchronous instead of running only the evaluation concurrently to improve the overall performance. Furthermore, the global population is divided into sub-populations of size V between the Q nodes aiming to take advantage of the total memory available. Thus, after analyzing all the operations described in Section 2, it was noticed that most of the GA operations are independent and can be done in parallel and asynchronously, except the selection and crossover.
The decision of keeping the GA operations of selection and crossover synchronous between all nodes and coordinated by the master node is to allow the selection and combination of individuals from any sub-population, which may be stored in different microcontrollers. In the traditional island model, only individuals from the same sub-population, that is, from the same node can cross. Because of this, some individuals of different nodes that eventually would generate a good result would never have the change to cross their contents. To address this limitation, this implementation centralizes both operations of selection and crossover in the master node, with the slave nodes working synchronized with the master during this stage, so that individuals of any microcontroller can be collected, combined, and new individuals sent back. This idea is presented in Figure 5.
Once these operations are done, all µCs can follow their run independently, and then they will synchronize again only during the selection and crossover of the next generation. After K generations, there is one extra step where the master needs to synchronize again all slaves to collect the best individual of all sub-populations. Finally, the master will compare Q best individuals and the best one will be the final result. This whole process is presented in Figure 6. As stated before, this implementation is a modification of the work presented in [13] and then it uses the same base structure and has the same constraints and limitations described there. To conform with those limitations, the number of nodes, Q, must be a power of 2, that is Since the resources are shared, then the size limit for the global population size, N , will be in function of Q as follows Ultimately, another consequence of this new N is that it can be bigger than 256. Therefore, the type popsize_t, which is used in the implementation in variables that store the population size, now cannot be stored as an 8-bit unsigned int anymore, thus it may need a 16-bit variable instead. ... Almost all operations, called function modules in that work, are still the same except the selection and crossover, which needed to be modified in this project. The selection and crossover functions have a different implementation for the master and the node. Also, at the end of the GA, the master node needs to collect the best individuals of all nodes and then select the best one as the final result. For that reason, the pseudocode for the master and the nodes are shown separated in Algorithms 2 and 3. In both Algorithms 2 and 3, there are new functions in comparison to the original Algorithm 1 and they are described below:

Algorithm 2 Distributed Genetic Algorithm Pseudocode-Master
Generation of the initial population 1: Same block in Algorithm 1.
Starts to process the generations 2: Same block in Algorithm 1.
Calculates the fitnesses and evaluates the individuals (or chromosomes) 3: Same block in Algorithm 1.

16: end for
Inform all nodes to continue the remaining operations. 17: for q ← 0 to Q − 1 do 18: COF (q) 19: end for Mutation 20: Same block in Algorithm 1.
Updates the population 22: Same block in Algorithm 1.
Collect the best individual of all nodes. 23: for q ← 0 to Q − 1 do 24: bestIndividuals q ← CBIF (q) 25: end for While the proposed implementation provides some benefits discussed previously, it also brings some drawbacks. The first one is the large time consumption during the selection and crossover because while the master is running the tournament method and crossing individuals, all slaves are idle and waiting for commands from the master. Only when the master finishes the processing of all sub-populations, then the slave nodes can continue the other operations. Furthermore, the communication method between the nodes is relevant because it is heavily used during the selection and crossover. Since there is an overhead for each data transfer between the nodes, then a big population makes the selection and crossover slower.

Algorithm 3 Distributed Genetic Algorithm Pseudocode-Slave
Generation of the initial population 1: Same block in Algorithm 1.
Starts to process the generations 2: Same block in Algorithm 1.
Calculates the fitnesses and evaluates the individuals (or chromosomes) 3: Same block in Algorithm 1.
Selection and crossover 4: while true do Wait for a command requested by the master node and take an action. 5: command ← CPF() 6: if command = "Collect Fitness Value" then 7: Send Fitness Value to Master Node 8: else if command = "Collect Individual" then 9: Send Individual to Master Node 10: else if command = "Send Individual" then 11: Receive Individual from Master Node 12: Another point to be discussed in this new algorithm is the mutation. Since the original function was kept as it is, thus all nodes will process the mutation of P individuals as described in Equation (4). That means the mutation rate will be higher and depend directly on the number of nodes, Q. Thus, the new mutation rate R M can be defined as Thus, if the project uses several nodes (big value for Q), it is important to use a reasonable population size, otherwise, the mutation rate would increase drastically. For example, by keeping P = 1 (lowest value possible), if there are 8 nodes and the global population N is only 32, then one individual in a sub-population of 4 would mutate, and this represents a mutation rate of 25%, which is considered high.

Communication between Microcontrollers
To implement the distributed genetic algorithm architecture proposed in Section 4.1, data transmission between the targeted devices is necessary. Most manufacturers usually implement in these devices at least the following serial interfaces: Serial Peripheral Interface (SPI), Inter-Integrated Circuit (I2C), and Universal Asynchronous Receivers/Transmitter (UART) [25]. Thus, developing a distributed system suitable to run over one of these interfaces allows it to be used in a wide range of devices.
A challenge of implementing any distributed system in these limited devices is that those common interfaces are simple and each one has different particularities, which affect transmission speed, the maximum number of connected devices, and energy consumption. It is possible to add additional hardware to the microcontroller to provide other interfaces and protocols, however, this could increase the overall price of the embedded system and increase energy consumption. Therefore, to keep this implementation efficient and with no need for extra hardware to prevent the increase the costs, the interface SPI was chosen as the communication mechanism between the devices that are part of the distributed system. SPI is a simple synchronous serial bus standard that operates in full-duplex mode and widely supported by different types of low-capacity devices [26]. It uses a master-slave architecture where the master node provides the clock to all the slaves and controls when the data transfer starts. When the master sends data, it also receives data from a selected slave at the same clock cycle and that explains why it is full-duplex. Another characteristic of the SPI is that it requires at least a four-wire bus for the simplest case with only one slave and for each extra slave a new write is necessary. The SPI wiring structure is shown in Figure 7 and the SPI bus is explained as follows: • SCLK (Serial Clock)-Wire where master node sends the clock signal to the slaves. • MOSI (Master Output, Slave Input)-Wire used by the master to send data and used by the slave to receive data through. • MISO (Master Input, Slave Output)-Wire used by the master to receive data and used by the slave to send data through. • SS (Slave Select)-Wire used to select which slave will be enabled to communicate with the master node. While this distributed genetic algorithm implementation may be implemented in any of the communication interfaces, the reason to choose SPI over I2C or UART is that it is simpler to implement, faster, and has lower power consumption for not needing pull-up registers like the I2C [27]. Furthermore, the other interfaces have other limitations that would compromise the DGA architecture proposed in this work and its performance as well. UART works point-to-point way and because most devices such as microcontrollers have a limited number of UART interfaces, sometimes only one, it would be impracticable to connect multiple slave nodes to the master node. In respect of I2C, despite the fact that it can support multiple devices, in order to send or receive data it also needs to send the device address before transmitting useful data. This would cause a huge overhead for this DGA proposal because in each generation several data transmissions are done as for individuals as for fitness value.
The SPI interface, which is used in this work, can transmit (send and receive) one byte (8 bits) per time, where 8 clock cycles are necessary for each submission. As described in Section 4.1, the distributed genetic algorithm needs to transmit individuals, which can be mapped M bits (either 8, 16 or 32 bits) and have D dimensions, and fitness values mapped in B bits, which usually are float-point numbers (usually 4 bytes for IEEE 754 format). Hence, the clock cycles necessary to transmit these values are (11) where c ind CLK represents the number of clock cycles to transmit one individual, c fit CLK to transmit a fitness value, and c trans CLK the clocks cycles necessary as a overhead to start the transmission. To abstract the transmission of different data types in the DGA implementation proposed by this work, it was developed a simple 2-step protocol based on commands and acknowledge messages to allow the master inform the selected slave which kind of transfer it is about to make (if the master will send an individual, receive an individual, receive a fitness value, etc). Once the slave receives the command, in the next transmission it will send an acknowledge message as a response, and then, since there is a guarantee they are safely synchronized, the transmission of useful data can begin. This idea is represented in Figure 8 and a list of the commands and acknowledge messages is shown in Table 1. Therefore, for each transmission of GA content (individual or fitness value), there is an overhead of 16 clock cycles because of the transmission of 2 bytes for command and acknowledge messages. Thus, in Equations (10) and (11), c trans CLK is 16 then.

Scalability and Overhead
With the Algorithms 2 and 3 and the communication protocol described in Section 4. The communication protocol is used in two moments: during the selection and crossover and end to the collection of the best individual from all nodes. The collection of best individuals is straightforward and deterministic because depends only on the individuals and number of nodes, Q. During the selection and crossover, however, there is some randomness during the selection and only selected individuals from slave nodes require an SPI transfer. For example, in the best-case scenario, if all selected individuals are collected from the master node, thus the only transfer would be to send the new individuals to the slaves. In the worst-case scenario, all selected individuals would be collected from the slaves, therefore more transfers would be necessary. At the end of selection and crossover, the master finally needs to synchronize all the nodes again. The collection of best individuals, in turn, is deterministic and depends only on the individuals and the number of slaves.
The expressions for the number of bytes transferred via SPI by the selection and crossover, and during the collection of the best individuals for the worst-case scenario are and where H sel-cross represents the number of bytes transferred during the selection and crossover, including commands and acknowledge messages to collect fitness values, and commands and acknowledge messages to collect and send individuals; and H col represents the number of bytes transferred during the collection of the best individuals, which includes the commands and acknowledge messages to collect one individual from each node. The total number of bytes transferred is the sum of Equations (12) and (13). Using this result, the equation to calculate the total overhead in seconds is where t overhead is the time spent with the transfers, in seconds; c SPI CLK the clock speed in which the SPI is running, in Hz; and ∆ is a non-deterministic a value, that may represent delays that are a consequence of limitations in the practical implementation, for instance. Finally, this expression considers that there are no transmissions errors and eventual retransmissions.
The t overhead is an estimation of the maximum time (worst-case scenario) spent only with the overhead, that is, the transmission of fitness values and individuals from the master to slaves. This amount of time, however, is just part of the total execution time, which also depends on the other genetic algorithm operations, including the evaluation function. An important result from the Equation (14) though is to notice that the number of nodes, Q, barely affects the total overhead because other variables, such as the number of individuals, N, and the number of generations, K are much greater than Q. For example, a real application could be using a population of N = 64 or and K = 64 generations with only 2 or 4 nodes (Q = 2 or Q = 4, respectively). Thus, it is expected that the overhead will not affect significantly the scalability based on the number of nodes.
Finally, using the results presented in [13] for the processing time for all sections of the GA, it is possible to estimate the total execution time of the DGA. The processing time for the standalone GA can be simplified as where t GA is the processing time for the standalone GA; t IFM the processing time to run the initialization; t FFM the processing time to run the fitness function; and t NPFM to run the new population function module. By expanding and simplifying Equation (15), the equation can be rewritten as where φ 1 is the internal time to run the initialization operation; φ 2 is the internal time to run the fitness operation (evaluation and normalization); φ 3 is the internal time to run the selection and crossover operations; φ 4 is the internal time to run the population update operation; and φ 5 is the sum of other internal times that do not depend on the population size, N. All these values of φ changes depending on other parameters, such as number of dimensions, D, or number of bits to represent the individual, M.
Since the distributed genetic algorithm implementation is built on top of the same implementation proposed in [13], therefore, if the devices are running at the same clock speed, the total time for the DGA is the same expression of Equation (16) but with the population divided between Q nodes plus the t overhead , that is, Finally, by putting t DGA in function of t GA , the final expression for t DGA is The result of Equation (18) is important because allows estimating how the processing time for the DGA will be based on how the standalone GA performs. Also, since t GA Kφ 5 Q and t overhead Kφ 5 Q , thus the processing time of the DGA is approximately the processing time of the standalone divided by the number of nodes plus the overhead. Hence, the expression to test if the DGA will be faster than the standalone GA for the same parameters is

Results
To validate the implementation proposed in this work as well as analyze its performance and correct operation, an embedded system was developed using the same technologies employed in [13]. The source code developed on Atmel Studio 7 in language C was used as the base for this project and modified to accommodate both versions of the distributed algorithm (Algorithms 2 and 3). The distributed embedded system targeted Atmel microcontrollers, particularly the same microcontroller ATmega328P that runs on Arduino Uno and was used in the previous work. This µC has an 8-bit processor based on the AVR architecture, which runs by default at 16 MHz, and has 32 KB of program memory and 2 KB of data memory [28]. The reason to choose an 8-bit microcontroller is that it is one of the simplest and limited devices available with lots of restrictions, thus if the implementation works for it, it will also work for more robust devices.
The construction of the DGA embedded system was done using 2 Arduino Uno boards, which is the minimum number of nodes required to run this project, but it can be used in multiple devices as long as they respect Equation (7). Both 8-bit microcontrollers present in these boards were connected between themselves via SPI, configured with a clock frequency of 125 kHz (µC base clock of 16 MHz divided by 128), and it was necessary 4 wires as described in Section 4.2. It is important to mention that for each byte transfer via SPI, a delay of 1 ms was added on purpose for each byte transfer to reduce the transmission errors that were happening compared to when SPI was running at full speed. Thus, the value of ∆ will be approximately Moreover, a third Arduino Uno board was connected to the master node using a regular GPIO pin to help with the measurement of processing time. The idea is simple: when some routine needs to be measured in the master node before it starts that pin receives value high and when it finishes that pin receives value low. The third microcontroller will start a timer then it gets value high and then stops it when receives value low, and finally will show the measured time. The wiring of the three µCs is illustrated in Figure 9. The following sections about resource consumption, specifically memory, processing time, and the correction operation using Hardware-In-The-Loop followed the same strategies used in [13]. Also, some experiments had to be done for both master and slave implementations since they have different contents. In the last subsection, there are more results about how this implementation of DGA compares to the standalone GA in terms of performance and energy consumption.

Memory Consumption
The first results collected were the program and data memory consumption. The program memory is non-volatile and is used to store the instructions to be executed by the processor, that is, the compiled program. The data memory is volatile and used to store variables during the run of the program. Also, the data memory can be divided into two segments: • Static memory: the memory consumed by global and static variables and is kept allocated during the whole program execution. That means this section of the memory cannot be freed and used by other variables. • Stack memory: the memory used by local variables and that can be allocated and freed according to their lifetime (for example, a local variable defined inside of a function will be freed when the function is finished).
The measurement of static memory is straightforward because the compiler can calculate it. The stack memory, in turn, needs to be calculated empirically. Therefore, both results are shown separately as for the master as for the slave node. To simplify the measurements, all experiments were performed with a fixed number of generations K = 64, since this affects only the processing time. Also, the evaluation function used was f 1 (x) =x 2 0 − 6x 0 + 8, with dimension D = 1, to avoid the use of external libraries. Finally, the crossover was configured as one-point and the number of mutated individuals was P = 1.
After running some experiments with the parameters above, the program memory consumption for the master and slave implementations is shown in Tables 2 and 3, respectively. The compiled program consumes only a small portion of the 32 KB available and practically does not scale, using only about 11% of program memory in the master node and 7% in the slave for almost all scenarios. This result is important because it allows this distributed genetic algorithm implementation to be deployed as part of other projects.  The results regarding data memory are divided into static and stack memory. For all the scenarios tested above, the static memory was always 8 bytes. This was expected because this project does not use global or static variables so that almost all data memory can be used dynamically as stack memory. The results of stack memory, in turn, are shown in Tables 4 and 5, sequentially. The numbers obtained in this work are similar to those obtained in [13], because after dividing the global population both microcontrollers got the same population size used in the experiments in that work. The numbers were also plotted in a chart in Figures 10 and 11, and the best approximation of a linear function was done for all the cases. The stack memory consumption seems to increase linearly with the population size N and at a slower rate with the increase of the individual size M. While not presented, the same linear increase is expected for the number of dimensions D in the individuals since another dimension is equivalent to add another individual. For a typical situation using 2 microcontrollers with a population size of 128, with individuals mapped into 16 bits, the total memory consumption is around 31% for the master and 28% for the slave. This low usage is important because it leaves about 70% of the memory available and allows this DGA implementation to reside together with other projects in the microcontrollers.  Therefore, it is important to consider the peculiarities of each application of this implementation. For this scenario with 2 microcontrollers, a global population of 512 individuals mapped into 32-bit would not be viable because the data memory would not be enough (by following the trend, it would be necessary more than 3.2 KB at least in each µC). As possible solutions, the population N or the precision M could be reduced or more microcontrollers could be added to provide more resources. The only problem with this last approach is that it would double the costs with hardware since the number of nodes Q must be a power of 2 as explained in Equation (7).

Processing Time
The second results collected from this distributed genetic algorithm implementation was the processing time. The methodology used in [13], which was mostly based on measuring the number of clock cycles using the Atmel Studio 7 debugger, is not interesting for this work because the communication between multiple microcontrollers may not keep the algorithm fully deterministic. As shown in Figure 6, some part of this implementation is not synchronized and some nodes may finish the run before others. Another issue that can happen is when the master sends a byte with a command and because of some error the slave didn't receive it properly, then the slave will not send the acknowledge message and will wait for a resending of the command again. Thus, the following results present the real run time, experimentally measured with an external timer.
To evaluate the processing time, the following evaluation functions were used: For all these functions, the following GA parameters were fixed: population size N = 32, individuals mapped into M = 16 bits, number of generations K = 64, and number of mutated individuals P = 2. The results are presented in Table 6. The processing time seems to not change so much with the type of evaluation function and this can be noticed when comparing functions f 2 (x), f 3 (x) and f 5 (x), which have different mathematical operations but same number of dimensions and similar run times. The same happened for f 1 (x) and f 4 (x), which have one dimension as common characteristic. Thus, this suggests the time spent with the communication is being predominantly the part that is consuming more time. For the experiments to analyze N, K,and M, the function f 4 (x) was used. For evaluate D, it was used the evaluation function f 2 (x), by adding or removing more termsx 2 D when the dimension was greater than 2. For example, to evaluate the version with 4 dimensions, the termsx 2 2 andx 2 3 are added to the function and so on.
The results for processing time for N, K, and D are presented in Tables 7-9, respectively. For the results of N and K, by observing the lines from the top to the bottom, the value of M does not affect so much the processing time and the difference in time when using 16-bit and 32-bit individuals is small. However, by analyzing the columns from left to right, it was noticed a sort of linear increase of processing time proportional as to N as to K in both Tables 7 and 8. This impression can be proved in Figures 12 and 13, wherein both cases the points seem to represent a first-degree polynomial function.  Finally, the results for the number of dimensions D are presented in Table 9. The value of M seems to affect more the time than the other 2 previous cases (N and K). On the other hand, even though the increase in the number of dimensions D affects the consumption of data memory, it produces only a slight increase in the processing time. A first-degree polynomial function is plotted in Figure 14 and shows how the increase is expected for different values of D.  Therefore, the results of processing time are importing to show how it increases based on important parameters of the distributed genetic algorithm. All the main four variables analyzed above (N, K, D, and M) influence directly on the time spent with communication between the nodes, which is the main overhead in this case, because an 8-bit microcontroller can transfer only one byte (8 bits) at once via SPI. The variables D and M define how large is each individual in terms of bytes and N and K how many transfers need to be done during a run of the distributed GA. For that reason, it is crucial to select the proper GA parameters to have control over the processing time.

Validation with Hardware-In-The-Loop
Another important experiment was the verification of the proper functioning of this implementation. To collect the data, it was used the Hardware-In-The-Loop (HIL) technique, where the microcontrollers are connected to a computer via some interface and then they can exchange messages during the run, such as parameters and results. In this project, both master and slave nodes were connected to the computer by using the USART interface and during each generation, they were set to send the current best individual. In the computer, there is a Python program running and collecting the data and after all generations, it plotted a chart showing the convergence of the DGA. The functions employed in this section are f 2 (x) and f 4 (x), which are shown in Figures 15 and 16, respectively. The first experiment was using the evaluation function f 2 (x), where the goal is to find the global minimum. The search space for all dimensions was defined between −5 and 5 and the DGA was set up with the following parameters: population size N = 16 individuals mapped into M = 16 bits, dimensions D = 2, number of generations K = 64, and number of individuals mutated P = 1. After running the distributed genetic algorithm, the local population in both nodes converged to close to the right result, which is (0, 0). This is shown separately for the master in Figure 17 and for the slave in Figure 18, where each dimension is independent and converge in different moments. For this particular run, after finishing all the generations and comparing the best individual of all nodes, the one from the slave was the selected to be the final result, which was the value (0.000076, 0.000687).
The second function used for the HIL validation was f 4 (x). The intention was to find the local maximum for the search space between 0 and 1. The distributed genetic algorithm was configured with population size N = 32 individuals mapped into M = 32 bits, dimensions D = 1, number of generations K = 64, and number of individuals mutated P = 4. As in the master as in the slave node, both populations converged to the expected maximum local maximum, which is located around x = 0.91. At the end of the algorithm, the populations in both nodes were homogenous and the best individual had the same value x = 0.910204, thus the best individual from the master was used as the final result. The results for the master and the slave are presented in Figures 19 and 20, respectively.

Comparison with Standalone Version
A final experiment was to investigate how the distributed genetic algorithm proposed in this work is compared to the standalone version, that is, the genetic algorithm that runs in one single 8-bit microcontroller, which is presented in [13]. There are two motivations for this result:

•
Verify if it is possible to accelerate the genetic algorithm for a certain application by adding more microcontrollers; • Evaluate if, by using multiple microcontrollers configured with lower voltage and lower clock frequency, it is possible to save energy and have a similar performance to the standalone version.
By analyzing the results presented in Section 5.2, there is a large overhead due to the SPI communication between the microcontrollers, which is consuming a lot of processing time even using those simple evaluation functions. Thus, in order to have some advantage with multiple cores, the evaluation function needs to be complex enough so that the processing time spent with it is much higher than the time spent with the data transfer between the nodes. To not change the original evaluation functions, they were changed in such a way to consume more clock cycles but generating the same result. This idea is expressed in the Algorithm 4.

Algorithm 4 Redefinition of Evaluation Function to Become Slower
Define how many times the evaluation function will repeat (2000 times, for example).

1: REPEAT ← 2000
The original evaluation function will run REPEAT times. For the following experiments, the GA was set up with the following parameters: population size N = 32, number of generations K = 64, individual size M = 16, number of mutated individual P = 1, and evaluation function f 4 (x), which was set up to repeat 1000, 2000, 4000 and 8000 times using the strategy proposed in Algorithm 4. By measuring the number of clock cycles that this function needs to run for each case, the processing time of the modified evaluation function, t EFM-slow , can be calculated as where c f slow (×) CLK is the number of clock cycles to run the modified evaluation function and CLK the clock frequency of the microcontroller. This processing time is used below for the different scenarios.
The values of c f slow (×) CLK were collecting via experiments in Atmel Studio 7 and are shown in Table 10. The first measures were done using both standalone and distributed version of the GA running at the same clock speed and voltage. As shown in Table 11 and Figure 21, when the evaluation function is not complex enough, the overhead due SPI communication makes the distributed GA slower than the standalone GA. However, as the evaluation function becomes more complex, the distributed GA becomes faster. In fact, this can also be noticed by analyzing both polynomial functions that fit those points shown in Figure 21, which is in the format t = a × c CLK + b, and is defined as for the standalone version, and for the distributed version, where t represents the processing time in seconds and c CLK represents the evaluation function clock cycles. When the number of clock cycles c CLK is large enough, the distributed version will run approximately 2 times faster than standalone as demonstrated as follows   Another important analysis from Equation (23) is the high overhead. By applying the Genetic Algorithm parameters and the SPI clock frequency, defined in 125 kHz, to Equations (14) and (20) where the value of 26.155 would be the maximum overhead in seconds for the worst-case scenario, that is, if all individuals were selected from the slave. However, since this is unlikely to happen, then the 14.97 seconds in Equation (23) is reasonable and under the theoretical limit. Finally, to validate the theoretical model presented in Equation (18), by applying the results from the experiment shown in Equation (22) and from Equation (25), the expected equation for the distributed versions would be where t dist is the estimated processing time for the distributed GA with the same configuration. The result of Equation (26) is similar to the one obtained experimentally in Equation (23). It is important to emphasize again that the t overhead is calculated for the worst-case scenario (all individuals selected from the slave) and that is why the second term 26.186 is greater than 14.97. Figure 22 illustrates how the theoretical model is reasonable when compared to the experiments, by showing that the theoretical model (blue line) has approximately the same inclination of the experimental result (cyan line). What makes the theoretical model to be higher is because it represents the time for the DGA when the overhead is maximum (the worst-case scenario). For most practical applications, the overhead will be lower than this and the line will be shifted vertically to a lower position. The second experiment was with the distributed version set up with reduced voltage and lower clock frequency for the same GA configuration used above. The motivation for this configuration is to take advantage of how dynamic power is defined for CMOS systems, which is present in regular microcontrollers [29,30]. By reducing the frequency and voltage, it is possible to reduce the power and consequently energy consumption in a higher rate. This idea can be verified in the equation that defines the power, P, as the sum of the dynamic power, P dynamic , and static power, P static , in a CMOS integrated circuit and is defined as where C is the capacitance of the transistor gates, f the operating frequency, V the power supply voltage, and P static the static power which depends mostly on the number of transistors and how they are organized spatially. Thus, by reducing the voltage V in the system, the reduction in the dynamic power will be in a quadratic level. The behavior of Equation (27) can also be found out in the datasheet of the microcontroller ATmega328P [28]. The Figure 23 shows what is the current I CC consumed by the µC for different combinations of frequency (from 0 to 20 MHz) and voltage (from 2.7 V to 5.5 V). Since power can be also defined as then the power will be reduced for low values of voltage and frequency as well (power reduces from right to left and from top to bottom in Figure 23).

Typical Characteristics
The data contained in this section are characterized values of actual automotive silicon. Unless otherwise specified, the data contained in this chapter are for -40° to 125°C.
The following charts show typical behavior. These figures are not tested during manufacturing. All current consumption measurements are performed with all I/O pins configured as inputs and with internal pull-ups enabled. A square wave generator with rail-to-rail output is used as clock source.
All Active-and idle current consumption measurements are done with all bits in the PRR register set and thus, the corresponding I/O modules are turned off. Also the analog comparator is disabled during these measurements. The "supply current of IO modules" shows the additional current consumption compared to I CC active and I CC idle for every I/O module controlled by the power reduction register. See Section 9.9 "Power Reduction Register" on page 36 for details.
The power consumption in power-down mode is independent of clock selection.
The current consumption is a function of several factors such as: operating voltage, operating frequency, loading of I/O pins, switching rate of I/O pins, code executed and ambient temperature. The dominating factors are operating voltage and frequency.
The current drawn from capacitive loaded pins may be estimated (for one pin) as C L  V CC  f where C L = load capacitance, V CC = operating voltage and f = average switching frequency of I/O pin.
The parts are characterized at frequencies higher than test limits. Parts are not guaranteed to function properly at frequencies higher than the ordering code indicates.
The difference between current consumption in power-down mode with watchdog timer enabled and power-down mode with watchdog timer disabled represents the differential current drawn by the watchdog timer. To run this last experiment, both microcontrollers in the DGA were arranged to run at 8 MHz at a voltage of 2.7 V. This is the minimum operational voltage for this frequency, as shown in Figure 24. The processing time for the same configuration in the previous experiment in shown in Table 12. As expected, by running at a slower clock frequency made the processing time increase, and even in situations where the evaluation function is complex, the processing time for the distributed GA is always slower than the standalone GA running at 16 MHz, as expressed in Table 11.   The comparison between these new results with the standalone version is presented in Figure 25. Both lines seem to be parallel, which suggests the distributed version with 2 nodes and half of the clock speed will never be faster than the standalone version. In fact, the first-degree polynomial functions that fit these points are calculated as follows

Active Supply Current
where t red is the processing time for the DGA running at reduced clock speed. This equation has almost the same inclination of Equation (22) and the small difference may be a consequence of error/lack of precision of the measurements. Thus, this result suggests that for this GA configuration the DGA will always be about 16.43 s slower than the standalone GA, no matter how complex is the evaluation function. However, for long runs, the time difference will decrease relatively. For example, if the standalone GA takes 5 min, the DGA will take 5 min plus 16.43 s, which is only about 5% slower.  Even though the distributed genetic algorithm with 2 microcontrollers running at a half frequency of the standalone version is always slower, the main advantage of this structure is the save of power and consequently energy. This is one of the most common goals in embedded systems because they normally run on batteries and need to be power-efficient. The equation of energy consumption, E, is the product of the power equation by the elapsed time, defined as follows where ∆t is the elapsed time. Since the elapsed time for the standalone and distributed versions on lower frequency, represented by t std and t dist respectively, were calculated in Equations (22) and (29), after applying them to Equation (30), the energy consumption equations for both cases are determined as E std = P std × t std = P std × (0.0001299 × c CLK + 0.06282) and where E std , P std , and t std are respectively the energy, power, and time consumption in the standalone system; E red , P red , and t red are respectively the energy, power, and time consumption in the distributed system with reduced clock speed; and Q is the number of nodes in the distributed system (Q = 2 in these results).
where the unity mAh means milliampere hour. By plotting the energy consumption equations in Figure 26, the equation of the distributed version grows slower than the standalone version. The energy consumption in the distributed genetic algorithm for this configuration will be lower than the standalone GA when the evaluation function has at least 73,244 clock cycles, as demonstrated in E red = E std = 0.0043841 × c CLK + 2.120175 = 0.0016270 × c CLK + 204.0606, where the value of c CLK that solves this equation is 73244. For example, when the evaluation function requires around 1,000,000 clock cycles, the standalone genetic algorithm needs approximately 130 s and 4400 mAh and the distributed GA approximately 147 s to run but only 1832 mAh, which is less than half of the energy spent by the standalone one. When the number of clock cycles is big enough, the distributed version will consume merely 37.1% regarding the standalone as demonstrated in lim c CLK →∞ E red E std = 0.0016270 0.0043841 = 0.3711138.
Therefore, the results presented in this section show some possible scenarios where the distributed genetic algorithm can have some advantages over a regular GA running on a single microcontroller. For situations where the evaluation function is not too complex, the standalone version is still the best option because it runs faster and consumes less energy. However, if it is complex enough, this proposed DGA, even having a large overhead due to the SPI communication, can be used either to accelerate the execution by running the microcontrollers in high frequency or to save power by reducing voltage and frequency. Finally, similar results are expected in case of employing more microcontrollers (4, 8, etc.) and with more cores, the global clock could be even more reduced to 4 MHz, 2 MHz, and so on.

Conclusions
This work proposed a strategy to implement distributed genetic algorithms in 8-bit microcontrollers. Details about the implementation, constraints, and limitations were presented, as well as how this strategy is compared to others in the literature. Several experiments were done and showed that the DGA deployed as an embedded system has a low consumption of memory and works properly. Furthermore, the results regarding processing time exposed that there is a large overhead due to the communication via SPI, which makes this implementation not the best choice for problems where the evaluation function is not very complex. Nevertheless, when it is sufficiently complex, the distributed version can be used either to accelerate the run or to reduce the energy consumption by reducing the voltage and clock speed without losing so much performance compared to the regular GA.
Therefore, we concluded this implementation has demonstrated that it is feasible to be applied in embedded systems using 8-bit microcontrollers and can be a good alternative to a regular GA when the processing time of the valuation function is high. In this sense, it can be applied in numerous situations where this time limitation due to the SPI communication overhead is not a problem and may be useful for some non-real-time applications in IoT, for instance. Finally, as future works, more results can be obtained by analyzing the performance scale with different clock frequencies for the SPI, with different communication protocols, with different distributed GA architectures, and with the addition of more microcontrollers as slaves. Funding: This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)-Finance Code 001.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.