Real-time Neural Networks Implementation Proposal for Microcontrollers

The adoption of intelligent systems with Artificial Neural Networks (ANNs) embedded in hardware for real-time applications currently faces a growing demand in fields like the Internet of Things (IoT) and Machine to Machine (M2M). However, the application of ANNs in this type of system poses a significant challenge due to the high computational power required to process its basic operations. This paper aims to show an implementation strategy of a Multilayer Perceptron (MLP) type neural network, in a microcontroller (a low-cost, low-power platform). A modular matrix-based MLP with the full classification process was implemented, and also the backpropagation training in the microcontroller. The testing and validation were performed through Hardware in the Loop (HIL) of the Mean Squared Error (MSE) of the training process, classification result, and the processing time of each implementation module. The results revealed a linear relationship between the values of the hyperparameters and the processing time required for classification, also the processing time concurs with the required time for many applications on the fields mentioned above. These findings show that this implementation strategy and this platform can be applied successfully on real-time applications that require the capabilities of ANNs.


Introduction
The microcontrollers (µCs) have been applied in many areas: industrial automation, control, instrumentation, consumer electronics, and other various areas. Nonetheless, there is an ever-growing demand for these devices, especially in emerging sectors like the Internet of Things (IoT), Smart Grid, Machine to Machine (M2M) and Edge Computing.
A µC can be classified as programmable hardware platform that enables Embedded System applications in specific cases. It is important to know that µCs are mainly composed of a General-Purpose Processor (GPP) of 8, 16 or 32 bits. Then this GPP is connected to some peripherals like Random Access Memory (RAM), flash memory, counters, signal generators, communication protocol specific hardware, analog to digital and digital to analog converters and others.
An important fact is that on most products that are available today, the µCs embedded into them encapsulate an 8-bit GPP with enough computational power and memory storage, to show itself as a resourceful platform for many embedded applications. However, those same 8-bit µCs are considered low-power and low-cost platforms when compared with other platforms that are used to implement AI applications with Artificial Neural Networks (ANNs) (19; 17).
The use of ANNs for embedded intelligent systems with real-time constraints has been a recurrent research topic for many (10; 12; 16; 15; 11). A large part of the arXiv:2006.05344v1 [eess.SP] 8 Jun 2020 works devised from this topic is driven by the growing demand for AI techniques for IoT, M2M and Edge Computing applications.
A major problem with implementing ANN applications into embedded systems is the computational complexity associated with ANNs. In regards to the Multi-Layer Perceptron (MLP), described in this work, there are many inherent multiplications and calculations of nonlinear functions (10; 12; 16). Besides the feedforward process between the input and the synaptic weights, the MLP also has a training algorithm associated with it to find the optimum weights of the neural network. This training algorithm is very computationally expensive (7). If the training process is also performed in realtime, the computational complexity is increased several times. This increase in complexity automatically raises the processing time and requirements for memory storage from the hardware platform used in the application (10; 12; 16).
The use of MLP neural networks for real-time applications on µCs itâĂŹs not a new effort. The work (2) in which the authors describe a method to linearize the nonlinear characteristics of many sensors using an MLP-NN on an 8-bit PIC18F45J10 µC. The obtained results showed that if the network architecture is right, even very difficult problems of linearization can be solved with just a few neurons.
In (3), a fully-connected multi-layer ANN is implemented on a low-end and inexpensive µC. Also, a pseudo-floating-point multiplication is devised, to make use of the internal multiplier circuit inside the PIC18-F45J10 µC used. The authors managed to store 256 weights into the 1KB SRAM of the µC and deemed it being enough for most embedded applications.
In (4) Farooq et al implemented a hurdle-avoidance system controller for a car-like robot using an AT89C52 µC as a system embedding platform. They implemented an MLP with a back-propagation training algorithm and a single hidden layer. The proposed system was tested in various environments containing obstacles and was found to avoid obstacles successfully.
The paper (14) presents a neural network that is trained with the backpropagation (BP) algorithm and validated using a low-end and inexpensive PIC16F876A 8-bit µC. The authors chose a chemical process as a realistic example of a nonlinear system to demonstrate the feasibility and performance of this approach, as well as the results found using the microcontroller, against a computer implementation. With three inputs, five hidden neurons and an output neuron on the MLP, the application showed complete suitability for a µC-based approach. The results comparing the µC implementa-tion showed almost no difference in Mean Square Error (MSE) after 30 iterations of the training algorithm.
The work presented in (20) an ANN-based PID controller is shown using an ARM9 based µC. The authors modeled the controller to overcome the nonlinearity of a microbiologic fermentation process and to provide a better performing control strategy. The results showed a more accurate control over the controlled parameters with an ANN-PID an achieve a greater control performance and ability of the system to meet the requirements.
In (6) the authors implemented a classification application for the MNIST dataset. This implementation regards the 10 digits and full classification with 99.15% testing accuracy. Also, this is implemented with a knowingly highly resource-hungry ANN model, the Convolutional Neural Network (CNN), using less than 2KB of SRAM memory and also 6KB of program memory, FLASH. This work was embedded on a Arduino Uno development kit that is comprised of a breakout board for the 8-bit ATMega328p µC, with 32KB of program memory and 2KB of work memory, running at 16M Hz.
The reference works presented above have shown different aspects of implementing ANNs on µCs. In (3; 4; 14) you will find application proposals showing MLP-ANNs trained with the Backpropagation algorithm (BP) implemented on µCs with good results, but none of those talk about memory usage and processing-time parameters that vary according to the MLP hyperparameters. In (2; 13; 20) the authors presented some results regarding processing-time, but none dependant on the number of artificial neurons or even comparing the time required to train the ANN in real-time or not. Therefore, this work proposes an implementation of an MLP-ANN that can be trained with the BP algorithm into an ATMega2560 8-bit µC in the C language, to show that many applications with ANNs are suitable on this µC platform. We also present two implementations regarding a model that is trained on the µC in real-time and another implementation that is trained with Matlab and then ported to the same architecture to execute classification in real-time.
In addition to the implementation proposal, parameters of processing time to each feedforward step and backward steps in the training and classification process are presented. Also, the variation of these parameters is shown due to the variation of the hyperparameters of the MLP. We also validate the classification and training results on a Hardware-in-the-Loop (HIL) strategy. 2 System Description Figure 1 shows a block diagram detailing the modules implemented into the µC. This implementation has the work (5) as a direct reference, that models the MLP into a matrix form, simplifying and modularizing all the feedforward and backwards propagations As seen in Figure 1, the MLP here implemented is structured as four main modules, the Input Random Permutation Module (IRPM), FeedForward Module (FFM-k), Error Module (EM) and Backpropagation Module (BPM-k), at which the variable k represents the ANN layer. It is important to add that the implementation presented in this work is shown with two layers (one hidden and the output layer), but it can be easily extended for more. The modules and associated mathematical modeling will be detailed in the following subsections.

Associated Variables
The implementation is composed of four main variables that are passed by reference between the modules. These variables are the input signals matrix, Y k=0 (n) defined as where k = 0 means that this matrix is dealing with the input layer of the MLP, n represents the iteration number of the training algorithm. N is defined as the number os samples of a training set and y s the s-th sample defined as y 0 s (n) = y 0 s1 (n) , . . . , y 0 s (n) , . . . , y 0 where P is the number of available inputs of the MLP. The synaptic weights matrix from the k-th layer, W k (n), is defined as where H k is the number of neurons from the k-th layer and w k ij (n) represents the synaptic weight associated with the i-th artificial neuron, from the j-th input signal of the k-th layer at the n-th iteration. The output signals matrix, Y L (n) defined as where L is the number of layers of the MLP and also defines which layer is the last one. Where y L s (n) represents the output signal associated with the s-th sample input y 0 s that is defined as where M is the number of output neurons. The D (n) variable represents the desired values or labels of the training set composed of the D (n) and the Y 0 (n), that is defined as where d s (n) is the vector of desired values referring to the s-th sample associated with the y 0 s (n) input signal, that is defined as For an MLP with two layers, W 1 (n) represents the synaptic weights matrix from the hidden layer (H 0 = P ) and W 2 (n) is the weights matrix from the output layer (H 2 = M ) at the n-th iteration. In addition to those two matrices, a few others are created to accommodate intermediary results from feedforward and backward propagation operations of the MLP.

Feedforward Module -(FFM − k)
This module is responsible for running the feedforward operation of an MLP, propagating the inputs through each k-th layer during each n-th iteration. As most BP implementations, this proposal can operate in online mode or batch mode (defining N > 1). At each k-th FFM-k at each k-th layer the following equation is calculated as where Y k−1 (n) is the output signals matrix from the previous layer. The Y k (n) represents the output of the current layer and ϕ (.) is the activation function of the current k-th layer. As previously said a few other matrices are devised as part of the BP calculation and some of them store these calculation results from Equation 8. Algorithm 1 describes the pseudocode executed when the FFM-k module is called. The function prodMatrix() implements the matrix product between W k (n) and Y k−1 (n) and in the end it stores the product result in Y k (n). The actFun() function executes the activation function of the k-th layer to each element of Y k (n) and stores the result in itself (Y k (n)) as it runs.
Algorithm 1 Description of the algorithm implemented by the FFM − k module.
The error module EM-k calculates the error between the desired values matrix D (n) and the output signals matrix Y L (n) of the last layer, L. The equation that is evaluated at the EM-k is defined as Algorithm 2 describes how the EM-k module functions, in which the difMatrix() function implements the element-by-element difference between D (n) and Y L (n) and stores the result into the E (n) matrix.
Algorithm 2 Pseudocode implementation of the EM module.
The last module is responsible for the main training part of the implementation. The BPM-k calculates the new updated values of the synaptic weights matrices of the L layers that better approximate the desired values, W k (n). The equation that the BPM-k implements is defined as where where η is the learning-rate of the BP algorithm, α is the learning-moment rate factor and g k (n) is defined as where and The prod() function in the Equation 13 implements an element-wise product between two matrices, implemented in Algorithm 3, and ϕ (·) is the derivative of the activation function.

Basic Operations
In this section, we describe how we implemented a few basic operations used in this work. First, Algorithm 3 implements the matrix product. A very important detail is that all the modules were implemented following pointer arithmetic procedures. With this in mind, the source code is very optimized for memory usage, reducing the overall duration of execution from each module. Every matrix used in this implementation are floatingpoint (IEEE754) (1) representation values. Algorithm 4 shows the pseucode of a element-wise product between two matrices used in this work, more specifically in the BPM-k module implementation. This same pseudocode in Algorithm 4 can be slightly altered to perform subtraction or addition by editing the operation between matrices on line 5. It's important to notice that Algorithm 4 requires that both matrices have the same dimensions.
Algorithm 5 presents to us the implementation of the calculation of the Trace and Product between two matrices at a single operation inside a nested loop. This Algorithm 4 Element-wise matrix product implementation algorithm.   Algorithm 6 shows the implemented steps for calculating the activation function of the induced-local-field of each neuron, an element-wise operation. This Algorithm 6 is implementing a sigmoid activation function.

Methodology
The implementation was validated using the HIL simulation strategy, as explained in (17)   A parameter that was also analyzed is the memory occupation, regarding both memories of the ATMega-2560 µC used in this paper, FLASH or program memory and SRAM or work memory.
As previously mentioned the modules were implemented in the C program language for AVR µCs using the avr-gcc version 5.4.0, inside the Atmel Studio 7 development environment, an Integrated Development Environment (IDE) made available by Microchip. After compilation and binary code generation the solution was embedded into an Atmega-2560. This µC has an 8-bit GPP integrated with 256KBytes of FLASH program memory and 8KBytes of SRAM work memory, its maximum processing speed is 1 MIPS/MHz.
The Atmega-2560 is associated to an Arduino Mega v2.0 development kit, the Arduino Mega is a development kit that provides a breakout board for all the ATmega-2560 pins and some other components required for the µC to function properly. A great feature of this development kit is the onboard Universal Serial Bus (USB) programmer that enables the developer to simply connect a USB port to a computer and test various implementations easily on the µC.
This work is further validated using two cases. First, we train the MLP-BP to behave as an XOR operation as the simplest case possible to train an MLP and evaluate its ability to learn a nonlinear relationship between two inputs. Secondly, we train the network to aid a carlike virtual robot in avoiding obstacles in a virtual map using Matlab and 3 cases of increasing ANN architecture complexity. The assembly, and analysis of these two validation cases are presented in the following subsections.

Hardware in the Loop Simulation
The tests were executed with the µC running at a clock of 16MHz. The results are obtained by setting the level of a digital pin of the ATMega-2560 to HIGH, executing one of the modules and setting the same digital pin logical value to LOW and measuring the time of logic HIGH on an oscilloscope for this digital pin. This can be easily seen on Figure 2 and a picture of the HIL assembly is shown in Figure 3. In Figures 5,6 we present curved plots of the results described into 5 where we performed curve-fitting on the measured points with Polynomial Regression. Figure 4 shows a closer look on the oscilloscope measurements, taking in regard all the modules being measured by their duration of execution for the XOR validation case described below.
The calculation of the MSE during training was also embedded into the µC implementation and they're transmitted through Universal Serial Synchronous Asynchronous Receiver Transmitter (USART) protocol to a computer. The calculation of the MSE is defined as where trace(·) is the implementation seen in Algorithm 5.

XOR Operation
A validation of the real-time embedded training using the BPM − k module was devised for a XOR operation, as described on Table 1, the most basic MLP validation case. This test was performed with a training and classification phases, both running in the µC, considering a configuration as seen in Figure 1. The test was executed on batch mode with N = 4, two layers of neurons (L = 2), two input signals (P = H 0 = 2), a varying amount of neurons in the hidden neurons layer (H 1 ) as seen in 5 and a single neuron in the output layer (H 2 = 1). It is important to notice that the strategy here presented can be assembled with various different configurations just modifying the P, H k and M parameters, being the µCs internal memories the limiting factor for the ANN architecture size.

Virtual Car-Like Robot
This work also tested a MLP-BP model to control a virtual car-like robot from the iRobot matlab toolbox, provided by the United States Naval Academy (USNA), . This toolbox provides a virtual environment and an interface to test control algorithms on a deferentially steered robot on various maps with different obstacles and different combinations of distance or proximity sensors.
As previously stated, this virtual environment provides an interface to control a deferentially steered virtual robot. This robot is controlled by changing the angular speed in rad/s of the two wheels, Right Wheel (RW) and Left Wheel (LW). Also this work used three proximity sensors with virtual 3 meters range to provide input to the MLP. The sensors are Front-Sensor (FS), Left-Sensor (LS) and Right-Sensor (RS).
We tested the MLP model with three different datasets, increasing the hyperparameters to evaluate the behavior of the robot as we increased the network architecture's complexity. The cases are comprised of three tables with five columns each, as seen in Table 2. The first case is very simple with only eight conditions provided to train the network, which means the network must be able to devise a knowledge representation on how to behave with cases not previously trained from these simple constraints.
The second case has a greater complexity with 18 conditions, as seen in Table 3. Also, as mentioned in Table 3 this case required the same amount of neurons in the hidden layer, meaning that the first architecture could still be used for a more complex case.
The third and most complex case with 27 conditions to train the MLP-BP. This dataset required a bigger architecture than the previous datasets, with double the amount of neurons in the hidden layer (H 1 = 10)

XOR Operation
The binary code for the XOR operation, that is generated from this implementation as seen in Figure 1, resulted in 6.672 KBytes of program memory occupation (equivalent to 2.6%), a pretty compact solution if compared to the maximum program memory available for the ATMega-2560 and also comparing to the  5.904KBytes presented in (8) for and MLP implementation without the BP training, considering that these 2.6% already include the training algorithm. The information presented at Table 5 show that the values obtained up to H 1 = 38 neurons in the hidden layer. The H 1 parameter is limited by the available memory in the µC used, ATMega-2560. However, this value is quite reasonable for most real-time applications in robotics and industrial automation. Another noticeable result is that the processing times are also reasonable. Analyzing the network without the training time results (EM, BPM-1 and BPM-2) we can see that the iteration for a batch mode of N = 4 samples takes 42.91ms for the worst case, H 1 = 38. If you analyze the duration of each iteration including the network being trained on the online mode it takes 69.88ms, again in the worst case of H 1 = 38. This shows that the im-plementation here presented is indeed suitable for commercial applications in fields like industrial automation, robotics, automotive industry, and others.
An important result of these tests is the behavior of these fitted curves in Figures 5 and 6. The processing time grows linearly with the number of neurons in the hidden layer (H 1 ). This result is quite significant since you can estimate if a certain µC can be used with this implementation and also if the network architecture that was chosen will fit or not in the µC. This result can be used as a reference by other groups that refute the usage of MLP-BP applications on µCs.
Also we compared the MSE curve when varying the number of neurons in the hidden layer (H 1 ), as shown in Figure 7. This shows that for 2 and 20 iterations the MSE was really close to zero, and that with 38 neurons, 80 iterations were needed. According to Table 5

Virtual Car-like Robot
The first validation test-case showed small MSEs, close to 1%, with the architecture described in the methodology section, three inputs and two outputs with only five neurons in the hidden layer, (H 1 =). Since it was trained with a maximum distance of 1.0m, the robot shows a great behavior close to obstacles, but on big empty spaces with greater than 1.0m distances between obstacles the robot drifts and spins around the same spot until the test is restarted. This showed us that this training dataset was too simple and required more complex data. The test results for this dataset are shown in Figure 8. The second dataset test showed much better results, with the virtual robot being able to react quickly to farther obstacles and also correcting its trajectory faster. The robot was able to run along the map borders smoothly and keep a safe distance between parallel obstacles on the center of the map. However, the virtual robot still collided with the obstacles and sometimes it would collide and drag it self along the borders of bigger obstacles, as shown in Figure 9.
The dataset for the third validation case resulted with a higher MSE of 2.4%. The car-like robot was faster than the previous two tests and was able to perform a fast reaction to abrupt changes in proximity to obstacles, as shown in Figure 10. However, this dataset training made the robot show a behaviour that can be interpreted as overfitting, since sometimes it would react too quickly and start to spin around itself after detecting some obstacles ahead.
The overall results shows that the datasets had not enough data and this required some more tests with even more virtual sensors. We included the angle between the Front-Sensor (FS) and the virtual horizontal axis of the test map. Also we performed training tests   with two and three more hidden layers (L = 3,4) and also hundreds of hidden neurons and thousands of iterations, none of which improved the results, actually preventing the Neural Network from reaching training error convergence lower than 20%.
The generated binary code from the source in C for the best dataset (3 rd ) had 3.07KB of memory size, which amounts to 1.2% of the 256KB of program memory of the ATMega2560. This code size is quite compact compared to a similar model implementation seen in (8). The same variables and testing criteria used for the XOR operation were used in this case, with 16Mhz of clock frequency.
The timing results for the virtual car-like robot were expected, being smaller than the XOR operation that has a simpler architecture with less inputs and outputs (H 0 = 1 and H 2 = 1) and also a wide range for H 1 , Table 5. Also, as this implementation does not perform online real-time embedded training, only the FFM-1, FFM-2 and FFM-3 modules were used, being FFM-2andFFM-3 for the cases with more than one hidden layer.
The timing results for the first and second datasets were the same. This is expected, since the only factor to influentiate the execution time is size of the synaptic weight matrices H 1 and H 2 that depends only on the amount of inputs, neurons and the amount of samples.
Since this was not a batch classification, but a single sample, online classification, the input size remained as 1 and unchanged throughout all the three tests.
The third case presented a fast timing result, but close to two times the timing results for the first and second cases. This is expected and simply reiterates what was shown in Figures 5 and 6, that the execution time grows linearly with the hyperparameters or more specifically the amount of neurons for this implementation proposal.
It is important to notice that the execution time results shown in Table 6 relate to what is found in the field's literature, that more hidden layers do not necessarily improves the training MSE and can actually make it worse, as seen in (9). In the cases were we trained the robot with more than a single hidden layer it's ability to avoid obstacles did not improve and the best MSE was slightly higher than 20%.

Comparison with the State-of-the-Art
The work McDanel et Al (9) shows two implementations of an MLP-BP with single and dual hidden layers and their respective execution times for a classification with the results shown in Table 7. In (6), the authors were able to embed a full MNIST-10 classification model using CNNs under 2KB of SRAM being used, also the Datasets H 1 t FFM-1 (ms) t FFM-2 (ms) t T otal−FFM-k    inference times were in the order of 640 ms per input sample. It's important to notice that not of the times presented in Table 7 take only the inferencing time. The   optimizations before porting the synaptic weights and kernels to the µC.
It is possible to analyze this proposal, regarding how this system would behave if it had to support the same hyperparameters that some of these state-of-theart applications have. The work presented by McDanel et Al (9) uses an MLP ANN with a single layer with 100 artificial neurons. This same work also presents another implementation with a two-layer MLP with 200 artificial neurons. Analyzing the t FFM-1 , t FFM-2 , t BPM-1 and t BPM-2 fitted curves from Table 5 we obtain four predictive equations that define how much time is required to process this MLP implementation.
Equation 16 shows us again that this works proposal has a linear relationship between inference time and hyperparameters, especifically artificial neurons. Evaluating this Equation 16 with a number of 100 neurons it is observed 1.18s of feedfowarding time for the first layer.
This analysis shows that the modular implementation of a MLP here presented performs on compatible training and inferecing times.

Conclusions
This work presents an implementation proposal of an MLP artificial neural network with embedded BP training for an 8-bit µC. Results and implementation details were presented for this proposal for an ATMega-2560 µCs. Also, the validation results of the embedding of this proposal were presented using a HIL simulation strategy. Finally, the results show that the execution times and memory occupation of the implementation were compatible with application requirements seen in the industry, seen that these requirements fall under hundreds of milliseconds.