Prediction Techniques on FPGA for Latency Reduction on Tactile Internet

Tactile Internet (TI) is a new internet paradigm that enables sending touch interaction information and other stimuli, which will lead to new human-to-machine applications. However, TI applications require very low latency between devices, as the system’s latency can result from the communication channel, processing power of local devices, and the complexity of the data processing techniques, among others. Therefore, this work proposes using dedicated hardware-based reconfigurable computing to reduce the latency of prediction techniques applied to TI. Finally, we demonstrate that prediction techniques developed on field-programmable gate array (FPGA) can minimize the impacts caused by delays and loss of information. To validate our proposal, we present a comparison between software and hardware implementations and analyze synthesis results regarding hardware area occupation, throughput, and power consumption. Furthermore, comparisons with state-of-the-art works are presented, showing a significant reduction in power consumption of ≈1300× and reaching speedup rates of up to ≈52×.


Introduction
Tactile Internet (TI) enables the propagation of the touch sensation, video, audio, and text data through the Internet [1]. TI-based communication systems will provide solutions to more complex computational problems, such as human-to-machine interactions (H2M) in real time [2,3]. Therefore, TI is a new communication concept that allows transmitting skills through the Internet [4]. Several applications are available in the literature, such as virtual and augmented reality, industrial automation, games, and education [5]. Currently, the system's latency is a major bottleneck for TI applications. Therefore, it is necessary to guarantee very low latency, as demonstrated in [5][6][7][8]. Studies indicate that TI applications' latency varies from 1 to 10 ms in most cases or up to 40 ms in specific cases. Nevertheless, high latency can result in many problems, as stated in [7], such as cybersickness [9,10]. Several works have investigated methods to minimize the problems associated with the latency on TI applications, as presented in [1,[11][12][13][14]. The work shown in [15] provides a comprehensive survey of techniques designed to deal with latency, which proposes prediction techniques as a solution to minimize the impacts caused by delays and loss of information. Thus, the system "hides the real network latency" by predicting the user's behavior; notably, the proposal does not reduce the latency but predicts the system behavior, thus, enhancing the user experience's quality.
Plenty of research areas, such as market, industry, stocks, health, and communication, have used forecasting techniques over the years [16][17][18][19][20][21][22]. However, these techniques are often implemented in software, increasing the latency in computer systems within tactile links due to the high computational complexity of the techniques and the large datasets to be processed.
Systems based on reconfigurable computing (RC), such as field-programmable gate arrays (FPGAs), have been proposed to overcome the processing speed limitations of complex prediction techniques [23]. In addition, FPGAs enable the deployment of dedicated hardware, enhancing the performance of computer systems within the tactile system. In addition, systems deployed with FPGAs proposed in the literature can reach 1000× speedup compared to software-based ones [24][25][26][27][28].
Therefore, we propose the parallel implementation of linear and nonlinear prediction techniques applied to the TI on reconfigurable hardware, that is, on FPGA. Hence, the main contributions of this work are the following: • Parallel implementation of prediction techniques on FPGA without additional embedded processors. • A detailed description of the modules implemented for the linear and nonlinear regression techniques on FPGA. • A synthesis-based analysis of the system's throughput, area occupation, and power consumption, using data from a robotic manipulator.

•
An analysis of fixed-point precision against floating-point precision used by software implementations.

Related Works
The use of RC for computationally complex algorithms is widely available in the literature. Prediction techniques based on machine learning (ML), such as multilayer perceptron (MLP), are proposed to assist the bandwidth allocation process on the server automatically [29][30][31]. However, the presented systems are local and may not be scalable for use in more complex networks with higher traffic due to the need for data from all communications to perform the techniques' configuration and training steps. Therefore, linear prediction techniques have been proposed in [32,33] to avoid the loss of packages or errors.
Numerous works applied to TI are software-based implementations, such as cloud applications [34][35][36]. Usually, these software-based approaches are slower compared to hardware-based ones, thus affecting the data processing time of prediction techniques. As a result, some proposals were deployed on FPGA to increase the performance of manipulative tools [37][38][39][40], requiring accurate feedback [41][42][43][44].
Prediction techniques deployed on hardware, such as FPGAs, can reduce the latency in computer systems. In [45], an implementation of the quadratic prediction technique based on FPGA regression is proposed. In [46], a technique to detect epistasis based on logistic regression is implemented with an FPGA combined with GPU, achieving between 1000× to 1600× speedup compared to software implementations. In [47], an implementation of a probabilistic predictor on FPGA is proposed. Ref. [23] presented the hardware area occupation and processing time results for various RNA configurations of functions radial bases. Meanwhile, [48,49] demonstrate the feasibility of implementing algorithms based on deep learning (DL) using an RC-based platform.
Few studies explore linear regression applied to signal prediction on FPGAs or predictors applied in TI systems. However, there are proposals for machine learning (ML) techniques on FPGA. As an example, [50] proposes an MLP architecture for wheezing identification of the auscultation of lung sounds in real time. The MLP training step is performed offline, and its topology contains 2 inputs, 12 neurons in the hidden layer, and 2 neurons in the output layer (2-12-2). The architecture uses a 36-bits fixed-point implementation on an Artix-7 FPGA, achieving a sampling time of 8.63 ns and a throughput of 115.88 Msps.
The work presented in [51] uses an MLP on FPGA to perform the activity classification for a human activity recognition system (HAR) for smart military garments. The system has seven inputs, six neurons in the hidden layer, and five in the output layer (7-6-5). In addition, five versions of the architecture were implemented by varying the data precision. The analysis shows that the MLP designed with a 16-bit fixed-point is more efficient concerning classification accuracy, resource utilization, and energy consumption, reaching a sampling time of 270 ns using about 90% of the embedded multipliers and a throughput of 3.70 Msp.
Another MLP implemented on FPGA is proposed by [52] for real-time classification of gases with low latency. The MLP has 12 inputs, 3 neurons in the hidden layer, and 1 neuron in the output layer (12-3-1). In addition, the Levenberg-Marquardt backpropagation algorithm is used to perform offline training. The architecture was developed on Vivado using high-level synthesis (HLS) to optimize the development time and deployed on a Xilinx Zynq-7000 XC7Z010T-1CLG400 FPGA. Concerning the bit-width, a 24-bit signed fixed-point representation was used for the trained weight data with 20 bits on the fractional part. Meanwhile, 16-bit (14 bits on the fractional part) was used to deploy the output layer using the TanH function. A throughput of 539.7 ns was achieved.
In [53], an MLP was implemented for automatic blue whale classification. The MLP had 12 inputs, 7 neurons in the hidden layer, and 3 in the output layer (12-7-3). The backpropagation algorithm was used for an offline training process. The trained weight data were deployed using fixed-point representation with a 24-bit maximum length. The output function adopted was the logistic sigmoid function. The architecture was developed on a Xilinx Virtex 6 XC6VLX240T and Artix-7 XC7A100T FPGAs, reaching a throughput of 27.89 Msps and 25.24 Msps, respectively.
Unlike the literature works discussed, we propose linear and nonlinear prediction techniques designed on hardware for TI applications to reduce the latency. The linear techniques proposed are predictions based on linear regression using the floating-point standard IEEE 754. In addition, four solutions for different ranges of the regression buffer are presented. Regarding the nonlinear techniques, an MLP-BP prediction technique is proposed, using fixed-point representation, performed with online training. The Phantom Omni dataset is used to validate the implementations and compare them to software versions implemented on Matlab.

Proposal Description
TI-based communication enables sending the sensation of touch through the Internet. The user, OP, interacts with a virtual environment or a physical tool, ENV, over the network. Figure 1 shows the general tactile internet system, with two devices interacting. The devices can be the most diverse, such as manipulators, virtual environments, and tactile or haptic gloves. The master device (MD) sends signals to the slave device (SD) during the forward flow. Meanwhile, the SD feedbacks the signals to the MD on the backward flow.
Each master and slave device has its subsystem, computational system, responsible for data processing, control, robotics, and prediction algorithms at each side of the communication process. MCS and SCS are the identifications for the master and the slave device computational systems, respectively. The total execution time of each of these blocks can be given by the sum of the individual time of each algorithm, assuming they are sequential.
The model adopted in this work considers that several algorithms constitute the computational systems, and each of them increases the system's latency. Thus, the prediction process should be implemented in parallel to the other algorithms embedded in the MCS and SCS. This consideration aims to decouple prediction techniques from other algorithms, simplify the analysis, and to improve performance. Figure 1 presents a model that uses prediction methods in parallel with computational systems. The prediction modules, identified as MPD and SPD, have the same signal inputs as their respective computational systems, signalsq(n) andc(n). In this project, the predictions performed use Cartesian values. The module MPD predicts a vector calledq(n) upon receiving the input vector. This prediction has a processing time of t mpd . Similarly, the SPD module predicts theĉ(n) vector on the slave side, with a prediction processing time of t spd .  Figure 1. Block diagram illustrating the behavior of a generic Tactile Internet system that uses a parallel prediction method.

Prediction Methods
As shown in Figure 1, the modules responsible for the prediction system, called MPD and SPD, can be implemented in parallel with MCS and SCS computational systems. These prediction systems can execute nonlinear prediction methods (NLPM), linear prediction methods (LPM), or probabilistic prediction methods (PPM), as illustrated in Figure 2. We propose the implementation of linear regression and the multilayer perceptron with the backpropagation algorithm (MLP-BP).
As mentioned in the previous section, the system has two data streams, forward and backward, represented by the signal vectors c(n) and q(n). In this section, υ(n) represents the input samples, andυ(n) represents the predicted samples for these two vectors in both streams.
Each prediction module can implement different prediction methods that can be applied for both Cartesian and joint coordinates, as described in [54]. The implementations can replicate the same technique multiple times. A replication index, NI, can be used as a metric to define the hardware capacity to implement multiple techniques in parallel. The NI value may vary according to the degree of freedom of the virtual environment or robotic manipulator model.

Linear Regression
The linear regression prediction model uses a set of M past samples to infer possible predicted data. It uses a set of observed pairs composed of the time marker, t m , and the dependent variable, υ, that is, (t m (1), υ(1)), (t m (2), υ m (2)), . . . , (t m (M − 1),υ(M − 1)),(t m (M), υ(M)). The regression can be defined by Equation (1), whereυ(n) is the predicted value of υ(n),β 0 (n) is the linear estimation coefficient, and β 1 (n) is the coefficient of angular estimation for the same estimated sample. The parameter estimation process uses the principle of least squares [55]. Equations (2) and (3) indicate the coefficients,β whereῡ(n) andt m (n) are the average values of the sample variables υ and t m .

Multilayer Perceptron Networks
Commonly, complex problems are solved with machine-learning-based solutions, such as artificial neural networks (ANN). The mathematical structure of the ANN is composed of processing units called artificial neurons. The neurons can operate in a parallel and distributed manner [56]. Hence, ANN solutions can exploit the high parallelism degree provided by FPGAs.

Architecture
Several applications based on neural networks use the architecture of an MLP-BP due to the ability to deal with nonlinearly separable problems [57]. Equation (4) represents the prediction function using the MLP technique, which uses B past samples of υ to generate theυ(n) value, as follows:υ where υ n−1 , υ n−2 , . . . υ n−B are the input values of the MLP andυ is the MLP predicted output. Equation (5) presents a generic MLP with L layers, where each k-th (k = 1, . . . , L) layer can have N k neurons with N k−1 + 1 inputs representing the number of neurons in the previous layer. The neurons from the k-th layer process their respective input and output signals through an activation function f k (•). At the n-th sample, this function is given by where y k i (n)(i = 1, . . . , N k ) is the i-th neuron output in the k-th layer, and x k i (n) can be defined as where w k ij (n) is the synaptic weight associated with j-th input of the i-th neuron. Figure 3 illustrates the structure of an MLP ANN with L layers and Figure 4 illustrates the i-th neuron in the k-th layer.  Figure 4. Structure of a neuron (perceptron) with N k−1 + 1 inputs.
The f k (•) function was defined by rectified linear unit (ReLU) function according to Equation (7): The backpropagation algorithm is the training algorithm used with MLP.

Backpropagation Training Algorithm
The weights are updated with the error gradient descent vector. At the n-th iteration, the i-th neuron error signal in the k-th layer is defined by where d i (n) is the desired value, and δ k+1 j (n) is the local gradient for the i-th neuron in the (k + 1)-th layer at the n-th iteration. Equation (9) describes the local gradient, where f ′ (y(n)) is the derivative of the activation function. The synaptic weights are updated according to the following: where η is the learning rate, α is the regularization or penalty term, and w k ij (n + 1) is the updated synaptic weight used in the next iteration.

Implementation Description
We propose an architecture using a 32-bit floating-point (IEEE754) format for the linear prediction technique. Throughout this section, we use the notation [F32]. For the MLP prediction technique proposed, we designed an architecture with a fixed-point format (varying the bit-width). We use the notation [sT.W] to represent the fixed-point values, where s represents the sign with 1 bit, T is the total number of bits, and W the number of bits in the fractional part. Therefore, the integer part of signed variables is T − W − 1 bits long, while for unsigned variables it is T − W bits.

Linear Regression
The hardware architecture implemented for the linear prediction technique based on linear regression was based on Equations (1)-(3). All circuits in the structure use 32-bits floating-point precision.
The circuit shown in Figure 5 executes Equation (1). As can be observed, the circuit is composed of one multiplier and one adder. There are three input values, (t m [F32](n), β 0 [F32](n), and β 1 [F32](n)), and one output, (υ[F32](n)). To perform Equation (2), we use one multiplier and one subtractor, as shown in  The circuit shown in Figure 7 performs Equation (3). As can be seen, the circuit is composed of two multipliers, one subtractor, one cascading sum module (CS), and two constant values (C). The constant values, C, were obtained empirically to simplify the existing division process in Equation (3). The circuit has two inputs values (υ[F32](n), υ[F32](n)), and one output value (β 1 [F32](n)).
-CS C C The cascading sum (CS) module shown in Figure 7 is implemented by the generic circuits shown in Figure 8. The cascading sum is also used as an input to calculate the mean values of t[F32](n) and υ[F32](n), as shown by the circuit illustrated in Figure 9.

Multilayer Perceptron
The main modules that perform the multilayer perceptron with the backpropagation training (MLP-BP) and the multilayer perceptron with recurrent output (RMLP-BP) are shown in Figures 10 and 11, respectively. The hardware structures are similar. The main difference between them is that the first input signal of the RMLP-BP is a feedback of the output signal. As can be observed, there are two main modules called multilayer perceptron module (MLPM) and backpropagation module (BPM). Both modules implement the variables in fixed-point format.
The MLPM module for the MLP-BP proposal ( Figure 10) has B inputs from previous instants of the υ variable. The MLPM for the RMLP-BP proposal ( Figure 11  . The hidden layers of the network also use the structure described. As mentioned in Section 3.2.1, the output layer uses the ReLU activation function. Figure 13 shows its hardware implementation. The signal x k i [sT.W](n) is the input of the nonlinear function described in Equation (7). The linear combination of weight and hidden layer output provides the neural network output.

Backpropagation Module (BPM)
The BPM defines the error gradient and updates the neurons' weights. The error gradient, e[sT.W](n), described in Equations (8) and (9), is performed by the circuits shown in Figure 14.
The circuit shown in Figure 15 calculates the MLP neurons' weights, as previously described in Equation (10 Table 1 summarizes the value used for each parameter in the MLP-BP and RMLP-BP hardware implementation. It is essential to mention that the training parameter was empirically defined.

Synthesis Results
This section presents synthesis results for linear and nonlinear prediction techniques. Three key metrics are analyzed: area occupation, throughput, and power consumption. This work's throughput (R s ) has a 1:1 ratio with frequency (MHz). All synthesis results analyzed here use a Xilinx Virtex-6 xc6vlx240t-1ff1156 FPGA, with 301, 440 registers, 150, 720 6-bits look-up tables (LUTs), and 768 digital signal processors (DSPs) that can be used as multipliers.
Firstly, we carried out analyses for the linear regression technique varying the M value from 1 to 3, 6, and 9, implemented in a 32-bit floating-point format. Secondly, we present the synthesis analysis values for MLP-BP using signed fixed-point configurations with the following bit widths: 18.14, 16.12, and 14.10. Finally, we also provide an analysis by increasing the number of implementations (NI) in parallel from 1 to 3 and 6, thus, increasing the number of variables processed in parallel. Tables 2-4 show the synthesis results for the linear regression prediction technique with  1, 3, and 6 parallel implementations, respectively. The first column of each table highlights the M value. The second to seventh columns present the area occupation on the FPGA. The second and third display the number of registers/flip-flops (NR) and their percentage (PNR), and the fourth and fifth, the number of LUTs (NLUT) and their percentage (PNLUT). Finally, the sixth and seventh indicate the number of multipliers (NMULT) and their percentage (PNMULT). The last two columns show the processing time, t s , in nanoseconds (ns), and the throughput, R s , in mega-samples per second (Msps).  To demonstrate the linear behavior of our hardware proposal, we provide a linear regression model for Table 4. Figures 16-18 show NR, NLUT, and R S results. It is essential to mention that linear regression models return a coefficient of determination called R 2 . The R 2 rate represents the quality of the linear regression model, i.e., it demonstrates the obtained data variance. Commonly, R 2 is expressed on a scale from 0% to 100% (or a scale from 0 to 1 for normalized values). Concerning the NR, the plane f NR (NI, M) can be described by f NR (NI, M) ≈ −1439 + 510.7 × NI + 309.5 × M;

Linear Prediction Techniques
the coefficient of determination is R 2 = 0.8553.
and R 2 = 0.8863. Finally, the the plane f R s (NI, M) presents the throughput in Msps, is presented in the plane f R s (NI, M), and is described as and R 2 = 0.8372. According to the t s results presented in Tables 2-4 and Figure 18, a significant reduction in throughput is noticeable as M increases. Increasing the number of circuits in the cascading sum (CS) submodule results in a more significant critical path and, thus, a more considerable sampling time (t s ). However, the throughput increases proportionally to NI for a fixed value of M.
It is observable that there is a linear increase in the number of resources used as M and the NI grow. As presented in Table 4, for NI = 6 and M = 9, 46% of the NLUT are occupied. On the other hand, for smaller values such as M = 3 and NI = 6, the NLUT occupied is 21.53%. Additionally, it is possible to increase the NI using the remaining resources. However, there is no guarantee that there will not be large throughput losses.
Therefore, it is relevant to mention that the parallel FPGA implementations of the linear regression can achieve high throughput, as required in the TI scenario. On the other hand, these implementations result in high hardware area occupation. Considering that TI is still under development, high processing speed and intelligent use of resources are crucial.

Nonlinear Prediction Techniques
Commonly, MLP-based implementations use the hyperbolic tangent function. However, using this function resulted in a 28% occupation of the FPGA memory primitives for an MLP of four inputs, four neurons in the hidden layer, and one neuron in the output layer (with N = 1). For N = 6, it could occupy ≈68% of the memory primitives, making the tanh function unfeasible due to its high hardware implementation cost. The activation function that we use in this work is ReLU, since its hardware implementation does not require the use of memory primitives. As previously described, Equation (7) describes the ReLU function. Tables 5 and 6 show the hardware area occupation and throughput results for the MLP-BP and RMLP linear prediction techniques. The analyses for both techniques use a Virtex-6 FPGA. As presented in the first columns (T.W), they are implemented for different unsigned fixed-point bit widths. The results displayed in Tables 5 and 6 make it possible to plot surfaces demonstrating the hardware behavior concerning the area occupation and throughput. Figures 19 and 20 present the relationship between the NI and the number of bits in the fractional part (W) with the number of registers (NR) for the MLP and RMLP, respectively.
The f NR (NI, W) planes can be expressed by with R 2 = 1, and and R 2 = 0.9835.
for R 2 = 0.9935, and for R 2 = 0.9899.  Equations (18) for R 2 = 1.  Regarding the throughput (R s ) presented in Tables 5 and 6, it is observable that the R s does not vary significantly for a fixed NI and a varying bit width (T.W). For a fixed bit width (T.W) and a varying NI, the throughput values have a linear increase proportional to the NI value. Nevertheless, it is also necessary to mention that the t s value has a low variance because the MLP and BP structures adapt well to parallelism. Hence, the circuit provides good scalability without considerable performance losses. Compared to the linear regression discussed in Section 5.1, the MLP shows better flexibility.
The area occupation decreases as the bit width (T.W) and NI parameters also decrease. Reducing these parameters also reduces the modules' circuits to store or process data. The multipliers (NMULT) are the most used resource, reaching up to ≈42% of occupation when NI = 6. In addition, the MLP and RMLP result in a similar hardware area occupation, using less than 43%, 27%, and 2% of multipliers, LUTs, and registers, respectively. Given that, or the current design and chosen FPGA, the maximum value of NI feasible to implement would be 9 or 10. The throughput would remain close to the current range. Nevertheless, this analysis used only the Virtex-6 DSPs. It is important to emphasize that the available LUTs can implement multipliers, permitting an increase in the parallelization degree and throughput.
We also performed the synthesis for the MLP and BP algorithms separately to verify the hardware impact of each of them. Table 7 presents an MLP-only implementation, while Table 8 presents a BP-only implementation. Given that most of the works in the literature do not implement the BP or any training algorithm on hardware, we provide a complete analysis of the modules implemented separately. The MLP, for NI = 6, occupies only 3.82% and 19.53% of the LUTs and multiplies (PNMULT), respectively. It also achieved a throughput of ≈188 Msps. Hence, the low resource usage shows that our approach provides good scalability and high performance for applications that do not require online training and only use the MLP module. The synthesis results show that the hardware proposal occupies a small hardware area. As can be seen, the MLP uses less than 20% and 4% of multipliers and LUTs, respectively. Meanwhile, the BP occupies less than 4% multipliers and LUTs and reaches more than 39 Msps. Thus, it is possible to increase the architecture parallelization degree due to the unused resources, consequently enabling the acceleration of several applications that relies on massive data processing [58]. In addition, the unused resources can also be used for robotic manipulators with more degrees of freedom and other tools [59]. The low hardware area occupation also shows that smaller, low-cost, and low-consumption FPGAs can fit our approach for IoT and M2M applications [60].
Therefore, for the linear and nonlinear regression with BP implementations, the throughput results reached values up to ≈98 Msps. These values make it possible to use these solutions in problems with critical requirements, such as TI applications [9,10,[29][30][31]. Figures 19-24 show that the MLP and RMLP techniques have similar results for NR, NLUT, and R s . The similarity observed between the results is expected due to the RMLP architecture being similar to the MLP, except for the inputυ[sT.W](n), which is now delayed by a time sample t s . Therefore, the following sections will only focus on the MLP and MLP-BP results, as it provides better scalability for increasing the NI.

Validation Results
This work uses bit-precision simulation tests to validate the proposed hardware designs for the prediction techniques described in the previous section. Bit precision simulation is performed by a dynamic nonlinear system characterized by a robotic manipulator system with 6 degrees of freedom (DOF), i.e., rotational joints, called Phantom Omni [61][62][63][64]. Nonetheless, only the first three joints are active [64]. Therefore, the Phantom Omni can be modeled as a three-DOF robotic manipulator with two segments (L 1 and L 2 ) interconnected by three rotary joints (θ 1 , θ 2 , and θ 3 ), as shown in Figure 25. Based on the description provided by [63], the Phantom Omni parameters on the simulations carried out were defined as follows: L 1 = 0.135 mm; L 2 = L 1 ; L 3 = 0.025 mm; and L 4 = L 1 + A for A = 0.035 mm. In addition, the dynamics of the Phantom Omni can be described by nonlinear, second-order, and ordinary differential equations, as follows: where θ(t) is the vector of joints expressed as τ is the vector of acting torques which can be described as M(θ(t)) ∈ R 3×3 is the inertia matrix, C θ(t),θ(t) ∈ R 3×3 is the Coriolis and centrifugal forces matrix, g(θ(t)) ∈ R 3×1 represents the gravity force acting on the joints, θ(t), and f θ (t) is the friction force on the joints, θ(t) [61][62][63][64]. Figure 26 shows the angular position for each joint of the three-DOF Phantom Omni robotic manipulator, that is, θ 1 , θ 2 , and θ 3 . It is possible to observe the trajectory of each joint concerning its angular position as a function of the number of samples received. The mean square error (MSE) between the actual and predicted data is used to define the reliability of the results generated by the proposal and can be defined as where E q m(X) is the value of the mean square error, N s is the number of samples,(X)(i) is the i-th sample estimated value, and (X)(i) is the i-th sample current value.
The following subsections present the validation results for the implemented linear and nonlinear prediction techniques.

Linear Prediction Techniques
We compared the θ 1 (n) signal generated by our proposed FPGA architecture with one from a Matlab implementation for the linear prediction techniques. Figures 27-30 show the results. We developed a Matlab version using a double-precision floating-point. In contrast, our hardware design uses a single-precision floating point. As can be observed, the results shown for the hardware implementation are similar to the Matlab version, despite reducing the hardware bit-width by half.    Table 9 and Figure 31 present the MSE between the software (64-bit floating-point based on IEE754) and hardware (32-bit floating-point) implementations for the LR prediction techniques, using N s = 4000 data samples, 80 frames, and 50 samples per frame. As can be observed, the two implementations are equivalent, i.e., the MSE is significantly small. Table 9. Mean square error (MSE) between the software implementation and the proposed hardware implementation for LR technique.   Afterwards, we performed an MSE analysis by varying the hardware bit-width from 18.14 to 16.12 and 14.10. The analysis was carried out for N s = 4000 data samples, 80 frames, and 50 samples per frame. Figure 35 and Table 10 show the resultant MSE. As can be observed, similarly to linear prediction techniques, the MSE between the software and hardware versions is also small for nonlinear techniques. The proposed hardware implementations for prediction techniques have a similar response to the double-precision (64-bit) software implementation, even using fixed-point with fewer bits, such as 14.10. Furthermore, fewer bits may allow the implementation of the proposed method on hardware with limited capacity resources. Thus, the number of resources available could define the number of bits used to implement a technique. After analyzing the MSE, it is possible to see that both linear and nonlinear techniques perform well in the current test scenario. However, as previously mentioned, linearregression-based techniques may not be the most suitable for the TI landscape due to scalability issues seen in Section 5.1. Hence, in the following section, this work will focus on the results of the MLP-BP.

Comparison with State-of-the-Art Works
In this section, a comparison with state-of-the-art works is carried out for the following hardware key metrics: throughput, area occupation, and energy consumption. The implementations presented were developed on the Virtex-6 FPGA with T.W = 14.10 bits. Table 11 shows the MLP processing speed and throughput for our work and other works in the literature. As can be seen, the columns present the number of implementations (NI), the fixed-point data precision (T.W), the MLP and MLP-BP processing speed, and the throughput in Msps.

Throughput Comparison
The work proposed in [50] is an MLP with a 12-12-2 topology (twelve inputs, twelve neurons in the hidden layer, and two neurons in the output layer) deployed with a 24-bits fixed-point format. The MLP training is offline, and it reaches a throughput of 113.135 Msps and 115.875 Msps for the Virtex 6 XC6VLX240T and the Artix-7 XC7A100T FPGAs, respectively. The high performance achieved is due to the pipeline used in their proposed hardware design, reducing the system's critical path and increasing the maximum frequency. Unlike [50], our proposal uses online training, and using a pipeline-based architecture is not feasible due to the chain of delays intrinsic to this approach that can reduce the sample's accuracy during online training. Nevertheless, the throughput value of our architecture can improve as the number of implementations grows, increasing the number of samples processed per second without impacting its maximum clock.
The design proposed in [51] implements a 7-6-5 MLP with offline training on the Artix-7 35T FPGA. It achieved a throughput of 3.7 Msps, but the number of clock cycles required to obtain a valid output reduces the throughput compared to other works. Meanwhile, the work presented in [52] proposes a 12-3-1 MLP on a Zynq-7000, also with offline training, capable of reaching a maximum throughput of 1.85 Msps. The small throughput (compared to other works) may be related to the use of high-level synthesis (HLS), which usually results in a non-optimized implementation. The architecture presented in [53] is a 12-7-3 MLP with a 24-bit fixed-point data format and offline training. The maximum throughput achieved was 27.89 Msps and 25.24 Msps for the Virtex 6 XC6VLX240T and Artix-7 XC7A100T FPGAs implementations, respectively.  Table 12 presents a speedup analysis performed for all works presented in Table 11. The first column presents the NI in our architecture, while the second to seventh columns are the literature works compared with ours.
Throughput work Throughput re f defines the speedup, where Throughput work represents the throughput of our proposal and Throughput re f represents the literature reference throughput.
The results were obtained only for the MLP-BP implementation. As shown in Table 12, the implementation seen proposed by [50] achieves a higher speedup. However, our proposal offers good scalability that allows increasing the NI and enables higher throughput, reducing this difference even with an implementation that uses online training embedded in the platform. Moreover, our approach reached a higher throughput than the other works, reaching speedup rates of up to 52×.
In addition, it is vital to mention that a higher frequency speed in MHz does not mean a higher throughput. Conversely, the throughput is commonly related to the parallelism degree. For example, the MLP speed in [51,52] have the lowest throughput even for a high-frequency speed (Table 11). In these cases, the speedup was up to 26× and 52× for [51] and [52], respectively.
In [53], the throughput value is 27.89 and 25.24 Msps, for an MLP with offline training and NI = 1. Meanwhile, even implementing the training algorithm in hardware, our work achieves speedup rates of up to 3×.
In [50], a pipeline scheme reduces the system's critical path and increases the throughput. However, it does not provide online training, which could reduce its performance. Meantime, our proposed architecture provides online training, adapting to different scenarios. In addition, it would not be feasible to use a pipelined scheme since the samples have a temporal dependence.

Hardware Area Occupation
The area occupation comparison was based on a hardware occupation ratio defined as .
The superscripts work and ref represent the resource information regarding our work and the compared work, respectively. Meanwhile, N hardware represents the primitives, such as the number of LUTS, registers, multipliers, or the number of block random access memory (BRAM). Table 13 shows the area occupation for our work and works in the literature. The second and third columns present the NI and fixed-point data precision (T.W). From the third to sixth columns, we present the number of LUTs (NLUT), the number of registers (NR), the number of multipliers (NMULT), and the number of BRAMs (NBRAM). The work presented in [51] uses an Artix-7 35T FPGA for the implementation, occupying 3466 LUTs, 569 registers, and 81 multipliers. The proposal shown in [52] uses 4032 LUTs, 2863 registers, 28 multipliers, and 2 BRAMs. The architecture proposed in [53] was implemented in two FPGAs using the sigmoid activation function, occupying 21,322 LUTs, 13,546 registers, 219 multipliers, and 2 BRAMs for the Virtex 6 XC6VLX240T FPGA, and 21,658 LUTs, 13,330 registers, 219 multipliers, and 2 BRAMs for Artix-7 XC7A100T. Tables 14-17 present the hardware ratio, R occupation , regarding our proposed architecture. As shown in Tables 14-17, our proposal uses online training and implements up to six replicas of the same technique in parallel. For most cases, it requires fewer resources, evidencing efficient use of hardware. For a scenario where NI is 1, except for the works presented in [51,52], which have low throughput (see Table 11), our proposal maintains a good advantage over the other proposals. For a scenario where NI is 6, the present work has a high consumption of hardware resources compared to the other works. However, this is a strategy adopted to increase the throughput of the proposal. Furthermore, unlike other proposals, our design does not occupy any BRAMs as we use the ReLU function, thus improving the design's scalability for flexible implementation in different scenarios, such as using TI systems with more DOFs, such as six or nine DOF.

Dynamic Power Consumption
Dynamic power is the primary factor for a digital circuit's energy consumption. It can be expressed as where N g is the number of elements (or gates), F clk is the maximum clock frequency, and V DD is the supply voltage. Given that the operating frequency of CMOS circuits is proportional to the voltage [65], the dynamic power can also be described as The number of elements, N g , can be defined by the FPGA primitives used to deploy the architecture, i.e., N g = NLUT + NR + NMULT. Tables 18 and 19 present the operating frequency and dynamic power analysis results regarding N g . Concerning the dynamic power, we present the reduction rate, S d , achieved by our proposal according to the following: where the N ref g and F ref clk are the number of elements and the maximum clock frequency of the work we are comparing. At the same time, N work g and F work clk are the number of elements and the maximum clock frequency of our work. Unlike the works in the literature, our hardware proposal uses a fully parallel layout, requiring one single clock cycle per sample processing. Therefore, the maximum clock frequency is equivalent to the throughput, F work clk ≡ R s . We assume that all proposals operate at the maximum frequency that the platform can reach. Thus, for an NI = 1, our design reduced power consumption by more than 1200× compared to the one proposed by [50]. Overall, our proposal reduced the power consumption compared to other work in most case scenarios. Therefore, IoT projects that require low power consumption can use our method without affecting their performance.
For NI = 6, we can observe a similar power consumption compared to [53] due to their proposal's small clock value and not providing online training.
Lowering the use of BRAMs to zero is a highlight of this work. This reduction is possible due to the implementation of the ReLU function. Unlike other proposals that make use of functions, such as sigmoid, this strategy provides an advantage in terms of scalability of the proposal, which can be scaled to various scenarios without compromising the use of BRAMs. The fully parallel computing strategy proposed in the present work does not spend clock time accessing the RAM block, and this can increase throughput and decrease power consumption.

Conclusions
This work introduced a method for implementing prediction techniques in parallel to reduce the latency of TI systems using FPGA, thus enabling local devices to be used in conjunction with haptic devices. The hardware-based method minimized the data processing time of linear and nonlinear prediction techniques, showing that reconfigurable computing is feasible for solving complex TI problems.
We presented all the implementation details and the synthesis results for different bit-width resolutions and three different numbers of implementations in parallel (one, three, and six). In addition, the proposal is validated with a three-DOF Phantom Omni robotic manipulator and evaluated regarding hardware area occupation, throughput, and dynamic power consumption. In addition, we also presented comparisons with state-of-the-art works.
Comparisons demonstrate that a fully parallel approach adopted for linear regression and nonlinear prediction techniques can achieve high processing speed. However, linear regression techniques have low scalability and may not be a good path for the TI area. Nonlinear prediction techniques achieve a throughput of up to ≈52× while also reducing power consumption by ≈1300×. Furthermore, despite the high degree of parallelism, the proposed approach offers good scalability, indicating that the present work can be used in TI systems, especially for the nonlinear prediction techniques.