Comparison of FPGA and Microcontroller Implementations of an Innovative Method for Error Magnitude Evaluation in Reed–Solomon Codes

: Reed–Solomon (RS) codes are one of the most used solutions for error correction logic in data communications. RS decoders are composed of several blocks: among them, many e ﬀ orts have been made to optimize the error magnitude evaluation module. This paper aims to assess the performance of an innovative algorithm introduced in the literature by Lu et al. under di ﬀ erent systems conﬁgurations and hardware platforms. Several conﬁgurations of the encoded message chosen between those typically used in di ﬀ erent applications have been designed to be run on an FPGA (ﬁeld programmable gate array) device and an MCU (microcontroller unit). The performances have been evaluated in terms of resource usage and output delay for the FPGA and in terms of code execution time for the MCU. As a benchmark in the analysis, the well-established Forney’s method is exploited: it has been implemented in the same conﬁgurations and on the same hardware platforms for a proper comparison. The results show that the theoretical ﬁnding are fully conﬁrmed only in the MCU implementation, while on FPGA, the choice of one method with respect to the other depends on the optimization feature (i.e., time or area) that has been decided as a preference in the speciﬁc application.


Introduction
The information exchange represents a significant aspect pervading all modern systems, from miniaturized wireless earphones to heavyweight space satellites. Considerable research work has been done to improve the efficiency of communication processes [1,2] and, among all the introduced techniques, errors management is of utmost importance.
The basic errors management implementation involves the error detection and message resend [3]: if the message is corrupted by random and burst errors, the receiver is configured to require subsequent retransmissions of the same message until the error is no longer present. A clear downside of this approach is the increase of the number of messages in the communication channel, which may bring band saturation and limitations in high-speed applications.
One solution is the introduction of error correction features [4]. With this approach, the receiver is able to detect the error and, under certain conditions, to correct the message. At the price of an increase in the receiver complexity and in the messages' length, the channel traffic is reduced, with a positive effect for the communication speed [5].
Considerable research work has been done on this topic [6,7], and a variety of systems based on different algorithms have been proposed. Among them, Hamming error correction codes (ECCs) were one of the first introduced, in 1950, to correct errors in punched card readers [8]. Hamming  Electronics 2020, 9, x FOR PEER REVIEW 3 of 15 In an RS system, the maximum number of detectable and correctable errors in the message is defined as follows: . (1) The architecture of a traditional RS decoder is shown in Figure 2. The decoder is typically composed of five subsystems [30,31]: syndrome calculation, error polynomial calculation, error positions extraction, error magnitude evaluation, and error compensation. Once a message is received by a Reed-Solomon decoder, the first step should be to divide the received polynomial by the generator polynomial chosen for encoding: the remainders of this division are known as syndromes, and they do not depend on the transmitted code word, but rather only on errors. The syndrome calculation block computes the 2t syndromes contained into a Reed-Solomon code word, usually exploiting Horner's method [32]. The next step is to introduce the error locator polynomial that contains the information about the location of the errors and their magnitudes. Two methods are widely used in error polynomial calculation, the Euclidean algorithm [33] and Berlekamp's algorithm [34]. Once the coefficients of error location polynomial are carried out, the error position block identifies the corrupted symbols, by means of the Chien search algorithm [35], while the error magnitude block computes the error values. Finally, the error compensation block uses this information to fix the errors. In traditional RS decoders, as depicted in Figure 2, the whole process typically takes several clock cycles: each block contributes to this delay and the optimization of a single block can improve the performance of the entire system.
In classical RS decoders, the search for error position is carried out in parallel with the error position estimation and, within the blocks, a serial strategy is adopted. This implies a number of clock cycles equal to n for the error magnitude and position evaluation. Recently, many efforts have been made to speed up the whole process, exploiting parallel Chien search strategies [36,37]. Combining these methods with parallel algorithms for error magnitude extraction can potentially reduce the required number of clock cycles to ⌈ / ⌉, where p is the Chien search parallelism, plus one clock cycle for the subsequent error magnitude evaluation. In this context, the analysis of the parallelism capability and the analog in-out time of the available error magnitude strategies can be very useful to identify the preferred method to optimize the whole process. For this reason, in the following, we will focus only on the error magnitude subsystem.
There are many approaches to compute error magnitudes in the literature [23][24][25] and, among these, the one with the most promising performance is the one introduced by Lu et al. [25]. The error magnitude subsystem can be schematically represented as in Figure 3.  The decoder is typically composed of five subsystems [30,31]: syndrome calculation, error polynomial calculation, error positions extraction, error magnitude evaluation, and error compensation. Once a message is received by a Reed-Solomon decoder, the first step should be to divide the received polynomial by the generator polynomial chosen for encoding: the remainders of this division are known as syndromes, and they do not depend on the transmitted code word, but rather only on errors. The syndrome calculation block computes the 2t syndromes contained into a Reed-Solomon code word, usually exploiting Horner's method [32]. The next step is to introduce the error locator polynomial that contains the information about the location of the errors and their magnitudes. Two methods are widely used in error polynomial calculation, the Euclidean algorithm [33] and Berlekamp's algorithm [34]. Once the coefficients of error location polynomial are carried out, the error position block identifies the corrupted symbols, by means of the Chien search algorithm [35], while the error magnitude block computes the error values. Finally, the error compensation block uses this information to fix the errors. In traditional RS decoders, as depicted in Figure 2, the whole process typically takes several clock cycles: each block contributes to this delay and the optimization of a single block can improve the performance of the entire system.
In classical RS decoders, the search for error position is carried out in parallel with the error position estimation and, within the blocks, a serial strategy is adopted. This implies a number of clock cycles equal to n for the error magnitude and position evaluation. Recently, many efforts have been made to speed up the whole process, exploiting parallel Chien search strategies [36,37]. Combining these methods with parallel algorithms for error magnitude extraction can potentially reduce the required number of clock cycles to n/p, where p is the Chien search parallelism, plus one clock cycle for the subsequent error magnitude evaluation. In this context, the analysis of the parallelism capability and the analog in-out time of the available error magnitude strategies can be very useful to identify the preferred method to optimize the whole process. For this reason, in the following, we will focus only on the error magnitude subsystem.
There are many approaches to compute error magnitudes in the literature [23][24][25] and, among these, the one with the most promising performance is the one introduced by Lu et al. [25]. The error magnitude subsystem can be schematically represented as in Figure 3. capability and the analog in-out time of the available error magnitude strategies can be very useful to identify the preferred method to optimize the whole process. For this reason, in the following, we will focus only on the error magnitude subsystem.
There are many approaches to compute error magnitudes in the literature [23][24][25] and, among these, the one with the most promising performance is the one introduced by Lu et al. [25]. The error magnitude subsystem can be schematically represented as in Figure 3.  To compute the error magnitudes δ k , the subsystem takes as inputs the positions of the errors α l k in the code word; the error-location polynomial coefficients σ k , computed in the previous error polynomial calculation phase; and the syndrome polynomial coefficients S k , computed in the syndrome calculation phase.
The RS error magnitude evaluation performed with a standard technique (the Forney method) will be compared with the alternative method introduced by Lu et al. Table 1 shows the theoretical number of required Galois additions (N ), multiplications (N ), and inversions (N −1 ) when the message includes a number ν of errors.

Galois Operation Forney's Method Lu's Method
Addition In Table 2, a numerical example of the equations of Table 1 with ν = 3 is reported.

Number of Errors Forney's Method Lu's Method
In the following, for the sake of clarity, the two methods are briefly summarized and, as an example, implementation in terms of addition, multiplication, and inversion blocks in the case of an error magnitude ν = 3 is presented.

Forney Method
The Forney method relies on the Forney equation (2): where ν is the number of error actually present in the code word (ν ≤ t), and with odd as the largest odd number less than or equal to ν. In Figure 4, an example of an implementation of Equation (3) is presented. For simplicity, the Z 0 α −l k and σ α −l k notations are replaced by Z 0k and σ k , respectively. In this case, the system is configured to perform the error magnitude evaluation of up to three errors ( 3). Inside the system, the two operands of Equation (3) ( and ) can be identified and the number of required operations to perform the result is found to follow the equation of Table 1: 3 inversions (plus 3 required to generate the inverses of coefficients), 12 additions, and 18 multiplications.

Lu Method
The method presented by Lu et al. [25] performs error magnitude computation with less computational effort in respect to the Forney method, as shown in Table 1. It is composed by three main phases, each one performing different operations: preprocessing phase, syndrome refining phase, and error-magnitude extraction phase. In the preprocessing phase, partial results, , and , , are introduced for convenience, and are defined as follows: , , for 1 , , , , , for 1 .
In the subsequent syndrome refining phase, following the idea that, to evaluate ν error magnitudes, knowing all error location numbers, only the first ν syndromes are needed, the original syndromes are re-elaborated to compute as , for 1, 2, … , , , , for 3, 4, … , and 3, 4, … , 1 .
Finally, in the error magnitude extraction phase, can be recursively computed as In this case, the system is configured to perform the error magnitude evaluation of up to three errors (ν = 3). Inside the system, the two operands of Equation (3) (Z 0k and σ k ) can be identified and the number of required operations to perform the result is found to follow the equation of Table 1: 3 inversions (plus 3 required to generate the inverses of α coefficients), 12 additions, and 18 multiplications.

Lu Method
The method presented by Lu et al. [25] performs error magnitude computation with less computational effort in respect to the Forney method, as shown in Table 1. It is composed by three main phases, each one performing different operations: preprocessing phase, syndrome refining phase, and error-magnitude extraction phase. In the preprocessing phase, partial results, P i,j and Q i,j , are introduced for convenience, and are defined as follows: In the subsequent syndrome refining phase, following the idea that, to evaluate ν error magnitudes, knowing all error location numbers, only the first ν syndromes are needed, the original syndromes S k are re-elaborated to compute S (k) w as Electronics 2020, 9, 89 6 of 15 Finally, in the error magnitude extraction phase, δ k can be recursively computed as An implementation for ν = 3 is reported in Figure 5.
An implementation for 3 is reported in Figure 5. As for the Forney example in Figure 4, the number of used operations is found to follow the equations of Table 1: 2 inversions (plus 1 required to generate the inverse of the coefficient), 9 additions, and 12 multiplications. Compared with the Forney method, the new method requires less operations to obtain the result. Moreover, as can be seen in the diagram, the new method does not need the values of the error locator polynomial ( ) as input.

Hardware Configuration
The aim of this work is to give an evaluation of Lu's method for estimating error magnitudes in a received code word in Reed-Solomon code, considering practical implementations. As a benchmark for the evaluation, Forney's method was considered. The two algorithms were designed with different configurations of the n, k, t, and m values, as shown in Table 3. Two different m values were chosen: code-words with symbols composed by 4 bits are typically used in image transmission [38,39], while code-words with 8-bit symbols are practically implemented in Quick Response (QR) codes [14], Digital Video Broadcasting-Terrestrial (DVB-T) [40], and the Consultative Committee for Space Data Systems (CCSDS) standard for space applications [41].
All configurations were designed to correct any number of errors in the code-word. However, as the worst performance is obtained when the maximum number of errors occurs, the number of errors introduced in the code-word was set to be equal to the maximum detectable error (i.e., ). The configurations were designed for implementations both on an FPGA device and on an MCU, as different behaviors are to be expected on the two hardware devices because of the different nature of the two platforms. As for the Forney example in Figure 4, the number of used operations is found to follow the equations of Table 1: 2 inversions (plus 1 required to generate the inverse of the α 1 coefficient), 9 additions, and 12 multiplications. Compared with the Forney method, the new method requires less operations to obtain the result. Moreover, as can be seen in the diagram, the new method does not need the values of the error locator polynomial (σ k ) as input.

Hardware Configuration
The aim of this work is to give an evaluation of Lu's method for estimating error magnitudes in a received code word in Reed-Solomon code, considering practical implementations. As a benchmark for the evaluation, Forney's method was considered. The two algorithms were designed with different configurations of the n, k, t, and m values, as shown in Table 3. Two different m values were chosen: code-words with symbols composed by 4 bits are typically used in image transmission [38,39], while code-words with 8-bit symbols are practically implemented in Quick Response (QR) codes [14], Digital Video Broadcasting-Terrestrial (DVB-T) [40], and the Consultative Committee for Space Data Systems (CCSDS) standard for space applications [41]. All configurations were designed to correct any number of errors ν ≤ t in the code-word. However, as the worst performance is obtained when the maximum number of errors occurs, the number of errors introduced in the code-word was set to be equal to the maximum detectable error (i.e., ν = t). The configurations were designed for implementations both on an FPGA device and on an MCU, as different behaviors are to be expected on the two hardware devices because of the different nature of the two platforms.
The FPGA platform was used to evaluate the parallelization capability of the two algorithms. For this purpose, the configurations presented in Table 3 for both the Forney and Lu algorithms were designed in full concurrent VHDL code, and results were evaluated in terms of resources' usage and delay to obtain a valid output. The platform environment is Xilinx Vivado 2019.1 and the target device was set to Artix-7 XC7A100T-CSG324 FPGA. As an example, in Figures 6 and 7, the register transfer level (RTL) results of the RS (15,11) configuration are reported.  The FPGA platform was used to evaluate the parallelization capability of the two algorithms. For this purpose, the configurations presented in Table 3 for both the Forney and Lu algorithms were designed in full concurrent VHDL code, and results were evaluated in terms of resources' usage and delay to obtain a valid output. The platform environment is Xilinx Vivado 2019.1 and the target device was set to Artix-7 XC7A100T-CSG324 FPGA. As an example, in Figures 6 and 7, the register transfer level (RTL) results of the RS (15,11) configuration are reported.  To evaluate the performance of the Forney and Lu methods when implemented on an FPGA platform, a Nexys DDR 4 board hosting the target FPGA was exploited. The measurement set-up is shown in Figure 8.  The FPGA platform was used to evaluate the parallelization capability of the two algorithms. For this purpose, the configurations presented in Table 3 for both the Forney and Lu algorithms were designed in full concurrent VHDL code, and results were evaluated in terms of resources' usage and delay to obtain a valid output. The platform environment is Xilinx Vivado 2019.1 and the target device was set to Artix-7 XC7A100T-CSG324 FPGA. As an example, in Figures 6 and 7, the register transfer level (RTL) results of the RS (15,11) configuration are reported.  To evaluate the performance of the Forney and Lu methods when implemented on an FPGA platform, a Nexys DDR 4 board hosting the target FPGA was exploited. The measurement set-up is shown in Figure 8. To evaluate the performance of the Forney and Lu methods when implemented on an FPGA platform, a Nexys DDR 4 board hosting the target FPGA was exploited. The measurement set-up is shown in Figure 8. A button is pressed to begin the elaboration and a controller takes care of signaling the start by raising the start signal. As the outputs of the method under the test consist of several bits (up to 128 in the worst case), it would be impossible to monitor all of them with an oscilloscope. Thus, a comparator was introduced to assert the end signal when a valid data matches a constant that represents the expected value. Thus, the test pin rises with the start signal and falls when the end signal goes high. The time between the two transitions of the test pin gives the algorithm output delay. The measurements were carried out with a Tektronix DPO7254 oscilloscope, exploiting its 40 GSa/s, realtime sample rate.
Exploiting the full sequential architecture of an MCU platform, the speed of the two algorithms without parallelization was evaluated. The platform environment in this case is IAR Embedded Workbench, in which the different configurations of Table 3 were coded in C language and flashed on a Texas Instruments (TI) LaunchPad XL development board. The board hosts a TI CC3200 system on chip (SoC) device, with a 32-bit architecture ARM Cortex-M4 MCU clocked at 80 MHz. This device, thanks to an embedded Wi-Fi radio module, is widely used to implement devices and systems compatible with the Internet of Things and wireless sensor network paradigms [42][43][44][45][46][47], topics in which error correction techniques were demonstrated to gain an advantage in recent implementations [48,49]. Data about execution speeds were collected by measuring a generalpurpose input/output (GPIO) pin voltage, which was toggled inside the C code at the computation's start and end. The time between the low-to-high and high-to-low transitions gives the code execution time. The measurements were carried out by means of a Tektronix MSO 2024 oscilloscope. In Figure  9, the Forney and Lu versions of the C code flow chart for the RS (15,11) configuration are reported.
(a) A button is pressed to begin the elaboration and a controller takes care of signaling the start by raising the start signal. As the outputs of the method under the test consist of several bits (up to 128 in the worst case), it would be impossible to monitor all of them with an oscilloscope. Thus, a comparator was introduced to assert the end signal when a valid data matches a constant that represents the expected value. Thus, the test pin rises with the start signal and falls when the end signal goes high. The time between the two transitions of the test pin gives the algorithm output delay. The measurements were carried out with a Tektronix DPO7254 oscilloscope, exploiting its 40 GSa/s, real-time sample rate.
Exploiting the full sequential architecture of an MCU platform, the speed of the two algorithms without parallelization was evaluated. The platform environment in this case is IAR Embedded Workbench, in which the different configurations of Table 3 were coded in C language and flashed on a Texas Instruments (TI) LaunchPad XL development board. The board hosts a TI CC3200 system on chip (SoC) device, with a 32-bit architecture ARM Cortex-M4 MCU clocked at 80 MHz. This device, thanks to an embedded Wi-Fi radio module, is widely used to implement devices and systems compatible with the Internet of Things and wireless sensor network paradigms [42][43][44][45][46][47], topics in which error correction techniques were demonstrated to gain an advantage in recent implementations [48,49]. Data about execution speeds were collected by measuring a general-purpose input/output (GPIO) pin voltage, which was toggled inside the C code at the computation's start and end. The time between the low-to-high and high-to-low transitions gives the code execution time. The measurements were carried out by means of a Tektronix MSO 2024 oscilloscope. In Figure 9, the Forney and Lu versions of the C code flow chart for the RS (15,11) configuration are reported. A button is pressed to begin the elaboration and a controller takes care of signaling the start by raising the start signal. As the outputs of the method under the test consist of several bits (up to 128 in the worst case), it would be impossible to monitor all of them with an oscilloscope. Thus, a comparator was introduced to assert the end signal when a valid data matches a constant that represents the expected value. Thus, the test pin rises with the start signal and falls when the end signal goes high. The time between the two transitions of the test pin gives the algorithm output delay. The measurements were carried out with a Tektronix DPO7254 oscilloscope, exploiting its 40 GSa/s, realtime sample rate.
Exploiting the full sequential architecture of an MCU platform, the speed of the two algorithms without parallelization was evaluated. The platform environment in this case is IAR Embedded Workbench, in which the different configurations of Table 3 were coded in C language and flashed on a Texas Instruments (TI) LaunchPad XL development board. The board hosts a TI CC3200 system on chip (SoC) device, with a 32-bit architecture ARM Cortex-M4 MCU clocked at 80 MHz. This device, thanks to an embedded Wi-Fi radio module, is widely used to implement devices and systems compatible with the Internet of Things and wireless sensor network paradigms [42][43][44][45][46][47], topics in which error correction techniques were demonstrated to gain an advantage in recent implementations [48,49]. Data about execution speeds were collected by measuring a generalpurpose input/output (GPIO) pin voltage, which was toggled inside the C code at the computation's start and end. The time between the low-to-high and high-to-low transitions gives the code execution time. The measurements were carried out by means of a Tektronix MSO 2024 oscilloscope. In Figure  9, the Forney and Lu versions of the C code flow chart for the RS (15,11) configuration are reported. (a)

Results and Discussion
In Figure 10, an example of measurement carried out with the 40GSa/s Tektronix DPO7254 oscilloscope in the case of the FPGA platform is reported. (a)

Results and Discussion
In Figure 10, an example of measurement carried out with the 40GSa/s Tektronix DPO7254 oscilloscope in the case of the FPGA platform is reported.

Results and Discussion
In Figure 10, an example of measurement carried out with the 40GSa/s Tektronix DPO7254 oscilloscope in the case of the FPGA platform is reported.
(a) In Table 4, the performance of the Forney and Lu methods when implemented on an FPGA platform are compared. Data about cell LUTs' (look-up tables') usage refer only to the block performing Forney or Lu algorithm (excluding logic used to compute the output delay, shown in Figure 8). To better appreciate the results, the same are also represented in Figure 11.
(a) (b) Figure 10. Oscilloscope measurements of the GPIO pin for the configuration F of Table 3  In Table 4, the performance of the Forney and Lu methods when implemented on an FPGA platform are compared. Data about cell LUTs' (look-up tables') usage refer only to the block performing Forney or Lu algorithm (excluding logic used to compute the output delay, shown in Figure 8). To better appreciate the results, the same are also represented in Figure 11. As can be seen, as theorized by the model presented in the literature, Lu's method performs better in terms of resources usage in respect of Forney's one. The difference between the two algorithms increases with the system complexity (i.e., increasing the m and t parameters), and it is clearly appreciable for the RS(255,223) configuration. Nevertheless, from the output delay point of view, Forney's algorithm definitely performs better: this behavior is justified by the fact that Lu's algorithm computes the result in a strong, recursive way (as can be seen, for example, in the error magnitude extraction phase). This leads to deep architectures in which the last output is strongly dependent on the previous one, negatively affecting the overall latency. From this point of view, Forney's algorithm is more efficient thanks to its parallel-prone structure, as can be seen in the RTL resulting scheme ( Figure 6). Therefore, if the target application is to be implemented on an FPGA platform, Forney's should be preferred, unless the available resources in terms of area are not binding.
In Table 5, the performance of the two algorithms when the implementation platform is an MCU is reported. The measurement technique can be better visualized in Figure 12, where the outputs from the Tektronix MSO 2024 oscilloscope in the case of configuration RS (255,223) are reported. The output delays obtained in the case of MCU implementation are not intended for a direct comparison with ones measured in the case of FPGA implementation. The FPGA solution outperforms the MCU one: the comparison must be made between the two methods, considering the same implementation platform and the same system configuration. To better appreciate the results, the same are also represented in Figure 11.

System Configuration
(a) (b) Figure 11. Graphical representation of the results of Table 4. (a) FPGA implementations resources' usage in terms of cell look-up tables (LUTs), (b) FPGA implementations' output delay.  As can be seen, as theorized by the model presented in the literature, Lu's method performs better in terms of resources usage in respect of Forney's one. The difference between the two algorithms increases with the system complexity (i.e., increasing the m and t parameters), and it is clearly appreciable for the RS(255,223) configuration. Nevertheless, from the output delay point of view, Forney's algorithm definitely performs better: this behavior is justified by the fact that Lu's algorithm computes the result in a strong, recursive way (as can be seen, for example, in the error magnitude extraction phase). This leads to deep architectures in which the last output is strongly dependent on the previous one, negatively affecting the overall latency. From this point of view, Forney's algorithm is more efficient thanks to its parallel-prone structure, as can be seen in the RTL resulting scheme ( Figure 6). Therefore, if the target application is to be implemented on an FPGA platform, Forney's should be preferred, unless the available resources in terms of area are not binding.
In Table 5, the performance of the two algorithms when the implementation platform is an MCU is reported. The measurement technique can be better visualized in Figure 12, where the outputs from the Tektronix MSO 2024 oscilloscope in the case of configuration RS (255,223) are reported. The output delays obtained in the case of MCU implementation are not intended for a direct comparison with ones measured in the case of FPGA implementation. The FPGA solution outperforms the MCU one: the comparison must be made between the two methods, considering the same implementation platform and the same system configuration.  Figure 12. Oscilloscope measurements of the GPIO pin for the configuration F of Table 3  As can be seen in MCU implementation, the improvements made by the method proposed by Lu et al. are fully evident. In this case, as opposed to the FPGA implementation, all the operations are computed sequentially, and Lu's method can take full advantage of the minor number of operations required to produce the result. Also, in this case, the advantage grows with the system complexity: in the RS(255,223) configuration, the time needed to get a valid result with Lu's method is three times As can be seen in MCU implementation, the improvements made by the method proposed by Lu et al. are fully evident. In this case, as opposed to the FPGA implementation, all the operations are computed sequentially, and Lu's method can take full advantage of the minor number of operations required to produce the result. Also, in this case, the advantage grows with the system complexity: in the RS(255,223) configuration, the time needed to get a valid result with Lu's method is three times lower than that required with Forney's algorithm. To better appreciate these results, the execution times are also compared in Figure 13.
Electronics 2020, 9, x FOR PEER REVIEW 12 of 15 lower than that required with Forney's algorithm. To better appreciate these results, the execution times are also compared in Figure 13.  Table 5 regarding the MCU implementations' code execution time.
Considering the overall results, it appears that the conclusions drawn by Lu et al. in the paper introducing their method are fully valid only in the case of MCU implementation. In case of FPGA (or in general parallel architectures), the choice between Lu and Forney's algorithms depends on what optimization strategy has to be followed. If the output delay is a strong constraint, Forney's method should be preferred, owing to the minor delay required. However, if a circuit with less occupied area is needed, Lu's method is the better choice.

Conclusions
In this paper, an innovative method, originally introduced by Lu et al. [25] to compute error magnitude values in an RS decoder, was implemented to evaluate its realistic performance. Two platforms, an ARM based MCU and an Artix-7 FPGA, were considered to assess different implementation strategies, fully sequential and fully combinational, respectively. As a benchmark for the analysis, we selected Forney's algorithm because it is widely used and, to date, it is considered the standard method for error magnitude evaluation in Reed-Solomon code. For both implementations, the two algorithms were designed considering six practical configurations, normally used in specific applications. All the solutions were evaluated in terms of output delay, that is, time to get a valid result when the system is fed with valid inputs. The FPGA based solutions were also evaluated in terms of resources' (LUTs') usage. The results show that, when implemented with a parallel strategy, Lu's method performs better in respect to Forney's one in terms of occupied LUTs, while suffering from the strongly recursive approach in terms of output time. However, Lu's method performs definitively better in case of MCU, when it can take advantage of the minor number of operations required to produce the output, with an output delay up to three times lower in the configurations examined in this paper. From these results, we can conclude that, if an FPGA platform has been considered for the implementation of the whole decoder, the designer should select Forney's method if there are stringent time constraints, while Lu's method is eligible when the area occupation is critical. Instead, when the decoder has to be implemented on an MCU platform, Lu's method should be preferred owing to the reduced execution time.   Table 5 regarding the MCU implementations' code execution time.
Considering the overall results, it appears that the conclusions drawn by Lu et al. in the paper introducing their method are fully valid only in the case of MCU implementation. In case of FPGA (or in general parallel architectures), the choice between Lu and Forney's algorithms depends on what optimization strategy has to be followed. If the output delay is a strong constraint, Forney's method should be preferred, owing to the minor delay required. However, if a circuit with less occupied area is needed, Lu's method is the better choice.

Conclusions
In this paper, an innovative method, originally introduced by Lu et al. [25] to compute error magnitude values in an RS decoder, was implemented to evaluate its realistic performance. Two platforms, an ARM based MCU and an Artix-7 FPGA, were considered to assess different implementation strategies, fully sequential and fully combinational, respectively. As a benchmark for the analysis, we selected Forney's algorithm because it is widely used and, to date, it is considered the standard method for error magnitude evaluation in Reed-Solomon code. For both implementations, the two algorithms were designed considering six practical configurations, normally used in specific applications. All the solutions were evaluated in terms of output delay, that is, time to get a valid result when the system is fed with valid inputs. The FPGA based solutions were also evaluated in terms of resources' (LUTs') usage. The results show that, when implemented with a parallel strategy, Lu's method performs better in respect to Forney's one in terms of occupied LUTs, while suffering from the strongly recursive approach in terms of output time. However, Lu's method performs definitively better in case of MCU, when it can take advantage of the minor number of operations required to produce the output, with an output delay up to three times lower in the configurations examined in this paper. From these results, we can conclude that, if an FPGA platform has been considered for the implementation of the whole decoder, the designer should select Forney's method if there are stringent time constraints, while Lu's method is eligible when the area occupation is critical. Instead, when the decoder has to be implemented on an MCU platform, Lu's method should be preferred owing to the reduced execution time.