FPGA-Oriented LDPC Decoder for Cyber-Physical Systems

: A potentially useful Cyber-Physical Systems element is a modern forward error correction (FEC) coding system, utilizing a code selected from the broad class of Low-Density Parity-Check (LDPC) codes. In this paper, development of a hardware implementation in an FPGAs of the decoder for Quasi-Cyclic (QC-LDPC) subclass of codes is presented. The decoder can be conﬁgured to support the typical decoding algorithms: Min-Sum or Normalized Min-Sum (NMS). A novel method of normalization in the NMS algorithm is proposed, one that utilizes combinational logic instead of arithmetic units. A comparison of decoders with different bit-lengths of data (beliefs that are messages propagated between computing units) is also provided. The presented decoder has been implemented with a distributed control system. Experimental studies were conducted using the Intel Cyclone V FPGA module, which is a part of the developed testing environment for LDPC coding systems.


Introduction
In recent years, there has been a strengthening link between advancement in computational technologies and components of physical systems. The so-called Cyber-Physical System (CPS) consists of a set of modules that interact with each other and communicate with the outside world. Combining computational and communication aspects with control techniques in one system becomes a challenge. Cyber-Physical Systems applications can be found in almost all areas of human life, such as production systems, intelligent networks, robotics, transport systems, medical devices, military systems, home networks, intelligent buildings, etc. In many CPS applications, digital communication systems play a key role, as they are often integrated with the executive system. During communication, data may be disturbed by various unwanted signals, noise and/or interferences. Error Correction Coding (ECC) techniques can not only detect communication errors, but also reconstruct valid data. Low-Density Parity-Check (LDPC) codes are one of the best known codes with very good correction capabilities.
An LDPC code is defined by its parity check matrix, which is a sparse matrix, or equivalently, by the corresponding bipartite factor graph (known as a Tanner graph [9]). The graph structure has a direct relationship with the structure of the LDPC encoder and decoder. Particularly important for implementation reasons are the Quasi-Cyclic (QC) codes [10], for which parity check matrix is an array of square submatrices, with every submatrix being a cyclic permutation of an identity matrix. These types of arrays allow for efficient hardware implementation of the parallel and semi-parallel QC-LDPC decoder [11].
The research activities in the area of LDPC coding techniques are still prevalent and they include, among others, the methodical construction of LDPC codes [10,12,13], decoding algorithm design [11,14,15] and efficient decoder implementation [4,[16][17][18]. The implementation issues typically constraint the design to some class of implementation oriented LDPC codes, which most commonly belong to the mentioned QC-LDPC class.
LDPC codes are decoded iteratively using the sum-product algorithm, also known as belief propagation [2], which closely approximates maximum-likelihood decoding. The commonly used message passing schedule is a two-phase message passing scheme [11], where a decoding iteration is divided into two rounds of computations, corresponding to variable and check nodes of the code graph respectively. There are also known other schedules, for example the layered decoding scheme [19].
Every hardware LDPC decoder belongs to one of the categories: serial, semi-parallel or parallel, which means the decoder computation units execute the decoding algorithm in a serial, semi-parallel or parallel manner, respectively. The serial decoder has the lowest throughput, but also the lowest hardware utilization. The parallel decoder has the highest throughput and hardware requirements. A semi-parallel decoder is a compromise solution that is the best choice in most applications. The results of the research presented in this paper concern a semi-parallel decoder architecture with the two-phase message computation schedule. The presented decoder solution can be adapted to any QC-LDPC code. There exist a broad range of known implementations, most of them surveyed in [16], many of them aimed at ASIC implementation [16,18], some of them for implementation in software [17]. The research presented in this paper is devoted specifically to the implementation in a Field-Programmable Gate Array (FPGA) chip. In the designed decoder, an important parts of the computing units are fitted directly to the Look-Up- Table (LUT) fabric of FPGAs, giving a slightly improved decoding performance and decreased resources requirements.
An important features of the LDPC decoder is the implemented method of calculating messages in computing units, which is essentially independent of the chosen architecture. Computationally efficient message computation algorithms are known as Min-Sum (MS) and Normalized Min-Sum (NMS) [14,15]. The NMS algorithm differs from the Min-Sum algorithm by the additional normalization stage, which slightly reduces the magnitude of the iteratively approximated beliefs, which has a known effect of improved final decoding results.
The main scope of this paper is a presentation of an irregular QC-LDPC decoder implementation, which is oriented specifically to a LUT based structure of an FPGA programmable chip. The essence of the contribution of this paper is technology mapping approach, in which the normalization in the NMS algorithm is directly oriented to the LUT-based architecture. Moreover, we present the decoder design and resulting system solutions, aimed at an efficient implementation of the LDPC decoder inside a LUT-based FPGAs.
This paper is organized as follows. The second chapter presents the theoretical background of LDPC decoding and implementation of QC-LDPC decoder. Next, new concepts of technology mapping of QC-LDPC decoder oriented to FPGA are proposed. The implementation of distributed control unit of QC-LDPC decoder and the original implementation of normalization module which is oriented to LUT-based architecture are presented. Section 4 illustrates experimental results of proposed solutions. The obtained results were compared with solutions known from the literature. The article ends with a summary, which contains directions for future work.

Theoretical Background of Hardware Implementation of a Semi-Parallel QC-LDPC Decoder
An LDPC code is defined by a parity check matrix H of size M × N. An example matrix H of a QC type code is shown in Figure 1 in the form of an array of its M × N = 8 × 16 entries. Matrix H is a QC matrix, since it consists of circulant submatrices of size P × P = 4 × 4. A submatrix will in short be called the matrix P. Parity check matrix H of QC-LDPC code can be presented in the form of an array of submatrices, where "X" corresponds to the all-zero submatrix and a numerical value s corresponds to a an identity matrix circularly shifted by s positions to the right (i.e., columns are cyclically shifted by the indicated number). For QC-LDPC codes, storing the H structure in the decoder (RAM/ROM) memory requires only storing the array of cyclic shift values. The implementation of the presented decoder takes advantage of this simplified representation. An example parity check matrix of QC-LDPC code, constructed for this research with a method presented in [20] , is shown in Figure 2. The size of the submatrix is P = 64 in this case, the size of the matrix H is 512 × 1024 and the code rate is R = (N − M)/N = 1/2. The minimum number of non-zero elements in rows is 6 and the maximum number is 7. For columns these are respectively 2 and 7. It means this is an irregular code with the maximum column weight of 7.  [15]. The decoding process utilizing Min-Sum function can be presented in consecutive steps as follows: 1. Initialization: Assigning input values that are Log-Likelihood Ratio (LLR). For every m ∈ M(n) and n ∈ (1, N): 2. Control nodes: Determination of minimum values. For every n ∈ N (m) and m ∈ (1, M): 3. Bit nodes: Adding minimum values and LLR input values. For every m ∈ M(n) and n ∈ (1, N): 4. Pseudo-posteriori probabilities: Determination of pseudo-posteriori probabilities. For every n ∈ (1, N):

5.
Hard decisions: Making trial, hard decisions. For every n ∈ (1, N): 6. Verification of control equations: 7. Another iteration: If the control equations have been met, decoding is terminated withx n as an outcome. Otherwise, the next iteration of decoding begins, starting from the (2) control nodes, unless the iteration limit is exceeded.
where: := meaning is "becomes", x = [x 1 , x 2 , . . . , x n ]-code vector, y = [y 1 , y 2 , . . . , y n ]-received vector, Q nm -credibility LLR value from the n-th bit vertex to the m-th Tanner graph control vertex, L(x n |y n )-LLR a priori probabilities for the n-th bit, (1, N) is a set of integers between 1 and N, N (m)-a set of column indexes in the parity check matrix H containing one in the m-th row, M(n)-a set of row indexes in the parity check matrix H containing one in the n-th column, R mn -message from the m-th control vertex to the n-th bit vertex of Tanner graph, x n -decoded vector.

Construction of the QC-LDPC Decoder with a Distributed Control System
The block diagram of the QC-LDPC irregular decoder is shown in Figure 3. Initially, the decoder receives a priori LLR values L(x n |y n ), which are stored in memory. The data is propagated between modules in a portion of P messages parallel in a data bus. The received data is forwarded to the initialization module and then P messages in the bus are re-positioned by cyclically shifting to the right using the Shift Right (SR) module. The number of positions to shift the data vector is defined in the offset memory S (Shift), which contains the stored shift values, corresponding to P submatrices of H. The shifted message vectors are stored in Q nm memory, according to the Equation (1). The unit computing control node messages reads data from the Q nm memory and determines the minimum value and the sign according to the Equation (2). The obtained result can be modified by the normalizing parameter α according to the Equation (8). The computed control node messages are saved to the R mn memory.
The unit calculating bit node messages reads data from the R mn memory and then cyclically shifts them to the left using the SL module. The unit computes the sums of appropriately chosen messages stored in R mn and L(x n |y n ) memories, in accordance with the (3) equation. Formulas (3) and (4) are similar, therefore the Q nm and Q n are computed simultaneously in the bit node unit. The obtained results are stored in memories Q nm and Q n respectively.
In the next stage, the control equations are verified according to (6). Control equations are checked in verification module of control equation. If all control equations have been met, the result transmission module begins reading data from the Q n memory by transferring it in the correct order to the communication handling module. Otherwise, all calculations are repeated, starting with the control unit node. Decoding can be interrupted if the assumed maximum number of iterations has been reached. Each of the presented elements of the QC-LDPC decoder has its own control unit that is responsible for its proper operation.
A typical method for implementing a control unit of an LDPC decoder is to use a global controller, which controls the operation of the entire decoder. The disadvantage of this solution is the high level of complexity of the controller and the possible occurrence of clock skew phenomena.
The proposed decoder uses a distributed control system. Elements: "communication handling module", "initialization module", "control unit node", "bit node unit", "verification module of control equations" and "result transmission module" in Figure 3 have their own built-in control units. The control of each element is activated by the preceding control unit as described by the decoder. The first control unit in the communication handling module is activated by an external signal informing it about the start of the transfer of a new, received data vector. A similar principle is used in systems with so-called Token Passing, e.g., the IEEE 802.5 Token Ring [22] standard developed in the 70s by IBM. This allowed for a significant simplification of the construction of individual control units and their better adaptation to the needs of a given element. The distributed control system is also less susceptible to clock skew phenomena. Figure 4 shows a simplified layout of the QC-LDPC decoder, which indicates in what order the Token is activated. The red arrows indicate the stages of running the distributed control system. The dotted arrows are optional and, depending on the decision made by the verification module of control equations, the Token will be passed to the result transmission module or the unit for calculating control nodes. Memory controlling and addressing is performed by control units located in the other elements of the decoder. Since more than one element uses every message memory, an attention should be paid to the address propagation. For this purpose, a multiplexer is used as shown in Figure 5. When implementing the multiplexer in a hardware description language (e.g., Verilog), it must be ensured that the selection of the address takes place depending on which Trigger was last to change its state (edge trigger). As mentioned before, in the developed distributed control system, each system element is activated by the preceding element, by detecting a change in the "Trigger X" signal, as shown in Figure 6. After finishing the current iteration tasks, the control unit activates the next control unit with the "Trigger Y" signal, at the same time deactivating itself. Every unit of the control system works only when the "Enable" signal is active.

Implementation of the Normalization Module
The block diagram of the unit calculating the control nodes messages R mn , with normalization elements of the NMS algorithm, is presented in Figure 7. The data read from memory is converted from the two's complement format (convenient for additions in bit node units) to the Sign and Module (SM) format. Every data word contains P messages, represented by B-bit fixed point numbers, that are separated and delivered to appropriate Min-Sign modules. Every Min-Sign module has d c inputs, for messages corresponding to at most d c check node edges. The value of d c is also equal to the largest number of non-zero elements in any row of H. The DMUX and MUX modules deliver the messages from memory to inputs of appropriate Min-Sign modules. Counters addressing the DMUX and MUX are part of the control system unit. When the packet of data is smaller in a given computation cycle than the maximum d c , the other outputs of the DMUX are set to the maximum (positive) value, which is transparent for the minimum operation as well as the sign-product operation. The values of the minimum and the sign are determined in P separate Min-Sign modules operating in parallel, then normalized by the α parameter, according to (8). After multiplexing in MUX, results are converted back from the SM representation to the two's complement format and saved in the R mn memory.
The typical hardware-efficient implementation of the multiplication of a fixed-point number by a constant coefficient (α) can be presented in the form of a module consisting of shifters and adders. In this method, α < 1 is expressed as: where α 1 , α 2 , . . . ∈ {0, 1}. Since multiplication by 2 −b is equivalent to shifting the fixed-point binary representation by b bits towards LSB, and α 1 , α 2 , . . . are constants, multiplication of a message m by α can be realized by summing these b-shifts of message m, for which α b = 1 in (9), b = 1, 2, . . . , B. For example, multiplication by α = 0.75 = 0.5 + 0.25 can be implemented making use of a single adder, while multiplication by 0.5 and 0.25 is realized by shifting the fixed-point number by one and two bits, respectively. Performing this multiplication using shift registers and B-bit adders results in a truncation error, because of shifting-out the least significant bits. For data buses smaller than 6-bit, the truncation error can be quite significant, making other implementation more feasible. Therefore we propose another approach, which is oriented to the LUT-based FPGA.
The typical FPGAs chip contains an array of programmable logic blocks (ALM-Adaptive Logic Module in the case of Intel devices) and a hierarchy of reconfigurable interconnections that allow the blocks to be wired according to the specific project needs [23]. The ALM block consists of combinational logic (LUT), adders and registers [24]. The design of the ALM unit is shown in Figure 8. Each LUT has a maximum of 8 inputs and can be configured to realize any logic (binary) function.
Therefore, when implementing the normalization module in an FPGA structure, it is possible to perform a low-precision normalization in the form of a direct mapping of the normalizing function into FPGA logic resources, that is LUTs. Figure 9 presents Karnaugh maps for several parameters of α, with 4-bit precision. The maps are dependent only on the module of the number-sign bits are processed independently. It should be noted that multiplication by 0.625, 0.6875, 0.75 and 0.8125 values can be implemented using combinational logic (LUTs) and registers. A single LUTs is enough for a 4-bit precision. The α = 1 parameter is also considered in Figure 9, which corresponds to QC-LDPC decoding without normalization. During the research, numerous variants of the Karnaugh maps were verified. The article presents one of the best Karnaugh map for the tested control matrix H.  However, since the linear normalization is just an approximated method of belief calculation, we assumed that it is possible that application of another (non-linear) normalization functions, can result in not worse, possibly improved decoding correction performance. An example is the function presented in the last set of Karnaugh maps, labeled "Proposal", which is an arbitrarily chosen map, experimentally shown to give the best results, at least for the case codes that we have experimented with. This map is similar to α-normalization, but not exactly the same, and-as will be shown-results in an improved decoding performance of the implemented decoder. Moreover, this map can still be implemented in a single LUT of the FPGA. A logic function resulting from the synthesis of Karnaugh maps can be implemented directly in an LUTs. Implementation of the normalization expressed as a logic function is beneficial due to the direct fit into the FPGA architecture with LUTs. Figure 10 presents graphically the dependence between 3-bit input module (represented by an integer in the range 0 . . . 7) and 3-bit output of normalization modules, for a few investigated normalizations, including the "Proposal" function that showed the best correction performance in our simulation experiments, as will be presented in the next section.

Experimental Results
A special test environment was developed for the purpose of experimental research, the block diagram of which is presented in Figure 11. The environment consists of three key elements: computer (PC), microcontroller (eval-board STM32F4DICOVERY) and FPGA system (eval-board with Cyclone V device). The PC computer runs a system simulation software developed in Python, modeling random data generator, LDPC encoder, Binary Phase Shift Keying (BPSK) modulator/demodulator, AWGN (Additive White Gaussian Noise) channel and LDPC decoder. The encoded and modulated data is disturbed using the AWGN channel model, with noise level according to the variable Signal-to-Noise Ratio (SNR). The erroneous data is then corrected (decoded) by LDPC decoding software. Such a testing system allows verification of the correction properties of the LDPC code with given parity check matrix H. The results of the simulation the form of Bit Error Ratio (BER) vs SNR in an AWGN channel can be used as a reference chart for hardware implementations of the LDPC decoder.
The developed computer software is also responsible for generating description of QC-LDPC decoders in a Hardware Description Language (Verilog) for the FPGA chip. The generator as input parameters needs the parity check matrix H, specified by matrix size MtimesN, submatrix size P, location of nonzero submatrices and corresponding cyclical shift values. The generator creates all files (in Verilog) that are necessary to build a project in a Quartus environment. Additional parameters for the generator are: • FPGA type and family (e.g., Cyclone V, 5CSEBA6U23I7); • bit-resolution of decoder input-a priori LLRs (e.g., 4-bits); • maximum number of decoder iterations (e.g., 50); • type of algorithm used (e.g., Min-Sum or Normalized Min-Sum with chosen normalization).
A set of test data consisting of coded vectors (without interference) and distorted data vectors can be generated by the developed software, which allows verification of the QC-LDPC decoder FPGA implementation. A graphical interface of the whole environment was created with C#, while the microcontroller software was written in the C language. The aim of the microcontroller is to distribute the data, the clock and other control signals to the FPGA. Making use of the environment, we can easily compare different constructions and configurations of the hardware decoder, reporting the hardware utilization as well as experimentally obtained error correction performance curves. Table 1 shows the hardware requirements, obtained as a result of logic synthesis of a few selected configurations of the QC-LDPC decoder. It can be observed that as the data bus size increases, the hardware utilization for ALM units and memory bits also increases. Meanwhile, the normalization choice, including the proposed non-linear variant ("Proposal"), has no impact on the decoder hardware utilization. Therefore, the Proposal can be applied without any implementation overhead.  Figure 12 shows the observed error correction performance of the QC-LDPC decoder implemented in the Cyclone V (FPGA), illustrating dependence on the message (belief) precision. The Bit Error Rate (BER) and Frame Error Rate (FER) curves for the 3-bit data reflect very poor correction performance for the presented SNR range. Meanwhile, the 4-bit and 5-bit precision decoders correct errors with effectiveness increasing with SNR. Differences in BER and FER curves between 4-and 5-bit cases depend somewhat on the SNR, but they are not very significant. Therefore, in general a 4-bit bus is recommended, because of its better correction properties than the 3-bit solution and less hardware utilization in relation to the 5-bit solution. All tests were performed for the parity check matrix H shown in Figure 2 and the maximum number of iterations of 15. Figure 12 as a reference provides the corresponding BER and FER curves obtained from computer a simulation with the floating-point precision BP decoding. The next simulation results provide an experimental insight into the normalization method of the NMS algorithm. Figure 13 presents the BER and FER results in the examined SNR range for the Min-Sum algorithm (α = 1) and a range of variants of the NMS algorithm, with the recommended 4-bit precision. It can be observed that in general, the Min-Sum algorithm has performance inferior to the NMS algorithm with optimized α, with α = 0.8125 being the optimal value in this case. Meanwhile, Min-Sum outperforms NMS with some other values of α.
However, it can be also observed that the use of the "Proposed" solution gives correction performance even better than NMS with the best α = 0.8125. The proposed nonlinear normalization method makes it possible to achieve to achieve significantly improved correction performance than the optimized NMS. Meanwhile, it can be well fitted into FPGA resources, with nonlinear normalization still implemented in a single LUTs. Therefore, we achieved a modified NMS algorithm implementation, well suited for implementation in FPGAs without any hardware overhead, but with an improved correction performance.

Conclusions
The article presents a hardware (FPGA) implementation of the irregular QC-LDPC decoder. A few variants of decoding algorithm and precisions have been investigated. It is shown that it is beneficial to use a 4-bit precision, characterized by a better correction performance in comparison with the 3-bit precision, and hardware resources savings in comparison with 5-bit precision. The use of 4-bit buses does not lead to a significant deterioration in the correction performance in comparison with the decoder with 5-bit buses.
The novel idea presented in this article is a method of implementing the normalization of the NMS algorithm, enabling effective technological mapping into FPGA structures. The proposed method of nonlinear normalization with a selected logic function makes it possible to obtain a better correction performance in comparison to standard, well-known NMS solutions. A comparison of the hardware resources used for the different decoder variants indicates a slight increase in the number of ALM units used for the NMS algorithm and the proposed algorithm in comparison with the Min-Sum algorithm.
Particularly noteworthy is the interdisciplinary nature of the work presented. The use of the capabilities of a personal computer, a microcontroller and an FPGA system perfectly fits the idea of Cyber-Physical Systems, which combine issues in the fields of electronics, computer science and telecommunications. The developed test environment not only allows for automating many activities (e.g., generating new decoders, carry out tests) but also significantly expands the research capabilities. The use of a distributed control system allows for significant simplification of the construction of the QC-LDPC decoder as well as the control system itself.
Future work will focus on optimizing the developed systems in terms of energy consumption. The idea of a distributed control system provides many new possibilities, including an introduction of Clock Gating for different parts of the decoder, which will also be a further direction of work. It seems that it is possible to develop a decoder which, while ensuring the assumed correction parameters, can become more energy-efficient. The essence of these ideas is to reduce the dynamic power consumption by using Clock Gating methods in a way that best suits QC-LDPC decoders.