A High Throughput Hardware Architecture for Parallel Recursive Systematic Convolutional Encoders

During the last years, recursive systematic convolutional (RSC) encoders have found application in modern telecommunication systems to reduce the bit error rate (BER). In view of the necessity of increasing the throughput of such applications, several approaches using hardware implementations of RSC encoders were explored. In this paper, we propose a hardware intellectual property (IP) for high throughput RSC encoders. The IP core exploits a methodology based on the ABCD matrices model which permits to increase the number of inputs bits processed in parallel. Through an analysis of the proposed network topology and by exploiting data relative to the implementation on Zynq 7000 xc7z010clg400-1 field programmable gate array (FPGA), an estimation of the dependency of the input data rate and of the source occupation on the parallelism degree is performed. Such analysis, together with the BER curves, provides a description of the principal merit parameters of a RSC encoder.


Introduction
During the last years, many applications, like space telecommunications systems [1], digital TV broadcasting [2] and wireless metropolitan area networks [3] have been exploiting forward error correcting (FEC) codes to reduce the bit error rate (BER) in data transmission.FEC approaches encode data to transmit by using redundant codes which allow to estimate the correct bit stream transmitted at receiver by using a maximum likelihood [4] or a maximum a posteriori [5] decoding.One of the most important FEC techniques is represented by recursive systematic convolutional (RSC) encoders.The latter are exploited as fundamental building block for the realization of more complex and efficient FECs encoders, like convolutional turbo codes (CTC) [6], which are realized by stacking RSC encoders in a parallel [7] or serial [8] configuration.In the last decades, CTCs have been the object of interest thanks to their high efficiency, which permits to transmit by using a data rate close the channel capacity boundary [9,10].In particular, to supply the request of a higher transmission data rate [11] of modern telecommunication systems, research focused on increasing the efficiency of CTCs [1,12,13] or on improving their architecture at implementation level [12,14,15].
In this paper, instead of investigating possible improvements of the whole CTC architecture, we focus on the optimization of the only RSC encoder building block.In particular, we propose to exploit the methodology presented in our previous work [16] to improve the RSC encoder throughput by increasing the number of parallel inputs data which are processed at the same time.For such an aim, the presented methodology exploits the ABCD matrices model to describe the system and provides indications about how to derive new A B C D parallel matrices, which allows us to increase the RSC encoder parallelism.
In this article, we propose to extend our previous work by showing a hardware implementation of the RSC encoder.The latter, realized in very high speed integrated circuits hardware description language (VHDL), exploits the described methodology and the A B C D model to realize different parallel RSC encoders depending on the polynomials and puncturing scheme chosen.
The main advantage offered by a hardware implementations is the maximization of timing performances thanks to the increased computational power compared to standard digital signal processors [17,18].In our previous work, various RSC encoders were analyzed in terms of their BER curves, proving the equivalence of the parallel models through a Matlab R model.Such RSC encoders were implemented onboard Zynq 7000 xc7z010clg400-1 field programmable gate array (FPGA) [19].Synthesis on FPGA permitted to quantify the throughput for such devices offered by our methodology.Furthermore, through an analysis of the implemented network topology, we present a methodology that permits us to estimate the dependency of the input data rate and of the FPGA slice lookup tables (LUTs) on the parallelism degree.
The remainder of the paper is structured as follows: Section 2 shows the approach used in this work: in particular, Section 2.1 introduces the RSC encoders and their ABCD equivalent model; Section 2.2 sums up the approach described in our previous work to increase the parallelism; Section 2.3 illustrates the proposed hardware architecture for different RSC encoders; furthermore, in Section 2.4 the importance of input rate as merit parameter is discussed, and the architecture of the RSC encoder is characterized as a function of the parallelism degree.Section 3 considers the different case studies proposed in our previous works and provides a characterization of their implementation on Zynq 7000 xc7z010clg400-1 FPGA in terms of source utilization, maximum clock frequency and input rate.Moreover, it presents a case study and analyzes the dependency of the input rate and the number of the FPGA slice LUTs on the parallelism degree.In Section 4, results are discussed.Finally, in Section 5 conclusions are given.

RSC Encoders Introduction
RSC encoders are realized through linear feedback shift registers (LSFRs).The latter are devices described by a set of generator polynomials whose coefficients belong to Galois fields of two elements (GF(2)).In particular, in Equation (1) we define the feedback polynomial: The unitary coefficients g i in g(x) indicate which present states contribute to determinate the successive ones.The presence of a feedback polynomial makes the encoder recursive.Moreover, for each input bit the RSC produces N o outputs, a set of N o − 1 feedforward polynomials are also defined as in Equation ( 2): where h k (x) is the polynomial producing the (k + 1) th RSC output, whose coefficients are indicated as h ki .Moreover, the input bit is directly reproduced (systematic output) as output together with the other N o − 1 redundant bits in the output code.For such reason, the encoder is defined as systematic.One of the merit parameters which describes the RSC encoder is the code rate, defined in Equation (3): where K is the number of information bits.The more R is closer to 1, the lower is the amount of redundant data introduced in the encoder output code.For such reason, values of R close to 1 guarantee lower performances in terms of BER.However, values of R much lower than 1 lead to a low efficiency of the telecommunication system, since many sources are dissipated to transmit redundant data.
A trade-off between BER performances and efficiency is reached by discarding some bits of the output code depending on a fixed puncturing scheme [20,21].
To better describe the architecture of a RSC encoder, we can consider a RSC encoder with R = 1 2 and no puncturing (N o = 2); the same considerations can be extended to RSC encoders with different values of R. In this case, only one redundant output is generated by a single feedforward polynomial.
Let us define the maximum degree between h(x) and g(x) as L. In this condition, the RSC encoder is realized by using N = L − 1 flip-flops, linked in a shift-register configuration.A feedback network composed by the flip-flops outputs feed the shift-register input together with the network input.When a coefficient g i = 1 , the output of the (i − 1) th flip flop is taken in the feedback network.Moreover, since we want g(x) to be a maximal length polynomial, g(0) and g(N) shall be unitary.
In the same way, when h i = 1, the output of the (i − 1) th contributes to create the redundant output c 1 .
If h 0 = 1, the input u[n] is considered in the generation of the redundant output; otherwise, only the flip-flop states are used.Figure 1 shows the architecture of an RSC encoder with R = 1 2 .
Architecture of a recursive systematic convolutional (RSC) encoder with R = 1 2 .
The systematic output c 0 [n] is generated through a direct connection with the input u[n].Equation (4) shows an alternative form to describe the generators of a RSC encoder with R = 1 2 : where the terms 1 and indicate respectively the unitary generator functions producing the outputs c 0 [n] and c 1 [n].
To introduce the equivalent ABCD model, let us consider the the vector S[n] containing the information on the flip-flop states, shown in Equation (5). to describe the encoded outputs c 0 [n] and c 1 [n], as illustrated in (6).
For RSC encoders with a different value of R, since Y[n] contains all the system outputs, Y[n] is N o dimensional vector.
To describe the timing evolution of the RSC states and outputs with modulo-2 operations, we can exploit the ABCD model considering the current input and state: where A, B, C, D are matrices, u[n] is the input bit and S[n + 1] is the state at instant n + 1.
A is N × N matrix; as it is shown in (7), A is function only of the linear feedback shift register (LFSR) structure, and it is made by the tap elements of g(x), with g(0) excluded, in the first row and a partial identical sub-matrix accountable of the shift operation in the second one. where indicates the (N − 1) × (N − 1) identity matrix and g 1 . ..g N are the coefficients of the polynomial g(x).The B vector describe the impact of the current input to the state evolution.In case of a base realization of RSC code, the B vector is equal to the N dimensional vector shown in (9).
C is a N o × N matrix representing the relation between the current state and the output coded bit.For RSC encoders that have code rate R = 1 2 , only two rows are present.In particular, since c 0 is the systematic output, which is only function of the input u, the first row of C is filled with zeros.On the contrary, the second row is computed by the c 1 parity output equation described by (10).
where h i and g i are coefficients of the polynomials h(x) and g(x).
For RSC encoders having N o > 2 outputs, the (k + 1) th row of C is populated by using the coefficients of the h k (x) polynomial according to Equation (10).
Finally, D is N o -dimensional vector describing the contribution of the inputs in the generation of the output code.By using the systematic code c 0 [n] = u[n] and Equation (10), it is possible to retrieve the D vector values.For a R = 1 2 , the D vector is shown in Equation ( 11): Equations ( 5)-( 10) permit us to redraw the RSC circuit as a finite state machine described by the ABCD matrices model, whose block diagram is shown in Figure 2.For each operation, the modulo-2 operation is performed.
Block diagram of the ABCD matrices model of an RSC encoder.

RSC Encoders Parallelization Approach
Increasing the parallelism of RSC encoders means to make it able to process k inputs at time.This means that in absence of puncturing, the vector the RSC encoder produces k vectors Y[n] relative to the k inputs.In addition, it shall be considered that a k-dimensional input vector produces and update on the S[n] of k steps at time.For such reason, a k parallel RSC encoder can be described by using an equivalent A B C D model shown in Equation ( 12): where A , B , C and D are the parallel matrices.In particular, the latter can be generated starting from the original A,B,C,D matrices by exploiting their proprieties and the information on the RSC encoder topology.13)-( 16) show the expressions of the matrices A ,B ,C and D as function of A,B,C,D.
In particular, A shall be calculated by exploiting the A matrix properties.The latter can be derived directly from the LFSR theory and are described in Equations ( 17) and ( 18): where I N×N is the N × N identity matrix and the operator mod indicates the module operation.Therefore, the sequence of A k matrices are a finite set of the 2 N − 1 matrices: Figure 3 shows the block diagram of a k-parallel RSC encoder.
In Section 2.1 we underlined that it possible to increase the code rate to improve the efficiency of the telecommunication system by puncturing the output code.In particular, it is important to underline that for a parallel RSC encoder particular puncturing schemes exist whose implementation do not require any additive logics; on the contrary, they also allow to reduce the complexity of the system.Indeed, let us express the puncturing scheme through a binary vector, whose null elements represent the punctured outputs.In this way, the implementation of puncturing schemes whose representative vector has a length equal to the number of rows of the C and D matrices is trivial.Indeed, to realize the puncturing, it is sufficient not to implement logics relative to rows of the C and D matrices having indexes equal to ones of the null elements in the puncturing scheme.

Parallel RSC Encoders Hardware Architecture
The parallel RSC encoder was implemented as VHDL intellectual property (IP) core.The architecture of the system matches the block diagram shown in Figure 3.The IP core can be exploit to generate a generic RSC encoder, which can be fixed by specifying g(x), the puncturing scheme vector, and the matrix containing the feed-forward polynomials h k (x).Such vectors are necessary to generate the matrices A , B , C , D .The information contained in these matrices is important to build the data path logics.Indeed, let us pose Y = (c 0 c 1 ...c k•N o −1 ) and let us consider as example the generation of the output c i .By isolating the contribution of the rows D i and C i of the matrix C and D in Equation (12), such output is calculated as shown in the system of Equation ( 19): where c iC and c iD are respectively the contributions of the networks described by the rows C i and D i .
In particular, c iD is produced by the network operating over the inputs of the RSC encoder; on the contrary, the term c iC is generated by the subsystem processing the internal flip-flop states.More specifically, only the inputs whose position corresponds to the ones of the unitary elements inside the D i row contribute to c iD .
In view of that, for each element c iD a dedicated network is instantiated which sums through exclusive OR (XOR) operations the inputs specified by the unitary elements of the row D i .In particular, the architecture of the network is designed to minimize the logical path from the inputs to the output.For such aim, the various XOR gates are linked in a tree fashion.In each layer of the tree, the elements of the previous layer are grouped into couples which feed a XOR gate.In case of an odd number of layer inputs, the last element is directly linked with the successive layer.When a matrix row D i is identically null, its contribution c iD is forced to 0, which is the neutral element for the XOR gate.This means that such row is not contributing to generate the output c i .
Figure 4 shows the tree pattern of the network for the generation of c iD .The approach described for the generation of the output bits c i through the matrix rows C i and D i was also exploited to produce the inputs to the flip-flops through the rows A i and B i .More specifically, the same network topology described for the matrix rows D i is also exploited for all the other matrices rows.

Analysis of the Tree Network Topology as a Function of the Parallelism Degree
In Section 2.3 we described the VHDL IP core implementing a k parallel RSC encoder.In order to characterize the IP core, let us consider the scenario where the RSC encoder is stimulated by a source producing k input data synchronously with the rising or falling edge of clock signal with frequency f clk .At the same time, let us suppose that the k • N o output code is sampled by a sink synchronously with the same reference clock signal.Such scenario is shown in Figure 5.The method to increase the parallelism degree illustrated in Section 2.2 has the aim to increase the capacity of a RSC encoder to process the data produced by the source fast.In particular, one merit parameter is the system throughput.In view of that, it is possible to consider as merit parameter the input data rate R I N .On the contrary, the output data rate is meaningless to characterize the processing speed of the system since it is dependent on the code rate R.
In the scenario shown in Figure 5, R I N can be calculated as described by Equation ( 20): Equation (20) shows that R I N ∝ k.Such relation might suggest that an increment of k leads to a proportional growth of R I N .However, if we suppose to process data with the maximum clock frequency f clk MAX which guarantees the correct sampling of the output code to maximize R I N , it shall be considered that f clk MAX depends on the critical path propagation delay T p according to the set-up time rule [22], which is shown in Equation ( 21): where T sup is the set-up time of the sink register and T cq is the time necessary to the source register to stabilize the output data after the clock edge.
Although the network described by matrices A and C have constant number of inputs with the parallelism degree k, the dimension of the input vector for the networks described by matrices B and D is depending on k.It implies that the complexity of such networks grows with k; for such reason, it reasonable to assume that there is a dependency of T p on the parallelism degree.It leads to conclude that f clk MAX depends on k, making the relation R I N [k] not linear.
In order to derive such relation, it is possible to consider the tree network architecture described in Section 2.3.In particular, if we define T p0 as the propagation delay of a single XOR gate, for such topology the total propagation delay can be estimated by using the expression shown in Equation (22). where indicates the number of unitary elements present in the row i of a generic matrix M .It is necessary to notice that Equation (22) does not take into consideration the delay due to the interconnections.At the same manner, it is possible to study the dependency on the number of sources as function of the parallelism degree k.
Indeed, if we suppose that the first layer of the tree has Ω M i [k] unitary elements, gates are necessary for the first layer.In the second layer, mod 2) XOR gates are necessary.For such reason, it is possible to consider the number of XOR gates composing a tree network roughly proportional to In particular, since for sufficiently high values of k the contribution of the A and C matrices is roughly constant, the number of XOR gates necessary to realization of the entire RSC encoder can be estimated through the Equation (23).
where N 0 and N 1 are constants to determine.We shall also consider that in FPGA implementations the number of XOR does not match in general the number of slice LUTs used.For such reason, in such conditions the model described in Equation ( 23) represents a worse estimation of source utilization.

BER Performance Analysis and Implementation Results of some RSC Codes
In our previous work [16], we showed the BER = BER( E b N 0 ) curves of some RSC encoders with R = 1  2 , where E b is the average energy per bit, and N 0 is the power spectral density of a white Gaussian noise process.These curves were produced through a Matlab R simulation including: Such RSC encoders were synthesized on Zynq 7000 xc7z010clg400-1 FPGA by exploiting the architecture described in Section 2.3.Table 1 shows the RSC codes chosen and their results in terms of number of maximum clock frequency, input data rate and FPGA sources.To estimate the maximum clock frequency, input and output registers were included as shown in Figure 5, and f clk = 100 MHz clock constraint was imposed.Such registers are not considered in the reported slice registers results of Table 1. Figure 6 shows the BER curves resulting from the Matlab R simulation.Table 1 shows that an increment of the parallelism degree leads to augment of the input data rate R I N , but it required a higher number of sources, especially slice LUTs.Section 3.2 presents a deeper analysis of the R I N [k] trend through a case study.

Impact of the Parallelism Degree on the Data Rate: Case Study
In this section, we present a case study that permits us to estimate the trend R I N [k] for the RSC encoder described by the generators shown in Equation (24) by applying the analysis reported in Section 2.4.
For such aim, it is necessary to estimate the dependency of the maximum clock frequency f clk MAX [k].The first difficulty is that the critical path might involve a different register-logics-register path for every different value of k.Even if this problem is real, in Section 2.4 we illustrate that for increasing values of k, only the networks relative to B and D matrices are increasing in complexity.In view of that, it is reasonable to deduce that for sufficiently high values of k the critical path is in one of the networks relatives to the rows of the matrices B and D .Such hypothesis is confirmed by the FPGA implementation results shown in Table 2. Table 2 shows that for k ≥ 5 the critical path is included in the networks relative to B and D matrices.
Even if it is probable that for different values k > 11 the critical path is not definitely included in the networks relative to a single matrix, it is necessary to consider that the both the B and D networks are implemented by using the same topology, whose propagation delay/parallelism grade trend can be described by Equation (22).For such reason, it is possible to approximate the function f clk MAX [k] by using f clk MAX M [k].The latter describes the dependency of the maximum clock frequency as function of k of the paths relative to a generic matrix M .For such reason, f clk MAX [k] can be estimated by using the expression described by Equation (25): where, T A and T B are parameters to determine, and where, similarly to Equation ( 22), the term Ω M [k] models the maximum number of unitary elements among the all networks relative to the matrix M for a fixed value of k (note the absence of the subscript i), as described by Equation (26).
Parameters T A and T B of Equation ( 25) can be estimated by using data shown in Table 2 through a mean square error (MSE) interpolation.In particular, the trend f clk MAX [k] was approximated with trend of the networks relative to the matrix D .This is due to the fact the maximum number of unitary elements among the rows D i is always higher than the maximum number of unitary elements among the rows B i for each value of k ∈ (1, ..., 100).It was demonstrated through a Matlab R simulation which calculated the maximum number of unitary elements for B i and D i depending on different values of k.
The same simulation was exploited to estimate the relation Ω D [k] for the matrix D .Such a trend is difficult to derive by simply exploiting (11) and (16).Nevertheless, it possible to realize an estimator Ω D [k] through a machine learning approach.First of all, the model described in Equation ( 27) was used: where • indicates the rounding operation; θ 0 and θ 1 are the learning parameters.The relation, previously calculated, reporting the maximum number of unitary elements respectively in the rows of and D matrix for each value of k in the range (1, ..., 100) was used to realize a dataset.The latter was randomly partitioned into a train and a test datasets, whose dimensions are respectively 2  3 and 1 3 of the original one.The values of the learning parameters θ 0 , θ 1 were estimated through a mean square error approach on the training set, without considering the rounding operation.
A total of 20 iterations were performed; during each iteration the random partition of the total dataset was changed and the accuracy of the estimator is calculated on the test dataset.In particular, accuracy was calculated as the percentage of right predictions on the test dataset.
At the end of the procedure, the learning parameters relative to the partition with maximum accuracy on the test dataset were considered.
The best accuracy on the test dataset was of 87.87% and the obtained learning parameters are shown in Table 3. Ω D [k] was used for the estimation of the parameters T A and T B .To increase the number of data to use for the interpolation process, we also exploited the values of the maximum clock frequency for the networks relative to the matrix D for such values of k for which the system critical path was included in the network describing the rows of another matrix.Table 4 reports the T A and T B values derived through the described methodology.

Impact of the Parallelism Degree on the Source Utilization: Case Study
In this section, we describe a methodology to estimate the dependency of the number of slice LUTs depending on the parallelism degree by considering as case study the RSC encoder shown in Equation (24).
By using a similar approach to the one described in Section 3.2, N 0 and N 1 parameters of Equation (23) were estimated by using a MSE approach.In particular, Ω D [k] was measured by using the estimator described in Equation ( 27) and by exploiting the values of the learning parameters described in Table 3.The interpolation was performed by using data on slice LUTs utilization obtained by the multiple implementation of the RSC encoder on board Zynq 7000 xc7z010clg400-1 FPGA; such data are shown in Table 5.Table 6 shows the N 0 and N 1 values found.Owing to the real nature of the N 0 and N 1 parameters, the result was rounded to obtained an integer estimation.Figure 8 shows the estimated dependency of the number of slice LUTs on the parallelism degree.

Discussion
Data presented in Section 3.1 complete the results shown in our previous work [16], by providing information about the implementations of the various RSC encoders on FPGA.This provides an additional characterization of such encoders in terms of their speed performances and their source occupation.
The most important contribution of this work is linked to the analysis performed in Sections 3.2 and 3.3 which permitted to extrapolate the dependency of the input data rate and the FPGA slice LUTs on the parallelism degree.Such analysis, even if approximated, provides a description complete of the most important merit parameters of the implementation, allowing to choose the values of k depending on the different application requirements.
In particular, it is useful to notice that an increment of k leads to a less than proportional improvement of the R I N [k] value but requires a more than proportional increment of the number of used sources.
In addition, even if this analysis is performed for a specific RSC encoder, the methodology applied and the results is general are valid.In fact, the trends estimated do not depend on the polynomials h k (x) and g(x) but only on the topology of the network used.
The validity of such results might be compromised by modifications of the topology of the network, e.g., an insertion of pipeline registers would lead to a different R I N [k] trend.Nevertheless, such optimization is linked to the single implementation and of difficult generalization.

Conclusions
This article presents a hardware IP core for the implementation of parallel RSC encoders.The architecture is based on the A B C D model of a RSC encoder, which can be obtained through the methodology presented in our previous work [16].The IP cores associate to each matrix an equivalent hardware network, based on a tree pattern for the minimization of the logics path.
Through a case study and an analysis of the proposed topology, the article provides an estimation of the trends of the input data rate and slice LUTs occupation depending on the parallelism degree which, together with the BER curves, provides a complete description of the merit parameters which are relevant for such devices.
where d i[n]  indicates the state of the flip-flop i.Since the RSC encoder is composed by N flip-flops, S[n] can encode 2 N possible combinations of states.In the same way, it is possible to use a vector Y[n]

Figure 4 .
Figure 4. Tree fashion network implementing the logics described in the row D i .

Figure 5 .
Figure 5. RSC encoder in a scenario with synchronous source and sink registers.

T A [s] 3 . 10 Figure 7
Figure7shows the estimated f clk MAX[n]  trend and data used for the interpolation.

Figure 7 .
Figure 7.Estimated f clk MAX [k] trend relative to the D matrix.Equation (28) sums up the expression for the R I N [k] trend.

Figure 8 .
Figure 8.Estimated trend of the number of lookup tables (LUTs) as function of k.

Table 2 .
Results of the case study synthesis on Zynq 7000 xc7z010clg400-1 field programmable gate array (FPGA).

Table 3 .
Learning parameters for the Ω D [k] estimator.

Table 4 .
T A and T B values derived through a mean square error (MSE) interpolation.

Table 6 .
N 0 and N 1 values derived through a MSE interpolation.