Efﬁcient Stochastic Computing FIR Filtering Using Sigma-Delta Modulated Signals

: This work presents a soft-ﬁltering digital signal processing architecture based on sigma-delta modulators and stochastic computing. A sigma-delta modulator converts the input high-resolution signal to a single-bit stream enabling ﬁltering structures to be realized using stochastic computing’s negligible-area multipliers. Simulation in the spectral domain demonstrates the ﬁlter’s proper operation and its roll-off behavior, as well as the signal-to-noise ratio improvement using the sigma-delta modulator, compared to typical stochastic computing ﬁlter realizations. The proposed architecture’s hardware advantages are showcased with synthesis results for two FIR ﬁlters using FPGA and synopsys tools, while comparisons with standard stochastic computing-based hardware realizations, as well as with conventional binary ones, demonstrate its efﬁcacy.


Introduction
Modern digital signal processing (DSP) blocks are characterized by hardware efficiency and high-performance computations [1,2]. On the other hand, standard binary computing methods impose constraints on their design specifications, namely power, area, and energy, which are continuously increasing given the rise of hardware-taxing emerging applications [2]. To this end, unconventional computing paradigms are explored as an alternative to the binary one, with stochastic computing (SC) being an attractive approach [3][4][5][6].
SC represents real-valued numbers in the form of stochastic sequences [7]. Encoding the information in such a way makes its processing resilient to soft errors originating from noisy sources [3,8], for instance bit-flips, which is prohibitive in the case of the binary arithmetic. Furthermore, SC's bit-processing nature allows for the realization of fundamental arithmetic operations as well as highly-complex functions using a few standard logic gates and cells, thereby reducing the hardware requirements when compared to their binary counterparts [3,4,[9][10][11][12].
Although the SC filters realized using the methods considered in [27][28][29][30] reduce the hardware resources when compared to the standard binary approach, they are also limited in their spectral characteristics. Specifically, their performance is affected by two factors: (1) the length of the stochastic sequence required to process a single sample, which is SC's essential design trade-off, and (2) the noise introduced from the binary-to-stochastic converters used to generate the input signal and the filter's coefficients.
A well-known method to reduce the noise floor and improve the spectral characteristics of a quantized signal is to use sigma-delta modulators (SDMs) [32][33][34]. They convert a high-resolution signal (several bits) into a lower-bit one by employing the technique of oversampling; the input signal is sampled at a frequency much higher than the Nyquist, thus reducing the noise in the desired frequency band of interest.
Motivated by the properties of SDMs, the use of a first-order SDM in FIR filtering was recently explored in [26]. It serves as a single-bit encoder of an input (quantized) signal, allowing for the SC-based FIR filter's multipliers to be realized exclusively by simple logic gates. As such, the proposed SDM-SC architecture's advantage is twofold; on one hand it offers improved signal-to-noise ratio (SNR), due to the SDM's oversampling technique, which is not possible with conventional SC-based filtering; on the other hand, the SDM allows for the filtered signal to be encoded in time, therefore reducing the SC's typical accuracy-latency trade-off as well as the power and energy consumed for this process.
We organize the remainder of the proposed work as follows. In Section 2, we provide a background on the stochastic numbers and their properties, as well as the operation of the first-order SDM. In Section 3, we review the prior work in SC-based FIR filter realizations and mathematically formulate their operation. In Section 4, we present the SDM-SC architecture and explain its principle operation through proper analysis. Section 5 includes experimental results with respect to (1) the SDM-SC's architecture spectral characteristics and (2) comparisons with the SC literature as well as the conventional binary filters in SNR, FPGA synthesis results, and hardware resources in a 45 nm process. Finally, Section 6 provides the conclusion.

Stochastic Computing and Sigma-Delta Modulation Notation and Principle Operation
In this section, we provide a background on the generation of stochastic numbers and their manipulation using standard logic gates, as well as explain the operation of the first-order sigma-delta modulator.

Stochastic Number Generation & Properties
The conversion of a binary number into a stochastic one is typically performed by the stochastic number generator (SNG), shown in Figure 1. Its operation is based upon the sampling on each clock cycle of a pseudo-random number generator uniformly distributed in R s = {0, 1, . . . , 2 k − 1}, with the desired binary number B ∈ [0, 1] of the same bit length. Usually, the pseudo-random number source is implemented as a k-bit linear feedback shift register (LFSR), but note that other methods can also be used [35]. The bit generation is completed after N = 2 k clock cycles and corresponds to the length of the stochastic sequence.
Formally, the generated sequence of length N, {X n } N n=1 , where n denotes the current sample processed (or the current clock cycle), is assumed to be an independent and identically distributed (IID) Bernoulli sequence. Therefore, the generated stochastic number (SN) has probability defined as and mean valueX The SN's mean valueX N represents a non-negative number in [0, 1], known as unipolar format in SC, whereas to obtain a negative SN representation (known as bipolar format), the transformation X → 2X − 1 is used, expanding the range of the SN to the interval [−1, 1]. As expected, in both formats the sequence length N plays a critical role in the accuracy of the SN given the fact that it increases at the cost of additional clock cycles and is inversely proportional to √ N [36]. To further investigate the equivalent noise introduced by an SN, one can consider the noise figure (NF) [37] defined as NF 10 log 10 P S P N , where P S and P N are the average power and noise, respectively, of the generated SN.

Mathematical Properties of Logic Gates in Stochastic Computing
Fundamental mathematical operations are supported in the context of SC and can be implemented by simple logical gates, according to the format used. For the following, we assume that {X n } N n=1 , {Y n } N n=1 are stochastic sequences generated by different SNGs, to ensure independence among them, and {H n } N n=1 is the result of their operation. We also note that whenever the bipolar format is required, the transformation X → 2X − 1 is used.

• NOT Gate
In unipolar format, the output of the NOT gate, H n = NOT(X n ), complements the probability of the input, whereas in the bipolar format, it operates as a sign inverter.
• AND Gate: The AND gate in unipolar format, H n = AND(X n , Y n ), performs multiplication.
• XNOR Gate: The XNOR gate in bipolar format, H n = XNOR(X n , Y n ), performs multiplication.

• Multiplexer
Assuming an an IID control sequence {C n } N n=1 , the multiplexer (MUX), H n = MUX(X n , Y n ; C n ), is the standard way to perform scaled addition between two SN, regardless of the format used, and is given as H = P(H n = 1) = P(X n = 1, C n = 1) + P(Y n = 1, C n = 0) = P(X n = 1)P(C n = 1) + P(Y n = 1)P(C n = 0) Furthermore, if P(C n = 1) = 1/2, the MUX operates as a scaling adder, i.e., Stochastic subtraction, on the other hand, can only be realized in the bipolar format, using a NOT gate in one of the two inputs as The operation of logic gates with specific interest is illustrated with an example in Figure 2.

Correlation in Stochastic Computing
The proper operation of SC processing elements is based upon the assumption that different SNGs are used as their input sequences. This is due to the fact that using the same initial seed in their LFSRs will cycle through the values in R S simultaneously, thereby generating maximally correlated sequences (overlap of logic 1s) given the same binary number for conversion [38][39][40]. To provide better insight on the former, consider the case where the multiplication of two SNs is desired, i.e., H = XY, with X = Y = 0.6. If SNGs with the same initial LFSR seed are used, then two identical sequences will be generated and their multiplication will result in H = 0.6, instead of the correct one H = 0.36. The same also happens when the LFSR is shared among the SNGs, without different seed initialization or proper use of delays. However, it has been shown that in certain cases, maximally correlated sequences can benefit specific applications [3,10], offering promising results.
A standard measure of correlation in SC is the stochastic computing correlation (SCC) [3,38]. For any two sequences {X n } and {Y n }, then SCC(X n , Y n ) is calculated as taking values in [−1, 1], with SCC(X n , Y n ) = 0 corresponding to uncorrelated sequences.
Note that SCC can also be used to measure the the auto-correlation of an output sequence, assuming the current output sample H n and its delayed version by r > 0 samples H k , with k = n + r, as SCC(H n , H k ).

The First-Order Sigma-Delta Modulator
A Sigma-Delta Modulator (SDM) is typically used to convert a higher-resolution analog or digital signal into a lower-bit one. Its main advantage is the exploitation of the oversampling technique, which pushes the in-band quantization noise outside the input signal's frequency band of interest. This is accomplished by sampling the input signal at a rate f s much higher than the Nyquist one. The oversampling ratio (OSR) is defined as where f B is the maximum input signal's frequency. Oversampling has a direct impact on the spectral quality of the modulator and, more specifically, increasing the OSR leads to an improvement of the modulator's signal-to-noise ratio (SNR) [34].
The first-order single-bit SDM is shown in Figure 3. It comprises an adder and an integrator, followed by a two-step quantizer block, Q(·), which behaves as a nonlinear function as According to the SDM of Figure 3, the modulator's input U n , with n being the time index, is of m-bits length, whereas its output V n is single-bit ±1. Therefore, the behavior of the first-order SDM can be expressed as Figure 3. The first-order single-bit sigma-delta modulator. A dithering sequence can optionally be used.

Prior Work in Stochastic Computing FIR Filters
For an arbitrary discrete-time signal U t , t = 1, 2, . . . , T and filter coefficients w m , m = 1, 2 . . . , M, an M-tap finite impulse response (FIR) filter is described as (15), using SC techniques, requires M − 1 XNOR gates for multiplication, as the values of w m , and V t−m might be negative, and an M-to-1 MUX for addition. However, this implementation leads to the downscaling of the output by a factor of 1/M, which causes severe accuracy loss, especially when the filter's order M is large.
To address this problem, a stochastic FIR filter that uses an MUX adder-tree was proposed in [41], and is based on the inner-product processing unit shown in Figure 4. Instead of the standard method to perform multiplication in bipolar format using XNOR gates, the sign of the weights is also considered, and thus the multiplications are realized using XOR gates. To explain the signed XOR gate's operation as an SC multiplier in bipolar format, assume first a 2s complement binary representation of a signed-value weight w m , where its most significant bit serves as its sign, i.e., Considering now a sample of the input signal V t converted into a stochastic sequence with probability P(V t,n = 1), then the output of an XOR gate is P(G t,n = 1) = sign(w m ) + P(V t,n = 1) − 2sign(w m )P(V t,n = 1) which is simplified to a multiplication sign(w m )V t given the definitions of the NOT gate in Section 2. Finally, using an uneven control signal with probability P(C n = 1) = |w 0 |/(|w 0 | + |w 1 |), the output of the MUX is producing an inner-product scaled by 1/(|w 0 | + |w 1 |). Based on the inner product module of Figure 4 and the former analysis, an Mtap stochastic FIR filter can be realized, with its output scaled, however, by a factor of 1/ ∑ M−1 m=0 |w m |. It requires, in total, M XNOR gates, M − 1 MUXs, and 2M − 1 SNGs. A representative example of a five-tap FIR filter implemented with the aforementioned inner-product module is shown in Figure 5.

Proposed SDM-SC Processing Scheme
In this section, we present the proposed SDM-SC architecture as it was introduced in [26], and is shown in Figure 6. It converts a multi-bit input signal into a single-bit one using a first-order SDM to exploit the SC's encoding and benefit from its low-area advantages. Moreover, it allows for time-encoding and, consequently, processing of the input signal, therefore bypassing the long-latency of the standard SC approaches. To proceed with the detailed analysis of the architecture, we start from the first-order SDM. Figure 6. Proposed SDM-SC architecture. The first-order SDM encodes a multi-bit input signal into a single-bit one, carrying the information in 0, 1 representation. The sequence is then processed by an SC-based M-tap FIR filter. A dithering sequence can optionally be used.

SDM Encoding
In the architecture of Figure 6, the SDM block is the digital realization of the systemlevel one shown in Figure 3. Its input U n is of m-bits length, whereas V n is the single-bit output. A register of c-bits can replace the integrator of Figure 3, allowing for the SDM's iterative behavior to be expressed according to Equation (14). With respect to its size (in bits), it should be noted that c ≥ m so as to account for the accumulation process, with typical value c = m + 1.
The quantization process of the SDM in Figure 3 can be modeled simply as the register's most significant bit (MSB); the current input sample, i.e., U n , determines if, and only if, the accumulator's current value is positive or negative, corresponding to an MSB of 0 or 1, respectively. Finally, it should be noted that V n 's negative value, i.e., 1, is converted into −1 using sign-extension methods.
The SDM's maximum operating frequency f s corresponds to the register's one and, as expected, determines the input signal's maximum operating frequency f B that the architecture is able to process. Optionally, a dithering sequence can be employed to further decrease the SDM's output noise floor [32][33][34].

Stochastic FIR Filter
According to Equation (15), the binary implementation of an M-tap FIR filter requires M − 1 D flip-flops, M binary multipliers of m + l-bit length, where m and l are the input signal's and the coefficient's bit resolutions, respectively, and a binary adder of m + l + log 2 (M) − 1 bits to avoid overflows, based on the guidelines in [42].
The architecture of Figure 6 exploits the SDM's 0, 1 encoding of the input signal, allowing for the M binary multipliers of m + l-bit length to be replaced by M AND gates. Furthermore, in contrast to the standard adder method used in SC, which is based upon the realization of an adder tree using the inner-product processing block of Figure 4, we use a simple binary adder of N = log 2 M -bits. As such, the value of Z n is binary and belongs in {0, 1, . . . , M − 1}. At this point, we note that one can also explore single-bit output implementations of the proposed architecture, using several non-scaling adders available in the SC literature [11,20,43].
To proceed with further analysis, we assume that each weight is converted into a stochastic sequence with probability w m = P(w m,n = 1), with m = 1, 2, . . . , M, and also that the probability of each AND gate's output is Considering the above and the architecture of Figure 6, the instantaneous value of the output Z n is the sum of the multiplications and is given as

Stochastic Coefficient Generation
The conversion of the filter's coefficients w m to stochastic numbers in the architecture of Figure 6 requires M SNGs. This, however, implies the use of M LFSRs of k-bits, which, along with their respective k-bit comparators, are hardware-taxing (in total, k × M registers). On the other hand, simple sharing of a single LFSR, as the random number generator between all SNGs, introduces maximal correlation among the generated sequences according to Equation (11), and thus it is expected to degrade the filtered signal's spectral quality [26,41].
In order to reduce the hardware resources from the SNGs, we employ the LFSR circular shifting scheme proposed in [44], shown in Figure 7. This scheme generates M stochastic sequences in parallel without being maximally correlated using a single LFSR. This is achieved, since the LFSR cycles through all of its values within R s only once. Therefore, if R n,i is the LFSR's current binary value n at time index i = 1, 2, . . . , M, the circular shift by s ∈ N * bits, where s < k, produces the next value R n,i+1 as R n,i+1 R (n−s,i) N = R n−s,i mod N.

Experimental Results
In this section, we demonstrate the performance of the proposed SDM-SC architecture. Specifically, we show experimental results with respect to 1) its spectral characteristics, and 2) comparisons with standard SC-based approaches, as well as with the conventional binary, in SNR and hardware resources.

Performance of the Proposed SDM-SC Architecture in the Spectral Domain
Here, we evaluate the performance of the proposed SDM-SC architecture in the spectral domain with simulations using MATLAB. We consider a sinusoidal input signal U n = sin(2π f B n), and test it over two FIR filters, a 5-and a 7-tap one. The simulation parameters, including the filters' weights, are summarized in Table 1.
It is important to note that all frequencies are fractional with respect to the sampling frequency f s . Moreover, to generate the sequences of the weights, LFSRs with initial register size of 15-bits are used and correspond to sequences with length 2 15 .

Parameter Name Parameter Value
Input signal U n sin(2π f B n) 5-tap FIR filter weights w 0 = w 4 = 0.7, w 1 = w 3 = 0.6, w 2 = 0.9 7-tap FIR filter weights w 0 = w 6 = 0.6, w 1 = w 5 = 0.4, w 2 = w 4 = 0.3, w 3 = 0.9 In Figures 8 and 9, the power spectral density (PSD) (top) and frequency response (bottom) for the 5 and 7-tap FIR filters are respectively shown. To calculate the power spectral density, MATLAB's pwelch function was used and 10 6 samples were considered. With respect to the frequency response, it can be observed that the SDM-SC architecture follows the conventional one's, correctly achieving the cut-off frequency ω −3dB .
To showcase the ability of the proposed SDM-SC architecture to achieve improved performance, we show the power spectral density of the SDM's output in Figure 10. It is calculated using MATLAB's pwelch function with 10 6 samples. As one can observe the SDM reduces the noise floor in low frequencies.  Table 1.  Table 1.

Signal-to-Noise Ratio Comparisons
Here, we show the advantage of the proposed SDM-SC architecture over the standard SC-based implementations with SNR comparisons. We realize the 5-and 7-tap FIR filters using the inner-product module of Figure 4, with their coefficient values taken from Table 1 With respect to the conventional binary approach, we assumed the existence of round-off noise in the coefficients and in the input signal, where the bit resolution is selected to be 15bits. In Table 2, the SNR comparison between the two approaches is shown, accompanied by the conventional binary. According to the results shown in Table 2, the proposed SDM-SC filtering scheme achieves better SNR than the filtering realized using the inner-product module of Figure 4, which is due to the SDM's oversampling. On the other hand, compared to the traditional binary, the SDM-SC architecture achieves lower SNR. Yet, its negligible low-area advantages as an approximate filtering scheme are highlighted in the following subsection.

FPGA Synthesis Results and Comparison
The low-area benefits of the proposed SDM-SC architecture are demonstrated here. We compared its hardware resources required to realize the two FIR filters with 5 and 7 taps, with the conventional binary and the inner-product approach synthesized in Xilinx's Vivado Design Suite targeting the Kintex-7 FPGA KC705 device. We considered a k = 15-bit resolution of the input signal U, which also corresponds to the LFSR's size used to generate the weights, w m . The hardware utilization results are cited in Table 3. For the results shown, we note the following: (1) for the conventional binary approach, the DSP blocks are converted into their LUT equivalents to have a uniform comparison of the resources among the approaches considered, and (2) the SNGs are included in the utilization results. According to the results shown in Table 3, the proposed SDM-SC architecture can realize a 5-tap FIR filter with 29 LUTs and 35 registers. On the contrary, the inner-product FIR approach based on [29,30,41] requires only a few LUTs more, namely 35, but also 143 register slices more, which is due to the required SNGs for (1) the MUXs and (2) the generation of the input signal's delays. The conventional binary five-tap FIR filter requires additionally 669 LUTs and 25 slice registers (in total, 698 LUTs and 60 slice registers), making the SDM-SC architecture a hardware-favored approach as it reduces the LUTs by 96% and the slice registers by 42% of the binary filter.
To realize the 7-tap filter, the SDM-SC architecture requires only two registers and one LUT more; increasing the order of the filter by two requires two delays, corresponding to two flip-flops as a result of the SDM's single bit encoding, while the one LUT increase is due to the additional wiring required to output the result. The inner-product approach, however, requires 60 slice registers more to increase the filter's order by two, which is due to the four SNGs required for the addition of an inner-product block as well as the MUX producing the output. The increase on the hardware resources is also observed when the conventional binary FIR filter's order is increased, which is equal to 209 LUT and 30 register slices more. As such, the SDM-SC architecture reduces, in this case, the binary realization's LUT and slice register utilization by 96.7% and 60%, respectively.

Hardware Resources Comparison in a 45 nm Technology
To further proceed with the hardware comparisons, here we provide the resources required to realize the two FIR filters with 5 and 7 taps, extracted using the Synopsys Design Compiler with the FreePDK CMOS library at 45 nm [45]. For the comparisons, the following estimates are provided: (1) the total area in (µm 2 ); (2) the average power consumption in mW; (3) the delay in ns; (4) the energy consumption in pJ, defined as the average power × delay product. In Table 4, the results are shown in detail. Table 4. Comparison of hardware resources for the realization of two FIR filters with 5 & 7 taps in area (µm 2 ), power (mW), delay (ns), and energy (pJ).

SDM-SC Filter
Inner-Product Adder-Tree [ Focusing on the realization of the 5-tap FIR filter, when compared to the inner-product and conventional binary approaches, the proposed SDM-SC architecture reduces, respectively, 1) the total area occupied by 81.9% and 93.7%, and 2) the total power and energy consumption by 74.7% and 85.5%. For the realization of the 7-tap FIR filter, the proposed SDM-SC architecture reduces, respectively, 1) the total area occupied by 85.8% and 94.8%, and 2) the total power and energy consumption by 81.09% and 88.34%. As such, from Table 4, one can conclude that the proposed SDM-SC architecture is the most efficient, hardware-wise, as it results in the least occupied area and power and energy consumption among the considered approaches.
Of great interest is the number of resources required to increase the filter's order. The SDM-SC architecture requires only 83.54 µm 2 , 0.08 mW, and 0.11 pJ, corresponding to an increase of the resources by 11.9%, 9.6%, and 8.8%, respectively. On the other hand, the inner-product approach requires 1509.8 µm 2 , 1.42 mW, and 2.05 pJ, corresponding to an increase of the resources by 30.5%, 32.3%, and 31.4%, respectively, whereas the conventional binary requires 3574.7 µm 2 , 1.93 mW, and 2.8 pJ, corresponding to an increase of the resources by 26.6%, 27.1%, and 26.4%, respectively.
A further advantage of the SDM-SC's architecture in filtering over the inner-product approach is that of time-encoding. Assuming that the input signal U n is of lengthN, then the SDM-SC approach processesN samples in N clock cycles, whereas the inner-product approach requires N clock cycles for a single sample ofN, thus requiring N ×N clock cycles to complete the processing. This, however, reflects on the total dissipated energy by the inner-product approach, which is increased by N. Therefore, this makes the SDM-SC architecture both faster in terms of processing time and a more energy-efficient approximate processing approach, besides its improved SNR performance.

Conclusions
A soft-filtering architecture based on SDMs and SC was presented in this work. The first-order SDM acts as a single-bit encoder to benefit from the negligible area SC multipliers, implemented as XNOR gates. The performance of the SDM-SC architecture was evaluated in two different-order FIR filters in the spectral domain, where the filters' proper operation and roll-off behavior were shown. Compared to the standard SC-based filtering approach, it was shown that the SDM-SC architecture improves the SNR by more than 10 dB, while at the same time eliminates the SC's typical latency-accuracy trade-off, as it provides encoding in time, reflecting on the power and energy consumption, as well. With respect to the hardware resources, FPGA and synopsys synthesis results demonstrated the SDM-SC's negligible area advantages and its highly energy-efficient processing over standard SC-based and binary approaches.  Acknowledgments: The authors would like to thank anonymous reviewers for their kind suggestions and comments.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript: