Efﬁcient FPGA-Based Architecture of the Overlap-Add Method for Short-Time Fourier Analysis/Synthesis

: This paper proposes a simple and efﬁcient FPGA-based architecture of the overlapping/ windowing and overlap-add methods for real-time FFT/IFFT-based signal processing algorithms. The analyzed signal is divided into short-time overlapping frames that are windowed before applying Fourier analysis/synthesis. Then, the original signal is reconstructed from the windowed (modiﬁed) frames using the overlap-add (OLA) technique. The proposed architecture was implemented on Field Programmable Gate Array (FPGA) using a high-level programming tool in MATLAB/SIMULINK environment. Its performance was evaluated on artiﬁcial and actual signals using objective metrics.


Introduction
Fourier transform is a powerful mathematical tool that converts a signal from the time domain to the frequency domain (spectrum) and vice versa.It is more appropriate to analyze stationary signals that have non-varying frequency content over time.However, most real-world signals are non-stationary in nature and necessitate the use of the time-frequency representation for analyzing their time-varying characteristics.Many techniques such as short-time Fourier transform (STFT), wavelet transform, wavelet packet transform, Wigner-Ville distribution, and S-transform have been proposed in the literature for analyzing the non-stationary signals in the time-frequency domain [1].
The short-time Fourier transform (STFT) has been widely used in many signal processing fields, such as speech enhancement [2] and recognition [3], music segmentation and classification [4], biomedical signal/image processing [5], vibration analysis, etc. [6].The original signal is divided into successive overlapping frames and then multiplied by a smoothing window to minimize the amplitude of discontinuities at their boundaries [7].The overlap-add (OLA) method allows a perfect reconstitution of the original signal from the windowed (modified) overlapping frames [8,9].
In this paper, an efficient FPGA-based architecture of an overlapping/windowing analysis and overlap-add synthesis techniques has been proposed for real-time signal processing algorithms based on the short-time Fourier transform.The proposed architecture is mainly based on an adequate management of dual-port RAM memories to implement the overlapping/windowing analysis and the overlap-add synthesis techniques.It has been implemented on the Nexys-4 development board using XSG programming tool.

Overlapping and Windowing
Let us first consider the original signal as x(n), where n is the global time index varying from zero to the total number of samples.This signal is then divided in overlapping frames in order to be processed separately.
x(m, n) = x(n)w(n − mL) where w(n) is a smoothing window of N samples, m is the frame index, and L is the shift-time step.
The overlap rate between consecutive frames is then (N − L)/N.A frame overlap of 50% corresponds to L = N/2.Overlapping and windowing steps of a typical signal are illustrated in Figure 1.To perform this task in real-time, a circular buffer is often used.It is a fixed-size buffer designed so that its last element is connected to its first one.It is a FIFO (first-in, first-out) register, where the oldest sample is replaced (discarded) by the newest sample, thereby creating a moving window effect.

Short-Time Fourier Transform
The short-time Fourier transform (STFT) of the mth frame x(m, n) is defined as: where N also represents the number of discrete frequencies.It is usually chosen to be a power of 2 to allow the use the fast Fourier transform (FFT) algorithm.
If the windowed frames are transformed (STFT) by a single processor, they must be presented as a longer signal y(n), obtained by non-overlapping concatenation of theses frames y 1 shows that signal frames are presented without overlapping to a single STFT processor.The Fourier transform of y(n) can also be seen as a non-overlapping concatenation of their corresponding STFT, i.e., Y(k) = [X(0, k) X(1, k) . . .].

Inverse Discrete Fourier Transform
The windowed frame x(m, n) can be reconstructed by transforming back X(m, k) using the inverse short-time Fourier transform (ISTFT): The back transformation to the time domain provides a non-overlapping concatenation of the windowed frames, Figure 1 shows that signal frames are transformed back by a single ISTFT processor without overlapping.

Perfect Reconstruction
The original signal can be reconstructed by overlapping and adding (OLA) all the windowed (modified) signal frames in the time domain.
By substituting x(m, n), as defined in Equation ( 1), the synthesized signal can be written as For perfect reconstruction (x For 50% frame overlap (L = N/2), the left-part of Equation ( 7) is periodic with N/2 period.Therefore, it is necessary to satisfy Equation ( 7) only for N/2 range of n.
The reconstruction condition in Equation ( 8) is satisfied for Hanning, Triangular, and Bartlett windows.It can also be satisfied for Hamming window by readjusting the amplitude of the reconstructed signal.Figure 2 presents the overlap-add (OLA) method using Hanning, Triangular, Bartlett, and Hamming windows of 32-sample width and 16-sample overlap (50%).The reconstruction condition is perfectly satisfied for Hanning, Triangular, and Bartlett windows (w(n) + w(n − 16) = 1).For Hamming window, w(n) + w(n − 16) = 1.08, but the original signal can be recovered by adjusting the amplitude of the reconstructed signal (x(n) = x r (n)/1.08).
The reconstruction condition in Equation ( 8) is also satisfied for any even value of the window width N, greater or equal to 4 (reasonable width).Figure 3 presents the overlap-add (OLA) method for the Hanning window of different widths (N samples), but a fixed shift (L = N/2) that corresponds to an overlap rate of 50%.It is also satisfied with Hamming, Triangular, and Bartlett windows for even values of the window width N and 50% overlap.However, for a given smoothing window of N-sample width, the reconstructed condition cannot be satisfied for any L-sample shift.A constant OLA is obtained for shift values lower the half of the window width (L ≤ N/2), ideally obtained by dividing N by a power-of-2.For a Hanning window of N = 256-sample width, Figure 4 shows a constant OLA for three shift values: (L = 32, 64, and 128), but only near-constant OLA for L = 46, which does not correspond to a division of N by a power-of-2.The reconstruction error increases when the shift exceeds the half of the window width (L = 160 and 192).

Overlap-Add Synthesis
To implement the overlap-add (OLA) method described by Equation ( 5), the non-overlapping frames concatenation y(n), defined by Equation (4), must be transformed back to overlapping frames.This can be done by extracting even and odd frames from y(n).
The overlap-add method in Equation ( 5), which allows recovering the original signal, can be defined by x r (n) = y e (n) + y o (n) (10) All steps in the overlapping/windowing and overlap-add methods are illustrated in Figure 1.

FPGA Implementation
The proposed architecture for the overlapping/windowing analysis, STFT/ISTFT, and overlap-add synthesis methods was implemented on FPGA using the Xilinx System Generator (XSG) interface (Figure 5).It was tested using Nexys-4 evaluation kit based on the Artix-7 XC7A100T FPGA chip.The proposed architecture was evaluated for different values of the window width (N = 256, 512, and 1024).The XSG-based blocks were pipelined to increase the operating frequency of this hardware architecture.

Overlapping/Windowing
The input signal was segmented into overlapping frames using a dual-port RAM (Random Access Memory) block that allows simultaneous read/write operations.The control circuit permits progressively storing (writing) the input data into memory through the first port.However, the second port allows reading overlapping signal frames using appropriate address gaps.Each resulting signal frame is then multiplied by a smoothing (Hanning) window w(n), stored on a ROM (Read Only Memory) block, to obtain x(m, n).

STFT/ISTFT
The STFT and ISTFT algorithms were implemented using Xilinx FFT (Fast Fourier Transform) block.When used in the analysis task (STFT), this block provides the real and imaginary parts of X(m, k), as well as their corresponding frequency index k.These parts can be transformed back (ISTFT) by the same block to recover the original signal x(m, n).Pipelined streaming input/output option was chosen to achieve continuous computation of the short-time Fourier transform.For complex multipliers, two options "CLB logic" and "4-multiplier structure" were tested and evaluated, in terms of used resources and operating frequency, for the FFT/IFFT size of N = 256.

Overlap-Add Synthesis
The successive frames of y(n) provided by the IFFT block are separated into even and odd frames (alternate dispatching), using two dual-port RAM memories.The time-domain signal y(n) is progressively stored on these memories.However, a shift in the reading addresses permits separating even frames y e (n) from odd frames y o (n).
As shown in Figure 5, the proposed architecture is very simple and easy to implement.Table 1 presents the required resources and the maximum operating frequency obtained for a 16-bit fixed-point data and a window width of N = 256 samples, as reported by the Xilinx ISE Design Suite 14.7.For the FFT/IFFT blocks, the size was also fixed to N = 256 and the "pipelined streaming input/output" option was selected to achieve continuous analysis/reconstruction.In addition, for the complex multipliers of these blocks, two options "CLB logic" and "4-multiplier structure" were compared, in terms of used resources and operating frequency.This architecture uses a small part of the used Artix-7 XC7A100T FPGA chip and can operate at more than 126.920MHz.As expected, the "CLB logic"-based hardware architecture required more logic slices (3972 Slices, 15,822 Flip Flops, and 14,199 LUTs (Look-Up Tables)).The only DSP48E1 block was used in the windowing step.However, the "4-multiplier structure"-based architecture used fewer logic slices (1563 Slices, 6504 Flip Flops, and 4775 LUTs), but more DSP48E1 slices (12 for FFT, 12 for IFFT, and 1 for windowing).The use of the embedded multipliers allowed the best calculation performance (133.905MHz instead of 126.920MHz).It can be noted, to the best of our knowledge, that only two approaches [10,12] that use Xilinx system generator have been proposed in the literature, but they were not sufficiently detailed to be implemented in the present study.Finally, the co-simulation block and its associate bitstream file were generated automatically by the XSG tool (Figure 6).During the hardware/software cosimulation, the compiled model (configuration bitstream file) was uploaded and performed on actual FPGA by taking advantage of the flexible simulation environment of MATLAB/SIMULINK.

Results and Discussion
The proposed FPGA-based architecture was tested and evaluated, by hardware/software co-simulation (Figure 6), using artificial and actual signals.The overlapping, windowing, STFT, ISTFT, and overlap-add methods were tested and evaluated using four smoothing windows (Hanning, Triangular, Bartlett, and Hamming) of 256-sample width and 128-sample overlap (50%).

Database
A first synthetic signal was constructed from two cosine functions having different amplitudes and frequencies: where f 1 = f s /200, f 2 = f s /30, and f s is the sampling frequency.Thus, the lower frequency component has a period of 200 samples, slightly shorter than the windowing size (N = 256 samples).A cosine signal having a fundamental period different from the smoothing window width allows a good evaluation of the overlap-add method, especially at frame boundaries.A second synthetic signal was obtained from a chirp signal that provides a linear swept-frequency sinusoidal in the time interval 0 ≤ n ≤ M.
where f 0 = f s /400 is the instantaneous frequency at n = 0 and f 3 = f s /15 is the instantaneous frequency at n = M, where M is the simulation duration in samples.
In addition, an actual ECG (electrocardiogram) signal was taken from MIT-BIH arrhythmia database [18] that was resampled at 360 Hz.

Evaluation Tests
Three objective tests were used to evaluate the performances of the proposed FPGA-based architecture: the signal-to-noise ratio (SNR), the normalized mean square error (NMSE), and the cross-correlation (CC) measure.
The SNR evaluates the noise level in the reconstructed signal.It is defined as [13] SNR dB = 10 log 10 where x(n) and x r (n) are the original signal and the reconstructed signal, respectively.M is the size of these signal in samples.
The NMSE evaluates the distortion introduced by the analysis/synthesis steps.It is given by [19] The CC measure evaluates the similarity between the original signal and the reconstructed signal.It is defined as [20] CC where µ o and µ r are the mean of the original and reconstructed signal, respectively.

Results and Discussion
The proposed architecture was tested using artificial signals and actual ECG signal.As shown in Figures 7-9, the original signal was correctly divided into overlapped (50%) and windowed frames.
The time-domain signal was recovered after the FFT/IFFT processing.Overlap-add method was successfully implemented to provide perfect reconstruction of the original signal.Tables 2 and 3 show the SNR dB , NMSE, and CC values, estimated on M = 5N = 1280 samples for various smoothing windows, using 16-bit and 32-bit fixed-point format, respectively.The perfect values (SNR dB = ∞, NMSE = 0, and CC = 1) of these parameters were obtained by floating-format implementation in MATLAB.It can be noted that a SNR dB = 50 dB corresponds to a SNR = 10 5 , which is very large in practice.The slight difference between the theoretical performances obtained by software (MATLAB) and the practical performances obtained by hardware (XSG) are attributable to the quantization errors that occurred during the FFT/IFFT calculation steps.In fact, when the FFT/IFFT blocks were bypassed, these objective evaluation parameters were greatly improved by increasing the fixed-point data width.When the Hanning window was used with the chirp signal, the SNR reaches 85.93 dB and 182 dB for the 16-bit and 32-bit format, respectively.The NMSE reaches 2.52 × 10 −9 and 6.21 × 10 −19 for the 16-bit and 32-bit format, respectively.

Conclusions
Overlapping/windowing analysis and overlap-add synthesis methods for real-time FFT/IFFT based applications have been efficiently implemented on FPGA.The proposed FPGA-based architecture is mainly based on an adequate management of dual-port RAM memories to implement short-time based analysis/synthesis techniques.The complete system was implemented and evaluated using four smoothing windows with 50% overlap.The output signal can be considered as a perfect reconstruction of the input (original) signal.The slight difference can be explained by the quantization errors, mainly in the FFT/IFFT blocks.
In the future, this architecture will be extended to other overlap rates.It will also be incorporated to more advanced signal processing systems as speech enhancement and feature extraction.

Figure 4 .
Figure 4. Illustration of the overlap-add method (red lines) using Hanning window of a fixed width (N = 256), but different shift values L. For example, a frame shift of L = 32 samples corresponds to an overlap rate of (256 − 32)/256 = 87.5%

Figure 6 .
Figure 6.Hardware/software co-simulation corresponding to the diagram of Figure 5.

Figure 7 .
Figure 7. Visualization of the intermediate signals obtained during different calculation stages: (a) original cosine mixture signal x(n); (b) successive windowed frames x(m, n) using Hanning function; (c) real parts X r (m, k) of their corresponding STFT; (d) imaginary parts X i (m, k) of their corresponding STFT; (e) overlapped and windowed signal y(n) obtained by back transformation ISTFT; (f) non-overlapped even frames extraction y e (n); (g) non-overlapped odd frames extraction y o (n); and (h) reconstructed signal x r (n).

Figure 8 .
Figure 8. Same as Figure 7 but using a chirp signal.

Figure 9 .
Figure 9. Same as Figure 7 but using an actual ECG signal from record 101 of MIT-BIH database.

Table 1 .
Resource utilization and maximum operating frequency of the proposed architecture obtained for the Artix-7 XC7A100T chip with two FFT/IFFT implementation options: "CLB logic" and "4-multiplier structure" of the complex multipliers.CLB, Configurable Logic Block; LUT, Look-Up Table;IOB, Input/Output Block; RAMB, Random Access Memory Block; DSP18E1, Digital Signal Processing slice.

Table 2 .
Signal-to-noise ratio (SNR dB ), normalized mean square error (NMSE), and cross-correlation (CC) parameters obtained by the proposed architecture using 16-bit fixed-point format.