1. Introduction
Digital Signal Processing in communications has recently focused on advanced techniques to reduce harmful effects on transmission, such as interchannel interference (ICI) or noise reduction [
1]. The results using advanced techniques are promising, but real-time implementation remains challenging. While traditional optical backbone networks demand multi-Gigabit processing, the emerging field of decentralized Edge Computing and IoT imposes a different, yet equally severe challenge: deploying complex DSP algorithms under extreme power, area, and cost constraints.
Recently, Field-Programmable Gate Array (FPGA) device implementations are an alternative to real-time implementation without fabricating an Application-Specific Integrated Circuit (ASIC) [
2]. FPGA-based boards can be re-programmed, allowing exploration of the design space to find an optimal algorithm implementation by applying parallelization techniques to increase throughput [
3,
4]. Also, unlike other parallel computing architectures such as Graphics Processing Units (GPUs), FPGAs have high-speed transceivers for low-latency communication with hardware [
5].
Another critical aspect in hardware-accelerated DSP is that FPGAs support fixed-point operations, allowing user-defined size operations, optimizing hardware resources, and increasing processing speed [
6]. For example, floating-point operations can use more than 100 times as many gates as fixed-point operations, resulting in worse performance when implementing the same algorithm.
For hardware reconfiguration, the System-on-Chip–FPGA (SoC-FPGA) has recently emerged as a technological solution for real-time implementations of parallel machine learning (ML) algorithms [
7]. SoC-FPGA interconnects a microprocessor and an FPGA on a single chip, providing the flexibility of FPGA architecture and management functionality, as well as the ability to re-program the programmable logic from the microprocessor [
8]. SoC devices have also been recently used for Visible Light Communications (VLCs) [
9], and the field of Photonic Networks-on-Chip (NoC) is expected to continue attracting the community’s attention [
10,
11].
On the other hand, given the limitations on hardware resources imposed by current electronic device technology, FPGA (or SoC-FPGA) designs are critical for programming complex and recurrent DSP operations in high-speed transmission systems [
12]. Hardware designers use methodologies that apply optimization directives to obtain hardware architectures that balance the performance and minimum hardware resource utilization [
13]. In this paper, we explore the design space of two well-known blind equalization algorithms: the Constant Modulus Algorithm (CMA) and the Multi-Modulus Algorithm (MMA). The selection of the algorithms was based on their continued use across different applications, such as blind linear channel equalization. Even in recent applications of machine learning algorithms, their relevance remains evident [
14,
15]. Thus, the focus of the paper’s readers will be on the application of the methodology and on extrapolating the concepts to other applications. Historically, CMA and MMA have been the cornerstone of high-speed optical communication systems to mitigate severe chromatic dispersion. However, these traditional deployments rely on massive, high-end FPGAs (e.g., Virtex Ultrascale) and floating-point arithmetic to achieve Gigabit-per-second (Gbps) throughputs. Today, a critical paradigm shift is required: transitioning these heavy algorithms to severely resource-constrained Edge System-on-Chips (SoC-FPGAs) to support decentralized networks, such as localized Software-Defined Radio (SDR) and IoT gateways.
We present a methodology for the SoC-FPGA design and implementation for minimum hardware resource utilization. The methodology first proposes identifying and implementing frequently used recurrent functions, referred to as Custom Processing Units, for this manuscript [
16]. Then, optimization directives and fixed-point numerical representations are applied during hardware design to improve floating-point performance in terms of latency while maintaining accuracy. We then explore different numeric representations implemented during online, real-time experiments running on the SoC-FPGA to ensure minimal hardware resource utilization and accurate results compared to the floating-point analysis.
We also present an extensive study of real-time channel equalization algorithms that minimize FPGA hardware resources to validate the methodology. A step-by-step design of the Constant Modulus Algorithm (CMA) and the Multi-Modulus Algorithm (MMA) is presented. As part of the methodology, we validate the designs by testing the hardware structures implemented on the SoC-FPGA using synthetic input signals modulated in 4-QAM, 16-QAM, and 64-QAM and passing through an emulated time-variant channel [
17]. The proposed methodology uses a Xilinx Vivado high-level synthesis (HLS) 2023.1 (Advanced Micro Devices—AMD, Inc., Santa Clara, CA, USA) for rapid prototyping as an alternative to the traditional Register Transfer Level (RTL) design.
The main contributions of this work are summarized as follows: (i) A formal multi-objective Design Space Exploration (DSE) framework targeting the minimization of hardware resources for complex adaptive equalizers on Edge SoC-FPGAs. (ii) The modular architectural mapping of Constant and Multi-Modulus Algorithms into Custom Processing Units (CuPUs) optimized for bit-level fixed-point operations. (iii) A dynamic scaling factor methodology that mitigates severe quantization noise, relaxing mathematical convergence bounds to allow larger step-sizes without divergence. (iv) The validation of the hardware architectures via a bit-true fixed-point emulation coupled with a time-variant multipath channel modeled directly on the processing system (PS) of the SoC-FPGA.
The rest of this paper is organized as follows:
Section 2 details the proposed Design Space Exploration (DSE) methodology and the hardware mapping of the processing algorithms on the SoC-FPGA.
Section 3 presents the experimental results and the real-time hardware emulation, and
Section 4 draws the final conclusions.
2. Design of DSP Algorithms on SoC-FPGA
This section presents the DSE for real-time DSP algorithm implementation on SoC-FPGA. Along the process, we define two types of performance analysis. First, we define “application performance” as the correct implementation of the targeted function, such as channel equalization. The final goal is to equalize the communication channel as much as possible. Second, we refer to “system performance” as the FPGA device’s performance in terms of latency and hardware resources. The final objective is to minimize both variables.
The methodology is composed of four stages: (i) identification of Custom Processing Units; (ii) high-level synthesis of identified CuPU applying parallelization directives, ensuring successful application performance; (iii) characterization of application performance using real-time channel emulation; and (iv) analysis and adjustment of numeric representations of the processed data in the SoC-FPGA device for minimum hardware resources utilization.
To address the complexity of optimizing numerical representations without degrading the equalization performance, the DSE is formally defined as a constrained multi-objective search algorithm. As detailed in Algorithm 1, the methodology systematically evaluates the design space bounded by the total word length (
B) and the fractional bits (
P). The objective function aims to minimize the hardware area and latency footprint (
and
) extracted from the HLS synthesis, subject to the strict constraint that the quantization noise does not disrupt system convergence. This constraint is enforced by ensuring the evaluated
remains below a tolerable threshold (
) derived from the ideal floating-point baseline. To clearly summarize the stages and objectives of our proposed methodology, Algorithm 1 outlines the complete DSE workflow, detailing the initialization, boundary constraints, and the multi-objective optimization process.
| Algorithm 1 Constrained multi-objective Design Space Exploration (DSE) |
- Require:
Modulation Set , target MMSE , scaling factor set - Ensure:
Optimal word length and fractional bits per modulation - 1:
for each
do - 2:
; - 3:
- 4:
for to 32 do - 5:
for to 12 do - 6:
{1. Hardware Synthesis Evaluation} - 7:
- 8:
{2. Bit-True Emulation & Error Calculation} - 9:
- 10:
{3. Multi-Objective Constraint Checking} - 11:
if then - 12:
- 13:
- 14:
end if - 15:
end for - 16:
end for - 17:
Output: as the hardware-efficient solution for m - 18:
end for
|
2.1. Custom Processing Unit (CuPU)
CuPUs have been defined as recurrently used DSP operations on which more complex functions, or high-level DSP structures, can be constructed. For this paper, we defined the following three functions as CuPUs: (i) Euclidean distance calculation, (ii) a generalized Finite Impulse Response (FIR) filter structure, and (iii) the routine for FIR coefficients update. CuPUs are chosen because of their recurrence for channel equalization algorithm implementations based on the CMA and MMA.
2.2. High-Level Synthesis
The proposed methodology is implemented using a high-level synthesis (HLS) tool, allowing the programmer to describe the system in high-level languages such as C/C++ or SystemC. The HLS tools translate an algorithm from a software programming approach to an RTL hardware description. HLS allows the programmer to do loops, arithmetic operations, and function calls. These programming structures are automatically converted into processing cores, memories, counters, and finite state machines. In this stage, identified CuPus are programmed following parallelization directives. HLS tools use parallelization directives, such as pipelining and loop unrolling, to achieve performance comparable to or even better than that of a Register Transfer Level (RTL) design. The C/C++ language is used to describe the algorithm and perform simulation tests. The HLS tool performs automatic transformation into HDL for real-time hardware implementation.
2.3. Real-Time Emulation on Heterogeneous Architecture Based on SoC-FPGA
SoC-FPGA combines a microprocessor and an FPGA on a single chip, providing greater integration with lower power consumption and high-bandwidth communication between the processor and the FPGA. SoC-FPGA also includes a variety of peripherals and high-speed transceivers and can be reprogrammed at any time.
SoC-FPGA consists of two main components, the processing system (PS) and the programmable logic (PL). PS is based on a standard hardware architecture, including a real-time processor and a Graphics Processing Unit (GPU), and it requires additional hardware to run an operating system (OS). On the other hand, PL is an FPGA with high-performance digital inputs and outputs. The interconnection between the PS and the PL allows hardware reconfiguration from the PS. In this work, SOC-FPGA enabled joint real-time emulation of a time-variant channel on the PS and an equalization algorithm implementation on the PL.
2.4. Numeric Representations Analysis
Real-time applications typically perform computations in the IEEE 754 [
18] 32-bit single-precision floating-point format because of the wide dynamic range of numeric values. In this work, we adopted the 32-bit model as a baseline for precision metrics. However, a fixed-point numerical representation is also used to represent real numbers in DSP, especially when system performance is more important than precision. The reason lies in the fact that fixed-point arithmetic is much faster than floating-point arithmetic. We explore system performance by applying different fixed-point formats to minimize hardware resources without sacrificing application performance. The representation for the fixed point was defined as FxP
, where FxP means fixed-point,
B is the size of the binary number in base-2, and
P is the position of the decimal point from the most significant bit (MSB). For example, the representation FxP8,3 corresponds to a binary number of 8 bits with 3 bits for the integer part and 5 bits for the fractional part:
.
2.5. Case Study: Blind Equalization Algorithms
In this section, we show the application of the introduced methodology in
Section 2 to the real-time implementation of blind channel equalization algorithms, CMA and MMA [
19], using geometric calculations to separate the distorted data symbols of a constellation diagram in a communication channel.
In the CMA algorithm, the first step calculates the Euclidean distance between the expected position of a data symbol and the position of the distorted data symbol once it has passed through the distorted multipath communication channel. The channel was modeled by adding additive white Gaussian noise (AWGN) to the data and passing them through a Finite Impulse Response (FIR) filter.
The received distorted symbols are stored in a buffer. They are divided into segments
in the delay line and organized in a Toeplitz matrix
X. The
segments are used as training data to update the equalizer’s filter weights
W during algorithm iterations. The partial results of this filtering are stored in
, and the expression is given by
The Euclidean distance
for CMA is calculated as
The metric referred to in Equation (
2) represents the error between the distorted symbols and the expected value
, which is constant for Phase Shift Keying (PSK) modulation because all symbols have the same expected power. Optimization algorithms adjust the filter coefficients
to minimize the error. The process of updating the coefficients is carried out by applying the gradient descent, as follows:
where the constant
regulates the step used to find the minimum error in each iteration
n.
The MMA algorithm is a modification of CMA, which separates
into real (
) and imaginary (
) parts. The algorithm works similarly to CMA, but applies the calculus of distance to each real and imaginary part separately, calculated by
where
is the statistical expected value for the MMA algorithm. Unlike PSK, the amplitude in QAM constellations is not constant. Therefore, as established in the foundational framework by Yang et al. [
19], the dispersion constant
must be mathematically pre-calculated based on the statistical moments of the input symbols (
a) to guarantee that the gradient of the cost function reaches zero at convergence:
By extracting the precise mathematical moments for the evaluated alphabets (e.g., yielding
for 16-QAM), the equalizer avoids the instability of empirical tuning and seamlessly translates the theoretical boundary into the optimized fixed-point hardware architecture.
The filter coefficients are updated using the same equation used in CMA; see Equation (
3).
2.6. CuPUs for CMA and MMA Blind Equalizers
Two of the identified CuPUs are used for the blind equalization algorithm, the distance calculation, and filter coefficient update. These processing units are combined in
N iterations. An additional FIR filter unit is used outside of the cycle to apply the coefficients stored in
W to the
X input matrix.
Figure 1 shows the connections of Custom Processing Units to build CMA and MMA blind equalization.
Figure 2 and
Figure 3 show the internal structure of processing units for distance calculation in CMA and MMA, respectively. These internal blocks compute different values in each iteration. Parallelization is applied within each iteration because
W must be updated on another CuPU, creating a data dependency. The block VVM represents the multiplication between the two vectors
and
, both of size
, which produces a result
of 1 × 1 size. Since the baseband signals and the filter weights (
W) are complex-valued, the VVM hardware module is specifically designed to perform complex Multiply-Accumulate (MAC) operations. Through the high-level synthesis (HLS) directives, each complex multiplication is inferred and partitioned at the Register Transfer Level (RTL) into four real-valued multipliers and two adders/subtractors (i.e.,
). Importantly, the optimized fixed-point numerical representation (e.g., FxP16,3) enables these real-valued multiplications to map natively and efficiently onto the integer multipliers within the dedicated DSP48E slices of the SoC-FPGA. This architectural mapping maximizes throughput while avoiding the extensive logic (LUT) overhead characteristic of floating-point implementations.
2.7. Hardware Architectural Mapping and Pipelining via HLS
While parallelization techniques such as loop unrolling and pipelining are standard in FPGA design, their direct application to highly iterative stochastic algorithms like CMA and MMA often leads to resource exhaustion on Edge devices. Instead of blindly parallelizing the loops, our mapping strategy uses HLS directives constrained by the optimal numeric representations found during the DSE (
), which ensures that the inferred hardware matches the exact capability of the SoC-FPGA’s DSP48E slices without requiring excess floating-point LUT overhead. The parallelizations applied were based on three techniques widely used in hardware design: data partitioning, loop pipelining, and loop unrolling. These were applied using HLS parallelization directives.
Figure 4 shows the application of the directive pipeline on the VVM block.
The equation for the coefficient filter update is shown in Equation (
3). In this equation, the variable in lowercase is 1 × 1, and the variables in uppercase are vectors. Thus, the operation is defined as the multiplication of constants by a vector, followed by vector subtraction. The applied parallelization directive is complete loop unrolling to obtain the result for each element of
simultaneously.
The FIR filter processing unit was implemented using matrix representation. The input symbol Toeplitz matrix
X needs to be multiplied by the weight vector
W generated by the equalizer. This multiplication equalizes the input; see
Figure 5.
3. Experimental Results
To ensure rigorous, reproducible performance evaluation, specific experimental parameters were established. The equalizer’s complex weight vector was initialized using the center-spike method, setting the middle tap to and all remaining taps to zero, thereby allowing the initial signal to pass while the error gradient stabilized. Instead of isolated Monte Carlo simulations, the hardware-in-the-loop methodology and its bit-true emulation utilized a continuous stream of pseudo-random QAM symbols to intrinsically capture the statistical variations of the noise and data. The algorithms processed the incoming data iteratively, and the convergence criterion was defined as the system reaching a steady state. Empirical observations confirmed that the required number of iterations to bypass the transient state scales with the modulation order due to progressively smaller step sizes () needed to ensure fixed-point stability. Specifically, steady-state was achieved after processing approximately 3000, 10,000, and 20,000 symbols for 4-QAM, 16-QAM, and 64-QAM, respectively. Consequently, all performance metrics and visual constellation diagrams were calculated exclusively on the steady-state symbols, deliberately discarding the initial transient convergence period.
3.1. Emulation of Blind Equalization Algorithm on SoC-FPGA
The blind equalization emulation uses the PS of the SoC-FPGA as the transmitter, generating QAM modulation signals. The physical communication channel is emulated as a time-varying multipath channel to distort the symbols of the input signal. This is implemented using an FIR filter where the time-varying coefficients are updated using independent Gaussian random processes to simulate stochastic tap variations, combined with additive white Gaussian noise (AWGN). This approach effectively captures the dynamic fading conditions and Doppler shifts typical in Edge and wireless environments, as shown in
Figure 6.
Gnuplot is used for real-time signal plotting to verify general system operation. Although this is not practical for final real-time implementation, it is helpful at the design stage. However, visualizing status and control signals in a real-time application might be a solution. The Custom Processing Units were implemented in the PL of the SoC-FPGA. PL receives the distorted symbols, applies the equalization algorithms, and sends the result back to PS for visualization.
Communication between PS and PL uses the Advanced eXtensible Interface (AXI) protocol. The AXI protocol enables point-to-point communication in microcontroller systems, achieving high bandwidth and low latency. This protocol is part of ARM and Advanced Microcontroller Bus Architecture (AMBA) specifications. On the PS side, it works like typical Linux file management functions, and on the PL side, it is handled as a standard read/write to a FIFO.
3.2. Numeric Representation for Minimum Hardware Resources Utilization of CMA and MMA
This approach focuses on minimizing hardware resource utilization. For this reason, the experiments were designed to achieve the best performance within the constraints of a low-cost SoC-FPGA board. The fixed-point representation FxP was used to improve performance by varying B from 16 to 32. P was selected, taking into account the maximum values for each QAM modulation.
The minimum mean-square error (MMSE) between floating-point and fixed-point representations allows for identifying suitable values for final implementations. While system-level metrics such as Bit Error Rate (BER) or Error Vector Magnitude (EVM) are standard for evaluating end-to-end communication systems, the primary objective of this DSE framework is to isolate and quantify the hardware degradation introduced specifically by fixed-point quantization. Therefore, the MMSE between the ideal floating-point baseline and the quantized fixed-point hardware implementation is established as the primary evaluation metric. This approach allows precise measurement of quantization noise while deliberately decoupling it from the inherent channel noise.
To complement this hardware-focused metric and provide qualitative proof of the end-to-end equalization process, visual validation through constellation diagrams is presented alongside the MMSE results. However, to visually evaluate this system-level performance across different modulation orders without exceeding the real-time memory capture limits of the physical FPGA testbed (e.g., Vivado ILA buffers), a bit-true fixed-point emulation of the proposed architecture was developed. This model strictly incorporates the hardware constraints defined by the DSE, including the exact word-length truncation, scaling factors, and hardware clipping limits.
As illustrated in
Figure 7, the emulated hardware successfully tracks and mitigates the time-varying multipath distortion, achieving a clear separation between the constellation points for 4-QAM, 16-QAM, and 64-QAM. This visual evidence confirms that the mathematical boundaries established for the fixed-point parameters are strictly valid, ensuring reliable symbol recovery without the need to calculate real-time hardware BER, which would exceed the logical footprint limits of the Edge SoC-FPGA.
Scale Factor MMSE increases with the modulation size because it needs better representations for both the integer and fractional parts; thus, it must increase the numerical representation size. A solution is to scale all variables involved in the operations, while keeping the format. A benefit of the experimental results is that
can take on a larger value; see
Table 1.
Theoretically, to guarantee convergence in adaptive equalizers, the step-size
must be strictly bounded by the inverse of the input signal’s power. However, in fixed-point architectures, severe quantization noise and saturation (clipping) alter this boundary. The scaling factors introduced act as a dynamic normalization mechanism, artificially compressing the dynamic range of higher-order modulations (e.g., shifting bits for 64-QAM) to fit within the restricted fixed-point word length. This bit-shifting prevents arithmetic overflow and mathematically relaxes the convergence bound, allowing the hardware to utilize a larger, more aggressive effective step-size (
) without diverging, as summarized in
Table 1.
3.3. Performance Comparison
The real-time hardware implementation of the algorithms and subsequent testing on an emulated channel were performed on a low-cost board called Ultra96. This board is an ARM-based, Xilinx Zynq UltraScale+ TM MPSoC development board based on the Linaro 96Boards. The algorithm was designed in Xilinx SDxTM, and the results were plotted in real time using Gnuplot 6.0. The signal-to-noise ratio (SNR) was fixed to 25 dB. Input data is processed in a streaming fashion using a 1024-sample buffer for temporary storage, and the blind equalizer filter has 21 coefficients.
3.4. Latency and Sample Rate Analysis
Figure 8 illustrates the latency measured in clock cycles for both floating-point and fixed-point representations across different binary word sizes (
B) for a 4-QAM modulated input. The experimental results demonstrate a clear inverse relationship between the number of bits and latency: as
B decreases from 32 (floating-point) to 16 bits, the latency decreases significantly, with optimal performance at FxP
16,3. This reduction in latency can be attributed to the lower complexity of arithmetic operations in fixed-point representations, which require fewer hardware resources and faster computation cycles than floating-point implementations.
Notably, the position of the decimal point (
P) does not affect the latency or the sample rate shown in
Figure 9. The sample rate remains relatively constant across different values of
P for a given word size
B, indicating that the fractional bit allocation does not affect the system’s throughput. This observation is significant because it allows designers to independently optimize fractional precision (
P) to improve accuracy without sacrificing data rate.
However, the parameter
P does modify the accuracy of the algorithm as demonstrated in
Figure 10, which presents the MMSE changes for different combinations of
B and
P. This relationship reveals a critical trade-off: while reducing
B improves latency and throughput, the choice of
P directly affects numerical precision and, consequently, equalization performance. The MMSE results show distinct performance regions in which certain
combinations yield superior accuracy, providing a roadmap for selecting optimal numerical representations that meet target hardware constraints.
The consumption of hardware resources (BRAM, DSP48E, FF, LUT) is presented in
Figure 11. CMA and MMA share the same CuPU, except for the calculation of the distance. However, the figure shows no difference in the amount of DSPs and BRAM across configurations; for this reason, we used the same results for both blind algorithms.
Figure 11 is used to adjust the design, making it fit on the selected SoC-FPGA and increasing the interval for reading the parallelization directive, sacrificing performance. In other words, when any of the hardware resources exceeds 100% of the selected hardware.
Figure 12 shows the MMSE of several fixed-point representations for 4-QAM, 16-QAM, and 64-QAM. This information is used to select the minimum representation with a low MMSE. For example, a viable implementation for 64-QAM is FxP18,2 or FxP20,2 using the scale factor in
Table 1.
Figure 13 shows the result of applying a scale factor to 64-QAM data input for the MMA blind equalizer implementation. Then, the scale factor can be reversed to obtain the same visualization of the original design.
4. Discussion
The results indicate that blind equalization algorithms such as CMA and MMA can be implemented on a low-cost SoC-FPGA with limited hardware resources when the numerical representation is optimally defined. The design’s structure uses Custom Processing Units and employs high-level synthesis with parallelization techniques. It becomes feasible to achieve processing rates of several mega-samples per second. Overall, these findings help narrow the gap between theoretical, floating-point signal processing methods commonly seen in academic simulations and the real-time execution required for programmable hardware applications.
In the context of Edge Computing and localized Software-Defined Radio (SDR), this work underscores the importance of considering real-time performance and cost constraints early in the algorithm design process. The investigation into fixed-point formats and their effects on latency, throughput, and resource consumption provides a clear alternative for algorithms implemented on FPGAs without floating-point arithmetic. Managing the trade-off between precision and hardware efficiency is crucial to making advanced techniques such as equalization, channel estimation, and impairment correction applicable on a broader scale in decentralized telecommunications, IoT gateways, and resource-constrained Edge nodes.
The study suggests the potential of combining aggressive fixed-point quantization with AI-based modules that compensate for any loss in precision, looking toward the future of more complex DSP developments. A key question remains: how much numerical precision can be sacrificed to maximize parallelism and reduce costs while still allowing a trained model to reconstruct or refine the output signal effectively? We can establish design guidelines for AI modules, including acceptable distortion levels, latency requirements, and robustness to channel variations. The integration of numerical optimization and hardware parallelization advances the deployment of complex adaptive equalizers into next-generation Edge communication systems.
While state-of-the-art FPGA implementations targeting optical backbone networks achieve multi-Gigabit throughputs using high-end Virtex Ultrascale devices and massive parallelization, our approach deliberately targets a different operational domain. By sacrificing maximum throughput (achieving 4 MS/s), the proposed DSE methodology enables the deployment of complex CMA/MMA equalizers on highly resource-constrained, low-cost Edge SoC-FPGAs. This demonstrates a novel trade-off: trading extreme data rates for critical reductions in power, area footprint, and silicon cost, which is essential for decentralized IoT gateways and localized SDR applications.
To contextualize the efficiency of our proposed DSE framework,
Table 2 compares our optimal fixed-point architecture with recent FPGA-based blind equalizers. Direct throughput comparisons across disparate application domains often mask the underlying engineering challenges. High-end implementations targeting optical backbones inherently achieve Gigabit speeds by heavily relying on massive, power-hungry FPGAs. In contrast, the true novelty of our work lies in the extreme compression of complex stochastic equalizers for severely constrained Edge environments. By strictly bounding the fixed-point quantization via our DSE methodology, we deliberately trade off peak data rates for ultra-low resource utilization, successfully demonstrating that robust CMA/MMA equalization is viable on low-cost, low-power IoT and SDR gateway nodes.
It is important to emphasize that direct data-rate comparisons across the literature require careful interpretation. While backbone network architectures prioritize maximizing throughput by using resource-intensive FPGAs (e.g., [
20]), our DSE framework is mathematically tailored for severely constrained Edge SoC platforms. Furthermore, evaluating effective throughput in the hardware literature can be ambiguous; for instance, Ashmawy et al. [
21] report a maximum synthesis clock frequency (
) of 22 MHz, but the effective symbol throughput (MS/s) remains unspecified. Similarly, while works like Hanzálek et al. [
22] and Kolosovs [
23] explore the architectural trade-offs of blind equalizers, our approach uniquely provides a fully transparent resource footprint. By mathematically bounding the fixed-point quantization, our design guarantees stable symbol recovery while minimizing logic area, making it ideal for decentralized IoT nodes.
Table 2.
Performance comparison of FPGA-based blind equalizers.
Table 2.
Performance comparison of FPGA-based blind equalizers.
| Work / Year | Target Domain | Algorithm | Hardware Platform | Data Format | Throughput |
|---|
| Wang et al. (2023) [20] | Optical Backbone | Parallel CADAMA | Kintex Ultrascale | Floating/Custom | ∼15 Gbps |
| Hanzálek et al. [22] | Wireless Comm. | Modified CMA | Generic FPGA | Fixed/Floating | Variable |
| Kolosovs (2024) [23] | Wireless QAM | Adaptive Blind Eq. | Generic FPGA | Fixed-Point | Variable |
| Ashmawy et al. (2022) [21] | Edge/Wireless | CMA/MMA | Zynq-7000 SoC | Fixed-Point | Not Reported |
| Our Work (Proposed) | Edge / IoT nodes | CMA / MMA | Zynq Ultra96-V2 | Optimized FxP16,3 | 4 MS/s |
5. Conclusions
We presented a comprehensive methodology for minimum hardware utilization using SoC-FPGA devices. The experimental results showed the application of our approach in CMA and MMA blind equalization. The exploration of the design space was based on the definition of Custom Processing Units, parallelization using high-level synthesis, the emulation on SoC-FPGA for the design stage, and the numeric representation focused on minimum hardware utilization. The results showed an improvement in performance with the fixed-point representation, with floating-point precision as the reference. The methodology helps define an adequate fixed-point for the hardware selected, reaching more than four mega-samples per second in a low-cost Ultra96 board. Ultimately, this DSE framework provides a practical, highly resource-efficient blueprint for deploying adaptive equalizers in real-time Edge telecommunication nodes. It proves that advanced DSP is feasible on low-end silicon architectures without incurring the prohibitive area or power costs associated with traditional optical-grade hardware.
Author Contributions
Conceptualization, D.M.-V. and N.G.-G.; methodology, D.M.-V. and N.G.-G.; software, D.M.-V.; validation, D.M.-V., L.J.M.-G., N.G.-G. and M.B.M.; formal analysis, D.M.-V. and N.G.-G.; investigation, D.M.-V. and N.G.-G.; resources, N.G.-G. and M.B.M.; data curation, D.M.-V.; writing—original draft preparation, D.M.-V., L.J.M.-G. and N.G.-G.; writing—review and editing, D.M.-V., L.J.M.-G., N.G.-G. and M.B.M.; visualization, D.M.-V.; supervision, N.G.-G. and M.B.M.; project administration, D.M.-V., N.G.-G. and M.B.M.; funding acquisition, N.G.-G. and M.B.M. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Office of Naval Research Global (ONRG) and the Air Force Office of Scientific Research/Southern Office of Aerospace Research and Development (AFOSR/SOARD) grant number N62909-24-1-2088, with financial support by the European Regional Development Fund within the Operational Program “Bulgarian national recovery and resilience plan” and the procedure for direct provision of grants “Establishing of a network of research higher education institutions in Bulgaria”, under the Project BG-RRP-2.004-0005 “Improving the research capacity and quality to achieve international recognition and resilience of TU-Sofia”.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The raw data supporting the conclusions of this article will be made available by the authors on request.
Acknowledgments
Grammarly was used for grammar correction.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
| FPGA | Field-Programmable Gate Array |
| DSE | Design Space Exploration |
| CMA | Constant Modulus Algorithm |
| MMA | Multi-Modulus Algorithm |
| MMSE | Minimum Mean Square Error |
| ASIC | Application-Specific Integrated Circuit |
| SNR | Signal-to-noise ratio |
| QAM | Quadrature amplitude modulation |
| SoC | System-on-Chip |
| CPU | Central Processing Unit |
| CuPU | Custom Processing Unit |
| BRAM | Block Random Access Memory |
| DSP | Digital Signal Processing |
| FF | Flip-flop |
| LUT | Look-Up Table |
References
- He, Z.; Vijayan, K.; Mirani, A.; Karlsson, M.; Schröder, J. Inter-Channel Interference Cancellation for Long-Haul Superchannel System. J. Light. Technol. 2024, 42, 48–56. [Google Scholar] [CrossRef]
- Gandhare, S.; Karthikeyan, B. Survey on FPGA Architecture and Recent Applications. In Proceedings of the 2019 International Conference on Vision Towards Emerging Trends in Communication and Networking (ViTECoN), Vellore, India, 30–31 March 2019; pp. 1–4. [Google Scholar] [CrossRef]
- Sakib, S.; Faizullin, D.; Koga, Y.; Uetsuhara, M.; Onishi, S. In-Orbit FPGA reprogramming device for small satellites. Adv. Space Res. 2023, 71, 4549–4556. [Google Scholar] [CrossRef]
- Zhang, H.; Zuo, J.; Zheng, H.; Liu, S.; Luo, M.; Zhao, M. Implementing Neural Networks on Nonvolatile FPGAs with Reprogramming. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 43, 3961–3972. [Google Scholar] [CrossRef]
- Dano, E.B. A Case Study on the Architectural Development of a Transceiver Assembly. In Proceedings of the 2019 International Symposium on Systems Engineering (ISSE), Edinburgh, UK, 1–3 October 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Junior, J.C.; Balza, M.; Silva, S.N.; da Silva, L.M.; Fernandes, M.A. Optimizing Tactile Feedback Devices with Fixed-Point Computation on FPGA. Circuits Syst. Signal Process. 2025, 44, 3826–3864. [Google Scholar] [CrossRef]
- Tripathi, S.L.; Mahmud, M.; Balas, V.E. FPGA implementation for explainable machine learning and deep learning models to real-time problems. In Machine Learning Models and Architectures for Biomedical Signal Processing; Elsevier: Amsterdam, The Netherlands, 2025; pp. 449–471. [Google Scholar]
- Seyoum, B.; Pagani, M.; Biondi, A.; Balleri, S.; Buttazzo, G. Spatio-Temporal Optimization of Deep Neural Networks for Reconfigurable FPGA SoCs. IEEE Trans. Comput. 2021, 70, 1988–2000. [Google Scholar] [CrossRef]
- Putra, A.P.; Fuada, S.; Aska, Y.; Adiono, T. System-on-Chip architecture for high-speed data acquisition in visible light communication system. In Proceedings of the 2016 International Symposium on Electronics and Smart Devices (ISESD), Bandung, Indonesia, 29–30 November 2016; pp. 63–67. [Google Scholar] [CrossRef]
- Wang, X.; Gu, H.; Yang, Y.; Wang, K. A Group-Based Laser Power Supply Scheme for Photonic Network on Chip. IEEE Photonics J. 2019, 11, 1502614. [Google Scholar] [CrossRef]
- Pandey, B.; Das, B.; Kaur, A.; Kumar, T.; Khan, A.M.; Akbar Hussain, D.; Tomar, G.S. Performance evaluation of FIR filter after implementation on different FPGA and SOC and its utilization in communication and network. Wirel. Pers. Commun. 2017, 95, 375–389. [Google Scholar] [CrossRef]
- Diouri, O.; Gaga, A.; Ouanan, H.; Senhaji, S.; Faquir, S.; Jamil, M.O. Comparison study of hardware architectures performance between FPGA and DSP processors for implementing digital signal processing algorithms: Application of FIR digital filter. Results Eng. 2022, 16, 100639. [Google Scholar] [CrossRef]
- Rasoulinezhad, S.; Zhou, H.; Wang, L.; Leong, P.H. PIR-DSP: An FPGA DSP Block Architecture for Multi-precision Deep Neural Networks. In Proceedings of the 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), San Diego, CA, USA, 28 April–1 May 2019; pp. 35–44. [Google Scholar] [CrossRef]
- Schmalen, L.; Lauinger, V.; Ney, J.; Wehn, N.; Matalla, P.; Randel, S.; Bank, A.v.; Edelmann, E.M. Recent Advances on Machine Learning-aided DSP for Short-reach and Long-haul Optical Communications. In Proceedings of the 2025 Optical Fiber Communications Conference and Exhibition (OFC), San Francisco, CA, USA, 30 March–3 April 2025; pp. 1–3. [Google Scholar]
- Xu, M.; Zhang, J.; Zhang, H.; Jia, Z.; Wang, J.; Cheng, L.; Campos, L.A.; Knittle, C. Multi-Stage Machine Learning Enhanced DSP for DP-64QAM Coherent Optical Transmission Systems. In Proceedings of the 2019 Optical Fiber Communications Conference and Exhibition (OFC), San Francisco, CA, USA, 3–7 March 2019; pp. 1–3. [Google Scholar]
- Marquez-Viloria, D.; Castano-Londono, L.; Guerrero-Gonzalez, N. A Modified KNN Algorithm for High-Performance Computing on FPGA of Real-Time m-QAM Demodulators. Electronics 2021, 10, 627. [Google Scholar] [CrossRef]
- Guerrero-Gonzalez, N.; Marquez-Viloria, D.; Alzate-Anzola, C. SoC-FPGA based Emulation of CMA Equalizer for Time-Variant Optical Communication Channel. In Proceedings of the Latin America Optics and Photonics Conference; Optica Publishing Group: Washington, DC, USA, 2018; p. Tu2E.1. [Google Scholar] [CrossRef]
- IEEE 754-2019; IEEE Standard for Floating-Point Arithmetic. IEEE: New York, NY, USA, 2019.
- Yang, J.; Werner, J.J.; Dumont, G. The multimodulus blind equalization algorithm. In Proceedings of the 13th International Conference on Digital Signal Processing; IEEE: Piscataway, NJ, USA, 1997; Volume 1, pp. 127–130. [Google Scholar]
- Wang, J.; Zhao, L.; Li, J.; Yu, X.; Bu, X. A parallel CADAMA-based equalization algorithm for FPGA implementation in terahertz communication systems. In Proceedings of the Ninth Symposium on Novel Photoelectronic Detection Technology and Applications; SPIE: Bellingham, WA, USA, 2023; Volume 12617, p. 126171N. [Google Scholar] [CrossRef]
- Ashmawy, D.; Abdel-Raheem, E.; Mansour, H.; Youssif, M.; Mohanna, M. FPGA implementation of blind adaptive decision feedback equalizer. In Proceedings of the 2009 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT); IEEE: Piscataway, NJ, USA, 2009; pp. 495–500. [Google Scholar] [CrossRef]
- Hermanek, A.; Hanzalek, Z. Efficient FPGA Implementation of Equalizer for Finite Interval Constant Modulus Algorithm. In Proceedings of the IECON 2006—32nd Annual Conference on IEEE Industrial Electronics; IEEE: Piscataway, NJ, USA, 2006; pp. 3562–3567. [Google Scholar] [CrossRef]
- Kolosovs, D. Adaptive Blind Equalization Algorithms for QAM Systems; RTU Press: Riga, Latvia, 2022. [Google Scholar] [CrossRef]
Figure 1.
High-level architectural block diagram of the blind equalization system. The design is modularized into Custom Processing Units (CuPUs) to efficiently distribute the computational load between distance calculation and filter coefficient updates.
Figure 1.
High-level architectural block diagram of the blind equalization system. The design is modularized into Custom Processing Units (CuPUs) to efficiently distribute the computational load between distance calculation and filter coefficient updates.
Figure 2.
Detailed internal architecture of the Custom Processing Unit (CuPU) dedicated to Euclidean distance calculation for the Constant Modulus Algorithm (CMA). The block processes the filter output to compute the error gradient () based on the constant modulus criteria, integrating complex vector–vector multiplication (VVM) within the iterative loop.
Figure 2.
Detailed internal architecture of the Custom Processing Unit (CuPU) dedicated to Euclidean distance calculation for the Constant Modulus Algorithm (CMA). The block processes the filter output to compute the error gradient () based on the constant modulus criteria, integrating complex vector–vector multiplication (VVM) within the iterative loop.
Figure 3.
Detailed internal architecture of the Custom Processing Unit (CuPU) for distance calculation in the Multi-Modulus Algorithm (MMA). Unlike CMA, this unit processes the real () and imaginary () components independently to compute the specialized error gradient (), relying on the statistically derived dispersion constant .
Figure 3.
Detailed internal architecture of the Custom Processing Unit (CuPU) for distance calculation in the Multi-Modulus Algorithm (MMA). Unlike CMA, this unit processes the real () and imaginary () components independently to compute the specialized error gradient (), relying on the statistically derived dispersion constant .
Figure 4.
Pipelined processing in the VVM block. Sequential pipeline stages along the time axis: input “1” multiplies with matrix A (producing A1) in stage 1, input “2” multiplies with matrix B (producing B2) in stage 2, followed by addition (A1 + B2) and data transfer to demonstrate throughput enhancement.
Figure 4.
Pipelined processing in the VVM block. Sequential pipeline stages along the time axis: input “1” multiplies with matrix A (producing A1) in stage 1, input “2” multiplies with matrix B (producing B2) in stage 2, followed by addition (A1 + B2) and data transfer to demonstrate throughput enhancement.
Figure 5.
Parallel processing in the FIR filter unit. Input data “1” multiplies with block A (yielding A1) and data “2” multiplies with block B (yielding B2) simultaneously, as indicated by the red connection lines highlighting parallel multiplications. A1 and B2 are then added (green block) before data transfer along the time axis.
Figure 5.
Parallel processing in the FIR filter unit. Input data “1” multiplies with block A (yielding A1) and data “2” multiplies with block B (yielding B2) simultaneously, as indicated by the red connection lines highlighting parallel multiplications. A1 and B2 are then added (green block) before data transfer along the time axis.
Figure 6.
Block diagram of the emulated time-variant multipath communication channel implemented on the processing system (PS) of the SoC-FPGA. The model introduces dynamic channel impairments, including multipath fading via a variable-coefficient FIR filter and additive white Gaussian noise (AWGN), to rigorously evaluate the tracking capabilities of the hardware equalizer.
Figure 6.
Block diagram of the emulated time-variant multipath communication channel implemented on the processing system (PS) of the SoC-FPGA. The model introduces dynamic channel impairments, including multipath fading via a variable-coefficient FIR filter and additive white Gaussian noise (AWGN), to rigorously evaluate the tracking capabilities of the hardware equalizer.
Figure 7.
Constellation diagrams of the recovered signals extracted from the bit-true fixed-point (FxP) emulation of the proposed hardware architecture for (top) 4-QAM, (center) 16-QAM, and (bottom) 64-QAM. By mathematically modeling the exact scaling factor and the word-length saturation limits of the Edge SoC-FPGA, these visual results confirm that the steady-state equalizer successfully mitigates channel distortion and maintains reliable signal integrity without divergence.
Figure 7.
Constellation diagrams of the recovered signals extracted from the bit-true fixed-point (FxP) emulation of the proposed hardware architecture for (top) 4-QAM, (center) 16-QAM, and (bottom) 64-QAM. By mathematically modeling the exact scaling factor and the word-length saturation limits of the Edge SoC-FPGA, these visual results confirm that the steady-state equalizer successfully mitigates channel distortion and maintains reliable signal integrity without divergence.
Figure 8.
Latency (measured in clock cycles) as a function of the numerical representation (word length B and fractional bits P) for the blind equalization algorithm processing a 4-QAM signal on the Ultra96 SoC-FPGA. The hardware synthesis demonstrates that reducing the total word length from a 32-bit floating-point baseline significantly decreases the computational latency.
Figure 8.
Latency (measured in clock cycles) as a function of the numerical representation (word length B and fractional bits P) for the blind equalization algorithm processing a 4-QAM signal on the Ultra96 SoC-FPGA. The hardware synthesis demonstrates that reducing the total word length from a 32-bit floating-point baseline significantly decreases the computational latency.
Figure 9.
Sample rate vs. numeric representations of blind equalization algorithm implementations on Ultra96 for 4-QAM.
Figure 9.
Sample rate vs. numeric representations of blind equalization algorithm implementations on Ultra96 for 4-QAM.
Figure 10.
Impact of the fixed-point fractional precision (P) and total word length (B) on the Minimum Mean Square Error (MMSE) for the 4-QAM modulation. The graph highlights the quantization noise trade-off, revealing the optimal configurations that minimize the equalization error without requiring highly resource-intensive floating-point arithmetic.
Figure 10.
Impact of the fixed-point fractional precision (P) and total word length (B) on the Minimum Mean Square Error (MMSE) for the 4-QAM modulation. The graph highlights the quantization noise trade-off, revealing the optimal configurations that minimize the equalization error without requiring highly resource-intensive floating-point arithmetic.
Figure 11.
Hardware resources (BRAM, DSP48E, FF, LUT) analysis of blind equalization algorithm implementations on Ultra96 for 4-QAM.
Figure 11.
Hardware resources (BRAM, DSP48E, FF, LUT) analysis of blind equalization algorithm implementations on Ultra96 for 4-QAM.
Figure 12.
Comparative Minimum Mean Square Error (MMSE) performance across 4-QAM, 16-QAM, and 64-QAM modulations under various fixed-point numerical representations. Higher-order modulations (e.g., 64-QAM) exhibit greater sensitivity to quantization noise, thereby necessitating a more precise bit-width allocation to maintain the error within acceptable steady-state convergence thresholds.
Figure 12.
Comparative Minimum Mean Square Error (MMSE) performance across 4-QAM, 16-QAM, and 64-QAM modulations under various fixed-point numerical representations. Higher-order modulations (e.g., 64-QAM) exhibit greater sensitivity to quantization noise, thereby necessitating a more precise bit-width allocation to maintain the error within acceptable steady-state convergence thresholds.
Figure 13.
Demonstration of the scaling factor methodology applied to the MMA blind equalizer for a high-density 64-QAM signal. By mathematically shifting the input bits prior to the error calculation, the dynamic range is successfully compressed, preventing hardware overflow and enabling steady-state convergence within the strictly minimized logic resources of the SoC-FPGA.
Figure 13.
Demonstration of the scaling factor methodology applied to the MMA blind equalizer for a high-density 64-QAM signal. By mathematically shifting the input bits prior to the error calculation, the dynamic range is successfully compressed, preventing hardware overflow and enabling steady-state convergence within the strictly minimized logic resources of the SoC-FPGA.
Table 1.
Scale factor and corresponding optimized step-size () parameters required to maintain hardware stability and prevent gradient divergence across different QAM modulation orders within the fixed-point implementation.
Table 1.
Scale factor and corresponding optimized step-size () parameters required to maintain hardware stability and prevent gradient divergence across different QAM modulation orders within the fixed-point implementation.
| Modulation | Scale Factor | |
|---|
| 4-QAM | 1 | 0.0035 |
| 4-QAM | 2 | 0.04 |
| 16-QAM | 1 | 0.000015 |
| 16-QAM | 4 | 0.01 |
| 64-QAM | 1 | 0.0000015 |
| 64-QAM | 8 | 0.002 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |