1. Introduction
Coherent free-space optical communication (FSOC) has emerged as the preferred solution for high-speed, long-distance, inter-satellite links owing to its vast channel capacity, superior sensitivity, spectral efficiency, and interference resistance [
1,
2,
3]. Although requiring sophisticated hardware components, such as narrow-linewidth lasers and high-performance processors that support real-time digital signal processing, its performance benefits are essential in mission-critical scenarios, including deep-space exploration and high-capacity inter-satellite data transmission. This trend has been supported by rapid advancements in high-speed data converters, with analog-to-digital (ADC) and digital-to-analog (DAC) devices now achieving sampling rates in the tens of giga-samples per second (GSPS), providing a robust hardware foundation for ultra-high-speed transceivers [
4]. However, a major bottleneck persists between the front-end sampling rate and the back-end digital signal processing (DSP) capability [
5]. The clock frequencies of contemporary baseband processors, such as field-programmable gate arrays (FPGAs) and digital signal processors, lag far behind these ultra-high sampling rates. To enable true real-time operation, parallel processing architectures must be adopted [
6]. By decomposing a high-speed data stream into multiple lower-speed parallel streams, DSP algorithms can be executed on baseband processors operating at feasible clock rates.
In high-speed coherent baseband DSP, parallelization techniques generally fall into three main categories: (i) Polyphase decomposition-based architectures: Widely used for parallelizing finite impulse response (FIR) filters, this method partitions filter coefficients into multiple subfilters operating on decimated input streams, then recombines the outputs. When combined with coefficient symmetry, it reduces per-branch computational load and enables efficient hardware mapping [
7]. (ii) Look-ahead and buffered memory structures: Buffered memory schemes exploit pipelining and interleaving to maintain continuous operation across parallel streams. Look-ahead transformations are often applied to recursive filters to mitigate feedback latency while sustaining throughput [
8]. This approach is also effective for FIR filters via distributed parallel structures [
9]. However, in high-order feedback loops such as phase-locked loops (PLLs), maintaining exact serial equivalence at high parallelism factors increases design complexity sharply due to strong dependence on loop order and parallelism degree. (iii) Frequency-domain parallel computing: This method transforms time-domain DSP operations into the frequency domain using the fast Fourier transform (FFT), executes parallel computations, and then employs the inverse FFT (IFFT) to restore the time-domain signal. It is particularly advantageous for block-based algorithms, including certain equalizers and synchronization schemes.
Parallel architectures for coherent receivers can be classified into three categories: open-loop (feedforward), closed-loop (feedback), and hybrid systems. (i) Open-loop (feedforward) parallel receivers: Feedforward architectures are inherently suited for parallelization as they operate on blocks of data without recursive dependencies. This structure allows for deep pipelining in hardware (FPGA/ASIC) implementations, making them a popular choice for ultra-high-speed systems [
10,
11]. Classic blind feedforward algorithms, such as the Viterbi–Viterbi (VV) and Blind Phase Search (BPS) methods, estimate carrier phase and frequency offset by processing a block of received symbols at once [
12,
13,
14]. While effective, their performance can be sensitive to block size, and they may struggle with very large frequency offsets or rapid phase fluctuations. To enhance robustness, data-aided or pilot-aided feedforward schemes are often employed, particularly for initial wide-range frequency acquisition [
15,
16,
17,
18]. (ii) Closed-loop (feedback) parallel receivers: Typically based on PLLs, these architectures excel at tracking time-varying impairments such as laser phase noise or frequency drift. However, their recursive nature, where each symbol’s processing depends on previous results, creates a “feedback bottleneck” for parallelization. Several strategies exist: (a) Polyphase-based architectures: One classical approach is to use look-ahead transformations, often realized through polyphase decomposition. This receiver architecture is called a parallel receiver (PRX) [
19]. This technique mathematically unfolds the serial loop recursion into a set of parallel equations, allowing multiple output symbols to be computed simultaneously. While this creates a structurally parallel system, it often results in a significant increase in computational complexity per symbol processed [
20]. (b) Frequency-domain transform: A frequency-domain closed-loop architecture is known as Alternative Parallel Receiver (APRX), which leverages the properties of the fast Fourier transform (FFT) [
21]. By transforming a block of signals into the frequency domain, the time-domain convolution within the feedback loop becomes a simple multiplication. Architectures like the APRX perform loop filtering and correction in the frequency domain before transforming the signal back to the time domain [
22,
23,
24]. This approach enables efficient block-based parallel processing. (c) Multi-stream averaging or simplified updates: These approximate schemes average error signals from all streams to produce a common correction term [
25], reducing complexity but slightly degrading tracking accuracy. (d) Buffered structures with serial equivalence: Designs aiming for exact equivalence with serial loops rely heavily on look-ahead derivations [
26]. Although theoretically capable under arbitrary parallelism, complexity grows rapidly with the loop order and parallelism factor, limiting practical use for high-order PLLs. (iii) Hybrid parallel receivers: These architectures combine feedforward and feedback stages to leverage both wide capture range and fine tracking [
12,
16,
27]. A common approach is to use a feedforward stage for coarse frequency/phase estimation, followed by a parallelized feedback loop for precise tracking, thus balancing acquisition speed and steady-state performance.
Despite these diverse approaches, achieving high tracking accuracy, low complexity, and minimal latency in a parallel carrier-recovery system remains an open challenge. Ideally, such a system would offer the robustness of feedback loops and the architectural simplicity of feedforward methods.
Motivated by this goal, we propose a state-space-based parallelization framework that unifies the design of both open-loop and closed-loop DSP algorithms. The main contributions of this work are as follows: (i) We propose a unified state-space modeling framework that automatically maps a broad class of serial DSP algorithms (FIR/IIR, PLL-based carrier recovery, etc.) into parallel multi-input multi-output architectures using only matrix operations. Unlike prior look-ahead and polyphase-based derivations, the proposed procedure is independent of loop order and parallelism degree, and therefore scales gracefully to high-order, high-parallelism feedback systems. (ii) Beyond structural parallelization, we introduce the notion of parallel equivalent delay (PED), which explicitly captures both structural and computational latency in state-space-based parallel feedback loops. We show analytically that PED induces right-half-plane zeros in the loop transfer function, a phenomenon not treated in earlier parallelization-oriented works. (iii) Based on this delay model, we derive a Throughput–Bandwidth Product (TBP) constraint, which links achievable loop bandwidth to implementation-level delay. (iv) We validate the framework through both numerical simulation and FPGA implementation (50-way parallel, 15.625 Gsps), and empirically confirm the predicted TBP behavior.
The remainder of this paper is organized as follows:
Section 2 introduces the state-space framework and details the systematic procedure for parallelizing any FIR filter. In
Section 3, we apply this framework to a practical example, deriving a highly parallel structure for the second-order Costas loop.
Section 4 presents an analysis of the stability of parallel feedback systems, introducing the concept of parallel equivalent delay (PED) and deriving the fundamental Throughput–Bandwidth Product limit. In
Section 5, we provide simulation results to validate the proposed framework, verify our theoretical analysis, and compare its performance against conventional feedforward algorithms. Finally,
Section 6 presents the hardware implementation results on a XCVU13P FPGA (AMD, San Jose, CA, USA). We demonstrate a 50-parallel design achieving 15.625 Gsps and empirically verify the theoretical TBP trade-off under actual timing constraints.
Section 7 concludes the paper.
3. Application to Feedback Systems: The Parallel Costas Loop
This paper presents a parallel implementation of the Costas loop for QPSK demodulation, as illustrated in
Figure 1. The Costas loop comprises a quadrature mixer, a phase detector, a loop filter, and a numerically controlled oscillator (NCO). Among these components, the quadrature mixers and phase detectors, being memoryless, are parallelized by simple replication across
N channels, with each channel n operating as:
3.1. Parallel NCO
A numerically controlled oscillator (NCO) is primarily composed of three key components: a phase accumulator, a phase register, and a Look-Up Table (LUT). The phase accumulator, driven by the sampling clock, performs discrete-time integration through incremental summation of the frequency control word (FCW), effectively functioning as an integrator. The bit width of the phase accumulator is denoted by W, and it automatically wraps around upon overflow, which can be seen as performing a modulo operation with on the integration result. The phase register stores the accumulated phase value, which serves as the address input to the LUT. The LUT, in turn, establishes a mapping between the input address and the corresponding output amplitude value.
For parallel implementation, both the modulo operation (inherent in the accumulator’s wrap-around) and the LUT can be readily realized by simple replication across multiple processing paths. When parallelizing the integrator component, its transfer function is considered to be:
For this system, the discrete-time state-space matrices are:
The corresponding state-space equations are:
From (
8) to (
11), the matrices for an N-parallel system
can be derived as follows:
The N-channel parallelized state-space equations are then:
As depicted in
Figure 2, these components are combined to achieve an N-path parallel NCO implementation.
3.2. Parallel Loop Filter
The loop filter of a second-order phase-locked loop (PLL) is a first-order infinite impulse response (IIR) filter, Commonly, a Proportional–Integral (PI) structure is employed for this purpose. Its discrete-time transfer function, H(z), is given by:
where
and
represent the proportional gain and integral gain, respectively, and
T denotes the sampling period.
As shown in
Figure 3, the operation of the PI filter involves two components acting on the phase error signal,
. The proportional term,
, provides an instantaneous control action based on the current error. The integral term,
, accumulates the historical error, effectively integrating the phase error over time. The total control output signal,
, is the superposition of these two terms.
From a control perspective, increasing the proportional gain () generally enhances the loop’s response speed. However, excessively high values can lead to undesirable overshoot or sustained oscillations in the transient response. The integral gain () is primarily responsible for eliminating steady-state phase error, thereby ensuring phase lock accuracy. Nonetheless, an overly large can detrimentally affect loop stability margins.
In the design process of digital PLLs (DPLLs), the gains
and
are typically determined based on the target loop noise bandwidth (
). This parameter is crucial for achieving the desired trade-off between system stability and dynamic performance characteristics.
quantifies the loop’s susceptibility to input noise; specifically, a narrower bandwidth (smaller
) improves the rejection of high-frequency noise components but consequently results in a slower dynamic response. For a standard second-order type-II PLL, the loop noise bandwidth is related to the loop’s natural angular frequency (
) and damping factor (
) via the expression:
This relationship allows for the determination of the required natural angular frequency based on the specified
and chosen
:
Subsequently, the proportional and integral gains can be calculated using the following formulae, which incorporate the phase detector gain (
) and the oscillator gain (
):
The damping factor, , is frequently selected to be approximately 0.707, corresponding to critical damping or an optimal balance between settling time and overshoot.
An alternative representation of the PI filter’s transfer function
, where
is the output and
is the input in the z-domain, can be obtained by defining coefficients
and
. This yields:
This form is often convenient for digital implementation. The corresponding state-space representation of this filter is expressed as:
Based on Equations (
8)–(
11),
The parallelized expression derived from the state-space equations can be formulated as follows:
4. Stability Analysis and Design Methodology for Parallel Feedback Loops
While the state-space framework guarantees mathematical equivalence between the serial and parallel implementations, the “parallel equivalent delay” (PED) fundamentally alters the loop characteristics compared to its serial counterpart. In this section, we first analyze the origin and impact of PED within the context of our state-space parallelized PLL, then derive a predictive design guideline that links system throughput, hardware implementation, and maximum achievable loop bandwidth. Finally, we reveal a fundamental trade-off, the “Throughput–Bandwidth Product”, which governs the performance limits of such parallel feedback systems.
4.1. Costas Loop Model and Normalization
For a standard, serial, second-order phase-locked loop (PLL) employing a Proportional–Integral (PI) loop filter, the open-loop transfer function
is given by:
where
is the phase detector gain,
is the oscillator gain, and
is the PI filter’s transfer function. Substituting
yields:
Here,
represents the total loop gain, and
. The system exhibits a double pole at
and a single zero at
. Root locus analysis, as shown in
Figure 4a, reveals that for a stable filter (
), the branches remain entirely within the left-half-plane (LHP) for all positive loop gains (
). This demonstrates that the idealized second-order serial PLL is unconditionally stable, with loop gain K primarily affecting damping characteristics.
The considered demodulator is based on a second-order Costas loop implemented in discrete time. Let denote the input sampling period and the sampling frequency. The loop filter is realized in a state-space form, which is particularly amenable to parallel and pipelined FPGA implementation. The normalized loop bandwidth is defined as , where is the equivalent noise bandwidth of the Costas loop. Throughout this section, we analyze how the parallel architecture and its associated delays affect the maximum stable value of .
4.2. Definition of Parallel Equivalent Delay (PED)
In the proposed architecture, the Costas loop is implemented with a parallelization factor
N and an internal processing clock frequency
. In each input sampling period
, the loop processes
N sub-iterations using
N clock cycles of duration
. For a given input sampling rate, the effective parallelization factor is
Parallelism and pipelining introduce an additional “effective” delay into the feedback path of the loop. To relate this implementation-dependent delay to the continuous-time model, we define the parallel equivalent delay (PED) as the total additional group delay experienced by the feedback signal due to parallel architecture. There are two primary sources contributing to PED:
Structural delay () due to the fact that the loop update at time index k is based on samples that have passed through a chain of intermediate parallel stages;
Computational delay () due to the finite number of serial operations that must be performed within each sampling period, even when parallel hardware is available.
Let
S denote the number of serial operations per input sample that cannot be fully overlapped in the algorithm realization. Under the assumption of a real-time implementation, the total PED can be expressed as
Here, is given in units of input sampling periods . The term models the structural delay associated with the parallel pipeline, while the term captures the cumulative effect of computational latency that scales with the parallelization factor N. This model for will be used in the following subsections to quantify the impact of the architecture on stability and loop bandwidth.
4.3. Impact of PED on Loop Stability and Bandwidth
The open-loop transfer function for the parallelized system,
, becomes:
To analyze the effect of the delay term
, we can use a first-order Padé approximation:
Substituting this approximation back into the transfer function:
The introduction of this delay term fundamentally alters the system’s dynamics. It adds a stable pole in the LHP at
, but more importantly, it introduces an unstable zero in the right-half-plane (RHP) at
. The presence of the RHP zero dramatically affects stability. As the degree of parallelization (
N) increases, the delay
becomes larger. This causes the RHP zero (at
) to move closer to the origin of the s-plane. As shown in
Figure 4b, the root locus analysis shows that branches are now “pulled” towards this RHP zero. Consequently, the value of loop gain K at which the root locus crosses the imaginary axis (the threshold of instability) decreases as
increases.
In the frequency domain, the delay term introduces an additional negative phase shift of radians to the loop’s frequency response. This phase lag increases linearly with frequency . This phase lag directly erodes the loop’s phase margin, which is the primary indicator of its stability.
This degradation of phase margin leads to a critical performance limitation: for a given level of parallelism, there exists a maximum achievable loop bandwidth (
) beyond which stability cannot be guaranteed. A robust feedback system typically requires a phase margin of at least 45°. The phase shift introduced by PED at the crossover frequency
must not consume the entire available margin. By setting a limit on the maximum allowable phase degradation, there is a fundamental design constraint:
The constant C is an empirically validated design parameter that encapsulates the complexities of the loop’s stability requirements. Its value is determined by the trade-off between stability (phase margin) and the tolerable degradation from the ideal serial loop.
Equation (
64) reveals the key impact of PED on loop performance: for a fixed sampling period
, the maximum stable normalized loop bandwidth is approximately inversely proportional to the equivalent normalized delay
, which in turn is determined by the PED through Equation (
60).
4.4. Throughput–Bandwidth Product (TBP) and the Stability Constant C
To summarize the trade-off between loop bandwidth and implementation delay in a compact form, we introduce the Throughput–Bandwidth Product (TBP). For a given loop architecture, the TBP is defined as
Combining (
64) and (
65) yields
We therefore interpret C as a stability constant of the considered loop implementation: for a fixed loop order, loop filter structure, and phase-margin requirement, the TBP is expected to remain approximately constant and numerically equal to C.
In this context, the “throughput” aspect is implicitly captured by the parallelization factor
N and the processing clock period
, which together determine the PED and its normalized form
. The “bandwidth” aspect is quantified by the normalized maximum loop bandwidth
. The TBP relation (
66) thus formalizes the intuitive statement that, for a given loop topology, any increase in PED (or equivalently in
) must be compensated by a reduction in the achievable loop bandwidth to maintain stability.
4.5. Design Implications of PED and TBP
The PED and TBP analyses provide direct guidance for the design of parallel FPGA implementations of Costas loops and related feedback loops. For a fixed input sampling rate
and a given throughput requirement, the designer mainly controls two architectural parameters: (a) the parallelization factor
N, which influences the PED through both the structural delay and the computational delay terms in (
60); (b) the internal processing clock frequency
, which determines how much computation can be performed within each sampling period without excessively increasing
N.
From (
60) and (
64), it follows that increasing
N while keeping
relatively low tends to increase the PED and its normalized counterpart
, thereby reducing the maximum stable normalized loop bandwidth
. Conversely, for the same throughput target, operating the processing logic at a higher clock frequency allows the use of a smaller
N, which decreases
and thus enables a larger achievable loop bandwidth. In practical terms, this leads to the following design guideline: For a given throughput constraint, loop-bandwidth performance is optimized by minimizing the PED, i.e., by operating at the highest feasible processing clock frequency and using the smallest degree of parallelism that still meets the throughput requirement.
From a design perspective, different values of C impose clear performance trade-offs: a larger C allows a higher normalized loop bandwidth for a given equivalent delay . This enables faster tracking and a wider effective capture range, but is typically associated with tighter phase margins and more aggressive transient responses (larger overshoot, shorter settling time). A smaller C corresponds to a more conservative design, with higher phase margin and smoother transients, but a lower achievable for the same . In practice, this reduces the maximum tolerable phase-noise dynamics or frequency offset for a fixed parallel architecture.
5. Simulation Results and Discussion
To validate the theoretical framework and evaluate the performance of the proposed state-space parallelization method, we conducted a series of comprehensive simulations. The method was applied to a Costas phase-locked loop (PLL) for carrier recovery in a coherent optical communication system. The simulation environment was configured with the following key parameters: the modulation format was Quadrature Phase-Shift Keying (QPSK) at a symbol rate of 10 Gbaud. A four-times oversampling ratio was employed, resulting in a sampling rate of 40 Gsps. Pulse shaping was performed using a Root-Raised Cosine (RRC) filter with a roll-off factor of 0.5. The channel was modeled with Additive White Gaussian Noise (AWGN), and laser phase noise was simulated as a Wiener process. For comparison, the performance of the classical open-loop Viterbi–Viterbi (VV) algorithm was also evaluated under identical conditions.
For a fundamental and standardized evaluation, system performance is measured in terms of the bit-energy-to-noise-power-spectral-density ratio (Eb/N0). This metric provides a normalized basis for comparison that is independent of specific system parameters such as modulation format, symbol rate, or channel bandwidth, facilitating a fair assessment against theoretical limits and other digital communication schemes.
For consistency, the OSNR measured in the optical experiments is converted to Eb/N0 using the following relation:
where m is the number of bits per symbol,
is the symbol rate, and
is the optical reference bandwidth used in the OSNR measurement. In this system, with
,
, and
12.5 GHz), the conversion becomes:
Thus, all BER performance curves in this section are plotted against Eb/N0 to ensure comparability between simulation and experimental results, independent of modulation format, symbol rate, or measurement bandwidth.
5.1. Frequency Tracking Performance and Locking Behavior
First, we verified the fundamental locking capability and dynamic tracking performance of the parallelized Costas loop.
Figure 5 depicts the frequency tracking curves for the serial loop and the proposed parallel structures with
N = 64, 128, and 256 under a fixed carrier frequency offset. The plots clearly demonstrate that all parallel configurations successfully acquire and lock onto the carrier frequency, exhibiting convergence behavior that is qualitatively similar to the serial counterpart.
For a more quantitative analysis,
Table 1 summarizes the mean error and standard deviation of the estimated frequency after the loops have achieved a steady state. Several key observations can be made:
The mean frequency error is effectively compensated in all cases, indicating that the parallel structures maintain tracking accuracy. The small residual mean error is inherent to the loop’s operation and does not systematically increase with the parallelism factor N.
The standard deviation of the frequency estimate exhibits a slight, gradual increase as N grows. This is a crucial and expected result, consistent with our theoretical analysis. The increased parallelism leads to a larger equivalent loop delay (), which slightly degrades the loop’s noise-filtering characteristics, resulting in a marginally higher phase noise variance. Nevertheless, this degradation is graceful and well-controlled.
5.2. Validation of the Throughput–Bandwidth Product
A core contribution of our work is the establishment of a theoretical limit on the loop bandwidth, encapsulated by the “Throughput–Bandwidth Product”. To empirically validate this constraint, we determined the maximum stable loop bandwidth (
) and the corresponding maximum frequency acquisition range for different parallelism factors. The results, presented in
Table 2, provide direct evidence supporting our analysis in
Section 4.
There is a clear inverse relationship between the parallelism factor N and the maximum achievable loop bandwidth. For instance, increasing N from 64 to 256 reduces the maximum from 0.018 to 0.005. This directly translates to a reduced frequency acquisition range from ±35 MHz down to ±9 MHz. There is a clear inverse relationship between the parallelism factor N and the maximum achievable loop bandwidth. Specifically, the product remains remarkably constant across the different configurations. For example, for N = 64, the product is , while for N = 256, it is . This empirically validates the existence of a constant Throughput–Bandwidth Product, which for our specific loop filter design is approximately 1.2. This fundamental trade-off provides a critical and quantifiable design guideline: achieving higher throughput via parallelism comes at the direct and predictable cost of reduced loop bandwidth, which in turn limits the system’s ability to track rapid phase variations or large carrier frequency offsets.
5.3. Bit Error Rate (BER) Performance
To assess the impact of the parallel architecture on the end-to-end system fidelity, we evaluated the Bit Error Rate (BER) performance.
Figure 6 shows the constellation diagrams after carrier recovery for both serial and parallel loops under various conditions. In all cases, the parallel loop successfully recovers the carrier, resulting in clean, well-defined constellation points.
Figure 7 presents the BER performance curves. As shown, the parallelized loops achieve performance remarkably close to that of the ideal serial implementation. A minor power penalty is observable as
N increases, which is a direct consequence of the slightly increased phase noise variance noted in
Table 1. For instance, at a target BER of
, the parallel loop with
N = 256 incurs a power penalty of less than 0.5 dB compared to the serial loop. This confirms that our state-space method preserves the system’s performance integrity while enabling massive parallelism.
5.4. Comparison with the Open-Loop Viterbi–Viterbi Algorithm
Finally, we benchmarked our proposed parallel closed-loop method against the widely used open-loop Viterbi–Viterbi (VV) feedforward algorithm.
Figure 8 and
Figure 9 illustrate the performance of the VV frequency estimator. It shows a critical weakness: its accuracy degrades precipitously at low Eb/N0. Specifically, as shown in
Figure 9, the standard deviation of the frequency estimate explodes to several hundred kHz for Eb/N0 below 8 dB, rendering the estimate unreliable. This contrasts sharply with the closed-loop method, which maintains a standard deviation below 1 kHz under the same conditions (as shown in
Table 1). This performance collapse is characteristic of blind, non-data-aided feedforward estimators, as the underlying phase estimation relies on non-linear operations that amplify noise at low Eb/N0, leading to unreliable block-based estimates.
This poor estimation performance leads to a catastrophic failure at the system level, as shown by the high error floor in
Figure 10. The system fails to achieve a BER below
even at an Eb/N0 of 13 dB. In stark contrast, our proposed parallel closed-loop structure (
Figure 7) not only achieves a BER of
at an Eb/N0 of approximately 7 dB but also maintains robust locking even in these challenging conditions. This highlights a fundamental advantage of our method: by preserving the recursive nature of the feedback loop, it leverages the inherent noise-filtering and tracking capabilities that are crucial for reliable performance, making it vastly superior to open-loop estimators in channels characterized by low Eb/N0.
6. FPGA Implementation and Performance Analysis
To validate the feasibility and evaluate the hardware performance of the proposed parallel architecture, the carrier recovery loop was implemented on a Virtex UltraScale+ XCVU13P (FLGA2577-2i) FPGA. The design was synthesized and implemented using the Vivado 2022.1 Design Suite.
6.1. Experimental Setup and Resource Utilization
Unlike Block-FFT-based methods, which typically require the parallelism factor N to be a power of 2, the proposed state-space architecture supports arbitrary parallelism. This flexibility allows for the selection of an optimal parallelism factor () to precisely match the target line rate and clocking resources of the optical transmission system.
Table 3 summarizes the post-implementation resource utilization and timing analysis for the
design under two different clock frequency constraints. At the target frequency of 312.5 MHz, the design achieves a throughput of 15.625 Gsps. The implementation is highly efficient, consuming only 6.5% of the available Look-Up Tables (LUTs) and 20.77% of the DSP slices, ensuring sufficient resources remain for other DSP modules.
6.2. Pipeline Depth and Verification of the Throughput–Bandwidth Product
Achieving high-frequency timing closure in recursive feedback loops necessitates the insertion of pipeline registers (S), which directly contribute to the loop latency. Based on the implementation results, the required pipeline depth S was modeled as a function of parallelism N and the target clock frequency ():
It is important to note that these empirical formulas are contingent upon the specific process technology (16 nm FinFET) and speed grade (-2) of the target FPGA. While the logarithmic relationship with N is architectural, the baseline stages and the coefficient of the logarithmic term may vary across different FPGA families or synthesis strategies. The increase in the coefficient from 1 to 2 at higher frequencies reflects the necessity of additional retiming stages to break critical paths in the feedback loop.
To validate the theoretical stability analysis, we calculated the stability constant
C using (
60) and (
66).
Table 4 presents the measured performance metrics for different configurations. Despite the variations in parallelism (
vs.
) and pipeline depth (
vs.
), the product of the maximum stable bandwidth and total delay (
) exhibits remarkable consistency, converging to approximately 1.28. This empirical evidence strongly validates the proposed “Throughput–Bandwidth Product (TBP)” metric as a reliable predictor of system stability.
While the proposed architecture successfully achieves high throughput, the experimental results highlight a fundamental trade-off inherent to parallel feedback systems. As observed in the implementation, increasing the clock frequency from 200 MHz to 312.5 MHz necessitates an increase in pipeline stages from 20 to 26. This extension of the feedback path imposes a penalty on the loop latency, causing the maximum stable bandwidth () to degrade from to .
Consequently, the frequency acquisition range is reduced from MHz to MHz. This phenomenon indicates that throughput enhancement via deep pipelining is constrained by the feedback latency bottleneck. For ultra-high-speed optical coherent receivers, this implies that the tracking capability for fast-varying phase noise is strictly bounded. Therefore, the proposed TBP metric serves as a critical design guideline, allowing engineers to determine the optimal operating point between throughput requirements and phase tracking performance prior to hardware implementation.
7. Conclusions
This paper presents a unified state-space-based framework for parallelizing both feedforward and feedback DSP algorithms in coherent optical receivers. By mapping serial algorithms into an equivalent MIMO state-space representation, the proposed method systematically derives parallel architectures using only matrix operations, eliminating ad hoc, algorithm-specific derivations. The framework is applicable to a broad class of FIR/IIR filters and tracking loops, and guarantees exact serial equivalence at the algorithmic level.
A key theoretical contribution of this work is the identification and analysis of the “parallel equivalent delay” (PED), an inherent latency in any parallel feedback architecture. Our analysis revealed that PED, composed of structural and computational delays, introduces a right-half-plane zero into the loop’s transfer function, fundamentally limiting its stability. This analysis led to the definition of a Throughput–Bandwidth Product (TBP), summarized by the relation (
64), where
is the total normalized loop delay. The corresponding stability constant
was found to be approximately 1.28 for the considered second-order Costas loop and implementation. This constant thus provided a practical design metric that linked parallelism, hardware latency, and loop dynamics.
The proposed framework was validated through the design of a highly parallel Costas carrier recovery loop. Simulations confirmed that the parallel loops retained the locking behavior and BER performance of the serial loop, with only a minor and graceful performance degradation as the parallelization factor increased. Compared with a classical Viterbi–Viterbi feedforward estimator, the parallel Costas loop exhibited dramatically improved robustness at low Eb/N0, avoiding the severe error floors observed for VV.
Finally, a 50-way parallel Costas loop was implemented on a AMD XCVU13P FPGA, achieving 15.625 Gsps at 312.5 MHz with less than 7% LUT utilization. Measured maximum stable loop bandwidths across different N, clock frequencies, and pipeline depths confirmed this.
Future work will extend the state-space parallelization and PED analysis to higher-order carrier recovery loops, adaptive equalizers, and timing recovery architectures, and investigate active PED compensation strategies in ultra-high-parallelism regimes.