FMCW LiDAR Nonlinearity Compensation Based on Deep Reinforcement Learning with Hybrid Prioritized Experience Replay

Zhiwei Li; Ning Wang; Yao Li; Jiaji He; Yiqiang Zhao

doi:10.3390/photonics12101020

,

and

¹

School of Microelectronics, Tianjin University, Tianjin 300072, China

²

Tianjin Key Laboratory of Imaging and Sensing Microelectronic Technology, Tianjin University, Tianjin 300072, China

^*

Authors to whom correspondence should be addressed.

Photonics2025, 12(10), 1020;https://doi.org/10.3390/photonics12101020

This article belongs to the Special Issue Advancements in Optical Measurement Techniques and Applications

Version Notes

Order Reprints

Abstract

Frequency-modulated continuous-wave (FMCW) LiDAR systems are extensively utilized in industrial metrology, autonomous navigation, and geospatial sensing due to their high precision and resilience to interference. However, the intrinsic nonlinear dynamics of laser systems introduce significant distortion, adversely affecting measurement accuracy. Although conventional iterative pre-distortion correction methods can effectively mitigate nonlinearities, their long-term reliability is compromised by factors such as temperature-induced drift and component aging, necessitating periodic recalibration. In light of recent advances in artificial intelligence, deep reinforcement learning (DRL) has emerged as a promising approach to adaptive nonlinear compensation. By continuously interacting with the environment, DRL agents can dynamically modify correction strategies to accommodate evolving system behaviors. Nonetheless, existing DRL-based methods often exhibit limited adaptability in rapidly changing nonlinear contexts and are constrained by inefficient uniform experience replay mechanisms that fail to emphasize critical learning samples. To address these limitations, this study proposes an enhanced Soft Actor-Critic (SAC) algorithm incorporating a hybrid prioritized experience replay framework. The prioritization mechanism integrates modulation frequency (MF) error and temporal difference (TD) error, enabling the algorithm to dynamically reconcile short-term nonlinear perturbations with long-term optimization goals. Furthermore, a time-varying delayed experience (TDE) injection strategy is introduced, which adaptively modulates data storage intervals based on the rate of change in modulation frequency error, thereby improving data relevance, enhancing sample diversity, and increasing training efficiency. Experimental validation demonstrates that the proposed method achieves superior convergence speed and stability in nonlinear correction tasks for FMCW LiDAR systems. The residual nonlinearity of the upward and downward frequency sweeps was reduced to

1.869 \times 10^{- 5}

and

1.9411 \times 10^{- 5}

, respectively, with a spatial resolution of 0.0203m. These results underscore the effectiveness of the proposed approach in advancing intelligent calibration methodologies for LiDAR systems and highlight its potential for broad application in high-precision measurement domains.

Keywords:

frequency-modulation-continuous-wave (FMCW); non-linearity compensatation; reinforcement learning; hybrid prioritized experience replay (HPER)

1. Introduction

FMCW LiDAR is a cutting-edge technology for high-precision distance measurement for applications ranging from 3D imaging [,] and self-driving [,], as well as precision instrument manufacturing and measurement [,]. The heart of the FMCW LiDAR system is the linear sweep of the signal, which generates a linear FM optical signal. This signal is split into a reference beam and a detection beam, which interfere after traversing different optical paths. This interference produces a beat frequency that is proportional to the distance to the target, resulting in accurate distance measurements. However, the inherent nonlinearity of the FMCW LiDAR system distorts the linear relationship between the modulating signal and the swept frequency response, leading to the appearance of spectral spreading and measurement errors, as well as the reduction in measurement accuracy, and this nonlinearity is a major bottleneck in the development of high-performance FMCW LiDAR systems.

In FMCW lidar systems, various frequency modulation techniques are employed, depending on the type of light source. These techniques include altering the laser cavity length through periodic mechanical stretching or mirror driving, regulating carrier concentration via current injection, and adjusting the internal cavity temperature through thermal tuning. Common methods for correcting frequency modulation nonlinearity effects include iterative current pre-distortion [], optical phase-locked loops (OPLLs) and feedback systems [], equal-frequency resampling [], and optical frequency comb calibration []. However, each method involves limitations. The iterative pre-distortion correction technique combining drive current and time-frequency self-injection locking technology can achieve relative nonlinearity of 100 GHz under a sweeping bandwidth, achieving a wide sweeping range and high linearity with

6.4 \times 10^{- 7}

; []. However, it also faces drawbacks such as high experimental apparatus and system complexity, the sensitivity of correction results to system feedback parameters, and insufficient long-term stability, requiring frequent recalibration. OPLL systems are complex and expensive. Equal-frequency resampling techniques are bulky and perform poorly for fast-moving targets. Optical frequency comb methods are highly accurate but expensive and require complex hardware configurations. These methods generally lack adaptability to dynamic system conditions and do not support real-time compensation. Therefore, there is an urgent need for new methods to address nonlinearity in frequency sweeping under dynamic conditions. The field of photonics is gradually shifting from traditional physical modeling to data-driven methods, fueled by the increasing availability of simulation and experimental data. This evolution opens new opportunities for applying machine learning to complex photonics challenges. Among these, reinforcement learning (RL) has shown strong potential for optimizing control in nonlinear, time-varying systems. Its effectiveness has been demonstrated in areas such as robotics, autonomous driving, and strategic gaming [,,]. In photonics, RL is emerging in applications like optical microscopy, inverse design, and optical communication. It has been used to enhance ultrafast laser dynamics, stabilize pulse generation, and improve beam quality. In optical communications, RL-based algorithms have been developed to optimize signal modulation and demodulation [,,]. Nevertheless, the application of RL in optical system control, especially in FMCW LiDAR—remains in its infancy, highlighting the need for further exploration in this area.

Recent progress in addressing FMCW LiDAR nonlinearities using reinforcement learning includes model-based control strategies employing actor–critic networks to generate highly linear laser sweeps []. Subsequent enhancements have integrated double-critic architectures with delayed policy updates and smoothed regularization to mitigate estimation bias in the critic network, thereby improving chirp control accuracy []. Nonetheless, both approaches face notable limitations in continuous control scenarios. Actor–critic frameworks are prone to insufficient exploration, suboptimal convergence, and sensitivity to hyperparameters such as learning rates and update intervals. While the dual-critic design in [] reduces overestimation bias present in [], it relies on a fixed exploration noise variance, limiting adaptability to dynamic system behavior. Moreover, excessive policy smoothing may inadvertently hinder effective exploration in complex control environments.

In contrast, the Soft Actor-Critic (SAC) [] algorithm offers a more robust alternative for correcting laser modulation nonlinearities through a maximum entropy reinforcement learning framework []. By jointly optimizing expected return and policy entropy, SAC adaptively generates accurate laser drive signals to compensate for distortions arising from intrinsic laser characteristics and environmental factors like thermal drift. The dual-Q architecture and automatic entropy coefficient adjustment maintain exploratory behavior while enhancing modulation linearity. However, standard SAC implementations still exhibit limitations in FMCW applications: (1) uniform experience sampling and static replay mechanisms struggle to capture critical nonlinear patterns, reducing responsiveness to transient distortions (e.g., startup frequency jumps); (2) fixed delay in data injection fails to account for time-varying disturbances such as rapid temperature shifts; (3) reliance on a single TD error objective neglects hardware-specific observations, thereby constraining adaptability to physical system dynamics. Furthermore, a prioritized experience replay variant of SAC (PER-SAC) has been proposed [], incorporating a TD-error-based sampling mechanism to emphasize transitions with higher potential for policy improvement. While this approach enhances learning efficiency, it remains suboptimal for FMCW applications. The exclusive reliance on TD error for prioritization overlooks critical physical indicators reflecting hardware state, reducing sensitivity to abrupt nonlinearities. Additionally, fixed-strength importance sampling correction may distort gradient updates for low-priority samples, compromising training stability. Over-sampling high-error experiences can also lead to premature convergence to local optima, which is particularly detrimental in systems requiring ongoing adaptation to changing operational conditions.

To address these issues, this paper proposes an enhanced SAC algorithm called hybrid priority experience replay (HPER). This innovative method achieves a significant leap forward through three key improvements: first, a hybrid priority mechanism based on the modulation frequency error of adjacent modulation segments and TD error is designed, where the dynamic adjustment coefficient is adapted via a Sigmoid function to balance the weights of physical error and strategy error, ensuring a priority response to hardware signals during nonlinear sudden changes while balancing long-term strategy optimization during steady states; second, dynamic importance sampling is introduced, with weighting coefficients automatically matched to the hybrid priority mechanism. This preserves the PER’s deviation correction capability while avoiding training fluctuations caused by over-adjustment. Finally, a time-varying delay injection strategy is innovatively adopted, dynamically adjusting the experience storage rhythm through the real-time monitoring of MF change rates. This shortens delays during rapid nonlinear changes to enhance timeliness and extends delays during stable phases to ensure data diversity. These multi-level improvements enable HPER-SAC to demonstrate significant advantages in FMCW nonlinear correction: an improved response speed to transient distortions and enhanced correction stability under continuous action variations, while maintaining the excellent exploration capabilities of traditional SAC.

The remainder of this paper is organized as follows: Section II describes the principles of FMCW LiDAR and the reinforcement learning network model proposed in this paper. Section III presents the experimental results and discussion, including network parameter settings and experimental result comparisons. Section IV discusses the significance of the research findings and summarizes this paper.

2. Methodology

2.1. FMCW LiDAR Non-Linearity

The fundamental architecture of the FMCW LiDAR system is illustrated in Figure 1, which presents a block diagram of a direct detection-based laser ranging configuration. In this system, a triangular waveform generated by a driver modulates a tunable laser source, producing a continuous-wave optical signal with a periodically swept frequency. The emitted light is split into two paths: the measurement beam, directed toward the target via a circulator, and the reference (local) beam, sent to an optical coupler. The measurement beam is collimated, transmitted to the target, and the reflected signal is re-collimated and guided back through the circulator to the coupler. There, it interferes with the local beam, and the resulting optical interference is detected via a balanced photodetector (BPD), which converts it into an electrical beat frequency signal. This signal is then amplified via a trans-impedance amplifier (TIA). The target distance is determined by analyzing the frequency of the beat signal.

Figure 1. The Block diagram of FMCW laser ranging.

External-cavity semiconductor frequency-modulated lasers, which achieve modulation by driving cavity mirrors to vary the optical path length, are widely adopted due to their superior FM linearity, wavelength stability, narrow linewidth, low noise, high side-mode suppression ratio, elevated output power, and compact integration. Despite these advantages, practical implementation is challenged by nonlinearities arising from piezoelectric actuator (PZT) hysteresis, mechanical vibrations, and circuit noise. These effects distort the frequency sweep, introducing errors in the beat frequency

f_{b}

and thereby degrading ranging resolution and accuracy. As depicted in Figure 2, the dashed line illustrates the ideal linear frequency modulation trajectory, while the solid line represents the actual, non-ideal modulation behavior encountered in real-world scenarios.

Figure 2. The FMCW LiDAR nonlinearity schematic: (a) FMCW signal sweeping and (b) FMCW beat frequency signal.

In the range of laser modulation period

(0, T)

, the modulated signal of local light and measure light

f_{L O} (t)

f_{m e} (t)

can be written as follows:

\{\begin{matrix} f_{l o} (t) = f_{0} + k t \\ f_{m e} (t) = f_{0} + k (t - τ_{d}) \end{matrix}

(1)

Let

f_{0}

denote the initial optical frequency and k the modulation rate, defined as

k = \frac{2 B}{T}

, where B represents the laser’s frequency sweep bandwidth and T the modulation period. The parameter

τ_{d}

corresponds to the round-trip propagation delay associated with a target located at the target’s distance R, and it is given as follows:

R = c τ_{d} / 2

(2)

When considering the nonlinear term

f_{e} (t)

of the FMCW LiDAR system, the above Equation (1) can be written as follows:

\{\begin{matrix} f_{l o} (t) = f_{0} + k t + f_{e} (t) \\ f_{m e} (t) = f_{0} + k (t - τ_{d}) + f_{e} (t) \end{matrix}

(3)

To calculate the beat frequency, we first need to obtain the phase of the local light and measure light:

\{\begin{matrix} φ_{l o} (t) = 2 π \int_{0}^{t} f (v) d v = 2 π f_{0} t + π k t^{2} + \int_{0}^{t} 2 π f_{e} (t) d t + φ_{0} \\ φ_{m e} (t) = 2 π f_{0} (t - τ_{d}) + π k {(t - τ_{d})}^{2} + \\ \int_{0}^{t - τ_{d}} 2 π f_{e} (t - τ_{d}) d t + φ_{0} \end{matrix}

(4)

Then, the phase difference between the measured light signal and the local light signal from Equation (4) can be expressed as follows:

φ_{l o} (t) - φ_{m e} (t) = 2 π f_{0} τ_{d} + 2 π k t τ_{d} - π k τ_{d}^{2} + \int_{t - τ_{d}}^{t} 2 π f_{e} (t) d t

(5)

Since practical ranging scenarios typically occur within 300 m, the round-trip time

τ_{d}

is at most 2

u s

. Given that the modulation period T of FMCW LiDAR is generally around 100

u s

, we have

τ_{d} ≪ T

. Therefore, we neglect the term

τ_{d}^{2}

,

f_{e} (t)

is considered invariant in

τ_{d}

, and we can get

\int_{t - τ_{d}}^{t} f_{e} (t) d t = f_{e} (t) τ_{d}

.

Thus, according to Equation (5), the nonlinearity beat signal can be expressed as follows:

f_{b n} = \frac{d (φ_{l o} (t) - φ_{m e} (t))}{d t} = k τ_{d} + \frac{d f_{e} (t)}{d t} τ_{d}

(6)

And the distance R of the target can be rewritten as follows:

R = \frac{c f_{b n}}{2 k} = \frac{c τ_{d}}{2 k} (k + \frac{d f_{e} (t)}{d t})

(7)

The accuracy of the beat signal frequency directly determines the precision and resolution of distance measurements in FMCW LiDAR. The ranging resolution is defined as

R_{res} = \frac{c}{2 B}

. When a window function is applied to the beat signal spectrum via Fourier transform [], the theoretical spatial resolution (TSR) corresponds to the full width at half maximum (FWHM) of the spectral peak, given by

R_{F W H M} = 2 R_{res} = \frac{c}{B}

. To evaluate the deviation in beat signal frequency

f_{b n}

introduced via modulation nonlinearity, the bandwidth is estimated using Carson’s rule [].

δ f_{b n} = 2 (1 + β) f_{r}

(8)

where

f_{r}

is the repetition of FMCW LiDAR modulation signal,

β

is the modulation index, and has

β = 2 π τ_{d} f_{e, r m s}

and

f_{e, r m s}

is the root mean square(RMS) value of nonlinear error. Thus, Equation (8) can be rewritten as

δ f_{b n} = (1 + 2 π τ_{d} f_{e, r m s}) f_{r}

. In this way, by analyzing the spectrum of the beat signal, we are able to quantitatively evaluate the sweep nonlinearity, and the

R_{F W H M}

can be rewritten as []:

R_{F W H M} = \frac{c (1 + 2 π τ_{d} f_{e, r m s})}{2 B}

(9)

Since we can control the magnitude of the modulation slope

k_{c}

to achieve the compensation purpose, the beat frequency by compensation

f_{b n_c}

can be expressed as follows:

f_{b n_c} = k_{c} τ_{d} + \frac{d f_{e} (t)}{d t} τ_{d}

(10)

The compensated distance

R_{c}

can be expressed as follows:

R_{c} = \frac{c τ_{d}}{2 k} (k_{c} + \frac{d f_{e} (t)}{d t})

(11)

To establish the reinforcement learning environment, we adopt the methodology presented in []. The environment is constructed using experimentally obtained data and the underlying principles of FMCW LiDAR measurement. A nonlinear setting is simulated by incorporating Gaussian noise, effectively emulating the real-world nonlinear behavior of FMCW systems. This approach enables efficient policy training without direct interaction with the physical hardware, thereby mitigating the risk of potential damage to laser components. The first step involves formulating the physical relationship between the beat frequency and the modulation slope:

f_{b} (t) = h (ν^{'} (t)) = h (k_{c} (t))

(12)

The

h (\cdot)

denotes the Hilbert transform, and

f_{b} (t)

can also be expressed as follows:

f_{b} (t) = F (ν (t)) k_{c} (t)

, with

F (\cdot)

denoting the nonlinear characteristic. The noise term

n (t)

is then introduced to model the stochastic properties of the system:

f_{m b} (t) = F (ν (t)) k_{c} (t) + n (t)

(13)

Based on the given

k_{c} (t)

, the modulated beat frequency

f_{m b} (t)

is calculated, and the model is constructed as a training environment for the RL algorithm through this relation.

2.2. Soft Actor–Critic with Hybrid Prioritized Experience Replay

The Soft Actor–Critic (SAC) algorithm represents a state-of-the-art reinforcement learning approach founded on the principle of maximum entropy policy optimization. Unlike conventional methods that primarily enhance policy performance through Q-value estimation of a given policy

π

, SAC employs a dual-stage optimization process involving both the value and policy functions. This design eliminates the reliance on manually tuned exploration noise parameters, which are often associated with inefficient and unstable policy learning. A distinguishing feature of SAC is its incorporation of an adaptive temperature coefficient,

α

, which is dynamically optimized as part of the objective function. This mechanism obviates the need for manual hyperparameter adjustment, thereby enhancing the algorithm’s ability to adapt to varying environmental complexities and different phases of the training process. Additionally, entropy regularization encourages broader exploration across the state–action space while suppressing redundant interactions with less informative regions. This exploratory efficiency renders SAC particularly well-suited to environments that demand a balance between exploration and learning stability. Despite these advantages, SAC exhibits limitations due to its uniform sampling strategy, which treats all experience samples with equal priority. This can lead to slower convergence and instability during training. To address these issues, prioritized experience replay is introduced into the SAC framework. By emphasizing samples with higher value function estimation errors and suboptimal policy outcomes, the network is guided to focus on experiences with greater learning potential. This modification enhances both the stability and convergence rate of the learning process. The SAC algorithm can be formally described as a policy search within the framework of a Markov decision process (MDP). By extending the standard reward maximization objective with a maximum entropy component, the algorithm aims to jointly maximize the expected return and the entropy of the policy. The resulting policy objective,

J (π)

, is given as follows:

J (π) = \sum_{t} E [r (s_{t}, a_{t}) + α H (π (\cdot | s_{t}))]

(14)

In this context,

r (s_{t}, a_{t})

represents the immediate reward received from the environment when the agent is in state

s_{t}

and executes action

a_{t}

and can be expressed as

r (s_{t}, a_{t}) = - n o r m (| f_{m b} (t) - f_{i d e a l} |)

in this paper. The term

H (π (\cdot | s_{t}))

denotes the entropy of the action distribution under policy

π

, given state

s_{t}

, which quantifies the level of randomness in the agent’s action selection at that state. A higher entropy indicates a more exploratory policy. The coefficient

α

regulates the contribution of the entropy term. The SAC algorithm consists of two primary components: policy evaluation and policy improvement. In the policy evaluation phase, the soft Q-value function, denoted as

Q_{s} (s_{t}, a_{t})

, is defined as follows:

Q_{s} (s_{t}, a_{t}) = r (s_{t}, a_{t}) + γ E [V_{s} (s_{t + 1})]

(15)

where

γ

is the discount factor, and the soft state value function

V_{s} (s_{t})

is defined as follows:

V_{s} (s_{t + 1}) = E [Q_{s} (s_{t + 1}, a_{t + 1})] - α \log π (a_{t + 1} | s_{t + 1})

(16)

The term

E [Q_{s} (s_{t + 1}, a_{t + 1})]

represents the expected soft Q-value obtained by selecting an action according to the current policy in state

s_{t + 1}

, capturing the anticipated return following the agent’s decision. The entropy regularization component,

α \log π (a_{t + 1} ∣ s_{t + 1})

, promotes stochasticity in the policy by penalizing certainty, thereby encouraging exploration. The temperature parameter

α

modulates this term, balancing exploration and exploitation by scaling the entropy contribution. Importantly,

α

is updated adaptively to optimize the trade-off between reward maximization and entropy. This adjustment follows the gradient update rule:

α^{'} = α - λ_{α} \nabla_{α} J (α)

(17)

where

α

and

α^{'}

denote the current and updated temperature coefficients, respectively,

λ_{α}

is the learning rate, and

\nabla_{α} J (α)

is the gradient of the entropy-related objective with respect to

α

.

\nabla_{α} J (α) = - \frac{1}{N} \sum (α \cdot (log π + h)

(18)

where N denotes the size of the batch data sampled from the empirical playback buffer and h is the target entropy. In the strategy enhancement stage, we optimize the strategy using Kullback–Leibler Divergence []:

π_{new} = argmin (D_{KL} (π^{'} (\cdot | s_{t}) ‖ \frac{exp (Q^{π_{old}} (s_{t}, \cdot))}{Z^{π_{old}} (s_{t})}))

(19)

where

Z^{π_{old}} (s_{t})

is a normalized distributional function, and the output of the strategy

π_{φ}

is a probability distribution. The soft Q-function

Q_{θ} (s_{t}, a_{t})

parameters can be trained to minimize the soft Bellman residual:

J_{Q_{s}} (θ) = E_{(s_{t}, a_{t})} [\frac{1}{2} {(Q_{θ} (s_{t}, a_{t}) - (r (s_{t}, a_{t}) + γ E_{s_{t + 1}} [V_{\bar{θ}} (s_{t + 1})]))}^{2}]

(20)

Here,

V_{\bar{θ}} (s_{t + 1}) = E [Q_{\bar{θ}} (s_{t + 1}, a_{t + 1}) - α \log π (a_{t + 1} ∣ s_{t + 1})]

, and

Q_{\bar{θ}} (s_{t}, a_{t})

denotes the target Q-value network obtained via the exponential moving average of the parameter

θ

of the Q-value function for the parameter

\bar{θ}

. The soft actor–critic algorithm uses the reparameterization trick to redefine the strategy

π_{φ}

as follows:

a_{t} = f_{φ} (ε_{t}; s_{t})

(21)

where

ε_{t}

is a random variable in the heavy parameter skill. According to the above definition, the updated gradient of the soft Q function is

\begin{matrix} \nabla_{θ} J_{Q_{s}} (θ) & = \nabla_{θ} Q_{θ} (a_{t}, s_{t}) (Q_{θ} (a_{t}, s_{t}) - r (s_{t}, a_{t}) \\ - γ (Q_{θ} (a_{t + 1}, s_{t + 1}) - α \log (π_{φ} (a_{t + 1} | s_{t + 1}))) \end{matrix}

(22)

where

θ

and

\bar{θ}

are parameters of the soft Q function and its target network, respectively, and the update gradient of the policy network

π_{φ}

is

\nabla_{φ} J_{π} (φ) = \begin{matrix} (\nabla_{a_{t}} log π_{φ} (a_{t}, s_{t}) - \nabla_{a_{t}} Q (a_{t}, s_{t})) \nabla_{φ} f_{φ} (ε_{t}; s_{t}) + \\ \nabla_{φ} log π_{φ} (a_{t}, s_{t}) \end{matrix}

(23)

During SAC training, experience batches are uniformly sampled from the replay buffer, assigning equal selection probability to all transitions. However, this uniform approach may be suboptimal, as experiences vary in their contribution to learning. Prioritized experience replay addresses this issue by assigning higher sampling probabilities to transitions with greater learning potential, thereby focusing updates on more informative experiences. PER quantifies the importance of each transition using the temporal-difference (TD) error

δ

, which measures the discrepancy between predicted and actual returns. Larger TD errors indicate transitions with greater potential to improve the policy, and thus, they are prioritized during sampling. The sampling probability for each experience is derived from its TD error, ensuring that data with higher learning relevance are revisited more frequently, thereby improving training efficiency and accelerating convergence. The sampling probability is calculated using the following formulation:

P (i) = \frac{p_{i}^{α_{p}}}{\sum_{k} p_{k}^{α_{p}}}

(24)

where

p_{i}

refers to the priority of state transition, which is a positive number, like

p_{i} > 0

.

α_{p}

is an exponential hyperparameter corresponding to the priority level to be used and can be regarded as a trade-off factor to balance the uniformity and greediness. When

α_{p} = 0

, uniform sampling is performed, and when

α_{p} = 1

, greedy (maximum value selection) sampling is performed. However, considering that, in FMCW LiDAR, the global estimate of the TD error representation strategy cannot directly reflect the nonlinear correction results in terms of laser device physics, to ensure accurate correction at the hardware level, we introduce the modulation frequency (MF) difference between the beat frequency signals of adjacent beat frequency units as a feature quantity that directly reflects instantaneous nonlinearity. This enables the integration of MF error and TD error into a hybrid priority mechanism, resolving the compatibility between instantaneous response and long-term optimization for intelligent agents. This achieves a more robust training strategy. The definition of MF error is

δ_{M F} = | Δ f_{m b}^{(i)} |

(25)

where

Δ f_{m b}^{(i)}

denotes the difference between the beat frequencies of two adjacent segments of the modulation signal, and the sampling probability

P_{m i x} (i)

of the new hybrid priority mechanism is

P_{m i x} (i) = \frac{(δ_{M F} {(i)}^{ϵ} \cdot | δ_{T D} (i) {|^{1 - ϵ})}^{α_{p}}}{\sum_{j} (δ_{M F} {(j)}^{ϵ} \cdot | δ_{T D} (j) {|^{1 - ϵ})}^{α_{p}}}

(26)

Here,

ϵ \in [0, 1]

denotes the mixing coefficient. By setting

ϵ

, the weight ratio between

δ_{T D}

and

δ_{M F}

can be dynamically adjusted. And

δ_{T D}

denotes the TD error, which can be expressed as follows:

\begin{matrix} δ_{T D} (i) = & \frac{1}{2} (|Q_{θ_{1}} (s_{i}, a_{i}) - Q_{s} (s_{i}, a_{i})| \\ + |Q_{θ_{2}} (s_{i}, a_{i}) - Q_{s} (s_{i}, a_{i})|) \end{matrix}

(27)

Among them,

Q_{θ_{1}} (s_{i}, a_{i})

and

Q_{θ_{2}} (s_{i}, a_{i})

are the two value networks of SAC. To further eliminate priority bias, we introduce a dynamic mixing coefficient

μ

to control the weights of

δ_{M F}

and

δ_{T D}

in the priority, and the importance sampling weight

w (i)

can be expressed as follows:

w_{i} = {(\frac{1}{| D | \cdot P_{m i x} (i)})}^{1 - μ}

(28)

When

μ \approx 1

,

w_{i} \approx 1

, indicating that

δ_{M F}

dominates the weighting. When

μ \approx 0

,

w_{i} \approx \frac{1}{| D | \cdot P_{m i x} (i)}

, indicating that

δ_{T D}

dominates the weighting. The calculation method for

μ

is as follows:

μ = σ (- \frac{d \hat{δ_{M F}}}{d t})

(29)

where

σ

is the sigmoid function, which is defined on [0,1], and

\hat{δ_{M F}}

is calculated through the following equation:

\hat{δ_{M F}} = \frac{1}{W} \sum_{k = t - W + 1}^{t} δ_{M F} (k)

(30)

In FMCW LiDAR systems, the modulation nonlinearity of the laser is influenced by environmental factors like temperature fluctuations and laser degradation, as well as hardware characteristics like driver circuit latency, leading to time-varying behavior. Conventional experience replay (ER) strategies, which typically rely on fixed-delay or immediate data injection, are insufficient for capturing these dynamic nonlinearities. This can result in an imbalance between data recency and diversity, thereby hindering model performance. To address this limitation, we introduce a Time-Varying Delay Experience (TDE) infusion mechanism, which dynamically adjusts the injection delay of training samples based on system dynamics. Combined with buffer segmentation management, this approach ensures a more balanced temporal distribution of experiences. The delay function

ξ (t)

is designed to reflect the rate of change in system nonlinearity, modeled using the first-order derivative of the modulation frequency error

\frac{d δ_{M F}}{d t}

.

ξ (t) = ξ_{base} \cdot (1 + tanh (η \cdot \frac{d δ_{M F}}{d t}))

(31)

where

ξ_{base}

is the base delay step to ensure minimum data diversity.

η

is the sensitivity coefficient, which controls the strength of the response of the delay to the rate of nonlinear change. The value range of

t a n h (\cdot)

is

[- 1, 1]

, so the actual range of

ξ (t)

is

ξ (t) \in [0, 2 ξ_{base}]

, and

\frac{d δ_{M F}}{d t}

is calculated by sliding the modulation frequency error window:

\frac{d δ_{M F}}{d t} \approx \frac{1}{W} \sum_{k = t - W + 1}^{t} (δ_{M F} (k) - δ_{M F} (k - 1))

(32)

where k denotes the current training step, and W represents the window width used to smooth instantaneous noise, typically determined based on the total number of steps in a single training session. When

\frac{d δ_{M F}}{d t} > 0

, it indicates that the nonlinearity of the modulation signal is intensifying, and the modulation frequency error is increasing rapidly (e.g., sudden rise in laser temperature causing frequency modulation distortion). In this case,

t a n h (\cdot)

approaches 1,

ξ (t) \approx 2 ξ_{base}

, and the delay increases. At this point, the system requires more time to accumulate new data in order to fully capture the evolving trend of nonlinear dynamics. When

\frac{d δ_{M F}}{d t} \approx 0

, it indicates that the nonlinearity of the modulation signal is stable,

tanh (\cdot) \approx 0

, and the base delay is maintained to preserve exploration diversity. When

\frac{d δ_{M F}}{d t} < 0

, it indicates that the nonlinear trend of the modulation signal is decreasing, the modulation frequency error is decreasing, approaching -1, and

ξ (t) \approx 0

. At this point, the delay is minimized, and the system needs to quickly inject new data to promptly reflect the nonlinear compensation effect and avoid training lag. In addition, we divide the ER buffer

D_{s}

into two sub-buffers, namely the fast zone

D_{f}

and the slow zone

D_{s}

, where the fast zone is used to store experiences in the high dynamic change phase

(\frac{d δ_{M F}}{d t} > 0)

, accounting for

ψ

of the total capacity, and the slow zone is used to store experiences in the steady state phase

(\frac{d δ_{M F}}{d t} \leq 0)

, with a capacity ratio of

1 - ψ

, as shown in the following equation:

\{\begin{matrix} D_{f} \leftarrow (s_{t}, a_{t}, r_{t}, s_{t + 1}), & if \frac{d δ_{M F}}{d t} > 0 \\ D_{s} \leftarrow (s_{t}, a_{t}, r_{t}, s_{t + 1}), & otherwise \end{matrix}

(33)

Finally, during sampling, we mixed the two types of data proportionally in batch

B

during network training to balance detection and exploitation:

B = ψ \cdot S a m p l e (D_{f}) + (1 - ψ) \cdot S a m p l e (D_{s})

(34)

The structure of HPER-SAC is shown in Figure 3. First, the FMCW LiDAR experimental system section, where an Field-Programmable Gate Array (FPGA) generates the original modulated signal, which is then converted into a drive signal via a digital-to-analog converter (DAC) to drive a tunable laser to generate laser light. The laser light is split by a coupler into 10% local light and 90% measurement light. The measurement light then passes through a time-delay fiber before entering a second coupler simultaneously with the local light, where they are mixed and frequency-converted into beat signals via a BPD. Then, an analog-to-digital converter (ADC) quantizes and collects the beat frequency signal to the FPGA. The FPGA transmits the digitized beat frequency data and the original modulation signal to the laptop via a network protocol and initializes the system environment required for RL training. Then, the HPER-SAC reinforcement learning network algorithm is started. First, the actor interacts with the environment and stores the samples in the experience replay buffer. Then, it calculates the beat frequency MF and TD error between adjacent segments within the sweep frequency period to construct a mixed priority experience pool and automatically adjust the proportion of replayed experiences. After training the agent, probabilistic sampling is performed from the experience replay buffer, and the parameters of the critic network are updated. Simultaneously, the target network undergoes a soft update to guide the Actor network in updating the signal modulation actions. The specific HPER-SAC algorithm steps are shown in Algorithm 1.

Algorithm 1 HPER-SAC RL.

Input:
Hyperparameters: exponential moving average $τ$ , discount factor $γ$ , target entropy h, batch size $B$ , Critic NN learning rate $λ_{Q}$ , Actor NN learning rate $λ_{π}$ , entropy coefficient learning rate $λ_{α}$ .
Initialize Critic NN: $Q_{θ_{1}}$ , $Q_{θ_{2}}$ ; Actor NN: $π_{φ}$ ; Target NN: $θ_{1}^{'} \leftarrow θ_{1}, θ_{2}^{'} \leftarrow θ_{2}, φ^{'} \leftarrow φ$ . Replay buffer: $D_{f}$ and $D_{s}$ .
for $e p i s o d e$ = 1 to N, do:
Get initial state $s_{1}$ and initialize modulation slope $k_{c} (1)$ .
for $s t e p$ = 1 to T do:
Sample $π_{φ}$ by $s_{t}$ to obtain action $a_{t}$ and calculate $r_{t}$ and $s_{t + 1}$
Update modulation slope: $k_{c} (t + 1) \leftarrow a_{t}$
Store tuple $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ to $D$ and calculate $δ_{M F}$ , $δ_{T D}$ , $\frac{d δ_{M F}}{d t}$ , $ξ (t)$
if t mod $ξ (t)$ == 0:
if $\frac{d δ_{M F}}{d t} > 0$ :
Store tuple $(s_{t}, a_{t}, r_{t}, s_{t + 1}, δ_{T D}, δ_{M F})$ to $D_{f}$
else
Store tuple $(s_{t}, a_{t}, r_{t}, s_{t + 1}, δ_{T D}, δ_{M F})$ to $D_{s}$
end if
end if
Calculate $μ$ , sample a batch transition from $D_{f}$ to $B_{f}$ , and from $D_{s}$ to $B_{s}$ , then get $B$
for $i \in B$ do:
Calculate $P_{m i x} (i)$ and $w_{i}$ .
end for
Update critic NN loss function: $θ_{k} \leftarrow θ_{k} - \nabla_{θ_{k}} [\sum_{i} w_{i} {(Q_{θ_{k}} (s_{i}, a_{i}) - y_{i})}^{2}] (f o r k \in (1, 2))$
Update actor NN loss function: $φ \leftarrow φ - \nabla_{φ} [\sum_{i} (μ δ_{M F} + (1 - μ) δ_{T D})]$
end for
Update critic networks: $θ_{k} \leftarrow θ_{k} - λ_{Q} \nabla_{θ_{k}} J_{Q_{s}} (θ_{k}) (f o r k \in (1, 2))$ ;
Update actor network: $φ \leftarrow φ - λ_{π} \nabla_{φ} J_{π} (φ)$
Update temperature coefficient: $α \leftarrow α - λ_{α} \nabla_{α} J (α)$
Update target networks: ${\bar{θ}}_{1} \leftarrow τ θ_{1} + (1 - τ) {\bar{θ}}_{1}$ ; ${\bar{θ}}_{2} \leftarrow τ θ_{2} + (1 - τ) {\bar{θ}}_{2}$ .
end for
Output: control policy

Figure 3. Block diagram of HPER-SAC network architecture.

The training parameters are listed in Table 1.

Table 1. Hyper-parameters setting.

3. Results and Discussion

To validate the proposed method, an experimental FMCW LiDAR system was constructed. The setup employs a tunable laser (PHX-C-T-M-C58-32-6-2-0-0, PHOTONX, Shenzhen, China) operating at a central wavelength of 1550 nm with an average output power of approximately 20 mW. A triangular modulation signal

u (t)

, superimposed on a DC bias, is used to drive the laser, with a modulation frequency

f_{k} = 10

kHz (period T = 100

μ

s) and a sweep bandwidth of 12 GHz. The AD9238 (ADC, 65 MSPS, 12 bit, Analog Devices, Inc., Norwood, MA, USA) and AD9767(DAC, 125 MSPS, 14 bit, Analog Devices, Inc., Norwood, MA, USA) are employed as sampling and quantizing beat-to-beat signals and generating analog modulation signals. We conducted reinforcement learning training on a laptop running the Windows 10 operating system, using Python 3.8.20 and the PyTorch 2.5.0 environment. The CPU is a Ryzen 7 5800H (Advanced Micro Devices, Inc., Santa Clara, CA, USA) with a base clock speed of 3.19 GHz, the GPU is a GeForce RTX 3060 (NVIDIA, Inc., Santa Clara, CA, USA), and the system has 6 GB of RAM.

3.1. Training Parameters Optimization

As shown in Figure 4a, the reward curve exhibits the highest instability when the replay buffer size is limited to 10,000. This is primarily due to reduced sample diversity and the loss of critical exploratory transitions, particularly in environments with high nonlinearity. These conditions increase the risk of convergence to suboptimal policies and amplify update variance, leading to significant reward oscillations. While increasing buffer size enhances stability, sizes beyond 60,000–80,000 degrade performance and increase resource consumption, thereby reducing data efficiency []. A buffer size of 40,000 provides an effective balance between stability and computational efficiency. Figure 4b analyzes the effect of hidden layer size. Smaller sizes (8, 16, 32) result in greater early-stage reward variance and slower convergence. However, increasing the size to 64 degrades both stability and performance. A hidden layer size of 128 achieves the most stable convergence, with reward values stabilizing near −0.05, indicating it is the optimal configuration. In Figure 4c, the influence of action noise variance is examined. A variance of 0.1 leads to the highest reward instability and poor convergence. Lower variances, such as 0.01, improve convergence speed but reduce exploration, increasing the risk of local optima. Thus, careful tuning of noise variance is essential to balance exploration and exploitation. Figure 4d explores the impact of learning rates. Larger rates for policy and value networks (e.g., LR-a = 1

\times 10^{- 3}

, LR-c = 1

\times 10^{- 2}

) induce greater reward fluctuations and lower final performance. Conversely, very small rates (e.g., LR-a = LR-c = 1

\times 10^{- 5}

) yield higher rewards but also cause instability. Both extremes compromise convergence and learning efficiency, necessitating moderate learning rate selection. The batch size also significantly influences model performance, particularly in capturing latent environmental structures. Larger batch sizes improve training stability but increase computational cost. As shown in Figure 4e, batch sizes above 2048 slow early exploration and delay convergence, while smaller sizes (e.g., 128) accelerate learning but result in high reward variance. A batch size of 256 offers a suitable trade-off. Finally, Figure 4f highlights the role of the discount factor

γ

. Values near 1 emphasize long-term returns, facilitating global policy optimization, whereas lower values favor short-term rewards, increasing the likelihood of suboptimal convergence. In FMCW LiDAR, where nonlinearity evolves over time and cumulative effects are critical, short-term focus is insufficient. While

γ = 1

introduces instability due to sensitivity to small action changes, a moderate value of

γ = 0.8

achieves stable and effective learning.

Figure 4. Agent performance evaluation with different hyper-parameters: (a) Replay buffer size, (b) Hidden neuron dimensions, (c) Action noise variance, (d) Learning rate of Actor and Critic NN, (e) Batch size, (f) Discount factor.

Then, we evaluated the training performance of six algorithms: the standard actor-critic method, the model-based reinforcement learning approach from Ref. [], the twin-critic actor-critic method from Ref. [], the original SAC, the prioritized experience replay SAC (PER-SAC), and the proposed hybrid prioritized experience replay SAC (HPER-SAC). As illustrated in Figure 5, the optimized HPER-SAC algorithm demonstrates clear advantages over the others in terms of maximum achieved reward, convergence speed, and stability, thereby validating its effectiveness.

Figure 5. Performance curves tracking average reward of different methods including [Actor-Critic], [Zhao 2022] [], [Zhao 2023] [], [SAC], [PER-SAC] and our [HPER-SAC].

3.2. Experiment Evaluation Based on Different RL Algorithms

To assess the effectiveness of nonlinear compensation on the beat-frequency signal, experiments were conducted using an FMCW LiDAR system at a distance of 1m. Based on Equations (11) and (12), the theoretical beat frequency under ideal, linear conditions is approximately 1.6 MHz. As shown in Figure 6, the HPER-SAC algorithm produces a beat-frequency spectrum with noticeably sharper peaks than other correction methods. This reflects reduced spectral dispersion, fewer spurious components, improved spatial resolution, and lower frequency deviation, demonstrating the method’s superior capability in correcting nonlinear distortions.

Figure 6. Comparison of power spectra of beat frequency signals under different compensation methods including [Actor-Critic], [Zhao 2022] [], [Zhao 2023] [], [SAC], [PER-SAC] and our [HPER-SAC].

To evaluate the stable nonlinear correction effect and resolution of the FMCW laser, we need to monitor the long-term performance stability of the nonlinear correction method and its adaptability to laser performance degradation to assess the reliability of this model in practical applications. The FMCW LiDAR nonlinear correction experimental system operated continuously for over 2 h, updating the correction results every hour under constant parameters, and the root mean square (RMS) value of nonlinear error

f_{e, r m s}

was used to evaluate the stability of nonlinear correction. We also compared the stability of correction effects from other methods, as shown in Figure 7. It can be observed that, with the accumulation of time, the relative

f_{e, r m s}

value exhibits the greatest variability in the 55–68 MHz range for the ActorCritic algorithm, while the variability ranges for the methods in [,] are approximately 50 MHz and 40 MHz, respectively. The fluctuations of the SAC and PER-SAC algorithms both fall within the range of 40–50 MHz. Finally, the proposed HPER-SAC algorithm exhibits the smallest fluctuation range at around 15 MHz. Therefore, after the correction results are compared over an extended period, our proposed method demonstrates a significant stability advantage over other methods.

Figure 7. Long-term stability evaluation with different methods including [Actor-Critic], [Zhao 2022] [], [Zhao 2023] [], [SAC], [PER-SAC] and our [HPER-SAC].

And furthermore, in order to verify the improvement of the spatial resolution of the beat signal after correction, we calculated the

f_{e, r m s} (t)

of the beat signal after correction and the theoretical spatial resolution (TSR) according to Equations (13) and (14), respectively, and compared them with the methods in other references.

As shown in Table 2, under the same test environment, by correcting the nonlinearity of FMCW LiDAR, the method proposed in this paper achieves a beat frequency signal bandwidth of 40 kHz, while other methods, such as ordinary Actor-Critic, references [,], and single SAC, PER-SAC achieve beat frequency signal bandwidths ranging from 47.62 kHz to 71.43 kHz, reducing the bandwidth by 54.4%, 47.9%, 41.4%, 34.9%, and 31.7%, respectively. A lower bandwidth indicates fewer signal components in other frequency bands caused by nonlinearity, more concentrated energy in the main lobe signal, and higher spectral resolution of the main signal. After calculating the RMS of the nonlinear residuals

f_{e, r m s}

, our proposed method achieved a reduction of 54.6% to 75.6% compared to the other five methods, indicating that our method has a significant advantage in terms of nonlinear correction effectiveness. Finally, at a test distance of 1.6m, a theoretical spatial resolution of 0.0203 m was achieved, representing an improvement of 31.9% to 54.5% compared to the other methods.

Table 2. Nonlinear compensation evaluation of different methods at 1m@(

f_{b}

= 1.6 MHz).

Furthermore, we evaluate the system residual nonlinearity of different methods by calculating the

(1 - r^{2})

[], which can be expressed as follows:

1 - r^{2} = 12 {(f_{e, r m s} / B)}^{2}

(35)

The outcome of up sweep and down sweep of FMCW laser after compensation are plotted in Figure 8c and Figure 8d, respectively. At the same time, in order to compare with the nonlinear correction results, we also present the original uncorrected laser sweep results. In Figure 8, the orange curve indicates the frequency variation of light sweep, and the blue curve represents the residual error, which means the difference between the actual frequency and the ideal frequency. The indicators

f_{e, r m s}

and

(1 - r^{2})

directly reflect the frequency linearity of the sweep light. The

f_{e, r m s}

of up and down sweep light are 14.976 MHz and 15.262 MHz. The indicator

(1 - r^{2})

of them are

1.869 \times 10^{- 5}

and

1.9411 \times 10^{- 5}

. The smaller these two parameters are, the higher the linearity of the FMCW laser sweep.

Figure 8. Experimental results of laser frequency sweep linearization: (a,b) are the original laser up-sweep and down-sweep signals before correction, respectively; (c,d) are the laser up-sweep and down-sweep signals corrected based on the method we proposed. The blue and orange arrows point to the residual error and sweep frequency, respectively.

Furthermore, to validate the effectiveness of our method on different lasers, we conducted test experiments using a new continuous tunable laser (TSL-SWT-M-C-20-P-FA, realphoton, Inc., Shenzhen, China) and conducted experimental tests, as shown in Figure 9. The laser was driven by a 10 kHz repetition rate modulation signal, achieving a modulation bandwidth of 10 GHz. The final

f_{e, r m s}

values for the up and down sweep light were 12.36 MHz and 11.86 MHz, respectively. The linearity indicator

(1 - r^{2})

for these values are

1.273 \times 10^{- 5}

and

1.172 \times 10^{- 5}

, demonstrating slightly improved linearity compared to previous lasers.

Figure 9. Experimental results of new laser frequency sweep linearization: (a,b) are the laser up-sweep and down-sweep signals corrected based on our method.

A comparative analysis was conducted to assess the calibration performance of various methods over 100 repeated measurements at test distances of 1 m, 3 m, 7 m, and 9 m. The resulting distance distributions were visualized using box plots, as shown in Figure 10. The results indicate that the proposed HPER-SAC algorithm yields smaller deviations between the maximum and minimum measured values and the ground truth across all tested distances, relative to other approaches. Furthermore, the box plots corresponding to the HPER-SAC method exhibit shorter interquartile ranges and median values that are more closely aligned with the theoretical distances. These findings suggest that the proposed method achieves lower variance, tighter data dispersion, and more consistent calibration performance compared to the benchmark techniques.

Figure 10. Beat signal frequency test in various distances after nonlinear compensation with different methods including [Actor-Critic], [Zhao 2022] [], [Zhao 2023] [], [SAC], [PER-SAC] and our [HPER-SAC]: (a) 1 m, (b) 3 m, (c) 7 m, (d) 9 m.

Finally, we present the FPGA hardware resources used in the entire RL agent training and actual beat frequency signal correction application, including ADC module and DAC module drivers and data register comprehensive layout, as well as the LUT (Look-Up Table), FF (Flip-Flop), BRAM (Block Random Access Memory), IO (Input/Output), BUFG (Global Clock Buffer), MMCM (Mixed-Mode Clock Manager), and other resources used. These are shown in Table 3 below.

Table 3. Resource consumption of FPGA when applying RL trained model.

4. Conclusions

This study proposes a hybrid priority experience replay soft actor–critic algorithm to address the nonlinear correction challenge in FMCW LiDAR systems. The proposed method innovatively combines the exploration advantage of maximum entropy reinforcement learning with a hybrid experience replay mechanism, achieving a significant improvement in LiDAR compensation performance. By introducing a priority scheme that combines nonlinear modulation frequency error priority in adjacent segments with time difference error-based strategy priority, the algorithm achieves parallel optimization of hardware characteristics and control strategies. Based on a dynamic importance sampling mechanism, the sampling weights are adaptively adjusted according to the hybrid priority scheme, thereby enhancing training stability and efficiency. Subsequently, a time-varying delay data injection strategy is employed to dynamically adjust the data storage intervals based on nonlinear dynamics, simultaneously improving temporal correlation and sampling diversity. The experimental results show that in the task of source nonlinearity correction for FMCW LiDAR, our proposed method has obvious advantages over other methods in the literature in terms of measurement accuracy, spatial resolution, and residual nonlinearity, which will provide a basis for realizing higher-performance FMCW LiDAR system in the future.

Author Contributions

Conceptualization, Z.L.; formal analysis, Z.L. and N.W.; investigation, Z.L. and Y.L.; project administration, Y.Z. and J.H.; resources, Z.L., Y.L., and N.W.; data curation, N.W. and Y.Z.; supervision, Y.Z. and J.H.; visualization, Z.L. and Y.Z.; writing manuscript, Z.L.; review and editing, Z.L., N.W., Y.L., J.H., and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Z.; Han, Y.; Wu, L.; Zang, Z.; Dai, M.; Set, S.Y.; Yamashita, S.; Li, Q.; Fu, H. Towards an ultrafast 3D imaging scanning LiDAR system: A review. Photonics Res. 2024, 12, 1709–1729. [Google Scholar]
Wang, Y.; Hua, Z.; Shi, J.; Dai, Z.; Wang, J.; Shao, L.; Tan, Y. Laser feedback frequency-modulated continuous-wave LiDAR and 3-D imaging. IEEE Trans. Instrum. Meas. 2023, 72, 7002309. [Google Scholar] [CrossRef]
Bastos, D.; Monteiro, P.P.; Oliveira, A.S.; Drummond, M.V. An overview of LiDAR requirements and techniques for autonomous driving. In Proceedings of the 2021 Telecoms Conference (ConfTELE), Leiria, Portugal, 11–12 February 2021; pp. 1–6. [Google Scholar]
Sayyah, K.; Sarkissian, R.; Patterson, P.; Huang, B.; Efimov, O.; Kim, D.; Elliott, K.; Yang, L.; Hammon, D. Fully integrated FMCW LiDAR optical engine on a single silicon chip. J. Light. Technol. 2022, 40, 2763–2772. [Google Scholar] [CrossRef]
Piotrowsky, L.; Jaeschke, T.; Kueppers, S.; Siska, J.; Pohl, N. Enabling high accuracy distance measurements with FMCW radar sensors. IEEE Trans. Microw. Theory Tech. 2019, 67, 5360–5371. [Google Scholar] [CrossRef]
Liu, J.; Liu, M.; Chen, G.; Gu, W.; Qu, X.; Zhang, F. Dynamic Measurement with High Precision Using Frequency Agile Spatial Encoding Integrated FMCW LiDAR. ACS Photonics 2024, 11, 4036–4047. [Google Scholar] [CrossRef]
Ahn, T.J.; Lee, J.Y.; Kim, D.Y. Suppression of nonlinear frequency sweep in an optical frequency-domain reflectometer by use of Hilbert transformation. Appl. Opt. 2005, 44, 7630–7634. [Google Scholar] [CrossRef]
Roos, P.A.; Reibel, R.R.; Berg, T.; Kaylor, B.; Barber, Z.W.; Babbitt, W.R. Ultrabroadband optical chirp linearization for precision metrology applications. Opt. Lett. 2009, 34, 3692–3694. [Google Scholar] [CrossRef]
Moore, E.D.; McLeod, R.R. Correction of sampling errors due to laser tuning rate fluctuations in swept-wavelength interferometry. Opt. Express 2008, 16, 13139–13149. [Google Scholar] [CrossRef] [PubMed]
Baumann, E.; Giorgetta, F.R.; Deschênes, J.D.; Swann, W.C.; Coddington, I.; Newbury, N.R. Comb-calibrated laser ranging for three-dimensional surface profiling with micrometer-level precision at a distance. Opt. Express 2014, 22, 24914–24928. [Google Scholar] [PubMed]
Zhang, J.; Li, S.; Zheng, X.; Xue, X. Ultra-linear FMCW laser based on time-frequency self-injection locking. Photonics Res. 2024, 13, 31–39. [Google Scholar]
Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement learning in robotics: A survey. Int. J. Robot. Res. 2013, 32, 1238–1274. [Google Scholar] [CrossRef]
Polydoros, A.S.; Nalpantidis, L. Survey of model-based reinforcement learning: Applications on robotics. J. Intell. Robot. Syst. 2017, 86, 153–173. [Google Scholar] [CrossRef]
Singh, B.; Kumar, R.; Singh, V.P. Reinforcement learning in robotic applications: A comprehensive survey. Artif. Intell. Rev. 2022, 55, 945–990. [Google Scholar] [CrossRef]
Lusch, B.; Kutz, J.N.; Brunton, S.L. Deep learning for universal linear embeddings of nonlinear dynamics. Nat. Commun. 2018, 9, 4950. [Google Scholar] [CrossRef]
Liu, Z.; Zhu, D.; Raju, L.; Cai, W. Tackling photonic inverse design with machine learning. Adv. Sci. 2021, 8, 2002923. [Google Scholar] [CrossRef] [PubMed]
Melanthota, S.K.; Gopal, D.; Chakrabarti, S.; Kashyap, A.A.; Radhakrishnan, R.; Mazumder, N. Deep learning-based image processing in optical microscopy. Biophys. Rev. 2022, 14, 463–481. [Google Scholar] [CrossRef]
Zhao, H.; Yuan, G.; Xiao, J.; Li, J.; Zhang, H.; Fang, K.; Wang, Z. Linearization of nonlinear frequency modulated continuous wave generation using model-based reinforcement learning. Opt. Express 2022, 30, 20647–20658. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Yuan, G.; Wang, Z. Precise chirp control with model-based reinforcement learning for broadband frequency-swept laser of LiDAR. Opt. Express 2023, 31, 20286–20305. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
Xu, S.; Yuan, G.; Zhang, H.; Hou, C.; Li, Z.; Zhang, P.; Xu, W.; Wang, Z. A Novel Two-Stage Approach for Nonlinearity Correction of Frequency-Modulated Continuous-Wave Laser Ranging Combining Data-Driven and Principle-Based Strategies. Photonics 2025, 12, 356. [Google Scholar] [CrossRef]
Wei, Z.; Xiao, W.; Yuan, L.; Ran, T.; Cui, J.; Lv, K. Memory-based soft actor–critic with prioritized experience replay for autonomous navigation. Intell. Serv. Robot. 2024, 17, 621–630. [Google Scholar] [CrossRef]
Ula, R.K.; Noguchi, Y.; Iiyama, K. Three-dimensional object profiling using highly accurate FMCW optical ranging system. J. Light. Technol. 2019, 37, 3826–3833. [Google Scholar] [CrossRef]
Zhang, X.; Pouls, J.; Wu, M.C. Laser frequency sweep linearization by iterative learning pre-distortion for FMCW LiDAR. Opt. Express 2019, 27, 9965–9974. [Google Scholar] [CrossRef] [PubMed]
Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Statist. 1951, 22, 79–86. [Google Scholar] [CrossRef]

Figure 1. The Block diagram of FMCW laser ranging.

Figure 2. The FMCW LiDAR nonlinearity schematic: (a) FMCW signal sweeping and (b) FMCW beat frequency signal.

Figure 3. Block diagram of HPER-SAC network architecture.

Figure 4. Agent performance evaluation with different hyper-parameters: (a) Replay buffer size, (b) Hidden neuron dimensions, (c) Action noise variance, (d) Learning rate of Actor and Critic NN, (e) Batch size, (f) Discount factor.

Figure 5. Performance curves tracking average reward of different methods including [Actor-Critic], [Zhao 2022] [], [Zhao 2023] [], [SAC], [PER-SAC] and our [HPER-SAC].

Figure 6. Comparison of power spectra of beat frequency signals under different compensation methods including [Actor-Critic], [Zhao 2022] [], [Zhao 2023] [], [SAC], [PER-SAC] and our [HPER-SAC].

Figure 7. Long-term stability evaluation with different methods including [Actor-Critic], [Zhao 2022] [], [Zhao 2023] [], [SAC], [PER-SAC] and our [HPER-SAC].

Figure 8. Experimental results of laser frequency sweep linearization: (a,b) are the original laser up-sweep and down-sweep signals before correction, respectively; (c,d) are the laser up-sweep and down-sweep signals corrected based on the method we proposed. The blue and orange arrows point to the residual error and sweep frequency, respectively.

Figure 9. Experimental results of new laser frequency sweep linearization: (a,b) are the laser up-sweep and down-sweep signals corrected based on our method.

Figure 10. Beat signal frequency test in various distances after nonlinear compensation with different methods including [Actor-Critic], [Zhao 2022] [], [Zhao 2023] [], [SAC], [PER-SAC] and our [HPER-SAC]: (a) 1 m, (b) 3 m, (c) 7 m, (d) 9 m.

Table 1. Hyper-parameters setting.

Parameters	Value
Episodes	500
Time steps	100
Batch size	128
Replay buffer size	10,000
Actor NN learning rate	0.01
Critic NN learning rate	0.01
Action noise variance	0.01
Hidden neurons	256
Target entropy	−1
Entropy coefficient learning rate	0.02
Base delay step	1
Sensitivity coefficient	0.5
Exponential moving average	0.005

Table 2. Nonlinear compensation evaluation of different methods at 1m@(

f_{b}

= 1.6 MHz).

Table 2. Nonlinear compensation evaluation of different methods at 1m@(

f_{b}

= 1.6 MHz).

Method	Beat Signal Bandwidth (kHz)	$f_{e, rms}$ (MHz)	TSR (m)
Actor-Critic	71.43	61.421	0.0446
[]	62.5	50.756	0.0391
[]	55.56	42.468	0.0347
SAC	50	35.828	0.0312
PER-SAC	47.62	32.985	0.0298
HPER-SAC	32.54	14.976	0.0203

Table 3. Resource consumption of FPGA when applying RL trained model.

Parameter Type	Resource Usage
LUT	2773
FF	3045
BRAM	2
IO	63
BUFG	5
MMCM	2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

FMCW LiDAR Nonlinearity Compensation Based on Deep Reinforcement Learning with Hybrid Prioritized Experience Replay

Abstract

1. Introduction

2. Methodology

2.1. FMCW LiDAR Non-Linearity

2.2. Soft Actor–Critic with Hybrid Prioritized Experience Replay

3. Results and Discussion

3.1. Training Parameters Optimization

3.2. Experiment Evaluation Based on Different RL Algorithms

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics