3.2.1. Distortion-Aware Training Data
Existing deep learning SFF methods typically train on data covering only ideal unimodal curves [
19], making it difficult to handle complex distortions encountered in practical measurements. This paper proposes a synthetic training data generation strategy encompassing multiple typical distortion patterns (
Figure 2), including, saturation (20%), ghost peak (25%), low-SNR (35%), asymmetric (10%), and background noise (5%).
Distortion transformations are applied to an ideal Gaussian curve as the base element:
where
is the normalized ground truth depth and
is the standard deviation controlling peak width. The synthesis methods for each distortion type are as follows:
Saturation truncation:
, where
is the saturation threshold, simulating peak flattening caused by insufficient sensor dynamic range. This per-curve model captures the temporal saturation plateau but not the lateral electron-overflow (blooming) effect [
38]; in practice, lateral blooming manifests as spatially correlated plateau widening. The FM operator (computed over a
neighborhood) and the CFA-based feature reweighting (
Section 3.2.2) may reduce sensitivity to some curve-level artifacts, but they do not explicitly correct spatial blooming. A more explicit lateral blooming model is left as future work (
Section 4.4).
Ghost peak: , where is the spurious peak position ( randomly sampled), and is the relative amplitude of the spurious peak, simulating secondary peaks caused by specular reflection or FM window crossing depth discontinuities.
Low-SNR weak peak: , where is the weak-texture attenuation coefficient and is additive noise, simulating low signal-to-noise ratio responses in weak-textured regions.
Asymmetric: Piecewise standard deviation with , simulating peak skewness caused by directional differences in the defocusing process.
Background noise: , pure noise sequences without valid peaks, serving as negative samples for background regions.
The above distortion parameter ranges (e.g., , , ) are determined based on observations of typical distorted curves in real HDR and weak-textured data, covering common distortion intensity intervals in industrial measurement scenarios. Hard samples (with distortions) constitute approximately 90%, while 5% ideal Gaussian curves are retained to ensure basic fitting capability for normal samples. The data generation strategy can be parameterized according to the sequence length K of the target dataset: given K, synthetic training samples of corresponding length are generated according to the above proportions and parameter ranges, and the network is trained accordingly. Due to significant differences in sequence lengths across different acquisition configurations, this paper generates training data and trains independent models for each sequence length rather than adopting a single universal model.
To verify that the proposed proportions do not overfit to the test distribution, a sensitivity analysis was performed by perturbing the class proportions and retraining; the full sweep is reported in the
Supplementary Materials. Within
–15 percentage-point shifts of the four hard-sample classes, RMSE on the HDR test set varies by less than ∼8% relative (from
to
), and a uniform proportion (each class
) yields RMSE
(
relative). Only severe imbalance, such as inflating the saturation class to
(
percentage points), causes a noticeable degradation (
relative). The strategy is therefore robust to moderate proportion changes and is not overfitted to the specific test set.
The transferability of this synthesis strategy stems from the physical interpretability of the distortion categories: saturation truncation originates from sensor dynamic range limitations; ghost peaks arise from FM windows crossing depth boundaries or specular reflection; weak-peak noise results from weak focus response due to texture absence. These mechanisms can appear across acquisition systems and measured surfaces, but their quantitative distributions are configuration- and sensor-dependent; therefore, the synthesis parameters and trained weights should be revalidated or regenerated when the optical system or sensor type changes.
3.2.2. Network Architecture
The overall architecture of DAFDR-Net is shown in
Figure 3, comprising four components: defocus response encoding module, Channel-wise Feature Attention (CFA) module, Soft Peak Localization (SPL) mechanism, and multi-task prediction heads.
Defocus Response Encoding Module. This module employs two layers of 1D-CNN to extract local geometric features of focus curves. The selection of convolution kernel sizes is based on scale analysis of the defocus physical model: in thin lens imaging, the diameter of the circle of confusion is approximately linearly related to the defocus amount [
8], and the effective response interval of the focus curve typically spans several to a dozen frames. The kernel sizes are therefore determined by the physical depth-of-field-to-step ratio rather than chosen as a global constant:
where
is the physical depth-of-field along the optical axis and
is the scanning step size. For the two acquisition configurations used in this work, the corresponding
ratios both fall near 5: in the free-form HDR setup (
, simulated objective with
) and in the silicon-wafer microscope setup (high-NA objective with
,
). Equation (
4) therefore yields
and
for both configurations. For a different objective with a different
ratio, kernel sizes would be recomputed via Equation (
4) and the network retrained with parametrically regenerated synthetic data of matching sequence length, as documented in
Section 3.2.1. Adaptive alternatives such as dilated convolutions [
39] and deformable/dynamic kernels [
40] could enable a single model to span multiple DoF regimes at the cost of additional parameters; this trade-off is discussed in
Section 4.4. To verify that this choice is not merely empirical, we performed kernel-size and depth sensitivity sweeps on the HDR free-form surface dataset (full tables provided in the
Supplementary Materials). Kernel pairs smaller than
(e.g.,
, RMSE
) under-cover the defocus response, while larger kernels (e.g.,
, RMSE
) over-smooth fine peak features; the physics-keyed choice
achieves the lowest RMSE of
. Increasing depth beyond 2 layers does not improve accuracy: 3 and 4 layers give RMSE values of
and
, respectively, while also increasing the parameter count. Therefore, the 2-layer baseline is retained for efficiency. Batch Normalization (BN) follows the convolutional blocks to stabilize training [
41].
Channel-wise Feature Attention (CFA) Module. In ideal defocus imaging, the FM response near the focused frame follows an approximately Gaussian distribution, and responses far from the focused frame should decay monotonically. However, distorted curves (saturation, ghost peaks, noise-dominated) violate this prior and produce unstable temporal-response feature patterns. This paper therefore introduces a channel-wise feature attention module, implemented in the squeeze-and-excitation style [
42], to learn weights for temporally encoded feature channels rather than explicit per-frame reliability weights:
where
is the convolutional feature,
denotes global average pooling over the temporal/frame dimension,
is the channel-wise attention vector,
and
are learnable parameters (compression ratio
),
is ReLU activation, and
is Sigmoid activation. Because the temporal index is pooled before generating
, CFA should be interpreted as channel-wise attention over temporal-response features, not as a direct estimator of individual frame reliability
. It suppresses feature channels that tend to respond strongly to saturation, ghost peaks, or noise-dominated curves, thereby indirectly reducing the influence of distorted temporal responses in the subsequent depth regression.
Failure safeguard. On ultra-smooth surfaces whose surface texture lies near the sensor quantization-noise floor [
43], the CFA module alone could in principle amplify noise-sensitive feature channels as if they carried defocus information. The multi-task validity head
is the explicit safeguard: the
Background-noise synthetic class (
Section 3.2.1, 5% of training samples) trains the network to map curves dominated by quantization-level noise to
rather than to a fabricated depth, while the confidence-guided smoothing (
Section 3.3) replaces such pixels by neighborhood interpolation. This mechanism is consistent with the silicon-wafer results, where the smoothed output further reduces the flat-region standard deviation relative to the direct DAFDR-Net output.
Soft Peak Localization (SPL) Mechanism. Traditional Gaussian interpolation performs analytical fitting on three points in the discrete peak neighborhood. This paper introduces differentiable soft peak localization in the network as an inductive bias. Let the CFA-weighted feature
yield a sequence-level representation
after pooling. The soft peak position can be expressed as
This mechanism provides a geometric prior for “peak localization” to the network, making depth regression a differentiable approximation inspired by physical processes rather than a pure black-box mapping. In practice, soft peak information is implicitly encoded in the depth prediction through fully connected layers.
Multi-task Prediction Heads. Shared features branch into depth regression and validity discrimination heads through fully connected layers, outputting normalized depth
and foreground mask probability
(during inference, the binary mask is obtained as
), respectively. Dropout follows the fully connected layers to prevent overfitting [
44]. This design follows the shared representation paradigm of multi-task learning [
37], where the complementarity between depth estimation and validity discrimination tasks facilitates learning more robust feature representations.
The detailed layer-by-layer parameter summary is provided in the
Supplementary Materials. For
, the network has approximately 200k parameters; for longer sequences, the parameter count grows linearly in the aggregation layer.