1. Introduction
Distributed feedback (DFB) semiconductor lasers serve as the workhorse transmitter in dense wavelength division multiplexing (DWDM) networks owing to their single-longitudinal-mode operation and narrow spectral linewidth [
1,
2]. A well-known vulnerability is that the lasing wavelength shifts by approximately 0.1 nm/K for typical InGaAsP chips around 1550 nm [
3]. A 100 GHz ITU channel grid leaves almost no margin. The thermal excursion of ±0.01 K is enough to push the carrier outside its slot [
4]. Wavelength stabilization within this window is the task of the thermoelectric cooler (TEC) inside the butterfly package. Holding sub-10 mK stability under real operating conditions is harder than first-order thermal analysis suggests.
The TEC current and the junction temperature are coupled nonlinearly through the Peltier effect. The
I2R Joule term partially cancels the Peltier cooling. The optimum drive point, therefore, shifts with the operating condition [
5]. The heat-flow path inside a butterfly package crosses several thermal masses. These include the submount carrying the laser chip, the ceramic spacer beneath it, and the case-mounted heat sink on the hot side. Their time constants span roughly 0.1 s to a few seconds. A single-node lumped model cannot reproduce even the step response [
6]. Conditions on a deployed line card are worse. Ambient drift, thermal crosstalk from neighboring lanes, and step changes in laser bias current all act as disturbances. The TEC loop must reject them at the same time.
In practice, almost every commercial laser driver still ships with a plain PID loop [
7,
8]. Steady-state performance is adequate, but PID is inherently reactive. The integrator only starts accumulating after an error appears. Step the setpoint by 2 K or bump the laser bias current, and the controller is already behind. The result is the familiar overshoot-then-ring pattern. Adding a feedforward branch changes the picture fundamentally, because a model that estimates the required TEC current ahead of time offloads the bulk of the transient work, leaving the PID to mop up whatever residual the model misses [
9,
10].
We adopt a residual correction architecture in which the simplified physics model enters as a structural component of the network rather than as a regularization term in the loss. The network output is formed as the sum of a frozen single-node prediction and a trainable correction, with the correction weights initialized near zero. When training data are scarce, the correction remains small, and the model reduces to the analytical baseline. As more labeled samples are accumulated, the correction branch progressively accounts for the nonlinearities that the single-node formula cannot represent. Temporal context is incorporated in parallel. Two past samples of cold-side temperature and TEC current, taken 0.5 s apart, are appended to the input vector. These lagged values provide indirect access to the ceramic-layer and hot-side temperatures, neither of which is instrumented. Removing them caused every architecture we evaluated to saturate at R2 0.77.
The contributions of this work are as follows. First, we propose a residual-correction PINN in which the one-node thermal model is embedded directly into the computation graph with frozen weights, avoiding the gradient conflict between physics and data losses that limits the accuracy of conventional PINNs when the model mismatch is large. Second, we find that appending only two lagged temperature–current pairs to the input is sufficient to overcome the partial observability of single-sensor TEC packages. Without these features, none of the architectures we evaluated exceeded R2 ≈ 0.77. Third, a data-budget sweep from 3% to 100% of the available training set shows that the PINN reaches a given accuracy target with about 5.4× fewer labeled samples than an equivalent purely data-driven network (R2 = 0.966 at 293 samples, compared with R2 = 0.930 for the NN). Finally, the trained surrogate is embedded in a hybrid feedforward–PID controller and benchmarked in closed-loop simulation against a standalone PID, an integral-separated PID, and an NN+PID baseline across step, multi-setpoint, and disturbance-rejection scenarios, with the plant represented by a three-node TEC model.
Section 2 below surveys related work on TEC control, PINNs, and hybrid physics–data methods.
Section 3 lays out the thermal models and the proposed architecture.
Section 4 details the experimental protocol,
Section 5 reports the results, and
Section 6 concludes.
4. Experimental Setup
4.1. Data Generation
We generated training data by running 80 separate simulations of the three-node plant. Initial cold-side temperatures were drawn from () and control signals (step, sinusoidal, ramp, telegraph, chirp, and random walk; ). Individual runs last between 30 and 120 s at a base sampling rate of 0.1 s. We then decimate by a factor of five to weaken temporal correlation among neighboring samples. We obtain the targets by finite differencing. Gaussian measurement noise () is superimposed on . After pooling and shuffling all runs, we have about 12,200 usable samples. And 20% of the dataset is for validation, and the other 80% for training.
4.2. Ablation Design
To isolate what each physics ingredient actually contributes, we train three model variants that share the same backbone architecture (
Table 1). The full PINN keeps everything: frozen physics baseline, hand-crafted physics features, and the correction regularizer. PhysFeat removes the physics baseline but retains the engineered input features. This shows whether feature engineering alone can replace structural physics embedding. The pure NN uses no physics. It learns everything from the seven raw inputs. All three models share the same layer widths, activation, learning-rate schedule, and early-stopping rule. Any accuracy gap must, therefore, come from the physics components.
4.3. Control Experiment Configuration
Four controllers are tested in closed loop on the three-node plant at 100 Hz. They are a conventional PID, an integral-separated PID (IS-PID), an NN+PID, and the proposed PINN+PID. All PID instances use the same gains (). Three test conditions are used. The difficulty increases across them. First, a single 2 K setpoint step from 25 °C to 27 °C. Second, a four-transition tracking sequence with both heating and cooling segments. Last, the disturbance-rejection test, and ±0.4 A current pulses are injected into the TEC at preset instants.
5. Results and Discussion
5.1. Surrogate Model Accuracy
Table 2 lists the validation metrics. All three models were trained on the full dataset. Every variant achieves R
2 = 0.993. The lag features alone resolve enough of the hidden-state ambiguity. And the physics-free NN can track the three-node dynamics accurately.
Figure 2 shows the validation-set scatter plots for the PINN and the pure NN; in both cases the points cluster tightly around the identity line.
On the full dataset, the NN slightly beats the PINN. This might suggest that the physics term adds nothing. The reason is 9700 points already overdetermine the mapping. The frozen baseline adds an inductive bias. At this scale, it no longer helps. The physics branch only pays off when data are scarce. The next subsection shows this.
5.2. Data Efficiency Analysis
In the data savings task, the PINN and the NN were retrained on subsets ranging from 3% to 100% of the training pool. The validation set was held fixed.
Table 3 and
Figure 3 report the results.
The R2 curves separate as the training pool shrinks. At the 3% budget (293 samples), the PINN reaches R2 = 0.9658. The NN reaches only 0.9303. In MSE terms, the PINN’s MSE is about 51% lower than the NN’s. The cross-budget view is more direct. The PINN at 3% already beats the NN at 10%. The frozen energy-balance baseline, therefore, acts as a prior that is worth about 5.4× the labeled sample count. At 20%, the PINN reaches R2 = 0.9891, close to its full-data value. The NN at the same budget reaches only 0.9716.
The mechanism is straightforward. Because the frozen one-node branch already supplies a rough but physically grounded estimate, the correction network only needs to approximate a small, relatively smooth residual. The NN, in contrast, must reconstruct the full input–output relationship from raw features. Learning a smaller function takes fewer examples.
To assess the statistical reliability of the data-scarce regime, we repeated the 3% training-budget experiment over 10 independent random seeds and report the mean and standard deviation rather than a single run. The residual-correction PINN attains R2 = 0.966 ± 0.006, whereas the pure NN reaches only 0.930 ± 0.025. Two consequences follow. First, beyond its higher mean, the PINN exhibits roughly fourfold smaller seed-to-seed dispersion (standard deviation: 0.006 against 0.025). Because the test-set sampling error is an order of magnitude smaller still (±0.0005 and ±0.001, respectively), this spread originates almost entirely from training-sample selection and weight initialization rather than from the validation set. The lower variance is itself a useful result. Since the residual branch need only learn a small correction on top of the frozen physics baseline, its effective degrees of freedom, and, hence, its sensitivity to the random seed, are markedly lower than those of the unconstrained NN, which must reconstruct the full mapping. Second, we make the data-efficiency claim precise by defining the factor η(τ) as the ratio of the training-sample counts that the NN and the PINN respectively require to first reach a target accuracy R2 = τ. At the target τ = 0.9658, which is the accuracy the PINN already reaches at the 3% budget of 293 samples, the NN needs about 1592 samples, obtained by linear interpolation between the 10% and 20% budgets, so that η = 1592/293 ≈ 5.4×. One caveat applies to this number. The PINN target is a 10-seed mean, but the NN reference points at the 10% and 20% budgets are single runs, so η inherits their run-to-run variability and should be read as a point estimate rather than an exact multiplier. Given the seed dispersion of the NN at small budgets, we, therefore, describe the data-efficiency gain as roughly fivefold, with η ≈ 5.4× at this particular accuracy target, in place of the looser 5–7× figure used in earlier drafts.
5.3. Role of Temporal Features
We retrained every model with the instantaneous triplet [t, T(t), u(t)] only, without any lagged samples. The result was unequivocal. PINN, PhysFeat, and NN all plateaued at R
2 ≈ 0.77, and no amount of capacity increase or regularization tuning budged the number. This ceiling is not a training artifact. It reflects a hard information-theoretic limit. The controller has access to
, and the three-node system has two hidden states (
,
). Different internal configurations can yield identical (T, u) pairs yet produce different values of
. Without past readings to disambiguate, the mapping from (T, u) to the derivative is genuinely multi-valued, and no network, regardless of depth or width, can collapse that ambiguity.
Figure 4 illustrates the learned correction: it is largest at high currents, where Joule heating dominates and the one-node model fits worst.
With the lagged snapshots added, every model rises above R2 = 0.99. The differences serve as numerical proxies for the first and second time derivatives. Together with the present state, they span the observable subspace of the three-node plant. One change in the input design lifts R2 from 0.77 to above 0.993. The binding constraint was observability, not model architecture.
Appending just two historical snapshots, separated by 0.5 s, instantly lifts every model above R2 = 0.99. The finite differences and serve as numerical proxies for the first and second time derivatives, and together with the present-time state, they span the observable subspace of the three-node plant. The fact that a single change in input design, namely, adding two lagged columns, lifting R2 from 0.77 to above 0.993, underscores the point that observability, not architectural sophistication, was the binding constraint.
5.4. Control Performance
A high open-loop R
2 does not by itself guarantee closed-loop performance. Errors in the feedforward prediction pass through the bisection inversion. They then couple with the PID dynamics. The resulting amplification depends on the scenario. Closed-loop testing is the only reliable check.
Table 4 summarizes the four controllers across the three test conditions.
Figure 5 shows the result. The PINN+PID settles in 7.2 s, and the standalone PID settles in 18.3 s. Overshoot falls from 160 mK to 30 mK. The feedforward branch delivers most of the energy demand at the moment of the setpoint change, so the PID integrator never winds up from zero. The steady-state error is 2.3 mK, inside the ±10 mK budget for 100 GHz DWDM. The NN+PID settles in 9.4 s. Its steady-state residual is visibly larger. We attribute this to the absence of an energy-balance backbone, which leaves the NN feedforward poorly conditioned at small errors.
The four-step tracking scenario (
Figure 6) is more demanding. Heating and cooling transients alternate in rapid succession, and the controller must reverse the TEC drive polarity each time. PINN+PID handles this gracefully, with tracking RMSE dropping to 58.3 mK, a 69% improvement over the 186.4 mK recorded for PID alone. Inspection of the control signal reveals that the feedforward output flips sign almost immediately at each setpoint transition, pre-charging the TEC current before the PID branch even registers an error. Between transitions, the PINN+PID envelope stays within ±4.2 mK. Standalone PID wanders over a 15.3 mK band.
Of the three closed-loop tests, disturbance rejection (
Figure 7) is arguably the most telling. A +0.4 A bias-current spike injected at t = 30 s pushes the PID-only loop 6.8 mK off setpoint, and full recovery takes 16.5 s. Under the same perturbation, the PINN+PID excursion stays below 1.8 mK, with a 6.5 s recovery. Across the entire disturbance window, the PINN+PID RMSE is 1.2 mK, which is 3.6× smaller than the 4.3 mK recorded for PID. The physics branch sees the current change at once and updates its heat-injection estimate within the same control cycle. The feedforward path then starts to compensate before the PID integrator has accumulated any error.
5.5. Discussion
The main lesson from these experiments concerns how physics knowledge is incorporated. In an earlier stage, we tried the conventional PINN formulation. A physics-residual term was added to the data loss, and the accuracy dropped below the pure NN baseline. The reason lies in the model itself. The one-node equation does not capture the three-node system. Even after parameter calibration, it yields on its own. Under these conditions, the two loss terms pull the optimizer in opposite directions. Reducing the physics residual moves the output toward an inaccurate model. Reducing the data residual moves it away. The residual-correction architecture removes this conflict by construction. The physics model feeds into the network output, not into the loss. This matters for thermal packages, where first-principles models are coarse at best. Therefore, building the physics into the network structure works better than imposing it through a loss penalty.
A second finding, which we did not anticipate, is that input design dominates model design. Going from without lag features to R2 > 0.993 with them is by far the largest effect in the ablation. Whether the physics branch is on or off matters much less. From an observability standpoint, the reason is simple. A third-order system requires at least three independent measurements (or their time-shifted equivalents) to pin down the state. Two lagged readings plus the current value give exactly that. Once the state is fully observable, a plain NN fits the data almost as well as the PINN. Physics then serves a different purpose. It reduces the number of samples needed, rather than enabling prediction outright.
The data-efficiency results close the loop on this story. The prior physics helps most when labeled data are scarcest, exactly the situation where extra help matters. At 3% of the training budget, the PINN reaches R2 = 0.966. The NN needs 10–20% of the data to get there. For production lines where every TEC module undergoes individual calibration, cutting the required data by about 5.4× directly shortens cycle time and lowers unit cost. And the advantage fades with abundant data; prior knowledge matters less once the data speak for themselves.
We should be upfront about the limitations. Everything reported above is simulated. Our three-node plant includes temperature-dependent material properties, 2 mK measurement noise, and the full Peltier/Joule interaction, so it is not trivial. But it cannot model contact-resistance drift, solder creep, or humidity exposure, all of which matter in fielded hardware. Closing the sim-to-real gap will likely require online adaptation or transfer learning of some kind. A second limitation is that the one-node model parameters are identified offline and never updated. Letting them co-train with the correction network, as recent augmented-physics papers suggest, could reduce the residual the NN branch has to learn and might improve both accuracy and data efficiency further. A third simplification concerns the thermal physics itself. The lumped-node models omit spatially distributed heat-flux gradients and temperature-dependent interface phenomena, such as contact-conductance variation across the solder and ceramic boundaries. These effects are likely to matter in real semiconductor-laser packages, and incorporating them, either in the high-fidelity plant or in the embedded baseline, is a worthwhile direction for tightening the sim-to-real correspondence.
6. Conclusions
We have demonstrated that freezing a one-node energy-balance model inside the network graph, rather than penalizing its residual in the loss, yields a workable PINN even when the underlying physics is badly wrong (standalone physics R2 ≈ −0.70). Equally critical is the inclusion of temporal lag inputs. Stripping them out collapses every architecture we tested, physics-augmented or not, to R2 ≈ 0.77, because the cold-side sensor alone cannot disambiguate the hidden thermal states of the ceramic and hot-side nodes.
From an engineering standpoint, the headline number is not peak R2 (the PINN and a plain NN essentially tie when data are plentiful) but calibration cost. At a 3% data budget, the PINN holds R2 = 0.966 against 0.930 for the NN, a gap that translates into an approximately 5.4× reduction in the number of labeled sweeps a production line must collect per module. When the trained surrogate is embedded in a feedforward–PID loop, the settling time drops by 60%, the tracking RMSE by 69%, and the peak disturbance excursion by 74% compared with PID alone. All three metrics land well inside the ±10 mK envelope required by 100 GHz DWDM channel spacing.
Hardware validation on actual DFB-TEC modules is the natural next priority. No simulation, ours included, captures the aging trajectories, humidity ingress, and contact-resistance creep that accumulate over thousands of thermal cycles in the field. Three follow-on directions strike us as particularly promising. First, an online adaptation loop that updates the surrogate weights during deployment so the model can track slow parametric drift. Second, extension to multi-lane transmitter packages, where inter-channel thermal coupling is non-negligible and currently unmodeled. Third, relaxing the frozen-coefficient constraint so that the one-node parameters co-evolve with the correction weights during training, which should shrink the residual the NN branch must learn and further improve sample efficiency.