On the Sufficiency of Direct Regression for Perovskite Solar Cell Degradation Forecasting

Chahine, Khaled; Noura, Hassan N.

doi:10.3390/asi9060116

Open AccessArticle

On the Sufficiency of Direct Regression for Perovskite Solar Cell Degradation Forecasting

by

Khaled Chahine

¹

and

Hassan N. Noura

^2,3,*

¹

College of Engineering and Technology, American University of the Middle East, Egaila 54200, Kuwait

²

Electrical and Computer Engineering Department, American University of Beirut, Beirut 1107 2020, Lebanon

³

Institut FEMTO-ST, CNRS, IUT-NFC, Université Marie et Louis Pasteur, F-90000 Belfort, France

^*

Author to whom correspondence should be addressed.

Appl. Syst. Innov. 2026, 9(6), 116; https://doi.org/10.3390/asi9060116

Submission received: 25 April 2026 / Revised: 27 May 2026 / Accepted: 28 May 2026 / Published: 30 May 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Accurate prediction of the long-term MPPT degradation trajectory of perovskite solar cells (PSCs) from short-term measurements can significantly reduce the time required for material characterization. Although conditional diffusion models have recently been introduced for degradation prediction in energy devices, their applicability to PSC-specific maximum power point tracking (MPPT) degradation trajectory forecasting remains uncertain due to the complexity of the underlying dynamics. This study benchmarks three approaches using 2245 devices from a publicly available dataset: NHITS, a hierarchical multilayer perceptron (MLP) with direct multi-horizon regression; Probabilistic NHITS (P-NHITS), which utilizes the same architecture with multi-quantile output; and TimeDiff, a conditional diffusion model with a CSDI backbone, autoregressive initialization, mode conditioning, and classifier-free guidance. The results indicate that PSC degradation under controlled conditions is predominantly single-exponential, with device-specific decay rates identifiable within the first 30 h. Therefore, the forecasting task is most appropriately framed as a regression problem rather than a generative one. NHITS achieves a root mean squared error (RMSE) of 0.738 PCE% compared to TimeDiff’s 0.863 (a 17% increase, p < 10⁻¹⁵), despite TimeDiff incorporating all architectural advantages reported in the literature. P-NHITS matches deterministic accuracy (0.744 PCE%) while providing 77% coverage prediction intervals without sampling, which is closer to the nominal 80% target than TimeDiff’s 63% coverage from 50 DDPM samples. For T90 (the time at which PCE first falls below 90% of its reference value) lifetime prediction restricted to forecast-window crossings, NHITS achieves a mean absolute error (MAE) of 16.2 h, outperforming TimeDiff’s 22.5 h. For smooth, unimodal degradation processes, direct regression with quantile outputs is both sufficient and preferable to conditional diffusion. Model selection should be guided by the underlying physical processes rather than by methodological trends.

Keywords:

perovskite solar cells; degradation forecasting; time-series prediction; diffusion models; NHITS

1. Introduction

Perovskite solar cells (PSCs) have achieved certified power conversion efficiencies exceeding 26%, positioning them as viable candidates for commercialization alongside silicon [1]. However, stability remains the primary limitation, not efficiency. The International Summit on Organic Photovoltaic Stability mandates that devices undergo MPPT under standardized stress conditions for several hundred hours to assess degradation [2]. At the HySPRINT facility, each aging test records 150 h of data per device at 10 min intervals using a high-throughput system [3]. Given finite testing capacity, each 150 h slot allocated to a single device precludes testing alternative materials or designs during that period.

Predicting long-term degradation from early test data would alleviate this limitation. If the initial 30 h of MPPT data can accurately forecast the subsequent 120 h, the total testing time per device could be reduced from 150 to 30 h, enabling the screening of approximately five times as many devices with the same equipment. The central question is which modeling approach is most appropriate for this forecasting task.

When formulated as a time-series forecasting problem with the aim of predicting hours 30 to 149 based on data from hours 0 to 29, the task becomes amenable to modern deep learning methods. Conditional diffusion models have recently gained prominence as generative approaches for such tasks. CSDI [4] introduced conditional score-based diffusion for probabilistic time-series imputation. TimeDiff [5] extended CSDI with autoregressive initialization and future mix-up, achieving leading results on nine real-world datasets. In a related application, DiffBatt [6] employed a denoising diffusion probabilistic model (DDPM) [7] with classifier-free guidance [8] for lithium-ion battery state-of-health prediction, proposing diffusion as a general framework for degradation forecasting in energy devices. This context raises a question that is both physical and methodological: do PSC degradation trajectories possess sufficient complexity to warrant generative modeling? Diffusion models are advantageous when future outcomes are multimodal, heavy-tailed, or analytically intractable. Battery degradation, characterized by cycling-dependent capacity loss, nonlinear knee points, and path-dependent aging [9], plausibly satisfies these criteria. Whether PSC degradation under controlled MPPT conditions exhibits similar complexity remains unexamined.

PSC degradation under controlled MPPT conditions is governed by a limited set of mechanisms: ion migration across interfaces, trap-state formation, contact degradation, and phase instabilities [10,11]. In the Hartono dataset, which features a nitrogen atmosphere, fixed illumination, and moderate temperatures [1], major external stressors such as humidity, ultraviolet exposure, and thermal cycling are absent. As a result, degradation is primarily determined by material properties established during fabrication. Two observations suggest that the forecasting task is low-dimensional. First, Hartono et al. [1] reported that 96.8% of devices reach their maximum power conversion efficiency (PCE) before 150 h, indicating that the burn-in phase, comprising light soaking, ion redistribution, and contact equilibration, concludes early, with the subsequent trajectory characterized by irreversible degradation at a device-specific rate. Second, unsupervised clustering of normalized MPPT degradation trajectories using self-organizing maps [12] identified four degradation modes: initial gain followed by plateau (approximately 56%), slow exponential decay (approximately 30%), medium exponential decay (approximately 10%), and fast exponential decay (approximately 4%). These modes share a common functional form and differ primarily in rate and the shape of the initial transient [1].

If the post-transient trajectory can be accurately represented by a decaying exponential characterized by a rate constant and a steady-state level, forecasting becomes a matter of estimating these parameters from partial data rather than sampling from a complex joint distribution over future time steps. In this scenario, a direct regression model is sufficient, and employing a diffusion model introduces unnecessary complexity without commensurate benefit.

The main contributions of this work are as follows:

A physics-grounded characterization of PSC degradation complexity: We demonstrate that 150 h MPPT degradation trajectories are predominantly single-exponential, with device-specific decay rates identifiable within the first 30 h, making the forecasting task low-dimensional and amenable to direct regression.
A rigorous three-way benchmark: We compare NHITS [13], a hierarchical MLP with direct multi-horizon regression; P-NHITS, which uses the same architecture with multi-quantile output [14]; and TimeDiff [5], a conditional diffusion model incorporating the full CSDI backbone, autoregressive initialization, mode conditioning, and classifier-free guidance. The evaluation is conducted on 2245 devices from the Hartono et al. dataset [1]. Despite providing TimeDiff with every architectural advantage described in the literature, NHITS outperforms it by 17% on point RMSE (0.738 vs. 0.863 PCE%, p < 10⁻¹⁵, Wilcoxon signed-rank test).
Evidence that uncertainty quantification does not require generative models in this context: P-NHITS attains 77% coverage on nominal 80% prediction intervals with negligible loss in point accuracy (0.744 vs. 0.738 PCE%). In contrast, TimeDiff’s sample-based intervals cover only 63%, indicating overconfidence despite 50 DDPM samples. For T90 lifetime prediction over forecast-window crossings, both NHITS variants achieve an MAE of 16.2–16.9 h, compared to TimeDiff’s 22.5 h.
Practical guidelines linking degradation physics to model selection: We delineate when diffusion models add value (multimodal futures, cycling-dependent dynamics) and when they do not (smooth, unimodal, physics-constrained trajectories).

The remainder of this paper is structured as follows. Section 2 reviews related work on PSC stability assessment, time-series forecasting, and conditional diffusion models. Section 3 details the HySPRINT dataset and the physical basis for low-dimensional degradation dynamics. Section 4 describes the three models and the evaluation protocol. Section 5 presents the benchmark results, and Section 6 interprets these results in the context of degradation physics. Finally, Section 7 summarizes the findings and their broader implications.

2. Related Work

2.1. Perovskite Degradation Modeling and the Data Landscape

Characterizing PSC stability requires standardized aging protocols and large-scale datasets. Khenkin et al. [2] established ISOS consensus procedures for PSC stability assessment, defining stress levels (light, thermal, dark storage) and reporting standards. These protocols make explicit a practical tension: rigorous characterization requires hundreds of hours of continuous MPPT monitoring per device, but screening many compositional and architectural variants requires high throughput.

The Perovskite Database Project [15] aggregated data from over 42,400 devices, but fewer than 20% include degradation data, and quality limitations cause model performance to plateau [16]. Zhang et al. [17] introduced a normalized stability indicator (TS80m) for cross-device comparisons, but the approach relies on scalar metrics rather than continuous trajectory data. Both efforts underscore the same limitation: literature-aggregated databases discard the trajectory information needed for forecasting.

The dataset from Hartono et al. [1] is notable for comprising 2245 PSC devices aged under controlled conditions (continuous MPPT, nitrogen atmosphere, 1 sun illumination, 20 to 85 °C) in a single laboratory over three years, utilizing a custom high-throughput system [3]. To the best of the authors’ knowledge, it represents the only large-scale, homogeneous, open-access PSC MPPT time-series dataset currently available. Hartono et al. reported that higher-efficiency cells exhibit greater statistical stability (∼1.5% reduction in relative PCE loss per 1% increase in maximum PCE) and identified four degradation modes using self-organizing map clustering [12]. This dataset and its mode classification provide the foundation for the benchmark described in this work.

All previous PSC-specific machine learning has focused on predicting tabular properties from composition or processing features. No published work has addressed PSC MPPT degradation trajectory forecasting from partial MPPT observations. This work fills that gap.

2.2. Time-Series Forecasting for Degradation in Energy Devices

Data-driven degradation prediction for lithium-ion batteries has been extensively studied, with the primary objective of estimating remaining useful life (RUL) or state of health (SOH) from early-cycle data. Severson et al. [9] demonstrated that features extracted from the first 100 charge–discharge cycles can predict battery cycle life with a 9.1% test error. Subsequent studies have utilized long short-term memory networks (LSTMs), Transformers, and physics-informed models to address this task by leveraging patterns in cycling-dependent capacity fade.

Battery and PSC degradation differ in complexity. Battery aging involves cycling-dependent capacity knees, calendar aging, and path-dependent changes, which can occasionally produce multimodal futures [9]. PSC MPPT degradation under fixed conditions is simpler: after an initial transient, the decline is monotonic and governed by a device-specific decay rate. This distinction is central to the present argument.

For deterministic forecasting, N-BEATS [18] demonstrated that deep stacks of fully connected layers with backward and forward residual links, without any time-series-specific inductive bias, can match or surpass specialized architectures on standard benchmarks. NHITS [13] advanced this approach by incorporating hierarchical interpolation and multi-rate downsampling, decomposing forecasts into components at different temporal scales. This method achieves approximately 20% greater accuracy than Transformer-based models at a computational cost 50 times lower, as implemented in the NeuralForecast library [19]. Two characteristics make NHITS particularly suitable for degradation forecasting. First, its direct multi-horizon output predicts all 120 future steps simultaneously, thereby avoiding the error accumulation that affects autoregressive methods over extended horizons. Second, NHITS can function without per-series normalization, preserving absolute power conversion efficiency (PCE) levels. This is important for PSC degradation: a 20% device losing −0.05% per hour will exhibit a qualitatively different trajectory from a 5% device at the same rate, and normalizing each series would obscure this distinction.

2.3. Conditional Diffusion Models Versus Direct Regression

Denoising diffusion probabilistic models (DDPMs) [7] generate samples by reversing a noise-adding process, iteratively transforming Gaussian noise into structured outputs. CSDI [4] adapted this framework for time-series imputation by conditioning the denoiser on observed values using a two-channel architecture. Channel 0 contains the conditional observations, while Channel 1 contains the noisy imputation target. The training process is self-supervised, dividing observed values into conditioning and target sets, similar to masked language modeling. CSDI achieved a 40–65% improvement in CRPS over previous probabilistic imputation methods and demonstrated competitive forecasting performance.

TimeDiff [5] extended CSDI through two primary modifications: autoregressive (AR) initialization, in which a pretrained recurrent model provides a warm-start forecast in Channel 0, and future mix-up, which incorporates ground-truth future values into the conditioning signal during training. These enhancements resulted in 10–30% improvements in mean squared error (MSE) over CSDI on standard benchmarks. In contrast, TimeGrad [20] combined DDPM with a recurrent neural network (RNN) encoder for autoregressive probabilistic forecasting; however, its sequential generation process is susceptible to error accumulation over extended prediction horizons.

DiffBatt [6] represents the most directly relevant prior work. This approach applied a DDPM with classifier-free guidance [8] to lithium-ion battery degradation, reporting an RMSE of 196 cycles for remaining useful life (RUL) prediction across multiple datasets and characterizing diffusion as a “general-purpose framework for battery degradation”. However, DiffBatt was evaluated only against less competitive baselines (MLP, CNN, LSTM) and did not include comparisons with NHITS or other advanced deterministic models. Whether the advantages of diffusion for batteries, which often exhibit complex and potentially multimodal degradation, extend to PSCs under controlled MPPT conditions remains unresolved.

This gap motivates the central research question. Diffusion models provide the greatest benefit when future outcomes are multimodal or inadequately represented by simple quantile estimates. In cases where the underlying process follows a single-exponential decay, defined by a rate and a steady-state level, the future trajectory is nearly deterministic given the conditioning window. Generating multiple samples from a learned reverse process to estimate what is essentially a two-parameter regression introduces unnecessary variance without contributing additional information.

A simpler alternative for uncertainty quantification is quantile regression [14], which directly estimates conditional quantiles without distributional assumptions or sampling. If a single architecture can deliver both accurate point forecasts and calibrated prediction intervals, a separate diffusion model for uncertainty becomes unnecessary.

3. Dataset and Degradation Physics

3.1. The HySPRINT Aging Dataset

This study uses the MPPT aging dataset published by Hartono et al. [1], collected at the HySPRINT laboratory (Helmholtz-Zentrum Berlin) using a custom high-throughput aging system [3]. The dataset contains 2245 perovskite solar cell devices fabricated over three years (2019 to 2022) in 66 aging batches, each tied to a specific fabrication campaign. The devices include lead–halide perovskite absorbers such as the triple-cation CsMAFAPbIBr family (881 cells), CsPbI₃ (218 cells), and FAPbI₃ (56 cells). They use charge-selective layers (spiro-OMeTAD, MeO-2PACz, C60/BCP, PCBM, NiO, PTAA), contact metals (Au, Ag, Cu), and both p-i-n (1743) and n-i-p (502) architectures.

All devices were aged under continuous MPPT tracking with a perturb-and-observe algorithm, a 0.01 V step, and a 1-s delay. Tests were conducted in a nitrogen atmosphere under 1000 W/m² simulated AM1.5G illumination from a metal–halide lamp with active intensity control. Device temperatures were set at 25, 45, 65, or 85 °C depending on the batch. PCE values were recorded every 10 min for approximately 150 h per device, producing roughly 900 data points per device, following ISOS-L-1I/L-2I protocols [2]. The public dataset (Zenodo 8185883) includes only the raw PCE time series and does not provide per-device metadata on composition, architecture, or aging temperature.

3.2. Preprocessing Pipeline

The preprocessing pipeline transforms raw 10 min measurements into a uniform hourly grid suitable for time-series modeling. This process consists of three steps:

Duration filter: Devices with a total measurement duration less than 149 h are excluded, ensuring that only devices with complete aging runs are retained.
Causal outlier detection and repair: A backward-looking rolling median filter with a window of 42 samples (approximately 7 h) is applied to each device’s PCE series. At each time step, the local median and median absolute deviation (MAD) are calculated from the preceding window. Data points deviating by more than 5 × 1.4826 × MAD from the rolling median are identified as outliers and replaced with the local median. Devices exhibiting more than 5% outliers, or containing values outside the physical range (PCE below −1% or above 50%), are excluded. Since the filter relies solely on past data, the temporal structure necessary for forecasting is maintained.
Temporal resampling: The irregularly spaced 10 min samples are resampled to a uniform 1 h grid using Akima cubic spline interpolation [21]. Akima interpolation fits piecewise cubic functions determined by local slopes at each data point, thereby preserving monotonicity and preventing the Runge oscillations associated with global polynomial methods. This procedure produces 150 evenly spaced data points per device.

Following preprocessing, 2030 devices are retained for analysis.

3.3. Degradation Modes and Their Physical Origins

Self-organizing map (SOM) clustering [12] is employed to characterize the diversity of degradation behaviors using MaxAbs-normalized trajectories, where each trajectory is divided by its absolute maximum. A 2 × 2 Kohonen map is trained for 10,000 iterations with principal component analysis (PCA)-based weight initialization, σ = 0.5, and a learning rate of 0.1. Clusters are subsequently sorted by the mean normalized power conversion efficiency (PCE) at hour 150, resulting in four canonical degradation modes:

Mode 0: Initial gain then plateau (∼56%). Light soaking and beneficial ion redistribution during early operation enhance PCE before stabilization. Mobile ions (I⁻, MA⁺) redistribute to establish favorable built-in fields at interfaces [11]. The gain phase is transient, while the plateau reflects the device’s intrinsic steady-state efficiency. This mode is most frequently observed among high-efficiency devices.
Mode 1: Slow exponential decay (∼30%). This mode exhibits gradual, monotonic efficiency loss at less than 0.5% per day, consistent with slow irreversible processes such as trap-state accumulation at grain boundaries, progressive contact oxidation, or slow halide phase segregation.
Mode 2: Medium exponential decay (∼10%). This mode is characterized by a steeper decline at 0.5 to 2% per day, potentially involving multiple concurrent degradation mechanisms or devices with less robust interface engineering.
Mode 3: Fast exponential decay (∼4%). This mode involves rapid failure, often reaching near-zero PCE within 50 to 100 h. It is consistent with catastrophic interface failure, delamination, or severe phase instability. Hartono et al. [1] reported no representation of this mode among devices with maximum PCE above 19.2%.

All four modes exhibit the same functional form: a transient phase (burn-in) followed by approximately exponential decay at a device-specific rate. The modes differ in rate and initial transient shape, but not in functional complexity. If each trajectory can be described by a small set of parameters, specifically transient duration, decay rate, and asymptotic level, then the forecasting task remains inherently low-dimensional, even when represented as 120 hourly values. This observation forms the physical basis for the hypothesis tested in this study.

3.4. Batch Structure and Generalization

The 66 aging batches represent distinct fabrication campaigns, each characterized by unique start dates, compositions, and processing conditions. The within-batch PCE standard deviation is approximately 42% of the overall standard deviation, indicating that batch membership reflects meaningful differences in fabrication and processing conditions.

The train, validation, and test split (70/15/15) incorporates this structure by employing stratified sampling based on conditioning-window slopes. For each device, the linear slope over the first 30 h is calculated and assigned to a slope-quartile label. This stratified approach ensures that each fold contains a representative distribution of initial degradation rates, thereby preventing the model from being evaluated solely on devices with slope profiles not present in the training set. Consequently, this approach provides a conservative evaluation, requiring the model to generalize across a broad range of initial behaviors rather than interpolate within a limited slope range. It is worth mentioning that the slope-quartile labels are used exclusively for stratifying the train/validation/test split; they are never provided as input features to NHITS or P-NHITS. TimeDiff receives causal mode labels derived from the same conditioning-window slopes (Section 4.3), but these labels are also available at test time because they depend only on hours 0–29. No future information (hours 30–149) enters either the stratification or the mode classification. The stratification ensures balanced evaluation across degradation rates, while curvature information, though not used for stratification, is implicitly available to all models through the raw 30 h input curve.

3.5. Task Formulation

Using the first 30 h of MPPT efficiency data (30 data points at 1 h intervals), the model forecasts the trajectory for hours 30 to 149 (120 data points). The choice of a 30 h conditioning window is based on Hartono et al.’s [1] finding that 96.8% of devices reach maximum PCE before 150 h. By hour 30, the burn-in transient is largely complete, and the subsequent trajectory is primarily determined by the device-specific decay rate, which can be inferred from the slope and curvature within the conditioning window.

Model performance is evaluated across three complementary dimensions:

Point accuracy: Per-device RMSE is computed over the 120 h forecast horizon and pooled across three random seeds [42, 123, 456]. All RMSE and MAE values are reported in absolute percentage points of PCE (denoted PCE%), representing the difference in efficiency between predicted and actual values on the original scale (e.g., an RMSE of 0.738 PCE% means the average prediction deviates by 0.738 absolute percentage points from the measured efficiency). This serves as the primary evaluation metric.
Lifetime milestones: T80 and T90 [22] represent the predicted times at which PCE first falls below 80% or 90% of the reference PCE (maximum PCE in the first 30 h). These values are computed using linear interpolation between hourly grid points and reported as MAE in hours. These milestones are particularly relevant for stability screening, as manufacturers require T90 to determine whether a device will maintain more than 90% of its initial efficiency for 100 h.
Uncertainty calibration: For models that provide prediction intervals (P-NHITS and TimeDiff), 80% coverage is reported as the proportion of actual values within the 10th to 90th percentile band, along with the mean band width in PCE%. Narrower bands at sufficient coverage indicate more informative uncertainty estimates.

For statistical testing, model comparisons employ the Wilcoxon signed-rank test on per-device RMSE, pooled across seeds. This non-parametric, paired test does not assume normality and is suitable for the observed heavy-tailed RMSE distributions. To ensure reproducibility of the results, all experiments utilize three random seeds [42, 123, 456] with identical stratified splits for each seed. No ensembling is performed; each model is evaluated independently on the same test devices.

4. Methods

Three forecasting models of increasing architectural complexity are compared. Table 1 summarizes all hyperparameters.

4.1. NHITS: Direct Multi-Horizon Regression

NHITS [13] addresses the forecasting problem using a hierarchical decomposition: three stacks with progressively finer temporal resolution process the input signal at distinct scales, each generating a partial forecast that is summed to yield the final prediction. Within each stack, MaxPool downsampling, followed by MLP blocks and learned interpolation coefficients, enables the model to capture both slow trends in the low-frequency stack and rapid transients in the high-frequency stack. The model produces all 120 forecast time steps in a single forward pass, performing direct multi-horizon regression without autoregressive generation.

The architecture employs three stacks with pooling kernel sizes of

[2, 2, 1]

and frequency downsampling ratios of

[4, 2, 1]

. Each stack contains one block with two hidden layers of 256 units each, using the identity stack type. Training is conducted for 2000 gradient steps with a batch size of 128, a learning rate of

10^{- 3}

, and Huber loss. The model is implemented using the NeuralForecast library [19].

The scaler_type=‘identity’ setting is applied, meaning no normalization is performed on the input or output. This approach is a deliberate, physics-motivated decision. The absolute power conversion efficiency (PCE) level serves as the primary predictor of trajectory shape: for example, a 20% device degrading at −0.05% per hour exhibits a gradual decline over 120 h, whereas a 5% device at the same relative rate approaches zero. Per-series standardization, which maps both series to mean 0 and variance 1, eliminates this discriminative information. No auxiliary features, such as batch identity, mode labels, or exogenous covariates, are included; the raw 30 h curve at its natural scale is the only input.

4.2. Probabilistic NHITS: Quantile Regression

The architecture, scaler, and training procedure remain consistent with deterministic NHITS, except that the Huber loss is replaced by multi-quantile loss (MQLoss) at quantiles

[0.1, 0.25, 0.5, 0.75, 0.9]

[14]. The model outputs five values per forecast time step, corresponding to each quantile. The 0.5 quantile provides the point prediction. The 0.1 to 0.9 quantile range directly defines an 80% prediction interval, eliminating the need for post hoc calibration, conformal wrappers, or stochastic sampling. The pinball loss function penalizes quantile miscalibration during training.

If P-NHITS achieves point accuracy comparable to deterministic NHITS while also providing calibrated uncertainty estimates, the rationale for employing a separate generative model for uncertainty quantification is undermined.

4.3. TimeDiff: Conditional Diffusion with Full Enhancements

The complete TimeDiff pipeline [5] is implemented with a CSDI backbone [4], augmented with mode conditioning and classifier-free guidance [8]. The intent is to give TimeDiff every advantage available in the literature: if it still underperforms, the result is conservative, and the gap would only widen under a more minimal configuration.

The backbone is a stack of 4 residual blocks, each containing time-axis and feature-axis Transformer encoder layers (8 attention heads, 64 channels, feedforward dimension 64, GELU activation). Each block receives the noisy two-channel input and side information through a gated activation mechanism (sigmoid gate × tanh filter), with skip connections summed and normalized across blocks. The input projection maps the 2-channel input (Channel 0: AR-conditioned observations; Channel 1: masked noisy target) to 64 channels. The output projection maps back to the prediction space via two 1 × 1 convolutions with ReLU.

The denoiser receives four components concatenated along the channel dimension:

A 128-dim sinusoidal positional encoding of timestamps (matching the CSDI reference implementation);
A 16-dim learned feature embedding ( $K = 1$ for univariate);
A 16-dim learned mode embedding from causal slope-quartile labels (see below);
A 1-dim binary conditioning mask (1 for observed, 0 for target).

This 161-dim side information is richer than a minimal implementation and matches the architecture that produced the best results during development.

A 2-layer GRU with 64 hidden units is pretrained for 30 epochs on the training set to forecast hours 30–149 from hours 0–29 (learning rate

10^{- 3}

, gradient clipping at norm 1.0). After pretraining, all AR parameters are frozen. The AR forecast fills Channel 0 in the target region, giving the denoiser a warm-start trajectory. This is the main TimeDiff contribution over CSDI.

Causal mode labels are computed from the conditioning window only (no future leakage): the linear slope over the first 30 h is assigned to quartile bins, yielding four mode classes. During training, mode labels are randomly replaced with a null token (dropout rate

p_{cfg} = 0.15

). At inference, both conditional and unconditional noise estimates are computed, and the final estimate is interpolated:

\hat{ϵ} = (1 + w) {\hat{ϵ}}_{cond} - w {\hat{ϵ}}_{uncond},

(1)

with guidance weight

w = 1.0

.

Each device’s trajectory is divided by its conditioning-window mean before entering the diffusion pipeline. This normalization is necessary because the DDPM noise schedule adds unit-variance Gaussian noise: at raw PCE scale (∼14% on average), the noise is invisible at early diffusion steps, making noise estimation numerically intractable. The normalization maps all devices to ≈1.0 in the observed region. Predictions are denormalized by multiplying by the per-device scale after sampling. Note the asymmetry with NHITS: direct regression benefits from raw PCE (the absolute level is informative), whereas diffusion requires normalization (the noise schedule must match the signal amplitude). This is a structural property of the two architectures, not a confound.

A quadratic beta schedule is used with

β_{1} = 10^{- 4}

,

β_{T} = 0.5

, and

T = 100

diffusion steps, following the convention

β_{t} = {(\sqrt{β_{1}} + \frac{t - 1}{T - 1} (\sqrt{β_{T}} - \sqrt{β_{1}}))}^{2}

.

An oversampled DataLoader (WeightedRandomSampler balancing mode frequencies) ensures equal representation across degradation modes. The diffusion parameters are optimized with Adam (learning rate

10^{- 4}

, weight decay

10^{- 6}

) for 200 epochs with MultiStepLR (milestones at 75% and 90% of total epochs,

γ = 0.1

). Exponential moving average (EMA) of model weights (decay = 0.999) is maintained and used during validation. Gradient clipping at norm 1.0 is applied throughout. Early stopping with patience 25 monitors validation loss evaluated every 5 epochs.

At inference, the full DDPM reverse process runs for 100 steps, generating 50 independent samples per device. The median of 50 samples is the point prediction; the 10th and 90th percentiles define the 80% prediction interval. A boundary-continuity correction is applied post hoc:

{\hat{y}}_{30 : 149} \leftarrow {\hat{y}}_{30 : 149} + (y_{29} - {\hat{y}}_{30})

, where

y_{29}

is the last observed value. This shift uses only known information and is applied consistently to all 50 samples. The correction is needed because CSDI-style simultaneous generation produces all target time steps without an explicit continuity constraint at the conditioning boundary.

During training, a fraction of ground-truth future values (

p_{mixup} = 0.3

) are randomly mixed into Channel 0, replacing the AR forecast at those positions. The loss is masked at mix-up positions to prevent the model from learning a trivial copy. This acts as a form of teacher forcing for the diffusion denoiser.

Figure 1 provides a schematic comparison of the three architectures, showing the progression from direct regression (a single forward pass) to the full TimeDiff pipeline (a multi-stage process with iterative sampling).

5. Results

5.1. Point Forecast Accuracy

Table 2 reports per-device RMSE (PCE%) for each seed and pooled across all three seeds. NHITS achieves a mean RMSE of 0.738, outperforming TimeDiff (0.863) by 17.0% (

p < 10^{- 15}

, Wilcoxon signed-rank test). P-NHITS is within 0.8% of deterministic NHITS (0.744 vs. 0.738;

p = 0.061

, not significant), confirming that the multi-quantile loss incurs negligible point-accuracy cost.

The gap is stable across seeds: NHITS ranges from 0.715 to 0.756, while TimeDiff ranges from 0.825 to 0.912. This consistency rules out the possibility that the result is an artifact of a favorable data split. The cumulative RMSE distribution (Figure 2) provides further evidence: the NHITS and P-NHITS curves are nearly superimposed, while TimeDiff is shifted rightward across the entire distribution, not just in the tails. The 17% gap is systematic, not driven by outlier devices.

Why does this happen? TimeDiff generates 50 independent trajectories via 100-step DDPM reverse processes starting from Gaussian noise, then takes their median. This median is inherently noisier than a direct regression output; it is computed from a finite sample of a distribution that the denoiser only approximately models. For a process as smooth as single-exponential decay, the sampling machinery adds variance without a compensating benefit.

5.2. Per-Mode Analysis

Table 3 disaggregates RMSE by degradation mode, revealing where the accuracy gap originates. Figure 3 visualizes the full RMSE distributions as box plots.

Mode 0 (Initial Gain, $N = 440$ )

All models achieve their lowest errors on this mode. The gain-then-plateau pattern is the smoothest and most predictable trajectory shape; NHITS captures the plateau timing well (RMSE 0.568). TimeDiff’s gap is 19%, modest for this easy mode.

Modes 1–2 (Exponential Decay, $N = 277 + 140$ )

The largest gaps appear here. TimeDiff’s RMSE exceeds NHITS by 12% on Mode 1 (slow decay) and 26% on Mode 2 (medium decay). Precise decay-rate estimation is the core forecasting challenge for these modes, and it is where direct regression has a clear advantage: NHITS learns the mapping from 30 h slope and curvature to 120 h trajectory end-to-end, at the natural PCE scale. The diffusion denoiser must learn the same rate from per-device-normalized data, where inter-device rate differences are compressed. Mode conditioning and classifier-free guidance partially recover this lost signal, but not fully.

Mode 3 (Fast Decay, $N = 58$ )

The gap narrows to 6%. When PCE approaches zero within 50–100 h, both models converge; there is little room for error when the trajectory is already near its asymptote. This mode also has the smallest sample size, making percentage comparisons less reliable.

The box plots in Figure 3 confirm that TimeDiff’s disadvantage lies in the central tendency, not just the tails: its median RMSE is consistently above NHITS across all four modes.

5.3. T80/T90 Lifetime Milestone Prediction

Table 4 and Table 5 report T90 and T80 prediction error disaggregated by degradation mode. These milestones matter directly for stability screening: a manufacturer asking whether a device will maintain >90% of its initial efficiency for 100 h needs T90, not RMSE.

Table 4 presents T90 prediction error limited to forecast-window crossings, defined as devices with actual T90 exceeding 30 h. Devices in Modes 2 and 3, as well as a significant portion in Modes 0 and 1, cross the 90% threshold within the conditioning window. Since T90 is calculated on the concatenated curve (observed hours 0–29 and predicted hours 30–149), all models reproduce these in-window crossings identically by design. Consequently, such events do not inform forecast quality and are excluded from both the table and aggregate metrics. For devices whose T90 occurs within the forecast horizon, initial-gain devices (Mode 0) are the most challenging to predict, with NHITS achieving an MAE of 16.9 h compared to TimeDiff’s 23.8 h. Slow-decay devices (Mode 1) exhibit a similar difference (15.1 versus 19.5 h). When aggregated across modes for forecast-window events only, NHITS achieves 16.2 h MAE, P-NHITS 16.9 h, and TimeDiff 22.5 h.

For T80, the same forecast-only filtering is applied, as shown in Table 5. TimeDiff outperforms NHITS on Mode 0 (25.9 versus 38.9 h MAE), likely due to the per-device normalization implemented by TimeDiff, which may better capture the subtle long-horizon decline in initially-gaining devices. However, Mode 0 accounts for only 14 forecast-window T80 events for NHITS, rendering this comparison statistically fragile. NHITS performs better on Mode 1 (15.3 versus 20.6 h), which constitutes the majority of forecast-window T80 events. The overall forecast-only T80 MAE is as follows: P-NHITS 16.5 h (best), NHITS 16.9 h, and TimeDiff 21.0 h.

For devices with T90 occurring within the forecast horizon, NHITS’s 16.2 h MAE corresponds to approximately 11% timing uncertainty over a 150 h run, whereas TimeDiff’s 22.5 h (approximately 15%) indicates lower predictive confidence. The scatter plots in Figure 4 illustrate the error structure: a dense cluster near the origin demonstrates that all models accurately identify devices that never reach the threshold, while increased scatter in the 50–100 h range highlights the greater temporal impact of decay-rate errors.

A subtlety in Table 4 and Table 5 is that N varies across models within the same mode. For Mode 0 T90, TimeDiff predicts 301 threshold crossings versus NHITS’s 267, meaning TimeDiff erroneously predicts a steeper decline for approximately 34 initial-gain devices that do not actually reach T90. These are false positives: the model forecasts degradation that does not occur. Conversely, for Mode 1 T80, TimeDiff predicts fewer crossings (233 vs. 246 for NHITS), underestimating the decline for some slow-decay devices. Both patterns are consistent with per-device normalization compressing inter-device rate differences, which can lead to occasional mode misidentification at the trajectory level.

5.4. Uncertainty Quantification

Both P-NHITS and TimeDiff provide prediction intervals: P-NHITS via the 10th–90th quantile outputs, TimeDiff via the 10th–90th percentiles of its 50 samples. Table 6 quantifies their calibration, and Figure 5 compares them visually.

P-NHITS achieves 77.2% coverage with a mean band width of 1.78 PCE%, close to the target 80% and practically useful for screening decisions. The multi-quantile loss [14] penalizes miscalibration directly during training, and the resulting intervals adapt to device-specific uncertainty: devices with noisy or ambiguous conditioning windows receive wider bands, while devices with clean exponential signatures receive narrow ones.

TimeDiff tells a cautionary story about sample-based uncertainty. Its intervals are narrower (1.49 PCE%) but achieve only 62.8% coverage, far below the nominal 80%. The model is overconfident: the 50 DDPM samples cluster too tightly around the median, failing to capture the true predictive uncertainty. One contributing factor is the boundary-continuity correction applied post hoc: shifting all 50 samples by the same offset (

y_{29} - {\hat{y}}_{30}

) corrects a shared systematic bias while simultaneously reducing inter-sample variance near the conditioning boundary, where calibration matters most. This occurs because the denoiser learns a relatively peaked conditional distribution (the trajectories are near-deterministic), and 50 reverse-process trajectories do not spread enough to cover the actual variability. Increasing the sample count would widen the intervals, but at a substantial computational cost.

The implication is clear: for this application, quantile regression provides more honest uncertainty estimates than diffusion sampling. P-NHITS learns interval width directly as a function of the input, without the indirection of learning a noise-to-data mapping and then computing empirical quantiles. For unimodal predictive distributions [1], this direct approach is both simpler and better calibrated.

5.5. Representative Trajectories

Figure 6 presents a 4 × 3 grid of representative forecast trajectories: four degradation modes × three difficulty levels (P5 best-fitting, P50 typical, P75 challenging), selected from the last seed’s test set.

NHITS tracks all modes well across the full difficulty spectrum. Even P75 (challenging) devices show reasonable trajectory matching, with errors concentrated in the timing of inflection points rather than gross level mismatches. The direct regression output is smooth by construction, reflecting the low-frequency dominance of the hierarchical decomposition.

TimeDiff follows the general trend but shows three characteristic behaviors: (a) systematic level offsets on slow and medium decay modes, despite boundary-continuity correction; (b) noisier predictions, because the median of 50 stochastic samples is inherently less smooth than a single deterministic output; and (c) occasional poor mode identification when the 30 h conditioning window is ambiguous about whether the trajectory will plateau or decay.

For fast-decay P5 devices, both models are indistinguishable from the actual trajectory; the PCE is near zero for most of the forecast window. This trivial case provides little discriminative information, which is why Mode 3 shows the smallest accuracy gap.

5.6. Input-Window Sensitivity

To test whether the 30 h conditioning window is critical to the results, we retrained all three models with 20 h and 40 h input windows, keeping all other hyperparameters fixed (same seeds, same stratified 70/15/15 split based on conditioning-window slopes, same training configuration). Table 7 reports the pooled mean RMSE across three seeds.

NHITS performance improves monotonically as the conditioning window increases (from 0.976 to 0.735 to 0.642 PCE%, representing a 34% reduction from 20 h to 40 h), confirming that additional input data yields a more accurate estimate of the device-specific decay rate. P-NHITS demonstrates a similar pattern of improvement. NHITS achieves lower RMSE than TimeDiff at all three windows, with Wilcoxon p-values below

10^{- 13}

in every case. The 30 h window does not represent a selectively chosen operating point; rather, the central conclusion of this study is independent of the window size.

TimeDiff exhibits a qualitatively different pattern. While its mean RMSE improves from 20 h to 30 h (1.185 to 0.888), the 40 h result (1.464) is worse than the 20 h result, driven by high cross-seed variance (coefficient of variation: 57.6%, versus 3.7% for NHITS at the same window). A diagnostic decomposition of the worst 40 h seed (seed 123) traced the aggregate error to a single device with a mean PCE of 1.7%: the DDPM reverse process generated 50 samples spanning −966 to +6235 PCE% at the forecast boundary, producing a per-device RMSE of 175 PCE% that inflated the aggregate. This unbounded behavior is a structural property of stochastic DDPM sampling, which has no output constraint, and cannot occur in NHITS, which produces a single deterministic forward pass in absolute PCE space. A more detailed analysis is provided in Section 6.5.3.

6. Discussion

6.1. Why Degradation Physics Determines Model Selection

The observed 17% performance gap between NHITS and TimeDiff does not indicate a failure in the diffusion implementation. Rather, TimeDiff was provided with all architectural advantages documented in the literature. This gap arises from applying a high-capacity generative framework to a low-complexity physical process.

The Root Cause: PSC Degradation Is Low-Dimensional

Under the controlled conditions of the Hartono dataset [1], PSC MPPT degradation trajectories are accurately described by a limited set of parameters: an initial transient, governed by ion redistribution and light soaking [11] and typically completed by hour 20 to 30, followed by an approximately exponential decay defined by a rate constant and an asymptotic steady-state level. SOM analysis corroborates this finding: all four degradation modes exhibit the same exponential functional form, and the effective number of degrees of freedom in the forecast is approximately two (a rate constant and an asymptotic level), regardless of whether the forecast is represented as 120 hourly values.

Given the simplicity of this process, the forecasting task reduces to parameter estimation: with 30 h of observation, the objective is to estimate the decay rate and steady-state level, then extrapolate. NHITS [13] addresses this implicitly through hierarchical pooling. The low-frequency stack captures the overall decay trend, the mid-frequency stack models the transition from transient to steady state, and the high-frequency stack accounts for residual fluctuations. The mapping from input to output is deterministic and learned in an end-to-end manner.

Why Diffusion Adds Overhead Without Payoff?

TimeDiff [5] is required to solve a more complex problem: it must learn to reverse a stochastic diffusion process conditioned on partial observations so that the denoised output statistically matches the true data distribution. For a distribution that is essentially a two-parameter family (rate and level) with added Gaussian measurement noise, this approach is unnecessarily complex. The denoiser, a four-layer attention network with 161-dimensional side information, effectively learns an exponential decay with noise. The DDPM reverse process [7] generates 50 independent trajectories, each following a distinct stochastic path, and computes their median. This median is inherently noisier than a direct regression estimate because it is derived from a finite sample of a distribution that the denoiser only approximates.

The Normalization Asymmetry Is a Symptom, Not a Cause

A consistent observation during development was that NHITS benefits from identity scaling (raw PCE), whereas TimeDiff requires per-device normalization. This difference is sometimes interpreted as a confounding factor, suggesting that varying preprocessing renders the comparison unfair. However, the more accurate interpretation is that this difference exposes a structural limitation of diffusion models for this application. NHITS leverages the absolute PCE level, the most informative feature, because direct regression does not constrain input or output scales. In contrast, diffusion models require normalized data since the noise schedule must be calibrated to the signal amplitude. At the raw PCE scale (approximately 14% on average), the added noise is undetectable during early diffusion steps, making the noise-estimation loss uninformative. The normalization requirement compresses the signal that facilitates the task. Mode conditioning and classifier-free guidance [8] serve as compensatory mechanisms that partially, but not fully, recover this information.

A potential question is whether a more sophisticated normalization scheme could address this limitation, for example, by normalizing only at the diffusion input and applying a learned inverse transformation after sampling. However, the per-device normalization implemented in this study, which involves dividing by the conditioning-window mean and multiplying back after sampling, already constitutes an invertible parameterized transformation. The primary challenge is not invertibility but information loss. Once inter-device amplitude differences are compressed, the denoiser must infer rate information solely from the normalized signal shape, and mode conditioning can only partially address this issue. A learned denormalization would encounter the same fundamental limitation: the distinguishing feature between a 20% device and a 5% device is the absolute scale, which normalization eliminates.

6.2. The Probabilistic NHITS Result in Context

Transitioning from deterministic to probabilistic NHITS incurs minimal additional cost: 0.738 → 0.744 (+0.8%). This finding represents a significant practical contribution. A single architecture addresses both point prediction and uncertainty quantification, achieving 77% coverage on nominal 80% intervals, compared to TimeDiff’s 63% coverage despite 50 DDPM samples. This approach removes the need for a two-model pipeline, in which a deterministic model is trained for accuracy and a diffusion model for uncertainty, as suggested by much of the recent literature.

This result is supported by a straightforward theoretical explanation. For a unimodal, continuous predictive distribution, as produced by PSC degradation under controlled conditions, the conditional quantiles are smooth functions of the conditioning variable. The pinball loss [14] serves as a proper scoring rule, penalizing each quantile separately for miscalibration. The primary additional requirement compared to point regression is that the model must learn the width as a function of input: devices with a more uncertain future should receive wider intervals than those with a more predictable one. In the case of PSC degradation, this width function is smooth and low-dimensional; devices with noisier conditioning windows and ambiguous slope or curvature exhibit wider predictive intervals. An MLP can effectively represent this mapping.

6.3. Connecting to DiffBatt and the Broader Generative-AI Debate

These results complement DiffBatt [6] rather than contradicting it, as they establish a boundary condition. DiffBatt addresses lithium-ion battery degradation, including cycling-dependent capacity fade, nonlinear capacity knees, calendar aging, and path-dependent degradation [9]. These phenomena generate genuinely multimodal futures, and exploring diverse trajectories through DDPM sampling provides information that a single regression output cannot capture.

PSC MPPT degradation under controlled conditions does not exhibit these characteristics. After the initial burn-in period, trajectories are monotonic, smooth, and governed by a single dominant mechanism [10]. For example, a device at 18% PCE declining at −0.05%/h will reach approximately 12% at hour 150, with a noise margin of ±1%. There is no capacity knee, cycling bifurcation, or latent secondary mechanism.

A key practical lesson is to characterize the system dynamics before selecting a modeling approach. When the effective dimensionality is low, as indicated by smooth trajectories, unimodal futures, and dynamics governed by a small number of parameters, direct regression is sufficient and preferable. In contrast, when the dynamics are high-dimensional or multimodal, generative models may justify their additional complexity. A concrete heuristic is to fit representative trajectories with simple parametric forms, such as single- or double-exponential decay, and assess the residual complexity before adopting a generative architecture.

6.4. Practical Recommendations

The following recommendations are provided for PSC researchers:

For point predictions: NHITS with identity scaler is recommended. Training requires approximately 5 min on a T4 GPU. Inference is immediate, requiring only a single forward pass per device without iterative sampling. An RMSE of 0.738 PCE% over 120 h corresponds to an average prediction error below 1 percentage point, which is within measurement uncertainty for many device architectures.
For uncertainty-aware screening: P-NHITS with multi-quantile loss is recommended. Training time matches that of deterministic NHITS. The 10th to 90th quantile interval (77% empirical coverage) directly addresses practical questions, such as whether a device is likely to maintain greater than 15% PCE at hour 100. This information supports go/no-go decisions in screening workflows. Coverage can be increased toward the nominal 80% using conformal calibration if required.
A concrete screening decision rule using these intervals is as follows. Let ${\hat{q}}_{0.9} (t)$ and ${\hat{q}}_{0.1} (t)$ denote the 90th and 10th percentile forecasts at hour t, and let $θ = 0.90 \times {PCE}_{max}$ be the T90 stability threshold. For a target assessment time $t^{*}$ (e.g., hour 100):
–
Accept if ${\hat{q}}_{0.1} (t^{*}) \geq θ$ (even the pessimistic bound stays above the threshold);
–
Reject if ${\hat{q}}_{0.9} (t^{*}) < θ$ (even the optimistic bound falls below the threshold);
–
Continue testing otherwise (the interval straddles the threshold, indicating insufficient certainty for a decision).
This three-outcome rule uses the 80% prediction interval directly as a decision boundary. In practice, the quantile levels should be selected based on the application’s tolerance for incorrect acceptance versus incorrect rejection, noting that the empirical coverage of the 80% interval is 77% (Section 5.4).
For T80/T90 estimation: When restricted to devices whose T90 falls in the forecast window (after hour 30), both NHITS variants achieve 16.2 to 16.9 h MAE at T90. In-window crossings (those occurring before hour 30) are identified exactly from the observed data. For devices requiring genuine forecasting, the model predicts within approximately ±16 h when a device will fall below 90% of its initial efficiency. When combined with a 30 h observation window, this still enables meaningful screening decisions with 80% time savings, though the timing precision for late-crossing devices is coarser than for in-window events.
When to consider diffusion: Two scenarios justify generative modeling for PSC degradation: (a) if per-device metadata (absorber composition, architecture, aging temperature) becomes available and reveals distinct degradation pathways that create multimodal futures, and (b) if synthetic trajectory generation for data augmentation is needed to bootstrap models for new fabrications with limited real data.

6.5. Limitations and Future Work

6.5.1. Single Dataset

The Hartono et al. dataset is the only large-scale, homogeneous, open-access PSC MPPT time-series dataset [1]. While it spans diverse fabrication conditions (66 batches, multiple absorbers and interlayers), the controlled aging environment (N₂, fixed illumination) removes the outdoor stressors (humidity, thermal cycling, UV) that would add complexity to degradation dynamics. These results may not generalize to field-aged devices, where degradation could be more complex and multimodal, precisely the conditions where diffusion might prove advantageous.

6.5.2. No Per-Device Metadata

Including architecture type, absorber composition, or aging temperature as covariates could benefit both models. For NHITS, these could serve as static exogenous features; for TimeDiff, they would enrich the conditioning signal beyond the current mode embedding. The absence of metadata in the public release of the dataset means both models operate purely from the time-series signal. While this ensures a fair comparison since both models receive identical information, it limits generalizability to scenarios in which metadata-informed conditioning could differentiate degradation pathways.

6.5.3. Window Sensitivity and Stability

The ablation in Section 5.6 confirms that the choice of 30 h is not critical. NHITS outperforms TimeDiff at all three windows tested (20, 30, and 40 h), with Wilcoxon

p < 10^{- 13}

in every case. NHITS accuracy improves monotonically with window length (0.976 → 0.735 → 0.642 PCE%), as expected for a direct regression model whose performance depends on how well the conditioning window constrains the decay rate.

The 40 h TimeDiff result exposes a failure mode absent from direct regression. The mean RMSE (1.464 PCE%) is worse than the 20 h result (1.185), and the cross-seed coefficient of variation rises to 57.6% (versus 3.7% for NHITS). A pipeline decomposition of the worst seed traced the error to the DDPM sampling stage: for a single low-efficiency device (mean PCE of 1.7%), the 50 reverse-process samples spanned −966 to +6235 PCE% at the forecast boundary, producing a per-device RMSE of 175 PCE%. The model’s validation loss for this run was within the normal range (0.020 in normalized space), confirming that the denoiser converged normally. The failure occurs downstream, in the stochastic sampling process, where the learned posterior for an underrepresented device type generates unbounded outputs. NHITS, operating as a single deterministic forward pass with no normalization, cannot produce such outputs. This instability reinforces the framework-level analysis in Section 6.1: the DDPM reverse process adds variance without compensating benefit when the target is near-deterministic.

6.5.4. Denoiser Architecture

Our TimeDiff implementation faithfully follows the CSDI/Shen–Kwok attention-based architecture. DiffBatt [6] uses a U-Net denoiser rather than attention-based residual blocks. A direct comparison using DiffBatt’s architecture on the Hartono dataset would strengthen the generalizability of these conclusions. However, the three root causes of TimeDiff’s underperformance identified in Section 6.1 are properties of the DDPM framework, not the denoiser backbone. Per-device normalization is required because the DDPM forward process adds unit-variance Gaussian noise, making noise estimation intractable at raw PCE scale; this compresses the absolute PCE level regardless of whether the denoiser is a U-Net or a Transformer. The stochastic noise injected at each reverse step is independent of the denoiser architecture, so the inter-sample variance from generating multiple trajectories and taking their median is a property of the sampling process, not the network. For a near-deterministic process such as single-exponential decay, this sampling machinery adds variance without any compensating benefit, irrespective of the denoiser used. Furthermore, adapting DiffBatt to PSC data requires replacing its capacity matrix, which is a structured representation of charge–discharge cycling data constructed from the first 100 cycles [6], with a conditioning signal derived from the 30 h MPPT curve. TimeDiff already conditions on this signal through the AR warm-start in Channel 0 and 161-dimensional side information, including mode embeddings. While generation-strategy differences exist between DiffBatt (full-curve generation with sample selection) and TimeDiff (partial-curve generation via CSDI masking), both share the same DDPM forward and reverse processes and, therefore, the same framework-level bottlenecks. A direct empirical comparison using DiffBatt’s U-Net architecture on the Hartono dataset would probe the denoiser backbone in isolation; such a comparison is planned for future work.

6.5.5. Beyond 150 h

Commercial PSC stability targets require thousands of hours of demonstrated operation. Extending the forecast horizon or cascading 150 h predictions is an important direction. The predominantly single-exponential dynamics suggest that long-range extrapolation may be feasible, but this remains untested. The Hartono dataset provides no trajectories beyond approximately 150 h, so whether the exponential decay rate estimated from the first 150 h remains valid over longer time scales cannot be verified from these data. Secondary degradation mechanisms that are negligible over 150 h under nitrogen (e.g., slow-phase instabilities, progressive contact degradation) could alter the long-term trajectory, and field conditions would introduce additional stressors (humidity, thermal cycling, UV) that are absent from this dataset. Validating or refuting this assumption requires datasets with multi-thousand-hour MPPT records under controlled conditions.

7. Conclusions

This paper asked whether the degradation dynamics of PSCs under controlled aging justify the use of conditional diffusion models for trajectory forecasting. The benchmark on 2245 devices shows that they do not. PSC MPPT degradation is predominantly single-exponential: a low-dimensional, unimodal process whose device-specific decay rate is identifiable from the first 30 h of operation. For this class of dynamics, NHITS, a hierarchical MLP with direct multi-horizon regression and no normalization, achieves RMSE 0.738 PCE%, outperforming the full TimeDiff pipeline (CSDI backbone, AR initialization, mode conditioning, classifier-free guidance) by 17% (

p < 10^{- 15}

). An input-window ablation (20, 30, and 40 h) confirms that this ranking is window-independent (

p < 10^{- 13}

at all windows). P-NHITS matches this accuracy (0.744) while providing 77%-coverage prediction intervals, better calibrated than TimeDiff’s 63% from 50 DDPM samples, and removes the need for a separate generative model.

The broader finding is simple: match the model to the physics. When degradation dynamics are smooth and governed by a small number of parameters, direct regression is sufficient. Generative models earn their overhead when the underlying process is high-dimensional, multimodal, or path-dependent. PSC MPPT degradation under controlled aging is none of these.

Author Contributions

Conceptualization, K.C. and H.N.N.; methodology, K.C.; software, K.C.; validation, K.C. and H.N.N.; formal analysis, K.C. and H.N.N.; writing—original draft preparation, K.C.; writing—review and editing, K.C. and H.N.N.; visualization, K.C. and H.N.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Perovskite Solar Cells Ageing Dataset is publicly available at https://zenodo.org/records/8185883, accessed on 17 May 2026. The model architectures, hyperparameters, and training procedures are described in Section 4, and the preprocessing pipeline is specified in Section 3.2. The key software dependencies and their versions are: neuralforecast 3.1.4, minisom, PyTorch (with CUDA), scipy, numpy, pandas, matplotlib, and scikit-learn, all running on Google Colab with Python 3.12.

Acknowledgments

The authors thank Hartono et al. for making the Perovskite Solar Cells Ageing Dataset publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hartono, N.T.P.; Köbler, H.; Graniero, P.; Khenkin, M.; Schlatmann, R.; Ulbrich, C.; Abate, A. Stability follows efficiency based on the analysis of a large perovskite solar cells ageing dataset. Nat. Commun. 2023, 14, 4869. [Google Scholar] [CrossRef] [PubMed]
Khenkin, M.V.; Katz, E.A.; Abate, A.; Bardizza, G.; Berry, J.J.; Brabec, C.; Brunetti, F.; Bulović, V.; Burlingame, Q.; Di Carlo, A.; et al. Consensus statement for stability assessment and reporting for perovskite photovoltaics based on ISOS procedures. Nat. Energy 2020, 5, 35–49. [Google Scholar] [CrossRef]
Köbler, H.; Neubert, S.; Jankovec, M.; Glažar, B.; Haase, M.; Hilbert, C.; Topič, M.; Rech, B.; Abate, A. High-Throughput Aging System for Parallel Maximum Power Point Tracking of Perovskite Solar Cells. Energy Technol. 2022, 10, 2200234. [Google Scholar] [CrossRef]
Tashiro, Y.; Song, J.; Song, Y.; Ermon, S. CSDI: Conditional score-based diffusion models for probabilistic time series imputation. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY, USA, 6–14 December 2021; pp. 24804–24816. [Google Scholar]
Shen, L.; Kwok, J. Non-autoregressive Conditional Diffusion Models for Time Series Prediction. In Proceedings of the 40th International Conference on Machine Learning. PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 31016–31029. [Google Scholar]
Eivazi, H.; Hebenbrock, A.; Ginster, R.; Blömeke, S.; Wittek, S.; Herrmann, C.; Spengler, T.S.; Turek, T.; Rausch, A. DiffBatt: A Diffusion Model for Battery Degradation Prediction and Synthesis. arXiv 2024, arXiv:2410.23893. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. arXiv 2020, arXiv:2006.11239. [Google Scholar] [CrossRef]
Ho, J.; Salimans, T. Classifier-Free Diffusion Guidance. arXiv 2022, arXiv:2207.12598. [Google Scholar] [CrossRef]
Severson, K.A.; Attia, P.M.; Jin, N.; Perkins, N.; Jiang, B.; Yang, Z.; Chen, M.H.; Aykol, M.; Herring, P.K.; Fraggedakis, D.; et al. Data-driven prediction of battery cycle life before capacity degradation. Nat. Energy 2019, 4, 383–391. [Google Scholar] [CrossRef]
Domanski, K.; Alharbi, E.A.; Hagfeldt, A.; Grätzel, M.; Tress, W. Systematic investigation of the impact of operation conditions on the degradation behaviour of perovskite solar cells. Nat. Energy 2018, 3, 61–67. [Google Scholar] [CrossRef]
Di Girolamo, D.; Phung, N.; Kosasih, F.U.; Di Giacomo, F.; Matteocci, F.; Smith, J.A.; Flatken, M.A.; Köbler, H.; Turren Cruz, S.H.; Mattoni, A.; et al. Ion Migration-Induced Amorphization and Phase Segregation as a Degradation Mechanism in Planar Perovskite Solar Cells. Adv. Energy Mater. 2020, 10, 2000310. [Google Scholar] [CrossRef]
Kohonen, T. Self-Organizing Maps; Springer Series in Information Sciences; Springer: Berlin/Heidelberg, Germany, 2001; Volume 30. [Google Scholar] [CrossRef]
Challu, C.; Olivares, K.G.; Oreshkin, B.N.; Garza Ramirez, F.; Mergenthaler Canseco, M.; Dubrawski, A. NHITS: Neural Hierarchical Interpolation for Time Series Forecasting. Proc. AAAI Conf. Artif. Intell. 2023, 37, 6989–6997. [Google Scholar] [CrossRef]
Koenker, R.; Bassett, G. Regression Quantiles. Econometrica 1978, 46, 33. [Google Scholar] [CrossRef]
Jacobsson, T.J.; Hultqvist, A.; García-Fernández, A.; Anand, A.; Al-Ashouri, A.; Hagfeldt, A.; Crovetto, A.; Abate, A.; Ricciardulli, A.G.; Vijayan, A.; et al. An open-access database and analysis tool for perovskite solar cells based on the FAIR data principles. Nat. Energy 2021, 7, 107–115. [Google Scholar] [CrossRef]
Graniero, P.; Khenkin, M.; Köbler, H.; Hartono, N.T.P.; Schlatmann, R.; Abate, A.; Unger, E.; Jacobsson, T.J.; Ulbrich, C. The challenge of studying perovskite solar cells’ stability with machine learning. Front. Energy Res. 2023, 11, 1118654. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, H.; Jacobsson, T.J.; Luo, J. Big data driven perovskite solar cell stability analysis. Nat. Commun. 2022, 13, 7639. [Google Scholar] [CrossRef] [PubMed]
Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting. In Proceedings of the Eighth International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Olivares, K.G.; Challú, C.; Garza, A.; Canseco, M.M.; Dubrawski, A. NeuralForecast: User Friendly State-of-the-Art Neural Forecasting Models; PyCon: Salt Lake City, UT, USA, 2022. [Google Scholar]
Rasul, K.; Seward, C.; Schuster, I.; Vollgraf, R. Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8857–8868. [Google Scholar]
Akima, H. A New Method of Interpolation and Smooth Curve Fitting Based on Local Procedures. J. ACM 1970, 17, 589–602. [Google Scholar] [CrossRef]
Saliba, M.; Stolterfoht, M.; Wolff, C.M.; Neher, D.; Abate, A. Measuring Aging Stability of Perovskite Solar Cells. Joule 2018, 2, 1019–1024. [Google Scholar] [CrossRef]

Figure 1. Model architectures compared in this study: (a) NHITS, (b) P-NHITS, and (c) TimeDiff.

Figure 2. Cumulative distribution of per-device RMSE across all seeds. NHITS and P-NHITS are nearly superimposed; TimeDiff is consistently shifted rightward, indicating a systematic accuracy disadvantage rather than isolated failures.

Figure 3. Per-device RMSE distributions by degradation mode for seed 456. Table 3 reports means pooled across all three seeds; the box plots show single-seed distributions for visual clarity. TimeDiff’s disadvantage is visible in the median, not just the tails, across all modes.

Figure 4. T90 (top row) and T80 (bottom row) scatter plots: predicted vs. actual milestone time for each model. The dense cluster near the origin represents devices that never reach the threshold within the forecast horizon, correctly identified by all models. Scatter increases for devices reaching the threshold in the 50–100 h range.

Figure 5. Prediction intervals on representative devices across four degradation modes (rows) and three difficulty levels (columns: P5 = best-fitting, P50 = typical, P75 = challenging). Orange bands: P-NHITS 10th–90th quantile output. Pink bands: TimeDiff 10th–90th percentile of 50 samples. Black line: actual trajectory. P-NHITS bands adapt to device-specific uncertainty; TimeDiff bands are narrower but severely under-cover.

Figure 6. Representative forecast MPPT degradation trajectories across four degradation modes and three difficulty levels (P5 = best-fitting, P50 = typical, P75 = challenging based on NHITS RMSE). Black: actual MPPT degradation trajectory. Blue: NHITS forecast. Red: TimeDiff forecast (median of 50 samples). The vertical dashed line marks the end of the 30 h conditioning window.

Table 1. Model hyperparameters.

	NHITS/P-NHITS	TimeDiff
Input/Horizon	30/120 h	30/120 h
Architecture	3 stacks, MLP	4 residual blocks, Transformer
Hidden units/Channels	$3 \times [256, 256]$	64 channels, 8 heads
Pool kernels	$[2, 2, 1]$	—
Freq. downsample	$[4, 2, 1]$	—
Scaler	Identity	Per-device (cond. mean)
Loss	Huber/MQLoss	$ϵ$ -prediction MSE
Training steps/epochs	2000 steps	30 AR + 200 diffusion epochs
Batch size	128	32 (oversampled)
Learning rate	$10^{- 3}$	$10^{- 3}$ (AR)/ $10^{- 4}$ (diff.)
Optimizer	Adam	Adam ( $λ = 10^{- 6}$ )
Scheduler	—	MultiStepLR ( $γ = 0.1$ )
EMA	—	$0.999$ decay
Early stopping	—	Patience 25
Grad. clip	—	1.0
Side info dim	—	161
Diffusion steps	—	$T = 100$
Inference samples	1	50 (median)
CFG dropout/weight	—	0.15/1.0
Mix-up rate	—	0.3

Table 2. Point forecast RMSE (PCE%) per seed and pooled mean across 3 random seeds. Lower is better. Best mean in bold. NHITS outperforms TimeDiff by 17.0% (

p < 10^{- 15}

, Wilcoxon signed-rank).

Table 2. Point forecast RMSE (PCE%) per seed and pooled mean across 3 random seeds. Lower is better. Best mean in bold. NHITS outperforms TimeDiff by 17.0% (

p < 10^{- 15}

, Wilcoxon signed-rank).

Seed	NHITS	P-NHITS	TimeDiff
42	0.756	0.763	0.912
123	0.741	0.763	0.825
456	0.715	0.706	0.853
Mean	0.738	0.744	0.863

Table 3. Per-mode RMSE (PCE%) pooled across 3 seeds. N is the total number of test devices across all seeds. Best per mode in bold.

Mode	Description	NHITS	P-NHITS	TimeDiff	N
0	Initial gain	0.568	0.584	0.678	440
1	Slow decay	0.885	0.866	0.993	277
2	Medium decay	0.953	0.983	1.197	140
3	Fast decay	0.798	0.800	0.847	58
	All	0.738	0.744	0.863	915

Table 4. T90 prediction error by degradation mode, restricted to forecast-window crossings (actual T90

> 30

h). Devices whose T90 falls within the 30 h conditioning window are excluded because all models reproduce the observed crossing by construction. N is the number of qualifying device–seed pairs per model. MAE and RMSE in hours. Best MAE per mode in bold.

Table 4. T90 prediction error by degradation mode, restricted to forecast-window crossings (actual T90

> 30

h). Devices whose T90 falls within the 30 h conditioning window are excluded because all models reproduce the observed crossing by construction. N is the number of qualifying device–seed pairs per model. MAE and RMSE in hours. Best MAE per mode in bold.

Mode	Model	N	MAE (h)	RMSE (h)
Initial gain	NHITS	92	16.9	24.8
	P-NHITS	88	18.2	25.8
	TimeDiff	126	23.8	31.6
Slow exp. decay	NHITS	53	15.1	25.8
	P-NHITS	53	14.7	24.8
	TimeDiff	52	19.5	28.4
Forecast-only	NHITS	145	16.2	25.2
	P-NHITS	141	16.9	25.4
	TimeDiff	178	22.5	30.7

Note: Modes 2 (medium decay) and 3 (fast decay) are excluded entirely—all 140 and 58 device–seed pairs, respectively, cross the T90 threshold within the conditioning window (

\leq 30

h), yielding MAE = 0 for all models by construction.

Table 5. T80 prediction error by degradation mode, restricted to forecast-window crossings (actual T80

> 30

h). Format follows Table 4.

Table 5. T80 prediction error by degradation mode, restricted to forecast-window crossings (actual T80

> 30

h). Format follows Table 4.

Mode	Model	N	MAE (h)	RMSE (h)
Initial gain	NHITS	14	38.9	46.6
	P-NHITS	13	42.1	48.5
	TimeDiff	12	25.9	35.8
Slow exp. decay	NHITS	150	15.3	22.0
	P-NHITS	148	14.6	21.4
	TimeDiff	137	20.6	27.1
Medium exp. decay	NHITS	5	2.6	4.2
	P-NHITS	5	2.9	4.8
	TimeDiff	5	9.7	14.6
Fast exp. decay	NHITS	1	24.1	24.1
	P-NHITS	1	27.5	27.5
	TimeDiff	1	66.9	66.9
Forecast-only	NHITS	170	16.9	24.7
	P-NHITS	167	16.5	24.4
	TimeDiff	155	21.0	28.0

Table 6. Uncertainty quantification: 80% prediction interval calibration. Coverage is the fraction of actual values falling within the 10th–90th percentile band. Mean width is averaged across all devices and forecast time steps.

Model	Coverage (%)	Mean Width (PCE%)
Prob. NHITS	77.2	1.78
TimeDiff	62.8	1.49

Table 7. Input-window ablation: pooled RMSE (PCE%) across 3 seeds. NHITS outperforms TimeDiff at all windows (

p < 10^{- 13}

, Wilcoxon signed-rank). Bold: best model per window.

Table 7. Input-window ablation: pooled RMSE (PCE%) across 3 seeds. NHITS outperforms TimeDiff at all windows (

p < 10^{- 13}

, Wilcoxon signed-rank). Bold: best model per window.

Window	NHITS	P-NHITS	TimeDiff
20 h	0.976	0.966	1.185
30 h	0.735	0.744	0.888
40 h	0.642	0.642	1.464

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the International Institute of Knowledge Innovation and Invention. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Chahine, K.; Noura, H.N. On the Sufficiency of Direct Regression for Perovskite Solar Cell Degradation Forecasting. Appl. Syst. Innov. 2026, 9, 116. https://doi.org/10.3390/asi9060116

AMA Style

Chahine K, Noura HN. On the Sufficiency of Direct Regression for Perovskite Solar Cell Degradation Forecasting. Applied System Innovation. 2026; 9(6):116. https://doi.org/10.3390/asi9060116

Chicago/Turabian Style

Chahine, Khaled, and Hassan N. Noura. 2026. "On the Sufficiency of Direct Regression for Perovskite Solar Cell Degradation Forecasting" Applied System Innovation 9, no. 6: 116. https://doi.org/10.3390/asi9060116

APA Style

Chahine, K., & Noura, H. N. (2026). On the Sufficiency of Direct Regression for Perovskite Solar Cell Degradation Forecasting. Applied System Innovation, 9(6), 116. https://doi.org/10.3390/asi9060116

Article Menu

On the Sufficiency of Direct Regression for Perovskite Solar Cell Degradation Forecasting

Abstract

1. Introduction

2. Related Work

2.1. Perovskite Degradation Modeling and the Data Landscape

2.2. Time-Series Forecasting for Degradation in Energy Devices

2.3. Conditional Diffusion Models Versus Direct Regression

3. Dataset and Degradation Physics

3.1. The HySPRINT Aging Dataset

3.2. Preprocessing Pipeline

3.3. Degradation Modes and Their Physical Origins

3.4. Batch Structure and Generalization

3.5. Task Formulation

4. Methods

4.1. NHITS: Direct Multi-Horizon Regression

4.2. Probabilistic NHITS: Quantile Regression

4.3. TimeDiff: Conditional Diffusion with Full Enhancements

5. Results

5.1. Point Forecast Accuracy

5.2. Per-Mode Analysis

Mode 0 (Initial Gain, N = 440 )

Modes 1–2 (Exponential Decay, N = 277 + 140 )

Mode 3 (Fast Decay, N = 58 )

5.3. T80/T90 Lifetime Milestone Prediction

5.4. Uncertainty Quantification

5.5. Representative Trajectories

5.6. Input-Window Sensitivity

6. Discussion

6.1. Why Degradation Physics Determines Model Selection

The Root Cause: PSC Degradation Is Low-Dimensional

Why Diffusion Adds Overhead Without Payoff?

The Normalization Asymmetry Is a Symptom, Not a Cause

6.2. The Probabilistic NHITS Result in Context

6.3. Connecting to DiffBatt and the Broader Generative-AI Debate

6.4. Practical Recommendations

6.5. Limitations and Future Work

6.5.1. Single Dataset

6.5.2. No Per-Device Metadata

6.5.3. Window Sensitivity and Stability

6.5.4. Denoiser Architecture

6.5.5. Beyond 150 h

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Mode 0 (Initial Gain, $N = 440$ )

Modes 1–2 (Exponential Decay, $N = 277 + 140$ )

Mode 3 (Fast Decay, $N = 58$ )