1. Introduction
Maximum power point tracking (MPPT) in photovoltaic (PV) arrays becomes notably challenging under partial shading conditions (PSCs), where the power–voltage (
P–
V) landscape is nonconvex and multimodal, with multiple local maxima that shift with irradiance and temperature [
1,
2,
3,
4,
5]. Classical trackers (perturb-and-observe and incremental conductance) are attractive for their simplicity and low sensor burden, but under PSCs, they can settle at suboptimal peaks and exhibit limit-cycle oscillations; performance is sensitive to stepsize tuning, measurement noise, and plant transients [
1,
3,
4]. Global-search enhancements (adaptive scanning; PSO/GA/DE and hybrids) increase the likelihood of reaching the global MPP, at the cost of additional computation and re-tuning as operating statistics change [
2,
6,
7]. This reflects a broader debate between global exploration to avoid local traps and local tracking to preserve fast, low-perturbation operation.
Several studies further sharpen the MPPT and Deep Reinforcement Learning (DRL) baselines under PSCs. On the DRL side, a recurrent Proximal Policy Optimization–Long Short-Term Memory (PPO–LSTM) agent for global MPPT demonstrates robust gains over classical and nonrecurrent baselines [
8], while a hardware-validated DQN achieves GMPPT on real PV strings with rigorous experimental design [
9]. Beyond single-agent control, physics-informed multi-agent DRL with centralized training and decentralized execution (CTDE) has matured in power-network settings, offering architectural guidance for PV module coordination [
10]. In parallel, the body of comparative evidence and synthesis continues to grow: a controlled 2024 evaluation contrasting classical and learning-based MPPT algorithms [
11], a comprehensive 2025 review of traditional and advanced MPPT techniques (including partial-shading scenarios) [
12], and a 2025 validation study of a global MPPT strategy under complex partial shading using model-predictive control [
13].
Learning-based control offers an alternative that adapts policies to nonstationary PV conditions. Deterministic policy gradient methods enable continuous actuation [
14,
15]; TD3 reduces overestimation bias via twin critics and target smoothing [
16,
17]. In multi-string or module-integrated topologies, the problem is intrinsically multi-agent with local observations and shared objectives, motivating centralized training with decentralized execution (CTDE) as in MADDPG, COMA, and value-factorization approaches [
18,
19,
20]. Nevertheless, off-policy actor–critic stability hinges on value-function smoothness and bootstrapping geometry [
16,
21], while on-policy trust-region methods (TRPO/PPO) trade stability for higher sampling costs in embedded experimentation [
22,
23]. Two-time-scale stochastic approximation provides convergence guarantees under regularity, stepsize, and contractivity conditions that are often verified for linear approximation but are harder to certify for deep critics and nonconvex policy classes [
24,
25,
26,
27,
28].
This work develops Fuzzy–MAT3D, a CTDE variant of TD3 in which actors and critics comprise a differentiable fuzzy partition of unity over the compact PV operating domain [
29,
30,
31]. The partition induces locality and global Lipschitz continuity by construction, aligning with classical partition-of-unity ideas in numerical analysis and meshfree interpolation [
32,
33,
34]. In the MPPT setting, this structure reduces temporal-difference (TD) target variance, bounds critic Jacobians in nonsmooth regions of the
P–
V surface, and yields a
global -contraction of fixed-policy evaluation (hence non-expansive with
)—properties that tighten constants in standard ODE/stochastic-approximation arguments and connect to input-to-state stability margins in control [
25,
27,
28,
35,
36].
From a mechanistic viewpoint, the differentiable fuzzy PoU constrains the actor and critics to be globally Lipschitz, with moduli that can be explicitly computed from the membership functions. In
Section 2, we show that these Lipschitz constants enter directly into the variance bounds of the smoothed TD targets and into the ISS gain
of the closed-loop dynamics. Thus, fuzzy regularization reduces the sensitivity of temporal-difference updates to stochastic perturbations and enlarges the ISS margin, which in turn manifests as smoother learning dynamics and reduced steady-state jitter in the PV power trajectories.
We also analyze the order-statistics bias introduced by the TD3
target and provide a projected Bellman residual bound with the correct contraction factor (clarified for
and
under measure invariance), together with a clean
N-agent CTDE extension [
37,
38,
39,
40].
Aim and significance. Our aim is to endow multi-agent TD3 with a structured, differentiable fuzzy front-end that regularizes value estimation and stabilizes policy updates for MPPT under PSCs. The contribution has two principal facets: it injects physics-compatible locality into actor–critic maps; practically, it seeks higher energy capture with reduced steady-state oscillation relative to plain multi-agent TD3, MADDPG, and classical MPPT, as substantiated in the Results Section. For evaluation, we follow established DRL practices and classical multiple-comparison control [
41,
42,
43,
44,
45].
Problem statement and objectives. We model a two-module PV string under partial shading as a discounted Markov decision process with compact state space and bounded action space , where the agents observe electrical quantities (irradiance proxies, voltages, and currents) and select module voltages or duty cycles every control period in order to maximize normalized PV power. The control problem is to synthesize a stationary multi-agent policy that maximizes the long-run MPPT efficiency while keeping steady-state oscillations and settling times within acceptable operating bounds.
Accordingly, the main objectives of this study are as follows: (i) to construct a Lipschitz-regularized CTDE–TD3 architecture with a differentiable fuzzy partition of the operating domain; (ii) to establish convergence and stability guarantees (contraction properties, temporal-difference variance bounds, two-time-scale convergence, and ISS margins) for the resulting actor–critic scheme; and (iii) to demonstrate, on a balanced set of PSC scenarios, that the proposed Fuzzy–MAT3D controller outperforms representative classical and DRL-based MPPT baselines in efficiency and stability.
To the best of our knowledge, this is the first CTDE–TD3 design that explicitly ties a fuzzy partition-of-unity to global Lipschitz constants and ISS margins in PV MPPT.
Relative to entropy-regularized SAC and trust-region methods (TRPO/PPO) [
21,
22,
23], the proposed fuzzy partition of unity (PoU) regularizes deterministic CTDE–TD3 by enforcing global input Lipschitzness and reducing TD-target variance. Recent reports on safe/robust Reinforcement Learning (RL) for power converter control and MPPT under PSCs [
46,
47,
48,
49,
50,
51,
52] motivate structure in actor–critic features; hence, Fuzzy–MAT3D offers a physics-compatible alternative, achieving high yield with low ripple under an identical control period and per-tick compute budget; irradiance is measured and logged for MPP reference and analyses, while classical baselines operate on their standard
inputs.
Contributions. In summary, this paper achieves the following:
Introduces Fuzzy–MAT3D, a fuzzy-partitioned CTDE–TD3 architecture for multi-string PV arrays, yielding globally Lipschitz and locally interpretable actor–critic mappings [
29,
31,
32].
Establishes a global
-contraction for fixed–policy evaluation (and a projected
bound under measure invariance), reduces TD-target variance, bounds the TD3 min underestimation bias, and provides a projected Bellman residual bound with the appropriate contraction factor [
16,
27,
39].
Provides an
N-agent CTDE extension consistent with the above analysis and tailored to PV module coordination [
18].
Demonstrates, under a matched control period and actuation limits with identical per-tick compute budgets (irradiance measured/logged for reference), and randomized, CRN–blocked trials, higher mean MPPT efficiency and lower steady-state oscillations than multi-agent TD3, MADDPG, and classical baselines; details appear in the Methods and Results Sections [
41,
42].
2. Mathematical Framework and Main Results
This section develops the mathematical backbone of our approach and states the main theoretical contributions. We model MPPT control as a discounted Markov decision process (MDP) with compact state and action spaces and augment the CTDE actor–critic pipeline with a differentiable fuzzy partition of unity over the normalized operating domain; this front-end induces global Lipschitz continuity of the actor and twin critics, uniformly bounds target magnitudes and Jacobians, and reduces TD-target variance by construction.
2.1. Setting, Notation, and Fuzzy PoU
We consider a discounted MDP with compact state space
, action space
, and discount
. Two decentralized controllers with shared parameters (CTDE) act on a PV string; rewards are bounded as in (
33); hence,
.
For each coordinate
, let
be differentiable, nonnegative, and with a per-coordinate partition:
. Define
Unless stated otherwise, denotes the Euclidean norm on state and feature spaces, and denotes the induced operator norm. Accordingly, and quantify input Lipschitz moduli in for the stacked membership map and the normalized PoU, respectively.
Because the memberships satisfy for each of the seven coordinates, we have identically, so the normalization by 7 is exact.
Since the sum is unitary per coordinate, and, by compactness, is globally Lipschitz. We write with .
Intuitively, a map f is L-Lipschitz if for all , so L bounds its worst-case slope. In an actor–critic architecture, global Lipschitz continuity implies that small perturbations in voltages, irradiance estimates, or sensor readings cannot induce arbitrarily large changes in actions or value estimates. The differentiable fuzzy PoU used here ensures that the feature map and the actors/critics built on top of it inherit such bounded slopes, with explicit moduli like , thereby regularizing the policy and the value functions at the level of the entire state space.
The deterministic actor and the two critics are
with Polyak-averaged targets
and clipped target policy noise (TD3).
Throughout the manuscript, the deterministic policy is denoted by
, where
are the actor parameters. This notation is reserved exclusively for the policy and must not be confused with the fuzzy partition
introduced in
Section 2. Accordingly,
denotes the Lipschitz modulus of the actor, while
refers to the global Lipschitz constant of the normalized partition of unity. This distinction is preserved in all subsequent sections.
2.2. Main Structural Results
Theorem 1 (Fuzzy PoU induces global Lipschitzness)
. On compact S,In particular, gradient norms and TD targets remain uniformly bounded along the iterates.
Proposition 1 (Smoothed fixed-policy operator is
-contractive in
)
. Let and ε be bounded noise. Define . Thenand in ,where is the pushforward
of ρ by . The same holds for the min-of-twins
operator since it is 1-Lipschitz. We write and use to denote the TD3 min-of-twins target.
Proposition 2 (TD-target variance reduction under PoU)
. Let with independent of conditional on s. Thenso decreasing reduces the contribution of state/action perturbations to the TD-target variance. Here, is the trace of the conditional covariance of and by design (target policy noise independent of next state given s). All Lipschitz moduli are taken with respect to the norm.
Collecting these results, the fuzzy PoU and the induced Lipschitz bounds play a dual role. On the critic side, they enter the coefficient multiplying the next-state variance term in the TD-target variance bound, thereby damping the stochastic fluctuations of the updates.
On the closed-loop side, the same Lipschitz moduli appear in the ISS gain
of (
23), so that reducing
enlarges the input-to-state stability margin. This provides a direct analytical link between the fuzzy regularization mechanism and the empirical reductions in temporal-difference variance and steady-state jitter observed in the PV power trajectories.
Proposition 3 (Projected Bellman residual bound (last–layer linear))
. If the critic class is closed and convex (e.g., last-layer linear) and its orthogonal projection, thenIf and is a fixed point of , then If, moreover, , the constant is .
Remark 1 (On
contraction vs. distribution shift)
. More generally, for any measurable u,where is the pushforward of ρ by . The bounds in (10) are Lipschitz bounds across different
measures. If, in addition, (e.g., when ρ is the discounted occupancy measure induced by with the target noise), one recovers an γ-contraction. Otherwise, residual bounds inherit the factor instead of . Remark 2 (Empirical estimation of
c and near-invariance)
. Let ρ denote the (discounted) state-action occupancy under the policy used to define the evaluation operator (typically the target ), and let be its pushforward by . Over a finite replay window, we estimate the Radon–Nikodym–type constantusing a measurable partition of the (normalized) state-action domain and Laplace smoothing . With Polyak-averaged targets and bounded target policy noise, consecutive replay windows are empirically near-stationary, so is expected; when drift is present, the projected-residual bound inherits the factor from Proposition 3 (cf. the discussion following Proposition 1). Lemma 1 (Underestimation by TD3’s minimum)
. Let with . Thenandwhere and . If
are jointly normal with
and correlation
, then using
and
,
Theorem 2 (Two-time-scale convergence (projected SA)). Under i.i.d./mixing replay, smoothness and boundedness of gradients, square-integrable MDS noises, and stepsizes , , , , the critics converge to the set of stationary points (projected flow) and the actor follows the differential inclusion with . Every limit point of the actor is stationary.
Corollary 1 (PL/last-layer linear ⇒ uniqueness and point convergence). If, with frozen targets, the critic’s last-layer (linear) loss is -smooth and satisfies PL with , then the critic has a unique global minimizer and the (fast) iterates converge to it; on two time scales, along the slow manifold.
Theorem 3 (Finite stepsize neighborhoods with explicit constants)
. With constant stepsizes , there exist (depending on Lipschitz moduli and noise bounds) such thatUnder PL, the radii admit closed-form expressions in terms of .
Remark 3. Theorem 2 assumes decreasing stepsizes with , , , and . Our implementations use constant stepsizes , for which the appropriate justification is provided by Theorem 3: the coupled recursions enter an explicit steady-state neighborhood whose radii depend on and the Lipschitz/noise constants. We do not claim that Theorem 2 applies to the constant-step regime; rather, Theorem 2 serves as the ideal decreasing-steps benchmark while Theorem 3 supports the practical regime.
In summary, Theorem 2 is formulated under standard two-time-scale stochastic-approximation conditions: (i) globally Lipschitz gradients and bounded noise, (ii) projections onto compact parameter sets, and (iii) stepsizes such that , , , and . Under these assumptions, the critic recursions track the projected gradient flow of their population losses with frozen targets, while the actor follows the projected differential inclusion on the slow time scale.
The PL condition in Corollary 1 strengthens this picture by ruling out spurious stationary points in the last-layer critic loss: it guarantees a unique global minimizer and exponential convergence of the fast critic iterates to , so that along the slow manifold, the coupled actor–critic dynamics identify a single value function per policy parameter .
Theorem 4 (
N-agent CTDE extension with block-norm constants)
. For N agents with local actors and a joint (or per-agent) critic, under the same two-time-scale regime, the conclusions of Theorem 2 hold and the moduli aggregate asand the noise variance aggregates by sum (2-norm) or maximum (∞-norm). 2.3. ISS Margins and Steady-State Jitter (Formalized)
Consider the plant
, with
and exogenous input
d. Let
V be a control Lyapunov function on a compact operating set
, such that
for some class-
function
and constants
, where
is a reference state defined below. As illustrated in
Figure 1, the critic layer contracts rapidly towards the slow manifold
, so on the slow time scale, the actor evolves with small approximation errors
, which is precisely the regime captured by (
17).
Assumption 1 (Equilibrium anchoring on
)
. There exists , such that the reference feedback and value satisfyMoreover, the maps are Lipschitz on with moduli , respectively.
Define the approximation errors on
, recentered at
,
By Assumption 1 and the triangle inequality,
and, writing
,
We define the critic mismatch
and use Lipschitz continuity to obtain
Invoking Theorem 1, the actor Lipschitz modulus satisfies
, where the fuzzy partition-of-unity (PoU) induces
. Substituting (
20) and (
22) into (
17) yields
We emphasize that, among the standing assumptions used in this section, the equilibrium anchoring on (Assumption 1), tailored to the PV MPPT operating region under PSCs, is the modeling ingredient that is most specific to the present work.
The remaining conditions—compactness of , bounded actions, global Lipschitz continuity of the dynamics and actor–critic maps, and square-integrable noise—are in line with standard hypotheses in ISS and stochastic-approximation analyses of actor–critic schemes and are recalled here mainly to keep the exposition self-contained.
Because
scales linearly with
(from the fuzzy PoU), decreasing
strictly reduces the gain
, thereby enlarging the effective decay margin in (
23). This formalizes the empirical observation that smaller
lowers steady-state jitter by improving the ISS gain consistently with the two-time-scale picture in
Figure 1.
Remark 4 (Origin-centered fallback with offsets)
. If an explicit anchor is inconvenient, one may work at the origin and carry offsets:In designs that calibrate to match at the operating point (zero-bias last layers, or explicit alignment at ), the offsets can be made negligible; the same -driven conclusion follows.
3. Materials and Methods
3.1. Study Design and Overview
We evaluate a fuzzy-partitioned, centralized training/decentralized execution (CTDE) variant of TD3 (hereafter, Fuzzy–MAT3D) for maximum power point tracking (MPPT) under partial shading. Two local controllers execute with shared parameters, whereas training is centralized from a replay buffer. The design follows a balanced common-random-numbers (CRN) protocol across 7 benchmark scenarios (Table 3) and 20 seeds per algorithm, with a fixed evaluation horizon of s. The empirical protocol, metrics, and statistical analysis were specified a priori and are detailed below. All methods and constants are chosen to satisfy bounded-target and Lipschitz regularity assumptions used in the theory.
3.2. PV Plant and Power-Stage Model
We adopt the classical single-diode model with series and shunt resistances:
with irradiance
G and cell temperature
T. Two modules in series yield
,
. Partial shading is represented by
profiles as in Table 3. The coefficients
are obtained by least–squares calibration from the module datasheet.
Remark 5 (Bypass diodes under PSCs)
. Commercial modules include substring bypass diodes (see Section 5 and Table 10. For substring ℓ, we considered an augmented branchin parallel with the cell branch. On the two-module bench and shading scripts of Table 3, measured string currents rarely forward-bias bypass diodes. Enabling this branch in post hoc simulations changed η by less than . Thus the single-diode model suffices for our scenarios. 3.3. DC/DC Stage and Control Interface
Let
denote the converter duty cycle. The averaged stage dynamics are
and the agent issues an incremental duty command
mapped to
.
Throughout all simulations and hardware runs, the incremental duty update is
followed by
Unless stated otherwise, we set
so that
affects the maximum admissible per-tick duty change. No additional slew-rate limiter or nonlinearity is applied beyond the clipping in (
27) and (
28). For completeness,
.
3.4. Operating Constraints
We enforce plant and safety limits
and impose parameter projection Proj and gradient clipping (critics/actor) to keep all iterates bounded, consistent with the ISS and SA analyses.
3.5. Observations, Actions, and Horizon
Each module provides a local observation
bounded componentwise by
The local observation includes . RL methods consume the full tuple. Classical baselines (P&O, INC, PSO) operate with their standard inputs; we nonetheless log for all runs to enable like-for-like post hoc analyses and MPP reference computation. Actions are incremental duty-cycle commands in . ; step scenarios use .
RL agents consume the full tuple because the irradiance channels are available on the intended hardware and stabilize near-MPP behavior under abrupt PSC changes. Classical baselines (P&O, INC, PSO) are deliberately kept on their canonical inputs to preserve standard formulations rather than re-tune them into nonstandard variants. All methods share the same wall-clock control period and actuation limits, and we log irradiance for every run (for MPP reference and post hoc analyses). This preserves compute and timing parity while making the sensing assumption explicit.
3.6. Fuzzy Features and Function Approximators
Each state coordinate is normalized to
and equipped with
differentiable memberships
per coordinate
, combined via a coordinate-wise softmax so that
. Stacking the 35 responses yields
with the global identity
; hence, the normalized fuzzy map
This induces global input Lipschitz constants for the actor and critics used below. The actor is a two-hidden-layer MLP with a tanh head; twin critics share the fuzzy front-end but maintain separate downstream towers.
3.7. Fuzzy-Partition Parameters
Each raw state coordinate is affinely normalized to and equipped with softmax memberships having centers and a per-coordinate temperature . Writing for the jth coordinate’s membership vector with , we stack , so that .
To avoid overloading the policy notation
, we denote the fuzzy partition by
. Observe that
; for notational consistency with
Section 2, we write
for the global input Lipschitz constant of the normalized PoU.
Table 1 reports
and the induced
.
3.8. Reward and Normalization
The saturated, normalized reward used throughout is
with physical normalization constants
and
. This ensures
and bounded targets.
3.9. Training Protocol (TD3 Under CTDE)
We employ TD3 with centralized replay and decentralized execution:
Discount factor ; minibatch size ; policy update period ; Polyak factor .
Target policy smoothing noise , sampled i.i.d. and independent of conditional on s; target networks are Polyak-averaged.
Parameter updates include explicit projections onto compact convex sets and ; gradient clipping is applied to all updates.
Algorithm 1 summarizes the CTDE training loop used.
Table 2 summarizes the training hyperparameters adopted for Fuzzy–MAT3D. These values implement a two-time-scale separation (fast critics and slow actor) with constant stepsizes; accordingly, Theorem 3 provides the formal justification for the training regime used in our experiments.
Training episodes used a longer horizon (
s; up to 12,000 steps
s per episode) to improve replay diversity. All reported performance metrics, however, were computed on a fixed evaluation horizon
across the seven scenarios to ensure comparability between algorithms.
| Algorithm 1 Fuzzy–MAT3D (CTDE–TD3 for two PV modules in series) |
| Require: , , , ; stepsizes (critics), (actor).
|
| 1: Initialize ; set , ; empty buffer .
|
| 2: for episodes do |
|
3: Reset irradiances ; initialize . |
| 4: for time step t do |
| 5: Apply ; observe ; compute by (33); push to . |
| 6: Targets ; update ; Polyak-average . |
| 7: if then ascend ; Polyak–average . |
| 8: end if |
| 9: end for |
| 10: end for |
3.10. Benchmark Scenarios, CRN Protocol, and Metrics
Seven benchmark scenarios are used; step cases share the same change time
, as shown in
Table 3.
CRN protocol. All algorithms use identical random seeds, initializations, and shading scripts within each scenario to enable paired, blocked inferences. We run 20 independent seeds per scenario (total replications/algorithm).
At each sampling instant, the reference power is computed via a 1D constrained search
The primary endpoint is MPPT efficiency
For step scenarios, we report (i) the settling time , defined as the smallest such that for all with ; and (ii) steady-state oscillation, the standard deviation of over the last of .
and come from calibrated limits of the single-diode model used for normalization over the two-module operating domain, and thus closely match—but do not strictly equal—the STC datasheet values of the series string ( and ; Table 10).
3.11. Baselines and Fairness Constraints
All algorithms share the same control period and actuation limits (
27) and (
28). For PSO, we cap the per-tick particles × iterations so the search finishes within the 60 ms deadline (
Table 4). RL methods perform exactly one forward pass per module per tick. Classical baselines operate with their standard
inputs; irradiance
is logged and used only for reference MPP computation and post hoc analyses, keeping budget parity in wall-clock and actuation.
Under the 60 ms control period, both PSO’s particles × iterations and the RL forward passes are strictly confined to this wall-clock budget; no baseline was granted extra evaluations or sensing beyond its canonical inputs.
3.12. Computational Budget and Real-Time Feasibility
RL inference requires exactly one forward pass per tick per module. We log wall-clock inference latencies (p50/p95/p99) to verify meeting the control deadline. The PSO controller is constrained by an identical per-tick computational budget.
Table 4 reports inference latencies (p50/p95/p99) measured on the evaluation host used for simulation;
Section 5 (Table 12) reports on-device latencies measured on the hardware bench under the same 60 ms period.
3.13. Statistical Analysis Plan
Primary confirmatory analysis comprises (i) a one-way ANOVA on across algorithms; and (ii) blocked Dunnett contrasts versus Fuzzy–MAT3D (blocking by scenario and seed). Secondary analyses include CRN-paired one-sided t-tests (Fuzzy–MAT3D minus comparator) with Benjamini–Hochberg FDR control across hypotheses, and linear mixed-effects models with random intercepts for scenario and seed. We report distributional summaries and two-sided Student-t confidence intervals; diagnostics include normality of residuals and Levene’s test for homoscedasticity.
When executed on physical hardware, we mirror the seven scenarios of
Table 3 with synchronized
sensing and logging and enforce the same limits and deadlines as in simulation. Each algorithm is evaluated with
bench replicates per scenario, blocked by scenario and replicate ID. The reference
is computed from a calibrated single-diode model parameterized by measured
.
4. Results
We report both learning behavior and fixed-horizon performance under a balanced common-random-numbers (CRN) design comprising 7 scenarios and 20 seeds per algorithm (
replications/algorithm). The primary endpoint is MPPT (Maximum Power Point Tracking) efficiency,
(Def. (
35)); secondary endpoints quantify transient speed and steady operation: settling time
(2% band, window
) and steady-state oscillation (SD of
over the last
of
). Unless stated otherwise, all summaries are CRN-blocked; means are reported with two-sided 95% Student-
t confidence intervals; and hypothesis testing follows the pre-specified hierarchy of a global one-way ANOVA, confirmatory blocked-Dunnett contrasts versus Fuzzy–MAT3D, and CRN-paired one-sided
t tests with BH–FDR control.
We begin with training dynamics for the RL controllers (
Figure 2), then present the CRN-blocked comparison across six algorithms via distributional views and mean-CI summaries (
Figure 3 and
Figure 4). Stability metrics are analyzed next: steady-state oscillation (
Figure 5) and settling time (
Figure 6), with effect sizes and paired inferences tabulated in Tables 7 and 8 alongside aggregate RL performance (Table 9). We subsequently dissect the speed–stability trade-off (
Section 4.3), provide ablations and sensitivity checks, and include an illustrative step disturbance case study. Exploratory Tukey–Kramer intervals are relegated to
Appendix A to avoid conflating descriptive and confirmatory claims.
Axes and table headers use a unified nomenclature: “MPPT efficiency, [%]’’, “settling time, [s]’’, and “steady-state oscillation [W]’’. All figures and tables reflect the same CRN blocking and horizon to ensure like-for-like comparisons across algorithms and scenarios.
4.1. Training Dynamics (RL Group)
We begin by characterizing the learning behavior of the three RL controllers under the same CTDE protocol and logging setup.
Figure 2 reports the evolution of the cumulative return and a moving-average return across training episodes. Two features are salient: (i) the Fuzzy–MAT3D trajectories exhibit visibly damped volatility and earlier stabilization relative to MAT3D and MADDPG, and (ii) improvements accrue more steadily once the critics have entered their fast-contraction regime, consistent with the variance-reduction and non-expansivity mechanisms developed in
Section 2. These dynamics foreshadow the downstream fixed-horizon advantages—higher MPPT efficiency and lower steady-state oscillation—documented in the comparative analyses that follow.
4.2. Global Comparison Across Six Algorithms (CRN-Blocked)
This subsection presents a global comparison of six algorithms under a CRN-blocked design, matching stochastic trajectories to control heterogeneity and reduce estimator variance. We report means, 95% confidence intervals, standardized effect sizes (Hedges’ g), and average ranks, and conduct blocked ANOVA with Dunnett/Holm corrections for multiple comparisons against the reference.
Table 5 reports the scenario-wise MPPT efficiency (mean ± 95% CI) with
seeds per scenario and algorithm. Across all seven scenarios, Fuzzy–MAT3D attains the highest mean efficiency and consistently narrow confidence intervals for example,
under Standard Condition and
under
Deep Shadow. These per-scenario results mirror the aggregate advantages shown in
Figure 3 and
Figure 4 and reinforce the robustness of the fuzzy-regularized approach across both static and step-change conditions.
A global one-way ANOVA on
rejects equality of means (F-statistic
,
;
per group), confirming the significant differences in performance distributions visually apparent in
Figure 3 and
Figure 4. Following
Section 3.13, we report blocked Dunnett in the supplement (see, e.g., the exploratory post hoc analysis in
Figure A1) and CRN-paired tests below.
Residual Q–Q plots (to check normality) and Levene’s test (for homoscedasticity) were performed to validate the ANOVA assumptions. No gross departures from normality were found, and Levene’s test did not reject homoscedasticity within groups at ; all confirmatory inferences therefore follow the pre-specified hierarchy.
4.3. Settling Time vs. Stability
As shown by the CRN-blocked boxplots in
Figure 6, Fuzzy–MAT3D exhibits a longer settling time
than the plain TD3 baseline (MAT3D) and MADDPG. The RL-only aggregate (Table 9) reports
s for Fuzzy–MAT3D, versus
s for MAT3D and
s for MADDPG.
In our protocol,
is the first time after the step at which the trajectory remains within a
band around the instantaneous MPP for a contiguous window of length
(see
Section 3.10). This definition is
agnostic to (i) how much overshoot/undershoot occurred before entering the band and (ii) whether the controller subsequently leaves the band again after the
-window has elapsed. Hence, a controller can register a small
by grazing the band early with aggressive moves yet sustain sizable steady-state jitter or even drift later; conversely, a more conservative controller can register a larger
while delivering substantially better long-horizon behavior.
Mechanistic explanation for Fuzzy–MAT3D’s larger . The fuzzy PoU front-end enforces global Lipschitzness on the actor/critics (Theorem 1) and, together with target smoothing, yields a locally non-expansive fixed-policy operator (Proposition 1); moreover, the twin-critic minimum introduces a small, correlation-dependent negative bias (Lemma 1). These ingredients reduce TD-target variance and damp fast transients, but they also make the closed loop deliberately conservative right after abrupt shading steps, prioritizing a monotone approach over rapid excursions. In short, Fuzzy–MAT3D trades a few seconds of responsiveness for markedly improved stability and bias robustness.
Three lines of evidence—empirical, statistical, and control-theoretic—support Fuzzy–MAT3D despite its larger :
Dominant long-horizon energy capture. Across the CRN–blocked study
per algorithm), Fuzzy–MAT3D achieves the highest mean MPPT efficiency (
), substantially above MAT3D (
) and MADDPG (
); CRN-paired tests yield large effects (Cohen’s
d up to
) and essentially zero
q-values (BH–FDR). Thus, any energy loss from a slower transient is more than offset by sustained operation near the MPP over the full horizon (
Table 6,
Table 7 and
Table 8).
Much lower steady-state jitter. Figure 5 shows the steady-state oscillation of Fuzzy–MAT3D tightly concentrated near zero, whereas MAT3D and MADDPG display broad, heavy-tailed distributions. The RL-only aggregate reports
W for Fuzzy–MAT3D vs.
W (MAT3D) and
W (MADDPG). Lower jitter not only improves energy capture but also reduces switching stress and thermal cycling in the power stage (
Table 9).
Design intent: stability margins over aggressiveness. Theoretically, the fuzzy-induced Lipschitz constant
improves ISS gains, while the min-ensemble plus target smoothing controls overestimation and high-frequency actuation. This combination is expected to enlarge decay margins but reduce “snap-to-setpoint” behavior—precisely the speed–stability trade-off seen in
Figure 6.
The distribution in
Figure 6 shows Fuzzy–MAT3D with a higher median
and a long right tail driven by the most abrupt step cases, which is consistent with its conservative transient policy. Yet, when read jointly with
Figure 5 (steady-state oscillation) and the efficiency summaries (
Figure 3 and
Figure 4;
Table 6 and
Table 9), the picture is unequivocal: Fuzzy–MAT3D sits on a better Pareto front—maximizing energy and minimizing jitter—while conceding some transient speed. In applications like PV MPPT under PSC, where (i) steps are intermittent and (ii) the objective is integral energy over minutes–hours, this Pareto choice is the correct one.
The ANOVA and CRN-paired tests in
Table 6,
Table 7 and
Table 8 indicate that the superiority of Fuzzy–MAT3D over MAT3D and MADDPG is statistically significant across all seven scenarios, with very small
p-values and large paired effect sizes.
Because the design includes both static PSC profiles and step-change scenarios, the reported efficiency should be interpreted as a robust average over a representative family of practically relevant shading patterns.
We do not claim universal optimality beyond this family, but the ISS and Lipschitz analysis suggest that the qualitative advantages of Fuzzy–MAT3D should persist under other smooth, slowly varying PSC profiles; extending the experimental design to more aggressive, rapidly varying shading remains an interesting direction for future work.
Because
declares success after any continuous
-window within the band, it cannot penalize later departures from the band. This explains why algorithms with aggressive, oscillatory responses can display deceptively small
while still underperforming in energy and stability. Our CRN-blocked analysis therefore treats
as a secondary indicator to be interpreted alongside efficiency and steady-state metrics (
Figure 5,
Table 6,
Table 7,
Table 8 and
Table 9).
Fuzzy–MAT3D is intentionally conservative around abrupt changes; this yields a larger settling time but confers superior stability and decisively better energy tracking. In the aggregate, and for the operational goals of MPPT under partial shading, Fuzzy–MAT3D’s trade-off is the desirable one.
Concretely, comparing the aggregate statistics in
Table 6 and
Table 9, Fuzzy–MAT3D sacrifices about 6 s of additional settling time on average (7.54 s vs. 1.57 s for MAT3D) in exchange for roughly
percentage points in MPPT efficiency (92.0% vs. 80.1%) and a reduction of
W in steady-state jitter (1.36 W vs. 37.96 W), which is an advantageous trade-off for energy-centric applications.
As complementary, descriptive evidence,
Appendix A,
Figure A1 reports Tukey–Kramer confidence intervals for all pairwise algorithmic contrasts. Consistent with the CRN-blocked summaries and directionally aligned with the confirmatory blocked-Dunnett analysis, these intervals place Fuzzy–MAT3D above both MAT3D and MADDPG in mean MPPT efficiency (pairwise CIs against Fuzzy–MAT3D exclude zero), with the largest deficit observed for MADDPG. We therefore treat this panel as supportive context for the superiority ordering rather than an independent inferential claim.
4.4. RL-only Aggregate Across Replications
Table 9 aggregates RL-only results across CRN-blocked replications, showing that Fuzzy–MAT3D attains the highest mean MPPT efficiency
with markedly lower steady-state oscillation than MAT3D and MADDPG. Conversely, Fuzzy–MAT3D exhibits a larger settling time
, reflecting the study’s speed–stability trade-off and the conservative regulation induced by the fuzzy PoU front-end.
The empirical distribution of MPPT efficiency across all CRN-blocked replications is shown in
Figure 7, complementing the notched boxplots and mean–CI panels by revealing the full shape and tails of the distributions. This panel contextualizes central-tendency summaries with dispersion and skewness across scenarios and seeds.
4.5. Case Study: Step Disturbance
As shown in
Figure 8, after the step at
, Fuzzy–MAT3D approaches
monotonically and maintains it with negligible steady-state jitter, whereas MADDPG overshoots and develops sustained oscillations that depress the average power; MAT3D converges slowly and under-tracks the MPP. The lower panel explains the mechanism: Fuzzy–MAT3D rapidly desaturates the duty cycle and then holds an almost constant command, while the other agents continue exciting the plant—consistent with the ISS-based stability rationale in
Section 2.3.
6. Discussion
Our working hypothesis was that inserting a differentiable fuzzy partition of unity in front of the actor–critic would (i) enforce global input Lipschitzness and lower TD-target variance, (ii) render fixed-policy evaluation non-expansive with the
correct contraction factor
, (iii) enlarge the closed-loop ISS margin, and (iv) convert constant stepsizes into explicit steady-state neighborhoods; these mechanisms are formalized in
Section 2.2.
Empirically, the resulting controller attains higher MPPT efficiency with markedly lower steady-state jitter while accepting a more conservative transient—precisely the speed–stability trade-off expected from the theory and consonant with prior observations that classical P&O/INC and PSO approaches tend to exchange rapid steps for oscillatory behavior under PSCs, whereas unregularized RL baselines (e.g., MAT3D and MADDPG) can amplify critic noise into unstable actuation. In the broader context of learning-based power electronics, the fuzzy layer acts as a structural regularizer that improves actor–critic conditioning and yields closed-loop behavior aligned with energy-centric objectives and hardware stress constraints, thereby offering a principled alternative to ad hoc damping or heuristic dithering. These interpretations are consistent with the controlled CRN-blocked study and the theory-first analysis reported herein.
We establish stationarity (not global optimality) and adopt a replay idealization; nonetheless, parameter projections, clipping, and bounded target noise narrow the assumption–implementation gap. The residual bounds depend on distribution shift (Remark 1), and the practical ISS gains in
Section 2.3 inherit the usual modeling idealizations.
Future Work
Building on the present results, future work will (i) close the distribution-shift loop by learning replay/behavior policies that better align
and its pushforward, tightening the constants in the projected Bellman bounds (Proposition 3). (ii) Automate the partition design (number of memberships and temperature) and study its effect on variance and twin-critic correlation, building on Proposition 2 and Lemma 1. (iii) Couple ISS-style margins (
Section 2.3) with barrier certificates and latency/quantization models for hardware-level safety guarantees. (iv) Extend the CTDE analysis to larger
N and partial observability, and benchmark against entropy-regularized and trust-region variants under identical sensing/compute budgets. (v) Translate the energy-centric gains to longer horizons and field deployments, including degradation, sensor drift, and grid-level constraints.
Beyond the specific two-module string considered here, the fuzzy-regularized CTDE–TD3 architecture is applicable to other domains where (i) the dynamics admit a compact operating envelope and (ii) safety or comfort requirements demand smooth closed-loop responses. Examples include cooperative voltage control in DC microgrids, coordinated charging of electric-vehicle fleets, and frequency regulation with distributed energy resources, where the PoU structure can encode network topology or operating regions. From the scaling viewpoint, Theorem 4 already shows that the Lipschitz moduli of the N-agent extension grow in a controlled way under block norms, so that Fuzzy–MAT3D can in principle be deployed on larger PV arrays by assigning one agent per string or module cluster. In such settings, additional work is needed to account for partial observability and communication constraints, but the core Lipschitz and ISS guarantees remain valid.