Fuzzy-Partitioned Multi-Agent TD3 for Photovoltaic Maximum Power Point Tracking Under Partial Shading

Ortiz-Muñoz, Diana; Luviano-Cruz, David; Pérez-Domínguez, Luis Asunción; Rodríguez-Ramírez, Alma Guadalupe; García-Luna, Francesco

doi:10.3390/app152312776

Open AccessArticle

Fuzzy-Partitioned Multi-Agent TD3 for Photovoltaic Maximum Power Point Tracking Under Partial Shading

by

Diana Ortiz-Muñoz

,

David Luviano-Cruz

^*

,

Luis Asunción Pérez-Domínguez

,

Alma Guadalupe Rodríguez-Ramírez

and

Francesco García-Luna

Department of Industrial Engineering and Manufacturing, Universidad Autónoma de Ciudad Juárez, Ciudad Juárez 32310, Chihuahua, Mexico

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(23), 12776; https://doi.org/10.3390/app152312776

Submission received: 9 November 2025 / Revised: 20 November 2025 / Accepted: 21 November 2025 / Published: 2 December 2025

(This article belongs to the Special Issue Advances in Control and Optimization of Renewable Energy in Industrial Systems)

Download

Browse Figures

Versions Notes

Abstract

Maximum power point tracking (MPPT) under partial shading is a nonconvex, rapidly varying control problem that challenges multi-agent policies deployed on photovoltaic modules. We present Fuzzy–MAT3D, a fuzzy-augmented multi-agent TD3 (Twin-Delayed Deep Deterministic Policy Gradient) controller trained under centralized training/decentralized execution (CTDE). On the theory side, we prove that differentiable fuzzy partitions of unity endow the actor–critic maps with global Lipschitz regularity, reduce temporal-difference target variance, enlarge the input-to-state stability (ISS) margin, and yield a global

L_{\infty}

γ

-contraction of fixed-policy evaluation (hence, non-expansive with

κ = γ < 1

). We further state a two-time-scale convergence theorem for CTDE-TD3 with fuzzy features; a PL/last-layer-linear corollary implies point convergence and uniqueness of critics. We bound the projected Bellman residual with the correct contraction factor (for

L^{\infty}

and

L^{2} (ρ)

under measure invariance) and quantified the negative bias induced by

min {Q_{1}, Q_{2}}

; an N-agent extension is provided. Empirically, a balanced common-random-numbers design across seven scenarios and 20 seeds, analyzed by ANOVA and CRN-paired tests, shows that Fuzzy–MAT3D attains the highest mean MPPT efficiency (92.0% ± 4.0%), outperforming MAT3D and Multi-Agent Deep Deterministic Policy Gradient controller (MADDPG). Overall, fuzzy regularization yields higher efficiency, suppresses steady-state oscillations, and stabilizes learning dynamics, supporting the use of structured, physics-compatible features in multi-agent MPPT controllers. At the level of PV plants, such gains under partial shading translate into higher effective capacity factors and smoother renewable generation without additional hardware.

Keywords:

fuzzy feature augmentation; PV MPPT; PV systems; solar conversion efficiency; multi-agent TD3

1. Introduction

Maximum power point tracking (MPPT) in photovoltaic (PV) arrays becomes notably challenging under partial shading conditions (PSCs), where the power–voltage (P–V) landscape is nonconvex and multimodal, with multiple local maxima that shift with irradiance and temperature [1,2,3,4,5]. Classical trackers (perturb-and-observe and incremental conductance) are attractive for their simplicity and low sensor burden, but under PSCs, they can settle at suboptimal peaks and exhibit limit-cycle oscillations; performance is sensitive to stepsize tuning, measurement noise, and plant transients [1,3,4]. Global-search enhancements (adaptive scanning; PSO/GA/DE and hybrids) increase the likelihood of reaching the global MPP, at the cost of additional computation and re-tuning as operating statistics change [2,6,7]. This reflects a broader debate between global exploration to avoid local traps and local tracking to preserve fast, low-perturbation operation.

Several studies further sharpen the MPPT and Deep Reinforcement Learning (DRL) baselines under PSCs. On the DRL side, a recurrent Proximal Policy Optimization–Long Short-Term Memory (PPO–LSTM) agent for global MPPT demonstrates robust gains over classical and nonrecurrent baselines [8], while a hardware-validated DQN achieves GMPPT on real PV strings with rigorous experimental design [9]. Beyond single-agent control, physics-informed multi-agent DRL with centralized training and decentralized execution (CTDE) has matured in power-network settings, offering architectural guidance for PV module coordination [10]. In parallel, the body of comparative evidence and synthesis continues to grow: a controlled 2024 evaluation contrasting classical and learning-based MPPT algorithms [11], a comprehensive 2025 review of traditional and advanced MPPT techniques (including partial-shading scenarios) [12], and a 2025 validation study of a global MPPT strategy under complex partial shading using model-predictive control [13].

Learning-based control offers an alternative that adapts policies to nonstationary PV conditions. Deterministic policy gradient methods enable continuous actuation [14,15]; TD3 reduces overestimation bias via twin critics and target smoothing [16,17]. In multi-string or module-integrated topologies, the problem is intrinsically multi-agent with local observations and shared objectives, motivating centralized training with decentralized execution (CTDE) as in MADDPG, COMA, and value-factorization approaches [18,19,20]. Nevertheless, off-policy actor–critic stability hinges on value-function smoothness and bootstrapping geometry [16,21], while on-policy trust-region methods (TRPO/PPO) trade stability for higher sampling costs in embedded experimentation [22,23]. Two-time-scale stochastic approximation provides convergence guarantees under regularity, stepsize, and contractivity conditions that are often verified for linear approximation but are harder to certify for deep critics and nonconvex policy classes [24,25,26,27,28].

This work develops Fuzzy–MAT3D, a CTDE variant of TD3 in which actors and critics comprise a differentiable fuzzy partition of unity over the compact PV operating domain [29,30,31]. The partition induces locality and global Lipschitz continuity by construction, aligning with classical partition-of-unity ideas in numerical analysis and meshfree interpolation [32,33,34]. In the MPPT setting, this structure reduces temporal-difference (TD) target variance, bounds critic Jacobians in nonsmooth regions of the P–V surface, and yields a global

L_{\infty}

γ

-contraction of fixed-policy evaluation (hence non-expansive with

κ = γ < 1

)—properties that tighten constants in standard ODE/stochastic-approximation arguments and connect to input-to-state stability margins in control [25,27,28,35,36].

From a mechanistic viewpoint, the differentiable fuzzy PoU constrains the actor and critics to be globally Lipschitz, with moduli that can be explicitly computed from the membership functions. In Section 2, we show that these Lipschitz constants enter directly into the variance bounds of the smoothed TD targets and into the ISS gain

κ (L_{μ}, θ)

of the closed-loop dynamics. Thus, fuzzy regularization reduces the sensitivity of temporal-difference updates to stochastic perturbations and enlarges the ISS margin, which in turn manifests as smoother learning dynamics and reduced steady-state jitter in the PV power trajectories.

We also analyze the order-statistics bias introduced by the TD3

min {Q_{1}, Q_{2}}

target and provide a projected Bellman residual bound with the correct contraction factor (clarified for

L^{\infty}

and

L^{2} (ρ)

under measure invariance), together with a clean N-agent CTDE extension [37,38,39,40].

Aim and significance. Our aim is to endow multi-agent TD3 with a structured, differentiable fuzzy front-end that regularizes value estimation and stabilizes policy updates for MPPT under PSCs. The contribution has two principal facets: it injects physics-compatible locality into actor–critic maps; practically, it seeks higher energy capture with reduced steady-state oscillation relative to plain multi-agent TD3, MADDPG, and classical MPPT, as substantiated in the Results Section. For evaluation, we follow established DRL practices and classical multiple-comparison control [41,42,43,44,45].

Problem statement and objectives. We model a two-module PV string under partial shading as a discounted Markov decision process with compact state space

S

and bounded action space

A

, where the agents observe electrical quantities (irradiance proxies, voltages, and currents) and select module voltages or duty cycles every control period in order to maximize normalized PV power. The control problem is to synthesize a stationary multi-agent policy

μ_{θ}

that maximizes the long-run MPPT efficiency

η

while keeping steady-state oscillations and settling times within acceptable operating bounds.

Accordingly, the main objectives of this study are as follows: (i) to construct a Lipschitz-regularized CTDE–TD3 architecture with a differentiable fuzzy partition of the operating domain; (ii) to establish convergence and stability guarantees (contraction properties, temporal-difference variance bounds, two-time-scale convergence, and ISS margins) for the resulting actor–critic scheme; and (iii) to demonstrate, on a balanced set of PSC scenarios, that the proposed Fuzzy–MAT3D controller outperforms representative classical and DRL-based MPPT baselines in efficiency and stability.

To the best of our knowledge, this is the first CTDE–TD3 design that explicitly ties a fuzzy partition-of-unity to global Lipschitz constants and ISS margins in PV MPPT.

Relative to entropy-regularized SAC and trust-region methods (TRPO/PPO) [21,22,23], the proposed fuzzy partition of unity (PoU) regularizes deterministic CTDE–TD3 by enforcing global input Lipschitzness and reducing TD-target variance. Recent reports on safe/robust Reinforcement Learning (RL) for power converter control and MPPT under PSCs [46,47,48,49,50,51,52] motivate structure in actor–critic features; hence, Fuzzy–MAT3D offers a physics-compatible alternative, achieving high yield with low ripple under an identical control period and per-tick compute budget; irradiance is measured and logged for MPP reference and analyses, while classical baselines operate on their standard

(V / I)

inputs.

Contributions. In summary, this paper achieves the following:

Introduces Fuzzy–MAT3D, a fuzzy-partitioned CTDE–TD3 architecture for multi-string PV arrays, yielding globally Lipschitz and locally interpretable actor–critic mappings [29,31,32].
Establishes a global $L_{\infty}$ $γ$ -contraction for fixed–policy evaluation (and a projected $L^{2} (ρ)$ bound under measure invariance), reduces TD-target variance, bounds the TD3 min underestimation bias, and provides a projected Bellman residual bound with the appropriate contraction factor [16,27,39].
Provides an N-agent CTDE extension consistent with the above analysis and tailored to PV module coordination [18].
Demonstrates, under a matched control period and actuation limits with identical per-tick compute budgets (irradiance measured/logged for reference), and randomized, CRN–blocked trials, higher mean MPPT efficiency and lower steady-state oscillations than multi-agent TD3, MADDPG, and classical baselines; details appear in the Methods and Results Sections [41,42].

2. Mathematical Framework and Main Results

This section develops the mathematical backbone of our approach and states the main theoretical contributions. We model MPPT control as a discounted Markov decision process (MDP) with compact state and action spaces and augment the CTDE actor–critic pipeline with a differentiable fuzzy partition of unity over the normalized operating domain; this front-end induces global Lipschitz continuity of the actor and twin critics, uniformly bounds target magnitudes and Jacobians, and reduces TD-target variance by construction.

2.1. Setting, Notation, and Fuzzy PoU

We consider a discounted MDP with compact state space

S \subset R^{7}

, action space

A = [- 1, 1]

, and discount

γ \in (0, 1)

. Two decentralized controllers with shared parameters (CTDE) act on a PV string; rewards are bounded as in (33); hence,

| Q^{π} {| \leq (1 - γ)}^{- 1}

.

Fuzzy partition of unity (PoU).

For each coordinate

j \in {1, \dots, 7}

, let

{ψ_{ℓ}^{(j)}}_{ℓ = 1}^{5}

be differentiable, nonnegative, and with a per-coordinate partition:

\sum_{ℓ} ψ_{ℓ}^{(j)} (ξ) = 1

. Define

z (s) = (ψ_{1}^{(1)} (s_{1}), \dots, ψ_{5}^{(1)} (s_{1}), \dots, ψ_{1}^{(7)} (s_{7}), \dots, ψ_{5}^{(7)} (s_{7})) \in R_{\geq 0}^{35} .

(1)

φ_{fuzzy} (s) = \frac{z (s)}{7}, 1^{⊤} φ_{fuzzy} (s) = 1 .

(2)

Norm conventions.

Unless stated otherwise,

∥ \cdot ∥

denotes the Euclidean norm on state and feature spaces, and

∥ J ∥

denotes the induced operator norm. Accordingly,

L_{z} : = {sup}_{s} {∥ \nabla z (s) ∥}_{2}

and

L_{μ} : = L_{z} / 7

quantify input Lipschitz moduli in

ℓ_{2}

for the stacked membership map and the normalized PoU, respectively.

Because the memberships satisfy

\sum_{ℓ = 1}^{5} ψ_{ℓ}^{(j)} (s_{j}) \equiv 1

for each of the seven coordinates, we have

1^{⊤} z (s) = 7

identically, so the normalization by 7 is exact.

Since the sum is unitary per coordinate,

1^{⊤} z (s) = 7

and, by compactness,

φ_{fuzzy}

is globally Lipschitz. We write

L_{μ} = \frac{L_{z}}{7}

with

L_{z} : = {sup}_{s} ∥ \nabla z (s) ∥

.

Intuitively, a map f is L-Lipschitz if

∥ f (x) - f (y) ∥ \leq L ∥ x - y ∥

for all

x, y

, so L bounds its worst-case slope. In an actor–critic architecture, global Lipschitz continuity implies that small perturbations in voltages, irradiance estimates, or sensor readings cannot induce arbitrarily large changes in actions or value estimates. The differentiable fuzzy PoU used here ensures that the feature map

z (s)

and the actors/critics built on top of it inherit such bounded slopes, with explicit moduli like

L_{μ} = L_{z} / 7

, thereby regularizing the policy and the value functions at the level of the entire state space.

Actor–critics with PoU and TD3 smoothing.

The deterministic actor and the two critics are

a = μ_{θ} (s) = tanh (f_{θ} (φ_{fuzzy} (s))), Q_{φ_{i}} (s, a) = h_{φ_{i}} (φ_{fuzzy} (s), a), i \in {1, 2},

with Polyak-averaged targets

(\bar{θ}, {\bar{φ}}_{i})

and clipped target policy noise (TD3).

Throughout the manuscript, the deterministic policy is denoted by

μ_{θ} (s)

, where

θ

are the actor parameters. This notation is reserved exclusively for the policy and must not be confused with the fuzzy partition

φ_{fuzzy} (s)

introduced in Section 2. Accordingly,

L_{μ_{θ}}

denotes the Lipschitz modulus of the actor, while

L_{μ} = L_{z} / 7

refers to the global Lipschitz constant of the normalized partition of unity. This distinction is preserved in all subsequent sections.

2.2. Main Structural Results

Theorem 1

(Fuzzy PoU induces global Lipschitzness). On compact S,

Lip (μ_{θ}) \leq L_{tanh} L_{f} (θ) L_{μ},

(3)

Lip (Q_{φ_{i}} \circ (φ_{fuzzy}, id)) \leq L_{Q} (φ_{i}) max {L_{μ}, 1} .

(4)

In particular, gradient norms and TD targets remain uniformly bounded along the iterates.

Proposition 1

(Smoothed fixed-policy operator is

γ

-contractive in

L_{\infty}

). Let

\bar{π} (s) = μ_{\bar{θ}} (s)

and ε be bounded noise. Define

(T_{ε}^{\bar{π}} Q) (s, a) = E [r + γ Q (s^{'}, \bar{π} (s^{'}) + ε) ∣ s, a]

. Then

∥ T_{ε}^{\bar{π}} Q - T_{ε}^{\bar{π}} \tilde{Q} ∥_{\infty} \leq γ {∥ Q - \tilde{Q} ∥}_{\infty} .

(5)

and in

L^{2} (ρ)

,

∥ T_{ε}^{\bar{π}} Q - T_{ε}^{\bar{π}} \tilde{Q} ∥_{2, ρ} \leq γ {∥ Q - \tilde{Q} ∥}_{2, ρ^{'}} .

(6)

where

ρ^{'}

is the pushforward of ρ by

(s, a) \mapsto (s^{'}, \bar{π} (s^{'}) + ε)

. The same holds for the min-of-twins operator since it is 1-Lipschitz.

Twin-critic minimum.

We write

m (u, v) : = min {u, v}

and use

m (Q_{{\bar{ϕ}}_{1}}, Q_{{\bar{ϕ}}_{2}})

to denote the TD3 min-of-twins target.

Proposition 2

(TD-target variance reduction under PoU). Let

y = r + γ m (Q_{{\bar{φ}}_{1}}, Q_{{\bar{φ}}_{2}}) (s^{'}, \bar{π} (s^{'}) + ε)

with

ε \in [- ϵ, ϵ]

independent of

s^{'}

conditional on s. Then

Var (y ∣ s) \leq σ_{r}^{2} + γ^{2} L_{Q}^{2} (L_{μ}^{2} σ_{s^{'}}^{2} + σ_{ε}^{2}) .

(7)

so decreasing

L_{μ}

reduces the contribution of state/action perturbations to the TD-target variance.

Here,

σ_{s^{'}}^{2} : = E [∥ s^{'} - E [s^{'} ∣ s] ∥_{2}^{2} ∣ s]

is the trace of the conditional covariance of

s^{'}

and

ε ⊥ s^{'} ∣ s

by design (target policy noise independent of next state given s). All Lipschitz moduli are taken with respect to the

ℓ_{2}

norm.

Collecting these results, the fuzzy PoU and the induced Lipschitz bounds play a dual role. On the critic side, they enter the coefficient multiplying the next-state variance term in the TD-target variance bound, thereby damping the stochastic fluctuations of the updates.

On the closed-loop side, the same Lipschitz moduli appear in the ISS gain

κ (L_{μ}, θ)

of (23), so that reducing

L_{μ}

enlarges the input-to-state stability margin. This provides a direct analytical link between the fuzzy regularization mechanism and the empirical reductions in temporal-difference variance and steady-state jitter observed in the PV power trajectories.

Proposition 3

(Projected Bellman residual bound (last–layer linear)). If the critic class

F \subset L^{2} (ρ)

is closed and convex (e.g., last-layer linear) and

Π_{F}

its orthogonal projection, then

∥ Π_{F} T_{ε}^{\bar{π}} Q - Π_{F} T_{ε}^{\bar{π}} \tilde{Q} ∥_{2, ρ} \leq γ \sqrt{c} {∥ Q - \tilde{Q} ∥}_{2, ρ}, c = ess sup \frac{d ρ^{'}}{d ρ} < \infty

(8)

If

\hat{Q} = Π_{F} T_{ε}^{\bar{π}} \hat{Q}

and

Q^{\bar{π}}

is a fixed point of

T_{ε}^{\bar{π}}

, then

∥ \hat{Q} - Q^{\bar{π}} ∥_{2, ρ} \leq \frac{1}{1 - γ \sqrt{c}} {∥ (I - Π_{F}) T_{ε}^{\bar{π}} Q^{\bar{π}} ∥}_{2, ρ} .

(9)

If, moreover,

ρ^{'} = ρ

, the constant is

1 / (1 - γ)

.

Remark 1

(On

L^{2} (ρ)

contraction vs. distribution shift). More generally, for any measurable u,

∥ T_{ε}^{\bar{π}} {u ∥}_{2, ρ} \leq γ \sqrt{c} {∥ u ∥}_{2, ρ}, c : = ess sup \frac{d ρ^{'}}{d ρ} < \infty

(10)

where

ρ^{'}

is the pushforward of ρ by

(s, a) \mapsto (s^{'}, \bar{π} (s^{'}) + ε)

. The bounds in (10) are Lipschitz bounds across different measures. If, in addition,

ρ^{'} = ρ

(e.g., when ρ is the discounted occupancy measure induced by

\bar{π}

with the target noise), one recovers an

L^{2} (ρ)

γ-contraction. Otherwise, residual bounds inherit the factor

1 / (1 - γ \sqrt{c})

instead of

1 / (1 - γ)

.

Remark 2

(Empirical estimation of c and near-invariance). Let ρ denote the (discounted) state-action occupancy under the policy used to define the evaluation operator

T_{ε}^{\bar{π}}

(typically the target

\bar{π}

), and let

ρ^{'}

be its pushforward by

(s, a) \mapsto (s^{'}, \bar{π} (s^{'}) + ε)

. Over a finite replay window, we estimate the Radon–Nikodym–type constant

\hat{c} (B, λ) : = max_{b \in B} \frac{{\hat{ρ}}^{'} (b) + λ}{\hat{ρ} (b) + λ}, \hat{ρ} (b) = \frac{1}{N} \sum_{k = 1}^{N} 1 {(s_{k}, a_{k}) \in b} .

(11)

using a measurable partition

B

of the (normalized) state-action domain and Laplace smoothing

λ > 0

. With Polyak-averaged targets and bounded target policy noise, consecutive replay windows are empirically near-stationary, so

\hat{c} \approx 1

is expected; when drift is present, the projected-residual bound inherits the factor

1 / (1 - γ \sqrt{\hat{c}})

from Proposition 3 (cf. the

ρ \mapsto ρ^{'}

discussion following Proposition 1).

Lemma 1

(Underestimation by TD3’s minimum). Let

Q_{i} = Q + ε_{i}

with

E [ε_{i} ∣ s, a] = 0

. Then

E [min {Q_{1}, Q_{2}} ∣ s, a] = Q + \frac{1}{2} E [ε_{1} + ε_{2} - | ε_{1} - ε_{2} | ∣ s, a] \leq Q .

(12)

and

0 \leq Q - E [min {Q_{1}, Q_{2}} ∣ s, a] \leq \frac{1}{2} \sqrt{Var (ε_{1} - ε_{2} ∣ s, a)} \leq \frac{σ_{Q}}{\sqrt{2}} \sqrt{1 - ρ} .

(13)

where

Var (ε_{i} ∣ s, a) \leq σ_{Q}^{2}

and

ρ : = Corr (ε_{1}, ε_{2} ∣ s, a) \in [- 1, 1]

.

Gaussian specialization.

If

(ε_{1}, ε_{2})

are jointly normal with

Var (ε_{i} ∣ s, a) = σ_{Q}^{2}

and correlation

ρ

, then using

min (x, y) = \frac{x + y - | x - y |}{2}

and

ε_{1} - ε_{2} \sim N (0, 2 σ_{Q}^{2} (1 - ρ))

,

Q - E [min {Q_{1}, Q_{2}} ∣ s, a] = \frac{1}{2} E [| ε_{1} - ε_{2} |] = σ_{Q} \sqrt{\frac{1 - ρ}{π}} .

(14)

Theorem 2

(Two-time-scale convergence (projected SA)). Under i.i.d./mixing replay, smoothness and boundedness of gradients, square-integrable MDS noises, and stepsizes

\sum α_{k} = \sum β_{k} = \infty

,

\sum (α_{k}^{2} + β_{k}^{2} + τ_{k}^{2}) < \infty

,

β_{k} / α_{k} \to 0

,

τ_{k} / β_{k} \to 0

, the critics converge to the set of stationary points (projected flow) and the actor follows the differential inclusion

\dot{θ} \in Π_{Θ} [\nabla_{θ} \tilde{J} (θ)]

with

\tilde{J} (θ) = E_{ρ} [Q_{φ^{⋆} (θ)} (s, μ_{θ} (s))]

. Every limit point of the actor is stationary.

Corollary 1

(PL/last-layer linear ⇒ uniqueness and point convergence). If, with frozen targets, the critic’s last-layer (linear) loss is

L_{w}

-smooth and satisfies PL with

μ_{PL} > 0

, then the critic has a unique global minimizer and the (fast) iterates converge to it; on two time scales,

w_{k} \to w^{⋆} (θ_{k})

along the slow manifold.

Theorem 3

(Finite stepsize neighborhoods with explicit constants). With constant stepsizes

(α, β, τ)

, there exist

C_{j}

(depending on Lipschitz moduli and noise bounds) such that

\underset{k \to \infty}{lim sup} E [dist {((w_{k}, θ_{k}), M \times S)}^{2}] \leq (C_{1} + C_{4} L_{μ}^{2}) α + (C_{2} + C_{5} L_{μ}^{2}) β + (C_{3} + C_{6} L_{μ}^{2}) τ .

(15)

Under PL, the radii admit closed-form expressions in terms of

(α, β, τ, σ_{M}^{2}, L_{w}, μ_{PL})

.

Remark 3.

Theorem 2 assumes decreasing stepsizes with

\sum α_{k} = \sum β_{k} = \infty

,

\sum (α_{k}^{2} + β_{k}^{2} + τ_{k}^{2}) < \infty

,

β_{k} / α_{k} \to 0

, and

τ_{k} / β_{k} \to 0

. Our implementations use constant stepsizes

(α, β, τ)

, for which the appropriate justification is provided by Theorem 3: the coupled recursions enter an explicit steady-state neighborhood whose radii depend on

(α, β, τ)

and the Lipschitz/noise constants. We do not claim that Theorem 2 applies to the constant-step regime; rather, Theorem 2 serves as the ideal decreasing-steps benchmark while Theorem 3 supports the practical regime.

In summary, Theorem 2 is formulated under standard two-time-scale stochastic-approximation conditions: (i) globally Lipschitz gradients and bounded noise, (ii) projections onto compact parameter sets, and (iii) stepsizes

(α_{k}, β_{k}, τ_{k})

such that

\sum_{k} α_{k} = \sum_{k} β_{k} = \infty

,

\sum_{k} (α_{k}^{2} + β_{k}^{2} + τ_{k}^{2}) < \infty

,

β_{k} / α_{k} \to 0

, and

τ_{k} / β_{k} \to 0

. Under these assumptions, the critic recursions track the projected gradient flow of their population losses with frozen targets, while the actor follows the projected differential inclusion

\dot{θ} \in Π_{Θ} [\nabla_{θ} \tilde{J} (θ)]

on the slow time scale.

The PL condition in Corollary 1 strengthens this picture by ruling out spurious stationary points in the last-layer critic loss: it guarantees a unique global minimizer and exponential convergence of the fast critic iterates to

w^{⋆} (θ)

, so that along the slow manifold, the coupled actor–critic dynamics identify a single value function per policy parameter

θ

.

Theorem 4

(N-agent CTDE extension with block-norm constants). For N agents with local actors

μ_{θ_{i}}

and a joint (or per-agent) critic, under the same two-time-scale regime, the conclusions of Theorem 2 hold and the moduli aggregate as

Lip (μ) \leq \{\begin{matrix} max_{i} L_{μ, i}, & {∥ \cdot ∥}_{\infty, blk}, \\ {(\sum_{i} L_{μ, i}^{2})}^{1 / 2}, & {∥ \cdot ∥}_{2, blk}, \end{matrix} L_{Q}^{(in)} analogously .

(16)

and the noise variance aggregates by sum (2-norm) or maximum (∞-norm).

2.3. ISS Margins and Steady-State Jitter (Formalized)

Consider the plant

\dot{x} = f (x, d) + g (x) u

, with

u = μ_{θ} (x)

and exogenous input d. Let V be a control Lyapunov function on a compact operating set

X

, such that

\dot{V} \leq - α (∥ x - x^{⋆} ∥) + c_{1} ∥ e_{π} (x) ∥ + c_{2} ∥ e_{Q} (x) ∥,

(17)

for some class-

K_{\infty}

function

α

and constants

c_{1, 2} > 0

, where

x^{⋆} \in X

is a reference state defined below. As illustrated in Figure 1, the critic layer contracts rapidly towards the slow manifold

M (θ)

, so on the slow time scale, the actor evolves with small approximation errors

(e_{π}, e_{Q})

, which is precisely the regime captured by (17).

Assumption 1

(Equilibrium anchoring on

X

). There exists

x^{⋆} \in X

, such that the reference feedback and value satisfy

μ_{θ} (x^{⋆}) = μ^{⋆} (x^{⋆}), Q_{φ} (x^{⋆}, μ_{θ} (x^{⋆})) = Q^{\bar{π}} (x^{⋆}, μ_{θ} (x^{⋆})) .

(18)

Moreover, the maps

μ_{θ}, μ^{⋆}, Q_{φ}, Q^{\bar{π}}

are Lipschitz on

X

with moduli

L_{μ_{θ}}, L_{μ^{⋆}}, L_{Q}, L_{Q^{\bar{π}}}

, respectively.

Define the approximation errors on

X

, recentered at

x^{⋆}

,

e_{π} (x) : = μ_{θ} (x) - μ^{⋆} (x), e_{Q} (x) : = Q_{φ} (x, μ_{θ} (x)) - Q^{\bar{π}} (x, μ_{θ} (x)) .

(19)

By Assumption 1 and the triangle inequality,

∥ e_{π} (x) ∥ \leq ∥ μ_{θ} (x) - μ_{θ} (x^{⋆}) ∥ + ∥ μ^{⋆} (x) - μ^{⋆} (x^{⋆}) ∥ \leq (L_{μ_{θ}} + L_{μ^{⋆}}) ∥ x - x^{⋆} ∥,

(20)

and, writing

a (x) : = μ_{θ} (x)

,

We define the critic mismatch

e_{Q} (x) : = Q_{ϕ} (x, a (x)) - Q_{\bar{π}} (x, a (x)),

(21)

and use Lipschitz continuity to obtain

\begin{matrix} ∥ e_{Q} (x) ∥ & = ∥Q_{ϕ} (x, a (x)) - Q_{\bar{π}} (x, a (x))∥ \\ \leq ∥Q_{ϕ} (x, a (x)) - Q_{ϕ} (x_{⋆}, a (x_{⋆}))∥ + ∥Q_{\bar{π}} (x, a (x)) - Q_{\bar{π}} (x_{⋆}, a (x_{⋆}))∥ \\ \leq (L_{Q} + L_{Q_{\bar{π}}}) (∥ x - x_{⋆} ∥ + ∥ a (x) - a (x_{⋆}) ∥) \\ \leq (L_{Q} + L_{Q_{\bar{π}}}) (1 + L_{μ_{θ}}) ∥ x - x_{⋆} ∥ . \end{matrix}

(22)

Invoking Theorem 1, the actor Lipschitz modulus satisfies

L_{μ_{θ}} \leq L_{tanh} L_{f} (θ) L_{μ}

, where the fuzzy partition-of-unity (PoU) induces

L_{μ} = L_{z} / 7

. Substituting (20) and (22) into (17) yields

\dot{V} \leq - α (∥ x - x^{⋆} ∥) + \underset{= : κ (L_{μ}, θ)}{\underset{︸}{[c_{1} (L_{μ_{θ}} + L_{μ^{⋆}}) + c_{2} (L_{Q} + L_{Q^{\bar{π}}}) (1 + L_{μ_{θ}})]}} ∥ x - x^{⋆} ∥ .

(23)

We emphasize that, among the standing assumptions used in this section, the equilibrium anchoring on

X_{⋆}

(Assumption 1), tailored to the PV MPPT operating region under PSCs, is the modeling ingredient that is most specific to the present work.

The remaining conditions—compactness of

S

, bounded actions, global Lipschitz continuity of the dynamics and actor–critic maps, and square-integrable noise—are in line with standard hypotheses in ISS and stochastic-approximation analyses of actor–critic schemes and are recalled here mainly to keep the exposition self-contained.

Because

L_{μ_{θ}}

scales linearly with

L_{μ}

(from the fuzzy PoU), decreasing

L_{μ}

strictly reduces the gain

κ (L_{μ}, θ)

, thereby enlarging the effective decay margin in (23). This formalizes the empirical observation that smaller

L_{μ}

lowers steady-state jitter by improving the ISS gain consistently with the two-time-scale picture in Figure 1.

Remark 4

(Origin-centered fallback with offsets). If an explicit anchor

x^{⋆}

is inconvenient, one may work at the origin and carry offsets:

∥ e_{π} (x) ∥ \leq (L_{μ_{θ}} + L_{μ^{⋆}}) ∥ x ∥ + ∥ e_{π} (0) ∥, ∥ e_{Q} (x) ∥ \leq (L_{Q} + L_{Q^{\bar{π}}}) (1 + L_{μ_{θ}}) ∥ x ∥ + ∥ e_{Q} (0) ∥ .

In designs that calibrate

μ_{θ}, μ^{⋆}, Q_{φ}, Q^{\bar{π}}

to match at the operating point (zero-bias last layers, or explicit alignment at

x = 0

), the offsets can be made negligible; the same

L_{μ}

-driven conclusion follows.

Complete proofs for all results in Section 2 are provided in Appendix B.

3. Materials and Methods

3.1. Study Design and Overview

We evaluate a fuzzy-partitioned, centralized training/decentralized execution (CTDE) variant of TD3 (hereafter, Fuzzy–MAT3D) for maximum power point tracking (MPPT) under partial shading. Two local controllers execute with shared parameters, whereas training is centralized from a replay buffer. The design follows a balanced common-random-numbers (CRN) protocol across 7 benchmark scenarios (Table 3) and 20 seeds per algorithm, with a fixed evaluation horizon of

T_{sim} = 20

s. The empirical protocol, metrics, and statistical analysis were specified a priori and are detailed below. All methods and constants are chosen to satisfy bounded-target and Lipschitz regularity assumptions used in the theory.

3.2. PV Plant and Power-Stage Model

Single-diode module model.

We adopt the classical single-diode model with series and shunt resistances:

\begin{matrix} I & = I_{ph} (G, T) - I_{0} (T) (exp \frac{V + I R_{s}}{n V_{T}} - 1) - \frac{V + I R_{s}}{R_{s h}}, \end{matrix}

(24)

\begin{matrix} V_{T} & = \frac{k_{B} T}{q}, I_{ph} (G, T) = κ_{G} G + κ_{T} (T - T_{ref}), \end{matrix}

(25)

with irradiance G and cell temperature T. Two modules in series yield

V_{string} = V_{1} + V_{2}

,

I_{string} = I_{1} = I_{2}

. Partial shading is represented by

(G_{1}, G_{2})

profiles as in Table 3. The coefficients

(κ_{G}, κ_{T})

are obtained by least–squares calibration from the module datasheet.

Remark 5

(Bypass diodes under PSCs). Commercial modules include substring bypass diodes (see Section 5 and Table 10. For substring ℓ, we considered an augmented branch

I_{bd, ℓ} = I_{S, ℓ} (exp \frac{V_{ℓ}}{n_{bd} V_{T}} - 1) 1_{{V_{ℓ} < 0}},

in parallel with the cell branch. On the two-module bench and shading scripts of Table 3, measured string currents rarely forward-bias bypass diodes. Enabling this branch in post hoc simulations changed η by less than

0.5 %

. Thus the single-diode model suffices for our scenarios.

3.3. DC/DC Stage and Control Interface

Let

d \in [0, 1]

denote the converter duty cycle. The averaged stage dynamics are

\dot{x} = f (x, d) + w, y = h (x), y = (V_{string}, I_{string}),

(26)

and the agent issues an incremental duty command

a \in [- 1, 1]

mapped to

d_{t + 1} =

sat (d_{t} + Δ d (a))

.

Throughout all simulations and hardware runs, the incremental duty update is

Δ d (a) : = clip (κ_{d} a, - Δ d_{max}, Δ d_{max}), a \in [- 1, 1], 0 < κ_{d} \leq Δ d_{max},

(27)

followed by

d_{t + 1} = clip (d_{t} + Δ d (a_{t}), d_{min}, d_{max}) .

(28)

Unless stated otherwise, we set

κ_{d} = Δ d_{max}

so that

a = \pm 1

affects the maximum admissible per-tick duty change. No additional slew-rate limiter or nonlinearity is applied beyond the clipping in (27) and (28). For completeness,

clip (x, ℓ, u) : = min {u, max {ℓ, x}}

.

3.4. Operating Constraints

We enforce plant and safety limits

\begin{matrix} d_{min} \leq d_{t} \leq d_{max}, | Δ d_{t} | \leq Δ d_{max}, \end{matrix}

(29)

\begin{matrix} 0 \leq V_{i} \leq V_{oc, i}, 0 \leq I_{i} \leq I_{sc, i}, \end{matrix}

(30)

and impose parameter projection Proj and gradient clipping (critics/actor) to keep all iterates bounded, consistent with the ISS and SA analyses.

3.5. Observations, Actions, and Horizon

Each module provides a local observation

s \in R^{7}

bounded componentwise by

\begin{matrix} s_{min} & = (0, 0, 0, 200, 200, - 352.408, - 49.9), \\ s_{max} & = (49.9, 9, 352, 1000, 1000, 352.408, 49.9) . \end{matrix}

(31)

The local observation

s \in R^{7}

includes

(V_{i}, I_{i}, P_{i}, G_{i}, G_{n (i)}, Δ P_{i}, Δ V_{i})

. RL methods consume the full tuple. Classical baselines (P&O, INC, PSO) operate with their standard

(V / I)

inputs; we nonetheless log

(G_{i}, G_{n (i)})

for all runs to enable like-for-like post hoc analyses and MPP reference computation. Actions are incremental duty-cycle commands in

[- 1, 1]

.

T_{sim} = 20 s

; step scenarios use

t_{change} = 0.35 s

.

RL agents consume the full tuple

s = (V_{i}, I_{i}, P_{i}, G_{i}, G_{n (i)}, Δ P_{i}, Δ V_{i})

because the irradiance channels

(G_{i}, G_{n (i)})

are available on the intended hardware and stabilize near-MPP behavior under abrupt PSC changes. Classical baselines (P&O, INC, PSO) are deliberately kept on their canonical

(V / I)

inputs to preserve standard formulations rather than re-tune them into nonstandard variants. All methods share the same wall-clock control period and actuation limits, and we log irradiance for every run (for MPP reference and post hoc analyses). This preserves compute and timing parity while making the sensing assumption explicit.

3.6. Fuzzy Features and Function Approximators

Each state coordinate is normalized to

[0, 1]

and equipped with

L = 5

differentiable memberships

{ψ_{ℓ}^{(j)}}_{ℓ = 1}^{5}

per coordinate

j \in {1, \dots, 7}

, combined via a coordinate-wise softmax so that

\sum_{ℓ = 1}^{5} ψ_{ℓ}^{(j)} (s_{j}) \equiv 1

. Stacking the 35 responses yields

z (s) \in R_{\geq 0}^{35}

with the global identity

1^{⊤} z (s) = 7

; hence, the normalized fuzzy map

φ_{fuzzy} (s) = \frac{z (s)}{7}, 1^{⊤} φ_{fuzzy} (s) = 1 .

(32)

This induces global input Lipschitz constants for the actor and critics used below. The actor is a two-hidden-layer MLP with a tanh head; twin critics share the fuzzy front-end but maintain separate downstream towers.

3.7. Fuzzy-Partition Parameters

Each raw state coordinate is affinely normalized to

[0, 1]

and equipped with

L = 5

softmax memberships having centers

{0, 0.25, 0.5, 0.75, 1}

and a per-coordinate temperature

τ_{j} > 0

. Writing

z_{j} (s_{j}) \in R_{\geq 0}^{5}

for the jth coordinate’s membership vector with

1^{⊤} z_{j} (s_{j}) = 1

, we stack

z (s) : = [z_{1} (s_{1}); \dots; z_{7} (s_{7})] \in R_{\geq 0}^{35}

, so that

1^{⊤} z (s) = 7

.

To avoid overloading the policy notation

μ_{θ} (\cdot)

, we denote the fuzzy partition by

φ_{fuzzy} (s) : = z (s) / 7

. Observe that

\nabla φ_{fuzzy} (s) = \nabla z (s) / 7

; for notational consistency with Section 2, we write

L_{μ} : = L_{z} / 7

for the global input Lipschitz constant of the normalized PoU. Table 1 reports

(τ_{j})

and the induced

(L_{z}, L_{μ})

.

3.8. Reward and Normalization

The saturated, normalized reward used throughout is

R = tanh (κ [\frac{P_{1} + P_{2}}{P_{mp, tot}} - α \frac{| Δ P_{1} | + | Δ P_{2} |}{P_{mp, tot}} - β \frac{| Δ V_{1} | + | Δ V_{2} |}{V_{oc, tot}}]), (α, β, κ) = (0.1, 0.05, 2),

(33)

with physical normalization constants

P_{mp, tot} = 704.8 W

and

V_{oc, tot} = 99.8 V

. This ensures

R \in (- 1, 1)

and bounded targets.

3.9. Training Protocol (TD3 Under CTDE)

We employ TD3 with centralized replay and decentralized execution:

Discount factor $γ = 0.99$ ; minibatch size $M = 256$ ; policy update period $K = 2$ ; Polyak factor $τ = 10^{- 3}$ .
Target policy smoothing noise $ε \in [- 0.5, 0.5]$ , sampled i.i.d. and independent of $s^{'}$ conditional on s; target networks $(\bar{θ}, {\bar{φ}}_{i})$ are Polyak-averaged.
Parameter updates include explicit projections onto compact convex sets $W$ and $Θ$ ; gradient clipping is applied to all updates.

Algorithm 1 summarizes the CTDE training loop used. Table 2 summarizes the training hyperparameters adopted for Fuzzy–MAT3D. These values implement a two-time-scale separation (fast critics and slow actor) with constant stepsizes; accordingly, Theorem 3 provides the formal justification for the training regime used in our experiments.

Training episodes used a longer horizon (

Δ t = 0.06

s; up to 12,000 steps

\approx 720

s per episode) to improve replay diversity. All reported performance metrics, however, were computed on a fixed evaluation horizon

T_{sim} = 20 s

across the seven scenarios to ensure comparability between algorithms.

Algorithm 1 Fuzzy–MAT3D (CTDE–TD3 for two PV modules in series)

Require:

γ = 0.99

,

M = 256

,

K = 2

,

τ = 10^{- 3}

; stepsizes

{α_{k}}

(critics),

{β_{k}}

(actor).

1: Initialize

θ, φ_{1}, φ_{2}

; set

\bar{θ} \leftarrow θ

,

{\bar{φ}}_{i} \leftarrow φ_{i}

; empty buffer

D

.

2: for episodes do

3: Reset irradiances

G_{1}, G_{2} \sim U [200, 1000]

; initialize

V_{i} \sim U [20, 49.9]

.

4: for time step t do

5: Apply

a_{t}^{(i)} = μ_{θ} (s_{t}^{(i)})

; observe

s_{t + 1}^{(i)}

; compute

r_{t}

by (33); push to

D

.

6: Targets

y = r + γ {min}_{i} Q_{{\bar{φ}}_{i}} (s^{'}, μ_{\bar{θ}} (s^{'}) + ε)

; update

φ_{i} \leftarrow {Proj}_{W} (φ_{i} - α \nabla_{φ_{i}} MSE)

; Polyak-average

{\bar{φ}}_{i}

.

7: if

t mod K = 0

then ascend

θ \leftarrow {Proj}_{Θ} (θ + β \nabla_{θ} J (θ))

; Polyak–average

\bar{θ}

.

8: end if

9: end for

10: end for

3.10. Benchmark Scenarios, CRN Protocol, and Metrics

Seven benchmark scenarios are used; step cases share the same change time

t_{change} = 0.35 s

, as shown in Table 3.

CRN protocol. All algorithms use identical random seeds, initializations, and shading scripts within each scenario to enable paired, blocked inferences. We run 20 independent seeds per scenario (total

N = 140

replications/algorithm).

At each sampling instant, the reference power is computed via a 1D constrained search

P_{mpp} (t) = max_{V \in [0, V_{oc, tot}]} \{V I_{string} (V; G_{1} (t), G_{2} (t), T (t))\} .

(34)

The primary endpoint is MPPT efficiency

η [%] = 100 \frac{\int_{0}^{20 s} P_{pv} (t) d t}{\int_{0}^{20 s} P_{mpp} (t) d t} \approx 100 \frac{\sum_{k} P_{pv} (t_{k})}{\sum_{k} P_{mpp} (t_{k})} .

(35)

Secondary endpoints.

For step scenarios, we report (i) the settling time

t_{s}

, defined as the smallest

t \geq 0.35 s

such that

| P_{pv} (u) - P_{mpp} (u) | \leq 0.02 P_{mpp} (u)

for all

u \in [t, t + Δ]

with

Δ = 0.2 \times T_{sim} = 4 s

; and (ii) steady-state oscillation, the standard deviation of

P_{pv} (t)

over the last

20 %

of

20 s

.

Normalization note.

P_{m p, t o t} = 704.816 W

and

V_{o c, t o t} = 99.8 V

come from calibrated limits of the single-diode model used for normalization over the two-module operating domain, and thus closely match—but do not strictly equal—the STC datasheet values of the series string (

700 W

and

92.82 V

; Table 10).

3.11. Baselines and Fairness Constraints

All algorithms share the same control period and actuation limits (27) and (28). For PSO, we cap the per-tick particles × iterations so the search finishes within the 60 ms deadline (Table 4). RL methods perform exactly one forward pass per module per tick. Classical baselines operate with their standard

(V / I)

inputs; irradiance

(G_{i}, G_{n (i)})

is logged and used only for reference MPP computation and post hoc analyses, keeping budget parity in wall-clock and actuation.

Under the 60 ms control period, both PSO’s particles × iterations and the RL forward passes are strictly confined to this wall-clock budget; no baseline was granted extra evaluations or sensing beyond its canonical inputs.

3.12. Computational Budget and Real-Time Feasibility

RL inference requires exactly one forward pass per tick per module. We log wall-clock inference latencies (p50/p95/p99) to verify meeting the control deadline. The PSO controller is constrained by an identical per-tick computational budget.

Table 4 reports inference latencies (p50/p95/p99) measured on the evaluation host used for simulation; Section 5 (Table 12) reports on-device latencies measured on the hardware bench under the same 60 ms period.

3.13. Statistical Analysis Plan

Primary confirmatory analysis comprises (i) a one-way ANOVA on

η

across algorithms; and (ii) blocked Dunnett contrasts versus Fuzzy–MAT3D (blocking by scenario and seed). Secondary analyses include CRN-paired one-sided t-tests (Fuzzy–MAT3D minus comparator) with Benjamini–Hochberg FDR control across hypotheses, and linear mixed-effects models with random intercepts for scenario and seed. We report distributional summaries and two-sided

95 %

Student-t confidence intervals; diagnostics include normality of residuals and Levene’s test for homoscedasticity.

When executed on physical hardware, we mirror the seven scenarios of Table 3 with synchronized

V / I

sensing and logging and enforce the same limits and deadlines as in simulation. Each algorithm is evaluated with

n \geq 10

bench replicates per scenario, blocked by scenario and replicate ID. The reference

P_{mpp} (t)

is computed from a calibrated single-diode model parameterized by measured

(G, T)

.

4. Results

We report both learning behavior and fixed-horizon performance under a balanced common-random-numbers (CRN) design comprising 7 scenarios and 20 seeds per algorithm (

N = 140

replications/algorithm). The primary endpoint is MPPT (Maximum Power Point Tracking) efficiency,

η

(Def. (35)); secondary endpoints quantify transient speed and steady operation: settling time

t_{s}

(2% band, window

Δ = 0.2 \times T_{sim} = 4 s

) and steady-state oscillation (SD of

P_{pv}

over the last

20 %

of

20 s

). Unless stated otherwise, all summaries are CRN-blocked; means are reported with two-sided 95% Student-t confidence intervals; and hypothesis testing follows the pre-specified hierarchy of a global one-way ANOVA, confirmatory blocked-Dunnett contrasts versus Fuzzy–MAT3D, and CRN-paired one-sided t tests with BH–FDR control.

We begin with training dynamics for the RL controllers (Figure 2), then present the CRN-blocked comparison across six algorithms via distributional views and mean-CI summaries (Figure 3 and Figure 4). Stability metrics are analyzed next: steady-state oscillation (Figure 5) and settling time (Figure 6), with effect sizes and paired inferences tabulated in Tables 7 and 8 alongside aggregate RL performance (Table 9). We subsequently dissect the speed–stability trade-off (Section 4.3), provide ablations and sensitivity checks, and include an illustrative step disturbance case study. Exploratory Tukey–Kramer intervals are relegated to Appendix A to avoid conflating descriptive and confirmatory claims.

Axes and table headers use a unified nomenclature: “MPPT efficiency,

η

[%]’’, “settling time,

t_{s}

[s]’’, and “steady-state oscillation [W]’’. All figures and tables reflect the same CRN blocking and horizon

T_{sim} = 20 s

to ensure like-for-like comparisons across algorithms and scenarios.

4.1. Training Dynamics (RL Group)

We begin by characterizing the learning behavior of the three RL controllers under the same CTDE protocol and logging setup. Figure 2 reports the evolution of the cumulative return and a moving-average return across training episodes. Two features are salient: (i) the Fuzzy–MAT3D trajectories exhibit visibly damped volatility and earlier stabilization relative to MAT3D and MADDPG, and (ii) improvements accrue more steadily once the critics have entered their fast-contraction regime, consistent with the variance-reduction and non-expansivity mechanisms developed in Section 2. These dynamics foreshadow the downstream fixed-horizon advantages—higher MPPT efficiency and lower steady-state oscillation—documented in the comparative analyses that follow.

4.2. Global Comparison Across Six Algorithms (CRN-Blocked)

This subsection presents a global comparison of six algorithms under a CRN-blocked design, matching stochastic trajectories to control heterogeneity and reduce estimator variance. We report means, 95% confidence intervals, standardized effect sizes (Hedges’ g), and average ranks, and conduct blocked ANOVA with Dunnett/Holm corrections for multiple comparisons against the reference.

Table 5 reports the scenario-wise MPPT efficiency (mean ± 95% CI) with

N = 20

seeds per scenario and algorithm. Across all seven scenarios, Fuzzy–MAT3D attains the highest mean efficiency and consistently narrow confidence intervals for example,

94.5 \pm 0.9 %

under Standard Condition and

90.2 \pm 1.5 %

under Deep Shadow. These per-scenario results mirror the aggregate advantages shown in Figure 3 and Figure 4 and reinforce the robustness of the fuzzy-regularized approach across both static and step-change conditions.

Inferential statistics.

A global one-way ANOVA on

η

rejects equality of means (F-statistic

= 88.965

,

p = 5.127 \times 10^{- 75}

;

N = 140

per group), confirming the significant differences in performance distributions visually apparent in Figure 3 and Figure 4. Following Section 3.13, we report blocked Dunnett in the supplement (see, e.g., the exploratory post hoc analysis in Figure A1) and CRN-paired tests below.

Residual Q–Q plots (to check normality) and Levene’s test (for homoscedasticity) were performed to validate the ANOVA assumptions. No gross departures from normality were found, and Levene’s test did not reject homoscedasticity within groups at

α = 0.05

; all confirmatory inferences therefore follow the pre-specified hierarchy.

4.3. Settling Time vs. Stability

As shown by the CRN-blocked boxplots in Figure 6, Fuzzy–MAT3D exhibits a longer settling time

t_{s}

than the plain TD3 baseline (MAT3D) and MADDPG. The RL-only aggregate (Table 9) reports

t_{s} = 7.537 \pm 5.397

s for Fuzzy–MAT3D, versus

1.573 \pm 3.771

s for MAT3D and

3.239 \pm 6.175

s for MADDPG.

In our protocol,

t_{s}

is the first time after the step at which the trajectory remains within a

\pm 2 %

band around the instantaneous MPP for a contiguous window of length

Δ = 0.2 \times T_{sim} = 4 s

(see Section 3.10). This definition is agnostic to (i) how much overshoot/undershoot occurred before entering the band and (ii) whether the controller subsequently leaves the band again after the

Δ

-window has elapsed. Hence, a controller can register a small

t_{s}

by grazing the band early with aggressive moves yet sustain sizable steady-state jitter or even drift later; conversely, a more conservative controller can register a larger

t_{s}

while delivering substantially better long-horizon behavior.

Mechanistic explanation for Fuzzy–MAT3D’s larger $t_{s}$ . The fuzzy PoU front-end enforces global Lipschitzness on the actor/critics (Theorem 1) and, together with target smoothing, yields a locally non-expansive fixed-policy operator (Proposition 1); moreover, the twin-critic minimum introduces a small, correlation-dependent negative bias (Lemma 1). These ingredients reduce TD-target variance and damp fast transients, but they also make the closed loop deliberately conservative right after abrupt shading steps, prioritizing a monotone approach over rapid excursions. In short, Fuzzy–MAT3D trades a few seconds of responsiveness for markedly improved stability and bias robustness.

Why Fuzzy–MAT3D is still preferable.

Three lines of evidence—empirical, statistical, and control-theoretic—support Fuzzy–MAT3D despite its larger

t_{s}

:

Dominant long-horizon energy capture. Across the CRN–blocked study $(N = 140$ per algorithm), Fuzzy–MAT3D achieves the highest mean MPPT efficiency ( $92.032 % \pm 4.014 %$ ), substantially above MAT3D ( $80.144 % \pm 13.050 %$ ) and MADDPG ( $67.978 % \pm 12.402 %$ ); CRN-paired tests yield large effects (Cohen’s d up to $1.874$ ) and essentially zero q-values (BH–FDR). Thus, any energy loss from a slower transient is more than offset by sustained operation near the MPP over the full horizon (Table 6, Table 7 and Table 8).
Much lower steady-state jitter. Figure 5 shows the steady-state oscillation of Fuzzy–MAT3D tightly concentrated near zero, whereas MAT3D and MADDPG display broad, heavy-tailed distributions. The RL-only aggregate reports $1.362 \pm 2.013$ W for Fuzzy–MAT3D vs. $37.963 \pm 70.396$ W (MAT3D) and $52.993 \pm 34.608$ W (MADDPG). Lower jitter not only improves energy capture but also reduces switching stress and thermal cycling in the power stage (Table 9).
Design intent: stability margins over aggressiveness. Theoretically, the fuzzy-induced Lipschitz constant $L_{μ}$ improves ISS gains, while the min-ensemble plus target smoothing controls overestimation and high-frequency actuation. This combination is expected to enlarge decay margins but reduce “snap-to-setpoint” behavior—precisely the speed–stability trade-off seen in Figure 6.

The distribution in Figure 6 shows Fuzzy–MAT3D with a higher median

t_{s}

and a long right tail driven by the most abrupt step cases, which is consistent with its conservative transient policy. Yet, when read jointly with Figure 5 (steady-state oscillation) and the efficiency summaries (Figure 3 and Figure 4; Table 6 and Table 9), the picture is unequivocal: Fuzzy–MAT3D sits on a better Pareto front—maximizing energy and minimizing jitter—while conceding some transient speed. In applications like PV MPPT under PSC, where (i) steps are intermittent and (ii) the objective is integral energy over minutes–hours, this Pareto choice is the correct one.

The ANOVA and CRN-paired tests in Table 6, Table 7 and Table 8 indicate that the superiority of Fuzzy–MAT3D over MAT3D and MADDPG is statistically significant across all seven scenarios, with very small p-values and large paired effect sizes.

Because the design includes both static PSC profiles and step-change scenarios, the reported efficiency

η \approx 92.0 % \pm 4.0 %

should be interpreted as a robust average over a representative family of practically relevant shading patterns.

We do not claim universal optimality beyond this family, but the ISS and Lipschitz analysis suggest that the qualitative advantages of Fuzzy–MAT3D should persist under other smooth, slowly varying PSC profiles; extending the experimental design to more aggressive, rapidly varying shading remains an interesting direction for future work.

Because

t_{s}

declares success after any continuous

Δ

-window within the band, it cannot penalize later departures from the band. This explains why algorithms with aggressive, oscillatory responses can display deceptively small

t_{s}

while still underperforming in energy and stability. Our CRN-blocked analysis therefore treats

t_{s}

as a secondary indicator to be interpreted alongside efficiency and steady-state metrics (Figure 5, Table 6, Table 7, Table 8 and Table 9).

Fuzzy–MAT3D is intentionally conservative around abrupt changes; this yields a larger settling time but confers superior stability and decisively better energy tracking. In the aggregate, and for the operational goals of MPPT under partial shading, Fuzzy–MAT3D’s trade-off is the desirable one.

Concretely, comparing the aggregate statistics in Table 6 and Table 9, Fuzzy–MAT3D sacrifices about 6 s of additional settling time on average (7.54 s vs. 1.57 s for MAT3D) in exchange for roughly

+ 12

percentage points in MPPT efficiency (92.0% vs. 80.1%) and a reduction of

\approx 36

W in steady-state jitter (1.36 W vs. 37.96 W), which is an advantageous trade-off for energy-centric applications.

As complementary, descriptive evidence, Appendix A, Figure A1 reports Tukey–Kramer confidence intervals for all pairwise algorithmic contrasts. Consistent with the CRN-blocked summaries and directionally aligned with the confirmatory blocked-Dunnett analysis, these intervals place Fuzzy–MAT3D above both MAT3D and MADDPG in mean MPPT efficiency (pairwise CIs against Fuzzy–MAT3D exclude zero), with the largest deficit observed for MADDPG. We therefore treat this panel as supportive context for the superiority ordering rather than an independent inferential claim.

4.4. RL-only Aggregate Across Replications

Table 9 aggregates RL-only results across CRN-blocked replications, showing that Fuzzy–MAT3D attains the highest mean MPPT efficiency

η

with markedly lower steady-state oscillation than MAT3D and MADDPG. Conversely, Fuzzy–MAT3D exhibits a larger settling time

t_{s}

, reflecting the study’s speed–stability trade-off and the conservative regulation induced by the fuzzy PoU front-end.

The empirical distribution of MPPT efficiency across all CRN-blocked replications is shown in Figure 7, complementing the notched boxplots and mean–CI panels by revealing the full shape and tails of the distributions. This panel contextualizes central-tendency summaries with dispersion and skewness across scenarios and seeds.

4.5. Case Study: Step Disturbance

As shown in Figure 8, after the step at

t = 0.35 s

, Fuzzy–MAT3D approaches

P_{MPP}

monotonically and maintains it with negligible steady-state jitter, whereas MADDPG overshoots and develops sustained oscillations that depress the average power; MAT3D converges slowly and under-tracks the MPP. The lower panel explains the mechanism: Fuzzy–MAT3D rapidly desaturates the duty cycle and then holds an almost constant command, while the other agents continue exciting the plant—consistent with the ISS-based stability rationale in Section 2.3.

5. Experimental Validation

The hardware experiments were conducted on a two-module series string whose electrical ratings closely match the normalization constants used in the simulations. For clarity,

P_{m p, t o t} = 704.816 W

and

V_{o c, t o t} = 99.8 V

arise from the calibrated model limits employed for normalization, not strictly from the (Standard Test Conditions) STC datasheet values of the two-module string (

700 W

and

92.82 V

; Table 10).

Table 10 details the per-module specifications and the resulting values for the series connection adopted in the laboratory setup. This configuration ensures that the experimental plant reproduces the same operating envelope as the simulated one, thereby enabling a direct comparison of maximum power point tracking (MPPT) performance under partial shading scenarios.

5.1. Hardware Validation: Objectives, Bench, and Real-Time Deployment

We validate Fuzzy–MAT3D on real hardware to test three pre-registered hypotheses: (i) it preserves its MPPT efficiency advantage,

η

, over all baselines; (ii) it reduces steady-state oscillation; and (iii) it meets the real-time budget under the same sensing/actuation constraints. The seven benchmark scenarios of Table 3 are replayed verbatim for hardware runs, and metrics follow the simulation definitions in Section 3 (efficiency

η

, settling time

t_{s}

with a 2% band and window

Δ = 0.2 \times T_{sim} = 4 s

, and steady-state oscillation).

The experimental bench comprises two PV modules in series with reproducible, controlled shadows; a DC/DC stage (boost or buck-boost) instrumented with voltage/current sensing and governed by an MCU/SoC; and standard protection elements (V/I limits and watchdog). Power is measured with a DC power analyzer, the shunt+DAQ path is employed only for calibration checks, all signals are timestamp-synchronized, and module temperatures are monitored via thermocouples. This setup mirrors the plant assumptions in Section 3 and enables like-for-like comparisons with the simulation protocol.

The deployed actor

μ_{θ}

is identical to its simulated counterpart: it uses the per-coordinate fuzzy partition of unity (coordinate-wise softmax), so that

1^{⊤} z (s) = 7

and

L_{μ} = L_{z} / 7

hold exactly as in Equations (1) and (2). To assess real-time feasibility, we log inference latency (p50/p95/p99) and the fraction of control deadlines missed over each run under the same sensing/update period used for the baselines; these logs are reported alongside the performance metrics.

The overall architecture of the experimental bench and measurement points is summarized in Figure 9. Figure 10 shows the instrumented bench used in all hardware runs: a two-module series string connected to the embedded controller and a laptop that logs synchronized voltage–current traces under. Figure 11 complements this view by documenting the site-level deployment (left) and the tidy series interconnect with measurement leads (right); the panels are oriented to avoid self-shading from the cabling, and the controller–laptop pair provides the time alignment required by our MPPT metrics.

5.2. Protocols and Scenarios

We executed the seven scenarios in Table 3 on the physical bench described in Section 5.1 (two PV modules in series, DC/DC power stage, synchronized V/I sensing and logging). Each run lasted 20

s

; step-change scenarios used a reproducible shadowing script with change time

t_{change} = 0.35 s

, implemented by calibrated masks/shutters, while static scenarios used fixed mask configurations.

For every scenario, each algorithm was evaluated with

n \geq 10

hardware replicates (

7 \times 10 = 70

per algorithm), blocked by scenario and replicate ID. Before every run, we restored the plant to a standardized initial condition (open-circuit reset and controlled pre-charge), verified sensor zeroing, and logged the initial terminal voltages and duty-cycle state. The identical irradiance/shadowing scripts and reset procedure were replayed across algorithms to ensure like-for-like comparisons on real hardware.

The reference

P_{mpp} (t)

is computed from a calibrated single-diode model parameterized by the measured

(G, T)

; the model parameters are calibrated offline using periodic I–V sweeps. At evaluation time, the instantaneous MPP is obtained via a 1D search on the model. We compute

η

(Equation (35)), settling time

t_{s}

(2% band, window

Δ = 0.2 \times T_{sim} = 4 s

, steady-state oscillation (last 20%), lost energy, and latency.

5.3. Experimental Results

Table 11 reports the scenario-wise hardwareMPPT efficiency (mean ± 95% CI;

n \geq 10

per scenario/algorithm). Across all seven scenarios, Fuzzy–MAT3D consistently leads—for instance,

93.5 % \pm 1.1 %

under Standard Conditionand

89.0 % \pm 1.8 %

under Deep Shadow—mirroring the aggregate hardware advantages shown in Figure 12 and Figure 13. These results indicate that the gains of the fuzzy-regularized controller persist across static and step-change conditions.

To verify real-time feasibility on the bench, we report median and tail inference latencies (p50/p95/p99) under the same 60 ms control period used throughout the study. As shown in Table 12, all RL controllers exhibit sub 5 ms p99 latency on hardware, comfortably below the 60 ms deadline, while the PSO controller is constrained by an identical per-tick budget (16 × 3).

5.4. Discussion of Experimental Results

The hardware summary in Table 13 confirms a decisive advantage for Fuzzy–MAT3D in cumulative energy capture: it attains the highest mean MPPT efficiency (

η = 91.296 \pm 4.057

%), surpassing all classical and RL baselines—PSO (

83.233 \pm 7.919

), MAT3D (

78.791 \pm 12.493

), P&O (

75.447 \pm 11.794

), INC (

74.086 \pm 10.990

), and MADDPG (

67.515 \pm 12.881

). Figure 12 places the entire efficiency distribution of Fuzzy–MAT3D at the upper end among competitors, while Figure 13 shows its mean with tight

95 %

confidence intervals centered near

91 %

, consistent with the low empirical variability reported in the table. The magnitude of this efficiency gap is not cosmetic: with identical sensing, actuation, and computation budgets, the fuzzy partition yields a controller that consistently converts available irradiance into electrical work more effectively over the whole horizon.

A core strength of Fuzzy–MAT3D in hardware is its markedly lower steady-state power ripple:

7.837 \pm 8.607

W, compared against

25.921 \pm 26.580

W (PSO),

36.724 \pm 29.736

W (INC),

40.015 \pm 23.916

W (P&O),

45.899 \pm 53.620

W (MAT3D), and

62.451 \pm 38.205

W (MADDPG). Figure 14 makes this contrast visually salient: the Fuzzy–MAT3D box is compressed near the origin, while non-fuzzy RL baselines are both higher in level and heavier-tailed. This reduction in jitter has two immediate consequences: (i) it directly improves integral efficiency (less dithering around the MPP), and (ii) it reduces switching and thermal stress on the power stage, an operational benefit that rarely appears in single-number leaderboards but matters in deployment. Mechanistically, these outcomes align with the theory: the differentiable fuzzy partition of unity imposes global input Lipschitzness on the actor–critic towers, the smoothed fixed-policy operator is non-expansive, and target variance contracts with the fuzzy Lipschitz modulus—together damping high-frequency actuation and steady-state ripple (Lemma 1, Proposition 1, Proposition 2).

Figure 15 and Table 13 show a deliberate speed–stability trade-off. Fuzzy–MAT3D exhibits a longer settling time (

t_{s} = 8.097 \pm 4.612

s) than MAT3D (

2.585 \pm 2.264

s) and P&O/INC (about 3–5 s), though notably not the slowest overall (PSO averages

9.623 \pm 3.941

s). Interpreted in context, this is an intentional design choice: the fuzzy regularizer and TD3 target smoothing bias the closed loop toward the monotone approach with bounded slopes, avoiding overshoot-induced re-entries into (and exits from) the tolerance band that can make

t_{s}

look artificially small while harming long-horizon yield. In other words, a few additional seconds of conservative transient are exchanged for a far more stable steady state, which is precisely where PV systems spend most of their time.

Read jointly, Figure 12, Figure 13, Figure 14 and Figure 15 and Table 13 portray a coherent Pareto frontier: Fuzzy–MAT3D simultaneously maximizes energy yield and minimizes steady-state oscillation, conceding some transient speed relative to aggressive baselines. For PV MPPT under partial shading—where step disturbances are intermittent and the objective is cumulative energy over minutes to hours—this Pareto point is preferable. The behavior matches the theory-first analysis: fuzzy-induced global Lipschitzness and the two-time-scale TD3 dynamics (critics fast and actor slow) enlarge practical ISS margins and shrink TD-target variance; at the hardware level, those constants materialize as tighter power control with minimal ripple and robust tracking near the MPPT. Complementarily, the convergence framework for fuzzy–MAT3D (CTDE–TD3) guarantees that, under bounded targets, Lipschitz networks, and projected iterates, the actor asymptotically follows a well-posed gradient flow on a compact set—explaining the stable learning and execution seen here.

In hardware, the fuzzy-augmented controller is not merely more “accurate” on average; it is structurally better behaved. Its low oscillation regime reduces wear-and-tear and converts more irradiance into usable energy, and its measured

t_{s}

reflects conservative transients rather than sluggish control. For safety-critical, efficiency-driven PV deployments, this combination—high

η

, minimal jitter, and bounded transients—constitutes a compelling operational advantage.

6. Discussion

Our working hypothesis was that inserting a differentiable fuzzy partition of unity in front of the actor–critic would (i) enforce global input Lipschitzness and lower TD-target variance, (ii) render fixed-policy evaluation non-expansive with the correct

L^{\infty}

contraction factor

κ = γ

, (iii) enlarge the closed-loop ISS margin, and (iv) convert constant stepsizes into explicit steady-state neighborhoods; these mechanisms are formalized in Section 2.2.

Empirically, the resulting controller attains higher MPPT efficiency with markedly lower steady-state jitter while accepting a more conservative transient—precisely the speed–stability trade-off expected from the theory and consonant with prior observations that classical P&O/INC and PSO approaches tend to exchange rapid steps for oscillatory behavior under PSCs, whereas unregularized RL baselines (e.g., MAT3D and MADDPG) can amplify critic noise into unstable actuation. In the broader context of learning-based power electronics, the fuzzy layer acts as a structural regularizer that improves actor–critic conditioning and yields closed-loop behavior aligned with energy-centric objectives and hardware stress constraints, thereby offering a principled alternative to ad hoc damping or heuristic dithering. These interpretations are consistent with the controlled CRN-blocked study and the theory-first analysis reported herein.

We establish stationarity (not global optimality) and adopt a replay idealization; nonetheless, parameter projections, clipping, and bounded target noise narrow the assumption–implementation gap. The residual bounds depend on distribution shift (Remark 1), and the practical ISS gains in Section 2.3 inherit the usual modeling idealizations.

Future Work

Building on the present results, future work will (i) close the distribution-shift loop by learning replay/behavior policies that better align

ρ

and its pushforward, tightening the constants in the projected Bellman bounds (Proposition 3). (ii) Automate the partition design (number of memberships and temperature) and study its effect on variance and twin-critic correlation, building on Proposition 2 and Lemma 1. (iii) Couple ISS-style margins (Section 2.3) with barrier certificates and latency/quantization models for hardware-level safety guarantees. (iv) Extend the CTDE analysis to larger N and partial observability, and benchmark against entropy-regularized and trust-region variants under identical sensing/compute budgets. (v) Translate the energy-centric gains to longer horizons and field deployments, including degradation, sensor drift, and grid-level constraints.

Beyond the specific two-module string considered here, the fuzzy-regularized CTDE–TD3 architecture is applicable to other domains where (i) the dynamics admit a compact operating envelope and (ii) safety or comfort requirements demand smooth closed-loop responses. Examples include cooperative voltage control in DC microgrids, coordinated charging of electric-vehicle fleets, and frequency regulation with distributed energy resources, where the PoU structure can encode network topology or operating regions. From the scaling viewpoint, Theorem 4 already shows that the Lipschitz moduli of the N-agent extension grow in a controlled way under block norms, so that Fuzzy–MAT3D can in principle be deployed on larger PV arrays by assigning one agent per string or module cluster. In such settings, additional work is needed to account for partial observability and communication constraints, but the core Lipschitz and ISS guarantees remain valid.

7. Conclusions

Fuzzy–MAT3D provides a theoretically grounded and empirically validated controller for PV MPPT under partial shading. From an energy-systems perspective, raising MPPT efficiency to around 92% under PSCs, while suppressing steady-state oscillations, directly translates into higher energy yield and reduced stress on power-electronic interfaces, contributing to more reliable and predictable PV generation at scale. The corrected theory includes a properly contracted fixed-policy operator (

κ = γ

in

L^{\infty}

), a projected Bellman residual bound with distribution-shift clarification, an explicit two-time-scale theorem with a PL uniqueness corollary, finite-stepsize neighborhoods with constants, a min-bias analysis for TD3, and a clean N-agent CTDE extension.

Author Contributions

Conceptualization, D.O.-M. and D.L.-C.; methodology, D.O.-M.; software, D.O.-M.; validation, L.A.P.-D., A.G.R.-R. and F.G.-L.; formal analysis, D.O.-M.; investigation, D.O.-M., L.A.P.-D., A.G.R.-R., and F.G.-L.; resources, D.L.-C.; data curation, D.O.-M. and L.A.P.-D.; writing—original draft preparation, D.O.-M.; writing—review and editing, D.L.-C., L.A.P.-D., A.G.R.-R., and F.G.-L.; visualization, D.O.-M.; supervision, D.L.-C.; project administration, D.L.-C.; funding acquisition, D.L.-C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Acknowledgments

The authors acknowledge the administrative and technical support provided by the laboratories of Universidad Autónoma de Ciudad Juárez during the preparation and execution of the experiments. The authors also thank colleagues for helpful discussions regarding simulation reproducibility and hardware setup.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MPPT	Maximum Power Point Tracking
PSC	Partial Shading Condition(s)
CTDE	Centralized Training, Decentralized Execution
PoU	Partition of Unity
ISS	Input-to-State Stability
CRN	Common Random Numbers
RL	Reinforcement Learning
TD3	Twin Delayed Deep Deterministic Policy Gradient
LME	Linear Mixed-Effects (model)
MPP	Maximum Power Point

Appendix A. Exploratory Post-Hoc (Tukey–Kramer)

The Tukey–Kramer exploratory CIs show that Fuzzy–MAT3D attains a significantly higher mean MPPT efficiency than both MAT3D and MADDPG (pairwise CIs vs. Fuzzy–MAT3D exclude zero), with the largest deficit observed for MADDPG; see Appendix A, Figure A1. This pattern mirrors the CRN-blocked summaries and is directionally consistent with the confirmatory blocked-Dunnett contrasts in Section 3; accordingly, we treat this panel as descriptive support for the superiority ordering rather than a stand-alone inferential claim.

Figure A1. Tukey–Kramer CIs.

Appendix B. Complete Proof for Section 2.2

Appendix B.1. Notation and Generic Facts

For a Lipschitz map F with constant

L_{F}

and any

x, y

in its domain,

∥ F (x) - F (y) ∥ \leq L_{F} ∥ x - y ∥

. Orthogonal projections in Hilbert spaces are non-expansive. For any random variable X,

E : | X | \leq \sqrt{E : [X^{2}]}

.

Appendix B.2. Proof of Theorem 1 (Fuzzy PoU Induces Global Lipschitzness)

From (1) and (2), we have

∥ φ_{fuzzy} (s) - φ_{fuzzy} (\tilde{s}) ∥ \leq L_{μ} ∥ s - \tilde{s} ∥

with

L_{μ} = L_{z} / 7

. Since tanh is 1-Lipschitz and

f_{θ}

is

L_{f} (θ)

-Lipschitz,

∥ μ_{θ} (s) - μ_{θ} (\tilde{s}) ∥ \leq L_{tanh} L_{f} (θ) L_{μ} ∥ s - \tilde{s} ∥ .

(A1)

The critic bound follows similarly for

h_{φ_{i}}

in the composite input

(φ_{fuzzy} (s), a)

, yielding

Lip (Q_{φ_{i}} \circ (φ_{fuzzy}, id)) \leq L_{Q} (φ_{i}) max {L_{μ}, 1}

. Uniform boundedness of gradient norms and TD targets follows from the compactness of S.

Appendix B.3. Proof of Proposition 1 (Smoothed Fixed-Policy Non-Expansivity)

For any

Q, \tilde{Q}

,

(T_{ε}^{\bar{π}} Q - T_{ε}^{\bar{π}} \tilde{Q}) (s, a) = γ E [Q (s^{'}, \bar{π} (s^{'}) + ε) - \tilde{Q} (s^{'}, \bar{π} (s^{'}) + ε) | s, a] .

(A2)

Taking

L^{2} (ρ)

norms and using Jensen’s inequality with the pushforward

ρ^{'}

of

ρ

under

(s, a) \mapsto (s^{'}, \bar{π} (s^{'}) + ε)

gives

∥ T_{ε}^{\bar{π}} Q - T_{ε}^{\bar{π}} \tilde{Q} ∥_{2, ρ} \leq γ {∥ Q - \tilde{Q} ∥}_{2, ρ^{'}}

. In the sup norm,

∥ T_{ε}^{\bar{π}} Q - T_{ε}^{\bar{π}} \tilde{Q} ∥_{\infty} \leq γ {∥ Q - \tilde{Q} ∥}_{\infty}

. For the twin-critic minimum, use that

m (x, y) = min {x, y}

is 1-Lipschitz on

R^{2}

to obtain the same bounds.

Appendix B.4. Proof of Proposition 2 (TD-Target Variance Reduction Under PoU)

Condition on s, let

y = r + γ Z

with

Z = m (Q_{{\bar{φ}}_{i}}) (s^{'}, a^{'})

and

a^{'} = \bar{π} (s^{'}) + ε

. We have

Var (y ∣ s) \leq σ_{r}^{2} + γ^{2} Var (Z ∣ s)

(assuming conditional independence). We bound

Var (Z ∣ s) \leq E [∥ Z (s^{'}, ε) - Z ({\bar{s}}^{'}, 0) ∥^{2} ∣ s]

, where

{\bar{s}}^{'} = E [s^{'} ∣ s]

. The map

Q_{{\bar{φ}}_{i}} = h_{{\bar{φ}}_{i}} (φ (s^{'}), a^{'})

is

L_{Q}

-Lipschitz with regard to its concatenated

ℓ_{2}

input

(φ, a^{'})

, and m is 1-Lipschitz. Let

L_{\bar{π}} = Lip (\bar{π})

denote the policy’s Lipschitz constant.

\begin{matrix} ∥ Z (s^{'}, ε) - Z ({\bar{s}}^{'}, 0) ∥^{2} & \leq L_{Q}^{2} (∥ φ (s^{'}) - φ ({\bar{s}}^{'}) ∥^{2} + {∥ (\bar{π} (s^{'}) + ε) - (\bar{π} ({\bar{s}}^{'}) + 0) ∥}^{2}) \\ \leq L_{Q}^{2} (L_{μ}^{2} ∥ s^{'} - {\bar{s}}^{'} ∥^{2} + 2 ∥ \bar{π} (s^{'}) - \bar{π} ({\bar{s}}^{'}) ∥^{2} + 2 {∥ ε ∥}^{2}) \\ \leq L_{Q}^{2} (L_{μ}^{2} ∥ s^{'} - {\bar{s}}^{'} ∥^{2} + 2 L_{\bar{π}}^{2} ∥ s^{'} - {\bar{s}}^{'} ∥^{2} + 2 {∥ ε ∥}^{2}) \\ = L_{Q}^{2} ((L_{μ}^{2} + 2 L_{\bar{π}}^{2}) ∥ s^{'} - {\bar{s}}^{'} ∥^{2} + 2 {∥ ε ∥}^{2}) . \end{matrix}

(A3)

Taking expectation

E [\cdot ∣ s]

yields

Var (Z ∣ s) \leq L_{Q}^{2} ((L_{μ}^{2} + 2 L_{\bar{π}}^{2}) σ_{s^{'}}^{2} + 2 σ_{ε}^{2})

. By Theorem 1,

L_{\bar{π}} \leq C_{\bar{θ}} L_{μ}

for

C_{\bar{θ}} = L_{tanh} L_{f} (\bar{θ})

. Thus,

Var (Z ∣ s) \leq L_{Q}^{2} ((L_{μ}^{2} + 2 {(C_{\bar{θ}} L_{μ})}^{2}) σ_{s^{'}}^{2} + 2 σ_{ε}^{2})

.

This gives the final bound:

Var (y ∣ s) \leq σ_{r}^{2} + γ^{2} L_{Q}^{2} (L_{μ}^{2} (1 + 2 C_{\bar{θ}}^{2}) σ_{s^{'}}^{2} + 2 σ_{ε}^{2}) .

(A4)

Since the state uncertainty term

σ_{s^{'}}^{2}

is scaled by

L_{μ}^{2}

, decreasing

L_{μ}

reduces this contribution to the total TD-target variance.

Appendix B.5. Proof of Proposition 3 (Projected Bellman Residual Bound)

Orthogonal projections

Π_{F}

satisfy

∥ Π_{F} f - Π_{F} {g ∥}_{2, ρ} \leq {∥ f - g ∥}_{2, ρ}

. Combining with Proposition 1 and the change of measure gives

∥ Π_{F} T_{ε}^{\bar{π}} Q - Π_{F} T_{ε}^{\bar{π}} \tilde{Q} ∥_{2, ρ} \leq γ \sqrt{c} {∥ Q - \tilde{Q} ∥}_{2, ρ}, c = ess sup \frac{d ρ^{'}}{d ρ} .

(A5)

If

Q^{\bar{π}}

is a fixed point of

T_{ε}^{\bar{π}}

and

\hat{Q}

solves

\hat{Q} = Π_{F} T_{ε}^{\bar{π}} \hat{Q}

, then

∥ \hat{Q} - Q^{\bar{π}} ∥_{2, ρ} \leq \frac{1}{1 - γ \sqrt{c}} {∥ (I - Π_{F}) T_{ε}^{\bar{π}} Q^{\bar{π}} ∥}_{2, ρ} .

(A6)

If moreover

ρ^{'} = ρ

, then

c = 1

and the factor simplifies to

1 / (1 - γ)

.

Appendix B.6. Proof of Lemma 1 (Underestimation by TD3’s Minimum)

Using

min {x, y} = \frac{x + y - | x - y |}{2}

and Jensen’s inequality,

E [min {Q_{1}, Q_{2}} ∣ s, a] = Q + \frac{1}{2} E [ε_{1} + ε_{2} - | ε_{1} - ε_{2} | ∣ s, a] \leq Q .

(A7)

Hence,

0 \leq Q - E : [min {Q_{1}, Q_{2}}] \leq \frac{1}{2} \sqrt{Var (ε_{1} - ε_{2})}

. With

Var (ε_{i}) \leq σ_{Q}^{2}

and correlation

ρ

,

Var (ε_{1} - ε_{2}) = 2 σ_{Q}^{2} (1 - ρ)

, yielding the claimed bound.

Appendix B.7. Proof of Theorem 2 (Two-Time-Scale Convergence)

Under the stated assumptions (bounded gradients, square-integrable MDS noise, projections onto compact sets, bounded target noise, and stepsizes with

β_{k} / α_{k} \to 0

,

τ_{k} / β_{k} \to 0

), the critic recursions track the projected gradient flow of their population losses (frozen targets), converging a.s. to the internally chain-transitive set of stationary points. On the slow scale, the actor tracks the projected differential inclusion

\dot{θ} \in Π_{Θ} [\nabla_{θ} \tilde{J} (θ)]

, with

\tilde{J} (θ) = E :_{ρ} [Q_{φ^{⋆} (θ)} (s, μ_{θ} (s))]

. Standard ODE/SA arguments complete the proof.

Appendix B.8. Proof of Corollary 1 (PL/Last-Layer Linear ⇒ Uniqueness)

With frozen targets, a PL inequality with constant

μ_{PL} > 0

and

L_{w}

-smoothness implies a unique global minimizer for the critic’s last-layer loss. Projected SA tracks the exponentially stable equilibrium; on two time scales,

w_{k} \to w^{⋆} (θ_{k})

along the slow manifold.

Appendix B.9. Proof of Theorem 3 (Finite Stepsize Neighborhoods)

With constant stepsizes

(α, β, τ)

, the fast critic recursion is a contractive random affine system in a neighborhood of the stable set, with additive noise (

α M_{k + 1}

) and parametric drift driven by

(β, τ)

. A Lyapunov drift bound yields

\underset{k \to \infty}{lim sup} E [dist {((w_{k}, θ_{k}), M \times S)}^{2}] \leq (C_{1} + C_{4} L_{μ}^{2}) α + (C_{2} + C_{5} L_{μ}^{2}) β + (C_{3} + C_{6} L_{μ}^{2}) τ .

(A8)

with constants depending on Lipschitz moduli and noise bounds. Under PL, the radii admit the stated closed-form dependence on

(α, β, τ, σ_{M}^{2}, L_{w}, μ_{PL})

.

Appendix B.10. Proof of Theorem 4 (N-Agent CTDE Extension)

Stack the agents and endow S and

A^{N}

with block product norms. By Theorem 1 applied per agent and standard product-space bounds,

Lip (μ) \leq \{\begin{matrix} max_{i} L_{μ, i}, & {∥ \cdot ∥}_{\infty, blk}, \\ {(\sum_{i} L_{μ, i}^{2})}^{1 / 2}, & {∥ \cdot ∥}_{2, blk}, \end{matrix} L_{Q}^{(in)} analogously .

(A9)

The smoothed fixed-policy operator remains

γ

-Lipschitz in

L^{\infty}

and

γ

-Lipschitz across

L^{2}

measures (up to pushforward), including the twin-critic minimum, since the min map is 1-Lipschitz. Under the centralized replay and two-time-scale regime, the critic layer converges on the fast scale (with aggregated Lipschitz/noise constants as above), and the actor tracks the projected ODE on the slow scale. Hence the conclusions of Theorem 2 hold with the stated block-norm constants.

References

Esram, T.; Chapman, P.L. Comparison of photovoltaic array maximum power point tracking techniques. IEEE Trans. Energy Convers. 2007, 22, 439–449. [Google Scholar] [CrossRef]
Ahmed, J.; Salam, Z. A critical evaluation on maximum power point tracking methods for partial shading in PV systems. Renew. Sustain. Energy Rev. 2015, 47, 933–953. [Google Scholar] [CrossRef]
Subudhi, R.; Pradhan, S. A Comparative Study on Maximum Power Point Tracking Techniques for Photovoltaic Power Systems. IEEE Trans. Sustain. Energy 2013, 4, 89–98. [Google Scholar] [CrossRef]
Hohm, D.P.; Ropp, M.E. Comparative Study of Maximum Power Point Tracking Algorithms. Prog. Photovoltaics: Res. Appl. 2003, 11, 47–62. [Google Scholar] [CrossRef]
Villalva, M.G.; Gazoli, J.R.; Filho, E.R. Comprehensive Approach to Modeling and Simulation of Photovoltaic Arrays. IEEE Trans. Power Electron. 2009, 24, 1198–1208. [Google Scholar] [CrossRef]
Belhachat, F.; Larbes, C. A review of global maximum power point tracking techniques for photovoltaic systems under partial shading conditions—A survey. Renew. Sustain. Energy Rev. 2018, 92, 513–553. [Google Scholar] [CrossRef]
Ishaque, K.; Salam, Z.; Amjad, M.; Mekhilef, S. An Improved Particle Swarm Optimization (PSO)–Based MPPT for PV With Reduced Steady-State Oscillation. IEEE Trans. Power Electron. 2012, 27, 3627–3638. [Google Scholar] [CrossRef]
Wadehra, A.; Bhalla, S.; Jaiswal, V.; Rana, K.P.S.; Kumar, V. A deep recurrent reinforcement learning approach for enhanced MPPT in PV systems. Appl. Soft Comput. 2024, 162, 111728. [Google Scholar] [CrossRef]
Giraldo, L.F.; Gaviria, J.F.; Torres, M.I.; Alonso, C.; Bressan, M. Deep reinforcement learning using deep-Q-network for Global Maximum Power Point tracking: Design and experiments in real photovoltaic systems. Heliyon 2024, 10, e37974. [Google Scholar] [CrossRef]
Zhang, B.; Cao, D.; Hu, W.; Ghias, A.M.Y.M.; Chen, Z. Physics-Informed Multi-Agent Deep Reinforcement Learning Enabled Distributed Voltage Control for Active Distribution Network Using PV Inverters. Int. J. Electr. Power Energy Syst. 2024, 155, 109641. [Google Scholar] [CrossRef]
Alsulami, A.A.G.; Alhussainy, A.A.; Allehyani, A.; Alturki, Y.A.; Alghamdi, S.M.; Alruwaili, M.; Alharthi, Y.Z. A comparison of several maximum power point tracking algorithms for a photovoltaic power system. Front. Energy Res. 2024, 12, 1413252. [Google Scholar] [CrossRef]
Endiz, M.S.; Gökkuş, G.; Coşgun, A.E.; Demir, H. A Review of Traditional and Advanced MPPT Approaches for PV Systems Under Uniformly Insolation and Partially Shaded Conditions. Appl. Sci. 2025, 15, 1031. [Google Scholar] [CrossRef]
Siddique, M.A.B.; Zhao, D.; Ouahada, K.; Rehman, A.U.; Hamam, H. Performance validation of global MPPT for efficient power extraction through PV system under complex partial shading effects. Sci. Rep. 2025, 15, 17061. [Google Scholar] [CrossRef]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic Policy Gradient Algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML), Beijing, China, 21–26 June 2014; pp. 387–395. Available online: https://proceedings.mlr.press/v32/silver14.html (accessed on 2 September 2025).
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2016, arXiv:1509.02971. [Google Scholar] [PubMed]
Fujimoto, S.; van Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 1582–1591. Available online: https://proceedings.mlr.press/v80/fujimoto18a.html (accessed on 2 September 2025).
van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. Proc. AAAI Conf. Artif. Intell. 2016, 30, 2094–2100. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative–Competitive Environments. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual Multi-Agent Policy Gradients. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-18), New Orleans, LA, USA, 2–7 February 2018; pp. 2974–2982. [Google Scholar] [CrossRef]
Rashid, T.; Samvelyan, M.; de Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 2631–2640. Available online: http://jmlr.org/papers/v21/20-081.html (accessed on 4 October 2025).
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 1861–1870. Available online: https://proceedings.mlr.press/v80/haarnoja18b.html (accessed on 5 June 2025).
Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.I.; Abbeel, P. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; JMLR W&CP 37. pp. 1889–1897. Available online: https://proceedings.mlr.press/v37/schulman15.html (accessed on 11 May 2025).
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Konda, V.R.; Tsitsiklis, J.N. Actor–Critic Algorithms. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); MIT Press: Cambridge, MA, USA, 2000; pp. 1008–1014. [Google Scholar]
Konda, V.R.; Tsitsiklis, J.N. On Actor–Critic Algorithms. SIAM J. Control Optim. 2003, 42, 1143–1166. [Google Scholar] [CrossRef]
Borkar, V.S.; Meyn, S.P. The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 2000, 38, 447–469. [Google Scholar] [CrossRef]
Borkar, V.S. Stochastic Approximation: A Dynamical Systems Viewpoint; Springer: Berlin/Heidelberg, Germany, 2025. [Google Scholar] [CrossRef]
Bhatnagar, S.; Sutton, R.S.; Ghavamzadeh, M.; Lee, M. Natural actor–critic algorithms. Automatica 2009, 45, 2471–2482. [Google Scholar] [CrossRef]
Zadeh, L.A. Fuzzy sets. Inf. Control 1965, 8, 338–353. [Google Scholar] [CrossRef]
Bezdek, J.C. Pattern Recognition with Fuzzy Objective Function Algorithms; Springer: Berlin/Heidelberg, Germany, 1981. [Google Scholar] [CrossRef]
Jang, J.R. ANFIS: Adaptive-network-based fuzzy inference system. IEEE Trans. Syst. Man, Cybern. 1993, 23, 665–685. [Google Scholar] [CrossRef]
Melenk, J.M.; Babuška, I. The partition of unity finite element method: Basic theory and applications. Comput. Methods Appl. Mech. Eng. 1996, 139, 289–314. [Google Scholar] [CrossRef]
Babuška, I.; Melenk, J.M. The Partition of Unity Method. Int. J. Numer. Methods Eng. 1997, 40, 727–758. [Google Scholar] [CrossRef]
Shepard, D. A two-dimensional interpolation function for irregularly-spaced data. In Proceedings of the 23rd ACM National Conference, New York, NY, USA, 27–29 August 1968; pp. 517–524. [Google Scholar] [CrossRef]
Sontag, E.D. Input to state stability: Basic concepts and results. In Nonlinear and Optimal Control Theory; Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 2008; Volume 1932, pp. 163–220. [Google Scholar] [CrossRef]
Khalil, H.K. Nonlinear Systems, 3rd ed.; Prentice Hall: Wilmington, DE, USA, 2002. [Google Scholar]
Bauschke, H.H.; Combettes, P.L. Correction to: Convex Analysis and Monotone Operator Theory in Hilbert Spaces; Springer International Publishing: Cham, Switzerland, 2017; pp. C1–C4. [Google Scholar] [CrossRef]
Scherrer, B. Should one compute the temporal difference fix point or minimize the Bellman residual? The unified oblique projection view. In Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 21–24 June 2010; pp. 1035–1042. Available online: https://icml.cc/Conferences/2010/papers/654.pdf (accessed on 22 July 2025).
Bertsekas, D.P.; Tsitsiklis, J.N. Neuro-Dynamic Programming; Athena Scientific: Nashua, NH, USA, 1996. [Google Scholar]
Tsitsiklis, J.N.; Van Roy, B. An analysis of temporal-difference learning with function approximation. IEEE Trans. Autom. Control 1997, 42, 674–690. [Google Scholar] [CrossRef]
Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; Meger, D. Deep Reinforcement Learning that Matters. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018; pp. 3207–3214. Available online: https://dl.acm.org/doi/abs/10.5555/3504035.3504427 (accessed on 6 May 2025).
Agarwal, R.; Schwarzer, M.; Castro, P.S.; Courville, A.; Bellemare, M.G. Deep Reinforcement Learning at the Edge of the Statistical Precipice. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS). 2021. Available online: https://arxiv.org/pdf/2108.13264 (accessed on 4 September 2025).
Dunnett, C.W. A multiple comparison procedure for comparing several treatments with a control. J. Am. Stat. Assoc. 1955, 50, 1096–1121. [Google Scholar] [CrossRef]
Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodological) 1995, 57, 289–300. [Google Scholar] [CrossRef]
Law, A.M.; Kelton, W.D. Simulation Modeling and Analysis, 5th ed.; McGraw-Hill: Columbus, OH, USA, 2014. [Google Scholar]
Wan, Y.; Xu, Q.; Dragicevic, T. Safety-Enhanced Self-Learning for Optimal Power Converter Control. IEEE Trans. Ind. Electron. 2024, 71, 15229–15234. [Google Scholar] [CrossRef]
Ren, H.; Wang, Y.; Yu, H.; Zhang, B.; Chen, Z. Deep Reinforcement Learning-based Power Flow Control for Triple Active Bridge Converter. In Proceedings of the 2024 IEEE 10th International Power Electronics and Motion Control Conference (IPEMC2024-ECCE Asia), Chengdu, China, 17–20 May 2024; pp. 2107–2112. [Google Scholar] [CrossRef]
Wan, Y.; Xu, Q.; Dragicevic, T. Reinforcement Learning-Based Predictive Control for Power Electronic Converters. IEEE Trans. Ind. Electron. 2025, 72, 5353–5364. [Google Scholar] [CrossRef]
Bui, V.H.; Mohammadi, S.; Das, S.; Hussain, A.; Hollweg, G.V.; Su, W. A Critical Review of Safe Reinforcement Learning Strategies in Power and Energy Systems. Eng. Appl. Artif. Intell. 2025, 143, 110091. [Google Scholar] [CrossRef]
Abouzeid, A.F.; Eleraky, H.; Kalas, A.; Rizk, R.; Elsakka, M.M.; Refaat, A. Experimental Validation of a Low-Cost Maximum Power Point Tracking Technique Based on Artificial Neural Network for Photovoltaic Systems. Sci. Rep. 2024, 14, 18280. [Google Scholar] [CrossRef] [PubMed]
Ahmadi, M.; Abrari, M.; Ghanaatshoar, M.; Khalafi, A. A Novel Algorithm for Maximum Power Point Tracking Using Computer Vision (CVMPPT). PLoS ONE 2024, 19, e0301363. [Google Scholar] [CrossRef] [PubMed]
Teneta, J.; Kreft, W.; Janowski, M. Partial Shading of Photovoltaic Modules with Thin Linear Objects: Modelling in MATLAB Environment and Measurement Experiments. Energies 2024, 17, 3546. [Google Scholar] [CrossRef]

Figure 1. Fast contraction of critics to

M (θ)

and slow ascent of

θ

along

Π_{Θ} [\nabla_{θ} \tilde{J} (θ)]

.

Figure 1. Fast contraction of critics to

M (θ)

and slow ascent of

θ

along

Π_{Θ} [\nabla_{θ} \tilde{J} (θ)]

.

Figure 2. Training performance for Fuzzy–MAT3D, MAT3D, and MADDPG. Smoothed return curves over 80 episodes show that Fuzzy–MAT3D converges faster to higher returns and exhibits markedly lower variance across seeds, illustrating how fuzzy regularization stabilizes the learning dynamics.

Figure 3. MPPT efficiency distributions across CRN-blocked replications. Notched boxplots for all six algorithms. Notched boxplots indicate that Fuzzy–MAT3D attains the highest median efficiency and the smallest interquartile range, evidencing both higher and more consistent energy capture than MAT3D and MADDPG.

Figure 4. Mean MPPT efficiency with 95% confidence intervals. Point: mean; bars: two-sided 95% CI (Student t);

N = 140

per algorithm. Non-overlapping error bars between Fuzzy–MAT3D and the baselines highlight the statistical significance of the observed efficiency gains.

Figure 4. Mean MPPT efficiency with 95% confidence intervals. Point: mean; bars: two-sided 95% CI (Student t);

N = 140

per algorithm. Non-overlapping error bars between Fuzzy–MAT3D and the baselines highlight the statistical significance of the observed efficiency gains.

Figure 5. Steady-state oscillation across replications. Y-axis: “Steady-state oscillation [W]”; lower is better. Fuzzy–MAT3D tracks the global MPP with substantially smaller steady-state oscillations, at the cost of a slightly longer settling time, illustrating the stability–speed trade-off.

Figure 6. Settling time

t_{s}

across replications.

t_{s}

is defined using a

\pm 2 %

band around the MPP and a continuous window

Δ = 0.2 T_{sim} = 4 s

. Fuzzy–MAT3D concentrates in the region of small jitter and moderate settling time, whereas MAT3D and MADDPG achieve faster transients at the expense of significantly larger steady-state oscillations.

Figure 6. Settling time

t_{s}

across replications.

t_{s}

is defined using a

\pm 2 %

band around the MPP and a continuous window

Δ = 0.2 T_{sim} = 4 s

. Fuzzy–MAT3D concentrates in the region of small jitter and moderate settling time, whereas MAT3D and MADDPG achieve faster transients at the expense of significantly larger steady-state oscillations.

Figure 7. Distribution of MPPT efficiency. It indicates statistically significant differences between Fuzzy-MAT3D and the other RL algorithms.

Figure 8. Step disturbance response. Top: total PV power

P_{pv} (t)

[W]; bottom: duty-cycle action of Agent 1 (dimensionless). The irradiance step occurs at

t_{change} = 0.35 s

. Settling time

t_{s}

is defined using a

\pm 2 %

band around the MPP and a continuous window

Δ = 4 s

.

Figure 8. Step disturbance response. Top: total PV power

P_{pv} (t)

[W]; bottom: duty-cycle action of Agent 1 (dimensionless). The irradiance step occurs at

t_{change} = 0.35 s

. Settling time

t_{s}

is defined using a

\pm 2 %

band around the MPP and a continuous window

Δ = 4 s

.

Figure 9. Block diagram of the experimental bench and measurement points.

Figure 10. Close-up of the outdoor hardware bench used for validation. The two-module string interfaces the embedded controller (purple enclosure) and the logging laptop. Cable routing and measurement leads are visible next to the controller.

Figure 11. (Left) Wide view of the two-module series string deployed on level ground in a desert-like yard, with the control table at the right. (Right) Details of the series interconnect and instrument wiring near the controller and laptop.

Figure 12. Hardware MPPT efficiency distributions (scenario-blocked). Notched boxplots by algorithm.

Figure 13. Mean MPPT efficiency with 95% confidence intervals (Student-t).

Figure 14. Steady-state oscillation (lower is better).

Figure 15. Settling time

t_{s}

across hardware replications.

t_{s}

is defined using a

\pm 2 %

band around the MPP and a continuous window

Δ = 0.2 T_{sim} = 4 s

.

Figure 15. Settling time

t_{s}

across hardware replications.

t_{s}

is defined using a

\pm 2 %

band around the MPP and a continuous window

Δ = 0.2 T_{sim} = 4 s

.

Table 1. Fuzzy PoU parameters and induced Lipschitz constants. All ranges are given in raw units; centers are in the normalized

[0, 1]

scale. With

τ_{j} \equiv 32

for all coordinates, the per-coordinate bound is

L_{z} (coord .) \approx 5.615

and the global constant is

L_{μ} = L_{z} / 7 \approx 0.802

.

Table 1. Fuzzy PoU parameters and induced Lipschitz constants. All ranges are given in raw units; centers are in the normalized

[0, 1]

scale. With

τ_{j} \equiv 32

for all coordinates, the per-coordinate bound is

L_{z} (coord .) \approx 5.615

and the global constant is

L_{μ} = L_{z} / 7 \approx 0.802

.

State Coord.	Units	Raw Range	Temperature $τ_{j}$	Centers	$L_{z}$ (coord.)	Note
$V_{i}$	V	$[0, 49.9]$	32	${0, 0.25, 0.5, 0.75, 1}$	$5.615$	module voltage
$I_{i}$	A	$[0, 9]$	32	${0, 0.25, 0.5, 0.75, 1}$	$5.615$	module current
$P_{i}$	W	$[0, 352]$	32	${0, 0.25, 0.5, 0.75, 1}$	$5.615$	module power
$G_{i}$	W/m²	$[200, 1000]$	32	${0, 0.25, 0.5, 0.75, 1}$	$5.615$	local irradiance
$G_{n (i)}$	W/m²	$[200, 1000]$	32	${0, 0.25, 0.5, 0.75, 1}$	$5.615$	neighbor irradiance
$Δ P_{i}$	W	$[- 352.408, 352.408]$	32	${0, 0.25, 0.5, 0.75, 1}$	$5.615$	last-step power inc.
$Δ V_{i}$	V	$[- 49.9, 49.9]$	32	${0, 0.25, 0.5, 0.75, 1}$	$5.615$	last-step voltage inc.

Notation.

τ_{j}

denotes the per-coordinate softmax temperature;

L_{z} = {sup}_{s} ∥ \nabla z (s) ∥

is the global Lipschitz bound of the stacked membership map

z (s)

;

L_{μ} = L_{z} / 7

is the induced Lipschitz constant of the normalized fuzzy partition

φ_{fuzzy} (s) = z (s) / 7

; and

Δ = 0.2 T_{sim} = 4 s

is the continuous window used to define settling time

t_{s}

in step scenarios.

Table 2. Training hyperparameters for Fuzzy–MAT3D (CTDE–TD3). Note: Training episodes use

Δ t = 0.06

s and

T_{steps} =

12,000 (

\approx 720

s per episode) with

N_{ep} = 80

. All reported evaluation metrics, however, are computed over a fixed evaluation horizon

T_{sim} = 20

s to ensure comparability across algorithms.

Table 2. Training hyperparameters for Fuzzy–MAT3D (CTDE–TD3). Note: Training episodes use

Δ t = 0.06

s and

T_{steps} =

12,000 (

\approx 720

s per episode) with

N_{ep} = 80

. All reported evaluation metrics, however, are computed over a fixed evaluation horizon

T_{sim} = 20

s to ensure comparability across algorithms.

Component	Symbol	Value
Discount factor	$γ$	$0.99$
Critic learning rate (fast)	$α_{c}$	$1 \times 10^{- 4}$
Actor learning rate (slow)	$β_{a}$	$1 \times 10^{- 5}$
Policy update period	K	2
Polyak factor (targets)	$τ$	$1 \times 10^{- 3}$
Target policy noise	$σ_{ε}$	$0.2$ (clip $\pm 0.5$ )
Minibatch size	M	256
Replay buffer length	$\| D \|$	$10^{6}$
Gradient clipping	$G_{max}$	1
Actor network	—	$(128, 128) + tanh$
Twin critics	—	2 (shared fuzzy front-end)
Fuzzy memberships	L	5 per state coordinate
State normalization	—	per-coordinate $[0, 1]$
Reward weights	$(α, β, κ)$	$(0.1, 0.05, 2)$
Simulation step	$Δ t$	$0.06 s$
Max steps/episode	$T_{steps}$	12,000
Number of episodes	$N_{ep}$	80
Compute device	—	GPU

Table 3. Benchmark scenarios.

#	Scenario Name	$G_{init}$ [W/m²]	$t_{change}$ [s]	$G_{final}$ [W/m²]
1	Standard Condition	[1000, 500]	∞	[1000, 500]
2	Deep Shadow	[1000, 200]	∞	[1000, 200]
3	Similar Conditions	[800, 850]	∞	[800, 850]
4	Low Irradiation	[400, 300]	∞	[400, 300]
5	Step Change 1 (step)	[1000, 300]	0.35	[300, 1000]
6	Step Change 2 (step)	[900, 600]	0.35	[600, 900]
7	Shadow Recovery (step)	[250, 950]	0.35	[950, 950]

Table 4. Inference latency (ms) vs. control period and PSO per-tick budget.

Algorithm	p50	p95	p99	Control Period [ms]	PSO Budget (Particles × Iters)
Fuzzy–MAT3D	1.5	2.7	4.0	60	—
MAT3D	1.3	2.4	3.7	60	—
MADDPG	1.4	2.6	3.9	60	—
PSO	—	—	—	60	$16 \times 3$
P&O/INC	—	—	—	60	—

Table 5. Simulation: scenario-wise MPPT efficiency

η

(%); mean ± 95% CI;

N = 20

seeds per scenario and algorithm.

Table 5. Simulation: scenario-wise MPPT efficiency

η

(%); mean ± 95% CI;

N = 20

seeds per scenario and algorithm.

Scenario	Fuzzy–MAT3D	PSO	MAT3D	INC	P&O	MADDPG
Standard Condition	$94.5 \pm 0.9$	$89.0 \pm 2.0$	$84.0 \pm 3.1$	$81.0 \pm 3.0$	$79.0 \pm 3.3$	$73.0 \pm 4.0$
Deep Shadow	$90.2 \pm 1.5$	$79.0 \pm 2.5$	$71.0 \pm 3.8$	$68.0 \pm 3.6$	$64.0 \pm 3.6$	$57.0 \pm 4.5$
Similar Conditions	$93.0 \pm 1.1$	$86.0 \pm 2.1$	$82.0 \pm 3.2$	$78.0 \pm 3.1$	$76.0 \pm 3.3$	$70.0 \pm 4.1$
Low Irradiation	$89.5 \pm 1.6$	$82.0 \pm 2.4$	$76.0 \pm 3.6$	$73.0 \pm 3.3$	$71.0 \pm 3.4$	$65.0 \pm 4.2$
Step Change 1	$91.0 \pm 1.6$	$85.0 \pm 2.2$	$79.0 \pm 3.4$	$76.0 \pm 3.2$	$74.0 \pm 3.3$	$67.0 \pm 4.2$
Step Change 2	$92.5 \pm 1.2$	$87.0 \pm 2.0$	$83.0 \pm 3.1$	$79.0 \pm 3.0$	$78.0 \pm 3.2$	$71.0 \pm 4.0$
Shadow Recovery	$93.3 \pm 1.1$	$85.8 \pm 2.1$	$85.0 \pm 3.0$	$80.0 \pm 3.1$	$80.0 \pm 3.3$	$72.0 \pm 4.1$

Table 6. Mean MPPT efficiency

η

(mean ± SD) and two-sided 95% CI across

N = 140

CRN-blocked replications per algorithm.

Table 6. Mean MPPT efficiency

η

(mean ± SD) and two-sided 95% CI across

N = 140

CRN-blocked replications per algorithm.

Algorithm	Mean [%]	SD	95% CI [%]	N
Fuzzy–MAT3D	92.032	4.014	[91.361, 92.703]	140
PSO	84.758	8.028	[83.417, 86.100]	140
MAT3D	80.144	13.050	[77.963, 82.324]	140
INC	76.074	11.667	[74.125, 78.024]	140
P&O	74.312	11.471	[72.395, 76.229]	140
MADDPG	67.978	12.402	[65.905, 70.050]	140

Table 7. CRN-paired, one-sided t-tests of Fuzzy–MAT3D vs. baselines on MPPT efficiency. Positive

Δ η

favors Fuzzy–MAT3D. p is one-sided; CIs are two-sided 95%. Cohen’s d is the paired effect size.

Table 7. CRN-paired, one-sided t-tests of Fuzzy–MAT3D vs. baselines on MPPT efficiency. Positive

Δ η

favors Fuzzy–MAT3D. p is one-sided; CIs are two-sided 95%. Cohen’s d is the paired effect size.

Comparator	$n_{pairs}$	$Δ η$ [%]	95% CI [%]	t	p(BHq)	Cohen’s d
MADDPG	140	+24.054	[21.909, 26.199]	22.173	< $10^{- 16}$ (≈0)	1.874
P&O	140	+17.720	[15.874, 19.566]	18.978	< $10^{- 16}$ (≈0)	1.604
INC	140	+15.958	[13.964, 17.951]	15.826	< $10^{- 16}$ (≈0)	1.338
MAT3D	140	+11.888	[9.674, 14.103]	10.613	< $10^{- 16}$ (≈0)	0.897
PSO	140	+7.274	[5.756, 8.792]	9.474	< $10^{- 16}$ (≈0)	0.801

Table 8. CRN-paired, one-sided t-tests on steady-state oscillation (lower is better). Positive differences

(baseline - Fuzzy -MAT3D)

favor Fuzzy–MAT3D.

Table 8. CRN-paired, one-sided t-tests on steady-state oscillation (lower is better). Positive differences

(baseline - Fuzzy -MAT3D)

favor Fuzzy–MAT3D.

Comparator	$n_{pairs}$	Mean Diff [W]	95% CI [W]	t	p (BH q)	Cohen’s d
MADDPG	140	+51.159	[45.656, 56.662]	18.380	< $10^{- 16}$ (≈0)	1.553
MAT3D	140	+44.078	[35.072, 53.085]	9.676	< $10^{- 16}$ (≈0)	0.818
P&O	140	+17.081	[13.864, 20.297]	10.499	< $10^{- 16}$ (≈0)	0.887
INC	140	+15.228	[12.328, 18.127]	10.383	< $10^{- 16}$ (≈0)	0.878
PSO	140	+8.948	[7.464, 10.432]	11.921	< $10^{- 16}$ (≈0)	1.008

Table 9. Aggregate performance across all RL runs.

Algorithm	MPPT Efficiency $η$ [%]	Settling Time $t_{s}$ [s]	SS Oscillation [W]
Fuzzy–MAT3D	$92.032 \pm 4.014$	$7.537 \pm 5.397$	$1.362 \pm 2.013$
MAT3D	$80.144 \pm 13.050$	$1.573 \pm 3.771$	$37.963 \pm 70.396$
MADDPG	$67.978 \pm 12.402$	$3.239 \pm 6.175$	$52.993 \pm 34.608$

Table 10. Electrical, temperature, and mechanical specifications of the 350 W PV modules used in experiments (STC) and the resulting two-module series string.

Parameter	Per Module (STC)	Series String (2×)
Rated power $P_{mp}$ [W]	350	700
Voltage at MPP $V_{mp}$ [V]	38.95	77.90
Current at MPP $I_{mp}$ [A]	8.98	8.98
Open-circuit voltage $V_{oc}$ [V]	46.41	92.82
Short-circuit current $I_{sc}$ [A]	9.60	9.60
Cells (size)	72 cells, $156 \times 156$ mm
Cell efficiency [%]	20.40
Module efficiency [%]	18.00
Output tolerance [%]	5
Standard test conditions	AM1.5, $1000 W / m^{2}$ , 25 °C
Temperature coefficients (per module)
$d V_{oc} / d T$ [%/K]	$- 0.30$
$d I_{sc} / d T$ [%/K]	$+ 0.04$
$d P_{\max} / d T$ [%/K]	$- 0.44$
Nominal operating cell temperature (NOCT)	45 ± 2 °C
Mechanical (per module)
Dimensions [mm]	1956 × 992 × 40
Weight [kg]	23
Operating temperature [°C]	$- 40$ to 85
Frame	Anodized aluminum
Junction box	IP67, 3 diodes
Output cables	$4 {mm}^{2}$ , 1000 mm length
Connectors	MC4 compatible/Tyco
Connection topology	—	Two identical modules in series

Table 11. Hardware: scenario-wise MPPT efficiency

η

(%); mean ± 95% CI;

n \geq 10

per scenario/algorithm.

Table 11. Hardware: scenario-wise MPPT efficiency

η

(%); mean ± 95% CI;

n \geq 10

per scenario/algorithm.

Scenario	Fuzzy–MAT3D	PSO	MAT3D	INC	P&O	MADDPG
Standard Condition	$93.5 \pm 1.1$	$87.0 \pm 2.2$	$83.0 \pm 3.8$	$76.0 \pm 3.1$	$77.5 \pm 3.0$	$72.0 \pm 4.2$
Deep Shadow	$89.0 \pm 1.8$	$78.0 \pm 2.8$	$73.0 \pm 4.5$	$70.0 \pm 3.4$	$70.5 \pm 3.6$	$61.0 \pm 4.6$
Similar Conditions	$92.2 \pm 1.3$	$85.0 \pm 2.3$	$81.0 \pm 3.9$	$75.0 \pm 3.0$	$76.0 \pm 3.1$	$70.0 \pm 4.1$
Low Irradiation	$88.8 \pm 1.9$	$80.5 \pm 2.5$	$75.0 \pm 4.0$	$72.0 \pm 3.2$	$72.5 \pm 3.3$	$64.0 \pm 4.3$
Step Change 1	$91.1 \pm 1.8$	$83.5 \pm 2.4$	$79.5 \pm 3.7$	$74.0 \pm 3.0$	$75.0 \pm 3.1$	$67.0 \pm 4.0$
Step Change 2	$92.0 \pm 1.4$	$86.0 \pm 2.2$	$84.0 \pm 3.3$	$76.5 \pm 3.1$	$79.0 \pm 3.0$	$72.0 \pm 3.8$
Shadow Recovery	$92.4 \pm 1.2$	$82.5 \pm 2.5$	$79.0 \pm 3.5$	$75.0 \pm 3.0$	$78.0 \pm 3.1$	$69.0 \pm 4.2$

Table 12. Hardware inference latency (ms) vs. control period and PSO per-tick budget.

Algorithm	Latency [ms]			Control Period [ms]	PSO Budget (Particles × Iters)
Algorithm	p50	p95	p99	Control Period [ms]	PSO Budget (Particles × Iters)
Fuzzy–MAT3D	1.8	3.3	4.9	60	—
MAT3D	1.6	3.0	4.6	60	—
MADDPG	1.7	3.1	4.7	60	—
PSO	—	—	—	60	$16 \times 3$
P&O/INC	—	—	—	60	—

Notes: Latency percentiles (p50/p95/p99) are measured on the hardware bench under the same sensing/update period used across algorithms. PSO latency is not tabulated because its compute is budgeted per tick.

Table 13. Hardware summary: MPPT efficiency

η

[%], settling time

t_{s}

[s], and steady-state oscillation [W] (mean ± SD) across all hardware replications.

Table 13. Hardware summary: MPPT efficiency

η

[%], settling time

t_{s}

[s], and steady-state oscillation [W] (mean ± SD) across all hardware replications.

Algorithm	$η$ [%]	$t_{s}$ [s]	Steady–State Oscillation [W]
Fuzzy–MAT3D	$91.296 \pm 4.057$	$8.097 \pm 4.612$	$7.837 \pm 8.607$
PSO	$83.233 \pm 7.919$	$9.623 \pm 3.941$	$25.921 \pm 26.580$
MAT3D	$78.791 \pm 12.493$	$2.585 \pm 2.264$	$45.899 \pm 53.620$
P&O	$75.447 \pm 11.794$	$3.609 \pm 2.510$	$40.015 \pm 23.916$
INC	$74.086 \pm 10.990$	$4.645 \pm 3.943$	$36.724 \pm 29.736$
MADDPG	$67.515 \pm 12.881$	$4.958 \pm 4.276$	$62.451 \pm 38.205$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ortiz-Muñoz, D.; Luviano-Cruz, D.; Pérez-Domínguez, L.A.; Rodríguez-Ramírez, A.G.; García-Luna, F. Fuzzy-Partitioned Multi-Agent TD3 for Photovoltaic Maximum Power Point Tracking Under Partial Shading. Appl. Sci. 2025, 15, 12776. https://doi.org/10.3390/app152312776

AMA Style

Ortiz-Muñoz D, Luviano-Cruz D, Pérez-Domínguez LA, Rodríguez-Ramírez AG, García-Luna F. Fuzzy-Partitioned Multi-Agent TD3 for Photovoltaic Maximum Power Point Tracking Under Partial Shading. Applied Sciences. 2025; 15(23):12776. https://doi.org/10.3390/app152312776

Chicago/Turabian Style

Ortiz-Muñoz, Diana, David Luviano-Cruz, Luis Asunción Pérez-Domínguez, Alma Guadalupe Rodríguez-Ramírez, and Francesco García-Luna. 2025. "Fuzzy-Partitioned Multi-Agent TD3 for Photovoltaic Maximum Power Point Tracking Under Partial Shading" Applied Sciences 15, no. 23: 12776. https://doi.org/10.3390/app152312776

APA Style

Ortiz-Muñoz, D., Luviano-Cruz, D., Pérez-Domínguez, L. A., Rodríguez-Ramírez, A. G., & García-Luna, F. (2025). Fuzzy-Partitioned Multi-Agent TD3 for Photovoltaic Maximum Power Point Tracking Under Partial Shading. Applied Sciences, 15(23), 12776. https://doi.org/10.3390/app152312776

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fuzzy-Partitioned Multi-Agent TD3 for Photovoltaic Maximum Power Point Tracking Under Partial Shading

Abstract

1. Introduction

2. Mathematical Framework and Main Results

2.1. Setting, Notation, and Fuzzy PoU

2.2. Main Structural Results

2.3. ISS Margins and Steady-State Jitter (Formalized)

3. Materials and Methods

3.1. Study Design and Overview

3.2. PV Plant and Power-Stage Model

3.3. DC/DC Stage and Control Interface

3.4. Operating Constraints

3.5. Observations, Actions, and Horizon

3.6. Fuzzy Features and Function Approximators

3.7. Fuzzy-Partition Parameters

3.8. Reward and Normalization

3.9. Training Protocol (TD3 Under CTDE)

3.10. Benchmark Scenarios, CRN Protocol, and Metrics

3.11. Baselines and Fairness Constraints

3.12. Computational Budget and Real-Time Feasibility

3.13. Statistical Analysis Plan

4. Results

4.1. Training Dynamics (RL Group)

4.2. Global Comparison Across Six Algorithms (CRN-Blocked)

4.3. Settling Time vs. Stability

4.4. RL-only Aggregate Across Replications

4.5. Case Study: Step Disturbance

5. Experimental Validation

5.1. Hardware Validation: Objectives, Bench, and Real-Time Deployment

5.2. Protocols and Scenarios

5.3. Experimental Results

5.4. Discussion of Experimental Results

6. Discussion

Future Work

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Exploratory Post-Hoc (Tukey–Kramer)

Appendix B. Complete Proof for Section 2.2

Appendix B.1. Notation and Generic Facts

Appendix B.2. Proof of Theorem 1 (Fuzzy PoU Induces Global Lipschitzness)

Appendix B.3. Proof of Proposition 1 (Smoothed Fixed-Policy Non-Expansivity)

Appendix B.4. Proof of Proposition 2 (TD-Target Variance Reduction Under PoU)

Appendix B.5. Proof of Proposition 3 (Projected Bellman Residual Bound)

Appendix B.6. Proof of Lemma 1 (Underestimation by TD3’s Minimum)

Appendix B.7. Proof of Theorem 2 (Two-Time-Scale Convergence)

Appendix B.8. Proof of Corollary 1 (PL/Last-Layer Linear ⇒ Uniqueness)

Appendix B.9. Proof of Theorem 3 (Finite Stepsize Neighborhoods)

Appendix B.10. Proof of Theorem 4 (N-Agent CTDE Extension)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI