2.2. Mathematical Formulation
Let
denote the problem dimension, and let
be the decision vector, where
is the irrigation depth (mm) applied at stage
i of a synthetic growing season. The feasible set is the hyper-rectangle
The only hard constraints imposed on the problem are the box bounds on the decision variables. The internal storage state is maintained within admissible bounds through the clipping operator in the storage transition equation, which acts as an implicit state constraint enforced by construction rather than by external penalization. All remaining operational constraints are treated as soft penalties incorporated directly into the objective function: the seasonal water budget is softly constrained through , which penalizes deviation from the target , peak hydraulic demand is softly constrained through and terminal storage is regularized through . This soft-constraint formulation is deliberate, as it preserves the box-constrained structure of the search domain while embedding operational feasibility requirements into the landscape topology of the objective.
The season length is fixed at 120 days, so the stage duration is
and the normalized stage coordinate is
Recommended difficult dimensions are
, with
and
suitable for routine experiments and
or
or
producing substantially harder landscapes for population-based optimizers.
To make the benchmark fully deterministic and self-contained, all exogenous weather and crop coefficients are generated analytically from the normalized stage coordinate
. No stochastic, scenario-based, or externally sampled weather realizations are involved: every weather profile is a fixed, closed-form trigonometric function of
that produces identical values for any given dimension
D. This design choice eliminates all sources of exogenous randomness from the benchmark, ensuring that any variability in optimization outcomes is attributable solely to the stochastic search process of the optimizer and not to weather input uncertainty. The specific functional forms are chosen to reproduce qualitative seasonal patterns (e.g., mid-season peaks in temperature and evapotranspiration, early-season dominance of rainfall) that are consistent with the general behavior documented in weather-aware scheduling studies [
17,
20], although they do not represent any particular geographic location or climate record. The stage-wise profiles are defined as:
and
The crop evapotranspiration demand follows the standard evapotranspiration-based construction
which grounds the benchmark in classical irrigation science [
4]. The coefficient
is used as a stage sensitivity factor in the relative yield-preservation component, consistent with the general idea of yield response to water stress [
19].
The first major source of difficulty is local decision coupling. A neighbor-coupled hydraulic load is introduced as
with boundary conventions
. Effective irrigation efficiency is then defined by the oscillatory nonlinear law
clipped to the interval
, and the effective irrigation becomes
These oscillatory factors are synthetic ruggedness generators. They are not intended to claim precise field physics, they are introduced to generate repeated attraction basins and to make coordinate-wise search substantially less effective.
The second major source of difficulty is path dependence through a synthetic soil-water storage state
. The initial storage is fixed by
The total water available at stage
i is
and the stage deficit and surplus are
Deep drainage is modeled by the thresholded term
recharge is defined as
evaporative loss is
and the storage transition is
This recursion makes the objective stateful: changing one variable can alter the effective optimization landscape seen by many subsequent variables.
The storage transition is nonlinear due to the and operators appearing in the drainage, surplus, deficit, and recharge terms. Rainfall and effective irrigation enter additively in the total available water while evapotranspiration demand is subtracted through the deficit-surplus split. The clipping to ensures that the storage state remains non-negative and bounded at every stage, guaranteeing feasibility of the dynamic system. The contraction factor 0.82 in the transition ensures that the autonomous dynamics (without recharge) are inherently stable, preventing unbounded state accumulation.
A multiplicative relative yield-preservation term is included to prevent the problem from collapsing into a pure cost minimization model. The relative water-satisfaction ratio is
the stage stress is
and the stage contribution to yield preservation is
The season-level relative yield factor is then
The multiplicative construction is deliberate: a few severely stressed stages can dominate many small local improvements, which makes the global trade off structure significantly more deceptive than in additive reward formulations [
15,
19,
21,
23].
The benchmark is posed as the minimization problem
where
To clarify the role and origin of each term, the eleven components of are organized into three categories according to their design rationale.
The first category contains terms that are directly motivated by agronomic and operational practice. The water-cost term
represents the direct economic cost of water application, which is a primary objective in any irrigation scheduling formulation [
17,
20]. The pumping-cost term
captures the energy cost associated with pressurized water delivery, a well-recognized component of irrigation operating expenses [
23]. The deficit penalty
penalizes unmet crop water demand, reflecting the agronomic principle that water stress reduces yield [
19]. The excess penalty
penalizes over-irrigation that leads to deep drainage and waterlogging, consistent with the water-balance perspective in which surplus water beyond the root zone is a resource loss [
18,
21]. The terminal storage penalty
encourages the preservation of soil moisture at the end of the season, a common objective in multi-season water management [
22]. The seasonal budget penalty
softly constrains total water use over the growing season, reflecting practical water-rights or allocation limits [
16]. The multiplicative yield factor
aggregates stage-wise crop productivity into a seasonal yield metric, following the classical yield response–water framework [
19].
The second category contains terms that model realistic operational concerns but whose specific functional forms are stylized. The smoothness penalty discourages abrupt changes in irrigation between consecutive stages, reflecting practical constraints on irrigation system ramping and scheduling regularity. The peak-load penalty penalizes excessive instantaneous hydraulic demand, which in practice can exceed the capacity of pumping and conveyance infrastructure. The neighbor-interaction penalty captures the operational coupling between adjacent irrigation stages, in which the cumulative load from consecutive high-application periods can create delivery bottlenecks.
The third category contains terms that are deliberately synthetic and are introduced to generate specific landscape difficulty features for benchmarking purposes. The resonance penalty
and the deceptive preferred-window penalty
are oscillatory and phase-shifted ruggedness generators designed to create dense local attractors and misleading basins that make the optimization landscape genuinely challenging for modern solvers. These terms have no direct agronomic counterpart, their role is purely to increase the benchmark’s difficulty in a controlled and ablatable manner, as verified experimentally in
Section 2.4.
The water-cost term is
and the pumping-cost term is
A weather-modulated shock factor is introduced by
and the deficit and excess penalties are given by
Temporal roughness and local resonance are imposed by
Neighbor-stage overloading and bilinear coupling are represented by
A deceptive preferred-window term is defined through
The terminal storage penalty is
the soft seasonal budget penalty uses
and is defined by
while the peak-load penalty is
The complete problem is therefore a box-constrained, deterministic, strongly non-separable, rugged, and partially nonsmooth minimization benchmark.
No closed-form global optimum is assumed known. This is intentional and is part of the benchmark design. The objective contains repeated oscillatory structures, threshold-triggered regime changes, clipped internal states, recursive storage dynamics, multiplicative stage aggregation, and coupled penalties. Consequently, the landscape contains many local minima and deceptive trade offs, and performance cannot be inferred from low-dimensional intuition alone. From the perspective of benchmark design, the difficulty comes from the simultaneous presence of six mechanisms: recursive state propagation, local oscillatory efficiency modulation, thresholded drainage and peak penalties, multiplicative yield aggregation, neighbor coupling through , and soft preferred-window structure that competes with the remaining terms instead of revealing the optimum. For these reasons, the benchmark is substantially harder than a smooth irrigation cost-yield trade off model while remaining interpretable as a weather-coupled irrigation scheduling problem.
Although no closed-form global optimum is available for the full benchmark formulation, the fully simplified variant
provides a tractable reference point for absolute performance calibration. In the
formulation, all embedded difficulty mechanisms are disabled and the resulting landscape is smooth and convex-like: as demonstrated in
Section 2.4, the methods arq, cmaes, lmcmaes, lracmaes, jso, and mlshaderl converge to identical objective values with 100% success rates and zero standard deviation across all dimensions, establishing a consensus empirical optimum that can serve as a lower bound reference for the full formulation. The optimality gap for any method on the full benchmark can therefore be estimated as the difference between its achieved objective value and the corresponding
consensus value, providing an absolute rather than purely relative measure of solution quality. The optimality gap for a given method on the full formulation is defined explicitly as
, where
is the best objective value achieved by the method across 30 independent runs on the full formulation, and
is the consensus reference value established by the subset of methods that solve the
variant to optimality with 100% success rates and zero standard deviation. This definition is solver-independent: the consensus
is not the average of all methods’
results but the value agreed upon by those solvers that reliably converge on the simplified landscape. Methods that fail to solve
such as gwo, ga, and clpso at large dimensions are therefore not used to establish the reference, and their optimality gaps are computed against the consensus established by the converging subset. This yields a conservative lower bound on their true optimality gap rather than an overestimate, since their
best values exceed
, making their apparent
gap smaller than it would be if measured against the true optimum. This approach is consistent with established practice in black-box benchmarking: the BBOB platform and the CEC suites both rely on inter-algorithm relative comparison as the primary performance metric when closed-form optima are not analytically available, and the availability of a tractable simplified variant with a verifiable consensus optimum provides an additional calibration layer that is rarely present in classical synthetic benchmarks.
For convenience,
Table 1 summarizes all parameters, auxiliary computed variables, and numerical coefficients appearing in the mathematical formulation of this benchmark.
2.4. Experimental Results
Table 3 summarizes the complete experimental results obtained by applying the ten competing optimization methods to the full benchmark formulation (
) and to each of its six ablation-based variants (
,
,
,
,
,
,
), across the five benchmark dimensions
. For every method-variant-dimension combination, four summary statistics are reported over 30 independent runs: the best objective value observed across all runs (Value), the mean final objective value at termination (Mean), the empirical success rate (Rate %), defined as the percentage of runs in which the method reached the best value found by any solver on that instance, and the standard deviation of the final objective values (SD). The performance indicator Rate therefore provides a run-level reliability measure that complements the point estimates given by Value and Mean. The table is organized by problem variant: each variant occupies a labeled block, and within each block the five dimensionality levels appear as consecutive rows. This structure allows a direct reading of both the cross-method comparison for a fixed variant and dimension, and the scaling behavior of each method as the problem dimension increases. Methods are listed in a fixed order throughout the table: arq [
34], clpso [
26], cmaes [
27], ga [
31], gwo [
32], jso [
24], lmcmaes [
28,
29], lracmaes [
30], mlshaderl [
25], and acor [
33]. The column structure within each dimension block is (Value, Mean, Rate, SD), repeated once per method. Results that correspond to a Rate of 100 indicate that the method reached the reference best solution in all 30 runs, a Rate of 3 represents the minimum observable rate under the 30-run protocol, indicating that the method succeeded in only one run.
Full problem formulation (). The results for the complete benchmark reveal a strongly differentiated and dimension-dependent performance landscape. At , the method jso emerges as the most reliable solver with a success rate of 40%, converging to the reference best value of approximately 1487.73. The method arq attains the same best value but only in 7% of runs. All remaining solvers clpso, cmaes, ga, gwo, lmcmaes, lracmaes, mlshaderl, and acor achieve the minimum success rate of 3%, indicating that even at the lowest recommended dimension, the benchmark is genuinely challenging for the majority of established optimizers. The standard deviations reveal a clear separation between methods: jso and arq maintain SD values below 2.5, while ga and gwo exhibit SD values exceeding 40 and 51 units, respectively, reflecting high run-to-run variability characteristic of methods that lack sufficient local exploitation capacity in rugged landscapes.
As the dimension increases to , the overall difficulty increases substantially. Only mlshaderl achieves a success rate above the minimum (10%), converging to the reference best value of approximately 2736.36. All other methods report success rates of 3%, with mean values deviating from the reference best by margins ranging from 8 units (arq) to 188 units (gwo). The rapid deterioration from to is attributable to the recursive storage propagation mechanism, which extends the effective dependency chain by six additional stages and thereby increases the probability that at least one stage accumulates sufficient error to deflect the solver from the global basin. This explanation is consistent with the substantially milder dimension scaling observed in the NORS variant, where the recursive storage is disabled.
At , the benchmark becomes markedly harder. The method acor achieves the lowest best value (approximately 6784.16), but with a success rate of only 3%. All ten methods report success rates of 3%, and the gap between best and mean values widens substantially: ga reports a mean of 7265.92 against a best of 6940.32, and gwo reports 7538.50 against 7193.55. These gaps indicate that most runs for these methods terminate in qualitatively different and substantially inferior regions of the landscape. The methods jso, mlshaderl, and acor maintain relatively tight standard deviations (below 19 units), suggesting that while they cannot reliably find the global best, they consistently locate competitive local basins a behavior consistent with their archive-based and distribution-estimation search mechanisms, which provide implicit memory of previously explored regions.
At , all methods achieve success rates of 3%. The best observed value (approximately 11,195.22, obtained by lracmaes) is followed by jso (11,214.25) and mlshaderl (11,246.26), while ga and gwo report best values exceeding 11,700 and 12,400, respectively. The standard deviations for ga (297.79) and gwo (380.82) are an order of magnitude larger than those of jso (40.04) and cmaes (30.74). This confirms that the simpler population recombination (ga) and swarm attraction (gwo) mechanisms fail to maintain sufficient search distribution precision in a 70-dimensional landscape characterized by coupled oscillatory ruggedness and multiplicative yield aggregation.
At the most challenging setting, , the performance hierarchy among methods remains visible despite all methods achieving 3% success rates. The method cmaes achieves the lowest best value (approximately 18,161.57), followed by jso (18,203.71) and mlshaderl (18,213.28). The emergence of cmaes as a competitive solver at the highest dimension, despite its poor reliability, is explained by its covariance matrix adaptation mechanism, which can capture local second-order structure in the landscape and exploit it for rapid descent when initialized near a promising basin. This capability becomes relatively more valuable as the number of competing local attractors grows with dimension. By contrast, ga and gwo report best values exceeding 19,260 and 20,124, respectively, with standard deviations above 398 and 712, indicating near-complete loss of search coherence at this scale.
Simplified formulation with all difficulty mechanisms disabled (). In sharp contrast with the variant, the formulation is solved to optimality by the large majority of methods across all dimensions. At D = 24, six out of ten methods arq, cmaes, jso, lracmaes, mlshaderl, and acor achieve success rates of 100% with standard deviations that are numerically indistinguishable from zero. The reference best values for scale moderately with dimension, from approximately 458.71 at to 5087.80 at . Even at , arq, cmaes, and jso maintain 100% success rates, while mlshaderl achieves 70%. The clpso, ga, and gwo methods are the only consistent exceptions, achieving rates of 3% across all dimensions with means that deviate from the reference best by increasingly large margins. These results demonstrate unambiguously that, in the absence of all embedded difficulty mechanisms, the benchmark collapses to a smooth, tractable problem that most modern continuous optimizers can solve reliably, and that the difficulty reported in the ALL variant is genuinely structural rather than numerical.
Formulation without recursive storage (). Removing the recursive soil-water storage propagation restores some problem tractability compared to the ALL variant, but the problem remains substantially harder than . At , jso and mlshaderl both achieve success rates of 63%, while arq and acor attain 20% each. At , success rates decline to 7% for jso, lracmaes, mlshaderl, and acor. At higher dimensions (D = 50–100), all methods except acor and lracmaes (7% at ) report success rates of 3%. The persistence of low success rates even without recursive storage confirms that the remaining mechanisms particularly the oscillatory efficiency modulation and the multiplicative yield aggregation are independently sufficient to create a challenging landscape. From an agronomic perspective, the recursive storage mechanism represents the temporal coupling that is fundamental to real irrigation scheduling: the soil moisture state at any point in the growing season depends on all prior irrigation decisions, creating a sequential dependency that cannot be decomposed into independent stage-wise choices. Its removal simplifies the problem by making each stage’s contribution to the objective approximately independent of preceding decisions.
Formulation without oscillatory efficiency (). This ablation produces the most dramatic improvement in solver performance across the entire experimental campaign, confirming that the oscillatory efficiency mechanism is the single most consequential source of landscape difficulty. At , four methods arq, jso, mlshaderl, and acor achieve success rates of 100% with zero standard deviation, while cmaes achieves 97%. At , jso and mlshaderl maintain 100% success rates, cmaes achieves 97%, arq 83%, and lracmaes 73%. Even at , cmaes achieves a success rate of 53%, jso 30%, and mlshaderl 17% results that are dramatically better than the 3% rates observed for all methods in the variant at the same dimension. The structural interpretation of this finding is clear: the oscillatory modulation of , which depends nonlinearly on both the local decision and the neighbor-coupled load , introduces a dense lattice of local attractors that are energetically competitive with the global basin. When this mechanism is removed, the residual landscape formed by the water cost, pumping cost, deficit and excess penalties, and storage terms is sufficiently smooth to permit gradient-like local improvement strategies to succeed over repeated restarts. The practical implication for benchmark users is that the oscillatory efficiency amplitude (controlled by the parameter 0.22 in the default formulation) is the primary lever for adjusting the overall difficulty of the benchmark.
Formulation without threshold losses (). The removal of threshold-triggered drainage losses produces moderate improvement relative to . At , jso achieves a success rate of 50% and mlshaderl 17%. At , jso reaches 97%, mlshaderl 73%, arq and cmaes both 30%, and lracmaes 23%. At , jso (47%), mlshaderl (37%), arq (27%), and cmaes (27%) all exceed the minimum rate. However, at and , all methods collapse to 3%. These results indicate that the threshold drainage mechanism, which introduces nonsmooth regime changes when the water surplus exceeds a fraction of the storage capacity, contributes meaningfully to the difficulty of the intermediate-dimensional regime but is not the primary driver of optimizer failure at large scales. In agronomic terms, threshold drainage corresponds to the physical phenomenon of deep percolation beyond the root zone, which occurs abruptly once the soil water content exceeds field capacity a nonlinearity that is genuinely present in real irrigation systems and that makes the optimization landscape locally nonsmooth.
Formulation without multiplicative yield aggregation (
). This variant replaces the multiplicative seasonal yield factor
with an additive average. At
, jso achieves a success rate of 37%, while arq, cmaes, and mlshaderl each achieve 7%. At
and above, all methods report success rates of 7% or lower, with the exception of mlshaderl and acor at
(both 7%). The improvement relative to
is modest at low dimensions and negligible at high dimensions, indicating that multiplicative yield aggregation amplifies the deceptive structure of the objective but is not the sole driver of difficulty. The agronomic rationale for the multiplicative construction is that crop yield is a cumulative outcome of the entire growing season: a single period of severe water stress can irreversibly damage the crop, and this damage cannot be compensated by excess water at other stages. The multiplicative form captures this non-compensatory character, which is consistent with the classical Doorenbo–Kassam yield-response model [
19].
Formulation without neighbor coupling (). The removal of the neighbor-coupling term produces a formulation in which each stage’s hydraulic load depends only on its own decision variable. At , arq and jso both achieve success rates of 10%, while all other methods report 3%. At , arq and acor achieve 7%, and all others 3%. At higher dimensions, all methods report 3%. The overall improvement relative to ALL is modest, confirming that neighbor coupling contributes to non-separability but is not the dominant difficulty mechanism. The relatively small effect of removing neighbor coupling is consistent with the observation that the oscillatory efficiency and multiplicative yield mechanisms create difficulty through fundamentally different channels: local ruggedness and global aggregation deceptiveness, respectively. By contrast, neighbor coupling primarily introduces mild inter-variable dependencies that modern population-based methods can partially accommodate through their implicit modeling of variable correlations.
Formulation without preferred-window regularization (). This final ablation removes the deceptive preferred-window penalty . At , arq achieves a success rate of 17% and jso 7%, while all other methods report 3%. At , jso achieves 17%, arq, lracmaes, and mlshaderl 7%, and all others 3%. At higher dimensions, all methods collapse to 3%. The best observed values for this variant are systematically lower than those for ALL (e.g., approximately 6681.29 vs. 6784.16 at ), confirming that the preferred-window term acts as a genuine penalty that shifts the global optimum relative to the attractors created by the remaining terms. The modest improvement in success rates indicates that, while the preferred-window mechanism does introduce additional deceptive structure, its contribution to the overall difficulty is smaller than that of the oscillatory efficiency or the recursive storage mechanisms.
Cross-variant and cross-method synthesis. Taken together, the ablation study reveals a clear hierarchy of mechanism importance. The oscillatory efficiency modulation is unambiguously the dominant source of landscape difficulty: its removal enables multiple solvers to achieve 100% success rates at dimensions where the full formulation yields universal 3% rates. The recursive storage propagation and threshold drainage mechanisms occupy the second tier, each contributing meaningful but less dramatic difficulty that primarily affects the intermediate-dimensional regime (D = 30–50). The multiplicative yield aggregation, neighbor coupling, and preferred-window regularization contribute more modestly, with their effects becoming largely indistinguishable from the baseline difficulty at high dimensions. This ordering has a structural interpretation: the benchmark’s hardness is primarily driven by local landscape ruggedness (oscillatory efficiency) and temporal state coupling (recursive storage), rather than by global aggregation structure or inter-variable connectivity. Crucially, no single ablation restores the problem to the triviality of the fully simplified variant, confirming that the benchmark’s difficulty is genuinely multi-mechanism in character.
Among the ten competing methods, jso exhibits the strongest overall performance in the full formulation, achieving the highest success rate at (40%) and competitive best values across all dimensions. The methods lracmaes, mlshaderl, and acor follow closely, with lracmaes achieving the best observed value at , mlshaderl’s historical archive mechanism and acor’s continuous probability density estimation providing complementary advantages for navigating the benchmark’s coupled structure. At the highest dimension (), cmaes emerges as a competitive solver in terms of best observed values, despite its low success rate, reflecting the value of second-order covariance information in very high-dimensional rugged landscapes. The methods ga and gwo consistently occupy the lowest performance tier across all variants and dimensions, with standard deviations that are typically 5–20 times larger than those of the best-performing methods. Their simpler search mechanisms tournament-based recombination in ga and hierarchical swarm leadership in gwo are fundamentally insufficient for the combined non-separability, state recursion, and oscillatory structure of the proposed benchmark.
Table 4 reports the mean wall-clock time per run for each method across all five dimensions. The benchmark’s objective function is inexpensive to evaluate: for eight of the ten methods, the time per run ranges from 0.22 s at
to approximately 1.6 s at
on the reference hardware. The exceptions are cmaes and lracmaes, both of which exhibit substantially higher wall-clock times that grow with dimension: cmaes from 2.53 s at
to 380.21 s at
, and lracmaes from 4.87 s at
to 800.07 s at
, reflecting the
cost of their covariance matrix adaptation mechanisms rather than the benchmark’s evaluation cost. A complete 30-run experimental campaign at
therefore requires less than one minute for most methods, confirming that the benchmark is computationally lightweight and suitable for routine algorithm development, with the caveat that covariance-based methods incur substantially higher per-run costs at large dimensions.
A cost-benefit analysis of the covariance-based methods at large dimensions reveals an important practical consideration for benchmark users. At
, lracmaes requires approximately 800s per run, yielding a total experimental cost of
24,000 s (approximately 6.7 h) for a single 30-run campaign, while cmaes requires approximately
11,400 s (approximately 3.2 h). By contrast, all remaining eight methods complete their 30-run campaigns at
in under one minute in total. Despite this substantial computational investment, lracmaes and cmaes do not achieve qualitatively superior performance at
relative to jso and mlshaderl, which require only seconds per full campaign: as shown in
Table 3, all four methods achieve success rates of 3% at
, and cmaes achieves the lowest best value only marginally ahead of jso. This cost-benefit asymmetry motivates a practical recommendation: for benchmark users employing methods with super-linear scaling in D, such as any covariance matrix adaptation variant,
is recommended as the default experimental setting. At
, lracmaes and cmaes complete 30 runs in approximately 30 and 14 min, respectively—a computationally tractable investment while still exposing the full multi-mechanism difficulty of the benchmark—as evidenced by the universal 3% success rates reported in
Table 3. The dimensions
and
remain available for studies specifically targeting the scaling behavior of covariance-based methods, with the understanding that the associated computational cost scales as
and may require dedicated hardware resources. For gradient-based solvers, surrogate-assisted methods, and reinforcement-learning approaches identified as directions for future work in
Section 4 the benchmark’s evaluation cost is negligible (well below 1 s per evaluation for all dimensions), making the full range
computationally accessible without restriction.
The recommended upper dimension of
is intentional and sufficient for the purposes of this benchmark. As demonstrated in
Section 2.4, all ten competing methods achieve success rates of
at
, representing the minimum observable rate under the 30-run protocol. This universal solver failure indicates that the benchmark has already reached a difficulty ceiling at
: extending to
or
would not produce qualitatively new algorithmic findings but would instead amplify the computational cost of the covariance-based methods (notably cmaes, whose wall-clock time scales as
, reaching 380 s per run at
) to levels that preclude routine experimental use. Furthermore, the combinatorial growth argument presented in
Section 3 establishes that the effective number of local attractors scales as
, yielding approximately
at
, a landscape complexity that is already far beyond the exhaustive coverage capacity of any population-based solver. The benchmark is therefore not limited by its upper dimension but by the fundamental intractability of its landscape structure, which manifests fully within the recommended range
.
2.5. Parameter Sensitivity Analysis
To assess how sensitive the benchmark’s optimization landscape is to its internal coefficients, a one-at-a-time perturbation study was conducted on six parameters that directly control the primary difficulty mechanisms of the benchmark. These parameters were selected because the ablation study in
Section 2.4 identified the oscillatory efficiency and multiplicative yield mechanisms as the two most consequential sources of landscape difficulty, and the six parameters studied here are the numerical coefficients that directly modulate these mechanisms. These parameters are: the oscillatory efficiency amplitude (default 0.22), which governs the depth of the sinusoidal modulation in
, the resonance penalty weight (default 8.0), which scales the local ruggedness term
, the oscillation frequency base and amplitude (defaults 0.22 and 0.09), which jointly determine the density of local attractors through
, and the yield-sensitivity base and amplitude (defaults 0.85 and 0.55), which control the stage-wise crop stress coefficient
. Each parameter was varied independently across its tested range while all others were held at their default values. For each setting, 30 independent runs of mlshaderl were executed at
, and the mean best objective value, standard deviation, minimum, and maximum were recorded. The main effect of each parameter is defined as the difference between the highest and lowest mean best values observed across its tested range.
Table 5 reports the complete results, and
Figure 1 visualizes the sensitivity profiles.
The one-at-a-time perturbation protocol adopted in this study is a deliberate and well-established choice for benchmarks whose primary goal is mechanism transparency rather than global parameter optimization. Single-factor variation isolates the individual contribution of each coefficient to the observed landscape difficulty, a property that would be obscured by factorial or Latin hypercube designs in which multiple parameters vary simultaneously. This interpretability is essential for benchmark users who wish to adjust a specific difficulty mechanism without inadvertently altering others. The analysis was conducted at
because this dimension lies at the transition between partial and universal solver failure in the full formulation, making it the most sensitive and informative setting for detecting landscape changes induced by parameter perturbations. The use of mlshaderl as the reference solver is justified by its consistent position among the top-performing methods across all dimensions and variants in
Section 2.4, its low run-to-run variability (SD below 15 units at
), and its archive-based mechanism, which provides stable convergence behavior that makes mean best objective values a reliable proxy for landscape difficulty rather than a reflection of solver-specific search artifacts.
The main effect column in
Table 5 provides a clear ranking of parameter importance. The oscillatory efficiency amplitude exhibits by far the largest main effect (397.27), confirming quantitatively the conclusion drawn from the ablation study: this single coefficient is the dominant lever controlling the benchmark’s difficulty. As the amplitude increases from 0 to 0.33, the mean best objective value decreases monotonically from 2899.34 to 2502.08, reflecting the fact that stronger oscillatory modulation reduces the effective irrigation efficiency and thereby shifts the entire cost-yield trade off toward lower absolute objective values. The standard deviation remains stable across all settings (between 5.88 and 18.21), indicating that the parameter affects the location of the optimal basin rather than the reliability of convergence. The resonance penalty weight ranks second with a main effect of 203.05. Its influence is monotonically increasing: higher weights raise the mean best value from 2636.94 (at weight 5) to 2839.99 (at weight 11), consistent with the expected behavior that a stronger resonance penalty adds cost to all feasible solutions proportionally. The remaining four parameters oscillation frequency base (main effect 18.19), oscillation frequency amplitude (14.81), yield-sensitivity amplitude (5.05), and yield-sensitivity base (3.83) exhibit substantially smaller main effects, indicating that the benchmark’s landscape structure is robust to moderate perturbations of these coefficients. In particular, the near-negligible sensitivity to
base and amplitude (main effects below 5.1) confirms that the specific numerical values of the crop stress coefficients have only a marginal influence on the achievable objective values at this dimension. The multiplicative yield mechanism contributes to the deceptive structure of the objective primarily through its aggregation form, not through the precise values of its coefficients.
Figure 1 visualizes the sensitivity profiles reported in
Table 5. The top-left panel confirms the dominant and monotonically decreasing relationship between the oscillatory efficiency amplitude and the mean best objective value. The top-right panel shows the monotonically increasing effect of the resonance weight. The middle panels reveal that the oscillation frequency parameters produce only mild, non-monotonic variations within a narrow band, suggesting that the density of local attractors is relatively stable across the tested frequency range. The bottom panels confirm the near-flat sensitivity to the yield-sensitivity coefficients, with variations that fall within the inter-run standard deviation and are therefore not statistically distinguishable from noise.
To assess whether the parameter importance ranking is consistent across dimensions, the analysis was extended to
,
Table 6 and
Figure 2 report results for both dimensions. The ordinal ranking is preserved: the oscillatory efficiency amplitude retains its dominant position (main effect 962.89 at
, versus 397.27 at
), the resonance penalty weight ranks second at both dimensions (356.97 and 203.05, respectively), and the oscillation frequency parameters remain negligible. The yield-sensitivity base and amplitude, which exhibit small non-zero effects at
(3.83 and 5.05), collapse to exactly zero at
, reflecting the structural property that at this dimension the shorter stage window (
= 2.4 days) allows high-quality solvers to consistently find near-zero-stress solutions, at which point the
coefficient vanishes from the stage yield computation. This dimensional transition is consistent with the ablation findings and does not reduce benchmark difficulty: all ten methods achieve 3% success rates at
.
2.6. Comparative Difficulty Assessment on Classical Benchmarks
To position the difficulty of the proposed benchmark relative to established test functions, the nine optimizers that are common to both experiments arq, clpso, cmaes, ga, jso, lmcmaes, lracmaes, mlshaderl, and acor were evaluated on four classical continuous optimization benchmarks at and under the same experimental protocol (30 independent runs, identical evaluation budget). The four benchmark functions were selected to represent distinct difficulty profiles that are individually present in the proposed irrigation benchmark but rarely combined in a single classical test function.
The Ackley function [
35] is a widely used multimodal benchmark with a nearly separable structure and an exponentially decaying envelope that creates a broad basin surrounding the global optimum. It was selected because its multimodal character provides a baseline for assessing how the solvers handle oscillatory local structure in the absence of inter-variable coupling. The Rastrigin function [
36] is a strongly multimodal, separable function whose cosine-based oscillatory landscape produces a dense lattice of local minima a structure that is directly analogous to the oscillatory efficiency modulation
embedded in the proposed benchmark. Its inclusion allows a controlled comparison between separable and non-separable oscillatory difficulty. The Rosenbrock function [
37] is a unimodal but non-separable function characterized by a narrow curved valley, making it a standard test for the ability of optimizers to follow non-separable gradient structure. It was selected to isolate the contribution of non-separability to search difficulty, independent of multimodality. The Schwefel function [
38] features a deceptive global structure in which the second-best local minimum is geometrically distant from the global optimum, creating a landscape that misleads solvers toward suboptimal regions. This deceptive character is analogous to the preferred-window penalty
and the multiplicative yield aggregation
in the proposed benchmark.
Table 7 reports the complete results.
The results in
Table 7 reveal a striking contrast between the difficulty of the classical benchmarks and that of the proposed irrigation benchmark. On Ackley at
, seven out of nine solvers (arq, cmaes, jso, lmcmaes, lracmaes, mlshaderl, and acor) achieve success rates of 100%, and cmaes, jso, lmcmaes, lracmaes, and mlshaderl maintain 100% at
. This confirms that multimodality alone, when combined with near-separability and a smooth global envelope, is readily handled by modern continuous optimizers. The Rastrigin function, which presents a denser lattice of local minima, is more challenging: at
, cmaes, lmcmaes, and lracmaes achieve 100%, while arq drops to 10% and most other methods fall to 3%. Nevertheless, the best-performing solvers maintain high reliability, indicating that separable oscillatory structure, even when dense, does not constitute a fundamental barrier for methods with adequate covariance adaptation or restart mechanisms.
The Rosenbrock function introduces non-separability without multimodality. At , arq, jso, and lmcmaes achieve 100%, while mlshaderl achieves 90% and cmaes 87%. At , cmaes maintains a success rate of 93%, demonstrating that non-separable valley structure alone remains tractable for covariance-based methods. The Schwefel function presents the closest difficulty profile to the proposed benchmark among the four classical functions: at , only mlshaderl achieves a meaningful success rate (40%), and at , mlshaderl drops to 10% and all other methods to 7% or below. The deceptive global structure of Schwefel thus represents a genuine challenge, but one that is still partially navigable by archive-based methods.
By comparison, the proposed irrigation benchmark at yields a maximum success rate of only 10% (mlshaderl), and at , all nine methods achieve the minimum rate of 3% without exception. This represents a qualitative step beyond even the Schwefel function in terms of search difficulty. The explanation lies in the simultaneous presence of mechanisms that the classical functions present only in isolation: the proposed benchmark combines the dense oscillatory local structure of Rastrigin (through ), the non-separable coupling of Rosenbrock (through the recursive storage and the neighbor load ), the deceptive global structure of Schwefel (through and the multiplicative , and a stateful path dependence that is entirely absent from all four classical functions. This combination creates a landscape in which no single algorithmic capability whether covariance adaptation, archive-based memory, or probability density estimation is sufficient to ensure reliable convergence, confirming that the proposed benchmark occupies a difficulty regime that is not adequately represented by existing classical test functions.