This section demonstrates the effectiveness of the proposed multi-objective NSGA-II algorithm for robust optimal constrained mixture design, evaluated through two case studies: a concrete formulation problem and an industrial glass formulation problem. To assess its performance, we benchmarked NSGA-II against five alternative design generation methods across four different design sizes in each case study: (i) Genetic algorithm (GA), (ii) GA with arithmetic mean-based desirability (GA-AW), (iii) GA with geometric mean-based desirability (GA-GW), (iv) Design-Expert 23 (DX), and (v) the Federov Exchange algorithm (FD). The number following each design name indicates the number of design points.
Prior to implementing GA-based methods, preliminary tuning of genetic parameters was conducted. Multiple parameter sets were evaluated within practical ranges specified by the experimenter, and the best-performing configuration was identified as the one yielding the highest and most stable objective function values across later iterations. Convergence behavior was also examined to ensure that all algorithms achieved stable solutions. To balance exploration and refinement, larger values of were applied during the early search phase to promote broad coverage of the design space, followed by reduced values after 2500 and 3000 iterations for the concrete formulation and glass chemical durability case studies, respectively, to accelerate convergence toward high-quality solutions. All algorithms were implemented in MATLAB (version R2023b) and executed on a 2.00 GHz Intel® CoreTM i9 processor, with the reported CPU times corresponding to this hardware configuration.
5.1. Example 1: The Concrete Formulations Study
We examined an example from Santana et al. [
53], whose objective was to determine the optimal self-compacting geopolymer formulation that maximized performance while minimizing cost. The mixture design included six components: metakaolin (MK), sodium hydroxide (NaOH), alternative sodium silicate (ASS), sand, water, and superplasticizer. The bounds for each component are:
To account for practical manufacturing variability, the design incorporated component-specific tolerances, consistent with empirical studies on flowability, stress, and viscosity. The relationship between mixture composition and durability was modeled using the Scheffé quadratic mixture model:
which includes six linear terms and
= 15 pairwise interactions. This formulation was also adopted by Santana et al. [
54] to capture both individual component effects and synergistic interactions that influence rheological and mechanical properties.
To assess the effectiveness of the proposed NSGA-II approach, we compared it with five alternative design generation methods for producing designs with 22 to 25 runs.
Figure 2 illustrates the Pareto fronts of nominal D-efficiency against the 10th-percentile robust D-efficiency (R-D
10). Gray points represent all candidate designs, while red points indicate the non-dominated solutions obtained by NSGA-II.
For the 22-run (
Figure 2a), 23-run (
Figure 2b), and 25-run (
Figure 2d) cases, NSGA-II identified a single Pareto-optimal solution in each scenario. These solutions occupy the upper-right region of the trade-off space, demonstrating strong performance in both nominal and R-D
10. Their clear separation from the dominated solutions confirms that NSGA-II effectively captures the balance between efficiency and robustness, even with limited run sizes. In contrast, the 24-run case (
Figure 2c) yielded two Pareto-optimal solutions: one emphasizing slightly higher nominal D-efficiency and the other favoring stronger robust D-efficiency under tolerance perturbations. The presence of multiple non-dominated solutions highlights the ability of NSGA-II to uncover alternative robust strategies, providing flexibility in selecting designs depending on whether nominal efficiency or robustness is prioritized.
Table 2 provides detailed comparisons of the candidate designs across multiple statistical efficiency metrics, including D-efficiency, R-D
10, A-efficiency, G-efficiency, IV-efficiency, and variance-related measures (mean SPV and maximum SPV). Across all run sizes, NSGA-II consistently demonstrated competitive or superior performance relative to GA, DX, and FD designs. For example, in the 22-run case, the NSGA-II design achieved the highest robust D-efficiency (1.2954 × 10
−6) while maintaining strong nominal D- and A-efficiency, with only a marginal decrease in G-efficiency compared with GA. A similar trend was observed in the 23-run case, where NSGA-II-23 achieved the best balance between robust D-efficiency (1.2738 × 10
−6) and A-efficiency (2.4871 × 10
−11), outperforming both GA and DX benchmarks. In the 24-run case, NSGA-II again dominated: NSGA-II-24-D2 achieved the highest G-efficiency (56.2995) and robust D-efficiency (1.3253 × 10
−6), while NSGA-II-24-D1 provided a balanced compromise with improved IV-efficiency and lower mean SPV. For the 25-run case, NSGA-II-25 attained the highest robust D-efficiency (1.3165 × 10
−6) while maintaining competitive nominal efficiency values, and, importantly, offered lower mean SPV compared to DX-25, which exhibited the largest maximum SPV.
Overall, NSGA-II designs achieved higher robustness (R-D10) and more balanced efficiency profiles across all criteria. In contrast, GA and DX occasionally produced higher single-criterion values (e.g., maximum G-efficiency or extreme SPV), but these gains often came at the expense of robustness or variance stability. Thus, NSGA-II consistently provided the most reliable designs under practical manufacturing tolerances, delivering a superior trade-off between nominal efficiency, robustness, and prediction variance control.
The comparative performance is further summarized in
Table 3, which ranks the six methods across all criteria. NSGA-II clearly outperformed all alternatives, securing the top overall ranking (1st) and ranking first in both robust D-efficiency and G-efficiency, while remaining competitive in D-efficiency, A-efficiency, and variance-related metrics. This balanced performance demonstrates its strength in simultaneously optimizing nominal and robust efficiency while maintaining variance stability. By contrast, DX and GA-GW ranked lowest overall (6th and 5th, respectively), reflecting their limited ability to balance efficiency and robustness. Although FD and GA performed moderately well in D-efficiency, their rankings were weakened by poor variance control, as indicated by higher mean and maximum SPV. GA-AW performed better than GA-GW and DX, but still lagged behind NSGA-II, particularly in variance-related measures. These results reinforce that while traditional GA- and exchange-based methods can excel in isolated criteria, they fall short in delivering balanced and robust performance.
Analysis of variance with blocking (run size as the blocking factor) confirmed statistically significant differences among the six methods at the 0.05 significance level for D-efficiency (F = 9.922,
p-value = 0.0002), R-D
10 (F = 8.498,
p-value = 0.0005), A-efficiency (F = 2.832,
p-value = 0.0050), G-efficiency (F = 13.704,
p-value = 3.84 × 10
−5), IV-efficiency (F = 1.877,
p-value = 0.158), mean SPV (F = 1.932,
p-value = 0.148), and maximum SPV (F = 4.131,
p-value = 0.0149). To further investigate, Tukey’s HSD post hoc test was applied, with results summarized in
Figure 3.
The heat maps (
Figure 3) show that for D-efficiency, significant differences were detected between NSGA-II and GA-based variants, and between FD and several other methods, confirming the superior nominal efficiency of FD and NSGA-II relative to GA-GW and GA-AW. For R-D
10, NSGA-II significantly outperformed GA-GW, GA-AW, and DX, emphasizing its robustness advantage under tolerance perturbations. For A-efficiency, fewer significant differences were detected, although GA-AW again performed better than DX. In G-efficiency, the clearest contrasts emerged: DX displayed significantly poorer variance control compared with NSGA-II and GA-based methods, consistent with its tendency to cluster design points and inflate prediction variance. Differences were less apparent for IV-efficiency and mean SPV, where no significant variation was observed. However, for maximum SPV, FD designs showed significantly higher values than NSGA-II, GA, and GA-AW, confirming that FD emphasizes nominal efficiency at the cost of variance dispersion. Overall, the heat map analysis reinforces the superiority of NSGA-II in balancing efficiency and robustness, while highlighting the trade-offs inherent in FD and DX.
Figure 4 presents the fraction of design space (FDS) plots, which characterize the distribution of scaled prediction variance across the design space. A flatter FDS curve indicates more uniform variance control and hence, greater robustness. Across all run sizes, NSGA-II consistently achieved lower SPV distributions compared with GA- and exchange-based designs, particularly in the upper tails of the distribution. This indicates that NSGA-II effectively limits extreme variances, thereby reducing the risk of poor predictive performance in underrepresented regions of the mixture space. In contrast, FD and DX exhibited steeper curves with inflated maximum SPV, reflecting their clustering behavior. The superiority of NSGA-II was particularly evident in the 24-run case, where two NSGA-II solutions (D1 and D2) both outperformed competing methods across nearly the entire design space. Similarly, in the 25-run case, NSGA-II produced one of the flattest variance profiles, underscoring its ability to generate variance-stable designs when the run size increases.
Taken together, the Pareto fronts, efficiency metrics, ANOVA results, post hoc comparisons, and FDS plots provide consistent evidence that NSGA-II delivers the most robust and balanced designs across multiple evaluation criteria. Unlike GA and exchange-based methods, which may excel in isolated metrics but suffer from poor robustness or inflated variance, NSGA-II provides a superior trade-off between nominal efficiency, robustness under tolerances, and global variance control, confirming its suitability for tolerance-aware robust mixture design generation.
5.2. Example 2: The Glass Chemical Durability
The benchmark problem, adapted from Martin et al. [
55], involves eight oxide components subject to both individual and multicomponent linear constraints. The goal is to model the relationship between glass composition and chemical durability, with all component proportions satisfying the unit-sum constraint. The lower and upper bounds for each component are:
The multicomponent constraints are defined as:
These constraints on the component proportions transform the experimental region from a standard simplex into an irregularly shaped polyhedron within the simplex, reflecting the realistic compositional limits encountered in industrial glass manufacturing.
The relationship between composition and durability is modeled using the Scheffé quadratic mixture model, consistent with Martin et al. [
54]. The response variable
is the percentage weight loss after durability testing, serving as a quantitative measure of chemical stability. The model under consideration is expressed as:
which includes eight linear terms and
= 28 pairwise interaction terms. This formulation provides sufficient flexibility to capture both individual oxide effects and synergistic interactions that influence chemical durability.
To evaluate the effectiveness of the proposed NSGA-II approach, we compared it against five alternative design generation methods—GA, GA-AW, GA-GW, NSGA-II, DX, and FD—for producing 37- to 40-point designs.
Figure 5 illustrates the Pareto front of nominal D-efficiency and 10th-percentile D-efficiency (R-D
10) for the glass chemical durability problem with design sizes of
= 37, 38, 39, and 40, generated using the proposed multi-objective NSGA-II approach. In each subplot, gray points correspond to all candidate designs, whereas red points highlight the non-dominated solutions on the first Pareto front.
Across all cases, the Pareto-optimal solutions consistently occupy the upper-right region of the trade-off space, demonstrating their dominance over the remaining candidates. This confirms that NSGA-II successfully balances nominal efficiency with robustness, generating designs that maintain high statistical efficiency while exhibiting improved resilience to perturbations. Although gains in nominal D-efficiency are relatively modest, the improvements in R-D10 underscore the method’s ability to enhance robustness without sacrificing efficiency. Notably, for = 37, 39, and 40, the optimization yielded a single Pareto-optimal design, whereas for = 38, two distinct solutions were identified. The latter case illustrates the presence of alternative trade-offs between nominal efficiency and robustness, highlighting NSGA-II’s capacity to reveal multiple optimal strategies, depending on the design objectives.
The performance of the six design generation method (NSGA-II, GA, GA-AW, GA-GW, DX, and FD) was assessed using seven statistical criteria: D-efficiency, 10th-percentile D-efficiency (R-D
10), A-efficiency, G-efficiency, IV-efficiency, mean SPV, and maximum SPV, as presented in
Table 4. Together, these metrics capture both statistical efficiency and robustness with respect to prediction variance, which are critical for optimizing constrained mixture formulations, such as the glass chemical durability case study. Boldface values in
Table 4 indicate the best performance for each criterion.
Table 5 presents the corresponding average rank analysis, facilitating a systematic comparison of method consistency across different design sizes.
Across all run sizes, NSGA-II designs consistently achieved the highest D-efficiency and R-D10 values, with R-D10 outperforming the best competing method by approximately 1.5–5.1%. This performance gap underscores the method’s robustness to perturbations, indicating that NSGA-II designs can sustain high statistical efficiency even when factor settings deviate from their nominal targets. By maintaining superior R-D10 values, NSGA-II not only minimizes the risk of poor statistical performance but also reduces product variability and enhances process reliability under real-world implementation conditions. These results reinforce the practical advantage of incorporating R-D10 as a co-criterion in the optimization process, ensuring that the selected design is both theoretically optimal and operationally resilient.
NSGA-II designs also attained the highest A-efficiency in all cases except for the 38-run design, where GA slightly surpassed it. For the 38 and 39 run designs, NSGA-II yielded the highest G-efficiency values, reflecting superior control over worst-case prediction variance in those scenarios. GA designs demonstrated particular strength in minimizing maximum SPV and produced competitive mean SPV values, indicating solid worst-case variance performance. DX designs generally showed moderate efficiency, often ranking below GA but outperforming both GA-AW and GA-GW designs in most criteria. In contrast, GA-AW and GA-GW designs consistently underperformed, with substantially lower efficiency and robustness scores, underscoring the limitations of aggregated desirability functions for complex constrained mixture design optimization. FD designs exhibited inconsistent results, performing competitively in selected D-efficiency and R-D10 cases, yet displaying weak performance in A-efficiency and prediction variance measures.
The average rank comparisons in
Table 5 highlight method consistency across all run sizes and criteria. NSGA-II achieved the best overall performance, ranking first in both D-efficiency and R-D
10 and maintaining top-three positions across all other metrics. GA ranked second overall, excelling in Max SPV (rank = 1) and showing competitive performance in A-efficiency and G-efficiency (both ranked 1.75). DX ranked third, with moderate standings in most metrics but weaker results in R-D
10 and mean SPV. FD, GA-AW, and GA-GW consistently occupied the lower ranks, with GA-AW producing the lowest D-efficiency and G-efficiency scores, and FD ranking last in A-efficiency and mean SPV. Overall, these results confirm NSGA-II as the most balanced and robust method across multiple statistical criteria, followed by GA as the next-most consistent performer.
The analysis of variance with blocking (run size as the blocking factor) was used to formally evaluate differences among the six design generation methods. The results indicated statistically significant differences among methods at the 0.05 significance level for all metrics: D-efficiency (F = 366.877, p-value = 4 × 10−15), R-D10 (F = 333.662, p-value = 8.1 × 10−15), A-efficiency (F = 55.551, p-value = 3.81 × 10−9), G-efficiency (F = 23.110, p-value = 1.48 × 10−6), IV-efficiency (F = 6.677, p-value = 0.0018), mean SPV (F = 10.960, p-value = 0.0001), and maximum SPV (F = 26.838, p-value = 5.57 × 10−7).
To further investigate these differences, Tukey’s HSD post hoc test was applied at the 0.05 significance level, with results summarized in
Figure 6. For D-efficiency (
Figure 6a), NSGA-II, GA, DX, and FD formed the top statistical group, with no significant differences among them, all significantly outperforming GA-AW and GA-GW. For R-D
10 (
Figure 6b), NSGA-II achieved the highest performance, while GA and DX performed comparably and FD, though statistically similar, lagged slightly behind. GA-AW and GA-GW occupied the lowest group. For A-efficiency (
Figure 6c), NSGA-II and GA were top performers, grouping with DX and GA-AW, while GA-GW was significantly worse, and FD ranked the lowest. For G-efficiency (
Figure 6d), NSGA-II, GA, DX, and FD formed the superior group, with GA-AW and GA-GW consistently at the bottom. For IV-efficiency (
Figure 6e), NSGA-II, GA, DX, GA-AW, and GA-GW formed the top group, while FD trailed, overlapping with GA-AW. For mean SPV (
Figure 6f), FD recorded the highest (worst) values, significantly exceeding all other methods. For maximum SPV (
Figure 6g), GA-AW performed the worst, followed by GA-GW and FD, while NSGA-II, GA, and DX formed the best group, demonstrating superior variance control.
Overall, the Tukey-adjusted comparisons highlight that NSGA-II and GA consistently rank within the top statistical groups for efficiency-based criteria, while FD performs well in nominal efficiency but exhibits poor variance control, as reflected in elevated SPV. Conversely, GA-AW and GA-GW consistently occupy the lowest-performing groups across key metrics, particularly D-efficiency, R-D10, and G-efficiency. These findings reinforce the conclusion that NSGA-II provides the most reliable trade-off between efficiency and robustness among the evaluated methods.
Robustness was further assessed using fraction of design space (FDS) plots, where lower and flatter curves indicate greater robustness.
Figure 7 presents the FDS curves for all methods across different run sizes. For
= 37, the NSGA-II design exhibited consistently lower and flatter curves across the design space, indicating superior robustness. For
= 38, GA and NSGA-II designs displayed nearly identical FDS curves throughout the design space. For
= 39, GA, NSGA-II, and DX performed comparably across most of the design space. For
= 40, GA and NSGA-II again showed similar and robust performance.
Overall, NSGA-II designs consistently produced the lowest or statistically comparable FDS curves across all run sizes. Importantly, at the boundaries of the design space, NSGA-II often achieved the lowest scaled prediction variance (SPV) or values comparable to the best-performing method, thereby reducing the risk of extreme variance in underrepresented regions. By contrast, GA-AW, GA-GW, and FD designs exhibited noticeably higher SPV at the boundaries, reflecting poorer robustness in these critical regions. These findings confirm that NSGA-II generally provides strong robustness across both the interior and boundaries of the design space, underscoring its effectiveness in generating tolerance-aware optimal designs.
Table 6 summarizes the runtime comparisons between single-objective GA variants and the proposed NSGA-II framework. In the six-component case study, GA achieved the lowest per-generation runtime (approximately 0.006 s/gen) but required 500,000 generations to converge, resulting in a total runtime of 3039 s. By contrast, GA-AW and GA-GW, which aggregate objectives into a single desirability score, incurred substantially higher per-generation costs (0.041–0.043 s/gen) and longer total runtimes (12,400–12,900 s). NSGA-II exhibited a similar per-generation cost (≈0.041 s/gen) but required only 200,000 generations to converge, yielding a total runtime of 8243 s, approximately 30–35% faster than GA-AW and GA-GW.
A similar pattern was observed in the eight-component case. GA again achieved the lowest runtime overall (3440 s) due to its lightweight per-generation cost, but only after 600,000 generations. GA-AW and GA-GW remained the slowest methods, requiring 15,300–15,900 s. NSGA-II converged in 250,000 generations, with a total runtime of 9939 s, representing an improvement of 20–25% over desirability-based GA variants.
These findings highlight the computational trade-offs between methods. Although NSGA-II incurs a higher per-generation cost than standard GA, its substantially faster convergence offsets this overhead. As a result, NSGA-II achieves runtimes that are competitive with, or superior to, GA-AW and GA-GW, while additionally producing Pareto-optimal fronts that explicitly balance nominal D-efficiency and R-D10 without relying on subjective desirability weighting.