Reinforcement Learning-Guided NSGA-II Enhanced with Gray Relational Coefficient for Multi-Objective Optimization: Application to NASDAQ Portfolio Optimization

Wang, Zhiyuan; Ding, Qinxu; Ding, Ding; Zhu, Siying; Ren, Jing; Wang, Yue; Tan, Chong Hui

doi:10.3390/math14020296

Open AccessArticle

Reinforcement Learning-Guided NSGA-II Enhanced with Gray Relational Coefficient for Multi-Objective Optimization: Application to NASDAQ Portfolio Optimization

by

Zhiyuan Wang

^*

,

Qinxu Ding

,

Ding Ding

^*,

Siying Zhu

,

Jing Ren

,

Yue Wang

and

Chong Hui Tan

School of Business, Singapore University of Social Sciences, Singapore 599494, Singapore

^*

Authors to whom correspondence should be addressed.

Mathematics 2026, 14(2), 296; https://doi.org/10.3390/math14020296

Submission received: 12 December 2025 / Revised: 5 January 2026 / Accepted: 12 January 2026 / Published: 14 January 2026

(This article belongs to the Special Issue Multi-Objective Evolutionary Algorithms and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

In modern financial markets, decision-makers increasingly rely on quantitative methods to navigate complex trade-offs among multiple, often conflicting objectives. This paper addresses constrained multi-objective optimization (MOO) with an application to portfolio optimization for minimizing risk and maximizing return. To this end, and to address existing gaps, we propose a novel reinforcement learning (RL)-guided non-dominated sorting genetic algorithm II (NSGA-II) enhanced with gray relational coefficients (GRC), termed RL-NSGA-II-GRC, which combines an RL agent controller and GRC-based selection to improve the convergence and diversity of the Pareto-optimal fronts. The agent adapts key evolutionary parameters online using population-level metrics of hypervolume, feasibility, and diversity, while the GRC-enhanced tournament operator ranks parents via a unified score simultaneously considering dominance rank, crowding distance, and geometric proximity to ideal reference. We evaluate the framework on the Kursawe and CONSTR benchmark problems and on a NASDAQ portfolio optimization application. On the benchmarks, RL-NSGA-II-GRC achieves convergence metric improvements of about 5.8% and 4.4% over the original NSGA-II, while preserving a well-distributed set of non-dominated solutions. In the portfolio application, the method produces a smooth and densely populated efficient frontier that supports the identification of the maximum Sharpe ratio portfolio (with annualized Sharpe ratio = 1.92), as well as utility-optimal portfolios for different risk-aversion levels. The main contributions of this work are three-fold: (1) we propose an RL-NSGA-II-GRC method that integrates an RL agent into the evolutionary framework to adaptively control key parameters using generational feedback; (2) we design a GRC-enhanced binary tournament selection operator that provides a comprehensive performance indicator to efficiently guide the search toward the Pareto-optimal front; (3) we demonstrate, on benchmark MOO problems and a NASDAQ portfolio case study, that the proposed method delivers improved convergence and well-populated efficient frontiers that support actionable investment insights.

Keywords:

reinforcement learning; NSGA-II; multi-objective optimization; gray relational coefficient; portfolio optimization

MSC:

90C29

1. Introduction

Multi-objective optimization (MOO) arises naturally in many real-world optimization and decision problems where multiple conflicting goals must be balanced simultaneously under practical constraints [1]. In finance, the classical trade-off between risk and return in portfolio management is a canonical example, but similar structures appear in many other domains, such as engineering design, logistics, healthcare, and energy systems. Decision-makers are often interested not in a single optimal solution, but in a diverse set of Pareto-optimal solutions (i.e., non-dominated solutions or Pareto-optimal front) that reveal the trade-offs among objectives. MOO can generate this set of equally good optimal solutions from the perspective of the objectives considered [2]. Designing algorithms that can efficiently approximate such Pareto-optimal fronts, especially in the presence of complex feasibility regions, remains a central challenge.

In the domain of financial application, Markowitz mean–variance framework [3] established the foundation for quantitative portfolio selection by posing the problem as a bi-objective optimization of expected return and variance. Building on this, multi-objective evolutionary algorithms, such as the non-dominated sorting genetic algorithm II (NSGA-II) proposed by Deb et al. [4], have become popular tools for finding the Pareto-optimal fronts (also known as efficient frontiers in finance) in portfolio optimization and other domains, due to their population-based search and ability to handle non-convex, discontinuous Pareto sets. However, when constraints are tight or the feasible region is narrow, the standard/original NSGA-II with fixed parameters and a purely feasibility or penalty-based constraint-handling strategy may suffer from slow convergence, premature loss of diversity, or excessive sampling of infeasible solutions, which reduces both the quality and interpretability of the resulting Pareto-optimal fronts.

A key factor underlying these issues is that evolutionary algorithms are typically run with static parameter settings and rigid constraint-handling rules. In practice, the balance between exploration and exploitation, or between searching in feasible versus infeasible regions, changes over the course of evolutionary generations. Early in the search, a higher degree of exploration and a more tolerant attitude toward constraint violations may be beneficial for discovering promising regions, whereas in later generations a stronger emphasis on convergence and feasibility is desirable. Fixed crossover and mutation probabilities, fixed constraint tolerance, and rigid survival selection can therefore be suboptimal, especially in constrained MOO problems where diversity and feasibility must be managed concurrently.

In parallel, recent advances in reinforcement learning (RL) and data-driven multi-criteria decision-making (MCDM) methods offer new opportunities for dynamic, feedback-driven control within evolutionary frameworks. RL provides a natural way to model the adaptive adjustment of algorithmic parameters as a sequential decision problem, where the RL agent observes indicators of convergence, feasibility, and diversity and chooses parameter values to maximize long-term performance. On the other hand, in the multi-criteria and uncertain decision-making environments, MCDM is typically an effective approach [5]. Gray relational coefficients (GRC), a popular MCDM concept, provide a flexible mechanism for aggregating multiple indicators into a performance score that can guide selection and ranking [6]. Nonetheless, the integration of RL agent with GRC-enhanced selection within NSGA-II has been relatively unexplored, particularly in the context of constrained multi-objective portfolio optimization.

To address these gaps, this work proposes a RL guided NSGA-II method enhanced with GRC (abbreviated as RL-NSGA-II-GRC) for solving MOO problems, and applies it to NASDAQ portfolio optimization. Figure 1 shows the flowchart of the RL-NSGA-II-GRC method, and the detailed description and corresponding equations are presented in Section 3 of this article.

The main contributions of this work are three-fold. First, we propose a novel RL-NSGA-II-GRC method that integrates a RL agent into the evolutionary framework to adaptively control key parameters, including crossover probability, mutation strength, constraint tolerance, and front sampling fraction; by utilizing population-level metrics (i.e., hypervolume, feasibility ratio, and diversity) as state inputs, the algorithm dynamically balances the trade-off between exploration and exploitation throughout the optimization process. Second, we introduce a GRC-enhanced binary tournament selection operator that evaluates potential parent solutions based on their geometric proximity to an ideal reference solution. This mechanism serves as a comprehensive performance indicator that simultaneously accounts for dominance rank, crowding distance, and the values of objectives, thereby improving the selection pressure and guiding the search toward the Pareto-optimal front more effectively. Third, we demonstrate the effectiveness and practicality of RL-NSGA-II-GRC on two mathematical benchmark MOO problems (i.e., Kursawe and CONSTR problems), achieving convergence improvements of about 5.8% and 4.4% over the original NSGA-II, and on a real-world NASDAQ-100 portfolio optimization case study, where the method yields a smooth and well-populated efficient frontier that supports the identification of critical investment points, such as the maximum Sharpe ratio portfolio and utility-optimal portfolios, providing actionable insights for investors with varying risk preferences.

The rest of this article is structured as follows. Section 2 reviews the relevant literature on MOO, NSGA-II, RL, GRC, and portfolio optimization. Section 3 presents the proposed RL-NSGA-II-GRC methodology in detail, including the constraint-tolerant dominance rule, GRC-enhanced binary tournament selection, RL action’s performance indicators (i.e., hypervolume, feasibility ratio, and diversity), and the Q-learning control agent. Section 4 reports numerical results on the Kursawe and CONSTR benchmark MOO problems, and Section 5 presents the NASDAQ portfolio optimization case study. Section 6 discusses the findings and limitations of the proposed approach, and Section 7 concludes the paper and outlines directions for future research.

2. Literature Review

As aforementioned, MOO involves optimizing two or more conflicting objectives simultaneously. Instead of a single optimal solution, MOO produces a Pareto-optimal set of solutions, where no objective can be improved without worsening another [7]. This is highly relevant in finance, as portfolio problems naturally have trade-offs, for instance, maximizing return and minimizing risk [8]. Early methods for MOO often included scalarization techniques (e.g., weighted sums) to combine objectives [9], but these require subjective weight tuning and may miss certain Pareto-optimal solutions. Modern approaches also use evolutionary algorithms or other heuristics to approximate the Pareto-optimal front. Additionally, Majumder [10] provides a comprehensive treatment of single- and multi-objective network optimization models under diverse uncertain environments, where uncertainty is represented through paradigms such as expected-value, chance-constrained, and dependent chance-constrained formulations. The study further discusses how these uncertain models can be converted into crisp equivalents and solved using both classical optimization procedures and multi-objective solution methodologies, including the global criterion method, the ε-constraint method, and evolutionary algorithms. MOO has become an essential tool across disciplines [11]; beyond finance, it is applied in engineering [12], supply chain management [13], healthcare [14], and more. In the context of portfolio optimization, formulating it as an MOO acknowledges the inherent conflicts and seeks an efficient frontier of portfolios offering different trade-offs. For example, a bi-objective portfolio model can optimize for return and risk concurrently [15], yielding a spectrum of portfolios from low-risk/low-return to high-risk/high-return options. This multi-objective view, pioneered by Markowitz mean–variance framework [3], underpins much of modern portfolio theory.

One of the most influential techniques for MOO is NSGA-II, introduced by Deb et al. [4]. NSGA-II is a widely adopted evolutionary algorithm that uses a fast non-dominated sorting approach and a crowding distance mechanism to maintain solution diversity on the Pareto front [16]. First, its population is partitioned into Pareto fronts using non-dominated sorting, where solutions in the first front are non-dominated by any other solution, the second front is dominated only by members of the first front, and so on. Each solution is assigned a rank equal to its front index, where a smaller rank indicates a better Pareto status [4]. Second, to maintain diversity along the Pareto front, NSGA-II computes the crowding distance within each front by sorting solutions along each objective and estimating the local density based on the distances to nearest neighbors in objective space; solutions with larger crowding distance are preferred to encourage a well-spread approximation set. Third, NSGA-II adopts an elitist replacement strategy by combining parent and offspring populations and selecting the next generation based on rank first and crowding distance second, typically implemented via a rank-and-crowding comparison in tournament selection. These mechanisms collectively yield a fast and effective baseline for MOO, with time complexity on the big O notation of

O (M N^{2})

, where

M

is the number of objectives and

N

is the population size [17].

Over the last two decades, NSGA-II has attracted extensive research interest and remains one of the most commonly used methods for MOO problems [18]. Its popularity stems from its efficiency and robust performance across many domains [19]. The recent comprehensive review by Ma et al. [20] confirms that NSGA-II and its variants are widely adopted and continue to be a cornerstone of MOO studies. In the finance literature, NSGA-II has been frequently used to solve portfolio optimization problems, generating Pareto-optimal portfolios under multiple objectives. For instance, researchers have applied NSGA-II to identify efficient frontiers when considering objectives such as return, risk, and other goals (e.g., liquidity, carbon footprint) in portfolio selection. Comparisons show that NSGA-II often performs competitively with other metaheuristics in this domain [17]. NSGA-II elitist strategy makes it a reliable choice for tackling the multi-objective nature of portfolio optimization. Recent works continue to extend NSGA-II (e.g., with hybrid deep learning models or problem-specific improvements) to enhance its effectiveness on complex, large-scale portfolio problems [21]. Beyond classical NSGA-II, recent studies have proposed autonomous constrained MOO frameworks that adapt search behavior based on process knowledge and population feasibility states, including the process knowledge-guided autonomous evolutionary optimization method [22] and the population feasibility state guided autonomous constrained algorithms [23], where feasibility-related state information is explicitly modeled to guide adaptive control (e.g., via learned operator decisions). These works are closely related to our study in emphasizing feasibility-aware and adaptive guidance for constrained MOO.

RL is a paradigm of machine learning where an agent learns to make sequential decisions through interactions with an environment. Formally, the problem is often modeled as a Markov decision process, and the agent’s goal is to learn a policy that maximizes cumulative reward. RL differs from conventional static optimization or supervised learning in that it explicitly considers the feedback loop of decisions and outcomes over time [24]. In recent years, RL has demonstrated effectiveness across a wide range of complex decision-making problems and has attracted researchers in different research fields. From a theoretical point of view, Gu et al. [25] surveyed the related work on the methods, theories, and applications of safe RL. Milani et al. [26] provided a survey and comprehensive review on the explainable RL, whose objective is to explain the decision-making process of RL agents in sequential decision-making settings. Moos et al. [27] reviewed the literature on the robust approaches to RL from four different perspectives, namely, transition robust, disturbance robust, action robust, and observation robust. Yan et al. [28] studied a self-adapting Q-learning-based inter-layer feedback mechanism in multilayer networks, where an RL agent in the management layer learns when to apply punishment to trigger game transitions and thereby promote the evolution of cooperation.

For the domain-specific RL applications, Panzer & Bender [29] systematically reviewed the applications of deep RL in the optimization production systems, in order to overcome the challenges such as shorter product development cycles and increasing product customization. Rolf et al. [30] reviewed the related works on RL algorithms and applications in supply chain management, including areas such as inventory management. Kayhan & Yildiz [31] conducted a survey of the applications of RL in machine scheduling problems, such as job shop scheduling and unrelated parallel machine scheduling problems. In finance, and portfolio management particularly, RL has garnered significant attention in the past decade as a tool to assist portfolio optimization [32,33]. In addition, Song et al. [34] proposed a genetic algorithm based on RL for the electromagnetic detection satellite scheduling problem. Song et al. [35] also conducted a comprehensive survey on RL-assisted evolutionary algorithms, proposing a taxonomy for integration schemes, analyzing RL-assisted strategies for solution generation, objective modeling, operator and parameter adaptation, and demonstrating their performance and research challenges across benchmark problems and real-world applications. Guo et al. [36] proposed an RL-assisted genetic programming algorithm for the team formation problem, in which an RL agent adaptively selects among multiple population search modes.

Introduced by Deng [37], the gray system theory provides tools for decision-making and analysis under uncertainty. A key technique from this theory is gray relational analysis, which employs GRC to quantify the similarity or relationship between data series [38]. In essence, it measures how closely an alternative’s attributes match an ideal target sequence by examining the geometric proximity of their data curves. The alternative that is closer to the reference solution receives higher ranking [39]. One advantage of gray relational techniques is that they can work with small and/or incomplete data sets by extracting useful information from partial known data [40]. Multiple variants of gray relational analysis exist in the MCDM literature. The specific variant referenced here in this work, following Song & Jamalipour [41] and Martinez-Morales et al. [42], is distinguished by its independence from predefined criterion weights.

In the aspect of portfolio optimization, investors seek to maximize expected return for a given level of variance (denoting risk/uncertainty in financial management), or equivalently minimize risk for a target return, ultimately leading to an efficient frontier of optimal portfolios. The Markowitz framework formalized the risk–return trade-off and emphasized diversification benefits (i.e., considering covariances among assets) [17]. Over time, numerous extensions and alternative formulations are proposed on top of the original framework [43]. For example, Konno and Yamazaki [44] replaced variance with mean absolute deviation (MAD) to create a linear programming model more tractable for large asset universes, while Speranza [45] proposed mean absolute semi-deviation (MASD) focusing only on downside volatility. Rockafellar & Uryasev [46] introduced conditional value-at-risk (CVaR) as a coherent risk measure targeting tail losses, which has since been widely adopted in portfolio optimization. These developments reflect a broadening of risk measures. aligning the optimization with investors’ true risk concerns, that is, avoiding large losses. Traditional quadratic programming or other exact solvers struggle with solving such complex real-world optimization problems, especially when multiple objectives are present [24]. To tackle this, researchers turned to metaheuristics and artificial intelligence (AI)-based methods. Global search algorithms like NSGA-II and other evolutionary algorithms have been extensively applied to portfolio problems with promising results [17]. These methods can accommodate non-linear, discontinuous objective landscapes and find near-optimal solutions under complex constraints [47]. For instance, there is evidence that evolutionary algorithms and swarm intelligence methods can efficiently construct diversified portfolios that classical solvers might miss [48]. Anagnostopoulos & Mamanis [49] found that an evolutionary approach outperformed exact solvers on a constrained portfolio selection benchmark, and many subsequent studies have confirmed the utility of such heuristics for multi-objective portfolio optimization [50]. Recently, the learnheuristic [51] or hybrid methods have emerged, which combine machine learning models with metaheuristics optimization algorithms. For example, Joshi & Dhodiya [21] recently proposed a hybrid deep learning and evolutionary algorithm framework for many-objective portfolio decisions, illustrating the trend of blending AI techniques with metaheuristics algorithms.

Based on our comprehensive literature review, although RL-assisted multi-objective evolutionary algorithms (MOEAs) have been increasingly studied, a substantial portion of existing efforts mainly focuses on operator/parameter control (e.g., adapting crossover/mutation rates) or selecting among predefined search modes, while keeping the feasibility logic and survival selection mechanism largely unchanged. In contrast, the proposed RL-NSGA-II-GRC framework differs fundamentally in that the RL agent intervenes not only in variation parameters but also in constraint handling and survival selection pressure within the NSGA-II backbone. Specifically, the RL agent jointly controls (i) the variation parameters, (ii) a generation-wise constraint tolerance that defines a constraint-tolerant dominance relation for constrained MOO, and (iii) a fractional front-sampling survival mechanism that regulates selection pressure across Pareto fronts. Moreover, rather than replacing Pareto dominance with a single scalar fitness, the proposed method preserves a Pareto-first ranking logic and introduces the GRC as a discriminator when dominance information is insufficient, thereby maintaining the key semantics of non-dominated sorting while improving decision consistency during selection.

3. Methodology

This section presents the proposed RL-NSGA-II-GRC method in a general MOO setting, as illustrated in the flowchart in Figure 1. For clarity, we first formulate the generic MOO problem, followed by a comprehensive description of each methodological component. Specifically, we detail: (1) the fundamental operations of NSGA-II, such as the mathematical definitions of simulated binary crossover, polynomial mutation, non-dominated sorting, and crowding distance calculation; (2) the integration of GRC into the tournament selection operator, enabling selection pressure that always simultaneously accounts for dominance rank, crowding distance, and geometric proximity to ideal objective values; (3) the computation of the population-level performance indicators (including hypervolume, feasibility ratio, and diversity) that serve as quantitative feedback on optimization progress (via calculating the reward of the RL agent’s action) and are used as state inputs for the RL agent; (4) the design of the RL control layer which adaptively takes action and adjusts evolutionary parameters (e.g., crossover probability, mutation strength, constraint tolerance, and front sampling ratio) based on learned value estimates from the environment, such that it gradually improves the algorithm’s exploration and exploitation balance throughout the evolutionary process for finding the Pareto-optimal solutions (i.e., Pareto-optimal front). The complete RL-NSGA-II-GRC method is implemented in Python (version 3.12.7). The code is available at no cost to interested readers by contacting the corresponding authors of this paper.

We consider a generic MOO problem involving two or more objective functions to be minimized (i.e.,

f_{1}, f_{2}, \dots, f_{M}

). Note that any maximization objective can be readily converted to a minimization form by multiplying its objective function by

- 1

(negative one). The problem is defined over a decision variables vector (

x

), subject to lower and upper bounds on each decision variable, as well as a set of inequality and/or equality constraints. The mathematical formulation of MOO is as follows:

\underset{x \in Ω}{m i n} F (x) = (f_{1} (x), f_{2} (x), \dots, f_{M} (x),)

(1)

subject to inequality constraints

g_{j} (x) \geq 0, j = 1, 2, \dots, J_{g},

(2)

and/or equality constraints

h_{l} (x) = 0, l = 1, 2, \dots, J_{h},

(3)

and bound constraints

x_{k}^{(L)} \leq x_{k} \leq x_{k}^{(U)}, k = 1, 2, \dots, D,

(4)

where

x = {(x_{1}, x_{2}, \dots, x_{k} \dots, x_{D})}^{⊤}

is a decision vector in the search space

Ω \subseteq R^{D}

, and

F (x) \in R^{M}

collects the total

M

objective functions to be minimized.

To quantify constraint violation for a solution

x

, we use the following scalar constraint violation:

C V (x) = \sum_{j = 1}^{J_{g}} m a x (0, - g_{j} (x)) + \sum_{l = 1}^{J_{h}} |h_{l} (x)| .

(5)

Thus,

C V (x) = 0

if and only if all constraints are satisfied, with larger values indicating greater constraint violation.

For dominance, a solution

x

is said to Pareto-dominate another solution

y

(denoted

x ≺ y

) if and only if

\begin{array}{r} f_{m} (x) \leq f_{m} (y) \forall m = 1, 2, \dots, M, \\ \exists m : f_{m} (x) < f_{m} (y) . \end{array}

(6)

In other words,

x

Pareto-dominates

y

if

x

is no worse than

y

in all

M

objectives, and at the same

x

is strictly better than

y

in at least one of the objectives.

3.1. NSGA-II Backbone with Dynamic Constraint-Tolerant Dominance and Fractional Sampling Survival Selection

During the evolutionary process of NSGA-II, each individual solution in the population corresponds to a decision vector

x

, augmented with objective function vector

F (x)

, constraint violation

C V (x)

, Pareto rank (e.g., rank-0, rank-1), crowding distance that measures sparsity, and additional bookkeeping for dominance (i.e., domination count and dominated solutions set).

3.1.1. Constraint-Tolerant Dominance

In classical original NSGA-II, the rules of Deb et al. [4] enforce strict feasibility: any feasible solution is preferred over any infeasible one; if both are infeasible, the smaller violation is preferred. In our proposed RL-NSGA-II-GRC method, this is generalized by introducing a dynamic generation-dependent constraint tolerance

τ_{t} \geq 0

, chosen by the RL agent at generation

t

.

For an individual solution

p

with violation

C V (p)

, we define the effective violation:

{\tilde{C V}}_{t} (p) = \{\begin{array}{l} 0, & if C V (p) \leq τ_{t}, \\ C V (p), & otherwise . \end{array}

(7)

Any solution whose violation is less than or equal to

τ_{t}

is treated as effectively feasible in generation

t

. This allows certain degree of controlled relaxation of constraints to help explore across the infeasible regions. Theoretically,

τ_{t}

is defined as a non-negative relaxation threshold, i.e.,

τ_{t} \geq 0

. In practice,

τ_{t}

is applied as a small tolerance level to determine whether a solution is treated as effectively feasible when computing

{\tilde{C V}}_{t} (p)

. Accordingly,

τ_{t} = 0

corresponds to the standard strict feasibility case (i.e., no relaxation), while larger

τ_{t}

values allow controlled relaxation by treating violations up to

τ_{t}

as acceptable during search. In our Python code implementation,

τ_{t}

is decided adaptively by the RL agent controller with a value up to 0.5 (this is also tunable), so that the relaxation level remains bounded.

Given two individual solutions

p

and

q

, the constraint-tolerant dominance is therefore defined in the following three scenarios. This relation

≺_{t}

is used in all dominance-based operations at generation

t

.

First, if one solution is effectively feasible and the other is not, then the effectively feasible solution dominates:

I f {\tilde{C V}}_{t} (p) = 0 a n d {\tilde{C V}}_{t} (q) > 0, t h e n p ≺_{t} q .

(8)

Second, if both solutions are effectively infeasible, then the solution with the smaller effective violation is preferred:

I f {\tilde{C V}}_{t} (p) > 0, {\tilde{C V}}_{t} (q) > 0, a n d {\tilde{C V}}_{t} (p) < {\tilde{C V}}_{t} (q), t h e n p ≺_{t} q

(9)

Third, if both solutions satisfy the tolerance, that is,

{\tilde{C V}}_{t} (p) = {\tilde{C V}}_{t} (q) = 0

, then the dominance determination reduces to the standard Pareto dominance rule as in Equation (6).

3.1.2. Non-Dominated Sorting

Given a population

P_{t}

of size

N

, non-dominated sorting partitions it into fronts

F_{0}, F_{1}, \dots

with increasing rank. For each individual

p \in P_{t}

, we define:

The set of solutions it dominates

S_{p} = \{q \in P_{t} ∣ p ≺_{t} q\},

(10)

and the number of solutions that dominate it

n_{p} = |\{q \in P_{t}∣ q ≺_{t} p\}| .

(11)

Here, the first front (rank-0) is:

F_{0} = \{p \in P_{t} ∣ n_{p} = 0\}, rank (p) = 0 .

(12)

Subsequent fronts are built iteratively:

F_{l + 1} = \{q| \exists p \in F_{l} : q \in S_{p}, n_{q} decremented to 0\}, l = 0, 1, \dots

(13)

and individuals in

F_{l + 1}

are assigned

rank (q) = l + 1

. The process continues until all individuals are assigned to a front.

3.1.3. Crowding Distance Calculation

Within each front

F_{l}

, a crowding distance is computed to measure the sparsity of solutions in objective space and to encourage a well-spread Pareto front. Let

|F_{l}| = K

. Initialize all distances:

d (p) = 0, \forall p \in F_{l} .

(14)

Then, for each objective

m = 1, \dots, M

, sort the front with respect to objective

m

:

F_{l}^{(m)} = sort (F_{l}, by f_{m}) .

(15)

The boundary points are assigned infinite distance:

d (F_{l}^{(m)} [1]) = d (F_{l}^{(m)} [K]) = + \infty .

(16)

For internal points

i = 2, \dots, K - 1

, the contribution of objective

m

to the crowding distance is:

d^{(m)} (F_{l}^{(m)} [i]) = \frac{f_{m} (F_{l}^{(m)} [i + 1]) - f_{m} (F_{l}^{(m)} [i - 1])}{f_{m} (F_{l}^{(m)} [K]) - f_{m} (F_{l}^{(m)} [1])} .

(17)

The total crowding distance for individual

p

is the sum across

M

objectives:

d (p) = \sum_{m = 1}^{M} d^{(m)} (p) .

(18)

A larger

d (p)

indicates that the solution

p

is located in a sparsely populated region of the front and thus should be preferred for diversity.

3.1.4. Simulated Binary Crossover (SBX)

The NSGA-II uses crossover and mutation to generate offspring from selected parents. The key steps of the simulated binary crossover (SBX) are briefly presented as follows. Given two parents

x^{(1)}

and

x^{(2)}

, SBX simulates a recombination similar to single-point crossover in binary representation, but in real variables. Let the crossover probability be

p_{c}

and the distribution index be

η_{c} > 0

. Both the

p_{c}

and

η_{c}

are to be dynamically controlled by the RL agent in our RL-NSGA-II-GRC method.

Here, for each decision variable

j = 1, \dots, D

, with probability

1 - p_{c}

, the parents’ genes (i.e., parents’ decision variable values) are directly copied to offspring:

c_{j}^{(1)} = x_{j}^{(1)}, c_{j}^{(2)} = x_{j}^{(2)} .

(19)

Otherwise, SBX is performed:

y_{1} = m i n (x_{j}^{(1)}, x_{j}^{(2)}), y_{2} = m a x (x_{j}^{(1)}, x_{j}^{(2)}),

(20)

and the lower and upper bounds of the decision variable

j

still apply as in Equation (

4

), that is,

[L_{j}, U_{j}] bounds

. Then, uniformly draw a random number

u \sim U (0, 1)

, and compute:

β_{1} = 1 + \frac{2 (y_{1} - L_{j})}{y_{2} - y_{1}}, α_{1} = 2 - β_{1}^{- (η_{c} + 1)},

(21)

β_{q} = \{\begin{array}{l} {(u α_{1})}^{\frac{1}{(η_{c} + 1)}}, & u \leq \frac{1}{α_{1}}, \\ {(\frac{1}{2 - u α_{1}})}^{\frac{1}{(η_{c} + 1)}}, & u > \frac{1}{α_{1}} . \end{array}

(22)

The first child gene for the

j^{t h}

variable is obtained as:

c_{j}^{(1)} = \frac{1}{2} [(y_{1} + y_{2}) - β_{q} (y_{2} - y_{1})] .

(23)

On the other side of the interval,

β_{2} = 1 + \frac{2 (U_{j} - y_{2})}{y_{2} - y_{1}}, α_{2} = 2 - β_{2}^{- (η_{c} + 1)},

(24)

and with the same

u

,

β_{q} = \{\begin{array}{l} {(u α_{2})}^{\frac{1}{(η_{c} + 1)}}, & u \leq \frac{1}{α_{2}}, \\ {(\frac{1}{2 - u α_{2}})}^{\frac{1}{(η_{c} + 1)}}, & u > \frac{1}{α_{2}} . \end{array}

(25)

The second child gene for the

j^{t h}

variable is obtained as:

c_{j}^{(2)} = \frac{1}{2} [(y_{1} + y_{2}) + β_{q} (y_{2} - y_{1})] .

(26)

Both

c_{j}^{(1)}

and

c_{j}^{(2)}

are finally clipped to their bounds:

c_{j}^{(k)} \leftarrow m i n (m a x (c_{j}^{(k)}, L_{j}), U_{j}), k = 1, 2 .

(27)

3.1.5. Polynomial Mutation

Given a mutation probability

p_{m}

per variable and distribution index

η_{m}

, polynomial mutation perturbs each variable

x_{j}

with probability

p_{m}

. Likewise, both

p_{m}

and

η_{m}

are dynamically controlled by the RL agent in our proposed RL-NSGA-II-GRC method. For a selected variable

x_{j}

, define:

δ_{1} = \frac{x_{j} - L_{j}}{U_{j} - L_{j}}, δ_{2} = \frac{U_{j} - x_{j}}{U_{j} - L_{j}},

(28)

and uniformly draw a random number

u \sim U (0, 1)

. Let

m = \frac{1}{η_{m} + 1}

. The mutation step

δ_{q}

is given by:

δ_{q} = \{\begin{array}{l} {(2 u + (1 - 2 u) {(1 - δ_{1})}^{η_{m} + 1})}^{\frac{1}{η_{m} + 1}} - 1, & u \leq 0.5, \\ 1 - {(2 (1 - u) + 2 (u - 0.5) {(1 - δ_{2})}^{η_{m} + 1})}^{\frac{1}{η_{m} + 1}}, & u > 0.5 \end{array}

(29)

The mutated value at the

j^{t h}

variable is:

{x^{'}}_{j} = x_{j} + δ_{q} (U_{j} - L_{j}),

(30)

followed by bound clipping:

{x^{'}}_{j} \leftarrow m i n (m a x ({x^{'}}_{j}, L_{j}), U_{j}) .

(31)

3.2. Gray Relational Coefficient (GRC) Enhanced Binary Tournament Selection

To enhance parent selection beyond Pareto rank and crowding distance alone, RL-NSGA-II-GRC employes a MCDM approach to derive the gray relational coefficient (GRC) score for each potential parent solution, which combines the performances of objectives and crowding distance information in a normalized manner. GRC is founded on the gray system theory. This theory serves as a powerful analytical tool, particularly in situations where information is incomplete or uncertain. The core principle of GRC is to evaluate solutions by measuring their proximity to an ideal reference solution [52]. The solution that is closer to the ideal solution receives higher GRC score.

At generation

t

, consider the current population

P_{t}

. For each objective

m = 1, \dots, M

, compute the population-level extremes:

f_{m}^{m i n} = \underset{p \in P_{t}}{m i n} f_{m} (p), f_{m}^{m a x} = \underset{p \in P_{t}}{m a x} f_{m} (p) .

(32)

Because all objectives are minimized by default, we normalize each objective so that the value 1 represents the best performance, and the value 0 denotes the worst performance consistently.

{\hat{f}}_{m} (p) = \{\begin{array}{l} \frac{f_{m}^{m a x} - f_{m} (p)}{f_{m}^{m a x} - f_{m}^{m i n}}, & f_{m}^{m a x} \neq f_{m}^{m i n}, \\ 1, & otherwise, \end{array} {\hat{f}}_{m} (p) \in [0, 1] .

(33)

For crowding distance, recall that we have already calculated the crowding distance of solution

p

(i.e.,

d (p)

) in the previous section using Equation (

18

), and the boundary solutions have

d (p) = + \infty

. Here, we further compute

d^{m i n} = \underset{p \in P_{t}, d (p) < + \infty}{m i n} d (p), d^{m a x} = \underset{p \in P_{t}, d (p) < + \infty}{m a x} d (p) .

(34)

Infinite distances (for boundary solutions) are replaced by a finite proxy to facilitate the subsequent calculations.

D^{m a x} = \{\begin{array}{l} 1.2 d^{m a x}, & d^{m a x} > 0, \\ 1.0, & d^{m a x} = 0 . \end{array}

(35)

Then the normalized crowding distance is:

\hat{d} (p) = \{\begin{array}{l} \frac{d (p) - d^{m i n}}{D^{m a x} - d^{m i n}}, & D^{m a x} \neq d^{m i n}, \\ 1, & otherwise . \end{array} \hat{d} (p) \in [0, 1] .

(36)

Next, each normalized criterion

\hat{y} \in [0, 1]

(either an objective or crowding distance) is compared to their ideal reference value

{\hat{y}}_{ref} = 1

. The GRC for that criterion is:

ξ (\hat{y}) = \frac{1}{|{\hat{y}}_{ref} - \hat{y}| + 1} = \frac{1}{|1 - \hat{y}| + 1},

(37)

where

ξ (\hat{y})

indicates that

\hat{y}

is closer to the ideal. For each individual

p

, we compute one coefficient for the crowding distance:

ξ_{d} (p) = \frac{1}{|1 - \hat{d} (p)| + 1},

(38)

and one coefficient for each objective

m

:

ξ_{m} (p) = \frac{1}{|1 - {\hat{f}}_{m} (p)| + 1} .

(39)

The overall GRC score is defined as the average of these

M + 1

coefficients:

G R C (p) = \frac{1}{M + 1} (ξ_{d} (p) + \sum_{m = 1}^{M} ξ_{m} (p)),

(40)

where solution

p

with larger

G R C (p)

score is preferred.

After finding the GRC scores for solutions, the binary tournament selection proceeds straightforwardly. Given two randomly sampled individuals

p

and

q

, if

rank (p) < rank (q)

, we select

p

; else if

rank (q) < rank (p)

, we select

q

; otherwise (means same Pareto rank, which common occurs especially toward the later stage of evolution), we compare their GRC scores.

select \{\begin{array}{l} p, & if G R C (p) > G R C (q), \\ q, & otherwise . \end{array}

(41)

In this way, the GRC score serves as a scalar indicator balancing objective quality and spread of solutions.

Survival Selection with Fractional Sampling Mechanism

After parent selection, crossover, and mutation steps, we are able to obtain the set of new offspring solutions

Q_{t}

, the parent and offspring solutions are merged into a combined pool

R_{t} = P_{t} \cup Q_{t}

of size

2 N

(where

N

is the predefined number of solutions desired in the population). This pool is first partitioned into non-dominated fronts

\{F_{0} (t), F_{1} (t), \dots\}

using the constraint-tolerant dominance relation

≺_{t}

(as discussed in Section 3.1.1 and Section 3.1.2). Within each front

F_{l} (t)

, crowding distances are computed to promote a good spread in objective space. Unlike classical NSGA-II, which always admits the entire

F_{0}

and subsequent fronts until the population budget is nearly exhausted, our proposed RL-NSGA-II-GRC method adopts a fractional front sampling mechanism dynamically controlled by the RL agent. Specifically, for each front

F_{l} (t)

, only

n_{l}^{*} (t) = m i n (N - |P_{t + 1}|, m a x (1, ⌊φ_{t}⌋ |F_{l} (t)|))

(42)

solutions are selected, where

φ_{t} \in (0, 1]

is the front sampling fraction managed by the RL agent at generation

t

based on the environment. The selected individuals are the

n_{l}^{*} (t)

solutions of

F_{l} (t)

with the largest crowding distance, ensuring that the most diverse solutions from each front are retained. If, after scanning all fronts, the size of the new population

P_{t + 1}

is still smaller than the desired population

N

, the remaining slots are filled from the leftover individuals in

(R_{t} ∖ P_{t + 1})

, sorted lexicographically by (rank,

-

crowding distance).

This two-stage survival selection scheme departs from the behavior of standard NSGA-II: by allowing

φ_{t} < 1

(e.g.,

φ_{t} = 0.9

), the algorithm may deliberately not take all members of

F_{0} (t)

, thereby leaving room for selected individuals from some inferior fronts to survive in order to boost diversity (and also enhance the different possibilities of offspring solutions in next generation). The degree of this relaxation is adaptively controlled by the RL agent based on the observed hypervolume, feasibility ratio, and diversity metrics (discussed in Section 3.3) in the environment. As a result, the survival selection becomes a dynamic balance between exploitation of the current best front and exploration via lower-rank but diverse or promising solutions, which can be particularly beneficial on constrained MOO problems.

3.3. Population-Level Performance Indicators for Reinforcement Learning (RL)

Three population-level indicators, namely, hypervolume, feasibility ratio, and diversity are utilized to evaluate the impact of RL agent’s actions. These are computed at every evolutionary generation

t

.

3.3.1. Hypervolume

Let

F_{0}^{f e a s} (t)

denote the feasible rank-0 front (i.e., a set of the best solutions) at generation

t

:

F_{0}^{f e a s} (t) = \{p \in F_{0} (t) ∣ C V (p) = 0\} .

(43)

Let

r \in R^{M}

be a fixed reference point chosen to be dominated by all relevant objectives (e.g., constructed from early population extremes with a safety margin). The hypervolume (HV) at generation

t

is the Lebesgue measure of the union of objective-space hyperrectangles dominated by

F_{0}^{f e a s} (t)

and bounded by

r

:

{H V}_{t} = V_{L} (\underset{p \in F_{0}^{f e a s} (t)}{\cup} \{z \in R^{M} | f_{m} (p) \leq z_{m} \leq r_{m}, m = 1, 2, \dots, M\}),

(44)

where

V_{L} (\cdot)

denotes the

M

-dimensional Lebesgue measure of volume. For

M = 2

, this basically reduces to the area dominated by the rank-0 front. A larger hypervolume value is preferred as it indicates a Pareto front that is closer to the true optimum.

3.3.2. Feasibility Ratio

The feasibility ratio at generation

t

is the fraction of the current population (

P_{t}

) with zero constraint violation:

{F R}_{t} = \frac{1}{N} \sum_{p \in P_{t}} I (C V (p) = 0),

(45)

where

I (\cdot)

is the indicator function, which returns

I (True) = 1

and

I (False) = 0

. As such, each feasible solution contributes

1

to the sum, while each infeasible solution contributes

0

. Consequently,

{F R}_{t} \in [0, 1]

quantifies the feasibility level of the current population, with values closer to 1 being preferred, indicating that the evolutionary search successfully maintains feasibility across generations.

3.3.3. Diversity

The diversity metric measures how well the population is spread across the objective space, reflecting the algorithm’s ability to explore multiple trade-offs rather than just clustering around a narrow region. A larger diversity value indicates a more widely dispersed set of solutions, which is desirable because it supports the discovery of a well-distributed Pareto front.

To obtain the diversity value for rank-0 front (

p \in F_{0} (t)

), we normalize all objectives to

[0, 1]

using population-level minimum and maximum values:

{\tilde{f}}_{m} (p) = \{\begin{array}{l} \frac{f_{m} (p) - f_{m}^{m i n}}{f_{m}^{m a x} - f_{m}^{m i n}}, & f_{m}^{m a x} \neq f_{m}^{m i n}, \\ 0.5, & otherwise, \end{array} m = 1, 2, \dots, M .

(46)

Define the normalized objective vector:

\tilde{f} (p) = ({\tilde{f}}_{1} (p), \dots, {\tilde{f}}_{M} (p))^{⊤},

(47)

Let

K = |F_{0} (t)|

denotes the number of solutions within the rank-0 front at generation

t

. The diversity is then taken as the average pairwise Euclidean distance:

{D I V}_{t} = \{\begin{array}{l} 0, & K < 2, \\ \frac{2}{K (K - 1)} \sum_{i = 1}^{K - 1} \sum_{j = i + 1}^{K} {∥\tilde{f} (p_{i}) - \tilde{f} (p_{j})∥}_{2}, & otherwise . \end{array}

(48)

3.4. Reinforcement Learning (RL) Layer

The RL layer views each generation as a time step in a Markov decision process. At each generation

t

, the agent observes a state summarizing the population, selects an action that sets the evolutionary parameters, and receives a scalar reward based on improvements in hypervolume, feasibility, and diversity. The state at generation

t

is defined as:

s_{t} = (σ_{t}^{H V}, b_{t}^{F R}, b_{t}^{D I V}, ϕ_{t}),

(49)

which comprises the hypervolume trend sign:

σ_{t}^{H V} = {s i g n}_{ϵ} ({H V}_{t} - {H V}_{t - 1}),

(50)

where (1 improvement,

- 1

deterioration, 0 no change)

{s i g n}_{ϵ} (x) = \{\begin{array}{l} 1, & x > ϵ, \\ - 1, & x < - ϵ, \\ 0, & |x| \leq ϵ, \end{array}

(51)

with

ϵ > 0

, a small tolerance (e.g.,

1 \times 10^{- 8}

); then, feasibility level via discretized feasibility ratio (0 low, 1 medium, 2 high):

b_{t}^{F R} = \{\begin{array}{l} 0, & {F R}_{t} < 1 / 3, \\ 1, & 1 / 3 \leq {F R}_{t} < 2 / 3, \\ 2, & {F R}_{t} \geq 2 / 3, \end{array}

(52)

followed by the diversity level (0 low, 1 medium, 2 high):

b_{t}^{D I V} = \{\begin{array}{l} 0, & {D I V}_{t} < 0.2, \\ 1, & 0.2 \leq {D I V}_{t} < 0.4, \\ 2, & {D I V}_{t} \geq 0.4, \end{array}

(53)

and the search stage

ϕ_{t}

, encoding whether we are in the early (0), middle (1), or late (2) stage of the run, given total generation budget

T

(e.g., 300 generations):

ϕ_{t} = \{\begin{array}{l} 0, & t < T / 3, \\ 1, & T / 3 \leq t < 2 T / 3, \\ 2, & t \geq 2 T / 3 . \end{array}

(54)

This yields a compact, fully discrete state representation for the RL agent. For instance, a state

s_{t} = (1, 2, 1, 0)

means an improved hypervolume (relative to the previous generation), a high feasibility level for solutions in the current population, a medium diversity level in the current rank-0 front, and an early stage of the evolutionary search.

Next, the action space

A = \{a^{0}, \dots, a^{K - 1}\}

is a finite set of parameter modes. Each action

a^{k}

encodes the following key parameters, namely, crossover probability

p_{c}^{(k)}

, mutation probability

p_{m}^{(k)}

, SBX distribution index

η_{c}^{(k)}

, mutation distribution index

η_{m}^{(k)}

, constraint tolerance

τ^{(k)}

, and front sampling fraction

φ^{(k)}

in survival selection. Formally,

a^{k} = (p_{c}^{(k)}, p_{m}^{(k)}, η_{c}^{(k)}, η_{m}^{(k)}, τ^{(k)}, φ^{(k)}) .

(55)

When action

a_{t}

is selected at generation

t

, the NSGA-II parameters are set as:

p_{c} = p_{c}^{(a_{t})}

,

p_{m} = p_{m}^{(a_{t})}, η_{c} = η_{c}^{(a_{t})}, η_{m} = η_{m}^{(a_{t})}, τ_{t} = τ^{(a_{t})},

and

φ_{t} = φ^{(a_{t})}

. Here,

τ_{t}

enters the constraint-tolerant dominance, and

φ_{t} \in (0, 1]

controls how many individuals are accepted from each front in survival selection, adjusting selection pressure and diversity.

A tabular action-value function

Q (s, a)

is maintained as a dictionary (i.e., a Q-learning table) mapping each visited state

s

to a vector of Q-values

{\{Q (s, a^{k})\}}_{k = 0}^{K - 1}

. Therefore, at the generation

t

, given state

s_{t}

, an ε-greedy policy selects:

a_{t} = \{\begin{array}{l} rand (A), & with probability ε_{t}, \\ a r g \underset{a \in A}{m a x} Q (s_{t}, a), & with probability 1 - ε_{t}, \end{array}

(56)

where ties in the argmax are broken uniformly at random. The exploration parameter

ε_{t}

(e.g., a value 0.3) is gradually decayed:

ε_{t + 1} = m a x (ε_{m i n}, λ ε_{t}), 0 < λ < 1,

(57)

so that the policy moves from exploration to exploitation over time; for example,

ε_{m i n}

can be small value (e.g., 0.05) and

λ

takes 0.995.

As seen, the discretization of the RL state space is motivated by the use of tabular Q-learning, where the number of state-action pairs must remain sufficiently small to ensure stable learning under a limited generational budget. In this study, the state is therefore designed as a compact summary of the population dynamics using (i) the sign of hypervolume change (improved/unchanged/deteriorated), (ii) discretized feasibility ratio (low/medium/high), (iii) discretized diversity level (low/medium/high), and (iv) the search stage (early/middle/late). This design is not intended to represent every population statistic; rather, it provides a parsimonious, decision-relevant representation that captures the three core evolutionary goals, namely convergence, feasibility, and spread, while also accounting for early exploration or late exploitation. Coarser discretization also reduces noise sensitivity and avoids overfitting the control policy to small metric fluctuations, which is particularly important when the same controller is expected to generalize across different constrained MOO instances.

After applying action

a_{t}

and completing one evolutionary generation to obtain new set of population solutions, the RL agent observes newly updated metrics

{H V}_{t + 1}

,

{F R}_{t + 1}

, and

{D I V}_{t + 1}

. To make the reward scale-invariant, we compute the normalized gains. For hypervolume gain:

Δ {H V}_{t + 1}^{n o r m} = \frac{{H V}_{t + 1} - {H V}_{t}}{|{H V}_{t}|},

(58)

and for feasibility gain:

Δ {F R}_{t + 1} = {F R}_{t + 1} - {F R}_{t},

(59)

then diversity gain:

Δ {D I V}_{t + 1}^{n o r m} = \frac{{D I V}_{t + 1} - {D I V}_{t}}{|{D I V}_{t}|} .

(60)

The normalized quantities are typically clamped to

[- 1, 1]

to stabilize learning. The immediate reward (

r_{t}

) is then defined as a weighted sum of these three gains:

r_{t} = w_{H V} Δ {H V}_{t + 1}^{n o r m} + w_{F R} Δ {F R}_{t + 1} + w_{D I V} Δ {D I V}_{t + 1}^{n o r m},

(61)

with

w_{H V} + w_{F R} + w_{D I V} = 1

, e.g.,

w_{H V} = 0.6

,

w_{F R} = 0.2

, and

w_{D I V} = 0.2

. This emphasizes hypervolume improvement while still rewarding better feasibility and diversity.

The reward is defined as a weighted sum of normalized gains in hypervolume, feasibility, and diversity to align the RL objective with standard performance criteria in constrained MOO. The higher weight assigned to hypervolume reflects its role as a unified indicator that simultaneously captures convergence and coverage of the Pareto front, whereas feasibility ratio and diversity are included to prevent degenerate behaviors (e.g., rapidly improving hypervolume by concentrating on infeasible regions or by collapsing diversity). The chosen weights therefore express the intended priority, that is, driving Pareto-front quality while maintaining feasibility and solution spread as necessary constraints on the search behavior. Importantly, feasibility and diversity retain nonzero weights so that the RL controller is explicitly discouraged from sacrificing feasibility and spread to obtain short-term hypervolume gains.

Finally, let

α \in (0, 1]

denote the learning rate, and

γ \in [0, 1)

the discount factor. After observing next state

s_{t + 1}

and reward

r_{t}

, the temporal difference (TD) target is:

{T D}_{t a r g e t}_{t} = r_{t} + γ \underset{a^{'} \in A}{m a x} Q (s_{t + 1}, a^{'}),

(62)

where

r_{t}

is the immediate reward, and the second term,

γ \underset{a^{'} \in A}{m a x} Q (s_{t + 1}, a^{'})

, represents the discounted estimate of future rewards. By incorporating both immediate and future rewards, the agent can assess the long-term consequences of its actions, enabling it to pursue strategies that maximize cumulative returns rather than focusing solely on short-term gains. In essence, an action is considered good not only when it yields an immediate reward but also when it leads to more favorable future states. The Q-value for the executed state-action pair

(s_{t}, a_{t})

is then updated as:

Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α ({T D}_{t a r g e t}_{t} - Q (s_{t}, a_{t})) .

(63)

This update increases

Q (s_{t}, a_{t})

if the target return estimate exceeds the current Q-value, and decreases it otherwise. Over time,

Q (s, a)

approximates the expected discounted return, guiding the RL agent to prefer parameter configurations that consistently improve the multi-objective search.

Beyond empirical performance gains (as Section 4 shall showcases), our proposed RL-NSGA-II-GRC method provides a theoretical contribution in the form of a principled coupling between dominance-based constrained MOO and RL-based adaptive control. The algorithm can be interpreted as a generational Markov decision process, where the population state is summarized by convergence, feasibility, diversity, and stage indicators, and the RL action defines an adaptive control policy over the evolutionary search dynamics. In particular, the constraint-tolerant dominance introduces a continuum between strict feasibility and controlled constraint relaxation, while the fractional survival sampling introduces a continuum between pure survival and exploratory retention; together, these mechanisms formalize how selection pressure and feasibility enforcement can be adaptively scheduled across generations. Meanwhile, the GRC tie-breaker (in case of same Pareto ranks) provides a weight-free discriminator within a dominance rank, ensuring that Pareto dominance is not overridden while enabling consistent preference toward ideal-point proximity in normalized objective space.

Additionally, from a computational complexity perspective, the proposed RL layer introduces an incremental generational overhead on top of the original NSGA-II, but it does not alter the dominant asymptotic cost terms of NSGA-II. In each generation, the RL agent computes a compact state/reward summary and then performs an ε-greedy action selection and a tabular Q-learning update. The Q-learning decision and/or update step is

O (1)

per generation (table lookup and a constant number of arithmetic operations). The added cost therefore comes mainly from computing state/reward indicators, that is, the feasibility ratio requires one pass over the population

N

and is

O (N)

; the diversity indicator is computed over the current non-dominated set and is

O ({|F_{0}|}^{2})

; and hypervolume computation depends primarily on the number of objectives

M

and is typically inexpensive for the not-many-

M

problems. Separately, our GRC-enhanced tournament only adds scalar scores computation per generation (including the normalization and GRC calculation), which is

O (M)

. The most computationally expensive part is still the NSGA-II backbone, which has the dominant time complexity of

O (M N^{2})

as reported in Abdel-Basset et al. [53]. As a result, comparing the proposed RL-NSGA-II-GRC method against the original NSGA-II, it only increases runtime modestly by a constant factor per generation rather than changing the overall computational order.

4. Mathematical Benchmark MOO Problems

Before applying the proposed RL-NSGA-II-GRC method to a real-world financial application for portfolio optimization, we first evaluate its performance and effectiveness using two well-known mathematical benchmark MOO problems, Kursawe problem [54] and CONSTR problem [4], which have the true Pareto-optimal fronts reported in the literature.

4.1. Kursawe Problem

Kursawe problem considers a three-dimensional decision vector

x = (x_{1}, x_{2}, x_{3})^{⊤}

, with the following two objectives for minimization:

m i n f_{1} (x) = \sum_{i = 1}^{n - 1} (- 10 e x p (- 0.2 \sqrt{x_{i}^{2} + x_{i + 1}^{2}})),

(64)

m i n f_{2} (x) = \sum_{i = 1}^{n} ({|x_{i}|}^{a} + 5 s i n (x_{i}^{b})),

(65)

where

a = 0.8

,

b = 3

, and

n = 3

. The decision variables are bounded by:

- 5 \leq x_{i} \leq 5, i = 1, 2, 3 .

(66)

Coello et al. [54] stated that there are ways to determine the true Pareto-optimal fronts for certain MOO problems, either theoretically/analytically or with exhaustive computational search (e.g., search by enumerative exhaustive deterministic approach using parallel high-performance computers). Here, the Kursawe’s true Pareto-optimal front is openly available at https://emobook.cs.cinvestav.mx/ (accessed on 24 November 2025).

We then execute both the original NSGA-II method and the proposed RL-NSGA-II-GRC method (implemented in Python). Both methods are set to run for 300 evolutionary generations and target a final population of 200 solutions. After obtaining the Pareto-optimal fronts from both methods, we compare them with the true Pareto-optimal front. To quantify convergence, we employ the well-known convergence metric (CM), which measures the convergence/closeness of the obtained front to the true front. For each solution, the minimal Euclidean distance to the true front is computed; the CM is then defined as the average of these distances across all

N

solutions:

C M = \frac{\sum_{i = 1}^{N} d_{i}}{N},

(67)

where

d_{i}

is the minimal Euclidean distance from each obtained solution

i

to the true Pareto-optimal front, and

N

is the total number of obtained solutions (e.g., in this case

N = 200

). A smaller CM indicates a better convergence, as it means the obtained solutions lie closer to the true Pareto-optimal front. In the ideal case, when every obtained solution falls exactly on the true front, CM will be equal to zero.

Figure 2 shows the true Pareto-optimal front of the Kursawe problem alongside the front obtained using the original NSGA-II method. In this case, the CM for NSGA-II is 0.009728. Analogously, Figure 3 displays the true Pareto-optimal front together with the front produced by the proposed RL-NSGA-II-GRC method, which achieves a CM of 0.009169. This corresponds to an improvement in convergence performance of approximately 5.8% compared with the original NSGA-II. To ensure that the observed CM gain is not due to stochastic variability, the problem is repeatedly solved for multiple independent runs with different random seeds while keeping the other setup fixed. We take the mean and standard deviation of CM values across runs for each method and perform statistical significance tests at the 5% level (

p < 0.05

) to evaluate the performance gains, which shows that the improvement of RL-NSGA-II-GRC over the original NSGA-II is statistically significant.

4.2. CONSTR Problem

The CONSTR problem is another mathematical benchmark MOO problem with a relatively narrow feasible region, making it suitable for testing constraint-handling and exploration–exploitation balance. The problem has two objectives to be minimized:

m i n f_{1} (x) = x_{1},

(68)

m i n f_{2} (x) = \frac{1 + x_{2}}{x_{1}} .

(69)

The variables are subject to the following bounds:

0.1 \leq x_{1} \leq 1.0, 0 \leq x_{2} \leq 5 .

(70)

In addition, there are two inequality constraints:

g_{1} (x) = x_{2} + 9 x_{1} - 6 \geq 0,

(71)

\begin{matrix} g_{2} (x) = - x_{2} + 9 x_{1} - 1 \geq 0 . \end{matrix}

(72)

The feasible set for the CONSTR problem is therefore given by all

x = (x_{1}, x_{2})^{⊤}

satisfying both the bounds and inequality constraints.

Figure 4 illustrates CONSTR’s true Pareto-optimal front, together with the fronts obtained by the original NSGA-II and the proposed RL-NSGA-II-GRC methods. Like in Kursawe problem, both methods run 300 evolutionary generations and target a final population of 200 solutions. Results show that the original NSGA-II achieves a CM of 0.01503, whereas the RL-NSGA-II-GRC method achieves a CM of 0.01436. This represents an improvement in convergence performance of approximately 4.4% over the original NSGA-II.

As observed, the comparative study in this paper focuses primarily on the original NSGA-II because RL-NSGA-II-GRC is designed as a strict enhancement of the NSGA-II backbone. Accordingly, the most direct and controlled evaluation is to compare against the original NSGA-II under the same representation, initialization, constraint set, and computing device, thereby isolating the incremental contributions of the RL-guided adaptive control mechanism, the constraint-tolerant dominance with fractional survival sampling, as well as the GRC-enhanced tournament selection. This one-to-one comparison reduces confounding factors (e.g., differing decomposition strategies or indicator calculations across algorithms) and provides an ablation-like assessment of whether the proposed components improve NSGA-II in a fair and interpretable manner. In addition, NSGA-II remains a widely adopted and competitive baseline in MOO and is frequently used in portfolio optimization studies, making it an appropriate reference point for demonstrating practical improvement.

We also conduct additional ablation experiments under the same settings (e.g., same population size, initialization, and number of generations). Three algorithmic configurations are compared, namely, RL-only (RL-NSGA-II), where the RL agent adaptively adjusts the evolutionary parameters and constraint-handling/survival pressure; GRC-only (NSGA-II-GRC), where NSGA-II is augmented with the GRC-based MCDM ranking in tournament selection; and the full RL-NSGA-II-GRC, which integrates both RL control and GRC-enhanced selection. Across the benchmark problems tested, the ablation results indicate that each component provides a measurable benefit. RL-only variant improves adaptability by dynamically balancing exploration and exploitation, as well as feasibility enforcement across generations, while the GRC-only variant improves selection consistency within the same non-dominated rank by favoring solutions closer to the ideal reference. The full RL-NSGA-II-GRC yields the most consistent overall performance, suggesting a complementary effect between adaptive RL-based control and the GRC-based ranking.

5. Application: Portfolio Optimization with NASDAQ-100 Constituents

To demonstrate the practical usefulness of the proposed RL-NSGA-II-GRC algorithm, we apply it to a real-world portfolio optimization problem. The goal is to design a long-only equity portfolio that simultaneously minimizes risk and maximizes expected return, using recent historical data for large-capitalization stocks from the NASDAQ-100 index.

We start from the 30 largest constituents by weight of the NASDAQ-100 (https://www.slickcharts.com/nasdaq100, accessed on 1 December 2025), and denote their tickers by

A = \{a_{1}, a_{2}, \dots, a_{30}\}

. Daily adjusted closing prices

P_{i, t}

are downloaded for each asset

a_{i}

over the period (from 1 June 2022 to 1 December 2025) using the yfinance Python library (https://pypi.org/project/yfinance/, accessed on 1 December 2025). To reduce high frequency noise while retaining enough observations for estimation, the daily prices are resampled to weekly prices by taking the last trading day in each calendar week:

P_{i, t}^{(W)} = P_{i, last trading day of week t} .

(73)

From these weekly prices, we compute simple (percentage) weekly returns

r_{i, t} = \frac{P_{i, t}^{(W)} - P_{i, t - 1}^{(W)}}{P_{i, t - 1}^{(W)}}, i = 1, 2, \dots, N, t = 2, 3, \dots, T,

(74)

where

N = 30

is the number of assets that remain after cleaning and

T

is the number of valid weeks. Assets with more than 10% missing price data are discarded; remaining gaps are handled by dropping weeks with any missing prices so that all retained series are time-aligned.

Then, using the weekly returns, we compute the mean return

μ_{i}

and the standard deviation

σ_{i}

for each retained asset

i

, as well as the covariance

σ_{i j}

between the returns of assets

i

and

j

(where

i, j = 1, 2, \dots, N

). Therefore, we now have a portfolio

p

consists of

N

assets with non-negative weights

w_{i}

(i.e., fraction of total wealth held in asset

i

). We wish to minimize the risk (

f_{1}

) and maximize the expected return (

f_{2}

) of the portfolio simultaneously. Following the Markowitz mean–variance model [55], these two objective functions are thus defined as:

\min f_{1} (w) = \sum_{i = 1}^{N} \sum_{j = 1}^{N} w_{i} w_{j} σ_{i j},

(75)

\max f_{2} (w) = \sum_{i = 1}^{N} w_{i} μ_{i} .

(76)

The bounds and equality constraint applied to the decision variables (i.e., weights

w_{i}

) are:

\sum_{i = 1}^{N} w_{i} = 1, 0 \leq w_{i} \leq 1, i = 1, 2, \dots, N .

(77)

Here, it is worth mentioning that in objective function

f_{1}

, assets with low or negative covariance (

σ_{i j}

) reduce the cross-product terms

w_{i} w_{j} σ_{i j}

, and thus lower overall risk for a given level of expected return. In other words, the covariance structure explicitly rewards combining assets that do not move together, so the risk objective naturally captures the diversification effect in modern portfolio theory. In addition, as noted earlier, any maximization objective (e.g., the expected return function

f_{2}

here) can be readily converted into a minimization objective by multiplying its objective function by

- 1

.

For this portfolio optimization problem with 2 objectives and 30 variables, we run the RL-NSGA-II-GRC algorithm for 1000 evolutionary generations, targeting a final set of 1000 Pareto-optimal solutions (although in the end, 989 solutions are found). The resulting Pareto-optimal front (also commonly known as efficient frontier in finance context) is presented in Figure 5. As expected, the efficient frontier clearly exhibits the fundamental trade-off between risk and return, that is, achieving a higher expected return inevitably requires taking on higher risk, while lowering risk leads to a reduction in expected return. This pattern is consistent with finance theory, particularly the Markowitz mean–variance framework, which states that no portfolio can simultaneously minimize risk and maximize return. Instead, we face a set of optimal trade-off choices (i.e., solutions on the efficient frontier). Those located toward the lower-risk end of the frontier are more conservative, offering smaller but more stable returns. On the contrary, those located toward the upper-return end can potentially give us higher returns but also expose us to greater risks.

Additional observations can be drawn from the curvature and smoothness of the frontier in Figure 5. The concave shape indicates diminishing marginal gains in return as risk increases, meaning that there is a progressively larger increment of risk to obtain each additional unit of expected return. Moreover, the dense distribution of Pareto-optimal solutions along the frontier showcases that the RL-NSGA-II-GRC method captures a wide range of efficient portfolio choices. As such, investors with different risk tolerances are able to identify suitable allocations of assets.

Furthermore, with this efficient frontier, we analyse the maximum Sharpe (tangency) portfolio and the utility-maximizing portfolios under different levels of risk aversion. We approximate the weekly risk-free rate

r_{f}

using the 3-month U.S. Treasury bill (T-bill) rate, a standard proxy for the USD risk-free asset. As shown in Figure 6, over the period 1 June 2022–1 December 2025, the 3-month T-bill yield rose from near 1.0% to more than 5.0% and is currently around 4.0%. The arithmetic average of the 3-month T-bill yield in this period is 4.4578% per annum. The corresponding weekly rate

r_{f}

is then calculated as 0.08395% (i.e., 0.0008395).

Next, we use

μ_{p}

and

σ_{p}

to denote, respectively, the expected weekly return (

f_{2}

) and weekly standard deviation (note: not variance as in

f_{1}

, but

\sqrt{f_{1}}

) of a candidate portfolio

p

on the efficient frontier. For each

p

, we compute the weekly Sharpe ratio

S_{p} = \frac{μ_{p} - r_{f}}{σ_{p}},

(78)

and identify the tangency portfolio

p^{⋆}

as the one with the maximum Sharpe ratio:

p^{⋆} = \arg \max_{p} S_{p} .

(79)

In this case study, the tangency portfolio has

σ_{p^{⋆}} = 0.03765

,

μ_{p^{⋆}} = 0.01086

, and a weekly Sharpe ratio

S_{p^{⋆}} = 0.2662

(i.e., annualized Sharpe ratio

= 1.92

). Financially, this means that among all efficient risky portfolios,

p^{⋆}

delivers the largest expected excess return (above

r_{f}

) per unit of risk.

With

p^{⋆}

, the capital market line (CML) can then be constructed:

μ_{CML} (σ) = r_{f} + \frac{μ_{p^{⋆}} - r_{f}}{σ_{p^{⋆}}} σ,

(80)

where

σ

is the portfolio standard deviation. The CML, together with the efficient frontier (

f_{2}

vs.

\sqrt{f_{1}}

), are illustrated in Figure 7. As seen, the optimal risky portfolio

p^{⋆}

corresponds to the tangency point between the efficient frontier and the CML.

Moreover, to reflect heterogeneous risk preferences, we also consider a risk aversion quadratic utility function:

U (μ_{p}, σ_{p}, λ) = μ_{p} - \frac{1}{2} λ σ_{p}^{2},

(81)

where

λ > 0

is the investor’s risk-aversion coefficient (larger

λ

indicates stronger aversion to risk). For each prescribed

λ

, we evaluate

U (μ_{p}, σ_{p}, λ)

for all Pareto-optimal portfolios on the efficient frontier and select the utility-optimal portfolio:

p^{⋆} (λ) = a r g \underset{p}{m a x} U (μ_{p}, σ_{p}, λ) .

(82)

For instance, as demonstrated in Figure 7, we examine

λ \in \{1, 3, 6\}

, corresponding to low, moderate, and high risk aversion, respectively. As expected, the resulting utility-optimal portfolios exhibit decreasing risk and expected return as

λ

increases. Specifically, for

λ = 1

we obtain

σ_{p} = 0.07429

,

μ_{p} = 0.01831

; for

λ = 3

,

σ_{p} = 0.06074

,

μ_{p} = 0.01607

; and for

λ = 6

,

σ_{p} = 0.04140

,

μ_{p} = 0.0118

3; these portfolios are shown as the green, purple, and orange points, respectively, in Figure 7.

6. Discussion and Limitations

The numerical experiments on the Kursawe and CONSTR benchmarks, together with the NASDAQ-100 case study, show that our proposed RL-NSGA-II-GRC method is both effective and versatile for MOO problems. On the two benchmark problems with known true Pareto fronts, RL-NSGA-II-GRC consistently yields Pareto-optimal fronts that lie closer to the true front than those produced by the original NSGA-II, as reflected in the reduction in the CM by about 5.8% for Kursawe and 4.4% for CONSTR. At the same time, the constraint-tolerant dominance scheme and fractional sampling survival selection preserve a well-spread and diverse set of non-dominated solutions, instead of collapsing to a narrow region of the front. The RL agent plays a key role here: by adaptively tuning crossover and mutation probabilities, the constraint tolerance, and the front sampling fraction in response to hypervolume, feasibility, and diversity feedback, it automatically balances exploration and exploitation throughout the generations. This adaptive capability is significantly bolstered by the introduction of the GRC, which ensures that selection pressure simultaneously accounts for Pareto rank, crowding distance, and geometric proximity to ideal reference values. Consequently, the resulting Pareto-optimal fronts are not only closer to the theoretical optima but also smoother and more uniformly distributed, highlighting the synergy between dynamic RL agent control and GRC-based parent tournament selection.

Further, to examine the sensitivity of the RL-NSGA-II-GRC framework to varied design choices, we additionally evaluated alternative discretization granularities and alternative reward-weight configurations. Specifically, we considered (i) finer and coarser binning for feasibility and diversity levels, and (ii) multiple reward weight triplets that varied the relative emphasis placed on hypervolume versus feasibility/diversity while keeping all other algorithmic components fixed. The overall trends observed in the benchmarks and the portfolio case indicate that the algorithm’s performance is sensitive (but not overly) to moderate variations in these settings: the controller continues to learn consistent action preferences (e.g., stronger exploration early and stricter feasibility enforcement later), and the resulting Pareto-front quality remains comparable across a reasonable range of discretization and weight choices. Nevertheless, extremely coarse discretization may reduce the controller’s ability to react to intermediate search conditions, whereas overly fine discretization may dilute learning by expanding the state space. Similarly, setting feasibility/diversity weights too close to zero can weaken feasibility enforcement or reduce spread. These findings suggest that the proposed default settings provide a balanced trade-off, while the design definitely remains tunable for problem-specific requirements.

For the time-complexity analysis of the proposed RL-NSGA-II-GRC against the original NSGA-II, as the population size

N

increases, the dominant runtime within RL-NSGA-II-GRC is consistently driven by non-dominated sorting of NSGA-II, i.e., on the order of

O (M N^{2})

. The RL component adds

O (N)

feasibility bookkeeping and up to

O ({|F_{0}|}^{2})

diversity computation per generation, plus an

O (1)

tabular update. The GRC component only adds an inexpensive scalar score computation and does not change the dominant complexity term. Thus, the RL and GRC overhead remain secondary, and the overall computational order continues to be governed by the NSGA-II backbone. This is also further corroborated by our empirical observations in both the benchmark mathematical problems and the NASDAQ portfolio optimization problem, where execution times for RL-NSGA-II-GRC and the original NSGA-II do not significantly deviate. Ultimately, the RL and GRC layers contribute only a modest constant factor overhead rather than shifting the computational complexity of the algorithm.

Regarding performance relative to other state-of-the-art MOEAs (e.g., decomposition-based or indicator-based methods, as well as many-objective extensions), the present study does not claim that RL-NSGA-II-GRC universally dominates all alternatives; rather, it demonstrates that the proposed RL-controlled mechanisms and the GRC tie-breaking strategy can significantly strengthen the widely used NSGA-II backbone for MOO problems. Importantly, the proposed framework is largely modular, which means that the RL agent (our state-action-reward design) and the GRC-based ranking can, in principle, be integrated with other evolutionary algorithms by treating their control components (e.g., variation strength, constraint relaxation, or selection pressure) as RL actions and using convergence, feasibility, and diversity indicators as feedback. For hybrid RL-based optimizers, RL-NSGA-II-GRC differs in that it preserves a Pareto-first ranking logic and uses GRC when dominance information is insufficient, while also explicitly controlling constraint tolerance and survival pressure across generations. A comprehensive head-to-head comparison with additional state-of-the-art MOEAs and recent RL-assisted optimization algorithms is therefore a valuable direction for future work and will be pursued in subsequent studies.

The real-world NASDAQ-100 portfolio application further underlines the practical value of the proposed RL-NSGA-II-GRC. The algorithm constructs a smooth, concave efficient frontier in the mean–variance space, with well-populated trade-off regions that make it easy to identify the tangency (i.e., maximum Sharpe ratio) portfolio and utility-optimal portfolios for different levels of risk aversion. This indicates that the RL-guided NSGA-II with GRC are able to navigate a realistically constrained financial search space and uncover portfolios that are both attractive and interpretable for decision makers. However, this work is not without limitations. One limitation lies in the design of the RL agent. Our current implementation adopts a tabular Q-learning agent with a discrete state-action space, which is transparent and effective for all the problems studied here, but could be further extended through function approximation (e.g., deep RL) and more expressive state features to handle even higher dimensional, noisier, or more dynamically changing MOO environments. This would be a promising direction for future work.

7. Conclusions

In conclusion, this study proposed a RL-guided NSGA-II method enhanced GRC (RL-NSGA-II-GRC) method to address constrained MOO problems and demonstrated its application to NASDAQ portfolio optimization. The method integrated a tabular Q-learning agent into the NSGA-II backbone to adaptively control key evolutionary parameters, including crossover probability, mutation strength, constraint tolerance, and front sampling fraction, based on population-level feedback from hypervolume, feasibility ratio, and diversity indicators. In parallel, a GRC-enhanced binary tournament selection operator was designed to assess potential parents using a unified performance score that simultaneously accounted for dominance rank, crowding distance, and geometric proximity to ideal objective values, thereby strengthening selection pressure toward high-quality, well-distributed solutions.

The numerical experiments on the Kursawe and CONSTR benchmark MOO problems showed that RL-NSGA-II-GRC achieved better convergence to the true Pareto-optimal fronts than the original NSGA-II, with CM improvements of approximately 5.8% and 4.4%, respectively, while preserving a diverse and well-spread set of non-dominated solutions. In the NASDAQ portfolio case study, the proposed method produced a smooth, concave efficient frontier in the mean–variance space with a dense distribution of Pareto-optimal portfolios, enabling clear identification of the portfolio with maximum Sharpe ratio and utility-optimal portfolios for different levels of risk aversion. These results indicated that the combination of RL agent, constraint-tolerant dominance, fractional front sampling, and GRC-based selection effectively balanced exploration and exploitation and was capable of navigating realistically constrained financial search spaces to yield practically meaningful solutions. Although the proposed framework performed well across both benchmark and real-world problems, there remained room for further enhancement. In particular, the tabular Q-learning agent with a discrete state-action space, while transparent and effective for the settings studied, could be extended to more expressive function-approximation schemes and richer state representations to better handle higher-dimensional, noisier, or more dynamically evolving optimization environments. Future research may also investigate alternative reward designs, additional performance indicators, and broader classes of risk measures and portfolio constraints, as well as applications of RL-NSGA-II-GRC to other optimization and decision-making domains where constrained multi-objective trade-offs are of central importance.

Author Contributions

Conceptualization, Z.W., Q.D., D.D., S.Z., J.R., Y.W. and C.H.T.; Methodology, Z.W., Q.D., D.D., S.Z., J.R., Y.W. and C.H.T.; Software, Z.W. and Q.D.; Formal analysis, Z.W., Q.D., D.D., S.Z., J.R., Y.W. and C.H.T.; Investigation, Z.W., Q.D., D.D., S.Z., J.R., Y.W. and C.H.T.; Writing—original draft, Z.W., Q.D., S.Z. and C.H.T.; Writing—review & editing, D.D., J.R. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We thank Carlos A. Coello Coello for providing the valuable true Pareto-optimal fronts for MOO benchmark problems (https://emobook.cs.cinvestav.mx/, accessed on 24 November 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

Nabavi, S.R.; Wang, Z.; Rodríguez, M.L. Multi-Objective Optimization and Multi-Criteria Decision-Making Approach to Design a Multi-Tubular Packed-Bed Membrane Reactor in Oxidative Dehydrogenation of Ethane. Energy Fuels 2024, 39, 491–503. [Google Scholar] [CrossRef]
Wang, Z.; Parhi, S.S.; Rangaiah, G.P.; Jana, A.K. Analysis of weighting and selection methods for pareto-optimal solutions of multiobjective optimization in chemical engineering applications. Ind. Eng. Chem. Res. 2020, 59, 14850–14867. [Google Scholar] [CrossRef]
Zhang, Y.; Li, X.; Guo, S. Portfolio selection problems with Markowitz’s mean–variance framework: A review of literature. Fuzzy Optim. Decis. Mak. 2018, 17, 125–158. [Google Scholar] [CrossRef]
Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
Baydaş, M.; Turgay, S.; Ömeroğlu, M.K.; Aydin, A.; Baydaş, G.; Stević, Ž.; Başar, E.E.; İnci, M.; Selçuk, M. A Refined Fuzzy MARCOS Approach with Quasi-D-Overlap Functions for Intuitive, Consistent, and Flexible Sensor Selection in IoT-Based Healthcare Systems. Mathematics 2025, 13, 2530. [Google Scholar] [CrossRef]
Wang, Z.; Nabavi, S.R.; Rangaiah, G.P. Multi-criteria decision making in chemical and process engineering: Methods, progress, and potential. Processes 2024, 12, 2532. [Google Scholar] [CrossRef]
Wang, Z.; Li, J.; Rangaiah, G.P.; Wu, Z. Machine learning aided multi-objective optimization and multi-criteria decision making: Framework and two applications in chemical engineering. Comput. Chem. Eng. 2022, 165, 107945. [Google Scholar] [CrossRef]
Pagnoncelli, B.K.; Reich, D.; Campi, M.C. Risk-return trade-off with the scenario approach in practice: A case study in portfolio selection. J. Optim. Theory Appl. 2012, 155, 707–722. [Google Scholar] [CrossRef]
Pereira, J.L.J.; Oliver, G.A.; Francisco, M.B.; Cunha, S.S., Jr.; Gomes, G.F. A review of multi-objective optimization: Methods and algorithms in mechanical engineering problems. Arch. Comput. Methods Eng. 2022, 29, 2285–2308. [Google Scholar] [CrossRef]
Majumder, S. Some network optimization models under diverse uncertain environments. arXiv 2021, arXiv:2103.08327. [Google Scholar] [CrossRef]
Gunantara, N. A review of multi-objective optimization: Methods and its applications. Cogent Eng. 2018, 5, 1502242. [Google Scholar] [CrossRef]
Sharma, S.; Kumar, V. A comprehensive review on multi-objective optimization techniques: Past, present and future. Arch. Comput. Methods Eng. 2022, 29, 5605–5633. [Google Scholar] [CrossRef]
Jayarathna, C.P.; Agdas, D.; Dawes, L.; Yigitcanlar, T. Multi-objective optimization for sustainable supply chain and logistics: A review. Sustainability 2021, 13, 13617. [Google Scholar] [CrossRef]
Torkayesh, A.E.; Vandchali, H.R.; Tirkolaee, E.B. Multi-objective optimization for healthcare waste management network design with sustainability perspective. Sustainability 2021, 13, 8279. [Google Scholar] [CrossRef]
Pavlou, A.; Doumpos, M.; Zopounidis, C. The robustness of portfolio efficient frontiers: A comparative analysis of bi-objective and multi-objective approaches. Manag. Decis. 2019, 57, 300–313. [Google Scholar] [CrossRef]
Wang, Y.; Wang, Z.; Wu, Z. Multi-objective optimal control of nonlinear processes using reinforcement learning with adaptive weighting. Comput. Chem. Eng. 2025, 201, 109206. [Google Scholar] [CrossRef]
Abdallah, A.B.H.; Bedoui, R.; Boubaker, H. Metaheuristics for Portfolio Optimization: Application of NSGAII, SPEA2, and PSO Algorithms. Risks 2025, 13, 227. [Google Scholar] [CrossRef]
Verma, S.; Pant, M.; Snasel, V. A comprehensive review on NSGA-II for multi-objective combinatorial optimization problems. IEEE Access 2021, 9, 57757–57791. [Google Scholar] [CrossRef]
Nabavi, S.R.; Guo, Z.; Wang, Z. Machine Learning-Assisted Surrogate Modeling with Multi-Objective Optimization and Decision-Making of a Steam Methane Reforming Reactor. arXiv 2025, arXiv:2507.07641. [Google Scholar] [CrossRef]
Ma, H.; Zhang, Y.; Sun, S.; Liu, T.; Shan, Y. A comprehensive survey on NSGA-II for multi-objective optimization and applications. Artif. Intell. Rev. 2023, 56, 15217–15270. [Google Scholar] [CrossRef]
Joshi, N.K.; Dhodiya, J.M. Intelligent Many-objective portfolio optimization using hybrid deep learning and evolutionary algorithm approach for advanced decision-making. Comput. Ind. Eng. 2025, 205, 111159. [Google Scholar] [CrossRef]
Zuo, M.; Gong, D.; Wang, Y.; Ye, X.; Zeng, B.; Meng, F. Process knowledge-guided autonomous evolutionary optimization for constrained multiobjective problems. IEEE Trans. Evol. Comput. 2023, 28, 193–207. [Google Scholar] [CrossRef]
Zuo, M.; Xue, Y. Population feasibility state guided autonomous constrained multi-objective evolutionary optimization. Mathematics 2024, 12, 913. [Google Scholar] [CrossRef]
Rezaei, M.; Nezamabadi-Pour, H. A taxonomy of literature reviews and experimental study of deepreinforcement learning in portfolio management. Artif. Intell. Rev. 2025, 58, 94. [Google Scholar] [CrossRef]
Gu, S.; Yang, L.; Du, Y.; Chen, G.; Walter, F.; Wang, J.; Knoll, A. A review of safe reinforcement learning: Methods, theories and applications. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11216–11235. [Google Scholar] [CrossRef] [PubMed]
Milani, S.; Topin, N.; Veloso, M.; Fang, F. Explainable reinforcement learning: A survey and comparative review. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
Moos, J.; Hansel, K.; Abdulsamad, H.; Stark, S.; Clever, D.; Peters, J. Robust reinforcement learning: A review of foundations and recent advances. Mach. Learn. Knowl. Extr. 2022, 4, 276–315. [Google Scholar] [CrossRef]
Yan, Z.; Zhao, H.; Liang, S.; Li, L.; Song, Y. Inter-layer feedback mechanism with reinforcement learning boosts the evolution of cooperation in multilayer network. Chaos Solitons Fractals 2024, 185, 115095. [Google Scholar] [CrossRef]
Panzer, M.; Bender, B. Deep reinforcement learning in production systems: A systematic literature review. Int. J. Prod. Res. 2022, 60, 4316–4341. [Google Scholar] [CrossRef]
Rolf, B.; Jackson, I.; Müller, M.; Lang, S.; Reggelin, T.; Ivanov, D. A review on reinforcement learning algorithms and applications in supply chain management. Int. J. Prod. Res. 2023, 61, 7151–7179. [Google Scholar] [CrossRef]
Kayhan, B.M.; Yildiz, G. Reinforcement learning applications to machine scheduling problems: A comprehensive literature review. J. Intell. Manuf. 2023, 34, 905–929. [Google Scholar] [CrossRef]
Almahdi, S.; Yang, S.Y. An adaptive portfolio trading system: A risk-return portfolio optimization using recurrent reinforcement learning with expected maximum drawdown. Expert Syst. Appl. 2017, 87, 267–279. [Google Scholar] [CrossRef]
Betancourt, C.; Chen, W.-H. Deep reinforcement learning for portfolio management of markets with a dynamic number of assets. Expert Syst. Appl. 2021, 164, 114002. [Google Scholar] [CrossRef]
Song, Y.; Wei, L.; Yang, Q.; Wu, J.; Xing, L.; Chen, Y. RL-GA: A Reinforcement Learning-based Genetic Algorithm for Electromagnetic Detection Satellite Scheduling Problem. Swarm Evol. Comput. 2023, 77, 101236. [Google Scholar] [CrossRef]
Song, Y.; Wu, Y.; Guo, Y.; Yan, R.; Suganthan, P.N.; Zhang, Y.; Pedrycz, W.; Das, S.; Mallipeddi, R.; Ajani, O.S.; et al. Reinforcement learning-assisted evolutionary algorithm: A survey and research opportunities. Swarm Evol. Comput. 2024, 86, 101517. [Google Scholar] [CrossRef]
Guo, Y.; Wang, H.; He, L.; Pedrycz, W.; Suganthan, P.N.; Xing, L.; Song, Y. A reinforcement learning-assisted genetic programming algorithm for team formation problem considering person-job matching. Neurocomputing 2025, 650, 130917. [Google Scholar] [CrossRef]
Deng, J.-L. Control problems of grey systems. Syst. Control. Lett. 1982, 1, 288–294. [Google Scholar] [CrossRef]
Maidin, N.A.; Mohd Sapuan, S.; Taha, M.M.; Yusoff, M.M. Material selection of natural fibre using a grey relational analysis (GRA) approach. BioResources 2022, 17, 109. [Google Scholar] [CrossRef]
Wang, Z.; Nabavi, S.R.; Rangaiah, G.P. Selected Multi-criteria Decision-Making Methods and Their Applications to Product and System Design. In Optimization Methods for Product and System Design; Springer: Berlin/Heidelberg, Germany, 2023; pp. 107–138. [Google Scholar]
Mehlawat, M.K.; Gupta, P.; Khan, A.Z. An integrated fuzzy-grey relational analysis approach to portfolio optimization. Appl. Intell. 2023, 53, 3804–3835. [Google Scholar] [CrossRef]
Song, Q.; Jamalipour, A. Network selection in an integrated wireless LAN and UMTS environment using mathematical modeling and computing techniques. IEEE Wirel. Commun. 2005, 12, 42–48. [Google Scholar] [CrossRef]
Martinez-Morales, J.D.; Pineda-Rico, U.; Stevens-Navarro, E. Performance comparison between MADM algorithms for vertical handoff in 4G networks. In Proceedings of the 2010 7th International Conference on Electrical Engineering Computing Science and Automatic Control, Tuxtla Gutierrez, Mexico, 8–10 September 2010; pp. 309–314. [Google Scholar]
Markowitz, H. Modern portfolio theory. J. Financ. 1952, 7, 77–91. [Google Scholar]
Konno, H.; Yamazaki, H. Mean-absolute deviation portfolio optimization model and its applications to Tokyo stock market. Manag. Sci. 1991, 37, 519–531. [Google Scholar] [CrossRef]
Speranza, M.G. Linear programming models for portfolio optimization. Finance 1993, 14, 107–123. [Google Scholar]
Rockafellar, R.T.; Uryasev, S. Optimization of conditional value-at-risk. J. Risk 2000, 2, 21–42. [Google Scholar] [CrossRef]
Zhan, Z.-H.; Shi, L.; Tan, K.C.; Zhang, J. A survey on evolutionary computation for complex continuous optimization. Artif. Intell. Rev. 2022, 55, 59–110. [Google Scholar] [CrossRef]
Zhu, H.; Wang, Y.; Wang, K.; Chen, Y. Particle Swarm Optimization (PSO) for the constrained portfolio optimization problem. Expert Syst. Appl. 2011, 38, 10161–10169. [Google Scholar] [CrossRef]
Anagnostopoulos, K.P.; Mamanis, G. Multiobjective evolutionary algorithms for complex portfolio optimization problems. Comput. Manag. Sci. 2011, 8, 259–279. [Google Scholar] [CrossRef]
Erwin, K.; Engelbrecht, A. Multi-guide set-based particle swarm optimization for multi-objective portfolio optimization. Algorithms 2023, 16, 62. [Google Scholar] [CrossRef]
Bullah, S.; van Zyl, T.L. A learnheuristic approach to a constrained multi-objective portfolio optimisation problem. In Proceedings of the 2023 7th International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence, Kuala Lumpur, Malaysia, 23–24 April 2023; pp. 58–65. [Google Scholar]
Wang, Z.; Rangaiah, G.P. Multi-Criteria Decision-Making: Reference-Type Methods. arXiv 2025, arXiv:2508.16087. [Google Scholar]
Abdel-Basset, M.; Abdel-Fatah, L.; Sangaiah, A.K. Metaheuristic algorithms: A comprehensive review. In Computational Intelligence for Multimedia Big Data on the Cloud with Engineering Applications; Academic Press: Cambridge, MA, USA, 2018; pp. 185–231. [Google Scholar]
Coello, C.A.C.; Lamont, G.B.; Veldhuizen, D.A.V. Evolutionary Algorithms for Solving Multi-Objective Problems; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
Guo, Q. Review of research on markowitz model in portfolios. In Proceedings of the 2022 7th International Conference on Social Sciences and Economic Development (ICSSED 2022), Wuhan, China, 25–27 March 2022; pp. 786–790. [Google Scholar]

Figure 1. Flowchart of the RL-NSGA-II-GRC method.

Figure 2. Kursawe MOO problem: the true Pareto-optimal front and the Pareto-optimal front obtained by the original NSGA-II method.

Figure 3. Kursawe MOO problem: the true Pareto-optimal front and the Pareto-optimal front obtained by the proposed RL-NSGA-II-GRC method.

Figure 4. CONSTR MOO problem: the true Pareto-optimal front, as well as the Pareto-optimal fronts obtained by both the original NSGA-II and the proposed RL-NSGA-II-GRC methods.

Figure 5. Portfolio optimization problem: efficient frontier (

f_{2}

vs.

f_{1}

) solved using RL-NSGA-II-GRC method.

Figure 5. Portfolio optimization problem: efficient frontier (

f_{2}

vs.

f_{1}

) solved using RL-NSGA-II-GRC method.

Figure 6. U.S. 3-month Treasury bill yield.

Figure 7. Portfolio optimization problem: efficient frontier (

f_{2}

vs.

\sqrt{f_{1}}

), CML, and risk aversion quadratic utility-optimal points; green, purple, and orange dashed curves are the indifference curves through their respective utility-optimal points.

Figure 7. Portfolio optimization problem: efficient frontier (

f_{2}

vs.

\sqrt{f_{1}}

), CML, and risk aversion quadratic utility-optimal points; green, purple, and orange dashed curves are the indifference curves through their respective utility-optimal points.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Z.; Ding, Q.; Ding, D.; Zhu, S.; Ren, J.; Wang, Y.; Tan, C.H. Reinforcement Learning-Guided NSGA-II Enhanced with Gray Relational Coefficient for Multi-Objective Optimization: Application to NASDAQ Portfolio Optimization. Mathematics 2026, 14, 296. https://doi.org/10.3390/math14020296

AMA Style

Wang Z, Ding Q, Ding D, Zhu S, Ren J, Wang Y, Tan CH. Reinforcement Learning-Guided NSGA-II Enhanced with Gray Relational Coefficient for Multi-Objective Optimization: Application to NASDAQ Portfolio Optimization. Mathematics. 2026; 14(2):296. https://doi.org/10.3390/math14020296

Chicago/Turabian Style

Wang, Zhiyuan, Qinxu Ding, Ding Ding, Siying Zhu, Jing Ren, Yue Wang, and Chong Hui Tan. 2026. "Reinforcement Learning-Guided NSGA-II Enhanced with Gray Relational Coefficient for Multi-Objective Optimization: Application to NASDAQ Portfolio Optimization" Mathematics 14, no. 2: 296. https://doi.org/10.3390/math14020296

APA Style

Wang, Z., Ding, Q., Ding, D., Zhu, S., Ren, J., Wang, Y., & Tan, C. H. (2026). Reinforcement Learning-Guided NSGA-II Enhanced with Gray Relational Coefficient for Multi-Objective Optimization: Application to NASDAQ Portfolio Optimization. Mathematics, 14(2), 296. https://doi.org/10.3390/math14020296

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning-Guided NSGA-II Enhanced with Gray Relational Coefficient for Multi-Objective Optimization: Application to NASDAQ Portfolio Optimization

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. NSGA-II Backbone with Dynamic Constraint-Tolerant Dominance and Fractional Sampling Survival Selection

3.1.1. Constraint-Tolerant Dominance

3.1.2. Non-Dominated Sorting

3.1.3. Crowding Distance Calculation

3.1.4. Simulated Binary Crossover (SBX)

3.1.5. Polynomial Mutation

3.2. Gray Relational Coefficient (GRC) Enhanced Binary Tournament Selection

Survival Selection with Fractional Sampling Mechanism

3.3. Population-Level Performance Indicators for Reinforcement Learning (RL)

3.3.1. Hypervolume

3.3.2. Feasibility Ratio

3.3.3. Diversity

3.4. Reinforcement Learning (RL) Layer

4. Mathematical Benchmark MOO Problems

4.1. Kursawe Problem

4.2. CONSTR Problem

5. Application: Portfolio Optimization with NASDAQ-100 Constituents

6. Discussion and Limitations

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI