1. Introduction
Multi-objective optimization (MOO) arises naturally in many real-world optimization and decision problems where multiple conflicting goals must be balanced simultaneously under practical constraints [
1]. In finance, the classical trade-off between risk and return in portfolio management is a canonical example, but similar structures appear in many other domains, such as engineering design, logistics, healthcare, and energy systems. Decision-makers are often interested not in a single optimal solution, but in a diverse set of Pareto-optimal solutions (i.e., non-dominated solutions or Pareto-optimal front) that reveal the trade-offs among objectives. MOO can generate this set of equally good optimal solutions from the perspective of the objectives considered [
2]. Designing algorithms that can efficiently approximate such Pareto-optimal fronts, especially in the presence of complex feasibility regions, remains a central challenge.
In the domain of financial application, Markowitz mean–variance framework [
3] established the foundation for quantitative portfolio selection by posing the problem as a bi-objective optimization of expected return and variance. Building on this, multi-objective evolutionary algorithms, such as the non-dominated sorting genetic algorithm II (NSGA-II) proposed by Deb et al. [
4], have become popular tools for finding the Pareto-optimal fronts (also known as efficient frontiers in finance) in portfolio optimization and other domains, due to their population-based search and ability to handle non-convex, discontinuous Pareto sets. However, when constraints are tight or the feasible region is narrow, the standard/original NSGA-II with fixed parameters and a purely feasibility or penalty-based constraint-handling strategy may suffer from slow convergence, premature loss of diversity, or excessive sampling of infeasible solutions, which reduces both the quality and interpretability of the resulting Pareto-optimal fronts.
A key factor underlying these issues is that evolutionary algorithms are typically run with static parameter settings and rigid constraint-handling rules. In practice, the balance between exploration and exploitation, or between searching in feasible versus infeasible regions, changes over the course of evolutionary generations. Early in the search, a higher degree of exploration and a more tolerant attitude toward constraint violations may be beneficial for discovering promising regions, whereas in later generations a stronger emphasis on convergence and feasibility is desirable. Fixed crossover and mutation probabilities, fixed constraint tolerance, and rigid survival selection can therefore be suboptimal, especially in constrained MOO problems where diversity and feasibility must be managed concurrently.
In parallel, recent advances in reinforcement learning (RL) and data-driven multi-criteria decision-making (MCDM) methods offer new opportunities for dynamic, feedback-driven control within evolutionary frameworks. RL provides a natural way to model the adaptive adjustment of algorithmic parameters as a sequential decision problem, where the RL agent observes indicators of convergence, feasibility, and diversity and chooses parameter values to maximize long-term performance. On the other hand, in the multi-criteria and uncertain decision-making environments, MCDM is typically an effective approach [
5]. Gray relational coefficients (GRC), a popular MCDM concept, provide a flexible mechanism for aggregating multiple indicators into a performance score that can guide selection and ranking [
6]. Nonetheless, the integration of RL agent with GRC-enhanced selection within NSGA-II has been relatively unexplored, particularly in the context of constrained multi-objective portfolio optimization.
To address these gaps, this work proposes a RL guided NSGA-II method enhanced with GRC (abbreviated as RL-NSGA-II-GRC) for solving MOO problems, and applies it to NASDAQ portfolio optimization.
Figure 1 shows the flowchart of the RL-NSGA-II-GRC method, and the detailed description and corresponding equations are presented in
Section 3 of this article.
The main contributions of this work are three-fold. First, we propose a novel RL-NSGA-II-GRC method that integrates a RL agent into the evolutionary framework to adaptively control key parameters, including crossover probability, mutation strength, constraint tolerance, and front sampling fraction; by utilizing population-level metrics (i.e., hypervolume, feasibility ratio, and diversity) as state inputs, the algorithm dynamically balances the trade-off between exploration and exploitation throughout the optimization process. Second, we introduce a GRC-enhanced binary tournament selection operator that evaluates potential parent solutions based on their geometric proximity to an ideal reference solution. This mechanism serves as a comprehensive performance indicator that simultaneously accounts for dominance rank, crowding distance, and the values of objectives, thereby improving the selection pressure and guiding the search toward the Pareto-optimal front more effectively. Third, we demonstrate the effectiveness and practicality of RL-NSGA-II-GRC on two mathematical benchmark MOO problems (i.e., Kursawe and CONSTR problems), achieving convergence improvements of about 5.8% and 4.4% over the original NSGA-II, and on a real-world NASDAQ-100 portfolio optimization case study, where the method yields a smooth and well-populated efficient frontier that supports the identification of critical investment points, such as the maximum Sharpe ratio portfolio and utility-optimal portfolios, providing actionable insights for investors with varying risk preferences.
The rest of this article is structured as follows.
Section 2 reviews the relevant literature on MOO, NSGA-II, RL, GRC, and portfolio optimization.
Section 3 presents the proposed RL-NSGA-II-GRC methodology in detail, including the constraint-tolerant dominance rule, GRC-enhanced binary tournament selection, RL action’s performance indicators (i.e., hypervolume, feasibility ratio, and diversity), and the Q-learning control agent.
Section 4 reports numerical results on the Kursawe and CONSTR benchmark MOO problems, and
Section 5 presents the NASDAQ portfolio optimization case study.
Section 6 discusses the findings and limitations of the proposed approach, and
Section 7 concludes the paper and outlines directions for future research.
2. Literature Review
As aforementioned, MOO involves optimizing two or more conflicting objectives simultaneously. Instead of a single optimal solution, MOO produces a Pareto-optimal set of solutions, where no objective can be improved without worsening another [
7]. This is highly relevant in finance, as portfolio problems naturally have trade-offs, for instance, maximizing return and minimizing risk [
8]. Early methods for MOO often included scalarization techniques (e.g., weighted sums) to combine objectives [
9], but these require subjective weight tuning and may miss certain Pareto-optimal solutions. Modern approaches also use evolutionary algorithms or other heuristics to approximate the Pareto-optimal front. Additionally, Majumder [
10] provides a comprehensive treatment of single- and multi-objective network optimization models under diverse uncertain environments, where uncertainty is represented through paradigms such as expected-value, chance-constrained, and dependent chance-constrained formulations. The study further discusses how these uncertain models can be converted into crisp equivalents and solved using both classical optimization procedures and multi-objective solution methodologies, including the global criterion method, the ε-constraint method, and evolutionary algorithms. MOO has become an essential tool across disciplines [
11]; beyond finance, it is applied in engineering [
12], supply chain management [
13], healthcare [
14], and more. In the context of portfolio optimization, formulating it as an MOO acknowledges the inherent conflicts and seeks an efficient frontier of portfolios offering different trade-offs. For example, a bi-objective portfolio model can optimize for return and risk concurrently [
15], yielding a spectrum of portfolios from low-risk/low-return to high-risk/high-return options. This multi-objective view, pioneered by Markowitz mean–variance framework [
3], underpins much of modern portfolio theory.
One of the most influential techniques for MOO is NSGA-II, introduced by Deb et al. [
4]. NSGA-II is a widely adopted evolutionary algorithm that uses a fast non-dominated sorting approach and a crowding distance mechanism to maintain solution diversity on the Pareto front [
16]. First, its population is partitioned into Pareto fronts using non-dominated sorting, where solutions in the first front are non-dominated by any other solution, the second front is dominated only by members of the first front, and so on. Each solution is assigned a rank equal to its front index, where a smaller rank indicates a better Pareto status [
4]. Second, to maintain diversity along the Pareto front, NSGA-II computes the crowding distance within each front by sorting solutions along each objective and estimating the local density based on the distances to nearest neighbors in objective space; solutions with larger crowding distance are preferred to encourage a well-spread approximation set. Third, NSGA-II adopts an elitist replacement strategy by combining parent and offspring populations and selecting the next generation based on rank first and crowding distance second, typically implemented via a rank-and-crowding comparison in tournament selection. These mechanisms collectively yield a fast and effective baseline for MOO, with time complexity on the big O notation of
, where
is the number of objectives and
is the population size [
17].
Over the last two decades, NSGA-II has attracted extensive research interest and remains one of the most commonly used methods for MOO problems [
18]. Its popularity stems from its efficiency and robust performance across many domains [
19]. The recent comprehensive review by Ma et al. [
20] confirms that NSGA-II and its variants are widely adopted and continue to be a cornerstone of MOO studies. In the finance literature, NSGA-II has been frequently used to solve portfolio optimization problems, generating Pareto-optimal portfolios under multiple objectives. For instance, researchers have applied NSGA-II to identify efficient frontiers when considering objectives such as return, risk, and other goals (e.g., liquidity, carbon footprint) in portfolio selection. Comparisons show that NSGA-II often performs competitively with other metaheuristics in this domain [
17]. NSGA-II elitist strategy makes it a reliable choice for tackling the multi-objective nature of portfolio optimization. Recent works continue to extend NSGA-II (e.g., with hybrid deep learning models or problem-specific improvements) to enhance its effectiveness on complex, large-scale portfolio problems [
21]. Beyond classical NSGA-II, recent studies have proposed autonomous constrained MOO frameworks that adapt search behavior based on process knowledge and population feasibility states, including the process knowledge-guided autonomous evolutionary optimization method [
22] and the population feasibility state guided autonomous constrained algorithms [
23], where feasibility-related state information is explicitly modeled to guide adaptive control (e.g., via learned operator decisions). These works are closely related to our study in emphasizing feasibility-aware and adaptive guidance for constrained MOO.
RL is a paradigm of machine learning where an agent learns to make sequential decisions through interactions with an environment. Formally, the problem is often modeled as a Markov decision process, and the agent’s goal is to learn a policy that maximizes cumulative reward. RL differs from conventional static optimization or supervised learning in that it explicitly considers the feedback loop of decisions and outcomes over time [
24]. In recent years, RL has demonstrated effectiveness across a wide range of complex decision-making problems and has attracted researchers in different research fields. From a theoretical point of view, Gu et al. [
25] surveyed the related work on the methods, theories, and applications of safe RL. Milani et al. [
26] provided a survey and comprehensive review on the explainable RL, whose objective is to explain the decision-making process of RL agents in sequential decision-making settings. Moos et al. [
27] reviewed the literature on the robust approaches to RL from four different perspectives, namely, transition robust, disturbance robust, action robust, and observation robust. Yan et al. [
28] studied a self-adapting Q-learning-based inter-layer feedback mechanism in multilayer networks, where an RL agent in the management layer learns when to apply punishment to trigger game transitions and thereby promote the evolution of cooperation.
For the domain-specific RL applications, Panzer & Bender [
29] systematically reviewed the applications of deep RL in the optimization production systems, in order to overcome the challenges such as shorter product development cycles and increasing product customization. Rolf et al. [
30] reviewed the related works on RL algorithms and applications in supply chain management, including areas such as inventory management. Kayhan & Yildiz [
31] conducted a survey of the applications of RL in machine scheduling problems, such as job shop scheduling and unrelated parallel machine scheduling problems. In finance, and portfolio management particularly, RL has garnered significant attention in the past decade as a tool to assist portfolio optimization [
32,
33]. In addition, Song et al. [
34] proposed a genetic algorithm based on RL for the electromagnetic detection satellite scheduling problem. Song et al. [
35] also conducted a comprehensive survey on RL-assisted evolutionary algorithms, proposing a taxonomy for integration schemes, analyzing RL-assisted strategies for solution generation, objective modeling, operator and parameter adaptation, and demonstrating their performance and research challenges across benchmark problems and real-world applications. Guo et al. [
36] proposed an RL-assisted genetic programming algorithm for the team formation problem, in which an RL agent adaptively selects among multiple population search modes.
Introduced by Deng [
37], the gray system theory provides tools for decision-making and analysis under uncertainty. A key technique from this theory is gray relational analysis, which employs GRC to quantify the similarity or relationship between data series [
38]. In essence, it measures how closely an alternative’s attributes match an ideal target sequence by examining the geometric proximity of their data curves. The alternative that is closer to the reference solution receives higher ranking [
39]. One advantage of gray relational techniques is that they can work with small and/or incomplete data sets by extracting useful information from partial known data [
40]. Multiple variants of gray relational analysis exist in the MCDM literature. The specific variant referenced here in this work, following Song & Jamalipour [
41] and Martinez-Morales et al. [
42], is distinguished by its independence from predefined criterion weights.
In the aspect of portfolio optimization, investors seek to maximize expected return for a given level of variance (denoting risk/uncertainty in financial management), or equivalently minimize risk for a target return, ultimately leading to an efficient frontier of optimal portfolios. The Markowitz framework formalized the risk–return trade-off and emphasized diversification benefits (i.e., considering covariances among assets) [
17]. Over time, numerous extensions and alternative formulations are proposed on top of the original framework [
43]. For example, Konno and Yamazaki [
44] replaced variance with mean absolute deviation (MAD) to create a linear programming model more tractable for large asset universes, while Speranza [
45] proposed mean absolute semi-deviation (MASD) focusing only on downside volatility. Rockafellar & Uryasev [
46] introduced conditional value-at-risk (CVaR) as a coherent risk measure targeting tail losses, which has since been widely adopted in portfolio optimization. These developments reflect a broadening of risk measures. aligning the optimization with investors’ true risk concerns, that is, avoiding large losses. Traditional quadratic programming or other exact solvers struggle with solving such complex real-world optimization problems, especially when multiple objectives are present [
24]. To tackle this, researchers turned to metaheuristics and artificial intelligence (AI)-based methods. Global search algorithms like NSGA-II and other evolutionary algorithms have been extensively applied to portfolio problems with promising results [
17]. These methods can accommodate non-linear, discontinuous objective landscapes and find near-optimal solutions under complex constraints [
47]. For instance, there is evidence that evolutionary algorithms and swarm intelligence methods can efficiently construct diversified portfolios that classical solvers might miss [
48]. Anagnostopoulos & Mamanis [
49] found that an evolutionary approach outperformed exact solvers on a constrained portfolio selection benchmark, and many subsequent studies have confirmed the utility of such heuristics for multi-objective portfolio optimization [
50]. Recently, the learnheuristic [
51] or hybrid methods have emerged, which combine machine learning models with metaheuristics optimization algorithms. For example, Joshi & Dhodiya [
21] recently proposed a hybrid deep learning and evolutionary algorithm framework for many-objective portfolio decisions, illustrating the trend of blending AI techniques with metaheuristics algorithms.
Based on our comprehensive literature review, although RL-assisted multi-objective evolutionary algorithms (MOEAs) have been increasingly studied, a substantial portion of existing efforts mainly focuses on operator/parameter control (e.g., adapting crossover/mutation rates) or selecting among predefined search modes, while keeping the feasibility logic and survival selection mechanism largely unchanged. In contrast, the proposed RL-NSGA-II-GRC framework differs fundamentally in that the RL agent intervenes not only in variation parameters but also in constraint handling and survival selection pressure within the NSGA-II backbone. Specifically, the RL agent jointly controls (i) the variation parameters, (ii) a generation-wise constraint tolerance that defines a constraint-tolerant dominance relation for constrained MOO, and (iii) a fractional front-sampling survival mechanism that regulates selection pressure across Pareto fronts. Moreover, rather than replacing Pareto dominance with a single scalar fitness, the proposed method preserves a Pareto-first ranking logic and introduces the GRC as a discriminator when dominance information is insufficient, thereby maintaining the key semantics of non-dominated sorting while improving decision consistency during selection.
3. Methodology
This section presents the proposed RL-NSGA-II-GRC method in a general MOO setting, as illustrated in the flowchart in
Figure 1. For clarity, we first formulate the generic MOO problem, followed by a comprehensive description of each methodological component. Specifically, we detail: (1) the fundamental operations of NSGA-II, such as the mathematical definitions of simulated binary crossover, polynomial mutation, non-dominated sorting, and crowding distance calculation; (2) the integration of GRC into the tournament selection operator, enabling selection pressure that always simultaneously accounts for dominance rank, crowding distance, and geometric proximity to ideal objective values; (3) the computation of the population-level performance indicators (including hypervolume, feasibility ratio, and diversity) that serve as quantitative feedback on optimization progress (via calculating the reward of the RL agent’s action) and are used as state inputs for the RL agent; (4) the design of the RL control layer which adaptively takes action and adjusts evolutionary parameters (e.g., crossover probability, mutation strength, constraint tolerance, and front sampling ratio) based on learned value estimates from the environment, such that it gradually improves the algorithm’s exploration and exploitation balance throughout the evolutionary process for finding the Pareto-optimal solutions (i.e., Pareto-optimal front). The complete RL-NSGA-II-GRC method is implemented in Python (version 3.12.7). The code is available at no cost to interested readers by contacting the corresponding authors of this paper.
We consider a generic MOO problem involving two or more objective functions to be minimized (i.e.,
). Note that any maximization objective can be readily converted to a minimization form by multiplying its objective function by
(negative one). The problem is defined over a decision variables vector (
), subject to lower and upper bounds on each decision variable, as well as a set of inequality and/or equality constraints. The mathematical formulation of MOO is as follows:
subject to inequality constraints
and/or equality constraints
and bound constraints
where
is a decision vector in the search space
, and
collects the total
objective functions to be minimized.
To quantify constraint violation for a solution
, we use the following scalar constraint violation:
Thus, if and only if all constraints are satisfied, with larger values indicating greater constraint violation.
For dominance, a solution
is said to Pareto-dominate another solution
(denoted
) if and only if
In other words, Pareto-dominates if is no worse than in all objectives, and at the same is strictly better than in at least one of the objectives.
3.1. NSGA-II Backbone with Dynamic Constraint-Tolerant Dominance and Fractional Sampling Survival Selection
During the evolutionary process of NSGA-II, each individual solution in the population corresponds to a decision vector , augmented with objective function vector , constraint violation , Pareto rank (e.g., rank-0, rank-1), crowding distance that measures sparsity, and additional bookkeeping for dominance (i.e., domination count and dominated solutions set).
3.1.1. Constraint-Tolerant Dominance
In classical original NSGA-II, the rules of Deb et al. [
4] enforce strict feasibility: any feasible solution is preferred over any infeasible one; if both are infeasible, the smaller violation is preferred. In our proposed RL-NSGA-II-GRC method, this is generalized by introducing a dynamic generation-dependent constraint tolerance
, chosen by the RL agent at generation
.
For an individual solution
with violation
, we define the effective violation:
Any solution whose violation is less than or equal to is treated as effectively feasible in generation . This allows certain degree of controlled relaxation of constraints to help explore across the infeasible regions. Theoretically, is defined as a non-negative relaxation threshold, i.e., . In practice, is applied as a small tolerance level to determine whether a solution is treated as effectively feasible when computing . Accordingly, corresponds to the standard strict feasibility case (i.e., no relaxation), while larger values allow controlled relaxation by treating violations up to as acceptable during search. In our Python code implementation, is decided adaptively by the RL agent controller with a value up to 0.5 (this is also tunable), so that the relaxation level remains bounded.
Given two individual solutions and , the constraint-tolerant dominance is therefore defined in the following three scenarios. This relation is used in all dominance-based operations at generation .
First, if one solution is effectively feasible and the other is not, then the effectively feasible solution dominates:
Second, if both solutions are effectively infeasible, then the solution with the smaller effective violation is preferred:
Third, if both solutions satisfy the tolerance, that is, , then the dominance determination reduces to the standard Pareto dominance rule as in Equation (6).
3.1.2. Non-Dominated Sorting
Given a population of size , non-dominated sorting partitions it into fronts with increasing rank. For each individual , we define:
The set of solutions it dominates
and the number of solutions that dominate it
Here, the first front (rank-0) is:
Subsequent fronts are built iteratively:
and individuals in
are assigned
. The process continues until all individuals are assigned to a front.
3.1.3. Crowding Distance Calculation
Within each front
, a crowding distance is computed to measure the sparsity of solutions in objective space and to encourage a well-spread Pareto front. Let
. Initialize all distances:
Then, for each objective
, sort the front with respect to objective
:
The boundary points are assigned infinite distance:
For internal points
, the contribution of objective
to the crowding distance is:
The total crowding distance for individual
is the sum across
objectives:
A larger indicates that the solution is located in a sparsely populated region of the front and thus should be preferred for diversity.
3.1.4. Simulated Binary Crossover (SBX)
The NSGA-II uses crossover and mutation to generate offspring from selected parents. The key steps of the simulated binary crossover (SBX) are briefly presented as follows. Given two parents and , SBX simulates a recombination similar to single-point crossover in binary representation, but in real variables. Let the crossover probability be and the distribution index be . Both the and are to be dynamically controlled by the RL agent in our RL-NSGA-II-GRC method.
Here, for each decision variable
, with probability
, the parents’ genes (i.e., parents’ decision variable values) are directly copied to offspring:
Otherwise, SBX is performed:
and the lower and upper bounds of the decision variable
still apply as in Equation (
), that is,
. Then, uniformly draw a random number
, and compute:
The first child gene for the
variable is obtained as:
On the other side of the interval,
and with the same
,
The second child gene for the
variable is obtained as:
Both
and
are finally clipped to their bounds:
3.1.5. Polynomial Mutation
Given a mutation probability
per variable and distribution index
, polynomial mutation perturbs each variable
with probability
. Likewise, both
and
are dynamically controlled by the RL agent in our proposed RL-NSGA-II-GRC method. For a selected variable
, define:
and uniformly draw a random number
. Let
. The mutation step
is given by:
The mutated value at the
variable is:
followed by bound clipping:
3.2. Gray Relational Coefficient (GRC) Enhanced Binary Tournament Selection
To enhance parent selection beyond Pareto rank and crowding distance alone, RL-NSGA-II-GRC employes a MCDM approach to derive the gray relational coefficient (GRC) score for each potential parent solution, which combines the performances of objectives and crowding distance information in a normalized manner. GRC is founded on the gray system theory. This theory serves as a powerful analytical tool, particularly in situations where information is incomplete or uncertain. The core principle of GRC is to evaluate solutions by measuring their proximity to an ideal reference solution [
52]. The solution that is closer to the ideal solution receives higher GRC score.
At generation
, consider the current population
. For each objective
, compute the population-level extremes:
Because all objectives are minimized by default, we normalize each objective so that the value 1 represents the best performance, and the value 0 denotes the worst performance consistently.
For crowding distance, recall that we have already calculated the crowding distance of solution
(i.e.,
) in the previous section using Equation (
), and the boundary solutions have
. Here, we further compute
Infinite distances (for boundary solutions) are replaced by a finite proxy to facilitate the subsequent calculations.
Then the normalized crowding distance is:
Next, each normalized criterion
(either an objective or crowding distance) is compared to their ideal reference value
. The GRC for that criterion is:
where
indicates that
is closer to the ideal. For each individual
, we compute one coefficient for the crowding distance:
and one coefficient for each objective
:
The overall GRC score is defined as the average of these
coefficients:
where solution
with larger
score is preferred.
After finding the GRC scores for solutions, the binary tournament selection proceeds straightforwardly. Given two randomly sampled individuals
and
, if
, we select
; else if
, we select
; otherwise (means same Pareto rank, which common occurs especially toward the later stage of evolution), we compare their GRC scores.
In this way, the GRC score serves as a scalar indicator balancing objective quality and spread of solutions.
Survival Selection with Fractional Sampling Mechanism
After parent selection, crossover, and mutation steps, we are able to obtain the set of new offspring solutions
, the parent and offspring solutions are merged into a combined pool
of size
(where
is the predefined number of solutions desired in the population). This pool is first partitioned into non-dominated fronts
using the constraint-tolerant dominance relation
(as discussed in
Section 3.1.1 and
Section 3.1.2). Within each front
, crowding distances are computed to promote a good spread in objective space. Unlike classical NSGA-II, which always admits the entire
and subsequent fronts until the population budget is nearly exhausted, our proposed RL-NSGA-II-GRC method adopts a fractional front sampling mechanism dynamically controlled by the RL agent. Specifically, for each front
, only
solutions are selected, where
is the front sampling fraction managed by the RL agent at generation
based on the environment. The selected individuals are the
solutions of
with the largest crowding distance, ensuring that the most diverse solutions from each front are retained. If, after scanning all fronts, the size of the new population
is still smaller than the desired population
, the remaining slots are filled from the leftover individuals in
, sorted lexicographically by (rank,
crowding distance).
This two-stage survival selection scheme departs from the behavior of standard NSGA-II: by allowing
(e.g.,
), the algorithm may deliberately not take all members of
, thereby leaving room for selected individuals from some inferior fronts to survive in order to boost diversity (and also enhance the different possibilities of offspring solutions in next generation). The degree of this relaxation is adaptively controlled by the RL agent based on the observed hypervolume, feasibility ratio, and diversity metrics (discussed in
Section 3.3) in the environment. As a result, the survival selection becomes a dynamic balance between exploitation of the current best front and exploration via lower-rank but diverse or promising solutions, which can be particularly beneficial on constrained MOO problems.
3.3. Population-Level Performance Indicators for Reinforcement Learning (RL)
Three population-level indicators, namely, hypervolume, feasibility ratio, and diversity are utilized to evaluate the impact of RL agent’s actions. These are computed at every evolutionary generation .
3.3.1. Hypervolume
Let
denote the feasible rank-0 front (i.e., a set of the best solutions) at generation
:
Let
be a fixed reference point chosen to be dominated by all relevant objectives (e.g., constructed from early population extremes with a safety margin). The hypervolume (HV) at generation
is the Lebesgue measure of the union of objective-space hyperrectangles dominated by
and bounded by
:
where
denotes the
-dimensional Lebesgue measure of volume. For
, this basically reduces to the area dominated by the rank-0 front. A larger hypervolume value is preferred as it indicates a Pareto front that is closer to the true optimum.
3.3.2. Feasibility Ratio
The feasibility ratio at generation
is the fraction of the current population (
) with zero constraint violation:
where
is the indicator function, which returns
and
. As such, each feasible solution contributes
to the sum, while each infeasible solution contributes
. Consequently,
quantifies the feasibility level of the current population, with values closer to 1 being preferred, indicating that the evolutionary search successfully maintains feasibility across generations.
3.3.3. Diversity
The diversity metric measures how well the population is spread across the objective space, reflecting the algorithm’s ability to explore multiple trade-offs rather than just clustering around a narrow region. A larger diversity value indicates a more widely dispersed set of solutions, which is desirable because it supports the discovery of a well-distributed Pareto front.
To obtain the diversity value for rank-0 front (
), we normalize all objectives to
using population-level minimum and maximum values:
Define the normalized objective vector:
Let
denotes the number of solutions within the rank-0 front at generation
. The diversity is then taken as the average pairwise Euclidean distance:
3.4. Reinforcement Learning (RL) Layer
The RL layer views each generation as a time step in a Markov decision process. At each generation
, the agent observes a state summarizing the population, selects an action that sets the evolutionary parameters, and receives a scalar reward based on improvements in hypervolume, feasibility, and diversity. The state at generation
is defined as:
which comprises the hypervolume trend sign:
where (1 improvement,
deterioration, 0 no change)
with
, a small tolerance (e.g.,
); then, feasibility level via discretized feasibility ratio (0 low, 1 medium, 2 high):
followed by the diversity level (0 low, 1 medium, 2 high):
and the search stage
, encoding whether we are in the early (0), middle (1), or late (2) stage of the run, given total generation budget
(e.g., 300 generations):
This yields a compact, fully discrete state representation for the RL agent. For instance, a state means an improved hypervolume (relative to the previous generation), a high feasibility level for solutions in the current population, a medium diversity level in the current rank-0 front, and an early stage of the evolutionary search.
Next, the action space
is a finite set of parameter modes. Each action
encodes the following key parameters, namely, crossover probability
, mutation probability
, SBX distribution index
, mutation distribution index
, constraint tolerance
, and front sampling fraction
in survival selection. Formally,
When action is selected at generation , the NSGA-II parameters are set as: , and . Here, enters the constraint-tolerant dominance, and controls how many individuals are accepted from each front in survival selection, adjusting selection pressure and diversity.
A tabular action-value function
is maintained as a dictionary (i.e., a Q-learning table) mapping each visited state
to a vector of Q-values
. Therefore, at the generation
, given state
, an ε-greedy policy selects:
where ties in the argmax are broken uniformly at random. The exploration parameter
(e.g., a value 0.3) is gradually decayed:
so that the policy moves from exploration to exploitation over time; for example,
can be small value (e.g., 0.05) and
takes 0.995.
As seen, the discretization of the RL state space is motivated by the use of tabular Q-learning, where the number of state-action pairs must remain sufficiently small to ensure stable learning under a limited generational budget. In this study, the state is therefore designed as a compact summary of the population dynamics using (i) the sign of hypervolume change (improved/unchanged/deteriorated), (ii) discretized feasibility ratio (low/medium/high), (iii) discretized diversity level (low/medium/high), and (iv) the search stage (early/middle/late). This design is not intended to represent every population statistic; rather, it provides a parsimonious, decision-relevant representation that captures the three core evolutionary goals, namely convergence, feasibility, and spread, while also accounting for early exploration or late exploitation. Coarser discretization also reduces noise sensitivity and avoids overfitting the control policy to small metric fluctuations, which is particularly important when the same controller is expected to generalize across different constrained MOO instances.
After applying action
and completing one evolutionary generation to obtain new set of population solutions, the RL agent observes newly updated metrics
,
, and
. To make the reward scale-invariant, we compute the normalized gains. For hypervolume gain:
and for feasibility gain:
then diversity gain:
The normalized quantities are typically clamped to
to stabilize learning. The immediate reward (
) is then defined as a weighted sum of these three gains:
with
, e.g.,
,
, and
. This emphasizes hypervolume improvement while still rewarding better feasibility and diversity.
The reward is defined as a weighted sum of normalized gains in hypervolume, feasibility, and diversity to align the RL objective with standard performance criteria in constrained MOO. The higher weight assigned to hypervolume reflects its role as a unified indicator that simultaneously captures convergence and coverage of the Pareto front, whereas feasibility ratio and diversity are included to prevent degenerate behaviors (e.g., rapidly improving hypervolume by concentrating on infeasible regions or by collapsing diversity). The chosen weights therefore express the intended priority, that is, driving Pareto-front quality while maintaining feasibility and solution spread as necessary constraints on the search behavior. Importantly, feasibility and diversity retain nonzero weights so that the RL controller is explicitly discouraged from sacrificing feasibility and spread to obtain short-term hypervolume gains.
Finally, let
denote the learning rate, and
the discount factor. After observing next state
and reward
, the temporal difference (TD) target is:
where
is the immediate reward, and the second term,
, represents the discounted estimate of future rewards. By incorporating both immediate and future rewards, the agent can assess the long-term consequences of its actions, enabling it to pursue strategies that maximize cumulative returns rather than focusing solely on short-term gains. In essence, an action is considered good not only when it yields an immediate reward but also when it leads to more favorable future states. The Q-value for the executed state-action pair
is then updated as:
This update increases if the target return estimate exceeds the current Q-value, and decreases it otherwise. Over time, approximates the expected discounted return, guiding the RL agent to prefer parameter configurations that consistently improve the multi-objective search.
Beyond empirical performance gains (as
Section 4 shall showcases), our proposed RL-NSGA-II-GRC method provides a theoretical contribution in the form of a principled coupling between dominance-based constrained MOO and RL-based adaptive control. The algorithm can be interpreted as a generational Markov decision process, where the population state is summarized by convergence, feasibility, diversity, and stage indicators, and the RL action defines an adaptive control policy over the evolutionary search dynamics. In particular, the constraint-tolerant dominance introduces a continuum between strict feasibility and controlled constraint relaxation, while the fractional survival sampling introduces a continuum between pure survival and exploratory retention; together, these mechanisms formalize how selection pressure and feasibility enforcement can be adaptively scheduled across generations. Meanwhile, the GRC tie-breaker (in case of same Pareto ranks) provides a weight-free discriminator within a dominance rank, ensuring that Pareto dominance is not overridden while enabling consistent preference toward ideal-point proximity in normalized objective space.
Additionally, from a computational complexity perspective, the proposed RL layer introduces an incremental generational overhead on top of the original NSGA-II, but it does not alter the dominant asymptotic cost terms of NSGA-II. In each generation, the RL agent computes a compact state/reward summary and then performs an ε-greedy action selection and a tabular Q-learning update. The Q-learning decision and/or update step is
per generation (table lookup and a constant number of arithmetic operations). The added cost therefore comes mainly from computing state/reward indicators, that is, the feasibility ratio requires one pass over the population
and is
; the diversity indicator is computed over the current non-dominated set and is
; and hypervolume computation depends primarily on the number of objectives
and is typically inexpensive for the not-many-
problems. Separately, our GRC-enhanced tournament only adds scalar scores computation per generation (including the normalization and GRC calculation), which is
. The most computationally expensive part is still the NSGA-II backbone, which has the dominant time complexity of
as reported in Abdel-Basset et al. [
53]. As a result, comparing the proposed RL-NSGA-II-GRC method against the original NSGA-II, it only increases runtime modestly by a constant factor per generation rather than changing the overall computational order.
5. Application: Portfolio Optimization with NASDAQ-100 Constituents
To demonstrate the practical usefulness of the proposed RL-NSGA-II-GRC algorithm, we apply it to a real-world portfolio optimization problem. The goal is to design a long-only equity portfolio that simultaneously minimizes risk and maximizes expected return, using recent historical data for large-capitalization stocks from the NASDAQ-100 index.
We start from the 30 largest constituents by weight of the NASDAQ-100 (
https://www.slickcharts.com/nasdaq100, accessed on 1 December 2025), and denote their tickers by
. Daily adjusted closing prices
are downloaded for each asset
over the period (from 1 June 2022 to 1 December 2025) using the
yfinance Python library (
https://pypi.org/project/yfinance/, accessed on 1 December 2025). To reduce high frequency noise while retaining enough observations for estimation, the daily prices are resampled to weekly prices by taking the last trading day in each calendar week:
From these weekly prices, we compute simple (percentage) weekly returns
where
is the number of assets that remain after cleaning and
is the number of valid weeks. Assets with more than 10% missing price data are discarded; remaining gaps are handled by dropping weeks with any missing prices so that all retained series are time-aligned.
Then, using the weekly returns, we compute the mean return
and the standard deviation
for each retained asset
, as well as the covariance
between the returns of assets
and
(where
). Therefore, we now have a portfolio
consists of
assets with non-negative weights
(i.e., fraction of total wealth held in asset
). We wish to minimize the risk (
) and maximize the expected return (
) of the portfolio simultaneously. Following the Markowitz mean–variance model [
55], these two objective functions are thus defined as:
The bounds and equality constraint applied to the decision variables (i.e., weights
) are:
Here, it is worth mentioning that in objective function , assets with low or negative covariance () reduce the cross-product terms , and thus lower overall risk for a given level of expected return. In other words, the covariance structure explicitly rewards combining assets that do not move together, so the risk objective naturally captures the diversification effect in modern portfolio theory. In addition, as noted earlier, any maximization objective (e.g., the expected return function here) can be readily converted into a minimization objective by multiplying its objective function by .
For this portfolio optimization problem with 2 objectives and 30 variables, we run the RL-NSGA-II-GRC algorithm for 1000 evolutionary generations, targeting a final set of 1000 Pareto-optimal solutions (although in the end, 989 solutions are found). The resulting Pareto-optimal front (also commonly known as efficient frontier in finance context) is presented in
Figure 5. As expected, the efficient frontier clearly exhibits the fundamental trade-off between risk and return, that is, achieving a higher expected return inevitably requires taking on higher risk, while lowering risk leads to a reduction in expected return. This pattern is consistent with finance theory, particularly the Markowitz mean–variance framework, which states that no portfolio can simultaneously minimize risk and maximize return. Instead, we face a set of optimal trade-off choices (i.e., solutions on the efficient frontier). Those located toward the lower-risk end of the frontier are more conservative, offering smaller but more stable returns. On the contrary, those located toward the upper-return end can potentially give us higher returns but also expose us to greater risks.
Additional observations can be drawn from the curvature and smoothness of the frontier in
Figure 5. The concave shape indicates diminishing marginal gains in return as risk increases, meaning that there is a progressively larger increment of risk to obtain each additional unit of expected return. Moreover, the dense distribution of Pareto-optimal solutions along the frontier showcases that the RL-NSGA-II-GRC method captures a wide range of efficient portfolio choices. As such, investors with different risk tolerances are able to identify suitable allocations of assets.
Furthermore, with this efficient frontier, we analyse the maximum Sharpe (tangency) portfolio and the utility-maximizing portfolios under different levels of risk aversion. We approximate the weekly risk-free rate
using the 3-month U.S. Treasury bill (T-bill) rate, a standard proxy for the USD risk-free asset. As shown in
Figure 6, over the period 1 June 2022–1 December 2025, the 3-month T-bill yield rose from near 1.0% to more than 5.0% and is currently around 4.0%. The arithmetic average of the 3-month T-bill yield in this period is 4.4578% per annum. The corresponding weekly rate
is then calculated as 0.08395% (i.e., 0.0008395).
Next, we use
and
to denote, respectively, the expected weekly return (
) and weekly standard deviation (note: not variance as in
, but
) of a candidate portfolio
on the efficient frontier. For each
, we compute the weekly Sharpe ratio
and identify the tangency portfolio
as the one with the maximum Sharpe ratio:
In this case study, the tangency portfolio has , , and a weekly Sharpe ratio (i.e., annualized Sharpe ratio ). Financially, this means that among all efficient risky portfolios, delivers the largest expected excess return (above ) per unit of risk.
With
, the capital market line (CML) can then be constructed:
where
is the portfolio standard deviation. The CML, together with the efficient frontier (
vs.
), are illustrated in
Figure 7. As seen, the optimal risky portfolio
corresponds to the tangency point between the efficient frontier and the CML.
Moreover, to reflect heterogeneous risk preferences, we also consider a risk aversion quadratic utility function:
where
is the investor’s risk-aversion coefficient (larger
indicates stronger aversion to risk). For each prescribed
, we evaluate
for all Pareto-optimal portfolios on the efficient frontier and select the utility-optimal portfolio:
For instance, as demonstrated in
Figure 7, we examine
, corresponding to low, moderate, and high risk aversion, respectively. As expected, the resulting utility-optimal portfolios exhibit decreasing risk and expected return as
increases. Specifically, for
we obtain
,
; for
,
,
; and for
,
,
3; these portfolios are shown as the green, purple, and orange points, respectively, in
Figure 7.
6. Discussion and Limitations
The numerical experiments on the Kursawe and CONSTR benchmarks, together with the NASDAQ-100 case study, show that our proposed RL-NSGA-II-GRC method is both effective and versatile for MOO problems. On the two benchmark problems with known true Pareto fronts, RL-NSGA-II-GRC consistently yields Pareto-optimal fronts that lie closer to the true front than those produced by the original NSGA-II, as reflected in the reduction in the CM by about 5.8% for Kursawe and 4.4% for CONSTR. At the same time, the constraint-tolerant dominance scheme and fractional sampling survival selection preserve a well-spread and diverse set of non-dominated solutions, instead of collapsing to a narrow region of the front. The RL agent plays a key role here: by adaptively tuning crossover and mutation probabilities, the constraint tolerance, and the front sampling fraction in response to hypervolume, feasibility, and diversity feedback, it automatically balances exploration and exploitation throughout the generations. This adaptive capability is significantly bolstered by the introduction of the GRC, which ensures that selection pressure simultaneously accounts for Pareto rank, crowding distance, and geometric proximity to ideal reference values. Consequently, the resulting Pareto-optimal fronts are not only closer to the theoretical optima but also smoother and more uniformly distributed, highlighting the synergy between dynamic RL agent control and GRC-based parent tournament selection.
Further, to examine the sensitivity of the RL-NSGA-II-GRC framework to varied design choices, we additionally evaluated alternative discretization granularities and alternative reward-weight configurations. Specifically, we considered (i) finer and coarser binning for feasibility and diversity levels, and (ii) multiple reward weight triplets that varied the relative emphasis placed on hypervolume versus feasibility/diversity while keeping all other algorithmic components fixed. The overall trends observed in the benchmarks and the portfolio case indicate that the algorithm’s performance is sensitive (but not overly) to moderate variations in these settings: the controller continues to learn consistent action preferences (e.g., stronger exploration early and stricter feasibility enforcement later), and the resulting Pareto-front quality remains comparable across a reasonable range of discretization and weight choices. Nevertheless, extremely coarse discretization may reduce the controller’s ability to react to intermediate search conditions, whereas overly fine discretization may dilute learning by expanding the state space. Similarly, setting feasibility/diversity weights too close to zero can weaken feasibility enforcement or reduce spread. These findings suggest that the proposed default settings provide a balanced trade-off, while the design definitely remains tunable for problem-specific requirements.
For the time-complexity analysis of the proposed RL-NSGA-II-GRC against the original NSGA-II, as the population size increases, the dominant runtime within RL-NSGA-II-GRC is consistently driven by non-dominated sorting of NSGA-II, i.e., on the order of . The RL component adds feasibility bookkeeping and up to diversity computation per generation, plus an tabular update. The GRC component only adds an inexpensive scalar score computation and does not change the dominant complexity term. Thus, the RL and GRC overhead remain secondary, and the overall computational order continues to be governed by the NSGA-II backbone. This is also further corroborated by our empirical observations in both the benchmark mathematical problems and the NASDAQ portfolio optimization problem, where execution times for RL-NSGA-II-GRC and the original NSGA-II do not significantly deviate. Ultimately, the RL and GRC layers contribute only a modest constant factor overhead rather than shifting the computational complexity of the algorithm.
Regarding performance relative to other state-of-the-art MOEAs (e.g., decomposition-based or indicator-based methods, as well as many-objective extensions), the present study does not claim that RL-NSGA-II-GRC universally dominates all alternatives; rather, it demonstrates that the proposed RL-controlled mechanisms and the GRC tie-breaking strategy can significantly strengthen the widely used NSGA-II backbone for MOO problems. Importantly, the proposed framework is largely modular, which means that the RL agent (our state-action-reward design) and the GRC-based ranking can, in principle, be integrated with other evolutionary algorithms by treating their control components (e.g., variation strength, constraint relaxation, or selection pressure) as RL actions and using convergence, feasibility, and diversity indicators as feedback. For hybrid RL-based optimizers, RL-NSGA-II-GRC differs in that it preserves a Pareto-first ranking logic and uses GRC when dominance information is insufficient, while also explicitly controlling constraint tolerance and survival pressure across generations. A comprehensive head-to-head comparison with additional state-of-the-art MOEAs and recent RL-assisted optimization algorithms is therefore a valuable direction for future work and will be pursued in subsequent studies.
The real-world NASDAQ-100 portfolio application further underlines the practical value of the proposed RL-NSGA-II-GRC. The algorithm constructs a smooth, concave efficient frontier in the mean–variance space, with well-populated trade-off regions that make it easy to identify the tangency (i.e., maximum Sharpe ratio) portfolio and utility-optimal portfolios for different levels of risk aversion. This indicates that the RL-guided NSGA-II with GRC are able to navigate a realistically constrained financial search space and uncover portfolios that are both attractive and interpretable for decision makers. However, this work is not without limitations. One limitation lies in the design of the RL agent. Our current implementation adopts a tabular Q-learning agent with a discrete state-action space, which is transparent and effective for all the problems studied here, but could be further extended through function approximation (e.g., deep RL) and more expressive state features to handle even higher dimensional, noisier, or more dynamically changing MOO environments. This would be a promising direction for future work.
7. Conclusions
In conclusion, this study proposed a RL-guided NSGA-II method enhanced GRC (RL-NSGA-II-GRC) method to address constrained MOO problems and demonstrated its application to NASDAQ portfolio optimization. The method integrated a tabular Q-learning agent into the NSGA-II backbone to adaptively control key evolutionary parameters, including crossover probability, mutation strength, constraint tolerance, and front sampling fraction, based on population-level feedback from hypervolume, feasibility ratio, and diversity indicators. In parallel, a GRC-enhanced binary tournament selection operator was designed to assess potential parents using a unified performance score that simultaneously accounted for dominance rank, crowding distance, and geometric proximity to ideal objective values, thereby strengthening selection pressure toward high-quality, well-distributed solutions.
The numerical experiments on the Kursawe and CONSTR benchmark MOO problems showed that RL-NSGA-II-GRC achieved better convergence to the true Pareto-optimal fronts than the original NSGA-II, with CM improvements of approximately 5.8% and 4.4%, respectively, while preserving a diverse and well-spread set of non-dominated solutions. In the NASDAQ portfolio case study, the proposed method produced a smooth, concave efficient frontier in the mean–variance space with a dense distribution of Pareto-optimal portfolios, enabling clear identification of the portfolio with maximum Sharpe ratio and utility-optimal portfolios for different levels of risk aversion. These results indicated that the combination of RL agent, constraint-tolerant dominance, fractional front sampling, and GRC-based selection effectively balanced exploration and exploitation and was capable of navigating realistically constrained financial search spaces to yield practically meaningful solutions. Although the proposed framework performed well across both benchmark and real-world problems, there remained room for further enhancement. In particular, the tabular Q-learning agent with a discrete state-action space, while transparent and effective for the settings studied, could be extended to more expressive function-approximation schemes and richer state representations to better handle higher-dimensional, noisier, or more dynamically evolving optimization environments. Future research may also investigate alternative reward designs, additional performance indicators, and broader classes of risk measures and portfolio constraints, as well as applications of RL-NSGA-II-GRC to other optimization and decision-making domains where constrained multi-objective trade-offs are of central importance.