1. Introduction
Traditional transmission planning often depends heavily on the experience of human planners. They develop a range of potential solutions and then choose a final plan by comparing the technical and economic pros and cons of each. As modern power systems grow more complex in both structure and operation, we have seen a shift toward mathematical optimisation. This approach uses objective functions and constraints to find the best plan. But applying these mathematical methods in real-world situations is challenging. For one, the growing number of optimisation goals makes it hard to build a single objective function that captures every real-life requirement. At the same time, as systems get larger and involve more variables, the models become highly nonlinear and sometimes non-convex. This makes finding a true global solution much more difficult. These limitations affect the reliability and effectiveness of pure mathematical optimisation in practical engineering. There is a clear need for new types of solutions. We require decision-making tools that not only produce more credible results but also help planners better analyse grid structures and evaluate different proposals. In recent years, advanced DC devices such as modular multilevel converter–coordinated fault current limiters and DC series–parallel power flow controllers have been proposed to enhance fault current limiting capability and power flow controllability in DC and hybrid AC/DC grids [
1,
2], which further highlights the need for systematic planning methods that can fully exploit these flexible assets.
For solving optimisation planning models, existing literature predominantly employs two major categories of algorithms: mathematical optimisation methods [
3,
4,
5] and heuristic algorithms.
Mathematical optimisation methods employ mathematical techniques such as linearisation and constraint relaxation to simplify transmission network planning models, thereby enhancing solution quality and computational speed. This facilitates algorithmic convergence towards optimal solutions. Reference [
6] addresses the grid voltage issue by relaxing the branch AC power flow equations, proposing a second-order cone relaxation method for solving optimal power flow. Reference [
7] employs network flow methods and the maximum-minimum cut theorem, establishing fuzzy sets based on boundary conditions, and utilises branch-and-bound techniques to derive planning solutions. Reference [
8] addresses transmission congestion caused by renewable energy sources by enhancing the CCG algorithm through nested columns and constrained production, enabling it to handle large M constraints and massive variable problems. A well-built model allows mathematical methods to find the best solution quickly and reliably. However, as models grow increasingly complex, mathematical optimisation methods become progressively more challenging to formulate. When the scale of the power grid expands and the number of integer variables in the model increases significantly, it may lead to a computational complexity explosion, rendering the problem unsolvable.
Heuristic algorithms, exemplified by genetic algorithms [
9] and particle swarm optimisation [
10], extend the evolutionary principles or behavioural patterns observed in various organisms within the natural world to the mathematical domain. Through the process of seeking optimal solutions, they derive either optimal or suboptimal solutions to planning problems. Reference [
11] proposes an optimisation model for optical cable planning in power optical transmission networks based on genetic algorithms, addressing the issues of slow computation and poor convergence in traditional planning methods. Reference [
12] proposes an optimisation model for power information system network architecture based on the artificial fish swarm algorithm, addressing the issue of low identification accuracy encountered by traditional methods when optimising power communication networks. Reference [
13] developed a multi-objective transmission network planning model that comprehensively considers economic efficiency, reliability, and environmental sustainability by modifying the inertia weight and learning factor of the particle swarm optimisation algorithm. This approach effectively addresses the challenges faced by traditional algorithms in solving complex planning problems, namely their susceptibility to local optima and slow convergence rates. Although heuristic algorithms offer relatively high computational efficiency, they remain fundamentally random search algorithms. When addressing large-scale optimisation problems, the quality of solutions deteriorates significantly.
The objective of these two types of algorithms is to obtain optimal solutions. The intermediate solution processes hold no practical physical significance for power grids, cannot be directly mapped to physical networks, and prove difficult to extract further knowledge for subsequent practical planning analyses, thus providing limited insights for planners.
In parallel with these algorithmic advances, recent studies have started to incorporate multiple flexibility resources and learning-based surrogates into power system decision-making. Reference [
14] propose a stochastic operating framework for systems with high wind penetration in which dynamic line rating, pumped-hydro storage, common energy storage, demand response and aggregated electric vehicles are co-optimised as a portfolio of flexibility options to enhance reliability and reduce operating costs. Reference [
15] introduce input-convex neural networks to learn Optimal Power Flow value functions, enabling fast, security-aware evaluation of AC dispatch decisions under non-convex constraints. These works show that richer flexibility modelling and machine-learning-based approximations can improve economic efficiency and operational security; however, they focus mainly on short-term operation and do not explicitly address long-term hybrid AC/DC transmission expansion planning or provide an interpretable mechanism to extract line-importance information from learning algorithms. This paper therefore extends this line of research by embedding AC/DC stability indicators and eligibility-trace-based knowledge extraction into a multi-objective reinforcement learning framework for transmission grid planning.
Reinforcement learning is an algorithm based on the classical Markov Decision Process (MDP) [
16], which simplifies an intelligent agent’s learning task by abstracting complex phenomena into straightforward scenarios, reducing it to states, actions, and feedback rewards. Currently, reinforcement learning algorithms represented by Q-learning have been applied in power system research. However, Q-learning employs a single-step policy update, resulting in slow algorithmic convergence and suboptimal learning capabilities. Reference [
17] investigated wide-area control strategies for suppressing power system oscillations following large disturbances based on the Q-learning algorithm. Reference [
18] designed an optimal control system employing a Q-learning algorithm that accounts for both physical and network uncertainties. Reference [
19] proposes an optimal operation model for a combined heat and power supply system, aiming to achieve both economic efficiency and environmental protection within an integrated energy system. This model employs reinforcement learning for solution, though it does not address system planning. Reference [
20] employs a multi-agent collaborative Q-learning algorithm, wherein multiple agents autonomously learn within the environment and mutually transfer experience. By constructing a fuzzy logic system, this approach endows robots with the capability to autonomously analyse unfamiliar environments. Reference [
21] successfully integrated eligibility traces into the Q-learning algorithm. It used a multi-step backtracking method to help the agent solve multi-objective power flow problems, achieving positive outcomes. Following these developments, This paper draws an analogy between robot path planning and transmission network expansion planning, applying the Q-learning algorithm to this new domain. This shift in perspective does more than just find an optimal solution. It allows us to learn from the algorithm’s intermediate decision steps, which were often overlooked before. By analysing these steps, we can assess the importance of various candidate lines. This process enhances the reliability of the final planning results.
Despite these advances, most hybrid AC/DC planning studies still suffer from three main limitations. First, the majority of models optimise investment and reliability, but rarely incorporate explicit AC/DC stability indicators that capture commutation–failure risk and weak-grid conditions in a unified way. Second, mainstream optimisation approaches—whether mathematical programming or meta-heuristics—typically output a single expansion plan, but do not provide planners with interpretable information on which lines are structurally “important” for security and resilience. Third, although reinforcement learning has been introduced into power system decision-making, eligibility-trace-based algorithms and multi-step backtracking mechanisms have not yet been systematically exploited to solve multi-objective hybrid AC/DC.
To address these gaps, this paper pursues the following objectives:
- (1)
Construct a multi-objective planning model for hybrid AC/DC transmission grids that simultaneously minimises annual economic cost, expected power shortage and a comprehensive electrical index reflecting AC/DC stability.
- (2)
Develop an improved multi-step backtracking α-Q(λ) reinforcement learning planner that embeds eligibility traces and an adaptive learning factor, enabling both fast convergence and explicit extraction of line importance from the agent’s trajectories.
- (3)
Design a practical AC/DC stability evaluation framework that combines electrical betweenness, commutation-failure margin and effective short-circuit ratio, and integrate it with an intuitionistic fuzzy AHP–based weighting scheme for comprehensive plan assessment.
In short, compared with standard Q-learning, the proposed α-Q(λ) planner accelerates convergence by leveraging multi-step backtracking and eligibility traces while adaptively tuning the learning rate. Beyond conventional mathematical-programming and heuristic planners, it also outputs an eligibility-trace matrix that highlights structurally critical lines, providing an interpretable decision aid for planners. This study was conducted to fill the above gap.
2. Evaluation Indicators for the Multi-Objective Model
This section and the following two sections constitute the methodological core of the paper.
Section 2 builds an indicator system covering economic cost, supply reliability and AC/DC stability.
Section 3 then organises these indicators into a tri-objective planning model with explicit constraints on power flow and generator outputs.
Section 4 describes the improved α-Q(λ) reinforcement learning algorithm that solves the mixed-integer planning problem by interacting with the environment, while
Section 5 focuses on reward design and eligibility-trace analysis, explaining how the agent’s learning process is converted into interpretable “line importance” information for planners.
2.1. AC/DC Stability Indicators
With the continuous advancement of China’s high-voltage direct current transmission technology, the grid structure of hybrid AC-DC transmission networks has been further refined, and the interconnection of large regions has progressively strengthened. While optimising resource allocation, the risk of grid disturbances and faults has also grown. This has made safety and stability critical concerns. Traditional stability metrics for AC and DC systems, like N-1, N-2, and commutation failure, mainly deal with internal problems. Therefore, we need to study how integrating DC systems affects AC grids and build a complete stability evaluation framework for hybrid AC/DC networks. This section uses AC-DC stability as a key criterion, assessing performance through three aspects: AC power flow distribution, DC commutation failure margin, and the effective short-circuit ratio.
2.1.1. Distribution of System Current Flows
Line load factor: DC systems often carry large amounts of power between distant AC grids. Because of this role, a DC link can act like a major power plant or a large load from the AC system’s viewpoint. This significantly changes power flows in the AC network. Lines near the connection point are affected most, as this can push some lines close to their transmission limits, threatening system stability. The load factor
τk for a transmission line is given by:
Improved Electrical Modulus: The concept of electrical modulus originates from graph theory. It works by modelling the power system as a directed graph, where the connections between generation nodes, load nodes, and transmission lines are analysed. This model can effectively represent the power transfer characteristics of a real-world grid. Since stability in hybrid AC/DC grids is primarily assessed through line power flows, our work focuses on refining the branch electrical modulus for this specific purpose.
Branch electrical parameters quantify the role that each transmission line plays within the directed graph model. Essentially, they describe how much a specific line contributes to moving power across the entire network. Here’s how it works: imagine a case where we inject a unit of active power at node
i and withdraw the same amount at node
j, with all other nodes set to zero. The power flow that results on any line connecting nodes
m and
n is designated as
Pmn. The electrical parameter of this line is defined as:
In the equation, Be(m,n) denotes the electrical parameters of transmission line mn, while wi and wj represent the weighting coefficients for generation node i and load node j, respectively, taken as the active power output and load values from the actual system.
To better assess the overall stability of hybrid AC/DC grids, we introduce the Gini coefficient. It evaluates stability by measuring how evenly power flows are distributed across transmission lines, using an improved electrical index derived from the coefficient itself. The Gini coefficient, an economic metric measuring uniformity, is combined with the electrical medium index to structurally analyse system stability by evaluating the uniformity of power flows across branches and nodes within hybrid AC/DC transmission grids. The more uniform the power flows, the smaller the Gini coefficient, indicating stronger system stability.
Within the Gini coefficient framework, the areas enclosed by the actual Lorenz curve and the lines representing absolute uniformity and absolute non-uniformity are denoted as
SA and
SB respectively. The uniformity metric
G is then defined as:
Sort the electrical permittivities of all branches in the power grid by magnitude, denoted as
Be1,
Be2⋯⋯
Ben. Using the Lorentz curve for calculation, the
SB area can be obtained:
The area of segment
SA is the area of the right-angled isosceles triangle minus the area of segment
SB. Combining Equations (3) and (4), the improved expression for calculating the electrical medium constant of the branch circuit is derived as follows:
2.1.2. DC System Commutation Failure Coefficient
During normal operation of the DC system, should the inverter’s arc-extinction angle
γ fall below the specified threshold
γmin (6–8°) at any given moment, commutation failure may occur on the inverter side of the DC system. This paper calculates the commutation failure margin of the DC system by taking the difference between the inverter’s arc extinction angle and
γmin, then comparing this difference with
γmin. This margin reflects the current operational state of the DC system and indicates the likelihood of commutation failure occurring. A larger value of this indicator corresponds to a lower probability of commutation failure in the DC system. Its expression is as follows:
2.1.3. Effective Short-Circuit Ratio
In hybrid AC-DC transmission networks, the magnitude of DC transmission capacity significantly impacts the voltage stability of the AC system. This makes it vital to assess how well the AC system can handle the power delivered by the DC link. A key measure of this strength is the effective short-circuit ratio (
ESCR), which reflects the overall robustness of the hybrid AC-DC network. The
ESCR can be found by calculating the combined impact of AC filters, reactive power compensation capacitors, and synchronous phase-shifting reactors at the converter stations. Its value is given by:
In this formulation, VN stands for the rated AC-side voltage, while PdN is the rated DC active power at the converter station. The equivalent admittance for filtering and reactive power compensation is given by BC, and |Z| represents the equivalent impedance of the AC system.
2.2. Weight Calculation Method for Evaluation Indicator Systems Based on Intuition-Fuzzy Analytic Hierarchy Process
Assessing transmission grid planning requires a balanced allocation of weights across different indicators. Most current methods for setting these weights rely on subjective approaches. Commonly used techniques include the Analytic Hierarchy Process and expert survey methods. A key drawback of these methods is their strong dependence on personal judgement. Different experts may assign very different weights to the same indicator, sometimes even giving opposite scores. This variation makes it important to reduce human bias in the calculations. Improving the objectivity and fairness of the process is a central challenge in building a good evaluation system.
The Intuition-Fuzzy Analytic Hierarchy Process offers one way to address this problem. This method combines fuzzy theory with the classic Analytic Hierarchy Process. Fuzzy theory helps handle vague or uncertain information. Combining it with AHP creates what is known as Fuzzy AHP. This version uses fuzzy numbers to make the traditional method more objective. It goes a step further by adding a non-membership function as an intuition function. Using intuitive fuzzy principles helps manage uncertainty and reduces the ambiguity that comes from human judgement. This makes the evaluation process more reliable. This renders the evaluation system more objective and effective.
In Intuition-Fuzzy AHP, to evaluate relationships between indicators, pairwise comparisons are first conducted to derive the intuition-fuzzy judgement matrix
Hz, representing relative importance. To quantitatively describe the significance of attributes, definitions are provided as shown in
Table 1.
Upon obtaining the fuzzy judgement matrix provided by Expert
l, the relative weights of the criterion layers shall be calculated according to the following formula:
In the formula, ωl denotes the criterion layer weight matrix provided by expert l; hij represents the importance of indicator i relative to indicator j within the intuitive fuzzy judgement matrix Hz; μij and νij correspond, respectively, to the importance and unimportance levels in the aforementioned scaling table.
According to the operation of the direct fuzzy weighted arithmetic mean operator:
Combining Equations (8) and (9), the calculation weighting for the criterion layer is obtained:
In the formula, σi denotes the final weighting coefficient of the criterion layer.
Furthermore, once the criterion-level weights are obtained, the indicator-level weights can be calculated. Assuming the experts’ intuitive fuzzy judgement matrix regarding the importance of indicator-level metrics to the criterion level is denoted as He, heij = (μeij, νeij), the normalised indicator-level weight coefficients σ(2) can then be derived.
2.3. Calculation of Indicator Scores Based on the Effectiveness Coefficient Method
In transmission network planning, specific numerical values obtained through data collection, statistics, and calculations cannot be directly weighted for use. They must be uniformly converted into dimensionless scoring values for comparison. Transmission network planning evaluations often incorporate both positive and negative indicators, or deviation-type indicators. As all three types of indicators coexist, each must undergo directional standardisation. This involves first converting negative indicators into positive ones, then transforming deviation-type indicators into positive values based on their deviation distance. The conversion methodology is as follows:
In the formula,
σij and
σ’ij denote the indicator values for the
jth indicator in the
ith scheme before and after directional standardisation, respectively; max|
σj| represents the maximum value of the
jth indicator;
p is the coordination coefficient, typically set at 0.1. Following the above processing yields the normalised evaluation indicator values. However, as the indicators differ in meaning and units, they undergo further dimensionless processing:
Following the processes of phase alignment and dimensionless transformation, the indicator values now permit comparative analysis between metrics. However, this approach makes it difficult to intuitively ascertain the relative strengths or weaknesses of individual indicators. Consequently, the effectiveness coefficient method has been introduced to convert each indicator value into a percentage-based score, thereby enabling a more intuitive assessment of comparative performance:
In the formula, Mj and mj denote the satisfaction value and the unacceptable limit value of indicator σj respectively; c and d are known constants, with c = 60 and d = 40 adopted in this paper. Following conversion via the efficacy coefficient method, the final score range obtained by the evaluation indicator system is 60–100, providing an intuitive representation of the quality of indicator assessment.
In this paper, the coordination coefficient p is set to 0.1, following common practice in multi-criteria evaluation where a small -value avoids excessive compression of indicator differences while maintaining numerical stability. The constants c = 60 and d = 40 in the effectiveness-coefficient method define a final score range of 60–100, which corresponds to a typical pass/fail threshold used in engineering evaluation. Alternative values were tested in preliminary experiments and were found to have little impact on the relative ranking of planning schemes.
3. Multi-Objective Planning Model for Transmission Grids
3.1. Multi-Objective Planning Model
Considering the economic viability, safety and reliability, as well as AC-DC stability of hybrid AC-DC power grids, a multi-objective planning model is established to minimise the annual comprehensive economic cost, reduce the annual expected power supply shortfall to the lowest level, and achieve the smallest system electrical coefficient. The model’s objective function is as follows:
The constraints are as follows:
In the equation, τk(in), τk(out), θk(in), and θk(out) denote the power flow and phase angle at the two end nodes of line τk, where in indicates the inflow node and out indicates the outflow node; N is the node set comprising all nodes in the system; Pdi is the load value at node i; bτk is the admittance of line τk; is the maximum transfer power of line τk; and denote the upper and lower limits of active power output for generator unit gi; θo is zero, representing the phase angle of the balanced node; F1 denotes the annual comprehensive economic cost, encompassing line construction expenditure, system operational expenses, network loss costs, and maintenance outlays; F2 represents the system’s expected power supply deficiency, determined via Monte Carlo simulation; F3 signifies the system’s composite electrical index, calculated by incorporating node electrical indices and branch electrical indices derived from the Gini coefficient; Bi and Bτ denote the electrical parameters of node i and line τ, respectively.
In addition, the notation in (15) and its constraints is defined as follows. Index denotes buses, and denotes generating units connected to the buses. Let be the set of candidate transmission lines, and let denote the -th candidate line. For each line , the binary decision variable indicates whether this line is constructed or not . The set collects all lines incident to bus , so that the nodal power-balance constraints are written as the net power injection at bus (generation minus load) being equal to the sum of power flows on all lines in . The active power output of generating unit at bus is denoted by , and and represent its lower and upper limits, respectively. The maximum transferable power of line is written as . In the Monte Carlo-based reliability evaluation associated with , indexes the contingency and load scenarios, denotes the probability of scenario , and is the total load shedding in that scenario. Finally, is used to index candidate planning schemes in the multi-objective evaluation, where is the set of all feasible schemes explored by the reinforcement learning agent.
The unit investment, operation and maintenance costs adopted in this study are taken from recent planning guidelines and case studies for regional 500 kV grids in China. The network loss cost is valued at 1000 RMB/MWh, which is consistent with the average marginal cost of energy used in long-term planning studies of large-scale AC/DC systems. Although more detailed cost models (e.g., time-varying loss prices or device-specific maintenance contracts) could be incorporated, the adopted cost parameters provide a reasonable approximation for comparing alternative expansion schemes at the planning horizon.
3.2. DC System Equivalence
In traditional planning for AC-DC hybrid grids, the DC system is usually modelled as a two-terminal system. Power flow in the hybrid network is then solved using unified or sequential methods. This detailed representation captures the DC system’s operational behaviour and offers high accuracy. However, in many practical hybrid grid projects, the DC system—which is the main focus—often connects to the AC grid at just one end. For this reason, our model treats the DC system as a single-terminal link to the AC network. It acts as an external power delivery path supplying electricity to other grids. Under this setup, the DC system can be represented as a PQ node, equivalent to a load in the planning model. This equivalent load approach simplifies computation while still capturing the effect of the DC link on AC power flow patterns.
It is acknowledged that representing the DC system as a PQ node neglects detailed DC dynamic behaviour such as commutation failures and DC voltage oscillations. However, for long-term transmission expansion planning, where the decision variables are line routes and capacities, the main influence of the DC system is its power injection pattern into the AC grid. This static PQ-node equivalence is therefore widely adopted in hybrid AC/DC planning studies, while more detailed DC dynamic models are usually reserved for operation, protection and transient-stability analyses. In this paper, the proposed framework is intended for planning-level studies; extending it to incorporate more detailed DC dynamics will be an important topic for future work.
5. Reward Design and Eligibility Trace Analysis
5.1. Eligibility Trace Matrix and Knowledge Extraction
During the learning process of the intelligent agent, the
α-
Q(
λ) algorithm progressively constructs a knowledge matrix Q. Simultaneously, the algorithm establishes an eligibility trace matrix M of identical size to Q:
The values within the eligibility trace matrix M represent the credibility of values within the knowledge matrix Q, and participate in the updating of the knowledge matrix Q. The greater the credibility, the larger the proportion of feedback rewards derived from past events. Based on this principle, the values within the eligibility trace matrix correspond to the frequency and importance of events executed by the intelligent agent in the past. The greater the value, the more significant the event, meaning the corresponding pathway for the event action is more important.
From a planning perspective, the eligibility trace associated with a given line-building or line-cancelling action can be interpreted as a measure of how frequently and how recently the RL agent relied on that decision to obtain high-quality expansion schemes. Lines with high eligibility traces are repeatedly selected in successful episodes and rarely removed thereafter, indicating that they play a structurally important role in maintaining economic efficiency, reliability and AC/DC stability across many scenarios. Conversely, actions with persistently low traces correspond to lines that are seldom used or frequently cancelled during learning, suggesting lower importance in the final plan. This interpretation allows the eligibility-trace matrix to be viewed as a data-driven analogue of traditional sensitivity analysis or contingency ranking, providing planners with physically meaningful insight into which corridors should be prioritised for route surveys and reinforcement.
5.2. Reward Function and Termination Criteria
In reinforcement learning, each action taken by an agent requires feedback rewards from the environment to evaluate the quality of its behaviour. Within multi-objective planning models, however, the presence of multiple objectives prevents agents from obtaining feedback rewards based on a single criterion. Therefore, this paper employs a weighting method. By applying a ratio transformation, the objective values under each goal are converted into dimensionless values within the range [0,1]. Finally, by assigning different weights, a composite feedback reward is obtained:
In the formula, F1best, F2best, and F3best denote the optimal comprehensive economic cost, the minimum expected power shortage, and the minimum comprehensive electrical index, respectively; R1, R2, and R3 represent the feedback rewards corresponding to each of the three objectives; β1, β2, and β3 denote the respective weights of R1, R2, and R3. In the initial state, F1best t (F2best/F3best) assumes a sufficiently large positive value. For planning schemes failing N-1 grid security constraint verification, a feedback reward of −1 is applied to deter the agent from selecting such schemes. When a planning scheme satisfies the N-1 grid safety constraint verification, the respective objectives for that state are computed. Taking F1 as an example: when F1best < F1now, a positive feedback reward R1 > 0 is provided. This indicates the agent receives positive reinforcement from the environment, leading it to favour exploring this scheme. Conversely, the agent tends to avoid such schemes to prevent receiving negative feedback. A larger gap between F1best and F1now leads to a stronger environmental feedback reward. This greater reward pushes the agent harder toward better solutions. This design helps the agent focus on actions that cause major improvements in F1now, while reducing rewards when the target is nearly met. That way, the agent avoids getting stuck too early with a suboptimal solution. By tuning the weights assigned to each objective’s feedback, we can also steer the agent to emphasise certain goals. For example, setting β2 to the highest weight encourages the agent to find a more cost-effective planning solution.
The reward function in (24) adopts a piecewise design: when a newly generated plan improves the dimensionless scores of all three objectives compared with the current best plan, a positive reward is given; otherwise, a negative reward encourages the agent to search alternative expansion paths. The weights β1 = 0.5, β2 = 0.3 and β3 = 0.2 reflect the planning preference that economic cost is slightly more important than reliability and AC/DC stability. These values were selected based on expert consultation and a simple sensitivity analysis, which showed that moderate perturbations of β1–β3 do not change the qualitative structure of the optimal plans. A more systematic multi-stakeholder weight-setting procedure will be considered in future work.
In summary, the specific algorithmic flow details of the α-
Q(
λ) algorithm are illustrated in
Figure 2.
7. Conclusions
This paper has presented a multi-objective hybrid AC/DC transmission expansion planning framework based on an improved multi-step backtracking α-Q(λ) reinforcement learning algorithm. A tri-objective model was formulated to minimise annual comprehensive economic cost, expected power shortage and a comprehensive electrical index that reflects AC/DC stability by combining electrical betweenness, commutation-failure margin and ESCR. An intuitionistic fuzzy AHP method and an effectiveness-coefficient-based scoring procedure were adopted to obtain dimensionless evaluation scores and overall plan rankings.
Numerical studies on the Garver-6 and IEEE 24-bus systems showed that the proposed α-Q(λ) planner achieves comparable or lower investment and operating costs than classical Q-learning, while improving the N-1 pass rate and reducing the expected power shortage. For the 500 kV regional hybrid AC/DC grid, the preferred expansion scheme reduces the expected annual power shortage from 70,810 MWh to 28,320 MWh, indicating a more uniform flow distribution and enhanced grid robustness. At the same time, the eligibility-trace matrix provides interpretable “line importance” information that highlights key AC/DC corridors such as 7–18 and 11–12, which align well with engineering intuition.
Overall, the results demonstrate that the proposed multi-step backtracking α-Q(λ) reinforcement learning framework can effectively solve multi-objective hybrid AC/DC TNEP and offers planners both high-quality expansion schemes and interpretable diagnostic information.
Despite the above advantages, several issues deserve further investigation. First, the present study considers a static planning horizon and uses a PQ-node equivalence for DC systems, whereas multi-period investment decisions and more detailed AC/DC dynamic interactions may alter the optimal expansion scheme. Second, only one form of eligibility-trace-based reinforcement learning is explored; extending the framework to actor–critic architectures or deep Q-networks could further improve scalability to very large-scale grids. Third, the current Monte Carlo reliability assessment assumes independent outages and simplified renewable profiles. Incorporating correlated failures and high-resolution renewable generation data would allow a more realistic quantification of planning risk. These directions will be pursued in future research to enhance the robustness and applicability of the proposed hybrid AC/DC planning framework.