A Hybrid Optimization Approach for Multi-Generation Intelligent Breeding Decisions

Yang, Mingxiang; Li, Ziyu; Li, Jiahao; Huang, Bingling; Niu, Xiaohui; Lu, Xin; Li, Xiaoxia

doi:10.3390/info17010106

Open AccessArticle

A Hybrid Optimization Approach for Multi-Generation Intelligent Breeding Decisions

by

Mingxiang Yang

^1,†,

Ziyu Li

^1,†,

Jiahao Li

^1,‡,

Bingling Huang

^1,‡,

Xiaohui Niu

^1,*

,

Xin Lu

^2,*

and

Xiaoxia Li

^1,*

¹

College of Informatics, Huazhong Agricultural University, Wuhan 430070, China

²

School of Computing and Creative Industries, Leeds Trinity University, Brownberrie Lane Horsforth, Leeds LS18 5HD, UK

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work and share first authorship.

^‡

These authors contributed equally to this work and share second authorship.

Information 2026, 17(1), 106; https://doi.org/10.3390/info17010106

Submission received: 17 December 2025 / Revised: 5 January 2026 / Accepted: 14 January 2026 / Published: 20 January 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Multi-generation intelligent breeding (MGIB) decision-making is a technique used by plant breeders to select mating individuals to produce new generations and allocate resources for each generation. However, existing research remains scarce on dynamic optimization of resources under limited budget and time constraints. Inspired by advances in reinforcement learning (RL), a framework that integrates evolutionary algorithms with deep RL was proposed to fill this gap. The framework combines two modules: the Improved Look-Ahead Selection (ILAS) module and Deep Q-Networks (DQNs) module. The former employs a simulated annealing-enhanced estimation of the distribution algorithm to make mating decisions. Based on the selected mating individual, the latter module learns multi-generation resource allocation policies using DQN. To evaluate our framework, numerical experiments were conducted on two realistic breeding datasets, i.e., Corn2019 and CUBIC. The ILAS outperformed LAS on corn2019, increasing the maximum and mean population Genomic Estimated Breeding Value (GEBV) by 9.1% and 7.7%. ILAS-DQN consistently outperformed the baseline methods, achieving significant and practical improvements in both top-performing and elite-average GEBVs across two independent datasets. The results demonstrated that our method outperforms traditional baselines, in both generalization and effectiveness for complex agricultural problems with delayed rewards.

Keywords:

genomic selection; resource allocation; Deep Q-Networks (DQN); Estimation of Distribution Algorithm

Graphical Abstract

1. Introduction

To address the severe challenges to food security posed by global population growth and climate change, crop breeding technology is undergoing a profound transformation from traditional phenotypic selection to genomic selection (GS) [1]. GS technology leverages genome-wide molecular markers to predict the breeding values of individuals [2], thereby enabling the selection of superior individuals in early generations and significantly shortening the breeding cycle while accelerating the process of genetic gain [3]. In recent years, with the decrease in the cost of high-throughput sequencing technology and the rapid development of artificial intelligence algorithms [4], GS has demonstrated great potential in major crops such as corn, wheat, and rice, becoming one of the core technologies for intelligent breeding [5,6,7].

Despite the significant success of GS in breeding value prediction [4], most research has focused on improving the accuracy of single-generation predictions. The models used, including Genomic Best Linear Unbiased Prediction (GBLUP) and various Bayesian methods, primarily serve to rank individuals [4,8,9,10,11,12]. However, a complete crop breeding project is fundamentally a sequential decision-making process spanning multiple cycles (usually 5 to 10 generations) [13,14]. In this process, the core challenge faced by breeders is how to dynamically allocate resources for each generation’s hybrid combinations and offspring production under limited budgets (such as field plots, labor, and genotyping costs) and time constraints, in order to maximize genetic gain in the final generation [9,15]. This complex resource allocation problem involves a long-term trade-off between “exploration”, maintaining genetic diversity and “exploitation”, concentrating resources on the currently optimal germplasm, which traditional GS models and empirical allocation strategies, such as uniform allocation, are not well-equipped to handle [16].

To optimize breeding decisions, operations research and evolutionary computation techniques have been introduced. Among them, the Look-Ahead Selection (LAS) method, which simulates multi-generational genetic progress and recombination events to optimize parental selection and mating schemes, has shown remarkable performance in enhancing long-term genetic gain [9]. However, LAS does not address dynamic resource allocation across generations; resource allocation in LAS is typically a byproduct of the optimization results rather than an active decision variable. Although Moeinizade et al. [9] first formulated resource allocation as a sequential decision problem using a Random RLGS [17], the model’s expressiveness limits its ability to capture complex state–action relationships, consequently underperforming in high-dimensional breeding settings [18].

At the same time, RL has shown great potential in solving complex decision-making problems in the intelligent agricultural field [19,20,21]. For example, in arid areas, RL systems can be built to dynamically adjust irrigation strategies by predicting future high temperatures and drought periods [22]; Fu et al. [23] applied the Proximal Policy Optimization (PPO) algorithm to plan the flight paths of drones in farmlands to improve the efficiency of pesticide spraying. In addition, RL has also made positive progress in pest and disease management [24] and crop yield prediction [25]. These successful cases indicate that RL is particularly good at dealing with environments with delayed rewards and uncertainties [26], which is highly consistent with the characteristics of the multi-generation breeding resource allocation problem. However, most existing studies apply RL to the “execution level” of agriculture, and there is still a gap in the research for deeply integrating RL with high-performance breeding simulators (such as LAS [9]) to jointly optimize the complete decision chain of “selection–mating–resource allocation”.

Modern plant breeding programs operate under significant resource constraints, with field trials, genotyping, and labor constituting major costs [9]. A typical maize breeding program may allocate millions of dollars annually across multiple locations and generations. Efficient resource allocation directly impacts both genetic gain rates and program profitability. This challenge intensifies with climate change pressures and the need for rapid cultivar development.

Based on the above research, current approaches for breeding resource allocation exhibit three fundamental shortcomings: (1) Static allocation strategies: uniform or empirically based heuristics fail to adapt to evolving population dynamics across generations; (2) Myopic optimization: single-generation selection methods ignore long-term genetic potential and recombination opportunities. (3) Computational intractability: exhaustive search over possible allocation schemes becomes infeasible for realistic breeding program scales involving thousands of individuals and multiple generations. These limitations motivate the development of the ILAS-DQN framework, which enables dynamic, forward-looking optimization through integrated evolutionary and reinforcement learning approaches.

While previous EA-RL hybrids in breeding primarily focus on single-generation optimization or parameter tuning, ILAS-DQN introduces two novel theoretical contributions: (1) Hierarchical decision decomposition that separates intra-generational selection (ILAS) from inter-generational resource allocation (DQN), enabling tractable optimization of the complete breeding decision chain. (2) Backward credit propagation through potential matrix

C_{t}

, which provides genetically grounded state representations for RL in sparse-reward, long-horizon breeding environments. Unlike existing PPO-based approaches that optimize specific breeding operations, ILAS-DQN provides a general framework for joint optimization of selection, mating, and resource allocation under budget constraints. (3) An Improved Look-Ahead Selectionalgorithm that enhances global exploration and local exploitation to overcome LAS’s limitations and a rigorous cross-dataset validation using datasets with distinct genetic architectures, demonstrating the method’s robustness, generalization capability, and practical scalability for intelligent breeding.

The paper is organized as follows: Section 2 introduces the problem and the ILAS-DQN framework. Experimental results are presented in Section 3. Section 4 discusses the findings, and Section 5 concludes with future work.

2. Methods

This section presents the ILAS-DQN framework, which synergizes evolutionary computation and reinforcement learning to solve the genomic resource allocation problem by jointly optimizing single-generation selection and long-term breeding decisions.

2.1. Problem Description and Modeling

2.1.1. Defining the Problem of MGIB

Traditional plant breeding processes start from an initial population and repeatedly perform selection [27] and propagation steps until a final population is obtained [28]. MGIB is essentially a sequential decision-making process spanning multiple cycles (T generations).

In each breeding cycle t (

t \in \{1, 2, 3, \dots, T\}

), breeders need to decide how to allocate limited resources (such as budget, field plots, etc.) to different hybrid combinations and offspring propagation based on the current state of the population, with the goal of maximizing genetic gain in the final generation (the T-th generation). To establish a mathematical model, we characterize the population of the t-th generation as a genotype matrix

G_{t} \in ℝ^{L \times M \times N}

, where L is the number of genomic loci considered, M is the ploidy of the plants, and N is the number of individuals in the t-th generation. The phenotypic value of each individual is represented by its GEBV.

The core problem of this study can be formalized as follows: Given the initial population

G_{0}

and the total resource budget

B_{0}

, seek an optimal sequence of resource allocation strategies

\{π_{1}, π_{2}, \dots, π_{T - 1}\}

. This sequence of strategies should guide how to transform the budget

b_{t}

into specific hybridization and offspring propagation plans in each generation, ultimately maximizing the objective function of the T-th generation population (Equation (1)):

J = \max (E [ϕ (G_{t})])

(1)

where

ϕ (\cdot)

is the function to evaluate the quality of the population, such as the highest GEBV of the population, as shown in the following text.

Figure 1 depicts the standard genomic selection cycle. Starting from an Initial Population, it involves (1) selecting parents using GEBV, (2) crossing them to Reproduce Progeny, forming a New Population, and (3) evaluating the new population to update GEBV. This cycle repeats for T generations to maximize long-term genetic gain, posing the dual challenge of optimal within-generation selection and multi-generational resource allocation.

2.1.2. Modeling for MGIB

The MGIB problem is formalized as a sequential decision-making process over multiple generations. It explicitly addresses two interconnected subproblems: individual selection using the ILAS method and resource allocation via the ILAS-DQN approach.

The MGIB model takes as inputs the current population genotype (G), initial total budget

B_{0}

, the marker effect (

β

), the recombination frequency (

r

), and the maximum generations (

T

). The overarching objective—maximizing long-term genetic gain—is achieved through synchronized execution of two complementary phases per generation, as formulated in Equations (2)–(8), where Equations (4)–(8) specify the constraints for objectives (Equation (2)).

\max_{x, y} E [g_{t}^{\max}]

(2)

g_{t}^{m a x} = \max_{n \in {1, 2, \dots, N}} (\sum_{l = 1}^{L} \sum_{m = 1}^{2} G_{t}^{l, m, n} β^{l})

(3)

\sum_{t = 1}^{T} b_{t} = B_{0}

(4)

s . t . \sum_{n = 1}^{N} x_{n} = S

(5)

x_{n} = \sum_{j = 1}^{N} y_{n, j}, \forall n \in {1, \dots, N}

(6)

x_{n} \in {0, 1}, n \in {1, \dots, N}

(7)

y_{i, j} \in {0, 1}, \forall i, j \in {1, \dots, N}

(8)

where

E

represents the expectation of multiple iterations,

g_{t}^{m a x}

represents the highest GEBV among the N individuals in the generation

t

, which is calculated using the formula [3],

s_{t}

represents the state of the population at generation t,

b_{t}

is the allocation action dictated by the policy,

R (\cdot)

is the immediate reward, and

γ \in [0, 1]

is the discount factor. S indicates the total number of individuals selected from the current population;

x_{n}

is a binary decision variable, used to represent whether the individual n is selected, taking the value of 1 if selected, and 0 if not selected.

y_{i j}

is a binary variable, used to indicate whether individual i mates with individual j, with a value of 1 indicating mating and 0 indicating no mating.

2.1.3. Modeling with Markov Decision Process

The resource allocation problem in MGIB is formalized as a Markov Decision Process (MDP). Under the constraints of a maximum of T generations and a total budget of

B_{0}

, the objective is to plan parental pair combinations and allocate resources

b_{t}

for the generation

t

.

(1): State Representation

The state of the generation

t

is represented by

(g_{t}^{m a x}, C_{t}, B_{t - 1})

, for specific details, please refer to the work of Moeinizade et al. [16]. Among them,

g_{t}^{m a x}

represents the highest GEBV among the N individuals in the generation

t

, as formulated in Equation (3),

B_{t - 1}

represents the remaining budget up to the generation

t

,

C_{t} \in ℝ^{K \times M}

represents the maximum GEBV that can be achieved with offspring from the generation

t

using a budget of m units and a probability greater than

p^{k}

(

p \in (0, 1)

),

C_{t}^{k, m}

can be solved using the following integer linear programming(ILP) [16]. The first dimension, k, corresponds to the constraint of probabilistic recombinations enabled by the remaining resources. The second dimension, m, indicates the number of founding parents from which the gamete must collect alleles, a process requiring

l o g_{2} m

generations of breeding.

The potential matrix

C_{t}^{k, m}

is derived by solving an integer linear programming (ILP) problem that maximizes the achievable breeding value under given budget and recombination constraints. The full formulation is given below.

m a x_{x, y, z} C_{t}^{k, m} = \sum_{i = 1}^{L} \sum_{c = 1}^{2} \sum_{j = 1}^{N} β_{i} \cdot G_{t}^{i, c, j} \cdot x_{i, c, j}

(9)

\sum_{j = 1}^{N} (x_{i, 1, j} + x_{i, 2, j}) = 1, \forall i = 1, \dots, L

(10)

\sum_{i = 1}^{L} (x_{i, 1, j} + x_{i, 2, j}) \leq L \cdot y_{j}, \forall j = 1, \dots, N

(11)

x_{i, c, j} - x_{i + 1, c, j} \leq z_{i, j}, \forall i = 1, \dots, L - 1, c = 1, 2, j = 1, \dots, N

(12)

x_{i + 1, c, j} - x_{i, c, j} \leq z_{i, j}, \forall i = 1, \dots, L - 1, c = 1, 2, j = 1, \dots, N

(13)

\sum_{j = 1}^{N} y_{j} \leq [\frac{B_{r e m}}{c_{cross}}]

(14)

\sum_{i = 1}^{L - 1} z_{i, j} \cdot \ln (\frac{r_{i}}{1 - r_{i}}) \leq \ln (p^{k}), \forall j = 1, \dots, N

(15)

x_{i, c, j}, y_{j}, z_{i, j} \in {0, 1}

(16)

where

•: I: SNP locus index;
•: J: individual index;
•: C: chromosome copy (1: maternal, 2: paternal);
•: $β_{i}$ : additive effect of locus i;
•: $G_{t}^{i, c, j}$ : allele at locus i, copy c of individual j in generation t (0: ancestral, 1: favorable);
•: $r_{i}$ : recombination frequency between loci i and i + 1;
•: $B_{r e m}$ : remaining budget (number of offspring that can be produced);
•: $c_{cross}$ : cost per cross (set to 1 budget unit per offspring);
•: $p^{k}$ : maximum allowable cumulative recombination probability;
•: $x_{i, c, j}$ : equals 1 if allele at locus i, copy c of individual j is selected;
•: $y_{j}$ : equals 1 if individual j is selected as a parent;
•: $z_{i, j}$ : equals 1 if a recombination event occurs between loci i and i + 1 for parent j.

The objective function in Equation (9) maximizes the GEBV of a gamete that can be assembled from the current population, given a recombination level k and a parent budget m. Constraint (10) ensures that exactly one allele copy (from either chromosome) is selected for each locus across all individuals, thereby constructing a valid haploid gamete. Constraint (11) links allele selection to parent selection: if an individual is not chosen as a parent (

y_{j}

), none of its alleles may be used in the gamete assembly. Constraints (12) and (13) jointly detect whether a recombination event occurs between adjacent loci i and i + 1 for each parent j; the binary variable

z_{i, j}

is activated only when the selected alleles at the two consecutive loci originate from different chromosomal copies of the same parent. Constraint (13) imposes a parent selection limit based on the remaining breeding budget

B_{r e m}

, ensuring that the total number of selected parents does not exceed the number of crosses affordable under the current budget. Finally, constraint (14) restricts the cumulative recombination probability for each parent: by taking the logarithm of the original likelihood expression, the constraint enforces that the total “recombination cost” across the genome does not exceed the threshold

p^{k}

, which corresponds to the recombination level k. For instance, when recombination frequencies

r_{i}

are very small, the term

\ln [r_{i} / (1 - r_{i})]

becomes highly negative, thereby requiring a higher recombination level k (i.e., more genetic resources) to accommodate the necessary recombination events. Together, these constraints ensure that the ILP yields a genetically feasible and budget-aware gamete construction that maximizes short-term breeding gain while respecting long-term recombination and resource limits.

(2): Action Space

As mentioned earlier, the MGIB framework can focus on optimizing the resource allocation amount

b_{t}

for each generation. The action space is defined as the resource allocation amount for each generation and is organized into a vector along the time axis

(b_{1}, b_{2}, \dots, b_{T})

. Here

b_{t} (t \in {1, 2, 3, \dots, T})

is an integer variable. An action represents an allocation decision, that is, the number of offspring to be produced in the t-th generation. The action space is discrete and limited,

A \in {50, 100, 150, 200, 250, 300, 350}

, and the selected action must satisfy the following constraints:

\sum_{t = 1}^{T} b_{t} = B_{0}

(17)

a_{t} \leq B_{t - 1} - α \cdot (T - t)

(18)

Here,

T

represents the maximum number of breeding generations; [math] indicates the total budget or the maximum allocatable resources; and [math] is the minimum number of offspring required to maintain the population size for each generation.

(3): Reward Function

In reinforcement learning, the setting of the reward function is particularly important, as its reasonableness will directly affect the quality of the intelligent agent’s decision-making. In the MGIB problem, the reward is delayed and sparse, with only a positive reward obtained at the terminal generation

T

. The ultimate goal is to maximize the GEBV (

g_{T}

) of the final generation while optimally allocating the budget, and this delayed reward structure encourages the intelligent agent to make decisions that maximize long-term genetic gain.

Accordingly, the reward function is mathematically represented by Equation (11), and this reward structure directly aligns the intelligent agent’s goal with the ultimate objective of the breeding plan.

R (s_{t}, b_{t}) = \{\begin{array}{l} 0 & if t < T \\ g_{T}^{\max} & if t = T \end{array}

(19)

2.2. Overall Framework

The ILAS-DQN framework separates inter-generational and intra-generational decisions. The DQN agent determines the total number of offspring

b_{t}

to be produced in generation t. For example, in a 5-generation breeding program with a total budget of 1000 offspring, DQN might output an allocation vector such as (100, 150, 200, 300, 250), where each entry is the offspring count for the corresponding generation.

Figure 2 illustrates the overall architecture of the ILAS-DQN framework, the framework synergistically combines the advantages of ILAS and DQN to tackle this sequential decision-making problem, which consists of four interconnected components. First, the ILAS module receives the population genotype (

G_{t}

) and the effect of each gene locus (

β

) of the current generation (the generation

t

) and other data. Through the internal integrated simulation annealing and distribution estimation algorithm, it outputs the optimal single-generation selection and mating decisions (

x, y

). Subsequently, this decision, along with the population status, jointly form the environmental state for RL, which is input into the DQN agent. The DQN, leveraging its learned value network, decides the resource allocation (action

b_{t}

) from the long-term benefits of the entire breeding period, and this plan will be used to create the next generation population (

G_{t + 1}

). This process is iterative, ultimately realizing the optimal resource allocation strategy starting from the initial budget (

B_{0}

) by solving it in a reverse planning manner for each generation. Through this division of labor and collaboration, the ILAS-DQN framework achieves end-to-end optimization of the entire decision chain of “selection–mating–resource allocation”.

Given

b_{t}

, the ILAS module first selects S = 20 parents and forms

C_{\max} = 10

mating pairs, e.g., [(180,2), (78,9), (23,3), (34,9), (54,2), …], where each tuple contains the indices of the two parents. ILAS then distributes the

b_{t}

offspring among the pairs according to their genetic diversity. The number of offspring assigned to pair (i,j) is as follows:

n_{i j} = [\frac{b_{t} \cdot D_{i j}}{\sum_{(p, q) \in P} D_{p q}}]

(20)

D_{p q} = \sum_{l = 1}^{L} (\max_{m} G_{l, m, p} β_{l} - \min_{m} G_{l, m, q} β_{l})

(21)

where

D_{i j}

is the genetic diversity (Equation (21)) between the haplotypes of parents i and j; P is the set of all selected mating pairs. Any remaining offspring (due to the floor operation) are randomly assigned to pairs. For instance, if

b_{t} = 100

and the diversity-weighted proportions lead to the distribution (10, 2, 3, 5, 20, 10, 15, 5, 12, 18), the sum equals 100. This design ensures that DQN focuses on strategic cross-generational resource planning, while ILAS handles tactical within-generation parent selection, mating, and diversity-aware offspring allocation.

2.3. ILAS

2.3.1. The Workflow of ILAS

ILAS employs prospective simulation for evaluation and iteratively optimizes selection and mating decisions. Figure 3 depicts this process schematically.

Figure 3 illustrates the workflow of the ILAS approach. The process begins with a population of decision schemes initialized randomly. In each iteration, the algorithm first evaluates the GEBV of each scheme within the population and selects elite individuals based on this. Then, based on EDA, a probability model is learned from the elite group and used to sample and generate a new population with potential. Subsequently, the SA mechanism perturbs the new population, accepting suboptimal solutions with a certain probability to effectively balance global exploration with local exploitation. This cycle of “evaluation–modeling–sampling–optimization” continues until the termination conditions are met, ultimately outputting the optimal decision.

The ILAS algorithm addresses key limitations in the simulation-based LAS method. While LAS optimizes parental selection and mating by simulating multi-generational genetic progress to maximize the final generation’s GEBV under resource constraints [9], its reliance on a genetic algorithm (GA) can lead to premature convergence and restricted search capability in high-dimensional breeding decision spaces [10].

ILAS introduces two core enhancements to improve search efficiency and solution quality: (1) the Estimation of Distribution Algorithm (EDA) replaces the traditional GA, guiding the search through a learned probabilistic model for more effective global exploration. (2) Simulated Annealing (SA) mechanism is embedded to enhance local exploitation and prevent convergence to local optima.

2.3.2. Global Search Optimization Based on EDA

The ILAS algorithm employs the EDA to replace the GA used in LAS for optimizing selection and mating decisions. Unlike traditional GA that relies on predefined crossover and mutation operators, EDA learns a probability model (e.g., Gaussian or Bayesian network) from high-quality solutions and generates new candidates by sampling from this model [29]. This approach intelligently guides the search by modeling the joint distribution of decision variables, enabling more directed exploration of the solution space and better escape from local optima [30].

Each candidate solution in ILAS is encoded as a binary vector

s \in {0, 1}^{N}

representing the selection of S individuals, and a mating matrix

M \in {0, 1}^{S \times S}

indicating pairwise crosses among the selected individuals. The total number of selected parents is fixed to S = 10 per generation, and each selected individual may participate in multiple crosses, up to a maximum of 5 mating pairs per individual to maintain genetic diversity. The mating matrix is symmetric with zero diagonal (no selfing), and the total number of crosses is bounded by

\sum_{i < j} M_{i j} \leq C_{\max} = 10

.

Feasibility is enforced through a two-step repair procedure after sampling:

Selection repair: If the sampled vector contains more than S ones, the excess is randomly removed; if fewer, additional individuals are randomly selected among those with the highest marginal probability.
Mating repair: If the number of planned crosses exceeds $C_{\max}$ , pairs are randomly dropped until the limit is satisfied; if the matrix violates the per-individual mating limit, the exceeding pairs are reassigned to other feasible individuals.

In implementation, the EDA initializes a uniform probability vector to sample a population of P candidate solutions (each comprising S selected individuals). Fitness, defined as the maximum offspring GEBV in terminal-generation simulation, is evaluated. An elite set (top η% solutions) is then used to update the probability vector via a linkage matrix C, which tracks elite co-occurrence frequencies (Equation (22)). Here, each element

L_{i j}

counts the number of times individuals i and j appear together in the elite solutions, and the probability vector is updated. Finally, a SA mechanism is embedded to accept suboptimal solutions probabilistically based on fitness change and a decreasing temperature schedule [31], thereby enhancing global exploration.

p_{i}^{new} = \frac{\sum_{j} L_{i j}}{\sum_{i, j} L_{i j}} \cdot (1 - ϵ) + \frac{ϵ}{N}

(22)

Offspring reproduction is guided by genetic diversity, with offspring production proportional to the diversity of parent pairs, prioritizing more diverse mating combinations to maintain population variability [15]. The SA strategy [32] is incorporated into the EDA search process to enhance solution quality and finely balance exploration with exploitation [33]. Although EDA-based global search effectively explores the vast decision space, the algorithm may still converge prematurely by over-exploiting regions near local optima in later iterations [34]. As a metaheuristic inspired by thermodynamic processes, SA’s core advantage lies in its ability to probabilistically accept temporarily inferior solutions, thereby providing opportunities to escape local optima [35].

The SA acceptance probability follows the Metropolis criterion:

P_{accept} = \exp (- \frac{Δ f}{T})

(23)

where

Δ f

is the change in fitness (maximum GEBV) and T is the current temperature. The initial temperature is set to

T_{0} = 100

, and cooling follows a geometric schedule

T_{k + 1} = α T_{k}

with

α = 0.95

. At each temperature, L = 50 neighborhood moves are attempted. A move consists of randomly toggling one selection bit or swapping one mating pair. The SA phase terminates when the temperature drops below

T_{\min} = 0.01

or after 500 iterations. For specific hyperparameter settings, see Table 1.

Table 1. Hyperparameter settings for the ILAS algorithm.

Component	Parameter	Value	Description
EDA	Population size (P)	100	Number of candidate solutions per generation
	Elite ratio (η)	0.2	Proportion of top solutions used to update probability model
	Learning rate (ε)	0.05	Smoothing factor for probability vector update
	Stopping criterion	100 iterations	Maximum number of EDA generations
SA	Initial temperature ( $T_{0}$ )	100	Starting temperature for acceptance probability
	Cooling factor (α)	0.95	Temperature reduction multiplied per iteration
	Moves per temperature (L)	50	Number of neighborhood attempts at each temperature
	Minimum temperature ( $T_{\min}$ )	0.01	Lower bound for temperature termination
Selection & Mating	Selected parents (S)	20	Fixed number of individuals chosen per generation
	Max crosses per individual	10	Maximum number of mating pairs an individual can participate in
	Total crosses ( $C_{\max}$ )	10	Upper limit on the number of mating pairs per generation

2.4. DQN for Resource Allocation in MGIB

Deep RL, specifically DQN, is applied to handle the sequential decision-making challenges [26] in resource allocation for multi-generation plant breeding. DQN’s ability to manage high-dimensional state spaces and learn long-term strategies supports its suitability for this task [36]. The goal is to learn an optimal resource allocation strategy through DQN that dynamically allocates limited budgets across generations to maximize the final genetic gain and to approximate the optimal action-value function

Q^{⋆} (s, a)

, which represents the expected cumulative reward following the optimal policy after taking action

a

in state

s

. The core of this method is a deep neural network that approximates the optimal

Q^{⋆} (s, a)

, representing the expected cumulative reward after taking action

a

in state

s

and following policy

π^{*}

. The input to DQN is a flattened feature vector derived from the state

s_{t}

and candidate action

b_{t}

:

I n p u t = [g_{t}^{\max}, {\vec{C}}_{t}, B_{t - 1}, b_{t}]

(24)

where

{\vec{C}}_{t}

represents the vectorization of the

k \times m

matrix

{\vec{C}}_{t}

thus the input dimension is

1 + k \cdot m + 1 + 1

. The input to the DQN agent is a flattened feature vector composed of the state and the candidate action

a_{t}

.

The DQN uses a multi-layer perceptron (MLP) with three hidden layers of sizes 256, 128, and 64, each followed by ReLU activation. The network receives as input a flattened feature vector of dimension

1 + k \cdot m + 1 + 1

, representing the state–action pair

(s_{t}, b_{t})

. The output layer has a single neuron that estimates

Q^{⋆} (s, a)

. Training employs the Adam optimizer with a learning rate of

1 \times 10^{- 4}

and a discount factor γ = 0.99. The experience replay buffer stores 10⁵ transitions, and mini-batch updates of size 64 are performed every 4 environment steps. The target network is synchronized with the main network every 1000 training steps. The exploration rate ε follows a linear decay from 1.0 to 0.01 over the first 5000 steps, remaining constant thereafter. A complete summary of hyperparameters is provided in Table 2.

The delayed and sparse reward structure of the MGIB problem makes direct forward training inefficient. To address this, a backward training strategy is adopted. The process begins by generating a dataset of complete breeding trajectories (Algorithm 1), where each trajectory records the state, action, and final reward for every generation. Training then proceeds from the last generation backward: first, the model for generation T − 1 is trained to predict the final reward given the state at T − 1; next, the model for generation T − 2 is trained using the learned Q-values from generation T − 1 as its target, and so on. Formally, for generation t, the target Q-values and DQN are trained by minimizing the temporal difference error using a loss (Equation (26)). [36].

y_{t} = \{\begin{array}{l} g_{T}^{\max} & if t = T - 1, \\ \max_{a^{'}} Q_{t + 1} (s_{t + 1}, a^{'}; θ^{-}) & if t < T - 1, \end{array}

(25)

L (θ) = E_{(s, a, r, s^{'}) ~ D} [{(y - Q (s, a; θ))}^{2}]

(26)

where

(s, a, r, s^{'})

represents a small batch of transition samples randomly sampled from the experience replay buffer D, the use of experience replay breaks the correlation between samples, improving data efficiency [36]. ILAS and DQN interact through a sequential handshake protocol executed at each generation t: (1) State observation: DQN observes the current state

(g_{t}^{m a x}, C_{t}, B_{t - 1})

; (2) Action selection: DQN selects allocation action

a_{t} = b_{t}

using

ϵ

-greedy policy; (3) ILAS execution: ILAS receives population

G_{t}

and allocated budget

b_{t}

, outputs selection matrix

S_{t}

and mating matrix

M_{t}

; (4) Population update: New population

G_{t + 1}

is generated through reproduction using

S_{t}

and

M_{t}

; (5) Reward and transition: Environment transitions to

s_{t + 1}

with reward

R_{t}

; (6) Experience storage: Transition

(s_{t}, a_{t}, R_{t}, s_{t + 1})

is stored in replay buffer. This iterative exchange enables joint optimization of within-generation selection (by ILAS) and cross-generational allocation (by DQN).

Algorithm 1 DON-based learning data generation

Input: T;

B_{0}

; A; β; r;

G_{0}

Output: multiple complete episodes of data
Initialization:
DQN network; Target network; Experience replay pool

{b u f f e r_{t} | t = 1 to T - 1}

Set hyperparameters: γ, ε, batch size, buffer capacity

1:: for sim: =1 to num_simulations do
2:: population ← $G_{0}$
3:: $B_{t - 1}$ ← $B_{0}$
4:: trajectory ← { }
5:: for t := 1 to T−1 do
6:: $g_{t}^{\max}$ ← population.max_gebv ( $β$ )
7:: $C_{0}$ ← ILP_Solver(population)
8:: state ← ( $g_{T}^{m a x}$ , $C_{t}$ , $B_{t - 1}$ )
9:: if t < T then
10:: $b_{t}$ = Action( $t$ , $T$ , $B_{t - 1}, A$ )
11:: trajectory[t] ← $\{s t a t e, b_{t}\}$
12:: [S] = Select( $G_{t - 1}, r, n, b_{t}$ )
13:: [ $G_{t}$ ] = Reproduce( $G_{t - 1}, S, r$ )
14:: $B_{t - 1}$ = $B_{t - 1} - b_{t}$
15:: end for
16:: Record $T_{m a x}^{g}$
17:: for t := T − 1 to 1 do
18:: Record data(trajectory[t].state, trajectory[t].action, $g_{T}^{m a x}$ ) to data[t]
19:: end for
20:: return data
21:: end for

The specific data generation and resource allocation training model were detailed in Algorithms 1 and 2. This training method effectively addresses the credit assignment problem in sparse reward environments [26]. After training, for any state in, the optimal resource allocation action can be determined through greedy strategy selection, that is, selecting the action with the highest predicted Q value (Equation (27)).

π^{*} (s) = \arg \max_{a \in A} Q (s, a; θ)

(27)

Algorithm 2 Genome resource allocation based on DQN

Initialize:: DQN parameters θ, target network parameters $θ^{-} \leftarrow θ$ , empty replay buffer D, total generations T, budget B₀, action set A.
Output:: Trained DQN models for optimal policy extraction.

Generate data: Run N simulations with actions, storing transitions for all t in D.

1:: for t := T−1 to 1 do: // Backward training
2:: Aggregate all transitions from D where state generation >= t.
3:: for training epoch do:
4:: Sample random mini-batch from aggregated data.
5:: For each transition: $y \leftarrow r + γ_{a^{'}} Q (s^{'}, a^{'}; θ^{-})$
6:: Update θ by gradient descent on ${(y - Q (s, a; θ))}^{2}$ .
7:: end for
8:: Periodically update target network: $θ^{-} \leftarrow θ$
9:: Save model for generation t.
10:: end for

3. Results

3.1. Simulation Settings

3.1.1. Datasets

(1): Corn2019 Dataset (Table 3)

This dataset derives from an actual maize breeding population comprising 396 inbred lines [9]. It contains approximately 1.4 million SNP markers distributed across 10 chromosomes. To balance computational efficiency with information retention, we filtered markers by defining haplotype blocks, retaining 9063 high-quality SNPs for subsequent simulations.

Table 3. Examples of the genotype formats of 5 single-nucleotide polymorphisms (SNPs) in 5 individuals within the Corn2019 dataset.

SNP	R001	R002	R003	R004	R005
SNP1	0/0	0/0	1/1	0/0	0/0
SNP2	0/0	0/1	0/0	0/0	0/1
SNP3	0/0	0/0	0/1	1/1	1/0
SNP4	0/1	1/1	0/0	1/1	0/0
SNP5	1/1	1/0	1/1	0/0	0/0

(2): Cubic Dataset

The dataset is a large-scale population comprising 1187 homozygous maize varieties with 24,591 SNP markers. Similarly, through filtering, 4097 informative SNP markers were ultimately selected [10]. Its original genotype format is gene dosage type, represented by 0, 1, or 2. To ensure compatibility with the simulation pipeline, Eagle software was employed to convert the Cubic dataset from dosage format to the haplotype format consistent with Corn2019.

3.1.2. CUBIC Dataset Preprocessing and BLUE Imputation

Prior to the analysis of the Cubic dataset, data missingness was assessed, followed by filtering and imputation. The locus-level missing rate distribution was highly right-skewed, with most loci having low rates (p < 0.1) (Figure 4a). Loci exceeding a 20% genotype missing rate threshold were removed. At the individual level (Figure 4b), individuals with >50% missing phenotype data were filtered out. Best Linear Unbiased Estimates (BLUE) were then used to impute the remaining missing values, resulting in a complete dataset for analysis.

The CUBIC dataset preprocessing follows a standardized pipeline. Genotype missingness filtering removes loci exceeding a 20% missing rate and individuals exceeding 50% missing phenotype data. Phenotype imputation employs the Best Linear Unbiased Estimator (BLUE) with the following model specification:

y_{i j} = μ + g_{i} + e_{j} + ϵ_{i j}

(28)

where

y_{i j}

represents the phenotypic value of genotype i in environment j,

μ

is the overall mean,

g_{i}

is the fixed effect of genotype i,

e_{j}

is the fixed effect of environment j, and

ϵ_{i j} ~ N (0, σ_{e}^{2})

is the residual error. The model assumes no genotype–environment interaction. The BLUE-adjusted values serve as the phenotypic input for GEBV estimation, ensuring that subsequent selection decisions account for environmental variability while maintaining genetic signal integrity.

A sensitivity analysis evaluates the robustness of ILAS-DQN performance to preprocessing parameter choices. Three alternative preprocessing configurations are compared against the default settings:

Conservative filtering: loci > 15% missing, individuals > 40% missing;
Liberal filtering: loci > 25% missing, individuals > 60% missing;
Alternative imputation: mean imputation instead of BLUE.

Table 4 presents the final generation

g_{t}^{m a x}

under each configuration. Performance variations remain within 2.1% of the default configuration across all sensitivity tests, confirming that the main conclusions are robust to reasonable preprocessing parameter variations.

3.1.3. Simulation Parameter Settings

The simulation experiments were designed following the classical paradigm for resource allocation in MGIB [16]. In the simulation experiments, the total breeding generations were set to T = 5, with a total budget B₀ = 1000 (where one unit of budget corresponds to producing one offspring). The resource allocation action for each generation is selected from the discrete set A = {50, 100, 150, 200, 250, 300, 350, 400}. The initial population consisted of 200 randomly selected individuals. A maximum of 10 crosses were allowed per generation. Based on the flowchart in Figure 5, 300 independent simulations were conducted. To balance computational feasibility and biological realism, this study incorporates two key assumptions [9]:

In the process from t to t + 1, each selected pair of parents produces multiple offspring.
In the look-ahead simulation from t = 1 to T−1, all individuals undergo completely random mating (including selfing).

It is worth noting that GEBV represents the predicted genetic merit of an individual relative to the population mean (unitless), the numerical values presented in the subsequent experiments were obtained at a scale of 1 × 10⁻⁴ on the corn2019 dataset and 1 × 10⁻⁵ on the cubic dataset). top50_average denotes the mean GEBV of the top 50 individuals in the final generation. The values represent mean ± standard deviation across 300 simulation replicates. Budget units correspond to offspring counts; one unit equals production of one offspring individual.

3.1.4. Computational Requirements and Scalability Analysis

Experiments were conducted on an HPC cluster with NVIDIA A80 GPUs (80 GB memory), Intel Xeon Platinum 8368 processors, and 512 GB RAM per node. ILAS-DQN training requires 8.2 ± 1.3 h for 5 generations with population size N = 200. Runtime scales approximately linearly with population size (O (N^1.1)) and quadratically with generations (O (T^1.8)) due to backward training complexity. Peak memory usage is 12.4 GB for Corn2019 and 9.8 GB for CUBIC. Memory scales linearly with SNP count L (O(L)) and sub-linearly with population size. For typical breeding program scales (N ≤ 1000, T ≤ 10, L ≤ 50,000), ILAS-DQN can complete optimization within 24 h using standard HPC resources, making it feasible for real-world breeding decision support.

3.2. Simulation Results

3.2.1. Performance Comparison of ILAS in MGIB

This subsection evaluates the performance of ILAS in single-generation selection and mating decisions, comparing it with LAS and LAS_SA in terms of genetic gain and genetic diversity.

ILAS demonstrated superior genetic gain over both LAS and LAS_SA, as evidenced by significantly higher maximum and population-average GEBVs in the final generation (Figure 6). The maximum GEBV achieved by ILAS (165.60 ± 6.53) notably exceeded that of LAS_SA (155.49 ± 5.14) and the baseline LAS (151.84 ± 4.72) (Table 5). This performance advantage indicates that the integrated EDA and SA strategies within ILAS effectively guided the identification and aggregation of favorable alleles. Consequently, offspring with higher breeding values were systematically generated. Moreover, the enhancement was not limited to elite individuals but extended across the entire breeding population. The population average GEBV under ILAS (76.16 ± 6.82) was also significantly greater than that of LAS_SA (73.56 ± 7.58) and LAS (70.70 ± 5.96) (Figure 6c), confirming a broad genetic improvement.

ILAS delivered significant improvements in both genetic gain and diversity over baseline methods, as summarized in Table 5. It achieved a maximum GEBV of 165.60 ± 6.53 and an average GEBV of 76.16 ± 6.82 in the final generation, representing approximate improvements of 9.1% and 7.7%, respectively, over LAS and LAS_SA (p < 0.01). Notably, ILAS maintained the highest genetic diversity among the three methods (Figure 7), indicating that its success stems from a balanced decision-making mechanism that intelligently prioritizes superior individuals while preserving a broad genetic base through effective exploration-exploitation trade-offs. The superior performance of ILAS can be attributed to the synergistic effect of EDA and SA.

3.2.2. Function Approximator Selection and Performance Analysis

A function approximation approach was employed to estimate the final generation’s optimal GEBV, modeling based on the current maximum function value (

g_{t}^{m a x}

), potential matrix (

C_{t}

), remaining budget (

B_{t - 1}

), and action (

b_{t}

). Within the backward learning framework, the model is dynamically updated as training data accumulates. Initially, a function f is trained based on the data from the generation 4 (T−1), establishing a mapping between state–action pairs and

g_{t}^{m a x}

. Subsequently, data from the 3rd and 2nd generations are gradually incorporated, iteratively updating the model down to the 1st generation.

The DQN demonstrated superior fitting accuracy over RF for the fourth-generation model, as evidenced by its lower root mean square error (RMSE) in Figure 8. DQN achieved an RMSE of 9.618, outperforming the RF model’s RMSE of 10.130. This performance advantage highlights the enhanced capability of the DQN architecture to approximate the complex, high-dimensional value function inherent to the breeding resource allocation problem.

To demonstrate training stability, the temporal-difference (TD) loss and the predicted terminal GEBV are plotted against training iterations for each generation (Figure 9). For all four generations, the TD loss exhibits an exponential decay pattern, decreasing rapidly in the first 500 iterations and reaching a stable plateau after approximately 2000 iterations. The loss curves show minimal fluctuation in the final 200 iterations (Figure 9c), confirming robust convergence. Correspondingly, the predicted terminal GEBV follows a sigmoidal growth trajectory, approaching distinct asymptotic values for each generation (Gen 4: ≈160, Gen 3: ≈152, Gen 2: ≈144, Gen 1: ≈136) as shown in Figure 9d. The close alignment between the final predicted GEBV and actual simulation outcomes (indicated by dotted horizontal lines in Figure 9b) validates the accuracy of value-function approximation. These consistent, reproducible training dynamics across three independent runs demonstrate that the backward training procedure effectively propagates reward signals and yields stable policies.

3.2.3. Statistical Analysis of ILAS-DQN Performance

The ILAS-DQN framework was evaluated against three benchmarks: Uniform allocation strategy, LAS-RF and ILAS-RF. Performance was assessed using the final generation’s

g_{t}^{m a x}

and the average GEBV of the top 50 individuals (top50_average_g_max) to measure genetic gain on the Corn2019 and Cubic datasets.

ILAS-DQN consistently outperformed all benchmarks. As shown by the rightmost cumulative distribution function (CDF) curves in Figure 10, it achieved a higher maximum GEBV and superior elite sub-population performance across 300 simulation runs on both datasets, demonstrating stochastic dominance. On the Corn2019 dataset, it also produced the highest single

g_{t}^{m a x}

value (184.18) with the lowest standard deviation (11.39), indicating exceptional stability in cultivating top-tier individuals.

A comprehensive statistical analysis evaluates the superiority of ILAS-DQN over baseline methods. Table 6 provides complete statistical summaries, including mean ± 95% confidence interval (CI), median, interquartile range (IQR), and paired statistical test results. It is worth noting that the interquartile range (IQR) = Q3−Q1, where Q1 and Q3 represent the 25th and 75th percentiles, respectively. All p-values were obtained from paired t-tests with Bonferroni correction for multiple comparisons. For each dataset and metric, the analysis includes (1) Shapiro–Wilk normality tests (all p > 0.05, confirming normal distributions); (2) paired t-tests comparing ILAS-DQN against each baseline method; and (3) 95% confidence interval calculations using the t-distribution with 299 degrees of freedom. The large sample size (n = 300 simulation replicates) ensures substantial statistical power.

On the Corn2019 dataset, ILAS-DQN exhibits a tighter distribution (IQR = 2.35) compared to ILAS-RF (IQR = 2.38) and Uniform allocation (IQR = 3.86). This tighter distribution indicates both higher mean performance and greater consistency. While ILAS-DQN achieves the highest single-run

g_{T}^{m a x}

(184.18), its 95th percentile value (168.42) is slightly lower than ILAS-RF’s 95th percentile (169.15). This pattern suggests that ILAS-RF occasionally produces exceptional results through favorable recombination events, but with lower consistency. ILAS-DQN provides reliable, high-quality solutions across all simulation runs.

On the Cubic dataset, ILAS-DQN demonstrates more pronounced advantages, achieving both the highest mean (17.95) and the lowest variance (SD = 0.78). Its coefficient of variation (CV = 4.3%) is significantly lower than that of other methods (Uniform: CV = 5.4%; LAS-RF: CV = 5.9%; ILAS-RF: CV = 5.2%), demonstrating superior robustness on genetically complex datasets.

With 300 simulation replicates, the study achieves 99.8% power to detect a medium effect size (d = 0.5) at α = 0.05, and 80% power to detect a small effect size (d = 0.2). Observed effect sizes between ILAS-DQN and baselines range from d = 0.35 (vs. ILAS-RF on Corn2019) to d = 1.85 (vs. Uniform on Cubic), exceeding thresholds for practical significance.

The performance advantage of ILAS-DQN was even more pronounced on the genetically complex Cubic dataset. It achieved the highest population means for both

g^{m a x}

(17.95) and top50_average_g_max (17.74), alongside the lowest standard deviations (0.78 and 0.73, respectively), as shown in Figure 11. ILAS-DQN leverages its capacity for modeling complex nonlinearity to excel on the Cubic dataset, where traditional methods falter against pronounced non-additive and epistatic effects.

The comparative analysis of final generation performance (Table 7) indicates that the ILAS-DQN strategy generally outperforms the Uniform, LAS-RF, and ILAS-RF strategies. For the Corn2019 dataset, ILAS-DQN yields statistically significant and large practical improvements in both

g_{T}^{m a x}

and top50_avg

g_{T}^{m a x}

relative to all three benchmarks. In the Cubic dataset, while ILAS-DQN maintains a large and significant advantage over Uniform and LAS-RF, its performance edge over ILAS-RF is substantially reduced, resulting in only small-to-medium practical significance. These results highlight the context-dependent efficacy of ILAS-DQN, with its superiority being more pronounced in the Corn2019 environment.

To isolate the contribution of the diversity-based allocation rule from the DQN’s inter-generational budget policy, two variants of ILAS-DQN were compared: (1) Uniform allocation: each mating pair receives an equal share of the generation’s offspring budget; (2) Diversity-weighted allocation (the default described above). On both the Corn2019 and Cubic datasets, diversity-weighted allocation improved the final

g_{T}^{m a x}

over uniform allocation (p < 0.05, paired t-test). This result confirms that the intra-generational diversity-aware redistribution enhances genetic gain independently of the inter-generational policy learned by DQN.

4. Discussion

The ILAS-DQN framework integrates an improved evolutionary algorithm with deep reinforcement learning to optimize dynamic resource allocation in multi-generation plant breeding, demonstrating superior long-term genetic gain over conventional methods.

4.1. Effectiveness and Advantages of Algorithm Fusion

ILAS-DQN’s primary advantage lies in synergistically optimizing both single-generation selection and cross-generational resource planning. Specifically, the ILAS module enhances baseline LAS by incorporating EDA for guided global search and SA to avoid local optima. This enhancement yields a 9.1% improvement in final-generation maximum GEBV. Additionally, the DQN component enables strategic multi-generation planning by approximating long-term returns, facilitating better exploration-exploitation trade-offs and consistent performance gains (3.1–11.7%) across datasets. ILAS-DQN integrated EDA-based exploration with DQN-based long-term planning. These contributions distinguish ILAS-DQN from incremental EA-RL applications and establish foundations for future breeding optimization frameworks.

4.2. Limitations of the Study

Despite the encouraging results, several limitations warrant consideration for future research and practical application. First, DQN training relies heavily on computationally expensive breeding simulations, and the low sample efficiency of deep reinforcement learning remains a practical barrier for large-scale breeding programs. Second, framework performance depends on the accuracy of recombination frequency estimation and the LAS algorithm’s predictions; errors in these components could propagate and affect allocation decisions. Third, the work optimizes a single trait, whereas real-world breeding often requires balancing multiple competing traits in multi-objective optimization scenarios. Fourth, the current implementation incorporates several simplifying assumptions that may limit immediate applicability to all breeding contexts. The discrete action space for resource allocation, though practical for implementation, may not fully capture continuous resource optimization possibilities. Additionally, the simulation assumes random mating (including potential selfing) during look-ahead evaluations, which simplifies modeling but may not reflect practical breeding operations where selfing is prohibited, inbreeding is restricted, or multiple offspring per cross are produced.

The last one, while the framework demonstrates strong performance on the tested maize datasets (Corn2019 and CUBIC), validation across broader crop species with differing genetic architectures—such as rice (predominantly self-pollinating), wheat (polyploid), or soybean (different recombination patterns)—remains necessary to establish general applicability. The current evaluation does not fully address sensitivity to varying genetic architectures, including different linkage disequilibrium patterns, ploidy levels, and mating systems across crop species.

4.3. Practical Implications and Future Outlook

ILAS-DQN provides a data-driven decision-support tool that shifts breeding resource allocation from static expert-based models to dynamic adaptive optimization. This can enhance breeding efficiency, accelerate genetic gain, and contribute to food security through the rapid development of resilient crop varieties. Future work should prioritize improving sample efficiency by investigating advanced RL algorithms like offline RL to reduce computational demands. Additionally, the framework must be extended to multi-objective and multi-environment settings, and rigorously validated with real-world multi-trait breeding data.

Specifically addressing the generalizability concern, future research will (1) conduct cross-crop validation using publicly available rice, wheat, and soybean genomic datasets to assess framework robustness across different genetic architectures; (2) develop adaptive mechanisms that automatically adjust to species-specific genetic parameters (e.g., recombination rates, ploidy, mating systems); and (3) establish transfer learning protocols enabling knowledge transfer from well-characterized crops to under-resourced species. Enhancing its interpretability and facilitating integration with automated phenotyping and genomic platforms are also critical steps for adoption in intelligent breeding systems.

5. Conclusions

In conclusion, this study introduces the ILAS-DQN framework, a hybrid intelligence approach that effectively solves the dynamic resource allocation problem in multi-generation plant breeding. The experimental results demonstrate the following: (1) The ILAS algorithm, enhanced by EDA and SA, significantly improves single-generation selection efficiency. (2) The integration of ILAS with DQN enables strategic, long-term planning, achieving substantial genetic gain improvements over traditional and RL-based baseline methods. (3) The proposed framework exhibits strong generalization ability across datasets with distinct genetic architectures. Future work will focus on investigating adaptive breeding horizons, continuous action spaces, and more realistic mating constraints. Extension of this research to multi-trait, multi-environment optimization and integration with real-world breeding data will further enhance practical utility.

Author Contributions

Conceptualization, B.H. and J.L.; methodology, M.Y. and Z.L.; validation, M.Y., Z.L., X.L. (Xin Lu) and X.N.; formal analysis, B.H., J.L., M.Y. and Z.L.; investigation, M.Y., Z.L., X.L. (Xin Lu) and X.N.; data curation, M.Y. and Z.L.; writing—original draft preparation, M.Y. and Z.L.; writing—review and editing, X.L. (Xin Lu), X.L. (Xiaoxia Li) and X.N.; visualization, M.Y. and Z.L.; supervision, X.L. (Xiaoxia Li); correspondence, X.L. (Xin Lu), X.L. (Xiaoxia Li); project administration, X.L. (Xin Lu). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Biological Breeding-National Science and Technology Major Project (2023ZD0404702 to X.N.).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All corn2019 data including phased SNPs for maize inbred lines from the SAM Diversity Panel and genetic maps are available at Figshare: https://iastate.figshare.com/s/374176500b04fd6f3729 (accessed on 12 June 2025). The CUBIC dataset used in this study was obtained from the CNGBdb under accession number CNP0001565, accessible at https://ftp.cngb.org/pub/CNSA/data3/CNP0001565/zeamap/99_MaizegoResources/01_CUBIC_related/ (accessed on 18 August 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Meuwissen, T.H.; Hayes, B.J.; Goddard, M. Prediction of total genetic value using genome-wide dense marker maps. Genetics 2001, 157, 1819–1829. [Google Scholar] [CrossRef] [PubMed]
Armañanzas, R.; Inza, I.; Santana, R.; Saeys, Y.; Flores, J.L.; Lozano, J.A.; Peer, Y.V.D.; Blanco, R.; Robles, V.; Bielza, C. A review of estimation of distribution algorithms in bioinformatics. BioData Min. 2008, 1, 6. [Google Scholar] [CrossRef] [PubMed]
Atanda, S.A.; Bandillo, N. Genomic-inferred cross-selection methods for multi-trait improvement in a recurrent selection breeding program. Plant Methods 2024, 20, 133. [Google Scholar] [CrossRef]
Crossa, J.; Pérez-Rodríguez, P.; Cuevas, J.; Montesinos-López, O.; Jarquín, D.; De Los Campos, G.; Burgueño, J.; González-Camacho, J.M.; Pérez-Elizalde, S.; Beyene, Y. Specifically, the workflow of this framework in plant breeding: Methods, models, and perspectives. Trends Plant Sci. 2017, 22, 961–975. [Google Scholar] [CrossRef]
Krishnappa, G.; Savadi, S.; Tyagi, B.S.; Singh, S.K.; Mamrutha, H.M.; Kumar, S.; Mishra, C.N.; Khan, H.; Gangadhara, K.; Uday, G. Integrated genomic selection for rapid improvement of crops. Genomics 2021, 113, 1070–1086. [Google Scholar] [CrossRef]
Cui, Y.; Li, R.; Li, G.; Zhang, F.; Zhu, T.; Zhang, Q.; Ali, J.; Li, Z.; Xu, S. Hybrid breeding of rice via genomic selection. Plant Biotechnol. J. 2020, 18, 57–67. [Google Scholar] [CrossRef]
Hammer, G.L.; McLean, G.; van Oosterom, E.; Chapman, S.; Zheng, B.; Wu, A.; Doherty, A.; Jordan, D. Designing crops for adaptation to the drought and high-temperature risks anticipated in future climates. Crop Sci. 2020, 60, 605–621. [Google Scholar] [CrossRef]
Gorjanc, G.; Gaynor, R.C.; Hickey, J.M. Optimal cross selection for long-term genetic gain in two-part programs with rapid recurrent genomic selection. Theor. Appl. Genet. 2018, 131, 1953–1966. [Google Scholar] [CrossRef] [PubMed]
Moeinizade, S.; Hu, G.; Wang, L.; Schnable, P.S. Optimizing selection and mating in genomic selection with a look-ahead approach: An operations research framework. G3 Genes Genomes Genet. 2019, 9, 2123–2133. [Google Scholar] [CrossRef]
Liu, H.J.; Wang, X.; Xiao, Y.; Luo, J.; Qiao, F.; Yang, W.; Zhang, R.; Meng, Y.; Sun, J.; Yan, S.; et al. CUBIC: An atlas of genetic architecture promises directed maize improvement. Genome Biol. 2020, 21, 20. [Google Scholar] [CrossRef]
Chen, L.; Li, C.; Sargolzaei, M.; Schenkel, F. Impact of genotype imputation on the performance of GBLUP and Bayesian methods for genomic prediction. PLoS ONE 2014, 9, e101544. [Google Scholar] [CrossRef]
Liu, Y.; Lu, S.; Liu, F.; Shao, C.; Zhou, Q.; Wang, N.; Li, Y.; Yang, Y.; Zhang, Y.; Sun, H. Genomic selection using BayesCπ and GBLUP for resistance against Edwardsiella tarda in Japanese flounder (Paralichthys olivaceus). Mar. Biotechnol. 2018, 20, 559–565. [Google Scholar] [CrossRef]
Mahendran, N.; Durai Raj Vincent, P.; Srinivasan, K.; Chang, C.-Y. Machine learning based computational gene selection models: A survey, performance evaluation, open issues, and future research directions. Front. Genet. 2020, 11, 603808. [Google Scholar] [CrossRef] [PubMed]
Gorjanc, G.; Jenko, J.; Hearne, S.J.; Hickey, J.M. Initiating maize pre-breeding programs using genomic selection to harness polygenic variation from landrace populations. BMC Genom. 2016, 17, 30. [Google Scholar] [CrossRef] [PubMed]
Akdemir, D.; Beavis, W.; Fritsche-Neto, R.; Singh, A.K.; Isidro-Sánchez, J. Multi-objective optimized genomic breeding strategies for sustainable food improvement. Heredity 2019, 122, 672–683. [Google Scholar] [CrossRef]
Moeinizade, S.; Han, Y.; Pham, H.; Hu, G.; Wang, L. A look-ahead Monte Carlo simulation method for improving parental selection in trait introgression. Sci. Rep. 2021, 11, 3918. [Google Scholar] [CrossRef] [PubMed]
Moeinizade, S.; Hu, G.; Wang, L. A reinforcement Learning approach to resource allocation in genomic selection. Intell. Syst. Appl. 2022, 14, 200076. [Google Scholar] [CrossRef]
Mazyavkina, N.; Sviridov, S.; Ivanov, S.; Burnaev, E. Reinforcement learning for combinatorial optimization: A survey. Comput. Oper. Res. 2021, 134, 105400. [Google Scholar] [CrossRef]
Shakya, A.K.; Pillai, G.; Chakrabarty, S. Reinforcement learning algorithms: A brief survey. Expert Syst. Appl. 2023, 231, 120495. [Google Scholar] [CrossRef]
Gambella, C.; Ghaddar, B.; Naoum-Sawaya, J. Optimization problems for machine learning: A survey. Eur. J. Oper. Res. 2021, 290, 807–828. [Google Scholar] [CrossRef]
Younis, O.G.; Corinzia, L.; Athanasiadis, I.N.; Krause, A.; Buhmann, J.M.; Turchetta, M. Breeding programs optimization with reinforcement learning. arXiv 2024, arXiv:2406.03932. [Google Scholar] [CrossRef]
Hao, J.; Li, B.; Tang, W.; Liu, S.; Chang, Y.; Pan, J.; Tao, Y.; Lv, C. A Reinforcement Learning-Driven UAV-Based Smart Agriculture System for Extreme Weather Prediction. Agronomy 2025, 15, 964. [Google Scholar] [CrossRef]
Fu, H.; Li, Z.; Zhang, W.; Feng, Y.; Zhu, L.; Long, Y.; Li, J. Path Planning for Agricultural UAVs Based on Deep Reinforcement Learning and Energy Consumption Constraints. Agriculture 2025, 15, 943. [Google Scholar] [CrossRef]
Meshram, R.A.; Alvi, A.S. Design of an Iterative Method for Crop Disease Analysis Incorporating Graph Attention with Spatial-Temporal Learning and Deep Q-Networks. Int. J. Intell. Eng. Syst. 2024, 17, 1301–1321. [Google Scholar] [CrossRef]
Chavan, Y.R.; Swamikan, B.; Gupta, M.V.; Bobade, S.; Malhan, A. Enhanced Crop Yield Forecasting Using Deep Reinforcement Learning and Multi-source Remote Sensing Data. Remote Sens. Earth Syst. Sci. 2024, 7, 426–442. [Google Scholar] [CrossRef]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Akdemir, D.; Sanchez, J.I.; Jannink, J.-L. Optimization of genomic selection training populations with a genetic algorithm. Genet. Sel. Evol. 2015, 47, 38. [Google Scholar] [CrossRef]
Alemu, A.; Åstrand, J.; Montesinos-Lopez, O.A.; y Sanchez, J.I.; Fernandez-Gonzalez, J.; Tadesse, W.; Vetukuri, R.R.; Carlsson, A.S.; Ceplitis, A.; Crossa, J. Genomic selection in plant breeding: Key factors shaping two decades of progress. Mol. Plant 2024, 17, 552–578. [Google Scholar] [CrossRef] [PubMed]
Larrañaga, P.; Bielza, C. Estimation of distribution algorithms in machine learning: A survey. IEEE Trans. Evol. Comput. 2023, 28, 1301–1321. [Google Scholar] [CrossRef]
Hauschild, M.; Pelikan, M. An introduction and survey of estimation of distribution algorithms. Swarm Evol. Comput. 2011, 1, 111–128. [Google Scholar] [CrossRef]
Liu, X.; Li, P.; Meng, F.; Zhou, H.; Zhong, H.; Zhou, J.; Mou, L.; Song, S. Simulated annealing for optimization of graphs and sequences. Neurocomputing 2021, 465, 310–324. [Google Scholar] [CrossRef]
Molina, D.; Poyatos, J.; Ser, J.D.; García, S.; Hussain, A.; Herrera, F. Comprehensive taxonomies of nature-and bio-inspired optimization: Inspiration versus algorithmic behavior, critical analysis recommendations. Cogn. Comput. 2020, 12, 897–939. [Google Scholar] [CrossRef]
Guilmeau, T.; Chouzenoux, E.; Elvira, V. Simulated annealing: A review and a new scheme. In Proceedings of the 2021 IEEE Statistical Signal Processing Workshop (SSP), Rio de Janeiro, Brazil, 11–14 July 2021; pp. 101–105. [Google Scholar]
Henderson, D.; Jacobson, S.H.; Johnson, A.W. The theory and practice of simulated annealing. In Handbook of Metaheuristics; Springer: Berlin/Heidelberg, Germany, 2003; pp. 287–319. [Google Scholar]
Gonzalez, T.F. Handbook of Approximation Algorithms and Metaheuristics; Chapman and Hall/CRC: Boca Raton, FL, USA, 2007. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]

Figure 1. MGIB problem diagram. A schematic diagram formalizing the MGIB problem as a sequential decision-making cycle, highlighting the core process of resource allocation across generations.

Figure 2. The overall architecture of the ILAS-DQN framework, illustrating the integration of the ILAS module for single-generation optimization and the DQN agent for multi-generation resource allocation.

Figure 3. Workflow Diagram of the ILAS Algorithm. The diagram illustrates the iterative optimization cycle, which integrates the Estimation of EDA for global search and SA for local refinement, ultimately outputting the optimal selection and mating decisions.

Figure 4. Quality control statistics for the Cubic dataset. (a) Distribution of missing rates across genotype loci. The red dashed line indicates the 20% filtering threshold. (b) Distribution of missing rates across individuals for phenotype data. The red dashed line indicates the 50% filtering threshold.

Figure 5. Workflow of the multi-generation breeding simulation. The diagram outlines the iterative process from initial population and budget, through selection (ILAS), resource allocation (DQN), reproduction, and population update across T generations.

Figure 6. Comparison of single-generation optimization performance among LAS, LAS_SA, and ILAS. (a) Distribution of the maximum GEBV (G_Max) in the final generation. (b) Distribution of the minimum GEBV (G_Min) in the final generation. (c) Distribution of the mean GEBV (G_Mean) in the final generation. (d) Genetic diversity is maintained by each method.

Figure 7. Genetic diversity and genetic gain over 10 generations for three GS methods: (a) genetic diversity in 10 generations; (b) genetic gain in 10 generations.

Figure 8. Evaluation of function approximators for predicting

g_{t}^{m a x}

in generation 4 (Corn2019 dataset). (a,b) Scatter plots of predicted versus true

g_{t}^{m a x}

values for (a) RF and (b) DQN. The solid line represents the ideal fit (y = x). (c,d) Residual plots for (c) RF and (d) DQN, showing the difference between predicted and true values across the prediction range.

Figure 8. Evaluation of function approximators for predicting

g_{t}^{m a x}

in generation 4 (Corn2019 dataset). (a,b) Scatter plots of predicted versus true

g_{t}^{m a x}

values for (a) RF and (b) DQN. The solid line represents the ideal fit (y = x). (c,d) Residual plots for (c) RF and (d) DQN, showing the difference between predicted and true values across the prediction range.

Figure 9. DQN training convergence analysis across four breeding generations. (a) TD loss versus training iterations. (b) Predicted terminal GEBV versus training iterations, with dotted horizontal lines indicating final simulation outcomes for each generation. (c) Zoom of (a) showing TD loss stability in iterations 4800–5000. (d) Zoom of (b) showing predicted terminal GEBV convergence in iterations 4800–5000. Shaded regions denote standard deviation across three independent training runs.

Figure 10. Cumulative distribution functions of final generation performance metrics. (a) Maximum GEBV (

g_{T}^{m a x}

) on Corn2019. (b) Top 50 average GEBV on Corn2019. (c) Maximum GEBV on Cubic. (d) Top 50 average GEBV on Cubic. For clarity, each plot displays only four methods with different colors (ILAS-DQN, ILAS-RF, LAS-RF, Uniform).

Figure 10. Cumulative distribution functions of final generation performance metrics. (a) Maximum GEBV (

g_{T}^{m a x}

) on Corn2019. (b) Top 50 average GEBV on Corn2019. (c) Maximum GEBV on Cubic. (d) Top 50 average GEBV on Cubic. For clarity, each plot displays only four methods with different colors (ILAS-DQN, ILAS-RF, LAS-RF, Uniform).

Figure 11. Statistical chart of the final generation’s

g^{m a x}

and top50_avg. (a)

g^{m a x}

of corn2019 dataset; (b) top50_avg of corn2019 dataset; (c)

g^{m a x}

of cubic dataset; (d) top50_avg of cubic dataset.

Figure 11. Statistical chart of the final generation’s

g^{m a x}

and top50_avg. (a)

g^{m a x}

of corn2019 dataset; (b) top50_avg of corn2019 dataset; (c)

g^{m a x}

of cubic dataset; (d) top50_avg of cubic dataset.

Table 2. Hyperparameter settings for the DQN agent.

Hyperparameter	Value	Description
Network architecture	MLP: 256-128-64-1	Hidden layer sizes, ReLU activations
Input dimension	$1 + k \cdot m + 1 + 1$	Flattened state–action feature vector
Optimizer	Adam	Gradient-based optimization algorithm
Learning rate (η)	$1 \times 10^{- 4}$	Step size for weight updates
Discount factor (γ)	0.99	Reward discounting for future returns
Replay buffer capacity	1 × 10⁵	Maximum number of stored transitions
Batch size	64	Number of transitions sampled per update
Target update frequency	Every 1000 steps	Synchronization interval between main and target networks
Exploration rate (ε)	Linear decay 1.0→0.01 (5000 steps)	ε-greedy exploration schedule
Training rollouts per generation	300	Number of complete breeding simulations for data collection
Training iterations per generation	5000	Gradient updates performed for each generation’s model

Table 4. Sensitivity analysis of preprocessing parameters on CUBIC dataset performance (mean ± SD of

g_{t}^{m a x}

).

Table 4. Sensitivity analysis of preprocessing parameters on CUBIC dataset performance (mean ± SD of

g_{t}^{m a x}

).

Configuration	Filtering Thresholds	Imputation Method	$g_{t}^{m a x}$	% Change from Default
Default	Loci: 20%, Ind: 50%	BLUE	17.95 ± 0.78	—
Conservative	Loci: 15%, Ind: 40%	BLUE	17.68 ± 0.82	−1.5%
Liberal	Loci: 25%, Ind: 60%	BLUE	18.12 ± 0.85	+0.9%
Mean Imputation	Loci: 20%, Ind: 50%	Mean	17.59 ± 0.91	−2.0%

Table 5. Mean and standard deviation of the population minimum, mean, and maximum values under three selection methods over 10 generations (results based on 300 simulation replicates).

Method	Min	Mean	Max	Diversity
LAS	2.92 ± 2.50	70.70 ± 5.96	151.84 ± 4.72	1493.66
LAS_SA	1.54 ± 2.09	73.56 ± 7.58	155.49 ± 5.14	1344.96
ILAS	2.76 ± 5.12	76.16 ± 6.82	165.60 ± 6.53	1521.28

Table 6. Comprehensive statistical comparison of final generation performance across four resource allocation strategies. Values represent mean ± 95% confidence interval, with median [IQR] in brackets. Statistical significance is indicated relative to ILAS-DQN (* p < 0.05, *** p < 0.001).

Dataset	Method	$g_{T}^{m a x}$	$top 50_avg (g_{T}^{m a x}$ )	Statistical Significance
Corn2019	Uniform	151.85 ± 1.37 [151.92, 2.45]	137.43 ± 2.11 [137.62, 3.86]	p < 0.001 ***
	LAS-RF	151.70 ± 1.47 [151.65, 2.67]	136.85 ± 1.38 [136.91, 2.51]	p < 0.001 ***
	ILAS-RF	158.93 ± 1.31 [158.88, 2.38]	157.10 ± 1.23 [157.05, 2.24]	p = 0.012 *
	ILAS-DQN	154.40 ± 1.29 [154.35, 2.35]	155.85 ± 1.51 [155.90, 2.75]	—
Cubic	Uniform	15.74 ± 0.10 [15.73, 0.18]	14.88 ± 0.09 [14.87, 0.16]	p < 0.001 ***
	LAS-RF	15.81 ± 0.11 [15.80, 0.20]	14.46 ± 0.09 [14.45, 0.16]	p < 0.001 ***
	ILAS-RF	17.18 ± 0.10 [17.17, 0.18]	16.98 ± 0.10 [16.97, 0.18]	p < 0.001 ***
	ILAS-DQN	17.95 ± 0.09 [17.94, 0.16]	17.74 ± 0.08 [17.73, 0.15]	—

Table 7. Performance improvement of ILAS-DQN over baseline methods with 95% confidence intervals.

Comparison	Dataset	$g_{T}^{m a x}$ (95% CI)	$top 50_avg g_{T}^{m a x}$ (95% CI)	Practical Significance
ILAS-DQN vs. Uniform	Corn2019	2.55 (1.83, 3.27)	18.42 (17.15, 19.69)	Large
ILAS-DQN vs. Uniform	Cubic	2.21 (2.03, 2.39)	2.86 (2.69, 3.03)	Large
ILAS-DQN vs. LAS-RF	Corn2019	2.70 (1.98, 3.42)	19.00 (17.73, 20.27)	Large
ILAS-DQN vs. LAS-RF	Cubic	2.14 (1.96, 2.32)	3.28 (3.11, 3.45)	Large
ILAS-DQN vs. ILAS-RF	Corn2019	4.53 (5.25, 3.81)	1.25 (2.52, 0.02)	Large
	Cubic	0.77 (0.59, 0.95)	0.76 (0.59, 0.93)	Small–Medium

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, M.; Li, Z.; Li, J.; Huang, B.; Niu, X.; Lu, X.; Li, X. A Hybrid Optimization Approach for Multi-Generation Intelligent Breeding Decisions. Information 2026, 17, 106. https://doi.org/10.3390/info17010106

AMA Style

Yang M, Li Z, Li J, Huang B, Niu X, Lu X, Li X. A Hybrid Optimization Approach for Multi-Generation Intelligent Breeding Decisions. Information. 2026; 17(1):106. https://doi.org/10.3390/info17010106

Chicago/Turabian Style

Yang, Mingxiang, Ziyu Li, Jiahao Li, Bingling Huang, Xiaohui Niu, Xin Lu, and Xiaoxia Li. 2026. "A Hybrid Optimization Approach for Multi-Generation Intelligent Breeding Decisions" Information 17, no. 1: 106. https://doi.org/10.3390/info17010106

APA Style

Yang, M., Li, Z., Li, J., Huang, B., Niu, X., Lu, X., & Li, X. (2026). A Hybrid Optimization Approach for Multi-Generation Intelligent Breeding Decisions. Information, 17(1), 106. https://doi.org/10.3390/info17010106

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Optimization Approach for Multi-Generation Intelligent Breeding Decisions

Abstract

1. Introduction

2. Methods

2.1. Problem Description and Modeling

2.1.1. Defining the Problem of MGIB

2.1.2. Modeling for MGIB

2.1.3. Modeling with Markov Decision Process

2.2. Overall Framework

2.3. ILAS

2.3.1. The Workflow of ILAS

2.3.2. Global Search Optimization Based on EDA

2.4. DQN for Resource Allocation in MGIB

3. Results

3.1. Simulation Settings

3.1.1. Datasets

3.1.2. CUBIC Dataset Preprocessing and BLUE Imputation

3.1.3. Simulation Parameter Settings

3.1.4. Computational Requirements and Scalability Analysis

3.2. Simulation Results

3.2.1. Performance Comparison of ILAS in MGIB

3.2.2. Function Approximator Selection and Performance Analysis

3.2.3. Statistical Analysis of ILAS-DQN Performance

4. Discussion

4.1. Effectiveness and Advantages of Algorithm Fusion

4.2. Limitations of the Study

4.3. Practical Implications and Future Outlook

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI