A Penalized Regression Method for Genomic Prediction Reduces Mismatch between Training and Testing Sets

Genomic selection (GS) is changing plant breeding by significantly reducing the resources needed for phenotyping. However, its accuracy can be compromised by mismatches between training and testing sets, which impact efficiency when the predictive model does not adequately reflect the genetic and environmental conditions of the target population. To address this challenge, this study introduces a straightforward method using binary-Lasso regression to estimate β coefficients. In this approach, the response variable assigns 1 to testing set inputs and 0 to training set inputs. Subsequently, Lasso, Ridge, and Elastic Net regression models use the inverse of these β coefficients (in absolute values) as weights during training (WLasso, WRidge, and WElastic Net). This weighting method gives less importance to features that discriminate more between training and testing sets. The effectiveness of this method is evaluated across six datasets, demonstrating consistent improvements in terms of the normalized root mean square error. Importantly, the model’s implementation is facilitated using the glmnet library, which supports straightforward integration for weighting β coefficients.


Introduction
Genomic selection (GS) is a transformative methodology in plant breeding, offering substantial reductions in phenotyping costs.GS achieves this by using a statistical machine learning algorithm trained on a reference population that includes both phenotypic and genotypic data.The trained model then predicts phenotypic values for a target population based solely on their genotypic data [1].Consequently, GS is widely adopted in breeding programs, enabling the early selection of candidate lines without the necessity of phenotyping.
GS has been successfully implemented in various crop breeding programs, leading to significant advancements [2].Notable examples include maize, where GS has enhanced yield, disease resistance, and drought tolerance.In wheat, GS has improved grain yield, disease resistance, and quality traits [3,4].For rice, GS has increased yield, pest and disease resistance, and environmental adaptability [5,6].In soybean, GS has boosted yield, disease resistance, and oil/protein content [7,8].Barley has seen enhanced yield, malting quality, and disease resistance due to GS [9,10].In cassava, GS has improved yield, disease resistance, and nutritional content [11,12].Potato breeding has benefited from increased yield, disease resistance, and tuber quality through GS [13,14].Lastly, GS has improved yield, disease resistance, and fruit quality in tomato breeding.These examples underscore the effectiveness of GS in enhancing key traits across a diverse range of crops.
According to Montesinos-López et al. [1], the successful implementation of GS in crop breeding depends on several critical factors.High-quality and extensive genotypic and phenotypic data are essential for training accurate predictive models.A large, diverse reference population that accurately represents the genetic diversity of the target population enhances model robustness.The choice of advanced statistical models and machine learning algorithms, along with adequate computational resources, also significantly impacts the predictive power of GS.Traits with higher heritability and those influenced by fewer genes are typically predicted more accurately.Environmental factors and genotype-byenvironment interactions must be considered to improve prediction accuracy for traits affected by external conditions.Clear breeding objectives, well-designed programs, and cost considerations are crucial for the effective integration of GS.Additionally, the presence of skilled personnel with expertise in quantitative genetics, statistics, bioinformatics, and computational biology is vital.Collaborative efforts among breeding programs, research institutions, and industry stakeholders can provide the necessary resources, data, and expertise, further enhancing the successful application of GS in crop breeding.
A significant mismatch between the reference (training) population and the target (testing) population can significantly undermine the accuracy of GS predictions [15].This issue arises when the genetic composition, allele frequencies, and linkage disequilibrium patterns differ between the two populations.Such discrepancies can lead to incorrect marker-trait associations and biased predictions.Additionally, genotype-by-environment interactions may not be adequately captured if the populations experience different environmental conditions.Variability in phenotypic expression and genetic backgrounds further complicates prediction accuracy.To mitigate these problems, it is essential to ensure that the reference population is representative of the target population, incorporating diverse genetic backgrounds and environmental conditions.This approach enhances the reliability of GS predictions by better capturing the genetic and environmental complexities present in the target population.
In this context, a method using penalized Lasso, Ridge, and Elastic Net regression [16] is proposed to reduce the mismatch between training and testing sets.This proposed method is evaluated in eight real datasets, each containing multiple environments.The model is implemented in a uni-environment fashion, with each environment containing several families.Given that the genetic material of each family differs, the training set in each dataset consists of all families minus one, and the remaining family is used as the testing set.We compare the proposed method where the coefficient of regression is weighted by the absolute value of the regression coefficient with conventional penalized methods, assessing prediction performance in terms of normalized root mean square errors.[17] and comprise two crops, Maize and Soybean, which have different numbers of DNA sequences and different numbers of environments (location-year combination) and numbers of traits.

Statistical Model
In a general context, we have a covariate vector X = X 1 , . . ., X p T , such as molecular markers, and we want to use this information to predict or explain how this variable affects a real-value response Y (of size n), for example, grain yield or any other quantitative trait the breeders want to predict.The linear multiple regression model assumes a relationship given by where ϵ is a random error vector with mean 0, E(ϵ) = 0 and is independent of X.This error is included in the model to capture measurement errors and the effects of other unregistered explanatory variables that can help explain the mean response.Then, the conditional mean of this model is E(Y|X) = β 0 + ∑ p j=1 X j β j , and the conditional distribution of Y given X is only affected by the information of X.
To estimate the parameters β 0 and β = β 1 , . . ., β p T , we usually have a set of data x T i , y i , i = 1, . . ., n, often known as training data, where x i = x i0 , x i1 , . . ., x ip T is a vector of feature measurement and y i is the response variable measurement corresponding to the i-th individual drawn.In the context of large p and small n, the ordinary least squares method that does not penalized the parameters is used, and in this context, the estimates are not optimal.One common method to estimate β and β 0 when p >> n is the Elastic regression (ENet) method (1), which consists of taking the ( β 0 , β) value that minimizes the penalized residual sum of squares [1,16], defined as where λ ≥ 0 is the regularization parameter, which determines the level or degree to which the β coefficients are shrunk towards zero, α ∈ (0, 1).When λ = 0 and α = 0, the ordinal least square (OLS) is the solution to the β coefficients, but when λ is large, PRSS λ (β) is dominated by the penalization term, and the OLS solution must shrink towards 0 [18].In general, when the number of parameters (p) to be estimated is larger than the number of observations (n), the estimator can be highly variable.In this situation, the intuition of penalized regression tries to alleviate this by constraining the β coefficients [19].Also, it is important that when α = 0, the loss function given in (1) is reduced to the loss function of Ridge regression (Ridge; [1,16]) while when α = 1, the loss function corresponds to the Lasso regression [16].

Proposed Approach
The proposed method is described below in 4 steps.

•
Step 1.We assume that we have a uni-environment or single-environment dataset in which there are at least two families, and for each family, there are some lines.Since our goal is to predict a complete family, the information of this family constitutes the testing set {X_tst , y_tst}, and the information of the remaining families is the training set {X_trn, y_trn}.Then, in the original dataset {X_trn, y_trn, X_tst, y_tst}, we remove the original response variable column {y_trn, y_tst} and add a fictitious (new) response variable column {y f _trn, y f _tst} that replaces the source of the data with zero (that is, y f _trn = 0) for samples (observations) on the training set and with one (i.e., y f _tst = 1) for the samples in the testing set [17].In other words, the fictitious response variables with 0 s correspond to the remaining families, and those with 1 s correspond to the family we want to predict.Note that since the response variable is a fictitious response variable, with a training label (y f _trn = 0) and a testing label (y f _tst = 1), not the original response variable, we should avoid any complication in defining these two populations when performing the random cross-validation and also avoid any type of data leakage confusion.

•
Step 3. The next step is to compute weights as w j = 1/abs(β j, bin ), where j = 1, . . ., p, p denotes the number of markers.

•
Step 4.Then, the weights computed in step 3 are used to train a weighted Lasso and Ridge regression.Regarding the Weighted Lasso (WLasso), the loss function used to train the models is given as follows: Meanwhile, for the weighted Ridge regression (WRidge), the weighted loss function is equal to and for the weighted Elastic net regression (WENet), the weighted loss function is equal to To implement these loss functions, we used the glmnet library in [16], which is implemented in R statistical software (v.2024) [20].

How to Avoid Data Leakage
Avoiding data leakage in genomic prediction studies is crucial to ensure the validity and generalizability of your model's performance.Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates.This can be avoided by properly separating training sets from testing sets; this means that no data points used for training the model should be included in the testing set, and vice versa.
Furthermore, we use cross-validation techniques where the dataset is split into k subsets, and the model is trained on k − 1 subsets and tested on the remaining subset.This helps in validating the model's performance across different subsets of the data, minimizing the risk of data leakage.In the prediction of longitudinal data, it is important to ensure that the training data precede the testing data chronologically.This simulates real-world scenarios where predictions are made on future, unseen data.
We must ensure that there are no duplicates or near-duplicates between the training and testing sets.In genomic studies, this means carefully managing how genotypes and phenotypes are split, ensuring no overlap.Feature selection should be performed within the cross-validation loop.Performing feature selection before splitting the data can lead to information from the test set leaking into the model, thereby inflating performance metrics.We should use resampling techniques like bootstrapping within the training data only.We must not mix data points from the test set during resampling, as this can cause leakage.

Evaluation of Prediction Accuracy
Our evaluation procedure preserves data leakage by employing a cross-validation approach wherein each iteration excludes one family at a time from the training set, and the entire proposed approach described above is applied.Therefore, data from the testing set do not influence the weightings derived from the training set.
Specifically, in each iteration, data from a single family served as the testing set, while data from the remaining families constituted the training set [1].The number of iterations corresponded to the total number of families, ensuring that each family was used as the testing set exactly once.This method was implemented to evaluate the model's capability to predict information for an entire family based on data from other diverse families.To compare the predictive performance of the proposed method (WRidge, WLasso, and WENet) with the conventional glmnet method (Ridge, Lasso and ENet), the normalized root mean square error (NRMSE) was employed as the evaluation metric.
The NRMSE, a measure of prediction error, was calculated by first computing the square root of the mean square error (MSE) and then dividing this value by the average of the observed values in the testing set.The NRMSE is a metric used to evaluate the accuracy of a predictive model, particularly useful when comparing models with different scales.It is a normalized version of the root mean square error (RMSE), which adjusts for the scale of the data, making it easier to interpret and compare across different datasets.NRMSE = 0: perfect predictions.NRMSE > 0: the further the value from 0, the worse the model's predictions.NRMSE < 1: indicates a good fit, depending on the context and domain.Some advantages of NRMSE are that it is scale-invariant as it adjusts to the scale of the data, making it easier to compare across different datasets, and has an intuitive interpretation because it provides a relative measure of the prediction error in relation to the range of the observed values.Using NRMSE helps provide a more standardized assessment of model performance, making it particularly useful in genomic prediction studies where traits can have varying scales.

Results
The results are provided in five sections that contain the results of the datasets Maize_1, Maize_3, Soybean_1, and Soybean_2 and across all of these datasets, presented in Figures 1-5 and Tables 2 and 3.The results for the rest of the datasets (Maize_2, Maize_4, Soybean_3, Soybean_4) are given in Appendix A containing Table A1 and Figures A1-A4.

Maize_1
Figure 1 shows the prediction performance in terms of the normalized root mean square error (NRMSE) for Maize Dataset 1.The first three original prediction models (ENet, Lasso, Ridge) were compared with their weighted versions (WENet, WLasso, WRidge) using two traits for these predictions: GDD_ASI and GDD_DTT.In the first comparison, we can observe that, in terms of NRMSE for the GDD_AS trait, the ENet model, with an NRMSE value of 1.427, was inferior to the weighted WENe model, with an NRMSE value of 1.307, being superior in prediction by 9.185%.Similarly for the GDD_DTT trait, the WENet model, with an NRMSE value of 1.307, also outper formed the Enet, with an NRMSE value of 1.429, with a difference of 9.358%.The individ ual values can be observed in more detail in Table 2.In the first comparison, we can observe that, in terms of NRMSE for the GDD_ASI trait, the ENet model, with an NRMSE value of 1.427, was inferior to the weighted WENet model, with an NRMSE value of 1.307, being superior in prediction by 9.185%.Similarly, for the GDD_DTT trait, the WENet model, with an NRMSE value of 1.307, also outperformed the Enet, with an NRMSE value of 1.429, with a difference of 9.358%.The individual values can be observed in more detail in Table 2.In the second comparison, we can observe that, in terms of NRMSE for the GDD_ASI trait, the Lasso model, with an NRMSE value of 1.433, was inferior to the weighted WLasso model, with an NRMSE value of 1.332, surpassing it in prediction by 7.567%.Similarly, for the GDD_DTT trait, the WLasso model, with an NRMSE value of 1.332, also outperformed the Lasso, with an NRMSE value of 1.433, with a difference of 7.564%.The individual values can be observed in more detail in Table 2.
In the third comparison, we can observe that, in terms of NRMSE for the GDD_ASI trait, the Ridge model, with an NRMSE value of 1.371, was inferior to the weighted WRidge model, with an NRMSE value of 1.303, being superior in prediction by 5.171%.Similarly, for the GDD_DTT trait, the WRidge model, with an NRMSE value of 1.303, also outperformed the Ridge, with an NRMSE value of 1.371, with a difference of 5.180%.The individual values can be observed in more detail in Table 2.

Maize_3
In the first comparison (Figure 2), we can observe that, in terms of NRMSE for the GDD_ASI trait, the ENet model, with an NRMSE value of 1.405, was inferior to the WENet model, with an NRMSE value of 1.369, which outperformed it in prediction by 2.647%.Similarly, for the GDD_DTT trait, the WENet model, with an NRMSE value of 1.369, was also superior to Enet, with an NRMSE value of 1.404, with an advantage of 2.549%.The individual values can be observed in more detail in Table 2.
Similarly, for the GDD_DTT trait, the WENet model, with an NRMSE value of 1.369, wa also superior to Enet, with an NRMSE value of 1.404, with an advantage of 2.549%.Th individual values can be observed in more detail in Table 2.In the second comparison (Figure 2), we can observe that, in terms of NRMSE for th GDD_ASI trait, the Lasso model, with an NRMSE value of 1.409, was inferior to th WLasso model, with an NRMSE value of 1.370, which outperformed it in prediction b 2.866%.Similarly, for the GDD_DTT trait, the WLasso model, with an NRMSE value 1.370, was also superior to Lasso, with an NRMSE value of 1.407, with an advantage 2.721% (see Table 2).
For the third comparison displayed in Figure 2, note that regarding the NRMSE fo the GDD_ASI trait, the Ridge model, with an NRMSE value of 1.380, was inferior to th WRidge model, with an NRMSE value of 1.331, which outperformed it in prediction b 3.656%.Similarly, for the GDD_DTT trait, the WRidge model, with an NRMSE value 1.333, was also superior to Ridge, with an NRMSE value of 1.381, with an advantage 3.598%.The individual values can be observed in more detail in Table 2.

Soybean_1
In the first comparison (Figure 3), it is evident that, in terms of NRMSE for the R trait, the ENet model, with an NRMSE value of 1.241, outperformed the weighted WEN model, with an NRMSE value of 1.274.The weighted model, in this instance, fell short prediction by 2.621%.Similarly, for the Height trait, the WENet model, with an NRMS value of 1.239, was inferior to Enet, with an NRMSE value of 1.193, exhibiting a shortfa of 3.729%.Detailed individual values can be observed in Table 2.In the second comparison (Figure 2), we can observe that, in terms of NRMSE for the GDD_ASI trait, the Lasso model, with an NRMSE value of 1.409, was inferior to the WLasso model, with an NRMSE value of 1.370, which outperformed it in prediction by 2.866%.Similarly, for the GDD_DTT trait, the WLasso model, with an NRMSE value of 1.370, was also superior to Lasso, with an NRMSE value of 1.407, with an advantage of 2.721% (see Table 2).
For the third comparison displayed in Figure 2, note that regarding the NRMSE for the GDD_ASI trait, the Ridge model, with an NRMSE value of 1.380, was inferior to the WRidge model, with an NRMSE value of 1.331, which outperformed it in prediction by 3.656%.Similarly, for the GDD_DTT trait, the WRidge model, with an NRMSE value of 1.333, was also superior to Ridge, with an NRMSE value of 1.381, with an advantage of 3.598%.The individual values can be observed in more detail in Table 2.

Soybean_1
In the first comparison (Figure 3), it is evident that, in terms of NRMSE for the R8 trait, the ENet model, with an NRMSE value of 1.241, outperformed the weighted WENet model, with an NRMSE value of 1.274.The weighted model, in this instance, fell short in prediction by 2.621%.Similarly, for the Height trait, the WENet model, with an NRMSE value of 1.239, was inferior to Enet, with an NRMSE value of 1.193, exhibiting a shortfall of 3.729%.Detailed individual values can be observed in Table 2.In Figure 3, it is noticeable that, in terms of NRMSE for the R8 trait, the Lasso mode with an NRMSE value of 1.254, surpassed the weighted WLasso model, with an NRMS value of 1.255, by 0.04%.Likewise, the weighted WLasso model, with an NRMSE value o 1.227, fell short in prediction for the Height trait, being inferior to the Lasso model, wit an NRMSE value of 1.195, by 2.675%.Detailed individual values can be observed in Table 2 Figure 3 displays that in terms of NRMSE for the R8 trait, the Ridge model, with a NRMSE value of 1.270, outperformed the weighted WRidge model, with an NRMSE valu of 1.284.The weighted model, in this instance, fell short in prediction by 1.112%.Similarly for the Height trait, the WRidge model, with an NRMSE value of 1.191, was inferior t Ridge, with an NRMSE value of 1.143, exhibiting a shortfall of 4.018%.Detailed individua values can be observed in Table 2.

Soybean_2
In the first comparison (Figure 4), it is evident that, in terms of NRMSE for the R trait, the ENet model, with an NRMSE value of 1.554, was inferior to the weighted WENe model, with an NRMSE value of 1.470, which outperformed it in prediction by 5.711% Similarly, for the Height trait, the WENet model, with an NRMSE value of 1.316, also sur passed Enet, with an NRMSE value of 1.361, exhibiting superiority of 3.427%.Detaile individual values can be observed in Table 2.In Figure 3, it is noticeable that, in terms of NRMSE for the R8 trait, the Lasso model, with an NRMSE value of 1.254, surpassed the weighted WLasso model, with an NRMSE value of 1.255, by 0.04%.Likewise, the weighted WLasso model, with an NRMSE value of 1.227, fell short in prediction for the Height trait, being inferior to the Lasso model, with an NRMSE value of 1.195, by 2.675%.Detailed individual values can be observed in Table 2.
Figure 3 displays that in terms of NRMSE for the R8 trait, the Ridge model, with an NRMSE value of 1.270, outperformed the weighted WRidge model, with an NRMSE value of 1.284.The weighted model, in this instance, fell short in prediction by 1.112%.Similarly, for the Height trait, the WRidge model, with an NRMSE value of 1.191, was inferior to Ridge, with an NRMSE value of 1.143, exhibiting a shortfall of 4.018%.Detailed individual values can be observed in Table 2.

Soybean_2
In the first comparison (Figure 4), it is evident that, in terms of NRMSE for the R8 trait, the ENet model, with an NRMSE value of 1.554, was inferior to the weighted WENet model, with an NRMSE value of 1.470, which outperformed it in prediction by 5.711%.Similarly, for the Height trait, the WENet model, with an NRMSE value of 1.316, also surpassed Enet, with an NRMSE value of 1.361, exhibiting superiority of 3.427%.Detailed individual values can be observed in Table 2.
In Figure 4, it is noticeable that, in terms of NRMSE for the R8 trait, the Lasso model, with an NRMSE value of 1.559, was inferior to the weighted WLasso model, with an NRMSE value of 1.473, which outperformed it in prediction by 5.828%.Similarly, for the Height trait, the WLasso model, with an NRMSE value of 1.313, also surpassed Lasso, with an NRMSE value of 1.357, exhibiting superiority of 3.317%.Detailed individual values can be observed in Table 2.In Figure 4, it is noticeable that, in terms of NRMSE for the R8 trait, the Lasso mode with an NRMSE value of 1.559, was inferior to the weighted WLasso model, with a NRMSE value of 1.473, which outperformed it in prediction by 5.828%.Similarly, for th Height trait, the WLasso model, with an NRMSE value of 1.313, also surpassed Lasso with an NRMSE value of 1.357, exhibiting superiority of 3.317%.Detailed individual va ues can be observed in Table 2.
Figure 4 displays that regarding NRMSE for the R8 trait, the Ridge model, with a NRMSE value of 1.545, was inferior to the weighted WRidge model, with an NRMSE valu of 1.515, which outperformed it in prediction by 2.013%.Similarly, for the Height trait, th WRidge model, with an NRMSE value of 1.332, also surpassed Ridge, with an NRMS value of 1.342, exhibiting superiority of 0.682%.Detailed individual values can be ob served in Table 2.

Across_Data
In the first comparison (Figure 5), it is evident that, in terms of NRMSE, the ENe model, with an NRMSE value of 1.409, was inferior to the weighted WENet model, wit an NRMSE value of 1.378, which outperformed it in prediction by 2.196%.Detailed ind vidual values can be observed in Table 3. Figure 5 shows that the Lasso model, with a NRMSE value of 1.413, was worse regarding the weighted WLasso model, with a NRMSE value of 1.377, which outperformed it in prediction by 2.630%.Detailed individ ual values can be observed in Table 3.On the other hand, the Ridge model, with a NRMSE value of 1.387, was superior to the weighted WRidge model, with an NRMS value of 1.369, which outperformed it in prediction by 1.309%.2.

Across_Data
In the first comparison (Figure 5), it is evident that, in terms of NRMSE, the ENet model, with an NRMSE value of 1.409, was inferior to the weighted WENet model, with an NRMSE value of 1.378, which outperformed it in prediction by 2.196%.Detailed individual values can be observed in Table 3. Figure 5 shows that the Lasso model, with an NRMSE value of 1.413, was worse regarding the weighted WLasso model, with an NRMSE value of 1.377, which outperformed it in prediction by 2.630%.Detailed individual values can be observed in Table 3.On the other hand, the Ridge model, with an NRMSE value of 1.387, was superior to the weighted WRidge model, with an NRMSE value of 1.369, which outperformed it in prediction by 1.309%.Table 3. Prediction performance across all traits in terms of normalized root mean square erro (NRMSE) across the eight datasets.ENet denotes Elastic net regression, Lasso denotes Lasso regres sion, and Ridge denotes Ridge regression, while WENet, WLasso, and WRidge denote the weighte counterparts of ENet, Lasso, and Ridge.NRMSE_SD denotes the standard deviation of the NRMSE

Discussion
There is no doubt that climate change and a growing global population are increasin the pressure to boost productivity worldwide [21].Therefore, innovative methods like G are very appealing, as they have the potential to significantly enhance crop productivit on a global scale.For this reason, the adoption of GS is being accelerated in many publi and private breeding programs due to its tremendous potential.However, like many new technologies, the practical and successful implementation of GS depends on several fac tors that impact its accuracy.Consequently, research is being conducted to mitigate thes factors affecting its performance.In this study, we propose a method to increase the effi ciency of GS in contexts where there is a limited relationship between the training and testing sets.
The core idea of the proposed method is to assign greater weight to features (mark ers) that less distinctly differentiate between the training and testing sets and, conversely to assign less weight to those that do.To achieve this, the method first trains a binar classifier (Lasso regression) with a fictitious response variable that assigns a value of 1 t Table 3. Prediction performance across all traits in terms of normalized root mean square error (NRMSE) across the eight datasets.ENet denotes Elastic net regression, Lasso denotes Lasso regression, and Ridge denotes Ridge regression, while WENet, WLasso, and WRidge denote the weighted counterparts of ENet, Lasso, and Ridge.NRMSE_SD denotes the standard deviation of the NRMSE.

Discussion
There is no doubt that climate change and a growing global population are increasing the pressure to boost productivity worldwide [21].Therefore, innovative methods like GS are very appealing, as they have the potential to significantly enhance crop productivity on a global scale.For this reason, the adoption of GS is being accelerated in many public and private breeding programs due to its tremendous potential.However, like many new technologies, the practical and successful implementation of GS depends on several factors that impact its accuracy.Consequently, research is being conducted to mitigate these factors affecting its performance.In this study, we propose a method to increase the efficiency of GS in contexts where there is a limited relationship between the training and testing sets.
The core idea of the proposed method is to assign greater weight to features (markers) that less distinctly differentiate between the training and testing sets and, conversely, to assign less weight to those that do.To achieve this, the method first trains a binary classifier (Lasso regression) with a fictitious response variable that assigns a value of 1 to testing set inputs and 0 to training set inputs.From this first step, the β coefficients are obtained and used to compute weights for the second step, which uses the response variable of interest.These weights are calculated as the inverse of the absolute values of the β coefficients from the first stage.
The proposed method demonstrated superior performance across various datasets.Specifically, WENet outperformed Enet by 2.196% in terms of NRMSE.Similarly, WLasso exceeded Lasso by 2.63% in terms of NRMSE.Lastly, WRidge surpassed Ridge with a 1.309% improvement in NRMSE.We acknowledge that the margin of improvement is modest; however, the proposed method provides empirical evidence of its ability to enhance prediction accuracy in contexts with significant mismatches between training and testing sets.Therefore, we encourage further exploration and refinement of this method to boost its predictive power.Since in each environment of all datasets, we predict a complete family using the remaining families, we cannot expect good prediction performance when the family to be predicted is genetically distant from those in the training set.Additionally, since families are often genetically distinct from each other, achieving good prediction performance for new families is challenging.Furthermore, predicting a complete family is complicated because markers do not capture all the necessary information for optimal predictions.
Finally, we emphasize that GS is revolutionizing plant and animal breeding by enabling the selection of individuals without the need to measure their phenotypic values.However, its successful implementation faces numerous challenges that must be addressed.This paper underscores the critical importance of understanding that when training and testing materials have significant genetic distance, the predictive quality of GS methodologies deteriorates significantly.Therefore, investigating statistical approaches to mitigate this issue is crucial for enhancing the robustness of GS methodologies.

Why Weighting Markers Might Be Useful for Enhancing Genomic Prediction
The usefulness of weighting markers in genomic prediction can vary depending on the characteristics of the datasets and the traits being studied.There are several reasons why weighting markers might be beneficial for some datasets and traits, but not for others: (1) For traits controlled by many genes with small effects, weighting markers can help identify and prioritize the most influential markers, enhancing prediction accuracy, whereas traits controlled by a few genes with large effects benefit significantly from weighting these key markers, as they have a disproportionate impact on the trait.However, when a trait is controlled by a very large number of genes with very small effects, the benefit of weighting individual markers diminishes as the effects are too dispersed.(2) When marker density is high, weighting can help manage the large volume of data by emphasizing markers with greater predictive power.(3) Weighting markers in regions with high LD can reduce redundancy and focus on the most informative markers; in populations with clear substructures or relatedness, weighting can help correct for population stratification and relatedness, leading to more accurate predictions.
Furthermore, for traits with significant genotype-by-environment interactions, weighting markers that show consistent effects across environments can improve prediction robustness, and for traits with high heritability, certain markers will have strong associations with the trait, making weighting beneficial to capture this strong genetic signal.When a trait is controlled by a very large number of genes with very small effects, the benefit of weighting individual markers diminishes as the effects are too dispersed.
In summary, weighting markers in genomic prediction can enhance model performance by emphasizing the most informative genetic markers, correcting for population structure, and managing high-dimensional data.However, its effectiveness depends on the genetic architecture of the trait, marker density, population characteristics, and environmental stability.Understanding these factors can guide the decision on whether to apply marker weighting in genomic prediction studies.

Specific Considerations for Genomic Prediction
In the context of genomic prediction, predicting traits based on genetic markers is complex and often challenging.For example, an NRMSE of 0.015 (see Tables 2 and A1) is particularly impressive.It indicates that the model predictions are very close to the true values, with only a 1.5% error, that is, with a very small deviation from its target.The larger and more complex the dataset, the more impressive a low NRMSE becomes.Highly variable traits typically result in higher errors, so a low NRMSE here is significant.Simpler models achieving low NRMSEs suggest a robust underlying pattern, whereas complex models might fit noise if overfitting.For example, in a genomic prediction study for plant breeding where breeders are predicting crop yield or resistance to diseases, an NRMSE of around 0.015 means the model's predictions are very accurate compared to the variability in the observed data.

Conclusions
The lack of strong relatedness between training and testing sets is a common issue in the practical implementation of GS, significantly affecting its accuracy.To address this problem, we propose a method aimed at improving accuracy under these conditions.Our findings indicate that the proposed method enhances accuracy when there is a significant mismatch between training and testing sets.However, while the method provides improvement, it does not completely resolve the issue.In addition to addressing this issue, we emphasize the significance of the problem within the GS methodology, as it often compromises its efficiency.We encourage other scientists to enhance the proposed framework to further increase the robustness of the GS methodology.A comparison is presented for the 6 evaluated models (Enet, Lasso, Ridge, WEnet, WLasso, and WRidge) and the 2 traits (GDD_ASI and GDD_DTT).A comparison is presented for the 6 evaluated models (Enet, Lasso, Ridge, WEnet, WLasso, and WRidge) and the 2 traits (GDD_ASI and GDD_DTT).A comparison is presented for the 6 evaluated models (Enet, Lasso, Ridge, WEnet, WLasso, and WRidge) and the 2 traits (GDD_ASI and GDD_DTT).A comparison is presented for the 6 evaluated models (Enet, Lasso, Ridge, WEnet, WLasso, and WRidge) and the 2 traits (GDD_ASI and GDD_DTT). .

Figure 4
Figure 4 displays that regarding NRMSE for the R8 trait, the Ridge model, with an NRMSE value of 1.545, was inferior to the weighted WRidge model, with an NRMSE value of 1.515, which outperformed it in prediction by 2.013%.Similarly, for the Height trait, the WRidge model, with an NRMSE value of 1.332, also surpassed Ridge, with an NRMSE value of 1.342, exhibiting superiority of 0.682%.Detailed individual values can be observed in Table2.

Table 1
has details of the 8 datasets included in this study.The data used were obtained from Montesinos-Lopez et al.(2023)

Table 1 .
Description of the datasets used to implement the proposed method.

Table 2 .
Prediction performance in terms of normalized root mean square error (NRMSE) for da tasets Maize_1, Maize_3, Sobybean_1, and Sobybean_2.ENet denotes Elastic net regression, Lass denotes Lasso regression, and Ridge denotes Ridge regression, while WENet, WLasso, and WRidg denote the weighted counterparts of ENet, Lasso, and Ridge.NRMSE_SD denotes the standard de viation of the NRMSE.

Table 2 .
Prediction performance in terms of normalized root mean square error (NRMSE) for datasets Maize_1, Maize_3, Sobybean_1, and Sobybean_2.ENet denotes Elastic net regression, Lasso denotes Lasso regression, and Ridge denotes Ridge regression, while WENet, WLasso, and WRidge denote the weighted counterparts of ENet, Lasso, and Ridge.NRMSE_SD denotes the standard deviation of the NRMSE.