Incorporating Molecular Markers and Causal Structure among Traits Using a Smith-Hazel Index and Structural Equation Models

: The goal in breeding programs is to choose candidates that produce offspring with the best phenotypes. In conventional selection, the best candidate is selected with high genotypic values (unobserved), in the assumption that this is related to the observed phenotypic values for several traits. Multi-trait selection indices are used to identify superior genotypes when a number of traits are to be considered simultaneously. Often, the causal relationship among the traits is well known. Structural equation models (SEM) have been used to describe the causal relationships among variables in many biological systems. We present a method for multi-trait genomic selection that incorporates causal relationships among traits by coupling SEM with a Smith–Hazel index that incorporates markers. The method was applied to ﬁeld data from a Nebraska winter wheat breeding program. We found that the correlation and the relative efﬁciency increased for the proposed Smith– Hazel indices when the total causal information among traits was accounted for by the vector of weights (b), which includes the causal path coefﬁcients in the causal matrix ( Λ ). On the other hand, when selection was based on a primary trait, for example yield, the proposed SI increased the mean yield of the best 28 (Top 10%) genotypes to 7%.


Introduction
Novel food production technologies are needed to tackle the increased demand for food in a world in which population is expected to reach 9 billion people by 2050. The production gains achieved through conventional breeding methods are gradually declining. Producing more food requires the development and implementation of new genetic technologies. The goal of breeding programs is to choose candidates that produce offspring with the best phenotypes. In conventional selection the best candidate is selected with high genotypic values (unobserved) under the assumption that these genotypic values are related to the observed phenotypic values. Selection index (SI) is a tool that helps to select individuals by considering more than two quantitative traits simultaneously. The SI is a linear combination of different traits, in which each trait is weighted according to its importance [1]. Selection indices were first proposed by Smith [2] and Hazel [3] as a linear combination of the phenotypic traits of interest, in which the corresponding weight for each trait is obtained by maximizing the correlation between the phenotypic and the genotypic merits with respect to the weights. This index is called a Smith-Hazel index or an optimum index [4]. To estimate the vector of coefficients for the optimum index, one must know the economic trait for each trait and the phenotypic and genotypic variance-covariance matrices among the traits. Unfortunately, there is no strict rule for setting the economic weights.
There are two general types of applications of SI. The first is single trait improvement, in which there is one trait of interest and other related traits are used to increase the efficiency of selection. The second type is to improve more than one trait simultaneously, which is also called multiple-trait improvement, where the main concern is to assign economic weights for each trait. In addition to the Smith-Hazel multiple trait index, other indices have been proposed, such as the base index, where economic weights are used as index coefficients: the restricted index and the non-weighted multiplicative index [4]. A common application of these indices has been applied to yield and yield components, among other quantitative traits.
The use of molecular markers (MM) in genetic studies has gained enormous attention in recent years, since markers may be used to explain a proportion of the total genetic variance. In this context, incorporating MM into selection indices to improve the estimation of breeding values has been proposed by several authors, either by adding a single-marker [5,6] or by adding k molecular markers [7]. A SI was developed to incorporate k molecular markers as an indirect trait, by regressing the breeding value over the molecular marker values (marker scores). In that SI, the essential steps are: (i) predict the genetic value, known as a marker score, by regressing phenotypes on marker information; and (ii) combining the marker score with phenotypic information and the SI to make the final predictions of the genetic merit [7].
In all the selection indices described above, it is possible to use several traits simultaneously, since they are based on the Smith-Hazel index theory. However, none of them use the causal relationships that exist between the traits. In agriculture, most of the traits of interest are direct or indirect functions of other traits. For example, in plant breeding, grain yield is a function of several intermediate traits. A unidirectional relationship between yield and yield components in small grains was established [8] (Figure 1). The path diagram describes the sequential development of yield components in small grains. For example, the spikes per square meter (SPSM) influence the number of kernels per spike (KPS), which in turn affects the kernel weight (KW), which is also affected by SPSM.
Agronomy 2021, 11, x 2 of 14 an optimum index [4]. To estimate the vector of coefficients for the optimum index, one must know the economic trait for each trait and the phenotypic and genotypic variancecovariance matrices among the traits. Unfortunately, there is no strict rule for setting the economic weights. There are two general types of applications of SI. The first is single trait improvement, in which there is one trait of interest and other related traits are used to increase the efficiency of selection. The second type is to improve more than one trait simultaneously, which is also called multiple-trait improvement, where the main concern is to assign economic weights for each trait. In addition to the Smith-Hazel multiple trait index, other indices have been proposed, such as the base index, where economic weights are used as index coefficients: the restricted index and the non-weighted multiplicative index [4]. A common application of these indices has been applied to yield and yield components, among other quantitative traits.
The use of molecular markers (MM) in genetic studies has gained enormous attention in recent years, since markers may be used to explain a proportion of the total genetic variance. In this context, incorporating MM into selection indices to improve the estimation of breeding values has been proposed by several authors, either by adding a singlemarker [5,6] or by adding k molecular markers [7]. A SI was developed to incorporate k molecular markers as an indirect trait, by regressing the breeding value over the molecular marker values (marker scores). In that SI, the essential steps are: (i) predict the genetic value, known as a marker score, by regressing phenotypes on marker information; and (ii) combining the marker score with phenotypic information and the SI to make the final predictions of the genetic merit [7].
In all the selection indices described above, it is possible to use several traits simultaneously, since they are based on the Smith-Hazel index theory. However, none of them use the causal relationships that exist between the traits. In agriculture, most of the traits of interest are direct or indirect functions of other traits. For example, in plant breeding, grain yield is a function of several intermediate traits. A unidirectional relationship between yield and yield components in small grains was established [8] (Figure 1). The path diagram describes the sequential development of yield components in small grains. For example, the spikes per square meter (SPSM) influence the number of kernels per spike (KPS), which in turn affects the kernel weight (KW), which is also affected by SPSM.  Arrows represent the direction of the influence of variables. The β's represent the path coefficients. The e's represent the error term associated with each trait. Own elaboration based on [8].
Causal relationships among traits have been well established in biology for many years [9]. Specifically, the causal structure among yield components is well known. However, selection is still conducted ignoring this causal structure. Therefore, the purpose of this research was to make use of the Smith-Hazel index and to incorporate structural equation modeling (SEM) theory to improve the predictability of the model for selection response.
Multivariate statistical models (MSM) can handle several variables simultaneously. However, when it is known that there is a causal association among these traits, a MSM does not take advantage of this causal information. In this context, structural equation modeling (SEM) can consider the causal relationships among traits, to improve the predictability of the model.
Structural equation modeling (SEM) is a powerful statistical tool that is useful for understanding complex relationships among variables by considering the causal structure among the traits of interest. SEM models that account for biological causal relationships among traits can be useful in multi-trait selection strategies [10]. SEM is a modern version of path analysis, which was originally proposed by Wright [9] and has been widely applied in the social sciences, but not widely used in the plant sciences. Despite its lack of use in the plant sciences, SEM has potential applications in this field, such as for the analysis of yield components, genetics, and multi-environment studies. SEM was applied to characterize the genetic architecture in multivariate systems modeling the causal relationship among phenotypes [11]. Some applications of SEM in crop sciences have been to explore the causal relationship in grain yield components [12,13], as well as in other areas of plant sciences [14,15]. SEM was used to model genotype using environmental interaction [13,16]. In plant science, a few papers have been published using SEM in the analysis of yield and yield components [12][13][14]17,18].
SEM was used to adapt quantitative genetic models to model causal relationships between phenotypes, and also to show the statistical consequences when the association between two traits was analyzed in terms of standard multi-trait models (MTM), which ignored the causal association among traits [11]. The authors suggested that accounting for the causal information that is presented in many biological systems would be a more realistic way of addressing them [11].
SEM integrates path analysis, a system of simultaneous equations and factor analysis [19]. SEM allows differentiating the effects among variables, into direct and indirect effects. Direct effects are those where one variable directly influences another variable, without intermediate variables. Indirect effects are those where the influence of one variable on another is mediated by one or more variables. In SEM, variables are classified as either exogenous or endogenous; where exogenous variables are independent variables determined outside of the system, and endogenous variables are determined within the system and act as dependent variables.
Causal relationships among traits are present in many biological phenomena [20]; however, neither marker assisted selection (MAS) nor genomic selection procedures account for such causation. In the present research, following Smith-Hazel index theory [2,3], a selection index was developed that takes into account the causal structure among yield component traits by using coupling selection index (SI) theory and structural equation modeling (SEM). We found that when causal information is used in the index, its selection efficiency for finding better genotypes improves.

Path Coefficients
All parameters estimated were assumed to be normal, either due to the normality of the data or large sample sizes. A t test was used to test the null hypothesis that a path coefficient is equal to zero in the population. A path coefficient may be different from zero if its absolute value exceeds 1.96, 2.58, and 3.30 (two-tailed test) at the level of significance p ≤ 0.05, p ≤ 0.01, and p ≤ 0.001, respectively [21]. Using a standardized path coefficient Agronomy 2021, 11,1953 4 of 14 helps make comparisons among them. In the present research, analysis was conducted using PROC CALIS in SAS 9.2 with the ML method.

Causal Coefficients as Economic Weights
When using SI for multiple traits simultaneously, it is common to assign an economic weight to each trait involved in the SI [22]. The vector of economic weights can be set or can be estimated. In this research we used the causal path directly and total effects estimated from the data (see Section 2.6).

Experimental Data
The data consisted of a population of two check varieties and 280 winter wheat lines evaluated at five locations in the state of Nebraska (Lincoln, Mead, McCook, Clay Center and Alliance) during the winter of 2013. The wheat lines were in the F6 generation of the winter wheat UNL program, which were selected based on the experience of the wheat breeder from around 1400 lines in the F5. The experimental design in each location was an augmented incomplete block design, with two replicated check cultivars (Goodstreak and Camelot). There were ten incomplete blocks, each of which consisted of 28 experimental lines and two check varieties. Goodstreak and Camelot are varieties that are well-adapted to different regions of Nebraska. To evaluate grain yield and yield components, we took random samples of 10 spikes per plot, including the new lines and the two checks, making a total of 300 samples per location. The agronomy data used in this research included grain yield (YLD), spikes per square meter (SPSM), kernels per spike (KPS), and kernel weight (KW). Grain yield was measured by using a combine harvester at each plot in each location. Spikes per square meter was estimated based on grain yield, kernel per spike, and kernel weight. Kernels per spike was estimated by the average of counting the number of kernels when threshing 10 spikes. Kernel weight was measured after weighing the total number of seeds of 10 spikes and dividing by the total number of seeds. When conducting a SEM, a sufficient sample size should be 100-200 samples or 5-10 times the number of parameters in the model [23]. The winter wheat data set consisted of a sample size of 1500, which exceeds the minimum recommended. The DNA marker dataset included 231 DArt markers declared significant in a previous analysis [24].

Data Analysis
To estimate the genotypic, phenotypic, and environmental variance-covariance matrices (Σ g ,Σ y ,Σ e ) involved in the SI without molecular markers and the coefficient matrix (Λ), we performed a two-stage analysis. The first stage consisted of the estimation of the variance-covariance matrices using the method of moment estimators based on the sums of squares and cross-products (SSCP) matrices from the multivariate analysis of variance (MANOVA) with the linear model (1). The second stage was performed using MANOVA to fit a linear model, to remove the main effects of environment and blocks for each trait, and modeling the residuals to estimate the coefficient matrix (Λ) [13,16].

Estimation and Covariance Matrices
The linear model to estimate the variance-covariance matrices that are needed for the SI is the model for two-way crossed classification: where Y ijk is the vector for the p traits of the j th genotype in the i th environment for the k th block; µ is the overall mean vector, E i is the vector of main effects of the i th environment; B(E) ki is the vector of effects of the k th block nested in the i th environment; G j is the vector of main effects of the j th genotype; GE ij is the vector of interaction effects between the j th genotype in the i th environment, and ε ijk is the vector of the experimental residual for the j th genotype in the i th environment into the k th block. Since we assumed the basic genetic model that did not contain a GE interaction, we used ε * ijk = (GE) ij + ε ijk in place of the last two terms. It was suggested that since the variation of the interaction is more environmental, it is reasonable to pool the variance of ε ijk (Σ ε ijk ) with the GE interaction variance (Σ gxe ) [25].
The genotype (Σ g ) and environmental (Σ e ) covariance matrices were estimated using the method of moments, based on a multivariate analysis of variance (MANOVA) for balanced data in SAS 9.2 with the procedure PROC GLM [26].
The linear model for estimating the variance-covariance matrices that include molecular markers as a random variable is: where M j is the vector of main effects of the j th molecular marker, and R j is the j th vector of genetic residuals (r), which is part of the genetic variation (g) not explained by the j th molecular marker (m), that is Σ g = Σ * m * Σ r . Each marker had values of −1, 0, and 1. The other parameters are the same as described in Equation (1), and we assumed the genetic residual and the model error were independent and normally distributed. We still assumed the basic genetic model, which did not contain an ME interaction. We pooled the variance with the environmental variance (Σ e ) [25] to make the phenotypic variance-covariance matrix for computing the vector of weights (b).

Estimation of Coefficient Matrix (Λ)
The second stage focuses on the estimation of the coefficient matrix (Λ). In this stage, grain yield and yield components were analyzed using SEM with observable variables by modeling the Y's residuals (YLD R , SPSM R , KPS R , and KW R ). These residuals were obtained by subtracting the main effects of environment and blocks within the environment from the observed values [3].
The estimation of the coefficient matrix (Λ), which accounts for the causal relationship between grain yield and yield components, was conducting using PROC CALIS in SAS 9.2 software with the maximum likelihood estimation method and following the recommended steps [21]. The chi-squared χ 2 test is the most widely used for testing the significance of the difference between sample covariances (Σ) and the predicted covariance (Σ(θ)) matrices. Along with the chi-square test, there are other model-fit criteria, such as GFI, AGF, I and NFI, that are described in the model evaluation section. Non-significant chisquare and values greater than 0.90 for GFI, AGFI, and NFI were used to evaluate the final model.
To conduct the SEM it is necessary to have prior knowledge of the causal relationships among the traits. For this research the unidirectional causal relationship between grain yield and yield components in small grains was used [8] (see Figure 1). The same estimated coefficient (Λ) matrix was used for both causal models, i.e., with and without molecular markers.

SEM Model Methodology
This research was based on the causal unidirectional relationship between grain yield and yield components in small grains [8]. Figure 1 shows the sequential development of yield components, where the later components are influenced by the earlier ones. Using structural equation modeling (SEM) it is possible to estimate the causal relationships present among grain yield and its components for small grains.
In general, the structural equation model with the observed variables can be written as: where Y is a px1 vector of endogenous variables (p = 4): yield (YLD), spikes per square meter (SPSM), kernels per spike (KPS), and kernel weight (KW); B is the pxp coefficient matrix expressing the causal relationship among endogenous variables, which is commonly a triangular matrix with zeros on its diagonal; X is the qx1 vector of exogenous variables; Γ is the pxq coefficient matrix expressing the causal relationship among endogenous and exogenous variables; ζ is the px1 disturbance vector, assumed to have E(ζ) = 0 and a covariance matrix E(ζζ) = Ψ and also assumed to be uncorrelated with exogenous variables. It is also assumed that Λ = (I − B) is nonsingular. For a case in which we only have endogenous variables, based on Figure 1, B can be note that Equation (4) can be written in reduced form, as Estimating parameters in SEM is unlike multiple regression and ANOVA, since the estimated parameters are obtained by minimizing the difference between the sample covariances (Σ) and the predicted covariance (Σ(θ)), where θ is the vector that contains the model parameters (B, Γ and ψ). If the model is correct, the population covariance matrix is equal to the model predicted covariances, Σ = Σ(θ) [19]. In this case the model implied population covariance matrix is based on Y and X, and has the order (p + q) × (p + q), where Each submatrix of the model's implied covariance matrix (Σ(θ)) can be obtained as The parameter vector θ is estimated by minimizing the distance between the model's implied covariance matrix Σ(θ) and the observed covariance matrixΣ using a criterion of "closeness".

SEM Model Evaluation
There are several statistics used to evaluate the model fit, which consists of measuring the validity of the hypothesis that Σ = Σ(θ). The fundamental hypothesis for the SEM is that the matrix of covariance of the observed variables is a function of a set of parameters [19]. If the model is correct, and if the parameters are known, the population covariance matrix will be exactly reproduced. Where Σ is the population covariance, and Σ(θ) is the covariance matrix as a function of the model parameters (θ). Some of these statistics are: chi-square statistics χ 2 , goodness of fit index (GFI), adjusted goodness of fit index (AGFI), and root mean square error of the approximation (RMSEA). The root mean square error of approximation (RMSEA) is one of the most informative fit indices, due to its sensibility to the number of estimated parameters in the model; the range of the RMSEA is between 0.05 to 0.10, where a value less or equal to 0.05 shows a good fit and values above 0.10 indicate a poor fit [27].

Model Comparison
One selection index is preferred to another if it improves selections in some sense. To compare if coupling SEM with the Smith-Hazel index improved the selection efficiency, we used two criteria: Relative efficiency (RE) of the selection index. This is expressed as the ratio of the correlation between the selection index and the breeding value (ρ H I ) for two different selection indices [22].
Mean square error of prediction. This is a criterion used as a way to evaluate and compare SI's [28]. When the correlation is large between H and I, this means that the index (I) will better predict the breeding value (H); that the mean squared error of the breeding prediction will be small and the effectiveness of the prediction of the SI will be greater. The term σ 2 H 1 − ρ 2 H I was called the mean square error of prediction, and the ratio H I was considered the effectiveness of I in predicting H [29]. Therefore, the SI will be more effective for predicting H when ρ 2 H I is large.

Model Validation
To validate the proposed selection indices with causal structure a simulation study was conducted. The process consisted of simulating grain yield (YLD) and yield components (SPSM, KPS, and KW) with a recursive structure of a population of 282 winter wheat genotypes as fixed effects, with 5 environments, 2 replications, and residual as random effects. It is important to point out that the true population consisted of 282 wheat genotypes tested in 5 environments. The means of the 282 true genotypes for each trait for all 5 environments were standardized and used as true fixed effects for genotypes in the simulation model. The causal relationships among traits were taken into account in the simulation program by including the estimated coefficients as the true path coefficients from the true population (see Figure 1). All responses were simulated as multivariate normal random variables using the rnorm function in R. A total of 500 data sets were simulated. The validity of the simulation was assessed using the correlation between the real yield data and the simulated yield data.

Models Developed Coupling Causal Structure and MM into the Smith-Hazel Index
Coupling SEM to the Smith-Hazel index was developed in two scenarios. The first scenario, named model 1, is the Smith-Hazel index with causality among traits ( Figure 2). The second scenario has both the molecular markers and the causality (Figure 3). error of approximation (RMSEA) is one of the most informative fit indices, due to its sensibility to the number of estimated parameters in the model; the range of the RMSEA is between 0.05 to 0.10, where a value less or equal to 0.05 shows a good fit and values above 0.10 indicate a poor fit [27].

Model Comparison
One selection index is preferred to another if it improves selections in some sense. To compare if coupling SEM with the Smith-Hazel index improved the selection efficiency, we used two criteria: Relative efficiency (RE) of the selection index. This is expressed as the ratio of the correlation between the selection index and the breeding value ( ) for two different selection indices [22].
Mean square error of prediction. This is a criterion used as a way to evaluate and compare SI's [28]. When the correlation is large between H and I, this means that the index (I) will better predict the breeding value (H); that the mean squared error of the breeding prediction will be small and the effectiveness of the prediction of the SI will be greater. The term (1 − ) was called the mean square error of prediction, and the ratio = 1 − was considered the effectiveness of I in predicting H [29]. Therefore, the SI will be more effective for predicting H when is large.

Model Validation
To validate the proposed selection indices with causal structure a simulation study was conducted. The process consisted of simulating grain yield (YLD) and yield components (SPSM, KPS, and KW) with a recursive structure of a population of 282 winter wheat genotypes as fixed effects, with 5 environments, 2 replications, and residual as random effects. It is important to point out that the true population consisted of 282 wheat genotypes tested in 5 environments. The means of the 282 true genotypes for each trait for all 5 environments were standardized and used as true fixed effects for genotypes in the simulation model. The causal relationships among traits were taken into account in the simulation program by including the estimated coefficients as the true path coefficients from the true population (see Figure 1). All responses were simulated as multivariate normal random variables using the rnorm function in R. A total of 500 data sets were simulated. The validity of the simulation was assessed using the correlation between the real yield data and the simulated yield data.

Coupling Causal Structure and MM into the Smith-Hazel Index
Coupling SEM to the Smith-Hazel index was developed in two scenarios. The first scenario, named model 1, is the Smith-Hazel index with causality among traits ( Figure 2). The second scenario has both the molecular markers and the causality ( where the subscript c stands for causality. Therefore, equation 16 is the expression that couples SEM and the Smith-Hazel index. Coupling 2: Smith-Hazel index with molecular markers and causality among traits From the path diagram shown in Figure 3 we can see that the genotypic value (g) can be expressed as = Γ + , where m represents the vector of molecular markers and r the vector of the portion of the genetic value that it is not explained by molecular markers, and Γ is a (p x m) matrix where p is the number of traits and m is the number of markers.
The model in Figure 3, can be represented using a SEM with observed variables. The structural model can be represented as where is a 4x1 vector of endogenous variables: YLD, SPSM, KPS, and KW; Λ = (I − ) is the 4 x 4 nonsingular matrix that includes the coefficient matrix ( ) expressing the causal relationship among endogenous variables; is the (m x 1) vector of molecular markers; is the (p x m) coefficient matrix, where p is the number of traits; is the (4 x 1) residual vector assumed to have E( ) = 0 and uncorrelated with .
The reduced form of the model can be written as now define = * + E * , where * = Λ Γ + Λ and E * = Λ E with the index I = , and the breeding value H = * .
Following the Smith-Hazel theory for maximizing the vector of weights (b), we need to maximize the correlation between H and I. Assuming independence among m, r, and E. Coupling 1: Smith-Hazel with causality among traits The model in Figure 2 can be represented by using structural equation modeling with observed variables [10,11,30].
where y is a (p × 1) vector of phenotypic values, g is the additive genotypic value, E is the vector of environmental effect, and Λ is a (p × p) matrix of structural coefficients, accounting for the causal relationship between the p traits. Model 9 can be written in a reduced form as: Defining the basic genetic model as y = g * + E * where g * = Λ −1 g and E * = Λ −1 E, and where g * and E * are the genetic and the environmental effects after accounting for causal structures.
From Figure 2, the index is I = b y, and the merit is H = θ g * . Following the Smith-Hazel theory for maximizing b, i.e., we need to maximize the correlation between H and I (ρ HI ). By assuming an independence between g and E, we can compute ρ HI = Cov(H,I) √ Var(H)Var(I) whose corresponding variances and covariance are: then, substituting the corresponding variances and covariances in ρ HI we end up with the following equation since θ is a vector of constants, we only need to maximize the following equation Agronomy 2021, 11,1953 9 of 14 taking derivatives of this expression with respect to b and set to zero, we end up with the optimal weights being expressed as Finally, the Smith-Hazel index that accounts for the causal relationship between traits is where the subscript c stands for causality. Therefore, Equation (16) is the expression that couples SEM and the Smith-Hazel index. Coupling 2: Smith-Hazel index with molecular markers and causality among traits From the path diagram shown in Figure 3 we can see that the genotypic value (g) can be expressed as g = Γm + r, where m represents the vector of molecular markers and r the vector of the portion of the genetic value that it is not explained by molecular markers, and Γ is a (p × m) matrix where p is the number of traits and m is the number of markers.
The model in Figure 3, can be represented using a SEM with observed variables. The structural model can be represented as where y is a 4 × 1 vector of endogenous variables: YLD, SPSM, KPS, and KW; Λ = (I − B) is the 4 × 4 nonsingular matrix that includes the coefficient matrix (B) expressing the causal relationship among endogenous variables; m is the (m × 1) vector of molecular markers; Γ is the (p × m) coefficient matrix, where p is the number of traits; r is the (4 × 1) residual vector assumed to have E(r) = 0 and uncorrelated with m.
The reduced form of the model can be written as now define y = g * + E * , where g * = Λ −1 Γm + Λ −1 r and E * = Λ −1 E with the index I = b y, and the breeding value H = θ g * . Following the Smith-Hazel theory for maximizing the vector of weights (b), we need to maximize the correlation between H and I. Assuming independence among m, r, and E.
Cov(H, I) = Cov θ g * , b y = θ Cov(g * , y)b since θ is a constant, we only need to maximize taking derivatives this expression with respects to b and set equal to zero, we end up with the optimal weights expressed as Finally, the Smith-Hazel index with molecular markers and causality among the traits is Table 1 shows the models developed considering only causality and both causality and molecular markers. Table 1. Different expressions of the selection index considering molecular markers and causality among traits.

Model Selection Index (I)
Classical Smith-Hazel Figure 4 and Table 2 display the path coefficients from the final model, which fit well since the model is exactly identified; i.e., the total number of parameters estimated (t = 10, 6 path coefficients and 4 errors) is equal to the number of data points p(p+1) 2 = 10 . The six coefficients were all significant, p ≤ 0.05. Since standardized data were used in the analysis, direct comparisons among grain yield and yield components coefficients were possible, and it was easy to understand how they impacted the yield. For example, the direct positive effect of SPSM on yield (0.81) was greater than the indirect effect on KW (−0.347), which means that the net effect of increasing SPSM would be to increase YLD. Note that, the indirect effects are calculated by multiplying the path coefficients of each path of the associated variable to the dependent variable. For instance, the indirect effect of KPS on YLD follows the path KPS→KW→YLD is (−0.25) (0.17) = −0.043. The total effect is just the sum of direct and indirect effects. In the case of SPSM, which has three indirect paths, the indirect effect is the summation of the indirect effects. Spikes per square meter showed the biggest effect (0.81) on grain yield, followed by kernel per spike (0.52) and kernel weight (0.17). That is, the effect of increasing SPSM would increase YLD more than if we increased the effect of KW on YLD.

Smith-Hazel with markers and causality
Agronomy 2021, 11, x 11 of 14 the indirect effects are calculated by multiplying the path coefficients of each path of the associated variable to the dependent variable. For instance, the indirect effect of KPS on YLD follows the path KPS→KW→YLD is (−0.25) (0.17) = -0.043. The total effect is just the sum of direct and indirect effects. In the case of SPSM, which has three indirect paths, the indirect effect is the summation of the indirect effects. Spikes per square meter showed the biggest effect (0.81) on grain yield, followed by kernel per spike (0.52) and kernel weight (0.17). That is, the effect of increasing SPSM would increase YLD more than if we increased the effect of KW on YLD.  An important tool for model comparison is the relative efficiency (RE), which increased when causality was considered over the same SI without causality. For example, the RE for comparing S-H with and without causality was = = = 120, this means that S-H causality is 20% more efficient than S-H classic for  An important tool for model comparison is the relative efficiency (RE), which increased when causality was considered over the same SI without causality. For example, the RE for comparing S-H with and without causality was RE = SI in the column SI in the row = S−H causality S−H classic = 120, this means that S-H causality is 20% more efficient than S-H classic for predicting the breeding value (BV). Moreover,RE = SI in the column SI in the row = S−H classic S−H causality = 84, which means that S-H classic is less efficient than S-H causality for predicting the breeding value (BV). Table 3 shows the best ten percent genotypes under different selection indices. One was based on yield, while the other rankings were based on the different selection indices. Note that the S-H index with causality was one of the closest to the yield selection, this is possible when total causal information is used as the economic weights for the index. The validity of the simulation was assessed through the correlation between the real yield data and the simulated yield data. The correlation over the 500 simulated data sets was 0.94, indicating the simulation accurately captured the structure of the real data. To validate the proposed selection index, we performed four kinds of analyses for the simulated data, comparing with the selection index without causal structure: (1) We computed the relative efficiency (RE) for model comparison between the indices; the results showed that the S-H causality increased by 12% over S-H classic.
(2) We examined how many of the true best 28 (Top 10% based on yield) genotypes based on true indices were identified under the simulation; the result showed a 39% coincidence with respect to the genotypes selected based on yield, while with S-H classic this was 18%.
In general, the results showed that when a causal relationship among traits is accounted for, the SI and the RE increased, and the standardized mean yield of the best 10% genotypes increased with respect to the SI without causality.

Discussion
We found that the direct effects followed the same pattern reported by other researchers [8,13,16]; in which SPSM had the biggest effect on yield, followed by KPS and KW (Table 2). On the other hand, the indirect effects were all negative among yield components (see Figure 1 and Table 2). The correlations between the SI (I) and the breeding value (H) increased when we considered the causal relationship between grain yield and yield components in the case of the Smith-Hazel indices. For example, the correlation for the S-H causality index (0.47) was larger than the S-H classic index (0.39). When the correlation between the SI (I) and the breeding value (H) is large, then the SI will be more effective for predicting H [29].
Since yield and yield components have a causal mechanism underlying the biological processes, it may be reasonable to use these causal coefficients as economic weights [31,32]. The result of using causal path effects as economic weights improves the ability of the Sis, increasing the mean of the selected genotypes with respect to using the same economic weight for each trait. This result confirms that taking the causal biological relationship among certain traits into account can help select promising genotypes.
The idea of using causal information among traits with SI has been suggested by other authors [11,[30][31][32]. Direct causal effects as economic weights were used for improving grain shape and grain yield in rice [31,32]. The authors concluded that using direct path coefficients as economic weights for secondary traits and the economic weight of one for the primary trait was an effective criterion of selection for improving primary traits. Similarly, we used as a vector of economic weights the total causal effects, which not only accounts for direct path coefficients but also for indirect effects. Using total causal effects as a vector of economic weights, we showed that the S-H with causality index increased the mean yield of the best 28 genotypes, and the number of matches between this index and yield per se, more than the other SIs.
Ignoring the causal association among traits leads to a loss of valuable information [8,12,13,15,16,18]. When causal structure among yield and yield components is captured in the index, the correlation improved between the index and the BV and the RE. In addition, it is important to point out that when comparing two indices without and with the causal structure among traits, the index that accounts for causal relationship increased the mean for yield of the selected genotypes. Accounting for causality in the index could be used to help breeders with the selection of promising genotypes, because this method of selection considers the recursive relationship among traits.

Conclusions
The specific purpose of this research was to improve the ability of the selection index for identifying promising genotypes by accounting for the causal structure among yield and yield components in winter wheat. The results from the true data showed that the most important findings were: First, the causal Smith-Hazel indices improved the relative efficiency with respect to those that did not incorporate causal information. Second, the selection indices may help when selecting for a primary trait. Third, using total causal effects as a vector of economic weights we showed that the S-H with causality index increased the mean yield of the best 28 genotypes, and the number of matches between this index and yield per se more, than the other SIs.
The proposed multi-trait selection indices can account for the causal structure among traits when there is a prior knowledge of the causal relationships. These indices provide certain advantages over the classic Smith-Hazel index by improving the correlation between the index and the breeding value, the relative efficiency, and the mean-yield of the selected genotypes when selection is based on a primary trait. In addition, the contributions of yield components to grain yield when the selection index accounts for the causal mechanism can be seen through the increase in the mean yield.
The results indicated that among the evaluated indices, the S-H causality index is recommended for improving yield when no marker information is available. In addition, selecting for a primary trait using total causal effects as economic weights for yield contributors and one for the trait of interest should serve as an effective selection criterion for improving grain yield.