Inbreeding and Genetic Erosion from a Finite Model of a Synthetic Formed with Single Crosses

When a seed produced by a single-cross (SC) maize hybrid is sown, the resulting grain yield is usually lower than that of the hybrid due to the inbreeding generated. However, if a seed from a mixture of s hybrids were sown instead, the synthetic variety thus formed (SynSC) would have a lower inbreeding coefficient (FSynSC) and a higher grain yield. The grain yield s, the finite number of representatives of each parent SC (m) and the inbreeding coefficient of the parent lines of the SCs (F) are related to the FSynSC. In addition, randomness and the finite size of m can cause the loss of genes and genotypes and increase the FSynSC. The objectives of this study were to derive formulas for (1) expressing FSynSC in terms of m, F, and s, and (2) calculating the probability of the occurrence of gene and genotype loss. It was found that for the probability of no genotype being missing from the progeny representing a parent to be at least 0.95, it is necessary that m ≥ 15. It was also found that a sample size of 7 is sufficient for FSynSC to stabilize, more visibly as F is larger, and for the probability of the occurrence of erosion to be practically zero.


Introduction
Synthetic varieties (SVs) of maize (Zea mays L.) are usually highly heterogeneous and heterozygous populations that adapt even to unfavorable environmental conditions. For example, Farid et al. [1] worked on the breeding of maize in drought conditions and developed 2 SVs that yielded at least 6.0 t ha −1 . On the other hand, Badu-Apraku et al. [2], after five selection cycles of S 1 families for drought resistance from a SV, obtained a genetic gain of 423 kg ha −1 cycle −1 .
However, since these varieties resulted from the random mating of a population of finite size, they may include the formation of progenies generated by mating between related individuals. This type of progeny, however, usually has a reduction in vigor [3,4]. For example, it can result in reduced seed yield in onions (Allium cepa L.) [5] and reduced grain yield in maize [6].
Commonly, a maize SV originates from the random mating of a set of selected lines. However, it is not common for good-quality lines to be available for the development of SVs oriented towards farmers who are unlikely to have the economic ability to regularly acquire seeds of hybrid varieties. Sometimes, when farmers sow a hybrid variety in one growing season, in the next season, they sow seeds produced by that variety, with a consequent reduction in the means of variables such as grain yield due to the inbreeding produced in one generation.
This effect of inbreeding could be attenuated if, instead of sowing seeds harvested from a single hybrid, the seeds harvested from a hybrid produced by sowing a mixture of two or more varieties of this type were sown. Villanueva et al. [7], from the results of a diallelic experiment on nine commercial hybrids, found that two crosses between hybrids could be successfully used as SVs. The idea of generating a SV whose parents are several hybrids has generated interest, along with the need to conduct the research necessary to answer the questions that arise in this context.
The relationship between lines, hybrids and SVs has been studied for a long time. According to Wright [8]'s formula, the behavior of each synthetic that can be formed with p lines can be estimated with the data from a complete diallelic experiment comprising those p lines. The estimation of the behavior of any synthetic formed by a subset of p lines (p ≤ p) is conducted with the experimental data of the p lines and their p (p − 1) single crosses. On the other hand, the open-pollinated varieties developed by the International Maize and Wheat Improvement Center (CIMMYT) for their yield potential and tolerance of adverse biotic and abiotic factors are actually synthetics whose parents are lines resulting from the process of hybrid-variety development [9].
The information generated by a diallelic experiment, in addition to helping to predict the yield of each SV that could possibly be formed with the p lines, can contribute to determining the optimal number of parents that a synthetic variety should have; moreover, it allows the identification of single crosses of good quality and outstanding parents per se [10].
The information required in terms of the optimal number of lines for the development of synthetic varieties is not, however, the only useful information of this type. Until recently, little attention was paid to the question concerning the number of plants that should represent each parent of a synthetic variety [11,12].
This number (m) is important because it can affect the inbreeding coefficient and the probability of gene loss in the synthetic variety. On this topic, the most common approach has been to proceed as if the number of representative plants used were of a sufficient size, so that the frequencies of the progeny of each parent were in accordance with the Hardy-Weinberg law. The difficulty that arises in this context is how to properly interpret what "sufficient size" means.
Evidently, the inbreeding coefficient (IC) and the stability of the genotypic array of a population that reproduces by random mating in successive generations can be affected by the randomness of the genetic mechanism operating during gametogenesis when the number of representatives of each parent is finite and their genotypes are heterozygous.
In a study on a SV formed by s single crosses (Syn SC ) that in turn were generated with unrelated 2s lines and an inbreeding coefficient equal to F, Escalante-González et al. [11] considered that each parent is represented by m plants and that the number of non-identicalby-descent (NIBD) genes that they contain is a random variable. To this end, the authors derived its mean and variance. However, they did not study the relationship of the number of representatives of each Syn SC parent (m) with the probabilities of gene and genotype loss and the magnitude of the Syn SC inbreeding coefficient (FSyn SC ). In the context of the Hardy-Weinberg law, there must be a sample size of the progeny of each parent in which changes in the Syn SC inbreeding coefficient and the probability of genotype and gene loss are negligible. The objectives of this research were to (1) derive the FSyn SC in terms of m, s and F, and (2) calculate the probability of losing one or more genes and genotypes when each parent is represented by a finite number of plants. With this information, the aim is to determine the minimum sample size necessary for FSyn SC to stabilize and for the probability of losing one or two genes to be negligible.
In this model, m is the sample size of the progeny of each single cross, x is the largest positive integer such that 4x ≤ m, and e is the number of plants that failed to form a complete group of four (e = 1, 2, 3). The synthetic variety whose parents are s single crosses (Syn SC ) was visualized as the population generated by the random mating of the ms plants representing the s parents. As this mating in turn involves the random mating of any subset of the ms plants, particularly of the m plants of a single cross, the derivation of the inbreeding coefficient of Syn SC (FSyn SC ) was based on the ICs of the progenies produced by the random mating of the m plants of a parent. This mating in turn involves the random mating of each complete group of four plants whose genotypes are those that form the genotypic array generated by the single cross. Consequently, if the lines that are parents of the single cross are A 1 A 2 and B 1 B 2 , the sampled population is formed by the genotypes of the genotypic array produced by the single cross A 1 A 2 × B 1 B 2 (GEA). This is: The basic results used to derive the formula for FSyn SC were obtained from the progenies produced by the random mating of the genotypes that form the GEA (Equation (1)). These results are shown in Table 1. In this derivation, the inbreeding of the progeny produced by the random mating of these four plants was considered to have two sources: (1) four selfings and (2) 4 × 3 = 12 intragroup crosses, of which six were direct and six were reciprocal (Table 1). Table 1. Average inbreeding coefficients of progenies produced by random mating of a group of 4 plants whose genotypes are those generated by the single-cross A 1 A 2 × B 1 B 2 , (Equation (1)). The inbreeding coefficient of the lines is F and they are unrelated.

Genotypes
Genotypes The contributions to inbreeding by the selfings and intragroup crosses were, on average, 1/2 and (1 + 2F)/6, respectively ( Table 1). The following also contributed to inbreeding: (a) the 16x(x − 1) crosses between each of the plants in a group of four with each of those in another group of four from the same parent (the results can be presented in complete tables, as in Table 1); and (b) selfings and crosses involving the e plants that failed to complete a group of four. The latter included e(e − 1) intragroup crosses, both direct and reciprocal, along with the 8xe intergroup crosses of the e plants that failed to complete a group with the 8x plants from the complete groups, 4x direct and 4x reciprocal. The selfings of these e plants were included in those of the m plants representing a parent. The above considerations regarding FSyn SC allowed us to arrive at the following formula: Obviously, the formula for FSyn SC (Equation (2)) is general. It can be used to determine the exact value of FSyn SC for any combination of values of m, F and s. It is the only formula that has this quality. The previously derived formulas for the IC of a Syn SC [13] correspond to "large samples".
The magnitude and significance of the precision with which the inbreeding coefficient is expressed must be considered. In the development of a SV, in addition to offering objective guidance for determining the sample size (m) of each parent, these factors are related to the means of quantitative characters of economic importance. In maize, for example, they relate to grain yield, stubble yield, etc. [6].
Based on Equation (2), the inbreeding coefficients of the progeny of a single cross were calculated for the 120 combinations of 24 values of m (1, 2, 3, . . . , 24) and 5 of F (0.000, 0.500, 0.750, 0.875, and 1.000). The results are shown in Table 2. Among these, the following stand out: Table 2. Inbreeding coefficients of the progenies produced by random mating of the m representatives of a single cross. The parent lines of the cross are unrelated and their inbreeding coefficient is equal to F. The calculation was performed with the formula of Equation (2)  Whenever m is a multiple of 4 (4, 8, 12, 16, . . . ), the inbreeding coefficients of the progeny produced by the random mating of the m plants are equal to 0.250, 0.375, 0.4375, 0.4687, and 0.5000 when the inbreeding coefficients of the parent lines of the single cross (F) are 0.000, 0.500, 0.750, 0.875, and 1.000, respectively. Furthermore, as the value of m increases, the values of FSyn SC tend towards those obtained when m is a multiple of 4, increasing in speed as the value of F increases. For x = 1, 2, . . . , 6, the values of FSyn SC corresponding to sample sizes 4x + 1, 4x + 2 and 4x + 3 are greater than those of 4x and 5x, but decrease as the value of m increases. In addition, whenever m = 1 or F = 1, FSyn SC = 0.5000.
Whenever m > 1, a positive relationship between the values of F and those of FSyn SC is observed.
The explanation for the results in Table 2 is obvious in some cases. For example, when m = 1, the "intragroup" progeny reproduces through the fertilization of a single plant whose genotype is made up of two NIBD genes and whose progeny must have an IC equal to 0.5, regardless of the value of F. Moreover, when F = 1, each parent single cross is represented by m plants with the same genotype formed by two NIBD genes. Thus, intragroup random mating generates a progeny whose genotypic array is the same as that produced by a single plant by selfing and which, consequently, has an IC equal to 0.5, regardless of the value of m.
The number of sets of frequencies (SFs) with which the GEA genotypes in (Equation (1) is formed by the frequencies k f m 1 , k f m 2 , k f m 3 , and k f m 4 for genotypes A 1 B 1 , A 1 B 2 , A 2 B 1 and A 2 B 2 , respectively. Furthermore, the number of ways in which this k-th SF can occur in the sample of size m k (NF) m must be:

Probability of Inclusion of GEA Genotypes (Equation (1)) in the Sample
Unfortunately, the probability that the genotypes contained in a sample size of four are all four in the GEA (Equation (1)) is very low: 4!/4 4 = 0.094. With larger sample sizes, the probability that all four genotypes are included must be increased. For example, with m = 5, the number of equally possible and mutually exclusive ways in which the sample includes all four genotypes, with one of them repeated once, is [5!/2!1!1!1!]4 = 240, while the total number of equally possible and mutually exclusive outcomes is 4 5 = 1024. Therefore, with m = 5, the probability that the sample contains all four genotypes is 240/1024 = 0.234. This probability is also too low.
To calculate the probability that the sample would include all four GEA genotypes in (Equation (1)) for larger sample sizes, the methods and elements used by Rodriguez-Perez et al. [14] were adapted to this case.
If the number of different permutations (NP) in which the four frequencies of the k-th SF can be assigned to the four genotypes that make up the sample is k (NP) m , and if k P m l is the number of times that the l-th smallest frequency (l = 1, 2, 3, 4) of the k-th SF appears, then: Evidently, the total number of equally possible and mutually exclusive events generated by sampling with replacement of size m (TN) is 4 m , that is: According to Equations (3)-(5), the formulas for calculating the probability that the size m sample contains all four GEA genotypes in (Equation (1)) according to the k-th SF [P k (GEA inclusion) m ] and for all SFs are, respectively: and For example, based on Equation (6), the probabilities were generated for each of the three possible SFs, so that with m = 7, all four GEA genotypes were included. First, in this case, the SFs are: k = (1): 1, 1, 1, 4; k = (2): 1, 2, 2, 2; and k = (3): 1, 1, 2, 3. The number of different permutations was calculated for each SF k (NP) 7 . For example, in SF k = (3), the total number of different permutations was 4!/(2!1!1!0!) = 12. These are shown in Table 3.
The calculation of the probability that a sample of seven would include the four genotypes of the genotypic array of the cross A 1 A 2 × B 1 B 2 was based on Equations (6) and (7) and is presented in Table 4. This probability was 0.513, but this was not satisfactory. A sample size that is compatible with the standards of certainty typically required by plant breeders must provide a probability of at least 0.95.  Only one permutation of the genotypic frequencies is shown for each of the three sets of frequencies (k = 1, 2, 3).
In the search for a size compatible with acceptable reliability, the probability that all four GEA genotypes would be included in samples of 8 and 12 was calculated (Equation (1)). These probabilities were 0.62 and 0.70, respectively. It was at m = 15 that, based on Equation (7), the probability of inclusion of the four GEA genotypes was found to reach the value of 0.95.

Probability of Gene Loss
When the sample of a single cross is generated, one or two genes can be lost. Two are lost when the m plants that make up the sample have the same genotype. The loss of one gene occurs only when the sample contains one genotype v times (v = 1, 2, . . . , m − 1) and another in the remaining times (m ≥ 2), except when the two genotypes are: (1) A 1 B 1 and A 2 B 2 or (2) A 1 B 2 and A 2 B 1 . In these two cases, there is no gene loss.
In a hypothetical case in which m = 1, the probability of losing two genes is 100%. If m = 2, the loss of two genes occurs when the two plants that make up the sample have the same genotype; this event can occur in four ways, each with a probability of (1/4) 2 . This means that when m = 2, the probability of losing two genes is 4(1/4) 2 = 0.25. In general, the probability that with a sample size m two genes are lost ( m P 2 ) is: If m = 2, the loss of one gene can also occur. This occurred in each of the four samples (Table 5). To calculate the probability of losing one gene when m = 2( 1,1;2 P 1 ), one must consider that each of the four samples (Table 5) can occur in two orders and that the total number of equally possible and mutually exclusive events is 4 2 . Therefore, with m = 2, 1,1;2 P 1 = (2 × 4)/4 2 = 0.5.
Hereafter, the general expression v,m−v;m P 1 is used to represent the probability that a sample of size m contains v plants that have the same genotype (v = 1, 2, . . . , m − 1) and m − v another and that both genotypes lack the same gene of the four samples in GEA (Equation (1)). The formula is: With m = 3, the loss of one gene can occur in eight ways. The origin of these can be visualized as if in each of the four samples of two in which a gene is lost ( Table 5), one of the two genotypes is duplicated in one case and the other in the second case. For example, from sample (A 1 B 1 ,A 1 B 2 ), the two visualized samples, now of size three, are: (1) A 1 B 1 , 2A 1 B 2 ). Therefore, since each of the four samples of two lacking the same gene (Table 5) is associated with two size samples of three lacking the same gene, the probability that one gene is lost when m = 3 in each of the eight cases 2,1;3 P 1 is as in (Equation (9) Regarding the probability that one gene is lost when m = 4 4 P 1 , the three samples of that size associated with the samples of two lacking the A 2 gene were samples 1, 2 and 3, as shown in Table 6. Samples 1 and 3 have the same probability of occurrence 3,1;4 P 1 , which differs from that of sample 2 2,2;4 P 1 . Based on Equation (9) and the data in Table 6: 4 P 1 = 8 3,1;4 P 1 + 4 2,2;4 P 1 = 8[4!/(3!1!)]/4 4 + 4[4!/(2!2!)]/4 4 = 0.219 Table 6. GEA in (Equation (1)) for samples of four lacking one gene. The formulas for calculating the probability of losing one gene in samples of size m ≥ 3 ( m P 1e and m P 1o ) are: Based on Equations (9)-(11), Table 7, containing the probabilities of one-gene ( m P 1 ), two-gene ( m P 2 ), and one-or two-gene ( m P 1,2 ) losses for twelve sample sizes was completed. Table 7. Probabilities of loss (PL) of one gene ( m P 1 ), of two genes m P 2 , and of one or two genes ( m P 1,2 ) in the formation of the progeny of size m from a parent single cross of a Syn SC . With the exception of when m = 1, the loss of one gene is more likely than that of two (Table 7). This is because while the loss of two genes occurs only when the sample only includes one genotype of the four in the GEA (Equation (1)) m times, the loss of one gene requires that, for example, the sample of the form vA 1 B 1 and (m − v)A 1 B 2 does not include A 2 and has m − 1 ways of being constructed (v = 1, 2, . . . , m − 1), each of which has m!/[(m − v)!v!] different orders of occurrence. There are three other cases of this kind. These result when A 1 , B 1 or B 2 is lost.

Discussion
It was found that for each of the five F values studied combined with each of the m values that are multiples of 4 (m = 4x, for x = 1, 2, 3, 4, 5, 6), the FSyn SC values are equal ( Table 2). This result is due to the fact that with these m values, the expected frequencies of the genotypes that make up the sample are equal (1/4). Furthermore, for the three values of m (4x + 1, 4x + 2 and 4x + 3; x = 1, 2, 3, 4, 5, 6) that are between two numbers of m that are consecutive multiples of 4 [between 4x and (4x + 4); x = 1, 2, 3, 4, 5, 6], the ICs are greater than those of the two that are consecutive multiples of 4. This occurs because for these three values of FSyn SC , the frequencies of the four genotypes of the GEA (Equation (1)) are not equal and, in these cases, matings between plants with the same genotype are more frequent and generate more inbreeding than when genotypic frequencies are equal [15]. When genotypic frequencies are equal, the mating frequencies between different genotypes that contribute less to inbreeding are maximized.
It is also noticeable (as shown in Table 2) that as m grows, the FSyn SC values corresponding to the three sample sizes that are between two consecutive multiples of 4 are increasingly close to these sizes [4x and 5x]. This trend is related to the decrease in the variance of the number of NIBD genes in the SCs that occurs when m is larger. This variance, [11], also decreases as F increases.
It is noteworthy that the smallest FSyn SC values for each value of F occur when m = 4, 8, 12, 16 (Table 2). This result is to be expected according to Equation (2). This reduces to the form FSyn SC = (1 + F)/4 whenever m is a multiple of 4. On the other hand, although the use of small m values is desirable, the decision to use a size of m = 4, for example, would not be appropriate without taking into account other results. For example, it is also desirable for the sample to include all four of the GEA genotypes in (Equation (1)), which means that the entire gene pool of the parents can be involved in the formation of the Syn SC . However, the probabilities of sample sizes of four, eight, and twelve including all four GEA genotypes are not satisfactory. According to the calculation for m = 4 and the results from Equation (7), these probabilities are 0.094, 0.623, and 0.8747, respectively. In addition, it was found that: (1) a sample of 15 is required for the probability that all four GEA genotypes included in (Equation (1)) to be 0.95, and (2) from m = 15, FSyn SC is practically unchanged and becomes more stable as the value of F increases ( Table 2) and approaches (1 + F)/4. It should be considered, however, that a sample that does not include one or two of the four genotypes may contain all four GEA genes (Equation (1)) and, therefore, may avoid the loss of genes and genotypes in the Syn SC . For example, if a sample only contains the genotypes A 1 B 1 , A 1 B 2 , and A 2 B 1 , its gametic array will include all four genes, A 1 , A 2 , B 1 , and B 2 . With these genes in the gametic array, the random mating of the sample should form a genotypic array that includes all four genotypes of the single-cross A 1 A 2 × B 1 B 2 . This result is also produced by a sample containing only the genotypes A 1 B 2 and A 2 B 1 .
If the gametic array of the sample representing a single cross is missing one or two of the four genes, the resulting genotypic array was not complete. This loss of genes, unlike that of the genotypes, is always irreparable and results in increased Syn SC inbreeding. Fortunately, the probability of loss is not high ( Table 7). In the case of losing one gene from m = 7 onwards, this probability is negligible and for the loss of two genes, the probability of occurrence is also negligible ( Table 7), even from a sample of five.

Materials and Methods
Due to the characteristics of maize (Zea mays L.), the model on which this study was based was that of a locus of a diploid species that reproduces by random mating. The parents of the SV under study, s single crosses, were considered to be formed with unrelated 2s lines whose inbreeding coefficient was F (0 ≤ F ≤ 1). This means that if a line that is a parent in a single cross is represented as A 1 A 2 , the probability (P) that A 1 and A 2 are identical by descent (≡) is F; that is, P(A 1 ≡ A 2 ) = F [16]. In a similar way, it could be argued that each of the 2s lines are parents of the s single crosses. This, in turn, means that when the genes A 1 and A 2 are not identical by descent (100 (1 − F) 100% of the cases) they must be alleles. It should be noted that in all cases, A 1 and A 2 were referred to as genes.
According to the characteristics of their origin, the genotypes of the plants representing a SV parent were visualized as those captured by a random sample of size m taken with replacement from the population formed by the genotypes produced by a single cross.
The derivation of the formula for FSyn SC was based on the consideration that the random mating of the ms plants representing the s parent single crosses of Syn SC implies the random mating of any subset of ms plants, particularly the m representatives of each single cross. This intraparental mating is the only source of inbreeding in Syn SC . All other matings are interparental crosses. These do not contribute to inbreeding, since the lines are unrelated. For this reason, the derivation of FSyn SC in terms of a finite number m was based on the IC of the progeny of m plants from a single cross. This derivation was conducted based on the procedure proposed by Rodríguez-Pérez et al. [17].
To investigate how sample size affects FSyn SC , the IC of all combinations of 24 values of m with 5 ICs of the parent lines of a single cross was calculated. The calculation of the probability that the sample of m plants of a single cross includes all the genotypes that form the genotypic array and that 1 or 2 genes are lost was also performed for different values of m.

Conclusions
A formula was derived to determine the inbreeding coefficient of a synthetic variety developed by the random mating of s single crosses formed by unrelated 2s lines whose inbreeding coefficient is F (FSyn SC ). Unlike previously derived formulas for this synthetic, FSyn SC can be applied to any number combination of representatives of each parent (m), number of parent single crosses (s), and inbreeding coefficient of the lines (F).
The greatest effect of the finite m size on FSyn SC occurs at values from 1 to 4. In addition, whenever m is a multiple of 4, the FSyn SC values corresponding to the same F value do not differ and their value is the smallest. This result is consistent with what would be expected for the genotypic array of a "large" population reproduced by random mating. Regarding the rest of the m values, very small reductions in FSyn SC are observed from m = 7 and tend towards an m value that is a multiple of 4.
It was also found that for the probability of the sample including all four genotypes being at least 0.95, it is necessary that m ≥ 15. However, in an apparent contradiction, it was found that the probability of the sample not including one or two genes is practically negligible from m = 7. This means that since all four genes are included in the sample, all four GEA genotypes must be present in the Syn SC (Equation (1)) and FSyn SC could be closer to its smallest value.