Fuzzy Logic as a Strategy for Combining Marker Statistics to Optimize Preselection of High-Density and Sequence Genotype Data

The high dimensionality of genotype data available for genomic evaluations has presented a motivation for developing strategies to identify subsets of markers capable of increasing the accuracy of predictions compared to the current commercial single nucleotide polymorphism (SNP) chips. In this simulation study, an algorithm for combining statistics used in the preselection and prioritization of SNP markers from a high-density panel (1.3 million SNPs) into a composite “fuzzy” ranking score based on a Sugeno-type fuzzy inference system (FIS) was developed and evaluated for performance in preselection for genomic predictions. FST scores, and p-values were evaluated as inputs for the FIS. The accuracy of genomic predictions for fuzzy-score-preselected panel sizes of 1–50 k SNPs ranged from −0.4–11.7 and −0.3–3.8% higher than FST and p-value preselection, respectively. Though gains in prediction accuracies using only two inputs to the FIS were modest, preselection based on fuzzy scores yielded more accurate predictions than both FST scores and p-values for the majority of evaluated panel sizes under all genetic architectures. FIS have the potential to aggregate information from multiple criteria that reflect SNP-trait associations and biological relevance in a flexible and efficient way to yield higher quality genomic predictions.


Introduction
The availability of high-density (HD) and whole genome sequence (WGS) data has increased in recent years to allow their practical use in genomic evaluations. It was initially expected that prediction accuracies would increase using such data as a result of better tracking of the sources of genetic variation [1]. However, practical results demonstrate little to no advantage relative to low-and moderate-density single nucleotide polymorphism (SNP) panels currently used in genomic evaluations [2]. This failure is likely due to a combination of excessive multicollinearity between SNP/variant predictors [3] and overparameterization of the association model [4].
SNP marker preselection has the potential to improve the quality of genomic predictions by reducing multicollinearity and overparameterization and by increasing the biological relevance of markers included in the association model or in the calculation of the genomic relationship matrix. Several published studies have evaluated single preselection criteria, including p-values [5], estimated marker effects [6], F ST scores [7,8], and annotation missense status [9]. In each of these studies, prediction accuracies were better than genome-wide SNP panels only when one or a few variants had a large contribution to the genetic variance of the trait for both simulated and real data.
The limited success of single-criteria preselection to improve prediction accuracies is likely to be explained by the limited sensitivity of any single criteria to distinguish between relevant and non-relevant markers. All criteria will be subject to false positives, and the impact of such markers on predictions is unlikely to be null. Ling et al. (2021) [10] demonstrated that while the SNP preselection by the estimated effect results in less false positives and more genetic variance explained than F ST preselection, the false positives selected based on estimated effect have a greater negative impact on predictions and can ultimately offset the benefit from capturing more genetic variance. Based on these observations, it is reasonable to hypothesize that a composite score combining these two preselection criteria may harness the benefits of both to ultimately yield more accurate genomic predictions. Ref. [11] proposed a composite score for ranking variants from WGS data based on association and biological criteria and found that when preselected variants were combined with a genome-wide 630 k SNP panel, differences in genomic prediction accuracies between high-and low-ranking variants were statistically significant for several economically important milk traits in Holsteins.
It has been demonstrated that the increase in the accuracy of genomic predictions in the presence of HD marker data requires a balance between the fraction of the genetic variance explained by the prioritized SNPs and the number of preselected markers [8].
Increasing the latter will often increase the portion of genetic variance explained but not the accuracy of the genomic predictions. Thus, if WGS variants fail to significantly improve the prediction accuracies solely by explaining greater genetic variance, then increasing accuracies must rely instead on decreasing the bias in genomic predictions.
Fuzzy logic was initially proposed by Zadeh (1965) [12] as a strategy to mimic the human approach in understanding data-generating processes and logical decision making, where factors in the decision-making process may be subject to vagueness and uncertainty. It has subsequently been applied to a variety of practical applications, from inventory control [13] to analysis of landslide susceptibility [14], due to its flexibility and efficiency in modeling nonlinear processes [15]. In human medical research, fuzzy logic algorithms have been evaluated for their use in disease classification based on genomic markers [16,17]. Fuzzy logic models have also found various applications in agriculture. They have been extensively utilized in dairy cattle production systems, as automated systems have become increasingly utilized [18]. Shahinfar et al. (2012) [19] evaluated the feasibility of estimating breeding values in dairy cattle based on parental estimated breeding values and own performance. More recently, fuzzy logic models have been used to predict phenotypic performance in crop hybrids based on parental line performance data [20] and genomic markers [21]. Fuzzy logic models are a computationally efficient and flexible tool for decision making based on multiple criteria. A fuzzy inference system (FIS) may be particularly suitable for genotype data and genomic prediction analyses, as it is generally assumed that the vast majority of markers available are in imperfect linkage disequilibrium (LD) with causative variants rather than they themselves having a causative effect on the trait(s) of interest. While thresholds for assessing the amount of useful LD between a SNP and causative variant have been proposed [22], these are clearly somewhat arbitrary and the usefulness of information carried by a particular SNP marker with regards to causative variants is more likely to fall in a gray area between true information and noise. However, it is clear that if the loci of the causative variants were known then the problem that originally motivated the use of SNPs would be solved and there would be no purpose for SNP preselection. Instead, criteria that reflects association of SNPs with a particular trait, and by extension the potential LD with a causative variant, could be used in a fuzzy model. In this simulation study, we evaluate the potential of a Sugeno-type fuzzy logic model [23] to combine p-values and F ST scores into a composite score for SNP preselection, which to our knowledge has not previously been evaluated as a SNP preselection tool for the prediction of the genomic breeding values of complex traits.

Materials And Methods
Fuzzy logic is often used in assessing uncertainty associated with a decision-making process. It provides computers with the ability to compare alternatives in a similar manner to the reasoning process of humans. For example, when comparing the daily milk production of two cows producing 40 and 41 kg, a crisp classification system based solely on phenotypic selection for high milk yield will conclude that the daily productions are different and the cow producing 41 kg is a better selection candidate. Realistically, a reasonable farmer would be more likely to use a "soft" approach by balancing the milk production with additional important factors that are not fully assessed (e.g., feed efficiency and health factors) and the final decision would potentially be different from the crisp classification. A FIS is a process that formalizes this strategy and consists of three main components: (1) fuzzification of the inputs, (2) rule evaluation, and (3) defuzzification to achieve a final output. Conceptually, the final fuzzy outputs in this example would not be dissimilar to a selection index where performance across multiple traits is accounted for, though the fuzzy algorithm notably differs from how selection indices are constructed.
Fuzzification of the inputs involves constructing for each input class a fuzzy set, A = (U, m); where U is a set that falls within the sample space of a given input, x; and m is a membership function that maps x to membership in U with m(x) ∈ [0, 1]. Figure 1 illustrates the mapping of the inputs used in this study (p-values associated with estimated SNP effects and marker F ST scores) using piecewise linear functions as a membership function (m). It is worth noting that a given input value may have partial membership in adjacent classes, hence the fuzziness of the set, under the constraint that the sum of all memberships are less than or equal to one. The rule evaluation step maps the underlying association between the input and output of the underlying system. Each set of inputs must be evaluated according to a rule base. The rule base follows a generalized structure of, where z i is the true output for individual i, modeled as a linear combination of the vector of input parameters (x i ); a i is a vector of unknown coefficients; and ∧ stands for the Boolean operator AND. Following the basic fuzzy logic approach formulation, the rule base is constructed according to principles of Boolean logical operators. The choice of a specific Boolean operator has an additional impact on how the truth values are aggregated in the defuzzification step. Each input that factors into a particular rule is associated with a membership in the membership function (m) for that rule, yet only one of these membership values will be associated with the resulting truth output. Which of the membership values is associated with the truth value depends on the choice of the Boolean operator. The minimum (maximum) membership value will be selected using the Boolean operator AND (OR). Table 1 shows the rules and corresponding parameters used in this study, where b 0 is the intercept and b 1 and b 2 are the coefficients corresponding to F ST scores and p-values, respectively. The defuzzification step is the process used to identify the best crisp outcome given the fuzzy set. It consists of generating a single score (y) through the aggregation of the output of the different fuzzy sets, where p is the number of rules in the rule base. It can be observed here how the construction of the rules and choice of Boolean operator ultimately influence the final output. Since the Boolean operators AND and OR result in y being weighted by the smallest and largest membership values, respectively, for a given rule, the contribution of a particular rule to the final output will vary considerably depending on the choice of operator. It would be reasonable to suggest that alternative strategies to collapsing the input memberships into a membership for the truth value could be used, for example an average, but this was not investigated in the present study. Figure 2 shows the output space of the crisp fuzzy scores for the range of inputs observed in this study. The fuzzy algorithm was able to accommodate a highly nonlinear output space for the fuzzy scores despite the membership functions and rule base having linear components.
F ST scores and p-values were chosen as the preselection criteria and fuzzy inputs in this study due to the straightforwardness of calculating these statistics in a simulation environment, but it should be noted that in real data there are many additional criteria that have sound justification for being indicative of biological relevance and could potentially be integrated into a fuzzy model for preselection. p-values were calculated jointly using a single-step genome-wide association study approach with the BLUPF90 suite of programs, as described in [24], where genomic predictions are first calculated using a single-step GBLUP (ssGBLUP) model and SNP effects are then jointly calculated from the genomic predictions, where Z is the transpose of the n × m matrix of genotypes, G −1 is the inverse of the genomic relationship matrix,û is the vector of solutions of genotyped individuals from a ssGBLUP model, and p i and q i . The SNP variances are calculated according to, It should be noted that while the equivalency of GBLUP and SNPBLUP has been demonstrated under certain assumptions [25], SNPBLUP and ssGBLUP are not strictly equivalent due to the inclusion of information from non-genotyped animals in the latter.
F ST scores were calculated as in [7]. With this strategy, subpopulations are formed from the extreme tails of the distribution of phenotypes. Groups of individuals with highly divergent phenotypes are expected on average to carry different alleles at loci that contribute variation to the phenotype of interest and F ST is a natural measure of such differences, where H T = 2p T q T , H S = H S1 n S1 +H S2 n S2 n S1 +n S2 and H Si = 2p Si q Si , p T and q T are major and minor allele frequencies, respectively, in the combined populations; p Si and q Si are major and minor allele frequencies, respectively, in the i th subpopulation; and n Si is the number of individuals in the i th subpopulation.
While separate breeding populations are traditionally used for calculations of F ST and it would seem natural to choose distinct breeding populations with divergent phenotypes (e.g., breeds with historically different selection goals), such an approach has the drawback of capturing deviations in allele frequencies at loci affecting all traits for which the popu-lations differ on average. By grouping individuals from a single population by extreme phenotype, deviations in allele frequencies at unrelated variants will be driven only by correlated traits and random sampling.
Simulated datasets were generated using QMSim [26] for three scenarios that differed in the number of the causative loci contributing variation to the trait of interest (300, 1 k, or 5 k). Table 2 shows parameters specified for the simulation. Each simulation consisted of ten replicates and three sequential populations. First, a historical population (HP) under random mating was generated to establish mutation-drift balance and LD. None of these individuals had data or pedigree recorded. For the second population, 500 males and 4000 females were randomly selected from HP to be founders. This population underwent phenotypic selection and the number of dams retained expanded by 5% each generation. All individuals had data and pedigree recorded without error. From this population, 100 males and 6000 females were selected to be founders of the most recent population. It consisted of 5 non-overlapping generations under selection and 3500 individuals per generation. Selection was based on pedigree EBVs. All individuals had data and pedigree and a random 5000 individuals from generations 1 to 4 and 1000 individuals from generation 5 had genotypes. Preselection was based on random selection, p-values, F ST scores, or fuzzy scores as criteria. The predictive power of preselected SNP panels of size 1, 2, 3, 4, 5, 10, 20, 30, 40, and 50 k was evaluated using a ssGBLUP model, as implemented in BLUPF90 [24].
The simulated genome consisted of 1.3 million SNPs randomly distributed across 29 chromosomes. There were either 300, 1 k, or 5 k causative variants randomly distributed across either only 14 or all 29 of the chromosomes; in the prior scenario, SNPs on 15 chromosomes did not segregate with any causative loci and was designed in such a way to allow investigation of how noninformative SNPs affect genomic predictions and sensitivity of different preselection criteria.
Genomic prediction accuracies were calculated as the correlation between true (u) and estimated (û) breeding values. Bias of SNP associations was measured as the deviation of the regression coefficient of u onû (b 1,u|û ) from one, with values less and greater than one indicating over-and underdispersion, respectively. If the estimated effect of a SNP (or its association with the trait) is not consistent from the training to the validation population, then its contribution to the validation genomic predictions will create a variation that is not present in the TBVs and results in overdispersion of the estimates (b 1,u|û < 1).
The amount of genetic variance explained (GVE) by preselected SNPs was quantified as the variance of validation predictions, var(û). The genomic prediction accuracy can be expressed as a function of the regression of true on estimated breeding values, b 1,u|û , and the ratio of the standard deviations of u andû,

Results
In the simulated genome, only 14 of the 29 chromosomes harbored causative variants, leaving the SNPs on 15 of the chromosomes definitively unlinked with any causative loci. Figure 3 shows the proportion of SNPs that are linked (blue) or not linked (red) selected by each of the marker prioritization criteria. As expected, random preselection resulted in the most unlinked markers for all genetic architectures. F ST scores preselected more unlinked markers than p-values for all genetic architectures and SNP panel sizes. Fuzzy preselection resulted in the least number of unlinked markers for all preselected panels and genetic architectures, though the differences between p-value and fuzzy preselection are relatively modest. Interestingly, the advantage of the fuzzy scores relative to p-values in discriminating between linked and unlinked markers seemed to increase as the genetic architecture increases in complexity, with the percent difference in the number of linked markers selected by fuzzy scores ranging from 0.17 to 2.77, 0.67 to 2.57, and 2.49 to 5.40% higher than p-values for 300, 1 k, and 5 k causative variants, respectively.
It may be useful to further compare how markers that are linked and unlinked with causative variants are commonly (or not) selected across preselection criteria. Tables 3-5 present the number of SNP markers that are unique to a particular criterion and the proportion of those markers that are linked with causative variants. Tables 6-8 show the same information but for SNP that are shared between two or more criteria. Between 15 to 25, 14 to 32, and 16 to 31% of the SNPs preselected based on fuzzy scores for 300, 1 k, 5 k causative variants, respectively, were not present in the equivalent-sized panels for F ST scores and p-values and a higher proportion of these SNPs were linked with causative variants for all genetic architectures. As expected, a high proportion of the SNPs that were selected by all criteria for a given panel size were linked with causative variants; while the proportion linked decreased as the panel size increased due to an expected loss of sensitivity, the proportion linked was still considerably higher than among SNPs selected by only one or two of the criteria, with the notable exception of markers that were selected by both F ST scores and p-values but not fuzzy scores. Greater than 95% of SNPs were linked with causative variants for all genetic architectures for the latter case, indicating that there may be a more optimal set of fuzzy parameters that would include these SNPs in the fuzzy-preselected panels.   While it is not straightforward how to quantify the relevance of a particular SNP in a scenario where all markers are potentially linked and in LD with a causative variant(s), if the overlap in SNP selected by each criteria is similar to the scenarios where a portion of the SNPs are definitively not linked with any causative variants, then it may be reasonable to expect that the fuzzy scores are robust across genetic architectures for preselection. Table 9 shows the number of SNPs unique to one criteria or shared across two or all criteria for a genetic architecture of 1 k causative variants that are distributed across all 29 chromosomes. While this scenario is expected to be more realistic with regards to the genetic architecture of complex traits in real populations, it is not straightforward how to define the relevance of the preselected SNPs. However, it can be observed that the distribution of SNPs across each criteria in Table 9 is quite similar to that observed under the scenarios, where a portion of the markers are definitively not linked with causative variants (and therefore have minimal relevancy for trait-specific predictions), so it may be reasonable to expect that the relevancy of SNPs selected under the more realistic genetic architectures may also be similar to what was measured under the less realistic genetic architectures. Figure 4A presents the trends in accuracy for each of the preselection criteria across the different SNP panel sizes. For each of the genetic architectures, fuzzy-score-based prioritization yielded higher accuracies compared to the other preselection criteria. In fact, the accuracy using the fuzzy approach was higher by up to 15.3, 7.9, and 2.4 under the 5 k causative variant scenario; 29.9, 8.9, and 3.5 under the 1 k causative variant scenario; and 44.5, 10.6, and 4.57% under the 300 causative variant scenario relative to random, F ST , and p-value preselection methods, respectively. The largest advantage for the fuzzy score preselection relative to random and F ST preselection was when the size of the prioritized marker set (panel size) was 1 k and this difference tended to decrease as the panel size grew with the exception of the scenarios of 5 k causative variants for random preselection (fuzzy preselection accuracy was 0.5% lower) and 300 causative variants for F ST scorebased prioritization (fuzzy approach accuracy was 0.7% lower). Accuracies for p-value preselection did not exceed those of fuzzy preselection for any of the genetic architectures or evaluated panel sizes.     Figure 4B shows b 1,u|û on the horizontal axis, var(û) on the vertical axis, and a gradient showing values of accuracy that correspond to the range of values. Additionally, the values of b 1,u|û and var(û) that are observed for each preselection criteria and panel size are plotted to show how each preselection criteria moves through the accuracy parameter space as the SNP panel size increases. For all genetic architectures, random preselection increased the variance of the validation predictions slowly but the overdispersion decreased rapidly as the panel size increased. The opposite trend was observed for p-value preselection; it tended to initially capture more variance in the predictions and increase this variance at a faster rate, but predictions based on p-value preselection are the most over-dispersed. In the case of 300 causative variants, this overdispersion even increased, explaining why under this genetic architecture the p-value accuracy trends downward for panel sizes larger than 3k SNPs. F ST scores tended to initially capture similar genetic variation as random preselection but be less over-dispersed; while this dispersion did not decrease as quickly as observed with random preselection with increasing SNP panel size, the variation of the predictions tended to increase more quickly, resulting in higher accuracies for F ST scores under 300 and 1 k but not 5 k causative variants. Similar trends in accuracies, b 1,u|û , and var(û) for fuzzy scores, F ST scores, and p-values were observed under the more realistic scenario where causative variants were distributed across all chromosomes, as shown in Figure 5. Fuzzy preselection outperformed random, F ST , and p-value preselection by up to 39.7, 7.8 and 3.8, respectively, for 300 causative variants; 24.6, 11.7 and 1.9 for 1 k causative variants; and 6.15, 7.7 and 1.9% for 5 k causative variants. When there were 5 k causative variants, random preselection yielded notably better predictions than the other preselection criteria, outperforming fuzzy preselection for all panels larger than 1 k SNPs by up to 5.3%.

Discussion
Much of the optimism around the use of HD marker panels and WGS data was motivated by the expectations that such data would vastly increase the percentage of the genetic variation captured and ultimately the accuracy of genomic selection. However, both simulation and real-data-based results have demonstrated little benefit using such data in genomic predictions. Equation (1) shows that genomic prediction accuracies are influenced by two factors: (1) the proportion of the true genetic variance captured in the validation predictions, and (2) the dispersion of validation predictions relative to the true additive genomic values. Increasing the proportion of true genetic variance captured will not necessarily improve genomic prediction accuracies if the validation predictions become more over-dispersed b 1,u|û . This is evidenced by the trends of accuracy, GVE, and overdispersion under the 5 k QTL genetic architecture. In such a case, p-value-based preselection yielded lower genomic prediction accuracies than all other criteria for SNP panels larger than 20 k despite explaining the largest percentage of the genetic variation. In fact, random preselection explained the least genetic variation of all criteria yet yielded the highest accuracies for SNP panels larger than 10 k. These results demonstrate that maximizing the genetic variation captured by a set of SNPs is not necessarily ideal if there is an accumulation of bias in the SNP associations, as p-value preselection has a tendency to do that, given that it consistently yielded the greatest over-dispersion of all criteria evaluated.
The performance of random preselection relative to the non-random preselection criteria improved as the number of causative variants increased. It would be reasonable to expect that random preselection might capture more genetic variance by more evenly covering the genome, where non-random criteria might instead saturate regions with causative variants with the strongest signals and consequently fail to capture genetic variance from causative variants with a smaller contribution; however, Figure 2 shows that random preselection in fact captures the least amount of genetic variation across all genetic architectures.
As the complexity of the genetic architecture increases, the maximum contribution to the genetic variance of a single variant is reduced and the sensitivity of non-random preselection criteria to distinguish true from spurious signals becomes more limited.
p-values and F ST scores followed opposing trends in accuracy, genetic variance explained, and SNP preselection bias. p-values tended to capture a larger portion of the genetic variance but result in more over-dispersed genomic predictions, while F ST captured consistently less genetic variation but yielded genomic predictions that were less over-dispersed.
A primary objective of designing a composite preselection score using FIS was to increase accuracies relative to those of the individual preselection criteria used as inputs to the FIS. While fuzzy preselection did not universally yield higher accuracies than p-value and F ST score preselection, there were certain points at which it performed better and this demonstrates that it is feasible to combine multiple criteria in an index to capture the strengths of each and improve predictions. Accomplishing this depends on preserving or increasing genetic variation captured while minimizing the accumulation of bias in SNP associations.
Though ultimately none of the preselection criteria yielded better predictions than the use of all 1.3 million SNPs, very similar accuracies were obtained using preselected panels of 50 k SNPs and smaller, which may be useful in analyses when it is not feasible to routinely use SNP panels of such large dimension.
Although a HD panel was used in this simulation study (1.3 million markers), the reality of the genomic field in livestock and plant applications is that several million markers/variants are available from whole genome sequencing at different depths. This dramatic increase in the number of variants will undoubtably increase the accumulated bias in variant associations using single criteria for prioritization. The results of this study seem to indicate that an aggregated preselection criterion, such as the FIS approach, may be adequate for such data, especially if further prioritization criteria such as annotation, functional, and expression information associated with the considered SNPs/variants is available.

Conclusions
A large amount of WGS data is being generated in several livestock applications largely for potential use in genomic evaluations. Currently, such information has not led to a meaningful increase in accuracy compared to low/moderate commercial marker panels. Several marker prioritization methods were proposed, but their performance were inconsistent across applications with slight to no improve in accuracy. In spite of our understanding of the relationship between the portion of the genetic variance explained by the preselected markers and the bias induced by the prioritization of false positives (markers non-linked to causal variants), we were unable to find the optimum balance between these two opposing forces to maximize accuracy. In this study, a fuzzy logic approach was presented to concatenate two prioritization criteria into a unique score to prioritize relevant markers with the expectation of reducing the number false positives tagged by each individual criteria separately. Under varying marker panel size and genetic complexity (number of causal variants), the FIS approach seems to perform generally better than the single criteria approaches. We believe that the performance of the FIS approach could be improved with the inclusion of already available additional prioritization information such as annotation, functional, and expression data. While the results presented here are promising, only simulated data was evaluated, the generation of which necessitates making simplified assumptions regarding the biology of the genome and its role in influencing phenotypes. Future research regarding this methodology will include its application to real data to evaluate the robustness of its performance in SNP selection for genomic predictions as well as the investigation of algorithms for estimation and tuning of FIS model parameters. The FIS approach presents a flexible method to combine noisy/uncertain information to derive reasonable decision-making outputs.

Informed Consent Statement: Not applicable
Data Availability Statement: Programs used to generate the simulated data are publicly available. Programs used to analyze the data are available upon request.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

FIS
Fuzzy inference system HD High-density HP Historical population LD Linkage disequilibrium SNP Single nucleotide polymorphism ssGBLUP single-step GBLUP WGS Whole genome sequence