A New Diversity Panel for Winter Rapeseed (Brassica napus L.) Genome-Wide Association Studies

: A diverse population (429 member) of canola (Brassica napus L.) consisting primarily of winter biotypes was assembled and used in genome-wide association studies. Genotype by sequencing analysis of the population identiﬁed and mapped 290,972 high-quality markers ranging from 18.5 to 82.4% missing markers per line and an average of 36.8%. After interpolation, 251,575 high-quality markers remained. After ﬁltering for markers with low minor allele counts (count > 5), we were left with 190,375 markers. The average distance between these markers is 4463 bases with a median of 69 and a range from 1 to 281,248 bases. The heterozygosity among the imputed population ranges from 0.9 to 11.0% with an average of 5.4%. The ﬁltered and imputed dataset was used to determine population structure and kinship, which indicated that the population had minimal structure with the best K value of 2–3. These results also indicated that the majority of the population has substantial sequence from a single population with sub-clusters of, and admixtures with, a very small number of other populations. Analysis of chromosomal linkage disequilibrium decay ranged from ~7 Kb for chromosome A01 to ~68 Kb for chromosome C01. Local linkage decay rates determined for all 500 kb windows with a 10kb sliding step indicated a wide range of linkage disequilibrium decay rates, indicating numerous crossover hotspots within this population, and provide a resource for determining the likely limits of linkage disequilibrium from any given marker in which to identify candidate genes. This population and the resources provided here should serve as helpful tools for investigating genetics in winter canola.


Introduction
Winter rapeseed is an economically valuable, cool season oilseed crop grown in the United States, ranging from Idaho to the central Mid-Atlantic states. It has also gained attention as a potential winter cover crop, with traits suitable for developing dual-cropping systems, since it generally outperforms spring biotypes in yield and matures early enough to allow for harvest before planting of some varieties of corn, soybean, and sunflower [1]. As a cover crop, winter rapeseed could also provide ecosystem to the F9 generation via selfing prior to DNA extraction. The remaining lines had been advanced to between the F5 and F8 generation. A list of common cultivar names associated with the associated Agricultural Research Service (ARS) numbers and preliminary phenotypic traits, including two years of data for flowering time, are available in Supplemental File S1.

Genotype by Sequencing and Polymorphism Filtering and Assessment
DNA was isolated from a single plant from each line using the DNeasy 96 Plant Kit (Qiagen, Hilden, Germany) with modifications. Samples were lysed with FastPrep-24™ 5G Homogenizer (MP Biomedicals, Irvine, CA, USA). The settings for the instrument were: two pulses at 6.0 m/s for 20 s each with a 10 s pause between pulses. Lysis tubes were spun at 16,000× g in a microcentrifuge. After the addition of Buffer P3, samples were incubated on ice for 10 min. S-Blocks were modified for use in a Sorval SH3000 swing-bucket rotor. Samples were spun at 3827× g (4300 rpm). To bind the DNA, 600 µL of sample was applied twice to the DNeasy 96 plate (because of the reduced volume of the modified S-Blocks). Membrane was washed twice with 400 µL of Buffer AW2 (because of the reduced volume of the modified S-Blocks), each with a 5 min spin. Membrane was dried with a final 10 min spin. DNA was eluted off the membrane twice with 100 µL Buffer AE. Buffer was added to the membrane, incubated for 1 min at room temperature, then spun for 2 min at 3827× g. The elutions were pooled before quantitation. DNA was quantitated using Quant-iT™ PicoGreen ® dsDNA Kit (Life Technologies, P7589) and a fluorescent plate reader (Varioskan LUX, Thermo Scientific, Waltham, MA, USA) with SkanIt analysis software (Thermo Scientific, Waltham, MA, USA). Quality of the extracted DNA was determined by visually observing selected samples run on a gel alongside the same samples cut with restriction enzyme. For each 20 µL reaction: 6 µL (~300 ng) DNA, 4 U Hind III (Roche, Basel, Switzerland), 2 µL reaction buffer, and H20. Reactions were incubated for 2 h at 37 • C. Then, 2.2 µL loading buffer (Invitrogen, Carlsbad, CA, USA,) was added to each reaction before loading on a 1% agarose gel with size standard. Then, 2 µL (~100 ng) undigested DNA was diluted to 20 µL, combined with loading buffer and loaded alongside the corresponding digested sample. Gel was run at 4 V/cm for 1 h and imaged using BioRad gel imager and software (Hercules, CA, USA).
The resulting DNA was sent to Cornell University and the enzyme ApeKI was used to select fragments for building the GBS libraries. Sequencing was done on 5 lanes of Illumina 2500 with 150 base paired end reads as output. Over 3 billion reads were generated with 46% containing readable barcodes and subsequently over 8 million tags present at least three times were identified in the merged files. The resulting sequences were trimmed to remove contaminating and low-quality sequences and then clustered and mapped to the canola reference genome (Brassica_napus_v4.1.chromosomes.fa) using the TASSEL5 pipeline [15]. Over 5 million of these aligned uniquely to the canola genome. The resulting single nucleotide and indels variants were filtered as described [15] and based on the number of missing alleles in individuals and or at specific sites in the genome. This selection reduced the population size to 429 individuals with at least 15% called variants and 290,972 markers that were called in at least 10% of the population. This file is available as Supplemental File S2. The resulting high-quality marker collection was subjected to imputation using the program NGSEP_Impute_VCF_2.1.4 [16] in the CyVerse discovery environment [17], yielding 251,576 markers. The imputed vcf file can be found in Supplemental File S3. This file was further filtered for minor allele counts greater than 5 using Tassel5. The filtered imputed file is available as Supplemental File S4. Basic information on observed and expected allelic frequencies was determined using the program NGSEP Diversity Statistics 2.1.4 in the CyVerse discovery environment and can be found in Supplemental File S5.

Imputation Assessment
The imputation program, NGSEP_Impute_VCF_2.1.4 [16], was tested for effectiveness by using a script to artificially delete a random proportion of existing, non-missing genotype values. The sample vcf file was processed using Python's pandas library. Once loaded into a pandas data-frame, a custom python script was used to generate a list of random, non-missing data points, along with their corresponding IDs and line IDs to store their coordinates within the data structure. These data points were then replaced with the characters used to indicate missing data ('./.' in this case). For our test, we replaced 0.0002%, approximately 13,000 of the non-missing genotype values in the file "all.taxamerged.filtered.recode.vcf.gz". A list of replaced values was saved as a csv file.
The imputation program was then run on the file with the 13,000 replacements, yielding a new file where the replaced values were imputed. Finally, a custom python script was used to compare each replaced genotype value from the list of replaced values against its corresponding data point in the imputed file (Supplemental File S3). We then calculated the percentage of correctly imputed data points overall and for each genotype.

Population Analysis
The program TASSEL5 was used to produce a kinship matrix (Supplemental File S6) based on the imputed vcf file (Supplemental File S4) using the default option "centered IBS". Model-based Bayesian clustering to infer the sub-population structure was performed with STRUCTURE (v 2.3.4) [18]. To speed up computation, genotype calls with a genotype quality (QG) score less than 50 were removed and loci missing more than 5% of the genotypes were removed and thinned to 1/10 kb or 1/50 kb, thus resulting in two genotyping datasets. The remaining 25,665 or 11,386 single-nucleotide polymorphisms (SNPs), for the 10 or 50 kb datasets, respectively, were independently analyzed with the default admixture model (K: 2-12; 9 replications; burn-in 20,000; simulation runs 80,000). CLUMPP [19] was used to align independent runs and allow comparative cluster allocations at each K value. An individual was assigned to the cluster that the largest fraction of its genotypes belonged to, with designation of admixtures set at <50%. Individual lines associated with each bar in the cluster analysis histogram can be found in Supplemental File S7. Due to the nature of the breeding population, the ad hoc methods described [20] were not useful in judging the number of optimal Bayesian clusters. Instead, we examined the change in the average Ln PR (X|K) values between K and K + 1 and analysis of each individual's proportional cluster allocation was also useful in determining K values exhibiting uninformative admixture. Nucleotide diversity (π) with a window size of 50 kb to approximate the average linkage decay rate with a 10 kb step and Tajima's D with a a window size also of 50 kb were calculated using subroutines in the program VCFtools [21]. These data can be found in Supplemental File S8. An unrooted neighbor-joining tree was constructed using Tassel5 and can be found in Supplemental File S9.

Linkage Disequilibrium Decay Analysis
Pairwise linkage disequilibrium (LD) was calculated in TASSEL5 [22] for all SNPs on each chromosome as the squared allele-frequency correlations (r2) following [23] (see Supplemental File S10 for the distance matrix). The decay of linkage disequilibrium (LD) with physical distance was estimated using a nonlinear regression analysis of linkage disequilibrium (LD) between SNP marker loci (r2) versus the genetic distance between SNP sites in base pairs [24]. Under the drift-recombination equilibrium, the expected value of r2 is E(rˆ2 ) = 1/(1 + C), where C = 4Nc, N is the effective population size, and c is the recombination fraction between SNPs [25]. Under the assumption of a low level of mutation and an adjustment of sample size, the expectation becomes [26]: The data were pooled for the canola A and C genome and fit this equation into a nonlinear regression model using the statistical package R version 3.4.3 (R Development Core Team 2017). These data are available in Supplemental File S11. We also measured the rate of LD decay individually for all nineteen canola chromosomes.
Local linkage decay rates were determined using a similar method with a custom R script to determine the LD decay rate in all 500 kb increments across all chromosomes using a 10 kb step (Supplemental File S12 provides the R script and Supplemental File S13 provides a table showing LD decay values for each window).

Phenotypic Analysis of Agronomic Traits of Interest
In order to assess the possibility of genetic variation for traits of agronomic interest in these populations, the diversity panel populations were planted in the fall of 2015 near Manhattan, KS and were arranged in a one replicate test with repeated checks in both growing seasons. The experimental setup had three row plots with a plot size of 3 m long by 0.76 m wide. Seeding rate was equal to common seeding rate for winter oilseed rape in the southern Great Plains 4.3 kg ha −1 or approximately 140 seeds m −2 under dryland conditions with no irrigation. Soil types in 2015-2016 were Ivan and Kennebec silt loam (fine-silty, mixed, superactive, mesic Cumulic hapludolls) and in 2017-2018 were Rossville silt loam (fine-silty, mixed, superactive mesic Cumulic Hapludolls).
The following phenotypic traits were measured: stand establishment (rated on a 0 to 10 scale with 0 = no stand and 10 = complete stand), plant vigor (rated on a 1 to 5 scale with 1 = least vigorous to 5 = most vigorous), and stem elongation (rated on a 1 to 5 scale with 1 = excessive and 5 = no stem elongation). Phenotypic traits measured in the spring included: rosette growth habit (rated on a 1 to 3 scale with 1 = upright, 2 = intermediate, and 3 = prostrate); winter survival (rated as the percentage of fall stand surviving the winter); days to flowering date after 1 January when 50% of plants within a plot had achieved one open flower); and plant height (measured in inches). A second measure for days to flowering was similarly measured in 2017. Above normal spring temperatures in 2016 resulted in much earlier flowering times. Dry conditions and below normal spring temperatures in 2018 resulted in later flowering times. Meteorological data for the field location in both years can be found in Supplemental File S14.

SNP Identification and Filtering
GBS identified~300,000 SNPs among the 433 winter rapeseed lines that were sequenced. After filtering and imputation, we were left with 429 lines with 251,575 single nucleotide polymorphisms (SNPs) mapped to the reference canola genome. An analysis of the imputation program indicated that 90.67% of the total data points were correctly imputed. We found that the imputation program worked well on homozygous genotypes, imputing 99.81% of major alleles and 85.07% of minor alleles. The imputation program was unable to correctly impute any heterozygous genotypes. However, these were rare as B. napus is a primarily self-pollinating species (see below). This imputed vcf file is available as Supplemental File S2. Analysis of the imputed SNPs (Supplemental File S3) indicated that 97,666 loci (~38%) had a minor allele frequency >0.05 but that 190,375 had minor allele counts >5.0. The average expected heterozygosity for all alleles was 13.7%, which was greater than the average observed heterozygosity (6.6%), as might be predicted for a primarily inbreeding species. Seeds from these lines will be made available upon request. The distances between SNPs range from 1 base to 281,248 bases, with 4462.8 and 69 bases for the average and median distances, respectively. Marker density was not even, and clustering of markers can be observed in regions with small and large distances between markers (see Figure 1 for an example). Analysis of LD for a given chromosome is shown in Figure 2, and two range from 7 kb to~65 kb. Regression analysis of the LD indicated that the distance between SNPs decreased sharply with R2 values greater than 0.2, with a value of 11.35 kb for the A genome and 33.88 kb for the C genome (Supplemental File S11). Thus, the marker density is sufficiently high enough to expect viable associations for most loci. More localized LD indicated a range of LD decay intervals from 1 to 491,136 bases for any given 500 kb window (Supplemental File S13). An example for the LD decay along chromosome A01 is shown (Figure 3).

Population Analysis
The population structure of our winter rapeseed lines was found to have significant genetic variability with 2-4 major clusters depending on the type of analysis conducted (Figures 4 and 5). The distance between individuals within any given cluster is generally consistent, as are the distances

Population Analysis
The population structure of our winter rapeseed lines was found to have significant genetic variability with 2-4 major clusters depending on the type of analysis conducted (Figures 4 and 5). The distance between individuals within any given cluster is generally consistent, as are the distances

Population Analysis
The population structure of our winter rapeseed lines was found to have significant genetic variability with 2-4 major clusters depending on the type of analysis conducted (Figures 4 and 5). The distance between individuals within any given cluster is generally consistent, as are the distances Nucleotide diversity and Tajima's D values were calculated along the chromosomes. There was relatively little difference between the average values between the A and C genome (4.89 × 10 −5 and 3.82 × 10 −5 for π, and 0.51 and 0.4 for Tajima's D values, respectively) (Supplemental File S8). There were localized troughs in π that corresponded to negative localized values for Tajima's D in the same region. These may be indicative of regions undergoing some selection within the entire population. It should be noted, though, that this population is part of an active winter B. napus breeding program and thus some selection for agronomic and domestication traits might be expected.

Population Analysis
The population structure of our winter rapeseed lines was found to have significant genetic variability with 2-4 major clusters depending on the type of analysis conducted (Figures 4 and 5). The distance between individuals within any given cluster is generally consistent, as are the distances between clusters. Additionally, the neighbor-joining phylogenetic tree indicates a similar number of major branches (Supplemental File S9). Thus, we conclude that there is considerable genetic diversity in this population but that there are genetic relationships among the members of this diversity panel. Thus, predicted models that include kinship relationships and population structure will usually provide the most accurate inferences. This was proven to be the case, not only with this study but also with a parallel study on freezing tolerance following deacclimation using this same population [14].
Agronomy 2020, 10, x FOR PEER REVIEW 7 of 12 between clusters. Additionally, the neighbor-joining phylogenetic tree indicates a similar number of major branches (Supplemental File S9). Thus, we conclude that there is considerable genetic diversity in this population but that there are genetic relationships among the members of this diversity panel.
Thus, predicted models that include kinship relationships and population structure will usually provide the most accurate inferences. This was proven to be the case, not only with this study but also with a parallel study on freezing tolerance following deacclimation using this same population [14].

Phenotypic Diversity of Numerous Traits of Interest to Breeders
The population was assessed for several different traits under field conditions from plots of plants grown in 2015 (and 2017 in the case of 50% bloom date). The range of phenotypes is shown in Agronomy 2020, 10, x FOR PEER REVIEW 7 of 12 between clusters. Additionally, the neighbor-joining phylogenetic tree indicates a similar number of major branches (Supplemental File S9). Thus, we conclude that there is considerable genetic diversity in this population but that there are genetic relationships among the members of this diversity panel.
Thus, predicted models that include kinship relationships and population structure will usually provide the most accurate inferences. This was proven to be the case, not only with this study but also with a parallel study on freezing tolerance following deacclimation using this same population [14].

Phenotypic Diversity of Numerous Traits of Interest to Breeders
The population was assessed for several different traits under field conditions from plots of plants grown in 2015 (and 2017 in the case of 50% bloom date). The range of phenotypes is shown in

Phenotypic Diversity of Numerous Traits of Interest to Breeders
The population was assessed for several different traits under field conditions from plots of plants grown in 2015 (and 2017 in the case of 50% bloom date). The range of phenotypes is shown in Figure 6.
These results are consistent with, but not conclusively indicative, of the hypothesis that this diversity panel will be useful for identifying loci controlling any of these traits.
Agronomy 2020, 10, x FOR PEER REVIEW 8 of 12 Figure 6. These results are consistent with, but not conclusively indicative, of the hypothesis that this diversity panel will be useful for identifying loci controlling any of these traits.

Discussion
Here, we report on a genotyped canola diversity panel, consisting primarily of winter biotypes, that contains sufficient genetic diversity to serve as a valuable resource for identifying loci associated with a multitude of traits. Similar panels have been used effectively in crops such as corn [27] and model plants such as arabidopsis (Arabidopsis thaliana Heyn) [28]. Our diversity panel has been used to identify loci associated with freezing tolerance following deacclimation [14] and, as demonstrated above, contains significant phenotypic diversity for a number of other different traits. Thus, it seems likely that there is a range of genotypic variation within our canola diversity panel controlling traits of significant agronomic importance.

Marker Density and Linkage Decay Are Higher Than Previously Reported
The local linkage decay rate suggests that linkage decay, on average, occurs in less than 23.888kb in our population. However, an analysis of LD decay for individual chromosomes indicated a much smaller interval (as small as 7 kb on Chrm A01, for example) and some regions where the local LD decay interval is very small (<100 bases).
Interestingly, the rate of LD decay rate is significantly greater than has been observed in other diversity panels of canola. For example, LD decay rates for Chrm A01 have been estimated to be as high as ~457 kb in a different population with a marker density of approximately every 29 kb [29]. However, most estimates (for example, [30]) found Chrm A01 to have an LD decay rate of ~30 kb when marker density was at ~1/10 kb. As might be expected, we detected an increase in the LD decay interval when we included fewer rare alleles in our assessment (data not shown). However, inclusion of these rare but not unique alleles (MAF count >5) is justified because they are likely real as they occur in multiple individuals. These additional alleles could be uncovering more crossover events in

Discussion
Here, we report on a genotyped canola diversity panel, consisting primarily of winter biotypes, that contains sufficient genetic diversity to serve as a valuable resource for identifying loci associated with a multitude of traits. Similar panels have been used effectively in crops such as corn [27] and model plants such as arabidopsis (Arabidopsis thaliana Heyn) [28]. Our diversity panel has been used to identify loci associated with freezing tolerance following deacclimation [14] and, as demonstrated above, contains significant phenotypic diversity for a number of other different traits. Thus, it seems likely that there is a range of genotypic variation within our canola diversity panel controlling traits of significant agronomic importance.

Marker Density and Linkage Decay Are Higher Than Previously Reported
The local linkage decay rate suggests that linkage decay, on average, occurs in less than 23.888kb in our population. However, an analysis of LD decay for individual chromosomes indicated a much smaller interval (as small as 7 kb on Chrm A01, for example) and some regions where the local LD decay interval is very small (<100 bases).
Interestingly, the rate of LD decay rate is significantly greater than has been observed in other diversity panels of canola. For example, LD decay rates for Chrm A01 have been estimated to be as high as~457 kb in a different population with a marker density of approximately every 29 kb [29]. However, most estimates (for example, [30]) found Chrm A01 to have an LD decay rate of~30 kb when marker density was at~1/10 kb. As might be expected, we detected an increase in the LD decay interval when we included fewer rare alleles in our assessment (data not shown). However, inclusion of these rare but not unique alleles (MAF count >5) is justified because they are likely real as they occur in multiple individuals. These additional alleles could be uncovering more crossover events in the population. The smaller than usual LD decay intervals that we observed could also be due to the fairly high diversity in our population, which could provide a greater number of crossover events. However, given the LD decay rate observed in our population, it is clearly too great in many localized regions to allow for effective association mapping of this population with the Brassica 60K Illumina Infinium SNP array [31]. Fortunately, the SNP marker density from the GBS data appears to be quite sufficient for association mapping since, on average, linked intervals will contain five or more markers.
However, there was significant variation observed among distances contained within R2 values >0.2 such that, for some markers, only a few bases apart could appear to be unlinked. This was reflected in the variation in the local LD decay rate, where there were some fairly large chromosomal regions with virtually no linkage between markers with very small distances between them; see, for example, the region between position 12,835,274 and 14,073,635 on chromosome A01 (Supplemental File S13) where there is essentially no linkage between markers even a few bases apart in this region. Such regions represent chromosomal segments that have been highly randomized in the population and suggest that these are areas of exceptionally high recombination rates. Likewise, there are regions with exceptionally low LD decay rates as shown by the large peaks in Figure 3. Such regions likely represent areas of repressed recombination, such as around centromeric regions with high levels of heterochromatin.
In regions with high linkage decay rates, any association between a given trait and the linked marker would be impossible to identify unless the marker fell directly on the causative mutation. In our parallel study [14], it was rare for a marker to fall into the transcribed region of a gene if the LD decay interval was large. However, when the LD decay interval was very small, the associated marker often fell into the coding region of a gene in over 50% of such cases. However, this information is invaluable for identifying relevant intervals in which to look for candidate genes with high confidence. With the information provided in Supplemental File S13, arbitrary choice of intervals around markers of interest is not needed, and candidate genes can be more easily identified. One other interesting observation is that markers are often clustered along the chromosome. Thus, although the average marker distance is small, the power to detect associations may be somewhat hampered by distances between clusters of markers.

Population Structure Provides Information Useful for Genome-Wide Association Studies and Future Breeding Efforts
The population structure obtained through multiple analysis methods indicated that this diversity panel is highly divergent but consists of two to four primary sub-populations. Thus, the population of canola making up this diversity panel appears to be similarly structured relative to previous populations that have been used in the past for conducting GWAS studies on canola. Several other studies in which similar population structure analyses have been done also identified two to four primary sub-populations [9,13]. The observation that many of the lines contained some proportion of admixtures is likely due to the fact the population has been used primarily for breeding purposes and thus likely contains individuals that have been crossed at multiple times in the recent past. Indeed, this fact might also explain the higher than expected rate of LD decay that we observed since outcrossing might be higher than expected from a primarily self-pollinating species. It is also noteworthy that the majority of the 39 spring lines present in this primarily winter canola diversity panel fall within population two, as shown by the orange color in Figures 4 and 5, with a few spring lines falling within clusters one and three in Figures 4 and 5 (and Supplemental File S7). Additionally, some true winter types also fall within cluster two (Supplemental File S7). This is interesting since there was no obvious segregation of populations based on growth habit in a similar mixed population (see, for example, [32]). It is likely that the clustering that we observed is an artifact resulting from the fact that this is a breeding population rather than a natural population. Thus, some kinship might be expected as the breeding lines for this trait might be selected based on this important agronomic trait. The lack of clustering of spring lines in the study observed [32] was expected since any loss of function mutation in the FOWERING LOCUS C pathway would result in a spring growth habit and thus might be expected to have arisen multiple times independently. The contamination of this primarily winter rapeseed population by some spring biotypes should be taken into account in any future use of this population to study the genetics of flowering time.

Conclusions
In conclusion, we have collected and refined a large diversity panel of primarily winter rapeseed lines for GWAS that should be useful in identifying multiple genetic elements controlling numerous traits of agronomic interest. We have made available a well characterized and exceptionally dense set of genetic markers, LD tables, and kinship maps for other researchers interested in using this resource. Finally, we note that this tool has demonstrated usefulness as it has been successfully used to identifying multiple genetic loci associated with freezing tolerance following deacclimation [14] from this population that match those found by other association studies for similar traits in other systems. It is our hope that this resource will be useful for breeders, plant physiologists, and developmental biologists interested in studying various aspects of winter rapeseed biology and diversity.  Table showing individual lines associated with each bar on the histograms from Figure 4. Spring lines are highlighted in yellow. Supplemental File S8: Table containing the Nucleotide diversity for each sliding window with the Tajima's D values noted for each window start site. Supplemental File S9: PDF file with image of the unrooted neighbor-joining tree generated from the imputed and filtered VCF file in Tassel5. Supplemental File S10: Linkage disequilibrium decay table. Supplemental File S11: PDF file with images of the regression analysis from the individual A and C genomes. Supplemental File S12: R script used to define local LD decay. This program should work with any LD decay table generated by the program Tassel5. Supplemental File S13: Table providing the start end and mid-point of each window, and the LD decay rate for that window in bases, as well as the model fit statistics. Supplemental File S14: Meteorological data for field study years.