Genome-Wide Association and Genomic Prediction for Fry Color in Potato

Potatoes destined for crisping are normally stored above 8 degrees; below this glucose accumulates leading to very dark fry colors and potential acrylamide build up. Unfortunately, sprouting occurs above 4 degrees and impacts product quality, necessitating the use of sprout suppressant chemicals. Therefore, a goal of breeders is to develop potatoes with excellent fry color, which is maintained under storage below 8 degrees. Genomic or marker-assisted selection offers an opportunity to improve the efficiency of potato breeding and thereby assist breeders in achieving this goal. In this study, we have accumulated fry-color data on a large population of potato lines and combined this with genotypic data to carry out a GWAS and to evaluate accuracy of genomic prediction. We were able to identify a major QTL on chromosome 10 for fry color, and predict fry color with moderate accuracy using genome-wide markers. Furthermore, our results provide evidence that it is possible to identify a small subset of SNPs for processing characteristics that can give moderate predictive ability, albeit lower than that achieved with genome-wide markers.


Introduction
Conventional phenotypic selection is carried out by many potato breeders. In the breeding program at Teagasc a cycle of breeding is initiated each year with 200-300 pair-crosses that produce over 100,000 true seed. All target ideotypes are then selected from this base population over the next 12 years. In the early years negative selection is employed to reduce the number of seedlings from 100,000 down to approximately 2500 by year four. Marker-Assisted Selection (MAS) is then employed to identify entries with favourable alleles against some common potato pests and diseases. By year seven the number of entries being evaluated has dropped to approximately 50, and at this point multi-location field trials are carried out to record accurate phenotypes for a large suite of traits. A further seven years of multi-environment trialing and phenotyping typically yields one to three varieties from a cycle of selection. A downside of this selection scheme is that our ability to collect phenotype information for many important traits in the first years of the program is limited. Clonal individuals are represented by single tubers in the first field generation, rising to three plots (single location) in the third field generation. Selection based largely on visually assessable characteristics in a single location and year carries the risk of eliminating individuals with favourable characteristics due to chance poor performance. It could be argued that a simple solution is simply to carry forward greater numbers, but most breeding programs are constrained by logistical and financial limitations which partly determine selection intensity. Providing more information on non-visually assessable traits earlier in the program provides a better decision-making framework for breeders in the early stages of a program. Approaches to improve genetic gain in potato breeding using pedigree or genomic information have recently been discussed [1,2].
One approach using best linear unbiased prediction (BLUP) analysis with pedigree information to estimate breeding values has already been shown to lead to increased genetic gains, particularly for low heritability traits [2]. Genomic selection uses phenotyped and genotyped entries from a training set to predict phenotypes of new individuals based only on genomic information and has already been evaluated in various crop species [3][4][5][6]. A recent review concluded that genetic gain can be substantially improved by implementing genomic selection in potato breeding programs [1], and a recent study has also demonstrated empirically that genomic prediction for processing traits shows promise [7]. Genomic selection may be particularly beneficial when many traits can be selected for using the same genotyping data. An example would be developing potato varieties for the processing industry where a variety needs to have favourable performance for a large suite of traits, including fry color, yield, sugar stability under storage, percentage dry matter, tuber shape, flesh color, skin color, eye depth, tuber number, Potato Cyst Nematode (PCN) resistance, bruising, common scab, powdery scab, spraing resistance, blackleg resistance, blight resistance, and Potato Virus Y (PVY) resistance. A body of work has already been completed in an attempt to identify QTL or genes linked to processing characteristics [3,[8][9][10][11][12][13][14][15].
A prerequisite for genomic selection is a genotyped and phenotyped training population for model development. The reduced cost of sequencing and the availability of genomic resources such as a reference genome [16] make it now feasible to characterize genetic variation on a genome-wide scale in potato populations. A greater challenge is the collection of accurate phenotypes for all target traits on the reference population. This is challenging due to time and cost of multi-year and multi-environmental trials on a sufficient number of lines to establish a training population. Resource constraints mean that the most efficient way to collect such data is to aggregate it from later stages of active breeding programs. However, there are generally much lower numbers of breeding lines under selection at these stages, meaning data must be aggregated over longer periods to enable sufficient population sizes. Other issues, such as unbalanced datasets and the fact that selection may have reduced allelic variation at this stage may also be problematic.
In this study, we have collected phenotypes and genotypes for multiple reference populations consisting of entries under evaluation in the Teagasc breeding program. These data were then used to build genomic prediction models that were evaluated on testing sets not used for training. Predictive abilities varied from low to moderate depending on the training and testing sets used. Interestingly, our results suggest that marker number can be greatly reduced with limited impact on predictive ability; that may permit deployment of inexpensive marker assays for prediction of potato processing characteristics.

Genotyping and Phenotyping Potato Lines
We accumulated phenotypic data on lines over three years (2015-2017) from material undergoing evaluation in the third field generation of the Teagasc breeding program. Each line was only evaluated in a single year and location. At the 2015 harvest we processed tubers three weeks after harvest (referred to as 'off-the-field') and also stored tubers for extended periods at either 4.5 • C, or 8 • C with chlorpropham treatment. The correlations between data sets were high (Table 1) and in general we observed that entries which had very light fry color (as measured by the HunterLab L value) when fried 'off-the-field' tended to have very light fry colors when fried after long-term storage. In subsequent years we focused our phenotyping efforts on material 'off-the-field' (OTF) and after long-term storage (LTS) at 4.5 • C for 230 days and from this point on we will only discuss results related to these time-points. It is clear from the results that the median HunterLab L values for fry color in the training population were greatest in the data set fried 'off-the-field', and there was a distinct decrease in median HunterLab L values for tubers stored at 4.5 • C ( Figure 1). The mean HunterLab L values for the populations fried 'off-the-field' was 13 to 26% higher than tubers stored at 4.5 • C for ca. seven months. Additionally, phenotypes were available for lines that had been evaluated for fry-color OTF in multiple locations and over multiple year (referred to as test panel). All lines were genotyped using a genotyping-by-sequencing approach after genome complexity reduction with the restriction enzyme ApeKI. The sequencing depth required to distinguish between the three heterozygous states in autotetraploid potato has been estimated at 60 [17]. As we did not have sufficient sequencing depth at each locus we treated the samples as diploids and determined genotypes according to rules developed in autotetraploid alfalfa [18]. A SNP database of 46,406 SNPs was developed using genotype data from all lines, and was used for developing genomic prediction models for fry color, and also for QTL identification with a GWAS. The SNP database had SNPs covering all potato chromosomes, and the number of SNPs ranged from 2707 on chromosome 10 to 5424 on chromosome 1. SNPs clustered towards the telomeres, which corresponds to regions of much higher gene density in potato [16]. It also confirms that Genotyping-By-Sequencing (GBS) with ApeKI digestion (methylation sensitive) in potato largely avoids the heterochromatin located in the pericentromeric regions. On average the SNP rate across the genome is one SNP every 17,469 bases; however, GBS is only interrogating regions around a portion of the ApeKI restriction sites. SNPs were well distributed across all genomic regions (downstream: 27.99%, exon: 22.30%, intergenic: 11.00%, intron: 11.42%, upstream: 20.18%, utr: 6.40%, splice sites: 0.71%), with 13,407 genes tagged with at least one SNP. Lines with more than 10,000 missing genotypes (22%) were removed and remaining lines with matching phenotype and genotype data were used for GWAS and genomic prediction ( Table 2). Table 2. Number of lines with sufficient genotype and phenotype data for further analysis. The three populations (2015-2017) from the third field evaluation year were used for GWAS and development of genomic prediction models. A collection of lines from highly unbalanced multi-location and multi-year trials (test panel) was used to further evaluate the prediction models developed with the 2015-2017 data.

Genome-Wide Association Analysis
A GWAS was performed within each year separately to identify significant QTL for fry color 'off-the-field' and after long-term storage at 4.5 • C. We did not identify any SNPs significantly associated with fry color after long-term storage at 4.5 • C in any of the three populations. However, we did identify QTL significantly associated with fry color 'off-the-field' in the 2017 population; the largest population available for analysis (Table 3, Figure 2). We identified significantly associated SNPs on chr04 and chr10, with the strongest signal on chr10. The SNPs on chr04 (chr04:67971220 and chr04:68008112) are proximal to a tuber-specific and sucrose-responsive element binding factor (PGSC0003DMG400003316, chr04:67630128-67632587).
The greatest number of SNPs were located on chr10 within the region from 49 Mb to 58 Mb, and several genes associated with sucrose cleavage, synthesis, metabolism and starch storage are located within the 50 to 60 Mb region of chr10 [16]. These include three invertase inhibitors, a sucrose-phosphatase, a cell-wall invertase, a fructose-1,6-biphosphatase, and two patatin genes.

Using Genome-Wide Variants to Predict Fry Color
We developed genomic prediction models with data from each of the three years, and used these to predict fry color 'off-the-field' and fry color after long-term storage at 4.5 • C in the remaining years. We also used the three models to predict fry color 'off-the-field' in a test panel made up of advanced breeding lines from multiple years at later stages of the Teagasc breeding program. The mean predictive ability ranged between 0.11 and 0.77 for fry color 'off-the-field', and between 0.24 and 0.66 for fry color after long-term storage at 4.5 • C (Table 4). Table 4. Predictive ability for fry color 'off-the-field' and after long-term storage at 4.5 • C using various combinations of training and testing sets and four statistical models (bias is shown in brackets). Our ability to predict 'off-the-field' fry color was greater than our ability to predict fry color after long-term storage at 4.5 • C, likely reflecting the additional complexity of the trait and in keeping with our ability to detect QTL for the former but not the latter. There was little difference in predictive ability across the different models evaluated, with the exception of the models developed using Random Forest, which resulted in lower predictive ability in all cases.
The predictive ability dropped when 2016 data was used as either training or test panel. It can be seen from the Genomic Relationship Matrix (GRM) (Figure 3) that the lines from 2016 set have a low relationship to the other populations, which likely explains the poorer predictive ability when 2016 was used as either a training or testing set. Conversely, we can see many high intensity genomic relationship values between lines in 2015 and 2017, which likely explains the greater predictive ability when these are used as training and testing sets.

Using Selected Variants to Predict Fry Color
We also developed predictive models using a subset of SNPs identified in the GWAS. Using the significant GWAS SNPs identified in the 2017 lines (31 SNPs), we were able to predict fry color 'off-the-field' in the 2015 population with a predictive ability of 0.45 and no bias.
In addition to using the GWAS to select subsets of SNPs for prediction, we performed variable selection using the variable importance measures from Random Forest. This was done with the 2015 and 2017 data sets and for both fry color 'off-the-field' and after long-term storage at 4.5 • C. In all cases the top 25 variables were widely spread across chromosomes (Figure 4), and the two SNPs identified on chr04 with the GWAS in the 2017 data set were in the top 25 ranked SNPs identified using variable importance measures.  These SNPs were located at 67.97 and 68.00 Mb and are proximal to a tuber-specific and sucrose-responsive element binding factor (PGSC0003DMG400003316, chr04:67630128), and an Alpha-amylase (PGSC0003DMG400007974, chr04:68255931) involved in starch degradation. The SNP on chr01 at 75.29 Mb that ranked high in importance in the 2015 data set is near three sugar transporters at 76. 16-76.23 Mb; furthermore markers associated with fry color have been found on chr01 at 43.8 cM [8] and QTL have been identified in linkage mapping studies [9]. A number of SNPs in the top 25 in both data sets and for both traits were located on chr09 and markers associated with fry color have been identified in association panels on chr09 [8]. One SNP at 51.35 Mb identified as the second most important variable for fry color after long-term storage at 4.5 • C in 2015 data set was proximal to a gene involved in sugar transport (Sugar transporter, PGSC0003DMG400003848, chr09:51364561).
We used the variable importance measures to select increasing numbers of ranked SNPs to develop genomic selection models for prediction in the data set that was left out of both variable selection and model training. The predictive ability was higher with selected SNPs compared to a random SNPs at lower SNP numbers; however, the difference disappeared as we increased SNP number (Table 5). Table 5. Predictive ability for fry color 'off-the-field' and after long-term storage at 4.5 • C using selected or random markers (bias is shown in brackets). Selected markers were identified in the training population (either 2015 or 2017) using variable importance measures in Random Forest. In the case of randomly selected SNPs the predictive ability is the mean of 100 iterations of random SNP selection.

Discussion
In this study, we present the results of a simple empirical evaluation of predicting fry color with DNA-based markers. Markers were generated using a genotyping-by-sequencing approach following genome complexity reduction with the restriction enzyme ApeKI. Predictive abilities were assessed as a function of statistical algorithm and marker density. We also present the results of a GWAS to identify QTL for fry color and low-temperature sweetening.
We did not observe any great difference between statistical algorithms in terms of predictive ability, which is in agreement with other studies [3,7,19]; with the exception that models developed with Random Forest had lower predictive ability. Predictive abilities were promising for both 'off-the-field' fry color and fry color after long-term storage at low-temperature. This is in general agreement with a recent study on genomic prediction of chipping quality in potato [7]. Our predictive ability was high (0.77) for 'off-the-field' fry color when training with the 2017 data set and predicting in 2015 data set. Similarly, when training with the 2015 data set and predicting in the 2017 data set, the predictive ability was 0.75. The predictive ability varied across training and test population combinations with the lowest predictive abilities observed when 2016 was used as either a training or testing set. This reflects the lower relationship between lines in 2016 and other data sets used. The lower levels of relatedness of the 2016 material was most likely due to the presence of entries from a parallel experimental program for pyramiding and multiplexing disease resistance loci in that year. This resulted in a different parental profile and lower rate of selection in this material. This is similar to previous studies in plants [20,21] and predictions across breeds in animals [22,23], and emphasises the importance of a good relationship between training set and selection candidates.
Our GWAS failed to identify QTL for resistance to low-temperature sweetening but did identify QTL for fry color 'off-the-field' in the 2017 data set. Two SNPs on chro04 at 67.97 and 68.00 Mb were associated with fry color. In particular the SNP at 67.97 Mb is proximal to a tuber-specific and sucrose-responsive element binding factor. These two SNPs were also identified with the variable importance measures in the 2017 data set. Previous studies have identified QTL for fry color in two association panels on chromosome four [8]. Our strongest QTL signal was on chr10 where a large cluster of associated SNPs was identified between 49 and 59 Mb, peaking at 55.28 Mb. Other genome-wide association studies looking at fry color have been carried out. A QTL for fry color has previously been detected at 57.6 Mb in a panel of varieties characterized by several Dutch breeding companies [8]. Another GWAS in a diversity panel did not identify QTL for processing quality [24] although the authors concluded that more lines and higher marker density were required. A recent study [7] also identified a cluster of SNPs associated with fry color on chromosome 10 in the region between 50 and 60 Mb, and our study now reproduces those findings in a different population; indicating that this may be an important region for fry color in potato. In future, genotyping approaches that enable distinction among the three heterozygous states, and support multi-allelic haplotype analysis may increase our power to detect marker-trait associations.
We selected all markers significantly associated with fry color in 2017 data set (31 SNPs) and used these to develop genomic selection models to predict fry color in 2015 data set, which resulted in a predictive ability (0.45), substantially lower than predictive ability with entire marker set. The majority of markers in this subset are within the 10 Mb region on chr10. We also identified and ranked variables using variable importance measures and selected increasing number for development of prediction models. In this case the marker subset is spread out across chromosomes. While predictive ability was lower than using entire marker set, it was higher than randomly selected markers at lower marker number. As we increase the marker number the difference between markers selected via variable importance measures and random selection reduced. Using the top 10 ranked SNPs our predictive ability for both traits ranged between 0.50 and 0.59 depending on which year was used as a training set. The ability to generate predictions with smaller sets of molecular markers is essential if we are to implement DNA-based selection strategies in classical potato phenotypic selection schemes.
Various strategies to improve breeding in potato using pedigree [2] and/or marker-assisted selection strategies [1] have been proposed. In some cases these require significant alterations to breeding schemes, including the classical phenotypic selection scheme outlined. One of the downsides of these schemes is that our ability to phenotype and make accurate selections in early years is very low. There is an opportunity to practice DNA-based selection in these early years, provided low cost DNA evaluations can be carried out. Within our breeding program, marker-assisted selection using low cost diagnostic markers for disease resistance is already carried out on 1000's of entries in single plot trials. Using genome-wide markers for selection in early stages of these schemes when numbers are large is currently not feasible; however, if we can identify smaller sets of markers that together have good predictive ability then there are opportunities to develop inexpensive marker systems [25] and practice marker-assisted selection for both simple (e.g., disease resistance) and more complex (e.g., processing) traits at high selection intensities. In particular, we envisage being able to develop a genotyping platform based on amplicon sequencing that is (i) inexpensive, (ii) multi-allelic, and (iii) adaptable (markers can easily be added or removed from the assay). An example of such an approach is outlined (Figure 5), which has the advantage of being flexible to enable inclusion of new loci and/or estimating the effects of new alleles as new material is introduced to the initial crossing schemes.  Figure 5. Pathway to implementation of marker-assisted selection in potato breeding. An initial reference population representing the breeding material is both genotyped and phenotyped for target traits. These data can be used to identify markers linked to traits and develop prediction models. Markers linked to QTL for complex traits and markers diagnostic for disease resistance can be used in development of an inexpensive genotyping assay for deployment on selection candidates.

Conclusions
Experimental results presented in this manuscript provide further support for the implementation of genomic prediction in potato breeding, and further evidence for a major QTL on chromosome 10 for fry color. Furthermore, our results provide evidence that it is possible to identify informative SNPs for processing characteristics, and that these SNPs have predictive abilities approaching those of genome-wide marker sets.

Phenotyping Training and Test Panels
The training populations consisted of lines collected from the breeding program over a period of three years (2015-2017). Lines were evaluated in 20-tuber plots and each line was phenotyped in a single year. In 2015 the tubers were harvested on the 7th October 2015 and collected for storage and subsequent fry analysis. Tubers from each plot were divided into six batches of ten tubers for drying and storage at either 4.5 • C or 8 • C (with chlorpropham treatment for sprout suppression), and removed for phenotyping at various time-points (Table 1). Four tubers (from the same plot) were selected from each entry at each time point and fried to evaluate crisp color. Tubers were sliced with a Hobert slicer to generate crisps with a thickness of 1.25 mm, and were deep fried for three minutes at 175 • C. Crisp color was then measured using a HunterLab Labscan XE Spectrophotometer (400-700 nm) in an upward configuration through a transparent petri-dish. Hunter L values were recorded, which indicates the level of lightness or darkness of crisps. Sample preparation and presentation were kept consistent across years to avoid variation due to sample preparation/presentation. The arithmetic mean of the four samples from each entry was calculated and used in subsequent analysis. In subsequent years (2016-2017) we focused phenotyping on samples collected off-the-field and those stored at 4.5 • C for ca. seven months. Again, the arithmetic mean of four tubers from each entry was calculated and used in subsequent analysis (https://doi.org/10.6084/m9.figshare.11298941).
The testing panel consisted of 56 lines with data on fry color 'off-the-field', which was collected as part of the breeding program over a six year period (2012-2017) with up to five locations per year and three replicate plots per location. Fry color was evaluated using a HunterLab Labscan XE Spectrophotometer, and not all 67 lines were evaluated together in a common field site, making it a highly unbalanced data set. BLUPs for fry color of each line were calculated using line as a random effect and year, location and the interaction as fixed effects.

Genotyping Training and Testing Panels
Leaf material was harvested from each entry and freeze dried for 48 h prior to tissue disruption on a bead mill and DNA isolation. DNA was isolated using a modified version of the CTAB protocol [26], and pellets were dissolved in 100 µL of TE 0.1 mM EDTA and treated with RNAse A for removal of RNA contamination. DNA samples were transferred to 96-well plates, quantified using a PicoGreen Quant-It ds-DNA assay, and all samples diluted to 20 ng/µL. The Genotyping-By-Sequencing (GBS) protocol followed that of [27]. Briefly, DNA from each sample was digested with the restriction enzyme ApeKI that has a 5 bp recognition site. Digested DNA was ligated to adaptors containing one of 96 unique DNA barcodes and up to 96 samples were then pooled to generate a single library. Each library was amplified via PCR, quantified, and evaluated on a BioAnalyser prior to sequencing. Each library was sequenced on 2-3 lanes of an Illumina HiSeq 2500 to generate single-end (SE) reads of 100 bp.
Sequence data from the same library was concatenated and adaptor contamination was removed with Scythe [28] with a prior contamination rate set to 0.40. Sickle [29] was used to trim reads when the average quality score in a sliding window (of 20 bp) fell below a phred score of 20, and reads shorter than 40 bp were discarded. The reads were demultiplexed using Sabre [30] allowing a single mismatch, data output per sample was determined, and reads from each sample were aligned to the Solanum tuberosum reference genome [16] using BWA aln with default parameters [31]. The Genome Analysis Tool Kit (GATK) [32] was used to identify putative SNPs in the population, and only SNPs with a read map score of 30 were retained for further analysis. We used an approach developed in alfalfa for calling genotypes from GBS data in autotetraploids [18], where distinguishing between three heterozygous states is difficult with low read depth. Briefly, no attempt was made to distinguish between the three different heterozygous states present in an autotetraploid (ABBB, AABB, AAAB), a minimum of 11 reads were required to confirm a homozygote, and a minimum of two reads per allele and a minimum allele frequency for alternative allele of 0.10 were required to call a heterozygote. The minor allele frequency (MAF) was calculated based on these genotype calls and SNPs with a MAF ≥2.5% and with ≤15% missing genotype data were retained for further analysis. Sequence data have been submitted to NCBI under BioProject PRJNA566151.

GWAS to Identify QTL Associated with Fry Color
A GWAS was carried out separately on data from each year with the R package GWASpoly [24]. All heterozygous genotypes were treated as having the same effect (diploidized additive), and kinship was calculated using the realized relationship matrix (see QQ-Plot; Figure 6). The genome-wide false discovery rate was controlled using Bonferroni method (level = 0.05).

Genomic Prediction of Fry Color
We used four statistical algorithms for genomic prediction, ridge regression best linear unbiased predictor (rrBLUP) [33], Bayes A [34], Bayesian Lasso [35] and Random Forest [36]. rrBLUP was used to estimate marker effects in the R package rrBLUP [33], the two Bayesian approaches were implemented in the R package BGLR [37] with the following parameters: number of iterations = 5000, burn-in = 500 and thinning = 5. Random forest was implemented with the R package Random Forest (setting the number of variables at each split to 1/3 of the total variables, and using a terminal node size of five and minimum of 500 trees per forest). Predictive ability was calculated as the Pearson correlation coefficient between observed and predicted values.
Genomic prediction models were developed for each year and evaluated in other years. Predictive models developed for fry color 'off-the-field' in each year (2015-2017) were also evaluated in a test panel consisting of 56 lines.
We also selected markers from the GWAS to use in genomic prediction. The GWAS was carried out in the training population as described above and selected markers were used to build prediction models with the training population. These prediction models were then used for prediction in the testing population. Variable importance measures were carried out in Random Forest using the mean decrease in accuracy as the importance measure. Funding: This project has received funding from the European Union's Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement number 658031. The project is also supported by the Irish Department of Agriculture, Food and the Marine (DAFM) through the Virtual Irish Centre for Crop Improvement (VICCI) (Project no. 14/S/819).