Propagation Fidelity and Kinship of Tomato Varieties ‘UC 82’ and ‘M82’ Revealed by Analysis of Sequence Variation

: Plant varieties are named and released based on distinct, unique and stable characteristics but may be maintained separately by genebanks or stock centers under separate accession identiﬁcation numbers. Genetic heterogeneity of the original variety, genetic drift, failure to exclude cross pollination, and propagation error may erode the integrity of genetic resources. The availability of resequencing and genotyping data for duplicate samples enables an analysis to clarify the relationship between speciﬁc varieties or independently curated accessions of the same variety while also assessing the ﬁdelity of germplasm maintenance. We accessed both Single Nucleotide Polymorphism (SNP) array genotypes and resequencing data for two important tomato varieties ‘UC 82’and ‘M82’ that have been maintained as separate accessions in collections as important resources for the research and breeding communities. Our analysis of these data suggests that polymorphism rates from resequencing of cultivated tomato are overestimated in the literature due to heterozygous calls caused by either sequence error or coalignment of repetitive sequences. We deﬁned a set of 32,352 robust SNPs from a ﬁle containing data for all samples and we compared the distribution of data with SNPs called from a genotyping array. For both analyses, intravariety variation was found in haplotype blocks, with the same haplotypes identiﬁed using SNPs detected from array and sequence data. The distribution pattern of variation across the entire genome sequence was similar for both ‘UC 82’and ‘M82’. Overall, the di ﬀ erences between distinct accessions of a variety were nearly as great as the di ﬀ erences between ‘UC 82’and ‘M82’. The similarities between ‘UC82’ and ‘M82’ range from 99.33% to 99.74% and are highly consistent with a common pedigree and shared selection from partially inbred progeny. The data also suggest that these tomato genetic resources have been propagated with high ﬁdelity.


Introduction
High-density genotyping and genome sequence data are increasingly available for large numbers of crops and their wild relatives. The availability of genotyping and resequencing data for duplicate accessions of the same variety in public data bases permits questions to be addressed about the fidelity of propagation and also the genetic relationship, or kinship, between varieties. Genetic heterogeneity within individual accessions of a named variety has been investigated for major crops like soybean [1], maize [2], and the model species Arabidopsis thaliana [3]. At the time the reference sequence for the tomato variety 'Heinz 1706' was released [4] a large Single Nucleotide Polymorphism (SNP) genotyping array was available as a commercial resource [5][6][7]. Subsequently resequencing data from several large [8][9][10][11] and small scale projects [12,13] became available as did a one thousand variety data set based on the SNP array [14]. These genetic and genomic data were used to provide insight into the origin, domestication and breeding history of the tomato [6,8,9,14,15]. There is a growing set of information for duplicate samples of the same variety, which has been maintained by independent groups as distinct accessions. These data provide both experimental controls and provide a resource to ask specific questions about genetic relationships within tomato germplasm collections. We sought to use sequence and SNP data to examine the relationship or kinship between tomato varieties and assess the fidelity of maintenance or propagation within accessions of the same varieties.
In tomato, the Solanum lycopersicum L. variety 'M82' (also referred to as 'M82-1-8' and 'M82-1-8 (VF)') has become an important resource for genetic studies. 'M82' is used as a parent for stable introgression line populations [16,17], which serve as a resource for the discovery of quantitative trait loci (QTL) [18][19][20][21], map-based cloning [22,23], and genomic studies seeking to link form and function [24,25]. The relationship between 'M82' and the variety 'UC 82' has been the subject of debate. 'UC 82' was released based on yield and fruit setting ability as well as properties favorable to bulk handling at the beginning of the mechanical harvest era but prior to the wide spread use of hybrid tomatoes in the canning industry [26]. Its pedigree is well described in the variety release notification and elsewhere [26,27]. Published studies describe 'M82' as a selection from 'UC 82' [28,29], despite the latter being described as a stable variety [27]. In an analysis of population structure based on SNP data, 'M82' and 'UC 82' clustered together with varieties having pedigrees tracing to California breeding programs [6]. The lack of a published pedigree for 'M82' contributes to a vague understanding of its origin.
We accessed independent data sets based on Illumina genotyping and next generation sequencing for 'UC 82' and 'M82' accessions in order to examine the relationship between the two varieties and variability within accessions of the same variety. Genome-wide analysis demonstrated high propagation fidelity among individual accessions of the same variety and a close relationship between these important and widely used tomato varieties.

Plant Material
SNP genotyping and whole genome sequence data from different sources of 'UC 82' and 'M82' were obtained (Table S1). 'M82' accession LA3475 was provided from the Charles Rick Tomato Genetics Resource Center (TGRC), UC Davis, and maintained at The Ohio State University (OSU). Other sources of 'M82' were maintained in Israel (e.g., ERS484046) or were unlisted (e.g., TS-228, see below). Stocks from 'UC 82' accession LA1706 were also obtained from the TGRC and maintained separately at the Genebank of the Institute for the Conservation and Improvement of Agricultural Biodiversity (COMAV) and OSU.

SNP Genotyping Data from the Illumina "Solanaceae Coordinated Agricultural Project" (SolCAP) Array
Genotyping was performed using the commercially available Illumina SNP array (Illumina, Inc., San Diego, CA USA) for tomato developed as part of the "Solanaceae Coordinated Agricultural Project" or SolCAP, as described in previous studies [14]. Data for 7720 SNPs are available from Table S2 from the original manuscript [14]. Samples with genotype data from SNP array were UC 82-COMAV, UC 82-OSU, and M82-OSU (Table S1). SNP data from 'Heinz 1706', the reference genome variety for tomato, were also used to provide a parallel reference for the resequencing data set. The close wild relative Solanum pimpinellifolium. L. LA1589 provided an outgroup for analysis.
SNPs for 'M82' and 'UC 82' accessions were arranged according to physical position in the genome and clustered using the R pvclust 2.0-0 package [30]. The SNP data were used to estimate the percentage of residual heterozygosity for each accession and to calculate the proportion of shared and distinct SNPs in pairwise comparisons between the accessions. The physical size of genomic regions between contiguous SNPs relative to reference assembly version SL3.0 [31] was used to determine the divergent proportion relative to the total genome sequence among accessions.

Sequence Alignment and SNP Calling
SRA were converted into FastQ format using the fastq-dump tool from the SRA toolkit [32]. Quality control of the FastQ files was performed using FastQC version 0.11.4 [33]. The four samples analyzed were derived from different sequencing chemistry and resulted in different read lengths and quality. M82-TS3 reads averaged 50 base pairs (bp) with data Illumina 1.5 encoded and Phred quality scores on a scale of 0-62, while M82-ERS484046, M82-TS228, and UC82-TS171 reads data were Sanger/Illumina 1.9 encoded and averaged 100 bp with Phred quality scores on a scale of 0-93. The wrapper tool Trim Galore! version 0.3.7 [34] was used to remove adapters and trim reads. Trimming was performed according to samples, with paired reads shorter than 60 bp removed from sample M82-ERS484046, M82-TS228, and UC82-TS171 and shorter than 20 bp removed from sample M82-TS3. Quality of reads was filtered using Phred+64 for the M82-TS3 sample and Phred+33 for M82-ERS484046, M82-TS228, and UC82-TS171 to impose a uniform quality filter at a Phred score of 20 corresponding to 99% accuracy.
The trimmed FastQ files were aligned against the S. lycopersicum L. cv. 'Heinz 1706' reference assembly version SL3.0 [31] using Bowtie 2 version 2.3.2 [35]. During the alignment, the "-very-sensitive-local" option was used. This option sets a combination of parameters aimed to cover the speed, sensitivity, and accuracy of the process. The ending "-sensitive" implies a slower progression but more sensitive and accurate. In Bowtie 2 the "-local" mode of alignment was used to trim some characters from one or both ends when doing so maximized the alignment score. The option "-phred64-quals" was used to align M82-TS3, while the option "-phred33-quals" was used for the other three samples, consistent with appropriate options based on the sequencing chemistries and Illumina encoding.
The aligned files in the Sequence Alignment Map (SAM) format were sorted, labeled, and converted to a compressed binary version (BAM files) using Picard software version 2.18.14 [36]. Duplicate records were located using Picard. The resulting BAM files were analyzed with Qualimap version 2.2.1 [37]. Stats and graphical representations of the obtained coverage across the entire genome sequence were extracted.
Three different approaches were followed to perform whole genome sequence comparison among the four sequence samples. For both approaches, the HaplotypeCaller tool from GATK version 4.0.9.0 [38,39] was used to obtain Genomic Variant Calling Format (gVCF) files for each sample. These files had records for all sites whether there is a variant call present or not. The CombineGVCFs tool from GATK version 4.0.9.0 [38,39] was used to merge gVCFs into a single gVCF. The function GenotypeGVCFs in GATK version 4.0.9.0 [38,39] was used to perform joint genotyping for all samples included in each gVCF, resulting in Variant Call Format (VCF) files. As a first approach we compared the four samples in pairwise combinations, so six different VCF files were obtained. The second approach resulted in a single VCF containing all four samples. Finally, we also compared each sample separately against the reference and therefore a single VCF for each sample was generated. SNPs were extracted from VCF files with the SelectVariants tool from GATK version 4.0.9.0 [38,39] using the "-select-type-to-include" option. Default parameters for filtering SNPs were set to use the VariantFiltration tool from GATK version 4.0.9.0 [38,39]. The "-remove-filtered-all" option from the VCFtools version 0.1.15 [40] was used to remove all sites previously filtered with the VariantFiltration tool. The Linux "grep" command with the "-v" option was used to filter out lines in the VCF files containing missing data (marked as "/") for at least one sample.
Since regions with extremely low depth of coverage are unreliable and regions with extremely high depth of coverage are correlated with highly repetitive sequences and the alignment of those reads may not be accurate, descriptive statistics from the read depth of coverage at each SNP were calculated to determine the best cut off criteria. We excluded regions with less than 4× coverage as lower coverage does not allow correction for sequence error. We also filtered based on a maximum read depth of coverage for each sample to exclude repetitive sequences, which may be mapped on top of the same reference leading to an overestimation of polymorphism usually as heterozygosity. Maximum depth of coverage was set as d+4 √ d where d was the mean depth of coverage [41]. The SnpSift filter command from SnpSift version 4.0 [42] was used to filter the VCF files by depth of coverage.

SNP Density Comparison from Resequencing Data
Variation and genotype calling was performed for each sample to calculate the SNP density relative to the reference genome sequence. Single VCFs were used to estimate the percentage of residual heterozygosity for each sample. VCF files obtained from both approaches were used to calculate the proportion of shared and distinct SNPs by chromosome in each pairwise comparison between the four samples. The VCF containing data for all four samples permitted a calculation of the SNPs shared by the four samples relative to the reference genome sequence. Genomic regions between contiguous SNPs were used to determine the divergent proportion from the total genome sequence among samples. Whole genome patterns of SNP distributions were visualized using "CircosVCF" [43].

SNP Genotyping Data Analysis
Data derived from the SolCAP Illumina array were analyzed for 7720 SNPs for two 'UC 82' accessions, an 'M82' accession, and 'Heinz 1706'. The percentage of missing data ranged from 0.76% to 1.27% and percentage of heterozygous SNPs ranged from 0.08% to 0.19% respectively. The low residual heterozygosity detected for all samples was expected for highly inbred lines. Heterozygous SNPs from the original selection are likely to be fixed and have an equal probability to be fixed as polymorphic or reference during each generation of self-pollination. After removal of missing data and heterozygous SNPs, a total of 7578 SNPs were retained for further analysis. SNPs were clustered in seven genomic regions distributed on chromosomes 1, 6, 7, 9, 10, and 11 (Table 1). Differences within and between variety accessions were described based on the physical length of the haplotypes. As a proportion of the genome sequence flanked by polymorphic SNPs, the UC 82-COMAV and UC 82-OSU differ by 0.18% (Table 1), with 92% of polymorphic SNPs clustered on the top and the bottom of Chromosome 11. The comparison between 'UC 82' accessions and 'M82' reveals differences that range between 0.14% and 0.19%, and 0.26% when considering both samples together (Table 1). When pairwise comparisons were performed, fewer polymorphic SNPs were identified between 'UC 82' and 'M82' accessions (38 and 61 SNPs out of 7578) relative to those detected within accessions of 'UC 82' (79 SNPs out of 7578; Table 1). In all cases over 66% of the polymorphic SNPs were clustered on Chromosome 11. Cluster analysis across all polymorphic SNPs, placed UC 82-COMAV closer to the M82-OSU sample at alpha = 0.98. This clustering was not consistent across chromosomes, however. For SNPs on chromosomes 6 and 7 the UC 82-OSU and M82-OSU samples clustered together. For chromosomes 9 and 10, 'UC 82' samples clustered together. On Chromosome 11, markers between 7.18 and 7.72 Mb cluster UC 82-OSU and M82-OSU samples, and between 52.03 and 52.83 Mb, 48 SNPs cluster the UC 82-COMAV with M82-OSU. In a separate analysis, SNPs from the SolCAP array were used to repeat the comparisons using 'Heinz 1706' and LA1589 as outgroups. The results again showed that the UC 82-COMAV and M82-OSU accessions were more closely clustered than the two 'UC 82' accessions ( Figure S1).

Sequence Alignment and SNP Calling
Sequence data from the four SRA files covered 55.30−89.08 percent of the 828,349,174 bp SL3.0 reference genome sequence to a depth of coverage of at least 4× (Table 2). Since different percentages of the genome sequence were covered, the total number of detected variants differed among samples ( Table 2).  Figure 1 presents a schematic representation of genomic regions with SNP calls. We identified 32,352 robust SNPs from a total of 1,886,601 SNPs detected in a single four-way VCF containing data for all samples. Only 414,820 SNPs passed the default filtering parameters from GATK. When SNPs with missing data in one sample were removed we retained 212,068 informative SNPs for all samples.
Filtering by depth of coverage reduced the total number of SNPs to 46,981 SNPs. Of the remaining SNPs 496 were unmapped (located on Chromosome 0) and 14,133 were heterozygous. The percentage of heterozygous SNPs detected when comparing each sample separately to the reference, ranged from 3.8% to 13.9% (Table 2) and could reflect true residual heterozygosity, sequence error, or duplicate genes that were clustered together in the alignment ( Figure 1A,C). Distinguishing residual heterozygosity from sequencing, alignment or SNP calling errors is difficult, even when following a conservative assembly pipeline. Since the mating system of tomato favored elimination of heterozygous SNPs, we therefore defined the set of robust SNPs based on homozygous regions identified after hard-filtering was applied ( Figure 1B).

SNP Density Comparison from Sequence Data
The genome sequences of M82-ERS484046, M82-TS3, M82-TS228, and UC82-TS171 were highly similar. The same conclusions were drawn whether we considered pairwise VCF comparisons or the four-way VCF comparison. The advantage of joining samples in a single fourway VCF was the ease of performing analysis of the samples based on exactly the same positions, and we focused on results from this comparison. Using the set of robust SNPs, differences within ʻM82ʼ accessions ranged from 50 to 165 SNPs or only 0.15%-0.51% of the genome sequence (Table   S2). The range of differences across all 12 chromosomes were between 0% and 1.27% for the M82-ERS484046 vs. M82-TS228 comparison, 0.02% and 12.30% for M82-ERS484046 vs. M82-TS3 and 0.08% and 13.27% for M82-TS3 vs. M82-TS228 (Table S2). The greatest variation within ʻM82ʼ pairwise comparisons was localized on chromosomes 7 and 9 (Table S2). Comparisons of UC82-TS171 with the three ʻM82ʼ accessions displayed a range from 229 to 277 SNPs or 0.71%-0.86% of

SNP Density Comparison from Sequence Data
The genome sequences of M82-ERS484046, M82-TS3, M82-TS228, and UC82-TS171 were highly similar. The same conclusions were drawn whether we considered pairwise VCF comparisons or the four-way VCF comparison. The advantage of joining samples in a single four-way VCF was the ease of performing analysis of the samples based on exactly the same positions, and we focused on results from this comparison. Using the set of robust SNPs, differences within 'M82' accessions ranged from 50 to 165 SNPs or only 0.15%-0.51% of the genome sequence (Table S2). The range of differences across all 12 chromosomes were between 0% and 1.27% for the M82-ERS484046 vs. M82-TS228 comparison, 0.02% and 12.30% for M82-ERS484046 vs. M82-TS3 and 0.08% and 13.27% for M82-TS3 vs. M82-TS228 (Table  S2). The greatest variation within 'M82' pairwise comparisons was localized on chromosomes 7 and 9 (Table S2). Comparisons of UC82-TS171 with the three 'M82' accessions displayed a range from 229 to 277 SNPs or 0.71%-0.86% of the genome sequence (Table S2). The range of polymorphic SNPs across all 12 chromosomes was of 0.02%-21.41%, 0.12%-22.32%, and 0.03%-21.10% when comparing UC82-TS171 with M82-ERS484046, M82-TS228 and M82-TS3, respectively (Table S3). For all pairwise comparisons the greatest differences in percentage of the homozygous SNPs were detected on Chromosome 1 and the largest number of polymorphic SNPs was identified on Chromosome 11 (Table S3).

SNP Allele Comparison across the Entire Genome
The distribution pattern of SNPs across the entire genome sequence for all samples was highly similar (Figure 2). A clear enrichment of SNPs was found on chromosomes 4, 5, 11, and 12, and to a lesser degree on chromosomes 2, 6, 7, and 10. Only a few specific regions on chromosomes 1, 3, 8, and 9 contained variation relative to the reference ( Figure 2).

SNP Allele Comparison Across the Entire Genome.
The distribution pattern of SNPs across the entire genome sequence for all samples was highly similar (Figure 2). A clear enrichment of SNPs was found on chromosomes 4, 5, 11, and 12, and to a lesser degree on chromosomes 2, 6, 7, and 10. Only a few specific regions on chromosomes 1, 3, 8, and 9 contained variation relative to the reference ( Figure 2).  Figure 2B shows a close-up of SNP allele comparisons on chromosomes with maximum differences within ʻM82ʼ and between ʻM82ʼ and ʻUC 82ʼ samples (Tables S2 and S3). Regions on the bottom of chromosome 7 and 9 clearly differentiated M82-TS3 from the other ʻM82ʼ samples, while UC82-TS171 shared SNPs with M82-ERS484046 and M82-TS3 on Chromosome 7 and with M82-TS228 on Chromosome 9. Two regions located on the bottom of Chromosome 1 and 11 differentiate UC82-TS171 from all ʻM82ʼ samples ( Figure 2B).

Comparison of Polymorphic Genomic Regions.
To determine the proportion of the total genome sequence for which within accession variation was detected, genomic regions with at least three consecutive SNPs were summarized in Table 3. Intravariety variation for ʻM82ʼ accessions covered from 0.0001% to 0.43% of the total genome sequence when pairwise comparisons were made (Table 3). These results suggest that samples M82-ERS484046 and M82-TS228 were highly similar. The overall proportion of the tomato genome sequence that varied within ʻM82ʼ accessions was 0.43% with the most polymorphic region located on bottom of Chromosome 7 (SL3.0ch07:60,125,464-62435965 bp) and 9 (SL3.0ch09:71180818-72418523 bp; Table 3). Divergence between ʻUC 82ʼ and ʻM82ʼ accessions detected by pairwise comparisons cover from 0.33% to 0.59% of the total genome sequence ( Table  3). Differences between all ʻM82ʼ samples and UC82-TS171 were mostly located on the bottom of  Figure 2B shows a close-up of SNP allele comparisons on chromosomes with maximum differences within 'M82' and between 'M82' and 'UC 82' samples (Tables S2 and S3). Regions on the bottom of chromosome 7 and 9 clearly differentiated M82-TS3 from the other 'M82' samples, while UC82-TS171 shared SNPs with M82-ERS484046 and M82-TS3 on Chromosome 7 and with M82-TS228 on Chromosome 9. Two regions located on the bottom of Chromosome 1 and 11 differentiate UC82-TS171 from all 'M82' samples ( Figure 2B).

Haplotype Identification Based on SNP Genotyping and Whole Genome Resequence Data
Analyses based on independent genotyping or resequencing data identify the same genomic regions that differentiate 'UC 82' and 'M82' accessions. To compare and visualize these regions, the genotypic data from variety 'Heinz1706' was extracted from the Illumina SolCAP array and used to determine reference and alternate alleles for samples UC 82-COMAV, UC 82-OSU, and M82-OSU. The allelic information relative to the reference was then used to guide a schematic representation of the chromosome regions for the accessions analyzed based on both genotypic and resequencing data ( Figure 3). From this visualization, three different haplotypes could be distinguished among all 'UC 82' and 'M82' accessions. The first haplotype was represented by samples UC 82-OSU and UC 82-TS171; the second by UC 82-COMAV and M82-TS3; and the third by M82-ERS484046, M82-TS228 and M82-OSU.

Haplotype Identification Based on SNP Genotyping and Whole Genome Resequence Data
Analyses based on independent genotyping or resequencing data identify the same genomic regions that differentiate ʻUC 82ʼ and ʻM82ʼ accessions. To compare and visualize these regions, the genotypic data from variety ʻHeinz1706ʼ was extracted from the Illumina SolCAP array and used to determine reference and alternate alleles for samples UC 82-COMAV, UC 82-OSU, and M82-OSU. The allelic information relative to the reference was then used to guide a schematic representation of the chromosome regions for the accessions analyzed based on both genotypic and resequencing data ( Figure 3). From this visualization, three different haplotypes could be distinguished among all ʻUC 82ʼ and ʻM82ʼ accessions. The first haplotype was represented by samples UC 82-OSU and UC 82-TS171; the second by UC 82-COMAV and M82-TS3; and the third by M82-ERS484046, M82-TS228 and M82-OSU.

Discussion
In order to compare variation between varieties and within accessions of the same variety we defined 32,352 robust SNPs in tomato. An important step in defining these SNPs was the elimination of heterozygous calls. The problem of accurate identification of heterozygosity is complicated by both a sequencing error and inaccurate read mapping in next-generation sequencing technology and has been widely discussed [44,45]. Typically, these problems are minimized by invoking strict coverage parameters for depth of coverage, uniformity of calling, and quality scores. Our analysis followed GATK [39,46] best practices. The HaplotypeCaller uses local de-novo assembly to reassemble the reads in variable regions, a strategy that provides more accuracy. A graphical inspection of the depth of coverage from all genotypes revealed regions in common with extremely high concentration of reads ( Figure 1A), probably caused by repetitive DNA or gene-families, which has been artificially condensed in the reference. The regions with higher depth were on Chromosome 1 (between positions SL3.0ch01:23,806,000 and 23,812,500), Chromosome 2 (between positions SL3.0ch02:14,175,000 and 14,190,000), and Chromosome 11 (between positions SL3.0ch11:20,552,400 and 20,553,600). These were also regions with a high frequency of heterozygous SNP calls. Based on results from the analysis of the SolCAP Illumina array, we expected fewer heterozygous SNPs than detected from resequencing data, suggesting many of these are false positives. To assess the  z Genomic region covered by at least three contiguous polymorphic SNPs. y Chromosome. x Considering all differential regions present in at least one pairwise comparison. w Calculated as the proportion from the reference genome sequence version SL3.0 (total size 828,349,174 bp) covered by differential regions.

Discussion
In order to compare variation between varieties and within accessions of the same variety we defined 32,352 robust SNPs in tomato. An important step in defining these SNPs was the elimination of heterozygous calls. The problem of accurate identification of heterozygosity is complicated by both a sequencing error and inaccurate read mapping in next-generation sequencing technology and has been widely discussed [44,45]. Typically, these problems are minimized by invoking strict coverage parameters for depth of coverage, uniformity of calling, and quality scores. Our analysis followed GATK [39,46] best practices. The HaplotypeCaller uses local de-novo assembly to reassemble the reads in variable regions, a strategy that provides more accuracy. A graphical inspection of the depth of coverage from all genotypes revealed regions in common with extremely high concentration of reads ( Figure 1A), probably caused by repetitive DNA or gene-families, which has been artificially condensed in the reference. The regions with higher depth were on Chromosome 1 (between positions SL3.0ch01:23,806,000 and 23,812,500), Chromosome 2 (between positions SL3.0ch02:14,175,000 and 14,190,000), and Chromosome 11 (between positions SL3.0ch11:20,552,400 and 20,553,600). These were also regions with a high frequency of heterozygous SNP calls. Based on results from the analysis of the SolCAP Illumina array, we expected fewer heterozygous SNPs than detected from resequencing data, suggesting many of these are false positives. To assess the importance of this issue we altered filtering conditions, only for samples with similar mean depth of coverage (M82-TS3, M82-TS228, and UC82-TS171; Table 2), comparing both a maximum depth of 15× and 30×, with the rationale that a higher coverage would permit more error due to duplicate regions of the genome sequence. The two filters differed in total SNP calls with 51,535 at 30× and 38,826 at 15×, suggesting that error was introduced due to alignment to repetitive sequences. We therefore applied a cut off criteria for maximum read depth coverage to filter false positive SNP calling due to duplicate sequences clustered together in the alignment. The filter criteria were determined for each sample as d+4 √ d where d is the mean depth of coverage [41]. After this filtering was applied, the remaining heterozygous SNPs should be more reflective of true residual heterozygosity, but a visual inspection of these SNPs revealed heterozygous calls (0/1) when only one SNP was detected out of 30 reads ( Figure 1C). Given the homozygosity of tomato, such calls were likely errors. Such errors appeared to occur independently of depth of coverage, and we assumed these errors might be caused by the Bayes' likelihood method used to estimate genotypes.
The genome sequences of 'UC 82' and 'M82' accessions were highly similar. Both, the SNP density and the SNP allele comparison across the entire genome sequence for all samples presented highly similar distribution patterns. A distribution of chromosomal variant density between 'M82' and the 'Heinz 1706' reference genome sequence was previously published [10] and is similar to the SNPs distribution presented in Figure 2, for both 'M82' and 'UC 82' samples. Furthermore, the high density of homozygous SNPs relative to the alternate allele detected on chromosomes 4, 5, and 11 for the four samples ( Figure 2) were consistent with putative S. pimpinellifolium introgressions into 'M82' previously detected [10]. Variation within 'M82' accessions ranged from 0.0001% to 0.43% of the total genome sequence and between 'UC 82' and 'M82' accessions the range was 0.33%-0.59% when pairwise comparisons are made (Table 3). Considering all of the differential regions detected between 'UC 82' and 'M82' accessions, the divergence covered 0.67% of the total genome sequence ( Table 3).
The same genomic regions differentiate 'UC 82' and 'M82' accessions whether considering genotyping array or resequencing data. From the visualization of these regions, three different haplotypes could be distinguished among all 'UC 82' and 'M82' accessions ( Figure 3). A first haplotype consisting of 'UC 82' accessions, a second of 'M82' accessions and a third haplotype was shared by one sample from each variety.
The genome-wide intra-variety variation of 0.18% and 0.43% we detected within 'UC 82' and 'M82' accessions, respectively, suggests a high fidelity of propagation for these varieties within the tomato genetics community. These percentages compare favorably to the genome-wide differences of 2.3% for the maize variety 'B73' and 3.1% for the soybean variety 'Williams 82', two references for agricultural genetics, genomics and breeding [1,2]. Differences for the model plant standard Arabidopsis 'Col-0' were considerably higher at 20% [3]. Differences that we detected within the same accessions were consistent with drift as regions of residual heterozygosity were resolved and no evidence for outcrossing or introgression were detected.
For specific comparisons and regions of the genome sequences, the within accession variation was greater than between 'UC 82' and 'M82'. A shared history or kinship for these varieties provides an explanation for these patterns of variation. The historical record is unclear as to whether 'M82' and 'UC 82' are independent selections made out of the same four-way cross or independent selections from the same partially inbred line from that same cross. The 'M82' release notice states, "initial cultivation of the variety 'M82-1-8 (VF)' was at the University of California, Davis, by Dr. M.A. Stevens" and, further, "selections and improvement in uniformity were made by Y. Rudich, Faculty of Agriculture, Rehovot" (translated from the Hebrew) [47]. The analysis presented here suggests that the two varieties were selected from the same partially inbred line, rather than from independent F 2 plants. Based on the published pedigree for 'UC 82' [26,27] there is a disproportionate contribution from VF 65-2, suggesting that independent F 2 plants will share more than 50% of their genomes by descent. Translating pedigree into SNP percentages is difficult as all parents were not represented in sequence or genotyping databases and we could not assume they were 100% different. However, observations across genotyping data show a narrow range of 0.14%-0.19% differences in pairwise comparisons. Considering all differential regions present in at least one comparison, the accessions differed at 0.26% of the genome sequence (they share 99.74%). Sequence analysis showed similar proportions, a range of 0.33%-0.59% differences in pairwise comparisons and a 0.67% (shared 99.33%) when considering all differential regions present in at least one pairwise comparison. These ranges were highly consistent with a common pedigree and shared selection between 'UC 82' and 'M82'.
To place the analysis presented here into context based on findings in other plant species, it is useful to consider results from maize, soybean, and Arabidopsis. In the introduction of their paper, describing sequence variation between accessions of the Zea mays inbred 'B73', Liang and Schnable [2] review the implications of genetic variation that may have occurred in genetic stocks as a result of contamination, accidental out-crossing, residual heterozygosity, or mutation. They note that resequencing of single plants from a source of the Arabidopsis thaliana accession 'Columbia-0' identified variation that accounted for nearly 20% of the total reference genome sequence [3]. They also note that two sources of the Glycine max variety 'Williams 82', the reference genome accession for soybean, contains 3.1% polymorphism [1]. The differences in 'Williams 82' accessions originated from the differential segregation of polymorphic chromosomal regions following introgression, backcrossing, and single-seed descent during the breeding process. In maize, 13,360 high confidence SNPs in 27 RNA-seq samples for 'B73' were described [2]. This number was 4.8-fold lower than the number expected from a non-B73 accession. Their analysis defined robust SNPs and found that 97.7% of the maize gene space was consistent with a single haplotype across all 27 'B73' samples. Over half of the SNPs locate within blocks on specific regions in different chromosomes, and were likely introgressions. The remaining SNPs were distributed over all ten chromosomes and were either new mutations, or reflect segregation and fixation of heterozygous loci from the original 'B73' accession [2]. The three genotypes, 'Williams 82', 'B73', and 'Columbia-0' were among the most widely studied by their respective plant research communities, thus understanding the distribution and causes of variation between independent accessions had important implications to the respective research communities.

Conclusions
A high fidelity of propagation for two tomato varieties was demonstrated based on two independent data sets. The close relationship between 'UC 82' and 'M82' accessions was confirmed, and the similarities were highly consistent with a common pedigree and shared selection. We also demonstrated how sequencing technologies and bioinformatics analysis could be successfully used by plant breeders to differentiate haplotypes between and within varieties.  ' sequence samples: M82-ERS484046, M82-TS228, M82-TS3,  Table S3: Pairwise comparison by chromosome (Ch) using a set of 32,352 robust SNPs for 'M82' and 'UC 82' sequence samples: M82-ERS484046, M82-TS228, M82-TS3, and UC 82-TS171, Figure S1: Cluster dendrogram based on SNPs extracted from the SolCAP data set. The accessions 'Heinz 1706' (gold-standard reference) and the Solanum pimpinellifollium LA1589 were used as outgroups for the analysis. Funding: This research was funded by Agencia Nacional de Promoción Científica y Tecnológica PICT2015-0424 and Consejo Nacional de Investigaciones Científicas y Técnicas PIP008 and the National Institute of Food and Agriculture to GRR, and U.S. Department of Agriculture, Hatch project OHO01405 to DMF.