GWAS Based on RNA-Seq SNPs and High-Throughput Phenotyping Combined with Climatic Data Highlights the Reservoir of Valuable Genetic Diversity in Regional Tomato Landraces

Tomato (Solanum lycopersicum L.) is a widely used model plant species for dissecting out the genomic bases of complex traits to thus provide an optimal platform for modern “-omics” studies and genome-guided breeding. Genome-wide association studies (GWAS) have become a preferred approach for screening large diverse populations and many traits. Here, we present GWAS analysis of a collection of 115 landraces and 11 vintage and modern cultivars. A total of 26 conventional descriptors, 40 traits obtained by digital phenotyping, the fruit content of six carotenoids recorded at the early ripening (breaker) and red-ripe stages and 21 climate-related variables were analyzed in the context of genetic diversity monitored in the 126 accessions. The data obtained from thorough phenotyping and the SNP diversity revealed by sequencing of ripe fruit transcripts of 120 of the tomato accessions were jointly analyzed to determine which genomic regions are implicated in the expressed phenotypic variation. This study reveals that the use of fruit RNA-Seq SNP diversity is effective not only for identification of genomic regions that underlie variation in fruit traits, but also of variation related to additional plant traits and adaptive responses to climate variation. These results allowed validation of our approach because different marker-trait associations mapped on chromosomal regions where other candidate genes for the same traits were previously reported. In addition, previously uncharacterized chromosomal regions were targeted as potentially involved in the expression of variable phenotypes, thus demonstrating that our tomato collection is a precious reservoir of diversity and an excellent tool for gene discovery.


Introduction
Tomato (Solanum lycopersicum L.) is a major vegetable crop worldwide and a widely used model species in genomic studies and breeding. The rapidly increasing availability of genomic data has allowed researchers to bypass the limitations of a single reference genome and has facilitated development of a pan-genome that represents the genetic diversity of cultivated and wild tomato species [1]. Recent groundbreaking studies have shown how the integration of widely variable tomato collections and multi-omics tools can contribute to advances in population genomics, domestication and plant evolution studies [2][3][4].
Our interest in tomato derives from the relevance of this crop for the Mediterranean area, where Italy is the leading European tomato producer, with an overall production in 2019 of~6 million tons and a harvested area of~100,000 ha (FAOSTAT, 2018;ISTAT, 2018ISTAT, , 2019. In Sardinia in particular, together with artichoke, tomato is the most important horticultural crop species, with numerous landraces still cultivated locally by farmers. This is of special relevance if we consider that landraces might be directly used in marginal environments or exploited in future breeding programs as a reservoir of genes that underlie quality and productive traits that can be used to increase the performance of modern cultivars and to boost sustainable agriculture [5][6][7][8][9]. As noted by Lin and colleagues [2], modern tomato cultivars have arisen following years of domestication and breeding, which have progressively reduced the genomic diversity, together with the chances of improving the species through conventional breeding. Therefore, it is important that researchers work to preserve and expand the genetic diversity through the analysis of local and regional genotypes. The results of these studies will help in disentangling the role of different genes in the expression of important agronomic traits [2,10,11]. Transcriptomic data, in particular, have been used to both compare expression profiles (expression quantitative trait loci (eQTLs)) in plant populations at different development stages, and genotype (using RNA-Seq-derived single nucleotide polymorphisms (SNPs)) germplasm collections that can be used in genome-wide association studies (GWAS) and genomic selection analyses [10,12,13].
Genome-wide association studies have become a powerful tool to investigate and link the huge available amounts of genome sequence variation data with measurable phenotypic variations [2][3][4].
position, unless otherwise requested in a few specific cases; for proximal and distal angles, 20 degrees was used as the macro distance and three degrees as the micro distance (unless otherwise requested in a few specific cases).

Carotenoids
We extracted the carotenoids from three replicates of each accession (one per block), at both the breaker stage (i.e., early ripening), when the fruit are still rich in chloroplasts and the red-ripe stage, when the fruit accumulate large quantities of carotenoids, for a total of 762 samples. All of the harvested fruit at the breaker and red-ripe stages were healthy and uniformly colored within their stage. An example of the two ripening stages is shown in Supplementary Figure S1. At ripe stage, accessions with mature orange, yellow or pink fruit were harvested when the color reached maximum intensity. The harvested fruit were left for 6 h on a laboratory bench at room temperature to reduce harvest stress and ethylene production. Then they were cut into halves and separated from the seeds, frozen in liquid nitrogen and stored at −80 • C.
Frozen flesh samples from each fruit stage were rapidly homogenized and carotenoids extraction performed as described in [27]. Carotenoid detection was performed on a Summit HPLC system with a photodiode array detector (PDA-100; Dionex, Sunnyvale, CA, USA). Spectra were collected at 286 nm, 450 nm and 471 nm and pigments were identified via co-migration with purified standards and/or by their pigment-specific absorbance spectra. The list of carotenoids analyzed for each stage is shown in Supplementary Table S2.

Climatic Data
For each accession for which geographic coordinates were available (108 accessions), climatic data were also downloaded relative to each collection site. Bioclimatic variables were used for further comparative analyses with phenotypic and molecular data, as they are considered more biologically meaningful than monthly temperatures and rainfall. Indeed, they describe annual temperature and precipitation trends, seasonality or extreme environmental factors (e.g., temperature of the coldest and warmest month or precipitation of the wettest and driest quarters). The DIVA-GIS 7.5 software (http://www.diva-gis.org/) was used to extract ecological data from the free-access database at http://www.diva-gis.org/climate.

Transcriptome Sequencing and SNP Calling
Transcriptome sequencing was performed on 120 of the 126 accessions cultivated during the green-house trial in 2013. The fruits used for RNA extraction were the same used for carotenoid extraction (see above) Total RNA was extracted using RNeasy Plant Mini kits (Qiagen, Hilden, Germany) from full-ripe tissue samples previously frozen at −80 • C and homogenized. Strand-specific RNA-Seq libraries were constructed using a protocol described in [28] and 51-bp-long single-end reads were sequenced using a HiSeq 2000 platform (Illumina, San Diego, CA, USA). Raw RNA-Seq reads were processed to trim low-quality and adapter sequences using Trimmomatic [29]. The resulting cleaned RNA-Seq reads were aligned to the "Heinz" reference genome (version SL2.40) using STAR [30]. Duplicated reads in each RNA-Seq library were marked using Picard (http://broadinstitute.github.io/ picard/) and only uniquely mapped reads were kept. SNPs were then identified based on the mpileup files generated by SAMtools [31]. The resulting raw SNPs (73,859) were filtered to exclude those with missing rate >0.9 and minor allele frequency <0.03, which obtained a final dataset of 2470 SNPs for downstream population genomic and GWAS analyses.

Statistical and Population Genomic Analyses
Ranges and mean values were calculated by site for each accession on quantitative conventional descriptors, Tomato Analyzer descriptors and carotenoid contents. Analysis of variance (ANOVA) was then performed on the conventional descriptors collected from the 122 accessions shared among Genes 2020, 11, 1387 5 of 25 the two sites. To test for significant variations among environments and genotypes and to estimate the size of the genotype by environment interaction (GxE) we used the model given in Equation (1): where Y ijr is the observation of the ith genotype, in the jth environment and block replicate r, µ is the grand mean, G i is the effect of the ith genotype, E j is the effect of the jth environment, G*E ij is the interaction of the ith genotype with the jth environment, B r (E j ) is the effect of the rth replicate in the jth year and ε ijr is the random error. The adjusted means (as the best linear unbiased predictors (BLUPs)) and the broad sense heritability were also calculated across sites for the quantitative conventional descriptors and within sites for the Tomato analyzer descriptors and the carotenoid contents. To calculate these statistics, we have fitted a mixed linear model based on the restricted maximum likelihood method with years as fixed effects and genotypes and blocks as random effects using Equation (2): where Yijr is the observation of the ith genotype, in the jth environment and block replicate r, µ is the grand mean, Gi is the effect of the ith genotype, Ej is the effect of the jth environment, G*Eij is the interaction of the ith genotype with the jth environment, Br(Ej) is the effect of the rth replicate nested in the jth year and εijk is the random error. The expression given in Equation (3) was used to calculate the statistics: where σ 2 p = σ 2 g + (σ 2 g · e/ n e ) + (σ 2 / n e ·n r ) for heritability across the two sites and σ 2 p = σ 2 g + (σ 2 / n e ·n r ) for heritability calculated by site, and where σ 2 g is the genotypic variance, σ 2 p is the phenotypic variance, σ 2 g · e is the genotype by environment interaction, n e is the number of environments, σ 2 is the residual variance and n r is the number of replicates. All the above statistical analyses were carried out using R dedicated packages (lme4 and car) [32]. The Shannon-Weaver index was used to evaluate the diversity among the qualitative traits [33].
GenAlEx 6.5 [34] was used to calculate the formula H' = − n i=1 pilogb, where pi is the frequency proportion of the descriptor state and b is the base of the logarithm. Most studies use natural logarithms, although some use base 2 (which makes no significant differences). Each value was normalized by dividing it by its maximum value (log2n, where n is the number of states) to keep the values between 0 and 1. Pearson's correlations were estimated among the different traits using the cor standard R function and plotted using the corrplot package [35]. Principal component analysis (PCA) and cluster analysis based on all of the morphophenological and quality traits were also performed. The PCA results were plotted using a modified version of the ggbiplot package (https://github.com/vqv/ggbiplot). The Hopkins statistic [36] was used to assess the clustering tendency and the most likely number of groups was detected using the NbClust R package [37]. The Euclidean distance and the Ward method were used to draw the clusters and the results were plotted using the factoextra R package (https://cran.r-project.org/web/packages/factoextra/index.html). Genetic diversity analyses were performed using GenAlEx 6.5 [34] and Arlequin vs. 3.5.1.2 [38], to calculate the main diversity statistics: Number of observed (Na) and expected (Ne) alleles, number of private alleles (PAs), expected heterozygosity (He) and unbiased expected heterozygosity (uHe). Hierarchical analysis of molecular variance (AMOVA) was performed to evaluate the partitioning of the genetic variance into, among and within groups of accessions, defined as: Sardinian landraces (SLRs), exotic landraces (ELRs) and modern or vintage cultivars (CVs). Genetic distances among groups were also calculated by pairwise F ST values and also proportion of shared alleles between pairs of sub-populations.
A Mantel correlation test was then used to compare the genetic and phenotypic distances using the mantel.rtest function of the ade4 R package; 9999 permutations were used to evaluate the significance of the test. The phenotypic distances were those obtained to draw the clustering plot, while the genetic Nei minimum distances were obtained using the R adegenet [39] and popr packages (URLhttps://grunwaldlab.github.io/poppr).
To investigate population structure, we used the model-based clustering method, as implemented in Structure vs. 2.3.4 [40], and discriminant analysis of principal components (DAPC), implemented in the adegenet package for the R software [39]. The first method is a model-based approach that assigns each individual to different groups according to a membership coefficient (qi). We used an admixture model with the options "correlated allele frequencies among populations" and "infer the degree of admixture (a) by the data". For each K (number of hypothetical populations), 20 runs (burn-in length, 100,000; iterations, 200,000) were carried out and the most likely number of K was determined using the method from Evanno et al. (2005), as implemented in the online program STRUCTURE Harvester [41]. The second method does not require any a priori knowledge of the population genetics model and among other advantages, it requires low computational effort to analyze large datasets and provides reliable assignment of individuals to groups [39]. The method uses PCA to transform the data and performs discriminant analysis on the principal components retained, thus also allowing easy graphical representation of the relatedness between the inferred groups. To run these analyses, we filtered the dataset of 2470 SNPs by clumping, to maintain only those with reduced pairwise linkage disequilibrium (r 2 < 0.2). To filter the dataset, we used the R package bigsnpr [42], to obtain a dataset of 649 unlinked SNPs.
To further evaluate the structure of the population, we estimated the familial relatedness and linkage disequilibrium (LD) levels of the tomato collection by calculating the pairwise kinship coefficients as obtained from TASSEL 5.2.42 [43]. We then determined the intrachromosomal LD using both uncorrected estimations and estimates corrected for population structure and familial relatedness. In the first case, the classical r 2 measure was calculated; in the second case, the r 2 unbiased estimates were corrected by the structure of the sample (r s 2 ), the relatedness of the individuals (r v 2 ) or both (r vs 2 ).
These methods were implemented in the LDcorSV R package [44]. Pairwise distance between loci was calculated using LD-vignette included in the Bioconductor package snpStats implemented in R [45,46]. We then plotted the LD r 2 data against the genetic distance and fitted the LD decay line as in [47], adapting an R script to our data (https://fabiomarroni.wordpress.com/). The regression line is based on [48] and the parameter C was calculated by using SneP, a program designed to easily estimate effective population size from genome-wide SNP data or directly from LD levels [49].
To identify the critical value of LD (r 2 value) across our tomato population above which loci can be assumed to be associated, we used the method from [50]. Here, the LD threshold below which markers can be defined as unlinked and is defined based on the 95th percentile of a normalized distribution of markers located on different chromosomes. The intersection point between the regression line and the LD significance threshold was used to indicate the LD decay overall and along each chromosome.

Genome-Wide Association Studies
Genome-wide association studies were performed on the BLUPs of all of the collected traits by fitting a MLM implemented in GAPIT v.3 [51], which accounts for kinship (genotype relatedness) and population structure. To explore farther marker-trait associations, we used FarmCPU, also implemented in GAPIT v.3 [21,51]. FarmCPU is a modified multilocus mixed model approach that enhances the false-discovery rate and the QTL detection power by incorporating one or several markers as cofactors in a stepwise MLM, thus removing the confounding between testing markers and kinship; see e.g., [19]. Liu et al. [21] divided the modified multilocus mixed model approach into two parts: A fixed effect model and a random effect model, using these iteratively. To control for false positives, the fixed effect model tests markers one at a time and uses multiple associated markers as covariates. To limit an overcorrection of the model, the associated markers are then estimated in a random effect model and used to define kinship. To perform MLM and FarmCPU analyses, no missing data are allowed and we therefore imputed the missing phenotypic data using the phenix package implemented in R [52]. We also performed a relatively novel GWA analysis using QTCAT [22], a method that does not require population structure information to detect associations between markers and traits. This method, has been designed to overcome the limitations of single-locus MLM models that while correcting for population structure and genotype relatedness, cannot entirely avoid yielding spurious associations between a marker and a phenotype [22]. In particular, this QTCAT groups markers into clusters of correlated markers while simultaneously associating them to the phenotype and therefore it does not require any correction for the population structure.
Substantial variation was detected also for the conventional qualitative traits, as indicated by the Shannon-Weaver index (H') (Supplementary Table S4). Almost all of the traits showed the same Genes 2020, 11, 1387 8 of 25 number of variants across the 2 years, with some relevant changes seen, e.g., for green stripes, with H' of 0.50 in 2012 that was reduced to nearly zero in 2013 (greenhouse experiment). A similar trend was also observed for leaf attitude (H' _2012 = 0.39, H' _2013 = 0.13; Supplementary Table S4).
Significant differences among genotypes were also detected for the 44 Tomato Analyzer descriptors (Supplementary Table S5). Broad sense heritability was on average higher for these traits, with the highest value for pericarp thickness (h 2 B = 92.21%) and the lowest for the perimeter of the longitudinal section (h 2 B = 36.65%) (Supplementary Table S5). The mean value for all of the traits was~70%. For the carotenoids, the total content was higher at the ripe stage than the breaker stage, as expected (Supplementary Table S6). At the breaker stage, the total mean content of carotenoids was 6.09 µg/g fresh weight (fw) and as a general trend there was an increase of about 20-fold with full ripeness (121.55 µg/g fw). Moreover, the ANOVA output showed that all of the analyzed compounds showed significant differences among genotypes at both the breaker stage and the red ripe stages (Supplementary Table S6).
The heritability of the carotenoids was on average lower than the other phenotypic traits, with β-carotene showing the highest values at both breaker and ripe stages (45%, 57%, respectively) (Supplementary Table S6).
Both positive and negative correlations were found among the different traits (Supplementary Figure S2), with the most significant values observed between fruit size and shape (as recorded by both conventional and digital phenotyping). As an example, fruit weight (FWG) and locule number (NOL) were strongly and positively associated (r > 0.8, p < 0.0001) with the area, perimeter and maximum width (Width_M_H_L and Width_M_H_T) of the transverse and longitudinal sections of the fruit (as registered by Tomato Analyzer). Significant correlations were also observed between carotenoid content and fruit color and size. In particular, β-carotene content was negatively correlated (r = −0.4, p < 0.001) to the mean fruit weight (FWG) at both the breaker stage and the ripe stage and it was positively correlated with the degrees Brix at the breaker stage (r = 0.43, p < 0.0001) and the ripe stage (r = 0.20, p = 0.03). Fewer correlations were detected between bioclimatic variables and the plant/fruit traits, some of which were of particular note, such as those between the traits that described fruit size (e.g., FLE, FWG, FWI, area, perimeter, Max_W_L, Max_H_L, Max_W_T, Max_H_T) and bio3 (Isothermality), bio8 (Mean Temperature of Wettest Quarter), bio12 (Annual Precipitation), bio13 (Precipitation of Wettest Month), bio 16 (Precipitation of Wettest Quarter) and bio18 (Precipitation of Warmest Quarter). The observed correlations varied from −0.45 (p < 0.001) for correlations between FLE and bio3 and bio8, to −0.27 (p < 0.005) for correlations between area and perimeter with bio12, bio13 and bio16 (Supplementary Figure S2).
Overall, a wide pattern of phenotypic variation was observed for all the traits, as shown by the PCA analysis, where the accessions were differentiated into three groups based on the hierarchical clustering obtained with the same data ( Figure 1, yellow, red, green). The landraces from Sardinia were present in all of the three groups, as well as the exotic landraces and modern/vintage cultivars, thus showing that the subdivision into varietal types (i.e., ELR, SLR, CV) does not overlap with the clustering based on phenotypic variance. While we should remark that less than the 50% of the phenotypic variance is explained by the first two principal components.
The main features of the three groups can be deduced from the loading of the different traits on the first two principal components (Supplementary Table S7). Briefly, we can summarize that cluster A is constituted by plants with composite inflorescence and large fruit, with heterogeneous shapes, including both flattened and heart-shaped types, and usually characterized by a medium to thin pericarp and low BRIX degrees, with a higher cis-lycopene content. Cluster B mainly includes varieties of small size, mainly round shaped with a medium pericarp, high sugar content, high β-carotene (at both breaker and ripe stages) and low puffiness. Cluster C is mainly comprised of varieties of small/medium size, elongated fruit, with thick pericarp and intermediate phenotypes for sugar content. The main features of the three groups can be deduced from the loading of the different traits on the first two principal components (Supplementary Table S7). Briefly, we can summarize that cluster A is constituted by plants with composite inflorescence and large fruit, with heterogeneous shapes, including both flattened and heart-shaped types, and usually characterized by a medium to thin pericarp and low BRIX degrees, with a higher cis-lycopene content. Cluster B mainly includes varieties of small size, mainly round shaped with a medium pericarp, high sugar content, high βcarotene (at both breaker and ripe stages) and low puffiness. Cluster C is mainly comprised of varieties of small/medium size, elongated fruit, with thick pericarp and intermediate phenotypes for sugar content. Table 2 gives the summary of the genetic diversity indices that indicate overall gene diversity (uHe) of 0.18, gene diversity that is the same for the Sardinian and exotic landraces (uHe = 0.16) and higher gene diversity within the group of cultivars (uHe = 0.29).  Accessions are attributed to three phenotypic groups (yellow, green, red), based on the hierarchical clustering obtained with the same data. Table 2 gives the summary of the genetic diversity indices that indicate overall gene diversity (uHe) of 0.18, gene diversity that is the same for the Sardinian and exotic landraces (uHe = 0.16) and higher gene diversity within the group of cultivars (uHe = 0.29). The genetic differentiation (F ST ) among the groups was 0.07 on average and was always statistically significant. Moreover, the proportion of shared alleles indicated higher similarity between SLR and ELR than between CV and ELR (Table 3). On the other hand, SLR and ELR showed 11 and 50 private alleles, respectively, while only one private allele was detected in CV (Supplementary Table S8). The partitioning of the SNP variance (AMOVA) indicated that only 4% of the total variation was due to a statistically significant differentiation (p < 0.001) among populations. This thus indicates that more than 96% of this diversity is due to the genetic variation among accessions within groups, either from different countries of the world or from a single country or region (e.g., Sardinia).

Genetic Diversity and Structure
Based on the Bayesian approach, the most likely number of genetic groups within this tomato population was three (Supplementary Figure S3a), while the DAPC method indicated four genetic groups (Supplementary Figure S3b). Looking in detail at these results, we found significant association between the methods in the assignment of the different accessions to the different genetic groups (X 2 = 194.0, p < 0.0001). Indeed, if we do not consider the admixed individuals, the same accessions were attributed to the green and red genetic groups shown in Figure 2 by both methods, while the accessions attributed to the yellow group by STRUCTURE were split into two groups by DAPC (Figure 2, yellow, orange).
The genetic differentiation (FST) among the groups was 0.07 on average and was always statistically significant. Moreover, the proportion of shared alleles indicated higher similarity between SLR and ELR than between CV and ELR (Table 3). On the other hand, SLR and ELR showed 11 and 50 private alleles, respectively, while only one private allele was detected in CV (Supplementary Table S8). Table 3. FST values (below diagonal) and relative significance of the test (above diagonal) and proportion of shared alleles between pairs of sub-populations based on the 2470 SNPs. The partitioning of the SNP variance (AMOVA) indicated that only 4% of the total variation was due to a statistically significant differentiation (p < 0.001) among populations. This thus indicates that more than 96% of this diversity is due to the genetic variation among accessions within groups, either from different countries of the world or from a single country or region (e.g., Sardinia).

Comparison among groups ELR SLR CV
Based on the Bayesian approach, the most likely number of genetic groups within this tomato population was three (Supplementary Figure S3a), while the DAPC method indicated four genetic groups (Supplementary Figure S3b). Looking in detail at these results, we found significant association between the methods in the assignment of the different accessions to the different genetic groups (Χ 2 = 194.0, p < 0.0001). Indeed, if we do not consider the admixed individuals, the same accessions were attributed to the green and red genetic groups shown in Figure 2 by both methods, while the accessions attributed to the yellow group by STRUCTURE were split into two groups by DAPC (Figure 2, yellow, orange).  When looking at the composition of each genetic group, landraces from ELR and accessions from CV were assigned to all the three groups detected by STRUCTURE and to the four groups detected by DAPC. The main difference among the three varietal groups was seen for SLR, which showed rare occurrence of the red genetic group in Figure 2. This is evident at both K3 (STRUCTURE), where we observed a low level of membership of the SLR accessions to the red group and at K4 (DAPC), where there are no red bars. We further investigated whether there was any correlation between the genetic groups and the phenotypic clusters and despite the significant relationships between the genetic and phenotypic distances (Mantel test correlation = 0.38, p = 2 × 10 −4 ; 9999 permutations), the genetic and phenotypic clustering do not completely overlap (Supplementary Figure S4).

Linkage Disequilibrium
Linkage disequilibrium that was calculated using the usual pairwise correlation measure (r 2 ) was higher than the r 2 measure corrected for population structure (r s 2 ), kinship (r v 2 ) and both population structure and kinship (r vs 2 ). These were seen as r 2 > r s 2 > r v 2 > r vs 2 ; r v 2 and r vs 2 LD were very similar overall and within each chromosome (Table 4, Supplementary Figures S5 and S6). The mean linkage disequilibrium (LD) levels over all chromosomes were r 2 = 0.15, r s 2 = 0.11 and r vs 2 = 0.07, whereas the estimates of the unlinked pairwise loci (among chromosomes) indicated r 2 = 0.19, r s 2 = 0.12 and r vs 2 = 0.09 as the threshold above which loci can be assumed to be in LD. There were fewer SNPs across centromeres and the highest LD levels were usually observed around peri-centromeric areas.
We plotted the r 2 , r s 2 , r v 2 and r vs 2 values against the genomic distances and modeled the LD decay for each chromosome ( Figure 3) and the LD decay was determined at the intersection point between the regression curves and the LD thresholds. While the mean values of LD were lower for kinship corrected values, LD decayed at slightly higher distances for r s 2 (1.40 Mb) and r v 2 (1.37 Mb) than for standard r 2 (1.06 Mb). This output was more evident within chromosomes for the r s 2 decay.
Genes 2020, 11, x FOR PEER REVIEW 12 of 26 Figure 3. Linkage disequilibrium decay levels over all chromosomes calculated using the standard r 2 coefficient and the r 2 measure corrected for population structure (rs 2 ), kinship (rv 2 ) and both population structure and kinship (rvs 2 ).
Some differences were also observed for LD decay among the different chromosomes, with lower LD decay distances for chromosomes 1, 3, 8 and 10, while the highest decay distances were for chromosome 5 (Table 4, Supplementary Figure S6). Some differences were also observed for LD decay among the different chromosomes, with lower LD decay distances for chromosomes 1, 3, 8 and 10, while the highest decay distances were for chromosome 5 (Table 4).

Fruit and Plant Traits
In particular, the highest number of MTAs (388) was detected for fruit size and shape (derived from both conventional and digital phenotyping; Table 5). Among the 165 MTAs detected for tomato fruit size, 86 indicated a subset of 13 genes that were shared between the two sets of traits. Among the 223 MTAs detected for fruit shape by both phenotyping methods, 72 where located within 12 genes.
Among the interesting MTAs we identified by both QTCAT and FarmCPU, one SNP (ch1_22) is located within the gene Solyc01g010440, which in addition to leaf size, was associated to multiple fruit traits (mainly related to fruit size) (Supplementary Tables S9 and S10). This gene was located 0.3 Mb from the CRABS-CLAW gene that encodes a YABBY transcription factor. On chromosome 2, one SNP was particularly relevant for fruit shape (ch2_151), which was detected by the three GWA models as associated to several traits (e.g., fruit shape, fruit size, BRIX, shape of pistil scar, shape of transverse section) and was located within the gene Solyc02g081700, which encodes a proteasome subunit α type protein, 13 kb from an ANANTHA gene (Supplementary Tables S9 and S10).
Among the genomic regions that influenced variation of fruit traits, an MTA on chromosome 4 (ch4_31) was 28 kb from a TONNEAU1 Recruiting Motif (TRM) gene that has been previously shown to influence final fruit shape [53].
On chromosomes 6 to 10, we detected more relevant MTAs for fruit size and shape, of which six were located within genes that encode transcription factors and others were either located near genes that influence fruit setting (OVATE-like, WUSCHEL-related, YABBY-like) or in regions where no genes determinant for the investigated fruit traits were found (Supplementary Table S9). On chromosome 11, there were other regions associated to fruit size and shape, among which the most significant was detected around 52 Mb, where the FASCIATED gene is located. FASCIATED has been extensively studied in tomato, where mutations to this gene result in altered floral meristem size and locule number [54][55][56][57][58].
Among the associations detected for quality traits, we found a relevant region on chromosome 1 (at 71 Mb) where three SNPs (ch1_38-40) were associated to fruit color and β-carotene content (Supplementary Table S9). These SNPs are located at 165 kb from Solyc01g079620, a SlMYB12 gene that influences accumulation of the yellow-colored flavonoid (naringenin chalcone) in tomato fruit [59]. These MTAs were detected by MLM and FarmCPU.
Other relevant MTAs were detected for sugar content and other fruit quality traits. Among these, one is located within the gene Solyc02g081700, above mentioned for its association with different fruit size and shape traits and one located within Solyc06g066320, the gene encoding for a transcription factor homolog of Arabidopsis thaliana IWS1 (INTERACTS WITH SPT6) [60]. This gene was also associated to other traits related to fruit size and shape and inflorescence.
Marker-trait associations that correlated with other plant traits were also detected, such as growth habit, leaf traits and inflorescence traits. For plant growth, we found one SNP on chromosome 1 (ch1_153) within Solyc01g108020, a gene that encodes a thioredoxin M3, and one on chromosome 10 within Solyc10g084400, which encodes a glutathione S-transferase (Supplementary Table S9). These MTAs were located at 206 kb and 27 kb, respectively, from genes that encode an AP2-like ethylene-responsive transcription factor (AP2/ERF). Moreover, on chromosome 11 we detected a SNP (ch11_49) within a DELLA gene (Solyc11g011260) that encodes a gibberellic acid insensitive (GAI) transcription factor (Supplementary Table S9). The gene product belongs to a family that has been shown to have a role as redox regulator in hormone signaling pathways of different plant species, including Arabidopsis, rice and tomato [61][62][63][64][65].
Marker-trait associations for leaf size were also identified that included an MTA for leaf length within Solyc04g080730, at 31 kb from Solyc04g080780 and 35 kb from Solyc04g080790, two BEL-like homeodomain genes [66]. BEL-LIKE genes have been shown to influence leaf development and morphology in many plant species, such as Arabidopsis and tomato [66][67][68]. More significant associations were found for flowering time (ch1_22, chr2_155, chr3_33), as well as for inflorescent type and number. In particular, looking at chromosome 11, two SNPs were located within Solyc11g010490 at 82 kb from the J-1 gene, a MADS-box 512 transcription factor implicated in the differentiation of the pedicel abscission zone and the 513 maintenance of the inflorescence meristem [69,70] and a further one within Solyc11g071680, which encodes a serine/threonine kinase (homolog of the Arabidopsis TOUSLED). This is particular interesting because it encodes a serine/threonine protein kinase that is required for leaf and flower development and is involved in the regulation of RNA interference [71,72]. These functions are congruent with the other marker-trait associations that we have detected for this gene (i.e., fruit size and shape).

Climatic Data
Genome-wide association analysis on bio-climatic data relative to each collection site resulted in numerous MTAs distributed among all of the chromosomes (Table 5). Among these MTAs detected, one on chromosome 2 (ch2_130) is located within Solyc02g071510, a transcription factor GTE12, which is associated with bio13, bio15 and bio16 (i.e., precipitation of wettest month, precipitation seasonality, precipitation of wettest quarter). The T allele of ch2_130 is private to the exotic landraces group (Supplementary Table S8). On chromosome 3, we found two more loci with an allele private to exotic landraces. Both of these MTAs were associated with isothermality and were detected within Solyc03g117760 and Solyc03g121660, genes that encode an un-known protein and a homolog of a zinc-finger protein, respectively (Supplementary Table S9).
We identified three additional MTAs associated with precipitation on chromosome 3. The loci ch3_275 and ch3_276 were correlated to bio 13 (i.e., precipitation of wettest month) and were both identified within Solyc03g121000, a gene encoding a PAF1-like protein (Supplementary Table S9). This protein regulates all stages of the RNA polymerase (Pol) II transcription cycle and in Arabidopsis it has been shown to transcriptionally regulate FLOWERING LOCUS C [73,74].
On chromosome 5, we detected a high number of MTAs. One within Solyc05g015510 that it is associated with precipitation (i.e., precipitation of driest month) and encodes SlySBP10 a SQUAMOSA promoter-binding protein (Supplementary Table S9). This protein family has key roles in plant growth and development, including flowering time, shoot architecture and fruit setting and ripening [75][76][77][78]. We singled out other MTAs along the remaining chromosomes in regions that might be involved in responses to different climatic factors. Examples include Solyc06g083150, located 127 kb from a FRIGIDA-like gene. In Arabidopsis, FRIGIDA regulates flowering transition (vernalization) by activating a central flowering repressor that is encoded by flowering locus C [79]. Others are Solyc07g055660, located 26 kb from a heat-stress transcription factor; Solyc08g076930, a transcription factor for jasmonic acid 3 that is 72 kb from a RAMOSA transcription factor (Supplementary Table S9). In maize, RAMOSA genes act during early inflorescence development and determine the fate of axillary meristems [80].

Discussion
We have here investigated the value of a wide collection of tomato landraces for association mapping and detection of both novel loci and candidate genes that might be useful for designing future genomic-based breeding strategies or tomato gene-editing approaches. The collection is enriched by Italian, in particular Sardinian, landraces that represent a wealth of genetic local resources that might be exploited in future conservation and breeding studies. We investigated the collection in different environments and for different traits, including bio-climatic variables recorded for the country of cultivation. We also tested three different GWA methods to determine whether relevant differences might emerge from conventional and digital phenotyping.

Phenotypic and Molecular Diversity
The diversity levels and heritabilities measured across phenotypic traits showed that our tomato collection is adequate for GWAS. We have, indeed, shown that these materials are widely diverse for all of the investigated traits. Sardinian landraces in particular show a level of diversity similar to that of the group of exotic landraces. This is particularly relevant if we consider that the exotic landraces were chosen from among those available at the Centre for Genetic Resources (CGN, NL) to be a worldwide sample. On the other hand, the higher levels of genetic diversity observed within modern and vintage cultivars underlines that the breeding programs might have led to inclusion of different loci in these accessions [11,16].
Previous studies performed on tomato landraces and commercial cultivars using SNP markers revealed comparable levels of diversity and similar levels of heterozigosity when looking at groups of landraces from different regions [16,[81][82][83]. Moreover, the partitioning of genetic variance has indicated that most of this diversity resides within varietal groups and not between them. Interestingly, novel alleles detected within both the Sardinian and the exotic landraces indicated that we have assembled a wide heterogeneous group with some unique diversity, which demonstrates the value of incorporating local and regional accessions when attempting to enrich the reservoir of crop genetic diversity. Indeed, nine of the 52 private alleles from the exotic landraces were associated with climate variables (mainly precipitation related). Private alleles are important because they indicate the presence of loci that can be exploited in future breeding and conservation programs or in genetic diversity studies [84]. For us, this is particularly interesting if we consider that these loci that are also associated to climatic variables, might be under selection. On the other hand, none of the 11 Sardinian private alleles were associated to any trait in this study.
The genetic distance observed among the groups of varieties was not high, which is consistent with the nature of domesticated tomato, for which high phenotypic but low genetic variation has been shown previously [1,11,16,85]. Concurrently, when we used all of the traits to investigate the phenotypic variation among accessions, the three main phenotypic clusters did not show any straightforward overlap with the genetic subdivision. This has been previously observed in studies on tomato and pepper [11,86]. Nonetheless, we observed a significant correlation between genetic and phenotypic distances and a clear tendency of accessions to group according to a small number of key traits, including fruit size and shape, pericarp thickness, carotenoid content and sugar content, similar to previous studies [11,16,81]. On the other hand, we did not observe any clear correlation between the genetic or phenotypic subdivisions and the provenience of our materials, as the landraces (either exotic or Sardinian) and the modern/vintage cultivars were distributed all over the detected groups.

Genome-Wide Association Study Results
The population structure and the linkage disequilibrium analyses performed prior to GWAS have further shown that our population was adequate for association analysis. While the LD levels and LD decay on some chromosomes were comparable to those from previous studies, these were quite variable among chromosomes [16,82,87]. In particular, the high levels of LD observed on chromosome 5 were similar to those from Ruggieri et al. [88], who used a set of genomic SNP markers to characterise a collection of cultivated tomatoes. To account for the high LD levels and decay on some chromosomes, following the example of previous studies [89,90], we reduced the average LD within chromosomes by correcting for population structure and genotype relatedness.
Among the methods used to perform these GWAS, FarmCPU and QTCAT shared the highest numbers of associations while the standard MLM method, showed the least number of shared associations with the other two methods. These data indicate that some associations are highly robust and detected by all methods and that weaker associations might best be identified using multiple strategies. In particular we also chose to use QTCAT to determine whether the associations detected were comparable with those from previous consolidated GWAS models [17,[19][20][21]. Few studies have exploited the QTCAT model until now and results have shown consistent results with previous methods [22,91]. The present results indeed show that this method is efficient and has allowed the identification of novel associations that MLM and FarmCPU did not detect. In particular, the MLM model showed the poorest results.
Differently from many GWA studies that rely on the genomic variation, the present study was based on the transcriptome variation among genes and accessions. The RNAseq based SNPs genotyping has been proven to be valuable for QTL mapping and functional and evolutionary studies [92,93]. Concurrently, in the present study, the variation among landraces and cultivars at the RNAseq level was sufficient to detect numerous associations. Indeed, we showed the association between diverse crop traits to previously identified candidate genes and we were able to suggest that some of the target genes might either have pleiotropic effects on different traits or show co-expression due to linkage. This has also been observed in previous GWAS and QTL studies [14][15][16]94]. As our population was variegated with no clear breeding history, such as in Blanca et al. [81] and Lin et al. [2], the observed genetic diversity might reflect complex metapopulation dynamics, with some loci under divergent selection and others under balancing selection, as previously seen in tomato, sorghum and maize [95][96][97][98]. Indeed, the level and structure of the diversity of cultivated tomato derives from historical and recent gene flow between wild tomato species and cultivars that allowed the dispersal of local varieties across a wide range of environments, with a complex evolution (i.e., domestication, improvements) [2,97,99].
The GWAS results have mainly targeted genomic regions that underlie genes relevant for fruit size and shape traits. This is a direct consequence of our phenotypic analysis, which mainly focused on fruit traits, in line with the relatively thorough phenotypic characterization, as either detected by conventional or digital phenotyping. This is also a good way to evaluate the appropriateness of the present collection for GWAS studies. Indeed, a large number of associations were detected with data from the Tomato Analyzer, which was also successfully used in previous studies [58,86,100]. In particular, this method has been effective in associating different fruit shape categories to different alleles of SUN, OVATE, FASCIATED and LOCULE NUMBER, genes that control the elongated and flattened fruit shape in tomato [55,58,101,102].
Additional traits identified by these authors as predictive for these categories were fruit shape index, distal end protrusion, width widest position, proximal end blockiness, rectangular, distal angle and proximal eccentricity. Here, we also detected many of these associations, such as that with Proximal_angle_macro near OVATE or with fruit-shape-related traits near FASCIATED. However, interestingly, along with these associations, we also detected further associations linked to SlCLV3 (homolog of Arabidopsis CLAVATA3), which underlies the fasciated mutant phenotype [56,101,[103][104][105], and a TONNEAU1-recruiting-motif gene. Together with OVATE, different genes from this family regulate early ovary development, which affects final fruit shape [53].
Along with studies mainly based on mutants, more recent GWAS based on genomic SNP, have detected significant polymorphisms within the same region [16,106]. Among associations detected for the fruit traits described in the present study, some were exclusive to conventional phenotyping and others to digital phenotyping. This demonstrates that the ability of the human eye to synthesize multiple features cannot yet be completely replaced by the available technology.
Many MTAs located either within or near transcription factor genes influenced fruit setting (OVATE-like, WUSCHEL-related, YABBY-like). These show that this combined regional and global tomato collection is useful for targeting genes with prominent roles in fruit. As an example, a single SNP (ch2_151) was identified within a gene that encodes a proteasome subunit α type protein that appears to have a key role during fruit setting [107,108]. This MTA was identified by all three of the GWAS methods for multiple fruit traits (detected by both conventional and precision phenotyping) near ANANTHA, which regulates inflorescence branching and floral organ identity [109], and in the region of additional MTAs for fruit shape and size.
Together, these results suggest that the diversity of this germplasm is sufficient to gain further insights into the role and function of genes that underlie important traits. Indeed, further associations were found in regions that were previously targeted as important for fruit color, carotenoids and sugar content, while others targeted novel genes that might be of relevance for quality traits, such as fruit color and β-carotene content on chromosomes 1 and 4 [59,[110][111][112].
Using GWAS on SNPs from RNA-Seq extracted from mature fruit, we did not expect a direct implication for most of the present polymorphisms in the expressions of genes involved in plant development and phenology. However, we would expect a relevant number of these genes to be in LD (physical or functional) with others expressed in the fruit tissue. The significant RNA variation has indeed allowed the identification of associations for growth habit. These were often at short distances from transcription factors that have determinant roles in the regulation of plant growth and development, including tomato ripening and responses to environmental stimuli [64,[113][114][115].
As revealed by a recent study from Wu and colleagues [53], genes of the OVATE and TONNEAU1 recruiting motif families not only control fruit shape in different domesticated plants (e.g., tomato, potato, melon, cucumber) but also interact to regulate other plant organs, such as leaves and flowers. This is in line with the present results, as among the different MTAs, some show joint association (either functional or as a result of hitchhiking during selection) with traits that underlie fruit development and plant and flower development, as also flowering. The MTA within Solyc06g066320, the gene encoding for a IWS1 transcription factor, is particularly interesting because this protein is highly conserved in different organisms (from humans to yeast) in Arabidopsis it is required for gene expression induced by brassinosteroids [60,116]. Moreover, the region around this gene spanning nearly 0.5 Mb, is enriched of genes encoding for different transcription factors, also related to fruit development and ripening [114,117]. In particular, the closest to IWS1 is a homolog to KANADI2, that together with KANADI1, appeared to be involved in the development of the carpel in Arabidopsis, as well as in the establishment of polarity of lateral organs, including leaf and flower, in Arabidopisis, rice and other species [118][119][120]. The function of Solyc06g066320, together with its simultaneous associations with different traits (i.e., fruit quality, inflorescence, fruit shape and size), suggest a possible pleiotropic role for this gene. Being pleiotropy a target for many GWA studies, variable strategies have been designed for the identification of causal variants and their roles that will help to extract the huge amount of information that is being produced [13,121,122].

Climatic Data
Previous studies based on climatic data have revealed gene expressed regions that might be involved in the answer to different environments [123][124][125][126][127][128]. They also suggest a role for the target genes in responses to environmental effects during plant development, such as those involved in regulation of the circadian clock, DNA methylation or flowering delay.
Among the most interesting results, we found MTAs identified on chromosome 2 for variables related to precipitation, such as ch2_130, with its "T" allele being private to the exotic landrace population. This MTA is located within Solyc02g071510, which encodes a transcription factor (GTE12) that in Arabidopsis is implicated in the cytosolic Ca 2+ increase after application of stress, such as heat, drought and cold [129,130]. Three additional MTAs were also private alleles of the exotic landrace population and all were associated to bioclimatic variables, which suggest that these variants might be a signal of adaptation. Previous studies, performed on sorghum landraces have evidenced that the environment of origin have had a significant role in shaping SNP allele variation and suggested that these genomic signatures of adaptation might be used for crop improvement [126][127][128]131].
Further associations were detected across the remaining chromosomes in genomic regions that appear to be involved in plant responses to climatic changes. As an example, we indicate the MTAs located within a gene that encodes a heat stress transcription factor or within SlMYC2, a jasmonic acid transcription factor involved in fruit chilling tolerance in tomato [132]. Others are located near genes that encode a FRIGIDA-like or other heat-stress proteins, as well as in close proximity to genes implicated in the regulation of flowering or inflorescence architecture. Since tomato cultivation is hindered by high temperature and drought, these results might help to target genes that will be useful to develop new varieties that are better adapted to changing environments.
The present results indicate that we can select the accessions that carry the polymorphic alleles to create specific crosses with controlled fruit size and shape and, for example, improved quality characteristics or that might be linked to specific climate responses.

Conclusions
We demonstrate here the value of exploring the genetic diversity of local and regional accessions, to exploit the variation of this genetic diversity reservoir to gain knowledge and facilitating crop improvements. Our results indicate that, as well as the relevant associations detected for fruit size, shape and quality, there is correlation between the expression of fruit traits and other traits related to plant development and flowering, and the likely response to climate variations. As recently demonstrated in tomato, the human-driven selection for specific traits during domestication has resulted in direct effects on other secondary traits or indirect influences in the expression of other genes to linkage drag [3,133]. Some of the present outcomes might be used to investigate gene functions in basic genetic studies or in the post-GWAS era, to be exploited in gene editing programs, and for genomic selection and breeding. The present collection is enriched in landraces that, along with the wild species, represent a reservoir of diversity that might translate into improved crops and/or crops adapted to ever-changing consumer preferences.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4425/11/11/1387/s1, Figure S1: Tomato fruit collected for the carotenoids content analysis. Different varieties of the tomato collection harvested at the breaker (greenish color) and ripe (red color) stages (Photograph: C.M. Posadinu). Figure S2: Estimates of Pearson's correlations among all of the traits recorded. The red (negative) to blue (positive) color palette is used to indicate the strength of the correlation. White indicates no correlation. The codes for each trait are specified in Table S2. Figure S3: Most likely number of populations as obtained from STRUCTURE (A) and DAPC (B) analyses. (A) Delta K for increasing K performed over the 120 tomato accessions characterized by RNA-Seq analysis. Ten runs were performed for each K value. (B) The most likely number of genetic groups of four, as indicated by the lower Bayesian information criterion (BIC) value. Figure S4: Population structure as obtained from morphophenotypical and quality traits (A) and genetic analysis (B). A certain level of overlap is seen between the two clustering methods, although the overall variation captured by one of the two analyses is not fully representative of the other. Figure S5: Linkage disequilibrium (LD) versus physical distance, from chromosome 1 to chromosome 6. Left to right: Each plot represents four different LD measures per single chromosome, for the usual r 2 value, the r s 2 measure (r 2 corrected by the population structure), the r v 2 (r 2 corrected by the kinship), the r vs 2 measure (r 2 corrected by both the kinship of individuals and the population structure). Figure S6: Linkage disequilibrium (LD) versus physical distance, from chromosome 7 to chromosome 12. Left to right: Each plot represents four different LD measures per single chromosome, for the usual r 2 value, the r s 2 measure (r 2 corrected by the population structure), the r v 2 (r 2 corrected by the kinship), the r vs 2 measure (r 2 corrected by both the kinship of individuals and the population structure). Table S1: Accessions, names, origin, donors and details for the analysis performed on each accession. Table S2: Phenotypic traits (conventional, and from Tomato Analyzer) and carotenoid contents as registered for the tomato collection. Bio-climatic data are also given (as obtained for the 108 geo-referenced accessions). Table S3: Significant differences among the tomato accessions for the conventional quantitative traits in 2012 and 2013. Table S4: Shannon-Weaver index for the qualitative traits evaluated in 2012 and 2013. Table S5: Distribution and variance of the digital phenotyping descriptors, as evaluated in 116 tomato accessions in 2013. The ANOVA was applied to the data to test for significant differences among traits. Table S6: Significant differences among genotypes of cultivated tomato for all of the compounds analyzed. Table S7: Loading of the PCA-see Figure 1 for PC plot-performed using phenotypic traits, carotenoids and bioclimate variables. Table S8: Summary of private alleles by population. When the allele is also associated to a specific trait-based on GWAS results-it is indicated in column E. Table S9: Gene ontology based on Genome V2.4 and V2.5 and Tomato Genome_4.0. The columns from S to AB indicate the SNPs that have been detected as associated to different traits. Additional notes are given for each marker-trait association when present. Table S10: Distribution of marker-trait associations on the different chromosomes and genes within each chromosomes. Data are subdivided into groups of different phenotypic traits. Files (.txt format) with all the data are also given: Files with SNP and morphologic/quality/climatic data: SNP dataset in hapmap format: tmt_120g_2470SNPs.hmp.txt; Carotenoid data: tmt_carotenoids_traits_BLUPs.txt; Climatic data: tmt_climate_traits.txt; Conventional morphologic traits: tmt_conventional_traits_BLUPs.txt Morphologic traits as obtained from digital phenotyping: tmt_TomatoAnalyzer_traits_BLUPs.txt; Output of population structure: tmt_pop_str_K3.txt; Kinship matrix: kinship_120i_2470SNPs.txt.  Open access publication was granted by University of Sassari "Fondo di Ateneo per la ricerca 2020".

Conflicts of Interest:
The authors declare that they have no conflicts of interest or competing interests.
Availability of Data and Material: Raw RNA-Seq reads have been deposited into the NCBI sequence read archive (SRA) under accession PRJNA646818. All data necessary for confirming the conclusions of the article are present within the article, Figures, and Tables, and within Supplementary Tables and Figures.