In the past decade, the development of the next-generation sequencing (NGS) methods combined with various types of newly developed genomic library preparation protocols provided the tools for relatively inexpensive discovery and genotyping of large numbers of loci useful in population genomics studies [1
]. The ultimate way of obtaining genomic data from multiple samples is to apply whole-genome sequencing (WGS). This approach maximizes the quantity of information gathered, and opens up the possibility of a wide variety of analyses; however, it is currently prohibitively expensive and computationally challenging [5
], especially in non-model species with large genomes. Analyses of genomic variations relevant in most population genomic studies can be conducted with reduced representation genomic libraries (RRL) [6
] where only a few percent of the genome is sequenced. The most popular techniques use restriction enzymes to prepare the DNA (restriction-site-associated DNA: RAD) for sequencing. In recent years many methods based on the RAD approach were developed, differing in the number of enzymes used or in additional steps of library preparation [8
In this study, we have focused on three RRL methods: RADseq [9
], GBS (genotyping-by-sequencing) [10
], and ddRAD (double-digest RAD sequencing) [11
]. In RADseq, DNA is digested with a single, frequently cutting restriction enzyme. To such prepared DNA, barcodes and common adaptors with the first of two primer (P1) are ligated, then samples are pooled, randomly sheared, size-selected within a 300–700 bp window and, finally, in the final step P2 adaptors are ligated. Only the DNA containing both adaptors is PCR-amplified and sequenced in the single-end mode [9
]. GBS is a simplified protocol also relying on a single restriction enzyme. Barcoded adaptors and common adaptors are randomly ligated to digested DNA, and fragments from multiple samples are pooled. Short DNA reads with both adaptor types are amplified and sequenced in single-end mode [10
]. The ddRAD protocol uses two restriction enzymes—a rare and a frequent cutter. The first is used to generate fragments, after which barcoded primers are annealed. The second enzyme is used as the replacement for random shearing to improve the size selection step. Finally, P2 primers are ligated, and the fragments are amplified and sequenced afterward in double-end mode [11
Precision and repeatability of genomic data are essential in most types of population genomics analyses, particularly in genomic selection surveys [14
]. However, the ability to accurately genotype SNP loci from restriction-enzyme-based sequencing is a major concern [8
]. Problems occurring in the genotyping step can be categorized into two groups: missing data and genotyping errors. These problems may originate from incorrect library preparation and un-optimized bioinformatics processing [13
]. When ignored, these biases might affect the inferences of downstream analyses, to an unpredictable degree [15
]. Optimizing library preparation and sequencing [13
] mitigates sequencing errors. Unfortunately, some types of problems are unavoidable with restriction-enzyme-based methods, especially when the DNA quality is low [8
On the other hand, sequencing artifacts can be minimized to some degree at the data preparation and core bioinformatics stage [17
]. Current bioinformatics tools for genotyping from RAD data follow a series of three crucial steps: (1) raw data processing (preparation); (2) reading of the alignment against a reference genome or de novo assembly of the sequence tags; and (3) variant calling and filtering [13
]. These steps aim to provide reliable and precise data; however, due to the probabilistic nature of the bioinformatic algorithms, they can alleviate errors while also generating some new mistakes that can have a profound effect on the final results of downstream analyses [21
]. However, overly conservative filtering of genomic SNP data may cause data loss leading to misestimation of genetic effects [12
One of the main advantages of using RAD methods for genotyping of non-model species is that loci can be identified de novo, without the availability of a reference genome [8
]. However, the presence of reference genomes may be beneficial as RAD loci predicted in combination with a reference genome will be useful in filtering SNPs from paralogous or repetitive sequences, identifying indel variation, and avoidance of calling wrong loci due to biological contaminations [8
]. With well assembled and annotated reference genomes, identified RAD loci can be readily positioned along the genome and may become directly useful in association studies. However, only a few studies have compared the efficiency and precision of SNP discovery and genotyping based on de novo versus reference-aligned approaches to date [23
Regardless of the nature of any genotyping problems, one way to monitor the levels of inconsistencies in SNP discovery and genotyping is to use technical replicates [13
] or include parent-offspring dyads, if permitted by experimental designs [23
]. Mastretta-Yanes et al. [19
] defined several types of possible errors that could be investigated based on the use of technical replicates while varying different parameters in Stacks software [26
]. Using ddRAD libraries, they analyzed a non-model plant species without a reference genome. The study demonstrated that with technical replicates, it is possible to optimize and tune a de novo genotyping pipeline and to identify and mitigate sources of errors [19
Due to their foundation roles in terrestrial ecosystems and their broad economic importance, forest tree species have been thoroughly investigated in the areas of population genetics and evolutionary biology. The accumulated genomic resources, including genome assemblies of major tree taxa (Populus trichocarpa
], Eucalyptus grandis
], Pinus taeda
], Olea europea
], and Quercus lobata
]), accelerated the progress of population genomics in forest trees. The number of studies involving genomic data is continuously increasing, including works based on RAD approaches [22
]. Forest trees, due to their characteristic life traits [32
] are highly heterozygous as compared to other species [33
]. The high level of genome-wide heterozygosity may complicate genome assemblies [31
] and confound SNP discovery and genotyping in RAD-based experiments. However, studies aimed at optimizing RAD approaches in forest trees are limited [22
In this study, we investigated the efficiency of SNP discovery and genotyping in two of the most important broadleaved tree species in Europe, namely common beech (Fagus sylvatica L.; abbr. FS) and pedunculate oak (Quercus robur L.; abbr. QR), both belonging to the Fagaceae family. Using the same set of individuals within the species, we evaluated the three RAD approaches mentioned earlier: RADseq, GBS, and ddRAD. Taking advantage of the reference genomes of beech and oak, we contrasted de novo and reference-aligned marker discovery approaches. Among the samples of each species, we included technical replicates (four individuals sampled twice), which enabled us to monitor the replicates genotyping consistency while optimizing specific parameters of the applied bioinformatics pipelines. Our ultimate goal was to fine-tune protocols and the approaches for gathering as many loci as possible with the fewest possible artifacts and the lowest proportion of missing loci. We believe that our findings will be useful for selecting the most appropriate RAD approaches for population studies in beech and oak, and will provide best-practice guidelines for processing RAD data in general.
In recent years, a growing interest in reduced-representation genomic approaches and their applications proved their usefulness in numerous studies [8
]. However, the variety of available library types and analytical pipelines to process this data can be confusing, especially for researchers with limited experience, despite the broad support of research community. The use of technical replicates in the optimization of SNP identification and calling is described in many studies [19
] and it appears to be a good practice to tune the pipeline parameters to the species and scope of the study, and to monitor the quality of SNP identification. Our results confirm these findings and provide additional information in the discussion on how different types of libraries from the same species can be influenced by distinct genotyping strategies.
Testing of technical replicates is inexpensive, and it should be the initial requirement when using RAD-based genomic libraries for marker discovery. The optimization step can provide initial insights into the expected range of results in a larger study, and reduce risk of project failure. The analytical process after data delivery is fast, and evaluation of a few samples on a standard computer (e.g., four-thread CPU with 8 GB of RAM) takes from 24 to 72 h. Due to the rapid development of out-of-the-box solutions, both offline (e.g., dDocent [61
]) and online (e.g., Galaxy [69
]), the data analysis process can be performed by staff even only moderately involved in the field of bioinformatics. However, to obtain informative SNPs, specific requirements for the isolated DNA must be met [70
NGS is particularly sensitive to low DNA quality, and vulnerable to errors which can emerge during the library preparation [71
]. Our results suggest that laboratory errors can lead to genomic region/marker dropout, as in the case of our RADseq dataset, despite a satisfying number of reads per sample. However, RADseq is known to suffer from large proportions of missing data [73
]. When outsourcing the construction of libraries and subsequent sequencing, research teams should be focused on providing good quality DNA or fresh raw material to avoid potential data errors [74
Overrepresentation in de novo SNP data, as observed in GBS datasets, can result from differences in genome size among individuals within the species [75
], especially when using a restriction enzyme that cuts frequently, as illustrated in our GBS data. When the mapping approach is used, sequencing errors may decrease SNP numbers after unfitting bases are discarded. Nonetheless, this reference-based strategy would provide data that may be more comparable with that of other studies, and it can also deliver additional information e.g., annotation of SNPs [8
]. This genotyping strategy should always be the first choice when a reference genome is available [77
]. On the other hand, although the availability of plant genomes is increasing on a daily basis, they are still scarce. In the absence of a reference genome, de novo genotyping is a reliable alternative, as demonstrated in other studies [78
It should be noted that in this study, we used an initial filtering of SNP data (SNPs biallelic; present in 75–80% of samples; indels removed), which generally alleviates some common genotyping problems, including missing loci. In particular, the choice to filter out those loci present in less than a specified fraction of samples (e.g., 80%) appears to be an efficient way of pre-selecting reliable loci [8
]. Therefore, our variation of Stacks or Heap parameters had only a moderate effect on the size and quality of resulting SNP datasets. After the initial optimization, based on replicated samples, the choice of the correct parameters to genotype the whole sample set is always the trickiest part of the study. Here, we would not like to provide unequivocal information on what exact values should be chosen, because these depend on the scope of analysis, the species’ biology, and even sampling strategy (local or wide sampling). However, we intend to share some observations and guidelines on how to perform the optimization process.
The assessment of the raw and filtered numbers of markers (i.e., biallelic SNPs, present in at least 6 out of 8 samples, indels removed) helps to detect library/data errors. For example, significant loss of markers in a filtered set will usually be a signal of uneven genome coverage by reads, regardless of the genotyping method.
When assessing the number of markers reported based on parameters associated with more restrictive genotyping (e.g., m, depth, mapq), first check the number of SNPs after filtering (with the abovementioned criteria). In some sets, more stringent filtering did not cause the expected improvement but rather the possible loss of SNPs.
Pairs of replicated samples from the same individuals will always share a significantly higher proportion of good loci (GL) with each other than with any other sample. This observation can be used as a quality control tool to determine whether a swap of samples occurred.
To overcome a threat of false positive SNPs, which can occur with highly elevated numbers of markers, we suggest focusing on the proportion of good loci as the most important key indicator.
Increasing the minimum number of reads necessary to create a stack (m) will always decrease read number and cause dropout of underrepresented makers leading to the decreased levels of GL and a higher proportion of ML (as in the case of GBS and ddRAD) and, in cases of problems with data uniformity throughout the whole genome, shifted MA values (as in the case of RADseq).
If a reference genome for the species under analysis is available, optimization should be conducted using the reference-based approach. It can be expected to deliver results that are more reliable, and more comparable to those of other studies.
The influence of SNP-calling procedure has a profound effect on the number and quality of markers. Both depth
should be treated as filtering tools; high values for these parameters will significantly decrease the number of markers returned. If more stringent filtering is necessary, using elevated mapq
is a preferable option due to having no effect on key indicators. These symptoms are usually a sign of increased detection of false positive SNPs [19
Shifted mismatch both on an individual level and in the joint catalog should be adjusted with respect to species biology and the sampling strategy applied [66
Genotyping by next-generation sequencing (NGS) of reduced-representation genomic libraries associated with restriction enzymes became a common approach to identify large numbers of genetic markers (mostly SNPs) uniformly distributed across genomes. However, the number and quality of RAD-based markers obtained in particular studies depends on many aspects, including the quality of DNA isolation, the choice of RRL, the type of restriction enzymes, the design of sequencing (resulting in sequencing depth), and the bioinformatics pipelines used for identification and calling of SNPs [8
]. Testing all of the possible variables is beyond the scope of a single study. In this paper, we briefly evaluated the three RRL approaches (GBS, RADseq, ddRAD) and different methods of SNP identification (de novo or by reference genome mapping) to find the best toolset for future population genomics studies in two broadleaved tree species: F. sylvatica
and Q. robur
We found that the most promising approach—providing relatively large numbers of reliable SNPs—is to employ the ddRAD technique and a calling approach based on mapping sequence reads to a reference genome. Based on about 90 individuals within species, we found ~28,000 and ~36,000 loci for beech and oak, respectively, given typical filtering criteria (MAF > 0.05; SNPs present in >80% samples; LD r2
< 0.5). However, when relaxing LD filtering limitations, these numbers increased up to ~56,000 and ~59,000 respectively (Table 4
). Based on technical replicates, we estimated that in ddRAD more than 80% of SNP loci should be considered reliable. Additionally, according to annotations on the reference genomes, we found that in both species more than 30% of the identified loci could be related to genes. These findings provide a solid support for the use of ddRAD-based SNPs for future population genomics, or even for genomics selection studies.