The Evolution of Molecular Genotyping in Plant Breeding

The era of plant genotyping began in the early 1980s with the progress in molecular biology and nucleic acid research and the advent of molecular marker technology [...]


The Advent Molecular Markers in Crop Research
The era of plant genotyping began in the early 1980s with the progress in molecular biology and nucleic acid research and the advent of molecular marker technology.Genetic markers are nucleotide sequences that possess the ability to identify polymorphisms among a group of individuals [1].Being stable and not affected by environmental effects, as well as detectable in all vegetative and reproductive tissues, they prove to be efficacious for DNA fingerprinting and for a wide range of applications in the field of plant genetics [2].Molecular markers possess several characteristics that render them highly suitable for genetic analysis.These include their capacity to identify polymorphisms, their widespread distribution throughout the genome, their mode of inheritance, their reproducibility, and their cost-effectiveness in analysis.The application of restriction fragment length polymorphism (RFLP) represents a pioneering instance of the successful utilization of markers in gene mapping studies [3] (Figure 1).RFLP relied on the development of DNA restriction fragments following digestion with specific restriction endonucleases.The identification of polymorphisms was then accomplished through the separation of these fragments via gel electrophoresis, followed by their detection using southern blot techniques.[4].This hybridization-based method facilitated the establishment of the first linkage map in several crop species [5][6][7], thereby paving the path to marker-assisted selection [8].The advent of polymerase chain reaction (PCR) [9], has revolutionized molecular biology, leading to countless applications in various branches of science [10].
In the field of plant biology, various types of PCR-based markers have been proposed and made available to the public based on the ability to arbitrary amplify DNA sequences without the need for prior sequence information, such as Random Amplified Polymorphic DNA (RAPD) and Inter Simple Sequence Repeat (ISSR), or by detecting specific sequence regions on the genome, such as Simple Sequence Repeats (SSR) and Cleaved Amplified Polymorphic Sequences (CAPS).Several other markers differing for a wide range of characteristics and being targeted to specific applications have been developed [11].These markers have significantly accelerated agricultural genetic research and have been utilized in numerous studies for various purposes, such as germplasm characterization, development of molecular maps, genetic mapping of key genes, phylogenetic investigations, and forensic applications [1,2,11].Over the past three decades, PCR based markers have contributed tremendously to implement plant breeding investigations toward the dissection of the basis of complex traits [12].However, the laborious, time-consuming, and low repeatability detection method associated with these markers has posed certain limitations.Consequently, the utilization of many of these markers has gradually decreased.
Scientists have long strived to increase the number of markers in order to finely characterize experimental recombinant populations toward the dissection of QTLs and precisely mapping genes of agricultural interest [13].It has indeed demonstrated that boosting the frequency of advantageous marker/alleles increases the likelihood of producing superior genotypes, thereby facilitating gene pyramiding [12,14].Reduction of the laboratory workload, increase of marker throughput, and possibility to automate at affordable costs were all targets achieved in the new millennium thanks to the advent of next-generation sequencing technologies.While microsatellites maintained a fair balance in terms of abundance across the genome, polymorphism, and automation [15], Single Nucleotide Polymorphisms (SNPs) have made a breakthrough in plant genotyping [16].SNPs represent the richest form of genetic variation, with a high occurrence on the genome.Their nature makes the discovery and screening applicable to different genotyping systems and flexible for various platforms of analysis.The transition from EST (express sequence tags) to whole genome and transcriptome sequencing [17], allowed to increase exponentially, from a few hundreds to millions, the discovery of SNPs.These markers are nowadays frequently used in genomic assisted breeding programs due to the increased availability of sequence data in many plant species.
abundance across the genome, polymorphism, and automation [15], Single Nucleotide Polymorphisms (SNPs) have made a breakthrough in plant genotyping [16].SNPs represent the richest form of genetic variation, with a high occurrence on the genome.Their nature makes the discovery and screening applicable to different genotyping systems and flexible for various platforms of analysis.The transition from EST (express sequence tags) to whole genome and transcriptome sequencing [17], allowed to increase exponentially, from a few hundreds to millions, the discovery of SNPs.These markers are nowadays frequently used in genomic assisted breeding programs due to the increased availability of sequence data in many plant species.

Platforms for High Throughput Plant Genotyping
Since the establishment of DNA sequencing technologies, the application of SNP markers has gained great attention as an emerging strategy for plant genotyping.While traditional Sanger sequencing has proven to be a reliable technique for producing highquality molecular DNA fingerprints, its limited throughput has hindered the applicability in numerous scenarios [18].The development of next-generation sequencing was sped up by technological advancements, which improved effectiveness and guaranteed larger

Platforms for High Throughput Plant Genotyping
Since the establishment of DNA sequencing technologies, the application of SNP markers has gained great attention as an emerging strategy for plant genotyping.While traditional Sanger sequencing has proven to be a reliable technique for producing high-quality molecular DNA fingerprints, its limited throughput has hindered the applicability in numerous scenarios [18].The development of next-generation sequencing was sped up by technological advancements, which improved effectiveness and guaranteed larger genome coverage at higher throughput and lower prices, leading to the development of different array-based genotyping and reduced representation sequencing methods [18].The transition from low-density markers to large-scale SNP detection in plants was initially facilitated by array platforms.The advantages of arrays included the possibility to design custom hybridization probes to be tested with a range of multiplex levels, high call rates, and robust SNP calling [19].Diversity Arrays Technology (DArT) was the first medium/high-density genotyping method as well as the first example of a reduction of genomic complexity.DArT included the generation of genetic probe libraries through endonuclease restriction of genomic DNA and subsequent hybridization on the array [20].
Although widely exploited at the beginning of the 2000s, the strategy did not provide a priori the sequence information of the probes as well as their link with genes.Towards this goal, further advancements were made with the Illumina GoldenGate, Illumina Infinium and Axiom Affymetrix technologies which represent so far, the most used array platforms for SNP genotyping in plants [18,19] (Figure 1).The Illumina microarrays utilized beads wrapped with allele-specific oligos fitted into microwells to enable highly multiplexed SNP identification, whereas Affymetrix used probes synthesized in situ on an array to create a GeneChip.While in the former system, beads for each probe are arranged at random on the array and decoded using tags, in the case of Affymetrix the positions of the probes are predetermined [21].Beyond technical differences, both systems have been demonstrated to be efficient, providing comprehensive marker coverage across the genome.SNP arrays commonly encompass a range of SNPs, starting from a few thousand, such as the BARCSoySNP6K in soybean [22], and extending to tens of thousands, such as the 135 K Axiom Exome Capture in wheat [23].Furthermore, high-density arrays have been introduced in wheat, comprising 600 thousand and 700 thousand SNPs, respectively [24,25].The primary focus in array design lies in the positioning of SNPs within euchromatin regions and diverse haplotype blocks, rather than the total number of SNPs.A survey of 28 genotyping arrays demonstrated that beyond genome-wide coverage and imputation quality of single nucleotide variants, the genes represented are the main criterion for determining the effectiveness of an array [26].
A huge leap in plant genotyping has been achieved with the implementation of reduced representation libraries sequencing approaches (RRLs) [18].These methods allow the reduction of genome complexity and generation of genome-wide high-density markers.Among these, genotyping by sequencing (GBS) and restriction-site associated DNA (RADseq) methods were the most used to generate million short-read sequences for variant discovery.Both involve similar procedures for library preparation and sequencing, as well as common bioinformatic pipelines for alignment of sequenced tags with the presence or absence of a reference genome and for SNP discovery.While GBS has become a widely used term to describe the strategy of genotyping loci through sequencing, additional RRLs methods with distinctive features have been developed (Figure 1) [27].Commonly, the scale of SNP markers developed with RRLs ranges from a few up to a hundred thousand, depending on the pipelines of analysis and criteria used for filtering [28,29].
The main differences between array and GBS methods rely on the presence of fixed or flexible loci and on the complexity of the analysis.In the former scenario, marker positions are fixed, and data analysis is simplified through the utilization of graphical user interface (GUI) software.Conversely, GBS necessitates informatics coding skills, which, however, make the choice and quality of loci to be used in downstream genomic analyses more flexible.These features allow SNP discovery through GBS to be improved by re-mapping the reads once updated versions of the genome reference become available.In contrast, arrays, being designed on a determined reference genome, cannot be upgraded unless a new chip is developed.Another significant advantage of GBS is the absence of ascertainment bias, which refers to the inability to detect polymorphisms that are not present in the population used to develop the array.This bias is not present in GBS, allowing for a more comprehensive assessment of genetic variation.However, GBS methods do have a major drawback in that they often detect genetic variation randomly, particularly within repetitive and intergenic regions.This randomness reduces the potential for candidate gene analysis, as these regions are less likely to contain genes of interest.
The single primer enrichment technology (SPET) is another cutting-edge and flexible method for targeted genotyping, using specialized probes.This method, patented by NuGEN ® [30], is the most recent for high-density marker discovery using short-read sequencing.SPET combines the key characteristics of RRLs and arrays by sampling the genetic variation inside the gene space, increasing the likelihood of discovering causative polymorphisms.The technology offers different scalability levels in terms of design and number of probes, as well as detection of thousands of genome-wide polymorphic SNPs [31].

What's The Next
In the past decade, short-read sequencing has dominated the panorama of plant genotyping.Several research groups opted for 2nd generation sequencing technologies due to affordable costs and the possibility of getting a medium or high number of SNPs even with low DNA concentrations and average quality parameters.However, despite the progress made, there is still a limitation in candidate gene discovery using arrays, or RRLs, due to the lack of causal loci detection, particularly for complex traits.Therefore, whole genome resequencing (WGR) initiatives are being produced more frequently, providing higher genomic resolution, and increasing the effectiveness and power of genome-wide association mapping research [18,32].WGRs allow for the detection of up to hundreds of millions of SNPs, definitively improving the accuracy of gene identification for further functional validation and cloning [33], thus being successfully applied to species having rapid linkage disequilibrium decay [34].However, for species showing LD decay over large distances (e.g., wheat), constraints may occur in the ability of GWAS to pinpoint candidate genes [34].Among methods using WGRs, Skim-sequencing has been proposed for genotyping using low sequencing coverage (1×-5×) [35].
Second-generation sequencing is limited by the short length of reads, which leads to an inadequate depiction of genome variation.Additionally, most of the variation investigated consists of SNPs and tiny insertions and deletions (INDELs), making it more difficult to analyze structural variations (SV e.g., long insertion/deletion, inversions, translocations, copy number variations).Long-read sequencing (LRS) instead improves genome assembly by accurately anchoring paralogous and repetitive sequences, enabling more precise loci identification, and better inferring structural variations [32].However, LRS provide lower per read accuracy than short-read sequencing due to large amount of sequencing errors related to the chemistry and technology [36].These drawbacks can be overcome applying bioinformatic pipelines for error correction and polishing.Nowadays, LRS is being used to perform de novo genome assembly and, in combination with short reads, to develop pangenomes.Although for genotyping many individuals, LRS is still prohibitive, thus is difficult to predict the complete replacements of short-read with long-read sequencing; it is possible to assume that the release of high-quality reference genomes, combined with the expected decrease in costs and the improvement of the computational pipelines, will make this methodology more accessible in the coming years.
One of the primary challenges in crop genotyping pertains to the management of large-scale molecular data generated from extensive sequencing processes.While the first genomes took several years for their completion with investments exceeding a billion, currently, it is possible to sequence a whole genome in a day for few hundred euros.Consequently, there is a pressing need for rapid processing and rigorous interpretation of data.To that end, graphics processing unit (GPU) supercomputers and artificial intelligence (AI) will play a key role providing high memory and storage capacity and accelerating the run time of data analysis.Machine learning (ML) approaches can aid in the creation of statistical models that are continuously enhanced through the addition of new data, thereby facilitating the identification of novel variants and the assembly of genomes [37].Various ML algorithms are currently being developed to enhance the prediction of single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) from both short-and long-read sequences, thereby reducing the occurrence of false positives and enabling the detection of high-quality single nucleotide variants in a timely manner [38,39].Future efforts will be directed towards the development of increasingly high-performance algorithms that can be applied to a wide range of species and read coverages.

Figure 1 .
Figure 1.The evolution of molecular markers and their applications in crop breeding and genetics.The increasing of throughput and number of loci analyzed from hybridization DNA techniques since early 80s: toward the development of PCR-based markers during 90s till the establishment of next generation sequencing markers after 2000s.Genetic markers have been used for countless applications including the dissection of the genetic diversity of germplasm collection and the investigation of phylogenesis relationships of plant species; the identification of QTLs (quantitative traits loci) and genes with QTL mapping and genome wide association (GWAS) studies; the increase of efficiency of breeding programs through genomic selection approaches.(1) Hybridization based, RFLP: restriction fragment length polymorphism.(2) DNA amplification based, RAPD: random amplified polymorphic DNA; AFLP: amplified fragment length; ISSR: inter simple-sequence repeat; polymorphism; SCAR: sequence characterized amplified region; CAPS: cleaved amplified polymorphic sequences; VNTR: variable number tandem repeat; TRAP: Target region amplified polymorphism; SSR: simple-sequence repeats; COS: conserved ortholog set; SCOT: start codon targeted polymorphism; SRAP: sequence-related amplified polymorphism; DAF: DNA amplification fingerprinting; REMAP: retrotransposon-microsatellite amplified polymorphism; SSAPs: sequence-specific amplification polymorphism; KASP: kompetitive allele specific PCR.(3) Next generation sequencing (NGS) based, DarT: diversity arrays technology, RRLs: reduced representation libraries; GBS: genotyping by sequencing; RAD-seq: restriction site associated DNA with modifications in italics; CRoPS: complexity reduction of polymorphic sequences; MSG: multiplexed Shotgun Genotyping; SLAF: specific-locus amplified fragment; SPET: single primer enrichment technology.

Figure 1 .
Figure 1.The evolution of molecular markers and their applications in crop breeding and genetics.The increasing of throughput and number of loci analyzed from hybridization DNA techniques since early 80s: toward the development of PCR-based markers during 90s till the establishment of next generation sequencing markers after 2000s.Genetic markers have been used for countless applications including the dissection of the genetic diversity of germplasm collection and the investigation of phylogenesis relationships of plant species; the identification of QTLs (quantitative traits loci) and genes with QTL mapping and genome wide association (GWAS) studies; the increase of efficiency of breeding programs through genomic selection approaches.(1) Hybridization based, RFLP: restriction fragment length polymorphism.(2) DNA amplification based, RAPD: random amplified polymorphic DNA; AFLP: amplified fragment length; ISSR: inter simple-sequence repeat; polymorphism; SCAR: sequence characterized amplified region; CAPS: cleaved amplified polymorphic sequences; VNTR: variable number tandem repeat; TRAP: Target region amplified polymorphism; SSR: simple-sequence repeats; COS: conserved ortholog set; SCOT: start codon targeted polymorphism; SRAP: sequencerelated amplified polymorphism; DAF: DNA amplification fingerprinting; REMAP: retrotransposonmicrosatellite amplified polymorphism; SSAPs: sequence-specific amplification polymorphism; KASP: kompetitive allele specific PCR.(3) Next generation sequencing (NGS) based, DarT: diversity arrays technology, RRLs: reduced representation libraries; GBS: genotyping by sequencing; RADseq: restriction site associated DNA with modifications in italics; CRoPS: complexity reduction of polymorphic sequences; MSG: multiplexed Shotgun Genotyping; SLAF: specific-locus amplified fragment; SPET: single primer enrichment technology.