Advances in Integrating Genomics and Bioinformatics in the Plant Breeding Pipeline

With the global human population growing rapidly, agricultural production must increase to meet crop demand. Improving crops through breeding is a sustainable approach to increase yield and yield stability without intensifying the use of fertilisers and pesticides. Current advances in genomics and bioinformatics provide opportunities for accelerating crop improvement. The rise of third generation sequencing technologies is helping overcome challenges in plant genome assembly caused by polyploidy and frequent repetitive elements. As a result, high-quality crop reference genomes are increasingly available, benefitting downstream analyses such as variant calling and association mapping that identify breeding targets in the genome. Machine learning also helps identify genomic regions of agronomic value by facilitating functional annotation of genomes and enabling real-time high-throughput phenotyping of agronomic traits in the glasshouse and in the field. Furthermore, crop databases that integrate the growing volume of genotype and phenotype data provide a valuable resource for breeders and an opportunity for data mining approaches to uncover novel trait-associated candidate genes. As knowledge of crop genetics expands, genomic selection and genome editing hold promise for breeding diseases-resistant and stress-tolerant crops with high yields.


Introduction
Humans depend on crops for over two-thirds of their daily energy intake [1,2].As the global human population grows, agriculture (crop cultivation) is under increasing pressure to produce higher crop yields [3].Additionally, climate change, limited availability of land and water shortages are posing further agricultural challenges [4].To increase crop yields while reducing the environmental impact of agriculture, genomics is accelerating crop breeding by helping systematically leverage the genetic components of agronomic traits [5,6].Crop genome sequences provide an important foundation for identifying agronomically relevant variation.During the last decade, the decreasing cost of DNA sequencing has led to a rapid rise in the size of crop genomic data, which represents a substantial opportunity for breeders [7].
Although plant genome assembly (generating a genome sequence from fragmented sequencing reads) is still hampered by frequent long repetitive regions, large genome sizes and frequent polyploidy, advances in sequencing technologies and bioinformatics tools have allowed rapid progress since the sequencing and assembly of the rice genome in 2005 [8].
Although rice was still sequenced using bacterial artificial chromosomes (BAC) and Sanger sequencing, the grape genome published in 2007 was the first to use a combination of the less costly 454 sequencing and Sanger sequencing [9].Two years later, Illumina short reads were combined with Sanger sequencing to assemble the cucumber genome [10], marking the start of the rapid adoption of next generation sequencing (NGS) [11].By 2013, 55 plant genomes had been sequenced, with 40 of these belonging to crops [12].Third generation sequencing technologies capable of generating long reads greater than 10 kb in length were developed in recent years, providing a further useful tool for crop genome sequencing.Today, there are over 260 land plant nuclear genomes publicly available in GenBank, including most major crops.
Crop breeding has long relied on cycles of phenotypic selection and crossing, which generate superior genotypes through genetic recombination.When genome sequences are available, all genes and genetic variants contributing to agronomics traits can be identified and changes made during breeding processes can be assessed at the genotype level.Because of the ready availability of genomic data for breeders today, genomics plays an increasingly important role in all aspects of crop breeding, such as quantitative trait loci (QTL) mapping and genome-wide association studies (GWAS), where genomic sequencing of crop populations can allow gene-level resolution of agronomic variation.For example, advances in genomics-based breeding allow the identification of genetic variation in crop species, which can be applied to produce climate resilient crops [5,13,14].In a different approach, genomic selection (GS) harnesses genome-wide genetic variants to avoid the need for repeated phenotyping in breeding cycles.
Bioinformatics is crucial for processing and analysing large genomic datasets and gaining functional insights into plant genomes [15,16].Although genome assembly, sequence alignment and variant calling are standard bioinformatics tasks when analysing sequencing data, the algorithms required are non-trivial and there are different competing computational approaches with unique biases [17][18][19].For alignment of third generation sequencing data, many tools developed for short reads perform poorly when aligning long reads [20].A challenge for assembly and alignment tools is combining different data types such as short reads and long reads to reduce the impact of unique biases.Downstream analyses such as comparative genomic analysis, variant calling and GWAS can provide comprehensive information to facilitate crop improvement [21].Variant callers differ in their ability to call indels and in their biases towards calling heterozygous and reference variants [19].Software designed for carrying out GWAS may apply models of varying complexity to address the effects of population structure.Applying the right bioinformatics tool for the right application is therefore crucial.Although there is a wide range of tools available to mine genomes and variant data, processing the growing amounts of genomic data and selecting the appropriate analyses is also a leading challenge faced by researchers in crop genomics [22].Adaptation of existing crop databases such as GrainGenes [23] and Gramene [24] as well as development of novel databases such as the Wheat Information System (WheatIS) [25] will help store the deluge of data and make it more accessible to breeders.
An interdisciplinary approach is needed for plant breeding in the 21st century to identify and resolve breeding challenges and improve crop production [26].Together with novel glasshouse technologies to accelerate plant development [27], genomics and bioinformatics play an important role in increasing the production rate of improved crop cultivars.Nevertheless, the vast amounts of genotypic and phenotypic data available create an enormous challenge to integrate diverse data outputs for breeding [28].Integrating phenotypes, genomics and bioinformatics tools and resources in public and private breeding pipelines will address this challenge and help deliver breeding targets [29].In this review, we discuss advances in genomics and bioinformatics, suggesting how these can be integrated to allow precise breeding and overcome bottlenecks in crop improvement.

Third Generation Sequencing to Improve Crop Genome Assemblies
Since the introduction of NGS, genome sequencing and resequencing have become standard in many disciplines of plant biology [11].However, there are important limitations of NGS, such as inherent biases and ambiguous alignment of repetitive elements, which leads to highly fragmented draft genome assemblies and complicates the study of hidden indels and structural variants [20].The emergence of third generation sequencing, including Pacific Biosciences (PacBio) single-molecule real-time sequencing and Oxford Nanopore Technologies (ONT) sequencing, has enabled the generation of long reads and allowed production of more accurate and contiguous genome assemblies [30][31][32].Third generation sequencing helps generate high-quality whole genome de novo assemblies, using reads spanning complex regions such as those with high levels of repetitive sequence and shed light on the remaining complex of repeat sequences and other structural variants.Moreover, the full-length sequenced transcripts produced by third generation sequencing techniques (isoform sequencing) allow the precise study of exons, splice sites and alternatively spliced regions, which is useful in improving genome annotation [30].A fully assembled and well-annotated genome will allow breeders to discover genes related to agronomic traits, determine their location and function as well as develop genome-wide molecular markers.
Combining long-read sequencing with long-range mapping technologies and chromosome conformation capture has put highly contiguous chromosome-level crop genome assemblies within grasp even for smaller laboratories and non-model crop species [33].New optical mapping platforms such as BioNano Genomics allow rapid labelling of long DNA molecules over 250 kb, enabling detection of structural variants and generation of a high-quality scaffolding at low cost.For example, using PacBio sequencing and optical mapping from BioNano, the assembly of the desiccation-tolerant grass species Oropetium thomaeum achieves a contig N50 of 2.4 Mb with over 99.5% genome coverage [34], a contiguity as high as that of model plant genomes such as Arabidopsis (TAIR10), rice (V 7) and Brachypodium distachyon (V 2.1).Optical mapping data was also used to generate a high-resolution map of the wheat chromosome 7DS with contigs showing an N50 of 1.3 Mb [35].Long-read sequencing also uncovers repetitive regions with high accuracy.In O. thomaeum, long reads helped identify 18 telomeric regions, nine centromeric satellites and 3247 intact long terminal repeats in 358 families [34].These repetitive sequences could not have been captured by short reads, potentially resulting in a miscalculation of their content.Another third-generation mapping innovation is chromosome conformation capture sequencing (Hi-C) [36], which is based on naturally physical close ligation of DNA fragments.Integration of Hi-C data and optical mapping allow further improvement of chromosome phasing and scaffolding.By combining short reads and optical and chromatin interaction mapping data, Mascher et al. [37] assembled the highly repetitive and polyploid barley genome, attaining an N50 value of 1.9 Mb.The higher level of sequence contiguity provided by third generation sequencing can facilitate genomics-based breeding approaches including trait mapping.For example, using a long read assembly of Arabidopsis, it was possible to define the growth-related SG3i QTL region that could not be previously resolved with Sanger sequencing [38].
The most powerful application of third-generation sequencing for breeding is the assembly of improved highly contiguous crop genomes.When selecting a sequencing approach for a crop genome assembly project, it is important to consider the size of the genome, ploidy, levels of repetitive content and the available funds.The current choice is mainly between PacBio, ONT and NGS, which can be used in combination with each other and supplemented with other long-range technologies such as BioNano and HiC.While costs for sequencing vary substantially between providers and countries, third generation sequencing can remain an order of magnitude costlier than NGS (Table 1).When sufficient funding is available, deep sequencing (>30×) using PacBio is likely to yield the most accurate, contiguous crop genome because it is less error prone than ONT sequencing [38] and existing sequencing errors can often be self-corrected at deep coverages [20].However, the cost of PacBio deep sequencing may be prohibitive, particularly for large genomes.At a cost of $900/Gb with the PacBio RS II (Table 1), 30× coverage of the average-sized 389 Mb rice genome [8] would cost over $10,000.For genomes with repeat regions >100 kb and high levels of heterozygosity, the ultra-long reads (>100 kb) generated by ONT sequencing offer a unique advantage of this technology as in this case ultra-long reads improve assembly quality most [20].When costs limit the depth of third generation sequencing to <30×, short reads can make an important contribution to error correct long reads before assembly [39] and to polish completed assemblies by correcting single base errors and small indels [38].For resequencing studies of crop populations, third generation sequencing is therefore generally not cost-effective and NGS remains the preferred approach.As the cost of third generation sequencing costs decreases, the application of these methods may extend into a broader range of genomic analysis beyond genome assembly.

Integrated Crop Databases
With third generation sequencing technologies and other 'omics' technologies emerging, large volumes of data are available to investigate crop traits from the gene level to the population level [50].Although essential sequence repositories such as Genbank [51], European Molecular Biological Laboratory (EMBL) [52], PlantGDB [53] and Phytozome [54] play an essential role, they mainly focus on storing and managing genomic data without integrating variant or phenotype data from other sources.This makes it more challenging for breeders and plant biologists to link genotype to phenotype, which often requires data on genomics, epigenomics, phenotypes and environments.Although come crop databases integrating these data exist, for example, marker and expression data are integrated in GrainGenes [23], additional databases are needed to address this gap in major repositories [55].
The task of creating an integrative crop database by combining annotated genome sequences, gene functions, interaction networks and trait phenotypes is challenging as the relevant data are dispersed in numerous databases with various data formats and different quality and coverage.Intelligent mining of large-scale crop databases is required to merge complex data resources and allow gene discovery and crop improvement [56].To integrate biological data from different resources, a web-based intelligent mining tool, KnetMiner (Knowledge Network Miner) has been developed to search for links and concepts in biological knowledge networks, which enables the discovery of novel connections between traits and genes [56].There are four main steps in the KnetMiner approach: (1) integrating diverse biological data into a knowledge graph, (2) improving the knowledge graph with text-mining of the literature, (3) identifying the link between genes and evidence nodes, (4) applying the evidence-based gene ranking algorithm and visualising the integrated data.Currently, KnetMiner has been employed for constructing integrative databases for important crops such as barley and wheat, providing insights into indirect associations between distant traits and biological processes.In barley, a gene-evidence network was applied to infer a connection between the MLOC_10687.2 and seed width phenotype, which show great potential in barley production improvement [57].Progress is underway in wheat and rice to further develop the single information systems available for these crops [25,58], allowing broad querying across integrated databases.By ongoing development of integrative crop database with advances in data mining techniques, breeders can understand a complex trait better and identify trait-associated candidate genes, which is beneficial for crop improvement [59].

Mining Quantitative Trait Loci Studies
The analysis of QTL enables estimation of genetic regions linked to quantitative phenotypic traits, bridging the gap between genomics and the field [60].However, with the increasing number of QTL studies being conducted and reported in plants, a new challenge to identify high-quality candidate loci and further improve crop breeding is to integrate information from different QTL studies.In this case, meta-analysis, a tool to pool the outcomes of a range of studies and predict the location of QTL more precisely than individual studies, is required to use existing resources fully [61].Bioinformatics tools are available for efficiently carrying out meta-QTL analysis.For instance, using statistics and a consensus model, a computational package called MetaQTL can reduce the length of the confidence interval of QTL, leading to precise estimation of the correct QTL location and effect [62].Other bioinformatics tools such as solQTL and RASQUAL further offer the convenience of low bias QTL analysis, visualizing the QTL data and linking the QTL data with other genome databases [63,64].To date, meta-QTL analyses have been performed to map traits related to crop development and abiotic and biotic responses in maize [65], cotton [66], soybean [67] and wheat [68].For example, meta-QTL analysis has been used to identify five groups of yield and yield-related candidate genes in wheat with 195 molecular markers and 197 ESTs reported from 55 wheat QTL studies in the last 14 years [68].Moreover, 37 QTLs related to nitrogen use efficiency were identified in maize [65].Similarly, 20 consensus QTLs and their related markers were narrowed down by meta-QTL from a combination of QTL studies in last 20 years, which provides a foundation for gene mining and crop improvement in soybean [67].QTL mapping remains a powerful method to link an observed agronomic trait to a genomic region.It provides a high detection power for scanning the complete crop genome and identifies rare alleles using limited genetic markers.Because QTL mapping is hypothesis-driven, it is applied when a trait of interest segregating in a mapping population is available.However, QTL mapping suffers from two fundamentals shortcomings: (1) low resolution caused by coarse mapping making it hard to differentiate pleiotropic and physically adjacent genes [69], (2) only allelic diversity present in the parents of the segregating population can be assayed [70].To overcome these limitations of QTL mapping, GWAS can be employed to pinpoint genomic regions linked to traits in diverse, unrelated populations.

Genome-Wide Association Studies for Identifying Breeding Targets
In contrast to QTL analysis conducted on bi-parental populations derived from controlled crosses, GWAS relies on natural populations, providing higher resolution to identify multiple recombination events and explore the natural variations associated with phenotypical differences.GWAS leverages linkage disequilibrium to detect links between genotype and phenotype in crop species, achieving higher mapping resolution than QTL analysis [71].When the aim of the breeder is an exploratory analysis to identify a broad range of genomic leads, GWAS is preferred to QTL analysis.Association studies are more likely than QTL analysis to identify specific candidate genes that can be directly introgressed into crop germplasm to improve crops [72].GWAS has been carried out in a variety of crops such as rice, soybean, maize, wheat and canola [73][74][75][76].In Oryza sativa indica, 517 landraces were sequenced, identifying around 3.6 million single nucleotide polymorphisms (SNPs) [73].Using a GWAS of 14 agronomic traits, over 36% of the phenotypic variance could be explained by the identified loci, allowing further discovery of trait-related genes and alleles for crop improvement.Through a GWAS of maize, Tian et al. [75] revealed the architecture of leaf traits and found that variation in liguleless genes can result in upright leaves.
Advances in bioinformatics tools offer extra opportunities for conducting GWAS studies.For instance, PLINK is a widely used bioinformatics tool for GWAS, employing a standard regression analysis to associate genotypes with phenotypes [77].However, for rare variants, standard regression cannot provide sufficient sensitivity for GWAS analysis [78].TASSEL is another common GWAS tool and implements a mixed linear model incorporating population and family structure in the analysis [79].Unlike PLINK, TASSEL can thus control for population effects.Other enhanced GWAS bioinformatics tools such as GAPIT have also been developed to computationally efficiently handle a large dataset containing over 1 million SNPs within 10,000 individuals using the compressed mixed linear model and model-based prediction and selection method [80].

Forward and Reverse Genetic Screening
Forward genetic screening is a widely used breeding tool and identifies and characterizes genes based on a known phenotype [81].Reverse genetic screening, on the other hand, determines the phenotypic effect of altered sequences of specific genes or regulatory regions [82].The starting point of forward genetics, including most QTL analyses, is a target phenotype that segregates in a population.Reverse genetics, on the other hand, can be used to functionally characterize known genes identified by QTL analysis or GWAS, with a mutation panel or a transformed line as a starting point.The role played by both forward and reverse genetic screening is essential to identify functional variation associated with agronomic traits such as tolerance to abiotic and biotic stresses, disease resistance, increased yield and improved nutritional quality.Forward genetic screening can improve gene cloning and marker development.Rather than performing whole genome sequencing to identify many sequences that are not related to heritable phenotypes, exome sequencing used in forward genetic screening allows selective screening of coding regions, excluding intergenic sequences.In rice, it was shown that to recover induced mutations it is sufficient to sequence 20 Mb of the 389 Mb genome [83].Reverse genetic screening has been applied to crop functional genomics and breeding.Targeted Induced Local Lesions IN Genomes (TILLING), a reverse genetic approach, can take advantage of conventional mutation induction and high-through mutation methods, providing the capability of recovering mutations from any genetic regions and discover novel phenotypes [84].For instance, TILLING was used to induce homozygous mutations in two waxy genes (granule-bound starch synthase I, or GBSSI) in a progenitor of wheat that has the null waxy genotype [84].Forward and reverse genetic screening can also be combined.After conducting reverse genetic screening in Lotus japon using TILLING, forward genetic screening was used to recover the alleles from a set of 275 cultivars with a 10-fold reduction of the cost compared with whole-genome screening methods [85].

Genomic Selection
Trait discovery using QTL analysis, GWAS or reverse genetics is not essential for genomics-based breeding.Particularly when targeting polygenic agronomic traits such as yield, minor effect alleles can be difficult to detect and assess using these methods.When confronted with complex traits that are difficult to introgress systematically, GS present an additional breeding approach.GS relies on calculating the genomic estimated breeding values (GEBV) for sets of variants [86] based on a genotyped and phenotyped training population.By combining the entire SNP marker sets in the predicted model and preventing biased marker effects from variations in small-effect QTL, GS overcomes inefficient translation of QTL analysis results from biparental mapping populations to breeding and insufficient capability of statistical approaches to identify polygenic loci [87,88].The combination of GS with automated phenotyping techniques can further promote prediction accuracy of GEBV, shortening the breeding cycle [88].Based on computational simulations, GS based genomics-assisted breeding allows a four-year reduction in the breeding cycle of pasture grass Lolium perenne compared with traditional breeding [89].Additionally, GS in cassava showed a significant increase in predicted genetic gains from 39.42% to 73.96%, comparing with the phenotypic selection [90].Genotyping-by-sequencing (GBS) allows low-cost de novo genotyping of crop plants [91,92] and can help to develop accurate GS models in crops with large genomes that are costly to genotype with whole genome sequencing at the population level [93].GBS was used in elite wheat breeding lines to develop precise GS models, estimating genotypes with high yield and stem rust resistance [93,94].GBS also delivered 55,000 SNP markers with high prediction accuracy from various maize breeding lines that were used for GS model construction [95].
4.5.Beyond the Gene: Targeting Cis-Regulatory Elements for Crop Breeding Cis-regulatory elements (CREs) such as promotors and enhancers regulate gene expression and may contain close to half of all variants influencing traits [96].In crops, domestication traits are often caused by variants in CREs [97].For instance, Wang et al. showed that a mutation in a rice CRE achieved the production of slender rice without decreasing yield production, by reducing the repression of the gene GRAIN WIDTH 7 [98].Targeting CREs for breeding can be advantageous when the aim is not to knock out a gene entirely but to reduce or increase expression.Although CREs are not expressed like genes, making them more difficult to study, they are associated with open chromatin, which facilitates protein-binding.Open chromatin can be identified through DNase I hypersensitivity mapping [99], ATAC-seq [100] and ChIP-seq [101] experiments, which helps predict candidate CREs.Recently developed laboratory approaches combine chromatin signature detection and genome editing to allow prediction, validation and genome-wide functional assessment of CREs.For example, by editing ChIP-seq identified candidate sites, 73 enhancers were detected in humans, using an approach readily transferable to plants [102].Additionally, bioinformatics approaches such as word-counting across the different promoter sequences [103] and analysis of sequence conservation [104] are used for regulatory element detection.Due to these methods, identifying cis-regulatory elements has become easier, leading to a growing body of knowledge on mammalian and, to a lesser extent, plant cis-regulatory elements.
As our high-throughput CRE detection ability grows, the challenge for breeders is knowing which CRE to target.This is difficult because knowledge of the functional impact of CREs in plants is scant [98].Integrated databases such as the Plant Cis-Acting Regulatory Elements (Plant CARE) database provide comprehensive resources to study plant CRE function [105].However, the roles of specific CREs in regulatory networks are largely unknown and whether editing the sequence of a CRE will affect the expression of the target gene can only be accurately determined experimentally.The experiments necessary to characterise CREs in this way have been conducted on genes in rice [106,107], generating a combined mutant library of almost 100,000 independent lines.By producing a CRE mutant library with a similar approach and obtaining expression data from the mutant lines, CREs associated with agronomic traits could be identified.When trait-associated CREs are known, an allelic series produced with genome editing can rapidly create stepwise variation in a target trait.For example, using Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)/Cas9 genome editing, Rodríguez-Leal et al. obtained stepwise variation in tomato seed compartment (locule) numbers, a yield-related trait, by targeting the promoters of the inflorescence architecture gene WUSCHEL (WUS) and CLAVATA (CLV3) [108].

Applying Machine Learning to Crop Breeding
Machine learning (ML) allows algorithms to interpret data by learning patterns through experience.For large, diverse and formless datasets, such as those generated by photo imaging or sequencing, ML can provide substantial advantages over other analytical approaches [109].With the aid of ML, crop breeders can efficiently phenotype plants and mine diverse datasets for patterns such as associations between DNA sequences and traits.

High Throughput Crop Phenotyping
Plant phenotyping is the measurement of functional or structural traits from the cellular level to the organism level and is essential for association studies and crop improvement [110].With the intensive development of genomics research and sequencing techniques, there is an increasing demand for plant phenotypes to help understand genomic data.Conventional phenotyping is often a bottleneck because it is subjective, error-prone, labour-intensive and time-consuming, limiting the number of traits, plants and environments that can be sampled [111].Advances in measuring technologies (high-throughput imaging and automatic sensors) and ML allow the establishment of robotic high-throughput phenotyping, overcoming the shortcomings of traditional human-based phenotyping by allowing rapid generation of phenotypical features and features across large populations.High-throughput phenotyping has four main elements, namely detection by imaging or sensors, phenotypical data classification, feature quantification and prediction based on specific models or algorithms [112].Phenotyping using ML has been applied in stress phenotyping and monitoring of diseases.A real-time ML-based high-throughput phenotyping method was developed to assess the severity of iron deficiency chlorosis from a total of 4366 soybeans from representative canopies [113].Using linear discriminant analysis (LDA) and multi-class support vector machines (SVM) ML algorithms, the collected phenotypic data were used to train the best classification model to predict iron deficiency chlorosis stress severity in soybean, which is useful to measure the real-time severity in the soybean field.Another recent study applied a deep ML-based phenotyping method using an unsupervised identification explanation mechanism to measure, determine and classify various stress severity of foliar stresses in Glycine max including bacterial and fungal diseases and nutrient deficiency [114].Although ML has been successfully applied in crop genomics and crop phenotyping, several challenges remain.ML modelling requires large datasets for training and model construction.A small training set can lead to statistically insignificant and problematic prediction [115] but a large dataset can be costly and time-consuming to acquire, as measurements on crops can often only be taken once per growth cycle.High-throughput ML based phenotyping therefore remains limited to certain research institutes and commercial companies [116].Further reduction of purchasing and operating costs is required to make ML-based phenotyping widely usable on the farm of the future.

Machine Learning in Crop Genomics Research
ML plays a role in many areas of genomics research, including genome assembly, the iterative inference of gene regulatory networks and identifying true SNPs in polyploid plants.A representative list of ML algorithms and related open-source R packages relevant for data analysis in plants is provided by Ma et al. [117].ML can be used to improve assemblies of polyploid genomes with complex genome redundancies [117].A complete genome assembly and annotation is the foundation to track the genetic variations within a plant species and understand plant gene function and structure, which are crucial within the crop trait discovery pipeline (Figure 1).Assembling highly redundant genomes is a challenge for a non-ML-based assembly approaches that use a linear algorithm to assemble repetitive sequence regions.To overcome this limitation, an ML method was used to detect assembly errors and generate a high-quality assembly of bread wheat (Triticum aestivum) [118].ML is also deployed by the RNA-seq mapping tool Portcullis to differentiate between real and artificial splicing junctions, which has helped annotate the bread wheat genome [119].
Inferring the relationships between regulatory elements and genes is a promising field for identifying previously unknown candidates for crop improvement.Based only on gene co-expression levels, a regulatory network constructed in silico is limited because the association between genes may not accurately reflect shared gene regulation [120].Consequently, an ML-based method that can incorporate various kinds of regulatory signals from different data sources has become popular for interactive inference of gene regulatory networks [117].For instance, by analysing data on transcription factor binding, conserved sequences, gene expression and chromatin modification data, the transcriptional regulatory network in Drosophila melanogaster involving 300,000 regulatory edges and over 12,000 targeted genes could be predicted [121].
SNPs are the most frequent class of variations in plant genomes [122,123].However, challenges remain for SNP discovery in polyploid plants [124].Korani et al. [125] developed an ML-based analysis tool called SNP-ML using neural networks and tree bagging models to efficiently filter false positive SNPs.They demonstrated that SNP-ML could be used to detect SNP variation and select true SNPs with over 98% accuracy in simulated SNP variant data of peanut, cotton and strawberry.Neural networks have also been deployed for SNP calling in error-prone long reads, which is particularly challenging in diploid or polyploid samples as errors must be distinguished from genuine heterozygous variants [126].

Genome Editing of Crops and Bioinformatics Challenges in Guide RNA Design
Breeding relies on generating novel combinations of alleles and this can be achieved by harnessing natural variation from germplasm collections or by generating novel variation.Advanced breeding methods to generate DNA mutations such as irradiation or chemical mutagens are commonly used [127].However, these methods are hampered by the high rate of background mutations, some of which can be deleterious and need to be removed with multiple time-consuming rounds of breeding.In contrast to conventional breeding methods (Figure 2), CRISPR/Cas does not require substantial crossing to fix traits and remove deleterious background mutations.The CRISPR/Cas system relies on a guide RNA (gRNA) to target the Cas protein to DNA sites for cleavage by locating a matching ~20 bp sequence and a protospacer adjacent motif (PAM), which is specific to the Cas protein used [128].The Cas protein induces a double-strand break at the target site, allowing gene knock-out via mutations arising during nonhomologous end joining and gene knock-in via a donor DNA template and homology-directed repair.In the last five years, the CRISPR/Cas system has been efficiently applied in a variety of essential food crops (reviewed in [129]) and is expected to have a major impact on agriculture [130,131].The remaining challenges for genome editors are improving protoplast transformation [132], increasing the efficiency of gene targeting using homology-directed repair [133] and optimisingthe bioinformatics tools for gRNA design with minimal off-target effects.Bioinformatics tools are essential for the optimal design of gRNAs that facilitate efficient and specific CRISPR/Cas gene editing.The two crucial requirements for gRNA design are that the gRNA has both high binding affinity to the target site and is specific, with few off-target effects.Using human cell lines, it was shown that guanines are at the −1 and −2 PAM-proximal positions increase binding efficiency, whereas thymines at the +4/−4 PAM-proximal positions decrease efficiency [134].An assay on human cells investigated the effect of nucleotide sequence and epigenetic parameters on gRNA efficiency, finding that both locus accessibility and the sequence composition affect efficiency [135].Similar studies of gRNA binding efficiency are lacking in plants.However, early assessments indicate that the base preferences identified in human cells are not shared by plants [136].The specificity of gRNA designs is difficult to predict based on sequence similarity alone and the results of off-target prediction are inconsistent between various tools [137].Although there are over 50 tools for gRNA design publicly available (http://omictools.com/crispr-cas9-category) [138], only CRISPR-P [139] and CRISPR-Plant [140] are designed for plants.CRISPR-P is the more sophisticated tool, supporting design for 49 plant species and providing secondary structure analysis and microhomology scores for guide designs.However, the gRNA activity evaluations and specific searches for knock-in or knock-out designs available for non-plant model systems in tools such as the CRISPR-ERA web server [141] as well as learning-based design tools such as sgRNA Designer [142] are not yet available for plants.Because the interactions between Cas proteins, gRNA, DNA and chromatin are likely to differ to some degree in plants [136], it will be important to incorporate this information into plant gRNA design tools.In addition, plant genomes are highly redundant which may make it difficult to generate unique gRNAs for single target sites.Moreover, for minor crops with few available genomic resources, or highly genetically diverse germplasm, it can be difficult to predict gRNA activity because of SNPs between the reference genome and the target individual [143].As additional endonucleases with different activity from Cas9 are added to the CRISPR/Cas toolkit and more empirical data on endonuclease activity in plants becomes available, bioinformatics tools will be able to tailor gRNAs to target more genomic sites at higher accuracy in crops.

Conclusions
Agriculture faces substantial challenges in harnessing the deluge of genomic data of diverse origins and formats for crop improvement.To overcome these challenges, novel breeding methods and bioinformatics tools must be used to translate genomic data into gains in crop yield and yield stability.To accelerate the detection of robust gene-trait associations, researchers can apply meta-QTL analyses, GWAS and genetic screens.While genome editing offers a valuable approach to rapidly introduce beneficial mutations into elite cultivars, GS increases selection efficiency without requiring knowledge of underlying genetic drivers.ML algorithms can take advantage of high-throughput phenotyping and genomic data to further automate parts of the gene discovery pipeline such as genome annotation and image interpretation that remain particularly challenging.By applying novel technologies and methods in concert, future plant breeding can achieve the crop improvement rate required to ensure food security.

Figure 1 .
Figure 1.Schematic overview of a crop trait discovery pipeline.

Figure 2 .
Figure2.Generating a novel crop cultivar using a parental cross, random mutagenesis and genome editing.Changed genomic regions on a representative chromosome are shown, with regions harbouring beneficial mutations in red.Plant populations shown to harbour more phenotypic changes when using a parental cross or random mutagenesis and are more uniform when using genome editing, likely varying only in the target trait(s).When using genomic selection to complement traditional breeding, the phenotyping process can be replaced with a faster genotyping step and plants selected based on breeding values calculated from a phenotyped and genotyped training population.

Table 1 .
Comparison of the three main sequencing technologies for crop genomics.Corrected error rates are calculated after applying error correction methods that do not require additional data.Prices are in US$.
* Costs are estimates and can vary substantially between sequencing providers.