Genomic Databases for Crop Improvement

Genomics is playing an increasing role in plant breeding and this is accelerating with the rapid advances in genome technology. Translating the vast abundance of data being produced by genome technologies requires the development of custom bioinformatics tools and advanced databases. These range from large generic databases which hold specific data types for a broad range of species, to carefully integrated and curated databases which act as a resource for the improvement of specific crops. In this review, we outline some of the features of plant genome databases, identify specific resources for the improvement of individual crops and comment on the potential future direction of crop genome databases.


Introduction
The majority of DNA sequence and expressed gene sequence data generated today comes from the next-or second-generation sequencing (NGS/2GS) technologies.NGS technologies produce vast quantities of short data rather than Sanger sequencing at a relatively low cost and short time.Genomics is undergoing a revolution, driven by advances in DNA sequencing technology, and this data flood is having a major impact on approaches and strategies for crop improvement.NGS technologies have

OPEN ACCESS
been applied for sequenced genomes of a number of cereal crop species including rice, Sorghum and maize.A quality sequence of rice that covers 95% of the 389 Mb genome has been produced [1].The Sorghum bicolor (L.) Moench genome has been assembled in size of 730-megabase, placing ~98% of genes in their chromosomal context [2].The draft nucleotide sequence of the 2.3-gigabase genome of maize has also been improved [3].One of the challenges encountered by researchers is to translate this abundance of data into improved crops in the field.There remains a gap between genome data production and next-generation crop improvement strategies, but this is being rapidly closed by far sighted companies and individuals with the ability to combine the ability to mine the genomic data with practical crop-improvement skills.Bioinformatics can be defined as the structuring of biological information to enable logical interrogation, and databases are a key part of the bioinformatics toolbox.Numerous databases have been developed for genomic data, on a range of platforms and to suite a variety of different purposes (see Table 1 for examples).These range from generic DNA sequence or molecular marker databases, to those hosting a variety of data for specific species.

Generic Databases
The largest of the DNA sequence repositories is the International Nucleotide Sequence Database Collaboration (INSDC), made up of the DNA Data Bank of Japan (DDBJ) at The National Institute of Genetics in Mishima, Japan [9], GenBank at the National Center of Biotechnology Information (NCBI) in Bethesda, USA [15,16], and the European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database, maintained at the European Bioinformatics Institute (EBI) in the UK [13].Daily data exchange between these groups ensures coordinated international coverage [43].
Since the introduction of advanced next-generation sequencing technology, the storage and interrogation of this data is becoming an expanding challenge [44,45].The ability to search the vast quantity of this data is made feasible by the development of custom databases such as TAGdb (http://flora.acpfg.com.au/tagdb/)[39], but it is increasingly the assembled and annotated genome data which are applied for crop-improvement applications [46].
While it is valuable to maintain all public nucleic acid sequences in one location, the size of this resource limits the ability to visualize this data.Genome viewers, which place genomic data within the context of sequenced or partially sequenced genomes, provide more context-orientated data interrogation.There are two main generic web-based tools to view plant genomes: Ensembl [10] and GBrowse [47,48].Both are widely used and it is not uncommon to find similar genome information hosted on both systems.A key development in genome databases was the establishment and adoption of a standard file format for genome data [49], and data in the current version, GFF3 can be visualized and searched using a wide range of tools from custom GBrowse databases to stand alone bioinformatics tools such as Biomatters Geneious [50].
There are several resources which collate genome data for multiple plant species.Gramene (http://www.gramene.org/)[20] is an EnsEMBL-based genome viewer and database hosting information on a variety of crop species, but based around the rice, maize and Arabidopsis genomes [18].A similar resource is hosted by the EBI (http://plants.ensembl.org/)[10].PlantGDB is a resource for comparative plant genomics [31,32] and hosts sequence data for >70,000 plant species with a focus on complete sequencing of reference species, Arabidopsis, rice, maize and Medicago truncatula.Plaza (http://bioinformatics.psb.ugent.be/plaza/)[34] hosts pre-computed comparative genomics data sets for a range of species [34].Phytozome (http://www.phytozome.net/)[29] also hosts genome data for numerous plant species and provides several genomes using the GBrowse format.With 25 complete plant genomes, phytozome is one of the most comprehensive plant genome databases currently available [18].In addition, PlantsDB is a generic database hosting data for multiple plant species.This database is hosted by MIPS (http://mips.helmholtz-muenchen.de/plant/genomes.jsp)[30].
While genome and transcript sequence information makes up the bulk of genome data maintained within public databases, it is often the differences between individuals and varieties which are the most valuable for crop-improvement applications.A major focus of crop genetic research in recent decades has been the development of molecular genetic markers associated with important traits.Genetic markers can be assayed with a variety of techniques [51].Early molecular genetic markers technologies such as restriction fragment length polymorphisms have been replaced by more high throughput methods, including amplified fragment length polymorphisms (AFLPs), diversity array technologies (DArT) and simple sequence repeats (SSRs) also known as microsatellites.Another important and crop-improvement-oriented database is the maize database Panzea (http://www.panzea.org/)[28], which hosts data on genomic diversity in a large germplasm collection including genetic data, trait phenotypes, allele frequencies, phenotyping environments, genetic analysis tools and so on.The Panzea database The Panzea databases comprises the genotypic and phenotypic data and genetic marker information.This database design is based on the Genomic Diversity and Phenotype Data Model (GDPDM) (http://www.maizegenetics.net/gdpdm/)[20].
An expressed sequence tag (EST) represents a short sub-sequence of a cDNA sequence.EMBL or GenBank have sub-sections for EST sequences.The crop expressed sequence tag database, CR-EST (http://pgrc.ipk-gatersleben.de/cr-est/)[40], provides access to more than 200,000 sequences derived from 41 cDNA libraries of four species: barley, wheat, pea and potato [40].
SSRs are short stretches of DNA sequence occurring as tandem repeats of mono-, di-, tri-, tetra-, penta-and hexa-nucleotides.They are highly polymorphic due to mutation affecting the number of repeat units.The value of SSRs is due to their genetic co-dominance, abundance, dispersal throughout the genome, multi-allelic variation and high reproducibility.The hypervariability of SSRs among related organisms makes them excellent markers for genotype identification, analysis of genetic diversity, phenotype mapping and marker assisted selection [52,53].SSRs demonstrate a high degree of transferability between species, as PCR primers designed to an SSR within one species frequently amplify a corresponding locus in related species, enabling comparative genetic and genomic analysis.
With the continued advances in DNA sequencing technologies, single-nucleotide polymorphisms (SNPs) have come to dominate high throughput molecular marker analysis.SNPs are the ultimate form of molecular genetic marker, as a nucleotide base is the smallest unit of inheritance, and a SNP represents a single-nucleotide difference between two individuals at a defined location.SNPs are direct markers as the sequence information provides the exact nature of the allelic variants.Furthermore, this sequence variation can have a major impact on how the organism develops and responds to the environment.SNPs represent the most frequent type of genetic polymorphism and may therefore provide a high density of markers near a locus of interest.SNPs at any particular site could in principle involve four different nucleotide variants, but in practice they are generally biallelic.This disadvantage, when compared with multiallelic markers such as SSRs, is compensated by the relative abundance of SNPs.The high density of SNPs makes them valuable for genome mapping, and in particular they allow the generation of ultra-high density genetic maps and haplotyping systems for genes or regions of interest, and map-based positional cloning.SNPs are used routinely in crop breeding programs, for genetic diversity analysis, cultivar identification, phylogenetic analysis, characterization of genetic resources and association with agronomic traits [54,55].
The SNP discovery software autoSNP [57,58] identifies SNPs and insertion/deletion (indel) polymorphisms from bulk sequence data using two measures of confidence; redundancy, defined as the number of times a polymorphism occurs at a locus in a sequence alignment; and co-segregation of SNPs to define a haplotype.AutoSNP software has recently been extended to database format, autoSNPdb, which permits complex queries and provides detailed genomic and functional information [4,5].Where the sequence trace files are available, the SNP discovery tool PolyPhred [59,60] can make use of the base pair quality scores to further differentiate between true SNP polymorphisms and random sequence error.The recent developments in next-generation sequence data have led to the identification of large numbers of SNPs in a range of plant genomes and these approaches are likely to dominate SNP discovery in the coming years [61].
The increased throughput for the discovery and application of molecular genetic markers has led to the requirement for databases hosting the results of molecular marker analysis.These maybe integrated within other database systems such as Gramene [20], the Legume Information System (LIS) [22,23], or Graingenes [17,18].
One of the principal uses of molecular genetic markers is the production of genetic maps and the mapping of heritable traits.While mapping data may be described as lists, graphical representations are more readily understood.The genetic map viewer CMap, developed by the GMOD consortium [62] is valuable for the validation of traits that map to the same position in different populations and also for the linkage between crop genetic maps and sequenced model genomes, enabling the identification of candidate genes for genetically mapped traits.A recent addition, CMap3D [63], enables the comparison of a larger number of maps in 3D space.
The linking of genomic data with agronomic traits remains one of the greatest challenges in the application of genome data for crop improvement [64,65].Several databases have been developed to assist in this endeavor.The International Crop Information System (ICIS) [21] is a database system that hosts integrated management information for crop improvement, including details on diverse germplasm and traits.One challenge in developing trait databases is the establishment of functional ontologies.The Plant Ontology (http://www.plantontology.org/)[33] is a controlled vocabulary (ontology) that describes plant anatomy and morphology and stages of growth and development for all plants [33] and this database of ontologies is becoming the standard for comparative physiology and for linking genes with potential function.

Species Focused Databases
It would be impossible to detail all available plant genetic and genomic databases, however some of the main ones are listed below along with a brief description of their content.
While Brachypodium distachyon is not grown as a crop, this species has many qualities that make it a model for studies in temperate grasses and cereals, including a small genome (~ 300 Mbp), small physical stature, self-fertility, a short lifecycle, and simple growth requirements.The B. distachyon genome was sequenced in 2010 [66] and the Brachypodium database which includes a GBrowse based genome viewer is available at http://www.brachypodium.org/[6].
Rice was one of the first crop genomes to be sequenced and there are now numerous resources available to mine this genomic information.Oryzabase is an integrated rice science database established in 2000 (http://www.shigen.nig.ac.jp/rice/oryzabase/) [27].The database hosts information on genetic resources, chromosome maps, genes and rice mutants.This is complemented by a rice genome annotation project [35] which presents data using GBrowse (http://rice.plantbiology.msu.edu/)[35].
Although wheat is an extremely important crop, advances in genomics have been limited by its large and highly complex genome.Assemblies of the gene rich regions for the group 7 chromosomes have been completed [67,68], and annotated sequences, including a large number of SNP polymorphisms are available at http://www.wheatgenome.info[42].
A central portal for Brassica data is maintained at Brassica.info, with links to genetic marker, map and a range of diverse Brassica related information.The recently sequenced Brassica rapa genome [7] is hosted at http://brassicadb.org/ [8] in a database named BRAD [8], with a second database which contains Brassica repeat information at http://www.BrassicaGenome.net[7].Both of these databases use GBrowse.
The SOL Genomics Network (SGN) (http://solgenomics.net/) [37] is a clade oriented database containing genomic, genetic, phenotypic and taxonomic information for plant genomes, with a focus on the Euasterid clade, which includes Solanaceae (e.g., tomato, potato, eggplant, pepper and petunia) and Rubiaceae (coffee) [37].As well as being a resource for basic crop research, SGN maintains databases with a specific focus on giving breeders direct links to breeder-relevant tools and data.

Conclusions and Future Direction
There are currently a range of databases dedicated to generic genome data or focusing on specific crops or clades.Both the type and volumes of data have increased greatly over the last few years and this trend looks to continue.Some of the early database formats are either no longer used or have limited applications [69-71], however several newer web tools are now becoming predominant.These include the GBrowse genome viewer [47,48] and associated open source bioinformatics developments as well as the EnsEMBL system [10].As genome technology continues to advance and an increasing number of crop genomes become available, an expanding number of these databases will be developed.One of the main challenges facing crop bioinformatics researchers is to make the ever increasing volume and types of data available in a suitable format for analysis [72].This includes new high-throughput plant phenotype data as well as the increasing volumes of genotypic diversity data.It will be the association of this diversity data with heritable phenotypes which will likely drive genome database development over the coming years [73,74].These databases therefore will require the implementation of appropriate statistical tools for association of high-density genotype and highthroughput phenotype data.

Table 1 .
Examples of genomic databases related to crop improvement.