The Complete Chloroplast Genome of Capsicum annuum var. glabriusculum Using Illumina Sequencing

Chloroplast (cp) genome sequences provide a valuable source for DNA barcoding. Molecular phylogenetic studies have concentrated on DNA sequencing of conserved gene loci. However, this approach is time consuming and more difficult to implement when gene organization differs among species. Here we report the complete re-sequencing of the cp genome of Capsicum pepper (Capsicum annuum var. glabriusculum) using the Illumina platform. The total length of the cp genome is 156,817 bp with a 37.7% overall GC content. A pair of inverted repeats (IRs) of 50,284 bp were separated by a small single copy (SSC; 18,948 bp) and a large single copy (LSC; 87,446 bp). The number of cp genes in C. annuum var. glabriusculum is the same as that in other Capsicum species. Variations in the lengths of LSC; SSC and IR regions were the main contributors to the size variation in the cp genome of this species. A total of 125 simple sequence repeat (SSR) and 48 insertions or deletions variants were found by sequence alignment of Capsicum cp genome. These findings provide a foundation for further investigation of cp genome evolution in Capsicum and other higher plants.


Introduction
Chloroplasts (cp) are membrane-bound organelles, mainly involved in the photosynthetic conversion of atmospheric CO2 into carbohydrates in which light energy is stored as chemical energy.Cp possess their own genome that encodes a range of genes, involved mainly in photosynthesis and some essential metabolic pathways [1,2].The first reports on complete cp genome sequences from tobacco and liverwort were reported in 1986 [3,4].Since then, with emerging rapid and cost-effective NGS sequencing approaches, 342 cp genome sequences from different lineages have been reported [5].Analyses of the cp genome among land plants show that their genome structure and organization are highly conserved with a quadripartite structure [6][7][8].
Capsicum L. (pepper) is a genus of the highly diverse Solanaceae family and comprises approximately 32 recognized species [9].Capsicum originated in the New World and is cultivated in temperate and tropical regions [10,11], but knowledge of its domestication is incomplete.Capsicum annuum var.glabriusculum, is a unique Capsicum species commonly known as the American bird pepper well-known for its rich variation in flavor and aroma.Peppers play important roles in various aspects of the economy, food and pharmaceutics [12].Therefore, knowledge regarding the genetic diversity among the germplasm is vital for strategic germplasm collection, maintenance, conservation and utilization.
DNA barcoding is a taxonomic method that aims to provide rapid and accurate species identification using a standard DNA region.The highly conserved structure of cp genome organization is a potential source of information for phylogenetic reconstruction of species relationships among plants.The cp genome has a simple and stable genetic structure, and universal primers can be used to amplify target sequences.In land plants, the highly variable cp gene sequences, such as matK, rbcL and psbA-trnH, are considered efficient DNA barcodes [13][14][15].The advent of DNA barcoding to identify plant species appears to be promising, but most of the individual plastid candidate barcodes lack species level resolution [16,17].
For finding multi-locus DNA barcodes of high resolution at species level, it is essential to determine the distribution and location of highly arranged sequences information present in the cp genome.Until now, only three complete cp genome sequences from Capsicum species, American bird pepper (Capsicum annuum var.glabriusculum) [18], Korean landrace "subicho" pepper (Capsicum annuum var.annuum) [19] and a cultivated pepper (Capsicum annuum L.) [20], have been reported.The complete cp genome sequence of Capsicum pepper, C. annuum var.glabriusculum, reported here augments the genetic information for Capsicum species which will facilitate multi-locus choice for plant barcoding, population, phylogenetic and cp genetic engineering studies of this species.

Chloroplast Genome Assembly
We sequenced the cp genome of C. annuum var.glabriusculum using the Illumina genome analyzer platform.Illumina paired-end (2 × 300 bp) sequencing produced a total of 7,716,442 paired-end reads, with an average fragment length of 277 bp, which were then analyzed to generate 1,964,163,823 bp of sequence.Low quality reads (Q20) were filtered out, and the remaining high quality reads were mapped to the reference cp genome of Capsicum, which contains 29,609,440 mapped nucleotides with an average coverage of 188× on the cp genome.The cp reads extracted from the Illumina dataset were assembled into a total of four contigs.Contig alignment and scaffolding based on paired-end data resulted in a complete circular C. annuum var.glabriusculum cp genome sequence (Figure 1).The genome sequence was deposited into GenBank under the accession number KR078311.

Features of the C. annuum var. glabriusculum Chloroplast Genome
The C. annuum var.glabriusculum cp genome is 156,817 bp in length.The GC content of the cp genome was 37.7%.The inverted repeats (IRs) had higher GC contents (43.05%) than those of the large single copy (LSC) (35.74%) or small single copy (SSC) (32.01%) regions due to the presence of GC-rich rRNA genes.The C. annuum var.glabriusculum cp genome is circular with quadripartite organization (Figure 1).The quadripartite structure includes two single copy DNA fragments, a LSC of 87,380 bp and a SSC of 17,853 bp, separated by a pair of IRs of 25,792 bp on a single circular molecule.The cp genome contains a total of 132 predicted genes (Table 1), including 87 protein-coding genes, 8 ribosomal RNA (rRNA) genes and 37 transfer RNA (tRNA) genes.Seven of these genes are duplicated in the IR regions, nine genes (Rps16, atpF, rpoC1, petB, petD, rpl16, rpl2 (IR), ndhB (IR), ndhA) and six tRNA genes contain one intron, and two genes (clpP, rps12) and one ycf (ycf3) contain two introns.

Discovery of SSRs and SNPs
A total of 125 potential SSRs motifs were identified which are located mostly in the non-coding regions (Table S1), and the majority belonged to tetra-nucleotide (50%) and tri-nucleotide (26%) repeats.All other types of SSRs such as di and penta-nucleotide motifs were relatively low (25%), and the majority of tetra-nucleotide SSRs had the AAAT/AATA/ATAA motif, followed by those with the ATAA/TAAA/ AAAT motif, and the remaining those with the TTTG/TTGT/TGTT, TCTT/CTTT/TTTC, and AATT/ATTA/ TTAA motifs were found with similar proportion (7.2%).Two different repeats those with the TTTTA/ TTTAT/TTATT, and TTATT/TATTT/ATTTT motifs were identified among penta-nucleotide SSRs.The TTC/TCT/CTT and TTA/TAT/ATT motifs were identified among the tri-nucleotide SSRs.Only, the TA/AT motif was identified as the dinucleotide SSRs (Table S1).Comparison of C. annuum var.glabriusculum cp genome sequence with the reference cp sequence of C. annuum revealed a total of 48 mutations (15 SNPs and 33 InDels) and 32 of these variants involving more than one nucleotide (Tables S2 and S3).Amongst the detected variants, 5 SNPs and 3 InDels were observed in the coding region of the cp genome.Amongst these SNPs and InDels, there were 43 and 5 mutations located in LSC and SSC region, respectively.

Discussion
Here we report the re-sequencing and assembly of a cp genome using the Illumina sequencing platform in which we recovered four contigs comprising 156,817 bp covering the entire C. annuum var.glabriusculum cp genome.Reported Capsicum cp genomes range in size from 156,612 to 156,781 bp, and the size of the C. annuum var.glabriusculum cp genome identified here is consistent with those reported previously in plants of the same species [18,20].The entire cp genome of C. annuum var.glabriusculum was 36 bp longer than the reported C. annuum L. cp genome (GenBank accession NC_018552) and 205 bp longer than another C. annuum var.glabriusculum cp genome (GenBank accession KJ619462).Also, the SSC and IR regions of C. annuum var.glabriusculum were 3 and 9 bp longer, respectively, and the LSC region was 14 bp shorter and 167 bp longer, respectively, than those of the previously reported cp genomes.The average GC content in the C. annuum cp genome is 37.7%, similar to other Capsicum species.The data generated using the Illumina platform covered a greater depth (188×) of the cp genome whereas, in the previous studies cp genome sequence coverage was not reported and were able to resolve the ambiguities present in the GS-FLX pyrosequencing.Thus, the data from the cp assembly reported here supports previous findings that Illumina can produce high quality sequence assemblies covering a greater genome depth [21].

Sampling and DNA Extraction
Sample (accession No. IT158289) was obtained from the National Agrobiodiversity Center, Rural Development Administration, Korea.Fresh leaves were collected from 40-day-old seedlings, and DNA was extracted to construct cp DNA libraries.

Library Preparation and Sequencing
An Illumina paired-end cp DNA library (average insert size of 500 bp) was constructed using the Illumina TruSeq library preparation kit following the manufacturer's instructions.The libraries were sequenced with 2 × 300 bp on the MiSeq instrument at LabGenomics (http://www.labgenomics.co.kr/).

Chloroplast Genome Assembly
Prior to cp de novo assembly, low quality sequences (quality score < 20; Q20) were filtered out, and the remaining high quality reads were assembled using the CLC Genome Assembler (version beta 4.6, CLC Inc. Aarhus, Denmark) with a 200-600-bp overlap size.Cp contigs were selected from the initial assembly by performing a BLAST search against known cp sequences (GenBank accession NC_018552).The selected contigs were oriented to construct the complete cp genome structure.Ambiguous nucleotides or gaps were corrected manually to build the complete cp genome.

Gene Annotation
The web-based program Dual OrganellarGenoMe Annotator (DOGMA, http://dogma.ccbb.utexas.edu/) was used to annotate the assembled genome using default parameters to predict protein coding, tRNA and rRNA genes.Subsequently, BLASTN was used to further identify intron-containing gene positions by searching a published cp genome database.A cp gene map was constructed using the OrganellarGenomeDRAW software (OGDRAW, http://ogdraw.mpimp-golm.mpg.de).

Discovery of SNPs and SSRs
Sputnik (http://espressosoftware.com/pages/sputnik.jsp)software was used to find the SSR markers present in the cp genome of C. annuum var.glabriusculum.It uses a recursive algorithm to search for repeats with length between 2 and 5, and finds perfect, compound and imperfect repeats.Sputnik has been applied for SSR identification in many species including Arabidopsis and barley [23].To identify SNP and INDEL variants in C. annuum var.glabriusculum cp genome, we used BWA [24] and Samtools [25] software.More detailed method and algorithm are descripted in Li (2012) [26].

Conclusions
The cp genome sequences of Capsicum species, such as C. annuum var.glabriusculum, C. annuum var.annuum and C. annuum L., have been reported previously; however, information on cp gene content is limited.The complete cp genome sequence of Capsicum pepper (C.annuum var.glabriusculum)

Figure 1 .
Figure 1.Complete chloroplast genome map of C. annuum var.glabriusculum.Genes drawn inside the circle are transcribed clockwise, and those outside are transcribed counterclockwise and marked by two arrows.Differential functional gene groups are color-coded.The GC content variation is shown in the middle circle.

Table 1 .
General features of the C. annuum var.glabriusculum chloroplast genome.

Table 2 .
Genes present in the C. annuum var.glabriusculum chloroplast genome.