Complete Chloroplast Genome Sequences of Four Meliaceae Species and Comparative Analyses

The Meliaceae family mainly consists of trees and shrubs with a pantropical distribution. In this study, the complete chloroplast genomes of four Meliaceae species were sequenced and compared with each other and with the previously published Azadirachta indica plastome. The five plastomes are circular and exhibit a quadripartite structure with high conservation of gene content and order. They include 130 genes encoding 85 proteins, 37 tRNAs and 8 rRNAs. Inverted repeat expansion resulted in a duplication of rps19 in the five Meliaceae species, which is consistent with that in many other Sapindales, but different from many other rosids. Compared to Azadirachta indica, the four newly sequenced Meliaceae individuals share several large deletions, which mainly contribute to the decreased genome sizes. A whole-plastome phylogeny supports previous findings that the four species form a monophyletic sister clade to Azadirachta indica within the Meliaceae. SNPs and indels identified in all complete Meliaceae plastomes might be suitable targets for the future development of genetic markers at different taxonomic levels. The extended analysis of SNPs in the matK gene led to the identification of four potential Meliaceae-specific SNPs as a basis for future validation and marker development.


Introduction
The Meliaceae or mahogany family is a flowering plant family of mainly trees and shrubs (and a few mangroves and herbaceous plants) in the order Sapindales. The species of the Meliaceae family, which are contained in The Plant List [1] belong to 52 plant genera, all showing a pantropical distribution. The Plant List includes 3198 scientific plant names of species rank for the family Meliaceae. Of these, 669 are accepted species names.
Cedrela odorata L., a fast-growing deciduous tree species, is the most commercially important and widely distributed species in the genus Cedrela, and one of the world's most important timber species. It is found from Mexico southwards throughout central America to northern Argentina, as well as in the Caribbean. The aromatic wood is in high demand in the American tropics because it is naturally termite-and rot-resistant. It contains an aromatic and insect-repelling resin that is the source of one of its popular names, Spanish-cedar (it resembles the aroma of true cedars, Cedrus spp.). Other common names include Cuban cedar or cedro in Spanish. It is used for a wide range of purposes, including the The plastomes of these four Meliaceae species (and of Azadirachta indica) are small circular DNA molecules of sizes in the range of 158,558 bp to 160,737 bp, with the typical quadripartite structure of land plant cp genomes consisting of two inverted repeats (IRa and IRb) separated by large (LSC) and small (SSC) single copy regions, respectively ( Table 2). In each of the cp genomes of the four newly sequenced Meliaceae species (Table 1), 112 different genes were annotated (78 protein-coding, 30 tRNA and 4 rRNA genes), 18 of which were duplicated in the IR regions, giving a total of 130 genes encoding 85 proteins, 37 tRNAs and 8 rRNAs (Tables 2  and 3). Among the 112 unique genes, 18 included one or two introns. The intron sizes are ranging from 532 bp for trnL-UAA to 2535 bp for trnK-UUU when considering the plastome of Cedrela odorata. One gene, rps12, is presumed to require trans-splicing (Table 3; [21]). Table 3. List of genes annotated in the cp genomes of four Meliaceae species sequenced in this study (Table 2).
The gene maps of the newly sequenced Meliaceae species are provided in Figure 1 (Cedrela odorata), and in Figures S1-S3 (Entandrophragma cylindricum, Khaya senegalensis and Carapa guianensis). The gene content and gene order in the cp genomes of the four species are nearly identical to the previously published plastome of Azadirachta indica (GenBank KF986530.1) with the exception that the ycf1 gene at the IRa/SSC border was not annotated as a pseudogene in Azadirachta indica resulting in a total number of 131 genes in this species compared to 130 genes in the other four Meliaceae species (Table 2). Another exception is that the gene names of trnG-UCC and trnC-GCC are swapped due to misannotation in Azadirachta indica.
As mentioned above, 18 unique genes were annotated to include introns in the four newly sequenced Meliaceae species (Table 3), whereas introns are missing in the annotations of two of these genes in Azadirachta indica (GenBank KF986530.1), namely petD and rps12 (intron in the 3 part of the gene). The annotations of these genes are probably not correct in Azadirachta indica; both exons are relatively short (exon 1 of petD: 8 bp; exon 3 of rps12: 26 bp), thus making their annotation difficult.
The gene rps19 is included in the inverted repeats (close to the IRa/LSC or IRb/LSC border, respectively; Figure 1, Figures S1-S3, Table 3) in the four Meliaceae species sequenced in this study as well as in Azadirachta indica (KF986530.1) thus resulting in a duplication of rps19. The grey arrows indicate the direction of transcription of the two DNA strands. A GC-content graph is depicted within the inner circle. The circle inside the GC content graph marks the 50% threshold. The maps were created using OrganellarGenomeDRAW [22].  The grey arrows indicate the direction of transcription of the two DNA strands. A GC-content graph is depicted within the inner circle. The circle inside the GC content graph marks the 50% threshold. The maps were created using OrganellarGenomeDRAW [22].

Diversity of the Meliaceae Plastome Sequences
The newly sequenced individuals of the four different Meliaceae species share several large regions, located in intergenic regions and in the rpl16 intron (LSC), which show low similarity to Azadirachta indica ( Figure 2). These regions are related to large deletions which become obvious in a multiple alignment of related whole plastomes ( Figure S4). The largest deletion which is in the psbE-petL linker is of a size of about 199 bp ( Figure S4; at position 69531 bp of the Azadirachta indica sequence). These deletions mainly contribute to the smaller genome sizes of the four individuals compared to Azadirachta indica ( Table 2). The Cedrela odorata individual, which is the individual with the smallest cp genome size (158,558 bp; Table 2 Table 2. Based on the multiple whole-plastome alignment of the five Meliaceae individuals ( Figure S4), 7635 positions that showed DNA sequence variations (SNPs or indels) in at least one of the plastomes (compared to the consensus sequence), were called by the SNiPlay tool (SNiPlay genotyping table  in Table S5). The following regions of the consensus sequence showed the highest frequencies of variations (considering intervals of 10,000 bp): 1-10,000 bp (1-9681 bp in Azadirachta indica) with 923 variable positions (top1); 120,000-130,000 bp (117,660-127,364 bp in A. indica) with 771 positions (top2); and 130,000-140,000 bp (127,365-137,199 bp in A. indica) with 735 positions (top3). The top1-region is located at the 5-prime part of the LSC including psbA, matK, rps16, psbK and psbI ( Figure 2). The top2and top3-region are connected and represent the SSC downstream of ndhF, the SSC/IRb border and parts of the rRNA cluster of the IRb (Figure 2).
A phylogenetic tree based on a multiple alignment of the five complete Meliaceae cpDNA sequences ( Table 2) and Acer buergerianum (family Sapindaceae in the Sapindales; NC_034744.1) as an outgroup, shows that the analyzed Meliaceae form a monophyletic sister clade to Azadirachta indica within the Meliaceae. Within this clade, Cedrela odorata and Entandrophragma cylindricum group together, as do Khaya senegalensis and Carapa guianensis (Figure 3). species each with the cpDNA sequence of Azadirachta indica (reference). VISTA-based similarity plots portraying the sequence identity of each of the four Meliacea species with the reference Azadirachta indica are shown. The annotation (protein-encoding genes) is provided for Azadirachta indica on top (based on the related GenBank file; KF986530.1). Plastome regions with the highest diversity between the 5 Meliaceae individuals are marked by blue arrows (top1-3). Further details are provided in the text below.  Table 2. The bootstrap value on the branch separating Azadirachta indica from the other Meliaceae is below 70% and was not shown for this reason. GenBank accession numbers of the plastomes are given in Table 2.

Comparative Analyses for the Identification of Potential Meliaceae-Specific Plastome DNA Variations in One Barcoding Region
The matK gene, one of the cpDNA barcoding regions (see Introduction), was selected for comparative analyses aiming at the identification of potential Meliaceae-specific cpDNA variations. Based on a multiple alignment ( Figure S5) of the extracted matK gene sequences of the five Meliaceae species listed in Table 2 Table 4).
These 16 SNP positions were further analyzed in a multiple alignment of matK gene sequences from 100 Meliaceae individuals downloaded from GenBank ( Figure S6). Only SNPs where all the downloaded Meliaceae sequences showed the same nucleotide at the given position were further validated with more matK sequences from members of other taxonomic groups (other Sapindales, other Rosids, and other land plants; Table 4).
The SNPs at the following four positions of the matK gene were selected as potential Meliaceae-specific: 346 bp/328 bp (consensus sequence/Cedrela odorata sequence), 1318 bp/1270 bp, 1478 bp/1430 bp and 1494 bp/1446 bp (Table 4). At position 1318 bp/1270 bp, e.g., all Meliaceae individuals considered include a G (Table 4, Figure S6) in contrast to members of other Sapindales families, other Rosids, as well as other land plants, where the considered individuals show an A (Table 4). This SNP is at the first position of the codon encoding for MatK amino acid 424 in Cedrela odorata and will result in an amino acid exchange (Meliaceae analyzed: Gly ( Figure S6); other Sapindales, Rosids and land plants analyzed: Ser or Asp).
The other barcoding region, the rbcL gene (see Introduction), showed less nucleotide variation compared to matK when analyzed in a similar way ( Figure S7). In a multiple alignment of the rbcL sequences of five Meliaceae species and five non-Meliaceae species of the order Sapindales ( Figure S7), only 2 SNP positions were identified where the five Meliaceae individuals showed another nucleotide than the non-Meliaceae species at the same position (SNP positions 225 and 582 related to the consensus sequence). However, at these two positions, some other Meliaceae individuals showed differing nucleotides at GenBank; thus, these positions were not further analyzed. The SNP positions listed were validated in different multiple alignments of matK sequences from member individuals of different taxonomic groups downloaded from GenBank (100 sequences each). The "position (consensus)" refers to the position in the consensus sequence in the alignment in Figure S5

Discussion
In this study, the complete chloroplast genomes of four Meliaceae species (order Sapindales) were sequenced by NGS (genome-skimming), annotated (Figure 1, Figures S1-S3, Table 3) and compared with each other and with the only previously published Meliaceae plastome sequence, the sequence of Azadirachta indica (KF986530.1; Table 2). Across eight other families within the Sapindales, there are currently 38 complete cpDNA sequences available at the Organellar Genome Resource at NCBI [24].
Gene and intron content are highly conserved among land plant plastomes, although losses have been identified in several angiosperm lineages (reviewed, e.g., in [25,26]). The following genes known to be lost in some other Angiosperm species [25] are also missing in the four newly sequenced Meliaceae plastomes (Table 3) as well as in Azadirachta indica (KF986530.1): chlB, chlL, chlN, infA, trnP-GGG. The losses of chlB, chlL, chlN, and trnP-GGG represent synapomorphies for flowering plants [25]. The most common gene loss involves infA [25,27]. There is evidence that infA has been transferred to the nucleus in some species [27], which has not yet been investigated in Meliaceae. The low depth of coverage (related to the nuclear genome) of the genome skimming data used in this study did not allow to perform such an investigation.
Intron content is also highly conserved across angiosperms with most genomes containing 18 genes with introns [25]. In the four Meliaceae species sequenced in this study, 18 unique genes with one or two introns were annotated ( Table 3). The apparent lack of the petD intron and the intron in the 3 -rps12 locus in Azadirachta indica (KF986530.1) are annotation errors in our opinion.
Hotspots for structural rearrangements within plastomes include the IRs, which are frequently subjected to expansion, contraction, or even complete loss (summarized in [21]). Inverted repeat expansion resulted in a duplication of rps19 in the five Meliaceae plastomes analyzed so far (Figure 1, Figures S1-S3, Table 3, KF986530.1). This is consistent with research considering the plastomes of Nitraria sibirica (Nitrariaceae [28]) and 38 other Sapindales species [24] not belonging to the Meliaceae. However, some of these plastomes have only incomplete second rps19 copies (Anacardium occidentale, NC_035235.1; Acer miaotaiense, NC_030343.1; Acer buergerianum, NC_034744.1; Mangifera indica, NC_035239.1) or the second copy is even completely missing, such as in Aesculus wangii (NC_035955.1) or Pistacia vera (NC_034998.1), where the first rps19 copy is in addition incomplete. The rps19 gene is completely absent in Rhus chinensis (NC_033535.1). In contrast to the Sapindales, many other rosid species include only one rps19 gene in the LSC of the cpDNA sequence [28]. To answer the question if the duplication of rps19 is a general plastome feature in Meliaceae, the plastomes of other member species of the 52 plant genera within the Meliaceae must be sequenced and annotated.
Complete plastome sequences are valuable for deciphering phylogenetic relationships especially between closely related taxa, or where recent divergence, rapid speciation or slow genome evolution has resulted in limited sequence variation [18,29,30]. Considering that most species-level diversity of Meliaceae in rainforests is recent [31], a whole-plastome phylogeny is highly desired for this family. The whole-plastome phylogeny that was generated in this study (Figure 3) based on the whole-plastome alignment for five Meliaceae species (Figure S4), is an initial step in this direction. The result that Azadirachta indica (member of the subfamily Melioidaeae) belongs to another subclade than the four other Meliaceae species (members of the subfamily Swietenioideae) as well as the further sub-grouping of the four Meliaceae species, agrees with previous studies (subfamilies according to [31][32][33][34]). In the future, more complete cpDNA sequences of member individuals of other Meliaceae species are needed to exploit the power of the whole-plastome phylogenies, especially for differentiation at lower taxonomic levels. Additional integration of complete cpDNA sequences with small amplicon datasets could further improve the phylogenetic resolution, as recently shown in Acacia, where the greatest support has been achieved when using a whole-plastome phylogeny as a constraint on the amplicon-derived phylogeny [30].
Whole-plastome alignments are also very useful to develop cpDNA markers for the genetic identification of species or other taxonomic categories [17]. Especially, large indels can be easily identified in whole-plastome alignments and used, e.g., for the development of robust PCR-based markers, as recently shown for a Populus tremula-specific marker that has been developed based on a 96 bp-indel [17,35]. In the present study, we identified and compactly visualized large indels between the analyzed Meliaceae individuals based on pairwise alignments using VISTA (Figure 2). Exclusive large deletions were identified for the Cedrela odorata and the Carapa guianensis individuals, respectively ( Figure 2). Before effective markers can be developed, it must be further validated with more Meliaceae species and individuals whether these deletions are genus-, species-or individual-specific. PCR-marker development based on indels in small intergenic regions might be particularly robust, because the primers could be placed into conserved genic regions adjacent to the indel. The SNPs and indels identified between the five Meliaceae plastomes (Table S5), based on a multiple alignment, may also serve-after further validation-as targets for future marker development at different taxonomic levels.
Aiming at the identification of potential Meliaceae-specific SNPs, the matK gene, one of the barcoding regions [7]-was selected and the SNPs identified in this gene were analyzed in extended sets of previously published matK sequences. In general, nucleotide variation per site in matK is three times higher than in rbcL (the large subunit of RUBISCO; [36]), and the amino acid substitution rate is six times that of rbcL [37]. Here, four potential Meliaceae-specific SNPs were identified (Table 4), three of which are in the 3 -terminus of the matK gene encoding for the C-terminus of MatK in a region that is homolog to domain X of mitochondrial group II intron maturases [38] and has a significantly higher amount of basic amino acids compared to the N-terminal region [39]. The potential Meliaceae-specific SNPs-once further validated-should be attractive for the taxonomic differentiation of samples from wood or wood products.

Sampling, DNA Extraction and Sequencing
The individual trees analyzed are described in detail in Table 1. Genomic DNA was isolated from leaves or cambium according to Dumolin et al. 1995. Genomic library generation and sequencing on the Illumina MiSeq v3 (2 × 300 bp paired-end reads) was done by Eurofins Genomics (Ebersberg, Germany).

Assemblies of Chloroplast Genome Sequences and Annotation
If not otherwise stated, CLC Genomics Workbench (CLC-GWB; v8.5.1 and v9.5.3; CLC bio, A QIAGEN company; Aarhus, Denmark) was used for data processing. Initial quality control of the NGS reads was done with FastQC [40].

Assembly of cpDNA Sequences of Khaya Senegalensis and Entandrophragma Cylindricum
All reads were trimmed with CLC-GWB including adapter trimming, quality trimming (quality limit of 0.01), trimming of 10 nucleotides at the 5 -end and removing reads of less than 50 bp in length. All other options were set to default. Potential chloroplast reads were extracted by mapping all trimmed reads against the chloroplast sequence of Azadirachta indica (KF986530.1) using the tool MITObim [41]. Multiple de novo assemblies based on different word sizes were performed on extracted chloroplast reads. Overlapping contigs of one or more assemblies were used to combine the contigs to complete cpDNA sequences.

Assembly of cpDNA Sequences of Carapa Guianensis and Cedrela Odorata
Adapter sequences were removed by the trimming software Trimmomatic [42] using simple and palindromic trim. Further trimming was done by CLC-GWB including quality trimming (quality limit of 0.01), trimming of 10 nucleotides at the 3 -and 15 nucleotides 5 -end and removing reads shorter than 50 bp. All other options were set to default. A first set of contigs was generated by de novo assembly of all trimmed reads, using a length fraction of 0.9 and a similarity fraction of 0.95. All resulting contigs were blasted with the command line tool blastn [43] against the nucleotide BLAST database downloaded from GenBank [44] to identify and select chloroplast contigs. All resulting Blast hits were filtered for the keyword "chloroplast" and finally validated with the web Blast tool at NCBI [45]. Trimmed reads were mapped to the chloroplast contigs, mapped reads were extracted and stored in a separate read set as chloroplast reads. Multiple de novo assemblies based on different word sizes were performed on these chloroplast reads. Overlapping contigs of one or more assemblies were used to assemble the complete cpDNA sequence.

Annotation of the cpDNA Sequences and Preparation of GenBank Submission Files
Draft annotations were generated with the web-based software CPGAVAS [46,47] to check gene content and order. These draft annotations were corrected where necessary, guided by alignments to other well-characterized eudicot plastomes including those of Arabidopsis thaliana (NC_000932), Nicotiana tabacum (NC_001879) and Spinacia oleracea (NC_002202).
The file resulting from the fine annotation of the plastome of Khaya senegalensis (GB-format) was transferred to a draft SQN-file using the CHLOROBOX-GenBank2Sequin-tool [48] and edited using the Sequin tool (v13.05; [49]). The error-corrected SQN-file was submitted to GenBank. Because gene content and order were the same in all species analyzed (according to the fine annotation), the GenBank submission files (SQN-format) for the other species were created by updating the sqn-file of Khaya senegalensis with the sequences of the other three species (using the "update sequence" function in Sequin) and subsequent manual editing of shifted annotations in Sequin.

Alignments and Construction of a Phylogenetic Tree
Pairwise alignments of complete cpDNA sequences were performed using the VISTA tool mVISTA ("AVID" as alignment program) at VISTA [51,52]. Identity plots of each pairwise alignment were downloaded from the related VISTA-point results page.
The phylogenetic tree was constructed based on the alignment of five Meliaceae species together with one outgroup (Acer buergerianum, NC_034744.1; family Sapindaceae in the order Sapindales) using the "Maximum likelihood phylogeny" tool of CLC-GWB including bootstrap analysis with 100 replicates (other parameters: construction method for the start tree = UPGMA; existing start tree = not set; nucleotide substitution model = Jukes Cantor; protein substitution model = WAG; transition/transversion ratio = 2.0; include rate variation = No; number of substitution rate categories = 4; gamma distribution parameter = 1.0; estimate substitution rate parameter(s) = Yes; estimate topology = Yes; estimate gamma distribution parameter = no).

SNP and Indel Detection in Multiple Alignments Using SNiPlay
To identify SNPs between the complete cpDNA sequences of the 5 Meliaceae species (Table 2), the alignment-FASTA of the related multiple alignment was exported from CLC-GWB and further edited (replacement of alignment spaces "-"-if present-by "?"at the 5 -or 3 -end of the sequences). The edited alignment-FASTA was used as an input for the web tool SNiPlay (pipeline v2; [53]) to run SNP/indel discovery (default parameters) [54].

NCBI-Blast Analyses of matK and Download of matK Gene Sequences of Different Taxonomic Groups for Multiple Alignments
The matK gene sequence of Cedrela odorata (MG724915) was used as a query in different BlastN searches (parameters: optimized for highly similar sequences, maximal target sequences: 100) at GenBank (NCBI; [45]): (i) restrict to Meliaceae (taxid:43707); (ii) restrict to Sapindales (taxid:41937) and exclude Meliaceae; (iii) restrict to Rosids (taxid:71275) and exclude Sapindales; and (iv) no restriction, but exclude Rosids. After each Blast analysis, all 100 hits were selected and a FASTA was downloaded with the aligned sequences. Each FASTA file was used as input for a multiple alignment together with the matK sequence of Cedrela odorata (used as reference). In the case of the Meliaceae alignment, sequences of wrong orientation in the alignment were reverse complemented and the alignment was repeated. In the case of other alignments only sequences in the right orientation (CDS orientation) were considered in the further analyses.
Acknowledgments: This work was funded by the Federal Ministry of Food and Agriculture (BMEL), Germany in the scope of the "Large scale project on genetic timber verification" and by the International Tropical Timber Organization (project PD 620/11 Rev.1). We would like to thank Susanne Bein for technical assistance in the laboratory, Stephen Cavers and Niklas Tysklind for providing samples from South America. Open access costs were covered by the Thuenen Institute, Braunschweig, Germany.

Conflicts of Interest:
The authors declare no conflict of interest. The funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.