Comparative Analyses of 35 Complete Chloroplast Genomes from the Genus Dalbergia (Fabacaeae) and the Identication of DNA Barcodes for Tracking Illegal Logging and Counterfeit Rosewood

The genus Dalbergia contains more than 120 species several of which are trees that produce traditional medicines and extremely high value timber commonly referred to as rosewood. Due to the rarity of these species in the wild, the high value of the timber, and a growing international illicit trade CITES has listed the entire genus in appendix II and the species D. nigra in appendix I because it is considered threatened with extinction. Given this and the fact that species or even genus level determination is nearly impossible from cut timber alternative molecular methods are needed to identify and track intercepted rosewood. In order to improve molecular identication of rosewood, we sequenced and assembled eight chloroplast genomes including D. nigra as well as conducted comparative analyses with all other available chloroplast genomes in Dalbergia and closely related lineages. From these analyses numerous repeats including simple sequence repeats (SSR) and conserved nucleotide polymorphisms unique to subclades within the genus were detected. From phylogenetic analysis using the CDS of 77 coding genes the groups Siam rosewood and scented rosewood based mainly on wood characteristics were supported as monophyletic. In addition, several instances of paraphyly and polyphyly resulting from mismatch between taxonomic determinations and phylogenetic tree topology were identied. Ultimately, the highly variable regions in the chloroplast genomes will provide useful plastid markers for further studies regarding the identication, phylogeny, and population genetics of Dalbergia species including those frequently intercepted in illegal trade.


Introduction
The genus Dalbergia L.f. in the tribe Dalbergieae (DC.) Cardoso and family Fabaceae contains approximately 275 species of trees, shrubs, and lianas. The genus is widely distributed worldwide with species occurring in the tropics and subtropics of South and Central America, Africa, Madagascar, and Asia. Several tree species in the genus are highly valued for producing premium darkly colored, dense, and sometimes fragrant wood used in a variety of applications such as the production of musical instruments, traditional medicine, ne furniture, cabinetry, and veneers. By way of example rosewood furniture in China can range from thousands of dollars for an ornate chair to millions of dollars ($USD) for a bed frame (Zhu, 2020). The main species used to produce high quality timber are loosely grouped into the three categories (made up of several species in each group) black rosewood, scented rosewood, and Siam rosewood based mainly on characteristics of the wood. The black and Siam rosewoods are the most highly valued.
The extremely high value of rosewood has led to large scale illegal logging and international trade in a number of different rosewood species involving numerous countries where rosewood is native. The main centers of rosewood logging are found in Central and South America (Espinoza et al., 2015;Vardeman and Velásquez Runk 2020), Africa and Madagascar (Innes 2010;Abdul-Rahaman et al., 2016), and Southeast Asia (Siriwat and Nijman 2018;Nhung et al., 2020). Madagascar is an area of particular concern given the high number of endemic Dalbergia species (42 spp.) and the presence of organized syndicates taking advantage of the political and economic instability in the country (DuPuy 2002;Patel 2007;Randriamalala and Liu 2010) Rosewood is considered the most tra cked group of endangered species in the world with the value of global seizures exceeding ivory, rhino horn, and big cats combined (United Nations O ce on Drugs and Crime, 2016).
Because of illegal trade, over exploitation, and the similarity of wood between some species the entire Dalbergia genus is protected under CITES (Convention on International Trade in Endangered Species of Wild Fauna and Flora) appendix II which requires permitting for export of most parts of the plant including wood from this genus (CITES, 2020). The Brazilian species Dalbergia nigra (Vell.) Benth endemic to the Bahia interior forest ecoregion, is considered threatened with extinction and has been listed in appendix I of CITES (2020) prohibiting all international trade. In addition to listing in appendix II of CITES D. fusca Pierre and D. odorifera T.C. Chen are also listed in China's key protection list. Given the illegal trade in rosewood and di culty in differentiating Dalbergieae species based on differences in con scated wood, a means of effectively identifying different rosewood species is needed. In parallel with tracking illegal trade of rosewood, programs that could be developed to produce licit rosewood timber will also bene t from routine molecular identi cation and tracking as part of a robust certi cation program.
In studies of plant biology, many different kinds of molecular markers have been developed, including AFLP (Ampli ed Fragment Length Polymorphism), SSRs (Simple Sequence Repeats), and a variety of short (500-1000bp) DNA sequences referred to as barcodes for the identi cation of species and populations . The cost of sequencing has rapidly decreased in recent years with concurrent increases in throughput making the discovery and development of diagnostic markers more e cient and cost effective even for a large number of species . Given the utility of DNA barcode-based-identi cation numerous different barcode regions from different cellular compartments have been proposed in plants (CBOL Plant Working Group, 2009). In recent years, super DNA barcodes have been applied to entire genome sequences, as in the comparative analyses of complete plant chloroplast genomes (Hollingsworth et al., 2016;Zhang et al., 2019;Zhang et al., 2020). Unlike the relatively large and complex nuclear genome, the chloroplast genome has several advantages including uniparental inheritance, high information content (in variable sites), very low recombination rates, and high copy number, making chloroplast genomes and chloroplast barcodes ideally suited for studies in plant systematics Fu et al., 2019), population genetics Zhou et al. 2021) and plant taxonomy (CBOL Plant Working Group, 2009). The conserved gene content and structural arrangement of chloroplast genomes in two inverted repeating regions (IRs), a long single-copy region (LSC) and, a short single-copy region (SSC) make assembly and alignment of chloroplast genomes more complete and less error prone than with plant nuclear or mitochondrial genomes. Additionally, because a large number of complete chloroplast genomes are available in public databases comparative analyses and searches are more accurate and thorough than with any other complete genome data in plants. Based on the advantages outlined above the whole chloroplast genome and barcodes derived therefrom would be ideal for identifying Dalbergia species from wood samples.
To isolate informative molecular markers useful for species identi cation from wood material we conducted the following analyses: 1) sequenced, assembled, and annotated the chloroplast genomes of eight Dalbergia species (including the CITES listed appendix I D. nigra) using next-generation sequencing methods; 2) conducted a phylogenetic analysis to infer the relationships among Dalbergia species and; 3) comprehensively analyzed the highly variable regions and conserved nucleotide sites unique to each clade in Dalbergia that could be employed for DNA barcode-based-identi cation.   Table 1). In general, the genome sequences of species in Dalbergia were similar in length.
The ndhF gene spans IRB and SSC in most of Dalbergia species (Supplemental Table 1). The SSC and IRA junction is spanned by ycf1 in all 43 chloroplast genomes, while no gene spans the IRA and LSC junction. The intergenic region that spans the IRA and LSC junction varies widely across the 43 chloroplast genomes with the distance between trnN-GUU and ndhF ranging from 864 bp to 1,668 bp. Even within a species this region was found to vary as in D. tonkinensis with the distance ranging from 882 bp to 900 bp. Similarly, the intergenic region between ycf1 and trnN-GUU varied from 411 bp to 429 bp across the D. tonkinensis chloroplast genomes ( Table  1). The expansion and contraction of the SSC region across the samples used in this study might make this region a suitable candidate for marker development.

Repeat Analysis
Nucleotide repeats in chloroplast genomes can be very useful markers for identifying populations and/or species given the high rates of mutation in these regions. All 43 chloroplast genomes were analyzed using Reputer software using the limitation that repeats must have the length of 8 bp or more. Four types of repeats were considered, including forward (direct) (F), reverse (R), complement (C), and palindromic (P). Based on different motif types the number of each type was counted based on grouping by sequence length (Figure 3). Almost all the repeats were in the 20-29 bp length range, followed by 30-39 bp, then 50+ bp, with the fewest in the 40-49 bp range. No C and R repeats where detected above 40bp in length and they were rare even in the smaller size ranges. In the 30-39 bp group, C and R repeats were only found in a few species. The R repeats only appear in six Dalbergia species, and C is only in Pterocarpus. For F and P repeat types, they are absent in the range of 40-49 bp in the scented rosewood species, while in 20-29 bp group, the C type repeats are lowest in Siam rosewood species (Figure 3). Given these differences in type and abundance, markers for identi cation of species and/or lineages could be devised based on different repeats. Fixed differences in repeat location, abundance, and type in a genome have provided ideal signatures for species or clade identi cation in previous studies (Saltonstall and Lambertini, 2012;Brazda et al., 2018;Zhang et al., 2019).
In chloroplasts, SSRs are often used for population genetics and/or phylogenetic analysis. Among all 43 chloroplast genomes, 86.7% (5579/6423) of the SSRs were single nucleotide A/T motifs. The genus Arachis was found to have less than half the number of A/T SSRs than the other species used in this study suggesting that indels are more abundant in long A/T stretches for these species. Similarly, the number of nearly all other, save AG/CT and AGAT/ATCT, SSR motif types were far fewer in Arachis than the other species (Figure 4). Within Dalbergia, the number of SSRs vary greatly to the point where certain motifs such as AAT/ATT are present in the scented rosewoods and absent in the black rosewoods. Additionally, the presence or absence of certain SSRs differed within a given species such as AT/AT in D. odorifera and C/G in D. tonkinensis. The results from the SSR analyses suggests that these genomic regions might be useful for the identi cation of populations, species, and clades of Dalbergia if the data from different SSR motif types and length difference is combined in a nested analysis.

Genome Sequence Divergence
To further characterize the differences between chloroplast genomes, we employed mVISTA to nd regions of greatest difference between conserved regions in the eight newly sequenced genomes and Pterocarpus indicus Willd. (an outgroup species that also produces high-quality rosewood type lumber). The intergenic and intragenic regions were found to have the least similarity between chloroplast genomes in Dalbergia and P. indicus ( Figure 5), especially in LSC (from psbA to rps19) and SSC regions (from ndhF to ycf1). Given these results there are numerous intergenic and intragenic regions for developing markers to differentiate Pterocarpus from Dalbergia and fewer, but a su cient number of regions to use for within Dalbergia species or clade level differentiation.
In order to more comprehensively assess regions of dissimilarity we used T-coffee to compare all 43 samples using complete genome sequences ( Figure 6, Supplemental Table 4). As in other plant lineages (Gu et al., 2019) CDS regions had very high identity scores across all 43 chloroplast genomes while the lowest identity scores were found in rps8-rpl14 (score 767; length 524bp), trnR-UCU-trnG-UCC (score 788; length 592bp), accD-psaI (score 790; length 824bp), psbA-trnK-UUU (score 797; length 658bp), and ndhG-ndhI (score 804; length 1308bp). The discovery of these hypervariable intergenic regions provide candidate regions for the development of genetic markers. While most intergenic regions are more variable psaA-psaB (score 1000; length 25bp), psbL-psbF (score, 1000; length, 22bp), psbF-psbE (score, 1000; length, 9), and ndhA-ndhH (score, 1000; length, 1) where noted as being very similar and short in length. As such these regions should be excluded from further consideration for marker development but might be useful in providing priming sites adjacent to variable regions in the development of molecular assays. Given that these photosynthesis genes are clustered in a single operon strong selection of function has resulted in nucleotide conservation even in the intergenic regions. However, in general su cient nucleotide divergence has been found for the development of genetic markers in Dalbergia and to closely related genera that produce lumber of a similar appearance.
In a third approach to nd regions of xed differences for identifying Dalbergia species we analyzed the dN (nonsynonymous substitution rates), dS (synonymous substitution rates) and the ratio of dN/dS (quantify strength of selection) of all genes with PAML. This approach is particularly useful in nding xed differences that persist through a given lineage because mutations that are undergoing different modes of selection can be detected. From this analysis the dN of all genes was relatively low, while for the dS, for two ribosomal genes (rps16 and rpl32) were outliers. Except for ycf2 (1.02), the dN/dS of all the other genes are less than 1, indicating that they are subject to puri cation selection, especially photosynthesis genes, which are lower on average than the other four gene categories (Figure 7).

Codon usage in Dalbergia chloroplast genes
Relative Synonymous Codon Usage (RSCU) is often used to analyze the frequency of codon usage, with the higher values scaling with usage frequency (Figure 8). Given the conserved nature of codon usage, if a mutation is detected, it general remains xed and as such can provide a useful marker locus. Given this we analyzed the codon usage of Dalbergia chloroplast genes. There are no large differences in RSCU between Dalbergia chloroplast genes, indicating conservation in codon usage across Dalbergia, and a very limited number of loci for clade or species identi cation.

Phylogenetic analysis and barcode selection
In order to assess chloroplast genome divergence in an evolutionary context and nd synapomorphies  (Figure 9). These discrepancies between taxonomy and phylogeny indicate that some of the species identi cations for Dalbergia NCBI accessions are probably incorrect. Alternatively, some of the examples of polyphyly might be the result of interspeci c hybridization where a chloroplast genome was maternally inherited from a more distantly related species. Ultimately, in either case improper identi cation (of hybrids and/or species) has led to discordance between the topology and taxonomic designations. These discrepancies should be addressed through identi cation of samples by taxonomic experts where possible as well as resequencing checks using nuclear loci (e.g. ITS) and reanalysis as well as increasing the number of different species used in phylogenetic analyses. Furthermore, comparative phylogenetic approaches can be used to isolate which taxa are most likely misidenti ed in the chloroplast data. For instance, D. sissoo MN936016 is probably correctly identi ed based on the location it resolved in the phylogenetic tree (early diverging to a clade containing D. tonkinensis in both trees) in this paper compared to the position in the phylogenetic analyses of Vatanparast et al. (2013) and Hassold et al. (2016). Issues of polyphyly in the Siam rosewood clade were further veri ed using whole chloroplast comparisons. Differences are apparent between the phylogenetic tree presented here and previously published trees as is expected given the different genomic regions and species sampled. That said consistency is found among the two trees in the membership of important taxa in separate clades and branching order. For example, D. nigra has an early diverging position to clade V in Vatanparast et al. (2013) and to a clade with similar membership in the phylogenetic tree presented here. These similarities in phylogenetic topology suggest that both ITS and chloroplast barcoding could be employed to identify wood samples. However, much greater within species sampling is needed to compare the stability of polymorphisms within species for each locus before a molecular assay can be deployed. This is especially true of ITS where the large number tandem copies can be found in multiple ribotypes (Teruel et al. 2013) and potentially affect the accuracy of molecular assays. Lastly it should be noted that in our analyses the black rosewood category does not form a monophyletic grouping whereas scented and Siam rosewood categories do form monophyletic groupings provided the issue of incorrect labeling and polyphyly can be corrected.
From a MAFFT alignment of the coding regions from 43 chloroplast genomes, we scanned for loci rich in SNVs (single nucleotide variants) and INDELS (INsertions DELetions) to identify potential barcode loci (Supplemental File 1 and 2). The MAFFT alignment after manual correction was 219,992 sites in total length. From this alignment 58 SNV loci and 10 INDEL loci were found to identify Siam rosewoods, 29 SNVs and 13 INDELs in scented rosewood, and 2 SNVs and 1 INDEL for black rosewood if D. nigra was taken into account. If D. nigra was removed 229 SNVs and 49 INDELs were found for the identi cation of the clade D. cultrata Benth. + D. fusca. However, since D. nigra is the highest priority Dalbergia species in CITES a set of 94 SNVs and 69 INDELs were isolated for the identi cation of this species from all other Dalbergia used in this study (Table 2; Supplemental Table 5 and 6).

Genome sequencing and assembly
The Illumina HiSeq 2500 platform was used to sequence the raw data with insert sizes of 500 bp, for 150 bp paired-end read lengths. Raw data was quality control ltered with the following criteria: ltered reads with adapters, ltered reads with N bases >10%, and ltered reads with low-quality bases (≤5) >50%, which yielded 2 Gb of clean reads for each species by Trimmomatic (Bolger et al., 2014).
All paired-end clean reads were aligned to the chloroplast database (containing all published chloroplast genomes from NCBI by the date November 26, 2019) with bwa v0.7.17-r1188 (Li, 2013) software, and then the Picard v2.20.3 program was used to select chloroplast reads. The selected chloroplast reads were assembled by Spades v3.14.0 (Nurk et al., 2013) with default parameters and the output scaffolds (GFA le) were imported into Bandage v0.8.1 (Wick et al., 2015) to generate the nal chloroplast genome for each species.

Genome annotation
All 43

Genome Nucleotide Diversity
Analyses of genome sequence diversity was done using an online software mVISTA (http://genome.lbl.gov/vista/mvista/submit.shtml, Frazer et al., 2004) to compare the 8 newly assembled Dalbergia species using Shu e-LAGAN (Brudno et al., 2003) alignment program with the P. indicus chloroplast genome used as a reference. All 43 chloroplast genomes were split into several parts based on annotation les, and the overall consistency score of each part was calculated with multiple sequence alignment tools using T-Coffee (Notredame et al., 2000) in default mode.

Phylogenetic Analysis and Nucleotide Substitutions
The whole chloroplast genome sequence alignment of 43 chloroplast genomes were generated using MAFFT v7.464 (Katoh et al., 2002;Katoh and Standley, 2013) software, with TrimAL v1.4(Capella-Gutierrez et al., 2009) used to trim the poorly aligned positions. The longest CDS sequences of 77 proteincoding genes were extracted from each genome according to the annotation les, and also aligned using MAFFT (Katoh et al., 2002;Katoh and Standley, 2013). The nucleotide sequence alignments of 77 proteincoding genes were concatenated. This data set was further used to resolve the phylogenetic tree using IQ-TREE v2.0 (Nguyen et al., 2015;Minh et al., 2020) with 1000 ultrafast bootstrap replicates to assess branch support with FigTree v1.4.3 (http://tree.bio.ed.ac.uk/software/ gtree) used for tree visualization. CODEML in PAML v4.9 (Yang, 1997) was used to estimate the nonsynonymous (dN), synonymous (dS) and the ratio of nonsynonymous to synonymous nucleotide substitutions (dN/dS) for each gene.

Declarations Funding
This study was co-supported by the Research Funds for the Central Non-pro t Research Institution of Chinese Academy of Forestry (CAFYBB2017ZA001-7) and National Natural Science Foundation of China (31500537).

Con icts of Interest
The authors declare no con ict of interest.

Figure 1
Gene map of the D. nigra chloroplast genome. Genes shown inside the circle are transcribed clockwise, and those outsides are transcribed counterclockwise. The darker gray color in the inner circle corresponds to the GC content. The IRA and IRB (two inverted repeating regions); LSC (long single-copy region); and SSC (short single-copy region) are indicated outside of the GC content. Variation of repeat abundance and type in 43 chloroplast genomes.

Figure 4
The number of Simple Sequence Repeat (SSR) of different types from 43 chloroplast genomes. A. the number of AAT/ATT, AG/CT, AGAT/ATCT, AT/AT, C/G type SSRs; B, the number of A/T type SSRs in each sample.

Figure 5
Global alignment of eight newly assembled Dalbergia chloroplast genomes using mVISTA with P. indicus as reference. Y-axis shows the range of sequence identity (50-100%). tRNA and rRNA genes were not present in this gure.
Page 19/22 Figure 6 Sequence identity among coding and non-coding regions based on the alignment from 43 chloroplast genomes. T-Coffee was used to calculate the score of identity.

Figure 7
The mode and strength of selection among 77 chloroplast protein coding genes in Dalbergia.