Development of a Target Enrichment Probe Set for Conifer (REMcon)

Simple Summary Conifers are vital for both ecological and economic reasons, offering valuable insights into land plant evolution. Molecular phylogenetics plays a significant role in studying evolution, but research on conifers using large-scale data from multiple nuclear genes has been limited. Target enrichment sequencing has emerged as a crucial method in phylogenomic studies. However, a specific bait set for conifers is missing. The REMcon probe set targets around 100 single-copy nuclear loci for family- and species-level phylogenetic studies of conifers. High target recovery and read coverage were observed for the REMcon when tested on 69 species, including conifers and other gymnosperm taxa. Phylogenetic analysis based on the DNA sequences generated from REMcon recovered the existing understanding of conifer relationships. The REMcon bait set will be beneficial in generating large-scale nuclear data consistently for any conifer lineage. Abstract Conifers are an ecologically and economically important seed plant group that can provide significant insights into the evolution of land plants. Molecular phylogenetics has developed as an important approach in evolutionary studies, although there have been relatively few studies of conifers that employ large-scale data sourced from multiple nuclear genes. Target enrichment sequencing (target capture, exon capture, or Hyb-Seq) has developed as a key approach in modern phylogenomic studies. However, until now, there has been no bait set that specifically targets the entire conifer clade. REMcon is a target sequence capture probe set intended for family- and species-level phylogenetic studies of conifers that target c. 100 single-copy nuclear loci. We tested the REMcon probe set using 69 species, including 44 conifer genera across six families and four other gymnosperm taxa, to evaluate the efficiency of target capture to efficiently generate comparable DNA sequence data across conifers. The recovery of target loci was high, with, on average, 94% of the targeted regions recovered across samples with high read coverage. A phylogenetic analysis of these data produced a well-supported topology that is consistent with the current understanding of relationships among conifers. The REMcon bait set will be useful in generating relatively large-scale nuclear data sets consistently for any conifer lineage.


Introduction
Conifers are the largest extant group among gymnosperms, with more than 722 species in 72 genera and 7 families, e.g., Araucariaceae, Cupressaceae, Pinaceae, Podocarpaceae, Sciadopityaceae, Cephalotaxaceae, and Taxaceae [1][2][3].They are of great economic, ecological, and evolutionary significance, comprising approximately 39% of the world's forests, and have a fossil record spanning more than 300 million years [4][5][6][7].The complete understanding of conifer diversity, trait evolutions, genetic structure, and evolutionary history is still poorly explored [3,5,6].Molecular phylogenetic studies play an important role in understanding the mode and tempo of evolution amongst conifers, but to date, most studies have applied a limited range of markers, principally a small number of chloroplast loci plus nuclear ribosomal DNA regions typically generated by direct amplicon sequencing (e.g., [5,[8][9][10][11][12]).These studies have leveraged DNA direct amplicon sequencing data generated from these loci for phylogenetic and evolutionary analyses.Further complicating the study of molecular evolution in this major land plant lineage is the large genome size and overall complexity of conifer genomes [13,14], leaving a notable gap in the exploration of conifers using large-scale data.Target enrichment by hybridization capture (e.g., hyb-seq; Weitemier et al. 2014) [15] provides an efficient and cost-effective approach for generating DNA sequence data for a large number of single and low-copy nuclear gene regions across multiple samples.
Target enrichment approaches (File S1) have become the method of choice for many systematics, phylogenetic, and evolutionary studies in plants [16][17][18][19][20][21], in part fostered by the availability of 'universal' probe sets that can recover a common set of genes across broad evolutionary timescales.These include angiosperm-specific probe sets that target hundreds of nuclear genes [17,22] and one recently developed to enrich more than 400 nuclear genes across flagellate land plants [18].While there are bait kits developed specifically for specific conifer families (e.g., Pinaceae; [23]), we are not aware of a conifer-specific bait set that targets the entire clade.
Here we present a new molecular toolkit (REMcon) which, based upon published transcriptomic (The 1000 Plant Transcriptomes Initiative, 1KP; [24]) and genomic data [25,26], uses RNA baits to target approximately 100 low-copy nuclear genes across conifers.In the present study, we demonstrate the universality and application of these new molecular tools for reconstructing phylogenetic relationships among conifers based on a broad sample of gymnosperm taxa.The approach typically recovers conservative coding regions plus more variable non-coding regions that flank the exons and has application for evolutionary analyses among closely related species, as we will demonstrate in an upcoming study of the Podocarpus 'Australis' clade (Khan et al., in prep) of family Podocarpaceae [6,27,28].
The sequences retrieved from the BLAST search for each target gene were downloaded and made into a BLAST database in Geneious (Kearse et al. 2012; https://www.geneious.com,accessed on 1 August 2021) [30].We queried each BLAST database using the Thuja plicata gene family member with exon annotations manually added and the following settings: Discontiguous Mega-Blast, expect value = 10, maximum target sequences = 1200, results = Hit Table, retrieve = Matching Region with Annotation.We then extracted all sequences matching one or more exon annotations in P. abies with the caveat that the exon was >180 bp in length to allow for bait tiling.The extracted sequences were clustered using CD-HIT-EST ( [31][32][33]; http://weizhong-lab.ucsd.edu/cdhit_suite,accessed on 4 August 2021) with a sequence identity cut-off fraction of 0.88 (see [34,35]) and a length similarity fraction of 0.2, and one representative sequence (the longest) per cluster was selected.A total of 1,124 representative sequences (mean length nucleotides, range 181-4416 nt) covering exons from 100 putative low-copy nuclear genes (Table 1; Table S1) were used for bait design with 120-nt baits and ~2× flexible tiling density for a total of 17,982 baits (see baits-Spruce [14853] for the nucleotide sequences of the baits).Bait design and synthesis were performed by Daicel Arbor Biosciences (formerly MYcroarray; Ann Arbor, MI, USA) in the generation of the myBaits Custom DNA-Seq kit™ Ann Arbor, MI, USA used for target enrichment-based next-generation sequencing.

Taxon Selection
A total of 44 conifer genera representing six families, three species of Cycadales, and Gingko biloba (69 taxa) were included to evaluate the efficiency of target capture across conifers and more widely among gymnosperms.Most plant specimens were freshly collected from the living collections held at the Botanic Gardens of South Australia and dried in silica gel, and some were sampled from preserved specimens held at the State Herbarium of South Australia (Table S2).

DNA Extraction, Library Preparation, Hybrid Capture and Sequencing
For DNA extractions, about 15 mg of silica gel dried leaf material per sample was used, and homogenized in a Omni ruptor (Omni International, Kennesaw, GA, USA) using ceramic beads.DNA was extracted using the Qiagen Plant Mini kit, QIAGEN, Germantown, MD, USA and normalized 2 ng/uL before proceeding to library preparation, which follows the steps outlined in Waycott et al. [36] for their nuclear bait set.Hybrid capture was performed following the manual provided by myBaits with a hybridization temperature of 65 • C and 150 bp paired-end sequencing was performed at the Australian Genome Research Facility (AGRF), Melbourne, Australia on an Illumina NovaSeq S1 flow cell.

Bioinformatics Analyses
High-throughput paired-end reads were de-multiplexed and quality trimmed (using a Phred score threshold of 20) using CLC Genomics Workbench (v.20; https://www.qiagenbioinformatics.com/,accessed on 22 April 2022).The Sequence Capture Processor pipeline SECAPR v 2.2.3: Andermann et al. [37], http://antonellilab.github.io/seqcap_processor/,accessed on 24 April 2022) was used to generate nuclear DNA sequence data sets from the trimmed reads.First, the reads from each sample were assembled de novo using SPAdes [38] with default kmer values.Contigs matching the reference Thuja plicata reference sequences (i.e., annotated exons, see above) were extracted from the de novo assemblies with LASTZ v. 1.04 [39] using a target length of 0.5 and a similarity fraction of 0.75 (i.e., 50% of the contig has to overlap with the target gene and be no less than 75% similar).The 'keep paralogs' flag was activated and deactivated to assess the extent of paralogy in the data.SECAPR identifies paralogs as multiple overlapping contigs matching a target sequence, keeping the longest contig if the 'keep paralogs' flag is activated.The extracted contigs were aligned per locus to produce multiple sequence alignments (MSA) using MAFFT [40].The aligned contigs were subsequently used for a reference-based assembly using the BWA read mapper v.0.7.16a-r1181 [41], and the 'sample specific' flag, i.e., each sample is extracted from the alignment and mapped separately.Consensus sequences per sample from subsequent read mappings were again aligned using MAFFT to produce MSAs for each targeted gene region.The approach developed by Yang and Smith [42] and modified with containerization for target capture data (Jackson et al. [43]; https://github.com/chrisjackson-pellicle/Yang-and-Smith-paralogy-resolution,accessed on 29 April 2022) was used to resolve groups of orthologous sequences (orthology inference) from targeted gene regions.Following various filtering steps, the approach uses phylogenetic tree-based methods and the pruning of duplicated taxa from rooted phylogenies to resolve orthologous groups of sequences.In this study, de novo contigs for each sample from the SECAPR pipeline (above) were first imported into Geneious Prime (v.2022.0.1; (https://www.geneious.com,accessed on 24 May 2022) and made into a Blast database.This database was queried using the extracted contigs from an outgroup (Ginkgo biloba) matching the targeted gene regions in P. abies.The contigs from Ginkgo were annotated with the CDS from P. abies, and the coding region(s) were queried against the Blast database using blast-n with a maximum expected value of 1e-10 and maximum hits set to 1000.The Blast output was filtered using a minimum coverage fraction (query coverage of at least 0.4) to remove poorly aligned and short sequence fragments [42].The resulting contigs were then used as input into the Yang-and-Smith-paralogy-resolution pipeline [43].We used the monophyletic outgroups (MO) method to identify ortholog groups using reference genes from Gingko biloba as the outgroup.For downstream analyses, we retained alignments with >10 individuals in order to reduce the influence of missing data in tree inference.The aligned ortholog groups were concatenated, and a phylogeny was generated using IQ-tree 2 [44] using model finder [45] to estimate the optimum partitioning scheme and partition-specific nucleotide substitution model (MFP+MERGE flag activated) and 1000 ultrafast bootstrap [46] replicates to assess branch support.

Results and Discussion
The retrieval of target loci across the conifers was high, with, on average, 94% of the targeted gene regions recovered per sample (range 53-100), and 27 loci were recovered across all samples (Table 2).For the recovered loci, read coverage (read depth/position, averaged across loci) was also high, averaging c.146 across the included taxa with a maximum of 622 in Chamaecyparis pisifera and a minimum of c. 6 in Agathis robusta (Figure 1-coverage heatmap).In general, the recovery of genes across the six conifer families was relatively consistent, and with the exception of Araucariaceae, the mean number of loci captured per family exceeds 90 (Table 2).The lower mean value for Araucariaceae (87 loci) is influenced by the poor recovery for Agathis robusta (57 loci), which is likely a consequence of DNA quality and/or issues with the library construction, given that target recovery amongst close relatives (e.g., Agathis microstachya, 94 loci) was high (Figure 1, Table S2).There was a lower recovery of target genes for the non-coniferous gymnosperm species (c.80% of genes recovered across the four samples), although this is not unexpected given that these were not specifically targeted in the probe design.Furthermore, the identity of the target sequences (here, sourced from Thuja) could influence locus recovery with increasing evolutionary distance, and an approach similar to McLay et al. [47] may be valuable in increasing target locus recovery from specific lineages.S1.
Overall, there was a large number of putatively paralogous gene copies recovered at most loci (c.36% of loci/sample, averaged across all samples) but this was highly variable among taxa (Prumnopitys andina, Podocarpaceae: c. 13% of recovered loci; Sciadopitys verticillata, Sciadopityaceae: 80% of recovered loci) (Figure 2; Table S2).The extent of paralogy might reflect the generally large genome size of conifers, which is also highly variable, with at least an order of magnitude difference between the smallest and the largest conifer genomes [14].Polyploidy is a major driver of genome size evolution amongst angiosperms, although until recently [48,49], this phenomenon was thought to be relatively rare among conifers (e.g., [13,50,51]).In addition, conifer genome size evolution has been attributed to other factors, such as a high copy number of long transposable elements (e.g., [25]).The distribution of paralogs in our data supports the view that genome size, per se, is only partly related to the frequency of duplicated genes.For instance, Podocarpaceae  S1.
Overall, there was a large number of putatively paralogous gene copies recovered at most loci (c.36% of loci/sample, averaged across all samples) but this was highly variable among taxa (Prumnopitys andina, Podocarpaceae: c. 13% of recovered loci; Sciadopitys verticillata, Sciadopityaceae: 80% of recovered loci) (Figure 2; Table S2).The extent of paralogy might reflect the generally large genome size of conifers, which is also highly variable, with at least an order of magnitude difference between the smallest and the largest conifer genomes [14].Polyploidy is a major driver of genome size evolution amongst angiosperms, although until recently [48,49], this phenomenon was thought to be relatively rare among conifers (e.g., [13,50,51]).In addition, conifer genome size evolution has been attributed to other factors, such as a high copy number of long transposable elements (e.g., [25]).The distribution of paralogs in our data supports the view that genome size, per se, is only partly related to the frequency of duplicated genes.For instance, Podocarpaceae has the smallest average genome size [14] and the smallest proportion of paralogs in our data.On the other hand, Pinaceae generally have large genomes, and we found a large proportion of putative paralogs among samples from this family.However, the relatively large genome size among Araucariaceae is not strongly associated with a high number of paralogous genes in our data (Figure 2), suggesting that factors other than gene duplication (e.g., transposable elements, larger introns, and abundant pseudogenes; [13,25,48] are also important drivers of genome size evolution. Biology 2024, 13, 361    Of the 100 targeted gene regions, 90 were recovered for Gingko, and these were included in the paralogy resolution analyses.Orthology inference recovered 98 MO ortholog groups and 95 with more than 10 samples included, which were retained for phylogenetic inference.The concatenated length of the 95 loci was an average of c. 48,770 bp with an aligned length of c. 74,179 bp and approximately 34% missing values.The average aligned length of the individual loci was c. 780 bp and ranged from 440-1691 bp.The concatenated alignment includes 47,469 (c.64%) variable positions, of which 36,350 (c.49%) are parsimony informative and 44,818 (c.60%) variable and 34,411 (c.46%) parsimony informative characters within the conifer clade.The maximum likelihood topology inferred from these data is shown in Figure 3.Of the 65 clades recovered, only 7 have a bootstrap support value < 100, and of these, only 3 received less than 80% support (Figure 3).The inferred topology is generally in agreement with our current understanding of conifer relationships (e.g., [2,3,5]), while the poorly supported nodes are associated with short branches and may be inherently difficult to resolve (e.g., [52][53][54]).For example, the relationships within the Prumnopityoid clade of Podocarpaceae, and in particular the placement of Halocarpus, were found to be unstable in the recent analyses of Chen et al. [55] using a large transcriptome data set of c. 1000 nuclear and c. 40 chloroplast gene regions and is poorly resolved here (Figure 3).S1.

Conclusions
In conclusion, we present a conifer-specific hybrid-capture bait set that has been shown to perform well in terms of the consistency of locus recovery across a broad range of gymnosperms, and these data can be applied to credibly resolve deep phylogenetic relationships within the conifer clade.As part of ongoing studies (Khan et al. in prep) [6,27,28], we have found the REMcon bait set to be similarly successful in resolving relationships among closely related species groups within Podocarpaceae.The REMcon bait set offers an efficient and relatively cost-effective approach that fills an important gap in conifer and gymnosperm phylogenomics.This hybrid-capture bait set has exciting future applications, including the resolution of complex phylogenetic relationships, population, and comparative genomics, providing valuable insights into the evolution and conservation of conifers and other gymnosperms.

Figure 1 .
Figure 1.Heatmap showing gene recovery and read depth across samples.The gene recovery includes samples that were flagged as potentially paralogous using the SECAPR pipeline.Sample abbreviations as in TableS1.

Figure 1 .
Figure 1.Heatmap showing gene recovery and read depth across samples.The gene recovery includes samples that were flagged as potentially paralogous using the SECAPR pipeline.Sample abbreviations as in TableS1.

Table 1 .
Targeted nuclear gene regions and the length of the probe sequences.

Table 2 .
Recovery of targeted gene regions, including potentially paralogous loci, across conifer families and four non-conifer gymnosperms.Locus recovery is averaged across samples (n), and the minimum and maximum recovery per sample is indicated.

Table 2 .
Recovery of targeted gene regions, including potentially paralogous loci, ac families and four non-conifer gymnosperms.Locus recovery is averaged across samples minimum and maximum recovery per sample is indicated.Of the 100 targeted gene regions, 90 were recovered for Gingko, and the cluded in the paralogy resolution analyses.Orthology inference recover