Gene Selection and Evolutionary Modeling Affect Phylogenomic Inference of Neuropterida Based on Transcriptome Data

Neuropterida is a super order of Holometabola that consists of the orders Megaloptera (dobsonflies, fishflies, and alderflies), Neuroptera (lacewings) and Raphidioptera (snakeflies). Several proposed higher-level relationships within Neuropterida, such as the relationships between the orders or between the families, have been extensively debated. To further understand the evolutionary history of Neuropterida, we conducted phylogenomic analyses of all 13 published transcriptomes of the neuropterid species, as well as of a new transcriptome of the fishfly species Ctenochauliodes similis of Liu and Yang, 2006 (Megaloptera: Corydalidae: Chauliodinae) that we sequenced. Our phylogenomic data matrix contained 1392 ortholog genes from 22 holometabolan species representing six families from Neuroptera, two families from Raphidioptera, and two families from Megaloptera as the ingroup taxa, and nine orders of Holometabola as outgroups. Phylogenetic reconstruction was performed using both concatenation and coalescent-based approaches under a site-homogeneous model as well as under a site-heterogeneous model. Surprisingly, analyses using the site-homogeneous model strongly supported a paraphyletic Neuroptera, with Coniopterygidae assigned as the sister group of all other Neuropterida. In contrast, analyses using the site-heterogeneous model recovered Neuroptera as monophyletic. The monophyly of Neuroptera was also recovered in concatenation and coalescent-based analyses using genes with stronger phylogenetic signals [i.e., higher average bootstrap support (ABS) values and higher relative tree certainty including all conflicting bipartitions (RTCA) values] under the site-homogeneous model. The present study illustrated how both data selection and model selection influence phylogenomic analyses of large-scale data matrices comprehensively.


Introduction
Neuropterida is a super order of Holometabola that is composed of the orders Neuroptera (lacewings), Megaloptera (dobsonflies, fishflies and alderflies), and Raphidioptera (snakeflies). Neuropterids are generally delicate insects that have two pairs of membranous wings with highly

Illumina Sequencing, Sequence Assembly, and Data Matrix Construction
Illumina sequencing of the transcriptome of C. similis (see Materials and Methods) yielded a total of 26,988,698 pairs of 101 base-pair (bp) long sequence reads (Table 1). After removing low-quality sequences, 25,017,948 clean pair-end sequence reads remained (Table 1). All these clean reads were assembled into loci (see Materials and Methods). Retaining the longest transcript of each locus yielded 67,683 distinct uni-genes. The minimum length of these uni-genes was 100 bp, the maximum length was 50,138 bp, and the N 50 was 1675 bp ( Table 1). The size distribution indicated that 9893/67,683 uni-genes were longer than 1000 bp ( Figure 1). To construct our phylogenomic data matrix, we used 22 holometabolous transcriptomes; 14 of these transcriptomes were from taxa belonging to the Neuropterida, and constitute the ingroup, whereas the remaining 8 represent the 8 other orders of Holometabola and were used as outgroups (Table S1, Materials and Methods). Orthologs of 2675 pre-selected Benchmarking Universal Single-Copy Orthologs (BUSCO) [29] genes that are conserved and broadly single-copy in arthropods were identified from the 22 transcriptomes. We retrieved 1392 orthologous genes that are single-copy and present in more than half of the 22 transcriptomes, resulting in a phylogenomic data matrix that contained 1,666,191 nucleotide (nt) sites and a translated amino acid (aa) version of the data matrix that contained 555,397 sites.
For each of the nt and aa versions of the data matrix, we also constructed several sub-datasets on the basis of the average bootstrap support (ABS) or relative tree certainty all (RTCA) values of the individual gene trees (see Materials and Methods).

Phylogenetic Analysis Under a Site-Homogeneous Model
Both concatenation and species coalescence analyses of the nucleotide (nt) and amino acid (aa) data matrices using a site-homogeneous model recovered a monophyletic Neuropterida. Within Neuropterida, analyses of aa and nt data matrices recovered different topologies ( Figure 2). Specifically, analyses of the aa data matrix recovered Megaloptera as the sister group to Neuroptera in both concatenation (maximum likelihood (ML), bootstrap support (BS) = 57) and species coalescence approaches (BS = 77). In contrast, analyses of the nt data matrix recovered Megaloptera as the sister group to Raphidioptera in both concatenation (ML, BS = 71) and species coalescence approaches (BS = 100). Both Megaloptera and Raphidioptera were recovered as monophyletic lineages. In all analyses, the coniopterygid species Conwentzia psociformis were identified as the basal or earliest diverging branch of the superorder Neuropterida, suggesting that the order Neuroptera is paraphyletic.

Phylogenetic Analysis Using Genes with Strong Signals
To test whether using genes with stronger phylogenetic signals can reduce incongruence, we examined the phylogenetic behavior of different subsets of genes in the aa and nt data matrices. For the nt data matrix, we performed analyses on five different data matrices comprising genes whose maximum likelihood (ML) trees had average bootstrap support (ABS) values across all internodes greater than or equal to 40% (1295 genes), 50% (1132 genes), 60% (834 genes), 70% (442 genes), or 80% (159 genes), as well as five data matrices comprising the 1295, 1132, 834, 442, or 159 genes whose ML trees had the highest relative tree certainty including all conflicting bipartitions (RTCA) values. For the aa data matrix, we performed analyses on five different data matrices comprising genes whose ML trees had ABS values across all internodes greater than or equal to 40% (1306 genes), 50% (1138 genes), 60% (863 genes), 70% (517 genes), 80% (218 genes), or 87% (72 genes), as well as five data matrices comprising the 1295, 1132, 834, 442, 159, or 65 genes whose ML trees had the highest RTCA values. Gene selection was solely based on the strength of phylogenetic signals exhibited in their gene trees (measured by ABS or RTCA) without any consideration to the topology supported. Each of these data matrices were analyzed using both concatenation and species coalescence approaches. In all cases, both the internode certainty (IC) and the internode certainty including all conflicting bipartitions (ICA) values of the vast majority of internodes greatly increased in data matrices comprised of genes with higher ABS or RTCA values (Figure 3, Tables 2 and 3), suggesting that selecting genes with high ABS or high RTCA significantly reduced incongruence in the Neuropteridan phylogeny ( Figure 3, Table 2; Table 3). Using only 834 genes with the highest RTC 2.05 0.11 11 7 Using only 442 genes with the highest RTC 6.20 0.33 16 3 Using only 159 genes with the highest RTC 10.11 0.53 17 2 The specific phylogenomic practice tested (treatment) the tree certainty including all bipartitions (TCA) of the phylogeny, the relative tree certainty including all bipartitions (RTCA) of the phylogeny, the numbers of internodes of the insect phylogeny in which the numbers of internodes of the insect phylogeny in which internode certainty including all bipartitions (ICA) increases or decreases. As the maximum value of ICA for a given internode is 1, the maximum value of TCA for a given phylogeny is the number of internodes, which are 19. Using only 218 genes with the highest RTC 11.09 0.58 17 2 The specific phylogenomic practice tested (treatment) the tree certainty including all bipartitions (TCA) of the phylogeny, the relative tree certainty including all bipartitions (RTCA) of the phylogeny, the numbers of internodes of the insect phylogeny in which the numbers of internodes of the insect phylogeny in which internode certainty including all bipartitions (ICA) increases or decreases. As the maximum value of ICA for a given internode is 1, the maximum value of TCA for a given phylogeny is the number of internodes, which are 19.  Examination of the phylogenies of data matrices comprised of genes with higher ABS or RTCA values showed that most relationships were consistent with those inferred from the original data matrix. The main difference was the placement of C. psociformis (Neuroptera: Coniopterygidae) ( Figure 4). In general, analyses of data matrices that used low stringency filters (e.g., ABS ≥ 40%) placed C. psociformis as the basal branch of Neuropterida with either Megaloptera being the sister group to Neuroptera (in aa data matrices) or to Raphidioptera (in nt data matrices). In contrast, analyses of data matrices that used high stringency filters (e.g., ABS ≥ 80%) placed C. psociformis as the basal branch of Neuroptera and recovered the order as monophyletic. In this topology, Megaloptera was recovered as the sister group to Neuroptera and the two orders together were the sister group to Raphidioptera ( Figure 4).

Heterogeneous Sequence Divergence Test
Recent phylogenetic studies of arthropods have increasingly shown that heterogeneous models could perform better than homogeneous models in resolving ancient relationships, which are often susceptible to systematic errors such as long branch attraction [30][31][32]. These studies have indicated that homogenous models are unable to accommodate the among-site or among-branch variations in evolutionary patterns such as rate, base composition, and substitution profile (e.g., [9,33]). To test whether there is such heterogeneity in the data and if heterogeneous models need to be used for the phylogenetic reconstruction, we next used the AliGROOVE [34] procedure to test the extent of sequence similarity and alignment ambiguity in pairwise sequence comparisons derived from the nt and aa data matrices. This analysis found strong heterogeneity in sequence divergence for both data matrices ( Figure 5). In particular, pairwise sequence comparisons of nt data yielded extremely low scores in almost all species, while pairwise sequence comparisons of aa data received relatively higher scores ( Figure 5).

Heterogeneous Sequence Divergence Test
Recent phylogenetic studies of arthropods have increasingly shown that heterogeneous models could perform better than homogeneous models in resolving ancient relationships, which are often susceptible to systematic errors such as long branch attraction [30][31][32]. These studies have indicated that homogenous models are unable to accommodate the among-site or among-branch variations in evolutionary patterns such as rate, base composition, and substitution profile (e.g., [9,33]). To test whether there is such heterogeneity in the data and if heterogeneous models need to be used for the phylogenetic reconstruction, we next used the AliGROOVE [34] procedure to test the extent of sequence similarity and alignment ambiguity in pairwise sequence comparisons derived from the nt and aa data matrices. This analysis found strong heterogeneity in sequence divergence for both data matrices ( Figure 5). In particular, pairwise sequence comparisons of nt data yielded extremely low scores in almost all species, while pairwise sequence comparisons of aa data received relatively higher scores ( Figure 5). . AliGROOVE analysis for nucleotide (NT) and amino acid (AA) sequences. The mean similarity score between sequences is represented by a colored square, based on AliGROOVE scores from -1, indicating great differences in rates from the remainder of the dataset, i.e., heterogeneity (red), to +1, indicating rates match all other comparisons (blue).

Phylogenetic Analysis Using Site-Heterogeneous Model
Analyses of the entire aa data matrix using the CAT-Poisson site-heterogeneous model in PhyloBayes v4.1c [35] and the LG+C60+F mixture model [36] in IQ-TREE [37] both recovered the same topology as the analysis of nt and aa data matrices that used high stringency filters ( Figure 6). Neuropterida was recovered to be monophyletic. Within Neuropterida, the sister-group relationship between Megaloptera and Neuroptera was recovered with high support (posterior probability (pp) = 0.98 and ultrafast bootstrap support (UFBS) = 100%). Megaloptera was recovered to be monophyletic with absolute support (pp = 1 and UFBS = 100%) and the two subfamilies Corydalinae and Chauliodinae, traditionally placed within the family Corydalidae, were grouped as monophyletic (pp = 1 and UFBS = 100%). Coniopterygidae was recovered as the basal branch in Neuroptera (pp = 0.98 and UFBS = 100%), Nevrorthidae as the sister group to Osmylidae (pp = 0.93 and UFBS = 100%), and Myrmeleontidae as the sister group to Chrysopidae (pp = 1 and UFBS = 100%). . AliGROOVE analysis for nucleotide (NT) and amino acid (AA) sequences. The mean similarity score between sequences is represented by a colored square, based on AliGROOVE scores from -1, indicating great differences in rates from the remainder of the dataset, i.e., heterogeneity (red), to +1, indicating rates match all other comparisons (blue).

Phylogenetic Analysis Using Site-Heterogeneous Model
Analyses of the entire aa data matrix using the CAT-Poisson site-heterogeneous model in PhyloBayes v4.1c [35] and the LG+C60+F mixture model [36] in IQ-TREE [37] both recovered the same topology as the analysis of nt and aa data matrices that used high stringency filters ( Figure 6). Neuropterida was recovered to be monophyletic. Within Neuropterida, the sister-group relationship between Megaloptera and Neuroptera was recovered with high support (posterior probability (pp) = 0.98 and ultrafast bootstrap support (UFBS) = 100%). Megaloptera was recovered to be monophyletic with absolute support (pp = 1 and UFBS = 100%) and the two subfamilies Corydalinae and Chauliodinae, traditionally placed within the family Corydalidae, were grouped as monophyletic (pp = 1 and UFBS = 100%). Coniopterygidae was recovered as the basal branch in Neuroptera (pp = 0.98 and UFBS = 100%), Nevrorthidae as the sister group to Osmylidae (pp = 0.93 and UFBS = 100%), and Myrmeleontidae as the sister group to Chrysopidae (pp = 1 and UFBS = 100%).

Discussion
The dramatically decreased cost of the whole-genome and transcriptome sequencing has facilitated the generation of genome-scale data from a wide variety of organisms. For insects, there are at least 138 whole genomes and 116 transcriptomes currently available [38]. These large datasets undoubtedly provide significant molecular evidence toward the understanding of the phylogeny

Discussion
The dramatically decreased cost of the whole-genome and transcriptome sequencing has facilitated the generation of genome-scale data from a wide variety of organisms. For insects, there are at least 138 whole genomes and 116 transcriptomes currently available [38]. These large datasets undoubtedly provide significant molecular evidence toward the understanding of the phylogeny and evolution of insects. However, figuring out how to properly use such large amounts of data to reconstruct the insect phylogeny is challenging.
By far, most published insect phylogenies based on genomic or transcriptomic data have been inferred using the concatenation approach on the entire data matrix, without filtering any orthologs that may lack phylogenetic signal [2,21,[39][40][41]. Analyses based on concatenation of all orthologous genes in a data matrix almost always results in absolute support values for most internodes of a phylogeny [2,42]. However, absolute support values do not necessarily indicate the reliability of a phylogeny [27,43]. Several case studies have shown that most individual gene trees in phylogenomic studies are topologically incongruent with each other and with the phylogeny supported from concatenation [23,27,29,40,42,44].
Incongruences are prevalent in the phylogenetic analyses and might be caused by both biological and analytical factors. Biological factors such as gene duplication and loss, recombination, natural selection, horizontal gene transfer, as well as incomplete lineage sorting (ILS) [45][46][47][48] can result in genuine differences between the evolutionary histories of genes and species, and some common solutions include careful gene selection (e.g., to avoid paralogy or horizontal gene transfer) and specialized phylogenetic approaches (e.g., coalescent methods for ILS). On the other hand, analytical factors such as stochastic error (e.g., insufficient taxon samples or sequence length) or systematic error (improper model assumptions) can introduce errors into the phylogenetic reconstruction, and might be potentially reduced by the increased sampling of genes and/or taxa, and with some data filtering approaches, such as using genes with high phylogenetic information content [40], slowly evolving genes [49], genes with stationary base composition [50], and so on.
In this study, we mainly investigated the impact of selecting genes that are highly informative or phylogenetic models that are more realistic on the reconstruction of Neuropterida phylogeny. Our results showed that the monophyly of Raphidioptera, Megaloptera, and Neuropterida (Raphidioptera + (Neuroptera + Megaloptera)) were consistently recovered as monophyletic clades, whereas the monophyly of Neuroptera was obtained only if genes with strong signals were analyzed or models that are more realistic were applied. It has been recently shown that phylogenomic data sets may contain genes that are highly informative but yet highly biased. In other words, some genes may have well supported phylogenies that are different from the underlying species tree, and they may bias the phylogenetic reconstruction under our gene selection criterion. Importantly, here the same topology was recovered by both data filtering and model selection, two independent strategies to improve phylogenetic inference, suggesting that our results were unlikely to be dominated by a few strongly biased genes. In addition, the monophyly of Neuropterida and each of the three orders are consistent with several recent phylogenetic studies based on the mitochondrial genome or transcriptome data [2,9,10,21,51].
Within Megaloptera, Corydalinae was recovered as the sister group of Chauliodinae through all the analytical methods, supporting the traditional monophyletic Corydalidae. Coniopterygidae was recovered as the basal branch in Neuroptera, which is consistent with Withycombe and Misof et al. [2,13], as well as Wang et al. [51]. Osmylidae was recovered as the sister group to Nevrorthidae, which is consistent with Winterton et al. [12], which was recovered as a sister group to the rest of Neuroptera with the exclusion of Coniopterygidae, Nevrorthidae, and Sisyridae based on the complete mitochondrial genome [51]. Besides, Gillung et al. [52] reported that NT data gave the better result because AA models were inadequate. However, the NT results with less filtering genes gave incomprehensible topology (Raphidioptera being sister group to Megaloptera), while the AA data gave the better one in this study. Furthermore, our results clearly show that using genes with stronger phylogenetic signals could significantly reduce the incongruence between different datasets as well as between different methods of phylogenetic inference.
Our study presented a comparison between the concatenated method and coalescent method using different datasets under the site-homogeneous and site-heterogeneous model in a transcriptome phylogenomic analysis of Neuropterida insects. Interestingly, analyses using genes with stronger phylogenetic signals under the site-homogeneous model from either concatenated or coalescence approaches and analysis of the AA data matrix under the site-heterogeneous model, yielded identical topologies, in which Neuroptera were recovered as monophyletic. In contrast, inclusion of genes with low phylogenetic signal under the site-homogeneous model in both concatenation and coalescence analyses yielded a paraphyletic Neuroptera. These results suggest that both selections of genes with strong phylogenetic signals as well as the use of more realistic models of sequence evolution are likely to be important in efforts to reconstruct a more accurate tree of insects. Meanwhile, in order to decrease the large computational resources and time, using genes with stronger phylogenetic signals may have a broader prospect as an efficient and accurate approach in the phylogenomic studies of insects.

Insect Samples and RNA Extraction
The C. similis specimen used in this experiment was collected from Daming Mount, Guangxi Province, China, on 12 May 2014. To obtain as many gene transcripts as possible, the whole body was sampled and frozen immediately in liquid nitrogen, and stored at −80 • C. Total RNA was extracted using the TRIzol reagent (Invitrogen, Carlsbad, CA, USA) following the manufacturer's protocol. RNA contamination and degradation were monitored on 1% agarose gels. Other quality parameters, such as purity, concentration, and integrity, were examined using the NanoPhotometer ® spectrophotometer (IMPLEN, CA, USA), the Qubit ® RNA Assay Kit in Qubit ® 2.0 Fluorometer (Life Technologies, Carlsbad, CA, USA), and the RNA Nano 6000 Assay Kit of the Agilent Bioanalyzer 2100 system (Agilent Technologies, Santa Clara, CA, USA).

cDNA Library Construction and Sequencing
Illumina sequencing was completed by Biomarker Technologies (Beijing, China), with the use of an Illumina HiSeq™ 2500. The first-strand cDNA was synthesized using random hexamer-primers from purified Poly (A) mRNA. Second-strand cDNA was synthesized using buffer, dNTPs, RNaseH and DNA polymerase I. Short fragments were purified using a QiaQuick PCR extraction kit. These fragments were washed with ethidium bromide (EB) buffer for end reparation poly (A) addition and then ligated to sequencing adapters. Suitable fragments, as judged by agarose gel electrophoresis, were selected for use as templates for PCR amplification. The cDNA library was sequenced on Illumina HiSeq™ 2500 using paired-end technology with a 101 base-pair long read in a single run.
Then, transcriptome de novo assembly was carried out with SOAPdenovo-Trans v1.03 (k-mer size = 31) [54], and 71,115 transcripts were obtained. The longest transcript of each locus was collected to generate 67,683 uni-genes as the final assembly. All raw transcriptome data have been deposited in the NIH Short Read Archive (SRA) with the accession number SAMN05525730.

Data Matrix Construction
We used the complete sets of annotated orthology data of 22 holometabolous transcriptomes (Table S1). The ingroup taxa include 14 species of Neuropterida, which represent three orders within the superorder and all families with available transcriptomes. The remaining 8 species represent 8 other orders of Holometabola and were selected as outgroups. Almost all the transcriptome data were downloaded from Transcriptome Shotgun Assembly (TSA) database of GeneBank (http://www.ncbi. nlm.nih.gov) with the accession number in the Table S1, except Chrysopa nipponensis and the newly sequenced C. similis. The transcriptome of C. nipponensis was downloaded from the Sequence Read Archive (SRA) of GeneBank. Both of these transcriptomes were assembled using SOAPdenovo-Trans v1.03 [54] since all other transcriptomes from TSA were assembled by SOAPdenovo-Trans [2,21]. Each transcriptome assembly was assessed for the copy number of 2675 pre-selected genes that were single-copy in 38 arthropod genomes in the OrthoDBv7 database using BUSCO v1.1b1 [29,55]. In total, 1392 genes were found to be present in more than 50 percent of the 22 species examined in this study, and their coding sequences and the respective translated amino acid sequences were retrieved to construct the phylogenomic data matrices. A series of different sub-datasets was constructed using custom Perl scripts. ABS and RTCA values were used to construct eight sub-datasets of 1392 orthologs: Five sub-datasets for comprising genes whose ML trees had ABS values across all internodes greater than or equal to 40% (1295 genes), 50% (1132 genes), 60% (834 genes), 70% (442 genes), or 80% (159 genes), and five sub-datasets comprising the 1295, 1132, 834, 442, or 159 genes whose ML trees had the highest RTCA values. For amino acids, we analyzed five sub-datasets comprising genes whose ML trees had ABS values across all internodes greater than or equal to 40% (1306 genes), 50% (1138 genes), 60% (863 genes), 70% (517 genes), 80% (218 genes), or 87% (72 genes), and five sub-datasets comprising the 1295, 1,132, 834, 442, 159, or 65 genes whose ML trees had the highest RTCA values.

Gene Alignment
We aligned all genes using the MAFFT software, v7.182 [56] based on their amino acid sequence, using E-INS-i (mafft-maxiterate 1000-reorder-genafpair). Then, we used PAL2NAL [57] to translate amino acid sequence alignments to codon sequence alignments, and the "gappyout" option of trimAl [58] to trim the amino acid sequence alignments. Trimmed segments of the amino acid sequence alignments were deleted from their corresponding codon sequence alignments using custom Perl scripts. Following trimming, our data matrix consisted of 1392 genes from 22 species.

Phylogenetic Inference Under Site-Homogeneous Model
For the codon sequence and amino acid alignments of each gene, the un-rooted phylogenetic tree under the optimality criterion of maximum likelihood (ML) was inferred using the RAxML, version 8.0.20 [59], under the GTRGAMMA (codon sequence) and PROTGAMMAAUTO (amino acids) model. The values of the nucleotide base/amino acid frequencies were fixed to "observed" and those of the substitution rate parameters estimated from the data. For the concatenation analysis, codon sequence and amino acid alignments from all genes were analyzed as a single super-matrix.
The un-rooted concatenation species phylogeny was inferred through a single ML search in RAxML v8.0.20 [59], with the values of the nucleotide base/amino acid frequencies fixed to "observed" and those of the substitution rate parameters estimated from the data. The concatenated file was partitioned based on every gene, and the model for every nucleotide sequence was GTRGAMMA and the model for every amino acid sequence was extracted from the single gene tree analysis. In all cases, robustness in inference was assessed via bootstrap resampling (100 replicates). Note that the RAxML software first infers the topologies for each of the bootstrap replicates and then searches for the best-scoring ML tree using every fifth bootstrap replicate tree as a starting tree.

Phylogenetic Inference Under Site-Heterogeneous Model
Analysis of the entire aa data matrix using the CAT-Poisson site-heterogeneous model (among site variation in stationary frequencies is modeled by a Dirichlet process and exchange rates among amino acids are assumed to be equal) was conducted in PhyloBayes v4.1c [35]. Four independent Markov Chain Monte Carlo (MCMC) chains were run in parallel for at least 2000 cycles until the convergence between the four chains were considered acceptable (the maxdiff parameter below 0.3). A consensus tree was obtained by discarding 25% of the samples as burn-in, and then sampling a tree every 10 cycles from the remaining samples. At the same time, we also analyzed the aa data matrix under the maximum-likelihood framework using the empirical site-heterogeneous model LG+C60+F [36] implemented in IQ-TREE v1.6.9 [37]. Here, the reliability of inferred relationships was assessed via ultrafast bootstrap approximation 2 plus a final nearest-neighbor-interchange based optimization (UFBoot2+NNI) [60] with 1000 replicates.

Conflicts of Interest:
The authors declare no competing financial interests.