Comparative Analysis of Complete Mitochondrial Genomes of Three Gerres Fishes (Perciformes: Gerreidae) and Primary Exploration of Their Evolution History

Mitochondrial genome is a powerful molecule marker to explore phylogenetic relationships and reveal molecular evolution in ichthyological studies. Gerres species play significant roles in marine fishery, but its evolution has received little attention. To date, only two Gerres mitochondrial genomes were reported. In the present study, three mitogenomes of Gerres (Gerres filamentosus, Gerres erythrourus, and Gerres decacanthus) were systemically investigated. The lengths of the mitogenome sequences were 16,673, 16,728, and 16,871 bp for G. filamentosus, G. erythrourus, and G. decacanthus, respectively. Most protein-coding genes (PCGs) were initiated with the typical ATG codon and terminated with the TAA codon, and the incomplete termination codon T/TA could be detected in the three species. The majority of AT-skew and GC-skew values of the 13 PCGs among the three species were negative, and the amplitude of the GC-skew was larger than the AT-skew. The genetic distance and Ka/Ks ratio analyses indicated 13 PCGs were suffering purifying selection and the selection pressures were different from certain deep-sea fishes, were which most likely due to the difference in their living environment. The phylogenetic tree was constructed by molecular method (Bayesian Inference (BI) and maximum Likelihood (ML)), providing further supplement to the scientific classification of fish. Three Gerres species were differentiated in late Cretaceous and early Paleogene, and their evolution might link with the geological events that could change their survival environment.


Introduction
Mitochondrial genomes (mitogenome) have become a powerful molecule marker to explore phylogenetic analysis and taxonomic diagnosis in ichthyological studies because of its simple genetic structure, maternal inheritance, fast evolutionary rate, high specificity and easy detection [1].
Cys, Tyr, Ser, Glu, and Pro) were encoded on the light strand (L-strand), and the others were located on the heavy strand (H-strand). The D-loop was located between tRNA-Pro and tRNA-Phe gene as well as other vertebrates (Figure 1) [11]. The gene structure and arrangement of these species were identical with other vertebrates mitogenomes, without gene rearrangement, which were always detected in fungus and insects which may be relate to the characteristics of species' life history [26,27].  The GC content of each gene varied from 30.43% to 60.56%, and the highest was found in tRNA-Trp of G. decacanthus. In addition, the richest AT content regions were detected in the D-loop of G. filamentosus, tRNA-Arg of G. erythrourus, and tRNA-His of G. decacanthus, respectively ( Table 1). The entire base composition of the three mitogenomes in H-strand was similar with AT content varied from 52.83% to 53.71% ( Table 2). The G content of the three species was low with an obvious bias against G. Besides, the content of A and C exhibited high values at the third codon position, indicating that the codon usage preferred A and C at this position ( Table 2). The content of T was the highest in the second codon position, which might be the reason of hydrophobic nature of the proteins [28].  The lengths of PCGs, tRNAs, rRNAs, and control regions for the three Gerres species and other Perciformes species were compared in Figure 2. The maximum length diversification was detected in D-loop, and its variation was regarded as the responsible for the differences of whole mitogenomes length [12]. Besides, the D-loop length in Gerres was longer than in other species, and the length of G. decacanthus was the highest resulted in its longest mitogenome. However, the primary sequences of D-loop seem to play minor roles in regulatory function, as the region reveals wide variability across species even their relationship were close [29]. There was a large length variation between G. oyena, G. filamentosus, G. erythrourus, and G. decacanthus, which belong to the same genus. The rapid variation of D-loop seems to provide some information for species revolution, but its internal mechanism needs more data and deeper examinations.

RNA Genes
There were 22 tRNA genes observed in mitogenomes of G. filamentosus, G. erythrourus, and G. decacanthus with 67-72 bp in lengths (Table 1). Both Leu (TAA, TAG) and Ser (GCT, TGA) were determined by two types of anticodons, while others were determined by only one type. Besides, 21 tRNAs displayed a canonical cloverleaf secondary structure, while tRNA-Ser (GCT) formed a simple loop missing dihydrouridine arm (D-arm). The structure of absence D-arm in tRNA-Ser has been treated as a common trait of fish mitogenomes, and it invariably transformed the recognition potential of tRNA-Ser [30]. Furthermore, the stem region of tRNA in the study contained lots of noncomplementary and T-G base pairs. In mitochondrial tRNA genes, stem mismatches seem to be a universal phenomenon and could be repaired through post-transcriptional editing [31]. The most reasonable explanation was that mitogenome was unaffected by the recombination process, and therefore allowed existence of base mismatches which might be helpful for eliminate deleterious mutations [32].
The small and large rRNA genes were recognized in the H-strand, ranging from 752 to 780 bp and 1960 to 1978 bp in length, respectively ( Table 1). The AT content of the rRNA genes were 53.22% in G. filamentosus, 52.79% in G. erythrourus, and 52.95% in G. decacanthus, respectively, which was slightly lower than other bony fishes [33]. Defining the boundaries of rRNA genes seems to be more difficult than the PCGs which had functional annotation features [34]. Therefore, the boundaries of the genes could be inferred by assuming that there was no overlap or gap between contiguous genes.

Intergenic Spacers and Overlapping
Spacers in animals' mitochondrial genes were short, and can be used for evolutionary studies due to quickly changing ratio than gene regions [35]. There were 13, 15, and 12 small intergenic spacers (IGS) region in G. filamentosus, G. erythrourus, and G. decacanthus totaling 87, 157, and 167 bp respectively identified ( Table 1). The length of each IGS sequences ranged from 1 to 40 bp, with the longest located between tRNA-Ile and tRNA-Gln on the L-strand in G. decacanthus. The number and the size of IGSs was one of the reasons for mitogenome length variation [26]. The size of IGS in G. erythrourus and G. decacanthus were larger than G. filamentosus, wherefore the complete mitogenomes length of the former were longer than the later. Besides, the IGS situated between tRNA-Asn and tRNA-Cys representing initiation signals for replication of L-strand (O L ) in the length of 35 bp to 37 bp. The O L was always found in the intergenic region between two conserved genes [36], and the three mitogenomes were discovered. The O L was usually observed between tRNA-Asn and tRNA-Cys in bony fishes, and the secondary structure that is folded into a stable stem-loop was the main feature of O L [37,38].
The mitogenomes also contained overlapping regions. Five, three, and four overlap sites totaling 23, 15, and 22 bp were found in G. filamentosus, G. erythrourus, and G. decacanthus, respectively (Table 1). In bony fishes, the overlapping of PCGs often meant that transcripts were partially shared between abutting regions, and most reading-frames overlaps were discovered in ATP8-ATP6, ATP6-COXIII, ND4L-ND4, and ND5-ND6. The gene overlapping discovered in the same region might suggest recent common ancestry and a putative genera-specific pattern [39]. Gene overlap was one reason for mitochondrial genome compact, and the smaller mitochondrial genomes pass to offspring more frequently than the larger ones [40]. However, selection was responsible for genomes size variation, relating to adapt new environment [38,40]. The same overlapping regions were detected in the Gerres species, indicating they might have their own mechanism to cope with the environment.

Protein Coding Genes (PCGs)
The 13 PCGs of G. filamentosus, G. erythrourus, and G. decacanthus were 11,429 to 11,430 bp in length, accounting for 67.74% to 68.56% of the whole mitochondrial genome ( Table 2). The 12 PCGs were encoded on the H-strand, only the ND6 were expressed on the L-strand in the mitogenomes. Three initiation codons (ATG, ATA, and GTG) were detected, and the ATG was the most common initiation codon in the mitochondrial genomes of three species. Except for the COXI, ATP6, and ND4 genes, all PCGs used ATG as initiation codon in the study (Table 1). Eight complete termination codons for G. filamentosus and six for G. erythrourus and G. decacanthus were detected (Table 1). And the incomplete termination codons (TA or T) were discovered. TA was detected in ND2, ATP6, COXIII, and ND5, while T was detected in COXII, ND3, ND4, and Cytb. These genes were followed by a gene encoded on the same strand that allowed transcription to terminate without complete codons [41]. The existence of incomplete termination codons was common in fish mitogenomes and could be accomplished by the addition of a poly A tail during RNA processing [42].
In DNA sequences, AT-skew and GC-skew was considered as a potential indicator to measure strand asymmetry and the patterns of nucleotide composition [43]. The majority of the AT-skew values and GC-skew values of the 13 PCGs among the three species were negative, demonstrating base T and C were more plentiful than A and G (Figure 3). In many cases, the amplitude of the GC-skew is larger than the AT-skew, and it is not statistically significant [11,44]. Here, the absolute value of GC-skew was indeed larger than AT-skew, which conformed to conventional preferences that GC-skew was more obvious. The lowest AT-skew and highest GC-skew value were all found in ND6, and it was the only gene displayed positive value in the GC-skew curve, which was consistent with other studies [33,43,44]. Nucleotide skew might be attributed to the equilibrium between mutation pressure and selection pressure during replication and transcription, providing a potential direction for gene replication [27,45]. ND6 had larger fluctuation in AT/GC-skew value, suggesting that the selection and mutational pressure on it might be significantly different from other genes.

Usage of Mitogenome Codon
The amino acid, Leu, had the highest value of codon usage, which was utilized by six different codons. Cys was the least used amino acids and were encoded by only two codons (Figure 4a). The using frequency of each amino acid in G. filamentosus, G. erythrourus, and G. decacanthus was relatively identical.
RSCU was also used to assess mitochondrial gene codon usage. When the RSCU value = 1, it indicated that the frequency of use of codons had no different with other degenerate codons; when the RSCU value >1, it represented the codon was used more frequently [46]. The RSCU value of all amino acids in the three species were not equal to 1, implying that the usage of each amino acid had varying degrees of bias (Figure 4b). The biases of codon usage were significant in the mitochondrial genomes of different species, and it made the gene under different selection pressure and could predict the gene function [47,48]. The identical RSCU values for each amino acid in three species suggested the gene function in family Gerreidae might similar. They displayed more quantity of NNA and NNC, echoing with the result of nucleotide composition analysis of third position that was preference A and C ( Table 2). Mutation pressure, genetic drift and natural selection were the main elements affecting codon bias [49]. In addition, GC contents at the third codon position, gene expression levels and gene length also were related to the codon bias [47,48]. The main evolutionary force led to high content of A + T or G + C was the mutation pressure in animals. Comparing with the low GC content gene, higher GC content of third codon position seemed easier methylated and caused mutations [50].

Variations, Genetic Distance, and Evolution Rates of PCGs
The genetic distance could be used to evaluate different mutation pressures among genes [51]. The pairwise genetic distances (p-distance) were calculated to reveal the sequence conservation and divergence of the PCGs among the Gerres species ( Figure 5). The genetic distance at the third nucleotide position was obviously higher than the first and second nucleotide position, indicating that the evolution of the third position was faster than the first and the second. The highest p-distance were found in ND1 (2.184, 1.610) and ND6 (2.286) at the third nucleotide of codons, while explored in ND2 (0.309, 0.286) and ND5 (0.243) base on the first and second nucleotide position ( Figure 5). The COXI-III and Cytb genes had low genetic distance in both first + second and third analysis. ND1, ND2, and ND6 genes might have high evolutionary rates among the three species, while COXI-III and Cytb were low. The value of nonsynonymous substitution (Ka)/synonymous substitution (Ks) is a common indicator to assess selective pressure and evolutionary relationships of species in molecular studies [52]. Ka/Ks < 1, Ka/Ks = 1, and Ka/Ks > 1 were represented purifying selection, neutral mutation and positive selection, respectively [53]. All 13 PCG genes were under strong purifying selection with Ka/Ks values below 1 ( Figure 5). The result was different from deep-sea fishes, where most genes exhibited positive selection or convergent/parallel signals with the exception of ND4L and ND5 [54]. One of the reasons might be the different living environment between them. The basic characteristics of genome evolution depended on random genetic drift and mutation pressure that closely connected with the environment [55]. The deep-sea fishes inhabited in the condition of oxygen deficiency, food lacked, no sunlight and extreme cold, while the Gerres species survived in the warm coastal waters [15,16,54]. Positive selection usually related to the adaptation of new environments and the development of new function, and most nonsynonymous mutations were disadvantage [56,57]. The Ka/Ks values in Gerres species showed they were under purifying selection, indicating that the environment variation was not great enough to change their genetic function.
ATP8 and ND2 genes had high Ka/Ks (mean: 0.15, 0.16) values across three Gerres mitogenomes comparing to other genes, while COXI and Cytb genes were low ( Figure 6). Low mutation rates tended to occur on highly expressed genes due to DNA repair mechanisms [58]. Compare to other genes, the COXI and Cytb showed low Ka/Ks representing a low mutation rate, indicating that they may have higher expression level.
The AliGROOVE analysis of 13 PCGs showed that there were no strongly divergent patterns among 26 species and all sites displayed positive scores. Besides, the Gerreidae exhibited higher heterogeneity than other families (Figure 7). The site score of nucleotides dataset was lower than the amino acid dataset, indicating that the degrees of heterogeneity of the PCG-NT datasets were higher than PCG-AA datasets. Generally, high divergences between different taxa suggested that species was not robustly placed or might be misplaced on phylogenetic trees [59]. From the PCG-NT or PCG-AA datasets, the heterogeneity of most pairwise comparisons was low with site score above 0.5. The low heterogeneity of pairwise comparisons demonstrated that the two datasets were applicable for further phylogenetic studies [60].   The obtained mean similarity score between sequences is represented by a colored square. The site scores are ranging from −1, indicating great difference in sequence composition (red coloring), to +1, indicating similarity to other sequence composition (blue coloring).

Phylogenetic and Divergence Times
The phylogenetic analysis contained 27 species from nine families (Lutjanidae, Haemulidae, Triodontidae, Pentacerotinae, Sinpercidae, Sciaenidae, Carangidae, Gerreidae, and Cyprinidae) (Figure 8). The phylogenetic trees inferred by two methods generated identical topologies and had formidable values of intermediate bootstrap and post probability. It showed that each family gathered together and separated with other families, while Gerridae species were clustered to be one branch without sister lineage. It was consistent with the traditional morphological classification, indicating that the morphological phenotypes of fish were closely related to genetic background [61]. The phylogenetic tree also revealed that the G. decacanthus had the closest relationship with G. oyena and the farthest with G. erythrourus in family Gerridae.
The present divergence time calculations showed that the initial of the Gerres species was separated around 104.5 million years ago (Mya) (Figure 9). In family Gerreidae, the differentiation time between G. erythrourus and other Gerres species was the earliest (70.01 Mya). However, the G. decacanthus and G.·oyena had latest differentiation time (50.18 Mya). There were 7.78 million gaps between G. erythrourus and G. filamentosus, and 12.05 between G. filamentosus and G. decacanthus. The accumulation of changes in genetic composition resulted species reproductive isolation and evolution, indicating evolution was a long-term process as shown in the earlier study [57]. And the change of genetic structure might relate to mutations, recombination, selection, drift, migration, and isolation. However, the occurrence of geological events was the main reason for migration and isolation [62,63]. The divergence time of Gerres species could trace back to the late Cretaceous . At that time, many mountains were formed, angiosperms began to appear and shale were extensively deposit to the ocean and the Oceanic Anoxic Event 2 occurred [64]. The geological events might change the habitats of fish and their genetic structure might induce differentiation. Thus, the Gerreidae family differentiated in the period. Besides, G. filamentosus differentiation from other species was around 62.23 Mya, and G. decacanthus and G. oyena were differentiated around 50.18 Mya. Cenozoic-Paleogene, approximately occurred at 2.4-65 Mya, a period with significantly shrank of transgressive range in the continent and appeared of marine sediments in the marginal areas of China [65]. Geographical isolation caused by geological movements might provide sufficient environmental conditions for divergence of fish, while high aquatic biological productivity caused by marine sedimentation could offer food sources for growth. In the present study, the divergence of most species concentrated on the Cenozoic-Paleogene, and should be closely related to these geological events.

Library Construction and High-Throughput Sequencing
High-quality DNA samples were randomly broken into fragments with the length of 350 bp for paired-end sequencing, and the DNA libraries were constructed according to the standard procedure of Illumina DNA library construction. The quality of the library was control by qPCR method and Agilent 2100 Bioanalyzer ( Agilent, California, USA). The quality-qualified DNA library was run on an Illumina HiSeq 4000 instrument (Illumina, California, USA) with paired-end reads of 150 bp, and the sequencing data of per sample was not less than 2 GB.

Sequence Assembly, Annotation, and Analysis
The original sequences obtained by Illumina HiSeq 4000 sequencing were filtered to get high-quality sequences, with the following principle: when the content of N in any sequencing reads exceeded 10% of the number of read bases or any sequencing reads containing low-quality (Q ≤ 5) bases exceeded 50%, the paired reads were removed. The obtained high-quality fragments were aligned with Gerridae mitochondrial genomes on NCBI to remove sequence repeats and inaccurate sequencing, and then assembled by SPAdes v.3.5.0 [66] software to obtain the complete circular mitochondrial genome. The preliminary annotation of mitochondrial genome were used MitoFish [24] (http://mitofish.aori.utokyo.ac.jp/) and ORF Finder (https://www.ncbi.nlm.nih.gov/orffinder/). The protein-coding regions and ribosomal genes were determined by align with the reported mitochondrial genomes of close related species base on the methods of Blastl and Blastn (https://blast.ncbi.nlm.nih.gov/Blast.cgi). The complete mitogenomes of G. filamentosus, G. erythrourus and G. decacanthus were uploaded to GenBank with accession number MG587039, MN075144, and MT023107, respectively.
Mitochondrial gene structure maps were drawn using CGView Server [67] (http://stothard. afns.ualberta.ca/cgview_server/). The secondary structure of tRNAs were obtained by ARWEN (Version1.2) [68], and verified again by tRNAscan-SE 2.0 [69] (http://lowelab.ucsc.edu/tRNAscan-SE/) if any tRNA structure was abnormal. Sequence length, nucleotide composition and codon usage were calculated via DNAStar. The following formulas were used to calculate the values of AT/GC-skew to assess the nucleotide bias: AT-skew = (A -T)/(A + T) and GC-skew = (G -C)/(G + C). The data of RSCU were received by MEGA 7 software [70]. Based on the RSCU values, a histogram of the distribution and visualization of codon usage were drawn by software of GraphPad Prism 8.1. The AliGROOVE [71] was used to assess the heterogeneities of sequence divergence, separately for different datasets.

Phylogenetic Analyses
To establish evolutionary relationships among G. filamentosus, G. erythrourus and G. decacanthus and the related species, the complete mitogenomes of other 24 Perciformes species were downloaded from GenBank. The phylogenetic tree was constructed using concatenated sequences of 13 PCGs. The MUSCLE v.3.8.31 software [72] was utilized to perform the alignment of individual genes between multiple species and excluded the start and stop codon. The Cyprinus carpio (GenBank accession number: KP993137) was used as the out-group to determine the root of phylogenetic tree [44]. The Bayesian Inference (BI) and maximum Likelihood (ML) methods were applied and the optimal model for nucleotide sequences was estimated by jModelTest2.1.7 [73]. Mtmam + I + G + F captured the minimum values of Akaike Information Criteria (AIC) and was considered to be the best model for phylogenetic tree construction. The ML tree was constructed by RAxML8.1.5 software [74] with 1000 replicates of bootstrap and the BI analysis was inferred by the software of MrBayes 3.2.6 based on 10,000,000 generations [75]. The divergence time was predicted by MEGA 7.0 with the RelTime-ML method and GTR + I + G modeling [70]. The calibration of divergence times were obtained from online Time Tree database (http://www.timetree.org/) [76].

Conclusions
In the present study, mitogenome sequnences of G. filamentosus, G. erythrourus, and G. decacanthus were obtained by high-throughput sequencing. Their mitogenomes were with a total length of 16,673 bp in G. filamentosus, 16,728 bp in G. erythrourus, and 16,871 bp in G. decacanthus, respectively. Each of the mitogenome composed of 13 PCGs, 2 rRNAs, 22 tRNAs and one D-loop. Most PCGs were initiated with the typical ATG codon and terminated with TAA codon. The ratio of Ka and Ks indicated that three species were suffering a purifying selection, while the COI and Cytb showed the highest Ka/Ks values. The three Gerres species were differentiated in late Cretaceous and early Paleogene, and their evolution might link with the geological events that could change their survive environment. The phylogenetic tree provided further supplement to the scientific classification of Gerres fishes. This study could provide basis information for genetic characters, phylogenetic position and evolution profile for these fishes, which could benefit for resource management or selective breeding in fishery and aquaculture.