The Mitochondrial Genome of a Freshwater Pelagic Amphipod Macrohectopus branickii Is among the Longest in Metazoa

There are more than 350 species of amphipods (Crustacea) in Lake Baikal, which have emerged predominantly through the course of endemic radiation. This group represents a remarkable model for studying various aspects of evolution, one of which is the evolution of mitochondrial (mt) genome architectures. We sequenced and assembled the mt genome of a pelagic Baikalian amphipod species Macrohectopus branickii. The mt genome is revealed to have an extraordinary length (42,256 bp), deviating significantly from the genomes of other amphipod species and the majority of animals. The mt genome of M. branickii has a unique gene order within amphipods, duplications of the four tRNA genes and Cox2, and a long non-coding region, that makes up about two thirds of the genome’s size. The extension of the mt genome was most likely caused by multiple duplications and inversions of regions harboring ribosomal RNA genes. In this study, we analyzed the patterns of mt genome length changes in amphipods and other animal phyla. Through a statistical analysis, we demonstrated that the variability in the mt genome length may be a characteristic of certain phyla and is primarily conferred by expansions of non-coding regions.


Introduction
Mitochondrial (mt) genome sequencing is a powerful tool used in many areas of modern biology, such as phylogenetics and phylogenomics; population genetics and molecular evolution; and studies of biodiversity, conservation, aging, and genetic diseases. At the same time, a mitochondrial genome is itself used for the investigation of fundamental molecular mechanisms governing its functionality and evolution [1,2].
Previous studies have demonstrated that most animal phyla have a relatively uniform mt genome length and gene content, thus establishing the concept of a "typical mt genome" in animals [3]. Indeed, the majority of animal mt genomes are single circular molecules of about 16 Kbp, with 13 protein-coding genes (PCG), that encode components of the electron transport chain and ATP synthesis, 22 tRNA genes, 2 ribosomal RNA genes, that provide a basis for the in-house protein synthesis machinery, and a control region, that maintain regulatory elements for replication and transcription [4]. However, deeper exploration of animal diversity, facilitated by advances in sequencing technologies, have shown that the mt genomes of many species deviate in terms of structure, length, gene content, and gene order from the archetypical animal mt genome [3,5,6]. slipped strand mispairing and errors in termination during mt genome replication [68,69]; however, more specific mechanisms, such as transposition, retrotransposition via an RNA intermediate, and recombination, were suggested to explain the proliferation of repeat elements in mt genomes and contribute to size expansion [68][69][70][71][72]. Purifying selection removes redundant genes and non-coding fragments, favoring compact mt genomes in most lineages [73,74], as shorter genomes are thought to have more effective transcription and replication [73], while excessive DNA is a target for deleterious mutations [67]. On the other hand, many researchers assume there to be no association between the excessive mt genome length conferred by selfish elements and negative organism fitness, which is corroborated by studies on Drosophila melanogaster laboratory lineages [75], freshwater sponges [21], and salamanders of Aneides spp. [23]. Recent findings have shown that such selfish elements may increase the replicative potential of certain mtDNA sequences and lead to the positive selection of such variants [66,67,76].
There is a growing body of evidence showing that short and long non-coding RNAs (ncRNA) transcribed inside known ORF (PCGs), ribosomal genes, intergenic regions, and pseudogenic sequences of nuclear and mt genomes participate in the regulation of different processes, such as protein translation, RNA methylation and splicing, mRNA degradation and silencing, etc. [77][78][79]. Translated short ORFs (30-60 bp) encode biologically active peptides that may also have regulatory functions in cells, and some of those peptides (humanin, SHLP 1-SHLP 6, MOTS-c) originate from the mt genome [78]. Data on the functionality of mt ncRNA and peptides obtained in human and murine cells models suggest the existence of similar regulatory units in the mt genomes of non-model species with non-canonical mt genes or inside their non-coding regions [5,80].
The sequencing and analysis of mt genomes with unusual lengths and peculiar architectures will help in studying the evolution of these features and provide direction for further researching the processes of maintenance and regulation in mitochondria.
Amphipods of the ancient Lake Baikal are a useful model for studying different aspects of mt genome evolution, as previous studies have shown many peculiarities in their mt DNA [22,81]. We found an unusual variability in mt genome lengths (from 14,370 to 18,114 bp) and gene orders within the currently sequenced mt genomes of ten representatives. Further analyses have revealed an unusually high number of tRNA genes that have undergone duplication and remolding (changes in tRNA gene identity through singular or multiple mutations in anticodon sequence) in the mt genomes of Baikalian species in comparison to those of amphipods from other habitats [22]. Out of more than 350 Baikalian amphipod species, M. branickii (Dyb.) is the only pelagic amphipod dweller [82,83]. This species inhabits the whole lake and is usually encountered at depths of more than 100-300 m. [84,85]; however, specimens are sometimes found at shallower depths and even at the water's edge [86,87], as the species performs diel vertical migrations from deep to shallow water layers [82,88]. M. branickii is an important component of the lake ecosystem; it is the main zooplankton predator, as well as an essential feeding component of pelagic fishes (gen. Comephorus, gen. Cottocomephorus, Coregonus autumnalis migratorius) [88].
In this study, we describe the mt genome of M. branickii, detail its extraordinary length, and discuss the mechanisms of this extension. In the context of this finding, we analyze the mt genome length distributions in different phyla of invertebrate animals to reveal how frequently variations in mt genome length occur and try to find common and distinctive features in their architectures.

Sampling, DNA Sequencing and Assembly
M. branickii samples were collected in 2015 at the south basin of Lake Baikal near the estuary of Harauz river (52 • 0 24" N, 105 • 59 04" E) at a depth of 0-70 m. Amphipods were attracted during the nighttime by the artificial light of the research vessel "G. Titov" and collected using a Juday plankton net. Total DNA was extracted from separate individuals using the modified CTAB method [89].
The genomic sequencing of a single individual of M. branickii was performed at the Faculty of Bioengineering and Bioinformatics of Lomonosov Moscow State University with an Illumina HiSeq 4000 system. A total of 41 million 150 bp paired-end reads were generated. The reads were cleaned with Trimmomatic [90] to remove sequencing adapters and assembled with SPAdes [91] using k-mer sizes of 21, 33, 55, 77, and 99. Mitochondrial sequences were detected in the assembly by BLAST [92] searches with the mtDNA-encoded protein sequences of amphipods. Fragmented mt contigs were extended by iteratively aligning read pairs to the ends of contigs. We used BLAST searches with the ends of contigs to find corresponding reads in the sequencing library and then aligned read pairs to the contig, thus extending its sequence. The sequences were extended until sufficient overlaps with other mt contigs were available. The contigs were then merged using overlapping sequences while manually resolving cases of inverted repeat structures.

Mt Genome Sequence Verification and Annotation
To validate the mt genome assembly obtained from total genomic reads, we additionally performed an assembly with transcriptomic data of M. branickii acquired from the Sequence Read Archive SRR3467077 [93]. Assembly was performed with SPAdes in single-cell mode (-sc) using k-mer sizes of 55 and 77. Mt contigs from both assemblies were aligned and inspected manually using BioEdit [94].
The merges and long inverted repeat regions in the assemblies were additionally verified by PCR and Sanger sequencing and by mapping reads to the complete mitochondrial sequence. The areas containing prominent repeats were amplified in two fragments of about 5 Kbp and 2 Kbp, and then sequenced using the primer walking method. Read mappings were performed using Bowtie2 [95] with the genomic and transcriptomic sequencing libraries. The mappings were inspected in Tablet [96] and visualized as circular diagrams using Circos [97]. The coverage statistics for genes were obtained from the read mappings using the BEDTools [98] genomecov utility. Duplicated reads were excluded from the mapping using Picard Tools (http://broadinstitute.github.io/picard) (accessed on 20 January 2021). Histograms of coverage were built from the sequence mappings with coverage values calculated using the rolling average in a 50 bp sequence window.
The mt genome sequence of M. branickii was annotated using the MITOS pipeline [99]. The prediction of the tRNA genes was performed with MiTFi [13] using both the default metazoan covariance models and the amphipod-specific models developed in our previous study [22]. The secondary structure visualization of tRNAs was carried out using the forna package [100]. The M. branickii mt genome map was also visualized using the OGDRAW program [101]. PCGs and ribosomal gene boundaries were manually corrected using sequence alignments with genes from the previously published amphipod mt genomes.

Structural Analyses of M. branickii Mt Genome and Phylogenetic Inference of Amphipods
Basic statistics for the nucleotide content of the newly sequenced mt genome were calculated using BioEdit [94].
Direct and inverted repeats in the mt genome sequence of M. branickii and four other Baikalian amphipods with length >17 Kbp (Acanthogammarus victorii, Brachyuropus grewingkii, Garjajewia cabanisii, Gmelinoides fasciatus) were found using NUCmer (-l 10 -maxmatch -nosimplify) and visualized using the Mummerplot of the MUMmer3.23 package [102]. To define the content of non-coding sequences of mt genomes from listed Baikalian species, we annotated the ORFs in these regions using the online version of the ORF finder integrated with the NCBI database (https://www.ncbi.nlm.nih.gov/orffinder/, accessed on 7 March 2021). ORFs with a minimal length of 30 nt. were found using the invertebrate mt genetic code and translated amino acid sequences were used for carrying out blastp searches in the online version of BLAST with using default settings. BlastN search with the non-coding regions was conducted with standalone BLAST (v.2.6.0.) using previously published amphipod mt genomes as queries (Eulimnogammarus vittatus KM287572, Pallaseopsis kesslerii KX341968, Gammarus duebeni JN704067, Metacrangonyx repens HE860495, Caprella mutica GU130250, Parhyale hawaiiensis MH542432). The analysis was conducted using the following settings: -word_size 9 -gapopen 2 -gapextend 1 -reward 1 -penalty -1 -evalue 0.001.
To infer the taxonomic position of M. branickii within other amphipods, we built a Maximum likelihood phylogenetic tree using IQ-TREE v.1.6.9. [103] based on the concatenated alignments of the amino acid sequences of 13 mt PCGs, including the newly sequenced species, ten other Baikalian species, and some non-Baikalian amphipods. Individual mt PCG sets and deduced amino acid sequences were aligned using Mafft [104] implemented in the local version of the TranslatorX program [105]. The substitution model mtMet+F+R10 was selected for the amino acid dataset using ModelFinder [106] implemented in IQ-TREE [103]. The SH-aLRT test and ultrafast bootstrap with 3000 replicates were used to assess node support values [107,108]. The resultant tree was rooted with the outgroup species and visualized in FigTree v.1.4.3 [109].

Statistical Analysis of Mt Genome Sequences from RefSeq
A dataset of animal mt genomes was acquired from the RefSeq database (entries released before 1 January 2020) and processed to extract the mt genome lengths. The dataset included data from species with mt genomes organized as a singular "chromosome". The lengths of the coding and non-coding parts of every mt genome were counted based on annotation data using custom R scripts. Animal mt genomes were separated into groups according to animal phyla from the Taxonomy Browser of the NCBI database. Phyla maintaining three or more species (Annelida, Arthropoda, Brachiopoda, Bryozoa, Chaetognatha, Chordata, Cnidaria, Ctenophora, Echinodermata, Hemichordata, Kinorhyncha, Mollusca, Nematoda, Nemertea, Onychophora, Placozoa, Platyhelminthes, Porifera, Sipuncula, Tardigrada, Xenacoelomorpha) were selected for further analysis.
Distributions of the mt genome length characteristics (length of the entire mt genome, length of the coding part, length of the non-coding part, and the ratio of lengths of noncoding part to coding part) of every animal phylum were visualized as boxplots. The outliers were defined as the values of the elements that were less from the first quantile of the distribution (downward outlier) and more than the third quantile of the distribution (upward outlier) according to a three-fold interquartile range (IQR). We used a 3-fold IQR to select a relatively small number of mt genomes with very diverse lengths for further analyses.
To test if the number of sequences in a phylum (sample size) significantly affects the mt genome length variability, we used regression analysis. The measures of the mt genome length variability in every phylum were the aforementioned characteristics (length of the entire mt genome, length of the coding part, length of the non-coding part, and the ratio of lengths of the non-coding part to lengths of the coding part) and the proportion of outliers. Additionally, a regression analysis was used to assess the contribution of the coding and non-coding regions to the length of the mt genomes by calculating the dependencies of the coding region lengths, non-coding region lengths, and their ratio on the length of the entire mt genome. We carried out all types of regression analysis with the next regression models: linear regression, second-degree polynomial, third-degree polynomial, exponential dependence, power dependence, and logarithmic dependence. The best regression model was chosen according to the Bayesian information criterion (BIC) (minimal BIC value for the best model) [110]. For the best regression model chosen, we calculated the R 2 covariance coefficient and estimated its reliability using the F-test (F). We assumed that the regression model was trustworthy with a p-value threshold of 0.05. The regression analysis and visualization were conducted with the standard function set of the R programming language according to the guidelines detailed in Reference [110].
The nonparametric version of ANOVA was used to examine the dependence of the genome length characteristics. The calculation of the ANOVA p-value was carried out using the permutation test [111] in the «lmPerm» package for the R programming language. ANOVA was used to test the dependencies of the entire mt genome lengths of all available animal species (9127 species) for three subsets of the data: (i) a set of mt genome lengths within general distributions (excluding values from outliers); (ii) a set of mt genome lengths from downward outliers; (iii) a set of mt genome lengths from upward outliers. The analysis shows if there is a significant difference between the lengths of the mt genomes belonging to the outlier categories in every phylum and the lengths of the rest of the mt genomes of this phylum.

Phylogenetic Analysis and Analysis of Repeats in Selected Mt Genome Sequences from RefSeq
Species whose non-coding mt genome region sizes were identified as outliers were selected for phylogenetic analysis. The datasets included species of interest with long mt genomes and all other species from the same taxa available in the RefSeq mt genomes. The set of taxa for the analysis in every case was chosen for balanced taxonomic sampling. We did not analyze the mt genome sequences of the phylum Chordata in this study because of their high number and the prevalence of small mt genomes without extensive non-coding regions in this group.
Substitution saturation tests were performed for the 1st + 2nd codon positions of each mt PCG in every sequence set using DAMBE v.7.2.43 [112]. Amino acid sequences from genes without saturation were concatenated and used for phylogenetic inference. The taxa used in the analysis, selected gene/protein sets, and the substitution model used in every set are summarized in Table S1. Phylogenetic trees were built and visualized using the aforementioned software.
The repeats in mt DNA sequences from the outlier set were examined and visualized using the pairwise alignment utility of the online version of BLAST.

M. branickii Mt Genome Assemblies and Features
Similarity searches with BLAST identified six mitochondrial contigs in the SPAdes assembly with the genomic sequencing data and another six contigs in the assembly with the transcriptomic data, ranging in size from 242 bp to 22 Kbp. Alignments and repeat resolution for mitochondrial contigs from each assembly resulted in a 42 Kbp circular sequence. The assembled sequence had two inverted repeat regions of 600 bp and 1.5 Kbp. These repeat regions were additionally verified by Sanger sequencing, generating six contigs with a total length of 6350 bp ( Figure 1, File S1). For the amplification and Sanger sequencing, we used a DNA sample of the same amphipod individual that was previously used for the total genomic Illumina sequencing. The new sequencing data led to slight correction of the assembly, which turned into a final version of the M. branickii mt genome, spanning a total of 42,256 bp (GenBank accession MT047459). The average read depth of the genomic reads for the genome assembly was estimated to be around 120 reads per nucleotide ( Figure 2, Table S2). The transcriptome read coverage was far less equal and was minimal in the non-coding part of the genome ( Figure 3, Table S2).    (Table S2). Genes encoded by the positive strand are shown in the outside ring, while genes encoded by the negative chain are shown in the inner ring. tRNA genes are labeled by their single-letter amino acid code. X is a tRNA pseudogene with a CCCC sequence in its anticodon.
The AT content of the total mt genome sequence is 59.20%. The AT-skew and GC-skew counted for the positive strand (coding the biggest portion of genes) of the entire mt genome of M. branickii were −0.0034 and −0.047, respectively, indicating a slight prevalence of pyrimidine over purine bases. The mt genome of M. branickii encodes 13 PCG, 2 ribosomal RNA genes, and 26 tRNA genes. All PCGs and the majority of the tRNA genes are grouped in a cluster that encompasses about 40% of the total mt genome length, while the ribosomal RNA gene cluster is separated from the PCGs by long non-coding regions spanning 16 Kbp and 7.5 Kbp (Figures 2 and 3). The genes are distributed between two strands of the mtDNA: the rRNA genes and five PCGs are encoded on one (negative) strand, while the other eight PCGs are encoded by the opposite (positive) strand.
A duplicated fragment (559 bp) of the Cox2 is located near the original Cox2 (664 bp) in a reverse orientation, constituting a prominent inverted repeat unique to the coding cluster of mtDNA. The 559 bp copy is identical to the original gene but lacks the first 105 bp and, thus, is annotated as "Cox2 fragment". Both copies have adjacent trnK(uuu) and trnD(guc) near their 3 ends, which implies that the Cox2-trnK(uuu)-trnD(guc) region is duplicated as a single unit.
The mt genome annotation in MITOS and a further BLAST search (Table S3) revealed partial copies of ribosomal RNA genes in the large non-coding segments between Nad2 and Nad6 (Figure 3). We defined the location of the true functional rrnL and rrnS based on the gene sequence integrity and the coverage values by transcriptomic reads (Figure 3, Table S2).
In the mt genome of M. branickii, we found additional tRNA gene copies along with the standard tRNA gene set. The Metazoan covariance models and amphipod-specific models predicted 26 and 28 tRNA genes, respectively (Table S4). One of the additional findings with the amphipod-specific models was trnL2(uaa) located in the region from 14,623 to 14,679 bp, which was ruled to be false positive due to its marginal bitscore (21.49) and e-value (6.79 × 10 −4 ). The second finding was a tRNA gene located between 34,609 and 34,669 bp. This tRNA gene was identified using a model for methionine tRNA with an e-value of 5.89 × 10 −7 , but as its anticodon loop contained eight nucleotides with the CCCC sequence in the anticodon, we ruled this finding as a trnM(cau)-derived pseudogene and annotated it as trnX(cccc). The M. branickii mt genome has two identical copies of trnV(uac), two copies of trnM(cau) with an 88.5% identity, two copies of trnK(uuu) with 96.6% identity, and two copies of trnD(guc) with 90.0% identity. Each duplicated copy is located on the opposite strand. Their secondary structures were not impaired ( Figure S1) and the transcriptome read coverage was comparable with the values of other singular tRNA genes (Table S2). In the mt genome sequence of M. branickii, we found direct and inverted repeats ranging from 39 to 1632 bp which cover about 20 Kbp of the whole sequence. All repeat pairs were located in the large non-coding sequence between Nad2 and Nad6, except for the Cox2 duplication ( Figure 1). Repeat searches in other Baikalian amphipods did not show such massive repeat expansions, even considering the differences in lengths of the non-coding parts ( Figure S2).

Non-Coding Regions of the Large Mt Genomes of Baikalian Amphipods
The amino acid sequences translated from 633 ORFs predicted in the large non-coding sequences of the mt genome of M. branickii between Cox1 and Cox2, Nad2 and Nad6, and Nad6 and Cox1 did not produce any hits in Blastp searches against the nr/UniProtKB/SwissProt/ refseq_protein databases. A similar BLAST search for translated ORFs from the non-coding parts of other mt genomes under consideration did not reveal homology with any protein either. Data on the ORF findings in the mt genomes of Baikalian amphipods are summarized in Table S5.
BLASTn searches revealed numerous copies of ribosomal RNA genes in the noncoding regions of the M. branickii mt genome ( Figure 1) and three Atp8 gene fragments of 81-83 bp predicted with a marginal e-value of 8 × 10 −4 (Table S3). Short fragments of the Cox2 (132 bp), Atp8 (58 bp), and Nad2 (64 bp) were detected with marginal e-values (from 1 × 10 −3 to 6.5 × 10 −4 ) in the control region of B. grewingkii and may constitute either degenerated gene copies or false-positive predictions. A truncated copy of the CytB of 405 bp was found in a non-coding region of G. cabanisii near the full-length CytB, indicating an event of duplication and subsequent degeneration. Additionally, small portions of the rrnL (100 bp), Nad4L (70 bp), and Nad1 (115 bp) were detected in the non-coding regions between rrnS and Nad2. Truncated copies of the Atp6 (135 bp) and Nad4L (197 bp) are found in a control region of G. fasciatus. No additional gene fragments were found in the non-coding regions of the A. victorii mt genome (Table S3).

Mt Gene Order of M. branickii and Its Phylogenetic Position within Baikalian Amphipods
A Maximum likelihood phylogenetic tree based on the concatenated alignments of amino acid sequences of mt PCGs placed M. branickii inside a well-supported clade comprising one of the two lineages of Baikalian amphipods (Figure 4). This lineage combines small species that mainly inhabit shallow water and have a tolerance to high temperatures [83,113,114]. The placement of M. branickii in this amphipod lineage corroborates the results of previous studies based on nuclear molecular markers and the analysis of morphological features [93,115]. The mt genome of M. branickii has an unusual gene order that differs from patterns seen in the majority of sequenced Baikalian amphipods or the closest non-Baikalian species of gen. Gammarus. The gene order of the M. branickii mt genome mostly resembles the one of Crypturopus tuberculatus, the nearest species it clusters with. It is worth noting that the mt genomes of both C. tuberculatus and M. branickii have pseudo tRNA genes that originate from trnM(cau) duplications. The presence of additional tRNA genes found in the M. branickii mt genome may be regarded as a common feature observed in the majority of currently sequenced genomes of Baikalian amphipods [22].

Statistical Analysis Reveals Genome Length Modes in Invertebrate Phyla
Statistical analysis was performed for 9127 sequences of complete mt genomes of animals from the RefSeq database submitted before 1 January 2020 (accessed on 20 October 2020) (Table S6).
The distributions of total mt genome lengths and their coding and non-coding lengths have distinct peaks, indicating that the majority of the total mt genome lengths and their constituents vary within narrow ranges ( Figure 5). At the same time, the long distribution tails indicate that a small number of values deviate significantly from the general averages in every group.  (Table S7) and used in a regression analysis. The RefSeq numbers of mt genomes that fall into the outliers category are shown in Table S8. A regression analysis showed no dependence of the mt genome length characteristics on the number of sequences in a phylum (p-value from 0.28 to 0.70) (Table S9), as well as no dependence of the proportion of outliers in distributions on the number of sequences in a phylum (p-value from 0.39 to 0.87) (Table S10).
At the same time, we found a significant dependence of genome lengths on the phylum itself (p-value = 2 × 10 − 16, R2 from 0.963 to 0.999) using ANOVA (Table S11). A regression analysis revealed that the length of non-coding parts of mt genomes contributes more significantly (R2 = 0.592) to the entire mt genome length than the length of coding parts (R2 = 0.480) ( Table S12). The longer the entire mt genome is, the larger the portion of its non-coding part will be ( Figure 7, Table S12). We also defined using ANOVA that the contribution to the variability of the entire mt genome length is more significant in mt genomes from the upward outliers category (F value = 2946) than in mt genomes from the downward outliers category (F value = 36.97) (Table S13). Figure S3 shows that the variability of the mt genome lengths of upward outliers is mainly determined by the variability of the lengths of the non-coding parts of the mt genomes (Table S13).
Thus, a statistical analysis and mt genome length distribution visualization showed that the length of the animal mt genomes, as well as the longest ones in different phyla (upward outliers), is mainly determined by the lengths of the non-coding parts, while the contribution of the coding part is much less.

Phylogenetic Analysis and Repeat Pattern Analysis of the Long Mt Genomes Sequences of Invertebrates
Mt genomes whose non-coding region lengths fall into the outliers category (the long mt genomes) were selected for phylogenetic analysis and repeat pattern examination. Phylogenetic trees were constructed using translated mt genome PCG sequences and included relatives from different taxa (Table S1) for the comparison of the mt genome lengths. We built nine phylogenetic trees encompassing the majority of outlier cases (Figures 8 and S4-S11). Repeat patterns in long mt genomes are visualized in Figure S12.
We found a significant variability in the lengths of non-coding regions on the level of the large taxa and smaller taxa in particular, for instance, in Polychaeta and Amphipoda (phylum Arthropoda) species with long mt genomes cluster with species with smaller mt genomes. The topology of the Amphipoda tree (Figure 8) shows that a significant increase in the non-coding genome length occured in the lineage of M. branickii. Some groups of closely related species, such as representatives of superfamily Nephropoidea, subclass Copepoda, or species from gen. Bombus from order Hymenoptera (phylum Arthropoda), show a significant variability of mt non-coding region lengths ( Figures S5 and S6). On the other hand, some groups, such as freshwater sponges (phylum Porifera), consistently maintain long non-coding regions in their mt genomes ( Figure S11). Other examples of such groups include the Bivalvia mollusks of the Arcoidea superfamily (represented by Cucullaea labiata, Scapharca broughtonii, Anadara sativa, Tegillarca granosa) and nematodes of gen. Meloigogyne, clade of gen. Trachelus, and gen. Cephus in Hymenoptera (phylum Arthropoda) (Figures S5, S7, and S9). Species whose non-coding region lengths fall into the category "outlier" are marked in blue. Yellow and blue rectangles show Baikalian amphipod species of the first and second lineages, respectively. Long mt genome sequences showed different patterns of repeats ( Figure S12). The mt DNA sequences of some species from the phyla Annelida, Mollusca, and Nematodes possessed direct repeats with variable copy counts and lengths in the non-coding regions. The patterns with relatively long inverted repeats seen in M. branickii were rare and were found in only three Nematode species: Hexamermis agrotis, Romanomermis culicivorax, and Romanomermis iyengari. Very short tandem and inverted repeats constitute low-complexity regions in the non-coding regions of insects from order Hymenoptera, family Curculionidae, and species of gen. Meloidogyne from phylum Nematoda suborder Tylenchina. In all Porifera species with long non-coding regions, mt genome sequences have very short repeats covering the whole sequences almost evenly ( Figure S12) [29,30]. The mt genomes of some species, such as Longpotamon kenliense, Vespa affinis (phylum Arthropoda), Meloidogyne graminicola (phylum Nematoda), and Isodiametra pulchra (phylum Xenacoelomorpha), did not show any repetitive sequences or had just a few very short repeats ( Figure S12).

The Unusual Architecture of Mt Genome of M. branickii and Its Potential Usefulness for Studies of Mt Genome Transcription and its Regulation
A newly sequenced mt genome of a pelagic amphipod M. branickii from Lake Baikal has an unusually large length of 42,256 bp, making this the largest length seen within Amphipods and one of the largest seen within all animals. Unusual gene orders and contents in comparison to other sequenced amphipod species were shown in a studied mt genome. In particular, there were duplications of four tRNA genes (trnM(cau), trnV(uac), trnK(uuu), trnD(guc)), a trnM(cau)-derived pseudogene, and a partial copy of the Cox2. It is worth noting that all duplicated genes (tRNA genes and Cox2) are located on the opposite strand relative to their copy, which indicates that inversion events happened along with duplication. The full-length Cox2 is located on the negative strand of the mt genome, which has never been observed in amphipods before and is not typical for the majority of mt genomes of Arthropods.
Additional tRNA genes and tRNA-like structures are sometimes found in mt genomes of different species, as well as in long mt genomes of some mollusks [20,46,116] and sponges [30]. Full-length and non-degraded copies of PCGs and ribosomal RNA genes are far less frequently identified in mt genomes than copies of tRNA genes; however, there are several examples of these cases. For instance, duplicated Cox2 were found in the mt genomes of mollusks of the species Ruditapes philippinarum (in F-type) [63,65]; Musculista senhousia [16] and Venustaconcha ellipsiformis [117] (in M-type); and Chaetoderma nitidulum [118], Anadara crebricostata, Scapharca inaequivalvis, Scapharca kagoshimensis, and Tegillarca sp. [19]. Authors have suggested that there may be a different functional status of additional gene copies, such as an ongoing pseudogenization [65], neofunctionalization [116,119], or concerted evolution of the copies [17]. The partial duplication of the Cox2 in the mt genome of M. branickii happened recently, as both 559 bp fragments were identical. At the same time, the adjacent trnK(uuu) and trnD(guc) had substitutions with their counterparts, as well as with other duplicated tRNA genes in the studied mt genome. The unusually high length of the non-coding region, multiple tRNA gene copies, and duplication of Cox2 cause the mt genome of M. branickii to mostly resemble the long mt genomes of Bivalvia mollusks of the Arcidae family [19,20,60]. One of the possible reasons for the retention of gene copies and long non-coding regions in genomes is their potential functionality. For example, Li and colleagues proposed that the Nad5-derived non-coding fragment in the Caenorhabditis briggsae mt genome upregulates the transcription of the neighboring Nad3 [120]. Transcribed copies of pseudogenes from non-coding regions of the nuclear and mt genomes were shown to participate in numerous processes of transcriptional and post-transcriptional gene regulation [121,122]. Moreover, proteomic studies confirmed the existence of small ORF-encoded peptides transcribed from non-coding regions or within PCGs or ribosomal RNA genes [123], affecting metabolism, development, DNA reparation, transcription, etc. [42,77]. tRNA genes, tRNA-like structures, and rRNA-derived fragments have also been shown to have a wide spectrum of functions beyond the mediation of translation [124,125]. However, most studies of such putative regulatory transcripts and peptides are devoted to nuclear-derived elements [77,78,121], and the regulatory potential of mt genome coding structures is yet to be assessed. Additionally, a detailed transcription pattern of the mt genome has been described in a handful of model organisms [123]. Thus, an unusual mt genome of M. branickii may be a useful model for studying the pattern of transcription and regulation of mt gene expression.

Features of the Long Mt Genomes in Invertebrates and Putative Mechanisms of Mt Genome Lengthening
About 65% of the length of the mt genome of M. branickii was annotated as non-coding intergenic regions. The two largest non-coding areas, interrupted by tRNA genes, are located between Nad2 and rrnL and between rrnS and Nad6. A BLAST search revealed vestiges of ribosomal RNA genes inside these two regions, which nevertheless occupy only a minor part of the entire length of the non-coding area. Most of the two aforementioned non-coding regions consists of relatively long direct and inverted repeats. Such features of the non-coding region indicate that the mechanism for the extension of the mt genome of M. branickii involves multiple duplications and inversions of regions harboring ribosomal RNA genes, with the subsequent degradation of redundant gene copies. Phylogenetic analysis based on available complete mt genome sequences of amphipods shows that a significant mt genome length is a unique feature of M. branickii. It is also worth noting that the relatively long mt genomes of Baikalian amphipod species (>17 Kbp) do not display the repeat patterns observed in the sequence of M. branickii, i.e., there is no proliferation of ribosomal RNA genes. Thus, M. branickii has a unique pattern of mt genome extension among amphipods. It is worth mentioning that M. branickii, uniquely among Baikalian amphipods, has a pelagic lifestyle; however, it is not clear whether the length of the mt genome is specifically associated with this lifestyle. One of the possible approaches to studying this issue would be to analyze of the effective population size and other characteristics of populations of this species in Lake Baikal using mt genes as molecular markers. It would also be useful to assess the variability of mt genome lengths, as well as the integrity of duplicated genes in the mt genome of M. branickii, on the population level.
The statistical analysis showed that animal mt genome lengths and the presence of mt genomes with prominent expansions in sequence length (upward outliers in length distributions) depend on the phylum, but not on the number of sequences in the phylum, which refutes sample size bias as a significant factor in the assessment. Thus, a significant bias in the number of available mt genomes within different animal phyla may not be taken into consideration in studies of mt genome length patterns. It was also shown that the overall mt genome length mainly depends on the variability of their non-coding region. Although the latter conclusion was previously made by different authors based on the analysis of separate species and taxa [19,32,126], we confirmed that this is a common rule for animal mt genomes in general (Figure 7, Table S12).
Species with especially long mt genomes, including M. branickii, were included in the phylogenetic analysis. We detected different ranges for the length variability of the noncoding regions in mt genomes of different taxa. In some lineages (for example, superfamily Nephropoidea, subclass Copepoda, species from gen. Bombus from order Hymenoptera) ( Figures S5 and S6), non-coding regions differ considerably, while other lineages maintain relatively short or unusually long non-coding regions for long evolutionary periods. Species from the latter group may be useful as models for studying the ecological reasons for and mechanisms of mt genome length extension and/or maintenance.
For instance, some species of Bivalvia mollusks from the family Arcidae have long mt genomes ( Figure S7) [19]. Authors have shown that the time of divergence in a clade of gen. Scapharca, which includes species with the largest mt genome sizes , was about 61 My and proposed that the low metabolic rate seen in these bivalves is associated with weakened purifying selection against long non-coding regions [19]. Another interesting group is freshwater sponges of the order Spongillida ( Figure S11) [21,30,63,127], which split from marine sponges at about 18 My [128]. It is worth noting that within this group there are three endemic sponges from Lake Baikal (Lubomirskia baikalensis, Rezinkovia echinata, Baikalospongia intermedia morpha profundalis) [21,63]. Baikalian sponges are another example of Baikalian invertebrate species with long mt genomes. Among Baikalian invertebrates with sequenced mt genomes, there are also four endemic mollusk species from the family Baicaliidae that have mt genomes with lengths in the range of 15,127 to 15,224 bp. These mt genomes have a uniform gene order and very short non-coding regions [129]. Considering that Baicaliidae mollusks and Baikalian sponges are estimated to have comparable divergence times from their respective last common ancestors [128][129][130][131], further studies of the life histories of these species might hold clues for the reasons behind the variability of their mt genome lengths.
The majority of examined long mt genomes possess one or two long non-coding regions and a relatively compact cluster of PCGs and ribosomal genes. Such mt genome organization is frequently seen in mollusks, nematodes, insects, crustaceans species, etc. [46,[132][133][134]. These non-coding regions often (but not always) contain repeat sequences ( Figure S12). Relatively long direct repeats were identified in the mt genomes of mollusks, nematodes, and annelid species ( Figure S12) [135][136][137][138]. Vestiges of different mt genes inside non-coding regions illuminate what parts of the genome were duplicated and degenerated, leading to mt genome length changes [46,139,140]. The non-coding regions of mt genomes of some species from different insects (order Hymenoptera, family Curculionidae) and Nematoda taxa (species of gen. Meloidogyne) possess segments with AT-rich short direct and inverted repeats ( Figure S12), indicating their emergence from a duplicated control region [141,142]. The aforementioned repeat patterns suggest the contribution of the tandem duplication-random loss (TDRL) mechanism [143] in mt genome extension.
The large non-coding region of the mt genome of M. branickii possesses an approximately even ratio of relatively long direct and inverted repeats, which makes this pattern quite rare within the examined animal mt genomes. Indeed, the prevalence of direct repeats over inverted repeats in mt genomes was shown earlier in a study by Nardi and colleagues (2012) [71]. Patterns combining direct and inverted repeats were identified in only three nematode species: R. iyengari, R. culicivorax [18,144], and H. agrotis ( Figure S12). The multiple sequence duplications and inversions found in the mt genomes of these species cannot be explained purely by the TDRL mechanism. An alternative explanation for such repeat patterns and the huge variation in the size of mt genomes could involve intramolecular or intermolecular recombination. Recombination in mt DNA was shown using bioinformatics and experimental approaches in plants, animals, and fungi [145][146][147][148][149], and, in particular, in the nematode Meloidogyne javanica [150]. Thus, recombination events along with duplications may have also contributed to the unusual mt genome architecture of M. branickii.
Another type of non-coding region repeat was detected in all analyzed long mt genomes of the phylum Porifera, which was, for the first time, noticed in the mt genome of Suberites domuncula [29]. Their short repeats are distributed almost evenly within mt genomes ( Figure S12). Earlier studies of Porifera mt genomes carried out by Lavrov and colleagues [151] showed that the numerous non-coding regions seen in mt genomes maintain small repetitive palindromic sequences in different sponge species [21,30,63,151]. Authors have proposed that these elements do not have adaptive significance and evolve in mt genomes as selfish elements [21]. Similar types of intergenic repetitive elements (short inverted regions) were described in the animal mt genomes of the phylum Placozoa [32,61], as well as in mt genomes of algae and fungi [72,152].
Some lengthy mt genomes do not have repeats noted in species from different taxa, such as insects (V. affinis, Diadegma semiclausum, D. melanogaster, Gonioctena intermedia, etc.), annelids (Owenia fusiformis), Nematoda (M. graminicola), and crustaceans (Sinopotamon xiushuiense, L. kenliense) ( Figure S12). We may assume that there were also duplication events in these genomes, but that the duplicated parts accumulated substantial substitutions or/and deletions and become indistinguishable as copies. Another but still less likely explanation for such sequence regions is their acquisition due to horizontal gene transfer, as was shown in Medusozoan Cnidarians [31], Octocorallia [34], and some Placo-zoan species [32]. The search for traces of horizontal gene transfer in the sequences of such mt genomes may well lead to new findings of unusual mt genes. It is also worth mentioning that the type of repeats seen, or their complete absence, is a lineage-specific feature, that is especially notable at the level of closely related species from monophyletic groups, such as gen. Calameuta (order Hymenoptera) or gen. Metanephrops (superfamily Nephropoidea) ( Figures S4-S12). However, in the mt genomes of species from higher taxonomic units (superfamily, order, phylum), different types of repeats are frequently seen, that indicates duplication of different genome regions and different stages of their evolution.
Thus, the analysis of the huge number of invertebrate animal mt genomes points to at least two mechanisms that might be responsible for their extension, that were previously discussed in studies of smaller mt genome groups: (i) the duplication of different regions (both coding and non-coding) with the subsequent rapid degradation of redundant gene copies [143] and (ii) the proliferation of small palindromic intergenic sequences [21,63,151]. The large variation in the non-coding region lengths in comparison to the coding regions in the mt genomes implies reduced negative selection in the former group of sequences. The mt genome of the amphipod M. branickii may be a rare intermediate stage of mt genome evolution bearing signs of multiple duplications as sequence copies with different degrees of degeneration and an excess of the non-coding sequence that has not yet been deleted by selective pressure forces. Further studies of these sequence feature variations at the population level might provide us with clues as to their effect on organism fitness.

Conclusions
In this study, we report the complete mt genome of the pelagic amphipod species M. branickii from the ancient Lake Baikal. This mt genome has an unusually large length of 42,256 bp and a unique gene order within amphipods. In particular, duplications and inversions of four tRNA genes and Cox2 were detected in the studied mt genome. The largest part of the mt genome consists of non-coding regions containing vestiges of ribosomal RNA genes. Phylogenetic analysis and the analysis of repeat patterns suggest that multiple duplications and inversions of regions containing ribosomal RNA genes occurred during the evolution of the M. branickii lineage and were followed by the degradation of redundant gene copies. The multiple inverted repeats found in the mt genome of M. branickii imply more complex mechanism of the sequence lengthening than mere duplication and loss. The mt genome of M. branickii is the largest found within amphipods so far and one of the largest within animals. We cofirmed that the length of animal mt genomes is mainly determined by the lengths of their non-coding regions. It was also revealed that mt genome length distributions depend on the phylum and are not affected by sampling bias. Thus, the emergence of exceptionally large mt genomes, such as that of M. branickii, is a rare and presumably lineage-specific phenomenon, as a more extensive sampling of other metazoan phyla did not reveal regular cases of large deviations in the lengths of mt genomes. Further studies on the details of M. branickii mt genome transcription, as well as the evolution of additional genes and long non-coding regions in populations of this species in Lake Baikal, will help us to define their degree of their evolution, maintenance and regulation.  Table S1: Data of the Maximum likelihood phylogenetic analyses of selected invertebrate taxa maintaining species whose non-coding mt genome region sizes were identified as outliers. File S1: Alignment of the Sanger sequence reads against the complete M. branickii mt genome. Table S2: The coverage statistics of the genomic and transcriptomic reads mapping on the mitochondrial genome of M. branickii. Table S3: The result of BlastN search in the mt genomes of Baikalian amphipods whose lengths exceed 17 Kbp. Table S4: The result of tRNA gene prediction in the mt genome of M. branickii. Figure S1: The predicted secondary structures of tRNA genes in the M. branickii mt genome. Figure  S2: Dotplots of repeated sequences in mt genomes of Baikalian amphipods with lengths exceeding 17 Kbp. Table S5: Data on ORF findings in mt genomes of Baikalian amphipods with lengths exceeding 17 Kbp. Table S6: Data of the complete animal mt genomes retrieved from the RefSeq database submitted before 1 January 2020. Table S7: Statistical data of mt genome length characteristics in animal phyla sets. Table S8: RefSeq numbers of the long mt genomes whose non-coding region lengths fall into the category of upward outliers. Table S9: Dependence of the animal mt genomes length characteristics on the number of sequences in the phylum assessed using regression analysis. Table S10: Dependence of the portions of outliers in sequence sets of genome length characteristics in phyla on the number of sequences in the phyla assessed using regression analysis. Table S11: Dependence of animal mt genome length characteristics on belonging to a phylum assessed using ANOVA. Table S12: Dependence of mt genome length characteristics on the length of the entire mt genome assessed using regression analysis. Table  S13: Dependencies of entire mt genome lengths on belonging to three gradations in ANOVA (mt genome lengths within general distributions, excluding values from outliers; mt genome lengths from downward outliers; mt genome lengths from upward outliers). Figure S3: Distributions of mt genome lengths from three sets: lengths of mt genomes within general distribution values (excluding values from outliers), lengths of mt genomes from downward outliers, and lengths of mt genomes from upward outliers shown for the lengths of the entire mt genome, lengths of coding mt genome parts, and lengths of non-coding mt genome parts. Figure S4: Maximum likelihood tree of Polychaeta species (phylum Annelida) from the RefSeq database based on mt PCG amino acid sequences. Figure S5: Maximum likelihood tree of Arthropods group 2 (Hymenoptera, Scarabaeiformia Staphyliniformia, Curculionidae, Nephropoidea) from the RefSeq database based on mt PCG amino acid sequences. Figure S6: Maximum likelihood tree of Arthropods group 3 (Copepoda, Potamoidea, Nephropoidea, Pentastomida) from the RefSeq database based on mt PCG amino acid sequences. Figure S7: Maximum likelihood tree of Pteriomorphia (phylum Mollusca) from the RefSeq database based on mt PCG amino acid sequences. Figure S8: Maximum likelihood tree of Dorylaimia (phylum Nematoda) from the RefSeq database based on mt PCG amino acid sequences. Figure S9: Maximum likelihood tree of order Strongylida and suborder Tylenchina (phylum Nematoda) from the RefSeq database based on mt PCG amino acid sequences. Figure S10: Maximum likelihood tree of phylum Xenacoelomorpha from the RefSeq database based on mt PCG amino acid sequences. Figure S11: Maximum likelihood tree of class Demospongiae (phylum Porifera) species from the RefSeq database based on mt PCG amino acid sequences. Figure S12: Dotplots of repeat sequences in long mt genomes of animals.

Data Availability Statement:
The data that support the findings of this study (sequencing reads and assemblies) are available from the corresponding authors upon reasonable request.