Diversity and Distribution of Mites (ACARI) Revealed by Contamination Survey in Public Genomic Databases

Simple Summary Mites are a group of minute animals ubiquitously distributed on the planet. They have close ecological ties with other species, such as plants, insects and vertebrates. With the development of sequencing technology, the genomic data have increased dramatically. Although the contaminations of microbial symbionts in public genomic databases have been explored to reveal the interactions between microbes and hosts, no similar study has been carried out to the microscopic mites. Here, we present a survey and analysis of the contamination of mites in Genbank genomic resources for the first time. The results showed that contamination of mites in public databases is not rare. Based on these contaminated contigs, the host associations and evolution of mites are discussed. Abstract Acari (mites and ticks) are a biodiverse group of microarthropods within the Arachnida. Because of their diminutive size, mites are often overlooked. We hypothesized that mites, like other closely related microorganisms, could also contaminate public genomic database. Here, using a strategy based on DNA barcodes previously reported, we scanned contaminations related to mites (Acari, exclusive of Ixodida) in Genbank WGS/TSA database. In 22,114 assemblies (17,845 animal and 4269 plant projects), 1717 contigs in 681 assemblies (3.1%) were detected as mite contaminations. Additional taxonomic analysis showed the following: (1) most of the contaminants (1445/1717) were from the specimens of Magnoliopsida, Insecta and Pinopsida; (2) the contamination rates were higher in plant or TSA projects; (3) mite distribution among different classes of hosts varied considerably. Additional phylogenetic analysis of these contaminated contigs further revealed complicated mite-host associations. Overall, we conducted a first systemic survey and analysis of mite contaminations in public genomic database, and these DNA barcode related mite contigs will provide a valuable resource of information for understanding the diversity and phylogeny of mites.


Introduction
Acari (mites and ticks) are a highly speciose group of animals within the Arthropoda [1].With nearly 55,000 described species and up to one million species awaiting discovery or description [2,3], mites can be found widely across various microhabitats around the world, from terrestrial to aquatic or oceanic environments, and even underground niches.Not surprisingly, their lifestyles are also highly diverse, from detritivorous, phytophagous, pollinivorous, fungivorous and predaceous in nonparasitic members to obligate ectoparasitism [1].They have also multifaceted roles in ecosystems, such as pests of crops (e.g., spider mites and gall mites), parasites on birds and mammals (e.g., quill mites, scabies mites and follicle mites), vectors capable of transmitting notorious viruses and sources of allergens (e.g., house dust mites) [4].Meanwhile, some of them can be beneficial to humans as biocontrol agents of pests and weeds.Although of great economic and ecological importance, our knowledge of mites is usually fragmentary which is focused on Animals 2023, 13, 3172 2 of 12 a particular mite taxon at a local scale [1,5], and many gaps still exist in our understanding of the distribution, diversification and evolution of mites.
Microbiologists have long been aware of contaminations in genomic databases caused by symbiotic bacteria, fungi or protists, and have utilized them as treasures to study the host-microbe interactions [13][14][15][16][17].However, contaminations of the microscopic mites in genomic databases have not been studied.Our assumption is as follows: the ubiquitous mites, with very small size (mostly 0.4-0.8mm) [2] and close associations to plants/animals, may go unnoticed in the field samples and have contaminated the public databases.Thus, we modified our previously published pipeline for protistan contaminations to survey mite contaminations in Genbank whole genome shotgun (WGS) genomes and transcriptome shotgun assemblies (TSA) based on DNA barcodes.DNA barcodes (e.g., the mitochondrial cytochrome c oxidase I, COI) are usually used in DNA barcoding experiments because such short sequences can produce accurate species identifications [18].Our pipeline took advantage of this attribute, and was reliable to detect contaminations related to DNA barcodes in large genomic databases [13].
The aims of current study were as follows: (1) survey possible contaminations of mites in animal and plant genomic data; (2) compare the contamination rates between different sequencing methods (WGS against TSA), or among specimens of different host classes; (3) assess the various host associations of different mites, by calculating the distribution of mite contaminations among different host classes; (4) explore the phylogenetic origins of these contaminated contigs.Given the wide geographic scope and the breadth of organisms covered by Genbank WGS/TSA genomic database, we expect our findings will provide a broad illustration of the distribution and biodiversity of mites.

Pipeline of Mite Contamination Survey
We modified our pipeline designed for scanning protistan contamination [13] by using mite barcodes as inclusion set and nonmite barcodes as exclusion set to scan mite contaminations (Figure 1).As Genbank WGS/TSA database is too large to be analyzed routinely, we sequentially eliminated candidate sequences by four steps that (1) were too long (>100,000 bp); (2) have no similarity to mite barcodes; (3) have more similarity to nonmite barcodes; (4) aligned with the best hit outside of Acari (exclusive of Ixodida) in the Genbank nt database, or with less than 80% identity.
bank nt database, or with less than 80% identity.
Considering the huge size of Genbank WGS/TSA and the limitation of computational resources, we filtered contigs more than 100 kb based on the reason that all RefSeq mitochondrial genomes of the Acari are less than 25 kb (Figure S1a), and 98.5% of mite barcodes in BOLD library are COI related (Section 3.1 presents the detail); therefore, most of detected mite contaminations were mitochondrial-derived and shorter than 100 kb (Figure S1b).

Taxonomic Analysis of Mite Contaminated Contigs
To correctly assign the mite contaminated contigs to family, genus or even species level, the thresholds need be more restrictive.It has been reported that the DNA barcodes enable family taxonomic assignments in the Acari with strict similarity thresholds (Sarcoptiformes 89.9%, and Trombidiformes 91.4%) [21].Thus, we further assigned the output contaminated contigs with mite origin to family level with a similarity threshold of 91.4%, according to the top best-score hit against nt database.The abundance of contaminated contigs was further plotted by Krona [22].Additionally, the relative abundances were calculated as the percentages of contaminations with different mite family origins across different host classes, and plotted by means of the matplotlib library.

Taxonomic Analysis of Mite Contaminated Contigs
To correctly assign the mite contaminated contigs to family, genus or even species level, the thresholds need be more restrictive.It has been reported that the DNA barcodes enable family taxonomic assignments in the Acari with strict similarity thresholds (Sarcoptiformes 89.9%, and Trombidiformes 91.4%) [21].Thus, we further assigned the output contaminated contigs with mite origin to family level with a similarity threshold of 91.4%, according to the top best-score hit against nt database.The abundance of contaminated contigs was further plotted by Krona [22].Additionally, the relative abundances were calculated as the percentages of contaminations with different mite family origins across different host classes, and plotted by means of the matplotlib library.

Mite DNA Barcodes in BOLD Database
Using a Python script with a regular expression ('.*\|Animalia,Arthropoda,Arachnida, (Trombidiformes|Sarcoptiformes|Mesostigmata|Holothyrida)') to match the sequence id, 138,272 DNA barcodes belonging to mites were extracted from the BOLD database to form the inclusion set, and the rest nonmite barcodes were used to build the exclusion set.

Mite DNA Barcodes in BOLD Database
Using a Python script with a regular expression ('.*\|Animalia,Arthropoda,Arachnida,(Trombidiformes|Sarcoptiformes|Mesostigmata|Holothyrida)') to match the sequence id, 138,272 DNA barcodes belonging to mites were extracted from the BOLD database to form the inclusion set, and the rest nonmite barcodes were used to build the exclusion set.
As for the distribution among genes, COI-5P (132,679, 96%) plus COI-3P (3507, 2.5%) account for 98.5% of all the barcodes.The COI has long been used to discriminate the small mites, and to resolve the diversity of mite fauna in large-scale surveys [30,31].It can overcome the shortage of external diagnostic characters of mites in traditional identification through morphology [32,33].

Mite Contaminations in Genbank nt Database
A substantial fraction of sequences in Genbank database appear to be contaminated [34].Undetected mite contaminations in the Genbank nt database would lead to false negatives in the fourth step (Figure 1) of eliminating candidate sequences.However, our pipeline [13] could discriminate mite contaminations in the nt database, by checking those records that have 100% identity in the best match against misidentified sequences from the source species, but with the second-best match to mite sequences.
After running the pipeline, it output four misidentified sequences (mite contaminants) (Table 1) in the Genbank nt database.XM_022085578.1-XM_022085580.1 are annotated to be mitochondrial genes of Zootermopsis nevadensis (Dictyoptera, Termopsidae), but actually they are contaminations derived from the Acaroidea mite; and XR_002707260.1 is predicted to Onthophagus taurus small subunit rRNA, but the real source of this sequence As for the distribution among genes, COI-5P (132,679, 96%) plus COI-3P (3507, 2.5%) account for 98.5% of all the barcodes.The COI has long been used to discriminate the small mites, and to resolve the diversity of mite fauna in large-scale surveys [30,31].It can overcome the shortage of external diagnostic characters of mites in traditional identification through morphology [32,33].

Mite Contaminations in Genbank nt Database
A substantial fraction of sequences in Genbank database appear to be contaminated [34].Undetected mite contaminations in the Genbank nt database would lead to false negatives in the fourth step (Figure 1) of eliminating candidate sequences.However, our pipeline [13] could discriminate mite contaminations in the nt database, by checking those records that have 100% identity in the best match against misidentified sequences from the source species, but with the second-best match to mite sequences.
After running the pipeline, it output four misidentified sequences (mite contaminants) (Table 1) in the Genbank nt database.XM_022085578.1-XM_022085580.1 are annotated to be mitochondrial genes of Zootermopsis nevadensis (Dictyoptera, Termopsidae), but actually they are contaminations derived from the Acaroidea mite; and XR_002707260.1 is predicted to Onthophagus taurus small subunit rRNA, but the real source of this sequence is the Macrochelidae mite.Thus, we must be careful when using COI-like genes with the '-like' suffix to identify species, because these genes are likely to be contaminants propagated from contaminations in Genbank WGS database.  The misidentified sequences in the Genbank nt database were blasted against nt database; the top two best matches were listed, with the first record to itself and the second to mite sequence. 2Alignment length. 3Abbreviation: 'cytochrome c oxidase subunit', COX; 'ribosomal RNA', rRNA.
Next, we calculated the mite contig numbers, and contamination rates in specimens from different hosts (Figure 3a).The results showed that the richness of contaminations varied greatly among different host classes.The top three host classes with the largest number of contaminated contigs were as follows: Magnoliopsida (730 contigs), Insecta (562 contigs) and Pinopsida (148 contigs).Although the contamination rates of Pinopsida (30/138) and Magnoliopsida (290/4047) were higher than average (681/22,114), contamination rate of Insecta was not (223/6224).
To further reveal the distribution of mites, we assigned these contigs to mite families and plotted the relative abundance among different host classes (Figure 3b).Using a similarity threshold of 91.4%, 1041 contigs were successfully assigned to mite families.The distribution can be concluded as follows: Contaminations in the order Mesostigmata are mostly from plant or insect specimens.For example, in the family Phytoseiidae which harbors most common plant inhabiting predatory mites [35], 38/48 of contaminated contigs are from projects of Magnoliopsida.
inants of Eupodina are found in assemblies of plants, except in the Halacaridae family.Notably, there were about 40% Halacaridae (marine mites) contigs from Anthozoa; and over 90% of Pinopsida in Phytoptidae.
Finally, in the order Sarcoptiformes, the most numerous of these contaminations were related to Insecta, followed by plants.Interestingly, of these, there are several contigs from the Actinopteri (bony fishes) assemblies (Table S3).This is consistent with the report that Histiostomatidae mites can attack fishes [41].

Phylogenetic Analysis of the Mite Contaminants
To further understand the phylogenetic origins of these contaminants, the contigs were annotated with MitoZ, and the predicted COI with a length more than 80 amino acids were used to infer a phylogenetic tree (Figure 4).The clades are colored according to the taxa of mite references retrieved from Genbank, and the host taxa of the contigs are derived from the project/assembly information (Spreadsheet S1) and indicated with symbols.As the preceding subsection revealed, similar host-mite associations can also be deduced from this smaller COI dataset.
According to the phylogenetic tree, conclusions can be drawn as follows: (1) the supercohort Anystina is monophyletic with low support, whereas the Eupodina is paraphyletic; (2) two superfamilies, Phytoseioidea (Blattisociidae + Phytoseiidae) and Eriophyoidea (Eriophyidae + Diptilomiopidae + Phytoptidae), were both recovered as monophyletic; (3) the monophylies of two clades, Parasitengona (Anystina) and For families in Eleutherengona, the detected contigs are modest: Tarsonemidae (46 contigs), Demodicidae (12 contigs), Tenuipalpidae (14 contigs) and Tetranychidae (98 contigs).Apart from Demodicidae, contaminations of these families are mostly associated with the class Magnoliopsida.Tetranychidae (spider mites) and Tenuipalpidae (false spider mites) are phytophagous and include major agricultural pests, thus are mainly found on plants.In the family Tarsonemidae (white mites), Steneotarsonemus spinki Smiley (rice mite) is a serious pest of rice crops, whereas some other genus/species are found associated with bark beetles [38,39].We here found a modest percentage of contigs from Pinopsida and Insecta in Tarsonemidae.Demodicidae mites are ubiquitous skin parasites in mammals [40].However, all 12 Demodicidae contigs here were related to nonmammal.After carefully checking these contigs, we found that all of them had high identities (96-100%) to the human mites (Demodex folliculorum or Demodex brevis) (Table S2); thus, we regard these Demodicidae contigs as fortuitous contaminations by human Demodex mites, and they should not be considered for further mite-host association analysis.
Finally, in the order Sarcoptiformes, the most numerous of these contaminations were related to Insecta, followed by plants.Interestingly, of these, there are several contigs from the Actinopteri (bony fishes) assemblies (Table S3).This is consistent with the report that Histiostomatidae mites can attack fishes [41].

Phylogenetic Analysis of the Mite Contaminants
To further understand the phylogenetic origins of these contaminants, the contigs were annotated with MitoZ, and the predicted COI with a length more than 80 amino acids were used to infer a phylogenetic tree (Figure 4).The clades are colored according to the taxa of mite references retrieved from Genbank, and the host taxa of the contigs are derived from the project/assembly information (Spreadsheet S1) and indicated with symbols.As the preceding subsection revealed, similar host-mite associations can also be deduced from this smaller COI dataset.
Next, we investigated the contamination by clades as follows: Manure-inhabiting (Coprophilous) Mesostigmata mites are important biological control agents of pests that feed on the eggs or larvae of pests [48].In Dung Beetles (Onthophagus taurus), a contig (JHOM02004312.1) was found related to the Macrochelidae (Mesostigmata) mite.And in this assembly, there was another contig related to rRNA (JHOM02004223.1) which was misidentified (XR_002707260.1) in the nt database (Table 1).
In the Tetranychoidea (Eleutherengona, Raphignathae) clade, all the contaminated contigs are from Magnoliopsida; among them, the ratio of dicots to monocots is 8:6.There were two clades of Demodicidae (Raphignathae) and Stigmaeidae (Raphignathae) close to the Tetranychoidea.In the Demodicidae clade, the contig is from the black howler monkey (GGWL01), with 83.4% nucleotide identity to Demodex folliculorum (Table S2), a known mite parasite that inhabits the skin of humans [40].In Stigmaeidae, it was a contig from Japanese cedar (Pinopsida; IABV01).
Oribatida are primarily soil dwelling, but also occur on trees [56].For example, Eremaeidae Eueremaeus trionus was found on bases of branches of Siberian pine trees (Pinus sibirica) [57].Thus, in the clade of Oribatida, we found most of the contigs are from Magnoliopsida (7/10).Interestingly, there was a contig from Brachystomella parvula (Collembola, JABASM01) which is closest to Hypochthonius rufulus (Oribatida, Hypochthoniidae).Springtails (Collembola) are also microarthropods that live below ground as Oribatida mites, and they are usually used together to reveal effects of the environmental change on soil microarthropod populations [58].

Discussion
Distribution and host associations of mites are complex because of their remarkable diversity of trophic preferences and habitats.Moreover, crossovers often occur (e.g., predators may feed on plants; free-living mites switch to parasitic or phoretic on other animals; and litter-inhabiting mites move onto plants) [1].Thus, it is very challenging to summarise the distribution and host-interactions of mites.
Fundamental advances in sequencing technology and bioinformatics made en masse biodiversity assessments of microscopic organisms possible [60].In this study, we applied a bioinformatics method to excavate mite contaminations in Genbank WGS/TSA database with acceptable computational costs and draw some conclusions that are in line with our expectations and mite-host associations concluded in traditional studies.However, we would like to emphasize some limitations of our study: First, this study was not intended to survey all contaminated contigs related to all mite genes, but just those related to DNA barcodes.The reason was that the huge size and rapid growing of the Genbank database surpasses the limit of our computational resources, as we mentioned before [13].
Second, the mite contaminations detected by this study still have biases.The greatest number of mite species is found in soils [61].However, we detected relatively few contigs of Oribatida and Endeostigmata (many of which live in deep soil).The reason is that Genbank WGS/TSA does not contain soil environmental data.Besides, the environmental specimens are not suitable for host association study because of the obscure host information.
Third, although BOLD barcode library is largely complete for vertebrate species, it remains poorly developed for invertebrates, especially mites [62].Since our pipeline relied heavily on the BOLD and Genbank nt databases, we suppose there are still undetected mite contaminations related to unrecognized species.As the BOLD database is growing, it will provide sufficiently available barcodes to allow more precise resolution of the contaminated mites.
Lastly, as mites are so speciose, the contaminated contigs detected in this study still cannot cover all mite or host taxa.Hence, there are some mite families or host classes missed in our deduced distribution pattern.However, as the Genbank database growing, the mite contaminations will increase, and would provide more comprehensive information for mite distribution study.

Conclusions
In this study, we systematically studied the mite distribution based on contaminations in the Genbank WGS/TSA database, which covered a large cohort of species (animals: 10,240; plants: 1970; Spreadsheet S1).The results suggest that mite-derived contaminations are common in genomic databases, with three in a hundred of assemblies contaminated by mites.Thus, apart from commonly known microbial contaminations, we should also be aware of the contaminations derived from minuscule mites to avoid erroneous interpretation of the genomic data.Based on these valuable contaminated contigs, host associations of mites were concluded, such as Parasitengona mites on arthropods and Phytoseiidae, Tetranychidae, Tenuipalpidae and Eriophyoidea on plants.Further phylogenetic analysis of the predicted COI derived from these contigs corroborated the mite origination and heterogeneous distribution of the contaminated contigs.Overall, our study provides valuable insights into the global biodiversity and distribution of mites.

Figure 1 .
Figure 1.Pipeline to scan mite contamination.The four steps (1-4) to eliminate candidate sequences are marked in red font.

Figure 1 .
Figure 1.Pipeline to scan mite contamination.The four steps (1-4) to eliminate candidate sequences are marked in red font.

Figure 2 .
Figure 2. (a) Krona plot displaying the distribution of mite DNA barcodes at various Acari taxonomic levels in BOLD database.(b) Pie chart of mite DNA barcodes to different gene markers in BOLD database.

Figure 2 .
Figure 2. (a) Krona plot displaying the distribution of mite DNA barcodes at various Acari taxonomic levels in BOLD database.(b) Pie chart of mite DNA barcodes to different gene markers in BOLD database.

Figure 3 .
Figure 3. Distribution of mite contaminations among different host classes or mite families (a).Numbers of mite contigs or contamination rates among projects of different host classes.(b) Relative richness according to the percentages of contigs of different host classes to different mite families.The contig numbers are list in parentheses, and the host classes were indicated at the bottom of the plot.The host/mite cladogram trees were generated by taxtree (https://github.com/nongxinshengxin/taxtree,accessed on 6 August 2023) based on NCBI taxonomy.The artificial contamination with human Demodex (Demodicidae) is marked with a star symbol.

Figure 3 .
Figure 3. Distribution of mite contaminations among different host classes or mite families (a).Numbers of mite contigs or contamination rates among projects of different host classes.(b) Relative richness according to the percentages of contigs of different host classes to different mite families.The contig numbers are list in parentheses, and the host classes were indicated at the bottom of the plot.The host/mite cladogram trees were generated by taxtree (https://github.com/nongxinshengxin/taxtree, accessed on 6 August 2023) based on NCBI taxonomy.The artificial contamination with human Demodex (Demodicidae) is marked with a star symbol.

Figure 4 .
Figure 4. Phylogenetic tree of COI predicted from mite contaminated contigs.The species names of the mite references retrieved from Genbank were colored in blue font.The contaminated WGS/TSA contigs were named with accession numbers following the host names, with host classes represented by a symbol in the nodes (most representative class of that clade), or symbols after exceptional branches individually.The D. following names indicates the host taxon is dicots, and M. indicates monocots.Nodes with bootstrap values (BSP) ≥ 70% are marked with a black dot.

Figure 4 .
Figure 4. Phylogenetic tree of COI predicted from mite contaminated contigs.The species names of the mite references retrieved from Genbank were colored in blue font.The contaminated WGS/TSA contigs were named with accession numbers following the host names, with host classes represented by a symbol in the nodes (most representative class of that clade), or symbols after exceptional branches individually.The D. following names indicates the host taxon is dicots, and M. indicates monocots.Nodes with bootstrap values (BSP) ≥ 70% are marked with a black dot.

Table 1 .
Misidentified sequences in Genbank nt database, which are actually sourced from mites.