Next Article in Journal
Correlation between Anti-Toxoplasma gondii IgG Antibodies in Serum and Colostrum of Naturally Infected Sheep and Passive Immunization in Lambs
Previous Article in Journal
Mastitis Pathogens Mannheimia haemolytica, Staphylococcus aureus, and Streptococcus uberis Selectively Alter TLR Gene Transcription in Sheep Mammary Epithelial Cells
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Forensic Genomic Analysis Determines That RaTG13 Was Likely Generated from a Bat Mating Plug

Biology Department, University of Puerto Rico—Rio Piedras, San Juan, PR 00901, USA
Microbiol. Res. 2024, 15(3), 1784-1805; https://doi.org/10.3390/microbiolres15030119
Submission received: 30 July 2024 / Revised: 31 August 2024 / Accepted: 3 September 2024 / Published: 5 September 2024

Abstract

:
RaTG13 is phylogenomically the closest related coronavirus to SARS-CoV-2; consequently, understanding the provenance of this high-value genome sequence is important in understanding the origin of SARS-CoV-2. While RaTG13 was described as being generated from a Rhinolophus affinis fecal swab obtained from a mine in Mojiang, Yunnan, numerous investigators have pointed out that this is inconsistent with the low proportion of bacterial reads in the sequencing dataset. Metagenomic analysis confirms that only 10.3% of small-subunit (SSU) rRNA sequences in the dataset are bacterial, which is inconsistent with a fecal sample. In addition, the bacterial taxa present in the sample are shown to be inconsistent with fecal material. The assembly of mitochondrial SSU rRNA sequences in the dataset produces a sequence 98.7% identical to R. affinis mitochondrial SSU rRNA, indicating that the sample was generated from R. affinis or a closely related species. In addition, 87.5% of the reads in the dataset map to the Rhinolophus ferrumequinum genome, and 62.2% of these map to protein-coding genes, indicating that the dataset represents a Rhinolophus sp. transcriptome rather than a fecal swab sample. Differential gene expression analysis reveals that the pattern of expressed genes in the RaTG13 dataset is similar to that of RaTG15, which was also collected from the Mojiang mine. GO enrichment analysis reveals the overexpression of spermatogenesis- and olfaction-related genes in both datasets. This observation is consistent with a mating plug found in female Rhinolophid bats and suggests that RaTG13 was mis-sampled from such a plug. A validated natural provenance of the RaTG13 dataset throws into relief the unusual features of the SARS-CoV-2 genome.

1. Introduction

Understanding the origin of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and coronavirus disease 2019 (COVID-19) is vital for preventing future pandemics. There are two main hypotheses regarding the origin of the COVID-19 pandemic. The zoonosis hypothesis proposes that the progenitor of SARS-CoV-2 jumped from a bat or intermediate host to a human [1]. This scenario requires that an infected bat or intermediate host came into close contact with a human in a non-research setting that allowed the initial transmission to occur. The contrasting lab leak hypothesis proposes that SARS-CoV-2 was transmitted to the human population through an accident associated with a research-related activity, such as a laboratory experiment [2,3].
RaTG13, sequenced by the Wuhan Institute of Virology (WIV), is still phylogenomically the closest known relative to SARS-CoV-2 [4,5,6,7]. While BANAL-52 [7] has a slightly higher sequence identity to SARS-CoV-2 than RaTG13, RaTG13 resolves as the sister taxon of SARS-CoV-2 using phylogenomic approaches. This is because phylogenomic methods take into account site-specific differences in nucleotide substitution rates. While some recombination is part of the evolutionary history of coronaviruses, including SARS-CoV-2 [8], the RaTG13 backbone is most closely related to that of SARS-CoV-2 overall.
The reported provenance of RaTG13 from a fecal swab from an intermediate horseshoe bat (Rhinolophus affinis) was used to support a proposed zoonotic origin of SARS-CoV-2 in the publication by Zhou et al. that first described the RaTG13 genome sequence [9]. However, Zhou et al. provided limited information regarding the sampling location and date of sequencing of RaTG13. Additional confusion occurred, as the study appeared to imply that RaTG13 had been sequenced after SARS-CoV-2. This seemed unlikely given the short window of time between the sequencing of SARS-CoV-2 in late December 2019 and the submission of the preprint version of the paper on 22 January 2020 [10].
A fragment of RNA-dependent RNA polymerase (RdRp) was the first part of RaTG13 to be sequenced and was published in 2016, initially labeled as ‘RaBtCov/4991’ [11] and subsequently renamed ‘RaTG13’ in Zhou et al. The link between the two was clarified by [12,13], as Zhou et al. had failed to reference the RdRp sequence. After online discussion and queries to the authors, further details were provided in a subsequent Addendum [14], which gave the date of sequencing of the RaTG13 genome as 2018 and the sampling location as a mine in Mojiang, Yunnan Province, China. It was subsequently revealed that the mine had been associated with the deaths of three miners in 2012 who had been clearing bat guano and succumbed to a virus-like respiratory infection [13,15,16].
Clearly, the provenance of RaTG13 is of importance, given its close relationship to SARS-CoV-2. However, a number of researchers have identified potential problems with the RaTG13 raw sequence data [17,18,19,20,21,22]. In particular, database entries for the raw reads deposited in the Genome Sequence Archive (GSA), Short Read Archive (SRA) and European Nucleotide Archive (ENA) state that the data were generated from an R. affinis fecal swab, as does the original paper describing the detection of ‘RaBtCov/4991’ (RaTG13) RdRp [11]. Likewise, a Master’s thesis from the WIV describing the sequencing of the RaTG13 genome (labeled in the thesis as ‘Ra4991_Yunnan’) attributed the provenance of the sample to 1 of 2815 ‘anal swabs’/‘fecal pellets’ collected in Yunnan Province, China, from 2011 to 2016 [23]. However, surprisingly, a low proportion of bacteria-related reads was revealed by a taxonomic analysis of the raw reads [17,19,20,21,22]. This is inconsistent with a fecal swab origin because fecal material is typically dominated by bacteria, with only a small amount of host nucleic acid present [24].
Here, the raw sequence reads used to generate the RaTG13 genome are analyzed in detail using metagenomic, phylogenetic, genome mapping and transcriptomic approaches in order to identify the true source of the dataset.

2. Materials

Sequence Data

The next-generation sequencing (NGS) dataset used to generate the RaTG13 genome by the WIV was obtained from the Genome Sequence Archive (GSA, accession number CRR122287). The date of collection was reported as 24 July 2013 in Yunnan, Pu’er, 22.82N 100.96E (GSA Biosample accession number SAMC133252). This location corresponds to the town of Pu’er rather than that of the Mojiang mine. The dataset was labeled as being generated from a ‘fecal swab’ in the GSA (Experiment accession CRX097481), the Short Read Archive (SRA, accession number SRR11085797) and the European Nucleotide Archive (ENA, accession number SRX7724752). The GSA entry states that the QIAamp Viral RNA Mini Kit (Qiagen, Hilden, Germany) was used to extract RNA and that the TruSeq Stranded mRNA Library Preparation kit (Illumina, San Diego, CA, USA) was used to produce the sequencing library for sequencing on the HiSeq 3000 (Illumina) platform.
An NGS dataset generated from an ‘anal swab’ obtained from ‘R. affinis’ by Li et al. at the WIV [25] was used as a comparison (note that the species attribution is likely incorrect, as discussed below in Results). The dataset was used to generate the BtRhCoV-HKU2r (Bat Rhinolophus HKU2 coronavirus-related) genome (National Center for Biotechnology Information, NCBI, accession number MN611522) and was apparently sampled from the mine in Mojiang [25]. The raw sequence data were obtained from the ENA (accession number SRR11085736) and were labeled as being generated from an R. affinis anal swab and sequenced on a HiSeq 3000 platform. Likewise, the dataset was described as being generated from an R. affinis anal swab using the QIAamp Viral RNA Mini Kit (Qiagen) and TruSeq Library Preparation kit (Illumina) in the publication describing its genome sequence [25].
Nine additional datasets from the same study were used for beta-diversity analysis. These were (species names in brackets) BtRaCoV-229Er (Rousettus aegyptiacus), BtScCoV-512r (Scotophilus sp.), BtHiCoV-CHB25 (Hipposideros pomona), BtMiCoV-1r (Miniopterus sp.), BtHpCoV-HKU10r (Hipposideros pomona), BtRhCoV-HKU2r (Rhinolophus affinis), BtTyCoV-HKU4r (Tylonycteris sp.), BtPiCoV-HKU5r (Pipistrellus sp.) and BtMiCoV-HKU8r (Miniopterus sp.)
Two transcriptomes generated from R. sinicus splenocyte primary cell lines by the WIV were used for comparison and were obtained from the SRA (accession numbers SRR5819066 and SRR5819069) [26]; these were termed ‘splenocyte 1’ and ‘splenocyte 2’, respectively. RNA was extracted using the RNeasy Mini Kit (Qiagen) [26]. The library construction protocol used to generate the datasets is not described in [26]; however, its NCBI entry describes it as being sequenced on a HiSeq 2000 platform.
An NGS dataset that was generated by the EcoHealth Alliance (EHA) from an oral swab from the bat Miniopterus nimbae from Zaire and that was described as containing the Ebola virus was obtained from the SRA (accession number SRR14127641). The dataset was described on its SRA webpage as being generated using the VirCapSeq target protocol [27] and sequenced using a HiSeq 4000 platform. No Ebola virus reads were detected in the dataset in this study; consequently, it was mis-named. The four datasets are described here as the RaTG13, BtRhCoV-HKU2r anal swab, splenocyte transcriptome 1 and Ebola oral swab datasets, respectively.
R. sinicus transcriptomes generated from kidney, liver, muscle, spleen, lung, heart and brain tissues were obtained from the SRA (accession numbers SRR2273931, SRR2273875, SRR2273816, SRR2273762, SRR2273740, SRR2273739 and SRR2273738, respectively). The transcriptomes were generated using the mRNA-seq Prep Kit (Ilumina) and sequenced on an Illumina HiSeq2500 platform. The datasets are referred to here as R. sinicus kidney, liver, muscle, spleen, lung, heart and brain.
Rhinolophus spp. ‘anal swab’ samples taken from ‘Tongguan Town, Mojiang County, Yunnan’, the ‘same location as RaTG13’ (i.e., the Mojiang mine) [28], were used to generate R. affinis SARSr-CoV RaTG15 (Ra7909), R. stheno SARSr-CoV (Rs7896), R. stheno SARSr-CoV (Rs7905), R. stheno SARSr-CoV (Rs7907), R. stheno SARSr-CoV (Rs7921), R. stheno SARSr-CoV (Rs7924), R. stheno SARSr-CoV (Rs7931) and R. stheno SARSr-CoV (Rs7952) were obtained from the GSA (accession numbers CRR290603, CRR290600, CRR290601, CRR290602, CRR290604, CRR290605, CRR290606 and CRR290607). These eight genomes constitute Clade 7896, a novel sarbecovirus clade [28]. The samples were collected in May 2015. The RNA was extracted using a High Pure Viral RNA Kit (Roche, Basel, Switzerland) and prepared using an MGIEasy RNA Library Prep Kit before sequencing on a BGI MGISEQ-2000RS (from the GSA entry).

3. Methods

3.1. Microbial Analysis

Metaxa2 [29] was used to identify (forward) reads in the raw datasets that match small-subunit (SSU) rRNA from mitochondria, bacteria and eukaryotes. Phylogenetic affiliation was assigned to the lowest taxonomic rank possible from the read alignments by Metaxa2.
Metaxa2 was used to output the taxonomic compositions of individual datasets (using the metaxa2_ttt function and the -b option), which were then combined into a table (using the metaxa2_dc function). Then, the R vegan package was used to calculate a distance matrix using the Bray–Curtis distance metric. A principal coordinate (PCoA) analysis was conducted on the distance matrix using the R ape package and k-means clustering was conducted using the kmeans function.

3.2. Mitochondrial Genome Analysis

Forward reads from the RaTG13, BtRhCoV-HKU2r anal swab and splenocyte 1 transcriptome datasets were initially mapped, using fastv, to a variety of mitochondrial genomes corresponding to mammalian species known to have been studied at the WIV [30].
Subsequently, forward reads from different samples were mapped to all mitochondrial genomes in the NCBI database using a custom pipeline termed ‘Mitoscan’ [31]. The output calculates the number of reads mapped and the percentage of genome coverage for each mitochondrial genome, ranking the genomes on the basis of the latter criterion.

3.3. Mitochondrial rRNA Phylogenetic Analysis

Reads from the transcriptome datasets were identified as corresponding to mitochondrial SSU rRNA using Metaxa2. These were assembled using Megahit [32]. The resulting contigs were used to query the NCBI nr database using Blast [33] in order to determine the closest match. An NGS dataset from a rectal swab sample from Cambodian Rhinolophus shameli (sample ID PH201, accession number SRR17498209) [34] was subjected to the same workflow in order to generate an R. shameli SSU rRNA sequence.
Additional mitochondrial SSU rRNA gene sequences from a range of Rhinolophus spp. and Hipposideros armiger (used as an outgroup) were obtained from the NCBI. The accession numbers were as follows: R. rex (NC_028536.1), R. paradoxolophus (NC_061980.1), R. siamensis (NC_061981.1), R. huananus (NC_061978.1), R. macrotis (NC_026460.1), R. marshalli (NC_061979.1), R. philippinensis (NC_061262.1), R. pumilus (NC_005434.1), R. pusillus (NC_046021.1), R. monoceros (NC_005433.1), R. ferrumequinum subsp. nippon (KT779432.1), R. sinicus subsp. sinicus (KP257597.1), R. thomasi (NC_034306.1), R. affinis subsp. himalyanus (NC_053269.1), R. yunnanensis (NC_036419.1) and H. armiger (NC_018540).
Sequence alignment, model testing and phylogenetic tree construction were conducted using Mega11 [35]. First, a nucleotide alignment was constructed using Muscle [36]. DNA substitution model selection was conducted and the General Time-Reversible (GTR) model [37] was identified as having the best fit to the data using the Akaike Information Criteria [38]. Then, a maximum likelihood analysis was conducted using an estimated gamma parameter and 100 bootstrap replicates.

3.4. Viral Genome Abundance Analysis

The number of forward reads mapping to the RaTG13 genome from the RaTG13 dataset was determined using fastv. Eight novel coronavirus genome sequences in addition to the BtRhCoV-HKU2r anal swab dataset were generated from NGS datasets derived from bat anal swabs from southern China by [25]. Fastv was used to determine the number of reads mapping to these nine coronavirus genomes from their respective NGS datasets.

3.5. Nuclear Genome Mapping

Raw sequences from the RaTG13, BtRhCoV-HKU2r anal swab and splenocyte 1 datasets were mapped to a variety of mammalian nuclear genomes. The Rhinolophus ferrumequinum genome (NCBI accession number GCA_ 004115265.3) was the closest related bat genome to R. affinis available and was used for mapping. In each case, the most recent assembly was used for mapping. First, the paired reads were trimmed and filtered using fastp [39], using polyX trimming and filtering reads with >5% of reads with a quality threshold of Q < 20. Then, the reads were mapped using the splicing-aware mapper BBMap (https://sourceforge.net/projects/bbmap/), using the default parameters and the usemodulo option.

3.6. Transcriptome Analysis

In order to assess the proportion of reads that mapped to protein-coding genes, reads from the transcriptomes were mapped to a previous version of the Rhinolophus ferrumequinum bat nuclear genome [40] (NCBI accession GCA_014108255.1), as a gff annotation file was not available for the most recent version of the genome assembly (GCA_ 004115265.3).
The corresponding annotation file, GCA_014108255.1_mRhiFer1.p_genomic.gff, was incompatible with the sam file containing mapped reads due to differences in chromosome naming between the annotation file and the corresponding genome assembly file. This was corrected by modifying the sam file so that the chromosome names in the two files matched. This was carried out by making sure that the chromosome names matched those of the gff annotation file (GCA_014108255.1_mRhiFer1.p_genomic.gff) using the following commands:
sed -i ‘s/Rhinolophus ferrumequinum isolate mRhiFer1 scaffold_m29_p_[0-9] [0-9]*, whole genome shotgun sequence//g’ test.sorted.sam
and
sed -i ‘s/Rhinolophus ferrumequinum isolate mRhiFer1 mitochondrion, complete sequence, whole genome shotgun sequence//g’ test.sorted.sam
The corrected sam file was then sorted, converted to bam and indexed using SAMtools [41]. Then, the bedtools [42] multicov function was used to assign the numbers of reads that mapped to different genomic features present in the R. ferrumequinum genome contained in the gff file. ORF-mapping values were extracted from the output file using the term ‘Genbank\sgene’, which produced 19,066 rows. The number of reads per gene was then converted to reads per million (RPM) values for each ORF in R.

3.7. Comparative Transcriptomics

The RPM values for the different datasets were compared using hierarchical cluster analysis (HCA) using the R package pheatmap. Row values were normalized using the R scale function.
In order to identify genes upregulated in RaTG13 and RaTG15 and the remaining seven Clade 7896 datasets, each of these transcriptomes’ RPM values were individually compared to those of the R. sinicus tissue and splenocyte transcriptomes using the R scale function. This produced a value for each gene in each sample, calculated in relation to the mean of all samples, which represented the number of standard deviations from the mean (Z-score). The Z-score represents the amount of overexpression or underexpression of a gene in one sample compared to the average value across samples. The genes with the highest positive Z-score value are those most highly overexpressed.

3.8. GO Enrichment

The Z-scores were used to identify the 1000 most overexpressed genes from each transcriptome of interest, annotated with their Hugo Gene Nomenclature Committee (HGNC) descriptors, and inputted into the g:GOSt functional enrichment tool [43] (part of the g:Profiler suite [44]). This then identified categories of gene products that were over-represented in the 1000-gene list using a cumulative hypergeometric test. These were divided into the three GO sub-ontologies: cellular component (CC), molecular function (MF) and biological process (BP).

4. Results

4.1. Numbers of Reads Matching SSU rRNA

Only 1.8% of forward reads in the RaTG13 dataset match SSU rRNA, in contrast to 20.7% in the BtRhCoV-HKU2r anal swab dataset, 27.4% in the splenocyte 1 transcriptome dataset and 13.4% in the Ebola oral swab dataset (Table 1). This implies that the RaTG13 dataset has undergone rRNA depletion during preparation, probably using the Ribo-Zero procedure, which is part of the TruSeq library preparation protocol. This procedure involves the enzymatic degradation of rRNA from both eukaryotes and bacteria. It is unclear whether the procedure preferentially degrades rRNA from eukaryotes or bacteria, thus altering the ratio of eukaryotic to bacterial SSU rRNA sequences in the dataset.

4.2. Ratio of Eukaryotic to Bacterial SSU rRNA Reads

The consideration of the ratio of eukaryotic to bacterial SSU rRNA reads reveals marked differences between the datasets. The ratio is 8.3:1 for the RaTG13 dataset, 88.7:1 for the splenocyte transcriptome dataset and 3.7:1 for the Ebola oral dataset (Table 1), indicating that eukaryotic SSU rRNA dominates these samples. However, in contrast, the BtRhCoV-HKU2r anal swab dataset has a ratio of 1:5.4, indicating that bacterial SSU rRNAs dominate the dataset, as expected with fecal material. The ratio of eukaryotic to bacterial SSU rRNAs in the RaTG13 dataset is inconsistent with that of the BtRhCoV-HKU2r anal swab dataset and is inconsistent with fecal material.

4.3. Microbial Analysis of the RaTG13 Dataset

Microbial taxonomic analysis provides a fingerprint that can be used to track the source of a sample by identifying microbial taxa characteristic of the microhabitat from which they were derived [45]. The results of taxonomic analysis for four bat NGS datasets using Metaxa2 are displayed in Table 2.
The RaTG13 dataset is dominated by Lactococcus spp. (64.9% of bacterial SSU rRNA sequences). Lactococcus spp. are lactic acid bacteria and are not characteristic of gut microbiota [46]. Only 4.9% of bacterial SSU rRNA sequences match Escherichia spp. In contrast, 74.1% of bacterial rRNA sequences match Escherichia spp. in the splenocyte 1 transcriptome dataset. The splenocyte 1 transcriptome is dominated by members of Enterobacteriaceae (90.7% of bacterial SSU rRNAs reads), which include members of the Escherichia genus. Their presence presumably reflects the contamination of the culture medium, which may be characterized by a bottlenecking effect and consequent low numbers of bacteria and overall diversity. This is consistent with being dominated by a single bacterial family (Enterobacteriaceae) and with the low proportion of bacterial SSU rRNA sequences overall in the dataset (1.1% of the total SSU rRNA sequences). Consequently, the presence of Escherichia spp. in the RaTG13 dataset by itself may not be indicative of the presence of fecal material per se. It is notable that only 7.4% of the bacterial SSU rRNA reads in the BtRhCoV-HKU2r anal swab dataset belong to members of the Escherichia genus, so they may not be useful for distinguishing fecal material in bats. Unfortunately, currently, there are no studies of the gut microbiome of Rhinolophus spp. available for comparison.
There is an absence of other bacteria characteristic of the gut in the RaTG13 dataset. Helicobacter spp. are diagnostic of the mammalian stomach and intestinal microbiota [47]. In the BtRhCoV-HKU2r anal swab dataset, 0.4% of bacterial SSU rRNA reads correspond to Helicobacter spp., but only 0.005% in the RaTG13 dataset. H. pylori is of low abundance in the human intestine [48], which is consistent with the data for the BtRhCoV-HKU2r anal swab. Micrococcus spp. make up 4.5% of bacterial SSU rRNA reads in the RaTG13 dataset. However, Micrococcus spp. are typically strict aerobes [49], so their presence in the RaTG13 dataset appears inconsistent with a fecal swab, given that the intestines are an anaerobic environment [50].
Members of the Peptostreptococcaceae family are anaerobes found in the human gut, soil and sediments [51]. In the BtRhCoV-HKU2r anal swab dataset, 21.2% of bacterial SSU rRNA reads correspond to the Peptostreptococcaceae family, but only 0.07% in the RaTG13 dataset. Members of the Lachnospiraceae family are some of the most abundant members of the gut and intestinal microbiota in humans [52]. While they account for 6.2% of bacterial SSU reads in the BtRhCoV-HKU2r anal swab dataset, they only constitute 0.7% in the RaTG13 dataset. Finally, Clostridium spp. are a major component of the intestinal tract [53]. They make up 47.8% of the bacterial rRNA sequences in the anal swab dataset but only 0.7% of the RaTG13 dataset. These differences in the proportions of members of the Peptostreptococcaceae and Lachnospiraceae families and Clostridium spp. between the two samples indicate that the microbiota of the RaTG13 sample is inconsistent with fecal material.
The possibility that the RaTG13 dataset was generated from a bat oral swab that was incorrectly labeled was examined. Oral swabs are reported as having been collected by the WIV and the EHA from 2010 to 2015 [54] in a joint EHA-WIV NIAID R01 grant proposal, 1R01Al 110964-01, that commenced in 2014 and in other publications by Dr Zheng-li Shi [55]. The issue is pertinent, as bat coronaviruses present in the oral mucosa are more likely to be transmissible via aerosols than those present in higher abundance in fecal swabs. Thus, determining whether RaTG13 was generated from an oral swab would provide a better understanding of the emergence of SARS-CoV-2 (which is transmitted via respiratory droplets and aerosols rather than the fecal route [56]).
A taxonomic analysis of the SSU rRNA reads from the Ebola oral swab dataset is instructive. Firstly, there is a low proportion of reads corresponding to Lactococcus spp. present (0.06% of bacterial SSU rRNA reads), in contrast with the RaTG13 dataset. This is expected, given that lactic acid bacteria are not present in high abundance in the mammalian oral cavity. The Ebola oral swab dataset has a low proportion of Escherichia spp. (0.003% of bacterial SSU rRNA sequences), in contrast to the RaTG13 dataset (4.9% of bacterial SSU rRNA sequences). Escherichia spp. are not expected in the oral cavity, as they are intestinal bacteria, and coprophagy is unreported in bats; consequently, their presence in the RaTG13 dataset is inconsistent with it being generated from an oral swab. Likewise, there are no Micrococcus spp. present (0% of bacterial SSU rRNA reads), in contrast to the RaTG13 dataset.
Members of the Pasteurellaceae family are mostly commensals living on mucosal surfaces, particularly in the upper respiratory tract [57]. The Ebola oral swab dataset is dominated by members of the Pasteurellaceae family (51.3% of bacterial SSU rRNA reads), in contrast to only 0.03% in the RaTG13 dataset. The Haemophilus genus is part of the Pasteurellaceae family, and Haemophilus spp. are characteristic of the upper respiratory tract [58]. They make up 4.4% of bacterial SSU rRNA reads in the Ebola oral swab dataset but only 0.01% in the RaTG13 dataset. The near absence of members of the Pasteurellaceae family in the RaTG13 dataset is inconsistent with it being derived from an oral swab. Members of the Gemella genus are characteristic of the oral microbiota in humans [59]. Gemella spp. account for 0.3% of the bacterial SSU rRNA reads in the Ebola oral swab dataset but are completely absent in the RaTG13 dataset.
While some sequences in the BtRhCoV-HKU2r anal swab dataset might be expected to originate from the bat’s insectivorous diet via carry-through into fecal material [60], only a few arthropod nuclear rRNA sequences were observed in the BtRhCoV-HKU2r, RaTG13 and splenocyte 1 datasets (0.01%, 0.02% and 0.01% of eukaryotic SSU rRNA reads, respectively). However, the Ebola oral swab dataset has substantially more (0.4% of eukaryotic SSU rRNA reads). This is consistent with the insectivorous diet of Miniopterus bats, from which the oral swab was taken. Rhinolophus spp. are also insectivorous, so the lower relative proportion of arthropod SSU rRNA sequences in the RaTG13 sample is an additional inconsistency with an oral swab. The observations described above indicate the differences between the Ebola oral swab microbiota, which is consistent with an oral microhabitat, and the microbiota present in the RaTG13 sample. These data indicate that the RaTG13 sample was not derived from an oral swab.

4.4. Microbial Community Comparison of RaTG13 and Clade 7896

Given that the Clade 7896 NGS datasets were generated from ‘anal’ swabs sampled from Rhinolophid bats from the Mojiang mine during a similar time period (2015) to RaTG13 (2013) and by the same research group, a comparison of the microbial communities in the Clade 7896 datasets was conducted. Firstly, using Metaxa2, it was determined that the eight Clade 7896 datasets were also low in microbial reads, similar to RaTG13 (Table 3).
Given these parallels between the RaTG13 and Clade 7896 datasets, a beta-diversity analysis of the microbial communities was conducted, as described in Methods. Given that both RaTG13 and the Clade 7896 datasets are anomalous in their low proportions of bacterial reads, nine ‘anal’ swab datasets from Li et al. (2020) were included in the comparison. These datasets had a high proportion of bacterial rRNA reads present, as expected for an anal swab (Supplementary Table S1).
The results of the beta-diversity analysis show a clustering of the microbial communities into two clusters, one of which contains RaTG13, RaTG15, Rs7905, Rs7907, Rs7896 and Rs7952 (Figure 1, black circles). The second cluster comprises the datasets from Li et al. (2020), combined with Rs7921, Rs7931 and Rs7924 (open circles). This indicates a similarity between the microbial communities of RaTG13 and five Clade 7896 datasets, which include RaTG15, implying a common source. However, the clustering of three Clade 7893 datasets with Rs7921, Rs7931 and Rs7924 with the datasets from Li et al. (2020) suggests some limitations to the approach.

4.5. Viral Genome Abundance Comparison

A comparison was made of the number of coronavirus reads present in the RaTG13 sample, BtRhCoV-HKU2r anal swab sample and eight additional bat anal swab samples generated by the WIV by [25] (Table 4). These data show that the viral concentration in the RaTG13 sample was relatively low (7.2 × 10−5 of total reads map to the RaTG13 genome) compared to the nine anal swab samples generated by Li et al. (which include the BtRhCoV-HKU2r anal swab sample), which ranged from 3.0 × 10−5 to 4.9 × 10−2 of total reads mapping to the respective coronavirus genomes. There are some differences in how the datasets were generated, but it is unclear whether they would have an effect on viral read abundances. Unfortunately, there are no coronavirus-containing datasets generated from cell lines by the WIV available for comparison. Finally, it was found that raw reads generated from Rhinolophus larvatus (SRA accession SRR11085733) mapped to the BtHiCoV-CHB25 genome and not to Hipposideros pomona, as reported in the Supplementary Materials file msphere.00807-19-st002.xlsx [25].

4.6. Mitochondrial Analysis

Reads from the RaTG13, BtRhCoV-HKU2r anal swab and bat splenocyte transcriptome datasets were mapped to a range of mammalian mitochondrial genomes (Table 5). Reads from the RaTG13 dataset mapped most efficiently to the R. affinis mitochondrial genome, with 75,335 reads mapping with 97.2% coverage, while 18,017 reads mapped with 40.4% coverage to the R. sinicus mitochondrial genome. This implies that the sample originated from R. affinis or a Rhinolophus species more closely related to R. affinis than R. sinicus.
Reads from the BtRhCoV-HKU2r anal swab dataset mapped most efficiently to the R. sinicus mitochondrial genome, with 29.8% coverage and 10,019 reads mapping, in contrast to the R. affinis mitochondrial genome, which mapped with 14.9% coverage and 6278 reads mapping. This indicates that the sample was derived from R. sinicus or a more closely related Rhinolophus species than R. affinis and is consistent with the phylogenetic analysis below. This contradicts the description of the sample as being derived from R. affinis.
Reads from the splenocyte transcriptome dataset mapped most efficiently to the R. sinicus mitochondrial genome, with 94.5% coverage and 170,591 reads mapping, in contrast to the R. affinis mitochondrial genome, which mapped with 32.2% coverage and 88,220 reads mapping. This indicates that the sample was derived from R. sinicus or a more closely related Rhinolophus species than R. affinis.
The low proportion of total reads mapping to the R. affinis mitochondrial genome (0.3%) is unusual. The RaTG15 dataset also had a low number of reads mapping to the R. affinis mitochondrial genome (3.9%). This may indicate that the datasets were derived from tissues with low metabolic demands and hence a low number of mitochondria, which can vary substantially in different tissue types, as indicated by the proportion of mitochondrial DNA present [61]. Consequently, this constitutes a useful forensic metric that can be used to infer the type of tissue source.
A more systematic analysis involved mapping the raw datasets to all mitochondrial genomes present in the NCBI database using the Mitoscan pipeline, as described in Methods. This showed that most of the RaTG13 reads mapped to the R. affinis mitochondrial genome with 97.6% coverage (Supplementary File S1), confirming that the sample was generated from R. affinis or a closely related species.

4.7. Mitochondrial SSU rRNA Phylogenetic Analysis

While mapping to complete mitochondrial genomes gives a convincing indication of the general phylogenetic affinities of the majority of the eukaryotic reads in the NGS datasets, a phylogenetic analysis of mitochondrial SSU rRNA confers more precision. A 1139 bp contig generated by Megahit from SSU rRNA sequences extracted from the RaTG13 dataset using Metaxa2 was found to match R. affinis mitochondrial SSU rRNA (NCBI accession number MT845219) with 98.7% sequence identity, with eight mismatches (Supplementary Figure S1). A maximum likelihood phylogenetic tree indicates that the RaTG13 contig was most closely related to R. affinis mitochondrial SSU rRNA compared to other Rhinolophus species for which full-length mitochondrial SSU rRNA sequences were available (Figure 2).
Mitochondrial SSU rRNA sequences generated by Metaxa2 from the BtRhCoV-HKU2r anal swab dataset were likewise assembled using Megahit. A resulting 960 bp contig aligned to Rhinolophus sinicus sinicus mitochondrial SSU rRNA (accession number KP257597.1), with only one mismatch. This is surprising given that the anal swab sample is described as having been obtained from R. affinis [25] (Supplementary Materials file msphere.00807-19-st002.xlsx), as is the BtRhCoV-HKU2r coronavirus genome sequence (NCBI accession number MN611522). However, it is consistent with the mitochondrial genome mapping results reported above.
Mitochondrial SSU rRNA sequences generated by Metaxa2 from the splenocyte transcriptome 1 dataset were also assembled using Megahit. This produced a contig of 943 bp, which perfectly aligned to R. sinicus sinicus mitochondrial SSU rRNA (KP257597.1), with no mismatches. This is consistent with the database description of the sample as being derived from an R. sinicus cell line.
The almost-perfect match of the BtRhCoV-HKU2r anal swab dataset and the perfect match of the splenocyte transcriptome 1 dataset to R. sinicus sinicus mitochondrial SSU rRNA demonstrate the accuracy of the methodology in generating high-quality mitochondrial SSU rRNA sequences from bat NGS RNA datasets, whether anal swab or cell line. The number of mismatches of the mitochondrial SSU rRNA sequence from the RaTG13 dataset with R. affinis mitochondrial SSU rRNA is, therefore, interesting, as these are unlikely to have arisen as the result of sequencing or assembly errors.
This inference is supported by the observation that an 866 bp contig generated from mitochondrial SSU rRNA sequences extracted from the reverse read dataset (CRR122287_r2) perfectly matched the 1139 bp contig generated from the forward read dataset (CRR122287_f1), where they overlapped. If the mismatches of the 1139 bp contig with the R. affinis mitochondrial SSU rRNA reference sequence were due to sequencing or assembly errors, they would not be observed in the reverse 866 bp contig. In addition, when the 1139 bp contig sequence is aligned with the mitochondrial SSU rRNA sequences of other Rhinolophus species included in the phylogenetic analysis, only three mismatches with the R. affinis sequence are unique, while the other five mismatches are also observed in other species in the alignment (Supplementary Figure S2). This non-random distribution implies that they are true SNPs. These results indicate that although the dataset appears to have undergone rRNA depletion, there are sufficient reads present for accurate rRNA sequence recovery.
The reference mitochondrial sequence for R. affinis (MT845219.1) was generated from the subspecies himalayanus, sampled from Anhui Province [62]. The eight mismatches of the 1139 bp contig to the reference sequence imply that the dataset was derived from a genetically distinct population/subspecies of R. affinis or a closely related cryptic species. This is consistent with the observation that the R. affinis taxon has nine subspecies with marked morphological and echolocation differences and might represent a species complex [63]. However, the only other R. affinis subspecies recorded from the Chinese mainland is R. affinis macrurus [64,65], so this is a candidate for the RaTG13 host. Interesting molecular evidence (using Cox1) for the existence of genetically distinct R. affinis subspecies/cryptic species from SE Asia, including southern China, is described in [66].
Contigs were likewise generated from the forward SSU mitochondrial rRNA reads for Clade 7896 samples. These were described as belonging to R. affinis (RaTG15) and Rhinolophus stheno (remaining samples). Interestingly, the RaTG15 R. affinis sequence is distinct from the RaTG13 R. affinis sequence. This indicates either the coexistence of two R. affinis subspecies within the Mojiang mine or the presence of a cryptic species complex related to R. affinis. It should be noted that, while the branching of the RaTG13 and RaTG15 sequences from R. affinis himalayanus is supported (bootstrap value = 80), the branching of the RaTG13 and RaTG15 sequences from each other is not strongly supported (bootstrap value = 54). The phylogenetic analysis confirms that the RaTG13 and RaTG15 datasets are derived from R. affinis proper or a closely related species.
R. stheno is a species closely related to R. affinis [67,68] and groups accordingly on the tree (Figure 2). Two distinct lineages of R. stheno are revealed by the analysis, which may likewise represent subspecies, or perhaps one clade represents a closely related cryptic species. The existence of potential R. stheno subspecies/cryptic species has also been noted by [66]. The method described here may be used to further explore the phylogenetic diversity of difficult-to-distinguish species of bats using publicly available datasets generated from swabs.
Lastly, it is worth noting the factor of mitochondrial introgression, which refers to the differential increase in the frequency of a mitochondrial haplogroup from one of the parental populations in comparison to its nuclear component. Mitochondrial introgression has been observed in R. affinis [65,69] and urges caution when making species attributions based solely on mitochondrial sequences. This is because observed mitochondrial haplogroups may be introgressed from another species or subspecies, confusing species attribution. A solution to this is to consider nuclear sequences.

4.8. Nuclear Genome Mapping Analysis

In order to identify the origin of the bulk of reads in the RaTG13 dataset, of apparent eukaryotic origin, they were mapped to a variety of mammalian nuclear genomes, corresponding to species for which cell lines were known to be in use at the WIV, as well as Rhinolophus ferrumequinum, which is the bat genome most closely related to R. affinis available (Table 6). The results show that the reads most efficiently map to the R. ferrumequinum genome, with 87.5% of reads mapping. An even higher percentage of reads would be expected to map to the exact Rhinolophus species used to generate the RaTG13 sample (R. affinis or closely related species, as identified in the phylogenetic analysis above), assuming the assembly qualities are comparable. Mapping to other species represents cross-mapping due to sequence conservation between mammalian genomes.
The high percentage of reads mapping to R. ferrumequinum is inconsistent with a fecal swab, which is expected to have a majority of reads mapping to bacterial sequences. Consistent with this expectation, only 2.6% of the reads from the BtRhCoV-HKU2r anal swab sample mapped to the R. ferrumequinum genome.
The Clade 7896 datasets also showed a high proportion of reads mapping to the R. ferrumequinum genome (79.0–92.2%, Supplementary Table S2). This is inconsistent with an anal swab origin for these datasets, a finding in common with the RaTG13 dataset.
Further analysis of the RaTG13 reads that mapped to the R. ferrumequinum genome shows that 62.1% map to protein-coding genes, and 92.1% of protein-coding genes have at least one read that maps to it. These data indicate that the RaTG13 sample represents a bat transcriptome. In addition, the result indicates that the sample did not have large amounts of DNA present, as this would lead to mapping to parts of the genome that do not code for protein-coding genes, which is the large majority of the bat genome. This indicates that the sample was subjected to DNase treatment, an optional step in the QIAamp Viral RNA Mini Kit.

4.9. Transcriptome Comparison

RPM values for protein-coding genes were calculated for the datasets that had high mapping values to the R. ferrumequinum genome (RaTG13, Clade 7896, two R. sinicus splenocyte samples and the R. sinicus tissue samples), as described in Methods (Supplementary File S2). While cross-species mapping is not as precise as mapping to the genome of the species from which the RNA was generated, the high mapping values obtained indicate good sequence conservation between species, and it was not deemed necessary to decrease the sequence match criteria in order to enhance mapping. Cross-species mapping is a more accurate method than de novo transcriptome assembly for transcriptome analysis when a genome assembly for the species used to generate the RNA-seq data is not available [70].
RPM Z-scores were then used to generate an HCA plot (Figure 3, Supplementary File S3). From the plot, it can be observed that RaTG13 groups with RaTG15. This indicates that they are from the same bat source. Likewise, the remaining 7896 clade datasets, which were all generated from R. stheno, group together, indicating a common origin. None of these datasets group with any of the R. sinicus tissue samples, indicating that they are from a different source. RaTG13 and RaTG15 group with the two splenocyte samples but not the spleen sample, potentially indicating either a splenocyte origin of the two samples or that they are from a source not present in the plot that clusters with splenocyte transcriptomes by default. The clustering of the two splenocyte transcriptomes together is an indication of the accuracy of the analysis.

4.10. GO Enrichment Analysis

The Z-score values were used for GO enrichment analysis to determine categories of genes upregulated in the RaTG13 and Clade 7896 datasets (Supplementary Table S3 and Figure S3), as described in Methods. For RaTG13, in the cellular component sub-ontology, there were a striking number of GO terms directly associated with sperm, which included sperm flagellum (GO: 0036126), sperm midpiece (GO: 0097225), sperm head (GO: 0061827) and perinuclear theca (GO: 0033011). There were also terms associated with cilia (the sperm flagellum is a specialized cilium), such as 9 + 2 motile cilium (GO: 0097729), motile cilium (GO: 0031514), cilium (GO: 0005929), stereocilium bundle (GO: 0032421) and keratin filament (GO: 0045095) (found in the sperm acroplaxome [71]).
In the molecular function sub-ontology, melatonin receptor (GO: 0008502) and olfactory receptor (GO: 0004984) were upregulated. Melatonin acts as a protectant for sperm, which are known to possess melatonin receptors [72], while olfactory receptor genes are overexpressed in sperm and play a role in chemotaxis, allowing the sperm to locate the egg [73]. Other categories associated with signaling were transmembrane signaling receptor activity (GO: 00004888), signaling receptor activity (GO: 0038023) and molecular transducer activity (GO: 0060089).
In the biological process sub-ontology, four groups of related GO categories were apparent: firstly, functions generally related to reproduction (GO: 0022414, GO: 0000003, GO: 0022412, GO: 0032504, GO: 0019953, GO: 0048609, GO: 0007276, GO: 0007281,GO: 0003006); secondly, functions related to meiosis (GO: 1903046, GO: 0051321, GO: 2000243, GO: 0140013, GO: 0061982, GO: 0007127, GO: 2000241, GO: 0007130); thirdly, functions related to gamete production and sperm (GO: 0007283, GO: 0048232, GO: 0048515, GO: 0007286, GO: 0060046); and lastly, functions related to olfaction and chemo-detection (GO: 0007600, GO: 0051606, GO: 0007606, GO: 0050906, GO: 0050907, GO: 0007608, GO: 0050911, GO: 0009593, GO: 0050896).
Similar results were observed for the eight Clade 7896 datasets, with the upregulation of genes involved in sperm and olfaction/chemo-detection (Supplementary Figure S3). The similarities in the results between the seven Clade 7896 datasets generated from R. stheno and RaTG13/RaTG15 generated from two R. affinis-related lineages were notable.

5. Discussion

Metaxa2 analysis identifies a low proportion of microbial reads in the RaTG13 dataset (Table 3), which is unexpected for a fecal swab sample, confirming the observations of multiple investigators [17,19,20,21,22]. A low proportion of microbial reads is also observed in the Clade 7896 datasets (Table 3). Consistent with this, the genome mapping results (Table 6) indicate that the majority of reads of RaTG13 are likely of bat origin and not microbial in nature. Similar mapping results were also observed for the Clade 7896 datasets (Supplementary Table S2). In addition, it was determined that the taxonomic assignments of the microbial reads present in RaTG13 (Table 2) were not consistent with gut microbiota. In agreement with this, a beta-diversity analysis showed that the microbial community of the RaTG13 dataset was not closely related to the bat anal swab datasets generated by [25] and instead grouped with the microbial communities of five Clade 7896 datasets (Figure 1).
These results stimulate the question of the true origin of both the RaTG13 and Clade 7896 datasets, given that they are inconsistent with fecal/anal swabs. The validation of the RaTG13 dataset is of importance, given that RaTG13 is the closest (phylogenomic) relative of SARS-CoV-2, and the origin, evolution and biology of RaTG13 can help to inform the origin of SARS-CoV-2.
The phylogenetic analysis of assembled mitochondrial SSU rRNA contigs shows that the originating bat species are consistent with the reported species assignments of the RaTG13 and Clade 7896 datasets (Figure 2). RaTG13 and RaTG15 were attributed to R. affinis, and phylogenetic analysis showed that they were derived from either two distinct subspecies of R. affinis or a closely related species. Likewise, the remaining seven Clade 7896 datasets were shown to be derived from two distinct Rhinolophus sp. lineages, with R. shameli as a sister species. From phylogenetic analysis, this placement is consistent with R. stheno [67,68]; however, unfortunately, no mitochondrial SSU rRNA sequences from R. stheno are available for confirmation. The two clades into which the seven mitochondrial SSU rRNAs are divided indicate that they may comprise two distinct subspecies coexisting within the mine or perhaps two closely related cryptic species. The pipeline described here may have utility in the species discovery of bats and other vertebrates for which RNA swab samples are available, given the high quality of assembled sequences that can be produced. A caveat is that this approach may be challenging for true fecal samples, given the low quantity of host RNA present, so a swab with a high proportion of host RNA is needed.
The HCA analysis indicates that RaTG13, RaTG15 and the rest of Clade 7896 were generated from unusual sources, as they are distinct from the major tissue types of R. sinicus (Figure 3). The RaTG13 dataset shows a significant upregulation of genes associated with sperm, reproduction and chemo-detection/olfaction (Supplementary Table S3), which, along with the other data described here, is inconsistent with a fecal swab. A likely tissue source for the dataset is a mating plug (also known as a copulatory or vaginal plug), which is composed of congealed semen and blocks the vagina after copulation, remaining until parturition [74]. Olfactory receptors are highly expressed in human semen and have a role in chemotaxis (allowing the sperm to locate the egg in the female reproductive tract) [73]. This is likely the case in other mammals.
Female Rhinolophid bats, like many other mammals, use mating plugs [75,76]. Their purpose is unclear, but they are likely used either to store sperm (to release at an appropriate time for egg fertilization) [77] or to act as a form of mate guarding by the male that contributes the plug, blocking mating by other males [78]. While in R. ferrumequinum, there is evidence that the sperm constituting the plug are dead, the animals have been shown to store live sperm in their oviduct for fertilization in spring [79].
As the eight Clade 7896 datasets were also likely generated from mating plugs, given the similar GO enrichment analysis results to RaTG13 (Supplementary Table S3 and Figure S3), it would appear that only females were sampled to generate all nine datasets. Female R. ferrumequinum bats roost together after mating in maternity roosts [80], as do other Rhinolophid bats. Hence, when sampling a particular roost, it may be entirely female, which can explain the apparent female bias of the datasets. While pregnant female R. affinis have been collected in April–May and in October on the Malay Peninsula [81], the mating period for R. affinis in Yunnan Province is not clear. RaTG13 was sampled in July 2013, while RaTG15 and the rest of the 7896 clade were sampled in May 2015. Rhinolophus spp. tend to gestate and give birth in the spring/summer in more seasonal latitudes [76], and this is likely also true for the Yunnan region.
While it would be desirable to determine the sex of the sampled animals from the NGS datasets, mRNA expression is most reliable for determining sex if the reproductive tissue is sampled, such as the vagina. This is because sex-specific gene expression is most pronounced in reproductive tissues [82]. However, while the NGS datasets show the clear upregulation of genes associated with sperm, as discussed, this likely indicates a mating plug generated by a male but resident in a female. Due to the nature of the sample, it is difficult to directly establish that the sampled animals were female (although mating plugs have a vaginal location). An alternative explanation that a male was sampled that had sperm on its skin does not seem likely for all nine datasets.
Indirect evidence of a female (vaginal) contribution is provided by the high proportion of lactic acid bacteria (Lactococcus spp.) in the RaTG13 dataset (Table 2). Lactic acid bacteria are abundant in the human vagina, where they are responsible for producing the acidic environment [83,84]. This is likely also the case in other mammals, including bats. A mating plug would be expected to have vaginal microbiota, such as lactic acid bacteria, associated with it.
Mis-sampling the vagina instead of the anus, with the former possessing a mating plug, is a plausible explanation for the observations outlined here, given the proximity of the two orifices. Mating plugs in Rhinolophid bats have been described as ‘huge’ and ‘massive’ [75]. A positive assignment of a tissue source for RaTG13 is useful in understanding its tissue tropism: the data described here indicate that it is associated with the genito-urinary tract. However, this observation does not rule out a gut location for the virus also.
Previously, the author proposed in a preprint that RaTG13 was generated from either a cell line or tissue from R. affinis or a related species [85]. The preprint presented some of the analyses presented here but did not assess the Clade 7896 dataset and lacked the transcriptomic and GO enrichment analysis. Here, transcriptomic analysis reveals the overexpression of sperm- and olfaction-related genes, which is more consistent with a mating plug tissue source. Cell-division-associated transcripts were previously taken as indicative of a cell line provenance of the dataset; however, the more detailed analysis described here indicates that these are associated with spermatogenesis.
A natural origin of the RaTG13 background reads supports a natural origin of the RaTG13 sequence itself. It should be noted, however, that while the analysis described here focuses entirely on the background reads, unequivocal validation of the RaTG13 virus sequences themselves is not possible without additional information, such as a timestamped hash derived from the original dataset. Immutable timestamping could be achieved by depositing the hash on a blockchain.
A validated natural origin of RaTG13 informs its comparison to the SARS-CoV-2 sequence. In particular, it emphasizes the puzzling presence of two features in SARS-CoV-2: the furin cleavage site (FCS) insertion [3] and the evenly spaced pattern of BsaI/BsmBI restriction sites [86]. The FCS is absent in RaTG13, and it is hard to understand which molecular process would lead to a functional sequence of four amino acids inserted into exactly the right location in the SARS-CoV-2 spike protein. Likewise, the BsaI/BsmBI restriction sites in SARS-CoV-2 are suitable for the construction of an infectious clone, but the BsaI/BsmBI sites present in RaTG13 are not. To convert the RaTG13 sequence to SARS-CoV-2 requires the removal of three BsaI sites and two BsmBI sites and the insertion of two BsaI sites and two BsmBI. As Bruttel et al. note, this is hard to explain via the mutational process [86]. Notably, in order to artificially insert an FCS into a coronavirus, the construction of an infectious clone would be necessary. In addition, two key mutations that enhance SARS-CoV-2 infectivity in the human host, T372A [87] and N519H [88], are the same in RaTG13 as other related sarbecoviruses (T372 and N519). Given a natural origin of RaTG13, these two mutations can be seen to have been important in facilitating the colonization of the human respiratory tract from an original bat genito-urinary/intestinal location.
Finally, it is hoped that the work described here illustrates that the validation of high-value datasets using forensic genomics approaches is feasible and that the methods may be of value in other forensic studies of NGS datasets.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/microbiolres15030119/s1, Figure S1: Alignment of mitochondrial SSU rRNA sequence assembled from the RaTG13 sample with the R.affinis reference sequence; Figure S2: Multiple sequence alignment of mitochondrial SSU rRNA sequences from Rhinolophus species; Figure S3: GO enrichment analysis of the Clade 7896 datasets; Table S1: Numbers of eukaryotic and bacterial SSU rRNA reads present in the anal swab NGS datasets from Li et al. (2020); Table S2: Mapping of Clade 7896 reads against the R.ferrumequinum nuclear genome; Table S3: GO enrichment analysis of the RaTG13 dataset; File S1: Mapping statistics of the RaTG13 dataset against all NCBI mitochondrial genomes; File S2: RPM values for the Rhinolophus spp. NGS datasets; File S3: Z-scores for the RaTG13 and Clade 7896 datasets.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article or Supplementary Material.

Acknowledgments

This work is an example of ‘Decentralized Science’ (DeSci), the result of discussions on Twitter/X with members of the DRASTIC research collective and friends and with the Paris Group. The Master’s thesis of Yu Ping was identified and translated by ‘The Seeker’ @TheSeeker268 and Francisco de Asis. The EcoHealth-WIV NIAID grant 1R01Al 110964-01 was made available by a FOIA made by The Intercept (theintercept.com).

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Holmes, E.C.; Goldstein, S.A.; Rasmussen, A.L.; Robertson, D.L.; Crits-Christoph, A.; Wertheim, J.O.; Anthony, S.J.; Barclay, W.S.; Boni, M.F.; Doherty, P.C.; et al. The origins of SARS-CoV-2: A critical review. Cell 2021, 184, 4848–4856. [Google Scholar] [CrossRef] [PubMed]
  2. Sirotkin, K.; Sirotkin, D. Might SARS-CoV-2 Have Arisen via Serial Passage through an Animal Host or Cell Culture?: A potential explanation for much of the novel coronavirus’ distinctive genome. Bioessays 2020, 42, e2000091. [Google Scholar] [CrossRef] [PubMed]
  3. Segreto, R.; Deigin, Y. The genetic structure of SARS-CoV-2 does not rule out a laboratory origin: SARS-COV-2 chimeric structure and furin cleavage site might be the result of genetic manipulation. Bioessays 2021, 43, e2000240. [Google Scholar] [CrossRef]
  4. Li, L.-L.; Wang, J.-L.; Ma, X.-H.; Sun, X.-M.; Li, J.-S.; Yang, X.-F.; Shi, W.-F.; Duan, Z.-J. A novel SARS-CoV-2 related coronavirus with complex recombination isolated from bats in Yunnan province, China. Emerg. Microbes Infect. 2021, 10, 1683–1690. [Google Scholar] [CrossRef]
  5. Delaune, D.; Hul, V.; Karlsson, E.A.; Hassanin, A.; Ou, T.P.; Baidaliuk, A.; Gámbaro, F.; Prot, M.; Tan Tu, V.; Chea, S.; et al. A novel SARS-CoV-2 related coronavirus in bats from Cambodia. Nat. Commun. 2021, 12, 6563. [Google Scholar] [CrossRef]
  6. Zhou, H.; Ji, J.; Chen, X.; Bi, Y.; Li, J.; Wang, Q.; Hu, T.; Song, H.; Zhao, R.; Chen, Y.; et al. Identification of novel bat coronaviruses sheds light on the evolutionary origins of SARS-CoV-2 and related viruses. Cell 2021, 184, 4380–4391.e14. [Google Scholar] [CrossRef]
  7. Temmam, S.; Vongphayloth, K.; Baquero, E.; Munier, S.; Bonomi, M.; Regnault, B.; Douangboubpha, B.; Karami, Y.; Chrétien, D.; Sanamxay, D.; et al. Bat coronaviruses related to SARS-CoV-2 and infectious for human cells. Nature 2022, 604, 330–336. [Google Scholar] [CrossRef]
  8. Hassanin, A.; Rambaud, O. Retracing Phylogenetic, Host and Geographic Origins of Coronaviruses with Coloured Genomic Bootstrap Barcodes: SARS-CoV and SARS-CoV-2 as Case Studies. Viruses 2023, 15, 406. [Google Scholar] [CrossRef] [PubMed]
  9. Zhou, P.; Yang, X.-L.; Wang, X.-G.; Hu, B.; Zhang, L.; Zhang, W.; Si, H.-R.; Zhu, Y.; Li, B.; Huang, C.-L.; et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 2020, 579, 270–273. [Google Scholar] [CrossRef]
  10. Zhou, P.; Yang, X.-L.; Wang, X.-G.; Hu, B.; Zhang, L.; Zhang, W.; Si, H.-R.; Zhu, Y.; Li, B.; Huang, C.-L.; et al. Discovery of a novel coronavirus associated with the recent pneumonia outbreak in humans and its potential bat origin. bioRxiv 2020. [Google Scholar] [CrossRef]
  11. Ge, X.-Y.; Wang, N.; Zhang, W.; Hu, B.; Li, B.; Zhang, Y.-Z.; Zhou, J.-H.; Luo, C.-M.; Yang, X.-L.; Wu, L.-J.; et al. Coexistence of multiple coronaviruses in several bat colonies in an abandoned mineshaft. Virol. Sin. 2016, 31, 31–40. [Google Scholar] [CrossRef]
  12. Segreto, R. Is Considering a Genetic-Manipulation Origin for SARS-CoV-2 a Conspiracy Theory That Must Be Censored? ResearchGate. 2020. Available online: https://www.researchgate.net/publication/340924249_Is_considering_a_genetic-manipulation_origin_for_SARS-CoV-2_a_conspiracy_theory_that_must_be_censored (accessed on 30 July 2024).
  13. Rahalkar, M.C.; Bahulikar, R.A. Lethal Pneumonia Cases in Mojiang Miners (2012) and the Mineshaft Could Provide Important Clues to the Origin of SARS-CoV-2. Front. Public Health 2020, 8, 581569. [Google Scholar] [CrossRef] [PubMed]
  14. Zhou, P.; Yang, X.L.; Wang, X.G.; Hu, B.; Zhang, L.; Zhang, W.; Si, H.R.; Zhu, Y.; Li, B.; Huang, C.L.; et al. Addendum: A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 2020, 588, E6. [Google Scholar] [CrossRef] [PubMed]
  15. Xu, L. The Analysis of Six Patients with Severe Pneumonia Caused by Unknown Viruses. Ph.D. Thesis, Kunming Medical University, Kunming, China, 2013. [Google Scholar]
  16. Huang, C. Novel Virus Discovery in Bat and the Exploration of Receptor of Bat Coronavirus HKU9. Ph.D. Thesis, National Institute for Viral Disease Control and Prevention, Beijing, China, 2016. [Google Scholar]
  17. Rahalkar, M.; Bahulikar, R. The anomalous nature of the fecal swab data, receptor binding domain and other questions in RaTG13 genome. Preprints 2020, 2020080205. [Google Scholar] [CrossRef]
  18. Lin, X.; Chen, S. Major concerns on the identification of bat Coronavirus strain RaTG13 and quality of related Nature paper. Preprints 2020, 2020060044. [Google Scholar] [CrossRef]
  19. Singla, M.; Ahmad, S.; Gupta, C.; Sethi, T. De-novo assembly of RaTG13 genome reveals inconsistencies further obscuring SARS-CoV-2 origins. Preprints 2020, 2020080595. [Google Scholar] [CrossRef]
  20. Deigin, Y.; Segreto, R. SARS-CoV-2’s claimed natural origin is undermined by issues with genome sequences of its relative strains: Coronavirus sequences RaTG13, MP789 and RmYN02 raise multiple questions to be critically addressed by the scientific community. Bioessays 2021, 43, e2100015. [Google Scholar] [CrossRef]
  21. Bostickson, B.; Ghannam, Y. 2. INVESTIGATION OF RaTG13 AND THE 7896 CLADE. 2021, Unpublished. Available online: https://doi.org/10.13140/RG.2.2.22382.33607 (accessed on 30 July 2024).
  22. Zhang, D. Anomalies in BatCoV/RaTG13 Sequencing and Provenance; Zenodo: Geneva, Switzerland, 2020. [Google Scholar] [CrossRef]
  23. Ping, Y. Geographic Evolution of Bat SARS-Related Coronaviruses; Shi, Z., Jie, C., Eds.; Wuhan Institute of Virology: Wuhan, China, 2019. [Google Scholar]
  24. He, K.; Fujiwara, H.; Zajac, C.; Sandford, E.; Reddy, P.; Choi, S.W.; Tewari, M. A Pipeline for Faecal Host DNA Analysis by Absolute Quantification of LINE-1 and Mitochondrial Genomic Elements Using ddPCR. Sci. Rep. 2019, 9, 5599. [Google Scholar] [CrossRef]
  25. Li, B.; Si, H.-R.; Zhu, Y.; Yang, X.-L.; Anderson, D.E.; Shi, Z.-L.; Wang, L.-F.; Zhou, P. Discovery of Bat Coronaviruses through Surveillance and Probe Capture-Based Next-Generation Sequencing. mSphere 2020, 5, e00807-19. [Google Scholar] [CrossRef]
  26. Xie, J.; Li, Y.; Shen, X.; Goh, G.; Zhu, Y.; Cui, J.; Wang, L.-F.; Shi, Z.-L.; Zhou, P. Dampened STING-Dependent Interferon Activation in Bats. Cell Host Microbe 2018, 23, 297–301.e4. [Google Scholar] [CrossRef] [PubMed]
  27. Briese, T.; Kapoor, A.; Mishra, N.; Jain, K.; Kumar, A.; Jabado, O.J.; Lipkin, W.I. Virome Capture Sequencing Enables Sensitive Viral Diagnosis and Comprehensive Virome Analysis. mBio 2015, 6, e01491-15. [Google Scholar] [CrossRef]
  28. Guo, H.; Hu, B.; Si, H.-R.; Zhu, Y.; Zhang, W.; Li, B.; Li, A.; Geng, R.; Lin, H.-F.; Yang, X.-L.; et al. Identification of a novel lineage bat SARS-related coronaviruses that use bat ACE2 receptor. Emerg. Microbes Infect. 2021, 10, 1507–1514. [Google Scholar] [CrossRef] [PubMed]
  29. Bengtsson-Palme, J.; Hartmann, M.; Eriksson, K.M.; Pal, C.; Thorell, K.; Larsson, D.G.J.; Nilsson, R.H. METAXA2: Improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Mol. Ecol. Resour. 2015, 15, 1403–1414. [Google Scholar] [CrossRef]
  30. Chen, S.; He, C.; Li, Y.; Li, Z.; Melançon, C.E. A computational toolset for rapid identification of SARS-CoV-2, other viruses and microorganisms from sequencing data. Brief. Bioinform. 2021, 22, 924–935. [Google Scholar] [CrossRef]
  31. Scanning-NGS-Datasets-for-Mitochondrial-and-Coronavirus-Contaminants. Available online: https://github.com/semassey/Scanning-NGS-datasets-for-mitochondrial-and-coronavirus-contaminants (accessed on 30 July 2024).
  32. Li, D.; Liu, C.-M.; Luo, R.; Sadakane, K.; Lam, T.-W. MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 2015, 31, 1674–1676. [Google Scholar] [CrossRef]
  33. Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403–410. [Google Scholar] [CrossRef]
  34. Zhu, F.; Duong, V.; Lim, X.F.; Hul, V.; Chawla, T.; Keatts, L.; Goldstein, T.; Hassanin, A.; Tu, V.T.; Buchy, P.; et al. Presence of Recombinant Bat Coronavirus GCCDC1 in Cambodian Bats. Viruses 2022, 14, 176. [Google Scholar] [CrossRef] [PubMed]
  35. Kumar, A.; Choudhury, B.; Dayanandan, S.; Khan, M.L. Molecular Genetics and Genomics Tools in Biodiversity Conservation; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
  36. Applied Research Press. MUSCLE: A Multiple Sequence Alignment Method with Reduced Time and Space Complexity; CreateSpace Independent Publishing Platform: Scotts Valley, CA, USA, 2015. [Google Scholar]
  37. Tavaré, S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 1986, 17, 57. [Google Scholar]
  38. Akaike, H. Information Theory and an Extension of the Maximum Likelihood Principle. Sel. Pap. Hirotugu Akaike 1998, 199–213. [Google Scholar]
  39. Chen, S.; Zhou, Y.; Chen, Y.; Gu, J. fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018, 34, i884–i890. [Google Scholar] [CrossRef]
  40. Jebb, D.; Huang, Z.; Pippel, M.; Hughes, G.M.; Lavrichenko, K.; Devanna, P.; Winkler, S.; Jermiin, L.S.; Skirmuntt, E.C.; Katzourakis, A.; et al. Six reference-quality genomes reveal evolution of bat adaptations. Nature 2020, 583, 578–584. [Google Scholar] [CrossRef]
  41. Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N.; Marth, G.; Abecasis, G.; Durbin, R.; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25, 2078–2079. [Google Scholar] [CrossRef]
  42. Quinlan, A.R.; Hall, I.M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26, 841–842. [Google Scholar] [CrossRef]
  43. Kolberg, L.; Raudvere, U.; Kuzmin, I.; Adler, P.; Vilo, J.; Peterson, H. g:Profiler-interoperable web service for functional enrichment analysis and gene identifier mapping (2023 update). Nucleic Acids Res. 2023, 51, W207–W212. [Google Scholar] [CrossRef] [PubMed]
  44. Gprofiler. Available online: https://biit.cs.ut.ee/gprofiler/gost (accessed on 30 July 2024).
  45. Massey, S.E. Comparative Microbial Genomics and Forensics. Microbiol. Spectr. 2016, 4, 4. [Google Scholar] [CrossRef] [PubMed]
  46. Louis, P.; Scott, K.P.; Duncan, S.H.; Flint, H.J. Understanding the effects of diet on bacterial metabolism in the large intestine. J. Appl. Microbiol. 2007, 102, 1197–1208. [Google Scholar] [CrossRef] [PubMed]
  47. Péré-Védrenne, C.; Flahou, B.; Loke, M.F.; Ménard, A.; Vadivelu, J. Other Helicobacters, gastric and gut microbiota. Helicobacter 2017, 22 (Suppl. S1), e12407. [Google Scholar] [CrossRef]
  48. Andersson, A.F.; Lindberg, M.; Jakobsson, H.; Bäckhed, F.; Nyrén, P.; Engstrand, L. Comparative analysis of human gut microbiota by barcoded pyrosequencing. PLoS ONE 2008, 3, e2836. [Google Scholar] [CrossRef]
  49. Schleifer, K.H.; Kloos, W.E.; Kocur, M. The genus Micrococcus. In The Prokaryotes; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar] [CrossRef]
  50. Friedman, E.S.; Bittinger, K.; Esipova, T.V.; Hou, L.; Chau, L.; Jiang, J.; Mesaros, C.; Lund, P.J.; Liang, X.; FitzGerald, G.A.; et al. Microbes vs. chemistry in the origin of the anaerobic gut lumen. Proc. Natl. Acad. Sci. USA 2018, 115, 4170–4175. [Google Scholar] [CrossRef]
  51. Slobodkin, A. The Family Peptostreptococcaceae in the Prokaryotes; Rosenberg, E., DeLOng, E.F., Lory, S., Stackebrandt, E., Thompson, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2014; pp. 291–302. [Google Scholar]
  52. DeLong, E.F.; Lory, S.; Stackebrandt, E.; Thompson, F. The Prokaryotes: Firmicutes and Tenericutes; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
  53. Guo, P.; Zhang, K.; Ma, X.; He, P. Clostridium species as probiotics: Potentials and challenges. J. Anim. Sci. Biotechnol. 2020, 11, 24. [Google Scholar] [CrossRef] [PubMed]
  54. Latinne, A.; Hu, B.; Olival, K.J.; Zhu, G.; Zhang, L.; Li, H.; Chmura, A.A.; Field, H.E.; Zambrana-Torrelio, C.; Epstein, J.H.; et al. Origin and cross-species transmission of bat coronaviruses in China. Nat. Commun. 2020, 11, 4235. [Google Scholar] [CrossRef]
  55. Ge, X.-Y.; Li, J.-L.; Yang, X.-L.; Chmura, A.A.; Zhu, G.; Epstein, J.H.; Mazet, J.K.; Hu, B.; Zhang, W.; Peng, C.; et al. Isolation and characterization of a bat SARS-like coronavirus that uses the ACE2 receptor. Nature 2013, 503, 535–538. [Google Scholar] [CrossRef]
  56. Zhou, L.; Ayeh, S.K.; Chidambaram, V.; Karakousis, P.C. Modes of transmission of SARS-CoV-2 and evidence for preventive behavioral interventions. BMC Infect. Dis. 2021, 21, 496. [Google Scholar] [CrossRef]
  57. Christensen, H.; Kuhnert, P.; Nørskov-Lauritsen, N.; Planet, P.J.; Bisgaard, M. The Family Pasteurellaceae. In The Prokaryotes: Gammaproteobacteria; Rosenberg, E., DeLong, E.F., Lory, S., Stackebrandt, E., Thompson, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2014; pp. 535–564. [Google Scholar]
  58. Fink, D.L.; Geme, J.W. The Genus Haemophilus. In The Prokaryotes: A Handbook on the Biology of Bacteria Volume 6: Proteobacteria: Gamma Subclass; Dworkin, M., Falkow, S., Rosenberg, E., Schleifer, K.-H., Stackebrandt, E., Eds.; Springer: New York, NY, USA, 2006; pp. 1034–1061. [Google Scholar]
  59. Dewhirst, F.E.; Chen, T.; Izard, J.; Paster, B.J.; Tanner, A.C.R.; Yu, W.-H.; Lakshmanan, A.; Wade, W.G. The human oral microbiome. J. Bacteriol. 2010, 192, 5002–5017. [Google Scholar] [CrossRef]
  60. Bourgarel, M.; Noël, V.; Pfukenyi, D.; Michaux, J.; André, A.; Becquart, P.; Cerqueira, F.; Barrachina, C.; Boué, V.; Talignani, L.; et al. Next-Generation Sequencing on Insectivorous Bat Guano: An Accurate Tool to Identify Arthropod Viruses of Potential Agricultural Concern. Viruses 2019, 11, 1102. [Google Scholar] [CrossRef] [PubMed]
  61. Osorio, D.; Cai, J.J. Systematic determination of the mitochondrial proportion in human and mice tissues for single-cell RNA-sequencing data quality control. Bioinformatics 2021, 37, 963–967. [Google Scholar] [CrossRef] [PubMed]
  62. Ding, Y.; Chen, W.; Mao, X. The complete mitochondrial genome of Rhinolophus affinis himalayanus. Mitochondrial DNA B Resour. 2021, 6, 164–165. [Google Scholar] [CrossRef]
  63. Ith, S.; Bumrungsri, S.; Furey, N.M.; Bates, P.J.; Wonglapsuwan, M.; Khan, F.A.A.; Thong, V.D.; Soisook, P.; Satasook, C.; Thomas, N.M. Taxonomic implications of geographical variation in Rhinolophus affinis (Chiroptera: Rhinolophidae) in mainland Southeast Asia. Zool. Stud. 2015, 54, e31. [Google Scholar] [CrossRef]
  64. Tan, S.; Shen, Y.; Sordoni, A.; Courville, A.; O’donnell, T.J. Recursive Top-Down Production for Sentence Generation with Latent Trees. arXiv 2020, arXiv:2010.04704. [Google Scholar] [CrossRef]
  65. Mao, X.; Rossiter, S.J. Genome-wide data reveal discordant mitonuclear introgression in the intermediate horseshoe bat (Rhinolophus affinis). Mol. Phylogenetics Evol. 2020, 150, 106886. [Google Scholar] [CrossRef] [PubMed]
  66. Chornelia, A.; Lu, J.; Hughes, A.C. How to accurately delineate morphologically conserved taxa and diagnose their phenotypic disparities: Species delimitation in cryptic Rhinolophidae (Chiroptera). Front. Ecol. Evol. 2022, 10, 854509. [Google Scholar] [CrossRef]
  67. Stoffberg, S.; Jacobs, D.S.; Mackie, I.J.; Matthee, C.A. Molecular phylogenetics and historical biogeography of Rhinolophus bats. Mol. Phylogenetics Evol. 2010, 54, 1–9. [Google Scholar] [CrossRef]
  68. Wu, H.; Jiang, T.; Huang, X.; Feng, J. Patterns of sexual size dimorphism in horseshoe bats: Testing Rensch’s rule and potential causes. Sci. Rep. 2018, 8, 2616. [Google Scholar] [CrossRef]
  69. Mao, X.G.; Zhu, G.J.; Zhang, S.; Rossiter, S.J. Pleistocene climatic cycling drives intra-specific diversification in the intermediate horseshoe bat (Rhinolophus affinis) in Southern China. Mol. Ecol. 2010, 19, 2754–2769. [Google Scholar] [CrossRef]
  70. Ockendon, N.F.; O’Connell, L.A.; Bush, S.J.; Monzón-Sandoval, J.; Barnes, H.; Székely, T.; Hofmann, H.A.; Dorus, S.; Urrutia, A.O. Optimization of next-generation sequencing transcriptome annotation for species lacking sequenced genomes. Mol. Ecol. Resour. 2016, 16, 446–458. [Google Scholar] [CrossRef]
  71. Kierszenbaum, A.L.; Rivkin, E.; Tres, L.L. Acroplaxome, an F-actin-keratin-containing plate, anchors the acrosome to the nucleus during shaping of the spermatid head. Mol. Biol. Cell 2003, 14, 4628–4640. [Google Scholar] [CrossRef]
  72. Espino, J.; Ortiz, Á.; Bejarano, I.; Lozano, G.M.; Monllor, F.; García, J.F.; Rodríguez, A.B.; Pariente, J.A. Melatonin protects human spermatozoa from apoptosis via melatonin receptor- and extracellular signal-regulated kinase-mediated pathways. Fertil. Steril. 2011, 95, 2290–2296. [Google Scholar] [CrossRef] [PubMed]
  73. Milardi, D.; Colussi, C.; Grande, G.; Vincenzoni, F.; Pierconti, F.; Mancini, F.; Baroni, S.; Castagnola, M.; Marana, R.; Pontecorvi, A. Olfactory Receptors in Semen and in the Male Tract: From Proteome to Proteins. Front. Endocrinol. 2017, 8, 379. [Google Scholar] [CrossRef]
  74. Schneider, M.R.; Mangels, R.; Dean, M.D. The molecular basis and reproductive function(s) of copulatory plugs. Mol. Reprod. Dev. 2016, 83, 755–767. [Google Scholar] [CrossRef]
  75. Oh, Y.K.; Mori, T.; Uchida, T.A. Studies on the vaginal plug of the Japanese greater horseshoe bat, Rhinolophus ferrumequinum nippon. J. Reprod. Fertil. 1983, 68, 365–369. [Google Scholar] [CrossRef]
  76. Lee, J.-H. Vaginal plug formation and release in female hibernating Korean greater horseshoe bat, Rhinolophus ferrumequinum korai (Chiroptera: Rhinolophidae) during the annual reproductive cycle. Zoomorphology 2020, 139, 123–129. [Google Scholar] [CrossRef]
  77. Rossiter, S.J.; Jones, G.; Ransome, R.D.; Barratt, E.M. Genetic variation and population structure in the endangered greater horseshoe bat Rhinolophus ferrumequinum. Mol. Ecol. 2000, 9, 1131–1135. [Google Scholar] [CrossRef] [PubMed]
  78. Wilkinson, G.S.; McCracken, G.F. Bats and balls: Sexual selection and sperm competition in the Chiroptera. In Bat Ecology; Kunz, T.H., Fenton, M.B., Eds.; University of Chicago Press: Chicago, IL, USA, 2006; pp. 128–155. [Google Scholar]
  79. Mori, T.; Oh, Y.K.; Uchida, T. Sperm Storage in the Oviduct of the Japanese Greater Horseshoe Bat, Rhinolophus ferrumequinum nippon. J. Fac. Agric. Kyushu Univ. 1982, 27, 47–53. [Google Scholar] [CrossRef]
  80. Flanders, J.; Jones, G. Roost Use, Ranging Behavior, and Diet of Greater Horseshoe Bats (Rhinolophus ferrumequinum) Using a Transitional Roost. J. Mammal. 2009, 90, 888–896. [Google Scholar] [CrossRef]
  81. Burgin, C. Rhinolophidae. In Handbook of the Mammals of the World—Volume 9; Wilson, D.E., Mittermeier, R.A., Eds.; Lynx Edicions: Barcelona, Spain, 2019; pp. 280–332. [Google Scholar]
  82. Gershoni, M.; Pietrokovski, S. The landscape of sex-differential transcriptome and its consequent selection in human adults. BMC Biol. 2017, 15, 7. [Google Scholar] [CrossRef] [PubMed]
  83. Boskey, E.R.; Telsch, K.M.; Whaley, K.J.; Moench, T.R.; Cone, R.A. Acid production by vaginal flora in vitro is consistent with the rate and extent of vaginal acidification. Infect. Immun. 1999, 67, 5170–5175. [Google Scholar] [CrossRef] [PubMed]
  84. Boskey, E.R.; Cone, R.A.; Whaley, K.J.; Moench, T.R. Origins of vaginal acidity: High D/L lactate ratio is consistent with bacteria being the primary source. Hum. Reprod. 2001, 16, 1809–1813. [Google Scholar] [CrossRef] [PubMed]
  85. Massey, S.E. SARS-CoV-2’s closest relative, RaTG13, was generated from a bat transcriptome not a fecal swab: Implications for the origin of COVID-19. arXiv 2021, arXiv:2111.09469. [Google Scholar]
  86. Bruttel, V.; Washburne, A.; VanDongen, A. Endonuclease fingerprint indicates a synthetic origin of SARS-CoV-2. bioRxiv 2022. [Google Scholar] [CrossRef]
  87. Kang, L.; He, G.; Sharp, A.K.; Wang, X.; Brown, A.M.; Michalak, P.; Weger-Lucarelli, J. A selective sweep in the Spike gene has driven SARS-CoV-2 human adaptation. Cell 2021, 184, 4392–4400.e4. [Google Scholar] [CrossRef] [PubMed]
  88. Cereghino, C.; Michalak, K.; DiGiuseppe, S.; Guerra, J.; Yu, D.; Faraji, A.; Sharp, A.K.; Brown, A.M.; Kang, L.; Weger-Lucarelli, L.; et al. Evolution at Spike protein position 519 in SARS-CoV-2 facilitated adaptation to humans. npj Viruses 2024, 2, 29. [Google Scholar] [CrossRef]
Figure 1. Beta-diversity (principal coordinate) analysis of the microbial communities of the RaTG13 and Clade 7896 datasets compared to datasets from Li et al. (2020) [25]. Microbial beta-diversity was calculated using the Bray–Curtis distance and clustered as described in Methods. The plot displays RaTG13, RaTG15 (Ra7909), the remaining Clade 7896 datasets (Rs7896, Rs7905, Rs7907, Rs7921, Rs7924, Rs7931, Rs7952), and the following datasets from [4]: 229Er (BtRaCoV-229Er), 512r (BtScCoV-512r), CHB25 (BtHiCoV-CHB25), CoV1r (BtMiCoV-1r), HKU2r (BtRhCoV-HKU2r), HKU4r (BtTyCoV-HKU4r), HKU5r (BtPiCoV-HKU5r), HKU8r (BtMiCoV-HKU8r) and HKU10r (BtHpCoV-HKU10r). K-means clustering was used to identify two clusters: cluster 1 (black circles) and cluster 2 (open circles).
Figure 1. Beta-diversity (principal coordinate) analysis of the microbial communities of the RaTG13 and Clade 7896 datasets compared to datasets from Li et al. (2020) [25]. Microbial beta-diversity was calculated using the Bray–Curtis distance and clustered as described in Methods. The plot displays RaTG13, RaTG15 (Ra7909), the remaining Clade 7896 datasets (Rs7896, Rs7905, Rs7907, Rs7921, Rs7924, Rs7931, Rs7952), and the following datasets from [4]: 229Er (BtRaCoV-229Er), 512r (BtScCoV-512r), CHB25 (BtHiCoV-CHB25), CoV1r (BtMiCoV-1r), HKU2r (BtRhCoV-HKU2r), HKU4r (BtTyCoV-HKU4r), HKU5r (BtPiCoV-HKU5r), HKU8r (BtMiCoV-HKU8r) and HKU10r (BtHpCoV-HKU10r). K-means clustering was used to identify two clusters: cluster 1 (black circles) and cluster 2 (open circles).
Microbiolres 15 00119 g001
Figure 2. The phylogenetic tree of mitochondrial SSU rRNA contigs generated from sample sequence datasets. The tree was constructed as described in Methods, using maximum likelihood and an HKY substitution model with an estimated gamma parameter. A total of 100 bootstrap replicates were conducted, and values > 50 are shown. The accession numbers of the mitochondrial genomes from which additional Rhinolophus sp. mitochondrial SSU rRNA sequences were obtained are listed in Methods. Those sequences derived from NGS RNA datasets have the sample number appended to the species name (with the exception of RaTG13 and RaTG15). Three taxa have the subspecies appended; these are ‘R.ferrumequinum nippon’, ‘R.affinis himalayanus’ and ‘R.sinicus sinicus’.
Figure 2. The phylogenetic tree of mitochondrial SSU rRNA contigs generated from sample sequence datasets. The tree was constructed as described in Methods, using maximum likelihood and an HKY substitution model with an estimated gamma parameter. A total of 100 bootstrap replicates were conducted, and values > 50 are shown. The accession numbers of the mitochondrial genomes from which additional Rhinolophus sp. mitochondrial SSU rRNA sequences were obtained are listed in Methods. Those sequences derived from NGS RNA datasets have the sample number appended to the species name (with the exception of RaTG13 and RaTG15). Three taxa have the subspecies appended; these are ‘R.ferrumequinum nippon’, ‘R.affinis himalayanus’ and ‘R.sinicus sinicus’.
Microbiolres 15 00119 g002
Figure 3. A hierarchical cluster analysis of bat NGS transcriptomic datasets. The HCA plot shows the RaTG13 and Clade 7896 datasets (RaTG15, Rs7896, Rs7905, Rs7907, Rs7921, Rs7924, Rs7931, Rs7952) and a variety of R. sinicus datasets, clustered using RPM Z-scores and Euclidean distance. Samples are on the x-axis, while genes are on the y-axis. The color bar represents the Z-scores.
Figure 3. A hierarchical cluster analysis of bat NGS transcriptomic datasets. The HCA plot shows the RaTG13 and Clade 7896 datasets (RaTG15, Rs7896, Rs7905, Rs7907, Rs7921, Rs7924, Rs7931, Rs7952) and a variety of R. sinicus datasets, clustered using RPM Z-scores and Euclidean distance. Samples are on the x-axis, while genes are on the y-axis. The color bar represents the Z-scores.
Microbiolres 15 00119 g003
Table 1. Numbers of eukaryotic and bacterial SSU rRNA reads present in four bat NGS datasets. SSU rRNA reads were identified in each dataset, as described in Methods.
Table 1. Numbers of eukaryotic and bacterial SSU rRNA reads present in four bat NGS datasets. SSU rRNA reads were identified in each dataset, as described in Methods.
Sample Total Number of (Forward) Reads in Dataset Total Number of SSU rRNA Sequences (% of Total Reads in Brackets) Number of Bacterial SSU rRNA Sequences (% of Total rRNA Sequences in Brackets) Number of Eukaryotic SSU rRNA Sequences (% of Total SSU rRNA Sequences in Brackets) Ratio of Eukaryotic to Bacterial SSU rRNA Sequences
RaTG13 11,604,666 208,776 (1.8%) 21,548 (10.3%) 178,804 (85.6%) 8.3:1
BtRhCoV-HKU2r
anal swab
11,924,182 2,470,567 (20.7%) 2,085,824 (84.4%) 384,023 (15.5%) 1:5.4
Splenocyte 1 transcriptome 4,764,112 1,306,781 (27.4%) 13,959 (1.1%) 1,238,388 (94.8%) 88.7:1
Ebola oral swab 1,000,000 1,341,026 (13.4%) 283,350 (21.1%) 1,050,512 (78.3%) 3.7:1
Table 2. The taxonomic analysis of SSU rRNA sequences present in four bat NGS datasets. The SSU rRNA sequences present in the NGS datasets were identified as described in Methods. The percentage represents the proportion of sequences corresponding to the respective taxonomic group in the different samples (number of sequences in brackets). For bacteria, this was calculated as the proportion of bacterial SSU rRNA reads, and for eukaryotes, this was calculated as the proportion of eukaryotic SSU rRNA reads. Results of particular interest are highlighted in bold.
Table 2. The taxonomic analysis of SSU rRNA sequences present in four bat NGS datasets. The SSU rRNA sequences present in the NGS datasets were identified as described in Methods. The percentage represents the proportion of sequences corresponding to the respective taxonomic group in the different samples (number of sequences in brackets). For bacteria, this was calculated as the proportion of bacterial SSU rRNA reads, and for eukaryotes, this was calculated as the proportion of eukaryotic SSU rRNA reads. Results of particular interest are highlighted in bold.
Taxonomic Group RaTG13 Sample BtRhCoV-HKU2r anal Swab Splenocyte 1 Transcriptome Ebola Oral Swab
Bacteria
Enterobacteriaceae18.1% (3891) 7.4% (154,498) 90.7% (12,667) 2.4% (6860)
Enterobacteriaceae, Escherichia4.9% (1,046)4.8% (100,567)74.1% (10,340)0.003% (9)
Mycoplasma0.07% (15) 0.04% (869) 0.1% (18) 0.005% (14)
Helicobacter0.005% (1)0.4% (8490)0% (0) 0% (0)
Bacillus0.004% (8) 0.4% (8214) 0.01% (1) 5.2% (14,662)
Peptostreptococcaceae0.07% (16) 21.2% (442,050) 0% (0) 0.0007% (2)
Enterococcus6.7% (1453) 6.3% (132,235) 0.2% (30) 2.4% (6828)
Lachnospiraceae0.7% (146) 6.2% (128,630) 0.01% (1) 0.08% (221)
Clostridium0.7% (141) 47.8% (997,409) 0% (0) 0.01% (31)
Lactococcus64.0% (13,780)0.07% (1532)0% (0) 0.06% (162)
Lactobacillus0.02% (5) 0.0009% (18) 0.01% (1) 0.005% (13)
Micrococcus4.5% (960) 0% (0) 0.08% (11) 0% (0)
Pasteurellaceae0.03% (7) 3.1% (64,468) 0% (0) 51.3% (145,394)
Pasteurellaceae, Haemophilus0.01% (2) 0.02% (511) 0% (0) 4.4% (12,383)
Gemella0% (0) 0.005% (106) 0.02% (3) 0.3% (742)
Eukaryota
Arthropoda0.02% (35) 0.01% (36) 0.01% (157) 0.4% (4005)
Arthropoda, Insecta0.02% (29) 0.008% (30) 0.009% (113) 0.1% (1348)
Fungi0.2% (302) 0.002% (6) 0.02% (261) 0.02% (186)
Viridiplantae0.1% (214) 0.05% (198) 0.001% (13) 0.002% (22)
Table 3. Metaxa2 analysis of bacterial SSU rRNA reads in the Clade 7896 datasets.
Table 3. Metaxa2 analysis of bacterial SSU rRNA reads in the Clade 7896 datasets.
SampleTotal Number of (Forward) Reads in DatasetTotal Number of rRNA Sequences (% of Total Sequences in Brackets)Number of Bacterial rRNA Sequences (% of Total rRNA Sequences in Brackets)Number of Eukaryotic rRNA Sequences (% of Total rRNA Sequences in Brackets)Ratio of Eukaryotic to Bacterial rRNA Sequences
RaTG15 (Ra7909)57,967,7637,582,328 (13.1%)15,497 (0.2%)7,364,952 (97.1%)475.3:1
Rs789633,095,8222,334,500 (7.1%)3599 (0.2%)2,273,798 (97.4%)631.8:1
Rs790534,515,8191,472,236 (4.3%)45,803 (3.1%)1,387,437 (94.2%)30.3:1
Rs790743,686,0623,321,723 (7.6%)111,111 (3.3%)3,239,588 (97.5%)29.2:1
Rs7921100,971,80812,814,436 (12.7%)2,200,755 (17.2%)10,312,025 (80.5%)4.7:1
Rs792445,210,2191,908,141 (4.2%)108,044 (5.7%)1,719,177 (90.1%)15.9:1
Rs793151,086,6642,565,395 (5.0%)319,175 (12.4%)2,195,238 (85.6%)6.9:1
Rs795240,979,989372,805 (0.9%)2720 (0.7%)342,244 (91.8%)125.8
Table 4. The number of coronavirus reads present in the RaTG13 datasets and anal swab datasets from the WIV. Nine coronavirus genomes were generated by the WIV from anal swabs by Li et al. (2020). The raw (forward) reads were mapped to the respective coronavirus genomes as described in Methods. The reads from the RaTG13 sample were mapped to the RaTG13 genome for comparison.
Table 4. The number of coronavirus reads present in the RaTG13 datasets and anal swab datasets from the WIV. Nine coronavirus genomes were generated by the WIV from anal swabs by Li et al. (2020). The raw (forward) reads were mapped to the respective coronavirus genomes as described in Methods. The reads from the RaTG13 sample were mapped to the RaTG13 genome for comparison.
Sample Dataset SRA Accession Number (Number of Reads in Brackets) Coronavirus Genome (NCBI Accession Number in Brackets) Number of Reads Mapped to Respective Coronavirus Genome Proportion of Reads Mapping to Coronavirus Compared to Total Number of Reads
SRR11085797
(23209332)
RaTG13
(MN996532)
1669 7.2 × 10−5
SRR11085736
(23848364)
BtRhCoV-HKU2r
(MN611522)
886 3.7 × 10−5
SRR11085735
(8032494)
BtHpCoV-HKU10-related
(MN611523)
7030 8.8 × 10−4
SRR11085733 (R. larvatus)
(27083324)
BtHiCoV-CHB25
(MN611525)
1,035,522 3.8 × 10−2
SRR11085741
(24828142)
BtRaCoV-229E-related
(MN611517)
99,776 4.0 × 10−3
SRR11085734
(19171950)
BtMiCoV-1-related
(MN611524)
581 3.0 × 10−5
SRR11085740
(19562848)
BtMiCoV-HKU8-related
(MN611518)
2817 1.4 × 10−4
SRR11085737
(23088962)
BtScCoV-512-related
(MN611521)
142,646 6.2 × 10−3
SRR11085738
(29134128)
BtPiCoV-HKU5-related
(MN611520)
1,437,700 4.9 × 10−2
SRR11085739
(9589348)
BtTyCoV-HKU4-related
(MN611519)
2778 2.9 × 10−4
Table 5. The mapping of NGS sample reads to mammalian mitochondrial genomes. NGS sample reads were mapped to a series of mammalian mitochondrial genomes as described in Methods.
Table 5. The mapping of NGS sample reads to mammalian mitochondrial genomes. NGS sample reads were mapped to a series of mammalian mitochondrial genomes as described in Methods.
Species Mitochondrial Genome NCBI Accession Number Percent of Mitochondrial Genome Covered
(Number of Reads Mapped in Brackets)
RaTG13BtRhCoV-HKU2r
Anal Swab
Splenocyte Transcriptome
R. affinisNC_053269.1 97.2% (75,335) 14.9% (6278) 32.2% (88,220)
R. sinicusKP257597.1 40.4% (18,017) 29.8% (10,019) 94.5% (170,591)
Mouse NC_005089.1 6.3% (111) 1.6% (18) 6.0% (1755)
Human NC_012920.1 3.6% (26) 9.4% (23) 40.5% (91)
Pig NC_012095.1 6.6% (1606) 4.2% (155) 5.2% (2238)
Black foot ferret (Mustela nigripes) NC_024942.1 6.6% (1537) 3.0% (61) 5.2% (1383)
Malaysian pangolin (Manis javanica) NC_026781.1 4.9% (88) 2.2% (32) 3.5% (1254)
Rabbit (Oryctolagus cuniculus) NC_001913.1 5.1% (92) 1.4% (16) 2.7% (1529)
Asian Palm civet (Paradoxurus hermaphroditus) MG200264.1 5.6% (1836) 4.3% (185) 4.4% (5258)
Chinese tree shrew (Tupaia chinensis) AF217811 4.2% (655) 2.7% (14) 2.9% (4117)
Table 6. Nuclear genome mapping statistics.
Table 6. Nuclear genome mapping statistics.
SpeciesGenome Assembly NCBI Accession Number% of RaTG13 Sample Reads Mapped to Genome% of BtRhCoV-HKU2r
anal Swab Sample Reads Mapped to Genome
Greater horseshoe bat (R. ferrumequinum)GCA_004115265.387.5%2.6%
HumanGCA_000001405.2864.3%7.4%
MouseGCA_000001635.962.2%6.6%
Green monkeyGCA_000409795.251.5%7.5%
PigGCA_000003025.6 Sscrofa11.162.5%7.4%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Massey, S.E. Forensic Genomic Analysis Determines That RaTG13 Was Likely Generated from a Bat Mating Plug. Microbiol. Res. 2024, 15, 1784-1805. https://doi.org/10.3390/microbiolres15030119

AMA Style

Massey SE. Forensic Genomic Analysis Determines That RaTG13 Was Likely Generated from a Bat Mating Plug. Microbiology Research. 2024; 15(3):1784-1805. https://doi.org/10.3390/microbiolres15030119

Chicago/Turabian Style

Massey, Steven E. 2024. "Forensic Genomic Analysis Determines That RaTG13 Was Likely Generated from a Bat Mating Plug" Microbiology Research 15, no. 3: 1784-1805. https://doi.org/10.3390/microbiolres15030119

APA Style

Massey, S. E. (2024). Forensic Genomic Analysis Determines That RaTG13 Was Likely Generated from a Bat Mating Plug. Microbiology Research, 15(3), 1784-1805. https://doi.org/10.3390/microbiolres15030119

Article Metrics

Back to TopTop