Retracing Phylogenetic, Host and Geographic Origins of Coronaviruses with Coloured Genomic Bootstrap Barcodes: SARS-CoV and SARS-CoV-2 as Case Studies

Phylogenetic trees of coronaviruses are difficult to interpret because they undergo frequent genomic recombination. Here, we propose a new method, coloured genomic bootstrap (CGB) barcodes, to highlight the polyphyletic origins of human sarbecoviruses and understand their host and geographic origins. The results indicate that SARS-CoV and SARS-CoV-2 contain genomic regions of mixed ancestry originating from horseshoe bat (Rhinolophus) viruses. First, different regions of SARS-CoV share exclusive ancestry with five Rhinolophus viruses from Southwest China (RfYNLF/31C: 17.9%; RpF46: 3.3%; RspSC2018: 2.0%; Rpe3: 1.3%; RaLYRa11: 1.0%) and 97% of its genome can be related to bat viruses from Yunnan (China), supporting its emergence in the Rhinolophus species of this province. Second, different regions of SARS-CoV-2 share exclusive ancestry with eight Rhinolophus viruses from Yunnan (RpYN06: 5.8%; RaTG13: 4.8%; RmYN02: 3.8%), Laos (RpBANAL103: 3.3%; RmarBANAL236: 1.7%; RmBANAL52: 1.0%; RmBANAL247: 0.7%), and Cambodia (RshSTT200: 2.3%), and 98% of its genome can be related to bat viruses from northern Laos and Yunnan, supporting its emergence in the Rhinolophus species of this region. Although CGB barcodes are very useful in retracing the origins of human sarbecoviruses, further investigations are needed to better take into account the diversity of coronaviruses in bats from Cambodia, Laos, Myanmar, Thailand and Vietnam.


Introduction
Tree reconstruction methods, such as Bayesian inference and maximum likelihood (ML), are very popular in deciphering phylogenetic relationships between pathogens based on multiple sequence alignments. In particular, newly discovered coronaviruses are routinely described based on Bayesian and ML trees reconstructed from whole or partial genome alignments [1][2][3][4]. For instance, most recent studies on SARS-CoV-2, the virus involved in the COVID-19 pandemic, have published a whole-genome tree of the subgenus Sarbecovirus (family Coronaviridae, genus Betacoronavirus) in which the human virus was found to be closely related to RaTG13, a virus detected in a horseshoe bat of the species Rhinolophus affinis sampled in 2013 in the Yunnan province of China [4][5][6][7]. However, discordant placements of SARS-CoV-2 were supported in phylogenetic trees based on different genomic fragments. This was well illustrated in Zhou et al. [7], in which SARS-CoV-2 appeared closely related to four bat viruses, namely RaTG13, RmYN02, RpYN06 and RshSTT200, in the RNA-dependent RNA polymerase (RdRp) gene tree, sister-group of RmYN02 and RpYN06 in the ORF1ab tree, and linked to RaTG13 in the Spike gene tree. Such conflicting results between gene trees are typically explained by genomic recombination, a process resulting in mosaic genomes containing regions from different parental viruses. The most widely-accepted model to explain recombination in coronaviruses is the copy-choice

Phylogenetic Analyses
Maximum likelihood analysis of the whole-genome alignment of sarbecoviruses was carried out using RAxML 8.2.11 [19], different GTR+G models for the three codon-positions and non-coding regions, and 1000 bootstrap replicates. The RAxML bootstrap trees were executed in PAUP* version 4.0a [20] to construct the bootstrap 50% majority-rule consensus tree.
To examine the distribution of phylogenetic support along the whole-genome alignment, the dataset was bootstrapped under the SWB program [9] using a window of W nucleotides (W parameter) moving in steps of 50 nt (S parameter). The window size (W) is a key parameter for SWB analyses because the amount of phylogenetic signal depends on both the number of nucleotide sites and their evolutionary rates [9]. For that reason, we decided to perform five SWB analyses with the same step parameter (S) of 50 nt but using five different window sizes, i.e., 400 nt, 500 nt, 600 nt, 1000 nt and 2000 nt ( Table 1). The smallest window size (W = 400 nt) was applied to detect possible changes in phylogenetic relationships due to the recombinant origin of small genomic regions, whereas the largest window size (W = 2000 nt) was used to guarantee enough phylogenetic signal (informative sites) for bootstrap analyses. The intermediate window sizes (500, 600 and 1000 nt) were used to better interpret the differences between SWB results based on the two extreme values. In the SWB program, each window bootstrap (WB) sub-dataset was automatically run in RAxML [19] with a GTR+G model and 100 bootstrap replicates. The SWB output is a CSV file containing the window bootstrap percentages (WBP) calculated for each WB sub-datasets and for all the bipartitions (nodes) reconstructed during the SWB analysis. For example, the SWB 400 output (SWB analysis based on a window of 400 nt) includes 1,213,380,595 WBP values (595 WB sub-datasets × 204,001 bipartitions), whereas the SWB 2000 output includes 11,189,625 WBP values (563 WB sub-datasets × 19,875 bipartitions) (Table 1).  [9]; 2 : Number of characters in the MRP matrix reconstructed using LFG [9] and SuperTRI [21] programs; 3 : Number of SWB bipartitions including at least one WBP ≥ 50% selected under the BBC program using the SWB file as input [9]; 4 : Number of BBC bipartitions selected under Microsoft ® Excel using the BBC file as input.
The bootstrap bipartitions generated from the five SWB analyses based on different window sizes were used for SuperTRI analyses [21] to reconstruct the trees showing the most reliable phylogenetic relationships. The LFG program [9] was used to convert the SWB output file into bootstrap log files, which were then used as inputs in SuperTRI v57 [21] to construct a MRP (Matrix Representation with Parsimony) file. For example, the SWB 400 output was converted with the LFG program into 595 bootstrap log files, and these files were further transformed into a MRP file using SuperTRI v57. In the MRP 400 file, each of the 750,330 characters represents one SWB bipartition with its assigned bootstrap percentage calculated in one of the 595 window bootstrap analyses. The MRP 400 file was then executed in PAUP* version 4.0a [20] using 1000 bootstrap replicates of weighted parsimony (with WBPs of the SWB 400 analysis assigned as weights) in order to reconstruct the SuperTRI bootstrap 50%-majority-rule consensus (SB 400 ) tree. Finally, the mean window bootstrap percentages (MBP 400 ) were calculated automatically in SuperTRI v57 [21] for all nodes of the SB 400 trees. The same approach was conducted using the SWB outputs obtained with the sliding windows of 500, 600, 1000 and 2000 nt. In total, we therefore reconstructed five SB trees with MBP values at the nodes.

Construction of Genomic Bootsrap Barcodes
The BBC program [9] was used to select only SWB bipartitions (i.e., phylogenetic hypotheses) showing one or more WBP values ≥ 50% and to construct their corresponding genomic bootstrap (GB) barcodes. For example, the SWB 400 output contains 204,001 bipartitions, each with 595 WBP values. After BBC analysis, the GB 400 barcodes were done for the 1263 selected SWB 400 bipartitions showing one or more WBP ≥ 50% (Table 1). A GB barcode is a small image representing the genome alignment and in which the n WBP values (n = 595 with an alignment of 30,115 nt and a window size of 400 nt) obtained for the BBC bipartition of interest were transformed into coloured vertical bars using the following code: green for WBP ≥ 70%; grey for 30% < WBP < 70%; and red for WBP ≤ 30%. Using the same BBC procedure but a different SWB output (either SWB 500 , SWB 600 , SWB 1000 or SWB 2000 ), we constructed 1240 GB 500 barcodes, 1212 GB 600 barcodes, 1061 GB 1000 barcodes, and 818 GB 2000 barcodes (Table 1). All the 5594 GB barcodes and the five BBC output files (BBC 400 , BBC 500 , BBC 600 , BBC 1000 , and BBC 2000 ) were deposited in the Open Science Framework (OSF) platform at https://osf.io/nj57e/, accessed on 15 June 2022.

Construction of Coloured Genomic Bootstrap Barcodes
A new method was developed in this study for constructing coloured genomic bootstrap (CGB) barcodes for a virus of special interest or the common ancestor of several viruses. A phylogenetic CGB barcode is a small image representing the genome of a virus (or an ancestral virus) in which the different colours show the best phylogenetic signals, i.e., containing the fewest number of closely-related viruses, detected in the different regions of the genomic alignment used for the analyses. Two other kinds of CGB barcodes were derived from the phylogenetic CGB barcodes: the host CGB barcodes showing the host origins of the closely-related viruses and the geographic CGB barcodes showing the geographic origins of the closely-related viruses. For this study, we choose to reconstruct the CGB barcodes for the common ancestor of seven selected SARS-CoV genomes and for the common ancestor of nine selected SARS-CoV-2 genomes (see Supplementary Table S1 for the origin of the sequences used in this study). To avoid repetitions, only the SARS-CoV-2 procedure is described below.
In the first step, the five BBC output files were imported in Microsoft ® Excel. They represent five lists of BBC bipartitions, i.e., SWB bipartitions showing at least one WBP ≥ 50% (Table 1). They contain the following information for each BBC bipartition: bipartition number, lists of viruses included in the bipartition, binary representations of the bipartition (*: viruses included in the bipartition; -: viruses excluded from the bipartition), WBPs obtained for the n sliding window bootstrap analyses (e.g., n = 595 for SWB 400 and n = 563 for SWB 2000 ). The five BBC outputs include the following number of BBC bipartitions: 1263 for BBC 400 , 1240 for BBC 500 , 1212 for BBC 600 , 1061 for BBC 1000 , and 818 for BBC 2000 (Table 1). The outputs were entered sequentially into the same Excel file starting with BBC 400 and ending with BBC 2000 and highlighted with five different background colours. To allow comparisons between the five BBC results, W and S parameters (window size and moving steps) were used to calculate the median positions (pos.) for the n WBPs calculated in each of the five SWB analyses.
In the second step, only BBC bipartitions including all the nine SARS-CoV-2 sequences were selected: 178 bipartitions for BBC 400 , 176 bipartitions for BBC 500 , 175 bipartitions for BBC 600 , 163 bipartitions for BBC 1000 and 132 bipartitions for BBC 2000 (Table 1). For each of the five BBC lists, the single bipartition including only SARS-CoV-2 sequences was removed. Then, other BBC bipartitions were ranked in increasing order of size, from +1 (for the bipartitions including the nine SARS-CoV-2 sequences + one closely-related virus) to +66 (=75 − 9, for the single bipartition including all viruses of our dataset). To make comparisons between WBPs calculated in the five SWB analyses, a new column was inserted to renumber BBC 500 , BBC 600 , BBC 1000 , and BBC 2000 bipartitions using BBC 400 numbers as references.
In the third step, all WBPs ≥ 70% were highlighted in green and all WBPs comprised between 50% and 70% were highlighted in yellow green using conditional formatting options in Microsoft ® Excel. We performed the comparisons starting with bipartitions +1, i.e., containing all SARS-CoV-2 sequences and another virus. Due to past events of genomic recombination, we found several bipartitions +1 supporting conflicting phylogenetic relationships. For each of these BBC bipartitions, we identified the intervals of genomic regions containing a robust phylogenetic signal (GRPS) using the criteria developed below. Then we proceeded similarly by analysing other genomic fragments for bipartitions +2 (containing all SARS-CoV-2 sequences and two other viruses), bipartitions +3 (containing all SARS-CoV-2 sequences and three other viruses), etc. In this way, we were able to identify the closest virus(es) to SARS-CoV-2 in all regions of our genome alignment. All selected GRPS contained a robust phylogenetic signal (WBP ≥ 70%) in at least two WB sub-datasets of the BBC 400 , BBC 500 or BBC 600 results. The 5' and 3' ends of GRPS were extended using the following criteria: (i) by accepting WBPs between 50% and 70% for BBC 400 results; (ii) when median positions showed an average WBP ≥ 50% for BBC 400 , BBC 500 and BBC 600 results; and (iii) when median positions showed an average WBP ≥ 50% for all the five BBC results (BBC 400 , BBC 500 , BBC 600 , BBC 1000 , and BBC 2000 ). This strategy was adopted for three major reasons: (i) GRPS of small lengths cannot be detected using SWB analyses based on the largest window sizes (i.e., 1000 nt and 2000 nt); (ii) due to stochastic variation in bootstrap values, the comparisons between the three SWB analyses based on the smallest window sizes (i.e., 400 nt, 500 nt and 600 nt) allow us to better detect GRPS of small lengths; (iii) GRPS of large size can be erroneously interrupted if we consider only SWB analyses based on the smallest window sizes because they contain lesser amounts of phylogenetic signal [9]. It is therefore important also to make comparisons with BBC results based on the largest window sizes (i.e., 1000 nt and 2000 nt).
In the fourth step, the intervals of GRPS (5 and 3 median positions in the wholegenome alignment) were written in a new CSV file for each of the BBC bipartitions including SARS-CoV2 sequences in which one or more GRPS were identified as the best phylogenetic signals (i.e., containing the fewest number of closely-related viruses). A specific colour code was chosen for each of the selected BBC bipartitions and the file was used as input in the CGB program (python script) to construct phylogenetic CGB barcodes of different colours. Two other files were derived from the original CSV file: (i) a file for host CGB barcodes, in which different colours were assigned to the BBC bipartitions including viruses extracted from the following nine taxa: Rhinolophus affinis, Rhinolophus malayanus, Rhinolophus marshalli, Rhinolophus pusillus, Rhinolophus shameli, R. affinis + Manis javanica, Rhinolophus species, Rhinolophus species + M. javanica, bat species + M. javanica; and (ii) a file for geographic CGB barcodes, in which different colours were chosen for the BBC bipartitions including viruses collected in the following nine geographic areas: Cambodia, North Laos, SE Asia, Yunnan, North Laos + Yunnan, SE Asia + Yunnan, North Laos + China, SE Asia + China, and SE Asia + China + Japan. All files used to construct the CGB barcodes for SARS-CoV-2 and SARS-CoV (including BBC 400 , BBC 500 , BBC 600 , BBC 1000 , BBC 2000 files, the two Excel files used to identify GRPS intervals, and the six CSV files used for phylogenetic, host and geographic CGB barcodes) were deposited at https://osf.io/nj57e/, accessed on 15 June 2022.

Phylogenetic Analyses Based on An Alignment of 75 Sarbecovirus genomes
In this study, 75 genomes of the subgenus Sarbecovirus (Supplementary Table S1) were aligned to infer phylogenetic relationships. The positions of the coding sequences were the following in our final alignment of 30 Figure 1).

Phylogenetic Analyses Based on An Alignment of 75 Sarbecovirus genomes
In this study, 75 genomes of the subgenus Sarbecovirus (Supplementary Table S1) were aligned to infer phylogenetic relationships. The positions of the coding sequences were the following in our final alignment of 30 Figure 1). Two different approaches were used for phylogenetic reconstruction: whole-genome bootstrap (WGB) analysis versus SuperTRI bootstrap (SB) analysis. On the one hand, a classical ML approach was applied with the RAxML method using the whole-genome alignment and 1000 bootstrap replicates. The WGB 50% majority-rule consensus tree is shown in Figure 2. On the other hand, five SWB analyses were conducted using five different window sizes (400, 500, 600, 1000 and 2000 nt) and the results were used to reconstruct five SuperTRI bootstrap (SB) consensus trees (with MBP values indicated at the nodes). The five SB trees (available at https://osf.io/nj57e/) were found to be very similar, except for a few nodes. For instance, Rs4237 and Rs4247 are sister viruses in SB400, SB500, SB600 trees, whereas Rs4247 is more closely related to Rs4081 in SB1000 and SB2000 trees. Additionally, MBP values were generally lowest for SB400 nodes and highest for SB2000 nodes. Therefore, we only reported MBP400 and MBP2000 values at the nodes of the WGB tree of Figure 2. The MBP values written in black indicate nodes recovered as monophyletic in all the five SB trees, whereas the MBP values highlighted in red indicate nodes not found monophyletic in SB trees. In Figure 2, we also showed the GB barcodes built from SWB400 and SWB2000 analyses for all nodes of the WGB tree supported by BPWG ≥ 90% or recovered as monophyletic in SB trees. Two different approaches were used for phylogenetic reconstruction: whole-genome bootstrap (WGB) analysis versus SuperTRI bootstrap (SB) analysis. On the one hand, a classical ML approach was applied with the RAxML method using the whole-genome alignment and 1000 bootstrap replicates. The WGB 50% majority-rule consensus tree is shown in Figure 2. On the other hand, five SWB analyses were conducted using five different window sizes (400, 500, 600, 1000 and 2000 nt) and the results were used to reconstruct five SuperTRI bootstrap (SB) consensus trees (with MBP values indicated at the nodes). The five SB trees (available at https://osf.io/nj57e/, accessed on 15 June 2022) were found to be very similar, except for a few nodes. For instance, Rs4237 and Rs4247 are sister viruses in SB 400 , SB 500 , SB 600 trees, whereas Rs4247 is more closely related to Rs4081 in SB 1000 and SB 2000 trees. Additionally, MBP values were generally lowest for SB 400 nodes and highest for SB 2000 nodes. Therefore, we only reported MBP 400 and MBP 2000 values at the nodes of the WGB tree of Figure 2. The MBP values written in black indicate nodes recovered as monophyletic in all the five SB trees, whereas the MBP values highlighted in red indicate nodes not found monophyletic in SB trees. In Figure 2, we also showed the GB barcodes built from SWB 400 and SWB 2000 analyses for all nodes of the WGB tree supported by BP WG ≥ 90% or recovered as monophyletic in SB trees.
In the WGB tree of Figure 2, 48 out of the total 69 nodes (70%) are supported by BP WG ≥ 90%. Robustness comparisons with SB trees revealed that 24 of these 48 robust nodes (50%) are associated with MBP 400/2000 ≤ 30%, indicating that the phylogenetic signal is restricted to only one or several regions of the genome alignment, representing in total less than 30% of the whole genome alignment [9]. For examples, the monophyly of RecSar, which includes the three bat viruses RpPrC31, RsVZC45, and RsVZXC21, was supported by MBP 400 = 15 and MBP 2000 = 21; the sister-group relationship between SARS-CoV and Rs4231 + Rs4874 was supported by MBP 400 = 3 and MBP 2000 = 8; and the grouping of SARS-CoV-2 with RaTG13, RmBANAL52, RmaBANAL236, and RpBANAL103 was supported by MBP 400 = 2 and MBP 2000 = 4. Note that the later node was not recovered monophyletic in the five SB trees reconstructed from SWB analyses (MBP values written in red in Figure 2). For all these nodes, the GB barcodes revealed that the phylogenetic signal is restricted to one or several small genomic regions (in green, WBP ≥ 70%), whereas most other genomic regions (in red, WBP ≤ 30%) can support different phylogenetic relationships, as illustrated with CGB barcodes described in Sections 3.2 and 3.3.  RAxML method and a GTR+G model for each of the four partitions of the whole-genome alignment (three codon positions and non-coding regions). The bootstrap percentages (BP WG ) are given for all nodes in bold. The genomic bootstrap (GB) barcodes built from SWB 400 and SWB 2000 outputs and the BBC program [9] were reported for all nodes recovered as monophyletic in the SuperTRI bootstrap (SB) trees reconstructed from SWB analyses (for these nodes, MBP 400 and MBP 2000 values are indicated in black at the right of the slash) and for all nodes supported by BP WG ≥ 90% that were not found as monophyletic in SB trees (for these nodes, MBP 400 and MBP 2000 values are indicated in red at the right of the slash). The GB 400 and GB 2000 barcodes are displayed above and below the branches, respectively. The colours of Asian sarbecoviruses indicate to which group of synonymous nucleotide composition they belong [16]: The nine SARS-CoV-2 genomes were sampled between December 2019 (HsRef, GenBank: NC_045512) and July 2022 (HsOmicron, GenBank: OP010674) and we know that their most recent common ancestor (MRCA) emerged a few weeks before the end of 2019. Due to their recent divergence, these viruses have very similar genomes. Indeed, pairwise nucleotide distances are comprised between 0.04% and 0.28%. We also found very similar distances for most other closely-related viruses supported by high MBP values: between 0.11% and 0.41% for the seven SARS-CoV genomes; 0.46% between Rmac1 and Rmac279; 0.76% between Rf1 and Rf273. All these pairwise distances are the smallest calculated for our dataset. In contrast, the two other virus pairs supported by MBP 400/2000 ≥ 70% were found to be more distant: 2.0% between Ra7909 and RstYN04, and 12.5% between RhGB01 and RhKhosta2.
Interrelationships between the seven human and two mink SARS-CoV-2 sequences were not robust in the WGB tree of Figure 2, except the basal position of HsRef (BP WG = 92) which was supported by one GRPS (pos. 14,251-14,750, with an exclusive Uracil in pos. 14,507). However, we found discordant SARS-CoV-2 relationships. For instance, HsGamma and HsOmicron were related with HsDelta based on pos. 9851-10,350 (including an exclusive

Coloured Genomic Bootstrap Barcodes Reconstructed for the Ancestor of SARS-CoV
We found that the best phylogenetic signals detected for the common ancestor of SARS-CoV involved 80 GRPS and 49 BBC bipartitions. The file used to identify the intervals of the 80 GRPS (5 and 3 median positions in the whole-genome alignment) is available at https://osf.io/nj57e/ (accessed on 15 June 2022).
The phylogenetic, host and geographic CGB barcodes constructed for the common ancestor of SARS-CoV are shown in Figure 3. The phylogenetic CGB barcodes indicate that 25.6% of the SARS-CoV genome shares exclusive ancestry with five Rhinolophus viruses (bipartitions +1 in Figure 3) detected in three provinces of Southwest China: three viruses from Yunnan, including RfYNLF/31C sampled in R. ferrumequinum (eight GRPS representing 17.9% of the WG alignment), RpF46 sampled in R. pusillus (two GRPS; 3.3% of the alignment); RaLYRa11 sampled in R. affinis (one GRPS; 1.0% of the alignment); one virus from Guangxi, Rpe3 sampled in R. pearsonii (one GRPS; 1.3% of the alignment); and one virus from Sichuan, RspSC2018, sampled in an unidentified species of Rhinolophus (two GRPS; 2.0% of the alignment).  If we consider the BBC bipartitions uniting SARS-CoV with two to five closely-related viruses (n = +2, +3, +4 or +5 in Figure 3), they involve eight additional viruses, all found in Rhinolophus bats collected in Yunnan: RmYN07, Rs4231, Rs4237, Rs4247, Rs4874, Rs7327, Rs9401, and RstYN09. These results therefore suggest that SARS-CoV originated from horseshoe bat (genus Rhinolophus) viruses. This hypothesis was confirmed with our analyses of Figure 4B, which revealed that 100% of the phylogenetic CGB barcodes reconstructed for SARS-CoV involved Rhinolophus viruses. As shown in Figure 4A, the most important contributors are five sarbecoviruses extracted from horseshoe bats: RfYNLF/31C (which is 65% involved), Rs7327 (63%), Rs9401 (63%), Rs4874 (62%), and Rs4084 (61%). Importantly, all the 22 viruses showing a significant contribution in SARS-CoV CGB barcodes (≥25%) belong to the SCoVrC lineage.
Several regions of the SARS-CoV genome were found to be exclusively related to RfYNLF/31C, a virus found in Rhinolophus ferrumequinum.  Figure 4B show that the three Rhinolophus species with the highest contribution to the phylogenetic CGB barcodes of SARS-CoV are R. ferrumequinum (76%), R. sinicus (71%) and R. affinis (58%).
The geographic CGB barcodes indicate that most regions of the SARS-CoV genome are exclusively related to bat viruses from Southwest China (84%; highlighted in orange, yellow and red in Figure 3), including Yunnan (41%), Sichuan (2%), and Guangxi (1%). Consistent with this, we found that 100% of the phylogenetic CGB barcodes reconstructed for SARS-CoV involved viruses detected in Southwest China, and the contribution of the Yunnan province is 97%, which is much more important than that of the three other provinces of Southwest China, i.e., Guizhou (53%), Guangxi (46%), and Sichuan (33%) ( Figure 4C).

Coloured Genomic Bootstrap Barcodes Reconstructed for the Ancestor of SARS-CoV-2
We found that the best phylogenetic signals detected for the common ancestor of SARS-CoV-2 involved 95 GRPS and 52 BBC bipartitions. The file used to identify the intervals of the 95 GRPS (5 and 3 median positions in the whole-genome alignment) is available at https://osf.io/nj57e/ (accessed on 15 June 2022).   The phylogenetic, host and geographic CGB barcodes constructed for the common ancestor of SARS-Co2 are shown in Figure 5. The phylogenetic CGB barcodes indicate that 23.4% of SARS-CoV-2 genome shares exclusive ancestry with eight Rhinolophus viruses (bipartitions +1 in Figure 5): three from Yunnan, RaTG13 sampled in R. affinis (four GRPS representing 4.8% of the alignment), RmYN02 sampled in R. malayanus (two GRPS; 3.8% of the alignment), and RpYN06 sampled in R. pusillus (two GRPS; 5.8% of the alignment); four from northern Laos, RpBANAL103 sampled in R. pusillus (one GRPS; 3.3% of the alignment), RmaBANAL236 sampled in R. marshalli (one GRPS; 1.7% of the alignment), RmBANAL52 (one GRPS; 1% of the alignment) and RmBANAL247 (one GRPS; 0.7% of the alignment), both sampled in R. malayanus; and one from northern Cambodia, RshSTT200 sampled in R. shameli (two GRPS; 2.3% of the alignment). The BBC bipartitions including between two and five additional viruses (n = +2, +3, +4 or +5 in Figure 5) involve only two additional viruses: one isolated from the Sunda pangolin (M. javanica), named MjGuangxi, and another found in Rhinolophus acuminatus from Eastern Thailand, named RacCS203. These results therefore suggest that SARS-CoV-2 originated from horseshoe bat (genus Rhinolophus) viruses rather than pangolin viruses. This hypothesis was confirmed with the results of Figure 6B, which show that 100% of the phylogenetic CGB barcodes reconstructed for SARS-CoV-2 involved Rhinolophus viruses, whereas the contribution of Manis viruses remains modest (24%). As shown in Figure 6A, the most important contributors are six viruses extracted from horseshoe bats of northern Laos and Yunnan: RpBANAL103 (which is 72% involved), RmBANAL52 (70%), RpYN06 (65%), RmaBANAL236 (65%), RmYN02 (63%), and RaTG13 (60%). Importantly, all the 10 viruses showing a significant contribution (≥25%) belong to the SCoV2rC lineage (between 72% and 35%), except RpPrC31 (31%), which belongs to the RecSar group [9,16].
The geographic CGB barcodes indicate that most regions of the SARS-CoV-2 genome are exclusively related to bat viruses from Yunnan and Southeast Asia (81%), some of them being specific to Yunnan (17% highlighted in red in Figure 5) or Southeast Asia (19%; northern Laos representing 16%, in yellow). Consistent with this, we found that 100% of the phylogenetic CGB barcodes reconstructed for SARS-CoV-2 involved viruses detected in Southeast Asia and Yunnan ( Figure 6C). However, sampling efforts were much more important in Yunnan: 53% of the Sarbecovirus genomes used in our study were collected in Yunnan against 7% for those found in Laos. The contribution of bat sarbecoviruses from Yunnan is 81%. The contribution of bat sarbecoviruses from Laos is 78%, which is much more important than the contribution of bat sarbecoviruses found in other Southeast Asian countries: 47% for the virus from Cambodia and 35% for the virus from Thailand ( Figure 6C).

Phylogenetic Trees versus Phylogenetic CGB Barcodes
One or several phylogenetic trees have been published in all reports dealing with the origin and evolution of SARS-CoV-2 [4][5][6][7]15,16]. In these studies, SARS-CoV-2 was generally found to be closely related to RaTG13, a virus detected in a R. affinis bat collected in Yunnan in 2013 [4][5][6][7]. However, phylogenetic analyses based on different genomic regions were found to support conflicting relationships [6,7,9,16]. This was well illustrated in the Figure 3 of Zhou et al. [7], in which SARS-CoV-2 appeared closely related to RaTG13, RmYN02, RpYN06, and RshSTT200 in the RdRp gene tree (BP = 94), sister to RmYN02 and RpYN06 in the ORF1ab tree (BP = 100) and linked to RaTG13 in the Spike gene tree (BP = 100). Our phylogenetic CGB barcodes showed that these discordances are explained by the mosaic structure of the SARS-CoV-2 genome, in which different regions support different relationships. Although SARS-CoV-2 was closely related to RaTG13 based on four genomic regions, representing only 4.8% of the WG alignment, other genomic regions provided support for 51 other phylogenetic relationships, including its grouping with RpYN06 (two regions; 5.8%), RmYN02 (two regions; 3.8%), RpBANAL103 (one region; 3.3%), RshSTT200 (two regions; 2.3%), etc. ( Figure 5).
Phylogenetic CGB barcodes allowed us to detect that several gene fragments provide conflicting phylogenetic signals. The best example concerns the Spike gene (pos. 21,664-25,557 in our alignment) for which we found conflicting phylogenetic signals supporting the grouping of SARS-CoV  (Figure 3). We conclude, therefore, that phylogenetic CGB barcodes provide much more reliable and accurate information than WG and gene trees for understanding the evolution of sarbecoviruses.

Intermediary Hosts for SARS-CoV and SARS-CoV-2?
While many studies agree that bats are the main reservoir host for sarbecoviruses, the role of other mammals as possible intermediate hosts between bats and humans, such as small carnivores and pangolins, remains unclear and controversial [4][5][6][7]9,14,15,22]. To better understand this issue, we have included in our analyses several sarbecoviruses sequenced from captive mammals, including pangolins, minks and civets (Supplementary Table S1).
In theory, pangolins could be contaminated in their natural habitat by pathogens circulating in horseshoe bats because both taxa are occasionally found together in hollow trees, burrows and possibly caves [14,22]. In contrast to bats, which are considered as asymptomatic for coronavirus, pangolins were found to be highly sensitive to sarbecovirus [23]. In addition, pangolins are not gregarious like Rhinolophus bats; they are solitary species and the female and male meet only for reproduction. Therefore, most pangolins infected by bat sarbecovirus in the wild should be considered as evolutionary dead ends for the virus. However, the situation changed significantly with the intense pangolin trafficking during the decade before COVID-19, as Sunda pangolins (M. javanica) imported illegally into China became infected in captivity [14]. In this way, at least two different pangolin sarbecoviruses were exported from Southeast Asia to China, MjGuangxi before 2017 and MjGuangdong before 2019 [14,[22][23][24]. Despite this, we did not find any evidence of exclusive ancestry between SARS-CoV-2 and the two pangolin viruses. Moreover, the synonymous nucleotide compositions of MjGuangxi and MjGuangdong genomes were found to be similar but divergent from those of SARS-CoV-2 and bat SCoV2rC viruses, suggesting that the two pangolin viruses have evolved independently and for some time in pangolin populations in the wild or in captivity [16]. Although pangolins may be intermediary hosts between bats and humans for some viruses, current data and our findings do not support their involvement in the case of SARS-CoV-2.
Recent studies have provided strong evidence that SARS-CoV-2 was introduced from humans to domestic or captive carnivores, including dogs, cats, lions, tigers and minks, and that the virus evolved very rapidly in mink farms, with several back transmissions from infected animals to humans [25,26]. The common ancestor of human and mink SARS-CoV-2 genomes was supported by high MBP values (82/99), confirming that a strong phylogenetic signal is present in all parts of the alignment (as highlighted by the green colour of GB 400 and GB 2000 barcodes in Figure 2). In addition, we found no genomic region providing support for the paraphyly or polyphyly of SARS-CoV-2. These results therefore confirmed that human and mink SARS-CoV-2 genomes included in our study share the same MRCA, fully isolated genetically from bat Sarbecovirus lineages since its emergence in December 2019 or a few weeks earlier. Our analyses also revealed that different genomic regions can support conflicting relationships between SARS-CoV-2 sequences. The best example concerns the placement of Gamma and Omicron variants, which were related to the Delta variant based on pos. 9851-10,350 or to Alpha variant based on pos. 28,801-29,350. These discordant patterns could be due to past genomic recombination, as several recombinants have already been reported in human populations [11,12].
The masked palm civet (Paguma larvata) was identified as the possible intermediate host transmitting SARS-CoV to humans during the SARS epidemic, which began on November 2002 in Foshan, a city about 20 km from Guangzhou, the largest city of Guangdong province (China) [13,27]. In our analyses, we have included three SARS-CoV-like genomes detected in civets maintained in captivity in the Guangdong province: two came from a wildlife market in Shenzhen before May 2003 (PlSZ3 and PlSZ61) and another was sampled in a restaurant in Guangzhou in 2003 (PlGZ81). In the WG tree of Figure 2, it is worth noting that human and civet SARS-CoVs are enclosed into a robust clade (BP WG = 100) also supported by high MBP values (74/100), which indicates that the phylogenetic signal is strong in all parts of the whole-genome alignment. These results therefore confirmed that human and civet SARS-CoV genomes share the same MRCA, which was fully isolated genetically from bat Sarbecovirus lineages. Although the node grouping the four human SARS-CoVs was not robust in the WG tree (BP WG = 52), it was supported by two GRPS, one including the RBD region (pos. 22,651-24,200 in our alignment) and another covering the C-terminus of the Spike protein and the N-terminus of ORF3a (pos. 25,750). Interestingly, there are two RBD amino-acids characterizing the four human SARS-CoVs included in our alignment, F360 and T487. The last one was found to increase by 20-fold the RBD affinity for human ACE2, suggesting therefore a specific adaptation to enhance human-to-human transmission during the 2002-2003 SARS-CoV outbreak [28]. In contrast, all human and animal viruses sequenced during the 2003-2004 SARS-CoV outbreak had a Serine (instead of Threonine) at this position of the Spike protein [13,28], indicating that this new episode may have resulted from an independent viral invasion from animal to human [29]. Our analyses have shown that several discordant phylogenetic signals involving the para-or polyphyly of human SARS-CoVs were detected in other genomic regions, such as the grouping of the three civet SARS-CoVs with HsRef and HsShanghaiQXC1 or the grouping of PlSZ3 with HsRef, HsShanghaiQXC1 and HsGD01. Therefore, these results suggest that human and civet SARS-CoVs were involved in past genomic recombination events in the same host and that humans and captive civets have exchanged SARS-CoV viruses. In other words, the situation was probably very similar to that recently observed for SARS-CoV-2 between humans and captive minks [25,26]. Due to the low divergence separating human and civet genomes (between 0.16% and 0.41%), however, we cannot rule out two other hypotheses involving either nucleotide homoplasy (due to convergence(s) and reversion(s)) or sequencing and genome assembly errors.

Species Involved as Reservoir Hosts for the Ancestors of SARS-CoV and SARS-CoV-2
As discussed previously [9,10], most conflicting phylogenetic signals in different genomic regions of our Sarbecovirus alignment can be explained by multiple past events of recombination at different periods of time and involving viruses circulating in several reservoir species of horseshoe bats (genus Rhinolophus). Host CGB barcodes obtained for both SARS-CoV and SARS-CoV-2 (Figures 3 and 5) clearly confirm that horseshoe bats are the animal reservoir in which coronaviruses related to either SARS-CoV (SCoVrC) or SARS-CoV-2 (SCoV2rC) evolve and diversify. First, all animal sarbecoviruses showing exclusive ancestry with human sarbecoviruses were extracted from Rhinolophus species. Second, 100% of the phylogenetic CGB barcodes reconstructed for either SARS-CoV or SARS-CoV-2 involved Rhinolophus viruses. The Rhinolophus species showing the highest contribution are R. ferrumequinum (76%), R. sinicus (71%), and R. affinis (58%) for SARS-CoV, and R. pusillus (82%), R. malayanus (81%), R. marshalli (65%) and R. affinis (60%) for SARS-CoV-2 ( Figures 4B and 6B). All these results corroborate high and recent evolutionary dynamics of genomic recombination between sarbecoviruses circulating in several Rhinolophus species. As pointed out in previous publications [9,16,22], recombination implies that whole or partial genomes of two divergent viruses co-exist in a cell of the same host. Such a situation is expected to occur frequently in horseshoe bats because interspecific transmission of sarbecoviruses could be favoured by their behaviour, as several Rhinolophus species often nest in colonies in the same cave, and by their cave habitat, in which viral contamination could be facilitated by promiscuity within and between bat colonies, high humidity levels and cool temperatures all year round.
The ecological niches (i.e., geographic distributions predicted using climatic parameters) of bat sarbecoviruses related to either SARS-CoV (SCoVrC) or SARS-CoV-2 (SCoV2rC) have been reconstructed using an original approach combining genetic data on both viruses and bat species [22]. The results have shown that the ecological niche of SCoVrC extends from Southwest China and northern Myanmar, through northern Vietnam and Central China to East China, Korea and southern Japan, whereas the ecological niche of SCoV2rC includes four main different regions of Southeast Asia: (i) northern Laos and bordering regions; (ii) southern Laos, southwestern Vietnam, and north-eastern Cambodia; (iii) the East region of Thailand and southwestern Cambodia; and (iv) the Dawna Range in central Thailand and south-eastern Myanmar. Since the geographic ranges of SCoVrC and SCoV2rC are different, we assumed that these two groups of viruses generally do not circulate in the same Rhinolophus species assemblages. In agreement with that, geographic CGB barcodes indicate that 84% of the SARS-CoV genome show exclusive ancestry with bat viruses from Southwest China ( Figure 4C), whereas 81% of the SARS-CoV-2 genome show exclusive ancestry with bat viruses from Southeast Asia and Yunnan ( Figure 6C). In addition, host CGB barcodes show that SARS-CoV shares exclusive ancestry with several viruses found in three bat species mainly distributed in China [30]: 19.4% for R. ferrumequinum, 13.1% for R. sinicus, and 1.3% for R. pearsonii ( Figure 4B), whereas SARS-CoV-2 shares exclusive ancestry with several viruses found in three bat species mainly distributed in Southeast Asia or endemic to this region [30]: 6.8% for R. malayanus, 2.3% for R. shameli, and 1.7% for R. marshalli ( Figure 6B).
However, the detection of recombinant viruses between SCoVrC and SCoV2rC lineages, named RecSar [9,16], has revealed that some bats were simultaneously infected by the two divergent virus lineages in the past. Interestingly, different viruses showing exclusive ancestry with either SARS-CoV or SARS-CoV-2 were isolated from R. pusillus (representing 3.3% and 9.1%, respectively) and also from R. affinis (representing 1.0% and 4.8%, respectively). Since these two Rhinolophus species are widely distributed in both China and Southeast Asia [30], their dispersal capacity is expected to be greater than that of other Rhinolophus species endemic to either China or Southeast Asia. Therefore, it can be argued that the rare events of genomic recombination between SCoVrC and SCoV2rC lineages may have occurred in R. affinis and R. pusillus bats and most likely in the region where the ecological niches of SCoVrC and SCoV2rC overlap, i.e., southern Yunnan and adjacent regions in northern Laos, and border regions between Laos and Vietnam [22].

Conclusion and Perspectives
Exclusive ancestry with human sarbecoviruses has currently been found in sarbecoviruses collected in the following Rhinolophus species (Figures 3 and 5): R. ferrumequinum, R. sinicus, and R. pearsonii for SARS-CoV; R. malayanus, R. shameli, and R. marshalli for SARS-CoV-2; and R. pusillus and R. affinis for both SARS-CoV and SARS-CoV-2. These results provide strong evidence that viral transmission within and between bat colonies of these species as well as genomic recombination have participated to the emergence of human sarbecoviruses. However, the lists of involved reservoir species cannot be considered exhaustive due to limited investigations for detecting sarbecoviruses in bats, particularly in Southeast Asia where the diversity of Rhinolophus species is much higher than anywhere else in the Old World [30]. Although genome data currently available support an origin of SARS-CoV in horseshoe bats of Yunnan and an origin of SARS-CoV-2 in horseshoe bats of Yunnan and northern Laos, the two hypotheses need to be further investigated by exploring Sarbecovirus diversity in bats of Cambodia, Laos, Myanmar, Thailand and Vietnam. Indeed, only a few bat sarbecoviruses were recently found in a limited number of localities in northern Cambodia (two highly similar genomes of the same virus collected in one locality) [5], eastern Thailand (five highly similar genomes of the same virus collected in one locality) [6] and northern Laos (five viruses; four localities) [15], and no sarbecovirus has yet been described from bats of Myanmar and Vietnam.
Another problem that limits our interpretation on reservoir hosts is that most studies dealing with the evolution of sarbecoviruses have generally provided very little information about the bats and caves in which the viruses were found. Recently, a new Sarbecovirus genome, RspSC2018 (GenBank accession: MK211374), was described from an unidentified bat at the species level collected from an unknown locality in Sichuan province [31]. Since Illumina reads were not deposited in any of the international Sequence Read Archive (SRA) databases, it was impossible to know from which host species the virus was extracted. Moreover, the quality of the virus genome assembly cannot be verified, which may also compromise some future evolutionary studies.
Our poor knowledge of horseshoe bat taxonomy also hinders understanding of their role as Sarbecovirus reservoir. For instance, some taxonomists have proposed to split R. ferrumequinum into two different species: R. ferrumequinum in Europe and West Asia, and Rhinolophus nippon in East Asia [32]. Similarly, a molecular study based on a set of 1500 nuclear loci has suggested that the R. sinicus complex may contain three different species [33]: R. sinicus, apparently distributed from northern Vietnam to East China, through Hainan Island and Central China; Rhinolophus septentrionalis in Yunnan; and an undescribed species of Rhinolophus in Vietnam. Funding: This research was funded by the "Agence nationale de la recherche" (AAP RA-COVID-19, grant number ANR-21-CO12-0002).

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The whole-genome alignment, the five BBC output files, the five SuperTRI MRP files, the five SB trees (SB 400 , SB 500 , SB 600 , SB 1000 , and SB 2000 ), all the 5594 GB barcodes, the two Excel and six CSV files used to construct CGB barcodes, and all the 303 CGB barcodes constructed in this study are available in the Open Science Framework (OSF) platform at https://osf.io/nj57e/. The SWB, BBC, CGB and LFG programs are available at https://github.com/ OpaleRambaud/GB_barcodes_project.