The Novel Halovirus Hardycor1, and the Presence of Active (Induced) Proviruses in Four Haloarchaea

The virus Hardycor1 was isolated in 1998 and infects the haloarchaeon Halorubrum coriense. DNA from a frozen stock (HC1) was sequenced and the viral genome found to be 45,142 bp of dsDNA, probably having redundant, circularly permuted termini. The genome showed little similarity (BLASTn) to known viruses. Only twenty-two of the 53 (41%) predicted proteins were significantly similar to sequences in the NCBI nr protein database (E-value ≤ 10−15). Six caudovirus-like proteins were encoded, including large subunit terminase (TerL), major capsid protein (Mcp) and tape measure protein (Tmp). Hardycor1 was predicted to be a siphovirus (VIRFAM). No close relationship to other viruses was found using phylogenetic tree reconstructions based on TerL and Mcp. Unexpectedly, the sequenced virus stock HC1 also revealed two induced proviruses of the host: a siphovirus (Humcor1) and a pleolipovirus (Humcor2). A re-examination of other similarly sequenced, archival virus stocks revealed induced proviruses of Haloferax volcanii, Haloferax gibbonsii and Haloarcula hispanica, three of which were pleolipoviruses. One provirus (Halfvol2) of Hfx. volcanii showed little similarity (BLASTn) to known viruses and probably represents a novel virus group. The attP sequences of many pleolipoproviruses were found to be embedded in a newly detected coding sequence, split in the provirus state, that spans between genes for integrase and a downstream CxxC-motif protein. This gene might play an important role in regulation of the temperate state.


Introduction
Viruses of prokaryotes are extraordinarily numerous in aquatic environments [1,2], commonly outnumbering cells by a factor of 5 to 10 [3]. They play significant roles in a variety of important biological and biogeochemical processes, including the lysis of cells and the consequent release of organic matter, selective sweeps of prokaryotic populations that drive the evolution of both virus and host, the acceleration of genetic exchange and the redirection of host metabolism [4]. They continue to provide a source of surprising new discoveries and insights, such as the BREX (BacteRiophage EXclusion) defense system [5] of bacteria and the viral counter defense to this based on the DNA mimic protein Ocr [6]. Another impressive example of their ingenuity is the recently described phage-specific peptide communication system, termed the arbitrium system, that allows proviruses to decide between lytic and lysogenic lifestyles [7].
Viruses of diverse morphotypes are known to infect extremely halophilic archaea (class Halobacteria), including caudoviruses such as phiH1 [8] and HF1 [9] (reviewed by [10][11][12][13]); spindle shaped viruses such as His1 [14][15][16]; lipid enveloped pleolipoviruses such as His2 [17,18] and the spherical/icosahedral, membrane-containing sphaerolipoviruses such Genes 2021, 12, 149 3 of 23 and sequenced. Read assembly produced a high coverage contig of 45,142 bp in length, and circular in form. As shown later, the most parsimonious interpretation of the circular assembly is that the virus genome is packaged as linear dsDNA molecules with ends that are terminally redundant and circularly permuted. A summary of the sequencing results is shown in Table 1. The %G + C of the host, Hrr. coriense, is 66.6% [37], very close to that of the virus. Contig sequences from contaminating DNA in the same virus sample exactly matched the genome of Hrr. coriense, confirming the host.
Analysis of tetramer frequencies revealed the absence of three tetrameric sequences, all of which are palindromic ( Table 2). Another three non-palindromic tetramers were strongly avoided. A similar analysis of palindromic 6-mers (excluding those with absent or under-represented tetrameric cores shown in Table 2) found that the viral genome lacks 22 such motifs ( Table 3). The results are indicative of a strong selection against numerous 4-6 bp sequence motifs, particularly palindromic motifs. Most likely, this helps to avoid host defences such as restriction-modification (R-M) systems. The host species, Hrr. coriense, has five annotated genes involved in R-M (Supplementary Table S1) and has previously been shown to be dam-methylated [38].  Table 3. Absent palindromic 6-mers in the Hardycor1 genome 1 .  1 Excluding all motifs that include those listed as absent or under-represented in Table 2. All motifs have known restriction enzymes (REBASE, [39]).

First Base 6-mer Motifs Not Present in Hardycor1
Repeat sequences: A number of related repeats of varying length (20-83 bp) occur in several intergenic regions, upstream of six ORFs (HrrHc1_045, _095, _160, _230, _235 and _245). One of these repeats partly overlaps the start of three CDS (HrrHc1_045, _95 and _235) with the result that the predicted proteins have identical N-termini (MNANT...). There is also a 530 bp direct repeat (nt 21,916) that spans the borders of three CDS (HrrHc1_135, _140 and _145), with the latter two predicted proteins sharing 106 aa of identical N-terminal sequence.
Annotation of the genome sequence revealed 53 CDS, and the map displayed in Figure 1 represents the unit genome in linear form with the starting base chosen for its proximity to the large subunit terminase gene, terL (hrrhc1_030) but placed upstream of the five closely spaced genes that precede terL because they were in the same orientation, had overlapping CDS and are likely to be transcribed together. Genes are generally closely spaced, with 27 genes (50%) overlapping at their start and stop codons, and a further 13 genes (24%) separated by 10 bp or less. Most genes are oriented inwards to a point around 27 kb (Figure 1b,c). This broad organizational pattern is reflected in the cumulative AT-skew plot shown above (panel a), which displays a major inflection at this point, and falls steadily to either side except for short regions corresponding to local reversals in gene orientation.
Repeat sequences: A number of related repeats of varying length (20-83 bp) occur in several intergenic regions, upstream of six ORFs (HrrHc1_045, _095, _160, _230, _235 and _245). One of these repeats partly overlaps the start of three CDS (HrrHc1_045, _95 and _235) with the result that the predicted proteins have identical N-termini (MNANT...). There is also a 530 bp direct repeat (nt 21,857-22,916) that spans the borders of three CDS (HrrHc1_135, _140 and _145), with the latter two predicted proteins sharing 106 aa of identical N-terminal sequence.
Annotation of the genome sequence revealed 53 CDS, and the map displayed in Figure 1 represents the unit genome in linear form with the starting base chosen for its proximity to the large subunit terminase gene, terL (hrrhc1_030) but placed upstream of the five closely spaced genes that precede terL because they were in the same orientation, had overlapping CDS and are likely to be transcribed together. Genes are generally closely spaced, with 27 genes (50%) overlapping at their start and stop codons, and a further 13 genes (24%) separated by 10 bp or less. Most genes are oriented inwards to a point around 27 kb (Figure 1b,c). This broad organizational pattern is reflected in the cumulative ATskew plot shown above (panel a), which displays a major inflection at this point, and falls steadily to either side except for short regions corresponding to local reversals in gene orientation. terL, large subunit terminase; por, portal protein; muf, MuF-family head morphogenesis protein (SSP1 gp7 family); mcp, major capsid protein; nep1, neck protein of type 1; tmp, tape measure protein; hjc, Holliday junction resolvase Hjc; aaa, AAA ATPase; vwa, Von Willebrand factor type A (vWA) interaction domain that includes a metal ion-dependent adhesion site (MIDAS). Asterisks immediately below gene arrows indicate the predicted hypothetical proteins contain CxxC motifs. The locus tags (e.g., HrrHc1_005) of several genes are shown below the gene map. Scale at bottom represents DNA length in kb.
The predicted Hardycor1 proteome was submitted to VIRFAM [34], which classified four of the inferred proteins as caudovirus homologs (TerL, Portal, MCP and Nep1) and predicted Hardycor1 was most likely a siphovirus.
A BLASTn search of the GenBank nr/nt nucleotide database restricted in scope to sequences from Halobacteria (taxid:183963) + Viruses (taxid:10239), and at a reduced stringency than before (E-value ≤ 10 −10 ), returned two short matches (Table 4), one of which was previously mentioned. These were to tmp (hrrhc1_120), the gene encoding the tape measure protein, and hrrhc1_160, which specifies a hypothetical protein. The top match for tmp was to a 540 bp region within an annotated tape measure protein gene (HPS36_14875) carried on the chromosome of Halorubrum sp. strain RHB-C. The top match to hrrhc1_160 was the halovirus HCTV-1 gene DNAM5_77. HCTV-1 is a siphovirus infecting Har. californiae [12]. terL, large subunit terminase; por, portal protein; muf, MuF-family head morphogenesis protein (SSP1 gp7 family); mcp, major capsid protein; nep1, neck protein of type 1; tmp, tape measure protein; hjc, Holliday junction resolvase Hjc; aaa, AAA ATPase; vwa, Von Willebrand factor type A (vWA) interaction domain that includes a metal ion-dependent adhesion site (MIDAS). Asterisks immediately below gene arrows indicate the predicted hypothetical proteins contain CxxC motifs. The locus tags (e.g., HrrHc1_005) of several genes are shown below the gene map. Scale at bottom represents DNA length in kb.
The predicted Hardycor1 proteome was submitted to VIRFAM [34], which classified four of the inferred proteins as caudovirus homologs (TerL, Portal, MCP and Nep1) and predicted Hardycor1 was most likely a siphovirus.
A BLASTn search of the GenBank nr/nt nucleotide database restricted in scope to sequences from Halobacteria (taxid:183963) + Viruses (taxid:10239), and at a reduced stringency than before (E-value ≤ 10 −10 ), returned two short matches (Table 4), one of which was previously mentioned. These were to tmp (hrrhc1_120), the gene encoding the tape measure protein, and hrrhc1_160, which specifies a hypothetical protein. The top match for tmp was to a 540 bp region within an annotated tape measure protein gene (HPS36_14875) carried on the chromosome of Halorubrum sp. strain RHB-C. The top match to hrrhc1_160 was the halovirus HCTV-1 gene DNAM5_77. HCTV-1 is a siphovirus infecting Har. californiae [12].
Nucleotide sequence similarity of Hardycor1 with the genomes of 23 known tailed haloviruses is presented as a dot plot in Figure 2. Related viruses are clearly detected as lines of similarity parallel to the main diagonal, such as members of the Myohalovirus genus (ChaoS9, phiCh1 and phiH1; lower left corner), and members of the Haloferacalesvirus genus (HF1 to HRTV-8; near upper right corner). Hardycor1 (blue triangle) shows little or no sequence similarity to any of the other haloviruses.  Nucleotide sequence similarity of Hardycor1 with the genomes of 23 known tailed haloviruses is presented as a dot plot in Figure 2. Related viruses are clearly detected as lines of similarity parallel to the main diagonal, such as members of the Myohalovirus genus (ChaoS9, phiCh1 and phiH1; lower left corner), and members of the Haloferacalesvirus genus (HF1 to HRTV-8; near upper right corner). Hardycor1 (blue triangle) shows little or no sequence similarity to any of the other haloviruses.  while members of the same genus share more than about 70% nt similarity, although more recently, the ICTV have suggested a threshold of ~50% nt similarity for caudoviruses [41]. Hardycor1 shows negligible similarity (0-2%) to the other 23 virus genomes and represents a novel species and genus. An independently described algorithm, VICTOR [42], calculates similarities of viral genomes based on nucleotide or protein sequences, and in both cases, Hardycor1 was predicted to represent a novel species and novel genus (Supplementary Table S2).  The low nucleotide sequence similarity of Hardycor1 to other tailed haloviruses prevents any meaningful alignment or phylogenetic inferences; however, whole genome similarity values are useful to define viral taxa. Figure 3 shows a heat map of intergenomic similarities of tailed haloviruses, produced using the VIRIDIC suite of programs [40]. Values are calculated using the traditional algorithm recommended by the International Committee on Taxonomy of Viruses (ICTV), Bacterial and Archaeal Viruses Subcommittee. In this scheme, members of the same species share ≥95% nt similarity, while members of the same genus share more than about 70% nt similarity, although more recently, the ICTV have suggested a threshold of~50% nt similarity for caudoviruses [41]. Hardycor1 shows negligible similarity (0-2%) to the other 23 virus genomes and represents a novel species and genus. An independently described algorithm, VICTOR [42], calculates similarities of viral genomes based on nucleotide or protein sequences, and in both cases, Hardycor1 was predicted to represent a novel species and novel genus (Supplementary Table S2).

Annotation and Predicted Proteins
Twenty-two of the 53 annotated proteins (41%) returned significant matches (BLASTp, E-value ≤ 10 −15 ) to protein sequences of the NCBI nr database, and the top matches are shown in Table 5. Fourteen matched the proteins of various species of haloarchaea, six matched proteins of three haloviruses (HCTV-1, HHTV-1 and HCTV-2), and two matched bacterial proteins. The three haloviruses with similar proteins are all siphoviruses with linear, circularly permuted dsDNA genomes, and infect species of Haloarcula [12,13].   a BLASTp searches (E-value ≤ 10 −15 , January 2021) against the NCBI nr protein database, with matches specified by their locus_tag followed by species or virus (in square brackets). Accessions for haloviruses HCTV-2, HHTV-2 and HHTV-1 are given in Figure 3. b H-J, Holliday Junction.
The presence of conserved protein domains and characteristic VIRFAM profiles of virus proteins [34] allowed functional assignments for eight proteins, revealing that the first 27 kb of the Hardycor1 genome carries genes encoding key proteins of caudoviruses, including the large subunit terminase (TerL), portal protein (Por), major capsid protein (Mcp) and tape measure protein (Tmp). A muf gene is found just downstream of the portal protein gene (por), and specifies a MuF (SPP1 gp7) family protein of the longer type [43]. MuF proteins have been reported to have a number of functions in different viruses, such as protecting the ends of viral DNA from nuclease attack when entering a host cell. The close gene spacing and typical arrangement of viral genes identified this region as being responsible for DNA packaging, virus assembly and morphogenesis. The absence of genes for tail-sheath or base-plate J proteins is consistent with the VIRFAM prediction that Hardycor1 is a siphovirus. Upstream of terL is a dam gene (hrrhc1_020) encoding a putative N-6-adenine-methyltransferase (Dam).
A strongly conserved feature among caudoviruses is a pair of genes upstream of the tape measure protein gene (tmp) that encode two related chaperone proteins via programmed ribosomal frameshifting [44,45]. In Hardycor1, these correspond to HrrHc1_110 and HrrHc1_115, and a classical −1 slippery sequence of the type X XXY YYZ is found at the appropriate position near the end of HrrHc1_110 (nt 17589-17595; G GGA AAT) that would allow translation of a protein encompassing the CDS of both genes. A similar protein to HrrHc1_110 is found in Natronolimnobius (NGM69196.1; 34% aa identity), is encoded by a gene upstream of a tape measure protein gene and also contains a classical -1 slippery sequence near its 3 end (G GGA AAG, nt 226569-226575, AAKXY010000003.1). The inferred tape measure protein (HrrHc1_120) of Hardycor1 is 703 aa long, and would predict a tail length of about 84 nm using the formula described by [46].
Genes hrrhc1_050 and hrrhc1_095 are unusual, as they are found on the complementary strand to the other genes in this region (Figure 1b,c). However, both are predicted by GeneMarkS2 and by Glimmer3 (see Methods), and both proteins have features consistent with other haloviral proteins. For example, HrrHc1_050 protein contains two CxxC motifs [22] and a predicted C-terminal membrane spanning domain, and HrrHc1_095 protein has a pI of 4.24 and an over-abundance of Asp residues, typical features of haloarchaeal proteins [47]. Manual examination of alternative ORFs in the regions of HrrHc1_050 and HrrHc1_095 did not reveal any that were more likely.
The right end (27-45 kb) consists largely of genes specifying proteins of unknown function (yellow in Figure 1), even though seven of these are similar to proteins of haloarchaea or viruses (Table 5). All but the last two genes face inwards, an organisation similar to that of bacterial siphoviruses [48]. Like many siphoviruses, this region includes a gene specifying a Holliday junction resolvase Hjc (HrrHc1_175), an endonuclease that among other roles acts in debranching DNA structures to allow packaging of the viral genome into capsids [49,50]. Two other proteins specified by genes in the replication/accessory module of the genome have conserved functional domains. The 711 aa protein HrrHc1_245 is predicted to carry a von Willebrand factor A domain (vWA) and a metal ion-dependent adhesion site (MIDAS) domain (Table 5). Such domains often function in protein-protein interactions [51]. The encoding gene is situated next to a gene encoding an AAA ATPase (HrrHc1_240), an arrangement that is commonly found in bacteria and archaea [52,53]. Similar vWA domain proteins have been reported previously in siphoviruses of haloarchaea: HCTV-2, HHTV-2 and HVTV-1 [11][12][13]. In the case of HVTV-1, the corresponding gene is also near to a gene for an AAA ATPase, and in the same orientation. The close proximity of genes encoding an AAA ATPase and a vWA-MIDAS domain protein has been reported in thermophilic archaeal viruses, such as Acidianus Two-Tailed Virus (ATV) [54]. The interaction between AAA ATPase and a vWA-MIDAS domain protein has been closely studied in several cases, and a common finding is that the vWA-MIDAS domain protein provides an adaptor function, while the AAA ATPase acts as a chaperone [55,56].
Four annotated proteins have predicted transmembrane domains (TMD) or a signal sequence (HrrHc1_050, _150, _155 and _165). Ten hypothetical proteins contain one or more CxxC motifs (asterisked in Figure 1c), a signature feature of zinc-finger domain proteins [57]. HrrHc1_050 is a 104 aa long, CxxC motif containing protein that carries a strongly predicted TMD near its C-terminus. The gene encoding this protein is located on the minus-strand, unlike the other genes around it ( Figure 1). HrrHc1_150 has a predicted signal sequence, and is the only annotated protein to do so. HrrHc1_155 possesses three evenly spaced TMDs (Phobius) and shares this and other similarities with the well-studied S105 holin of lambda [58,59]. HrrHc1_150 and HrrHc1_155 are separated by only 11 bp. The fourth protein, HrrHc1_165, has a TMD near its N-terminus, but this is not predicted to be a signal sequence. The gene is located just before the major switch in coding strand that occurs around 27 kb (Figure 1), and its inferred protein matched several similar sequences in the NCBI nr protein database, although none have an annotated function.

Protein-Based Phylogenetic Analyses
The large subunit terminase (TerL) is highly conserved in caudoviruses and has often been used to infer phylogeny [60][61][62]. The Hardycor1 TerL sequence showed significant similarity to numerous homologues present in the NCBI nr protein database. A phylogenetic tree reconstruction is presented in Figure 4 and shows that the Hardycor1 protein clusters with TerL sequences of haloarchaea and haloviruses but is distinct and branches just outside the other members of this clade. similar sequences in the NCBI nr protein database, although none have an annotated function.

Protein-Based Phylogenetic Analyses
The large subunit terminase (TerL) is highly conserved in caudoviruses and has often been used to infer phylogeny [60][61][62]. The Hardycor1 TerL sequence showed significant similarity to numerous homologues present in the NCBI nr protein database. A phylogenetic tree reconstruction is presented in Figure 4 and shows that the Hardycor1 protein clusters with TerL sequences of haloarchaea and haloviruses but is distinct and branches just outside the other members of this clade. The major capsid protein is also commonly used to infer viral phylogeny [63], but the Hardycor1 Mcp shows low similarity to known homologs (≤31% aa identity), and the top four BLASTp hits were to a wide variety of organisms, including an oceanic (bacterial) virus (QDP55370.1) and three diverse taxa of bacteria (Pseudomonas, Bacteroidetes, Paenibacillus). Without more examples of specifically related relatives, no meaningful phylogenetic inferences are possible based on Mcp trees.
A whole proteome-based phylogenetic reconstruction is presented in Figure 5, and shows Hardycor1 branches deeply and is not closely related to other known tailed haloviruses. The major capsid protein is also commonly used to infer viral phylogeny [63], but the Hardycor1 Mcp shows low similarity to known homologs (≤31% aa identity), and the top four BLASTp hits were to a wide variety of organisms, including an oceanic (bacterial) virus (QDP55370.1) and three diverse taxa of bacteria (Pseudomonas, Bacteroidetes, Paenibacillus). Without more examples of specifically related relatives, no meaningful phylogenetic inferences are possible based on Mcp trees.
A whole proteome-based phylogenetic reconstruction is presented in Figure 5, and shows Hardycor1 branches deeply and is not closely related to other known tailed haloviruses.

Match to CRISPR Spacer
The Hardycor1 genome was used to search for CRISPR spacer matches at the IMG/VR and CRISPRfinder websites (accessed December 10, 2020; see Methods). Only one significant match was found (Supplementary Table S3), to a 40 nt region (nt 17635-17674) found between the two annotated CDS immediately upstream of the gene encoding tape measure protein (tmp). The source of the spacer sequence was a halite endolithic microbial community found in the Atacama Desert, Chile [65].

Active Proviruses of Hrr. coriense
Assembly of sequence reads of the Hardycor1 virus stock (HC1) revealed another circular contig, distinct from Hardycor1, that was 11,758 bp in length with a high read coverage, and matched a region on contig 20 of the Hrr. coriense draft genome ( Table 6). The circular nature of this contig indicated it was an extrachromosomal element and not simply an amplified fragment of host chromosomal DNA. Its gene content indicated it was a virus, closely related to pleolipoviruses such as HRPV-6 ( Figure 6), and was designated Humcor2.
Genes 2021, 12, x FOR PEER REVIEW 11 of 23 Figure 5. Phylogenetic tree reconstruction of tailed haloviruses inferred from viral proteomes using the Genome-BLAST Distance Phylogeny method (GBDP) under optimal settings (formula VICTOR d6), as implemented at the DSMZ webserver [42,64]. Percentage support values above 60% are shown near the branch points. The branch lengths are scaled in terms of the GBDP distance formula d6 [42]. Tree scale (0.1) is indicated by the bar. Accessions are given at the right in brackets.

Match to CRISPR Spacer
The Hardycor1 genome was used to search for CRISPR spacer matches at the IMG/VR and CRISPRfinder websites (accessed December 10, 2020; see Methods). Only one significant match was found (Supplementary Table S3), to a 40 nt region (nt 17635-17674) found between the two annotated CDS immediately upstream of the gene encoding tape measure protein (tmp). The source of the spacer sequence was a halite endolithic microbial community found in the Atacama Desert, Chile [65].

Active Proviruses of Hrr. coriense
Assembly of sequence reads of the Hardycor1 virus stock (HC1) revealed another circular contig, distinct from Hardycor1, that was 11,758 bp in length with a high read coverage, and matched a region on contig 20 of the Hrr. coriense draft genome ( Table 6). The circular nature of this contig indicated it was an extrachromosomal element and not simply an amplified fragment of host chromosomal DNA. Its gene content indicated it was a virus, closely related to pleolipoviruses such as HRPV-6 ( Figure 6), and was designated Humcor2.
A map of Humcor2 is shown in Figure 6. It begins just after the 3′ end of tRNA-Pro gene C464_t04328 and ends after the 13 bp att sequence, which is identical to the 3′ end of the same tRNA. Near the att sequence is a gene coding for an integrase, a typical pattern for integrative prokaryotic viruses, including haloviruses [10,66]. Currently, the only available genome sequence of Hrr. coriense is a draft consisting of 69 contigs (accession GCF_000337035 [67],and a nucleotide alignment with Humcor2 revealed it to be identical to Hrr. coriense except for three separate 90 bp long artifactual direct repeats in the Hrr. coriense draft genome sequence, most likely assembly errors due to poor quality reads. Many of the predicted proteins of Humcor2 are similar to those of alphapleolipoviruses, such as HRPV-6 (Halorubrum pleomorphic virus 6) and its close relative HRPV-2 (Halorubrum pleomorphic virus 2) [17]. In summary, Humcor2 represents the Figure 5. Phylogenetic tree reconstruction of tailed haloviruses inferred from viral proteomes using the Genome-BLAST Distance Phylogeny method (GBDP) under optimal settings (formula VICTOR d6), as implemented at the DSMZ webserver [42,64]. Percentage support values above 60% are shown near the branch points. The branch lengths are scaled in terms of the GBDP distance formula d6 [42]. Tree scale (0.1) is indicated by the bar. Accessions are given at the right in brackets.   Matches Hfx. volcanii DS2 T (nt 1307486-1294960) a All from virus stocks except Halfvol3, which was found to excise using publicly available sequence data (see Section 3.6). b These virus stocks were described previously in [22]. A map of Humcor2 is shown in Figure 6. It begins just after the 3 end of tRNA-Pro gene C464_t04328 and ends after the 13 bp att sequence, which is identical to the 3 end of the same tRNA. Near the att sequence is a gene coding for an integrase, a typical pattern for integrative prokaryotic viruses, including haloviruses [10,66]. Currently, the only available genome sequence of Hrr. coriense is a draft consisting of 69 contigs (accession GCF_000337035 [67], and a nucleotide alignment with Humcor2 revealed it to be identical to Hrr. coriense except for three separate 90 bp long artifactual direct repeats in the Hrr. coriense draft genome sequence, most likely assembly errors due to poor quality reads. Many of the predicted proteins of Humcor2 are similar to those of alphapleolipoviruses, such as HRPV-6 (Halorubrum pleomorphic virus 6) and its close relative HRPV-2 (Halorubrum pleomorphic virus 2) [17]. In summary, Humcor2 represents the extrachromasomal form of a pleolipoviral provirus, and most likely originates from virions. Both alpha and beta pleolipoviruses have circular dsDNA genomes [17].
Humcor1 (Table 6) is a second provirus of Hrr. coriense that was found in the assembled sequence reads from a different archival virus stock, labelled CC1. This virus isolate was recovered from Cheetham saltern (38 • 09' 23.5"S 144 • 25' 41"E) in 1998 and infected Hrr. coriense Ch2 T . The provirus Humcor1 present in this stock assembled as a circular contig ( Figure 7) and carries many genes that are characteristic of caudoviruses including genes encoding large subunit terminase (TerL), portal protein (Por), major capsid protein (Mcp) and tape measure protein (Tmp). No genes for tail sheath, base plate or tail fibres were detected, so it is most likely of the siphovirus type, as supported by VIRFAM typing of the head proteins (see Methods), which predicted Humcor1 as a siphovirus and also identified the Nep1 homologue (HK97 gp10 family phage protein) as C464_06210. Related proviruses are found in the genome sequences of Hrr. aidingense JCM 13560 and Halobonum sp. NJ-3-1 (red and green rings of Figure 6). The most closely related halovirus is BJ1 (Figure 7, blue ring), which shared a similar large subunit terminase (53% aa identity) as well as many of the accessory genes, such as an integrase, MCM and a strongly similar (82% aa identity) homologue of the hypothetical protein C464_06065 (nt 34838-35560). The circular assembly of Humcor1 is most likely due to head-full packaging of the virus genome, which produces a population that is circularly permuted and terminally redundant. This is also the case with halovirus BJ1 [68].

Proviruses Present in Virus Stocks from Other Haloarchaeal Hosts
Searches for proviruses were made using the sequence data collected from six other archival virus stocks that, like HC1, had been stored frozen since 1998 and analysed by sequencing in the same manner (see Methods). Four active proviruses were detected (Halfgib1, Harhisp1, Halfvol1 (Table 6). All assembled to circular contigs with high read coverage. Three are pleolipoviruses, and the fourth (Halfvol2) represents a novel virus group, and was previously unsuspected in the Hfx. volcanii genome.
Halfgib1 is found to be integrated near the end of tRNA-Arg (C454_t15621) in the Hfx. gibbonsii Ma2.38 T genome. The Halfgib1 sequence aligned near perfectly to the draft Hfx. gibbonsii genome sequence, except that the draft genome sequence across this region The predicted protein products of several genes are given, with names shown next to their encoding gene. For more details see the text. Similarity (tBLASTx; E-value ≤ 10 −10 ) to the two proviruses and halovirus BJ1 are shown as inner, coloured rings, with the colour key displayed at the top left. Below the similarity rings is a plot of GC content (black), with higher than average GC content depicted as outward pointing peaks, and lower than average GC content as inwardly pointing peaks. The scale, in kb, is shown at the outer perimeter.

Proviruses Present in Virus Stocks from Other Haloarchaeal Hosts
Searches for proviruses were made using the sequence data collected from six other archival virus stocks that, like HC1, had been stored frozen since 1998 and analysed by sequencing in the same manner (see Methods). Four active proviruses were detected (Halfgib1, Harhisp1, Halfvol1 and Halfvol2) that matched chromosomal regions of Hfx. gibbonsii Ma2.38 T , Haloarcula hispanica Y27 T and Haloferax volcanii DS2 T ( Table 6). All assembled to circular contigs with high read coverage. Three are pleolipoviruses, and the fourth (Halfvol2) represents a novel virus group, and was previously unsuspected in the Hfx. volcanii genome.
Halfgib1 is found to be integrated near the end of tRNA-Arg (C454_t15621) in the Hfx. gibbonsii Ma2.38 T genome. The Halfgib1 sequence aligned near perfectly to the draft Hfx. gibbonsii genome sequence, except that the draft genome sequence across this region contains two separate 90 bp direct repeats, most likely due to misassembly. Similar errors were mentioned earlier in the Hrr. coriense draft genome, which was part of the same sequencing study [69]. The closely related ARA6 strain of Hfx. gibbonsii has no integrated provirus at this tRNA. In Hfx. volcanii DS2, there is a 12.5 kb provirus present in the corresponding tRNA (CP001956.1, nt 1294959-1307485), as reported previously [70,71]. We denote this as Halfvol3, and it shares high (>90%) nucleotide similarity with Halfgib1 (Supplementary Figure S1) and both encode predicted proteins that show similarity to proteins of pleolipoviruses [17].
Harhisp1 assembled as a circular contig from reads recovered from halovirus stock HH1, which was produced from Har. hispanica. The contig matched a provirus integrated at tRNA-Ala (HISP_14435) of the host chromosome and encompassed genes hisp_14430 to hisp_14315. This region had previously been identified as being related to betapleolipoviruses, such as HHPV3 (see Figure 6 of [72]), and its excision from the chromosome had been detected using PCR amplification across the predicted attP region [71].
Halfvol1 and Halfvol2 were recovered as circular contigs from sequence reads derived from the virus stock HV2, which was produced from cells of Hfx. volcanii. Their summary characteristics are shown in Table 6. Halfvol1 matched one of two proviruses on the Hfx. volcanii chromosome that had been pointed out in earlier studies [70,71], and is found to be integrated at the tRNA-Pro gene (HVO_3017). It is affiliated with the betapleolipovirus group, and a closely related but smaller provirus (14,675 bp) is found to be integrated in the corresponding tRNA-Pro of Hfx. volcanii strain SS0101 (VMTR00000000.1, Supplementary Figure S2).
Halfvol2 had not previously been recognised as a provirus because its encoded proteins do not show significant matches to known viruses. It was first reported at the 2019 Halophiles conference [73]. The genome size, circular form and the ten annotated proteins with predicted transmembrane domains (asterisked in Figure 8) suggest it may be a lipid enveloped virus belonging to a novel virus group. Related proviruses are found in other haloarchaea (Supplementary Figure S3), and an example of similar size (12,732 bp) found in Hfx. volcanii SS0101 is compared to Halfvol2 in Figure 8. It is integrated in the corresponding tRNA-Ala of that strain. Searches of the ArcPP proteome database [74] revealed that the proteins expressed from several genes of both Halfvol1 and Halfvol2 have been detected in Hfx. volcanii. For example, HVO_0271 of Halfvol1 (dataset PXD011015) corresponds to the virus structural protein VP4 of Halogeometricum pleomorphic virus 1 (HGPV-1), and was detected in enriched fractions of cell surface proteins (archaella/pilins) after partial purification by CsCl centrifugation [75].
We designate the remaining previously described provirus of Hfx. volcanii as Halfvol3 (CP001956, nt 1294959-1307485), and for convenience, it is also shown in Table 6. While most of the genes of Halfvol3 have been shown to be transcribed [76], only one of the predicted proteins (HVO_0143) has been detected in proteomic studies (datasets PXD006877, PXD009116 and PXD011056) available from the ArcPP database [74].
No contig matching the length and sequence of Halfvol3 was produced by de novo assembly of the HV2 reads, and mapping of reads to Halfvol3 revealed they were only present in a low number, resulting in patchy coverage. However, reads spanning the termini were present, indicating that the element can excise and circularise. As added support for these findings, we examined the publicly available sequence archives of previous genomic sequencing studies of Hfx. volcanii and also found reads traversing the joined termini of Halfvol1 and Halfvol3. Two examples are given in Supplementary Tables S4A,B and the accompanying Supplementary Figures S4-S6. In the case of Halfvol1, the read coverage of the provirus region was significantly higher than the read coverage outside of the provirus (Supplementary Figure S6), indicating a high level of induced virus in strain Hv1 (sequenced in the study of [77]). Reads traversing the circularised termini of Halfvol2 were not found. In a recent study on hypermotile mutants of Hfx. volcanii, the deletion of Halfvol3 was detected as a secondary genome alteration in one of the analysed strains [78].
proteins with predicted transmembrane domains (asterisked in Figure 8) suggest it may be a lipid enveloped virus belonging to a novel virus group. Related proviruses are found in other haloarchaea (Supplementary Figure S3), and an example of similar size (12,732 bp) found in Hfx. volcanii SS0101 is compared to Halfvol2 in Figure 8. It is integrated in the corresponding tRNA-Ala of that strain. Searches of the ArcPP proteome database [74] revealed that the proteins expressed from several genes of both Halfvol1 and Halfvol2 have been detected in Hfx. volcanii. For example, HVO_0271 of Halfvol1 (dataset PXD011015) corresponds to the virus structural protein VP4 of Halogeometricum pleomorphic virus 1 (HGPV-1), and was detected in enriched fractions of cell surface proteins (archaella/pilins) after partial purification by CsCl centrifugation [75].

A CDS Frequently Encompasses the attP Sequence of Pleolipovirus-Like Proviruses
Curiously, the attP sequences of the pleolipovirus-like proviruses described in this study are all found within a putative CDS that begins just downstream of the integrase gene and terminates adjacent to, or overlaps, the next CDS (a CxxC motif protein) in the circular form of their genomes (Figure 9). In all cases, the three adjacent CDS are on the same DNA strand, and so closely spaced that they may be transcribed together. A bridging CDS also occurs in the novel virus Halfvol2, as well as viruses and proviruses reported in earlier studies, such as SNJ2 and HRPV-9 ( Figure 9). Although the lengths and inferred protein sequences of these genes vary, they regularly span the region between the genes for integrase and CxxC protein, genes that would be widely separated in the provirus state. Since the CDS is only complete when the virus genome is circularised, there is an obvious potential for switching off its activity upon provirus integration into the host genome.

Discussion and Conclusions
Hardycor1 was isolated 22 years ago and is a lytic halovirus infecting Halorubrum coriense strain Ch2 T , but its genetic makeup was unknown until this study. It was found to have a 45,142 bp dsDNA genome encoding proteins that are typical of siphoviruses, a classification supported by gene organization, the presence of a tmp gene that predicts a tail length of 84 nm, and the absence of genes for tail sheath or baseplate J family protein (BpJ) [79]. This classification was confirmed by the conserved features of its head and neck proteins (VIRFAM). At the DNA sequence level, Hardycor1 shares little similarity with other described haloviruses, and standard comparisons show that it represents a novel species and genus. Inferred phylogenies using conserved proteins such as terminase (TerL) and the major capsid protein (Mcp) also support this conclusion.
The genome is most likely linear and packaged in a head-full manner that produces terminally redundant, circularly permuted ends. This was supported by protein similarities and protein phylogenetic tree reconstructions that indicated a distant relationship to haloviruses HCTV-2 and HHTV-2, both of which are siphoviruses with dsDNA genomes that are circularly permuted and terminally redundant [13]. Consistent with the view that Hardycor1 leads a lytic lifestyle, the genome does not carry a tRNA-like attP sequence or

Discussion and Conclusions
Hardycor1 was isolated 22 years ago and is a lytic halovirus infecting Halorubrum coriense strain Ch2 T , but its genetic makeup was unknown until this study. It was found to have a 45,142 bp dsDNA genome encoding proteins that are typical of siphoviruses, a classification supported by gene organization, the presence of a tmp gene that predicts a tail length of 84 nm, and the absence of genes for tail sheath or baseplate J family protein (BpJ) [79]. This classification was confirmed by the conserved features of its head and neck proteins (VIRFAM). At the DNA sequence level, Hardycor1 shares little similarity with other described haloviruses, and standard comparisons show that it represents a novel species and genus. Inferred phylogenies using conserved proteins such as terminase (TerL) and the major capsid protein (Mcp) also support this conclusion.
The genome is most likely linear and packaged in a head-full manner that produces terminally redundant, circularly permuted ends. This was supported by protein similarities and protein phylogenetic tree reconstructions that indicated a distant relationship to haloviruses HCTV-2 and HHTV-2, both of which are siphoviruses with dsDNA genomes that are circularly permuted and terminally redundant [13]. Consistent with the view that Hardycor1 leads a lytic lifestyle, the genome does not carry a tRNA-like attP sequence or a gene for a site-specific integrase, and it lacks a gene for a DNA replicase. The absence of replicase genes means that the virus is dependent on host enzymes for this process, and in this respect, Hardycor1 is similar to HHTV-2 [12].
The viral genome has undergone strong selection against palindromic motifs, as evidenced by the absence of three tetrameric motifs (AGCT, CTAG and TGCA). This is a common finding among prokaryotic viruses [22], and protects the viral DNA from attack by sequence-specific defences of host species, such as restriction-modification (R-M) systems that are widespread in Halobacteria [80]. Hrr. coriense alone carries five genes predicted to encode R-M enzymes including a Dam methylase and two restriction endonucleases (Mrr) that target methylated DNA.
Three other haloviruses that infect Hrr. coriense have been described previously, the myoviruses HF2, Hardycor2 and Serpecor1 [9,22,81,82]. Like Hardycor1, their ds-DNA genomes lack CTAG motifs and are under-represented in TGCA. Unlike Hardycor1, they lack the motif GATC and have the expected frequency of AGCT. Differences in the under-representation of palindromic (and non-palindromic) motifs seen in the genomes of viruses infecting the same host probably reflect distinct evolutionary histories, such as differing alternate host species, but could also be modulated by the defence systems carried by each virus. For example, DNA methylase genes are carried by Hardycor1 (hrrhc1_020), HF1 and Hardycor2 but not by Serpecor1 [22].
The overall pattern of gene organisation in Hardycor1 is typical of many siphoviruses [48]. Genes are generally oriented towards the centre, with a transition point at around 27 kb, and this divides the genome into two major regions that are functionally distinct. The left region carries genes for DNA packaging (TerL) and virus assembly (head and tail proteins). At the inner end of this region, near the major switch in gene orientation at 27 kb, there are three genes specifying proteins that contain transmembrane domains (hrrhc1_155, hrrhc1_160 and hrrhc1_165), which could represent the lysis module. They occur in the corresponding region that holin and lysin genes are found in many siphoviruses [48], and which are usually transcribed late in the infection cycle. In bacterial viruses, these proteins are regulated so that cell lysis only occurs after virion assembly has been completed [59].
The right end of the genome (27-45 kb) is designated as the replication and accessory gene region but has many genes specifying proteins of unknown function. In well-studied siphoviruses, the corresponding region is transcribed early in infection and carries a variety of genes involved in the evasion of host defences, genome replication and the alteration of host metabolism to enhance virus production [48]. In Hardycor1, only a few genes code for proteins with conserved domains indicative of function, such as a Holliday junction resolvase (Hjc), an AAA ATPase and a von Willebrand factor type A (vWA) interaction domain protein that includes a metal ion-dependent adhesion site (MIDAS). Genes for Holliday junction resolvases are widespread in caudoviruses of bacteria [49,83], and occur in some haloarchaeal viruses, such as HCTV-2 [12], as well as other archaeal viruses [84]. Hjc functions to resolve recombination intermediates but can also debranch DNA prior to packaging as well as degrade host DNA [83,84].
The close association of genes for AAA ATPase and vWA-MIDAS proteins has been well documented [52], and examples are known among archaeal viruses such as Acidianus Two-Tailed Virus [54] and the haloarchaeal siphovirus HVTV-1 [11]. The functions of these proteins are unclear, but it is thought that vWA domain proteins interact with and assist the activity of AAA ATPases, which may function as chaperones [54].
Seven genes of the replication and accessory gene region code for proteins containing CxxC motifs, a signature feature of zinc-finger (ZF) domains that are commonly involved in interaction modules, such as DNA binding [85]. Such proteins are often small, commonly occur in haloviruses and are most frequently encoded by genes situated outside of the virus assembly module [22]. In Hfx. volcanii, small CxxC motifs containing proteins have been shown to be important in a variety of phenotypes, including stress adaptation, biofilm formation and swarming [86].
Two genes within the virus assembly module of the Hardycor1 genome are unusual (hrrhc1_050 and hrrhc1_095), as they are found on the strand complementary to the other genes. This is uncommon, as the genes for head and tail proteins are typically closely spaced and all in the same orientation [48], allowing them to be transcribed together [81,82]. However, hrrhc1_050 and hrrhc1_095 are predicted independently by two gene callers (GeneMarkS2 and Glimmer3; see Methods), and the specified proteins have characteristics similar to other viral proteins. HrrHc1_050 contains two CxxC motifs [22,86] and a Cterminal membrane spanning domain, while the HrrHc1_095 protein has a low pI of 4.24 and an over-abundance of Asp residues, typical features of haloarchaeal proteins [47]. Alternative CDS options on the other strand are less likely. It is unclear why Hardycor1 is organised in this way. The haloarchaeal siphovirus HCTV-5 also has two CDS encoded on the opposite strand in the head and tail assembly module (HCTV5_113 and HCTV5_115).
The frequent presence of active proviruses in archival virus stocks was unexpected and facilitated the discovery of a novel and previously undocumented provirus (Halfvol2). The genome sequences of six induced proviruses were determined at high read coverage from five different virus stocks, and evidence for the presence of one more (Halfvol3) was detected using publicly available sequence read data. In the latter case, the increased read coverage of the virus sequence also indicated a high level of virus production. Three groups of viruses were identified: a siphovirus (Humcor1), five different pleolipoviruses and the novel virus Halfvol2. The gene content of Halfvol2 suggests it is probably lipid enveloped. In a previous study, the induction of specific proviruses in four species of haloarchaea was detected by PCR of att sequences [71], and evidence for the excision of Harhisp1 in Har. hispanica, as well as four others (in Har. marismortui and Hmc. mukohataei), was reported. In the current study, induced proviruses were not expected but were significant contaminants occurring in cell-free virus stocks, and their genomes were sequenced and assembled inadvertently.
In comparing the numerous proviruses described in this study, a surprising observation was the frequent occurrence of a CDS that overlaps the attP sequence, and which neatly spans between the viral integrase gene and a gene coding for a CxxC motif protein.
This protein can only be produced after circularization of the virus genome and would not exist in the provirus state. The function(s) of the encoded protein will be interesting to elucidate, but its position and fragmentation upon integration into the host genome suggest it is intimately involved in the regulatory mechanisms underlying the transition between virulent and temperate states.
The frequency and multiplicity of active proviruses present in virus stocks highlight important issues when studying haloarchaea and their viruses. At the cell level, the interactions, mutations and phenotypes of haloarchaea may be influenced by provirus induction, loss or cross-infection. It is also unclear what perturbations in cell physiology might trigger provirus induction and subsequent virulent growth. To control for these variables, it would be prudent to document all functional proviruses of the species under study, and assess their activity when cells are placed under experimental conditions. As shown in this study, an unsuspected provirus was able to be detected by sequencing cell-free DNA preparations, such as viral lysates. On the other hand, when studying viruses, it is important to realise that one or more induced proviruses of the host may well be present, and could be difficult to remove from virus preparations using standard purification regimes. This would be most problematic if they share similar physical characteristics to the virus of interest, for example, if growing the pleolipovirus His2 on Har. hispanica, a host species known to produce the endogenous pleolipovirus Harhisp1.