A Survey of Archaeal Restriction–Modification Systems

When compared with bacteria, relatively little is known about the restriction–modification (RM) systems of archaea, particularly those in taxa outside of the haloarchaea. To improve our understanding of archaeal RM systems, we surveyed REBASE, the restriction enzyme database, to catalog what is known about the genes and activities present in the 519 completely sequenced archaeal genomes currently deposited there. For 49 (9.4%) of these genomes, we also have methylome data from Single-Molecule Real-Time (SMRT) sequencing that reveal the target recognition sites of the active m6A and m4C DNA methyltransferases (MTases). The gene-finding pipeline employed by REBASE is trained primarily on bacterial examples and so will look for similar genes in archaea. Nonetheless, the organizational structure and protein sequence of RM systems from archaea are highly similar to those of bacteria, with both groups acquiring systems from a shared genetic pool through horizontal gene transfer. As in bacteria, we observe numerous examples of “persistent” DNA MTases conserved within archaeal taxa at different levels. We experimentally validated two homologous members of one of the largest “persistent” MTase groups, revealing that methylation of C(m5C)WGG sites may play a key epigenetic role in Crenarchaea. Throughout the archaea, genes encoding m6A, m4C, and m5C DNA MTases, respectively, occur in approximately the ratio 4:2:1.


Introduction
Restriction-modification (RM) systems are one of the best-known defense systems used by prokaryotes to prevent phage infection [1,2].They comprise a restriction enzyme (REase) that cleaves unmodified DNA and a DNA methyltransferase (MTase) that modifies DNA to block cleavage by the cognate REase.There are four main types of such systems.Type I systems employ three subunits acting in complex, where the R subunit is responsible for restriction, the M subunit is responsible for methylation, and the S subunit is responsible for recognizing the specific DNA sequence that is to be modified or cleaved.Type II systems usually contain two independent enzymes, an REase and an MTase, both of which must recognize and target the same DNA sequence.However, in some Type II systems (designated Type IIG), the MTase and REase activities are encoded in the same polypeptide.Type III systems, like Type I systems, consist of subunits that must act as complexes: an MTase (Mod) subunit that is also solely responsible for sequence recognition and an REase (Res) that must complex with the Mod subunit to cleave unmodified sequences.Type IV systems, also found in many prokaryotes, comprise only an REase that cleaves methylated DNA.Examples of all four types of systems are found in both bacteria and archaea.
REBASE is a comprehensive database of sequence and experimental information about RM systems, drawing information from all fully sequenced microbial genomes deposited in GenBank [3,4].Methylome data derived from Single-Molecule Real-Time (SMRT) sequencing are also included, enabling the assignment of target sites to MTases and their companion REases.Such assignments are then propagated to homologous enzymes in other organisms for which no experimental data are available.While nanopore sequencing is also capable of detecting DNA methylation, the accuracy of de novo motif calling, particularly for motifs with m 6 A and m 4 C, is currently lower than for SMRT Microorganisms 2023, 11, 2424 2 of 16 sequencing [5,6].As a result, relatively little nanopore-based microbial methylation data have been deposited in REBASE to date.We expect this to change as methods continue to improve.
RM systems of bacteria have been far more extensively studied than those of archaea.There have been numerous studies surveying different types of RM systems that largely or exclusively focus on bacteria, many of which use REBASE as source data.Such studies have focused on such topics as Type I systems with recombining S subunits [7], phase-variable Type I systems [8], phase-variable Type III systems [9], solitary REase genes [10], conserved ("persistent") MTases [11], and the association of RM systems with mobile elements and genome rearrangements [12].Surveys of RM systems found specifically in archaea are fewer, with the largest being a study in Halobacteria [13].This review examines more broadly the RM systems of archaea, for which there is relatively little experimental data about restriction and REases.Owing to methylation-sensitive sequencing techniques such as SMRT sequencing, however, our knowledge of DNA methylation and MTases in archaea is improving.There are currently 519 complete DNA sequences for archaeal genomes and SMRT methylation data are available for 49 of these.

Identification of Genomes
We retrieved a list of accession numbers of all genome sequence files stored in the REBASE database [3] and grouped together different accession numbers associated with the same strain (n = 59,327 strains).From this list, we first retrieved all genome sequences that had been taxonomically curated by NCBI and were stated to belong to the domain Archaea (n = 697 strains).We next retrieved all sequences, regardless of taxonomic assignment, that had not originated with NCBI (n = 1417 strains).The latter set was manually curated to identify the archaea (n = 15 strains), and these were combined with the NCBI set for a total of 712 strains.
This set was further parsed to remove genomes whose sequence was not complete at the time of accession.Of the 712 strains, 487 were flagged as "complete genome" in the GenBank definition line and retained.From the other 225 strains, we removed those flagged as whole-genome shotgun data, those where the longest sequence was less than 500 kb, and those where the status of the NCBI genome sequence project was anything less than complete.The remaining strains in the latter set (n = 32 strains) were combined with the earlier set for a total of 519 archaeal strains with complete genome data in REBASE.Of these, 49 also had associated methylome data from SMRT sequencing.

Identification and Clustering of Genes
Genomes processed for entry into REBASE were analyzed to identify genes associated with RM systems using the SEQWARE v. 4 software pipeline [14].We obtained all such genes encoded by the 519 archaeal strains identified above (n = 4135 protein sequences).These sequences were clustered to 30% sequence identity using Usearch v11 cluster_fast (n = 1034 sequence clusters) [15].

Construction of HMM Library
To predict the function of the uncharacterized archaeal proteins, we built a library of 62 HMMs spanning many different RM system-related functions and protein types (Supplementary Materials, Table S1).Protein sequences from which these HMMs were constructed were obtained from REBASE, focusing on experimentally characterized examples, where available, and their close homologs.Of these protein sequences, 462 were DNA MTases (including Type IIG RM proteins) and 202 were of all other functions (REases, S proteins, etc.).Sequences comprising two fused MTase domains were separated into component domains, but other multidomain proteins (Type IIG RM proteins, for example) were left intact.These two groups were separately clustered and visualized in two dimensions using CLANS [16] run under the MPI Bioinformatics Toolkit [17].The resulting clusters were used to verify and refine the protein sets used for each HMM.Most of the final sets formed visually well-defined clusters in the CLANS analysis.
The protein sets were presumed to comprise functionally similar and/or evolutionarily related groups of proteins.The MTase sets were generally homogeneous in terms of methylation type (m 6 A, m 4 C, or m 5 C) based on experimentally characterized examples.However, protein sequences of MTases conferring m 6 A and those conferring m 4 C can be very similar [18], and four HMMs (b1a, lmoa118-like, nru-like, and b3) were built from sequence sets that included characterized MTases of both types (Supplementary Materials, Table S1).For the purpose of classifying based on methylation type (used in the tables in this work), the HMMs b1a, lmoa118-like, and nru-like were all considered to be m 6 A, and b3 was considered to be m 4 C.
Each set of protein sequences was aligned using Muscle v. 5.1 [19] run under Geneious Prime 2023.0.4 (https://www.geneious.com)using default parameters.An HMM was built from each alignment using Hmmer v. 3.3.2hmmbuild (http://hmmer.org).A list of the HMMs can be found in the Supplementary Materials, Table S1.Of the 62 HMMs, 41 were built from the MTases and 21 from the other functions.

Bacterial Genes and Genomes
For comparison with archaea, we also retrieved the set of RM genes encoded in all completely sequenced bacterial genomes deposited in REBASE that had associated methylome data, resulting in a total of 36,718 RM-related genes from 3369 genomes.These RM-related genes were individually classified using the same HMM library and methodology used for the archaeal genomes described above.

Characterization of MTase Activity
Plasmid clones were synthesized (GenScript Biotech, Piscataway, NJ, USA) with codon-optimized genes encoding suaIIM and asp7IM in pRRS10, a lower-copy number derivative of the constitutive expression plasmid pRRS (GenBank acc.no.JN569339) with a pBR322 origin of replication.Clones were used to transform the DNA methylation-deficient E. coli strain ER2796, which is notably Dcm -.Genomic DNA from overnight cultures grown at 37 • C in LB with 100 µg/mL ampicillin was purified using the Monarch HMW DNA Extraction Kit (New England Biolabs, Ipswich, MA, USA).DNA was sheared in a Covarys ML230 (Covarys, Woburn, MA, USA) using the 175 bp AFA-TPX protocol.
Sequencing libraries were constructed from 100 ng of sheared DNA using the NEBNext Ultra II DNA Library Prep Kit for Illumina (New England Biolabs, Ipswich, MA, USA) and partially deaminated using the RIMS-seq2 protocol [20].Five µL of USER-treated library DNA was used for the PCR amplification step (6 cycles, with barcoded primers from the NEBNext 96 Unique Dual Index Primers) (New England Biolabs, Ipswich, MA, USA).
Libraries were sequenced on a NextSeq (Illumina, San Diego, CA, USA) using the 2 × 76 + 8 + 8 protocol.1.4 × 10 7 reads from the asp7IM clone and 1.5 × 10 7 reads from the suaIIM clone were obtained.Methylation at m 5 C sites was determined by comparing the C>T deamination rates of read1 and read2 [21].Motifs were determined by searching for over-represented sequences around these sites using pipelines based on both MoSDi [22] and DiNAMO [23], with similar results.The presence of dcm-6, the nonsense mutation inactivating the dcm gene in the ER2796 host, was verified in the sequence assembly.

Archaeal Genomes and RM Genes in REBASE
Genome sequences from archaea, and the RM system-related genes encoded by them, were obtained from the REBASE database [3].To minimize our chances of making assumptions based on missing data, we restricted our analysis to those genomes that appeared to be completely sequenced, closed, and finished-a total of 519.The genomes in this set are not evenly distributed across the phylogenetic tree, with 480 (92.5%) coming from just six archaeal classes (phylum in parentheses): Thermoprotei (Crenarchaeota); Methanomada, Halobacteria, Methanomicrobia, and Thermococci (all Euryarchaeota); and Nitrososphaerota (TACK group).This uneven distribution likely reflects a combination of sampling bias, academic or industrial interest, and ease of culturing.In the 519 archaeal genomes, we identified 4135 RM-related genes, which were grouped into 1034 sequence clusters based on 30% protein sequence identity.The sizes of these clusters ranged from 167 to 1, with 88 clusters of size ≥ 10 and 494 of size = 1.A complete list of genomes and cluster members can be found in the Supplementary Materials, Table S2.

Functional Categorization of Gene Clusters
For functional prediction, we constructed a library of 62 HMMs, each built from an RM-related evolutionary or functional group of protein sequences, using experimentally characterized examples where available (see Section 2).Each HMM was assigned to one of 13 general functional categories based on RM system type and biochemical activity (Supplementary Materials, Table S1).The protein sequence of the centroid of each gene cluster was used as a query to search the HMM library, and the predicted function of the cluster was determined as the functional category of the top HMM hit.For Type II DNA MTase clusters that included experimentally characterized members (largely based on SMRT sequencing data), the target site of the characterized examples was taken as representative of the entire cluster.
For each high-level taxonomic group (phylum, class, and order) represented in our set of 519 archaeal genomes, we determined the mean number of genes per genome from each of these 13 functional categories (Table 1).Looking at the set of genomes in its entirety, the most common category is Type II MTases (IIM), with about 2.7 per genome.The mean number of known Type II REase genes (IIR) is more than 30-fold lower; this partially reflects the prevalence of orphan MTases, which are similar to those of Type II RM systems but lack an REase partner.However, it is worth noting that Type IIR genes are difficult to identify based on sequence similarity [24], and our HMM library captured only three specific homologous groups of these enzymes, typified by BsiHKI, DpnII, and (presumably) DUF3883.As a result, this category is expected to be significantly under-represented in our data, with most IIR genes instead captured in the "Other" category.Type I and Type IIG RM systems are the next most common types, at just under one per genome.Type III and IV systems are the least common, at less than 0.2 per genome on average.However, it is also possible that Type IV systems are under-represented for the same reason as Type II REases.
Among the phyla, the Crenarchaeota are generally depleted in RM systems of all types, although the single representative from the order Cenarchaeales, Cenarchaeum symbiosum A, harbors 22 MTase genes of Type IIM, so this is not universally true. Figure 1 illustrates two extremes in RM system content in the Crenarchaeota, and archaea in general.The 25 RM system loci in C. symbiosum A, which are spread throughout the chromosome, include 17 orphan Type II MTases, one Type II MTase paired with a second MTase, one Type II MTase paired with a vsr gene, two complete Type II RM systems, two Type IIG genes, and two complete Type III RM systems (Figure 1A).All or nearly all recognize different sites based on characterized homologous examples.The genome of Fervidicoccus fontis Kam940 is more typical of Crenarchaeota, with only two RM loci, both Type II orphan MTases (Figure 1B).
Type I systems are particularly prevalent among the Methanomicrobia, at more than three per genome, and Type III systems are prevalent among both the Methanomicrobia and Thermoplasmata.The Halobacteria and Methanomicrobiales are relatively rich in Type IIG RM systems, at more than one per genome.Factors affecting the differences in RM system content and type between taxonomic groups may include the frequency of exposure to phage, the relative efficiency of horizontal exchange, and the microbiomes in which their members typically reside.For the purposes of this work, the taxa in bold will be considered phyla, those in Roman type classes, and those in italics orders.If every member of a particular taxon represented here belongs to the same lower-order taxon, that lower-order taxon is shown in parentheses next to the higher-order taxon.

DNA Methylation Phenotypes
Of the 519 complete archaeal genomes under consideration here, 49 have associated methylome data from SMRT sequencing (Pacific Biosciences).From these data, one can readily identify DNA motifs around m 6 A and m 4 C methyl marks; m 5 C-associated motifs can also sometimes be identified, but with less efficiency and accuracy [25,26].Alternative methods such as bisulfite sequencing, EM-seq [27], TAPS-seq [28], and RIMS-seq [21] are better suited to identifying m 5 C motifs, but they have not yet been applied to archaeal genomes at a large scale.Table 2 shows the number of genomes in each taxonomic group that have associated methylome data derived from SMRT sequencing, as well as the mean number of genes and observed motifs of each methylation type.
It is expected that the number of MTase genes should equal or exceed the number of motifs since not every gene is active, and many m 5 C motifs are not detected via SMRT sequencing.Indeed, we observed that in general, the numbers of genes and motifs are comparable, indicating that most of the MTase genes are active.We observed two cases where the number of motifs exceeds the number of genes: m 6 A in Desulfurococcales and m 4 C in Methanosarcinales (Table 2).This can be due to erroneous prediction of protein activities (typically misclassifying m 6 A vs. m 4 C) or identification of motifs (typically misclassifying m 4 C vs. m 5 C), or it may indicate that the genome sequence is incomplete, likely missing one or more plasmids that could encode additional MTases.In the archaea as a whole, the ratio of MTase genes predicted to encode m 6 A, m 4 C, and m 5 C enzymes is approximately 4:2:1 (Table 2).Certain phyla show significantly different ratios, however.In Crenarchaeota, the most prevalent class is m 5 C due to the universal presence of a single persistent m 5 C MTase (see below) and the general depletion of RM systems in this taxon.In the TACK group, the most prevalent class is m 4 C due to the presence of several persistent m 4 C MTases in the Nitrososphaerota (see below).

Comparison with Bacteria
For comparison with archaea, we retrieved a large set of completely sequenced bacterial genomes from REBASE and performed a similar analysis (see Section 2).Overall, archaea encoded fewer RM-related genes than bacteria (7.9 vs. 10.9), and this was true of every class of genes except IIR, IIG, M (BREX), and V (Figure 2A).Interestingly, the overall ratio of m 6 A, m 4 C, and m 5 C MTase genes in the bacterial genome set is approximately 5:1:1.5, with m 5 C outnumbering m 4 C (Figure 2B).The relative difference in the ratio of m 4 C and m 5 C between bacteria and archaea may reflect a greater proportion of hyperthermophiles in archaea.

Persistent MTases and RM Systems
Many RM systems and orphan MTases show a "patchy" distribution of homologs across a phylogenetic tree and significant differences between closely related strains, a pattern most parsimoniously explained by frequent horizontal gene transfer (HGT) and gene loss [13].The resulting diversity of defense systems can be advantageous in protecting a population from infection by phage and other deleterious genetic elements.However, the ability of DNA methylation to affect gene transcription and other DNA-protein interactions can result in orphan DNA MTases (and sometimes full RM systems) acquiring functional roles outside of cellular defense.When this happens, the selective pressure on the genes encoding them can favor conservation and vertical transmission; such genes are sometimes termed "persistent" because they are less likely to be lost over time than most RM systems [11].Classical examples of these include dam in the Gammaproteobacteria and ccrM in the Alphaproteobacteria.Large-scale comparative genomic studies have identified additional examples in bacteria and in the archaeal phylum Halobacteria [11,13,14,29].For the purposes of this work, the taxa in bold will be considered phyla, those in Roman type classes, and those in italics orders.If every member of a particular taxon represented here belongs to the same lower-order taxon, that lower-order taxon is shown in parentheses next to the higher-order taxon.
5:1:1.5, with m 5 C outnumbering m 4 C (Figure 2B).The relative difference in the rat m 4 C and m 5 C between bacteria and archaea may reflect a greater proportion of hyper mophiles in archaea.1. (B) Mean numbers of MTase genes conferring each of the three methylated bases in the sets of bacterial and archaeal genomes.Values for archaea are from Table 2.

Persistent MTases and RM Systems
Many RM systems and orphan MTases show a "patchy" distribution of homo across a phylogenetic tree and significant differences between closely related strai pattern most parsimoniously explained by frequent horizontal gene transfer (HGT) gene loss [13].The resulting diversity of defense systems can be advantageous in pro ing a population from infection by phage and other deleterious genetic elements.H ever, the ability of DNA methylation to affect gene transcription and other DNA-pr interactions can result in orphan DNA MTases (and sometimes full RM systems) acqu functional roles outside of cellular defense.When this happens, the selective pressu the genes encoding them can favor conservation and vertical transmission; such gene sometimes termed "persistent" because they are less likely to be lost over time than RM systems [11].Classical examples of these include dam in the Gammaproteobac and ccrM in the Alphaproteobacteria.Large-scale comparative genomic studies have tified additional examples in bacteria and in the archaeal phylum Halobac [11,13,14,29].  1. (B) Mean numbers of MTase genes conferring each of the three methylated bases in the same sets of bacterial and archaeal genomes.Values for archaea are from Table 2.
We define a persistent MTase or RM system as one that is present in at least 75% of members of a given taxonomic group represented by at least five genomes in our set.We mapped the 88 clusters with ≥10 members to the taxonomic tree of the 519 archaeal genomes to identify such cases.For those clusters that met our definition, or nearly so, we combined them with closely related clusters, built phylogenetic trees on the combined sets, and reassorted the members based on monophyletic groups where necessary.We refer to these manually adjusted clusters as homologous groups (HGs).Table 3 shows each taxonomic group encoding at least one HG that met the criteria for persistence, and Supplementary Materials Table S3 shows the original cluster number to which each HG member belongs.a For the purposes of this work, the taxa in bold will be considered phyla, those in Roman type classes, and those in italics orders.If every member of a particular taxon represented here belongs to the same lower-order taxon, that lower-order taxon is shown in parentheses next to the higher-order taxon.b Methylated base on the top strand is underlined.c Methanospirillaceae and Methanoregulaceae consistently form a subclade under Methanomicrobiales and are treated as a single group for the purpose of this table.
We identified 1 persistent group at the phylum level, 3 at the class level, 7 at the order level, and 18 between the levels of family and species.Of these 29 persistent systems, 20 are Type II orphan MTases (all with 4-5 base recognition sites, and all but one palindromic), 1 is a complete Type II RM system, 2 are BREX-like MTases, 5 are Type I systems (comprising 2 or 3 genes), and 1 is a Type IV REase (Table 3).Four persistent systems (HG2, HG3, HG1, and HG11) are shared between multiple taxonomic groups, which may be due either to independent acquisition or to gene loss in sister taxa.
The largest group, HG1, is found throughout the Halobacteria (163/181), except for Halorubrum (1/10) and Haloquadraticum (0/2); its members are orphan m 4 C MTases that modify CTAG (with the underline here and elsewhere indicating the methylated base), and it corresponds to cHG U observed previously by Fullmer and coworkers [13].Although the general function of this epigenetic signal remains unknown, the CTAG sequence is generally under-represented in Halobacterial genomes [13] but locally clustered upstream of orc6/cdc1 gene orthologs [14], which encode the origin of replication binding complex in most archaea, a role analogous to that of DnaA in bacteria.This suggests a role for HG1 in chromosome replication or the regulation thereof in Halobacteria, but its precise function remains to be elucidated.
The second largest group, HG2, is found almost universally throughout the Crenarchaeota phylum (99/100) as well as in most Methanococci (where in Methanocaldococcus it is present in two copies) and Pyrococcus; its members are orphan m 5 C MTases.Prior to this work, two examples from this clade had predicted recognition sites, although neither had been tested directly: M.SuaII had been predicted to modify RGATCY based on SMRT sequencing of Sulfolobus acidocaldarius DSM639 [14] and M.Asp7I was predicted to modify GGCAC in Acidilobus species 7A.To address the conflicting predictions, we cloned and expressed both genes in a methyl-deficient strain of E. coli and, using RIMS-seq [21], found both to modify the heterologous host chromosome in vivo at CCWGG sites, the same site modified in wild-type E. coli strains by the product of dcm.In other words, both predictions were incorrect.The presence of a persistent m 5 C MTase in hyperthermophiles is intriguing since the rate of deamination of m 5 C is expected to be high at elevated temperatures, leading to a mutator phenotype [30].The answer to this conundrum may be that HG2 is silenced under most conditions: although M.SuaII is active as a constitutively expressed clone, negligible levels of m 5 C methylation were observed in its native host, S. acidocaldarius, under the conditions of one published experiment [31].This suggests that HG2 may be under tight regulatory control, in contrast to Dcm, which provides nearly complete methylation of CCWGG sites in E. coli.
Group HG4, nearly ubiquitous in Natrialbales (41/42), encodes an orphan m 6 A MTase and is the only example of a Type II MTase group found here that modifies a nonpalindromic sequence, CATTC.All of the remaining persistent Type II MTases modify m 4 C: HG10 and HG18 (GTAC); HG11, HG16, and HG19 (AGCT); HG12 (CGCG); HG15 (GGCC); HG20 (CTNAG); and HG21 (unknown recognition site).All are orphans except for HG15, which is always accompanied by a companion REase, an arrangement atypical of persistent systems [11].Two taxa, the Methanomicrobiales and the Nitrososphaerota, are particularly rich in these persistent m 4 C orphan MTases.Interestingly, in both taxa, GATC (conferred by HG3) and AGCT (conferred by HG11, HG16, or HG19) are present throughout the group or nearly so, with one or more additional persistent m 4 C groups present in the subclades.This may indicate a common epigenetic function for GATC and AGCT methylation in these distantly related taxa.
Several Type I RM systems met the criteria for persistence.However, given that the target sites of these systems are dictated by the specificity subunit, which tends to be the least conserved of the three Type I components, it is not clear that members of all of these systems recognize and methylate the same sequence.It may be that these systems are not vertically inherited, but rather are frequently horizontally exchanged between strains of the same species or taxon.HG6, for example, is also found frequently in Thermococcus and Methanothermobacter, and HG14 in other Methanomicrobia.Interestingly, four of the five Type I RM systems that meet the criteria for persistence are found in Methanosarcina, and they are the primary reason that this species has the highest density of Type I RM systems in the archaea generally, at more than four per genome (Table 1).
HG5 resembles PglX, the MTase associated with BREX systems, and is persistent in the genus Haloterrigena but sporadically found throughout the rest of the Halobacteria.HG13, which weakly resembles Eco57I-like Type IIG systems, is persistent in Methanosarcina mazei (where it is largely coincident with the four Type I systems) but sporadic throughout other Methanosarcinales.The lone persistent Type IV system, HG9, strongly resembles (39% identity) Mrr from E. coli K-12 and is persistent in the genus Methanosarcina (22/29).
The determination of persistence is highly dependent on the availability of completely sequenced genomes.Many taxa in our set are not represented by a sufficient number of genomes to be able to determine persistence based on our criteria.In general, higher-order taxa are represented by more examples than lower-order taxa.However, even among higher-order taxa, two of six phyla and 9 of 16 classes are represented in our set by fewer than five examples, too few to make a persistence determination.Also, in general, lower-order taxa tend to be less diverse groups and therefore would be expected to have more persistent systems than higher-order taxa.However, more specific taxa are also less likely to have enough examples to make the assessment.For example, only seven named archaeal species have more than five examples in our set, but three of these seven have at least one persistent system by our criteria (Table 3).The sequencing to closure of additional archaeal genomes from a broad diversity of taxa will no doubt reveal many additional examples.

Supplementary Materials:
The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/microorganisms11102424/s1.Table S1: Profiles and functions used for gene classification.Functions refer to the biochemical function(s) of the proteins in the set: M = MTase; R = REase; C = control protein (transcriptional regulator); S = target specificity; V = Vsr repair endonuclease.Categories refer to the 13 general functional categories, a combination of biochemical activity and RM system type, used for functional classification.The methylation type applies only to the MTases.Table S2: Number of protein cluster members in each genome.In the header are shown the cluster number, number of members, and top HMM hit to the cluster centroid.Columns are in descending order by cluster size.Entries show the number of cluster members (n) in each genome, colored pink when n = 1 and yellow when n > 1.Those genomes with associated methylome data from SMRT sequencing are shaded in green.Orgnum = REBASE organism number; taxonomy = taxonomy string as determined by NCBI.Table S3: Number of homologous group (HG) members in each genome, showing only those homologous groups that are persistent in at least one taxon.In the header are shown the HG number, number of members, and top HMM hit to the cluster centroid.Columns are in descending order by HG size, except for Type I RM systems, for which the persistent MTase member is shown adjacent to its companion REase and specificity subunits, which may or may not also be persistent.Members of the same RM system are shaded in color in the top row.Entries show the original cluster number of each HG member(s), colored in orange for those taxa where it meets the criteria for persistence.Those genomes with associated methylome data from SMRT sequencing are shaded in green.Orgnum = REBASE organism number; taxonomy = taxonomy string as determined by NCBI.

Figure 1 .
Figure 1.Locations of RM systems in two Crenarchaeota.For each locus, the arrows show the gene arrangement (green = Type II MTase; blue = Type IIG RM system; red = Type III MTase; black = REase; and gray = Vsr nuclease).Numerals show the ORF number from the REBASE nomenclature (where, for example, 1514 = CysAORF1514P) and the motif is the predicted recognition site based on characterized homologs.(A) Cenarchaeum symbiosum A (2.05 Mbp).(B) Fervidicoccus fontis Kam940 (1.32 Mbp).

Figure 2 .
Figure 2. (A) Mean numbers of RM system genes, by function, in 3369 bacterial genomes (blue 519 archaeal genomes (orange).Values for archaea and definitions of the functional groups are Table1.(B) Mean numbers of MTase genes conferring each of the three methylated bases in the sets of bacterial and archaeal genomes.Values for archaea are from Table2.

Figure 2 .
Figure 2. (A) Mean numbers of RM system genes, by function, in 3369 bacterial genomes (blue) and 519 archaeal genomes (orange).Values for archaea and definitions of the functional groups are from Table1.(B) Mean numbers of MTase genes conferring each of the three methylated bases in the same sets of bacterial and archaeal genomes.Values for archaea are from Table2.

Table 1 .
Mean number of RM-related genes per taxonomic group.

Table 2 .
Mean numbers of MTase genes and motifs based on methylated base and position.

Table 3 .
Persistent RM systems in each taxonomic group.