Comparative Genome Analysis of 33 Chlamydia Strains Reveals Characteristic Features of Chlamydia Psittaci and Closely Related Species

To identify genome-based features characteristic of the avian and human pathogen Chlamydia (C.) psittaci and related chlamydiae, we analyzed whole-genome sequences of 33 strains belonging to 12 species. Using a novel genome analysis tool termed Roary ILP Bacterial Annotation Pipeline (RIBAP), this panel of strains was shown to share a large core genome comprising 784 genes and representing approximately 80% of individual genomes. Analyzing the most variable genomic sites, we identified a set of features of C. psittaci that in its entirety is characteristic of this species: (i) a relatively short plasticity zone of less than 30,000 nt without a tryptophan operon (also in C. abortus, C. avium, C. gallinacea, C. pneumoniae), (ii) a characteristic set of of Inc proteins comprising IncA, B, C, V, X, Y (with homologs in C. abortus, C. caviae and C. felis as closest relatives), (iii) a 502-aa SinC protein, the largest among Chlamydia spp., and (iv) an elevated number of Pmp proteins of subtype G (14 in C. psittaci, 14 in Cand. C. ibidis). In combination with future functional studies, the common and distinctive criteria revealed in this study provide important clues for understanding the complexity of host-specific behavior of individual Chlamydia spp.


Introduction
Chlamydiae are different from typical eubacteria for their obligate intracellular nature, which manifests itself in a biphasic developmental cycle comprising extracellular and intracellular stages. Along this cycle, infectious, but metabolically largely inactive elementary bodies (EBs) enter host cells to transform into non-infectious, metabolically active reticulate bodies (RBs) within a vacuole-like inclusion. These RBs replicate causing expansion of the inclusion and differentiate back into EBs to start a fresh cycle after host cell rupture.
In addition, the inclusion membrane (Inc) proteins form a large family whose members are inserted in the inclusion membrane via type III secretion. Being exposed to the cytosol, some of them are among the major immunogens [36]. It is remarkable that, in the average chlamydial genome, approximately 4 percent of the coding capacity is dedicated to this family [37]. Furthermore, all chlamydial species harbor polymorphic membrane proteins (Pmps), which represent autotransporters with surface-exposed and membrane-translocated domains. They are regarded as virulence factors [38], as well as adhesins and immune modulators [39]. Due to its central regulatory role in differentiation between EBs and RBs, the histone-like protein pair HctA/B [40] could be of interest in the context of strain viability and growth characteristics.
The present study was based on comprehensive comparative analysis that included 33 strains of all species of Chlamydia validly published by March 2020, irrespective of their status as a pathogen, co-infecting agent or commensal. Publicly available genome sequences were complemented by de novo sequenced and assembled genomes of eight field strains of avian chlamydiae (C. avium, C. gallinacea and C. abortus). We anticipated that exploration of inter-species genomic diversity throughout the genus Chlamydia could entail advances in revealing distinctive properties of the avian and human pathogen C. psittaci and other chlamydiae with avian host preference.

General Characteristics of the Genome Sequences
Basic genomic parameters of all 33 strains are given in Table 1. While C. pneumoniae strain TW-183 has the largest genome of this panel (sized 1,225,935 bp), strains of C. avium (10DC88 1,041,170 bp) and C. trachomatis (434-Bu 1,038,842 bp) possess the smallest genomes.
Between 89 and 93% of individual genomes carry coding sequences (CDS). The total number of CDS ranges from 887 (C. muridarum) to 1050 (C. pneumoniae), but only 53 to 61% of them have been assigned a specific function by the annotation software. The remaining 39-47% are still at hypothetical protein status.

Common and Unique Elements in the Genomes of Chlamydia spp.
Each grouping of chlamydial organisms, e.g., at species or genus level, can be described by its pan-genome, which comprises all genes found in all strains included. It can be subdivided into the core genome containing genes present in all strains and a dispensable or accessory genome comprising genes encountered in a single or a few strains [41]. In this study, RIBAP, a specially designed pipeline was used to calculate both core and pan-genomes.
The pan-genome of the present panel of 33 chlamydial strains comprised a total of 31,573 CDS (Table 1) forming 11,054 homologous clusters consisting of gene homologs (Roary output at 95% identity, including clusters with a single CDS). Through pairwise ILP comparisons within RIBAP, these clusters were refined into 1583 RIBAP groups with 784 groups including genes from all 33 input genomes.
To visualize intersecting sets, i.e., shared genes among the chlamydial species, UpSet diagrams involving all 33 strains (Supplementary Figure S1) as well as the 12 type and reference strains ( Figure 1) have been constructed. While 791 genes are common to the 12-strain panel (compared to 784 common to the complete 33-strain panel), it is obvious that C. trachomatis and its closest relatives share comparatively few genes outside the core genome with C. psittaci, C. abortus, C. avium and C. gallinacea. For instance, there are as few as 13 genes that C. trachomatis shares with C. abortus and C. psittaci only (Figure 1, i.e., 7 genes in column 19 plus 6 in column 23). In contrast, the more closely related species of C. psittaci, C. gallinacea, C. avium and C. abortus have an additional 62 genes in common, while C. psittaci and C. abortus share 126 more genes outside the core genome. The C. pneumoniae genome, the largest in this dataset, possesses the highest number of unique genes (121), but also Cand. C. ibidis (41) and C. avium (49) with its relatively small genome have an elevated number of singletons. The core genome of all 33 strains comprised 784 genes (Supplementary Figure S1), i.e., they share about four-fifths of their genome.
Concatenated core genes of all strains were used to reconstruct a phylogenetic tree ( Figure 2). Two major monophyletic groups can be distinguished, the first one containing C. trachomatis, C. muridarum and C. suis, hence referred to as 'the trachomatis group', and the second one harboring the remaining 9 taxa. In the latter, C. pecorum and C. pneumoniae make up their own clade, while the other clade contains 7 species. At the second node, Candidatus C. ibidis forms an external branch, whereas C. avium/C. gallinacea, C. felis/C. caviae, and C. abortus/C. psittaci appear in further clades, each containing two species. We will refer to the members of these three two-species clades as 'the psittaci cluster'. Concatenated core genes of all strains were used to reconstruct a phylogenetic tree ( Figure 2). Two major monophyletic groups can be distinguished, the first one containing C. trachomatis, C. muridarum and C. suis, hence referred to as 'the trachomatis group', and the second one harboring the remaining 9 taxa. In the latter, C. pecorum and C. pneumoniae make up their own clade, while the other clade contains 7 species. At the second node, Candidatus C. ibidis forms an external branch, whereas C. avium/C. gallinacea, C. felis/C. caviae, and C. abortus/C. psittaci appear in further clades, each containing two species. We will refer to the members of these three two-species clades as 'the psittaci cluster'.
The tree also reveals particularly high genetic heterogeneity within the species of C. psittaci and C. gallinacea. Notably, the atypical C. abortus strain 16DC122, which was isolated from a duck, forms its own branch apart from two typical ruminant strains of the species. Concatenated core genes of all strains were used to reconstruct a phylogenetic tree ( Figure 2). Two major monophyletic groups can be distinguished, the first one containing C. trachomatis, C. muridarum and C. suis, hence referred to as 'the trachomatis group', and the second one harboring the remaining 9 taxa. In the latter, C. pecorum and C. pneumoniae make up their own clade, while the other clade contains 7 species. At the second node, Candidatus C. ibidis forms an external branch, whereas C. avium/C. gallinacea, C. felis/C. caviae, and C. abortus/C. psittaci appear in further clades, each containing two species. We will refer to the members of these three two-species clades as 'the psittaci cluster'.
The tree also reveals particularly high genetic heterogeneity within the species of C. psittaci and C. gallinacea. Notably, the atypical C. abortus strain 16DC122, which was isolated from a duck, forms its own branch apart from two typical ruminant strains of the species. The tree also reveals particularly high genetic heterogeneity within the species of C. psittaci and C. gallinacea. Notably, the atypical C. abortus strain 16DC122, which was isolated from a duck, forms its own branch apart from two typical ruminant strains of the species.

The Plasticity Zone (PZ)
Major hotspots for variable regions are observed at~300,000 bp and~600,000 bp (normalized genome sequences with hemB gene in initial position). The first hotspot is the hypervariable region of the plasticity zone (PZ).
There is only a low degree of similarity among Chlamydia spp. as seen with nucleotide identity values of PZ sequences, which are typically in the range of 30-50% (Table 2, Supplementary Figure S2). C. psittaci 6BC and C. abortus S26-3 share the most similar structure with 81.55% nucleotide identities, whereas the avian C. abortus strain 16DC122 has a less similar PZ compared to strain S26-3 (67.41% identity). Remarkably, the closely related species of C. avium and C. gallinacea lack any PZ homology. Basic PZ parameters of all 33 strains are given in Table S1, representative strains of each species are compiled in Table 3. C. suis (82,505 nt), C. muridarum (82,115 nt) and C. trachomatis (52149-56792 nt) were found to have the largest PZs, while strains of C. avium (4669-5694 nt), C. pneumoniae (8759 nt) and C. gallinacea (15,845-16,624 nt) harbor considerably reduced versions.
The composition of the PZ was found to differ widely among the species and, to a lesser extent, also within species. While biotin modification genes accB and accC were identified at one PZ boundary in all 33 strains, purine synthesis genes guaA and guaB, which are usually marking the other PZ boundary, were missing in C. avium, C. gallinacea, C. suis and C. trachomatis. A complete guaAB-ADA operon was seen in C. caviae, C. felis, C. muridarum, C. pecorum and most of the C. psittaci strains. Functional Trp operons, which include the regulatory region, structural genes trpA, B, D, F, kynU and repressor gene trpR, were encountered in C. caviae and C. felis. The analogous region in C. trachomatis and C. suis consisted of three genes only. All other Chlamydia spp. genomes lacked a Trp operon.
The gene encoding the large cytotoxin (toxB) was found in all species except C. avium, C. pneumoniae and C. trachomatis. Its size ranged from 7053 nt in C. psittaci strain VS225 to 10,317 nt in C. pecorum. In C. suis and C. pecorum, two gene copies were identified, in C. muridarum even three. In C. psittaci, sequence and size variations were observed from strain to strain (Table S1). In the case of C. abortus, it is remarkable that a 9312-nt toxB gene was seen in the avian strain 16DC122, whereas it is absent in typical ruminant strains, such as S26-3 and C18-98 (B577).

Genes Encoding Polymorphic Membrane Proteins (pmps)
The above-mentioned hotspot regions also harbor clusters of pmp genes. To identify homologs and characterize the spectrum of pmp genes present in strains of the 'psittaci cluster', all previously annotated pmp sequences of C. psittaci, C. avium, C. gallinacea and C. abortus were blasted against the genome sequences of all 33 strains. While 21 individual pmp genes were seen in C. psittaci and 18 in C. abortus, their number was considerably reduced in C. gallinacea (10) and C. avium (7). The results in Table 4 also show that all known Pmp subtypes A, B, D, E, G, and H are represented by at least one member in each species. Subtype G proved the most extensive one with 11 members in C. abortus, 14 in C. psittaci, four in C. gallinacea and two in C. avium. Table S2 illustrates the extent of sequence variation among strains of these four species. Notably, the set of individual pmp genes was conserved within a given species, where strain-to-strain variations can cover similarity values between 50 and 80%. The more conserved Pmp family members, such as PmpD, can be used for phylogenetic purposes. While sequence similarity within the same Pmp subtype among Chlamydia spp. is generally low, phylogenetic analysis revealed a genetic relatedness closely resembling genome-based data shown in Figure 1 and Supplementary Figure S1. Comparison of PmpD proteins revealed higher than 50% similarity at amino acid level among the species of the 'psittaci cluster', but lower similarity to the 'trachomatis group' (Table S3).

Inc Proteins
In contrast to the 'trachomatis group', where at least eight Inc family subtypes, i.e., IncA-G and V, and more individual members can be found, the remaining Chlamydia spp. possess less inc genes (Table 5). Six different inc gene types have been identified in C. psittaci and C. abortus, but two of them, incX and incY (or NC), have yet to be assigned to a particular subtype. Again, C. avium and C. gallinacea appear to have undergone a reduction of this locus, as they presented only subtypes A, B, C and V.
The Inc proteins are highly variable among the species of Chlamydia. As an example, IncB of the closely related species of C. psittaci and C. abortus share only 74.26% amino acid sequence similarity (see Table S4). At the same time, Inc sequences are remarkably conserved within species. For instance, Inc sequence identities within C. psittaci and C. abortus are typically close to 100%, within C. avium and C. gallinacea higher than 90%.

The Secreted Inner Nuclear Membrane-Associated Chlamydia Protein (SINC)
Homologs of the gene encoding SINC of C. psittaci 6BC were identified in all ten strains of this species, with amino acid similarities between 86.2% (strains GR9 and WS-RT-E30), 96.8% (NJ1) and 99-100% (rest of the strains). SINC orthologs of reduced size were encountered in C. abortus, C. caviae, C. felis, C. gallinacea, Cand. C. ibidis, and C. avium (order of descending amino acid similarity, see Table 6 and Table S5).

Histone-Like Proteins HctA and HctB
This protein pair was present in most of the Chlamydia spp. However, C. avium and C. gallinacea strains completely lacked the hctA gene (Table 6). In addition, hctB was missing in C. gallinacea strain JX-1 while present in the other field strains. In C. avium, only type strain 10DC88 harbors a truncated hctB gene sized 258 nt or 86 aa (probably a pseudogene) that was highly homologous to its counterpart in C. gallinacea. In the other C. avium strains, the hctA/B pair seems to be absent altogether.

Pseudogenes
To check genomes for the presence of homologs to C. trachomatis pseudogenes we conducted a blastn 2.7.1+ search with default parameters using known pseudogenes of C. trachomatis (n = 15) as queries. The latter were extracted from [42]. We found that two of the pseudogenes tested were conserved in all 33 genomes (CTL0228 and CTL0627 with 70-83% sequence identity and 76-86% sequence identity, respectively). These are the only homologous pseudogenes found in strains of C. psittaci, C. avium, C. gallinacea C. abortus and C. pneumoniae. As expected, more of the C. trachomatis pseudogenes are shared with the phylogenetically closer relatives of C. suis (9) and C. muridarum (7), but also C. pecorum (4) and Cand. C. ibidis (3) harbor additional homologs (Table S6).

Discussion
Many bacterial genome sequences in public databases are in a permanent draft state [43]. Failure in achieving complete genome sequence assembly is often due to multiple sequence repeats in certain loci, which may confound assembly tools. However, incompletely assembled genome sequences pose limitations on accurate annotation and subsequent genome analysis [44]. In the present study, we used 27 completely and 6 nearly completely assembled genomes of chlamydial strains from 12 species, 25 of which had already been in the public domain. In addition, 4 strains of C. gallinacea, 3 of C. avium and 1 of C. abortus were de novo sequenced and assembled.

Bioinformatics Tools: New and Unique Features of the RIBAP
While several tools for pan-genome calculation were already available (such as Roary), we observed that the core genome size was often underestimated, in particular, when evolutionarily distinct species were compared. Thus, in the course of this comparative study, we developed a combined approach by connecting initial Roary groups based on high sequence similarity (95%) with additional parameters derived from evolutionary pairwise ILP calculations (github.com/hoelzer-lab/ribap). Using this approach, we are able to calculate a comprehensive core genome, even at genus level and for evolutionarily distinct bacteria.
In addition, the RIBAP code is public domain and the pipeline is implemented using a workflow management system with containerized steps, thus allowing easy and reproducible execution on different platforms and modular pipeline adjustments including versionization in the future.
One output of RIBAP is a concise HTML table (see osf.io/j9zas for download or direct HTML access) that compiles not only the core genome, but also other gene groups belonging to the accessory genome. The table allows an interactive search for specific genes of interest and visualizes sequence similarity thresholds and phylogeny for every single gene group. When the number of input genomes to be compared becomes too large for visualization in a table format, an UpSet plot as shown in Figure 1 (and Supplementary Figure S1) will still allow quantitative visualization of multiple gene set intersections. An example of the RIBAP output concerning the HctB protein is shown in Figure 3.  Importantly, our approach also encompasses genes that lack a functional annotation due to assembly errors, annotation problems or incomplete reference databases (so-called 'hypothetical genes'). Thus, we are able to determine a more complete set of core genes and even assign functions a posteriori. Hypothetical genes can be added to a RIBAP group consisting of genes with an assigned function, thus allowing annotation of genes that had been overlooked by the annotation software. For example, the RIBAP group13 (see HTML table located at https://osf.io/j9zas/) includes genes present in all 33 strains and functionally annotated as rpmG, which encodes the 50S ribosomal protein L33. However, for C. trachomatis 434-Bu, Prokka only reported a hypothetical gene despite clear sequence similarity to the other rpmG genes in this group. By exclusion or routine handling of 'hypothetical genes', this group would have lacked one of the 33 strains and thus would not have been included in the core gene set. Furthermore, our analysis also revealed additional core genes (annotated as 'hypothetical proteins') present in all 33 strains, which are interesting targets for further studies and additional annotation efforts.

Core Genome vs. Dispensable Genome
As a central parameter in comparative genomic studies, the core genome comprises the genes or CDS shared by all strains of the selected panel. However, there is no general agreement on definition of the term itself, nor on similarity cut-off values to be applied.
In this situation, core genome size naturally depends on the selection of strains studied and the calculation method including sequence similarity thresholds.  (color table) and a dendrogram based on multi-sequence alignment and FastTree. Displayed for each RIBAP group are the group size and all involved strains, as well as gene name and gene description for each strain. The corresponding assignment to Roary runs with sequence identities of 60, 70, 80, 90 and 95% is color coded in their respective groups. On the left-hand side, a phylogenetic tree is displayed, which is generated from the MSA of the CDS within this RIBAP group. Alignment and tree are provided for download. The full HTML table can be found in the OSF repository.
Importantly, our approach also encompasses genes that lack a functional annotation due to assembly errors, annotation problems or incomplete reference databases (so-called 'hypothetical genes'). Thus, we are able to determine a more complete set of core genes and even assign functions a posteriori. Hypothetical genes can be added to a RIBAP group consisting of genes with an assigned function, thus allowing annotation of genes that had been overlooked by the annotation software. For example, the RIBAP group13 (see HTML table located at https://osf.io/j9zas/) includes genes present in all 33 strains and functionally annotated as rpmG, which encodes the 50S ribosomal protein L33. However, for C. trachomatis 434-Bu, Prokka only reported a hypothetical gene despite clear sequence similarity to the other rpmG genes in this group. By exclusion or routine handling of 'hypothetical genes', this group would have lacked one of the 33 strains and thus would not have been included in the core gene set. Furthermore, our analysis also revealed additional core genes (annotated as 'hypothetical proteins') present in all 33 strains, which are interesting targets for further studies and additional annotation efforts.

Core Genome vs. Dispensable Genome
As a central parameter in comparative genomic studies, the core genome comprises the genes or CDS shared by all strains of the selected panel. However, there is no general agreement on definition of the term itself, nor on similarity cut-off values to be applied.
In this situation, core genome size naturally depends on the selection of strains studied and the calculation method including sequence similarity thresholds.
As standard Venn and Euler diagrams are an inadequate solution for quantitative visualization of multiple (n > 4) gene set intersections, we used the UpSetR package (Figure 1, Supplementary Figure S1), a scalable alternative for visualizing intersecting sets and their properties [45]. The result of 784 genes shared among 33 chlamydial strains (or 791 among 12 type strains) is somewhat higher than the number reported in the study by Joseph et al. [32], who identified 668 core genes in a panel of 36 strains of Chlamydiaceae. These authors studied a different panel of strains with emphasis on C. pneumoniae, C. pecorum and C. psittaci/C. abortus and defined core genes as protein-coding gene clusters shared by all strains. Sigalova et al. identified 698 orthologous groups (genes) that were universally conserved in a huge set of 227 genomes of 16 species and Candidati of Chlamydia [1].
Investigating genome synteny in higher taxonomic hierarchies, Collingro et al. [33] previously found 560 genes belonging to the core genome of the phylum Chlamydiae, but nine years later, based on a far larger number of organisms and genome sequences, that number decreased to 340 [46]. In another study, Pillonel et al. [47] identified 424 core proteins shared by members of the order Chlamydiales.
The higher number of core genes in our study can be explained by our combined approach: We relied on the same annotation source for all genomes (re-annotation with Prokka) and also included all 'hypothetical proteins' (39-46% of all CDS in our annotations) in the comparison. This approach became possible because we based our study on gene homology rather than gene function. By including non-annotated ('hypothetical') genes, often omitted in other studies, we provide a more comprehensive comparative genome analysis.
In addition, our novel ILP approach is able to connect even gene groups that share low sequence similarity (Figure 3), potentially resulting in additional core gene sets that are missed by other approaches. It should be emphasized that membership of the core genome is based on a sequence identity threshold of 95% within Roary clusters, which are further connected by supporting ILP results based on an all-vs-all MMSeqs2 comparison without any sequence identity cutoff. Thus, the combination of re-annotation of all genomes based on the same database and novel functionalities of RIBAP accounts for the larger set of core genes identified in our study.
Using the new RIBAP pipeline, we selected the following criteria and thresholds for core genome identification: We define a protein-coding gene to be part of the core genome when it forms a RIBAP group of 33 members (i.e., the number of genomes included in this study). Thereby, a RIBAP group consists of at least one Roary cluster with a sequence identity of 95%. Multiple 95% Roary clusters can be connected via ILP support into larger RIBAP groups, potentially forming a core gene. We believe that these conditions take the peculiarities of chlamydial genomes into account and may represent a working compromise for core genome definition.
In an alternative approach, van Aggelen et al. [48] recently proposed a core genome definition to be based on conserved sequences rather than conserved genes. In these circumstances, larger core genome sizes will be calculated, since also non-coding sequences are included.
The elements forming the core genome are considered to be indispensable and represent an accurate dataset for phylogenetic studies [49,50]. Therefore, we reconstructed the phylogenetic tree of Chlamydia spp. using concatenated core gene sequences of the strains examined ( Figure 2). While largely confirming previously published phylogenies based on rRNA [51], conserved proteins [47] and core locally collinear blocks [32], this tree comprises all validly published species of the genus Chlamydia and highlights intra-species genetic variability of C. psittaci and C. gallinacea. It is worth noting that all chlamydial organisms originating from avian hosts have been included in 'the psittaci cluster' following the second node, thus confirming their close evolutionary relationship. This is in line with our observation that the four species of C. psittaci, C. gallinacea, C. avium and also C. abortus share another 68 common genes in addition to those of the core genome ( Figure 1). C. abortus, C. caviae and C. felis, which have a preference for non-avian hosts, separated from the common ancestor at a later stage, so that they are still closely related.

The Plasticity Zone
The data of this study show that the absence of a tryptophan (Trp) operon is a common feature of the avian species C. psittaci, C. gallinacea, C. avium, and the closely related C. abortus. The fact that C. abortus with its mainly ruminant and only occasional avian strains also lacks the operon is not surprising as it evolved from C. psittaci relatively recently [30,32].
Functional operons, which include the regulatory region, structural genes trpA, B, C, D, F, kynU and repressor gene trpR, were encountered in C. caviae and C. felis. In the case of C. trachomatis, it was suggested that genital strains possess the operon, while ocular strains do not [52]. Our analysis revealed trpA, trpB and trpR genes in all three C. trachomatis strains, but the ocular strain A-Har-13 had two shortened CDS encoding trpB segments instead of a single trpB gene, which may indicate its limited functionality. In C. muridarum, we found no trp genes, thus confirming the absence of a functional operon, although the presence of truncated remnants as suggested by Xie et al. [52] cannot be ruled out. Similarly, C. pecorum type strain E58 lacked trp genes in the PZ, but a locus containing six trp genes outside the PZ was recently reported [1]. Besides the absence of a Trp operon, chlamydial strains of C. psittaci, C. abortus, C. avium, C. gallinacea, as well as C. pneumoniae share a tendency towards reduced PZ size (Table 4).
Regarding the most prominent locus in the PZ, the role of the large cytotoxin as a potential virulence factor is far from understood. ToxB was found in all species except C. avium, C. pneumoniae and C. trachomatis. In the present analysis, the toxB gene product has been categorized as an ortholog of lymphostatin/EFA-1, a toxin from E. coli (EPEC and EHEC) that also occurs in Citrobacter rodentium besides chlamydiae. It harbors three enzymatic activities associated with glycosyltransferase -(D-X-D, 1.6 kb), protease-(C, H, D, 4.5, 4.8 kb), and aminotransferase (TMGKALSASA, 5.8 kb) motifs [53]. While in E. coli the lifA/efa-1 gene is present in pathogenic, but absent in non-pathogenic strains, there is no analogous data for chlamydiae. Based on similarity to the cytotoxin of Clostridoides difficile, the glycosyltransferase domain was suggested to interact with eukaryotic host cells by glycosylating proteins of the Ras superfamily, thus inactivating them and causing disassembly of the actin cytoskeleton [54,55]. Other studies highlighted homologies of the chlamydial toxin to clostridial large cytotoxins and yersinial type III effector YopT, which are involved in protein translocation and disassembly of the host cytoskeleton [56,57].
Functional members of the MAC/perforin family were encountered in most of the species considered, but not in C. avium and C. gallinacea, nor in C. abortus and C. caviae, where truncated genes or pseudogenes were found. However, these findings remain provisional as the accuracy of annotation tools is limited in this regard. All MAC/perforin proteins possess a MACPF domain, but the degree of conservation within the family is low, so that sequence-based search algorithms may not always be suitable for identification. Their assumed function consists in pore formation, which may contribute to host cell entry [58].
With so many question marks left, there is a notion that PZ genes, even if translated into functional proteins, may be dispensable for infectious processes [59]. Nevertheless, the present finding of toxB in an atypical C. abortus strain from an avian source, which contrasts the absence of the cytotoxin gene in typical ruminant C. abortus, may be an indication of a role in host tropism, even though it is also missing in the C. avium genome. All in all, it has yet to be experimentally shown that this toxin can be a specific virulence factor in C. psittaci and closely related avian chlamydiae.
There is also a notable diversity among Chlamydia spp. regarding the guaAB/ADA locus. While C. caviae, C. felis, C. pecorum, C. muridarum and most of the C. psittaci strains harbor an intact operon it is absent in the other species (Table 3). As the operon's gene products are involved in salvaging biosynthesis of purine nucleotides, which are required for bacterial growth, this indicates different purine salvage strategies [60]. In the case of C. psittaci, the two strains lacking guaAB/ADA, GR9 and WS-RT-E30, form their own clade in the phylogenetic tree in Figure 2. The presence of the guaAB/ADA operon was suggested to reflect adaption of species or strains to a niche with biochemical restrictions [29]. However the distribution of this trait among species and strains in our study did not reflect phylogeny or host tropism, so that a role of the operon as a major delineator of speciation cannot be anticipated.

The Family of Polymorphic Membrane Proteins (Pmps)
The Pmps represent autotransporters carrying conserved tetrapeptide motifs in the central part or passenger domain. While sharing many of the characteristics of classical autotransporters in eubacteria [38], Pmps are unique to chlamydiae. Some of them were shown to be adhesins and immune modulators in C. pneumoniae and C. trachomatis [39,61]. Since Chlamydia spp. use a high proportion of their genome to encode this protein family (C. pneumoniae 17.5%; C. trachomatis 13.6%) it is likely that they are of crucial importance.
An unusually high number of mutated sites in these loci is responsible for high sequence variation across species [62,63]. The data of our study highlight the extreme interspecies diversity among members of this gene/protein family. At the same time, the spectrum of pmp genes proved fairly stable within the same species (Table 5 and Table S2). Due to low inter-species sequence homologies of individual Pmp family members, annotation can be complicated. In the case of hitherto unassigned hypothetical proteins, the software would only recognize sufficiently similar sequences as homologs of known proteins, which could prevent identification of certain orthologs.
In this study, all species were found to harbor the complete set of Pmp subtypes A, B, D, E, G, and H. In the C. avium strains, a reduced set consisting of only 7 members was identified, among them only two subtype G proteins. Most species of the 'psittaci cluster' are distinguished from the trachomatis group (with 9 Pmps) by a larger number of individual Pmps ranging from 21 in C. psittaci to 10 in C. gallinacea. The species of C. psittaci and C. gallinacea display the highest intra-species sequence variation in the Pmp family (Table S2).
With the exception of C. avium, the 'psittaci cluster' contains an elevated number (≥4) of subtype G members, in contrast to the trachomatis group, which has only two members, PmpG and PmpI. In the case of C. psittaci, other workers had already noted the presence of several pmpG genes [35] and their variability among genotypes [64]. The present data show that the pmpG subtype of this pathogen exhibits the greatest variety among all Chlamydia spp., with 14 currently known members (pmp7-17 and pmp19-21).
Considering our current knowledge on Pmps, which includes experimentally proven adhesive functions [39,61] and immunogenicity [65,66] it is straightforward to suggest a substantial role in host tropism and adaptation, as well as species-specific pathogenicity [67].

Inclusion Membrane Proteins
As the presence of a bi-lobed hydrophobic domain of 40-60 amino acids is the only common feature, and inter-species similarity of primary sequences is low, it is difficult to identify all members of the Inc protein family solely using bioinformatics tools. In C. trachomatis, 39-59 Inc proteins have been predicted, but only 23 of them functionally characterized [68]. Nevertheless, due to being located in the inclusion membrane, these proteins are players in direct interaction with host defense factors, which suggests a major role in pathogenesis. In addition to the known subtypes A-G, a new family member IncV was recently identified and shown to be directly involved in the formation of membrane contact sites with host cells [69]. The fact that Incs are among the major immunogens [36,70] confirms their active participation in pathogen-host interaction and also recommends them as efficient species-specific capture antigens in serology and potential vaccine components. In the present study, we have identified a tendency towards reduction in Inc family size among some avian Chlamydia spp., particularly in C. gallinacea and C. avium. This could be a contributing element to their particular host preference. It is also obvious that IncB sequences of 'psittaci cluster' members are more similar to each other than to the remaining Chlamydia spp. (Supplementary Table S4).

SINC Protein
This type III-secreted effector is considered a potential virulence factor for its ability to target the nuclear envelope [71]. We have identified the sinC locus encoding a 502/503-aa protein in all ten strains of C. psittaci, which is by far the largest SinC among Chlamydia spp.
In strain Mat116 we found a truncated version consisting of two overlapping sequence strands. In the other species of the 'psittaci cluster', there were SINC orthologs shortened to roughly 50-70% of the original size (Table 6). Mojica et al. were able to show that the shorter SINC orthologs in the closely related species of C. abortus and C. caviae still retained their specific functionality [71]. On the other hand, they found a 'weak ortholog' in C. trachomatis, which has only 12.5% sequence similarity and lacks the specific capability of nuclear membrane association. Our own multiple Blast query did not identify SINC orthologs outside the 'psittaci cluster'. This could indicate a role of this protein in host tropism and pathogenesis.

Histone-Like Proteins HctA and HctB
Although the hct locus is a priori not considered a particularly variable genomic site there is a striking contrast between high intra-species conservation and low inter-species homology of these proteins. For instance, sizeable homology of HctB proteins is only observed between C. psittaci and C. abortus (71-83% aa identity) as well as between the Hct-carrying C. gallinacea and C. avium strains (83-88%). Nevertheless, comparison among species reveals that HctBs within the 'psittaci cluster' are more similar to each other than to other chlamydial species (Supplementary Table S7).
This protein pair is known for its central role in RB-to-EB differentiation late in the developmental cycle [72,73], which includes the capability of repressing transcription and translation through binding to the genome. It is conceivable that sequence variations in this pair of lysine-rich and highly basic proteins could have implications on host preference and the ability to proliferate in host tissue. In this context, it is remarkable that C. gallinacea and C. avium, with their narrow host range and challenging cell culture, are obviously poorly equipped in terms of Hcts. In future research, a closer look at primary and secondary structures of chlamydial histone-like proteins combined with cell culture experiments including hct mutants could provide some answers to those questions.

Pseudogenes
Detailed studies on pseudogenes in chlamydiae are still scarce, but their number is believed to be low compared to other bacteria [1]. Although the issue has been addressed only superficially here, our findings suggest that pseudogenes encountered in C. trachomatis are not as prominent in phylogenetically distant chlamydial species. The latter, which include all Chlamydia spp. with avian host preference, probably have a separate set of pseudogenes. This in itself is an interesting point since it is conceivable that pseudogenes could be part of a marker set of host or tissue tropism. In any case, more systematic studies are needed to investigate this subject.

Chlamydial Strains
The 33 strains included in this study and the sources of whole-genome sequences are given in Table 6.

Genome Sequencing and Genome Assembly
The genomes of eight field strains (three C. avium, four C. gallinacea, one C. abortus), were de novo sequenced and assembled (Table 1). These strains were cultured on BGM cells and DNA extracted as described previously [14]. Approximately 5 µg of genomic DNA was sent to GATC/Eurofins Genomics (Konstanz, Germany) for Illumina MiSeq 2 × 300-bp paired-end sequencing (except C. abortus 16DC122 with 2 × 151 cycles). In the case of C. gallinacea strain 12-4358, a genomic library was prepared using 1 ng of genomic DNA and the Nextera XT kit (Illumina, Berlin, Germany, and sequencing was performed on the MiSeq platform at Anses using a Micro V2 with 2 × 151 cycles. Raw sequencing data were quality controlled using FASTQC [74]. Subsequently, the raw reads were assembled using SPAdes v3.12 [75] with k-mer values of 21,33,55,77,99, and 127, the careful option and automatic coverage cut-off. The assembly quality was evaluated using QUAST [74]. The raw sequencing data are deposited in the European Nucleotide Archive under accession PRJEB40883 and the final assemblies are available at https://osf.io/j9zas/.

Pan-Genome and Core Genome Calculation Using RIBAP
The Roary ILP Bacterial Annotation Pipeline (RIBAP) is a newly developed pipeline freely available at github.com/hoelzer-lab/ribap and implemented using the workflow management system Nextflow [76]. For an input set of genome sequences, the pipeline performs annotation, core gene set calculation of the identified coding genes, alignments and phylogenetic reconstructions of homologous genes and the full core gene set, as well as various visualizations of the results. Each step of the pipeline is encapsulated in its own Docker container, thus, only Nextflow and Docker are required to run RIBAP. For the present study, we ran the pipeline in v0.4 using the following command: nextflow run hoelzer-lab/ribap -r v0.4 -fasta '*.fasta' -tree -tmlim 240.
We executed the pipeline twice: (1) on the full set of all 33 genomes and (2) only on the type strains of the 12 species included in this study. RIBAP will be described in full detail in an upcoming publication. Briefly, the pipeline combines the output of various tools, most importantly Roary and an ILP-approach, to calculate a comprehensive core gene set even in the case of evolutionarily different species. Below, we explain the core features of RIBAP to calculate a core gene set. Version numbers of all third-party tools involved in the execution of RIBAP are provided in version 04 of the GitHub release at github.com/hoelzer-lab/ribap. Complete results of the pipeline are provided at the Open Science Framework (osf.io/j9zas).

Annotation
First, to ensure full comparability of annotations using the same data basis, RIBAP re-annotates all 33 genome sequences using Prokka v1.14.5 [77] with default parameters. Prokka searches each candidate CDS (from start to stop codon) against a reference protein database derived from UniProtKB, while CDS with no database match are designated 'hypothetical protein'. Since Prokka is an integral part of RIBAP, the subsequent core genome calculation is based on Prokka's CDS annotations, which also include hypothetical proteins.

Pan-Genome Scaffold
Next, we calculated a preliminary scaffold of the core gene set using the rapid large-scale prokaryote pan-genome analysis tool Roary v3.13.0 [78] based on the Prokka CDS annotations. RIBAP automatically performs multiple Roary calculations with different sequence similarity thresholds (60%, 70%, 80%, 90%, 95%). However, the final RIBAP core gene set calculation is based on the 95% Roary output, whereas results of lower similarity thresholds are only used for visualization of the final core gene groups. The output consists in so-called Roary clusters, which are further connected into larger RIBAP groups using Integer Linear Programming (ILP) results.

Integer Linear Programming and GLPK
A disadvantage of pan-genomes produced by Roary at high sequence similarity thresholds is the fragmentation of clusters that likely belong together. On the other hand, a low sequence similarity threshold can result in an increased amount of false positive assignments (see Roary online FAQ). Therefore, within RIBAP, we adapted and extended an ILP approach [79] by an InDel-model to minimize the evolutionary distance between CDS based on an MMseqs2 (v10.6d92c) all-vs-all CDS comparison [80]. We split the initial MMseqs2 all-vs-all homology table into all pairwise genome comparisons and formatted the output to be solved with ILPs using the GNU Linear Programming Kit (GLPK, v4.65) package [81]. Subsequently, the solved ILPs were combined again and the resulting data was used to connect fragmented Roary clusters even below the sequence similarity threshold of 95%. We run RIBAP with the '-tmlim' parameter set to 240 s to limit how long GLPK will try to solve each individual ILP.

Creating a RIBAP Group
A RIBAP group (set of homologous genes) is created by merging the preliminary Roary pan-genome clusters and the refined pairwise ILP results. Using pairwise ILP data, smaller Roary clusters of low sequence similarity were connected into larger RIBAP groups that potentially represent homologous core genes. For detailed documentation refer to the scripts provided at https://github.com/hoelzer-lab/ribap.

The RIBAP Output
All results are compiled in a searchable and interactive HTML table embedding the final RIBAP groups and including gene designation, gene description, a color-coded heat map based on Roary assignments at different thresholds, as well as a phylogenetic tree based on the multiple sequence alignment (MSA) of the CDS making up a RIBAP group.

Phylogenetic Tree Based on Core Genomes
RIBAP further calculates a phylogenetic tree based on the core genome, comprising CDS present in all input genomes. The pipeline utilizes MAFFT v7.455 [82] to create an MSA for each RIBAP group and concatenates the resulting alignments. Then, RAxML v.8.2.12 [83] is applied to construct a phylogenetic maximum likelihood tree using a bootstrap value of 100 and the PROTGAMMAWAG model.

UpSet Diagrams
Standard Venn and Euler diagrams are an inadequate solution for quantitative visualization of multiple (n > 4) gene set intersections. Thus, we used the UpSetR package v1.4.0 in RIBAP, a scalable alternative method for visualizing intersecting sets and their properties [45]. As input for UpSetR, we selected all 33 genomes as well as type strains of the 12 species included in this study and their homologous gene groups identified with RIBAP. We restricted the visualization to the largest 40 intersecting sets.

Multiple Blast to Identify Homologs of Pmp, Inc and SinC Genes
All sequences of pmp/Pmp and inc/Inc family members of C. psittaci, C. avium, C. gallinacea and C. abortus type strains annotated in the NCBI (nucleotides) and/or UniProt (proteins) databases were compiled in multi-FASTA files (one query file per species) and blasted against the genome sequences of all 33 strains of this study. The resulting hits were sorted according to target strains and filtered at 50% sequence identity. Likewise, the amino acid sequence of SinC of C. psittaci (EGF85279.1) was blasted against all 33 genomes.

Calculation of Sequence Identities
Nucleotide and amino acid sequence identity values were calculated using distance matrices based on multiple sequence alignments in Geneious v. 10.2.4 (Biomatters Ltd., Auckland, New Zealand).

Normalization of Genomes
To facilitate visual genome comparison using generic tools, such Geneious, whole-genome sequences were rearranged by means of a custom python script (located as helper script in the RIBAP repository) to have the hemB (delta-aminolevulinic acid dehydratase) gene in the initial position.

Conclusions
Using a newly developed combined genome analysis approach we were able to accurately calculate the core genome of 33 strains of Chlamydia spp. belonging to 12 species, which comprises four-fifths of the respective genomes. Our analysis also revealed a high degree of genetic variability among field strains of the species of C. psittaci and C. gallinacea. An atypical C. abortus strain isolated from a bird was shown to share common traits with Chlamydia spp. of avian host preference.
Our study has revealed a number of characteristic features, which, in its entirety, distinguish C. psittaci from other Chlamydia spp.: (i) a relatively short PZ without a tryptophan operon, (ii) a characteristic set of Inc proteins (2)(3)(4)(5), (iii) the presence of the largest sinC locus, and (iv) an elevated number of subtype G Pmp proteins. The fact that characteristic genomic traits common to all avian Chlamydia spp. could not be identified indicates that the clue for understanding chlamydial host preference may rather lie at the level of SNPs or small indel events as well as gene regulation.
Future studies should investigate the functions of these characteristic loci in more detail and address the implications for pathogen-host interactions.

Availability of Data and Materials
The datasets generated and/or analyzed during the current study are available in an OSF repository at https://osf.io/j9zas/. All code is available at https://github.com/hoelzer-lab/ribap. Novel sequencing data were uploaded to ENA acc.no. PRJEB40883 and assemblies generated for eight strains in the course of this study are available at https://osf.io/j9zas/.
Supplementary Materials: The following are available online at http://www.mdpi.com/2076-0817/9/11/899/s1, Figure S1: UpSet diagram illustrating the core genome and common genes among all 33 strains based on complete genome sequences. The intersections are based on the RIBAP output for the 33 Chlamydia strains. Figure S2: Schematic presentation of the ClustalW alignment of the plasticity zone of all 33 strains included in this study. Abbreviations: Cab Chlamydia abortus, Cav Chlamydia avium, Cca Chlamydia caviae, Cfe Chlamydia felis, Cga Chlamydia gallinacea, Cmu Chlamydia muridarum, Cpe Chlamydia pecorum, Cpn, Chlamydia pneumoniae, Cps Chlamydia psittaci, Csu Chlamydia suis, Ctr Chlamydia trachomatis. Table S1: Plasticity zone parameters of all 33 strains included in this study. Table S2: Members of the Pmp family and subtypes of all strains investigated and intra-species homology. Table S3: Amino acid sequence similarity among PmpD proteins of Chlamydia spp. Table S4: Amino acid sequence similarity among IncB proteins of Chlamydia spp. Table S5: Results of Blast search for SINC protein orthologs. Table S6: Blastn hits of C. trachomatis pseudogenes in all 33 strains of this study. Table S7: Amino acid sequence similarity among HctB proteins of Chlamydia spp.