Regulation of Expression and Evolution of Genes in Plastids of Rhodophytic Branch

A novel algorithm and original software were used to cluster all proteins encoded in plastids of 72 species of the rhodophytic branch. The results are publicly available at http://lab6.iitp.ru/ppc/redline72/ in a database that allows fast identification of clusters (protein families) both by a fragment of an amino acid sequence and by a phylogenetic profile of a protein. No such integral clustering with the corresponding functions can be found in the public domain. The putative regulons of the transcription factors Ycf28 and Ycf29 encoded in the plastids were identified using the clustering and the database. A regulation of translation initiation was proposed for the ycf24 gene in plastids of certain red algae and apicomplexans as well as a regulation of a putative gene in apicoplasts of Babesia spp. and Theileria parva. The conserved regulation of the ycf24 gene expression and specificity alternation of the transcription factor Ycf28 were shown in the plastids. A phylogenetic tree of plastids was generated for the rhodophytic branch. The hypothesis of the origin of apicoplasts from the common ancestor of all apicomplexans from plastids of red algae was confirmed.


Introduction
The rapid growth of the number of sequenced plastid genomes gives rise to assumptions concerning their evolution and regulation not only in algae but also in plastid-bearing non-photosynthetic protists. The latter include the agents of dangerous protozoan infections, malaria and toxoplasmosis. Namely the phylum Apicomplexa includes many parasitic genera. For example, malaria is caused by Plasmodium spp.; Toxoplasma gondii is one of the most common parasites and can cause toxoplasmosis; Babesia microti is the primary cause of human babesiosis. In HIV patients, Toxoplasma gondii as well as Cryptosporidium spp. can cause serious and often fatal illness. Apicomlexan parasites also cause diseases in animals including cattle, chickens, dogs, and cats.
Apicoplasts are relict nonphotosynthetic plastids found in many species of the supergroup Chromalveolata. They originated from red algae through secondary endosymbiosis. The apicoplast is surrounded by four membranes that could emerge during endosymbiosis. The ancestral genome was reduced by deletions and rearrangements to its present 35 kb size.
Apicoplasts are among the efficient targets for therapeutic intervention and generation of non-virulent strains for rapid vaccine production [1].
All known plastids originate from cyanobacteria [2]. Three branches of primary plastids of independent origin are recognized; they are represented in GenBank by green algae and plants, glaucophyte Cyanophora paradoxa, and red algae. At the same time, many species distant from those mentioned above have secondary or tertiary plastids derived from the primary ones. This study is focused on plastids of the rhodophytic branch, which have a common origin with red algal plastids. These comprise apicomplexan apicoplasts [3] as well as plastids of various algae including photosynthetic alveolates [4,5]. The latter include Durinskia baltica and Kryptoperidinium foliaceum with tertiary plastids originating from the plastids of diatoms, which consequently originate from those of red algae.
All plastid genomes are examples of reductive evolution. The identification of apicoplast origin in non-photosynthetic species is often problematic due to a significant reduction of their genomes. This explains the controversy concerning the origin of apicoplasts [6,7]. Indeed, early reports suggested green algae as the source of apicoplasts. Recent studies confirm that apicoplasts belong to the rhodophytic branch of plastids [3,5]. The identified putative common regulation of gene expression preserved in some apicoplasts is an important argument for the red algal origin of apicoplasts [3]. The coral endosymbiotic algae Chromera velia and Vitrella brassicaformis share a common ancestry with apicomplexan parasites [8]. A common ancestry of their plastids and apicoplasts can also be anticipated.
Some plastids have no genes of the photosystems and are incapable of photosynthesis but synthesize amino acids and isoprenoids and carry out fatty acid oxidation as well as other chemical reactions. For instance, such plastids are found in red algae Choreocolax polysiphoniae (GenBank: NC_026522) [9] or cryptomonad Cryptomonas paramecium (GenBank: NC_013703.1), and such apicoplasts are found in many apicomplexan parasites. Comparative analysis of proteomes of photosynthetic and non-photosynthetic species exposes the relationships between different proteins and makes it possible to identify putative regulons of transcription factors encoded in plastids.
Certain apicomplexan species lack apicoplasts, for instance Cryptosporidium parvum [10] and Gregarina niphandrodes [11,12]. This raises the question of the origin of apicoplasts: do they have a common origin and were lost in some species or were they independently acquired by different groups?

Methods
Bacterial-type promoters were identified using the method described elsewhere [22,23] based on the data relating nucleotide substitutions with the intensity of binding of bacterial-type RNA polymerase to the promoter upstream of the psbA gene in mustard plastids [24]. On the whole this method relies on comparison of genome regions with known promoters. The sfdp program of the Graphviz package [25] was used to visualize the clusters (protein families). The sequence Logos were prepared with WebLogo tool [26]. The phylogenetic trees were visualized using the MEGA 6 [27] and TreeView 1.6.6 [28] software. Conserved protein domains were identified using the Pfam database [29]. Amino acid sequences were aligned using the MUSCLE algorithm [30]. Trees were generated from multiple alignments of protein sequences using the RAxML software [31].
Protein clustering was done with the method from [32] and successfully tested in a series of works [33][34][35]. Let us note that MCL [36] is commonly used to define clusters in a graph. However, our method performs well as confirmed by correct clusterings obtained by this method for reference data [33][34][35]; at the same time, it requires essentially less computation time.
The representation of proteins as points in Euclidean space makes it possible to apply clustering methods described in [37][38][39][40][41]. However, the real data on proteins are inconsistent with the Euclidean metric. Our approach to clustering does not require even the triangle inequality to hold.
In mathematical terms, the following problem is solved. We are given a set of protein sequences. It is required to generate a clustering, i.e., to partition this set into pairwise disjoint subsets so that a cluster includes proteins with similar sequences from different proteomes, and proteins from the same proteome are included in the same cluster as rarely as possible.

Description of the Clustering Algorithm
We are given a set of proteomes S i and sets of component proteins P ij for each proteome. The BLAST raw score was used to compute the similarity s o (P 1 ,P 2 ) between proteins; s o (P ij ,P kl ) is evaluated for all pairs of proteins (P ij ,P kl ) from all pairs of proteomes, so that the normalized similarity can be computed: It peaks for identical proteins. Let us consider an undirected graph G o with a set of nodes {P ij }, which are connected by an edge if the BLAST E-value for the corresponding pair of proteins is no less than the expect threshold. Each edge (P ij ,P kl ) is given the value s(P ij ,P kl ), which will be referred to as the edge weight; loops are not allowed. G o is used to generate a sparse graph G which only includes edges meeting the following requirements: spP ij , P kl q " max m spP im , P kl q " max m spP ij , P km q and spP ij , P kl q ě L where the maximums are taken for all proteins of the corresponding plastids i and k, and L is the algorithm parameter. The case when i = k imposes the constraint that m " l and the second equality is not considered.
Our algorithm implements Kruskal's procedure [42] for the graph G to generate a forest F (an acyclic subgraph with trees as the connected components) that includes all nodes from G. Specifically, edges in G are searched in descending order of their weight (in the case of equal weights, the edges connecting proteins of the same proteome are considered first), and the edges from G whose addition to F do not introduce a cycle in F are called edges of the constructed forest F. Total weight of all edges in the forest is called its weight. The weight of the resulting forest is the highest among all other forests in G.
The following procedure of forest partition generating a set C of desired protein clusters is applied to the forest F. Let T be a tree from F and e be the edge in T with the minimum weight s among all edges in T. If s < H, where H is the algorithm parameter, and T does not meet the criterion of tree preservation stated below, then T is replaced in F with two new trees F' and F" by removing the edge e from T; otherwise (when the criterion is met or s ě H) the tree T is transposed to the set C.
The criterion of tree T preservation is that two conditions are satisfied: (1) the edge (P ij , P kl ) with the minimum weight in T connects proteins P ij and P kl , where i " k; and (2) any pair of nodes P ij and P il in the tree T corresponding to proteins of plastid i is connected in T by a path composed of nodes that correspond to proteins of this plastid.
If there are trees remaining in F, the next tree T in F is considered; otherwise the algorithm terminates. The resulting set of trees C represents clusters of initial proteins: each cluster consists of sequences assigned to all nodes of the same tree.

Clustering of Proteins
We have clustered proteins encoded in the plastids of the rhodophytic branch. The results are publicly available at http://lab6.iitp.ru/ppc/redline72/. The database functions allow rapid cluster identification by either a fragment of a protein amino acid sequence or by a protein phylogenetic profile.
The total number of proteins is 9286; the number of singletons is 265; and the number of clusters is 305. The number of clusters including exactly n proteins in a particular species and no more than n proteins in any species is referred to as PC(n). For this clustering, PC(1) = 223, PC(2) = 79, PC(3) = 2, and PC(4) = 1. Some general data about the clusters are given in Table 1.

Regulons of Transcription Factors Encoded by Plastids
As compared to our previous data [35], the clusters of the MoeB and Ycf28 proteins were both supplemented by proteins encoded in the plastids of Vertebrata lanosa; neither of these proteins is encoded in plastids of Choreocolax polysiphoniae or any species beyond Rhodophyta. The profile identical to that of MoeB and Ycf28 was found in the proteins encoded by the apcA, apcB, apcD, apcE, apcF, carA, cpcA, cpcB, cpcG, gltB, nblA (ycf18), preA, and rpl28 genes; however, their 5'-leader sequences lack the conserved site found upstream of the moeB genes instead of the typical -35 promoter box.
The transcription factor Ycf29 is encoded in plastids of cryptomonads and rhodophytic algae except Porphyridium purpureum. The Ycf29 proteins are listed in Table 2. In the sparse graph, the Ycf29 and Ycf27 (OmpR) proteins belonged to the same connected component but were separated after clustering by our algorithm, which corresponds to the NCBI annotation. No other proteins with such phylogenetic profile have been identified. A similar profile was observed for the CemA protein found in Porphyridium purpureum but not in Choreocolax polysiphoniae. CemA includes the PF03040 domain and was localized to the inner face of the outer membrane in chloroplasts but not to the thylakoid membrane. Cyanobacterial proteins orthologous to CemA are involved in carbon dioxide transport but are not transporters [43]. The membrane protein Ycf19 also has a similar phylogenetic profile. A sequence close to the consensus of the conserved bacterial-type promoter was found upstream of the ycf19 gene. Since Ycf29 is a part of the two-component signaling system, its regulon is linked to the response to environmental rather than intraplastid changes. The Ycf19 and Ycf89 proteins are not partitioned with the clustering parameters used. At the same time, the proteins listed in Ycf19 annotations together with several related proteins constitute a dense subgraph. The graph of proteins Ycf19 and Ycf89 generated by the algorithm is shown in Figure 4.

Regulation of Ycf24 (SufB) Translation Initiation
A conserved site was found in the 5'-untranslated region of ycf24 (sufB) in Eimeria tenella, Cyclospora cayetanensis, Toxoplasma gondii RH, Leucocytozoon caulleryi, Plasmodium chabaudi, and Porphyra purpurea. The sequence logo of this site is shown in Figure 5.

Regulation of Translation Initiation in Babesia spp. and Theileria parva
The genes from plastids of the Piroplasmida order lying between the rpl14 and rps8 are of particular interest. Although one such gene codes for the ribosomal protein L5 in many plastid-bearing algae of the rhodophytic branch, rpl14 and rps8 are neighboring genes in Coccidia and Haemosporida. The functional identification of the protein encoded by the gene lying between rpl14 and rps8 is questionable. Our clustering in Babesia bovis, Babesia orientalis, and Theileria parva suggests that this gene codes for the ribosomal protein L5, which belongs to a large cluster. In Babesia microti, this protein forms a singleton but is also annotated as L5. At the same time, it is only marginally similar to ribosomal proteins according to Pfam. The tree of these proteins is shown in Figure 6, and the proteins are listed in Table 3.
Conserved sites were identified in the leader regions 170-100 nt upstream of such genes in Piroplasmida. In Babesia spp., such sites reside within the coding sequence of rpl14. However, there was an insertion near the site, which is missing in orthologous L14 proteins. The sequence of this insertion is TSYSIDDRNRFKD in Babesia bovis. In Theileria parva, the site is not overlapped by the coding sequences. The corresponding transcription factor remains unknown in this case. Figure 6. The tree of proteins encoded by the plastid genes located between the rpl14 and rps8 genes in Babesia spp. and Theileria parva. The plastid protein L5 from Chromera velia was used as the outgroup.

Protein Clustering
Overall, the data obtained indicate a good agreement between the clustering of plastid-encoded proteins performed by our algorithm and published data on the protein and species evolution. The proposed clustering algorithm and its software implementation are applicable to a wide range of problems related to graphs.
The clustering pattern of proteins encoded in red algal plastids demonstrate a substantial distance of Porphyridium purpureum from other species, which is accompanied by multiple DNA rearrangements in Rhodophyta plastids [44]; in addition, it demonstrates the separation of the Cyanidiaceae family including Galdieria sulphuraria, Cyanidium caldarium, and Cyanidioschyzon merolae.

Regulons of Plastid-Encoded Transcription Factors Ycf28, Ycf29, and Ycf30
The coincidence of the phylogenetic profiles of Ycf28 and MoeB reported previously [35] has been confirmed. The Ycf28 protein demonstrates a significant similarity with the cyanobacterial transcription factor NtcA. Consequently, we propose that Ycf28 is the factor that controls the transcription of the moeB gene by binding the DNA region near the promoter where the conserved motif was identified. There are no grounds to believe that Ycf28 is related to nitrogen metabolism, which assumes a change of the transcription factor specificity relative to cyanobacteria contrary to the previous proposal [45]. The absence of the typical -35 promoter box upstream of the moeB gene indicates that Ycf28 is a transcription activator.
The presence of Ycf29 in the plastid genomes of non-photosynthetic Cryptomonas paramecium and Choreocolax polysiphoniae indicates that this protein regulates processes related to photosynthesis. One can assume that Ycf19 orthologs include proteins in the large cluster combining Ycf19 and Ycf89 that are encoded in plastids together with the Ycf29 factor. This allows us to refine protein clustering and, at the same time, to identify the putative photosynthesis-independent regulation.
Plastids of many algal species are known to encode the transcription factor Ycf30, which controls the expression of the rbcLS genes coding for subunits of ribulose-bisphosphate carboxylase (EC 4.1.1.39) as well as of the cbbX gene. Light-induced transcriptional activation was experimentally demonstrated and the Ycf30-binding motif was identified in these genes in plastids isolated from Cyanidioschyzon merolae [46]. Our phylogenetic profiles of these proteins agree with these data. However, the variability of Ycf30-binding site complicates its unambiguous identification in the DNA sequence. The sequence variability of experimentally confirmed Ycf30-binding site suggests that the factor binding to DNA largely depends on the DNA curvature [47] or electrostatic potential along the DNA [48] rather than on the nucleotide context.

Regulation of Ycf24 (SufB) Translation Initiation
The same regulation found in red algae, Coccidia, and Haemosporida supports the common origin of all apicoplasts from red algal plastids. Moreover, early separation of these apicomplexan groups naturally suggests that Cryptosporidium spp. and Gregarina niphandrodes lost their apicoplasts in the course of evolution but the common ancestor of apicomplexans had apicoplasts.
Moreover, the site identical to that upstream of ycf24 was found in the 5'-untranslated region of rps4 of Toxoplasma gondii [3]. This indicates possible the common regulation of translation in the apicoplast.

Regulation of Translation Initiation in Babesia spp. and Theileria parva
We believe that the gene coding for the ribosomal protein L5 was eliminated from the apicoplast in the ancestor of apicomplexan parasites, and a new gene was inserted into this chromosomal locus in the ancestor of Piroplasmida. The recognition of a new type of proteins is confirmed by the analysis of their 5'-leader regions, where conserved sites were identified. Indeed, it is natural to assume that a conserved site is involved in the regulation of gene expression, and the same expression pattern indicates a common functional significance of the corresponding proteins.

Conclusions
We have made a publicly available web service for protein identification by their phylogenetic profile. To our knowledge, no other services for the identification of plastid-encoded proteins by their phylogenetic profile (the two lists of species) are available. Our method allowed us to confirm the previous assumption concerning the regulation of plastid gene expression in the rhodophytic branch. In particular, our results confirm the hypothesis that apicoplasts in the common ancestor of apicomplexans descend from red algal plastids.