#### 2.2. Methods

Bacterial-type promoters were identified using the method described elsewhere [

22,

23] based on the data relating nucleotide substitutions with the intensity of binding of bacterial-type RNA polymerase to the promoter upstream of the

psbA gene in mustard plastids [

24]. On the whole this method relies on comparison of genome regions with known promoters. The

sfdp program of the

Graphviz package [

25] was used to visualize the clusters (protein families). The sequence Logos were prepared with WebLogo tool [

26]. The phylogenetic trees were visualized using the MEGA 6 [

27] and TreeView 1.6.6 [

28] software. Conserved protein domains were identified using the Pfam database [

29]. Amino acid sequences were aligned using the MUSCLE algorithm [

30]. Trees were generated from multiple alignments of protein sequences using the RAxML software [

31].

Protein clustering was done with the method from [

32] and successfully tested in a series of works [

33,

34,

35]. Let us note that MCL [

36] is commonly used to define clusters in a graph. However, our method performs well as confirmed by correct clusterings obtained by this method for reference data [

33,

34,

35]; at the same time, it requires essentially less computation time.

The representation of proteins as points in Euclidean space makes it possible to apply clustering methods described in [

37,

38,

39,

40,

41]. However, the real data on proteins are inconsistent with the Euclidean metric. Our approach to clustering does not require even the triangle inequality to hold.

In mathematical terms, the following problem is solved. We are given a set of protein sequences. It is required to generate a clustering, i.e., to partition this set into pairwise disjoint subsets so that a cluster includes proteins with similar sequences from different proteomes, and proteins from the same proteome are included in the same cluster as rarely as possible.

#### 2.3. Description of the Clustering Algorithm

We are given a set of proteomes S_{i} and sets of component proteins P_{ij} for each proteome. The BLAST raw score was used to compute the similarity s_{o}(P_{1},P_{2}) between proteins; s_{o}(P_{ij},P_{kl}) is evaluated for all pairs of proteins (P_{ij},P_{kl}) from all pairs of proteomes, so that the normalized similarity can be computed:

It peaks for identical proteins. Let us consider an undirected graph

G_{o} with a set of nodes {

P_{ij}}, which are connected by an edge if the BLAST

E-value for the corresponding pair of proteins is no less than the expect threshold. Each edge (

P_{ij},

P_{kl}) is given the value

s(

P_{ij},

P_{kl}), which will be referred to as the edge

weight; loops are not allowed.

G_{o} is used to generate a sparse graph

G which only includes edges meeting the following requirements:

where the maximums are taken for all proteins of the corresponding plastids

i and

k, and

L is the algorithm parameter. The case when

i =

k imposes the constraint that

m ≠

l and the second equality is not considered.

Our algorithm implements Kruskal’

s procedure [

42] for the graph

G to generate a forest

F (an acyclic subgraph with trees as the connected components) that includes all nodes from

G. Specifically, edges in

G are searched in descending order of their weight (in the case of equal weights, the edges connecting proteins of the same proteome are considered first), and the edges from

G whose addition to

F do not introduce a cycle in

F are called edges of the constructed forest

F. Total weight of all edges in the forest is called its

weight. The weight of the resulting forest is the highest among all other forests in

G.

The following procedure of forest partition generating a set C of desired protein clusters is applied to the forest F. Let T be a tree from F and e be the edge in T with the minimum weight s among all edges in T. If s < H, where H is the algorithm parameter, and T does not meet the criterion of tree preservation stated below, then T is replaced in F with two new trees F’ and F” by removing the edge e from T; otherwise (when the criterion is met or s ≥ H) the tree T is transposed to the set C.

The criterion of tree T preservation is that two conditions are satisfied: (1) the edge (P_{ij}, P_{kl}) with the minimum weight in T connects proteins P_{ij} and P_{kl}, where i ≠ k; and (2) any pair of nodes P_{ij} and P_{il} in the tree T corresponding to proteins of plastid i is connected in T by a path composed of nodes that correspond to proteins of this plastid.

If there are trees remaining in F, the next tree T in F is considered; otherwise the algorithm terminates. The resulting set of trees C represents clusters of initial proteins: each cluster consists of sequences assigned to all nodes of the same tree.

The following algorithm parameters were used: H = 0.60, E = 0.001, and L = 0.