Nearest Embedded and Embedding Self-Nested Trees

Self-nested trees present a systematic form of redundancy in their subtrees and thus achieve optimal compression rates by DAG compression. A method for quantifying the degree of self-similarity of plants through self-nested trees has been introduced by Godin and Ferraro in 2010. The procedure consists in computing a self-nested approximation, called the nearest embedding self-nested tree, that both embeds the plant and is the closest to it. In this paper, we propose a new algorithm that computes the nearest embedding self-nested tree with a smaller overall complexity, but also the nearest embedded self-nested tree. We show from simulations that the latter is mostly the closest to the initial data, which suggests that this better approximation should be used as a privileged measure of the degree of self-similarity of plants.


Introduction
Trees form a wide family of combinatorial objects that offers many application fields, e.g., plant modeling and XML files analysis. Modern databases are huge and thus stored in compressed form. Compression methods take advantage of repeated substructures appearing in the tree. As explained in [2], one often considers the following two types of repeated substructures: subtree repeat (used in DAG compression [3,4,6,7]) and tree pattern repeat (exploited in tree grammars [5,9] and top tree compression [2]). We restrict ourselves to DAG compression of unordered rooted trees, which consists in building a Directed Acyclic Graph (DAG) that represents a tree without the redundancy of its identical subtrees (see Fig. 1). Two different algorithms exist for computing the DAG reduction of a tree τ [7, 2.2 Computing Tree Reduction], which share the same time-complexity in O(#V(τ) 2 × D(τ) × log(D(τ))) where V(τ) denotes the set of vertices of τ and D(τ) its outdegree.
Trees that are the most compressed by DAG compression present the highest level of redundancy in their subtrees: all the subtrees of a given height are isomorphic. In this case, the DAG related to a tree τ is linear, i.e., there exists a path going through all vertices, with exactly H(τ) + 1 vertices, H(τ) denoting the height of τ, which is the minimal number of vertices among trees of this height (see τ 3 in Fig. 1). This family of trees has been introduced in [8] as the first interesting class of trees for which the subtree isomorphism problem is in NC 2 . It has been known under the name of nested trees [8] and next self-nested trees [7] to insist on their recursive structure and their proximity to the notion of structural self-similarity.  Figure 1: Trees and their DAG reduction. In the tree, roots of isomorphic subtrees are colored identically. In the DAG, vertices are equivalence classes colored according to the class of isomorphic subtrees that they represent.
The authors of [7] are interested in capturing the self-similarity of plants through self-nested trees. They propose to construct a self-nested tree that minimizes the distance of the original tree to the set of self-nested trees that embed the initial tree. The distance to this Nearest Embedding Selfnested Tree (NEST) is then used to quantify the self-nestedness of the tree and thus its structural self-similarity (see τ and NEST(τ) in Fig. 2). The main result of [7, Theorem 1 and E. NEST Algorithm] is an algorithm that computes the NEST of a tree τ from its DAG reduction in O(H(τ) 2 × D(τ)).
The goal of the present article is threefold. We aim at proposing a new and more explicit algorithm that computes the NEST of a tree τ with the same time-complexity O(H(τ) 2 × D(τ)) as in [7] but that takes as input the height profile of τ and not its DAG reduction. We establish that the height profile of a tree τ can be computed in O(#V(τ) × D(τ)) reducing the overall complexity of a linear factor. Based on this work, we also provide an algorithm in O(H(τ) 2 ) that computes the Nearest embedded Self-nested Tree (NeST) of a tree τ (see τ and NeST(τ) in Fig. 2). Finally, we show from numerical simulations that the distance of a tree τ to its NeST is much lower than the distance to its NEST. The NeST is most of the time a better approximation of a tree than the NEST and thus should be privileged to quantify the degree of self-nestedness of plants.
The paper is organized as follows. The structures of interest in this paper, namely unordered trees, DAG compression and self-nested trees, are defined in Section 2. Section 3 is dedicated to the definition and the study of the height profile of a tree. The approximation algorithms are presented in Section 4. We give a new insight on the definitions of the NEST and of the NeST in Subsection 4.1. Our NEST algorithm is presented in Subsection 4.2, while the NeST algorithm is given in Subsection 4.3. Section 5 is devoted to simulations. We state that the NeST is mostly a better approximation of a tree than the NEST in Subsection 5.1. An application to a real rice panicle is presented in Subsection 5.2.

Unordered rooted trees
A rooted tree τ is a connected graph containing no cycle, that is, without chain from any vertex v to itself, and such that there exists a unique vertex R(τ), called the root, which has no parent, and any vertex different from the root has exactly one parent. The leaves of τ are all the vertices without children. The set of vertices of τ is denoted by V(τ). The height of a vertex v may be recursively defined as otherwise, C τ (v) denoting the set of children of v in τ. The height of the tree τ is defined as the height of its root, H(τ) = H(R(τ)).The outdegree D(τ) of τ is the maximal branching factor that can be found in τ, that is with E(τ) the set of edges of τ.
In all the sequel, we consider unordered rooted trees for which the order among the sibling vertices of any vertex is not significant. A precise characterization is obtained from the additional definition of isomorphic trees. Let τ and θ two rooted trees. A one-to-one correspondence ϕ : V(τ) → V(θ) is called a tree isomorphism if, for any edge (v, w) ∈ E(τ), (ϕ(v), ϕ(w)) ∈ E(θ).
Structures τ 1 and τ 2 are called isomorphic trees whenever there exists a tree isomorphism between them. One can determine if two n-vertex trees are isomorphic in O(n) [1, Example 3.2 and Theorem 3.3]. The existence of a tree isomorphism defines an equivalence relation on the set of rooted trees. The class of unordered rooted trees is the set of equivalence classes for this relation, i.e., the quotient set of rooted trees by the existence of a tree isomorphism.

DAG compression
Now we consider the equivalence relation "existence of a tree isomorphism" on the set of the subtrees of a tree τ. We consider the quotient graph Q(τ) = (V, E) obtained from τ using this equivalence relation. V is the set of equivalence classes on the subtrees of τ, while E is a set of pairs of equivalence classes (C 1 , C 2 ) such that R(C 2 ) ∈ C τ (R(C 1 )) up to an isomorphism. The graph Q(τ) is a DAG [7, Proposition 1] that is a connected directed graph without path from any vertex v to itself.
Let (C 1 , C 2 ) be an edge of the DAG Q(τ). We define N(C 1 , C 2 ) as the number of occurrences of a tree of C 2 just below the root of any tree of C 1 . The tree reduction R(τ) is defined as the quotient graph Q(τ) augmented with labels N(C 1 , C 2 ) on its edges [7, Definition 3 (Reduction of a tree)]. Intuitively, the graph R(τ) represents the original tree τ without its structural redundancies (see Fig. 1). This result proves that self-nested trees achieve optimal compression rates among trees of the same height whatever their number of nodes (compare τ 3 with τ 1 and τ 2 in Fig. 1). Indeed, R(τ) has at least H(τ) + 1 nodes and the inequality is saturated if and only if τ is self-nested.

Self-nested trees
3 Height profile of the tree structure

Definition and complexity
This section is devoted to the definition of the height profile ρ τ of a tree τ and to the presentation of an algorithm to calculate it. In the sequel, we assume that the tree τ is always traversed in the same order, depth-first search to set the ideas down. In particular, when vectors are indexed by nodes of τ sharing the same property, the order of the vector is important and should be always the same.
is the number of subtrees of height h directly under v. Now, we consider the vector made of the concatenation of the integers γ h 2 (v) over subtrees τ[v] of height h 1 ordered in depthfirst search. Consequently, ρ τ is an array made of vectors with varying lengths.
Let A 1 and A 2 be two arrays for which each entry is a vector. We say that A 1 and A 2 are equivalent if, for any line i, there exists a permutation σ i such that, for any column j, In particular, i being fixed, all the vectors A 1 (i, j) and A 2 (i, j) must have the same length. This condition defines an equivalence relation. The height profile of τ is the array ρ τ as an element of the quotient space of arrays of vectors under this equivalence relation. In other words, the vectors ρ τ (h 1 , h 2 ), 0 ≤ h 2 < h 1 and h 1 fixed, must be ordered in the same way but the choice of the order is not significant. Finally, it should be already remarked that ρ τ (h 1 , h 2 ) = ∅ when h 2 ≥ h 1 or h 1 > H(τ). Consequently, the height profile can be reduced to the triangular array The application ρ τ provides the distribution of subtrees of height h 2 just below the root of subtrees of height h 1 for all couples (h 1 , h 2 ), which typically represents the height profile of τ. For clarity's sake, we give the values of ρ τ k for the trees τ k of Fig. 1 It should be noticed that the height profile does not contain all the topology of the tree since trees τ 1 and τ 2 of Fig. 1 are different but share the same height profile (1). However, the height of a tree τ can be recovered from its height profile through the relation H(τ) = dim(ρ τ ), the dimension of ρ τ being defined by Next, traverse the tree in depth-first search in O(#V(τ)) and calculate for each vertex v the vector

Relation with self-nested trees
Self-nested trees are characterized by their height profile in light of the following result.

Proposition 3.2.
τ is self-nested if and only if, for any 0 ≤ h 2 < h 1 ≤ H(τ), all the components of the vector ρ τ (h 1 , h 2 ) are the same (for instance see the profile (1) of the tree τ 3 presented in Fig. 1). In addition, a self-nested tree τ may be reconstructed from ρ τ (see Algorithm 1).
Proof. If τ is self-nested, the N h 1 subtrees of height h 1 appearing in τ are isomorphic and thus have the same number n h 1 ,h 2 of subtrees of height h 2 just below their root. As a consequence, ).
The reciprocal result may be established in light of the following lemma which proof presents no difficulty. All the subtrees of height 1 in τ are isomorphic because all the components of ρ τ (1, 0) are the same. The expected result is shown by induction on the height thanks to the previous lemma which assumptions are satisfied since ρ τ always contains vectors for which all the entries are equal. The previous reasoning also provides a way (presented in Algorithm 1) to build a unique (self-nested) tree T from the height profile ρ τ . In addition, this is easy to see that τ and T are isomorphic.
In order to present the algorithm of reconstruction of a self-nested tree from its height profile, we need to define the restriction of a height profile to some height. Let p be a height profile. The restriction p | h of p to height h ≥ 0 is the array defined by A peculiar case is p | 0 for which each entry is the empty set and thus dim(p | 0 ) = 0. It should be also remarked that there may exist no tree τ such that p | h is the height profile of τ.
As we can see in the proof of Proposition 3.2 or in Algorithm 1, the lengths of the vectors ρ τ (h 1 , h 2 ) are not significant to reconstruct a self-nested tree τ. Consequently, since all the components of ρ τ (h 1 , h 2 ) are the same, we can identify the height profile of a self-nested tree with the integervalued array [ρ τ (h 1 , h 2 ) 1 ].
Proposition 3.4. The number of nodes of a self-nested tree τ can be computed from ρ τ in O(H(τ) 2 ).
Proof. By induction on the height, one has #V(τ) = N (H(τ)), where the sequence N is defined by N (0) = 1 (number of nodes of a tree reduced to a root) and, The number of operations required to compute N (H(τ)) is of order O(H(τ) 2 ).
The authors of [7, Proposition 6] calculate the number of nodes of a tree (self-nested or not) from its DAG reduction by a formula very similar to (2), and which achieve the same complexity on selfnested trees. As mentioned before, a tree can not be recovered from its height profile in general, thus we can not expect such a result from the height profile of any tree.

Editing operations
We shall define the NEST and the NeST of a tree τ. As in [7, eq. (5)], we ask these approximations to be consistent with Zhang's edit distance between unordered trees [10] denoted D Z in this paper. Thus, as in [10, 2.2 Editing Operations], we consider the following two types of editing operations: adding a node and deleting a node. Deleting a node w means making the children of w become the children of the parent v of w and then removing w (see Fig. 3). Adding w as a child of v will make w the parent of a subset of the current children of v (see Fig. 4).

Constrained editing operations
Zhang's edit distance is defined from the above editing operations and from constrained mappings between trees Let θ be a tree that approximates τ obtained by inserting nodes in τ only and consider the induced mapping M τ→θ that associates nodes of τ with theirselves in θ. We want the approximation process to be consistent with Zhang's edit distance D Z , i.e., we want the mapping M τ→θ to be a constrained mapping in the sense of Zhang, which in particular implies D Z (θ, τ) = #V(θ) − #V(τ). We shall prove that this requirement excludes some inserting operations in our context.
Indeed, the mapping M τ→θ involved in the inserting operation of Fig. 4 is partially displayed in Fig. 5, nodes v i of τ being associated with nodes w i of θ. The LCA of v 1 and v 2 in τ is a proper ancestor of v 3 . However, the LCA of w 1 and w 2 in θ is not a proper ancestor of w 3 . As a consequence, this mapping is not a constrained mapping as defined by Zhang. A necessary and sufficient condition for M τ→θ to be a constrained mapping is given in Lemma 4.1 w 2 Figure 5: The tree θ is obtained from τ by inserting an internal node. The associated mapping does not satisfy the conditions imposed by Zhang [10] because the LCA of v 1 and v 2 is a proper ancestor of v 3 whereas the LCA of w 1 and w 2 is not a proper ancestor of w 3 .
Proof. The proof is obvious if v has one or two children. Thus we assume that v has at least three children c 1 , c 2 and c 3 . In τ, the LCA of c 1 and c 2 is v and v is an ancestor of c 3 . Adding w as the parent of c 1 and c 2 makes it the LCA of these two nodes, but not an ancestor of c 3 in θ. The additional condition on the LCAs is then not satisfied. This problem appears only when making w the parent of at least two children and of not all the children of v.
Consequently, we restrict ourselves to the following inserting operations which are the only ones that ensure that the associated mapping satisfies Zhang's condition: adding w as a child of v will make w (i) a leaf, (ii) the parent of one current child of v, or (iii) the parent of all the current children of v. However, it should be noticed that (iii) can always be expressed as (ii) (see Fig. 6). Finally, we only consider the inserting operations that make the new child of v the parent of zero or one current child of v. For obvious reasons of symmetry, the allowed deleting operations are the complement of inserting operations, i.e., one can delete an internal node if and only if it has a unique child, which also ensures that the induced mapping is constrained in the sense of Zhang.

Preserving the height of the pre-existing nodes
In [7, Definition 9 and Fig. 6], the NEST of a tree τ is obtained by successive partial linearizations of the (non-linear) DAG of τ which consist in merging all the nodes at the same height of the DAG.
A consequence is that the height of any pre-existing node of τ is not changed by the inserting operations. For the sake of consistency with [7], we only consider inserting and deleting operations that preserve the height of all the pre-existing nodes of τ.
The next two results deal with inserting operations that preserve the height of the pre-existing nodes.

Lemma 4.2.
Let τ be a tree, v ∈ V(τ) and c ∈ C τ (v). Let θ be the tree obtained from τ by adding the internal node w as a child of v making w the parent of c. Then, Proof. Adding w may only increase the height of v and the one of its ancestors in τ. If the height of v is not changed by adding w, the height of its ancestors will not be modified.  Proof. Adding a subtree t under v may only increase the height of v and the one of its ancestors in τ. If the height of v is not changed by adding t, the height of its ancestors will not be modified. Adding t will make the height of v increase if H(t) is strictly greater than the height of the higher child of v. , v is not a leaf. The below results concern deleting operations that preserve the height of the remaining nodes of τ.

Lemma 4.4.
Let τ be a tree, v ∈ V(τ), w ∈ C τ (v) and C τ (w) = {c}. Let θ be the tree obtained from τ by deleting the internal node w making its unique child c a child of v. Then, Proof. Deleting w may only decrease the height of v and the one of its ancestors in τ. If the height of v is not changed by deleting w, the height of its ancestors will not be modified.
Proof. The proof follows the same reasoning as in the previous result.

NEST and NeST
In view of the foregoing, we consider the set of inserting and deleting operations that fulfill the below requirements.     The NEST (the NeST, respectively) of a tree τ is the self-nested tree obtained by the set of inserting operations AI and AS (of deleting operations DI and DS, respectively) of minimal cost, the cost of inserting a subtree being its number of nodes. Existence and uniqueness of the NEST are not obvious at this stage. The NeST exists because the (self-nested) tree composed of a unique root can be easily obtained by deleting operations from any tree, but its uniqueness is not evident.

NEST algorithm
In order to present our NEST algorithm in a concise form in Algorithm 2, we need to define the following operations involving two vectors u and v of the same size n and a real number γ, (u 1 , γ) , . . . , max(u n , γ)).
In other words, these operations must be understood component by component. In addition, in a condition, u = 0 (u = 0, respectively) means that for all 1 ≤ i ≤ n, u i = 0 (u i = 0, respectively). Finally, for 1 ≤ i ≤ j ≤ n, u i...j denotes the vector (u i , . . . , u j ) of length j − i + 1. This notation will also be used in Algorithm 3 for calculating the NeST.
The relation between the above algorithm and the NEST of a tree is provided in the following result, which states in particular the existence of the NEST. Proof. By definition of the NEST, the height of all the pre-existing nodes of τ can not be modified. Thus, the number of nodes of height h − 1 under a node of height h can only increase by inserting subtrees in the structure. Then we have Let v be a vertex of height h in τ. We recall that γ i (v) denotes the number of subtrees of height i under v. Our objective is to understand the consequences for γ i (v) of inserting operations to obtain ρ NEST(τ) (h, h − 1) subtrees of height h − 1 under v. To this aim, we shall define a sequence that corresponds to the modified versions of τ. The first exponent h − 1 means that this sequence concerns editing operations used to get the good number of subtrees of height h − 1 under v.
h−1 (v) be the number of subtrees of height h − 1 that must be added under v to obtain the height profile of the NEST under v, i.e., Implicitly, it means that γ The subtrees of height h − 1 that we have to add are isomorphic, self-nested and embed all the subtrees of height h − 2 appearing in τ by definition of the NEST. In particular, they can be obtained by the allowed inserting operations from the subtrees of height h − 2 under v, by first adding an internal node to increase their height to h − 1. In addition, it is less costly in terms of editing operations to construct the subtrees of height h − 1 from the subtrees of height h − 2 available under v than to directly add these subtrees under v. If all the subtrees of height h − 2 under v must be reconstructed later, it will be possible to insert them and the total cost will be same as by directly adding the subtrees of height h − 1 under v. As a consequence, all the available subtrees of height h − 2 are used to construct subtrees of height h − 1 under v and it remains The ∆ (1) h−1 subtrees of height h − 1 can be constructed from subtrees of height h − 3 (with a larger cost than from subtrees of height h − 2), and so on. To this aim, we define the sequence of the modified versions of τ by, for At the final step j = h − 2, the ∆ From now on, the number of subtrees of height h − 2 under v will not decrease. Indeed, it would mean that an internal node has been added between v and the root of a subtree of height h − 2. This would have the consequence to increase of one unit the number of subtrees of height h − 1 in subtrees of height h, which cost is (strictly) larger than adding a subtree of height h − 2 in all the subtrees of height h. Consequently, we obtain We can reproduce the above reasoning to construct under v subtrees of height h − i, i from 2 to h − 1, from subtrees with a smaller height, which defines a sequence γ (h−i,j) i of modified versions of τ, which size is h − i + 1, and we get the following inequality, The tree returned by Algorithm 2 is self-nested and its height profile saturates the inequalities (3) and (4) for all the possible values of h and i by construction. In addition, we have shown that this tree can be obtained from τ by the allowed inserting operations. Since increasing of one unit the height profile at (h 1 , h 2 ) has a (strictly) positive cost, this tree is thus the (unique) NEST of τ.
As seen previously, the number of iterations of the while loop at line 7 is the number of subtrees of height h 2 < h 1 available to construct a tree of height h 1 , i.e., the degree of τ in the worst case, which states the complexity.

NeST algorithm
This section is devoted to the presentation of the calculation of the NeST in Algorithm 3. Proof. The proof follows the same reasoning as the proof of Proposition 4.7. First, one may remark that . Instead of deleting a subtree of height h − 1, it is always less costly to decrease its height of one unit by deleting its root. However it is possible only if this internal node has only one child, i.e., if ρ τ (h − 1, h − 2) = 1 and ρ τ (h − 1, i) = 0 for 0 ≤ i < h − 2. If this new tree of height h − 2 has to be deleted in the sequel, it will be done with the same global cost as by directly deleting the subtree of height h − 1. As a consequence, From now on, the number of subtrees of height h − 2 under v will thus not increase and we obtain We can repeat the previous reasoning and delete the root of subtrees of height h − 2 if possible rather than delete the whole structure, and so on for any height. Thus the sequence γ The tree returned by Algorithm 3 saturates the inequalities (5) and (6) for all the possible values of h and i. Decreasing of one unit the height profile at (h 1 , h 2 ) has a (strictly) positive cost. Thus this tree is the (unique) NeST of τ. The time-complexity is given by the size of the height profile array.

Random trees
The aim of this section is to illustrate the behavior of the NEST and of the NeST on a set of simulated random trees regarding both the quality of the approximation and the computation time. We have simulated 3 000 random trees of size 10, 20, 30, 40, 50, 75, 100, 150, 200 and 250. For each tree, we have calculated the NEST and the NeST. The number of nodes of these approximations is displayed in Fig. 9. We can observe that the number of nodes of the NEST is very large in regards with the size of the initial tree: approximately one thousand nodes on average for a tree of 150 nodes, that is to say an approximation error of 750 vertices. Remarkably, the NEST has never been a better approximation than the NeST on the set of simulated trees.
The computation time required to compute the NEST or the NeST of one tree on a 2.8 GHz Intel Core i7 has also been estimated on the set of simulated trees and is presented in Fig. 10. As predicted by the theoretical complexities given in Propositions 4.7 and 4.8, the NeST algorithm requires less computation time than the NEST. As a consequence, the NeST provides a much better and faster approximation of the initial data than the NEST.

Structural analysis of a rice panicle
In light of [7], we propose to quantify the degree of self-nestedness of a tree τ by the following indicator based on the calculation of NEST(τ), In [7, eq. (6)], the degree of self-nestedness of a plant is defined as in (7) but normalizing by the number of nodes of the NEST and not the size of the initial data, which avoids the indicator to be negative. In the present paper, we prefer normalizing by the number of nodes of τ to obtain the following comparable self-nestedness measure based on the calculation of NeST(τ), δ NeST (τ) = 1 − D Z (NeST(τ), τ) #V(τ) = #V(NeST(τ)) #V(τ) .
The main advantage of this normalization is that, if the NEST and the NeST offer equally good approximations, i.e., D Z (NEST(τ), τ) = D Z (NeST(τ), τ), then the degree of self-nestedness does not depend on the chosen approximation scheme, δ NEST (τ) = δ NeST (τ).
We propose to investigate the degree of structural self-similarity of the topological structure of the rice panicle studied in [7, 4.2 Analysis of a Real Plant] through these self-nested approximations. The rice panicle V 1 is made of a main axis bearing a main inflorescence P 1 and lateral systems V i , 2 ≤ i ≤ 5, each composed of inflorescences P j , 2 ≤ j ≤ 8 (see Fig. 11). We have computed the indicators of self-nestedness δ NEST ∨ 0 and δ NeST for each substructure composing the whole panicle (see Fig. 12). The numerical values and the shape of these indicators are similar. However, δ NeST is always greater than δ NEST , in particular for the largest structures V i . Based on a better approximation procedure as highlighted in the previous section, the NeST better captures the selfnestedness of the rice panicle.