Determinization and Minimization of Automata for Nested Words Revisited

: We consider the problem of determinizing and minimizing automata for nested words in practice. For this we compile the nested regular expressions ( NREs ) from the usual XPath benchmark to nested word automata ( NWAs ). The determinization of these NWAs , however, fails to produce reasonably small automata. In the best case, huge deterministic NWAs are produced after few hours, even for relatively small NREs of the benchmark. We propose a different approach to the determinization of automata for nested words. For this, we introduce stepwise hedge automata ( SHA s) that generalize naturally on both (stepwise) tree automata and on ﬁnite word automata. We then show how to determinize SHA s, yielding reasonably small deterministic automata for the NREs from the XPath benchmark. The size of deterministic SHA s automata can be reduced further by a novel minimization algorithm for a subclass of SHA s. In order to understand why the new approach to determinization and minimization works so nicely, we investigate the relationship between NWAs and SHA s further. Clearly, deterministic SHA s can be compiled to deterministic NWAs in linear time, and conversely NWAs can be compiled to nondeterministic SHA s in polynomial time. Therefore, we can use SHA s as intermediates for determinizing NWAs , while avoiding the huge size increase with the usual determinization algorithm for NWAs . Notably, the NWAs obtained from the SHA s perform bottom-up and left-to-right computations only, but no top-down computations. This NWA behavior can be distinguished syntactically by the (weak) single-entry property, suggesting a close relationship between SHA s and single-entry NWAs . In particular, it turns out that the usual determinization algorithm for NWAs behaves well for single-entry NWAs , while it quickly explodes without the single-entry property. Furthermore, it is known that the class of deterministic multi-module single-entry NWAs enjoys unique minimization. The subclass of deterministic SHA s to which our novel minimization algorithm applies is different though, in that we do not impose multiple modules. As further optimizations for reducing the sizes of the constructed SHA s, we propose schema-based cleaning and symbolic representations based on apply-else rules that can be maintained by determinization. We implemented the optimizations and report the experimental results for the automata constructed for the XPathMark benchmark. a ) where n = 0, . . . ,9: size (#states). There is no schema-based cleaning. Our implementation of the minimization algorithm was applied to all dSHA with at most 200 states, since it failed for larger dSHA s. No minimization algorithm for subclasses of single-entry dNWAs was implemented.


Introduction
Nested words are hierarchical structures that are omnipresent in computer science. They were used to represent sequences of data trees, like XML or JSON documents, and to analyze the call structure of recursive programs. The idea of nested words is to generalize on both words and trees, resulting in sequences of unranked trees that are also known as hedges. Otherwise, nested words can be obtained by enriching Dyck words with internal letters, besides opening and closing parentheses. Furthermore, nested words are the elements of the least set containing internal letters from a given alphabet, triples consisting of an opening parenthesis, a nested word, and a closing parenthesis, and all sequences of nested words. Last but not least, nested words can be seen as words over an alphabet with internal letters, opening parentheses, and closing parentheses, under the conditions that the parenthesis are well nested, so that every opening parenthesis is properly closed and every closing parenthesis properly opened.
From the viewpoint of formal language theory, a natural question is how to lift the notions of finite automata and regular expressions, from words and trees to nested words, while preserving their well-known relationships. Nested word automata (NWAs) were heavily studied since the 1980s [1][2][3][4], under the name input-driven automata. They are the same as visibly pushdown automata [5], pushdown forest automata [6], and streaming tree automata [7]. NWAs can recognize the same languages of unranked trees as hedge automata [8], a generalization of tree automata for ranked trees [9]. NWAs are often defined as pushdown automata with visible stacks, meaning that exactly one symbol is pushed when reading an opening parenthesis, and exactly one symbol is popped when reading a closing parenthesis, while the stack is not used otherwise. Their main advantage is a powerful notion of determinism, generalizing both over bottom-up and top-down determinism of tree automata for ranked trees [2,3]. We note that general pushdown automata do not permit determinization in contrast.
Regular expressions for nested words were proposed more recently by Hosoya and Pierce [10] under the name of regular expression types. In the present article, we will call them nested regular expressions (NREs) instead. Independently, more complex notions of nested regular expressions were introduced [11,12] in order to deal with generalizations of nested words with dangling opening and closing parentheses, which are not of interest to us. It was already claimed in [10], that our simpler notion of NREs has the same expressiveness as hedge automata [8,9], which in turn have the same expressiveness as NWAs [3]. However, the question under which conditions NREs can be compiled to small deterministic NWAs has not been studied. For classes of NREs for which deterministic NWAs can be computed in polynomial time, we can decide language inclusion or equivalence in polynomial time too. For other classes, these problems may not be feasible since language inclusion for nondeterministic NWAs is EXP-complete.
Our concrete interest in the universality of deterministic NWAs is motivated by XML stream processing: we want to compute the certain answers of a navigational XPath query on an XML stream [13,14], i.e., those elements that are selected in all possible futures of the stream. Whether an answer is certain is computationally hard, even for tiny syntactic fragments of navigational XPath [14,15], but can be done in polynomial time for queries defined by deterministic NWAs [16]. A natural question is, therefore, whether it is possible to compile navigational XPath queries as in the usual benchmark [17] to deterministic NWAs of reasonable size. Unfortunately, the existing compilers fail to do so [18], as they are based on NWA determinization for dealing with disjunction, negation, and recursive steps. Thereby, they produce huge deterministic automata even for very simple navigational XPath queries from the benchmark, or do not terminate after some hours.
In this article, we consider NREs for defining queries on nested words. For benchmarking with realistic example, we consider the navigational XPATH queries in the XPathMark benchmark with only forwards axis, that we compiled to NREs of the same size up to a constant factor. The question is then whether these NREs can be compiled to reasonably small deterministic NWAs.
As a first approach, we distinguish a subclass of "deterministic" NREs that can be compiled in polynomial time to deterministic NWAs by generalizing on Glushkov's construction of deterministic finite-state automata (DFAs) from "deterministic" regular expressions [19,20]. However, the NREs obtained by compilation from navigational XPath queries are rarely deterministic, so neither are the NWAs compiled from them. Moreover, as we cannot apply NWA determinization to them in practice as argued above, this first approach has a much too low coverage to reach the objective. Therefore, we will report it only at the end in Section 9.
For our second approach, we propose a novel variant of automata for nested words that we call stepwise hedge automata (SHAs). Even though motivated by the wish to create deterministic automata for the NREs of our benchmark, they are of general interest: for the class of deterministic SHAs to obtain unique minimization. We could have required the existence of multiple modules as for multi-module single-entry NWAs. Instead, we restrict ourselves to deterministic SHAs for which the initial states for trees and hedges coincide. We then show that minimization for such SHAs can be reduced to the minimization of tree automata up to a novel encoding of hedges to binary trees.
We implemented all four algorithms for compiling our benchmark NREs to deterministic NWAs and report the experimental results. We have also implemented the novel minimization algorithm for SHAs with equal tree and hedge initial states, and used it in our experiments. We propose two further optimization methods for reducing the sizes of the constructed automata.
First, we introduce schema-based cleaning both for SHAs and NWAs. In our application, the schema expresses the XML data model, stating that hedges must encode valid XML documents. More generally, an automaton A can be cleaned relative to an automaton S for the schema, if the language of interest is the intersection L(A) ∩ L(S) rather than L(A) itself. The idea of schema-based cleaning is to keep only those transition rules of A that are used to recognize some hedge of L(S). These transition rules can be computed from the product of A and S. Note that schema-based cleaning may change the language of the automaton. Only the intersection L(A) ∩ L(S) is preserved, not necessarily L(A).
Second, we propose a symbolic representations for SHAs based on apply-else rules. They help to represent more compactly a large number of apply rules produced by the determinization of SHAs. Before compiling SHAs to NWAs, however, we need to eliminate the apply-else rules. This is because we have not developed analogous symbolic representations for NWAs so far. A second limitation is that we have not implemented any minimization algorithm for NWAs at the time being.
The main improvement of this journal article compared with the conference version [25] is the addition of the minimization algorithm for the subclass of SHAs with equal tree and hedge initial states. Furthermore, we added the idea of schema-based cleaning and the symbolic representations for SHAs by apply-else rules. The experimental results were enhanced with minimization, symbolic representations of rules, and schema-based cleaning. All the nested regular expressions generated for the XPathMark benchmark queries that we consider, as well as their corresponding automata-when we could produce them-can be found at http://researchers.lille.inria.fr/niehren/complementary-material (accessed on 1 February 2021).

Related Work.
In the present article, we restrict ourselves to nested words over signatures with a single opening parenthesis, a single closing parenthesis, and possibly many internal letters a, b. This permits us to simplify the presentation of nested regular expressions, the notions of NWAs and SHAs, their forth and back compilers, as well as the determinization algorithms. Note that any multi-module NWA for such signatures must have exactly 2 modules. From an application perspective, multiple parentheses can be encoded by using internal letters, that is a named opening parenthesis a by the word a · and a named closing parenthesis b by the word b · . When encoding XML documents as nested words, such some encoding is needed anyway in order to deal with the complex information in XML tags, and also to provide symbolic representations with else rules that are able to deal with infinite signatures.
For the minimization of deterministic NWAs, general signatures with multiple parentheses raise additional problems. Chervet and Walukiewicz [23] solved such problems by reducing the minimization for expanded CDAs to the minimization of CDAs. Gauwin, Muscholl, and Raskin [24] showed that the minimization for deterministic NWAs is NPhard in the case with general signatures. Their approach is based on a reduction from the problem of minimal immersion for sequences of DFAs, for which they construct NWAs with an unbounded number of opening parenthesis and an unbounded number of entry states. Weak single-entry NWAs in our setting do not permit this. Neither do NWAs over fixed general signatures with a finite number of opening parenthesis. Navigational XPath queries on XML documents can be formalized in the language Cor-eXPath [26], or more generally by nested regular path queries [27] on data trees. Nested regular path queries were introduced earlier under the name of the propositional dynamic logic (PDL) in the 1970s [28], where they are applied to labeled graphs that generalize on data trees.
As certain query answering for XPath was considered difficult, the currently existing approaches to XPath query evaluation on XML streams [13,18] either approximate certain query answers based on nondeterministic machines or restrict the queries so that answer certainty can be decided without latency [15,29]. This also holds for recent streaming algorithms on words without nesting in the context of complex event processing [30].

Nested Words
Nested words are words with parentheses that are well nested. They can be identified with hedges, that is, sequences of internal symbols and unranked trees.
Nested words are constructed with opening and closing parentheses, respectively, and . An unranked alphabet Σ is a possibly infinite set of so-called "internal" symbols, that does not contain the two parentheses. The set of nested words over Σ is denoted N Σ and is defined by the following abstract syntax: The empty nested word is denoted by ε and assumed to be the neutral element of the composition operator ε · h = h = h · ε, which furthermore is assumed to be associative, i.e., Nested words can be identified with hedges, i.e., words of trees and internal symbols. Seen as a graph, the inner nodes are labeled by the tree constructor and the leaves by symbols in Σ or the tree constructor. For instance, a · b · ε · c · d · ε corresponds to the hedge in Figure 1. A nested word of type tree has the form h . a b c d Figure 1. Nested word a · b · ε · c · d · ε seen as a graph.
Variants. Our notion of nested words accepts only well-nested words without dangling opening or closing parentheses in contrast to others [3,5]. This will lead to simpler notion of regular expressions, avoiding the more complex operators as with visibly rational expressions [12,31]. A less important difference is that we do not support labeled parentheses.
Labeled unranked trees. Labeled parentheses can be simulated by using internal letters. For instance, the labeled tree a(b(), c()) can be represented by the nested word of type tree a · b · c . In this way, the labeled tree a() is represented by the nested word a , which is of type tree (while the internal letter a alone is not). Unranked sequences of subtrees, often called hedges and sometimes forests, can be composed by using the sequence operator.
were introduced earlier under the name of the propositional dynamic logic (PDL) in the seventies [28], 185 where they are applied to labeled graphs, that generalize on data trees. Since certain query answering for XPath was considered difficult, the currently existing approaches to XPath query evaluation on XML streams [13,18] either approximate certain query answers based on nondeterministic machines or restrict the queries so that answer certainty can be decided without latency [15,29]. This also holds for recent streaming algorithms on words without 190 nesting in the context of complex event processing [30].

Nested Words
Nested words are words with parentheses that are well-nested. They can be identified with hedges, that is sequences of internal symbols and unranked trees.
Nested words are constructed with opening and closing parentheses, respectively and . An unranked alphabet Σ is a possibly infinite set of so called "internal" symbols, that does not contain the two parentheses. The set of nested words over Σ is denoted N Σ and is defined by the following abstract syntax: The empty nested word is denoted by ε and assumed to be the neutral element of the composition 195 operator ε · h = h = h · ε, which furthermore is assumed to be associative, i.e., Nested words can be identified with hedges, i.e. words of trees and internal symbols. Seen as a graph, the inner nodes are labeled by the tree constructor and the leaves by symbols in Σ or the tree constructor. For instance a · b · ε · c · d · ε corresponds to the hedge in Figure 1. A nested word 200 of type tree has the form h . Variants. Our notion of nested words accepts only well-nested words without dangling opening or closing parentheses in contrast to others [3,5]. This will lead to simpler notion of regular expressions, avoiding the more complex operators as with visibly rational expressions [12,31]. A less important difference is that we do not support labeled parentheses.  doc· elem · site· elem · closed_auctions· elem · closed_auction· elem · date · 0 · 1 · / · 0 · 1 · / · 2 · 0 · 0 · 0 elem · keyword · w · i · n · e · duced earlier under the name of the propositional dynamic logic (PDL) in the seventies [28], y are applied to labeled graphs, that generalize on data trees. certain query answering for XPath was considered difficult, the currently existing es to XPath query evaluation on XML streams [13,18] either approximate certain query ased on nondeterministic machines or restrict the queries so that answer certainty can be ithout latency [15,29]. This also holds for recent streaming algorithms on words without the context of complex event processing [30].
Words ed words are words with parentheses that are well-nested. They can be identified with at is sequences of internal symbols and unranked trees. ed words are constructed with opening and closing parentheses, respectively and . An alphabet Σ is a possibly infinite set of so called "internal" symbols, that does not contain arentheses. The set of nested words over Σ is denoted N Σ and is defined by the following yntax: where a ∈ Σ y nested word is denoted by ε and assumed to be the neutral element of the composition ε · h = h = h · ε, which furthermore is assumed to be associative, i.e., h 1 · (h 2 · h 3 ) = h 3 . ed words can be identified with hedges, i.e. words of trees and internal symbols. Seen as a inner nodes are labeled by the tree constructor and the leaves by symbols in Σ or the tree r. For instance a · b · ε · c · d · ε corresponds to the hedge in Figure 1. A nested word e has the form h . Our notion of nested words accepts only well-nested words without dangling opening or rentheses in contrast to others [3,5]. This will lead to simpler notion of regular expressions, the more complex operators as with visibly rational expressions [12,31]. A less important is that we do not support labeled parentheses. We use the names of XML elements as labels of the nested word, as well as the letters of UTF8 for the string data values. Further labels such as doc and elem are added to express the types of the XML data model document and element, respectively.
When it comes to querying for nodes in XML documents, we will be interested in nested words encoding XML documents, in which a unique node is marked. We will use the label x to mark the selected node and the label ¬x for all others. When marking the date in the XML document of Figure 2, we obtain the nested word in doc · ¬x· elem · site · ¬x· elem · closed_auctions · ¬x· elem · closed_auction · ¬x· elem · date · x · 0 · 1 · / · 0 · 1 · / · 2 · 0 · 0 · 0 elem · keyword · ¬x · w · i · n · e · Labeled unranked trees. Labeled parentheses can be simulated by using internal letters. For instance, the labeled tree a(b(), c()) can be represented by the nested word of type tree a · b · c . In this way, the labeled tree a() is represented by the nested word a which is of type tree (while the internal letter a alone is not). Unranked sequences of subtrees, often called hedges and sometimes forests, can be composed by using the sequence operator.

210
XML Documents. Our notion of nested words is sufficiently powerful to express general XML documents. An example of an XML document is given in Figure 2 and the representing nested word in Figure 3. We use the names of XML elements as labels of the nested word, as well as the letters of UTF8 for the string data values. Further labels such as doc and elem are added to express the types of the XML 215 data model document and element respectively.
When it comes to querying for nodes in XML documents, we will be interested in nested words encoding XML documents, in which a unique node is marked. We will use the label x to mark the selected node and the label ¬x for all others. When marking the date in the XML document of Figure 2, we obtain the nested word in Figure 4.

Nested Regular Expressions
We present nested regular expressions (NREs), which were introduced under the name regular expression types in the context of XDuce [10] up to minor details. Note that similar nested regular expressions for ranked trees are folklore in the context of tree automata [32].

Syntax and Semantics
Let the alphabet Σ be a set. An NRE over Σ is a term describing a language of nested words. It has the following abstract syntax where a ∈ Σ: The µa.E expressions are the same as in µ-calculus [33], except that we restrict them such that all occurrences of a in E are nested below parentheses. Otherwise, nonregular languages could be defined such as with µa. (b · a · c + ε) whose language would be {b n · c n | n ∈ N}. We also forbid intersections and complements in expression µa.E on all paths between the µa-operator and the occurrences of a in E that are bound by this operator. The expressions µa.E allow for vertical recursion, while the expressions with the Kleene star E * support horizontal recursion.
Our syntax allows for conjunctions E&E and negations E, which are well known to not add expressiveness if Σ is finite. They are still relevant from the viewpoints of modeling, and for the treatment of infinite signatures. This comes at the price of increasing the complexity, as for the well-known case of words [34].
For infinite signatures, we can define for any finite subset Σ of labels the language of single-letter words Σ \ Σ by some NRE. This can be seen as follows. If Σ = ∅, then the expression _ does the job: it matches exactly the set of all labels in Σ. Moreover, if Σ is nonempty then we can use negation. For instance, if Σ = {a, b} then the expression a + b&_ describes the language Σ \ Σ .
The sets of free and bound letters fn(E) and bn(E) are defined as usual. The only binder µa.E binds the symbol a with scope E. Note that f n(_) = ∅.
There are three differences with respect to the regular expression types from [10]. First, our NREs treat labels as internal symbols instead of labels of parentheses. Second, they provide recursion through the µ-operator instead of using recursive equation systems. Third, conjunctions and general negations are not considered there.
Any NRE E describes a language L(E) of nested words that we define by induction on the structure of E as follows: N Σ is the set of nested words over Σ, as defined in Section 2.
For all expressions-E, E 1 , and E 2 , the notation E[E 1 /E 2 ] stands for the expression E where all the occurrences of E 1 have been replaced by E 2 . The semantics of a µ-operator is then defined using the shortcuts µ 0 a.E = E[a/∅] and µ n a.E = E[a/µ n−1 a.E] for all n ≥ 1. In particular L(µa._) = L(_) = Σ, so that a ∈ L(µa._). The semantics of the complement expression L(E) is the complement of L(E) in the set of all nested words, that is N Σ \ L(E).

XPATH Example
We now show how to express navigational XPATH queries by NREs that are restricted to forward axis. The idea is to adapt the spirit of a generate-and-test algorithm for query answering. The generation produces a nested word from XML documents by guessing a single node and marking it by x. This node is a candidate for a query answer that is to be tested. The test is done by a NRE.
For expressing XPATH queries with child and descendant-or-self axes we will use the following NREs where a ∈ fn(E): For instance, consider the XPATH query A5 from the XPathMark benchmark [17]: /site/closed_auctions/closed_auction[descendant::keyword]/date Applied to the above XML document, it selects all date children of closed_auctions nodes that contain at least one keyword descendant. Query A5 can be compiled to the following NRE, which will accept the nested word in Figure 4 in particular: The only label that the expression _ may match on a document that is properly annotated with the variable x will be the letter ¬x ∈ Σ. The label x is annotated to the marked node, which is tested for being selected by the query. The label ¬x is annotated to all nodes except a unique x-marked node.
Note also that the µ-operator of the ch + (. . .)-expression expresses the recursion of the descendant axis. Furthermore, the conjunction permits us to connect the main path of A5 with its only filter.

XPath Benchmark
For testing NREs, we rely on the usual XPathMark benchmark [17]. We restrict ourselves to navigational path queries with forward axis: child, descendant, and followingsibling. We notice that the following axis is excluded in contrast to following-sibling, as following is not strictly forwards. We can also admit path composition and filters with conjunction, disjunction, and negation.
The XPATH queries of the benchmark satisfying these restrictions are the queries A1, . . ., A8 and B3 given in Figure 5. We developed a more general compiler from navigational forward XPATH queries to NREs, which yields the NREs in Figure A1 of the Appendix B for the benchmark XPATH queries. The NREs for A1-A3 do have neither conjunctions nor negations, while the queries A4-A8 contain filters, which are mapped to conjunctions in NREs. The compiler uses the µ-operator to capture the recursion of descendant axis as in A2, A3, and A5. Furthermore, nondeterminism is introduced by disjunctions in filters as in A7 and A8. Conjunction in filters appears in A6 which is mapped to conjunctions in NREs too. A detailed description of this compiler is not in the scope of the present article though.  Note also that the µ-operator of the ch + (. . .)-expression expresses the recursion of the descendant axis. Furthermore, the conjunction permits us to connect the main path of A5 with its only filter.

265
For testing NREs, we rely on the usual XPathMark benchmark [17]. We restrict ourselves to navigational path queries with forward axis: child, descendant, and following-sibling. We notice that the following axis is excluded in contrast to following-sibling, since following is not strictly forwards. We can also admit path composition and filters with conjunction, disjunction and negation.
The XPATH queries of the benchmark satisfying these restrictions are the queries A1, . . . , A8 and 270 B3 given in Figure 5. We developed a more general compiler from navigational forward XPATH queries to NREs, which yields the NREs in Figure A1 of the appendix for the benchmark XPATH queries. The NREs for A1-A3 do have neither conjunctions nor negations, while the queries A4-A8 contain filters,

Nested Word Automata
Nested word automata (NWAs) are pushdown automata reading nested words, whose stacks are visible: they push a single stack symbol when reading an opening parenthesis, pop a single stack symbol when reading a closing parenthesis, and do not alter or inspect the stack otherwise.

Definition 1.
An NWA is a tuple A = (Q h , Q t , Σ, Γ, ∆, I, F) consisting of a possibly infinite set Σ of internal symbols; finite sets Q h and Q t of states of type hedge and tree, respectively; sets of initial and final states I, F ⊆ Q h ; a finite set Γ of stack symbols; and a finite set ∆ of transition rules of the forms: hedge rules Our NWAs are symbolic, in that they come with else rules, i.e., elements of (q, q ) ∈ _ ∆ that we will denote by q _ − → q , for dealing with large or infinite alphabets. An example for an NWA is given in a graphical syntax in Figure 6. Tree states are drawn in circles that are filled in light gray q , while hedge states are in unfilled circles q . Initial states are drawn as → q and final states as q . Hedge rules that have the form Our notion of NWAs supports factorization in the spirit of the work in [35]. It is obtained by distinguishing two types of states, q ∈ Q h and p ∈ Q t , and adding explicit type coercion rules q −→ p. Semantically, both kinds of states could be merged when replacing the type coercion rules by the epsilon rule q ε − → p, but at the cost of introducing additional nondeterminism. This may lead to quadratically larger deterministic automata, as we will illustrate at the NWA in Figure 20.
The language of nested words between two states q 1 , q 2 ∈ Q h is defined as the least language such that The language of the NWA is then L(A) = q 1 ∈I,q 2 ∈F L q 1 ,q 2 (∆).

Determinization of NWAs
Determinization for NWAs was first studied by von Braunmühl and Verbeek [2] in the 1980s, where NWAs are named input-driven pushdown automata. We notice that the determinization algorithm was published only in the journal version of this paper, but not in the conference version. Later on, the same algorithm was rediscovered in the context of visibly pushdown automata and republished for nested word automata.

Definition 2. An NWA A is called deterministic or equivalently a dNWA if
• I contains at most one element; • there is no epsilon rule, i.e., ε ∆ = ∅, • a ∆ and _ ∆ are partial functions from Q h to Q h for all a ∈ Σ, and T ∆ is a partial function from Q h to Q t ; • for all q ∈ Q h and γ ∈ Γ there exists a most one q ∈ Q h such that q ∈ ∆ γ ; and • ∆ γ is a partial function from Q t to Q h for all γ ∈ Γ.
Many of our results are based on the determinization algorithm going back to von Braunmühl and Verbeek. For self-containedness, we recall the version of this algorithm that we will use in the Appendix A. For illustrations, the determinization of the NWA in Figure 6 is also presented here too, see Figure A2. It has size 271 while the nondeterministic NWA has size 39 (12 states + 2 letters + 3 stack symbols + 22 rules). The blow-up is even worse in general as our experimental results will show and as noticed earlier by [18].

Multi-Module NWAs
Multi-module NWAs will play a prominent role for our NWA constructions and are relevant for minimization [23]. For signatures with a single opening parenthesis, each multimodule NWAs has exactly two modules, one for the top level and one for the nested level.
We can define multi-module NWAs based on the natural notion of homomorphisms for NWAs. A homomorphism from an NWA A to an NWA A with the same signature is a triple of functions (α h : Q h → Q h , α t : Q t → Q t , β : Γ → Γ ) that maps all concepts of A to the corresponding concepts of A . These concepts are hedge initial states, final states, opening, closing, internal, and tree transitions. We do not enforce the preservation of epsilon rules by homomorphisms.

Definition 3.
A multi-module NWA A is an NWA for which there exists a homomorphism from A to the NWA T in Figure 7. The NWA T evaluates all top level positions of a nested word to state 0: all those positions that are not between parentheses. All nested positions are evaluated to state 1. The homomorphism of a multi-module NWA A to T thus partitions the states of A between those that can be assigned to top level positions, and the others that can be assigned to nested positions.

Compilation of NREs to NWAs
We next discuss a compiler from NREs E to NWAs nwa(E). This compiler extends on the McNaughton-Yamada-Thompson algorithm [36] for regular expressions, which introduces epsilon edges for constructing the automata of composition E · E .

Theorem 1. For any NRE E, we can construct an NWA A such that L(A) = L(E). If E contains neither conjunctions nor negations, then the construction is in time O(|E|).
Proof sketch. Conjunctions E&E are compiled to products of automata, so repeated conjunctions may lead to an exponential blow up. Negations E are computed by complementing automata based on determinization. Each complementation may lead to an exponential blow-up, so when this is repeated, the construction may become non-elementary.
For expressions without conjunction and negation, no such blow-up may arise. As stated by the theorem, we have to show that expressions can be compiled in linear time.
Case E = E · E : We use the McNaughton-Yamada-Thompson algorithm for composing the NWAs of nwa(E ) and nwa(E ).
and Γ be, respectively, the set of hedge states, tree states, and stack symbols of nwa(E ). We consider new hedge states q i and q f that are not in Q h , a new tree state p not in Q t and a new stack symbol γ not in Γ . Then, nwa(E) is constructed by adding to nwa(E ) opening rules q i ↓γ − −− q for all the initial states q of nwa(E ), tree rules q −→ p for all the final states q of nwa(E ) and a closing rule p ↑γ − −− q f . Furthermore, we set q i as the only initial state of nwa(E), and q f as its sole final state.
Case E = µa.E : Special care has to be given to repeat expression µa.E. First of all, the naive compilation approach for these expression turns out to be wrong. Second, fixing the problem in the simplest possible manner does not lead to a linear time algorithm.
Note that we can assume w.l.o.g. that a occurs at most once in E by using the golden lemma of the µ calculus [37], stating for all names a 1 , . . . , a n and expressions E in which a 1 , . . . , a n can appear free that µa 1 . . . . .µa n . E ≡ µa.E [a 1 /a, . . . , a n /a].
Our construction guarantees that all transitions of the form q a − → q in nwa(E) will start with the same state q. The wrong naive construction would remove the transitions q a − → q from nwa(E) and add ε-rules from q to all the initial states of nwa(E), and from all final states of nwa(E) to q . Unfortunately, the construction is not correct. For illustration, we consider the NRE E = µa. a * . The reader should be warned that constructing an NWA for E is less trivial than it might seem at first sight. One has to start from the NWA for a * which is given in Figure 8. Simply adding epsilon edges to capture the operator µa will not work though. It will lead to the wrong automaton in Figure 9. This automaton will wrongly accept the hedge , as this hedge does not belong to L(E).
If the NWA for E is multi-module, then the naive construction of compiling µa.E can be made correct. Therefore, the simplest fix is to make the NWA multi-moduled, before applying the naive construction. This can be achieved by typing the states of the automaton, by states of the NWA T in Figure 7. The added types yield the homomorphism of the constructed automaton to T .
The naive algorithm is then adapted as follows. Let P be the multi-module NWA obtained from the product of nwa(E) and T . Note that we keep only the accessible top level states (type 0), but all nested states (type 1). In our example, this yields the NWA in Figure 10. We then remove transition (q, 1) a − → (q , 1) and add ε-rules from state (q, 1) to all states in I × {1}, and from all states in F × {1} to (q , 2), where I and F are, respectively, the set of initial and final states of nwa(E). Then, P recognizes L(µa.E). The result obtained in the example is shown in Figure 11.
The algorithm described so far makes the NWA multi-moduled before compiling a µ-operator. For this, two copies of all states are introduced. This, however, could lead to an exponential construction if multiple µ-operators are nested. This problem can be avoided by preserving multi-moduledness as an invariant. Whenever a new state is created, it is created twice: once for the top level and once for the nested level. This information is maintained by typing the states, so that no further copies of the same state are produced later on. Figure 8. Automaton for the a * expression. Figure 9. Wrong naive construction for µa. a * . Figure 10. The multi-module NWA for a * . Figure 11. The correctly adapted naive construction for µa. a * .
We omit the correctness proof of this construction.

Experimental Results Starting with the NWA Compiler
In the first two column of Figure 12, we report the sizes of the NWAs obtained from NREs by our compiler, and the size of the deterministic NWAs produced thereof. For each automaton, we give its total size and in parentheses the number of states.

Experimental Results Starting with the NWA Compiler
In the first two column of Figure 12, we report the sizes of the NWAs obtained from NREs by our compiler, and the size of the deterministic NWAs produced thereof. For each automaton we give its total size and in parentheses the number of states.

385
The sizes of the nondeterministic NWAs produced by the compiler for the NREs for A1-A8 and B3 are given in column nwa(.) of Figure 12. Note that the NWAs are cleaned so that only accessible and co-accessible states remain. The sizes of the nondeterministic NWAs are acceptable for all NREs, except for A8 for which the NWA has more than 3000 states and an overall size greater than 10000. This can be partially explained by the fact that the NRE for A8 contains 3 conjunctions (1 for the filter 390 and 2 for the conjunctions in the filter). But still, the number of states remains surprising. The sizes of the nondeterministic NWAs produced by the compiler for the NREs for A1-A8 and B3 are given in column nwa(.) of Figure 12. Note that the NWAs are cleaned so that only accessible and co-accessible states remain. The sizes of the nondeterministic NWAs are acceptable for all NREs, except for A8, for which the NWA has more than 3000 states and an overall size greater than 10, 000. This can be partially explained by the fact that the NRE for A8 contains three conjunctions (one for the filter and two for the conjunctions in the filter). Still, the number of states remains surprising.
The determinized NWAs are given in column det(nwa(.)). It turns out that only A2 and A3 could be determinized successfully with some few hours of computation time on a standard laptop. However, even in the successful cases, the resulting deterministic NWAs are simply huge. This confirms similar problems first noticed in [18] and not solved since then.
The remaining columns of Figure 12 based on the back-and-forth compiler from SHAs to NWAs from the following Section 6. They show that better determinization algorithms can indeed be obtained, yielding NWAs of acceptable size for all benchmark queries, with the exception of A8. The idea of det(nwa(step(nwa(.))) is to compile the NWAs obtained from the NREs to stepwise hedge automata and back before applying the above algorithm for NWAs. This might be surprising, as this determinization algorithm failed for the original NWAs, while it now proves successful on the forth-and-back transformed NWAs.

Stepwise Hedge Automata
We propose SHAs as an extension of stepwise tree automata [21] that allows to recognize not only unranked trees but also hedges. We avoid more classical hedge automata from [9] that were already introduced in 1967 by Thatcher [8], as their notion of determinism is problematic. For instance, it makes unique minimization fail [38] and universality hard.
Our notion of SHAs will be symbolic in using else rules, and factorized as in [35]: there are two types of states for hedges and trees and an operator for explicit type coercion. We also propose a novel treatment of internal letters inspired by nested word automata, so that SHAs generalize both on stepwise tree automata and on NFAs.

Definition 4.
A SHA is a tuple A = (Q h , Q t , Σ, ∆, I, F) such that Q t and Q h are finite sets of states of two types: t for tree and h for hedge, respectively; Σ an alphabet of internal letters (that may be infinite); I, F ⊆ Q h are subsets of hedge initial and final states, respectively; and ∆ is a finite set of transition rules such that for all q ∈ Q t and a ∈ Σ: An example for a SHA is given in graphical syntax in Figure 13. It recognizes all hedges that are either just a or b or contain some tree node that contains either just a or b. In the graphical syntax, the states of type tree q ∈ Q t are drawn in circles filled in light gray q , while the states of type hedge q ∈ Q h are drawn in unfilled circles q . The right part of the graph is an NFA which uses tree states as additional edge labels, while the left part is a stepwise tree automaton that defines the tree languages of these tree states. Let ∆ h be the restriction of ∆ to the hedge rules. Then, (Q h , Σ Q t , ∆ h , I, F) is a standard NFA with ε-rules, which is symbolic [39] in providing else rules for dealing with large or infinite alphabets in addition. Therefore, we denote the hedge initial states q ∈ I by h − → q and the final states q ∈ F by q . A rule with an internal letter (q 1 , q 2 ) ∈ a ∆ is denoted by q 1 a − → q 2 ∈ ∆ stating that a hedge in state q 1 can be extended by the internal letter a leading to a hedge in state q 2 . Similarly, an epsilon rule (q 1 , q 2 ) ∈ ε ∆ is denoted by q 1 ε − → q 2 , and an else rule (q 1 , q 2 ) ∈ _ ∆ is denoted by q 1 _ − → q 2 . In the same spirit, a hedge rule (q 1 , q 2 ) ∈ q ∆ -also called apply rule-is denoted by q 1 q − −− q 2 ∈ ∆, stating that a hedge in state q 1 can be extended by a tree in state q leading to a hedge in state q 2 .
A tree initial state q ∈ ∆ is graphically denoted by t − → q and a tree final rule (q 1 , q 2 ) ∈ T ∆ by q 1 −→ q 2 . Intuitively, a tree h can be evaluated to state q if h can be evaluated starting with some tree initial state q 1 ∈ ∆ to some state q 2 such that q 2 −→ q ∈ ∆. More formally, the hedge languages L q 1 ,q 2 (A) between any two hedge states q 1 , q 2 ∈ Q h are defined as This definition is mutually recursive with the definition of the tree languages L q (A) of all tree states q ∈ Q t : The hedge language L(A) that is recognized by the automaton is

Determinization of SHAs
We formalize the notion of determinism for stepwise hedge automata and show how determinization works.

Definition 5.
A SHA (Q h , Q t , Σ, ∆, I, F) is deterministic or equivalently a dSHA, if it satisfies the following conditions: • I contains at most one element; • ∆ contains at most one element; • there is no epsilon transition, i.e., ε ∆ = ∅; • a ∆ , q ∆ , _ ∆ are partial functions from Q h to Q h for all a ∈ Σ and q ∈ Q t ; and • T ∆ is a partial function from Q h to Q t .

Proposition 2.
A SHA of size n can be made deterministic in time O(2 n ) while preserving the hedge language.
Proof. The determinization procedure for SHAs combines the determinization algorithms of word and tree automata in the natural manner, while eliminating epsilon transitions. Let ε ∆ * be the reflexive and transitive closure of ε ∆ , and for any subset Q ⊆ Q h ∪ Q t let ε ∆ * (Q) = q∈Q ε ∆ * (q). Given a SHA A = (Q h , Q t , Σ, ∆, I, F), we define an equivalent There is a unique tree initial state in ∆ det = {ε ∆ * ( ∆ )} and no ε-rule, that is, ε ∆ det = ∅. The inference rules in Figure 14 define the missing part of ∆ det . We can show for all Q 1 , Q 2 ⊆ Q h and P ⊆ Q t that L Q 1 ,Q 2 (det(S)) = q 1 ∈Q 1 ,q 2 ∈Q 2 L q 1 ,q 2 (S) so that L P (det(S)) = q ∈Q L q (S). Therefore, L(det(S)) = Q ∈F det L I,Q (det(S)) and thus L(det(S)) = q 1 ∈I,q 2 ∈F L q 1 ,q 2 (S) = L(S).
For illustration, the deterministic SHA in Figure 15 is obtained by determinization of the SHA in Figure 13

Compilation of NREs to SHAs
As for NWAs, we introduce the notion of multi-module SHAs for which the sets of hedge states are partitioned between those that can evaluate top level positions and those to which nested positions are assigned. Therefore, multi-module SHAs will have exactly two modules too. For instance, consider the multi-module SHA in Figure 13. The states of module for the top level are Q 0 = {1, 2, 3, 4, 5, 6, 12}. The others belong to the module for the nested level.
Any NRE E can be compiled to a multi-module SHA step(E) = (Q h , Q t , Σ, ∆, I, F) such that Q t = {E | E = E subexpression of E} and L t (E ) = L(E ) for all tree states E ∈ Q t . The SHA step(E) can be partitioned into disjoint SHAs for all E ∈ Q t and ∆ top = ∅. Note that the transitions relation ∆ is decomposed thereby into independent connected components. The automaton A top can be identified with an NFA with signature Σ ∪ Q t given that it has no tree initial states. The automata A E are stepwise tree automata that recognize the tree language L(E ) when taking E as final state. For this, they may have tree initial states, but will not have any initial nor final states.

Theorem 2. For any NRE E, we can construct a SHA A such that L(A) = L(E). If E contains neither conjunctions nor negations, then the construction is in time O(|E|).
Proof sketch. For the case of expressions with conjunctions or negations, the construction is analogous to the way it is done for NWAs. We next sketch the construction of SHAs for expressions without conjunction and negation.
Case E = E · E : We use McNaughton-Yamada-Thompson algorithm [36] for composing the multi-module NFAs of step(E ) and step(E ). The stepwise tree automata A E of the subexpressions of type tree are preserved. For succinctness, if some subexpression E occurs more than once, then only a single copy of A E is kept. References to states of the removed copy should be renamed to their equivalent counterparts.
Case E = E : We construct step(E) from step(E ). The initial states of step(E ) are turned into tree initial states. We then add a new tree state E and connect it to all final states of step(E ) by a tree final rule q −→ E . Furthermore, the previously final state q becomes non final. Again by the golden lemma of the µ-calculus, we can assume w.l.o.g. that a occurs at most once in E . By using ε-rules, we can preserve the invariant that there will be at most one pair (q, q ) such that q a − → q in step(E ). Furthermore, these transitions cannot be on top level, given that the occurrence of a in E must be nested below parentheses. The automaton step(E) is obtained from step(E ) by first copying the top level NFA of step(E ), as in Figure 16. We thus obtain two versions for each state of the top level NFA of step(E ): one referred to as the top level copy-q 0,0 and q 3,0 in Figure 16, and another one as the nested levelq 0,1 and q 3,1 in Figure 16. Only top level states may be initial or final. Then, we add ε-rules from q to the nested states that correspond to the initial states of step(E ), and from the nested states corresponding to the final states of step(E ) to q . Finally we remove the rule q a − → q . The resulting automaton is shown in Figure 17.
Note that every transition added for a state-top level or nested-in a subsequent step of the construction-except the ε-rules added for µ-expressions-must also be added for its copy.
The construction is correct as the µ-bound name a is nested below parenthesis in E . Therefore, it can be shown that the ε-edges introduced cannot be used to produce unwanted order in successful runs. Maintaining this invariant in polynomial time requires an additional argument. Instead of copying the top level parts of subexpressions, each state is introduced twice during the construction: one version for nesting, and another one for being part of top level parts. This way the size of the automaton is not doubled at each step, but only once. We omit the correctness proof of this construction.
Unlike NWAs, one cannot preserve the determinism of the expressions of nregexp(ch, T) in SHAs, even with Glushkov-like constructions. For instance, for the deterministic NRE a 1 · a 2 · . . . · a n . . . , one would have an SHA having a tree initial state for each of the a i . . . subtree, implying nondeterminism.

Experimental Results Starting with the SHA Compiler
In the first two columns of Figure 18, we report the size of the SHAs obtained from NREs by our compiler, and the size of the deterministic SHAs produced thereof. step(.) det(step(.)) nwa(det(step(.))) nwa(step(.)) det(nwa(step(.)))  The SHA compiler yields automata of acceptable size from the NREs of all benchmark queries. These sizes are given in the first column step(.) of Figure 18. This even holds for A8, in contrast to the case where the produced SHA has overall size 1106 and 267 states.
The determinization of the SHAs in the second column det(step(.)) even yields smaller automata in all cases. For A8, we obtain a deterministic stepwise automaton of overall size 749 and with 123 states. This might be surprising, in that the determinization algorithm may lead to an exponential blow-up in the worst case. However, it may also clean the automaton, keeping only accessible sets of states. This is what seems to happen systematically on the benchmark with the exception of A2, where the size goes up by a factor of four and A5 where the size doubles. For A2 the number of states grows by one third, while for A5 it decreases by one third.
Based on the back-and-forth compiler from SHAs to NWAs from following Section 6, we can obtain deterministic NWAs of acceptable size for all benchmark queries. The method nwa(det(step(.))) yield for A a dNWA of size 2831 and with 124 states. The alternative method det(nwa(step(.))) yields a dNWA of size 2520, which is even smaller, and the same number of states.

NWAs versus SHAs
We next show how to compile SHAs to NWAs such that determinism is preserved, and back while introducing nondeterminism. Thereby, we can obtain small NWAs for NREs such as E = ch * (a + b) for which det(nwa(E)) blew up in size in a surprising manner (see Figure 12).

SHAs to NWAs
As a first step, we introduce a transformation on SHAs, so that for any SHA A: Then, we compile any SHA A = (Q h , Q t , Σ, ∆, I, F) obtained after the above transformation to an NWA nwa(A) = (Q h , Q t , Σ, Γ, ∆ , I, F) such that L(A) = L(nwa(A)). We set Clearly, if S is deterministic then so is nwa(S), as p is unique in this case in particular. One might be tempted to skip the first-step transformation and restrict the above construction rule to states p such that L q (S[ ∆ /{p}]) = ∅. However, this would lead to a huge blow-up when determinizing these NWAs, basically as this change spoils the single-entry property discussed in Definition 7.
The conversion of step(ch * (a + b)) in Figure 13 yields the NWA in Figure 19. Note that the opening rules are deterministic (but not the whole NWA), as for all tree states q there is at most one hedge state p with → p such that q is accessible from p. The NWA has size 64, while its determinization has size 159 (see Figure A4 of the Appendix C). The size increase raised by determinization is thus 95 = 159 − 64 for this NWA.
The size increase for determinization is considerably smaller for the NWA obtained from the regular expressions by indirection via a SHA than for NWAs obtained by direct compilation. Indeed, the determinization of nwa(ch * (a + b)) blows the size from 39 to 271. The size increase for the determinization of nwa(ch * (a + b)) is thus 242 = 271 − 39 while for nwa(step(ch * (a + b))) it is only 95 = 159 − 64.
The experiments will show that this is not an exception but the general rule. Intuitively, the reason is that NWAs obtained from SHAs do all the work bottom-up, where NWAs obtained directly from the regular expression do a considerable amount of work top-down. In terms of the work in [22], this restriction can be characterized syntactically by the single-entry property: Definition 7. A (weak) single-entry NWA A = (Q h , Q t , Σ, Γ, ∆, I, F) is a NWA for which there exists a single state q entry ∈ Q h such that all opening rules in ∆ have the form q ↓q − −− q entry . Figure 19. The single-entry NWA nwa(step(ch * (a + b))) obtained from the SHA.
It can be shown that nwa(S) has the (weak) single-entry property for all SHAs S for which the p's are unique in the above construction rule, i.e., such that ∆ = {p}. Note that this was not the case for step(ch * (a + b)) in Figure 13 but could have been imposed w.l.o.g. leading to a slightly different NWA than in Figure 19.
The conversion of the determinization det(step(ch * (a + b))) in Figure 15 yields the deterministic NWA in Figure 20. The size goes up slightly from 53 to 73. It should be noticed that factorization avoids a quadratic blow up in this case. This can be observed at state 14, which has 3 incoming tree edges and 10 outgoing closing edges. Without factorization, the 3 tree edges could be replaced by 3 ε-edges whose elimination would produce 30 closing edges. This would increase the number 3 + 10 edges to 3 * 10 edges.  *  (a + b)))).

NWAs to SHAs
Conversely, NWAs can be compiled to stepwise hedge automata, but at the cost of introducing nondeterminism, as an NWA may traverse the branches of a tree top-down, while a stepwise must traverse them bottom-up. For this, the stepwise guesses the state in which the NWA will arrive from above and then evaluates the subtree starting with this state, while verifying the correctness of the guess later on. Let A = (Q h , Q t , Σ, ∆ , I, F) be an NWA. We build a SHA step and ∆ s is the smallest satisfying the following rules: The construction is such that L(A) = L(step(A)).
For the NWA nwa(ch * (a + b)) in Figure 6, we obtain the stepwise in Figure A3 up-to removing useless states and separating the top level.

Optimizations
We will use three optimization methods for constructing smaller dSHAs and thus smaller dNWAs: minimization, symbolic representations of sets of transition rules, and schema-based cleaning.

Minimization
Our next objective is to reduce the size of deterministic SHAs by developing a minimization algorithm for a subclass of dSHAs. Even though our implementation can deal with them, we consider SHAs without symbolic rules q − − → q for simplicity in this section. We start with an example that motivates the choice of our subclass. In Figures 21 and 22, two dSHAs are given that both recognize the language of all hedges with signature Σ = {x, y} containing exactly one occurrence of the letter x. The dSHA in Figure 22 is the dSHA recognizing this language which has the minimal number of states. The dSHA in Figure 21 is the minimal multi-module dSHA for this language. The question is how a minimization algorithm for dSHAs could convert the dSHA in Figure 21 to this minimal one in Figure 22. In particular, why would it merge the tree initial state and the hedge initial state? We do not see how this could be done based on some Myhill-Nerode-like equivalence relation. This motivates an a priori restriction to dSHAs imposing that the tree initial state and hedge initial state must be equal.  Note that any SHA can be converted into a dSHA with equal tree and hedge initial states. For this, it is sufficient to "fuse" these states and then to determinize the SHA obtained. When doing so for the dSHA in Figure 21, we indeed obtain the minimal dSHA from Figure 23, so no further minimization is needed in this case. Given the close relationship between SHAs and weak single-entry NWAs, it is instructive to consider the existing results on minimization for dNWAs. It is known that the class of general dNWAs does not allow for unique minimization [22] and that the minimization becomes NP-hard when admitting general signatures with multiple parenthesis [24].
On the positive side, the best existing minimization algorithm is due to Chervet and Walukiewicz [23]. It applies to the subclass of multi-module single-entry dNWAs, called there call-driven automata (CDA) (Chervet and Walukiewicz [23] permit signatures with multiple opening parenthesis. In the case of a single opening parenthesis, the class of CDAs is equal to their subclass of expanded CDAs for which they develop their minimization algorithm in the first place.). They showed that the subclass of multi-module single-entry dNWAs enjoys unique minimization in polynomial time.
In the case of dSHAs, we believe that unique minimization holds for the following two subclasses, and will show it for the second: • the subclass of multi-module dSHAs, and • the subclass of dSHAs where the hedge and tree initial state are the same, i.e., ∆ = I. The first subclass of multi-module dSHAs is motivated by the subclass of multi-module single-entry dNWAs. Note, however, that the SHAs that are obtained by compilation from single-entry dNWAs need not be deterministic, so the analogy between both automata classes is not perfect. The dSHA in Figure 21 is minimal for the class of multi-module dSHAs.
The second subclass of dSHAs corresponds to the subclass of single-entry dNWAs in which the single-entry state is equal to the initial state. The dSHA in Figure 22 is minimal for the second subclass. In the remainder of this section, we present a minimization algorithm for the second subclass. For this, we identify dSHAs in which tree and hedge initial state coincide with two-sorted deterministic tree automata, so that we can use a minimization algorithm for the latter. Our automaton translation is based on a novel encoding of hedges into ranked well-sorted trees with monadic and binary function symbols, which is inspired by the previous binary encoding of unranked trees known from stepwise tree automata [21]. For any unranked signature Σ, as for the construction of hedges, we consider two sorts: h for hedges and t for trees. We then consider the following ranked signature with these two sorts: The well-sorted trees over Σ @ of both sorts then have the following abstract syntax: well-sorted trees of sort h: τ ::= a(τ) | @(τ, τ ) | ε well-sorted trees of sort t: τ ::= T(τ) Any hedge over Σ can be encoded into a ranked well-sorted tree of sort h with signature Σ @ . For instance, the hedge is encoded into the following ranked well-sorted tree of sort h over Σ @ : (c(ε)))) Any SHA A = (Q h , Q t , Σ, ∆, I, F) with equal tree and hedge initial states, that is, ∆ = I, can then be encoded into a two sorted tree automaton [[A]] = (Q h , Q t , Σ @ , ∆ , I, F) by mapping the transition rules in ∆ to those in ∆ as follows: . Furthermore, the unique minimization of deterministic two-sorted tree automata implies the unique minimization of the class of dSHAs with equal tree and hedge initial states. Using this translation back and forth, we can thus lift the minimization algorithm of deterministic two-sorted tree automata to a minimization algorithm for the subclass of dSHAs with equal tree and hedge initial states. This is the minimization algorithm for dSHAs that we have implemented. We then used it in our constructions to reduce the size of the dSHAs obtained by determinization.

Symbolic SHAs with Apply-Else Rules
The sizes of the dSHAs constructed so far are dominated by the number of transitions. We now propose a class of symbolic dSHAs by adding apply-else rules, in order to represent large numbers of apply rules in a more compact and symbolic manner.
An apply-else rule has the form q _ − −− q where q, q ∈ Q h . It represents the set of apply rules q p − −− q , where p ∈ Q t can be chosen arbitrarily from a subset of tree states distinguished by the automaton.
We have also adapted our determinization for dSHAs so that it preserves apply-else rules. What is missing so far is a concept for NWAs that corresponds to the apply-else rules of dSHAs. Therefore, we have to eliminate apply-else rules before translating SHAs to NWAs.

Schema-Based Cleaning
Automata for XPATH queries recognize nested words that can be obtained by encoding XML documents with a single x-marked position. The class of such nested words is characterized by a schema that we can define as the intersection of the two dSHAs in Figures 23 and 24. The first SHA tests whether there is exactly one occurrence of the internal letter x, and the second one tests that the XML data model is satisfied, and the node annotations with x and ¬x are put at the right positions. The automata constructed for the XPATH queries may accept some trees that do not satisfy the schema (but will never be evaluated on such trees when answering the query). The idea of schema-based cleaning is to remove all transition rules and states that are not used for recognizing any nested word satisfying the schema. Schema-based cleaning of an automaton can be performed by constructing the product of the automaton with the schema, which is in our case an intersection of two dSHAs. We then only keep those states of the original SHA that are used in accessible and co-accessible states of the product with the schema.
Note that schema-based cleaning typically changes the language of the automaton. Different languages may be obtained when cleaning different automata for the same query with respect to the schema. If one is interested in a unique language, then one can choose the intersection of the automaton with the schema. This intersection, however, is usually larger than the automaton obtained by schema-based cleaning.

Experimental Results for Optimizations
The sizes of optimized automata for the benchmark queries are reported in Figure 25.   is encoded into the following ranked well-sorted tree of sort h over Σ @ : (c(ε)))) Any SHA A = (Q h , Q t , Σ, ∆, I, F) with equal tree and hedge initial states, that is ∆ = I, can then be encoded into a two sorted tree automaton [[A]] = (Q h , Q t , Σ @ , ∆ , I, F) by mapping the transition rules in ∆ to those in ∆ as follows: The function step @ _ used in the first column compiles NREs to SHAs with apply-else rules. This does not change the number of states, but reduces the number of automata transitions. In the case of A8, the size of the stepwise automaton is reduced from 1106 to 894.
An optimized determinizer is applied by the function D(.) = det(step @ _(.)) in the second column. It preserves apply-else rules in particular. For A8, the size is reduced from 749 to 639 while the number of states is preserved.
Schema-based cleaning is applied by the function S(.) = schema − clean(D(.)) in the third column. For A8, the number of rules is reduced further from 639 to 527.
Minimization is applied by the function M(.) = mini(S(.)) in the fourth column. In the case of A8, it reduces the number of states from 117 to 101 and the size from 527 to 487.
In order to come back to dNWAs, we have to eliminate the apply-else rules in column six. For A8 this increases the number of rules back from 527 to 1257.
In the final column, we apply the compiler from SHAs to NWAs which preserves determinism. For A8, this results in a dNWA of size 1413 and 102 states. This is better than the previous results, in particular with respect to the number of states.

Summary of Experimental Results
We now plug the different compilers and optimization methods all together and compare the sizes of deterministic NWAs that we can obtain thereby.

Summary of Experimental Results
We now plug the different compilers and optimization methods all together and compare the sizes of deterministic NWAs that we can obtain thereby.

670
The overall sizes (#states) of the resulting dNWAs are given in Figure 26. We see that the two methods starting with SHAs nwa(det(step(.))) and det(nwa(step(.))) yield reasonably small deterministic NWAs for the NREs of all benchmark queries. The methods starting with NWAs nwa(det(step(nwa(.)))) and det(nwa(step(nwa(.)))) provide reasonably small deterministic NWAs for queries except for A8. 675 We also tested our algorithms on collections of XPath queries with a scalable parameter, such as the queries ch n (a) for increasing n. This series is known to require automaton with a number of states exponential in n for deterministic bottom-up evaluation. The best methods to produce deterministic NWAs in this case is nwa(det(step)). It works until n = 9, leading to an dNWA of size 134929 with 772 states. The number of states close to doubles when increasing n by 1. The second best method for 680 Figure 26. Deterministic NWAs computed with optimizations for the XPath benchmark queries. Note that different dNWAs for the same query may recognize different languages, due to schemabased cleaning with respect to the XML data model. Furthermore, our implementation of the minimization algorithm for the subclass of dSHAs worked successfully only for dSHAs with at most 200 states.
We also tested our algorithms on collections of XPath queries with a scalable parameter, such as the queries ch n (a) for increasing n. This series is known to require automaton with a number of states exponential in n for deterministic bottom-up evaluation. The best methods to produce deterministic NWAs in this case is nwa(det(step)). It works until n = 9, leading to an dNWA of size 134, 929 with 772 states. The number of states close to doubles when increasing n by 1. The second best method for producing dNWAs for the series ch n (a) works only until n = 6.
For explaining the different size of the dNWAs for the series ch n (a), we first note that no schema-based cleaning was applied in this experiment. As a consequence, unique minimal single-entry dNWAs in which the single-entry state is the initial state should exist. The reason for the larger number of states with the three other methods is that we have not implemented any minimization algorithm for this subclass of single-entry dNWAs. Furthermore, our implementation of the minimization algorithm for our subclass of dSHAs failed for too big dSHAs. In this case, the number of states reported in Figure 27 could not be reduced to the minimum. In addition, the number of rules seems to be increased further by the lack of any symbolic representation for rules of NWAs that could mimic the apply-else rules for SHAs.
The only query of the benchmark for which we can provide a deterministic NRE is the query A1. The NRE for query A1 in Figure A1 is nondeterministic nevertheless, since we replaced ch(E) 705 with T · E · T. This is not problematic, given that we can use a decent method for determinization of NWAs. For this reason, it does no more seem worth the effort to maintain specialized compilation methods for deterministic NREs. For the same reason, we will not present any experimental results for our specialized compiler from deterministic NREs to deterministic NWAs. Instead we use the more general compiler for general nondeterministic NREs.

710
The compiler from Theorem 1 introduces epsilon rules and thus, it does not preserve determinism: some deterministic NREs will be compiled to nondeterministic NWAs. This introduction of nondeterminism can be avoided by eliminating epsilon rules on the fly, that is by using Glushkov's approach rather than that of Thompson. Proof sketch. Theorem 3 uses Glushkov's construction and thus eliminates ε-edges on the fly compared to the McNaughton-Yamada-Thompson algorithm. The Glushkov construction is well-known to preserve determinism when compiling regular expressions without nesting to finite state automata [20]. For the additional deterministic expressions ch(E), we adapt the deterministic 720 compilation from [18]. This quadratic time result generalizes a previous result for the Glushkov construction [19] from regular expressions without conjunctions and negations to NREs without conjunctions and negations. Small deterministic NREs without conjunction and negation can thus be compiled to small dNWAs.

Deterministic Nested Regular Expressions
We finally show how to distinguish NREs that can be evaluated deterministically in polynomial time, for instance, by compilation to deterministic NWAs. For this, we consider the language of NREs nregexp(ch, T) that extends the abstract syntax of NREs by a new constant T and a new unary operator ch. Definition 8. An expression of nregexp(ch, T) is deterministic if it does not contain a subexpression of any of the forms: E 1 + E 2 , E * , T · E, µa.E.
Note in particular that ch(a) is a deterministic expression of nregexp(ch, T), as the child operator is added as a primitive there. In contrast, the semantically equivalent expression T. a .T is not deterministic. Similarly, T is deterministic since it is a primitive expression of nregexp(ch, T), while the equivalent expression µx.( x + _) * is nondeterministic for three different reasons: the µ-operator, star * , and disjunction +. The recursive expression ch * (E) is nondeterministic: it is not primitive in nregexp(ch, T), and its definition is based on the µ-operator and disjunction.
The only query of the benchmark for which we can provide a deterministic NRE is the query A1. The NRE for query A1 in Figure A1 is nondeterministic nevertheless, as we replaced ch(E) with T · E · T. This is not problematic, given that we can use a decent method for determinization of NWAs. For this reason, it does no more seem worth the effort to maintain specialized compilation methods for deterministic NREs. For the same reason, we will not present any experimental results for our specialized compiler from deterministic NREs to deterministic NWAs. Instead we use the more general compiler for general nondeterministic NREs.
The compiler from Theorem 1 introduces epsilon rules, and thus it does not preserve determinism: some deterministic NREs will be compiled to nondeterministic NWAs.
This introduction of nondeterminism can be avoided by eliminating epsilon rules on the fly, that is by using Glushkov's approach rather than that of Thompson.

Theorem 3.
For any deterministic regular expression E of nregexp(ch, T) without conjunction and negation, we can construct in time O(|E| 2 ) a dNWA recognizing the same language.
Proof sketch. Theorem 3 uses Glushkov's construction and thus eliminates ε-edges on the fly compared to the McNaughton-Yamada-Thompson algorithm. The Glushkov construction is well-known to preserve determinism when compiling regular expressions without nesting to finite state automata [20]. For the additional deterministic expressions ch(E), we adapt the deterministic compilation from the work in [18]. This quadratic time result generalizes a previous result for the Glushkov construction [19] from regular expressions without conjunctions and negations to NREs without conjunctions and negations. Small deterministic NREs without conjunction and negation can thus be compiled to small dNWAs. On the benchmark, however, this construction can be applied to the query A1 only, so only a few queries can be covered in this manner.

Conclusions and Future Work
We presented SHAs and showed how they can be used to compile NREs to deterministic NWAs. When applied to NREs for navigational XPATH queries in the usual XPathMark benchmark, we obtained reasonably small deterministic NWAs, in contrast to all previous approaches.
The dNWAs that we obtain by compilation from SHAs all have the weak single-entry property. This property means that the computation of the NWA is done in a purely bottomup and left-to-right manner, so in the same way as by an SHA. Our experiments show that the usual determinization algorithm for NWAs is well-behaved when applied to weak single-entry NWAs, while it quickly fails without the weak single-entry property.
We have also stated a unique minimization algorithm for dSHAs with the same tree and hedge initial state. It is open whether unique minimization holds for general dSHAs. Neither do we know whether dSHA minimization is NP-hard. The analogous questions remain open for the class of weak single-entry dNWAs.
In future work, one needs to tackle the open questions on the minimization of dSHAs, weak single-entry dNWAs, and dNWAs with fixed general signatures. One has to understand, whether and why unique minimization holds or not, and whether and why minimization is hard or not. Independently, it is interesting to use SHAs in various questions in theory and practice. In particular, we want to develop new algorithms for earliest query answering for dSHAs that are more efficient than the existing algorithms for dNWAs [16] and to see how they behave in practice.  Data Availability Statement: The nested regular expressions generated for the queries mentioned in this article and their corresponding automata-when we could produce them-can be found at http://researchers.lille.inria.fr/niehren/complementary-material.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Determinization of NWAs
Let us first introduce some notations. For a transition τ ∈ Q h × Q h , we write lab(τ) = {a ∈ Σ | ∃(q, q ) ∈ τ, q ∈ Q.q a − → q ∈ ∆}. Furthermore, we write ε ∆ * to denote the reflexive and transitive closure of ε ∆ . Finally, for any set Q, we write id Q to denote the binary relation that relates every element of Q to itself, that is, id Q = {(q, q) ∈ Q 2 }.
We adapt the usual determinization procedure for NWAs [3,18] so that they can account for hedge ending and else rules. Given an NWA A = (Q h , Q t , Σ, Γ, ∆, I, F), the difficulty is to deal with concurrent opening rules q ↓γ 1 − −− q 1 and q ↓γ 2 − −− q 2 in ∆ during determinization without mixing up the stack symbols γ 1 and γ 2 . Therefore, we use transition relations as states of the determinized automaton det(A) = (Q det h , Q det t , Σ, Γ det , ∆ det , I det , The only initial state is the composition of id I with ε ∆ * , i.e., I det = {id I • ε ∆ * }. The set of final states is F det = {τ ∈ Q det h | τ ∩ (I × F) = ∅}. Schemas generating the transition rules in ∆ det are given below.

Appendix B. NREs for the XPathMark Benchmark
We compiled navigational XPath queries of the XPathMark benchmark to the NREs given in Figure A1.

Appendix B NREs for the XPathMark Benchmark
We compiled navigational XPath queries of the XPathMark benchmark to the NREs given in 860 Figure A1.