Static Analysis for ECMAScript String Manipulation Programs

: In recent years, dynamic languages, such as JavaScript or Python, have been increasingly used in a wide range of ﬁelds and applications. Their tricky and misunderstood behaviors pose a great challenge for static analysis of these languages. A key aspect of any dynamic language program is the multiple usage of strings, since they can be implicitly converted to another type value, transformed by string-to-code primitives or used to access an object-property. Unfortunately, string analyses for dynamic languages still lack precision and do not take into account some important string features. In this scenario, more precise string analyses become a necessity. The goal of this paper is to place a ﬁrst step for precisely handling dynamic language string features. In particular, we propose a new abstract domain approximating strings as ﬁnite state automata and an abstract interpretation-based static analysis for the most common string manipulating operations provided by the ECMAScript speciﬁcation. The proposed analysis comes with a prototype static analyzer implementation for an imperative string manipulating language, allowing us to show and evaluate the improved precision of the proposed analysis. Abstract interpretation establishes a correspondence between a concrete semantics and an approximated one called abstract semantics [9,13]. In a Galois Connection framework, if C and A are complete lattices, a pair of monotone functions α : C → A and γ : A → C forms a Galois Connection visualization,


Introduction
Dynamic languages, for instance JavaScript or Python, have seen an important growth in a very wide range of fields and applications. Common features in these languages are dynamic typing (typing occurs during program execution, at run-time) and implicit type conversion [1], which lighten the development phase and allow programs not to block execution in the presence of unexpected or unpredictable situations. Moreover, one important aspect of dynamic languages is the way strings may be used. In JavaScript, for example, strings can be either used to access property objects or transformed into executable code by using the global function eval. In this way, dynamic languages provide multiple string features that simplify the writing of programs, allowing, at the same time, statically unpredictable executions which might make them harder to understand [1]. For this reason, string obfuscation (e.g., string splitting) is becoming one of the most common obfuscation techniques in JavaScript malwares [2], making it hard to statically analyze code. Consider, for example, the JavaScript program fragment in Figure 1 where strings are manipulated, de-obfuscated, combined together into the variable d and finally transformed into executable code, the statement ws = new ActiveXObject(WScript.Shell). This command, in Internet Explorer, opens a shell which may execute malicious commands. The command is not hard-coded in the fragment but it is built at run-time and the initial values of i,j and k are unknown, as is the number of iterations of the loops. vd , ac , la = ""; v = " wZsZ "; m = " AYcYtYiYvYeYXY "; tt = " AObyaSZjectB "; l = " WYSYcYrYiYpYtY . YSYhYeYlYlY "; while ( i +=2 < v . length ) vd = vd + v . charAt ( i ) ; while ( j +=2 < m . length ) ac = ac + m . charAt ( j ) ; ac += tt . substring ( tt . indexOf (" O ") , 3) ; ac += tt . substring ( tt . indexOf (" j ") , 11) ; while ( k +=2 < l . length ) la = la + l . charAt ( k ) ; d = vd + "= new " + ac + "(" + la + ") "; eval ( d ) ; All these observations suggest that, in order to statically understand statements which are dynamically generated and executed, it may be extremely useful to statically analyze the string value of d. Unfortunately existing static analyzers for dynamic languages [3][4][5][6], might fail to precisely analyze strings in dynamic contexts. For instance, in the example above, TAJS [3], JSAI [4] and SAFE [5], lose precision on the eval input value and any information gathered so far about it. Namely, the issue of analyzing dynamic languages, even if tackled by sophisticated tools as the cited ones, still lacks formal approaches for handling the dynamic features of string manipulation, such as dynamic typing, implicit type conversion and dynamic code generation. Instead, in [7], a new approach for dynamic language analysis is proposed based on finite state automata for abstracting strings, coming with both a precise string abstraction able to infer string properties in general and a sound abstract interpreter for dynamically-generated code.

Contributions
In this paper (This is an extended and revised version of [8] integrated with a more complete range of string operations, detailed proofs of the results presented (proofs are reported in Appendix A) and an improved implementation that will be discussed in Section 6.), we focus on the characterization of an abstract interpretation-based [9] formal framework, capable of handling dynamic typing and implicit type conversion, by defining an abstract semantics able to (precisely, when possible) capture the previously mentioned dynamic features. Even if we do not tackle the problem of analyzing dynamically generated code (meaning that we do not analyze its behavior), as highlighted in [7], such semantics is a necessary step towards a sufficiently precise analysis for it, since it is able to reason about a class of string manipulation programs (as far as string values are concerned) that state-of-art static analyzers would fail to precisely analyze. Indeed the domain we propose allows us to collect (and potentially approximate) the set of all possible string values that a variable may receive during computation (at each program point). It should be clear that, in order to analyze what an eval statement might execute, we surely need to (over-)approximate the set of precise string values of its input. Hence we propose an approach defining a collecting semantics for strings. With this task in mind, we will first discuss how to combine abstract domains of primitive types (strings, integers and booleans) in order to capture dynamic typing. Once we have such an abstract domain, we will define on it an abstract semantics for a µJS language, augmented with implicit type conversion, dynamic typing and several interesting string operations taken from the official ECMAScript language specification [10], namely the JavaScript language specification, whose concrete semantics is inspired by the JavaScript one. In particular, for each one of these operations we will provide the algorithm computing its abstract semantics and we will discuss their soundness and completeness.

Paper structure
In Section 2 we recall relevant notions on finite state automata and the core language adopted for this paper is established in Section 3. In Section 4.1 we define the finite state automata domain, highlighting some important operations and theoretical results. In Section 4 we discuss and present two ways of combining abstract domains (for primitive types) suitable for dynamic languages. Then, In Section 5, we present the new abstract semantics for string manipulating operations. In Section 6 we examine and evaluate the precision of the string static analyzer based on the above semantics. Finally, in Section 7, we discuss and compare this paper to the most related works and we draw our conclusions.

Background
In this section, we recall some basic notations and notions that will be used in the rest of the paper.

String Notation
We denote by Σ a finite non-empty alphabet of symbols, its Kleene-closure by Σ * and a string element by σ ∈ Σ * . If σ = σ 0 σ 1 · · · σ n , the length of σ is |σ| = n + 1 and the element in the i-th position is σ i . Given two strings σ, σ ∈ Σ * , σ · σ is their concatenation. A language is a set of strings, i.e., L ∈ ℘(Σ * ). We use the following notations: Σ i def = { σ ∈ Σ * | |σ| = i } and Σ <i def = j<i Σ j . Given σ ∈ Σ * , i, j ∈ N (i ≤ j ≤ |σ|) the substring between i and j of σ is the string σ i · · · σ j−1 . We denote by Σ Z def = {+, −, } · {0, 1, . . . , 9} + the set of numeric strings, i.e., strings corresponding to integers. I : Σ Z → Z maps numeric strings to the corresponding integers. Dually, we define the function S : Z → Σ Z that maps each integer to its numeric string representation (e.g., 1 is mapped to the string "1", and not "+1"). Given σ ∈ Σ * and n ∈ N, we denote with σ n the n-times concatenation of σ. Given a symbol c ∈ Σ we denote with toLowerCase(c) its corresponding lower-case symbol, if it is a capital letter, otherwise c is returned. We abuse notation denoting by toLowerCase(σ) the string σ where at each position any upper-case symbol is replaced with the corresponding lower-case symbol.

Regular Languages and Finite State Automata
We follow [11] for automata notation. A finite state automaton (FA) is a tuple A = (Q, Σ, δ, q 0 , F) where Q is a finite non-empty set of states, q 0 ∈ Q is the initial state, Σ is a finite alphabet, δ ⊆ Q × Σ × Q is the transition relation and F ⊆ Q is the set of final states. In particular, if δ : Q × Σ → Q is a function then A is called deterministic FA (DFA). We consider DFA also those FA which are not complete, namely such that a transition for each pair (q, a) (q ∈ Q, a ∈ Σ) does not exists. They can be easily transformed in a DFA by adding a sink state receiving all the missing transitions. The class of languages recognized by FA is the class of regular languages. We denote the set of all DFA as DFA. Given an automaton A, we denote the language accepted by A as L (A). A language L is regular iff there exists a FA A such that L = L (A). From the Myhill-Nerode theorem [12], for each regular language uniquely exists a minimum automaton, i.e., with the minimum number of states, recognizing the language. Given a regular language L, we denote by Min(L) the minimum DFA A s.t. L = L (A). Given an automaton A, we denote by Kleene(A) the automaton that recognizes the language corresponding to the Kleene-closure of L (A), namely the automaton A s.t. L (A ) = L (Kleene(A))) = { σ n | σ ∈ L (A), n ∈ N }. Moreover, given an automaton A, we rely on the predicate hasCycle(A) that checks whether A is cyclic.

Abstract Interpretation
Abstract interpretation establishes a correspondence between a concrete semantics and an approximated one called abstract semantics [9,13]. In a Galois Connection framework, if C and A are complete lattices, a pair of monotone functions α : C → A and γ : A → C forms a Galois Connection (GC for short) between C and A if for every x ∈ C and y ∈ A we have α(x) ≤ A y ⇔ x ≤ C γ(y). α and γ are called abstraction function and concretization function, respectively.
Let L be a complete lattice. X ⊆ L is a Moore family of L if X = M(X) = { S | S ⊆ X } and (top element) ∈ M(X). If any concrete object in C has a best abstraction in the abstract domain A implies that A is a Moore family of C and so there exists a Galois connection between C and A.
Weaker forms of correspondence are possible, e.g., when A is not a complete lattice or when only γ exists [14]. In all cases, relative precision in A is given by comparing the meaning of abstract objects in C, i.e., ). If f : C → C is a continuous function and A is an abstraction of C by means of the GC α, γ , then f always has a best correct approximation in A, f A : In abstract interpretation, there exist two notions of completeness: backward completeness and forward completeness. The former is the best known form of completeness and focuses on complete abstractions of the inputs, while the latter is forward completeness [15][16][17] and it focuses on complete abstractions of the outputs, both w.r.t. an operation of interest. When we do not have a GC, namely when only the concretization γ exists, we need to focus only on forward completeness, as we will do in this paper. Given a GC α, γ , a concrete function f : C → C and an abstract function f : A satisfies the ascending chain condition (ACC) if all ascending chains are finite. When A is not ACC convergence to the limit of the fix-point iterations can be ensured through widening operators. A widening operator ∇ : A × A → A approximates the least upper bounds, i.e., ∀x, y ∈ A . x, y ≤ A (x∇y) and it is such that for any increasing chain x 1 ≤ x 2 ≤ · · · ≤ x n ≤ . . . the increasing chain w 0 = ⊥ and w i+1 = w i ∇xi is finite.

The Core Language
In this paper, we consider a JavaScript core language, reported in Figure 2, that we call µJS, containing several representative string operations taken from the set of methods offered by the JavaScript built-in class String, detailed in the ECMAScript language specification [10]. Even though we have decided to focus on a core of the operations, note that the missing methods (e.g., indexOf or endsWith) can be easily modeled as composition of our chosen string methods or as particular cases of them. Nevertheless, as we will discuss in Section 6, these operations have been implemented and tested.  Program states are partial maps from identifiers to primitive values, i.e., STATES : ID → V. The concrete big-step semantics · : STMT × STATES → STATES is standard and follows [18], and it includes dynamic typing and implicit type conversion. In addition, the expression semantics, (| · |) : EXP × STATES → V, is standard and follows [18]; we only provide the formal and precise semantics of the µJS string operations. Let σ, σ ∈ S and i, j ∈ Z (values which are not strings or numbers respectively, are converted by the implicit type conversion primitives, moreover, negative values are treated as zero).
substring: It extracts the substring between two indexes from a string. The semantics is defined by the function SS: S × Z × Z → S as: charAt: It returns the character, i.e., the string of unitary length, at a specified index in a string σ. The semantics is the function CA: S × Z → S defined as follows: length: It returns the length of a string σ ∈ S. Its semantics is the function LE: S → Z defined as LE(σ) def = |σ|. concat: It returns the concatenation between two strings and its concrete semantics CC : S × S → S relies on the concatenation operator reported in Section 2.
CC(σ, σ ) = σ · σ startsWith: It determines whether a specified string σ starts with σ . The semantics is the function SW : S × S → B defined as: repeat: It returns the given string repeated n times. The semantics is the function RT : S × Z → S defined as RT(σ, n) def = σ n . includes: It determines whether a string σ is a substring of σ. The semantics is the function IN: S × S → B defined as: toLowerCase: It returns the given string in all lowercase letters. The semantics is the function LC : S → S defined as LC(σ) def = toLowerCase(σ). trimLeft: It removes all the white-spaces at the beginning of a string. The semantics is the function TL : S → S defined as: trimRight: It removes all the white-spaces at the end of a string. The semantics is the function TR : S → S defined as: trim: It removes all the white-spaces at the end and beginning of a string. The semantics is the function TM : S → S defined as: TM(σ) def = TR(TL(σ)).

Implicit Type Conversion
In order to properly capture the semantics of the language µJS, inspired by the JavaScript semantics, we need to deal with implicit type conversion [18]. For each primitive value, we define an auxiliary function converting it to other primitive values ( Figure 3). Note that all the functions behave like identity when applied to values not needing conversion, e.g., toInt on integers. Then, toString : V → S maps any input value to its string representation; toInt : V → Z ∪ {NaN} returns the integer corresponding to a value, when it is possible: for true and false it returns respectively 1 and 0, for strings in Σ Z it returns the corresponding integer, while all the other values are converted to NaN. For instance, toInt("42") = 42, toInt("42hello") = NaN. Finally, toBool : V → B returns false when the input is 0, and true for all the other non boolean primitive values. It is worth noting that the auxiliary functions defined in Figure 3 do not correspond to explicit casting but they model the implicit type conversion implemented by JavaScript. In particular, these functions cannot be directly called by a programmer since they are exclusively used internally (indeed implicitly) by the semantics when a type value of an expression operand is required.

The Finite State Automata Abstract Domain for Strings
In this section, we describe the finite state automata abstract domain for strings [19][20][21], namely the domain of regular languages over ℘(Σ * ). In particular our goal is to exploit automata, and therefore regular languages, for approximating string values collected during analysis. The idea is to approximate strings as regular languages represented by the minimum DFA [12] recognizing them. In general, we have more DFA than regular languages, hence the domain of automata is indeed the quotient DFA /≡ w.r.t. the equivalence relation induced by language equality: ∀A 1 , . Therefore any equivalence class is composed by automata that recognize the same regular language. We abuse notation by representing these classes in the domain DFA /≡ w.r.t. ≡ using one of its automata (usually the minimum), i.e., when we write A ∈ DFA /≡ we mean [A] ≡ .
The partial order DFA is induced by language inclusion, i.e., ∀A 1 , , which is well defined since automata in the same ≡-equivalence class recognize the same language.
The corresponding least upper bound, DFA : DFA /≡ × DFA /≡ → DFA /≡ on the domain DFA /≡ , is the standard union between automata: ∀A 1 , ). It is the minimum automaton recognizing the union of the languages L (A 1 ) and L (A 2 ). This is a well-defined notion since regular languages are closed under union. As example consider Figure 4, where the automaton in Figure 4c is the least upper bound of A 1 and A 2 given in Figure 4a,b, respectively.  In other words, it cannot exists a Galois connection between DFA /≡ and ℘(Σ * ), i.e., there may be no minimal automaton abstracting a language. Note that some works [22][23][24] have studied automatic procedures to compute, given an input language L, the regular cover of L [23] (i.e., an automaton containing the language L). Some of them [22,23] studied regular covers guaranteeing that the automaton obtained is the best w.r.t. a minimal relation (but not minimum). However this is not a concern since the relation between concrete semantics and abstract semantics can be weakened still ensuring soundness [14]. A well known example is the convex polyhedra domain [25].

Widening
The domain DFA /≡ is an infinite domain, and it is not ACC, i.e., it contains infinite ascending chains. For instance, consider the set of languages L i = { a j b j | 0 ≤ j ≤ i } ⊆ ℘(Σ * ), indexed by a constant natural i ∈ N, forming an infinite ascending chain of finite regular languages. The set of the corresponding minimal automata trivially forms an ascending chain on DFA /≡ . This clearly implies that any computation on DFA /≡ may lose convergence [14] (Most of the proposed abstract domains for strings [3][4][5]26] trivially satisfy ACC being finite, but they may lose precision during the abstract computation [27].).
As far as automata are concerned, existing widenings are defined in terms of a state equivalence relation merging states that recognize the same language, up to a fixed length n (set as parameter for tuning the widening precision) [28,29]. We denote this parametric widening with ∇ n : DFA /≡ × DFA /≡ → DFA /≡ , with n ∈ N [28] and it is defined in the following.
Let A = (Q, Σ, δ, q 0 , F) and A = (Q , Σ, δ , q 0 , F ) be two finite state automata such that L (A) ⊆ L (A ): the widening between A and A is formalized in terms of a relation R ⊆ Q × Q between the sets of states of the two automata. The relation R is used to define an equivalence relation ≡ R ⊆ Q × Q over the states of A , such that ≡ R = R • R −1 . The widening between A and A is then given by the quotient automaton of A w.r.t. the partition induced by ≡ R : A ∇ R A = A ≡ R (Given A ∈ DFA /≡ and a partition π over its states, we denote as A π = (Q , δ , q 0 , F , Σ) the quotient automaton [12].). Thus, the widening operator merges the states of A that are equivalent by the relation ≡ R . By changing the relation R, we obtain different widening operators [28]. It has been proved that convergence is guaranteed when the relation R n ⊆ Q × Q is such that (q, q ) ∈ R n iff q and q recognize the same language of strings of length at most n [28]. The parameter n therefore tunes the length of strings determining the equivalence of states used for merging them in the widening. It is worth noting that the smaller is n, the more information will be lost by widening.
In the following, given A, A ∈ DFA /≡ (without any constraints on the languages they recognize), we define the widening operator on DFA /≡ parametric on n ∈ N as follows.
In order to show how the defined widening operator works, let us discuss the following example. Example 1. Consider the following µJS fragment str =""; while ( x < 100) { str = str +" a "; x = x +1; } The value of the variable x is unknown and so is the number of iterations of the while-loop. In these cases, in order to guarantee soundness and termination, we apply the widening operator.
In Figure 5a we report the abstract value of the variable str at the beginning of the second iteration of the loop, while in Figure 5b the abstract value of the variable str at the end of the second iteration. Before starting a new iteration, in the example, we apply ∇ 1 between the two automata, specifically we merge all the states having the same outgoing character. The minimization of the so obtained automaton is reported in Figure 5c. The next iteration will reach the fix-point, guaranteeing termination. Figure 5. Widening of DFA /≡ .

An Abstract Domain for µJS
By definition, string operations in our language also involve other primitive values, such as booleans or integers, hence we need an abstract domain able to observe any possible concrete value. This is additionally necessary for dealing with implicit type conversion as we will later observe.
We therefore have to design an abstract domain for string manipulation dealing with other primitive types, namely being able to combine different abstractions of various types. In particular, an abstract domain for string analysis equipped with dynamic typing must include all the possible primitive values, i.e., the whole V = Z ∪ B ∪ S ∪ {NaN}. The idea is to consider an abstract domain for each type of primitive value and to combine them in a unique abstract domain for V. Consider, for each value D, an abstract domain D (we denote D ⊥ the domain D without bottom), equipped with an abstraction α D : D → D and a concretization γ D : D → D forming a Galois insertion [9].

Coalesced Sum
One way to merge domains is the coalesced sum [30]. The resulting domain contains all the non-bottom elements of the input domains, with a new top and a new bottom.
be two lattices abstracting the posets C, ≤ C and D, ≤ D with abstraction functions α A : A → C and α B : B → D, respectively. The coalesced sum domain A ⊕ B is defined as: such that the partial order is defined as x ≤ A⊕B y ⇔ x ≤ A y (x, y ∈ A) ∨ x ≤ B y (x, y ∈ B) and ∀x ∈ A ⊕ B. ⊥ A⊕B ≤ A⊕B x ≤ A⊕B A⊕B , its least upper bound is defined as: and its greatest lower bound A⊕B can be dually defined. The abstraction functions α A⊕B : In our case, if we consider the abstract domains Z , S and B , the coalesced sum is the abstraction of ℘(V) depicted in Figure 6. This is the simplest choice but unfortunately this is not suitable for dynamic languages, in particular for dealing with dynamic typing and implicit type conversion. The problem is that the type of variables is inferred at run-time and may change during execution. For example, consider the µJS fragment if (y < 5) {x = "42"; } else {x = true; }. The value of the variable y is statically unknown hence, in order to guarantee soundness, we must take into account both the branches, meaning that x may be both a string and a boolean value, after the if statement. On the coalesced sum domain, the analysis would lose any precision w.r.t. collecting semantics by returning α S ("42") α B (true) = .

Cartesian Product
In order to catch union types, without losing too much precision, we need to complete [15,16,32] the above domain in order to observe collections of values of different types. In order to define this combination, we rely on the Cartesian product, following [33]. The complete abstract domain w.r.t. dynamic typing and implicit type conversion is: Z × B × S × ℘({NaN}), abstraction of ℘(V). In this combined abstract domain, the value of x after the if-execution is precisely (⊥, α B (true), α S ("42"), ⊥), now an element of the domain, inferring that the value of x can be α B (true) or α S ("42") but surely not an abstract integer of NaN.
In the following, we consider the abstract domain V for string analysis obtained as Cartesian product of the following abstractions: the abstract domain of constant integers) and S = DFA /≡ , .

Abstract Semantics of ECMAScript String Operations
In this section, we define the abstract semantics of the language µJS over the abstract domain V . In particular, we have to define the expressions abstract semantics · : EXP × STATES → V , abstracting the collecting semantics (The string collecting semantics (fully reported in Appendix A) is defined lifting to ℘(V) the concrete one reported in Section 3. For example, the collecting semantics of substring is, abusing notation, SS : , which is standard except for the string operations that will be explicitly provided by describing the algorithms for computing them. Let us first recall some important notions on regular languages, useful for the algorithms we will provide. Definition 2 (Suffixes and prefixes [12]). Let L ∈ ℘(Σ * ) be a regular language. The suffixes of L are We can define the suffixes from a position, namely given i ∈ N, the set of suffixes from i is Definition 3 (Right quotient [12]). Let L 1 , L 2 ∈ Σ * be regular languages. The right quotient of [34]). Let L ∈ ℘(Σ * ) be a regular language. The set of its substrings/factors is FA(L)

Definition 4 (Substrings/Factors
These operations are all defined as transformations of regular languages. In [12], the corresponding algorithms on FA are provided. In particular, let A, A 1 ∈ DFA /≡ and i ∈ N, then SU

Abstract Semantics of Substring
In this section we define the abstract semantics of substring. In particular, we define the operator SS : DFA /≡ × Const × Const → DFA /≡ , that takes as input an automaton and two constant integer indexes i, j ∈ Const, and computes the automaton recognizing the set of all substrings of the input automata language between the two provided integer indexes. Since the abstract semantics has to take into account the swaps when the initial index is greater than the final one, several cases arise when one of the two integer parameters is unknown, namely when it is equal to Const . Indeed, the abstract semantics SS is divided in four cases that are reported in Table 1. Consider A ∈ DFA /≡ , i, j ∈ Const (for the sake of readability we denote by the automata lub DFA , and by the glb DFA ). As in the concrete semantics of substring, negative integer values are treated as zero.

1.
If i, j ∈ Z (second row, second column of Table 1) we have to compute the language of all the substrings between the initial index i and a final index in j, i.e., SS(L (A), i, j). For example, let L = {a} * ∪ {hello, bc}, the set of its substrings from 1 to 3 is SS(L, 1, 3) = { , a, aa, el, c}. When i < j, as in the example, the automaton accepting this language is computed by the operator If j > i, the integer arguments are simply swapped, as in the Table 1.

2.
When both integer parameters correspond to Const , the result is the automaton of all possible factors of A (third row, third column), i.e., FA(A).

3.
When i is defined and j = Const (second row, third column), we have to compute the automaton recognizing all the substrings of L (A) from 0 to i and any substring starting from i. For example, let us consider SS (Min({helloworld}), 5, Const ). Due to the semantics of substring reported in Section 3, we need to compute the substring from a ∈ [0, 5] to 5 and then any substring with initial index equal to 5. The automata recognizing any substring starting at a specific index l is defined as , l)). The abstract semantics returns the least upper bound of all the automata of substrings from a in [0, i] to the automata recognizing any substring with initial index equals to i.

4.
Similarly to the previous case, when j is defined and i = Const (third row, second column), we have to compute the automaton recognizing all the substring of L (A) from 0 to j and any substring starting from j. Let us consider SS (Min({helloworld}), Const , 5). Similarly to the previous case, we compute the substrings from a ∈ [0, 5] to 5 and then any substring with initial index equal to 5. The abstract semantics therefore returns the least upper bound of all the automata of substrings from a in [0, j] to the automata recognizing any substring with initial index equal to j.
In Figure 7 we report an example obtained applying the rules in the table.

FA(A)
Theorem 2. SS is sound and complete. Formally, From here on, when we say completeness we mean forward completeness. As highlighted in Section 2, this is the only form of completeness we can ensure in absence of a Galois connection. In particular, when an abstract operation (e.g., SS ) is forward complete for a concrete operation (e.g., SS) means that the computation on the abstract domain (e.g., DFA /≡ ) does not lose information due to the necessary computation only on abstract elements.

Abstract Semantics of charAt
The abstract semantics of charAt should return an automaton accepting the language of the characters at position i in the strings accepted by the given automaton. Since charAt is a particular case of substring, its abstract semantics, determined by CA : DFA /≡ × Const → DFA /≡ , relies on the abstract semantic of substring previously defined. In particular, We call SS (defined before) when the index i corresponds to a determinate integer value otherwise we use the function chars : DFA /≡ → ℘(Σ), returning the set of characters read in any transition of an automaton, together with Min({ }).
Theorem 3. CA is sound and complete. Formally,

Abstract Semantics of length
The abstract semantics of length should return a value, of the integer domain Const, that, in a sound way, approximates the length of all the possible strings of an automaton. The abstract semantics of length is defined by the function LE : DFA /≡ → Const, computed by Algorithm 1, where Paths : DFA /≡ → ℘(℘(Q)) returns the set of the paths from the initial state to any final state of A [35]. Given a path p ∈ Paths(A), we denote by |p| the length of p. If the input automaton has cycles, LE returns Const otherwise it checks that any path of the automaton A has the same length (lines 5-8). Whenever the algorithm finds that there exists two paths in the automaton that have different lengths, Const is returned (lines 8-10). Due to the constant integers domain, the abstract semantics of length can give a precise answer only when any string of the automaton has precisely the same length. More accurate results can be obtained by using more precise integer abstract domains, e.g., intervals, as we will discuss in Section 6. For example, consider the automata A and A in Figure 8a

Abstract Semantics of Concat
The abstract semantics of string concatenation is CC : DFA /≡ × DFA /≡ → DFA /≡ and returns the concatenation between the input automata. Since regular languages are closed under the concatenation operation, so are finite state automata. Hence, CC exactly implements the standard concatenation operation between automata. Given the closure property on automata, the following result holds. As we have already mentioned before, completeness holds thanks to the closure properties of regular languages (and in turn of finite state automata).

Abstract Semantics of StartsWith
The abstract semantics of startsWith takes as input two automata and checks whether a string of the language of the first automaton starts with a string of the language of the second one. The abstract semantics of startsWith is captured by the function SW : DFA /≡ × DFA /≡ → B , computed by Algorithm 2, where maxString : DFA /≡ → DFA /≡ returns the (minimal) automaton recognizing the longest string of the automaton given as input and isSinglePath : DFA /≡ → {true, false} checks whether the input automata A = (Q, Σ, δ, q 0 , F) respect the following condition: δ = i∈[0,|Q|] (q i , q i+1 , c). Informally, a single-path automaton is an automaton where, if we sort the strings of its language from the shortest to the longest, each string is a prefix of the next one. An example of a single-path automaton is reported in Figure 9b where it is graphically clear that each state, excluding the initial and last one, have one incoming and one outgoing transition. Since the longest string in a single-path automaton has, as prefix, all the others of the language, it is sufficient to check, for an automaton A, if it starts with only the former. For example, let L (A) = {so f ter} and L (A ) = {s, so, so f t}. The string s is prefix of so, which is in turn prefix of so f t so A is a single-path automaton. Therefore, in this case, it is sufficient to check if so f ter starts with only so f t (the longest string of L (A )) since, being A single-path, the other strings (s and so) are consequently prefix of so f ter. Instead, consider L (A ) = {s, no}. It would be impossible for a string to start with both of them since there is no prefix relation between them.
Algorithm 2 takes as input two automata denoted by A and A . Lines 1-9 handle some corner cases. If L (A ) = {ε}, {true} is returned, since any string starts with ε (lines 1-3). If none of the prefixes of A is recognized by A , meaning that none of the strings recognized by A start with a string of A , we can safely return {false} (lines 4-6). Finally, if at least one of the input automata have cycles, we return {true, false} (lines 7-9). Lines 10-17 determine if any string of A is the beginning of any string of A, otherwise Bool is returned. In order to explain our approach in lines 10-17, consider the automata A and A reported in Figure 9. To be sure that any string recognized by A is the beginning of any string recognized by A we need to check two conditions: (1) any string recognized by A is prefix of its longest recognized string σ and (2) each string in A starts with σ (all strings must have a common prefix). Only if both conditions occur we can safely return {true} otherwise we return Bool . In particular, (1) is checked by the function isSinglePath at line 10 and (2) is checked at lines 11-15. It is worth noting that if an automaton is single-path, then the longest string is unique (line 11).
In our example, both the strings p and pan in L (A ) are prefixes of pan, which is the longest string recognized by A , so we build B, which is the (minimal) automaton that recognizes pan and C, L (C) = {pan, koa}, and compare them (line 13). We return {true} if B and C recognize the same language otherwise we return Bool . In the other cases, as already mentioned, we return {true, false}. For example, in Figure 9, {true, false} is returned because, although A is a single-path automaton, only the string panda ∈ L (A) begins with pan, namely the longest string of L (A ). The abstract semantics of toLowerCase is defined by the function LC : DFA /≡ → DFA /≡ which returns as result an automaton that recognizes the same strings of the input automaton, where any upper-case symbol is replaced with the corresponding lower-case symbol. LC is computed by Algorithm 3.

Algorithm 3: LC
Starting from an input automaton A, the idea is to return as result the automaton A , that is a copy of A with the exception that any upper-case symbol read by a transition is replaced by its corresponding lower-case symbol. Transitions that already read lower-case or special symbols are unaltered. An example is reported in Figure 10.

Abstract Semantics of Includes
The abstract semantics of includes is defined by the function IN : DFA /≡ × DFA /≡ → B . It takes as input two automata A and A and checks whether a string recognized by A is a substring of a string recognized by A. The function IN is computed by Algorithm 4, where, given a path p of an automaton A, we abuse notation denoting by Min(p) the automaton that recognizes the string encoded by the path p (lines [11][12]. The algorithm first checks some corner cases: if A only recognizes the empty string, {true} is returned, since the empty string is always a substring of a non-empty automaton (lines 2-4), if none of the substring of A is contained in A , {false} is returned (lines 5-7) and if one of the input automata is cyclic, it returns Bool (lines 8-10). When these corner cases are excluded, we check each string recognized by A. If the algorithm finds at least one string σ in L (A ) that is not a substring of a string σ of A, Bool is returned otherwise {true}. This is done in lines 10-14 where, for each path p of A we create Min(p) and check if its factorization with A equals A , i.e., we check if it contains any string of A . For example, consider the automata A and A reported in Figure 11. The algorithm returns Bool since the string f g ∈ L (A ) is not a substring of abc ∈ A. Another example is reported in Figure 12.

Abstract Semantics of Repeat
The abstract semantics of repeat is defined by the function RT : DFA /≡ × Const → DFA /≡ that, given as input an automaton A and a constant integer value i, returns an automaton that recognizes any string of L (A) repeated i times. RT is computed by Algorithm 5 and we suppose that the abstract integer value i is positive or zero. Any non-positive value is treated as zero. The algorithm first checks some corner cases. If i = 0 or the input automaton only recognizes the empty string, then Min( ) is returned (lines 1-3). If the automaton has a cycle or i = Const , it returns the Kleene-closure of the input automaton (lines 4-6). If none of these corner cases is detected then, for each string in L (A), we concatenate it with itself (i − 1)-times using the already defined CC . The result is the union of all the concatenated automata. Let us consider the automaton A reported in Figure 13a and suppose to call RT (A, 2). The resulting automaton, applying Algorithm 5, is reported in Figure 13b. Let us suppose to call RT (A, Const ). In this case, since the input integer value is not determinate, Algorithm 5 returns the Kleene-star automaton of A and the result is reported in Figure 13c. As a counterexample to completeness, consider the automaton A s.t. L (A) = { ab n | n ∈ N }. The completeness condition is not met, indeed RT(L (A), 2) = { ab n ab n | n ∈ N } = RP (A, 2) = { (ab n ) m | n, m ∈ N } since when the input automaton is cyclic, Algorithm 5 returns the Kleene closure of the input automaton.

Abstract Semantics of TrimLeft, TrimRight and Trim
In this section, we will show the abstract semantics of trimLeft, trimRight and trim operations. The abstract semantics of trimLeft is defined by the function TL : DFA /≡ → DFA /≡ . In particular, it takes as input an automaton A and returns an automaton accepting the same strings of A removing, at the beginning of each string, consecutive white spaces, if present. In the following, we denote a white-space as . The function is computed by Algorithm 6. The idea of algorithm is to iteratively replace white-space transitions from the initial state with -transition (lines 5-7), while leaving the other transitions unaltered (lines 7-9). At each iteration, the resulting automaton is minimized, and hence determinized (line 11). This operation is repeated until the initial state has no white-space transitions, checking the condition that white-space is not a prefix of the automaton (line 3). In Figure 14 is depicted an example of application of our algorithm. Proof. The proof of TR follows from the completeness of TL and reverse operations, while the proof of TM follows from the completeness of TL and TR .

Concerning Abstract Implicit Type Conversion
In this section, we discuss the abstraction of implicit type conversion functions. Here we will focus only on the conversion of automata into other values, since conversions concerning booleans, not-a-number and integers are standard. Let toBool : V → B be applied to A ∈ DFA /≡ : If . Regarding abstract integers, if i ∈ Z, then the automaton recognizing the string S(i) is returned (We recall that the function S(i) maps an integer i to its numeric string representation.), otherwise, hence when i = Const , the automaton recognizing any possible integer is returned and reported in Figure 15. Finally, toInt : V → Const ∪ {NaN} handles conversion to constant integers. Given an automaton A, if A Min(Σ Z ) = Min(∅), the automaton is precisely converted to NaN, since A does not recognize any numerical string. Otherwise, if A DFA Min(Σ Z ) it means that L (A) contains only numeric strings. In particular, if A recognizes only one numerical string, the corresponding integer is returned, otherwise Const is returned.

µFASA Implementation
In this section we present µJS Finite-state Automata String Analyzer (µFASA), the string static analyzer integrating the finite state automata abstract domain, and the corresponding abstract semantics, presented in the previous sections.

Theoretical Concerns
It is worth noting that, as reported in Theorem 1, ℘(Σ * ) (string concrete domain) and DFA /≡ (abstract string domain) do not form a Galois connection, however this is not a concern. We have shown, for the core language we adopted, that the abstract semantics we have defined for string operations guarantee soundness hence, if the abstract interpreter starts from regular initial conditions (i.e., constraints expressible as finite state automata) it will always compute regular invariants.
When implementing, an important issue is computational complexity. The abstract semantics reported in this paper often relies on minimization of finite state automata in order to keep the automata, which arise during abstract computations, determinized and minimized. In the worst case, minimization has exponential complexity but this is not a problem. Even if our library relies on the Brzozowski's algorithm, which theoretically has exponential complexity in worst-case scenario, in practice it is extremely fast on average and consistently outperforms other minimization algorithms (e.g., Hopcroft's algorithm, having average-case complexity O(n log n), where n is the number of states), as reported in [36]. Moreover, the minimization is only applied when the input automaton is not-deterministic.

Implementation
µFASA is a string static analyzer for extended µJS inter-procedural programs and it is built upon the finite state automata abstract domain described above. (Available at www.github.com/SPY-Lab/ mufasa) It analyzes string variables and is also able to express associative arrays. The finite state automata abstract domain has been implemented as an external library (Available at www.github.com/ SPY-Lab/java-fsm-library), offering a suitable and easy way to plug the domain into existing static analyzers, such as [3][4][5]37]. The library includes the implementation of all the algorithms concerning the finite state automata domain and provides well-known operations on automata such as suffix, right quotient, abstract domain-related operations, such as DFA , DFA , and the parametric widening for tuning precision and forcing convergence.
In addition to the string operations (and the corresponding automata-based abstract semantics) introduced in this paper, µFASA also analyzes functions that can be defined as composition of the ones presented here (e.g., endsWith w.r.t. startsWith, slice w.r.t. substring). The full list of implemented string operations is reported in Table 2, also summarizing for which operations holds soundness and completeness and the average complexity of their algorithms (w.r.t. the constant integer abstract domain). indexOf O(n(n log n)(n 2 m)) slice O((n + m)(n log n))

Extension to Interval Abstract Domain
For the presentation of this paper, in Section 4.1, we have chosen to abstract integer values to the constant integer abstract domain. Of course this affects the abstract semantics of those string operations that involve them, namely substring, charAt, repeat and length, while the other methods only use strings or booleans. Nevertheless, µFASA abstracts integer values to the more precise interval abstract domain [9], i.e., to the set Intervals.
The choice of presenting the automata-based abstract semantics with constant integers, rather than intervals, was driven by the will to not burden the definition of the abstract semantics of the string operations involving integers. Let us consider the interval-based substring abstract semantics. Since intervals can be unbounded (e.g., [5, +∞]), more than 20 different cases have been identified in its abstract semantics [8]. Given substr (A, [a, b], [c, d]), for some A ∈ DFA /≡ , many of these cases include b = +∞ and d definite value, b definite and d = +∞, b, d = +∞ and a, c definite values and a ≤ c, only to cite few. Moreover, the interval-based abstract semantics does not add any further important technical detail to our contribution since the cases cited above, met with an interval-based analysis, were handled in an ad hoc manner and would have made this paper harder to follow. In particular, being the constant integer abstract domain strictly contained into the intervals one, restricting the presentation to constant integers permitted us to report only the meaningful cases (from a technical point of view), avoiding the others (related to intervals) handled in specific ways (and relevant for the implementation).
Nevertheless, µFASA implements intervals (which include constant integers) and, accordingly, the abstract semantics based on them of substring, charAt, length and repeat, as reported in [8].
The abstract semantics of the other string operations remain unaffected by the change. Just as an example, in the following we report the abstract semantics of length on intervals.
length Abstract Semantics with Intervals The constant integer domain leads to a big loss in precision in the abstract semantics of length, reported in Section 5.3. The idea behind the algorithm capturing its abstract semantics is to check if any string recognized by the input automata have the same length l ∈ N. If so, l is returned as result, otherwise Const is returned. Clearly, this is a forced choice given by the fact that the constant integer abstract domain is only able to track a single integer value. In this sense, the abstract semantics of length can be improved, from a precision point of view, when we deal with intervals rather than constant integers. Algorithm 7 reports the abstract semantics of length using the former abstract domain. We compute the minimum and the maximum path reaching each final state in the automaton and then we abstract the set of lengths obtained so far into intervals. Problems arise when the automaton contains cycles. In that case, we return the undefined interval starting from the minimum path, to a final state, to +∞.
Clearly, using the interval abstract domain produces more precise results for certain operations, but it complicates the abstract semantics of others.

Qualitative Evaluation of µFASA
In this section, we evaluate the precision of µFASA and, in turn, of the finite state automata abstract domain. In particular, we comment and discuss two string manipulation programs. The first is the one already introduced in Section 1, namely an obfuscated malware manipulating strings and transforming them into code by using eval, while the second is a benevolent function taken from a real-world string manipulation program. In both cases, we will show that important string information can be obtained by µFASA. Consider the fragment reported in Figure 1 in the introduction. By analyzing it with µFASA, we obtain that the abstract value of d, at the eval call, is the automaton A d in Figure 16. The cycles are caused by the widening application in while computations. From this automaton we are able to retrieve some important and non-trivial information. For example, we are able to answer the following question: May A d contain a string corresponding to an assignment to an ActiveXObject? We can answer by checking the predicate A d Min(Id · {new ActiveXObject(} · Σ * · {)}) = ∅, controlling whether A d recognizes strings that are concatenations of any identifier with the string new ActiveXObject, followed by any possible string. In the example, the predicate returns true. Another interesting information could be: May A d contain eval string? We can also answer that by checking if A d Min({eval}) = Min(∅). In this case it returns false and enforces the idea that any explicit call to eval cannot occur.
This analysis may lose precision during fix-point computations, causing the cycles in the automaton in Figure 16, due to the widening application. Nevertheless, it is worth noting that this result is obtained without any precision improvement on fix-point computations, such as loop unrolling, narrowing or widening with thresholds, that can surely be implemented in the future development of µFASA.

String Manipulation Program
In order to evaluate the precision of µFASA, we decided to create a benchmark of tests taken from real-world programs. We therefore selected string manipulating functions from popular modern frameworks (such as Mozilla useful methods, RXJava, Mockito) whose code can be easily found on GitHub. (The selected string manipulation functions are available at www.github.com/SPY-Lab/mufasa/src/test/resources and it is possible here to go back to where they were selected.) Among this set of methods, we will focus our attention to the precision of the function fixStations reported in Figure 17, taken from [38]. The function takes as input an object stations containing information about train stations (each item contains the three-letter station code, followed by some machine-readable data, followed by a semicolon, followed by the human-readable station name) and extracts the station code (in capital letters) and the station name. For instance, given the input stations ={st1:"MANay781;Manchester", st2:"gNfbx420;Greenfield"} , the function returns the object{st1: "MAN: Manchester", st2: "GNF: Greenfield"} .
Thus, given an object containing strings following the station information pattern previously described, the function fixStations returns another object containing strings following the pattern of three capital letters concatenated with a colon concatenated with a string. The goal of our analyzer is to exactly preserve this information on the variable result. Let us consider a statically unknown value of stations, namely where stations = {st1:σ 1 , . . . , stn:σ n }, n ∈ N and σ i follows the station information pattern, for each i ∈ [1, n]. While other static analyzers, such as TAJS, which has a finite height string abstract domain, lose any information about the returned string, µFASA is able to infer, for the variable result, the object {st1:p 1 , . . . , st n :p n }, where each p i is a string abstract value, namely a finite state automaton, following the desired pattern We are therefore able to preserve the string pattern that the function returns. As we have already highlighted, the result is obtained without implementing ad-hoc improvements regarding loop computations and we believe that even more precise results can be obtained integrating such techniques. We believe the integration of these analyses will drastically decrease false positives of the proposed string analysis (will address this topic in future works section).

Discussion and Related Work
In this paper we have proposed an abstract semantics for a toy imperative language µJS, augmented with string manipulation operations, expressive enough to handle dynamic typing and implicit type conversion. In our abstract semantics we have combined the DFA domain with abstract domains for the other primitive types, necessary to deal with static analysis of programs with dynamic typing. The proposed formal framework allows us to formally prove soundness and to study the precision of the abstract semantics of each string operation: depending on the property of interest, one can tune the degree of precision, namely the completeness of any string operation.

Analysis vs. Verification
Even if several solutions, also involving finite state machines, have been proposed for string solving and verification [21,39,40], it should be noted that our approach is placed instead in the context of string static analysis. Over the years, there has always been the intuition that program analysis was harder than verification: given a program, the aim of the former is to derive invariants for each program point, the one of the latter is instead to check whether a certain property holds for the given input program. Recently, this concept has been formalized from a computability point of view [41], confirming this belief. Therefore, our approach, placed in the context of static analysis of string manipulation programs, has goals that are hardly comparable with the solutions proposed in the context of verification, such as those cited above.

Main Related Works
The issue of analyzing strings is a widely studied problem and it has been tackled in literature from different points of view. Before discussing the most related works, we can observe what makes our approach original w.r.t. existing literature: (1) We provide a modular parametric abstract domain on the abstractions of the different primitive types, this allows us to obtain both a tunable semantics precision and to handle dynamic typing for operations having both integer and string parameters, such as substring; (2) our focus is on the characterization of a formal abstract interpretation-based framework where it is possible to prove soundness and to analyze the completeness of string operations, in order to understand where it is possible to tune precision versus efficiency. The main feature we have in common with existing works is the use of DFA (regular expressions) for abstracting strings. In [21], the authors propose symbolic string verifier for PHP based on finite state automata represented by a particular form of binary decision diagrams, the MBDD. Even if it could be interesting to understand whether this representation of DFAs may be used also for improving our algorithms, their work only considers operations exclusively involving strings (not also integers such as substring) and therefore it provides a solution for different string manipulations. In [20], the authors propose an abstract interpretation-based string analyzer approximating strings into a subset of regular languages, called regular strings and define the abstract semantics of four string operations of interest equipped with a widening. This is the most related work, but our approach is strictly more general, since we do not introduce any restriction to regular languages. In [19], the authors propose a scalable static analysis for jQuery that relies on a novel abstract domain of regular expressions. The abstract domain in [19] contains the finite state automata one but pursues a different task and does not provide semantics for string operations. Surely it may be interesting to integrate our library for string manipulation operators into SAFE. Finally, [42] proposes a lattice-based generalization of regular expression, formally illustrating a parametric abstract domain of regular expressions starting from a complete lattice of reference. However, this work does not tackle the problem of analyzing string manipulations, since it instantiates the parametric abstract domain in the network communication environment, analyzing the exchanged messages as regular expressions.
Finite state machines (transducer and automata) have also found a critical application in model checking both for enforcing string constraints and for modeling infinite transition systems [43]. For example, the authors of [44] define a sound decision procedure for a regular language-based logic for verification of string properties. The authors of [45] propose an automata abstraction in the context of regular model checking to tackle the well-known problem of state space explosion. Moreover other formal systems, similar to DFA, have been proposed in the context of string analysis [46][47][48]. As future work, it can be interesting to study the relation between standard DFA and the other existing formal models, such as logics or other forms of FA.
In the context of JavaScript several static analyzers have been proposed, pushed by the wide range of applications and security issues related to the language [3][4][5]37]. TAJS [3] is a static analyzer based on abstract interpretation for JavaScript. The authors focus on allocation site abstraction, plugging in the static analyzer the recency abstraction [49], decreasing the number of false positives when objects are accessed. Upon TAJS, a sound way to statically analyze a large range of non-trivial eval patterns has been defined in [50]. In [37], it is defined the Loop-Sensitive Analysis (LSA) that distinguishes loop iterations using loop strings in the same way call strings distinguish function calls from different call sites in k-CFA [51]. The authors have implemented LSA into SAFE [5], a JavaScript web applications static analyzer. As future work, it may be intriguing to combine LSA with our abstract semantics for decreasing the occurrences of false positives introduced by the widening operator during fix-point computations.

Future Ideas
In this paper we have proposed static string program analysis for a set of relevant JavaScript string manipulation operations, whose semantics is inspired by the official ECMAScript specification [10]. The first goal is to involve our abstract semantics into a static analyzer for JavaScript that uses finite state automata to approximate strings. In order to decrease the number of false positives in our string approximation in presence of loops, several techniques can be involved, such as loop unrolling and LSA [37]. The domain described in this paper has been equipped only with a widening, to enforce termination in fix-point computations, which may lead to a big loss in precision. A narrowing will be studied and integrated in our static analyzer in order to retrieve some of the precision lost when the widening is applied.
We conclude by observing, as already highlighted in [7], the important application of finite state automata for string-to-code primitives analysis. Consider, for instance, in JavaScript programs, the eval function, transforming strings into code. Our semantics is sound and precise enough to answer some non-trivial properties of interest. Indeed, in [7], the finite state automata domain and the corresponding abstract semantics for strings turned out to be the basis for a sound and precise enough analysis of eval.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Selected Proofs
In this appendix we report all the long proofs of results presented in the paper. The proofs are listed in order of appearance.
Proof of Theorem 2. The collecting semantics of substring is defined lifting the concrete semantics defined in Section 3 as follows, where S ∈ ℘(Σ * ) and I, J ∈ ℘(Z).
In order to prove soundness and completeness of SS , we need to prove that ∀A ∈ DFA /≡ , ∀i, j ∈ Const SS(L (A), γ(i), γ(j)) = L (SS (A, i, j)) We split the proof in the following cases. Since in substring semantics any negative value is treated as zero, in the proof, we suppose w.l.o.g. that when a negative value arises it is treated as zero.
Proof of Theorem 3. The collecting semantics of charAt is defined lifting the concrete semantics defined in Section 3 as follows, where S ∈ ℘(Σ * ) and I, ∈ ℘(Z).
CA(S, I) = { CA(σ, i) | σ ∈ S, i ∈ I } In order to prove soundness and completeness of CC , we need to prove that ∀A ∈ DFA /≡ , ∀i ∈ Const CA(L (A), γ(i)) = L (CA (A, i)) We split the proof in the following two cases.
• Let us suppose that i = Const , hence γ(i) = Z. It is worth noting that the function chars we used in the abstract semantics of charAt is complete. Let CHARS : ℘(Σ * ) → ℘(Σ) be the function that given a set of strings returns the set of characters inside any string of the input string set. It holds that CHARS(L (A)) = chars(A). We split the proof in the following cases. We split the proof in the following cases. The condition is verified at lines 13 of Algorithm 2, it fails, hence {true, false} at line 17 is returned.
Proof of Theorem 7. The soundness and completeness of LC follows from the fact that any upper-case transition found in A is replaced with the same transition that reads the corresponding lower-case symbol, without changing neither the orientation of the transitions or the automaton states.
Proof of Theorem 8. The collecting semantics of includes is defined lifting the concrete semantics defined in Section 3 as follows, where S, S ∈ ℘(Σ * ). Proof of Theorem 9. The collecting semantics of repeat is defined lifting the concrete semantics defined in Section 3 as follows, where S ∈ ℘(Sigma * ) and I ∈ ℘(Z).
RT(S, I) = { σ n | σ ∈ S, n ∈ I } In order to prove soundness of RT , we need to prove that ∀A ∈ DFA /≡ , ∀i ∈ Const RT(L (A), γ(i)) ⊆ L (RT (A, i)) We split the proof in the following two cases.
• Let us suppose that i = Const , hence γ(i) = n, where n ∈ Z. We split the proof in the following cases: Min(p) · Min(p) · · · · · Min(p) | p ∈ Paths(A) }) In this case, Algorithm5 returns the above automaton at lines 8-15.
• Let us suppose that i = Const , hence we have that γ(i) = Z. Proof of Theorem 10. At each iteration of Algorithm 6, we remove white-space transitions from the initial state q 0 . The invariant of Algorithm 6 after line 9 is that the initial state has no white-space transitions. Before checking the while-loop condition (line 3), the automaton is minimized and determinized (lines 3) with the new set of transitions δ . Hence, if the initial state q 0 had only white-state transitions, after the minimization at line 11, q 0 is not the initial state of R anymore, and a new initial state will be computed at line 11. Hence, when the loop is repeated, the algorithm will search the other white-space transition from the new initial state. In this way, Algorithm 6 is able to remove consecutive white-space transitions from the original initial state q 0 .