Next Article in Journal
Design of the Input and Output Filter for a Matrix Converter Using Evolutionary Techniques
Next Article in Special Issue
An Abstraction Technique for Verifying Shared-Memory Concurrency
Previous Article in Journal
PRANAS: A Process Analytics System Based on Process Warehouse and Cube for Supply Chain Management
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Static Analysis for ECMAScript String Manipulation Programs

1
Department of Computer Science, University of Verona, 37134 Verona, Italy
2
Ca’ Foscari University, Department of Environmental Sciences, Informatics and Statistics, 30170 Venice, Italy
*
Author to whom correspondence should be addressed.
Appl. Sci. 2020, 10(10), 3525; https://doi.org/10.3390/app10103525
Submission received: 25 April 2020 / Revised: 14 May 2020 / Accepted: 16 May 2020 / Published: 20 May 2020
(This article belongs to the Special Issue Static Analysis Techniques: Recent Advances and New Horizons)

Abstract

:
In recent years, dynamic languages, such as JavaScript or Python, have been increasingly used in a wide range of fields and applications. Their tricky and misunderstood behaviors pose a great challenge for static analysis of these languages. A key aspect of any dynamic language program is the multiple usage of strings, since they can be implicitly converted to another type value, transformed by string-to-code primitives or used to access an object-property. Unfortunately, string analyses for dynamic languages still lack precision and do not take into account some important string features. In this scenario, more precise string analyses become a necessity. The goal of this paper is to place a first step for precisely handling dynamic language string features. In particular, we propose a new abstract domain approximating strings as finite state automata and an abstract interpretation-based static analysis for the most common string manipulating operations provided by the ECMAScript specification. The proposed analysis comes with a prototype static analyzer implementation for an imperative string manipulating language, allowing us to show and evaluate the improved precision of the proposed analysis.

1. Introduction

Dynamic languages, for instance JavaScript or Python, have seen an important growth in a very wide range of fields and applications. Common features in these languages are dynamic typing (typing occurs during program execution, at run-time) and implicit type conversion [1], which lighten the development phase and allow programs not to block execution in the presence of unexpected or unpredictable situations. Moreover, one important aspect of dynamic languages is the way strings may be used. In JavaScript, for example, strings can be either used to access property objects or transformed into executable code by using the global function eval. In this way, dynamic languages provide multiple string features that simplify the writing of programs, allowing, at the same time, statically unpredictable executions which might make them harder to understand [1]. For this reason, string obfuscation (e.g., string splitting) is becoming one of the most common obfuscation techniques in JavaScript malwares [2], making it hard to statically analyze code. Consider, for example, the JavaScript program fragment in Figure 1 where strings are manipulated, de-obfuscated, combined together into the variable d and finally transformed into executable code, the statement ws = new ActiveXObject(WScript.Shell). This command, in Internet Explorer, opens a shell which may execute malicious commands. The command is not hard-coded in the fragment but it is built at run-time and the initial values of i , j and k are unknown, as is the number of iterations of the loops.
All these observations suggest that, in order to statically understand statements which are dynamically generated and executed, it may be extremely useful to statically analyze the string value of d. Unfortunately existing static analyzers for dynamic languages [3,4,5,6], might fail to precisely analyze strings in dynamic contexts. for instance, in the example above, TAJS [3], JSAI [4] and SAFE [5], lose precision on the eval input value and any information gathered so far about it. Namely, the issue of analyzing dynamic languages, even if tackled by sophisticated tools as the cited ones, still lacks formal approaches for handling the dynamic features of string manipulation, such as dynamic typing, implicit type conversion and dynamic code generation. Instead, in [7], a new approach for dynamic language analysis is proposed based on finite state automata for abstracting strings, coming with both a precise string abstraction able to infer string properties in general and a sound abstract interpreter for dynamically-generated code.
Contributions
In this paper (This is an extended and revised version of [8] integrated with a more complete range of string operations, detailed proofs of the results presented (proofs are reported in Appendix A) and an improved implementation that will be discussed in Section 6.), we focus on the characterization of an abstract interpretation-based [9] formal framework, capable of handling dynamic typing and implicit type conversion, by defining an abstract semantics able to (precisely, when possible) capture the previously mentioned dynamic features. Even if we do not tackle the problem of analyzing dynamically generated code (meaning that we do not analyze its behavior), as highlighted in [7], such semantics is a necessary step towards a sufficiently precise analysis for it, since it is able to reason about a class of string manipulation programs (as far as string values are concerned) that state-of-art static analyzers would fail to precisely analyze. Indeed the domain we propose allows us to collect (and potentially approximate) the set of all possible string values that a variable may receive during computation (at each program point). It should be clear that, in order to analyze what an eval statement might execute, we surely need to (over-)approximate the set of precise string values of its input. Hence we propose an approach defining a collecting semantics for strings. With this task in mind, we will first discuss how to combine abstract domains of primitive types (strings, integers and booleans) in order to capture dynamic typing. Once we have such an abstract domain, we will define on it an abstract semantics for a μ JS language, augmented with implicit type conversion, dynamic typing and several interesting string operations taken from the official ECMAScript language specification [10], namely the JavaScript language specification, whose concrete semantics is inspired by the JavaScript one. In particular, for each one of these operations we will provide the algorithm computing its abstract semantics and we will discuss their soundness and completeness.
Paper structure
In Section 2 we recall relevant notions on finite state automata and the core language adopted for this paper is established in Section 3. In Section 4.1 we define the finite state automata domain, highlighting some important operations and theoretical results. In Section 4 we discuss and present two ways of combining abstract domains (for primitive types) suitable for dynamic languages. Then, In Section 5, we present the new abstract semantics for string manipulating operations. In Section 6 we examine and evaluate the precision of the string static analyzer based on the above semantics. Finally, in Section 7, we discuss and compare this paper to the most related works and we draw our conclusions.

2. Background

In this section, we recall some basic notations and notions that will be used in the rest of the paper.

2.1. String Notation

We denote by Σ a finite non-empty alphabet of symbols, its Kleene-closure by Σ * and a string element by σ Σ * . If σ = σ 0 σ 1 σ n , the length of σ is | σ | = n + 1 and the element in the i-th position is σ i . Given two strings σ , σ Σ * , σ · σ is their concatenation. A language is a set of strings, i.e., L ( Σ * ) . We use the following notations: Σ i = def { σ Σ * | | σ | = i } and Σ < i = def j < i Σ j . Given σ Σ * , i , j N ( i j | σ | ) the substring between i and j of σ is the string σ i σ j 1 . We denote by Σ Z = def { + , , ϵ } · { 0 , 1 , , 9 } + the set of numeric strings, i.e., strings corresponding to integers. I : Σ Z Z maps numeric strings to the corresponding integers. Dually, we define the function S : Z Σ Z that maps each integer to its numeric string representation (e.g., 1 is mapped to the string "1", and not "+1"). Given σ Σ * and n N , we denote with σ n the n-times concatenation of σ . Given a symbol c Σ we denote with toLowerCase ( c ) its corresponding lower-case symbol, if it is a capital letter, otherwise c is returned. We abuse notation denoting by toLowerCase ( σ ) the string σ where at each position any upper-case symbol is replaced with the corresponding lower-case symbol.

2.2. Regular Languages and Finite State Automata

We follow [11] for automata notation. A finite state automaton (FA) is a tuple A = ( Q , Σ , δ , q 0 , F ) where Q is a finite non-empty set of states, q 0 Q is the initial state, Σ is a finite alphabet, δ Q × Σ × Q is the transition relation and F Q is the set of final states. In particular, if δ : Q × Σ Q is a function then A is called deterministic FA (DFA). We consider DFA also those FA which are not complete, namely such that a transition for each pair ( q , a ) ( q Q , a Σ ) does not exists. They can be easily transformed in a DFA by adding a sink state receiving all the missing transitions. The class of languages recognized by FA is the class of regular languages. We denote the set of all DFA as Dfa. Given an automaton A , we denote the language accepted by A as L ( A ) . A language L is regular iff there exists a FA A such that L = L ( A ) . From the Myhill-Nerode theorem [12], for each regular language uniquely exists a minimum automaton, i.e., with the minimum number of states, recognizing the language. Given a regular language L , we denote by Min ( L ) the minimum DFA A s.t. L = L ( A ) . Given an automaton A , we denote by Kleene ( A ) the automaton that recognizes the language corresponding to the Kleene-closure of L ( A ) , namely the automaton A s.t. L ( A ) = L ( Kleene ( A ) ) ) = { σ n | σ L ( A ) , n N } . Moreover, given an automaton A , we rely on the predicate hasCycle ( A ) that checks whether A is cyclic.

2.3. Abstract Interpretation

Abstract interpretation establishes a correspondence between a concrete semantics and an approximated one called abstract semantics [9,13]. In a Galois Connection framework, if C and A are complete lattices, a pair of monotone functions α : C A and γ : A C forms a Galois Connection (GC for short) between C and A if for every x C and y A we have α ( x ) A y x C γ ( y ) . α and γ are called abstraction function and concretization function, respectively.
Let L be a complete lattice. X L is a Moore family of L if X = M ( X ) = { S | S X } and ( top element ) M ( X ) . If any concrete object in C has a best abstraction in the abstract domain A implies that A is a Moore family of C and so there exists a Galois connection between C and A.
Weaker forms of correspondence are possible, e.g., when A is not a complete lattice or when only γ exists [14]. In all cases, relative precision in A is given by comparing the meaning of abstract objects in C, i.e., x 1 A x 2 if γ ( x 1 ) C γ ( x 2 ) . If f : C C is a continuous function and A is an abstraction of C by means of the GC α , γ , then f always has a best correct approximation in A, f A : A A , defined as f A = α f γ . Any approximation f : A A of f in A is sound if f A f .
In abstract interpretation, there exist two notions of completeness: backward completeness and forward completeness. The former is the best known form of completeness and focuses on complete abstractions of the inputs, while the latter is forward completeness [15,16,17] and it focuses on complete abstractions of the outputs, both w.r.t. an operation of interest. When we do not have a GC, namely when only the concretization γ exists, we need to focus only on forward completeness, as we will do in this paper. Given a GC α , γ , a concrete function f : C C and an abstract function f : A A , f is forward complete w.r.t. f if a A . f ( γ ( a ) ) = γ ( f ( a ) ) .
A satisfies the ascending chain condition (ACC) if all ascending chains are finite. When A is not ACC convergence to the limit of the fix-point iterations can be ensured through widening operators. A widening operator : A × A A approximates the least upper bounds, i.e., x , y A . x , y A ( x y ) and it is such that for any increasing chain x 1 x 2 x n the increasing chain w 0 = and w i + 1 = w i x i is finite.

3. The Core Language

In this paper, we consider a JavaScript core language, reported in Figure 2, that we call μ JS , containing several representative string operations taken from the set of methods offered by the JavaScript built-in class String, detailed in the ECMAScript language specification [10]. Even though we have decided to focus on a core of the operations, note that the missing methods (e.g., indexOf or endsWith) can be easily modeled as composition of our chosen string methods or as particular cases of them. Nevertheless, as we will discuss in Section 6, these operations have been implemented and tested.

μ JS Semantics

In μ JS the primitive values are V = S Z B { NaN } with S = def Σ * (strings on the alphabet Σ ), B = def { true , false } and NaN a special value denoting not-a-number.
Program states are partial maps from identifiers to primitive values, i.e., S TATES : I D V . The concrete big-step semantics · : S TMT × S TATES S TATES is standard and follows [18], and it includes dynamic typing and implicit type conversion. In addition, the expression semantics, ( | · | ) : E XP × S TATES V , is standard and follows [18]; we only provide the formal and precise semantics of the μ JS string operations. Let σ , σ S and i , j Z (values which are not strings or numbers respectively, are converted by the implicit type conversion primitives, moreover, negative values are treated as zero).
substring:
It extracts the substring between two indexes from a string. The semantics is defined by the function Ss : S × Z × Z S as:
S S ( σ , i , j ) = def S S ( σ , j , i ) j < i σ i σ j j < | σ | i j σ i σ n j n = | σ | i j
charAt:
It returns the character, i.e., the string of unitary length, at a specified index in a string σ . The semantics is the function Ca : S × Z S defined as follows:
C A ( σ , i ) = def σ i 0 i < | σ | ϵ otherwise
length:
It returns the length of a string σ S . Its semantics is the function Le : S Z defined as L E ( σ ) = def | σ | .
concat:
It returns the concatenation between two strings and its concrete semantics Cc : S × S S relies on the concatenation operator reported in Section 2.
C C ( σ , σ ) = σ · σ
startsWith:
It determines whether a specified string σ starts with σ . The semantics is the function Sw : S × S B defined as:
S W ( σ , σ ) = def true σ Σ * . σ = σ · σ false otherwise
repeat:
It returns the given string repeated n times. The semantics is the function Rt : S × Z S defined as R T ( σ , n ) = def σ n .
includes:
It determines whether a string σ is a substring of σ . The semantics is the function In : S × S B defined as:
I N ( σ , σ ) = def true ϕ , ψ Σ * . σ = ϕ · σ · ψ false otherwise
toLowerCase:
It returns the given string in all lowercase letters. The semantics is the function Lc : S S defined as L C ( σ ) = def toLowerCase ( σ ) .
trimLeft:
It removes all the white-spaces at the beginning of a string. The semantics is the function Tl : S S defined as:
T L ( σ ) = def σ where ψ = max { ψ ( ) * | σ = ψ · σ } σ = ψ · σ
trimRight:
It removes all the white-spaces at the end of a string. The semantics is the function Tr : S S defined as:
T R ( σ ) = def σ where ψ = max { ψ ( ) * | σ = σ · ψ } σ = σ · ψ
trim:
It removes all the white-spaces at the end and beginning of a string. The semantics is the function Tm : S S defined as: T M ( σ ) = def T R ( T L ( σ ) ) .

Implicit Type Conversion

In order to properly capture the semantics of the language μ JS , inspired by the JavaScript semantics, we need to deal with implicit type conversion [18]. For each primitive value, we define an auxiliary function converting it to other primitive values (Figure 3). Note that all the functions behave like identity when applied to values not needing conversion, e.g., toInt on integers. Then, toString : V S maps any input value to its string representation; toInt : V Z { NaN } returns the integer corresponding to a value, when it is possible: for true and false it returns respectively 1 and 0, for strings in Σ Z it returns the corresponding integer, while all the other values are converted to NaN . For instance, toInt ( 42 " ) = 42 , toInt ( 42 h e l l o " ) = NaN . Finally, toBool : V B returns false when the input is 0, and true for all the other non boolean primitive values. It is worth noting that the auxiliary functions defined in Figure 3 do not correspond to explicit casting but they model the implicit type conversion implemented by JavaScript. In particular, these functions cannot be directly called by a programmer since they are exclusively used internally (indeed implicitly) by the semantics when a type value of an expression operand is required.

4. An Abstract Domain for String Manipulation

4.1. The Finite State Automata Abstract Domain for Strings

In this section, we describe the finite state automata abstract domain for strings [19,20,21], namely the domain of regular languages over ( Σ * ) . In particular our goal is to exploit automata, and therefore regular languages, for approximating string values collected during analysis. The idea is to approximate strings as regular languages represented by the minimum DFA [12] recognizing them. In general, we have more DFA than regular languages, hence the domain of automata is indeed the quotient D FA / w.r.t. the equivalence relation induced by language equality: A 1 , A 2 D FA / . A 1 A 2 L ( A 1 ) = L ( A 2 ) . Therefore any equivalence class is composed by automata that recognize the same regular language. We abuse notation by representing these classes in the domain D FA / w.r.t. ≡ using one of its automata (usually the minimum), i.e., when we write A D FA / we mean [ A ] .
The partial order D FA is induced by language inclusion, i.e., A 1 , A 2 D FA / . A 1 D FA A 2 L ( A 1 ) L ( A 2 ) , which is well defined since automata in the same ≡-equivalence class recognize the same language.
The corresponding least upper bound, D FA : D FA / × D FA / D FA / on the domain D FA / , is the standard union between automata: A 1 , A 2 D FA / . A 1 D FA A 2 = def Min ( L ( A 1 ) L ( A 2 ) ) . It is the minimum automaton recognizing the union of the languages L ( A 1 ) and L ( A 2 ) . This is a well-defined notion since regular languages are closed under union. As example consider Figure 4, where the automaton in Figure 4c is the least upper bound of A 1 and A 2 given in Figure 4a,b, respectively.
The (finite) greatest lower bound D FA : D FA / × D FA / D FA / corresponds to automata intersection (since regular languages are closed under finite intersection): A 1 , A 2 D FA / . A 1 D FA A 2 = def Min ( L ( A 1 ) L ( A 2 ) ) .
Theorem 1.
D FA / , D FA , D FA , D FA , Min ( ) , Min ( Σ * ) is a sub-lattice but not a complete meet-sub-semilattice of ( Σ * ) .
In other words, it cannot exists a Galois connection between D FA / and ( Σ * ) , i.e., there may be no minimal automaton abstracting a language. Note that some works [22,23,24] have studied automatic procedures to compute, given an input language L, the regular cover of L [23] (i.e., an automaton containing the language L). Some of them [22,23] studied regular covers guaranteeing that the automaton obtained is the best w.r.t. a minimal relation (but not minimum). However this is not a concern since the relation between concrete semantics and abstract semantics can be weakened still ensuring soundness [14]. A well known example is the convex polyhedra domain [25].

Widening

The domain D FA / is an infinite domain, and it is not ACC, i.e., it contains infinite ascending chains. for instance, consider the set of languages L i = { a j b j | 0 j i } ( Σ * ) , indexed by a constant natural i N , forming an infinite ascending chain of finite regular languages. The set of the corresponding minimal automata trivially forms an ascending chain on D FA / . This clearly implies that any computation on D FA / may lose convergence [14] (Most of the proposed abstract domains for strings [3,4,5,26] trivially satisfy ACC being finite, but they may lose precision during the abstract computation [27].).
As far as automata are concerned, existing widenings are defined in terms of a state equivalence relation merging states that recognize the same language, up to a fixed length n (set as parameter for tuning the widening precision) [28,29]. We denote this parametric widening with n : D FA / × D FA / D FA / , with n N [28] and it is defined in the following.
Let A = ( Q , Σ , δ , q 0 , F ) and A = ( Q , Σ , δ , q 0 , F ) be two finite state automata such that L ( A ) L ( A ) : the widening between A and A is formalized in terms of a relation R Q × Q between the sets of states of the two automata. The relation R is used to define an equivalence relation R Q × Q over the states of A , such that R = R R 1 . The widening between A and A is then given by the quotient automaton of A w.r.t. the partition induced by R : A R A = A R (Given A D FA / and a partition π over its states, we denote as A π = ( Q , δ , q 0 , F , Σ ) the quotient automaton [12].). Thus, the widening operator merges the states of A that are equivalent by the relation R . By changing the relation R, we obtain different widening operators [28]. It has been proved that convergence is guaranteed when the relation R n Q × Q is such that ( q , q ) R n iff q and q recognize the same language of strings of length at most n [28]. The parameter n therefore tunes the length of strings determining the equivalence of states used for merging them in the widening. It is worth noting that the smaller is n, the more information will be lost by widening.
In the following, given A , A D FA / (without any constraints on the languages they recognize), we define the widening operator on D FA / parametric on n N as follows.
A n A = def A R n ( A D FA A )
In order to show how the defined widening operator works, let us discuss the following example.
Example 1.
Consider the following μ JS fragment
str = “ ”; while (x < 100) { str = str + “a”; x = x + 1; }
The value of the variablexis unknown and so is the number of iterations of thewhile-loop. In these cases, in order to guarantee soundness and termination, we apply the widening operator.
In Figure 5a we report the abstract value of the variablestrat the beginning of the second iteration of the loop, while in Figure 5b the abstract value of the variablestrat the end of the second iteration. Before starting a new iteration, in the example, we apply 1 between the two automata, specifically we merge all the states having the same outgoing character. The minimization of the so obtained automaton is reported in Figure 5c. The next iteration will reach the fix-point, guaranteeing termination.

4.2. An Abstract Domain for μ JS

By definition, string operations in our language also involve other primitive values, such as booleans or integers, hence we need an abstract domain able to observe any possible concrete value. This is additionally necessary for dealing with implicit type conversion as we will later observe.
We therefore have to design an abstract domain for string manipulation dealing with other primitive types, namely being able to combine different abstractions of various types. In particular, an abstract domain for string analysis equipped with dynamic typing must include all the possible primitive values, i.e., the whole V = Z B S { NaN } . The idea is to consider an abstract domain for each type of primitive value and to combine them in a unique abstract domain for V . Consider, for each value D , an abstract domain D (we denote D the domain D without bottom), equipped with an abstraction α D : D D and a concretization γ D : D D forming a Galois insertion [9].

4.2.1. Coalesced Sum

One way to merge domains is the coalesced sum [30]. The resulting domain contains all the non-bottom elements of the input domains, with a new top and a new bottom.
Definition 1
(Coalesced sum domain [31]). Let A , A , A , A , A , A and B , B , B , B , B , B be two lattices abstracting the posets C , C and D , D with abstraction functions α A : A C and α B : B D , respectively. The coalesced sum domain A B is defined as:
A B = def { A B } { a | a A } { b | b B } { A B }
such that the partial order is defined as x A B y x A y ( x , y A ) x B y ( x , y B ) and x A B . A B A B x A B A B , its least upper bound is defined as:
x A B y = def x A y i f x , y A x B y i f x , y B x i f y = A B y i f x = A B A B otherwise
and its greatest lower bound A B can be dually defined. The abstraction functions α A B : C D A B is defined as:
α A B ( x ) = def α A ( x ) i f x C α B ( x ) i f x D A B otherwise
In our case, if we consider the abstract domains Z , S and B , the coalesced sum is the abstraction of ( V ) depicted in Figure 6.
This is the simplest choice but unfortunately this is not suitable for dynamic languages, in particular for dealing with dynamic typing and implicit type conversion. The problem is that the type of variables is inferred at run-time and may change during execution. For example, consider the μ JS fragment if ( y < 5 ) { x = 42 " ; } else { x = true ; } . The value of the variable y is statically unknown hence, in order to guarantee soundness, we must take into account both the branches, meaning that x may be both a string and a boolean value, after the if statement. On the coalesced sum domain, the analysis would lose any precision w.r.t. collecting semantics by returning α S ( 42 " ) α B ( true ) = .

4.2.2. Cartesian Product

In order to catch union types, without losing too much precision, we need to complete [15,16,32] the above domain in order to observe collections of values of different types. In order to define this combination, we rely on the Cartesian product, following [33]. The complete abstract domain w.r.t. dynamic typing and implicit type conversion is: Z × B × S × ( { NaN } ) , abstraction of ( V ) . In this combined abstract domain, the value of x after the if-execution is precisely ( , α B ( true ) , α S ( 42 " ) , ) , now an element of the domain, inferring that the value of x can be α B ( true ) or α S ( 42 " ) but surely not an abstract integer of NaN .
In the following, we consider the abstract domain V for string analysis obtained as Cartesian product of the following abstractions: B = ( { true , false } ) , Z = Const = def { Const , Const } { { z } | z Z } (the abstract domain of constant integers) and S = D FA / , .

5. Abstract Semantics of ECMAScript String Operations

In this section, we define the abstract semantics of the language μ JS over the abstract domain V . In particular, we have to define the expressions abstract semantics · : E XP × S TATES V , abstracting the collecting semantics (The string collecting semantics (fully reported in Appendix A) is defined lifting to ( V ) the concrete one reported in Section 3. For example, the collecting semantics of substring is, abusing notation, S S : ( Σ * ) × ( Z ) × ( Z ) ( Σ * ) defined as S S ( L , I , J ) = { S S ( σ , i , j ) | σ , L , i I , j J } .), which is standard except for the string operations that will be explicitly provided by describing the algorithms for computing them. Let us first recall some important notions on regular languages, useful for the algorithms we will provide.
Definition 2
(Suffixes and prefixes [12]). Let L ( Σ * ) be a regular language. The suffixes of L are S U ( L ) = def { y Σ * | x Σ * . x · y L } , and the prefixes of L are P R ( L ) = def { x Σ * | y Σ * . x y L } .
We can define the suffixes from a position, namely given i N , the set of suffixes from i is S U ( L , i ) = def { y Σ * | x Σ * . x · y L , | x | = i } . for instance, let L = { a b c , h e l l o } , then S U ( L , 2 ) = { c , l l o } .
Definition 3
(Right quotient [12]). Let L 1 , L 2 Σ * be regular languages. The right quotient of L 1 w.r.t L 2 is R Q ( L 1 , L 2 ) = def { x Σ * | y L 2 . x · y L 1 } .
For example, let L 3 = { x a b , y a b } and L 4 = { b , a b } . The right quotient of L 3 w.r.t L 4 is R Q ( L 3 , L 4 ) = { x a , y a , x , y } .
Definition 4
(Substrings/Factors [34]). Let L ( Σ * ) be a regular language. The set of its substrings/factors is F A ( L ) = def { y Σ * | x , z Σ * . x · y · z L } .
These operations are all defined as transformations of regular languages. In [12], the corresponding algorithms on FA are provided. In particular, let A , A 1 D FA / and i N , then SU ( A ) , PR ( A ) , SU ( A , i ) , FA ( A ) and RQ ( A , A 1 ) are the algorithms corresponding to the transformations S U ( L ( A ) ) , P R ( L ( A ) ) , S U ( L ( A ) , i ) , F a ( L ( A ) ) and R Q ( L ( A ) , L ( A 1 ) ) , respectively. Namely, A , A 1 D FA / , i N , the following facts holds:
S U ( L ( A ) ) = L ( SU ( A ) ) P R ( L ( A ) ) = L ( PR ( A ) ) F a ( L ( A ) ) = L ( FA ( A ) ) R Q ( L ( A ) , L ( A 1 ) ) = L ( RQ ( A , A 1 ) ) S U ( L ( A ) , i ) = L ( SU ( A , i ) )

5.1. Abstract Semantics of Substring

In this section we define the abstract semantics of substring. In particular, we define the operator SS : D FA / × Const × Const D FA / , that takes as input an automaton and two constant integer indexes i , j Const , and computes the automaton recognizing the set of all substrings of the input automata language between the two provided integer indexes. Since the abstract semantics has to take into account the swaps when the initial index is greater than the final one, several cases arise when one of the two integer parameters is unknown, namely when it is equal to Const . Indeed, the abstract semantics SS is divided in four cases that are reported in Table 1. Consider A D FA / , i , j Const (for the sake of readability we denote by ⊔ the automata lub D FA , and by ⊓ the glb D FA ). As in the concrete semantics of substring , negative integer values are treated as zero.
  • If i , j Z (second row, second column of Table 1) we have to compute the language of all the substrings between the initial index i and a final index in j, i.e., S s ( L ( A ) , i , j ) . For example, let L = { a } * { h e l l o , b c } , the set of its substrings from 1 to 3 is S s ( L , 1 , 3 ) = { ϵ , a , a a , e l , c } . When i < j , as in the example, the automaton accepting this language is computed by the operator
    SS ( A , i , j ) = def ( RQ ( SU ( A , i ) , SU ( A , j ) ) Min ( Σ j i ) ) ( SU ( A , i ) Min ( Σ < j i ) )
    If j > i , the integer arguments are simply swapped, as in the Table 1.
  • When both integer parameters correspond to Const , the result is the automaton of all possible factors of A (third row, third column), i.e., FA ( A ) .
  • When i is defined and j = Const (second row, third column), we have to compute the automaton recognizing all the substrings of L ( A ) from 0 to i and any substring starting from i. For example, let us consider SS ( Min ( { h e l l o w o r l d } ) , 5 , Const ) . Due to the semantics of substring reported in Section 3, we need to compute the substring from a [ 0 , 5 ] to 5 and then any substring with initial index equal to 5. The automata recognizing any substring starting at a specific index l is defined as SS ( A , l ) = def FA ( SU ( A , l ) ) . The abstract semantics returns the least upper bound of all the automata of substrings from a in [ 0 , i ] to the automata recognizing any substring with initial index equals to i.
  • Similarly to the previous case, when j is defined and i = Const (third row, second column), we have to compute the automaton recognizing all the substring of L ( A ) from 0 to j and any substring starting from j. Let us consider SS ( Min ( { h e l l o w o r l d } ) , Const , 5 ) . Similarly to the previous case, we compute the substrings from a [ 0 , 5 ] to 5 and then any substring with initial index equal to 5. The abstract semantics therefore returns the least upper bound of all the automata of substrings from a in [ 0 , j ] to the automata recognizing any substring with initial index equal to j.
In Figure 7 we report an example obtained applying the rules in the table.
Theorem 2.
SS is sound and complete. Formally,
A D FA / , i , j Const . S S ( L ( A ) , γ ( i ) , γ ( j ) ) = L ( SS ( A , i , j ) )
From here on, when we say completeness we mean forward completeness. As highlighted in Section 2, this is the only form of completeness we can ensure in absence of a Galois connection. In particular, when an abstract operation (e.g., SS ) is forward complete for a concrete operation (e.g., Ss) means that the computation on the abstract domain (e.g., D FA / ) does not lose information due to the necessary computation only on abstract elements.

5.2. Abstract Semantics of charAt

The abstract semantics of charAt should return an automaton accepting the language of the characters at position i in the strings accepted by the given automaton. Since charAt is a particular case of substring, its abstract semantics, determined by CA : D FA / × Const D FA / , relies on the abstract semantic of substring previously defined. In particular,
CA ( A , i ) = def SS ( A , i , i + 1 ) i Const Min ( chars ( A ) ) Min ( { ϵ } ) otherwise
We call SS (defined before) when the index i corresponds to a determinate integer value otherwise we use the function chars : D FA / ( Σ ) , returning the set of characters read in any transition of an automaton, together with Min ( { ϵ } ) .
Theorem 3.
CA is sound and complete. Formally,
A D FA / , i Const . C A ( L ( A ) , γ ( i ) ) = L ( CA ( A , i ) )

5.3. Abstract Semantics of length

The abstract semantics of length should return a value, of the integer domain Const , that, in a sound way, approximates the length of all the possible strings of an automaton. The abstract semantics of length is defined by the function LE : D FA / Const , computed by Algorithm 1, where Paths : D FA / ( ( Q ) ) returns the set of the paths from the initial state to any final state of A [35]. Given a path p Paths ( A ) , we denote by | p | the length of p .
Algorithm 1: LE : D FA / Const algorithm
Applsci 10 03525 i001
If the input automaton has cycles, LE returns Const otherwise it checks that any path of the automaton A has the same length (lines 5–8). Whenever the algorithm finds that there exists two paths in the automaton that have different lengths, Const is returned (lines 8–10). Due to the constant integers domain, the abstract semantics of length can give a precise answer only when any string of the automaton has precisely the same length. More accurate results can be obtained by using more precise integer abstract domains, e.g., intervals, as we will discuss in Section 6. For example, consider the automata A and A in Figure 8a,b, respectively. LE ( A ) precisely returns 5, since all the strings recognized by A have the same length, while LE ( A ) returns Const .
Theorem 4.
LE is sound and complete. Formally,
A D FA / . L E ( L ( A ) ) = γ ( LE ( A ) )

5.4. Abstract Semantics of Concat

The abstract semantics of string concatenation is CC : D FA / × D FA / D FA / and returns the concatenation between the input automata. Since regular languages are closed under the concatenation operation, so are finite state automata. Hence, CC exactly implements the standard concatenation operation between automata. Given the closure property on automata, the following result holds.
Theorem 5.
The function CC is sound and complete. Formally, A , A D FA / .
C C ( L ( A ) , L ( A ) ) = L ( CC ( A , A ) )
As we have already mentioned before, completeness holds thanks to the closure properties of regular languages (and in turn of finite state automata).

5.5. Abstract Semantics of StartsWith

The abstract semantics of startsWith takes as input two automata and checks whether a string of the language of the first automaton starts with a string of the language of the second one. The abstract semantics of startsWith is captured by the function SW : D FA / × D FA / B , computed by Algorithm 2, where maxString : D FA / D FA / returns the (minimal) automaton recognizing the longest string of the automaton given as input and isSinglePath : D FA / { true , false } checks whether the input automata A = ( Q , Σ , δ , q 0 , F ) respect the following condition: δ = i [ 0 , | Q | ] ( q i , q i + 1 , c ) . Informally, a single-path automaton is an automaton where, if we sort the strings of its language from the shortest to the longest, each string is a prefix of the next one. An example of a single-path automaton is reported in Figure 9b where it is graphically clear that each state, excluding the initial and last one, have one incoming and one outgoing transition. Since the longest string in a single-path automaton has, as prefix, all the others of the language, it is sufficient to check, for an automaton A , if it starts with only the former. For example, let L ( A ) = { s o f t e r } and L ( A ) = { s , s o , s o f t } . The string s is prefix of s o , which is in turn prefix of s o f t so A is a single-path automaton. Therefore, in this case, it is sufficient to check if s o f t e r starts with only s o f t (the longest string of L ( A ) ) since, being A single-path, the other strings (s and s o ) are consequently prefix of s o f t e r . Instead, consider L ( A ) = { s , n o } . It would be impossible for a string to start with both of them since there is no prefix relation between them.
Algorithm 2: SW : D FA / × D FA / B algorithm
Applsci 10 03525 i002
Algorithm 2 takes as input two automata denoted by A and A . Lines 1-9 handle some corner cases. If L ( A ) = { ε } , { true } is returned, since any string starts with ε (lines 1-3). If none of the prefixes of A is recognized by A , meaning that none of the strings recognized by A start with a string of A , we can safely return { false } (lines 4-6). Finally, if at least one of the input automata have cycles, we return { true , false } (lines 7-9). Lines 10-17 determine if any string of A is the beginning of any string of A , otherwise Bool is returned.
In order to explain our approach in lines 10-17, consider the automata A and A reported in Figure 9. To be sure that any string recognized by A is the beginning of any string recognized by A we need to check two conditions: (1) any string recognized by A is prefix of its longest recognized string σ and (2) each string in A starts with σ (all strings must have a common prefix). Only if both conditions occur we can safely return { true } otherwise we return Bool . In particular, (1) is checked by the function isSinglePath at line 10 and (2) is checked at lines 11-15. It is worth noting that if an automaton is single-path, then the longest string is unique (line 11).
In our example, both the strings p and p a n in L ( A ) are prefixes of p a n , which is the longest string recognized by A , so we build B, which is the (minimal) automaton that recognizes p a n and C, L ( C ) = { p a n , k o a } , and compare them (line 13). We return { true } if B and C recognize the same language otherwise we return Bool . In the other cases, as already mentioned, we return { true , false } . For example, in Figure 9, { true , false } is returned because, although A is a single-path automaton, only the string p a n d a L ( A ) begins with p a n , namely the longest string of L ( A ) .
Example 2.
Consider for example A s.t. L ( A ) = { p a n d a , p a n e m } , and A s.t. L ( A ) = { p , p a n } . SW ( A , A ) returns { true } since A is a single-path automaton and both strings of A start with the longest string in A , the string p a n . Consider instead the automata A , L ( A ) = { p a n d a , k o a l a } , and A , L ( A ) = { p , k } . In this case, SW ( A , A ) returns { true , false } since A is not a single-path automaton. Indeed, we can easily check that even if the string p a n d a L ( A ) starts with p L ( A ) , the string k o a l a L ( A ) does not.
Theorem 6.
SW is sound but not complete. Formally,
A , A D FA / . S W ( L ( A ) , L ( A ) ) SW ( A , A )
As a counterexample to completeness, consider the automata A s.t. L ( A ) = A = { a n | n > 1 } and A s.t. L ( A ) = { a } . The completeness condition is not met, indeed,
SW ( A , A ) = { true , false } S W ( L ( A ) , L ( A ) ) = { true }

5.6. Abstract Semantics of ToLowerCase

The abstract semantics of toLowerCase is defined by the function LC : D FA / D FA / which returns as result an automaton that recognizes the same strings of the input automaton, where any upper-case symbol is replaced with the corresponding lower-case symbol. LC is computed by Algorithm 3.
Algorithm 3: LC : D FA / D FA / algorithm
Applsci 10 03525 i003
Starting from an input automaton A , the idea is to return as result the automaton A , that is a copy of A with the exception that any upper-case symbol read by a transition is replaced by its corresponding lower-case symbol. Transitions that already read lower-case or special symbols are unaltered. An example is reported in Figure 10.
Theorem 7.
LC is both sound and complete. Formally,
A D FA / . L C ( L ( A ) ) = L ( LC ( A ) )

5.7. Abstract Semantics of Includes

The abstract semantics of includes is defined by the function IN : D FA / × D FA / B . It takes as input two automata A and A and checks whether a string recognized by A is a substring of a string recognized by A . The function IN is computed by Algorithm 4, where, given a path p of an automaton A , we abuse notation denoting by Min ( p ) the automaton that recognizes the string encoded by the path p (lines 11–12). The algorithm first checks some corner cases: if A only recognizes the empty string, {true} is returned, since the empty string is always a substring of a non-empty automaton (lines 2–4), if none of the substring of A is contained in A , { false } is returned (lines 5–7) and if one of the input automata is cyclic, it returns Bool (lines 8–10). When these corner cases are excluded, we check each string recognized by A . If the algorithm finds at least one string σ in L ( A ) that is not a substring of a string σ of A , Bool is returned otherwise { true } . This is done in lines 10–14 where, for each path p of A we create Min ( p ) and check if its factorization with A equals A , i.e., we check if it contains any string of A .
Algorithm 4: IN : D FA / × D FA / B algorithm
Applsci 10 03525 i004
For example, consider the automata A and A reported in Figure 11. The algorithm returns Bool since the string f g L ( A ) is not a substring of a b c A . Another example is reported in Figure 12. The result of IN ( A , A ) returns { true } since σ L ( A ) , σ L ( A ) . σ is a substring of σ .
Theorem 8.
IN is sound but not complete. Formally,
A , A D FA / . I N ( L ( A ) , L ( A ) ) IN ( A , A ) .
As a counterexample to completeness, consider the automaton A s.t. L ( A ) = { a n | n > 1 } and the automaton A s.t. L ( A ) = { a } . The completeness condition is not met, indeed
I N ( L ( A ) , L ( A ) ) = { true } IN ( A , A ) = { true , false }
since when one of the input automata is cyclic, Algorithm 4 returns Bool .

5.8. Abstract Semantics of Repeat

The abstract semantics of repeat is defined by the function RT : D FA / × Const D FA / that, given as input an automaton A and a constant integer value i, returns an automaton that recognizes any string of L ( A ) repeated i times. RT is computed by Algorithm 5 and we suppose that the abstract integer value i is positive or zero. Any non-positive value is treated as zero. The algorithm first checks some corner cases. If i = 0 or the input automaton only recognizes the empty string, then Min ( ϵ ) is returned (lines 1–3). If the automaton has a cycle or i = Const , it returns the Kleene-closure of the input automaton (lines 4–6). If none of these corner cases is detected then, for each string in L ( A ) , we concatenate it with itself ( i 1 ) -times using the already defined CC . The result is the union of all the concatenated automata.
Algorithm 5: RT : D FA / × Const D FA / algorithm
Applsci 10 03525 i005
Let us consider the automaton A reported in Figure 13a and suppose to call RT ( A , 2 ) . The resulting automaton, applying Algorithm 5, is reported in Figure 13b. Let us suppose to call RT ( A , Const ) . In this case, since the input integer value is not determinate, Algorithm 5 returns the Kleene-star automaton of A and the result is reported in Figure 13c.
Theorem 9.
RT is sound but not complete. Formally,
A D FA / , i Const . R T ( L ( A ) , γ ( i ) ) L ( RT ( A , i ) )
As a counterexample to completeness, consider the automaton A s.t. L ( A ) = { a b n | n N } . The completeness condition is not met, indeed
R T ( L ( A ) , 2 ) = { a b n a b n | n N } RP ( A , 2 ) = { ( a b n ) m | n , m N }
since when the input automaton is cyclic, Algorithm 5 returns the Kleene closure of the input automaton.

5.9. Abstract Semantics of TrimLeft, TrimRight and Trim

In this section, we will show the abstract semantics of trimLeft, trimRight and trim operations. The abstract semantics of trimLeft is defined by the function TL : D FA / D FA / . In particular, it takes as input an automaton A and returns an automaton accepting the same strings of A removing, at the beginning of each string, consecutive white spaces, if present. In the following, we denote a white-space as ␣. The function is computed by Algorithm 6. The idea of algorithm is to iteratively replace white-space transitions from the initial state with ϵ -transition (lines 5–7), while leaving the other transitions unaltered (lines 7–9). At each iteration, the resulting automaton is minimized, and hence determinized (line 11). This operation is repeated until the initial state has no white-space transitions, checking the condition that white-space is not a prefix of the automaton (line 3). In Figure 14 is depicted an example of application of our algorithm.
Algorithm 6: TL : D FA / D FA / algorithm
Applsci 10 03525 i006
Theorem 10.
TL is sound and complete. Formally, A D FA / ,
T L ( L ( A ) ) = L ( TL ( A ) )
The abstract semantics of trimRight can be defined in function of the already defined function TL . Indeed, the abstract semantics TR : D FA / D FA / reserves the input automaton, applies TL and finally reverses again the so obtained automaton. Formally,
TR = def reverse ( TL ( reverse ( A ) )
Similarly, the abstract semantics of trim applies both the abstract semantics of trimLeft and trimRight. Thus, the abstract semantics of trim is captured by the function TM : D FA / D FA / and it is defined as
TM ( A ) = def TR ( TL ( A ) )
Theorem 11.
TR and TM are sound and complete. Formally, A D FA /
T R ( L ( A ) ) = L ( TR ( A ) ) T M ( L ( A ) ) = L ( TM ( A ) )
Proof. 
The proof of TR follows from the completeness of TL and reverse operations, while the proof of TM follows from the completeness of TL and TR . □

5.10. Concerning Abstract Implicit Type Conversion

In this section, we discuss the abstraction of implicit type conversion functions. Here we will focus only on the conversion of automata into other values, since conversions concerning booleans, not-a-number and integers are standard. Let toBool : V B be applied to A D FA / : If A Min ( { ϵ } ) = Min ( ) , it returns { true } , when A = Min ( { ϵ } ) the function returns { false } , otherwise the function return Bool . Implicit type conversion to D FA / is handled by the function toStr : V D FA / . As far as non numeric strings are concerned, toStr returns Min ( { NaN } ) . If the input is the boolean value true [ false ] it returns Min ( { true } ) [ Min ( { false } ) ], otherwise it returns Min ( { true } ) Min ( { false } ) . Regarding abstract integers, if i Z , then the automaton recognizing the string S ( i ) is returned (We recall that the function S ( i ) maps an integer i to its numeric string representation.), otherwise, hence when i = Const , the automaton recognizing any possible integer is returned and reported in Figure 15. Finally, toInt : V Const { NaN } handles conversion to constant integers. Given an automaton A , if A Min ( Σ Z ) = Min ( ) , the automaton is precisely converted to NaN , since A does not recognize any numerical string. Otherwise, if A D FA Min ( Σ Z ) it means that L ( A ) contains only numeric strings. In particular, if A recognizes only one numerical string, the corresponding integer is returned, otherwise Const is returned.

6. μ F ASA Implementation

In this section we present μ JS Finite-state Automata String Analyzer ( μ F ASA ), the string static analyzer integrating the finite state automata abstract domain, and the corresponding abstract semantics, presented in the previous sections.

6.1. Theoretical Concerns

It is worth noting that, as reported in Theorem 1, ( Σ * ) (string concrete domain) and D FA / (abstract string domain) do not form a Galois connection, however this is not a concern. We have shown, for the core language we adopted, that the abstract semantics we have defined for string operations guarantee soundness hence, if the abstract interpreter starts from regular initial conditions (i.e., constraints expressible as finite state automata) it will always compute regular invariants.
When implementing, an important issue is computational complexity. The abstract semantics reported in this paper often relies on minimization of finite state automata in order to keep the automata, which arise during abstract computations, determinized and minimized. In the worst case, minimization has exponential complexity but this is not a problem. Even if our library relies on the Brzozowski’s algorithm, which theoretically has exponential complexity in worst-case scenario, in practice it is extremely fast on average and consistently outperforms other minimization algorithms (e.g., Hopcroft’s algorithm, having average-case complexity O ( n log n ) , where n is the number of states), as reported in [36]. Moreover, the minimization is only applied when the input automaton is not-deterministic.

6.2. Implementation

μ F ASA is a string static analyzer for extended μ JS inter-procedural programs and it is built upon the finite state automata abstract domain described above. (Available at www.github.com/SPY-Lab/mufasa) It analyzes string variables and is also able to express associative arrays. The finite state automata abstract domain has been implemented as an external library (Available at www.github.com/SPY-Lab/java-fsm-library), offering a suitable and easy way to plug the domain into existing static analyzers, such as [3,4,5,37]. The library includes the implementation of all the algorithms concerning the finite state automata domain and provides well-known operations on automata such as suffix, right quotient, abstract domain-related operations, such as D FA , D FA , and the parametric widening for tuning precision and forcing convergence.
In addition to the string operations (and the corresponding automata-based abstract semantics) introduced in this paper, μ F ASA also analyzes functions that can be defined as composition of the ones presented here (e.g., endsWith w.r.t. startsWith, slice w.r.t. substring). The full list of implemented string operations is reported in Table 2, also summarizing for which operations holds soundness and completeness and the average complexity of their algorithms (w.r.t. the constant integer abstract domain).

6.3. Extension to Interval Abstract Domain

For the presentation of this paper, in Section 4.1, we have chosen to abstract integer values to the constant integer abstract domain. Of course this affects the abstract semantics of those string operations that involve them, namely substring, charAt, repeat and length, while the other methods only use strings or booleans. Nevertheless, μ F ASA abstracts integer values to the more precise interval abstract domain [9], i.e., to the set Intervals .
Intervals = def { [ a , b ] | a , b Z { , + } , a b } { }
The choice of presenting the automata-based abstract semantics with constant integers, rather than intervals, was driven by the will to not burden the definition of the abstract semantics of the string operations involving integers. Let us consider the interval-based substring abstract semantics. Since intervals can be unbounded (e.g., [ 5 , + ] ), more than 20 different cases have been identified in its abstract semantics [8]. Given substr ( A , [ a , b ] , [ c , d ] ) , for some A D FA / , many of these cases include b = + and d definite value, b definite and d = + , b , d = + and a , c definite values and a c , only to cite few. Moreover, the interval-based abstract semantics does not add any further important technical detail to our contribution since the cases cited above, met with an interval-based analysis, were handled in an ad hoc manner and would have made this paper harder to follow. In particular, being the constant integer abstract domain strictly contained into the intervals one, restricting the presentation to constant integers permitted us to report only the meaningful cases (from a technical point of view), avoiding the others (related to intervals) handled in specific ways (and relevant for the implementation).
Nevertheless, μ F ASA implements intervals (which include constant integers) and, accordingly, the abstract semantics based on them of substring, charAt, length and repeat, as reported in [8]. The abstract semantics of the other string operations remain unaffected by the change. Just as an example, in the following we report the abstract semantics of length on intervals.

length Abstract Semantics with Intervals

The constant integer domain leads to a big loss in precision in the abstract semantics of length, reported in Section 5.3. The idea behind the algorithm capturing its abstract semantics is to check if any string recognized by the input automata have the same length l N . If so, l is returned as result, otherwise Const is returned. Clearly, this is a forced choice given by the fact that the constant integer abstract domain is only able to track a single integer value. In this sense, the abstract semantics of length can be improved, from a precision point of view, when we deal with intervals rather than constant integers. Algorithm 7 reports the abstract semantics of length using the former abstract domain. We compute the minimum and the maximum path reaching each final state in the automaton and then we abstract the set of lengths obtained so far into intervals. Problems arise when the automaton contains cycles. In that case, we return the undefined interval starting from the minimum path, to a final state, to + .
Clearly, using the interval abstract domain produces more precise results for certain operations, but it complicates the abstract semantics of others.

6.4. Qualitative Evaluation of μ F ASA

In this section, we evaluate the precision of μ F ASA and, in turn, of the finite state automata abstract domain. In particular, we comment and discuss two string manipulation programs. The first is the one already introduced in Section 1, namely an obfuscated malware manipulating strings and transforming them into code by using eval, while the second is a benevolent function taken from a real-world string manipulation program. In both cases, we will show that important string information can be obtained by μ F ASA .
Algorithm 7: LE : D FA / Intervals Algorithm
Applsci 10 03525 i007

6.4.1. Obfuscated Malware

Consider the fragment reported in Figure 1 in the introduction. By analyzing it with μ F ASA , we obtain that the abstract value of d, at the eval call, is the automaton A d in Figure 16. The cycles are caused by the widening application in while computations.
From this automaton we are able to retrieve some important and non-trivial information. For example, we are able to answer the following question: May A d contain a string corresponding to an assignment to an ActiveXObject? We can answer by checking the predicate A d Min ( Id · { n e w A c t i v e X O b j e c t ( } · Σ * · { ) } ) , controlling whether A d recognizes strings that are concatenations of any identifier with the string n e w A c t i v e X O b j e c t , followed by any possible string. In the example, the predicate returns true . Another interesting information could be: May A d contain eval string? We can also answer that by checking if A d Min ( { e v a l } ) Min ( ) . In this case it returns false and enforces the idea that any explicit call to eval cannot occur.
This analysis may lose precision during fix-point computations, causing the cycles in the automaton in Figure 16, due to the widening application. Nevertheless, it is worth noting that this result is obtained without any precision improvement on fix-point computations, such as loop unrolling, narrowing or widening with thresholds, that can surely be implemented in the future development of μ F ASA .

6.4.2. String Manipulation Program

In order to evaluate the precision of μ F ASA , we decided to create a benchmark of tests taken from real-world programs. We therefore selected string manipulating functions from popular modern frameworks (such as Mozilla useful methods, RXJava, Mockito) whose code can be easily found on GitHub. (The selected string manipulation functions are available at www.github.com/SPY-Lab/mufasa/src/test/resources and it is possible here to go back to where they were selected.) Among this set of methods, we will focus our attention to the precision of the function fixStations reported in Figure 17, taken from [38]. The function takes as input an object stations containing information about train stations (each item contains the three-letter station code, followed by some machine-readable data, followed by a semicolon, followed by the human-readable station name) and extracts the station code (in capital letters) and the station name. for instance, given the input stations = { st 1 : " MANay 781 ; Manchester " , st 2 : " gNfbx 420 ; Greenfield " } , the function returns the object { st 1 : " MAN : Manchester " , st 2 : " GNF : Greenfield " } .
Thus, given an object containing strings following the station information pattern previously described, the function fixStations returns another object containing strings following the pattern of three capital letters concatenated with a colon concatenated with a string. The goal of our analyzer is to exactly preserve this information on the variable result . Let us consider a statically unknown value of stations , namely where stations = { st 1 : σ 1 , , stn : σ n } , n N and σ i follows the station information pattern, for each i [ 1 , n ] . While other static analyzers, such as TAJS, which has a finite height string abstract domain, lose any information about the returned string, μ F ASA is able to infer, for the variable result , the object { st 1 : p 1 , , st n : p n } , where each p i is a string abstract value, namely a finite state automaton, following the desired pattern
σ 1 · σ 2 · σ 3 · : · σ where σ i is capital , i [ 1 , 3 ] , σ Σ * .
We are therefore able to preserve the string pattern that the function returns. As we have already highlighted, the result is obtained without implementing ad-hoc improvements regarding loop computations and we believe that even more precise results can be obtained integrating such techniques. We believe the integration of these analyses will drastically decrease false positives of the proposed string analysis (will address this topic in future works section).

7. Discussion and Related Work

In this paper we have proposed an abstract semantics for a toy imperative language μ JS , augmented with string manipulation operations, expressive enough to handle dynamic typing and implicit type conversion. In our abstract semantics we have combined the DFA domain with abstract domains for the other primitive types, necessary to deal with static analysis of programs with dynamic typing. The proposed formal framework allows us to formally prove soundness and to study the precision of the abstract semantics of each string operation: depending on the property of interest, one can tune the degree of precision, namely the completeness of any string operation.

7.1. Analysis vs. Verification

Even if several solutions, also involving finite state machines, have been proposed for string solving and verification [21,39,40], it should be noted that our approach is placed instead in the context of string static analysis. Over the years, there has always been the intuition that program analysis was harder than verification: given a program, the aim of the former is to derive invariants for each program point, the one of the latter is instead to check whether a certain property holds for the given input program. Recently, this concept has been formalized from a computability point of view [41], confirming this belief. Therefore, our approach, placed in the context of static analysis of string manipulation programs, has goals that are hardly comparable with the solutions proposed in the context of verification, such as those cited above.

7.2. Main Related Works

The issue of analyzing strings is a widely studied problem and it has been tackled in literature from different points of view. Before discussing the most related works, we can observe what makes our approach original w.r.t. existing literature: (1) We provide a modular parametric abstract domain on the abstractions of the different primitive types, this allows us to obtain both a tunable semantics precision and to handle dynamic typing for operations having both integer and string parameters, such as substring; (2) our focus is on the characterization of a formal abstract interpretation-based framework where it is possible to prove soundness and to analyze the completeness of string operations, in order to understand where it is possible to tune precision versus efficiency. The main feature we have in common with existing works is the use of DFA (regular expressions) for abstracting strings. In [21], the authors propose symbolic string verifier for PHP based on finite state automata represented by a particular form of binary decision diagrams, the MBDD. Even if it could be interesting to understand whether this representation of DFAs may be used also for improving our algorithms, their work only considers operations exclusively involving strings (not also integers such as substring) and therefore it provides a solution for different string manipulations. In [20], the authors propose an abstract interpretation-based string analyzer approximating strings into a subset of regular languages, called regular strings and define the abstract semantics of four string operations of interest equipped with a widening. This is the most related work, but our approach is strictly more general, since we do not introduce any restriction to regular languages. In [19], the authors propose a scalable static analysis for jQuery that relies on a novel abstract domain of regular expressions. The abstract domain in [19] contains the finite state automata one but pursues a different task and does not provide semantics for string operations. Surely it may be interesting to integrate our library for string manipulation operators into SAFE. Finally, [42] proposes a lattice-based generalization of regular expression, formally illustrating a parametric abstract domain of regular expressions starting from a complete lattice of reference. However, this work does not tackle the problem of analyzing string manipulations, since it instantiates the parametric abstract domain in the network communication environment, analyzing the exchanged messages as regular expressions.
Finite state machines (transducer and automata) have also found a critical application in model checking both for enforcing string constraints and for modeling infinite transition systems [43]. For example, the authors of [44] define a sound decision procedure for a regular language-based logic for verification of string properties. The authors of [45] propose an automata abstraction in the context of regular model checking to tackle the well-known problem of state space explosion. Moreover other formal systems, similar to DFA, have been proposed in the context of string analysis [46,47,48]. As future work, it can be interesting to study the relation between standard DFA and the other existing formal models, such as logics or other forms of FA.
In the context of JavaScript several static analyzers have been proposed, pushed by the wide range of applications and security issues related to the language [3,4,5,37]. TAJS [3] is a static analyzer based on abstract interpretation for JavaScript. The authors focus on allocation site abstraction, plugging in the static analyzer the recency abstraction [49], decreasing the number of false positives when objects are accessed. Upon TAJS, a sound way to statically analyze a large range of non-trivial eval patterns has been defined in [50]. In [37], it is defined the Loop-Sensitive Analysis (LSA) that distinguishes loop iterations using loop strings in the same way call strings distinguish function calls from different call sites in k-CFA [51]. The authors have implemented LSA into SAFE [5], a JavaScript web applications static analyzer. As future work, it may be intriguing to combine LSA with our abstract semantics for decreasing the occurrences of false positives introduced by the widening operator during fix-point computations.

7.3. Future Ideas

In this paper we have proposed static string program analysis for a set of relevant JavaScript string manipulation operations, whose semantics is inspired by the official ECMAScript specification [10]. The first goal is to involve our abstract semantics into a static analyzer for JavaScript that uses finite state automata to approximate strings. in order to decrease the number of false positives in our string approximation in presence of loops, several techniques can be involved, such as loop unrolling and LSA [37]. The domain described in this paper has been equipped only with a widening, to enforce termination in fix-point computations, which may lead to a big loss in precision. A narrowing will be studied and integrated in our static analyzer in order to retrieve some of the precision lost when the widening is applied.
We conclude by observing, as already highlighted in [7], the important application of finite state automata for string-to-code primitives analysis. Consider, for instance, in JavaScript programs, the eval function, transforming strings into code. Our semantics is sound and precise enough to answer some non-trivial properties of interest. Indeed, in [7], the finite state automata domain and the corresponding abstract semantics for strings turned out to be the basis for a sound and precise enough analysis of eval.

Author Contributions

V.A.: writing, conceptualization, formal analysis, original draft preparation, visualization, implementation; I.M.: writing, conceptualization, formal analysis, original draft preparation, visualization, supervision, methodology; S.X.: visualization, formal analysis, implementation. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the University of Verona under the 2017 grant project "Analyzing secuRity in the modErn Software" (ARES).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Selected Proofs

In this appendix we report all the long proofs of results presented in the paper. The proofs are listed in order of appearance.
Proof of Theorem 2
The collecting semantics of substring is defined lifting the concrete semantics defined in Section 3 as follows, where S ( Σ * ) and I , J ( Z ) .
S S ( S , I , J ) = { S S ( σ , i , j ) | σ S , i I , j J }
In order to prove soundness and completeness of SS , we need to prove that A D FA / , i , j Const
S S ( L ( A ) , γ ( i ) , γ ( j ) ) = L ( SS ( A , i , j ) )
We split the proof in the following cases. Since in substring semantics any negative value is treated as zero, in the proof, we suppose w.l.o.g. that when a negative value arises it is treated as zero.
  • γ ( i ) = { l } , γ ( j ) = { k } , l and k Z : let us suppose, w.l.o.g., that l < k (otherwise the indexes are swapped).
    S s ( L ( A ) , { l } , { k } ) = = { σ l σ k | σ L ( A ) , k < | σ | } { σ i σ n | σ L ( A ) , k n = | σ | } = { y | z Σ * . y z S U ( L ( A ) , l ) , z S U ( L ( A ) , k ) , | y | = k l , k < | σ | } { y | y S U ( L ( A ) , l ) , y Σ k l } = ( R Q ( S U ( L ( A ) , l ) , S U ( L ( A ) , k ) ) Σ k l ) S U ( L ( A ) , l ) Σ k l = L ( ( RQ ( SU ( A , i ) , SU ( A , j ) ) Min ( Σ j i ) ) ( SU ( A , i ) Min ( Σ < j i ) ) ) = L ( SS ( A , i , j ) )
  • γ ( i ) = Z , γ ( j ) = { k } , with k Z
    S s ( L ( A ) , Z , { k } ) = { Ss ( σ , l , k ) | σ L ( A ) , l Z } = { S s ( σ , l , k ) | σ L ( A ) , 0 l < k } { S s ( σ , k , l ) | σ L ( A ) , l k l < | σ | } = a [ 0 , k ] S s ( L ( A ) , a , k ) F a ( S U ( L ( A ) , l ) ) = L ( a [ 0 , k ] SS ( A , a , k ) D FA FA ( SU ( A , l ) ) ) = L ( a [ 0 , k ] SS ( A , a , k ) D FA SS ( A , l ) ) = L ( SS ( A , i , j ) )
  • γ ( i ) = l Z , γ ( j ) = Z :
    S s ( L ( A ) , l , Z ) = { S s ( σ , l , k ) | σ L ( A ) , k Z } = { S s ( σ , l , k ) | σ L ( A ) , k l k | σ | } { S s ( σ , k , l ) | σ L ( A ) , 0 k < l } = a [ 0 , l ] S s ( L ( A ) , a , l ) F a ( S U ( L ( A ) , l ) ) = L ( a [ 0 , l ] SS ( A , a , l ) D FA FA ( SU ( A , l ) ) ) = L ( a [ 0 , l ] SS ( A , a , k ) D FA SS ( A , l ) ) = L ( SS ( A , i , j ) )
  • γ ( i ) = γ ( j ) = Z :
    S s ( L ( A ) , Z , Z ) = { S s ( σ , l , k ) | σ L ( A ) , l , k Z } = { S s ( σ , l , k ) | σ L ( A ) , l , k 0 , l , k < | σ | } = F a ( L ( A ) ) = FA ( A ) = L ( SS ( A , i , j ) )
 □
Proof of Theorem 3
The collecting semantics of charAt is defined lifting the concrete semantics defined in Section 3 as follows, where S ( Σ * ) and I , ( Z ) .
C A ( S , I ) = { C A ( σ , i ) | σ S , i I }
In order to prove soundness and completeness of CC , we need to prove that A D FA / , i Const
C A ( L ( A ) , γ ( i ) ) = L ( CA ( A , i ) )
We split the proof in the following two cases.
  • Let us suppose that i Const , hence γ ( i ) = { n } , where n Z .
    C A ( L ( A ) , { n } ) = { C A ( σ , n ) | σ L ( A ) } = { σ n | σ L ( A ) , 0 n < | σ | } { ϵ | σ L ( A ) . n | σ | n < 0 } = { Ss ( σ , n , n + 1 ) | σ L ( A ) , 0 n < | σ | } { Ss ( σ , n , n + 1 ) | σ L ( A ) . n | σ | n < 0 } = { Ss ( σ , { n } , { n + 1 } ) | σ L ( A ) } = Ss ( L ( A ) , n , n + 1 ) = L ( SS ( A , i , i + 1 ) ) = L ( CA ( A , i ) )
  • Let us suppose that i = Const , hence γ ( i ) = Z . It is worth noting that the function chars we used in the abstract semantics of charAt is complete. Let C HARS : ( Σ * ) ( Σ ) be the function that given a set of strings returns the set of characters inside any string of the input string set. It holds that C HARS ( L ( A ) ) = chars ( A ) .
    C A ( L ( A ) , γ ( i ) ) = { C A ( σ , n ) | σ L ( A ) , n [ 0 , | σ | 1 ] } { ϵ } = = { σ n | σ L ( A ) , n [ 0 , | σ | 1 ] } { ϵ } = = C HARS ( L ( A ) ) { ϵ } = L ( Min ( chars ( A ) ) Min ( { ϵ } ) ) = L ( CA ( A , i ) )
 □
Proof Of Theorem 4
The collecting semantics of length is defined lifting the concrete semantics defined in Section 3 as follows, where S ( Σ * ) .
L E ( S ) = { | σ | | σ S }
In order to prove soundness of LE , we need to prove that A D FA /
L E ( L ( A ) ) γ ( LE ( A ) )
We split the proof in the following cases.
  • L E ( L ( A ) ) ) = I ( Z ) , s.t. | I | = 1 :
    | L E ( L ( A ) ) | = 1 L E ( L ( A ) ) = { n } for   some   n N σ L ( A ) . | σ | = n p Paths ( A ) , | p | = n
    This condition checks whether the size of any path of A is n. This check is performed by Algorithm 1 at lines 5–8.
  • L E ( L ( A ) ) = I ( Z ) , s.t. | I | > 1 : this means that
    | L E ( L ( A ) ) | > 1 σ , σ L ( A ) . | σ | | σ |
    If A is cyclic, then the condition at line 1 is successful and Const is returned. Let us suppose that A is not cyclic.
    | L E ( L ( A ) ) | > 1 σ , σ L ( A ) . | σ | | σ | p , p Paths ( A ) . | p | | p |
    This condition is checked by lines 5–8 of Algorithm 1.
 □
Proof of Theorem 6
The collecting semantics of startsWith is defined lifting the concrete semantics defined in Section 3 as follows, where S , S ( Σ * ) .
S W ( S , S ) = { S W ( σ , σ ) | σ , σ S }
In order to prove soundness of SW , we need to prove that A , A D FA /
S W ( L ( A ) , L ( A ) ) SW ( A , A )
We split the proof in the following cases.
  • Let us suppose that S W ( L ( A ) , L ( A ) ) = { false } .
    S W ( L ( A ) , L ( A ) ) = { false } σ L ( A ) . σ L ( A ) . ϕ Σ * . σ · ϕ = σ P R ( L ( A ) ) L ( A ) = PR ( A ) D FA A = Min ( ) ( lines 4 6 of Algorithm 2 )
  • Let us suppose that S W ( L ( A ) , L ( A ) ) = { true } . We split the proof in the following cases:
    -
    if A = Min ( { ϵ } ) : Algorithm 2 verifies the condition ( A = = Min ( { ϵ } ) ) at lines 1–3 and returns { true } .
    -
    if A or A are cyclic: Algorithm 2 verifies the condition ( hasCycle ( A ) hasCycle ( A ) ) at lines 7–9 and returns { true , false } .
    -
    if A is not a single-path automaton: in this case, we check if A is not a single path automaton at line 10 of Algorithm 2 and, if so, { true , false } is returned at line 17.
    -
    if A is a single path automaton: let us denote by MAX S TRING ( L ( A ) ) the longest string recognized by L ( A ) . As we already highlighted, if A is single path, the longest string is unique. Clearly, we have that MAX S TRING ( L ( A ) ) = maxString ( A ) . Let us denote MAX S TRING ( L ( A ) ) by σ m .
    S W ( L ( A ) , L ( A ) ) = { true } σ L ( A ) , σ L ( A ) . ϕ Σ * . σ · ϕ = σ σ L ( A ) ϕ Σ * . σ = σ m · ϕ Ss ( L ( A ) , 0 , | σ m | ) = = σ m SS ( A , 0 , LE ( maxString ( A ) ) ) = = maxString ( A ) ( lines 10 15 of Algorithm 2 )
  • Let us suppose that S W ( L ( A ) , L ( A ) ) = { true , false } . We split the proof in the following cases:
    -
    A or A are cyclic: Algorithm 2 verifies the condition ( hasCycle ( A ) hasCycle ( A ) ) at lines 7–9 and returns { true , false } .
    -
    if A is not single path automaton: the check at line 10 of Algorithm 2 fails and { true , false } is returned at line 17.
    -
    A is single-path automaton: as before, if A is single path, the longest string is unique. Let us denote MAX S TRING ( L ( A ) ) by σ m .
    S W ( L ( A ) , L ( A ) ) = { true false } σ L ( A ) , σ L ( A ) ϕ Σ * . σ · ϕ σ σ L ( A ) ϕ Σ * . σ m · ϕ σ Ss ( L ( A ) , 0 , | σ m | ) σ m SS ( A , 0 , LE ( maxString ( A ) ) ) maxString ( A ) ( lines 10 15 of Algorithm 2 )
    The condition is verified at lines 13 of Algorithm 2, it fails, hence { true , false } at line 17 is returned.
 □
Proof of Theorem 7
The soundness and completeness of LC follows from the fact that any upper-case transition found in A is replaced with the same transition that reads the corresponding lower-case symbol, without changing neither the orientation of the transitions or the automaton states. □
Proof of Theorem 8
The collecting semantics of includes is defined lifting the concrete semantics defined in Section 3 as follows, where S , S ( Σ * ) .
I N ( S , S ) = { I N ( σ , σ ) | σ , σ S }
In order to prove soundness of IN , we need to prove that A , A D FA /
I N ( L ( A ) , L ( A ) ) IN ( A , A )
We split the proof of the following cases.
  • Let us suppose that I N ( L ( A ) , L ( A ) ) = { false } .
    I N ( L ( A ) , L ( A ) ) = { false } σ L ( A ) . σ F A ( A ) L ( A ) F A ( L ( A ) ) = A D FA FA ( A ) = Min ( ) ( lines 4 6 of Algorithm 4 )
  • Let us suppose that I N ( L ( A ) , L ( A ) ) = { true } . Thus, consider the following cases:
    -
    A = Min ( { ϵ } ) : Algorithm 4 verifies the condition ( A = = Min ( { ϵ } ) ) at lines 1–3 and returns { true } .
    -
    A or A are cyclic: Algorithm 2 verifies the condition ( hasCycle ( A ) hasCycle ( A ) ) at lines 7–9 and returns { true , false } .
    -
    A Min ( { ϵ } ) and A , A are not cyclic:
    I N ( L ( A ) , L ( A ) ) = { true } σ L ( A ) . σ L ( A ) ϕ , ψ Σ * . ϕ · σ · ψ = σ σ L ( A ) . F A ( { σ } ) L ( A ) = L ( A ) p Paths ( A ) . FA ( Min ( p ) ) D FA A = A
    This condition is verified in lines 11–15 of Algorithm 4 and in this case the algorithm returns { true } .
  • Let us suppose that I N ( L ( A ) , L ( A ) ) = { true , false } . Thus, consider the following cases:
    -
    A or A are cyclic: Algorithm 4 verifies the condition ( hasCycle ( A ) hasCycle ( A ) ) at lines 7–9 and returns { true , false } .
    -
    A Min ( { ϵ } ) and A , A are not cyclic:
    I N ( L ( A ) , L ( A ) ) = { true , false } σ L ( A ) σ L ( A ) . ϕ , ψ Σ * . ϕ · σ · ψ = σ σ L ( A ) . F A ( { σ } ) L ( A ) L ( A ) p Paths ( A ) . FA ( Min ( p ) ) D FA A A
    This condition is verified in lines 11–15 of Algorithm 4 and in this case the algorithm returns { true , false } .
 □
Proof of Theorem 9
The collecting semantics of repeat is defined lifting the concrete semantics defined in Section 3 as follows, where S ( S i g m a * ) and I ( Z ) .
R T ( S , I ) = { σ n | σ S , n I }
In order to prove soundness of RT , we need to prove that A D FA / , i Const
R T ( L ( A ) , γ ( i ) ) L ( RT ( A , i ) )
We split the proof in the following two cases.
  • Let us suppose that i Const , hence γ ( i ) = n , where n Z . We split the proof in the following cases:
    -
    i = 0 : R T ( L ( A ) , 0 ) = { ϵ } and Algorithm 5 checks this condition and returns Min ( { ϵ } ) at lines 1–3.
    -
    i 0 :
    *
    if A is s.t. L ( A ) = { ϵ } : since R T ( { ϵ } , i ) = { ϵ } , Algorithm 5 checks this condition and returns Min ( { ϵ } ) at lines 1–3.
    *
    if A is cyclic: R T ( L ( A ) , i ) L ( Kleene ( A ) ) and Algorithm5 checks this condition and returns Kleene ( A ) at lines 4–6.
    *
    A is not cyclic:
    R T ( L ( A ) , i ) = { σ i | σ L ( A ) } = { σ · σ · · σ i t i m e s | σ L ( A ) } = L ( { Min ( p ) · Min ( p ) · · Min ( p ) i t i m e s | p Paths ( A ) } )
    In this case, Algorithm 5 returns the above automaton at lines 8–15.
  • Let us suppose that i = Const , hence we have that γ ( i ) = Z .
    R T ( L ( A ) , γ ( i ) ) = R T ( L ( A ) , Z ) = { σ n | σ L ( A ) , n > 0 } L ( Kleene ( A ) )
    In this case, Algorithm 5 returns Kleene ( A ) , guaranteeing the soundness of RT .
 □
Proof of Theorem 10
At each iteration of Algorithm 6, we remove white-space transitions from the initial state q 0 . The invariant of Algorithm 6 after line 9 is that the initial state has no white-space transitions. Before checking the while-loop condition (line 3), the automaton is minimized and determinized (lines 3) with the new set of transitions δ . Hence, if the initial state q 0 had only white-state transitions, after the minimization at line 11, q 0 is not the initial state of R anymore, and a new initial state will be computed at line 11. Hence, when the loop is repeated, the algorithm will search the other white-space transition from the new initial state. In this way, Algorithm 6 is able to remove consecutive white-space transitions from the original initial state q 0 . □

References

  1. Pradel, M.; Sen, K. The Good, the Bad, and the Ugly: An Empirical Study of Implicit Type Conversions in JavaScript. In Proceedings of the 29th European Conference on Object-Oriented Programming, ECOOP 2015, Prague, Czech Republic, 5–10 July 2015; Boyland, J.T., Ed.; LIPIcs. Schloss Dagstuhl- Leibniz-Zentrum für Informatik: Wadern, Germany, 2015; Volume 37, pp. 519–541. [Google Scholar] [CrossRef]
  2. Xu, W.; Zhang, F.; Zhu, S. The power of obfuscation techniques in malicious JavaScript code: A measurement study. In Proceedings of the 7th International Conference on Malicious and Unwanted Software, MALWARE 2012, Fajardo, PR, USA, 16–18 October 2012; IEEE Computer Society: Washington, DC, USA, 2012; pp. 9–16. [Google Scholar] [CrossRef] [Green Version]
  3. Jensen, S.H.; Møller, A.; Thiemann, P. Type Analysis for JavaScript. In Proceedings of the 16th International Symposium on Static Analysis, SAS 2009, Los Angeles, CA, USA, 9–11 August 2009; Palsberg, J., Su, Z., Eds.; Lecture Notes in Computer Science. Springer: Berlin, Germany, 2009; Volume 5673, pp. 238–255. [Google Scholar] [CrossRef] [Green Version]
  4. Kashyap, V.; Dewey, K.; Kuefner, E.A.; Wagner, J.; Gibbons, K.; Sarracino, J.; Wiedermann, B.; Hardekopf, B. JSAI: A static analysis platform for JavaScript. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, (FSE-22), Hong Kong, China, 16–22 November 2014; Cheung, S., Orso, A., Storey, M.D., Eds.; ACM: New York, NY, USA, 2014; pp. 121–132. [Google Scholar] [CrossRef] [Green Version]
  5. Lee, H.; Won, S.; Jin, J.; Cho, J.; Ryu, S. SAFE: Formal specification and implementation of a scalable analysis framework for ECMAScript. In Proceedings of the 19th International Workshop on Foundations of Object-Oriented Languages (FOOL’12), Tucson, AZ, USA, 19–26 October 2012. [Google Scholar]
  6. Hauzar, D.; Kofron, J. Framework for Static Analysis of PHP Applications. In Proceedings of the 29th European Conference on Object-Oriented Programming, ECOOP 2015, Prague, Czech Republic, 5–10 July 2015; Boyland, J.T., Ed.; LIPIcs. Schloss Dagstuhl-Leibniz-Zentrum für Informatik: Wadern, Germany, 2015; Volume 37, pp. 689–711. [Google Scholar] [CrossRef]
  7. Arceri, V.; Mastroeni, I. A sound abstract interpreter for dynamic code. In Proceedings of the SAC ’20: The 35th ACM/SIGAPP Symposium on Applied Computing, Brno, Czech Republic, 30 March–3 April 2020; Hung, C., Cerný, T., Shin, D., Bechini, A., Eds.; ACM: New York, NY, USA, 2020; pp. 1979–1988. [Google Scholar] [CrossRef]
  8. Arceri, V.; Mastroeni, I. Static Program Analysis for String Manipulation Languages. In Proceedings of the Seventh International Workshop on Verification and Program Transformation, VPT@Programming 2019, Genova, Italy, 2 April 2019; Volume 299, pp. 19–33. [Google Scholar] [CrossRef]
  9. Cousot, P.; Cousot, R. Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints. In Proceedings of the Conference Record of the Fourth ACM Symposium on Principles of Programming Languages, Los Angeles, CA, USA, 17–19 January 1977; Graham, R.M., Harrison, M.A., Sethi, R., Eds.; ACM: New York, NY, USA, 1977; pp. 238–252. [Google Scholar] [CrossRef] [Green Version]
  10. ECMA. Standard ECMA-262 Language Specification, 9th ed. Available online: https://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf (accessed on 6 December 2018).
  11. Hopcroft, J.E.; Ullman, J.D. Introduction to Automata Theory, Languages and Computation; Addison-Wesley: Reading, MA, USA, 1979. [Google Scholar]
  12. Davis, M.D.; Sigal, R.; Weyuker, E.J. Computability, Complexity, and Languages: Fundamentals of Theoretical Computer Science; Academic Press Professional, Inc.: Cambridge, MA, USA, 1994. [Google Scholar]
  13. Cousot, P.; Cousot, R. Systematic Design of Program Analysis Frameworks. In Proceedings of the Conference Record of the Sixth Annual ACM Symposium on Principles of Programming Languages, San Antonio, TX, USA, 29–31 January 1979; Aho, A.V., Zilles, S.N., Rosen, B.K., Eds.; ACM Press: New York, NY, USA, 1979; pp. 269–282. [Google Scholar] [CrossRef]
  14. Cousot, P.; Cousot, R. Abstract Interpretation Frameworks. J. Log. Comput. 1992, 2, 511–547. [Google Scholar] [CrossRef]
  15. Giacobazzi, R.; Quintarelli, E. Incompleteness, Counterexamples, and Refinements in Abstract Model-Checking. In Proceedings of the Static Analysis, 8th International Symposium, SAS 2001, Paris, France, 16–18 July 2001; Cousot, P., Ed.; Lecture Notes in Computer Science. Springer: Berlin, Germany, 2001; Volume 2126, pp. 356–373. [Google Scholar] [CrossRef]
  16. Giacobazzi, R.; Mastroeni, I. Making abstract models complete. Math. Struct. Comput. Sci. 2016, 26, 658–701. [Google Scholar] [CrossRef]
  17. Giacobazzi, R.; Mastroeni, I. Transforming Abstract Interpretations by Abstract Interpretation. In Proceedings of the Static Analysis, 15th International Symposium, SAS 2008, Valencia, Spain, 16–18 July 2008; Alpuente, M., Vidal, G., Eds.; Lecture Notes in Computer Science. Springer: Berlin, Germany, 2008; Volume 5079, pp. 1–17. [Google Scholar] [CrossRef]
  18. Arceri, V.; Maffeis, S. Abstract Domains for Type Juggling. Electron. Notes Theor. Comput. Sci. 2017, 331, 41–55. [Google Scholar] [CrossRef]
  19. Park, C.; Im, H.; Ryu, S. Precise and scalable static analysis of jQuery using a regular expression domain. In Proceedings of the 12th Symposium on Dynamic Languages, DLS 2016, Amsterdam, The Netherlands, 1 November 2016; Ierusalimschy, R., Ed.; ACM: New York, NY, USA, 2016; pp. 25–36. [Google Scholar] [CrossRef]
  20. Choi, T.; Lee, O.; Kim, H.; Doh, K. A Practical String Analyzer by the Widening Approach. In Proceedings of the 4th Asian Symposium on Programming Languages and Systems, APLAS 2006, Sydney, Australia, 8–10 November 2006; Kobayashi, N., Ed.; Lecture Notes in Computer Science. Springer: Berlin, Germany, 2006; Volume 4279, pp. 374–388. [Google Scholar] [CrossRef] [Green Version]
  21. Yu, F.; Bultan, T.; Cova, M.; Ibarra, O.H. Symbolic String Verification: An Automata-Based Approach. In Proceedings of the 15th International SPIN Workshop on Model Checking Software, Los Angeles, CA, USA, 10–12 August 2008; Havelund, K., Majumdar, R., Palsberg, J., Eds.; Lecture Notes in Computer Science. Springer: Berlin, Germany, 2008; Volume 5156, pp. 306–324. [Google Scholar] [CrossRef]
  22. Câmpeanu, C.; Paun, A.; Yu, S. An Efficient Algorithm for Constructing Minimal Cover Automata for Finite Languages. Int. J. Found. Comput. Sci. 2002, 13, 83–97. [Google Scholar] [CrossRef]
  23. Domaratzki, M.; Shallit, J.O.; Yu, S. Minimal Covers of Formal Languages. In Proceedings of the 5th International Conference Developments in Language Theory, DLT 2001, Vienna, Austria, 16–21 July 2001; Revised Papers. Kuich, W., Rozenberg, G., Salomaa, A., Eds.; Lecture Notes in Computer Science. Springer: Berlin, Germany, 2001; Volume 2295, pp. 319–329. [Google Scholar] [CrossRef]
  24. Mohri, M.; Nederhof, M. Regular Approximation of Context-Free Grammars through Transformation. In Robustness in Language and Speech Technology; Springer: Dordrecht, The Netherlands, 2001; pp. 153–163. [Google Scholar]
  25. Cousot, P.; Halbwachs, N. Automatic Discovery of Linear Restraints Among Variables of a Program. In Proceedings of the Conference Record of the Fifth Annual ACM Symposium on Principles of Programming Languages, Tucson, AZ, USA, 23–25 January 1978; Aho, A.V., Zilles, S.N., Szymanski, T.G., Eds.; ACM Press: New York, NY, USA, 1978; pp. 84–96. [Google Scholar] [CrossRef] [Green Version]
  26. Costantini, G.; Ferrara, P.; Cortesi, A. A suite of abstract domains for static analysis of string values. Softw. Pract. Exp. 2015, 45, 245–287. [Google Scholar] [CrossRef]
  27. Cousot, P.; Cousot, R. Comparing the Galois Connection and Widening/Narrowing Approaches to Abstract Interpretation. In Proceedings of the 4th International Symposium on Programming Language Implementation and Logic Programming, PLILP’92, Leuven, Belgium, 26–28 August 1992; Bruynooghe, M., Wirsing, M., Eds.; Lecture Notes in Computer Science. Springer: Berlin, Germany, 1992; Volume 631, pp. 269–295. [Google Scholar] [CrossRef]
  28. D’Silva, V. Widening for Automata. Ph.D. Thesis, Institut Fur Informatick, UZH, Zurich, Switzerland, 2006. [Google Scholar]
  29. Bartzis, C.; Bultan, T. Widening Arithmetic Automata. In Proceedings of the 16th International Conference on Computer Aided Verification, CAV 2004, Boston, MA, USA, 13–17 July 2004; Alur, R., Peled, D.A., Eds.; Lecture Notes in Computer Science. Springer: Berlin, Germany, 2004; Volume 3114, pp. 321–333. [Google Scholar] [CrossRef] [Green Version]
  30. Cousot, P. Types as Abstract Interpretations. In Proceedings of the Conference Record of POPL’97: The 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Paris, France, 15–17 January 1997; Lee, P., Henglein, F., Jones, N.D., Eds.; ACM Press: New York, NY, USA, 1997; pp. 316–331. [Google Scholar] [CrossRef]
  31. Reynolds, J.C. Theories of Programming Languages; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
  32. Giacobazzi, R.; Ranzato, F.; Scozzari, F. Making abstract interpretations complete. J. ACM 2000, 47, 361–416. [Google Scholar] [CrossRef]
  33. Fromherz, A.; Ouadjaout, A.; Miné, A. Static Value Analysis of Python Programs by Abstract Interpretation. In Proceedings of the 10th International Symposium on NASA Formal Methods, NFM 2018, Newport News, VA, USA, 17–19 April 2018; Dutle, A., Muñoz, C.A., Narkawicz, A., Eds.; Lecture Notes in Computer Science. Springer: Berin, Germany, 2018; Volume 10811, pp. 185–202. [Google Scholar] [CrossRef] [Green Version]
  34. Bordihn, H.; Holzer, M.; Kutrib, M. Determination of finite automata accepting subregular languages. Theor. Comput. Sci. 2009, 410, 3209–3222. [Google Scholar] [CrossRef] [Green Version]
  35. Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms, 3rd ed.; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
  36. Holzer, M.; Jakobi, S. Brzozowski’s Minimization Algorithm - More Robust than Expected-(Extended Abstract). In Proceedings of the 18th International Conference on Implementation and Application of Automata, CIAA 2013, Halifax, NS, Canada, 16–19 July 2013; Konstantinidis, S., Ed.; Springer: Berlin, Germany, 2013. Lecture Notes in Computer Science. Volume 7982, pp. 181–192. [Google Scholar] [CrossRef]
  37. Park, C.; Ryu, S. Scalable and Precise Static Analysis of JavaScript Applications via Loop-Sensitivity. In Proceedings of the 29th European Conference on Object-Oriented Programming, ECOOP 2015, Prague, Czech Republic, 5–10 July 2015; LIPIcs. Boyland, J.T., Ed.; Schloss Dagstuhl-Leibniz-Zentrum für Informatik: Wadern, Germany, 2015; Volume 37, pp. 735–756. [Google Scholar] [CrossRef]
  38. Mozilla. MDN Web Docs-Useful String Methods. Available online: https://developer.mozilla.org/en-US/docs/Learn/JavaScript/First_steps/Useful_string_methods (accessed on 20 April 2020).
  39. Abdulla, P.A.; Atig, M.F.; Chen, Y.; Holík, L.; Rezine, A.; Rümmer, P.; Stenman, J. Norn: An SMT Solver for String Constraints. In Proceedings of the Computer Aided Verification-27th International Conference, CAV 2015, San Francisco, CA, USA, 18–24 July 2015; Part I. Kroening, D., Pasareanu, C.S., Eds.; Lecture Notes in Computer Science. Springer: Berlin, Germany, 2015; Volume 9206, pp. 462–469. [Google Scholar] [CrossRef]
  40. Liang, T.; Reynolds, A.; Tsiskaridze, N.; Tinelli, C.; Barrett, C.W.; Deters, M. An efficient SMT solver for string constraints. Form. Methods Syst. Des. 2016, 48, 206–234. [Google Scholar] [CrossRef]
  41. Cousot, P.; Giacobazzi, R.; Ranzato, F. Program Analysis Is Harder Than Verification: A Computability Perspective. In Proceedings of the Computer Aided Verification-30th International Conference, CAV 2018, Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK, 14–17 July 2018; Part II. Chockler, H., Weissenbacher, G., Eds.; Lecture Notes in Computer Science. Springer: Berlin, Germany, 2018; Volume 10982, pp. 75–95. [Google Scholar] [CrossRef] [Green Version]
  42. Midtgaard, J.; Nielson, F.; Nielson, H.R. A Parametric Abstract Domain for Lattice-Valued Regular Expressions. In Proceedings of the Static Analysis-23rd International Symposium, SAS 2016, Edinburgh, UK, 8–10 September 2016; Rival, X., Ed.; Lecture Notes in Computer Science. Springer: Berlin, Germany, 2016; Volume 9837, pp. 338–360. [Google Scholar] [CrossRef]
  43. Lin, A.W.; Barceló, P. String solving with word equations and transducers: Towards a logic for analysing mutation XSS. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, 20–22 January 2016; Bodík, R., Majumdar, R., Eds.; ACM: New York, NY, USA, 2016; pp. 123–136. [Google Scholar] [CrossRef]
  44. Abdulla, P.A.; Atig, M.F.; Chen, Y.; Holík, L.; Rezine, A.; Rümmer, P.; Stenman, J. String Constraints for Verification. In Proceedings of the Computer Aided Verification-26th International Conference, CAV 2014, Held as Part of the Vienna Summer of Logic, VSL 2014, Vienna, Austria, 18–22 July 2014; Biere, A., Bloem, R., Eds.; Lecture Notes in Computer Science. Springer: Berlin, Germany, 2014; Volume 8559, pp. 150–166. [Google Scholar] [CrossRef]
  45. Bouajjani, A.; Habermehl, P.; Vojnar, T. Abstract Regular Model Checking. In Proceedings of the 16th International Conference on Computer Aided Verification, CAV 2004, Boston, MA, USA, 13–17 July 2004; Alur, R., Peled, D.A., Eds.; Lecture Notes in Computer Science. Springer: Berlin, Germany, 2004; Volume 3114, pp. 372–386. [Google Scholar] [CrossRef] [Green Version]
  46. Bouajjani, A.; Habermehl, P.; Holík, L.; Touili, T.; Vojnar, T. Antichain-Based Universality and Inclusion Testing over Nondeterministic Finite Tree Automata. In Proceedings of the 13th International Conference on Implementation and Applications of Automata, CIAA 2008, San Francisco, CA, USA, 21–24 July 2008; Ibarra, O.H., Ravikumar, B., Eds.; Lecture Notes in Computer Science. Springer: Berlin, Germany, 2008; Volume 5148, pp. 57–67. [Google Scholar] [CrossRef]
  47. Alur, R.; Madhusudan, P. Visibly pushdown languages. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 13–16 June 2004; Babai, L., Ed.; ACM: New York, NY, USA, 2004; pp. 202–211. [Google Scholar] [CrossRef] [Green Version]
  48. Holík, L.; Janku, P.; Lin, A.W.; Rümmer, P.; Vojnar, T. String constraints with concatenation and transducers solved efficiently. Proc. ACM Program. Lang. 2018, 2, 4. [Google Scholar] [CrossRef] [Green Version]
  49. Balakrishnan, G.; Reps, T.W. Recency-Abstraction for Heap-Allocated Storage. In Proceedings of the 13th International Symposium on Static Analysis, SAS 2006, Seoul, Korea, 29–31 August 2006; Yi, K., Ed.; Lecture Notes in Computer Science. Springer: Berlin, Germany, 2006; Volume 4134, pp. 221–239. [Google Scholar] [CrossRef] [Green Version]
  50. Jensen, S.H.; Jonsson, P.A.; Møller, A. Remedying the eval that men do. In Proceedings of the International Symposium on Software Testing and Analysis, ISSTA 2012, Minneapolis, MN, USA, 15–20 July 2012; Heimdahl, M.P.E., Su, Z., Eds.; ACM: New York, NY, USA, 2012; pp. 34–44. [Google Scholar] [CrossRef] [Green Version]
  51. Sharir, M.; Pnueli, A. Two Approaches to Interprocedural Data Flow Analysis; NYU CS: New York, NY, USA, 1978. [Google Scholar]
Figure 1. A potentially malicious obfuscated JavaScript program.
Figure 1. A potentially malicious obfuscated JavaScript program.
Applsci 10 03525 g001
Figure 2. μ JS syntax.
Figure 2. μ JS syntax.
Applsci 10 03525 g002
Figure 3. μ JS implicit type conversion functions.
Figure 3. μ JS implicit type conversion functions.
Applsci 10 03525 g003
Figure 4. Least upper bound of D FA / .
Figure 4. Least upper bound of D FA / .
Applsci 10 03525 g004
Figure 5. Widening of D FA / .
Figure 5. Widening of D FA / .
Applsci 10 03525 g005
Figure 6. Coalesced sum abstract domain for μ JS .
Figure 6. Coalesced sum abstract domain for μ JS .
Applsci 10 03525 g006
Figure 7. (a) A , L ( A ) = { l a n g , h e l l o } (b) A = SS ( A , 2 , Const ) .
Figure 7. (a) A , L ( A ) = { l a n g , h e l l o } (b) A = SS ( A , 2 , Const ) .
Applsci 10 03525 g007
Figure 8. (a) A , L ( A ) = { p a p e r , h e l l o } . (b) A , L ( A ) = { a b c , h e l l o } .
Figure 8. (a) A , L ( A ) = { p a p e r , h e l l o } . (b) A , L ( A ) = { a b c , h e l l o } .
Applsci 10 03525 g008
Figure 9. (a) A , L ( A ) = { p a n d a , k o a l a } . (b) A , L ( A ) = { p a n , p } .
Figure 9. (a) A , L ( A ) = { p a n d a , k o a l a } . (b) A , L ( A ) = { p a n , p } .
Applsci 10 03525 g009
Figure 10. (a) A , L ( A ) = { ! A b , C d E } , (b) LC ( A ) .
Figure 10. (a) A , L ( A ) = { ! A b , C d E } , (b) LC ( A ) .
Applsci 10 03525 g010
Figure 11. (a) A , L ( A ) = { a b c , a b d , e f g } (b) A , L ( A ) = { a b , f g } .
Figure 11. (a) A , L ( A ) = { a b c , a b d , e f g } (b) A , L ( A ) = { a b , f g } .
Applsci 10 03525 g011
Figure 12. (a) A , L ( A ) = { p a n d a , c a n d y , a n d y } (b) A , L ( A ) = { a n , n d } .
Figure 12. (a) A , L ( A ) = { p a n d a , c a n d y , a n d y } (b) A , L ( A ) = { a n , n d } .
Applsci 10 03525 g012
Figure 13. (a) A , L ( A ) = { d o , m i } (b) RT ( A , 2 ) (c) RT ( A , Const ) .
Figure 13. (a) A , L ( A ) = { d o , m i } (b) RT ( A , 2 ) (c) RT ( A , Const ) .
Applsci 10 03525 g013
Figure 14. (a) A , L ( A ) = { ( ) * a b , d } , (b) TL ( A ) .
Figure 14. (a) A , L ( A ) = { ( ) * a b , d } , (b) TL ( A ) .
Applsci 10 03525 g014
Figure 15. toStr ( Const ) .
Figure 15. toStr ( Const ) .
Applsci 10 03525 g015
Figure 16. A d abstract value of d before eval call of the program in Figure 1.
Figure 16. A d abstract value of d before eval call of the program in Figure 1.
Applsci 10 03525 g016
Figure 17. Useful string manipulation method taken from [38].
Figure 17. Useful string manipulation method taken from [38].
Applsci 10 03525 g017
Table 1. Definition of SS .
Table 1. Definition of SS .
SS ( A , i , j ) j Z ( j Const ) j = Const
i Z ( i Const ) SS ( A , min ( i , j ) , max ( i , j ) ) a [ 0 , i ] SS ( A , a , i )

SS ( A , i )
i = Const a [ 0 , j ] SS ( A , a , j )

SS ( A , j )
FA ( A )
Table 2. μ JS Finite-state Automata String Analyzer ( μ F ASA ) string operations.
Table 2. μ JS Finite-state Automata String Analyzer ( μ F ASA ) string operations.
String OperationSoundnessCompletenessAverage Complexity
substring O ( n log n )
charAt O ( n log n )
length O ( n + m )
concat O ( n log n + n + m )
startsWith
endsWith
O ( n log n + n + m )
toLowerCase
toUpperCase
O ( m )
includes O ( n l o g n + n + m )
repeat O ( n log n + n + m )
replace O ( ( n + m ) n l o g n )
indexOf O ( n ( n log n ) ( n 2 m ) )
slice O ( ( n + m ) ( n log n ) )

Share and Cite

MDPI and ACS Style

Arceri, V.; Mastroeni, I.; Xu, S. Static Analysis for ECMAScript String Manipulation Programs. Appl. Sci. 2020, 10, 3525. https://doi.org/10.3390/app10103525

AMA Style

Arceri V, Mastroeni I, Xu S. Static Analysis for ECMAScript String Manipulation Programs. Applied Sciences. 2020; 10(10):3525. https://doi.org/10.3390/app10103525

Chicago/Turabian Style

Arceri, Vincenzo, Isabella Mastroeni, and Sunyi Xu. 2020. "Static Analysis for ECMAScript String Manipulation Programs" Applied Sciences 10, no. 10: 3525. https://doi.org/10.3390/app10103525

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop