Abstracting Strings for Model Checking of C Programs

: Data type abstraction plays a crucial role in software veriﬁcation. In this paper, we introduce a domain for abstracting strings in the C programming language, where strings are managed as null-terminated arrays of characters. The new domain M-String is parametrized on an index (bound) domain and a character domain. By means of these different constituent domains, M-Strings captures shape information on the array structure as well as value information on the characters occurring in the string. By tuning these two parameters, M-String can be easily tailored for speciﬁc veriﬁcation tasks, balancing precision against complexity. The concrete and the abstract semantics of basic operations on strings are carefully formalized, and soundness proofs are fully detailed. Moreover, for a selection of functions contained in the standard C library, we provide the semantics for character access and update, enabling an automatic lifting of arbitrary string-manipulating code into our new domain. An implementation of abstract operations is provided within a tool that automatically lifts existing programs into the M-String domain along with an explicit-state model checker. The accuracy of the proposed domain is experimentally evaluated on real-case test programs, showing that M-String can efﬁciently detect real-world bugs as well as to prove that program does not contain them after they are ﬁxed.


Introduction
C is still one of the mainly used programming languages [1], and a large portion of systems of critical relevance are written in this language, such as server-side software and embedded systems. Unfortunately, C programs suffer of bugs, due to the way they are laid out in memory, which malicious parties may exploit to drive security attacks. Ensuring the correctness of such software is of great concern. Our main interest is guaranteeing the correctness of C programs that manage strings, because the incorrect string manipulation may lead to several catastrophic events, ranging from loss or exposure of sensitive data to crashes in critical software components.
Strings in C are not a basic data type. As a matter of facts, strings in C are represented by zero-terminated arrays of characters and there are libraries that provide functions which allow operating on them [2]. C programs that manipulate strings can suffer from buffer overflows and related issues due to the possible discrepancy between the size of the string and the size of the array (buffer). A buffer overflow is a bug that affects C code when a buffer is accessed out of its bounds. In particular, an out-of-bounds write is a particular (and very dangerous) case of buffer overflow. Out-of-bounds read is less critical as a bug. It is important to design methods supporting the automatic correctness verification of string management in C programs for the previously mentioned reasons and also because buffer overflows are usually exploitable and can easily lead to arbitrary code execution [3].
Existing bugs can be identified by enhancing tools for code analysis, which can also reduce the risk of introducing new bugs and limit the occurrence of costly security incidents.

Paper Contribution
This paper is a revised and extended version of [4,5]. We introduce M-String, a new abstract domain tailored for the analysis of strings in C, whose elements: • approximate sets of C character arrays; • allow the abstraction of both shape information on the array structure and value information on the contained characters; • highlight the presence of well-formed strings in the approximated character arrays.
M-String refines the segmentation approach to array representation introduced in [6]. M-String's goal is to detect the presence of common string management errors that may lead to undefined behaviours or, more specifically, which may result in buffer overflows. Moreover, keeping track of the content of the characters occurring after the first null character allows us to reduce the number of false positives. In fact, rewriting the first null character in the array is not always an error, as further occurrences of the null character may follow. M-String, such as the array segmentation-based representation introduced in [6], is parametric in two ways: both with respect to the representation of the indices of the array and with respect to the abstraction of the element values.
To provide evidence of the effectiveness of M-String, we extend LART [7], a tool which performs automatic abstraction on programs, making it supporting also sophisticated (non-scalar) domains such as M-String.
We extend LART along with DIVINE 4 [8], an explicit state model checker based on LLVM. This way, we can verify the correctness of operations on strings in C programs automatically. The experimental evaluation is performed by analyzing several C programs, ranging from quite simple to moderately complex, including parsers generated by bison, a tool which translates context-free grammars into C parsers. The results show the actual impact of an ad-hoc segmentation-based abstract domain on model checking of C programs.

Paper Structure
In the following Section 2 we give basics in abstract interpretation and we introduce the array segmentation abstract domain [6] on which M-String is based. Furthermore, Section 3 introduces the syntax of some operations of interest. Section 4 defines the concrete domain and semantics. Section 5 presents the M-String abstract domain for C character arrays and its semantics, whose soundness is formally proved. In the Section 6, we present a general approach to abstraction as a program transformation and extend it to abstraction of program strings. Sections 7 and 8 present implementation and evaluation details of M-String abstraction. In Section 9 we discuss related work. Finally, Section 10 concludes.

Prerequisites
We assume the reader is familiar with order theory.

Abstract Interpretation
Abstract Interpretation [9,10] is a theory about sound approximation or abstraction of semantics of computer programs, focusing on some run-time properties of interest. Formally, the concrete semantics is based on a concrete domain D. Likewise, the abstract semantics is based on an abstract domain D. Both the concrete and the abstract domains form a complete lattice, such that: (D, ≤ D , ⊥ D , D , D , D ) and (D, ≤ D ⊥ D , D , D , D ). Please note that we use the same notation interchangeably to denote a domain and its set of elements. The concrete and the abstract domains are related by a pair of monotonic functions: the concretization γ D : D → D and the abstraction α D : D → D functions. In order to obtain a sound analysis, α D and γ D have to form a Galois Connection (GC) [11]. (α D , γ D ) is a GC if and only if for every d ∈ D and d ∈ D we have that d ≤ D γ D (d) ⇔ α D (d) ≤ D d. Notice that, one function univocally identifies the other. Consequently, we can infer a Galois Connection by proving that γ D is a complete meet morphism (resp. α D is a complete join morphism) (Proposition 7 of [12]). Please note that these conditions can be relaxed, performing abstract interpretation over non-lattice abstract domains [12]. Abstract domains that do not respect the Ascending Chain Condition (ACC) need to be equipped with a widening ∇ D and a narrowing D operator, in order to get fast convergence and to improve the accuracy of the resulting analysis, respectively [13]. An abstract domain functor D is a function from the parameter abstract domains D 1 , D 2 , ..., D n to a new abstract domain D(D 1 , D 2 , ..., D n ). The abstract domain functor D(D 1 , D 2 , ..., D n ) composes abstract domain properties of the parameter abstract domains to build a new class of abstract properties and operations [6].

Fun Array
In the following we recall the array segmentation analysis presented in [6]. Notice that we slightly modified the notation to be consistent with the whole work. For more details, we invite the reader to refer directly to the original paper.

Array Concrete Semantics
Let R a be the set of concrete array environments. A concrete array environment θ ∈ R a maps array variables a ∈ A to their values θ(a) ∈ A, such that: • θ(a) = (ρ, low a , high a , A a ) and, R v is the set of concrete variable environments. A concrete variable environment ρ ∈ R v maps variables (of basic types) x ∈ X to their values ρ(x) ∈ V. 2.
E is the set of program expressions made up of constants, variables, mathematical unary and binary operators. In the following, for simplicity, expressions are evaluated to integers. low a , high a ∈ E are expressions whose value, given by [[low a ]]ρ and [[high a ]]ρ, respectively represents the lower bound and the upper bound of an array a, i.e., the lower and the upper bound of its indexes range. According to the denotational semantics approach, in [6] the value of an arithmetic expression e is denoted by [[e]]ρ, where: (1) the double square brackets notation denotes the semantic evaluation function and, (2) ρ is an environment mapping program variables (which also may appear in e) to their value. Typically, [[x]]ρ is equivalent to ρ(x), with x ∈ X, and [[n]]ρ, where n is a constant, is equivalent to n itself. Thus, for example, if e is the expression x -1, its semantics [[x -1]]ρ is defined as [[x]]ρ -[ [1]]ρ, which corresponds to ρ(x) -1. Notice that the value of an upper bound of an array concrete value corresponds to the index immediately after the one that points to the last memory block allocated to the array when it has been initialized. As usual, array indexes are 0-based. 3. Z is the set of integer numbers and V is the set of values. Let I a be the set of indexes i of an array a, i.e., I a = {i | i ∈ [[[low a ]]ρ, [[high a ]]ρ)} ⊆ Z and, let P a be the set of pairs (i, v) such that v is the value of the element indexed by i in an array a, i.e., P a = {(i, v) | i ∈ I a ∧ [[a [i]]]ρ = v ∈ V} ⊆ Z × V. Thus, A a : I a → P a is a function mapping the indexes of an array a to their corresponding pairs (index, indexed array value). , (2,9), (3,11), (4, 13)}. Moreover, let b denote the sub-array of a from position 2 to 3 included, its concrete value is given by θ(b) = (ρ, 2, 4, A b ) such that P b = { (2,9), (3,11)}.
Observe that this array representation allows reasoning about the correspondence between shape components of an array and actual values of the array elements.

Array Segmentation Abstract Domain Functor
According to [6], the FunArray abstract domain S (shortcut for S(B, A, R)) allows representing a sequence of consecutive, non-overlapping and possibly empty segments that over-approximate a set of concrete array values in P(A), i.e., the powerset of A. Each segment represents a sub-array whose elements share the same property (e.g., being positive integer values) and is surrounded by the so-called segment bounds, i.e., abstractions on its lower and upper bound.

Example 2.
Consider the integer array a[5] = {5,7,9,10,12}. As an abstraction of a we may consider {0} odd {3} even {5} saying that the array contains odd numbers in the first three elements (indexed from 0 to 2) and two even elements (indexed from 3 to 4).

The elements of FunArray belong to the set
B is the segment bound abstract domain, approximating array indexes, with abstract properties b i ∈ B such that i ∈ [1, n] and n > 1.
We denote by E the set of expressions of canonical form x + k, where x ∈ X and k ∈ Z. The segment bounds b i are sets of expressions {e 1 i , ..., e m i }, such that e j i ∈ E. The variable abstract domain X encodes program variables, i.e., X = X ∪ {v 0 }, where v 0 is a special variable whose value is assumed to be zero. Moreover, b i = ∅ denotes unreachability; if b i = ∅, the expressions appearing in a segment bound are all equivalent symbolic denotations of some concrete value (generally unknown in the abstract representation except when one of the e j i is a constant). Thus, B depends on the expression abstract domain E which, in turn, depends on the variable abstract domain X.

2.
A is the array element abstract domain, with abstract properties p i ∈ A. It denotes possible values of pairs (index, indexed array element) in a segment, for relational abstractions, array elements otherwise.

3.
R is the variable environment abstract domain, which depends on the variable abstract domain X, with abstract properties ρ ∈ R.

4.
the question mark, if present, expresses the possibility that the segment that precedes it may be empty. The question mark can never precede b 1 . The space symbol in { , ?} represents a non-empty segment.
Please note that in the last case, the lack of positive values is justified by the presence of the question mark that says that the first segment is optional. The unification algorithm, in [6], modifies two compatible segmentations in order to align them with respect to the same list of bounds. The unification algorithm does not guarantee the maximality of the result, but it is always well-defined, it does terminates and it is deterministic. The partial order S over S is defined over unified segmentations as well as the join S and the meet S operators. Please note that S is not necessarily a lattice [14]. Moreover, S does not respect the Ascending Chain Condition, therefore, in order to ensure the convergence of the analysis, it is equipped with a widening operator ∇ S . A narrowing operator which improves the precision of the widening result, is also defined. Widening and narrowing operators are applied on unified segmentations.
Such an abstract array representation is effective for analyzing the content of arrays, but in the case of the C programming language where a string is defined as a null-terminating character array, it is not powerful enough to detect common string manipulation errors.

Syntax
Strings in the programming language C are arrays of characters, whose length is determined by a terminating null character '\0'. Thus, for example, the string literal ''bee'' has four characters: 'b', 'e', 'e', '\0'. Moreover, C supports several string handling functions defined in the standard library string.h.
We focus on the most significant functions in the string.h header (see Table 1), manipulating null-terminated sequences of characters, plus the array elements access and update operations. Recall that char, int and size_t are data types in C, const is a qualifier applied to the declaration of any variable which specifies the immutability of its value, and *str denotes that str is a pointer variable. • strcat appends the null-terminated string pointed to by str2 to the null-terminated string pointed to by str1. The first character of str2 overwrites the null-terminator of str1 and str2 should not overlap str1. The string concatenation returns the pointer str1. • strchr locates the first occurrence of c (converted to a char) in the string pointed to by str. The terminating null character is considered to be part of the string. The string character function returns a pointer to the located character, or a null pointer if the character does not occur in the string. • strcmp lexicographically compares the string pointed to by str1 to the string pointed to by str2. The string compare function returns an integer greater than, equal to, or less than zero, accordingly as the string pointed to by str1 is greater than, equal to, or less than the string pointed to by str2. • strcpy copies the null-terminated string pointed to by str2 to the memory pointed to by str1. str2 should not overlap str1. The string copy function returns the pointer str1. • strlen computes the number of bytes in the string to which str points, not including the terminating null byte. The string length function returns the length of str.
Accessing an array element is possible indexing the array name. Let i be an index, the i-th element of the character array str is accessed by str[i]. On the other hand, a character array element is updated (or an assignment is performed to a character array element) by str[i] = 'x', where 'x' denotes a character literal.
As mentioned in Section 1, C does not guarantee bounds checking on array accesses and, in case of strings, the language does not ensure that the latter are null-terminated. As a consequence, improper string manipulation leads to several vulnerabilities and exploits [15]. For instance, if non null-terminated strings are passed to the functions above, the latter may return misleading results or read out of the array bound. Moreover, since strcat and strcpy do not allow the size of the destination array str1 to be specified, they are frequent sources of buffer overflows.

Concrete Domain and Semantics
Our aim is to capture the presence of well-formed strings in C character arrays, to avoid undesired execution behaviours that may be security relevant. To reach our goal, we propose a character array concrete value which highlights the occurrence of null characters in it and we introduce the notion of string of interest of an array of chars. The concrete semantics relative to the operations presented in Section 3 is also given.

Character Array Concrete Semantics
Let C be a finite set of characters representable by the character encoding in use equipped with a top element C representing an unknown value and let M be the set of character array variables, such that M ⊆ A (with A being the set of array variables -of any type -presented in Section 2.2). Then, the operational semantics of character array variables are concrete array environments µ ∈ R m mapping character arrays m ∈ M to their values µ(m). Precisely:

String of Interest
We formally define the string of interest of a character array as the sequence of its elements up to the first terminating one (included).

Definition 1 (string of interest).
Let m ∈ M be an array of characters with concrete value µ(m) = (ρ, low m , high m , M m , N m ) and let z be the minimum element of N m (if it is non-empty). The string of interest of the character array described by µ(m) is defined as follows: with v i denoting the character value which occurs in the pair (i, v).

Example 5.
Consider the concrete character array value of Example 4. Its string of interest is the sequence of characters ''bee\0''.
Our definition of string of interest of character arrays allows us to distinguish well-formed strings and avoid bad usage of arrays of characters. If the null character appears at the first index of a character array, then we refer to its string of interest as null (null). In general, we refer to character arrays which contain a well-defined or null string of interest as character arrays which contain a well-formed string.
Moreover, when allocated memory capacity is not sufficient for a declared character array, the system writes a null character outside the array, occupying memory that is not destined for it and causing a buffer overflow. We do not represent this system behaviour, since it leads to an undefined one, so we simply consider the string of interest of such character arrays as undefined (undef). We stress the fact that the concrete domain we introduce is used as a framework that helps us in creating the abstract representation, and it is not how the (concrete) character array values are actually represented in C programs.

Concrete Semantics
To formalize the concrete semantics of the C standard library functions from string.h introduced in Section 3, the following auxiliary functions embedding, extraction, comparison and substitution over single concrete character array values need to be introduced.  Moreover, consider the intervals of equal length: The function embedding(µ(m 1 ), [2,4]  Definition 4 (comparison). Let µ(m 1 ), µ(m 2 ) ∈ M be two concrete character array values which contain a fully initialized well-formed string of interest, i.e., no C occurs. The function comparison(µ(m 1 ), µ(m 2 )) (c.f. Algorithm 1) lexicographically compares the strings of interest of µ(m 1 ) and µ(m 2 ) and it returns an integer value n which denotes the lexicographic distance between them. Notice that n will be strictly smaller than zero if string(µ(m 1 )) precedes string(µ(m 2 )) in lexicographic order, equal to zero if string(µ(m 1 )) and string(µ(m 2 )) are lexicographically equivalent, and strictly greater than zero if string(µ(m 1 )) follows string(µ(m 2 )) in lexicographic order.

Example 8.
Let µ(m 1 ) and µ(m 2 ) be the character array concrete values of Example 6. Both of them contain a fully initialized well-formed string of interest and the function comparison(µ(m 1 ), µ(m 2 )) computes the lexicographic distance between them. Precisely, the procedure stops after the first iteration of the for loop (c.f. Algorithm 1) and, assuming ASCII as the character encoding set, it returns the value -1, i.e., n = 97 -98, which means that string(µ(m 1 )) lexicographically precedes string(µ(m 2 )).

Definition 5 (substitution). Let µ(m) ∈ M be a concrete character array value, z ∈ [[[low m ]]ρ, [[high m ]]ρ)
be an index and c ∈ C be a character. The function substitution(µ(m), z, c) substitutes the character which appears in µ(m) at the index z with the character c. Formally, substitution(µ(m), z, c) = µ(m) such that: Let µ(m 1 ) be the character array concrete value of Example 6, the index z be equal to 4 and the character c be the null termination '\0'. The function sub(µ(m 1 ), 4,'\0') = µ(m 1 ) such that: The size.condition is true if:  The semantics operator M, when applied to strcpy(M 1 , M 2 ), behaves similarly to the string concatenation function above. Formally, The size.condition is true if: Moreover, M 1 is the set of embedding(µ(m 1 ), [l 1 , u 1 ], µ(m 2 ), [l 2 , u 2 ]), such that: In particular, M is the set of substitution(µ(m), j, v) (c.f. Definition 2).

M-String
In the previous section we defined the concrete value of a character array, which highlights the presence of a well-formed string in it. Moreover, we presented our concrete domain P(M), made of sets of character array values, and its concrete semantics of some operations of interest. In the following we formalize the M-String abstract domain, which approximates elements in P(M), and its semantics for which soundness is proved. B denotes the abstraction of segment bounds, equipped with the addition (+ B ) and subtraction (-B ) operations.

2.
C is the abstraction of the character array elements, it is signed, it contains the value 0, and it is equipped with is_null, a special monotonic function lifting abstract elements in C to a value in the set {true, false, maybe} and with subtraction (-C ).

3.
R denotes the abstraction of scalar variable environments (cf. Section 2.2). Namely, the constant propagation domain on the set of variables X.
where α B and γ B are respectively the abstraction and concretization functions over the bounds abstract domain. Please note that b 1 and b n respectively represent the segmentation lower and upper bound and in the case in which m corresponds to the split segmentation the segmentation upper bound is hidden, due to a representative choice, and equal to b n-1 + B 1.
p i ∈ C are abstract predicates, chosen in an abstract domain C, denoting possible values of pairs (index, character array element value) in a segment, for relational abstraction, character array elements otherwise. 3.
the question mark ?, if present, indicates that the preceding segment might be empty, while indicates a non-empty segment and, as for [6], non-empty segments are not marked.

Example 10.
Consider the split segmentation abstract predicate m = ([0, 0] 'a' [2,5], ∅) where C is the constant propagation domain for characters and B the interval domain. m approximates character arrays certainly containing a string of interest which is actually a sequence of 'a', whose length goes from 2 to 5, followed by a null character, e.g., "aa\0" and "aaaaa\0".
In the rest of the paper we will refer to the s and to the ns parameters of a given split segmentation abstract predicate m by m.s and m.ns respectively.
M-String, like FunArray, is equipped with join M , meet M , widening ∇ M and narrowing M operators (c.f. Section 2.2.2). We highlight the fact that the choice of B is let free, so the segmentation unification algorithm presented in [6] needs to be modified accordingly, while preserving its original requirements. The unify procedure behaves as follows: given m 1  Example 11. Consider the following split segmentations: [7,7], ∅) and [3,6] even [7,7], ∅). Their unification leads to the abstract elements [7,7], ∅). Observe that the unify yields to a pair of segmentations with the same number of segments and that is not always optimal.
if m 1 .s and m 2 .s (resp. m 1  Formally, we firstly define the concretization function of a generic segment (bpb [?]) (regardless of what part of the split it is part of) γ * M , following [6], which corresponds to the set of character array values whose elements in the segment [b, b [?]) satisfy the predicate p.
is the concretization function for the variable environment abstract domain, γ B ∈ B → P(Z + ) is the concretization function for the segment bounds abstract domain, and γ C ∈ C → P(Z × C) is the concretization function for the array characters abstract domain. We remind that the upper bound of m.s is not followed by a segment abstract predicate. Let b be the upper bound of m.s (which may coincide with the lower bound of m.s in the case in which m approximates characters arrays containing null strings of interest). b is equivalent to the segment bpb such that b = b + B 1 and p is null.
An abstract element in the M-String domain is a pair of segmentations. Thus, we define the concretization function of the possible m.s and m.ns belonging to a character array abstract predicate m, i.e., γ M ∈ M → R → P(M). Let +M denote the concatenation of several concrete values.
Finally, the concretization function of a split segmentation abstract predicate m is as follows: where + M returns all the possible concatenations between a concrete array value taken from γ M (m.s), and a concrete array value taken from γ M (m.ns).

Definition 8 (invalid segment). Given a generic segment bpb [?]
, it is considered invalid if its segment abstract predicate p is equal to ⊥ C and its upper bound b is not followed by a question mark. In the implementation we will make use of two functions lift and lower that relate single strings to their abstraction in M-String.

Abstract Semantics
Let us now formalize the abstract semantics of the concrete operations defined in Section 4.3, over the M-String domain. In doing so, we will take advantage of the auxiliary function minlen which computes the minimum length of an element m ∈ M, as the upper bound of a split segmentation is possibly followed by a question mark.
Definition 11 (m minimum length). Let m ∈ M different from ⊥ M and let low m , high m ∈ B denote the lower and the upper bound of m, respectively. We define the minimum length of a split segmentation abstract predicate m, denoted by minlen(m), as follows: if m.ns = ∅ ∧ high m is followed by ? ∧ ∃k ∈ m.ns : k = max{i ∈ m.ns|b i is not followed by ?} Please note that in the second case of Definition 11, the minimum length of a split segmentation corresponds to its length, denoted by len(m). The len operation can be also applied over the parameters of m themselves, when they are different from the emptyset and their upper bound is not question marked, which is always the case with m.s.

Abstract String Concatenation
The k-1 ))[? 2 2 ] such that b denotes the immediately preceding adapted segment bound. On the other hand, m 1 .ns is the result of removing from m 1 .ns the sub-segmentation that goes from the lower bound of m 1 .ns to the upper bound of m 1 .s included.

Abstract String Character
The semantics operator M M , when applied to strchr v (m), returns a split segmentation abstract predicate s with the left hand side parameter equal to the suffix segmentation of the input m.s from the first segment in which v certainly occurs and the right hand side parameter equal to the emptyset, if m approximates character arrays which contain a well-formed string and the character v appears in at least one segment whose bounds are not question marked. Otherwise, if m approximates character arrays which contain a well-formed string of interest and the abstract character v does not occur in m.s, it returns ⊥ M ; otherwise it returns M . Formally, are not question marked}.

Abstract String Compare
The semantics P M is the abstract counterpart of P. In particular, strcmp(m 1 , m 2 ) returns a value n denoting the lexicographic distance between m 1 .s and m 2 .s if both the input split segmentations approximate character arrays which contain a well-formed string and they can be unified; otherwise it returns Z .
Notice that if n is negative, this means that the strings of interest approximated by m 1 precede those represented by m 2 in lexicographic order. Conversely, if n is positive, this means that the strings of interest approximated by m 1 follows those represented by m 2 in lexicographic order, and if n is equal to zero they are lexicographically equal. Formally,

Program Abstraction
Adapting M-String to the analysis of real-world C programs requires, first of all, a procedure that identifies string operations automatically. A subset of such operations then has to be performed using abstract operations, carried out on a suitable abstract representation. The technique that captures this approach is known as abstract interpretation. A typical implementation is based on an interpreter in the programming language sense: it executes the program by directly performing the operations written down in the source code. However, rather than using concrete values and concrete operations on those values, part (or the entirety) of the computation is performed in an abstract domain, which over-approximates the semantics of the concrete program.
In this paper, we mainly focus on string abstraction. Therefore we will interpret the portions of the program that do not make use of strings without abstracting values. We only apply abstraction to strings that within the program are manipulated by string operations: when the program deals with string variables that exhibit minimal variation, e.g., string literals, the M-String representation would provide no benefit, and instead it could either hurt performance or it may introduce spurious counterexamples.
Based on the considerations above, it is clear that it is beneficial to reuse and refactor existing tools that implement abstract verification in a modular way on explicit programs. A compilation-based abstraction design that follows this approach was introduced and implemented in [7]. However, such a tool is designed to abstract scalar values only. This is why we need to extend it to operate with more sophisticated domains that represent more complex objects, such as strings.
In the rest of this section, we will first summarize the general approach to abstraction as a program transformation. In Section 6.3, we explore the implications of aggregate (as opposed to scalar) domains within this framework. Sections 6.4 and 6.5 then go on to discuss the semantic (run-time) aspects of the abstraction and which operations we consider as primitives of the abstraction.

Compilation-Based Approach
Instead of (re-)interpreting instructions abstractly, in a compilation-based approach, abstract instructions are transformed into an equivalent explicit code that implements the abstract computation. The transformation takes place before the analysis of the program (e.g., model checking) during the compilation process.
Consequently, the analysis processes the program without needing special knowledge of the abstract domains in use, as the abstraction is encoded directly in the program. Figure 1 depicts a comparison of the compilation-based approach with respect the interpretation-based approach adopted by more conventional abstract interpreters.

Interpretation-based
Compilation-based bitcode linked bc. In the interpretation-based approach, the whole abstract interpretation is performed at runtime. The bitecode operations are interpreted abstractely by a virtual machine (VM) which maintains an abstract state. In this way, an abstract state-space is generated for a model-checking algorithm (MC). The compilation-based approach is different. The abstract operations are instrumented into the compiled program and their implementation is provided as a library. Then, the virtual machine executes the instrumented program as a regular bitcode [7].
In a compilation-based approach, two different abstraction perspectives are considered: 1. static, referencing to the syntax and the type system, 2.
dynamic, or semantic, referencing to execution and values.
The LART tool performs syntactic (static) abstraction on LLVM bitcode [16]. Syntactic abstraction replaces some of the LLVM instructions that occur in the program with their abstract counterparts, as depicted in Figure 2

Syntactic Abstraction
The first step of program abstraction performed by LART is a syntactic abstraction. Syntactic abstraction replaces LLVM instructions or whole functions with their abstract counterparts. Since we do not want to perform all operations abstractly, we need to classify only those operations that might obtain abstract values as their arguments. The abstract values emerge in the program as input values. From these values, LART computes all operations that might come into contact with abstract values using a combination of data flow and alias analyses. Finally, as a result of analyses, LART obtains a set of possibly abstract operations that are replaced by their abstract equivalents, e.g., strcat, strlen are replaced by abstract_strcat and abstract_strlen. Abstract operations then implement the manipulation with abstract values, in our case with M-Strings as described in Section 4, in other words the specific meaning of abstract instructions and abstract values then defines the semantic abstraction.
For the precise formulation of syntactic abstraction, we take advantage of the static type system of LLVM. We leverage the fact that we can assign to each variable its type, which is either concrete or abstract. In this way, we can precisely set a boundary between concrete and abstract values.
Let us consider a simplified version of LLVM. It defines a set of concrete scalar types S. The set of all possible types is given by a map Γ that inductively defines all finite (non-recursive) algebraic types over the set of given scalars. To be precise, the set of all possible types Γ(T) derived from a set of scalars T is defined as follows: 1.
In a concrete LLVM program, the set of admissible types comprise those derived from concrete scalars S, i.e., Γ(S). In syntactic abstraction, we need to extend admissible types by abstract types. From these, we generate all possible types using Γ. Depending on the type of abstraction, we use a different set of basic abstract types. In the case of scalar abstraction, a set of basic abstract types contains abstract scalar types S. Correspondence between abstract and concrete scalars is given by a bijective map Λ : S → S. Finally, each value, which exists in the abstracted program, has an assigned type of Γ(S ∪ S). Specifically, this implies that the abstraction works with mixed types-products and unions might contain both concrete and abstract fields. Moreover, it is possible to create pointers to both abstract or mixed values.

Aggregate Domains
In addition to scalar values that cannot be further decomposed, programs typically operate with more complex data which can be seen as compositions-aggregates-of multiple scalar values. Depending on aggregates' nature, we can classify them as aggregates which contain a variable number of items (arrays), records that contain a fixed number of items in a fixed layout, where each of these can be of a different type. The items in such aggregates can be (and often are) scalars. However, more complex aggregates are also possible: arrays of records, records which in turn contain other records, and so on.
While scalar domains only dealt with simple values, in aggregate abstraction, we consider composite data in the spirit of the above definition. Similarly to scalar domains, abstract aggregate domains approximate concrete aggregate values by describing a particular set of aggregate properties. For example, we can describe a set of aggregates by their length or a set of values that appear in the aggregate. In the M-String, the kept properties are in the form of segmentation, where segments are further abstracted by bounds and characters. Values in an aggregate domain then keep the representation of chosen properties and operations updates them. For instance, consider an array length property domain-the domain operations in such a case operate only with lengths of arrays, e.g., abstract concat of arrays adds together lengths of its arguments (abstract arrays).
In general, aggregate domains can provide arbitrary operations. However, two operations are, in some sense, universal, being elementary memory manipulation operations, namely: byte-wise access and update of the aggregate. The universality of these operations originates from the fact that all aggregate operations can be represented as accesses and updates. In a low-level representation of a program (assembly), they usually are presented in this form. LLVM allows a slightly higher level of manipulation to access and update individual scalars present in the aggregates (as opposed to bytes). For M-String, though, this distinction is not essential because the scalars stored in C strings are individual bytes (characters). All other operations are present in the form of sequences of elementary instructions-possibly encapsulated in functions. Moreover, as in concrete programs, the access and update represents an interface between scalars and memory, in the abstraction, they form an interface between scalar and aggregate domains (even in the case of byte-oriented access since bytes are also scalars). We refer the reader to the Section 4.3.1 for abstract semantics of access, respectively to the Section 4.3.7 for the abstract semantics of update.
In comparison to scalar abstraction, the syntactic abstraction of aggregates does not operate directly with aggregate types. In LLVM, aggregate values are usually represented by a pointer to the underlying aggregate type. Therefore all the accesses and updates are made through the pointers to the aggregates. For instance, strings are represented as a pointer to a character array. We need to take this fact into account when we perform the syntactic abstraction. In the analysis, we consider the pointers to aggregates as base types for the abstraction. In the case of arrays, the base types are concrete pointers to those arrays: let us call them P * , where P * ⊆ Γ(S). A set of abstract pointers types P * then describes types of abstracted aggregates (arrays). As for scalar domains, we define a natural correspondence between pointers to concrete values and abstract aggregates as a bijective map Λ : P * → P * . For instance, in the case of M-String abstraction, the map Λ assigns to char* a type of M-String value. Finally, we allow all the mixed types generated from scalars and abstract aggregates: Γ(S ∪ P * ).
Observe that pointers, in general, also in LLVM maintain two pieces of information about memory location: they represent both the memory object and an offset into that object. In particular, our implementation treats the first 32-bits of the pointer as an object identifier and the last 32-bits as its offset. This distinction is not very relevant in explicit programs because those two components are represented in a uniform way in a single value and often they cannot be distinguished at all. However, the distinction becomes relevant when dealing with abstract aggregate values. In fact, in this case, the object component of the pointer is concrete as it determines a single specific abstract object. On the other side, the offset component may or may not be concrete. The choice depends on the specific abstract aggregate domain: it may be more advantageous representing the offset in an abstract way, i.e., by a 32-bit abstract scalar value. Observe that a memory access through such a pointer needs to be treated in both cases as an abstract access or update operation.
In LLVM, two basic memory access operations are defined-load and store, corresponding to the access and update operations. It is important to notice that memory access is always explicit: memory is never used in a computation directly. This observation is used in the design of aggregate abstraction, where we can assume that the access to the content of an aggregate will always go through a pointer associated with the abstract object.

Semantic Abstraction
In syntactic abstraction, we dealt with operations' syntax, their types, and the types of values and variables. It described how LART performs a source-to-source transformation. In contrast, semantic abstraction concerns with the values computed at runtime by a program. It defines how abstract operations modify values and how to transfer between concrete and abstract values. Therefore, similarly to syntactic abstraction that defined the maps Λ and Λ -1 to transfer between concrete and abstract types, the semantic abstraction makes use of lift and lower (cf. Definitions 9 and 10): operations (instructions) converting values between their concrete and abstract representations. They realize a runtime implementation of domain functions: abstraction (α M in the case of M-String) and concretization (γ M ).
The lift operation implements abstraction of concrete values by a single over-approximating abstract value. For example, in Figure 2 on line 3 of the abstracted program, a concrete string b is lifted to the abstract domain. This allows performing abstract_strcat in a single abstract domain. In other words, operations do not need to consider concrete values because all their arguments are lifted to the abstract domain. This simplifies the implementation of a domain and reduces the number of possible domain interactions. In comparison to Λ, which was a purely syntactic construct, lift and lower accomplish actual conversion of values between domains during program runtime. During program execution, lowering an abstract value into multiple concrete values can be seen as nondeterministic branching in the program and the lower operator is indeed based on a non-deterministic choice operator. In a model checker, the non-deterministic choice would be typically implemented as branching in the state space and the consequences of all possible outcomes would be explored. In a testing context, however, the choice might implemented as random, by choosing one particular path. For further details of the program transformation performed by LART, we kindly refer the reader to [7].

Abstract Operations
As a result of syntactic abstraction, we obtain a program that temporarily contains abstract operations. These operations take abstract values as operands and return abstract values as a result. Though, after the program transformation, the resulting program is required to be a semantically valid LLVM bitcode. Therefore, we demand that each abstract operation can be realized as a sequence of concrete instructions. This allows us to obtain an abstract program that does not contain any abstract operations and executes it using standard (concrete, explicit) methods.
Thoroughly, syntactic abstraction substitutes concrete operations with their abstract counterparts: an operation with type (t 1 , . . . , t n ) → t r is substituted by an abstract operation of type (Λ(t 1 ), . . . , Λ(t n )) → Λ(t r ). Furthermore, transformation inserts lift and lower operations as needed, e.g., in places where concrete values are operands of abstract operations. The implementation is free to select the operations to be abstracted and where value lifting and lowering be inserted, so long type constraints are satisfied. However, it tends to minimize the number of abstracted operations.
In addition to LLVM instructions, the M-String abstraction requires the transformation to abstract function calls to standard library functions such as strcmp, strcat. From the perspective of syntactic abstraction, we can treat function calls as single atomic operations that take abstract values and produce abstract results. Hence, the transformation substitutes them in the same way as instructions: for instance strcmp operation of type (m, m) → s is replaced by abstract_strcmp of type (Λ(m), Λ(m)) → Λ(s) where m is a concrete character array and s is a concrete scalar result of the string comparison. Afterwards, all abstract operations are implemented by using concrete subroutines (implementation of abstract semantics). For details, see [7].
Observe that, as an alternative approach, the standard library functions strcat, strcmp, etc. could have been transformed instruction by instruction, by using abstract access and update of a content only. However, the price to pay would have been loosing a certain degree of accuracy in the abstraction, the exact amount depending on the single operation.

Instantiating M-String
As an aggregate domain, M-String is a parametrizable by scalar domains of characters and indices (bounds). This allows us to tailor the abstraction to the needs of the analysis of string values. Depending on the precision of chosen domains, the instance of the M-String domain will inherit their properties. With more precise domains, the M-String values will maintain higher granularity of segmentation. On the other hand, simpler character representation will decrease the segmentation granularity for the cost of a higher rate of false alarms.
A particular instance of M-String is automatically derived from a parametric description given in Section 5, provided a suitable scalar domain C for characters and scalar domain B to represent segment bounds. The instantiation demands that both scalar domains C and B are equipped with operations that appear in the operations with the segmentation. These are mainly elementary arithmetic and relational operations. In the implementation, we provide an M-String domain template that automatically derives all the operations from provided scalar domains.

Symbolic Scalar Values
In program verification, it is common practice to represent certain values symbolically (for instance, inputs from the environment). The symbolic representation allows the verifier to consider all admissible values with a reasonably small overhead. In DIVINE, symbolic verification is implemented using a similar abstraction to one described in the previous section: symbolic scalar values represent their content by SMT formula expressions (terms) in form of abstract syntax trees.
The input values are represented as unconstrained variables in the bit vector logic. Operations then build formulae trees from their arguments. In addition to these so-called data definitions, symbolic representation also maintains one global formula of constraints (path-condition), which is derived from the control flow of the program. A more detailed description of this symbolic representation is presented in [7].
The domain of symbolic values (we call it a term domain) requires DIVINE to be augmented with an SMT solver form a suitable theory. For scalars in C programs, we use the bitvector theory. DIVINE uses the solver to detect computations that have reached the bottom of the term domain (those are the infeasible paths through the program). Furthermore, as a model checker, it needs to identify equal states or whether the state subsumes another one. This is achieved by the equivalence check of corresponding formulae. With these prerequisites, the symbolic representation in joint with the bit-vector theory is a precise abstraction (i.e., it is not an approximation but models the program state faithfully).

Concrete Characters, Symbolic Bounds
In the evaluation, we instantiate the M-String domain in two ways. The first simpler instantiation sets the domain of characters C to be the concrete domain (i.e., we let the characters be represented by themselves). We let the domain of segment bounds B to be a symbolic 32b integers. This instantiation balances between simplicity on the one hand (both domains we used for parameters were already present in DIVINE) and the ability to describe strings with undetermined length and structure.
At the implementation level (as described in more detail in the following section), the domain remains generic: the particular domains we picked can be easily substituted by other domains. Compared to the theoretical description of M-String, the implementation uses a slightly simplified representation of segmentation by a pair of arrays (cf. Figure 3). The elements of these arrays are characters and bounds, whose type is derived from parametrization, i.e., from the scalar domains C and B. The modification of the representation is just optimization for the implementation and does not affect the operations' semantics. The analysis with this representation is presented in Example 14. This instantiation of M-String is particularly suitable for representing strings with sequences of a single character of variable length, i.e., the strings of the form a k b l c m . . . where relationships between k, l, m, . . . can be specified using standard arithmetic and relational operators and each of a, b, c is a concrete letter. This, in turn, allows M-String to be used for the analysis of program behavior on broad classes of input strings described this way. A more detailed description of this approach can be found in Section 8.  [0, b 1 , b 2 , b 3 , b 4 ]. In the following, we describe mstring values as pairs of these two arrays. The second line creates a symbolic index of arbitrary value. On line 3, the program constraints the index to be smaller than mstring maximal length. Otherwise, the update on the next line would yield an error. Next the program assigns to the position of abstract index a character y. The assignment is implemented as update operation on mstring value. Depending on the value of the idx, the operations results in the following strings str x , as result we join all possibilities: 1.
if idx < b 1 : idx falls to the first segment: and creates a new segment between idx and idx + 1 containing character y. Notice that if idx = 0 the first segment is empty, similarly the third segment for idx + 1 = b 1 . The string of interest for str 1 is of form , with string of interest as join of following forms: • if the update is performed right after the first segment, i.e., idx = b 1 : if and b 1b 2 > 1, i.e., the segment of zeros contains more elements, then the string has form otherwise the update overwrites the single zero character, hence extends the string of interest by segment of y characters: • otherwise between first segment and idx is a terminating zero, hence the string of interest remains unchanged: x b 1 .

3.
if b 2 ≤ idx < b 3 : than str 3 = str, because update stores the same character as is already present in the segment.

4.
if b 3 ≤ idx < b 4 : than update creates a new segment inside of sequence of last zeros: Consequently, the abstract_strlen operation on the last line of the program computes the join of all possible lengths of strings of interest, i.e., b 1 ∪ b 3 .

Symbolic Characters, Symbolic Bounds
The second instantiation is used in benchmarks, where the computation with M-String values encountered abstract scalars (characters). This occurs when the program obtains some character as input from the environment and tries to store it into the M-String value. Therefore, we instantiated the M-String domain with an abstract representation of characters by setting the domain C to be the term domain, which keeps track of symbolic 8b bitvectors (characters in C language). In this way, we do not need to lower abstract characters before storing them to the M-Strings, what was needed for the concrete domain used in the previous instantiation. However, we pay the price for more expensive computation with symbolic characters.

Implementation
Finally, we implemented the M-String abstraction as a LART domain. The implementation, with examples and documentation of domain usage, can be found online on the supplementary page https://divine.fi.muni.cz/2020/mstring. The LART domain is a C++ library that implements abstract semantics of M-String operations presented in Section 5. Such a library is then linked to the transformed program allowing the program to perform abstract analysis with model-checker DIVINE. An abstract domain definition in LART consists of a C++ class that describes both the representation (in terms of data) and the operations (in terms of code) of the abstract domain.
In the case of M-String domain, this class contains 2 attributes: an array of bounds and an array of characters, as outlined in Section 7.2 and depicted in Figure 3. The class has two type parameters: the domain to use for representing segment bounds and the domain to represent individual characters (i.e., the content of segments). A specific instantiation is then automatically derived by the C++ compiler from the classes which represent the type parameters and the parametric class which represents M-String values.
As a minimal set of operations, the M-String domain implements all requisite aggregate operations: these are lift, update and access. Furthermore, the implementation provides an optimized version of string operations described in Sections 5: strlen, strcpy, strcat, strcmp and strchr. These operations reduce the loss of abstraction precision that would arise if only the abstraction of accesses and updates from strings were used.
Since C strings are stored, in fact, as shared, mutable character arrays, the implementation of the M-String domain needs to reflect the sharing semantics of such arrays. If multiple pointers exist into the same abstract string, modifications through one such pointer must be also visible when the string is accessed through another pointer. Moreover, the pointers do not have to be equal: they may point to different suffixes of the same string. Therefore, the representation of pointers to abstract strings must treat the object and the offset components separately (see also Section 6.3), and the representation of the offset component must be compatible with the bound domain B.

Experimental Evaluation
In the evaluation, we chose a few scenarios to demonstrate the properties of the abstraction. In the first scenario, we show that using abstract versions of standard functions is more efficient than if concrete versions were transformed using only abstract string accesses and updates. The second scenario investigates several implementations of standard library functions: we transform them automatically in the means of accesses and updates, and we show that their results agree with results generated by M-String library operations. In the third scenario, we evaluate M-String instantiation with symbolic characters on the set of benchmarks from real software that contain buffer-overflow errors. Here we show that M-String can efficiently detect real-world bugs as well as to prove that program does not contain them after they are fixed. The last benchmark shows the use of abstractions on more complex C programs. As an example, we analyze automatically generated parsers from bison and flex tools on abstract (M-String) inputs. The resource limits for all scenarios were the same: each verification run was limited to 4 processing units (cores), 80 GB of memory, and 1 hour of CPU time. The processor used to run benchmarks was AMD EPYC 7371 clocked at 2.60GHz.

M-String Operations
The first group of benchmarks focuses on the use of resources by abstraction. Benchmarks compare the effectiveness of abstract domain operations with the automatically abstracted implementation of standard library functions from PDCLib, a public-domain libc implementation, using only essential abstract operations: lift, update and access. The results depicted in Table 2 were measured with parametrized M-String inputs of two kinds (l is a parametric length of the input): • Word w is a string of the form: w = c i 1 1 · c i 2 2 · . . . · c i l l where ∑ l k=1 i k ≤ l and c x is an arbitrary character from domain C.

•
Sequence w is a string of the form w = c i , where i ≤ l and c is a character from domain C.
For each standard library function and input type, we created an isolated benchmark in two variants: one using an abstract semantics of M-String operations (see Table 2) and the other variant (Table 3) only with an automatic abstraction of essential aggregate operations.
The first notable difference between automatically abstracted implementations of library functions and M-String operations is that the analysis of the former timeouts for input strings longer than 64 characters. The main cause of the lifted implementation's inefficiency is that it has to iterate over all characters, while M-String operations leverage iteration over larger segments. This difference also causes a blow-up of the model checker's state space for the lifted implementations while the state space size does not change for M-String operations. The reason for this is the fact that the number of segments does not change with the length of the input. Therefore M-String operations always perform the same computation independently of the M-String length.

C Standard Libraries
In the second set of benchmarks (see Table 4), we investigate whether the implementation from several standard libraries matches the expected results of abstract implementation. In other words, we perform an equivalence check of results obtained from M-String operations with the results of the automatically abstracted (originally concrete) standard library functions. We expect that both give the same results. For the evaluation, we picked three open-source libraries: PDClib, musl-libc and µCLibc. Since results for the libraries are rather similar, we present here only an evaluation of PDClib functions. The remaining results are provided in the Supplementary Material. All benchmarks showed that our implementation matches the standard one. Similarly, as in the previous case, these benchmarks suffer from the state space blow up caused by an exponential number of possible character combinations. For this reason, we decreased the size of the input strings. In addition to large state space, many string accesses and updates of concrete implementations result in a large SMT formulae, causing a long time spent in solvers.
Furthermore, notice that the computation analysis with Word input, which has more segments, results in longer execution times than the analysis with Sequence. The reason is that the more segments naturally also causes overhead for the analyses. For example, The M-String needs to consider cases when some segments have zero length: this causes a hard SMT queries because, in the worst case, it needs to check all possible strings for given segment bounds and characters.

Veriabs Overflow Benchmarks
In this scenario (see Table 5), we show that the domain is capable of efficient overflow bug finding. Veriabs benchmarks exhibit overflow errors and fixed variants of real-world software. To soundly prove correctness of these benchmarks, we instantiate M-string with term domain also for characters. Hence we can reason about arbitrary strings of a symbolic length. However, as a drawback of this instantiation is that whenever the length of the string bounds a loop, we might have to unroll the loop infinitely in the analysis-these cases timeouts in the correct benchmarks.

Parsers
Lastly, we evaluate our implementation on more complex programs: automatically generated parsers. For the generation, we use a tool Bison. It reads a language specification in the form of context-free grammar and produces a C parser that accepts the language. In the benchmarks, we generate two such parsers. The first one accepts a language of numerical expressions (mathematical expressions that consist of numbers and binary operators). The second parser is of a simple programming language with variables and branching. We present a evaluation for both parsers in Table 6. As with the previous benchmark sets, the M-String inputs with a smaller number of segments outperformed other analyses. In these benchmarks, we use specifically hand-crafted M-String inputs for parsers. For parsing of mathematical expressions, it was: addition input had a form of two arbitrary numbers with a plus sign between them, ones was a simple input of a single digit sequence, and lastly, alternation was input that produced complicated M-Strings by alternating digits inside of expressions. The other parser of simple programming language was evaluated on: value was in input that created a variable and assigned a constant to it, loop was a short program with some control flow and wrong was a program that contained a syntax error.

Related Work
Static methods tailored to automatically identify buffer overflows have been extensively studied in the literature and several inference techniques were proposed and implemented: tainted data-flow analysis, constraint solvers for various theories (including string theories) and techniques based on them (e.g., symbolic execution), annotation analysis or string pattern matching analysis [17]. Furthermore, the above mentioned techniques and a large number of bug hunting tools based on static analysis have been implemented [18][19][20][21][22][23].
For instance, in [24] authors introduced a backward compatible method of bounds checking of C programs, which leaves the representation of pointers unchanged, allowing inter-operation between checked and unchecked code, with recompilation confined to the modules where problems might occur. The just mentioned feature differentiates the proposed schema from previously existing techniques. In [20] the static verifier of C strings CSSV is introduced. Contracts are supplied to the tool, which acts in 4 stages, reducing the problem of checking code that manipulates string to checking code that manipulates integers. Finally, Splat, described in [25], is a tool that automatically generates test inputs, symbolically reasoning about lengths of input buffers.
Static code analysis aims at approximating possible behaviours of a program without examining all of its (possibly infinite) actual executions. By a proper abstraction of data and operations, static analysis results into an over-approximation of all the possible runs of a program, and its effectiveness heavily depends on degree of precision of such an abstraction. In particular, the framework of abstract interpretation [9] can be adopted also to approximate semantics of string operations. The basic, well-known domain is a string set domain, which simply keeps track of a set of strings and it is a specific instance of the general (bounded) set domain. Others are the prefix-suffix domain (which captures the first and the last letter of a string) and the character inclusion domain (which only tracks the characters that surely or maybe appear in a string). Another general-purpose string domain is the string hash domain proposed in [26], based on a distributive hash function. More complete reviews of general-purpose string domains can be found in [11,27].
Most general-purpose domains focus on the generic aspects of strings, without accounting for the specifics of string handling by the different programming languages. However, it is often beneficial to consider specific aspects of string representation when designing abstract domains for program analysis. Referring to the C programming language, [28] has proposed an abstract domain for C strings which tracks both their length and the buffer allocated size into which they are contained.
Combining it with the cell abstraction [29], such domain is able to describe relations between length of variables and offsets of pointers. Amadini et al. [27] have evaluated several abstract string domains (and their combinations) for analysis of JavaScript programs. In [30] was defined the simplified regular expression domain for JavaScript analysis too. In addition to theoretical work, a number of tools based on the above mentioned abstract domains and their combinations have been designed and implemented [30][31][32][33]. While dynamic languages heavily rely on strings and their analysis benefits greatly from tailored abstract domains, the specifics of the C approach to strings also earns attention.

Conclusions
A new segmentation-based abstract domain for approximating C strings has been introduced, whose main novelty lies in abstracting both index bounds and substrings while managing strings as a pair of two string buffers: the string of interest itself, and a tail of allocated and possibly initialized but unused memory.
The presented approach enables a more precise modelling of the functions in the standard C library for strings, considering also the known weaknesses for the management of terminating null characters and buffer bounds. The M-string domain results effective for identifying security leaks caused by string manipulation errors, e.g., buffer overflows.
After theoretically describing the domain and the basic operations on strings, we implemented (using C++ language) the abstract semantics combining them with a tool that starting from string-manipulating C codes lifts them to the M-String domain. Our experimental results also focused on tuning the parameters of M-String (the domains for both segment content and segment bounds ) by instantiating them by both concrete and symbolic characters and by symbolic (bitvector) bounds.
As a future work, we plan to further enhance the effectiveness of the M-String domains by combining it by reduced product with other either numerical or symbolic domains.