1. Introduction
C is still one of the mainly used programming languages [
1], and a large portion of systems of critical relevance are written in this language, such as server-side software and embedded systems. Unfortunately,
C programs suffer of bugs, due to the way they are laid out in memory, which malicious parties may exploit to drive security attacks. Ensuring the correctness of such software is of great concern. Our main interest is guaranteeing the correctness of
C programs that manage strings, because the incorrect string manipulation may lead to several catastrophic events, ranging from loss or exposure of sensitive data to crashes in critical software components.
Strings in
C are not a basic data type. As a matter of facts, strings in
C are represented by zero-terminated arrays of characters and there are libraries that provide functions which allow operating on them [
2].
C programs that manipulate strings can suffer from buffer overflows and related issues due to the possible discrepancy between the size of the string and the size of the array (buffer). A buffer overflow is a bug that affects
C code when a buffer is accessed out of its bounds. In particular, an out-of-bounds write is a particular (and very dangerous) case of buffer overflow. Out-of-bounds read is less critical as a bug. It is important to design methods supporting the automatic correctness verification of string management in
C programs for the previously mentioned reasons and also because buffer overflows are usually exploitable and can easily lead to arbitrary code execution [
3]. Existing bugs can be identified by enhancing tools for code analysis, which can also reduce the risk of introducing new bugs and limit the occurrence of costly security incidents.
1.1. Paper Contribution
This paper is a revised and extended version of [
4,
5]. We introduce M-String, a new abstract domain tailored for the analysis of strings in
C, whose elements:
approximate sets of C character arrays;
allow the abstraction of both shape information on the array structure and value information on the contained characters;
highlight the presence of well-formed strings in the approximated character arrays.
M-String refines the segmentation approach to array representation introduced in [
6]. M-String’s goal is to detect the presence of common string management errors that may lead to undefined behaviours or, more specifically, which may result in buffer overflows. Moreover, keeping track of the content of the characters occurring after the first null character allows us to reduce the number of false positives. In fact, rewriting the first null character in the array is not always an error, as further occurrences of the null character may follow. M-String, such as the array segmentation-based representation introduced in [
6], is parametric in two ways: both with respect to the representation of the indices of the array and with respect to the abstraction of the element values.
To provide evidence of the effectiveness of M-String, we extend
LART [
7], a tool which performs automatic abstraction on programs, making it supporting also sophisticated (non-scalar) domains such as M-String.
We extend
LART along with
DIVINE 4 [
8], an explicit state model checker based on
LLVM. This way, we can verify the correctness of operations on strings in
C programs automatically. The experimental evaluation is performed by analyzing several
C programs, ranging from quite simple to moderately complex, including parsers generated by
bison, a tool which translates context-free grammars into
C parsers. The results show the actual impact of an ad-hoc segmentation-based abstract domain on model checking of
C programs.
1.2. Paper Structure
In the following
Section 2 we give basics in abstract interpretation and we introduce the array segmentation abstract domain [
6] on which M-String is based. Furthermore,
Section 3 introduces the syntax of some operations of interest.
Section 4 defines the concrete domain and semantics.
Section 5 presents the M-String abstract domain for
C character arrays and its semantics, whose soundness is formally proved. In the
Section 6, we present a general approach to abstraction as a program transformation and extend it to abstraction of program strings.
Section 7 and
Section 8 present implementation and evaluation details of M-String abstraction. In
Section 9 we discuss related work. Finally,
Section 10 concludes.
3. Syntax
Strings in the programming language C are arrays of characters, whose length is determined by a terminating null character ‘\0’. Thus, for example, the string literal “bee” has four characters: ‘b’, ‘e’, ‘e’, ‘\0’. Moreover, C supports several string handling functions defined in the standard library string.h.
We focus on the most significant functions in the
string.h header (see
Table 1), manipulating null-terminated sequences of characters, plus the array elements access and update operations. Recall that
char,
int and
size_t are data types in
C,
const is a qualifier applied to the declaration of any variable which specifies the immutability of its value, and
*str denotes that
str is a pointer variable.
strcat appends the null-terminated string pointed to by str2 to the null-terminated string pointed to by str1. The first character of str2 overwrites the null-terminator of str1 and str2 should not overlap str1. The string concatenation returns the pointer str1.
strchr locates the first occurrence of c (converted to a char) in the string pointed to by str. The terminating null character is considered to be part of the string. The string character function returns a pointer to the located character, or a null pointer if the character does not occur in the string.
strcmp lexicographically compares the string pointed to by str1 to the string pointed to by str2. The string compare function returns an integer greater than, equal to, or less than zero, accordingly as the string pointed to by str1 is greater than, equal to, or less than the string pointed to by str2.
strcpy copies the null-terminated string pointed to by str2 to the memory pointed to by str1. str2 should not overlap str1. The string copy function returns the pointer str1.
strlen computes the number of bytes in the string to which str points, not including the terminating null byte. The string length function returns the length of str.
Accessing an array element is possible indexing the array name. Let i be an index, the i-th element of the character array str is accessed by str[i]. On the other hand, a character array element is updated (or an assignment is performed to a character array element) by str[i] = ‘x’, where ‘x’ denotes a character literal.
As mentioned in
Section 1,
C does not guarantee bounds checking on array accesses and, in case of strings, the language does not ensure that the latter are null-terminated. As a consequence, improper string manipulation leads to several vulnerabilities and exploits [
15]. For instance, if non null-terminated strings are passed to the functions above, the latter may return misleading results or read out of the array bound. Moreover, since
strcat and
strcpy do not allow the size of the destination array
str1 to be specified, they are frequent sources of buffer overflows.
6. Program Abstraction
Adapting M-String to the analysis of real-world C programs requires, first of all, a procedure that identifies string operations automatically. A subset of such operations then has to be performed using abstract operations, carried out on a suitable abstract representation. The technique that captures this approach is known as abstract interpretation. A typical implementation is based on an interpreter in the programming language sense: it executes the program by directly performing the operations written down in the source code. However, rather than using concrete values and concrete operations on those values, part (or the entirety) of the computation is performed in an abstract domain, which over-approximates the semantics of the concrete program.
In this paper, we mainly focus on string abstraction. Therefore we will interpret the portions of the program that do not make use of strings without abstracting values. We only apply abstraction to strings that within the program are manipulated by string operations: when the program deals with string variables that exhibit minimal variation, e.g., string literals, the M-String representation would provide no benefit, and instead it could either hurt performance or it may introduce spurious counterexamples.
Based on the considerations above, it is clear that it is beneficial to reuse and refactor existing tools that implement abstract verification in a modular way on explicit programs. A compilation-based abstraction design that follows this approach was introduced and implemented in [
7]. However, such a tool is designed to abstract scalar values only. This is why we need to extend it to operate with more sophisticated domains that represent more complex objects, such as strings.
In the rest of this section, we will first summarize the general approach to abstraction as a program transformation. In
Section 6.3, we explore the implications of aggregate (as opposed to scalar) domains within this framework.
Section 6.4 and
Section 6.5 then go on to discuss the semantic (run-time) aspects of the abstraction and which operations we consider as primitives of the abstraction.
6.1. Compilation-Based Approach
Instead of (re-)interpreting instructions abstractly, in a compilation-based approach, abstract instructions are transformed into an equivalent explicit code that implements the abstract computation. The transformation takes place before the analysis of the program (e.g., model checking) during the compilation process.
Consequently, the analysis processes the program without needing special knowledge of the abstract domains in use, as the abstraction is encoded directly in the program.
Figure 1 depicts a comparison of the compilation-based approach with respect the interpretation-based approach adopted by more conventional abstract interpreters.
In a compilation-based approach, two different abstraction perspectives are considered:
static, referencing to the syntax and the type system,
dynamic, or semantic, referencing to execution and values.
The
LART tool performs syntactic (
static) abstraction on
LLVM bitcode [
16]. Syntactic abstraction replaces some of the
LLVM instructions that occur in the program with their abstract counterparts, as depicted in
Figure 2.
6.2. Syntactic Abstraction
The first step of program abstraction performed by
LART is a syntactic abstraction. Syntactic abstraction replaces
LLVM instructions or whole functions with their abstract counterparts. Since we do not want to perform all operations abstractly, we need to classify only those operations that might obtain abstract values as their arguments. The abstract values emerge in the program as input values. From these values,
LART computes all operations that might come into contact with abstract values using a combination of data flow and alias analyses. Finally, as a result of analyses,
LART obtains a set of possibly abstract operations that are replaced by their abstract equivalents, e.g.,
strcat,
strlen are replaced by
abstract_strcat and
abstract_strlen. Abstract operations then implement the manipulation with abstract values, in our case with M-Strings as described in
Section 4, in other words the specific meaning of abstract instructions and abstract values then defines the semantic abstraction.
For the precise formulation of syntactic abstraction, we take advantage of the static type system of LLVM. We leverage the fact that we can assign to each variable its type, which is either concrete or abstract. In this way, we can precisely set a boundary between concrete and abstract values.
Let us consider a simplified version of LLVM. It defines a set of concrete scalar types S. The set of all possible types is given by a map that inductively defines all finite (non-recursive) algebraic types over the set of given scalars. To be precise, the set of all possible types derived from a set of scalars T is defined as follows:
, meaning each scalar type is included in ,
if then also the product type is in : ,
if then also disjoint union is in : ,
if then , where denotes pointer type.
In a concrete LLVM program, the set of admissible types comprise those derived from concrete scalars S, i.e., . In syntactic abstraction, we need to extend admissible types by abstract types. From these, we generate all possible types using . Depending on the type of abstraction, we use a different set of basic abstract types. In the case of scalar abstraction, a set of basic abstract types contains abstract scalar types . Correspondence between abstract and concrete scalars is given by a bijective map . Finally, each value, which exists in the abstracted program, has an assigned type of . Specifically, this implies that the abstraction works with mixed types—products and unions might contain both concrete and abstract fields. Moreover, it is possible to create pointers to both abstract or mixed values.
6.3. Aggregate Domains
In addition to scalar values that cannot be further decomposed, programs typically operate with more complex data which can be seen as compositions—aggregates—of multiple scalar values. Depending on aggregates’ nature, we can classify them as aggregates which contain a variable number of items (arrays), records that contain a fixed number of items in a fixed layout, where each of these can be of a different type. The items in such aggregates can be (and often are) scalars. However, more complex aggregates are also possible: arrays of records, records which in turn contain other records, and so on.
While scalar domains only dealt with simple values, in aggregate abstraction, we consider composite data in the spirit of the above definition. Similarly to scalar domains, abstract aggregate domains approximate concrete aggregate values by describing a particular set of aggregate properties. For example, we can describe a set of aggregates by their length or a set of values that appear in the aggregate. In the M-String, the kept properties are in the form of segmentation, where segments are further abstracted by bounds and characters. Values in an aggregate domain then keep the representation of chosen properties and operations updates them. For instance, consider an array length property domain—the domain operations in such a case operate only with lengths of arrays, e.g., abstract concat of arrays adds together lengths of its arguments (abstract arrays).
In general, aggregate domains can provide arbitrary operations. However, two operations are, in some sense, universal, being elementary memory manipulation operations, namely: byte-wise
access and
update of the aggregate. The universality of these operations originates from the fact that all aggregate operations can be represented as accesses and updates. In a low-level representation of a program (assembly), they usually are presented in this form.
LLVM allows a slightly higher level of manipulation to access and update individual scalars present in the aggregates (as opposed to bytes). For M-String, though, this distinction is not essential because the scalars stored in
C strings are individual bytes (characters). All other operations are present in the form of sequences of elementary instructions—possibly encapsulated in functions. Moreover, as in concrete programs, the
access and
update represents an interface between scalars and memory, in the abstraction, they form an interface between scalar and aggregate domains (even in the case of byte-oriented access since bytes are also scalars). We refer the reader to the
Section 4.3.1 for abstract semantics of
access, respectively to the
Section 4.3.7 for the abstract semantics of
update.
In comparison to scalar abstraction, the syntactic abstraction of aggregates does not operate directly with aggregate types. In LLVM, aggregate values are usually represented by a pointer to the underlying aggregate type. Therefore all the accesses and updates are made through the pointers to the aggregates. For instance, strings are represented as a pointer to a character array. We need to take this fact into account when we perform the syntactic abstraction. In the analysis, we consider the pointers to aggregates as base types for the abstraction. In the case of arrays, the base types are concrete pointers to those arrays: let us call them , where . A set of abstract pointers types then describes types of abstracted aggregates (arrays). As for scalar domains, we define a natural correspondence between pointers to concrete values and abstract aggregates as a bijective map . For instance, in the case of M-String abstraction, the map assigns to char* a type of M-String value. Finally, we allow all the mixed types generated from scalars and abstract aggregates: .
Observe that pointers, in general, also in LLVM maintain two pieces of information about memory location: they represent both the memory object and an offset into that object. In particular, our implementation treats the first 32-bits of the pointer as an object identifier and the last 32-bits as its offset. This distinction is not very relevant in explicit programs because those two components are represented in a uniform way in a single value and often they cannot be distinguished at all. However, the distinction becomes relevant when dealing with abstract aggregate values. In fact, in this case, the object component of the pointer is concrete as it determines a single specific abstract object. On the other side, the offset component may or may not be concrete. The choice depends on the specific abstract aggregate domain: it may be more advantageous representing the offset in an abstract way, i.e., by a 32-bit abstract scalar value. Observe that a memory access through such a pointer needs to be treated in both cases as an abstract access or update operation.
In LLVM, two basic memory access operations are defined—load and store, corresponding to the access and update operations. It is important to notice that memory access is always explicit: memory is never used in a computation directly. This observation is used in the design of aggregate abstraction, where we can assume that the access to the content of an aggregate will always go through a pointer associated with the abstract object.
6.4. Semantic Abstraction
In syntactic abstraction, we dealt with operations’ syntax, their types, and the types of values and variables. It described how LART performs a source-to-source transformation. In contrast, semantic abstraction concerns with the values computed at runtime by a program. It defines how abstract operations modify values and how to transfer between concrete and abstract values. Therefore, similarly to syntactic abstraction that defined the maps and to transfer between concrete and abstract types, the semantic abstraction makes use of lift and lower (cf. Definitions 9 and 10): operations (instructions) converting values between their concrete and abstract representations. They realize a runtime implementation of domain functions: abstraction ( in the case of M-String) and concretization ().
The
lift operation implements abstraction of concrete values by a single over-approximating abstract value. For example, in
Figure 2 on line 3 of the abstracted program, a concrete string
b is lifted to the abstract domain. This allows performing
abstract_strcat in a single abstract domain. In other words, operations do not need to consider concrete values because all their arguments are lifted to the abstract domain. This simplifies the implementation of a domain and reduces the number of possible domain interactions. In comparison to
, which was a purely syntactic construct,
lift and
lower accomplish actual conversion of values between domains during program runtime. During program execution, lowering an abstract value into multiple concrete values can be seen as nondeterministic branching in the program and the
lower operator is indeed based on a non-deterministic choice operator. In a model checker, the non-deterministic choice would be typically implemented as branching in the state space and the consequences of all possible outcomes would be explored. In a testing context, however, the choice might implemented as random, by choosing one particular path. For further details of the program transformation performed by
LART, we kindly refer the reader to [
7].
6.5. Abstract Operations
As a result of syntactic abstraction, we obtain a program that temporarily contains abstract operations. These operations take abstract values as operands and return abstract values as a result. Though, after the program transformation, the resulting program is required to be a semantically valid LLVM bitcode. Therefore, we demand that each abstract operation can be realized as a sequence of concrete instructions. This allows us to obtain an abstract program that does not contain any abstract operations and executes it using standard (concrete, explicit) methods.
Thoroughly, syntactic abstraction substitutes concrete operations with their abstract counterparts: an operation with type is substituted by an abstract operation of type . Furthermore, transformation inserts lift and lower operations as needed, e.g., in places where concrete values are operands of abstract operations. The implementation is free to select the operations to be abstracted and where value lifting and lowering be inserted, so long type constraints are satisfied. However, it tends to minimize the number of abstracted operations.
In addition to
LLVM instructions, the M-String abstraction requires the transformation to abstract function calls to standard library functions such as
strcmp,
strcat. From the perspective of syntactic abstraction, we can treat function calls as single atomic operations that take abstract values and produce abstract results. Hence, the transformation substitutes them in the same way as instructions: for instance
strcmp operation of type
is replaced by
abstract_strcmp of type
where
m is a concrete character array and
s is a concrete scalar result of the string comparison. Afterwards, all abstract operations are implemented by using concrete subroutines (implementation of abstract semantics). For details, see [
7].
Observe that, as an alternative approach, the standard library functions strcat, strcmp, etc. could have been transformed instruction by instruction, by using abstract access and update of a content only. However, the price to pay would have been loosing a certain degree of accuracy in the abstraction, the exact amount depending on the single operation.
7. Instantiating M-String
As an aggregate domain, M-String is a parametrizable by scalar domains of characters and indices (bounds). This allows us to tailor the abstraction to the needs of the analysis of string values. Depending on the precision of chosen domains, the instance of the M-String domain will inherit their properties. With more precise domains, the M-String values will maintain higher granularity of segmentation. On the other hand, simpler character representation will decrease the segmentation granularity for the cost of a higher rate of false alarms.
A particular instance of M-String is automatically derived from a parametric description given in
Section 5, provided a suitable scalar domain
for characters and scalar domain
to represent segment bounds. The instantiation demands that both scalar domains
and
are equipped with operations that appear in the operations with the segmentation. These are mainly elementary arithmetic and relational operations. In the implementation, we provide an M-String domain template that automatically derives all the operations from provided scalar domains.
7.1. Symbolic Scalar Values
In program verification, it is common practice to represent certain values symbolically (for instance, inputs from the environment). The symbolic representation allows the verifier to consider all admissible values with a reasonably small overhead. In
DIVINE, symbolic verification is implemented using a similar abstraction to one described in the previous section: symbolic scalar values represent their content by SMT formula expressions (terms) in form of abstract syntax trees. The input values are represented as unconstrained variables in the bit vector logic. Operations then build formulae trees from their arguments. In addition to these so-called data definitions, symbolic representation also maintains one global formula of constraints (path-condition), which is derived from the control flow of the program. A more detailed description of this symbolic representation is presented in [
7].
The domain of symbolic values (we call it a term domain) requires DIVINE to be augmented with an SMT solver form a suitable theory. For scalars in C programs, we use the bitvector theory. DIVINE uses the solver to detect computations that have reached the bottom of the term domain (those are the infeasible paths through the program). Furthermore, as a model checker, it needs to identify equal states or whether the state subsumes another one. This is achieved by the equivalence check of corresponding formulae. With these prerequisites, the symbolic representation in joint with the bit-vector theory is a precise abstraction (i.e., it is not an approximation but models the program state faithfully).
7.2. Concrete Characters, Symbolic Bounds
In the evaluation, we instantiate the M-String domain in two ways. The first simpler instantiation sets the domain of characters to be the concrete domain (i.e., we let the characters be represented by themselves). We let the domain of segment bounds to be a symbolic 32b integers. This instantiation balances between simplicity on the one hand (both domains we used for parameters were already present in DIVINE) and the ability to describe strings with undetermined length and structure.
At the implementation level (as described in more detail in the following section), the domain remains generic: the particular domains we picked can be easily substituted by other domains. Compared to the theoretical description of M-String, the implementation uses a slightly simplified representation of segmentation by a pair of arrays (cf.
Figure 3). The elements of these arrays are characters and bounds, whose type is derived from parametrization, i.e., from the scalar domains
and
. The modification of the representation is just optimization for the implementation and does not affect the operations’ semantics. The analysis with this representation is presented in Example 14.
This instantiation of M-String is particularly suitable for representing strings with sequences of a single character of variable length, i.e., the strings of the form
where relationships between
can be specified using standard arithmetic and relational operators and each of
is a concrete letter. This, in turn, allows M-String to be used for the analysis of program behavior on broad classes of input strings described this way. A more detailed description of this approach can be found in
Section 8.
Example 14. Simple program analysis with symbolic bounds and concrete characters:
Imagine we are given symbolic bounds , then the first line of the transformed program createsmstringvalue with characters and bounds . In the following, we describe mstring values as pairs of these two arrays. The second line creates a symbolic index of arbitrary value. On line 3, the program constraints the index to be smaller than mstring maximal length. Otherwise, the update on the next line would yield an error. Next the program assigns to the position of abstract index a charactery. The assignment is implemented as update operation on mstring value. Depending on the value of the , the operations results in the following strings , as result we join all possibilities:
if falls to the first segment: and creates a new segment between and containing character . Notice that if the first segment is empty, similarly the third segment for . The string of interest for is of form .
if than , with string of interest as join of following forms:
if the update is performed right after the first segment, i.e., :
- -
if and , i.e., the segment of zeros contains more elements, then the string has form ,
- -
otherwise the update overwrites the single zero character, hence extends the string of interest by segment of characters: .
otherwise between first segment and is a terminating zero, hence the string of interest remains unchanged: .
if than , because update stores the same character as is already present in the segment.
if than update creates a new segment inside of sequence of last zeros: .
Consequently, the operation on the last line of the program computes the join of all possible lengths of strings of interest, i.e., .
7.3. Symbolic Characters, Symbolic Bounds
The second instantiation is used in benchmarks, where the computation with M-String values encountered abstract scalars (characters). This occurs when the program obtains some character as input from the environment and tries to store it into the M-String value. Therefore, we instantiated the M-String domain with an abstract representation of characters by setting the domain to be the term domain, which keeps track of symbolic 8b bitvectors (characters in C language). In this way, we do not need to lower abstract characters before storing them to the M-Strings, what was needed for the concrete domain used in the previous instantiation. However, we pay the price for more expensive computation with symbolic characters.
7.4. Implementation
Finally, we implemented the M-String abstraction as a
LART domain. The implementation, with examples and documentation of domain usage, can be found online on the supplementary page
https://divine.fi.muni.cz/2020/mstring. The
LART domain is a
C++ library that implements abstract semantics of M-String operations presented in
Section 5. Such a library is then linked to the transformed program allowing the program to perform abstract analysis with model-checker
DIVINE. An abstract domain definition in
LART consists of a
C++ class that describes both the representation (in terms of data) and the operations (in terms of code) of the abstract domain.
In the case of M-String domain, this class contains 2 attributes: an array of
bounds and an array of
characters, as outlined in
Section 7.2 and depicted in
Figure 3. The class has two type parameters: the domain to use for representing segment bounds and the domain to represent individual characters (i.e., the content of segments). A specific instantiation is then automatically derived by the
C++ compiler from the classes which represent the type parameters and the parametric class which represents M-String values.
As a minimal set of operations, the M-String domain implements all requisite aggregate operations: these are
lift,
update and
access. Furthermore, the implementation provides an optimized version of string operations described in
Section 5:
strlen,
strcpy,
strcat,
strcmp and
strchr. These operations reduce the loss of abstraction precision that would arise if only the abstraction of accesses and updates from strings were used.
Since C strings are stored, in fact, as shared, mutable character arrays, the implementation of the M-String domain needs to reflect the sharing semantics of such arrays. If multiple pointers exist into the same abstract string, modifications through one such pointer must be also visible when the string is accessed through another pointer. Moreover, the pointers do not have to be equal: they may point to different suffixes of the same string. Therefore, the representation of pointers to abstract strings must treat the
object and the
offset components separately (see also
Section 6.3), and the representation of the
offset component must be compatible with the bound domain
.
8. Experimental Evaluation
In the evaluation, we chose a few scenarios to demonstrate the properties of the abstraction. In the first scenario, we show that using abstract versions of standard functions is more efficient than if concrete versions were transformed using only abstract string accesses and updates. The second scenario investigates several implementations of standard library functions: we transform them automatically in the means of accesses and updates, and we show that their results agree with results generated by M-String library operations. In the third scenario, we evaluate M-String instantiation with symbolic characters on the set of benchmarks from real software that contain buffer-overflow errors. Here we show that M-String can efficiently detect real-world bugs as well as to prove that program does not contain them after they are fixed. The last benchmark shows the use of abstractions on more complex C programs. As an example, we analyze automatically generated parsers from bison and flex tools on abstract (M-String) inputs. The resource limits for all scenarios were the same: each verification run was limited to 4 processing units (cores), 80 GB of memory, and 1 hour of CPU time. The processor used to run benchmarks was AMD EPYC 7371 clocked at 2.60GHz.
8.1. M-String Operations
The first group of benchmarks focuses on the use of resources by abstraction. Benchmarks compare the effectiveness of abstract domain operations with the automatically abstracted implementation of standard library functions from
PDCLib, a public-domain
libc implementation, using only essential abstract operations:
lift,
update and
access. The results depicted in
Table 2 were measured with parametrized M-String inputs of two kinds (l is a parametric length of the input):
Word w is a string of the form: where and is an arbitrary character from domain .
Sequence w is a string of the form , where and c is a character from domain .
For each standard library function and input type, we created an isolated benchmark in two variants: one using an abstract semantics of M-String operations (see
Table 2) and the other variant (
Table 3) only with an automatic abstraction of essential aggregate operations.
The first notable difference between automatically abstracted implementations of library functions and M-String operations is that the analysis of the former timeouts for input strings longer than 64 characters. The main cause of the lifted implementation’s inefficiency is that it has to iterate over all characters, while M-String operations leverage iteration over larger segments. This difference also causes a blow-up of the model checker’s state space for the lifted implementations while the state space size does not change for M-String operations. The reason for this is the fact that the number of segments does not change with the length of the input. Therefore M-String operations always perform the same computation independently of the M-String length.
8.2. C Standard Libraries
In the second set of benchmarks (see
Table 4), we investigate whether the implementation from several standard libraries matches the expected results of abstract implementation. In other words, we perform an equivalence check of results obtained from M-String operations with the results of the automatically abstracted (originally concrete) standard library functions. We expect that both give the same results. For the evaluation, we picked three open-source libraries:
PDClib,
musl-libc and
CLibc. Since results for the libraries are rather similar, we present here only an evaluation of
PDClib functions. The remaining results are provided in the
Supplementary Material. All benchmarks showed that our implementation matches the standard one.
Similarly, as in the previous case, these benchmarks suffer from the state space blow up caused by an exponential number of possible character combinations. For this reason, we decreased the size of the input strings. In addition to large state space, many string accesses and updates of concrete implementations result in a large smt formulae, causing a long time spent in solvers.
Furthermore, notice that the computation analysis with Word input, which has more segments, results in longer execution times than the analysis with Sequence. The reason is that the more segments naturally also causes overhead for the analyses. For example, The M-String needs to consider cases when some segments have zero length: this causes a hard smt queries because, in the worst case, it needs to check all possible strings for given segment bounds and characters.
8.3. Veriabs Overflow Benchmarks
In this scenario (see
Table 5), we show that the domain is capable of efficient overflow bug finding. Veriabs benchmarks exhibit overflow errors and fixed variants of real-world software. To soundly prove correctness of these benchmarks, we instantiate M-string with term domain also for characters. Hence we can reason about arbitrary strings of a symbolic length. However, as a drawback of this instantiation is that whenever the length of the string bounds a loop, we might have to unroll the loop infinitely in the analysis—these cases timeouts in the correct benchmarks.
8.4. Parsers
Lastly, we evaluate our implementation on more complex programs: automatically generated parsers. For the generation, we use a tool
Bison. It reads a language specification in the form of context-free grammar and produces a C parser that accepts the language. In the benchmarks, we generate two such parsers. The first one accepts a language of numerical expressions (mathematical expressions that consist of numbers and binary operators). The second parser is of a simple programming language with variables and branching. We present a evaluation for both parsers in
Table 6. As with the previous benchmark sets, the M-String inputs with a smaller number of segments outperformed other analyses. In these benchmarks, we use specifically hand-crafted M-String inputs for parsers. For parsing of mathematical expressions, it was:
addition input had a form of two arbitrary numbers with a plus sign between them,
ones was a simple input of a single digit sequence, and lastly,
alternation was input that produced complicated M-Strings by alternating digits inside of expressions. The other parser of simple programming language was evaluated on:
value was in input that created a variable and assigned a constant to it,
loop was a short program with some control flow and
wrong was a program that contained a syntax error.
9. Related Work
Static methods tailored to automatically identify buffer overflows have been extensively studied in the literature and several inference techniques were proposed and implemented: tainted data-flow analysis, constraint solvers for various theories (including string theories) and techniques based on them (e.g., symbolic execution), annotation analysis or string pattern matching analysis [
17]. Furthermore, the above mentioned techniques and a large number of bug hunting tools based on static analysis have been implemented [
18,
19,
20,
21,
22,
23].
For instance, in [
24] authors introduced a backward compatible method of bounds checking of
C programs, which leaves the representation of pointers unchanged, allowing inter-operation between checked and unchecked code, with recompilation confined to the modules where problems might occur. The just mentioned feature differentiates the proposed schema from previously existing techniques. In [
20] the static verifier of
C strings CSSV is introduced. Contracts are supplied to the tool, which acts in 4 stages, reducing the problem of checking code that manipulates string to checking code that manipulates integers. Finally, Splat, described in [
25], is a tool that automatically generates test inputs, symbolically reasoning about lengths of input buffers.
Static code analysis aims at approximating possible behaviours of a program without examining all of its (possibly infinite) actual executions. By a proper abstraction of data and operations, static analysis results into an over-approximation of all the possible runs of a program, and its effectiveness heavily depends on degree of precision of such an abstraction. In particular, the framework of abstract interpretation [
9] can be adopted also to approximate semantics of string operations. The basic, well-known domain is a
string set domain, which simply keeps track of a set of strings and it is a specific instance of the general (bounded) set domain. Others are the
prefix-suffix domain (which captures the first and the last letter of a string) and the
character inclusion domain (which only tracks the characters that surely or maybe appear in a string). Another general-purpose string domain is the
string hash domain proposed in [
26], based on a distributive hash function. More complete reviews of general-purpose string domains can be found in [
11,
27].
Most general-purpose domains focus on the generic aspects of strings, without accounting for the specifics of string handling by the different programming languages. However, it is often beneficial to consider specific aspects of string representation when designing abstract domains for program analysis. Referring to the
C programming language, [
28] has proposed an abstract domain for
C strings which tracks both their length and the buffer allocated size into which they are contained. Combining it with the cell abstraction [
29], such domain is able to describe relations between length of variables and offsets of pointers. Amadini et al. [
27] have evaluated several abstract string domains (and their combinations) for analysis of JavaScript programs. In [
30] was defined the simplified regular expression domain for JavaScript analysis too. In addition to theoretical work, a number of tools based on the above mentioned abstract domains and their combinations have been designed and implemented [
30,
31,
32,
33]. While dynamic languages heavily rely on strings and their analysis benefits greatly from tailored abstract domains, the specifics of the
C approach to strings also earns attention.
10. Conclusions
A new segmentation-based abstract domain for approximating C strings has been introduced, whose main novelty lies in abstracting both index bounds and substrings while managing strings as a pair of two string buffers: the string of interest itself, and a tail of allocated and possibly initialized but unused memory.
The presented approach enables a more precise modelling of the functions in the standard C library for strings, considering also the known weaknesses for the management of terminating null characters and buffer bounds. The M-string domain results effective for identifying security leaks caused by string manipulation errors, e.g., buffer overflows.
After theoretically describing the domain and the basic operations on strings, we implemented (using C++ language) the abstract semantics combining them with a tool that starting from string-manipulating C codes lifts them to the M-String domain. Our experimental results also focused on tuning the parameters of M-String (the domains for both segment content and segment bounds ) by instantiating them by both concrete and symbolic characters and by symbolic (bitvector) bounds.
As a future work, we plan to further enhance the effectiveness of the M-String domains by combining it by reduced product with other either numerical or symbolic domains.
yes