As we explained in the previous section, our goal is to combine the advantages of these two types of benchmarks and separate the analysis challenges in a way that reduces evaluation entropy and enables more precise attribution of tool performance.
Our intuition is that SAST tools capable of correctly addressing these three entropy-contributing dimensions may still face difficulty detecting vulnerabilities in real-world applications due to the inherently higher information diversity and uncertainty in long data flows and nested contexts. However, tools that cannot reduce entropy and control uncertainty in these three areas are certain to fail in real-world scenarios because these abilities are prerequisites for handling the high-entropy execution space of PHP applications. Due to the limited Lines of Code (LoC) in the small test cases, they inevitably have lower entropy distribution and cannot fulfill the relevance (C). By combining real-world applications, we aim to create a comprehensive benchmark that balances and spans the full entropy spectrum, meeting the three key characteristics of test cases.
4.2. Workload
In designing the workload, we follow the principle of a “separation requirement”, which refers to isolating individual analysis challenges so that each test case focuses on a single dimension of difficulty without interference from others. We discuss the setting of Workload from four dimensions.
Identify the three elements of taint analysis (A1): Currently, there is no dataset specifically designed for identifying the three elements of taint analysis. Therefore, we conducted a comprehensive investigation of all built-in PHP [
29] functions. Based on the definitions of sources, sanitization, and sinks mentioned in the background section, we analyzed and matched the semantics of these built-in functions to complete the classification and constructed the workload for this part. More specifically, for taint sources, we designate five user-controllable superglobal variables as sources of taint:
$GET,
$POST,
$FILES,
$COOKIE, and
$REQUEST. For sanitization functions, we define them as functions that disrupt the attack semantics of their parameters with respect to a specific vulnerability type. If the return value processed by such a function can no longer trigger the relevant vulnerability, effectively interrupting meaningful taint propagation, the function is considered a sanitization function. For example, as shown in Listing 2, the sink function is echo, and thus we consider potential XSS vulnerabilities. After being processed by the abs function, the tainted variable
$tainted is transformed into a positive integer, which cannot carry the malicious semantics required to trigger an XSS attack (e.g., injection of <script> tags). For sink functions, we selected security-sensitive functions in PHP, such as system, exec, unlink, file_get_contents, mysql_query, move_uploaded_file, and echo. When unprocessed tainted variables are passed into these functions, they may result in vulnerabilities.
Listing 2. Example of a sanitization function. |
![Entropy 27 00926 i002 Entropy 27 00926 i002]() |
Below, we use specific test cases as examples to illustrate how to evaluate the ability of SAST tools to identify the three key elements of taint analysis.
For the source, we traverse all the source points of PHP applications and use the statement echo as the sink. Listing 3 shows a test case for evaluating the source identification capability. According to the separation requirement, this part should ideally focus solely on source points without involving taint propagation. However, if no propagation is introduced, they cannot be evaluated as effective sources of taint, since their impact would not manifest. Therefore, we slightly relax the separation requirement by constructing a minimal taint propagation process, which enables us to evaluate the capability of SAST tools to correctly recognize source points.
Listing 3. Example of a test case of source identification. |
![Entropy 27 00926 i003 Entropy 27 00926 i003]() |
For sanitization, we fix the source as GET and the sink as echo, and then iterate through all sanitization functions in PHP based on this setup. In our previous work [
23], we found that there is a class of functions in sanitization that can be used to recover taint through paired functions. We called them reversible sanitization and added them to our benchmark to evaluate whether SAST tools can recognize such sanitization functions. Specifically, this type of function typically involves encoding/decoding or encryption/decryption operations. When a function A encodes or encrypts a string, there exists a corresponding function B that can reverse the operation and restore the encoded or encrypted data. Thus, when a tainted variable is processed by function A, the taint is temporarily sanitized but can be restored when processed by function B. Listing 4 presents two scenarios of reversible sanitization functions. In lines 3–5, only htmlspecialchars is executed, so the taint is sanitized and no vulnerability exists. In lines 7–9, both htmlspecialchars and htmlspecialchars_decode are executed in sequence, restoring the taint in line 9 and resulting in an XSS vulnerability in line 10. For SAST tools, reporting no vulnerability in the first case (lines 3–5) is considered a true positive, while identifying a vulnerability in the second case (lines 7–9) is also considered a true positive.
Listing 4. Example of a test case of sanitization identification. |
![Entropy 27 00926 i004 Entropy 27 00926 i004]() |
For the sink, we fix the source as GET and do not use sanitization. We iterate through the sink functions in PHP that may cause different vulnerabilities to evaluate the types of vulnerabilities supported by SAST tools for detection. In addition, the sink function may have multiple parameters, and usually, only one parameter carrying a taint can lead to vulnerabilities. For such multi-parameter sink functions, we set up an example that passes a taint to non-hazardous parameters to determine whether the SAST tool has modeled hazardous parameter positions. Listing 5 presents the test case for evaluating the sink identification capability. We use a paired positive and negative test case to assess whether the SAST tool supports the recognition of dangerous parameters in the dangerous function, as illustrated in lines 3 to 8 of the code. When the sink function has only one required parameter and the other parameters are optional, we do not consider it as a multi-parameter sink function in our evaluation, as shown in the exec in lines 9–10 of the code.
Listing 5. Example of a test case of sink identification. |
![Entropy 27 00926 i005 Entropy 27 00926 i005]() |
In summary, A1 is designed to assess whether SAST tools can correctly model sources, sanitization functions, and sinks in PHP applications. We conducted a comprehensive survey of all PHP built-in functions and classified them based on the definitions of source, sanitization, and sink. The dataset was constructed step by step, with each component (source, sanitization, and sink) isolated to minimize interference from other factors.
To elaborate, for sources, we created minimal vulnerable code snippets with straightforward taint propagation, such as a two-line example involving an XSS vulnerability. These examples require SAST tools to recognize basic sources like GET and fundamental sinks like echo. For sanitization, we extended these examples by adding sanitization functions, testing whether the tools properly model them. Similarly, sink-related examples fixed the source as GET and excluded any sanitization, focusing on detecting the correct identification of sinks. This approach effectively delineates the boundaries of SAST tools in modeling these essential components.
Basic data flow analysis capabilities (A2): Similarly, there is currently no dataset specifically designed for evaluating the basic capabilities of PHP SAST tools in data flow analysis. Therefore, we conduct in-depth investigations into known vulnerabilities in PHP applications, analyze the taint propagation of these vulnerabilities, and study what capabilities PHP SAST tools need to have to detect these vulnerabilities. We found that these vulnerabilities have longer data flows, more branch judgments, more interprocedural data flows, and more complex contexts compared to small test cases. Therefore, PHP SAST tools need to have flow-sensitive analysis, context-sensitive analysis, and interprocedural analysis capabilities to complete the detection tasks for these vulnerabilities.
For the evaluation of flow-sensitivity analysis capabilities, we consider three levels of abilities: flow-insensitive, flow-sensitive, and path-sensitive analysis capabilities.
Figure 2 shows a sample code of test cases for evaluating the flow-sensitivity analysis capabilities. Overall, we cleverly designed the code execution logic of the test cases, and through the combined verification of two or more test cases, we can determine the analytical ability of the PHP SAST tool. As shown in the figure, the code in
Figure 2a is a simple if-branch, so whether it is the flow-insensitive method without considering branch logic or the flow-sensitive method, the analysis of this sample code can be performed in order from top to bottom.
Figure 2b introduces the else-branch and inserts an assignment statement below the if-branch to sanitize the tainted variable
$tainted. If it is a flow-insensitive method and still parsed in top-to-bottom order, the result will be that the
$tainted variable does not carry any taint, while the flow-sensitive method does not. In
Figure 2c, a for-branch is added so that the program only accepts the taint and triggers the sink when the value of
$i is equal to 20. For flow-sensitive, it ignores conditional predicates to track and record the situation of all branches, so it assumes that an if-branch may also be executed, leading to a false positive report.
For the evaluation of interprocedural analysis capabilities, we consider two scenarios: function calls (including method calls) and file inclusion. For function calls, we focus on whether SAST tools can correctly analyze the data flow of function call edges and function return edges.
Figure 3 shows a function call test case, where (a) has a vulnerability and (b) is the secure version of (a). In
Figure 3a, the taint is passed as a parameter into the function vul, propagates within the function, and is then returned through a return statement. On line 11, the variable
$ret receives the tainted value and triggers the sink on line 12.
Figure 3b adds an assignment to the safe function to sanitize the taint. By combining these two test cases for verification, we can determine whether the SAST tool can handle the most common function call scenarios.
On the other hand, because SAST tools often make assumptions, taking
Figure 3a as an example, assuming that the function vul on line 11 is a function that can propagate taint, and then directly assuming that the return value
$ret of the function also carries taint because the parameter
$tainted of the vul function carries taint. That is to say, the SAST tool did not conduct interprocedural analysis, and the vulnerability of the test case was only determined based on the assumption of success. Therefore, we separate the data flows of function call edges and function return edges to further evaluate the function call analysis capability of SAST tools.
Figure 4 shows a test case where function parameters are passed, but function return values are not used. Compared with
Figure 3, the sink is inside the called function and no longer triggers the sink by returning a tainted variable. Therefore, it can be determined whether the SAST tool accurately analyzes the data flow on the edge of the function call.
For file inclusion, we focus on whether SAST tools can correctly analyze the data flow within the included files.
Figure 5 shows a test case, where (a) has vulnerabilities and (b) is the secure version of (a). It is worth mentioning that due to inconsistent vulnerability report samples generated by various SAST tools, some tools only report the file where the sink is located and do not report the propagation process from source to sink. When the main file and the included files are in the same directory, we use some SAST tools to analyze the directory, and the report only shows that there are vulnerabilities in the included files, which makes it impossible for us to evaluate whether these SAST tools can analyze the file content. Therefore, we cleverly separate the main file and the included files into two directories and use SAST tools to analyze the directory where the main file is located, to make the correct assessment.
Calling functions defined in the included file is another scenario for file inclusion interprocedural analysis.
Figure 6 shows the test case for calling functions defined in the included file. Similarly, we still adhere to the strategy of separating the main file from the directory containing the included files.
For the evaluation of context-sensitive analysis capability, we also use the joint validation of two test cases to determine whether the SAST tool has context-sensitive analysis capability, mainly evaluating the context-sensitive analysis capability in two scenarios: function call and method call. Because context-insensitive analysis typically involves function summarization of the called function, in this capability assessment, two function call points are set, one causing a vulnerability and the other not. Listing 6 shows a test case in a function call scenario. When the SAST tool reports a vulnerability, it indicates that the tool has context-sensitive analysis capabilities in that scenario. When the SAST tool reports no vulnerabilities or two vulnerabilities, it indicates that the tool only has context-insensitive analysis capabilities.
Listing 6. A test case of context-sensitive assessment. |
![Entropy 27 00926 i006 Entropy 27 00926 i006]() |
To summarize, A2 focuses on evaluating the data flow analysis capabilities of SAST tools, including flow-insensitive, flow-sensitive, path-sensitive, interprocedural, and context-sensitive analyses. To control experimental variables, we fixed GET as the source, echo as the sink, and excluded sanitization functions. This standardization ensures that the evaluation isolates the tools’ abilities to handle different levels of data flow analysis without interference.
Complex semantic analysis capabilities (A3): For a long time, PHP’s SAST tools have faced complex challenges in type inference, dynamic features, and built-in functions during the analysis process. Accurate type inference is the foundation for solving dynamic features analysis, function/method call addressing, and other analysis scenarios. Only by correctly inferring the types of relevant variables can the subsequent analysis be carried out correctly. The behavior of dynamic features may only be determined at runtime, and static analysis requires understanding dynamic features without executing code, which is a challenging task. The built-in functions of PHP are implemented in C language, and for SAST tools, the internal implementation of these built-in functions is a black box, making it difficult to analyze their internal data flow. In previous studies [
18], researchers delved into the impact of these complex semantics on SAST tools and made the dataset open-source. Based on their work, we reclassified their dataset according to type inference, dynamic features, and built-in functions, and corrected some erroneous test cases in the dataset. It is worth mentioning that because test cases with complex semantics usually involve branch structures and function calls, it is difficult to control experimental variables well. However, we believe that SAST, which has achieved excellent results in the evaluation of the three elements of taint analysis recognition capability and the evaluation of basic data flow analysis capability, will not be troubled in these aspects and can effectively evaluate SAST’s ability in complex semantic analysis. This is also the core advantage of our proposed SAST capability progressive assessment method. This method gradually increases the difficulty of analysis through a carefully designed hierarchical structure, with each level built on the foundation of the previous level, ensuring that the depth and breadth of analysis continue to increase as the levels progress.
Real-world application vulnerability detection capability (A4): Evaluating the capability of SAST tools to detect real-world application vulnerabilities is the most essential requirement for assessing the capability of SAST tools. For the selection of real-world applications, we have two main principles: first, the application must be widely used, and second, the latest version of the application must be released within ten years. The first principle ensures that the applications we select have universality and can represent applications developed according to industry standards as much as possible, as this often better represents the current trends and development trends in the production environment than applications used by a few people. The second principle ensures that the PHP version used by the application is not too outdated, avoiding vulnerabilities that are outdated or no longer relevant. The real-world applications selected based on such principles can effectively assess the capability of SAST tools in detecting vulnerabilities in real-world applications.
More specifically, we selected 24 representative PHP applications, totaling over 10 million lines of code, as shown in
Table 1. These applications were chosen based on three criteria: (1) Popularity, quantified by the number of GitHub stars, with a threshold of more than 1000 stars at the time of collection; (2) functional diversity, ensured by categorizing applications into different usage domains (e.g., content management systems, e-commerce, enterprise management, customer relationship management, project management), as summarized in the “Usage” column of
Table 1; and (3) relevance to related work, focusing on applications that have been used in prior studies.