Comparative Analysis of Open-Source Tools for Conducting Static Code Analysis

The increasing complexity of web applications and systems, driven by ongoing digitalization, has made software security testing a necessary and critical activity in the software development lifecycle. This article compares the performance of open-source tools for conducting static code analysis for security purposes. Eleven different tools were evaluated in this study, scanning 16 vulnerable web applications. The selected vulnerable web applications were chosen for having the best possible documentation regarding their security vulnerabilities for obtaining reliable results. In reality, the static code analysis tools used in this paper can also be applied to other types of applications, such as embedded systems. Based on the results obtained and the conducted analysis, recommendations for the use of these types of solutions were proposed, to achieve the best possible results. The analysis of the tested tools revealed that there is no perfect tool. For example, Semgrep performed better considering applications developed using JavaScript technology but had worse results regarding applications developed using PHP technology.


Introduction
The digital transformation continues to accelerate.More and more businesses and regular users are utilizing various types of software, whether for work or entertainment.Access to public services is also undergoing digitization processes.For instance, in Poland, the development of the mObywatel application enables the storage of identification documents in mobile applications [1], electronic circulation of medical prescriptions (E-recepta) [2], and electronic tax filing (e-PIT) [3].Newer and larger systems are being developed, comprising thousands of lines of code, and numerous libraries and technologies.A steadily growing number of programmers collaborate on a single project, needing to work closely together to deliver a finished product on time.An increasing number of services are available over the Internet, and they also have extended functionality.In summary, the impact of software security also affects the end user, regardless of the device they use, such as the Internet of things (IoT), sensor-equipped devices, embedded systems, or a mobile phone [4].The importance of securing such software applications, which frequently involve complex codebases, cannot be overstated.Vulnerabilities in these applications can lead to serious security breaches, data leaks, and even physical harm, if the devices controlled by the software are critical to safety.
Cybercriminals tirelessly devise new ways to exploit vulnerabilities in application functioning, to cause harm and extract data.Analyzing program code for security purposes is challenging, time-consuming, and expensive.Thus, there is a necessity to support these processes.Examples of such solutions include tools for conducting static code analysis for security (SAST), and dynamic code analysis for security (DAST).Both of these processes have been discussed in the literature [5,6].However, it can be observed that authors tend to focus on only one technology, such as C [7,8] or Java [6].The literature also includes comparisons of solutions for related technologies, such as C and C++ [9].Additionally, the authors in [6] used enterprise-type tools that are not available to every user due to their high cost.According to the current state of the authors' knowledge, there is a lack of a broader comparison of open-source tools available to everyone, supporting current trends in software development.Furthermore, the literature lacks a perspective on solutions that can perform static code analysis for more than one technology, such as analyzing code written in both Java and JavaScript.Such solutions can potentially reduce the number of tools used in the software development process, thereby simplifying the continuous integration/continuous delivery process and reducing the amount of data processed in the big data process [10].
The novel contribution of this paper relates to the comparison of the results of opensource tools supporting various technologies used in software development for conducting static analysis and detecting potential errors affecting the security of applications, thereby enhancing the security of organizations and end users.Based on the analysis of the obtained results, a recommendation was formulated regarding the utilization of such solutions, which could significantly enhance the quality of applications developed, even at the code-writing stage.The analysis was carried out based on the list of vulnerabilities reported by all tools.Vulnerable web applications were scanned using these tools for the most popular programming languages [11].
The scope of this work encompassed a review of the available literature; an analysis of methods for scanning code for vulnerabilities; a determination of the pros and cons of SAST tools; an overview of existing tools, their comparison, research, and analysis of the obtained results; as well as effectiveness verification.Within the conducted research, vulnerable web applications were configured and launched, SAST tools were configured and launched, the extraction, transformation loading (ETL) process [12] was utilized to consolidate results from the tools, and the acquired data were processed.Given that web applications are the most popular type of application enabling access to services through personal computers, this work specifically focused on them.This paper is divided into the following subsections: • Background -describes the research basics and presents the problems, processes, and compromises that occur in static code analysis research; • Environment-describes the hardware and software used for conducting the research.This section also provides an overview of the analyzed tools and the utilized vulnerable web applications; • Research Methodology-covers the design of the experiment for the research conducted; • Results-provides a discussion of the results, and highlights the strengths and weaknesses of the examined tools for static code analysis; • Conclusions-summarizes the obtained results.

Background
A web service is a component of an information technology system that can be communicated with via the Internet.This could be a web application, an API server, and so on.With the ever-improving global access to fast and affordable internet connections, these services have become more popular than ever before-they are used for tasks such as online banking, information searches, doctor appointments, watching movies, listening to music, and playing computer games.However, not only has their popularity increased, but their functionality has also expanded.Bank websites offer more and more features, government websites enable administrative procedures-all through the use of computers or smartphones.
Nevertheless, despite the many advantages of web services, as their complexity grows, so does the likelihood of vulnerabilities that could lead to security threats.In the context of cybersecurity, a vulnerability (or flaw or weakness) is a defect in an information technology system that weakens its overall security.Exploiting such a flaw, a cybercriminal could cause harm to the system owner (or its users) or pave the way for further attacks on the system.Each vulnerability can also be described as an attack vector-a method or means that enables the exploitation of a system flaw, thereby jeopardizing security.A set of attack vectors is termed the attack surface [13].The utilization of a vulnerability is referred to as exploitation.
The common weakness enumeration (CWE) is a community-developed list of software and hardware weakness types [14].As of today, the CWE database contains 933 entries [15].Each CWE entry possesses attributes that describe the specific weakness, including an identification number, name, description, relationships to other entries, consequences, and examples of exploitation.
The common vulnerability scoring system (CVSS) is a standard for rating the severity of vulnerabilities [16].It aims to assign a threat rating to each identified vulnerability (in the case of a successful exploit).Such a rating allows prioritizing the response efforts of the reacting team based on the threat severity.The latest version is CVSSv3.1,released in June 2019.The specification is available on the website, and CVSSv4 is currently in the public consultation phase [17].
The common vulnerabilities and exposures (CVE) database contains entries about known security vulnerabilities [18].It differentiates vulnerabilities that can directly lead to exploitation from exposures that may indirectly lead to exploitation.Each entry is assigned a unique identifier and includes a description, name, software version, manufacturer, cross-references to other resources about the entry, and creation date.
Due to the increasing complexity of web systems, software testing for security vulnerabilities has become an essential and critical activity in the software development life cycle (SDLC), especially for web applications [19].Secure software development life cycle (SSDLC) is an extension of the SDLC, with additional security measures [20].It aims to assist developers in creating software in a way that reduces future security threats.It includes, among other things, defining and implementing security requirements alongside the functional requirements of the application being developed, as well as periodically assessing the security level, for instance, through conducting penetration tests.
Numerous SSDLC models have been proposed and are successfully employed in contemporary processes [20].Some of these include the National Institute of Standards and Technology (NIST) guidelines 800-64 [21], Microsoft's Security Development Lifecycle (MSSDL) [22], and the Comprehensive Lightweight Application Security Process by OWASP (OWASP CLASP) [23].
Penetration testing is one of the most popular methods for assessing the security level of web applications [24].It constitutes a part of the testing phase within the SSDLC process.It involves attacking the application to discover the existence or extent of vulnerabilities within the application's attack surface.In contemporary cybersecurity solutions, automation plays a pivotal role.As the complexity of developed solutions continues to grow, the need for more efficient methods of testing the security of web services arises.In today's fast-paced environment, where software updates are released daily, penetration tests must be conducted swiftly and effectively.
It is impossible to fully automate the entire process of conducting penetration tests-certain aspects must be carried out by a human.However, many tasks, such as fuzzing (a technique involving the supply of incorrect, unexpected, or random data), can be easily automated.Although no automation tool can fully replace the intuition and abstract thinking of a human tester, it can expedite their work by identifying well-known and documented vulnerabilities.
Static code analysis tools analyze the source code of a program, utilizing a white-box testing approach.There are various approaches to conducting this analysis, such as string pattern matching, lexical analysis, and abstract syntax tree (AST) analysis [25].The earlier a software error is detected, the lower the cost of its resolution [26].SAST tools can scan code not only during the software testing phase within the SDLC but also during the writing of the program code by the developer, providing real-time error feedback.They examine the entire code, ensuring 100% coverage of the software.Besides detecting vulnerabilities in the code, these scanners often analyze the libraries used by the application, highlighting those with known vulnerabilities.Despite developers' positive evaluation of using SAST tools for error reduction, integrating such tools into SDLC processes encounters certain challenges, such as low analysis performance, the need for multiple tools, and technical debt (a phenomenon where choosing a seemingly easier and cheaper option in the short term becomes less cost-effective in the long run [27]) when implemented late [28].Scanning extensive lines of code can result in hundreds or even thousands of vulnerability alerts for a single application.This generates numerous false positives, prolongs investigation time, and diminishes trust in SAST tool results [28].
Unlike SAST tools, dynamic application security testing (DAST) tools assess the behavior of running program code through the user interface and APIs.This is a blackbox approach to penetration testing, as they lack access to the source code of the tested application.DAST tools are language-independent and can identify issues not present in the application code, such as misconfigured environments, manipulation of cookies (text snippets stored by a web browser in a computer's memory-can be transmitted by a web service or created locally), and errors in integration with third-party services [5].As DAST tools do not have access to the source code of the tested application, there is a significant likelihood that they may overlook certain parts of the scanned application.They will not pinpoint the location of vulnerabilities in the code; rather, they will indicate the detected issues, leaving it up to the programmer to identify the line(s) of code responsible for the error.
Since DAST requires a functioning application, vulnerabilities are detected towards the end of the SDLC process, increasing the cost of their remediation.Additionally, a separate environment is needed for conducting tests, further amplifying the financial investment-the entire infrastructure of the tested service must be provided, typically encompassing (but not limited to) the client application, API server, and database.Similarly to SAST tools, they also generate numerous false alarms.
Static and dynamic code analysis for security are not the only types of code analysis.Additionally, we cab distinguish the following: interactive application security testing (IAST), a method that combines SAST and DAST.It utilizes monitoring mechanisms to observe the behavior of the web application server's code, while simultaneously attacking the application through its graphical interface or API [5]; runtime application self-protection (RASP), a solution that involves using tools to monitor the behavior of a web application during runtime to detect and block attacks.Unlike IAST, RASP does not attempt to identify vulnerabilities but rather protects against attacks that might exploit those vulnerabilities.

Environment
To ensure accurate research results, an appropriate testing environment was prepared.It consisted of computer hardware and software that enable a thorough analysis process.In order to facilitate seamless reproduction of the experiment results, open-source software and licenses allowing their use for research purposes were utilized.Table 1 presents a list of the hardware and software used to conduct the research.

Tested Vulnerable Web Applications
Table 2 presents the tested web applications.These applications were designed to include numerous security vulnerabilities.Many of them also come with a list of present vulnerabilities, which facilitates the analysis of the results from static code analysis tools.

•
EasyBuggy-Prepared by Kohei Tamura, EasyBuggy is a deliberately flawed web application, written to understand error behavior and vulnerabilities [29], using the Java programming language.A list of implemented vulnerabilities is available on the project's website; • Java Vulnerable Lab-Created by the Cyber Security and Privacy Foundation (CSPF), the Java Vulnerable Lab is intended for Java programmers and others interested in learning about vulnerabilities in web applications and how to write secure code [30]; • SasanLabs VulnerableApp-Another application written in Java, VulnerableApp is a project prepared by and for cybersecurity experts [31].According to the creators, it stands out for its ability to be extended by individuals who want to utilize it.The project is part of the OWASP Foundation's incubator; • Security Shepherd-This application was prepared by the OWASP foundation, in Java language.It serves as a platform for training skills in the field of cybersecurity.It was designed to enhance awareness of application security [32]; • Broken Crystals-Developed by Bright Security, using JavaScript (and TypeScript), Broken Crystals is an application that implements a collection of common security vulnerabilities.The list of vulnerabilities can be found on the project's website [33]; • Damn Vulnerable Web Services-An application created by Sam Sanoop using JavaScript.It facilitates learning about vulnerabilities related to Web Services and APIs [34].The list of present vulnerabilities can be found on the project's website; • OWASP Juice Shop-Described as possibly the most modern and sophisticated insecure web application [35], it was developed by the OWASP Foundation for practicing cybersecurity skills.The list of vulnerabilities can be found on the project's website [36];

Selected Code Analysis Tools
Within the scope of this study, 11 code analysis tools were tested.For SAST tools, both those dedicated to specific technologies (e.g., Java) and those supporting multiple technologies were chosen.Table 3 presents the tools used for code analysis.Table 4 presents the support of the selected SAST tools for various technologies.All chosen tools operate locally on a computer.They are not cloud-based solutions, although some of them can be incorporated into the continuous integration/continuous delivery (CI/CD) process-a software delivery method that introduces automation into application deployment stages.The employed tools differed in aspects such as their supported technologies, extensibility capabilities, and the content of generated reports.Find Security Bugs-This is a plugin for the SpotBugs tool, supporting web applications written in the Java language.Similarly to phpcs-security-audit, it extends its capabilities to identifying security vulnerabilities, effectively turning it into a SAST tool.The report is generated in HTML or XML format.Each entry in the report includes the file path, identifier of the detected vulnerability (which, when entered on the tool's website, provides a detailed description of the vulnerability), and line number of the code [51]; • Progpilot-This is a SAST tool for scanning code written in the PHP language for security vulnerabilities.It allows adding custom rules, disabling existing ones, and excluding files and folders from scanning.The JSON format report contains a list of vulnerabilities, each having various attributes, such as the file path, line number of the code, type, and a link to CWE [52]; • Bandit-This SAST tool, developed by PyCQA, is designed for scanning code written in the Python language for security vulnerabilities.It allows adding custom rules, disabling existing ones, and excluding files and folders from scanning.The report can be generated in various formats such as JSON, CSV, or HTML.For each detected vulnerability, details are included, such as the file path, line number of the code, vulnerability type and description, along with the CWE identifier [53] It uses its own scanners, as well as other open-source ones (a full list is available in the documentation).It allows adding custom rules and excluding specific files and folders from the scan.The report can be saved in text, JSON, or SonarQube formats.Each entry in the report contains a description of the detected vulnerability (not always with a CWE identifier) and its location [57].

Research Methodology
Table 5 presents a structured summary of experiments conducted in this paper.Each row represents a distinct software application and each column corresponds to a specific SAST tool.The letter "Y" indicates that the experiment was conducted using the given tool and application.A total of 95 experiments were conducted as part of the research.

Data Unification
Each tool presented results in a unique way.To conduct further analysis, it was necessary to unify the received data.For this purpose, Python scripts were written to implement the ETL process: they extracted data from reports written in different formats, modified them to a predefined format, and then output the reformatted data to CSV (comma separated values) files.A separate CSV file was created for each application.The file contained data in tabular form, where each row represented a vulnerability detected by a specific tool.The columns in the file included Not all tools provided complete information.For example, Graudit did not specify the vulnerability type.In such cases, the corresponding fields for vulnerabilities remained empty.

Used Indicators
To perform appropriate measurements regarding the effectiveness of vulnerability detection in the analyzed applications using selected SAST tools, a methodology and metrics proposed by the OWASP organization were utilized [58].However, their recommended tool was abandoned because it focuses solely on one technology and does not allow checking various vulnerability variants that may occur in the application code.Therefore, for each tool, a set of indicators was calculated:

•
True Positive (TP)-Indicates the count of vulnerabilities reported by the tool that actually exist in the corresponding location within the application's code (also referred to as correctly reported vulnerabilities); • False Positive (FP)-Represents the count of vulnerabilities reported by the tool that do not actually exist in the corresponding location within the application's code (also referred to as incorrectly reported vulnerabilities); • True Negative (TN)-Represents the count of vulnerabilities that were not reported by the tool and do not exist in the corresponding location within the application's code (also referred to as correctly not reported vulnerabilities); • False Negative (FN)-Represents the count of vulnerabilities that were not reported by the tool but actually exist in the corresponding location within the application's code (also referred to as incorrectly not reported vulnerabilities); • Positive (P)-Indicates the number of vulnerabilities reported by all tools that exist in the application.The P indicator is calculated using Equation ( 1).This is the same for all tools within a given application.
• Negative (N)-Represents the number of vulnerabilities reported by all tools that do not exist in the application.The N indicator is calculated using Equation (2).It is the same for all tools within a given application.
• TOTAL-This indicator represents the total number of vulnerabilities reported by all tools.It is the same for all tools within a given application.• Accuracy (ACC)-Accuracy evaluates the portion of all vulnerabilities reported by all tools that consist of correctly reported vulnerabilities, as well as correctly unreported vulnerabilities using a specific tool.The ACC indicator is determined using Formula (3).
• Sensitivity (SEN)-Sensitivity assesses the proportion of correctly reported vulnerabilities among all reported vulnerabilities.The SEN indicator is calculated using Formula (4).
• Precision (PRE)-Precision evaluates the proportion of correctly reported vulnerabilities among all reported vulnerabilities.The PRE indicator is calculated using Formula (5).

Results
The study conducted in this work examined 11 SAST tools.These tools were used to scan 16 vulnerable web applications written in four different technologies.The tools and applications are presented in Section 3.

Results for Applications Developed Using Java Technology
Table 6 presents a comprehensive analysis of the various static application security testing (SAST) tools for the EasyBuggy application.These tools were evaluated based on the indicators presented in Section 4.Among the tools examined, Semgrep performed the best in terms of true positives (TP), with eight identified vulnerabilities.This indicates that Semgrep was effective in detecting actual security issues within the EasyBuggy application.However, it is worth noting that FindSecBugs achieved the highest sensitivity (SEN) at 87.88%.This means it had a higher capability to identify true positives relative to the other tools, even though the absolute number of TP was lower compared to Semgrep.On the other hand, Graudit had no true positives (TP) in this context, which raises concerns about its effectiveness for this specific application.It is important to consider that the absence of TP could indicate either a lack of vulnerabilities in the code or limitations in the tool's scanning capabilities.In terms of false positives (FP), Horusec had the highest count, with 38.High FP values can lead to wasted resources and time investigating false alarms.In summary, FindSecBugs was consistently the top-performing SAST tool, excelling in accuracy, sensitivity, and precision.It consistently achieved the highest proportion of true positives, while maintaining a reasonable accuracy.On the other hand, Graudit consistently performed the worst, with lower accuracy, sensitivity, and precision, and a higher rate of false negatives.The choice of SAST tool should consider the specific needs of the application and the importance of minimizing false alarms, while maximizing the detection of true security vulnerabilities.These average values provide an overall picture of how each SAST tool performed when applied to JavaScript-based applications.Semgrep stands out, with a high accuracy, sensitivity, and precision, making it a strong choice for securing JavaScript applications.However, the selection of the most suitable tool should consider project-specific requirements and constraints; In conclusion, the choice of a SAST tool for JavaScript applications should be made based on a careful evaluation of the specific requirements and constraints of the project.While Semgrep consistently exhibited a strong overall performance, other tools may excel in particular areas or be better suited for specific use cases.A comprehensive security strategy should involve the selection of the right tools, continuous monitoring, and expert analysis, to ensure robust protection against vulnerabilities in JavaScript-based applications.

Results for Applications Developed Using PHP Technology
Table 16 presents the results of the assessment of SAST tools applied to the Conviso Vulnerable Web application.The results reveal important insights into each tool's performance:

•
Horusec stands out with a 100% sensitivity, indicating that it successfully identified all true positives.However, it is essential to consider the balance between sensitivity and specificity, as achieving 100% sensitivity might lead to a high number of false positives; • Graudit and Horusec demonstrated a perfect precision, with 100%, meaning that all reported vulnerabilities were true positives.Conversely, ShiftLeft Scan and Semgrep showed 0% precision, implying that they reported only false positives in this context; • Graudit, Horusec, PHP_CS, and Progpilot exhibited true positive rates ranging from 20% to 44.44%, while ShiftLeft Scan and Semgrep had 0% true positive rates, indicating that they failed to identify any true positives; • Semgrep had a notably high false positive rate of 90%, which means it reported many issues that were not actual vulnerabilities in the application; • Some tools, such as Horusec and Progpilot, reported true negatives, indicating that they correctly identified non-vulnerable portions of the application; • Horusec achieved 100% accuracy, which is commendable.However, it is crucial to consider accuracy in conjunction with other metrics, as a high accuracy rate may be achieved by reporting fewer vulnerabilities, potentially missing real issues.Table 17 presents the results of an assessment of SAST tools applied to the Damn Vulnerable Web application.The results reveal important insights into each tool's performance:

•
Horusec stands out, with a high sensitivity (90.40%), indicating that it successfully identified a substantial portion of true positives.Conversely, Progpilot showed a sensitivity of only 12.80%, suggesting it missed many true positives; • Progpilot demonstrated a 100% precision, implying that all reported vulnerabilities were true positives.However, ShiftLeft Scan had a relatively low precision, at 55.26%, indicating a higher likelihood of false positives; • Horusec had a high true positive rate (25.17%), while Progpilot and Semgrep had lower rates, implying they missed a significant number of true positives; • Horusec and PHP_CS had relatively high FP rates, indicating they reported some issues that were not actual vulnerabilities in the application.Semgrep had the lowest FP rate among the tools; • Some tools, such as Graudit, PHP_CS, and ShiftLeft Scan, reported TNs, indicating that they correctly identified non-vulnerable portions of the application; • Graudit, Progpilot, and ShiftLeft Scan exhibited reasonably high accuracy rates.However, it is essential to consider accuracy in conjunction with other metrics, to assess the overall performance of each tool.Table 20 presents the average values of the selected indicators for SAST tools applied to applications developed using PHP technology.These averages provide an overview of the overall performance of each SAST tool across multiple applications: In conclusion, the choice of SAST tool for PHP applications should consider a balance between accuracy, sensitivity, and precision.Progpilot excels in precision but may miss some vulnerabilities.Horusec has high sensitivity but reports more false positives.Graudit and ShiftLeft Scan offer a good trade-off between these metrics.Semgrep demonstrated a lower overall performance, particularly in sensitivity and precision.The selection should align with the specific requirements and constraints of the project, and fine-tuning may be necessary for comprehensive security testing.In summary, the SAST tools exhibited varying performance for the Damn Small Vulnerable Web application.Aura achieved perfect precision but had a limited sensitivity.Horusec balanced precision and sensitivity, while Graudit showed limited performance.Bandit performed well, with high precision and sensitivity.ShiftLeft Scan excelled in precision but had limited sensitivity.Semgrep achieved a good balance between precision and sensitivity.
Table 22 presents the results of the assessment of SAST tools applied to the Damn Vulnerable GraphQL application.The results reveal important insights into each tool's performance:  In summary, the SAST tools exhibited varying levels of performance when analyzing the Damn Vulnerable GraphQL Application.ShiftLeft Scan stood out, with the highest accuracy and perfect precision, indicating a low rate of false positives.However, it also reported a relatively higher number of false negatives.Semgrep achieved a balanced performance, with good precision and sensitivity.Other tools, such as Aura, Horusec, and Bandit, showed moderate performance with different trade-offs between accuracy, precision, and sensitivity.Graudit had a limited performance, with no true positives reported.In summary, the SAST tools provided varying results when analyzing the Damn Vulnerable Python Web Application.Horusec and Semgrep demonstrated the highest accuracy and precision, indicating their ability to identify vulnerabilities, with fewer false positives.Aura and Bandit showed moderate performance, with a balance between accuracy, sensitivity, and precision.Graudit reported limited performance, with no true positives, while ShiftLeft Scan had the lowest accuracy and precision among the tools.The choice of a specific tool should consider the trade-offs between accuracy and precision, depending on the specific application's security requirements.In summary, the SAST tools provided varying results when analyzing the Tiredful API application.Semgrep demonstrated the highest accuracy, sensitivity, and precision, indicating its ability to identify vulnerabilities effectively.Horusec and Bandit showed moderate performance, with a balanced accuracy and precision.Aura had the lowest accuracy and precision among the tools.The choice of a specific tool should consider the trade-offs between accuracy and precision, depending on the specific application's security requirements.In summary, Semgrep consistently performed well across multiple evaluation criteria, making it a strong candidate for analyzing Python applications for security vulnerabilities.However, the choice of the most suitable SAST tool should also consider project-specific requirements; the types of vulnerabilities you are targeting; and trade-offs between accuracy, sensitivity, and precision.Additionally, it is essential to keep in mind that the effectiveness of these tools can vary depending on the specific codebase and the complexity of the application.Therefore, conducting comprehensive testing and fine-tuning the tool's configurations may be necessary to achieve optimal results.

Scan Duration
Table 26 presents a comparison of the scan duration times (in seconds) for the various security scanning tools across different applications.The results are given rounded to the nearest second.The times were rounded up.The table demonstrates a significant variability in scan duration times across different security scanning tools and applications.Scan times ranged from a few seconds to several minutes, depending on the combination of the tool and the target application.Bandit consistently demonstrated fast scan times, typically taking only 1 s to complete its analysis, regardless of the application.Other tools, such as Graudit, also exhibited fast scan times, completing scans in just 1 s for most applications.The choice of the target application had a considerable impact on scan duration.Some applications, such as Broken Crystals, required longer scan times, with Bearer CLI taking 510 s for this particular application.Semgrep and ShiftLeft Scan both showed competitive scan times across a wide range of applications.They tended to provide relatively quick results, without compromising on scan depth.On average, across all applications, Bearer CLI had the longest scan time, averaging 181 s (approximately 3 min).In contrast, Bandit, Graudit, and Semgrep had average scan times of 1 s.While some tools, such as Bandit, consistently exhibited fast scan times, they might have limitations in terms of the types of vulnerabilities they can detect.Therefore, the choice of a tool should consider not only scan duration but also the tool's coverage and effectiveness in identifying vulnerabilities.Scan times can also be influenced by tool configurations, such as the scan depth and the number of rules enabled.Adjusting these configurations might help balance scan duration with the depth of analysis.The complexity and size of the target application play a significant role in scan times.For example, Bearer CLI takes longer to scan more complex applications, while smaller applications generally have shorter scan times.In practice, organizations should consider a balance between scan duration and the tool's ability to identify vulnerabilities effectively.A tool with a very fast scan time but low detection rates may not be as valuable To facilitate a testing environment, a dedicated test infrastructure was established.This infrastructure encompassed a host machine and a virtual machine, serving as the platform for experimental execution.The study provided concise descriptions of the scrutinized tools and the web applications subjected to evaluation.A total of eleven distinct tools, each tailored to diverse technologies prevalent in web application development, underwent assessment.The research encompassed a broad spectrum of programming languages, including Java, JavaScript, PHP, and Python, and involved the analysis of sixteen vulnerable web applications.The analysis adhered to a structured methodology, where scan reports were standardized into a uniform format, outcomes for each application were consolidated, and each detected vulnerability was categorized into one of three labels: True Positives (TP), False Positives (FP), or Not Applicable (N/A).Vulnerabilities designated as N/A were excluded from subsequent analyses.Finally, performance metrics were computed for each tool, and the results underwent meticulous scrutiny.
The findings emerging from this exhaustive analysis of security testing tools for static code analysis underscore a pivotal realization: the absence of a universally impeccable tool.A salient example is Semgrep, which exhibited outstanding performance when evaluating applications developed using JavaScript technologies but faltered when confronted with applications forged in PHP technologies.This observation underscores the intricacy of tool selection, as distinct tools exhibit superior efficacy in disparate contexts.For instance, native tools specifically engineered for particular technologies, such as Java and PHP, generally outperformed their counterparts when evaluated within their respective domains.Conversely, "multitechnology" tools demonstrated enhanced effectiveness when scrutinizing applications developed with JavaScript and Python technologies.
Furthermore, it is imperative to emphasize that the deliberate inclusion of security vulnerabilities in the test applications amplifies the real-world relevance of this study's outcomes.These insights transcend the domain of web applications, as the tested tools are inherently versatile and can be applied to a spectrum of application types, including those designed for embedded systems, IoT, or sensor-equipped devices.This versatility accentuates their relevance in fortifying overall software security across diverse domains, extending beyond the confines of web development.
In summation, this study advocates for a nuanced approach to tool selection in the realm of static code analysis, given the absence of a universally flawless tool.Tailoring tool choices to the specific technologies in use emerged as a critical consideration for effective vulnerability detection.The deliberate inclusion of security errors in the test applications reinforces the practical applicability of the study's findings, thereby elucidating the versatility of these tools in diverse application landscapes beyond web development.

Table 1 .
Used hardware and software.

Table 2 .
Used web applications.

Table 3 .
Tested security code analysis tools.
[47]report can be generated in three formats: JSON, YAML, and SARIF.Each entry in the report includes the file path, line number of the code, and a description of the detected vulnerability (along with the CWE identifier(s))[47]; • phpcs-security-audit-Prepared by Floe, phpcs-security-audit is a set of rules for the PHP_CodeSniffer tool.It extends its capabilities for detecting vulnerabilities and weaknesses in PHP code, turning it into a SAST tool.The scan report is displayed in the system shell console window.It includes file paths, line numbers of the code, and descriptions of detected vulnerabilities, though without any links to, for example, the CWE database [48]; • Graudit-Developed by a single programmer, Eldar Marcussen, Graudit is a SAST tool that searches for potential vulnerabilities in application source code, using another tool, GNU grep, for text filtering.It supports multiple programming languages (including all those tested within this work-a complete list can be found on the project's website).It is essentially a Bash script.Adding rules involves entering specific rules into files provided with the tool.Similarly, you can "disable" certain rules by removing them.The tool's output is text displayed in the system shell console.It contains only file paths, the line numbers of the code, and highlighted code snippets that triggered the rule [49]; • Insider CLI-This is a SAST tool prepared by InsiderSec.It supports Java, Kotlin, Swift, C#, and JavaScript languages.It does not allow extending the set of rules, but specific files or folders can be excluded from scanning.The report, available in HTML or JSON format, includes a list of detected vulnerabilities.Each vulnerability is accompanied by a CVSS score, file path, line number of the code, description, and removal recommendation [50]; • ; • Semgrep-Semgrep, a versatile SAST tool supporting multiple programming languages, allows you to add custom rules, disable existing ones, and exclude specific files and folders from scanning.Reports can be generated in various formats, such as JSON or SARIF.For each vulnerability, details are provided, including the file path, line number of the code, vulnerability type and description, along with references to resources such as the CWE and OWASPTop 10 [54]; • Scan-The SAST tool developed by ShiftLeft supports multiple programming languages (full list available on the project's website).It is not a standalone solution; Scan is a combined set of static analysis tools.The scan report is a combination of reports from the tools used by Scan.It does not allow for rule disabling or excluding specific files or folders.The report is distributed across multiple files in formats such as HTML, JSON, and SARIF.Each entry in the report contains information about the vulnerability's location and type [55]; • Aura-The SAST tool developed by SourceCode.AI is used for scanning code written in the Python programming language.The tool does not allow adding custom rules, nor can individual rules be blocked.The report can be generated in text, JSON, or SQLite database file formats.Each detected vulnerability is associated with a file path, line number, and vulnerability type [56]; • Horusec-A SAST tool for scanning code written in multiple programming languages.

Table 5 .
Matrix of conducted experiments.
The TN% indicator determines what proportion of vulnerabilities reported by all tools are correctly not reported vulnerabilities; • FN%-The FN% indicator determines what proportion of vulnerabilities reported by all tools are incorrectly not reported vulnerabilities.
• TP%-The TP% indicator determines what proportion of vulnerabilities reported by all tools are correctly reported vulnerabilities; • FP%-The FP% indicator determines what proportion of vulnerabilities reported by all tools are incorrectly reported vulnerabilities; • TN%-

Table 6 .
The values of SAST tool indicators for the EasyBuggy application.

Table 7
•TP-Among the tools, FindSecBugs achieved the highest count of TP, with 32 vulnerabilities detected.This indicates a strong capability to identify actual security issues within the application; • FN-Each tool had varying counts of false FN, representing missed vulnerabilities;

Table 7 .
The values of SAST tool indicators for the Java Vulnerable Lab application.

Table 8
presents a comprehensive analysis of the various SAST tools applied for the Security Shepherd application.The results provide valuable insights into the performance of each tool: The tools correctly identified TN where no vulnerabilities were present.The TN count was highest for Insider (1841), indicating its ability to avoid false alarms; •The overall ACC of the tools varied, ranging from 59.51% (Horusec) to 90.32% (Insider), showing differences in their effectiveness in correctly classifying vulnerabilities; • Insider identified only six TP, signifying limited effectiveness in detecting vulnerabilities; • ShiftLeftScan found 173 TP, demonstrating a robust ability to identify security problems; • Semgrep detected 28 TP, indicating some effectiveness in identifying vulnerabilities; • FP was highest for FindSecBugs with 767, followed by Horusec with 801.These high FP counts could lead to resource-intensive investigations of false alarms; •

Table 8 .
The values of SAST tool indicators for the Security Shepherd application.

Table 9
provides an analysis of the various SAST tool indicators for the Vulnerable App application.The results reveal important insights into each tool's performance: The overall ACC of the tools varied, ranging from 39.18% (Graudit) to 76.29% (Semgrep), showing differences in their effectiveness in correctly classifying vulnerabilities; • Semgrep achieved the highest SEN at 62.50%, indicating its strong capability to identify true positives relative to the other tools; (3)) was observed, with the highest count for Graudit(31)and the lowest for Insider(3).High FP counts can lead to resource-intensive investigations of false alarms; •The tools correctly identified TNs, where no vulnerabilities were present.TN counts ranged from 26 (Graudit) to 55 (Insider), indicating the ability to avoid false alarms;•

Table 9 .
The values of SAST tool indicators for the Vulnerable App application.

Table 10
provides average values for the selected SAST tool indicators for applications developed using Java technology.The results provide valuable insights into the performance of each tool:

Table 10 .
Average values of the selected SAST tool indicators for applications developed using Java technology.
• TP and FN-FindSecBugs consistently identified a higher proportion of TP among all vulnerabilities detected, indicating its strong capability to find actual security issues.Graudit had the highest average FN percentage(22.19%),suggesting that it frequently missed vulnerabilities; • FP and TN-Insider had the lowest average FP percentage (2.02%),indicating a lower rate of false alarms among non-vulnerable instances.FindSecBugs had the highest average FP percentage (17.83%),suggesting a higher rate of false alarms among nonvulnerable instances.

Table 11
presents the results of SAST tool evaluations for the Broken Crystals application.The results reveal important insights into each tool's performance: • Semgrep achieved the highest accuracy at 85.26%, meaning it made fewer misclassifications.Horusec had the lowest accuracy, at 25.64%; • Horusec demonstrated the highest sensitivity, at 55.56%, indicating its effectiveness in identifying true positives.Bearer had the lowest sensitivity at 7.41%, implying that it missed many vulnerabilities; • Semgrep achieved the highest precision at 75.00%, meaning that when it reported a positive result, it was often accurate.Graudit had no precision because it reported no TPs.

Table 11 .
The values of SAST tool indicators for the Broken Crystals application.

Table 12
presents the results of SAST tool evaluations for the Damn Vulnerable Web Services application.The results reveal important insights into each tool's performance:

Table 12 .
Values of the SAST tool indicators for the Damn Vulnerable Web Services application.

Table 13
presents the results of the assessment of SAST tools applied to the Juice Shop application.The results reveal important insights into each tool's performance: •Semgrep achieved the highest precision, at 60.61%.However, Graudit reported a high number of false positives, resulting in a low precision of 0.00%.

Table 13 .
The values of SAST tool indicators for the Juice Shop application.

Table 14
presents the results of SAST tool evaluations for the NodeGoat application.The results reveal important insights into each tool's performance:

Table 14 .
The values of SAST tool indicators for the NodeGoat application.

Table 15
presents the average values for the selected indicators for SAST tools applied to applications developed using JavaScript technology.These averages provide an overview of the overall performance of each SAST tool across multiple applications: On average, Semgrep achieved the highest accuracy of 88.21%, a sensitivity of 53.00%, precision of 80.78%, true positive rate of 10.65%, false negative rate of 9.00%, and false positive rate of 2.79%.The true negative rate averaged 77.56%.

Table 15 .
Average values of selected SAST tool indicators for applications developed using JavaScript technology.
• Semgrep and Bearer also excelled in precision.Reducing false positives is crucial to minimize the time spent investigating non-existent vulnerabilities; • Semgrep consistently maintained a high true positive rate.• Horusec showed promise in precision but lagged in sensitivity, which might be suitable for certain scenarios.

Table 16 .
The values of the SAST tool indicators for the Conviso Vulnerable Web application.

Table 17 .
The values of the SAST tool indicators for the Damn Vulnerable Web application.

Table 18
presents the results of the assessment of SAST tools applied to the WackoPicko application.The results reveal important insights into each tool's performance:

Table 18 .
The values of the SAST tool indicators for the WackoPicko application.

Table 19
presents the results of the assessment of SAST tools applied to the Xtreme Vulnerable Web application.The results reveal important insights into each tool's performance:

Table 19 .
The values of SAST tool indicators for the Xtreme Vulnerable Web application.

Table 20 .
Average values of the selected SAST tool indicators for applications developed using PHP technology.Table21presents the results of the assessment of SAST tools applied to the Damn Small Vulnerable Web application.The results reveal important insights into each tool's performance:

Table 21 .
The values of the SAST tool indicators for the Damn Small Vulnerable Web application.

Table 22 .
The values of the SAST tool indicators for the Damn Vulnerable GraphQL application application.

Table 23
presents the results of the assessment of SAST tools applied to the Damn Vulnerable Python Web application.The results reveal important insights into each tool's performance:

Table 23 .
The values of SAST tool indicators for the Damn Vulnerable Python Web application.

Table 24
TPs and six FNs.It had an ACC of 62.50%, a SEN of 25.00%, and a perfect PRE of 100.00%.ShiftLeft Scan reported six TPs and two FNs.It had an ACC of 62.50%, a SEN of 75.00%, and a PRE of 60.00%; • Semgrep reported three TPs and five FNs.It achieved an ACC of 68.75%, a SEN of 37.50%, and a perfect PRE of 100.00%.

Table 24 .
The values of the SAST tool indicators for the Tiredful API application application.

Table 25
presents the average values of selected SAST tool indicators for applications developed using Python technology.These averages provide an overview of the overall performance of each SAST tool across multiple applications: The average FN rates ranged from approximately 12.50% to 65.10%.Graudit had the highest FN rate, implying that it missed a substantial number of vulnerabilities.Semgrep and ShiftLeft Scan demonstrated relatively lower FN rates; • The average FP rates ranged from around 0.00% to 20.54%.ShiftLeft Scan had the highest FP rate, followed by Aura.Semgrep produced the fewest false alarms; • Graudit achieved the highest TN rate, followed by Semgrep.

Table 25 .
Average values of selected SAST tool indicators for applications developed using Python technology.