An Empirical Comparison of Pen-Testing Tools for Detecting Web App Vulnerabilities

: Today, one of the most popular ways organizations use to provide their services, or broadly speaking, interact with their customers, is through web applications. Those applications should be protected and meet all security requirements. Penetration testers need to make sure that the attacker cannot ﬁnd any weaknesses to destroy, exploit, or disclose information on the Web. Therefore, using automated vulnerability assessment tools is the best and easiest part of web application pen-testing, but these tools have strengths and weaknesses. Thus, using the wrong tool may lead to undetected, expected, or known vulnerabilities that may open doors for cyberattacks. This research proposes an empirical comparison of pen-testing tools for detecting web app vulnerabilities using approved standards and methods to facilitate the selection of appropriate tools according to the needs of penetration testers. In addition, we have proposed an enhanced benchmarking framework that combines the latest research into benchmarking and evaluation criteria in addition to new criteria to cover more ground with benchmarking metrics as an enhancement for web penetration testers and penetration testers in real life. In addition, we measure the tool’s abilities using a score-based comparative analysis. Moreover, we conducted simulation tests of both commercial and non-commercial pen-testing tools. The results showed that Burp Suite Professional scored the highest out of the commercial tools, while OWASP ZAP scored the highest out of the non-commercial tools.


Introduction
The aim of Web application penetration-testing (pen-testing) is to identify vulnerabilities that are caused by insecure development practices in software or website design, coding, and server configuration. Generally, web app pen-testing includes testing user authentication to verify that data cannot be compromised by user authentication; assessing the web app for vulnerabilities and flaws such as cross-site scripting (XSS); confirming secure configuration of web browsers and servers; identifying features that can result in minimum vulnerabilities; and ensuring web server security and database server security [1]. Pen-testing has become an essential requirement for identifying vulnerabilities and security flaws that cyberattackers can exploit [2]. With technological advancement, the complexities of pen-testing are increasing in terms of security. The Open Web Application Security Project (OWASP) [3] is a non-profit organization that is dedicated to promoting software security. The organization provides a wide range of services that help developers improve educational resources, social events, and tools. They also provide guidelines, including the recently updated OWASP Top 10 Application Security Risks. The updated list features considerable changes, such as the introduction of Broken Access Control, which moved from the fifth spot to the first position. According to information provided by the organization [3], 94% of the applications have undergone testing for broken access control, and "the 34 Common Weakness Enumerations (CWEs) mapped to Broken Access Control had In order to understand the development of Pentest over the last few years, this work presents an empirical comparison of pen-testing tools. The paper provides a review of the literature on the work of various researchers in the field of pen-testing and discusses the various aspects of web application pen-testing. Most importantly, the study reviewed pen-testing tools in terms of their characteristics and performance. • We have conducted a literature review on the work done by various researchers in the area of web application pen-testing. • We have also studied the various tools used for PEN-testing in terms of their performance, vulnerability detection, test coverage, etc. • We have proposed an enhanced benchmarking framework for web application pentesting tools.

Background
Web application pen-testing is a form of ethical hacking created specifically to assess the design, configuration, and architecture of a web application. The aim of conducting assessments is to identify security risks that could result in unauthorized access or data exposure [4]. This chapter compares the three major types of security testing technologies relied on by developers to help identify security flaws before software releases. There are three types of application security testing: dynamic application security testing (DAST), static application security testing (SAST), and interactive application security testing (IAST). SAST consists of technologies and tools designed for checking vulnerabilities and flaws in code. The operation of SAST tools involves scanning the codes at rest, meaning that the code is not executed by any program or human. The tool looks for the static code by following each instruction and line and conducting a comparison against a set of rules and obvious errors. SAST is regularly used by development teams to enforce compliance with coding standards and formats. DAST is a representation of multiple tools used for vulnerability checking in Web-based applications. While SAST can see the code base, DAST does not know the underlying code. DAST works by running the application in a staging environment to be probed by a hacker for weaknesses. IAST is a hybrid testing method that aims to solve the main failures of SAST and DAST by combining the best features of the two. Agents are tasked with continuously monitoring and analyzing the behavior of the Web application during automated or manual tests. When IAST is properly configured, it can identify information such as calls to other services, data flow, infrastructure data, HTTP traffic, or configuration options and access application components such as frameworks, libraries, and data within the back-end dependencies [5].

Problem Statement
There are numerous advantages to checking web applications for security vulnerabilities, but doing this manually requires a lot of skills and time. Various pen-testing tools are available that can automate the process, but they too have their limitations. For example, some of these Web application pen-testing tools are known to report false positives. A false positive can be equated to a false alarm, like a house alarm being triggered while there is no burglar. A false positive in web application security is when a web application security scanner indicates the presence of a vulnerability to a website, such as SQL injection, yet it is not there in reality. Pen-testers are experts in web security and use automated web app security scanners to simplify the pen-testing process. This is to ensure that all the attack surfaces of the web application are tested rapidly and comprehensively. However, automated tools can still result in problems as well. False positives lead to web application pen-testing consuming a considerable amount of time [6]. This is because pen-testing must go through all the reported security vulnerabilities and try to exploit them through manual verification. This is a lengthy process that makes web application security unaffordable for many businesses, though the problems caused by false positives go beyond just cost. Naturally, as human beings, we tend to ignore false alarms, and this also applies to pen-testing. For example, if a web app security scanner detects 100 cross-site scripting vulnerabilities, and if the first 25 variants were false positives, then there is a possibility of the pen-test assuming that all the remaining ones were false positives and ignoring the remaining ones. Therefore, this increases the chance of the real security vulnerabilities going undetected. While this is the issue, web app scanners also have limitations in terms of performance, scan speed, accuracy, and cost.

Related Work
Many developers of security-critical Web services face the problem of choosing the best vulnerability detection tools. Both practice and research indicate that state-of-the-art tools are not very effective in terms of false-positive rates and vulnerability coverage. The main issue is that these tools are limited in the detection techniques they adopt, and they are designed for application in very concrete scenarios. Therefore, using the wrong tool for vulnerability detection may result in the deployment of undetected vulnerability services. The authors of [7] proposed a benchmarking approach for assessing and comparing the effectiveness of vulnerability detection tools in Web services. The outcomes indicate that the proposed benchmarks accurately depict the effectiveness of vulnerability detection tools and suggest the application of the proposed benchmarking approaches in the field. The increased application of web vulnerability scanners and the difference in their effectiveness necessitate benchmarking of the scanners. Furthermore, the existing literature does not present a comparison of the results on the effectiveness of scanners from various benchmarks. The authors of [8] compared the performance of certain open-source vulnerability scanners by running them against the OWASP benchmark. The authors proceeded to compare the results from the OWASP benchmark with the existing outcomes from the Web Application Vulnerability Security Evaluation Project (WAVSEP) benchmark. The results from the study's evaluation of the web vulnerability scanners confirmed that scanners perform differently based on the category. Thus, there is no single scanner that can perform all the tasks of scanning web vulnerabilities [8]. Pen-testing enables the person carrying out PEN-testing to check out the system's functional aspects, that is, the extent to which a system is vulnerable to network security and intrusion attacks, and view the system's defense mechanisms to mitigate these attacks. The authors of [9] conducted a comprehensive literature review of pen-testing and its applications. The study reviewed the work conducted in the field of pen-testing. The authors have tried to review various aspects that are related to PEN-testing. In addition, the work reviews the various pentesting strategies and tools used for PEN-testing in terms of their technical specifications, platform compatibility, release date, and utility. It reviews the significance of pen-testing in detecting system vulnerabilities and how to protect a system from network attacks. The paper concludes that penetrating-testing is a proven and efficient technique for detecting system security flaws. Cybersecurity has become very crucial today due to the rise in cybercrime. Every firm is striving to avoid cybercrimes such as hacking and data breaches. The authors of [10] studied pen-testing processes and tools, focusing on the comparison of four different port scanning tools to show their effectiveness. The project tested different scanning tools in the Kali Linux environment in terms of discovered port numbers and the time that the tool took to discover the ports. Of the various scanning tools tested in the project, Sparta emerged as the most efficient tool for efficiency and ease of use. Its recommendation is backed by its availability in Kali Linux and the fact that it is a free tool, making it ideal for small businesses with less than 10 employees. Apart from the analysis, the project also included a study of various processes, types, and models of PEN-testing. It presents a detailed discussion of seven different types of penetration and two models of pen-testing. With the wide range of pen-testing tools on the market today, practitioners often find it confusing to make properly informed decisions when searching for suitable tools. A study by [11] provides an overview of pen-testing and a list of the criteria for selecting suitable pen-testing tools for a given purpose. The paper briefly describes the selected tools and then provides a comparison of the tools. As society continues to depend on technology, hacking remains an underlying security threat to computer systems. Authors in [12] analyzed the tools, techniques, and mathematics involved in pen-testing. The study introduced the idea of pen-testing and investigated the security and vulnerability of the server of Appalachian State University's Computer Science Department. The work began by obtaining permission from the appropriate system administrators, including a discussion on the scope of the PEN test, before launching an attack on any of the systems. The project then obtained background information on the Department of Computer Science, followed by the formulation of a targeted attack focusing on the flaws in the Linux kernel, called Dirty COW or CVE-2016-5195. Eventually, root access was gained through Dirty COW, which enabled the fetching of both the /etc./passwd and /etc./shadow files. A total of 61.01% of all the passwords stored in the Shadow File were cracked using oclHashcat. The CVE-2016-5195 awareness campaign will enable the hardening of the Appalachian State University's Computer Science Department's server (student.cs.appstate.edu), preventing future exploitation of the vulnerability CVE-2016-5195 by malicious attackers. The work also contributed to the awareness of security vulnerabilities and their ongoing importance in the 21st century. There are two types of web vulnerability scanners (WVSs): open-source and commercial web vulnerability scanners. However, the two vary in terms of their vulnerability detection performance and capability. Authors in [13] conducted a comparative study to determine the capabilities of eight VWSs (Iron WASP, Skipfish, HP Web Inspect, Acunetix, OWASP ZAP, IBM App Scan, Arachni, and Vega) to detect vulnerabilities. The study examined the use of two Web apps: Web Goat and Damn Vulnerable Web Application. The study used multiple evaluation metrics to evaluate the performance of the eight VWSs. These include web application security scanner evaluation criteria, the OWASP benchmark, the Youden Index, recall, and precision. According to experimental results, other than commercial scanners, some open-source vulnerability scanners such as Skipfish and ZAP are also effective. The study recommended that there is a need to improve the vulnerability detection capability of commercial and open-source scanners to improve the detection rate and code coverage and minimize the number of false positives. While there are various open-source web application security scanners with similar functionalities, it is always important to choose the best one. In [14], the authors carried out a comparison and assessment study on different open-source web application security scanners, with a specific focus on the OWASP Top 10 (2013) Application Security Risks. One of the significant findings of this study was that Skipfish 2.07, Arachniv 0.40.0.3, and W3AF 1.2 emerged as the best among the sampled security scanners. The authors demonstrated the difference between open-source scanners that concentrated on session management, injection, cross-site scripting, and broken authentication. The growing requirement to perform pen-testing of network and web applications has increased the need for benchmarking and standardizing the techniques that penetration testers use. The author of [15] examined modern web pen-testing tools and compared them to an OWASP vulnerability list. The paper also addresses the lack of literature for scanner evaluation frameworks with a 360-degree view. Their research work indicates that scanners that have configured crawling and web proxies show better performance compared to shot and joint scanners. The author also observed that scanners that have an active maintenance cycle showed better performance. Thus, the study concluded that to obtain reliable results, penetration testers should use multiple automated scanning tools to detect multiple vulnerabilities. There is a difference in the design of the algorithms and techniques used by dynamic, interactive, and static security testing tools. Thus, each tool varies in the level or extent to which it detects the vulnerability that it is designed for. Moreover, because of their different designs, their percentage of false positives also differs. To take advantage of the potential synergies that various types of analysis tools may have, authors in [16] combined various dynamic, interactive, and static security testing tools into static application security testing (SAST), dynamic application security testing (DAST), and interactive application security testing (IAST), respectively. The study was aimed at ways of improving the effectiveness of security vulnerability detection while minimizing the number of false positives. Specifically, the authors combined two interactive security analysis tools and two dynamic security analysis tools to study their behaviors using specific OWASP Top 10 security vulnerability benchmarks. The study recommended using a combination of DAST, SAST, and IAST tools in both the development and testing phases of Web application development.

Research Methodology
In this section, we discuss the research methodology. Figure 2 shows steps undertaken in our research methodology for comparing and evaluating the selected web application pen-testing tools.

Selection of Top 6 Tools
We started our research work with a collection of tools based on the most repeated tools in the latest published academic comparison papers. Then, we surveyed the experts in the cybersecurity industry to choose the top six of them. The objective was to select and evaluate the top 6 tools that are preferred by experienced penetration testers in the cybersecurity industry. We also made certain that we had the most recent version of each tool available until the project deadline, as shown in Table 1.

Design of a Framework for Evaluation Criteria
This subsection describes a comprehensive comparative benchmark framework to evaluate the selected top six pen-testing tools. We specified the selected metrics to evaluate the tools in all aspects. After considering existing web application scanner evaluation frameworks such as [10,11,[17][18][19], we proposed a new framework that is similar to their methods but covers more ground with benchmarking metrics and criteria as an enhancement for seekers in the web pen-testing field. We looked at all of the criteria in [13,[20][21][22] to build one framework that includes all of the following: test coverage criteria, attack coverage criteria, vulnerability detection criteria, and efficiency criteria.
Based on our evaluation of different web application scanner evaluation frameworks such as the OWASP Benchmarking Project, WAVSEP, and the Web Input Vector Extractor Teaser (WIVET), we found that most of them focused on specific areas of the automated scanners with limited metrics to validate the performance of the scanner. Therefore, we proposed a framework that covers more ground as compared to existing frameworks. We added more parameters, which we will be considering while evaluating the web application scanners. In addition, we used a scoring system that was used previously in [10] to comparatively analyze each tool. Each key parameter has a score system as follows: • Scanner Scoring System: The selected criteria will be kept in mind while benchmarking the top 6 web application PEN-testing tools. We use the proposed score system in [10] to evaluate the tools. Furthermore, each key metric has a point system of up to 5 points. • Criteria and Metric Selection: The used benchmarking metrics and criteria for tool evaluation are presented as follows: -Graphic user interface (GUI); -Command-line interface (CLI). The GUI interfaces are always preferred by most pen-testers in PEN-testing web applications, rather than CLI. Score for tool type: • Crawling Types: There are two types of crawling: passive crawl and active crawl. The active crawl is the first step before the active scanning, which catalogs the found links. However, the passive crawl is best for covering. Score for crawling ability: -1: only passive crawler or only active crawler; -2: active crawler and passive crawler. •

Number of URLs Covered:
Web application crawling is a part of the information gathering stage in the PEN-testing process [10]. In this stage, a penetration tester would like to gather as much information as possible about the web application. Crawler coverage can be signified by the number of URLs crawled by the scanner; the more URLs the scanner covers, the higher the score as follows. Score for covered URLs: -1: less than 25% coverage; -2: 25% to 50% coverage; -3: 50% to 70% coverage; -4: 70% to 90% coverage; -5: more than 90% coverage.
• Scanning Time: The automated tools developed by penetration testers cover a greater area in a large web application with less possible time. Therefore, the time taken is important for scanner evolution. Score for scanning time: -1: more than 6 h; -2: more than 3 h; -3: more than 2 h; -4: more than 45 min; -5: less than 30 min.
• Types of Scan: There are two types of scans in web application PEN-testing, passive and active. In this metric, the scanner with active and passive options takes the highest point. Score for scan type: -1: only active scan or only passive scan; -2: active and passive scan; -3: active, passive, or policy scan.
• Reporting Features: The reports can be formatted depending on the compliance policy that the penetration tester needs to analyze, which is a recent feature in scanners. Some of these standards are OWASP Top 10, HIPAA, and so on. There are several normal formats for reporting, such as HTML, PDF, and XML. The compliance policy reports are emptier and easier to analyze by the penetration tester. Score for reporting features: -0: HTML, PDF, and XML reports; -1: compliance standers report such OWASP Top 10 and HIPAA.
• Added Features: Some automated tools have add-ons and extension features that improve the scanner performance in vulnerability detection. Most penetration testers take advantage from these features. Score for add-ons and extension features: -0: no add-ons and extension features; -1: with add-ons and extension features.
• Configuration Effortlessness: A previous article [10] defined three levels of configuration (difficult, hard, and easy). The difficult level means needing requirements such as server and database configuration to launch the scanner; hard requires some dependencies before tool installation; easy does not need any obligations to launch the scan. Score for configuration level: -1: difficult: requirements are needed, such as server and database configuration to launch the scanner; -2: hard: some dependencies are needed for installation; -3: easy: (plug-and-play) out-of-the-box ready-to-use application.
• Scans Logging Option: The logs are essential in PEN-testing to monitor and detect thousands of requests and responses. Logging these processes is important to retrieve them when needed. Some automated tools provide these options to store logs in formats such as txt, csv, html, or xml [10]. Score for scan logs: -0: no scan log option; -1: scan log option.
• Tool Cost: The cost of the tool is an important factor in choosing the right tool. More features with low cost are an essential metric for penetration testers and organizations. In addition, some frameworks have better performance depending on their brand and continued development by offered cost.
• OWASP Top 10 Vulnerabilities Coverage: The OWASP Top 10 Vulnerabilities are essential for evaluating many organizations and penetration testers use penetrating tools to cover the top 10 vulnerabilities in their web applications and protect their assets from the known vulnerabilities. Developers and software testers are also trying to avoid these top 10 vulnerabilities. This metric will evaluate the degree of covered vulnerabilities from the total existing vulnerabilities in the OWASP benchmark. Score for vulnerabilities coverage: -1: less than 25% coverage; -2: 25% to 50% coverage; -3: 50% to 70% coverage; -4: 70% to 90% coverage; -5: more than 90% coverage.
• Pause and Resume Scans: The ability to pause and resume the scan from the same point is a strength factor for the scanner and it helps the pen-tester reduce the time for rescanning the web application. Score for test coverage: -0: no ability to pause and resume scans; -1: the ability to only pause or only resume scans; -2: the ability to pause and resume scans. •

Number of Test Cases Generated:
This evaluates the number of test cases produced by a web application security scanner in a scanning session [13]. Score for the number of test cases generated: • Automation Level: In this metric, we evaluate the scanner proficiency to automate the scan without penetration tester manual association. Score for automation level: -1: 100% tester involvement needed; -2: 80% tester involvement needed; -3: 70% tester involvement needed; -4: 50% tester involvement needed; -5: less than 30% tester involvement needed. •

Number of False Positives:
The false positive is an unreal indicator for vulnerabilities in the OWASP benchmark reported by the scanner. Fewer false positive percentages are helpful for penetration.

Number of True Positives:
The true positive means that the real vulnerability number in the OWASP benchmark is detected correctly by the scanner. It is the most important metric in vulnerability detection criteria.
-True positive formula: -Score for true positive number: We have enhanced the framework above with additional metrics covering all evaluation criteria aspects for web application PEN-testing scanners, as summarized in Table 2. Our framework sets a scoring up to 53 points, as listed in Table 2. We can benchmark any web PEN-testing tool to choose the better one depending on the needed metrics.

Experimental Setup
In this section, we describe our implementation approaches. We have taken two main test implementation approaches. Firstly, we did the environment setup, then we configured the scan configuration and started the benchmarking, using the OWASP benchmark test tool to evaluate the scanner's crawling and vulnerability detection coverage. After benchmarking, we analyzed the results and compared them using our proposed method of evaluation.

Environment Setup
Our tools' installation and evaluation environment are detailed in Table 3. Take the score results and start our manual benchmarking using our proposed framework.

•
Compare the tools after the overall benchmarking.
The summary of our implementation process is shown in Figure 3. We used the Open Web Application Security Project's (OWASP) list of vulnerable web applications as a test to see how well our top six pen-testing tools could find vulnerabilities in web applications. The OWASP Benchmark is an open-source project that covers all the top 10 vulnerabilities and is commonly used for WAVS benchmarking. It also has an accurate scoring process.

Result
In this section, we divided the best six tools into two categories. In the first category, commercial tools including Qualys WAS, Fortify WebInspect, and Burp Suite Professional are compared and contrasted. In the second category, we examined the disparities between free, open-source software programs such as OWASP ZAP, Arachni, and Wapiti3. Our results from benchmarking serve as the basis for the comparisons. Crawling types: Burp Suite Professional gets a higher score (2 points) than Qualys WAS and Fortify WebInspect. It has the ability to apply active and passive crawling, whereas Qualys WAS and Fortify WebInspect scored 1 point since they can only crawl actively.
The Number of URLs covered: Qualys WAS scored 5 points, then Burp Suite Professional and Fortify WebInspect scored 3 points, they covered 50% to 70% of the benchmark URLs. Qualys WAS crawled 4979 URLS, which is 90.5% of 5500 URLs. On the other hand, Burp Suite Professional crawled 3231 URLs, which is 58%; Fortify WebInspect crawled 3598 URLs, which is 65%. However, Qualys WAS received the highest score in this metric.
Scanning Time: Scanning speed is important for the pen-tester, especially for vulnerability detection. Fortify WebInspect gets the highest score (5); its scan took 15 min. In comparison, Qualys WAS and Burp Suite Professional both received a score of 1. The Qualys WAS scan took 24 h and did not cover the whole site. Burp Suite Professional took over 12 h.
Type of Scan: Fortify WebInspect and Burp Suite Professional use active and passive scan modes. Fortify WebInspect also has scan by policy mode, which is a new and helpful feature. It can let you choose a known policy, or you can create your own one. However, Fortify WebInspect and Burp Suite Professional received 3 points. Conversely, Qualys WAS uses only active scan modes such as discovery scan and vulnerability scan (1 point).
Reporting Features: Fortify WebInspect and Qualys received 1 point each. Besides their ability to generate standard reports in HTML and PDF, they can also generate a report depending on the needed compliance with OWASP Top 10, ISO, or a custom template. On the other hand, Burp Suite Professional only generates standard reports in HTML or PDF.
Added Features: Qualys was scored 0 because it does not support any features that can be added. On the contrary, Fortify WebInspect and Burp Suite Professional got support for adding features (1 point). Fortify WebInspect has simulated attack tools for SQL injection, HTTP editor, server analyzer, web proxy, traffic viewer, and SWF scan. They are available during the scan, manually and automatically. With Burp Suite, a professional pen-tester can configure the scanner specifications as he needs, and he can also download add-ons from the updated marketplace.
Configuration Effortlessness: Qualys was scored and Burp Suite Professional received 3 points because it was configured easily (Plug and Play) out of the box and was ready to use after installation. Qualys is a cloud-based platform, whereas Burp Suite Pro-fessional was an easily installed application. On the contrary, Fortify WebInspect required dependencies before completing installation such as SQL Server; for this difficulty, the score is 1 point.
Scanning Logging Option: Burp Suite Professional, Fortify WebInspect, and Qualys WAS all received the same: 1 point. Burp Suite Professional logs all requests and responses during the scan, whereas Qualys logs all scans with a date filtering option. Fortify We-bInspect logs all scans with their results and gives you the ability to generate reports for them.
Tool Cost: Each tool has its own installation features and cost, depending on what the consultant pen-tester or organization requires. Qualys costs USD 30,000 per year to cover WAS and VM Security, which are additional services. Burp Suite Professional costs USD 399 per year for personal or consulting pen-tester use, while Fortify WebInspect costs USD 24,000 per year.
OWASP Top 10 Vulnerabilities Coverage: Burp Suite Professional covered 5% command injection, 8% cross-site scripting, 3% insecure cookie, 4% LDAP injection, 3% path traversal, 8% SQL injection, and 7% XPath injection. The scanner covers 70% of the OWASP Top 10. In complement to this, Burp Suite Professional scored 5 points, which is higher than the rest of the tools. QUALYS covered the following categories in one scan: 24% command injection, 38% cross-site scripting, 53% insecure cookies, and 32% SQL injection. Qualys covered 40% of the OWASP Top 10 vulnerabilities and scored 2 points. As was not expected, Fortify WebInspect only detected one SSL cipher, and the rest of the vulnerabilities were best practice. In the OWASP Top 10 Vulnerabilities Coverage, it could be 14 percent or less. After comparison, Burp Suite Professional is the best in OWASP's Top 10 Vulnerabilities Coverage (see Table 4).  (2). The TP number is 233, which is 233/591 × 100 = 39.4%. Qualys WAS detected a higher number of true positive vulnerabilities.
Youden Index: Fortify WebInspect and Burp Suite Professional received 2 points in the Youden Index. The Youden Index for Qualys WAS is 0.14%, which means that the tool outputs the same expected result as the web application (FP, TP). Likewise, Burp Suite Professional (0.0%) also outputs the same expected result. While this is the case, Fortify WebInspect did not detect any FP; therefore, there was no calculation for the Youden Index. To sum up the comprehensive comparison between Fortify WebInspect, Burp Suite Professional, and Qualys WAS, the scores in Figure 4 illustrate the strongest features of each tool.
Qualys WAS is a fully automated tool and solid in web application crawling coverage. Burp Suite Professional is the best in OWASP top 10 vulnerability coverage and the highest in test case generation. It has the ability to automate the scans fully or partially. In addition, it can scan actively or passively with a customized scan template. In addition, it has less false positive vulnerabilities with a high true positive detection capability. Fortify WebInspect is the best in scanning speed, scan customization, report features, adding features/add-ons to the scan. In addition, it can impressively intercept the HTTP requests and simulate the attack (see Figure 5).

Case Two: Non-Commercial Tools
In the non-commercial category, we compared Arachni, Wapiti3, and OWASP ZAP using our framework to determine which metric is a tool's strength or weakness.
Tool Type: Arachni, OWASP ZAP, and Wapiti3 scored the same (1 point) because they were using only the command line interface (CLI) or only the graphic user interface (GUI). Arachni and OWASP ZAP use only graphic user interface (GUI). On the contrary, Wapiti3 uses only a command line interface (CLI).

Pen-testing Level:
Most open-source tools use black box pen-testing levels, especially open-source vulnerability detection tools, as we see in the kali toolkit. While this is the case Arachni, OWASP ZAP, and Wapiti3 are using only the black box method. As a result, they both received the same (1 point).
Crawling types: According to its ability to use both active and passive crawlers at the same scan, OWASP ZAP received 2 points, which is higher than Arachni and Wapiti3. Conversely, Arachni and Wapiti3 scored 1 point each because they are only able to crawl actively.
Number of URLs covered: OWASP ZAP scored higher points than Arachni and Wapiti3 in coverage of the benchmark's URLs. OWASP ZAP received 5 points for coverage of more than 90% of the OWASP benchmark. OWASP benchmark has nearly 5500 URLs, and OWASP ZAP was able to discover 30,369 URLs, which means it covered all the benchmark URLs. Arachni did not crawl, although it should have crawled as said by the tool site, so it scored 0. In contrast, Wapiti3 scored 1 for crawling 621 URLS, which is 11% of the nearly total 5500 URLs.
Scanning Time: All the three tools (OWASP ZAP, Wapiti3 and Arachni) scored the same (1 point). It took all of them over 6 h to finish one scan. The Wapiti3 scan time was over 21 h, and yet, Arachni took over 48 h. That aside, OWASP ZAP scanning time was over 7 h, which is less than that of the others. At least, OWASP ZAP was faster after comparison.
Type of Scan: Then again, OWASP ZAP scored higher points than Arachni and Wapiti3 in scan types. OWASP ZAP received 3 points, which is the maximum score according to its ability to use both active and passive scan modes. Despite this, Wapiti3 and Arachni can only scan in active mode, so they scored the same (1 point).
Reporting Features: OWASP ZAP, Arachni, and Wapiti3 scored 0 in this metric, rendering their expected standard reports. They do not have additional reporting features.
Added Features: Then again, OWASP ZAP scored higher points than Arachni and Wapiti3 in its ability to add extensions and add-ons for stronger scanner and better vulnerability detection. In addition, it has an updated marketplace for installing add-ons. Arachni and Wapiti3 scored 0, due to their incapability to add extensions and add-ons to their scanner.
Configuration Effortlessness: As expected, OWASP ZAP scored higher points than Arachni and Wapiti3 in the effortlessness of configuration. However, Arachni needs JAVA JRE, PostgreSQL server to complete the installation. Likewise, it requires dependencies such as JAVA JRE, Python 3.x, httpx, BeautifulSoup, yaswfp, tld, Mako, and httpx-socks to complete the installation.
Scanning Logging Option: OWASP ZAP and Arachni received 1 point, which is higher than Wapiti3 according to their ability to log the scans. OWASP ZAP can log all requests and responses during the scan, whereas Arachni logs all scans but without the results of vulnerability detection. Conversely, Wapiti3 scored 0 since it has no logging option.
Tool Cost: All three tools (OWASP ZAP, Wapiti3 and Arachni) are free. However, WASP ZAP is more rapidly updated than the others, which is a great feature to keep up with the new vulnerability detection.
OWASP Top 10 Vulnerabilities Coverage: Unexpectedly, Wapiti3 scored two points, which is higher than OWASP ZAP and Arachni in coverage of the top 10 vulnerabilities by one scan. In one scan, Wapiti3 covered the following categories: 9% Command Injection, 9% Cross-Site Scripting, 1% Path Traversal, 3% SQL Injection. Wapiti3 covered 40% of the OWASP Top 10 Vulnerabilities. On the other hand, OWASP ZAP only covered 1% of the Cross-Site Scripting category, which was not a predictable result. In addition, as considered by the OWASP ZAP official site, a pen-tester needs to add some add-ons to improve the ZAP scanner's detection of the top 10 OWASP vulnerabilities. On the contrary, Arachni scored 0 because it did not detect any vulnerability. This issue was in version 0.5.12 [10], and the updated version v1.5.1 did not patch the issue we considered. Table 5 compares the coverage percentages of the top ten OWASP vulnerabilities. Pause and Resume Scans: OWASP ZAP and Arachni scored higher than Wapiti3 in this metric. They both can pause and resume functionality, whereas Wapiti3 did not. OWASP ZAP and Arachni scored 2 points each, whereas Wapiti3 received 0.
Number of test cases generated: OWASP ZAP received the maximum points (5) in test case generation, then Wapiti3 (3), followed by Arachni (1). OWASP ZAP generated 8799 test cases in one scan, whereas Wapiti3 generated 621 test cases in one scan and Arachni generated nearly 50 test cases in one scan.
Automation level: The three tools (OWASP ZAP, Wapiti3 and Arachni) scored 5 points since they needed less than 30% pen-tester involvement.
Number of False Positive: According to benchmark results, OWASP ZAP, Wapiti3 and Arachni did not report false positive vulnerabilities.
Number of True Positive: OWASP ZAP scored 1 point since the scan results only covered 1% in the Cross-Site Scripting category; however, the result of benchmarking the rest was 0%. Therefore, there were no FP vulnerabilities, and the TP number was 3. Similarly, Wapiti3 scored 1 point because the TP number is 42, which is less than 10%. On the contrary, Arachni did not detect any vulnerability.
Youden Index: Wapiti3 received a higher score (2 points) than OWASP ZAP and Arachni. The Youden Index for Wapiti3 is 0.03%. This means that the tool outputs the same expected result from the web application in (FP, TP). On the other hand, OWASP ZAP and Arachni did not detect any FP, so there was no possible calculation for the Youden Index. Finally, each tool has its own strength and weakness. Here, you can find some of the strengths: OWASP ZAP is the top tool in crawling coverage, crawling, and scanning actively/ passively and yet generated the highest test cases. It is a fully automated tool with zero configuration efforts (see Figure 6).

Discussion
The effectiveness of all web vulnerability scanners should be evaluated using a set of "benchmark" web applications and all OWASP Top 10 types of vulnerabilities. We proposed benchmarking, which extends the framework to new metrics and applies the benchmarking methodology. Consequently, new standards and benchmark Web applications were developed. These standards covered the majority of Web application domains. This ensures that the results of Web vulnerability scanners are comparable and complete. Due to the lack of standardization in most of the literature, it is challenging to measure and compare our results with previous studies. In addition, Web vulnerability scanners should also be evaluated based on their usability and performance. This research found that only a small number of surveys and overviews of black box web vulnerability scanners with limited metrics have been done. The majority of these surveys and overviews instead focus on summarizing the general ideas of the approaches without focusing on their effectiveness and characteristics [8,[23][24][25]. On the other hand, the current study includes a systematic review of the literature about the most popular web vulnerability scanners, extending the framework to new metrics and applying the benchmarking approach, summarizing their features, and discussing the performance of different scanners to find common web application vulnerabilities.

Conclusions
We presented an empirical comparison of the top six web application pen-testing tools (OWASP ZAP, Burp Suite Professional, Qualys WAS, Arachni, Wapiti3, Fortify WebInspect) using our proposed benchmark framework. Then, we split the top six tools into two common use cases: commercial tools and non-commercial tools. We aimed to make our proposed framework comprehensive and adequate with all the required features to suit pen-tester needs. Generally, the penetration tester should take advantage of all the strengths of each tool separately, according to their needs. Moreover, each tool has strengths and weaknesses. For instance, Burp Suite Professional, and Qualys WAS are the best in vulnerability detection, notwithstanding their delay in performing the task completely. On the other hand, the automated tool, Fortify WebInspect, did not detect any vulnerabilities in a 15 min scan, but if the pen-tester used the manual attack simulation feature, it would be helpful to use it to manually assess known vulnerabilities. In addition, OWASP ZAP and Burp Suite Professional are crawling powerfully. Future work will include extending the framework to more new metrics and applying the benchmarking approach to other new tools. The OWASP Top 10 Benchmark Project can be extended to other benchmarks and real-life vulnerable environments as long as it is possible to provide deep results that help in choosing the best tool depending on the required task.

Conflicts of Interest:
The authors declare that they have no conflict of interest.
Appendix A Table A1. Results of benchmarking OWASP ZAP using our proposed framework in detail.

Metric
Score Score Details Score Reason     The scan results only covered 1% of SSL cipher; however, the result of bechnmarking the rest was 0%. Therefore, the TP number is 261 (261/2741 × 100) which is 9.5% less than 10%. The number of expected vulnerabilities from the OWASP benchmark project are 2741.
Youden's Index 0 0 for None The benchmark did not detect any FP; therefore, no calculation for Youden's Index Total score 32 Table A5. Results of benchmarking Arachni using our framework in detail.  The Youden Index for Wapiti3 is 0.03% which means that the tool outputs the same expected result from the web application (FP,TP)

Metric
Total score 20