The Scanning Legion:
Web Application Scanners Accuracy Assessment & Feature Comparison
Commercial & Open Source Scanners
A Comparison of 60 Commercial & Open Source Black Box Web Application Vulnerability Scanners
By Shay Chen
Security Consultant, Researcher & Instructor
sectooladdict-$at$-gmail-$dot$-com
August 2011
Disclaimer
The results of this research are only valid for estimating the detection accuracy of SQLi & RXSS exposures, and for counting and comparing the various features of the tested tools.
The author did not evaluate every possible feature of each product, only the categories tested within the research, and thus, does not claim to be able to estimate the ROI from each individual product.
Furthermore, several vendors invested resources in improving their tools according to the recommendations of the WAVSEP platform which was publically available since December 2010. Some of them did so without any relation to the benchmark (and before they were aware of it), and some in preparation for it. Since the special structure of the WAVSEP testing platform actually requires the vendor to cover more vulnerable test scenarios, that action actually improves the detection ratio of the tool in any application (for the exposures covered by WAVSEP).
It is however, important to mention that a few vendors were not notified on this benchmark, and were not aware of the existence of the WAVSEP platform, and thus, could not have enhanced their tools in preparation for this benchmark (HP Webinspect, Tenable Nessus, and Janus security Webcruiser), while other vendors that were tested in the initial research phases released updated versions that were not tested (Portswigger Burpsuite and Cenzic Hailstorm)
That being said, the benchmark does represent the accuracy level of each tool in the date it was tested (the results of the vast majority of the tools are valid for the date this research was released), but future benchmark will use a different research model in order to ensure that the competition will be fair for all vendors.
Table of Contents
1. Prologue. 3
2. List of Tested Web Application Scanners. 4
3. Benchmark Overview & Assessment Criteria. 5
4. Test I – The More The Merrier – Counting Audit Features. 6
5. Test II – To the Victor Go the Spoils – SQL Injection. 6
6. Test III – I Fight (For) the Users – Reflected XSS. 7
7. Test IV – Knowledge is Power - Feature Comparison. 7
8. What Changed?. 8
9. Initial Conclusions – Open Source vs. Commercial 9
10. Morale Issues in Commercial Product Benchmarks. 9
11. Verifying The Benchmark Results. 11
12. Notifications and Clarifications. 11
13. List of Tested Scanners. 12
14. Source, License and Technical Details of Tested Scanners. 12
15. Comparison of Active Vulnerability Detection Features. 13
16. Comparison of Complementary Scanning Features. 14
17. Comparison of Usability and Coverage Features. 15
18. Comparison of Connection and Authentication Features. 15
19. Comparison of Advanced Features. 16
20. Detailed Results: Reflected XSS Detection Accuracy. 16
21. Detailed Results: SQL Injection Detection Accuracy. 16
22. Drilldown – Error Based SQL Injection Detection. 16
23. Drilldown – Blind & Time Based SQL Injection Detection. 16
24. Technical Benchmark Conclusions – Vendors & Users. 17
25. So What Now?. 17
26. Recommended Reading List: Scanner Benchmarks. 18
27. Thank-You Note. 19
28. Frequently Asked Questions. 19
29. Appendix A – Assessing Web Application Scanners. 20
30. Appendix B – A List of Tools Not Included In the Test. 21
31. Appendix C – WAVSEP Scan Logs. 25
32. Appendix D – Scanners with Abnormal Behavior. 25
I've always been curious about it… from the first moment I executed a commercial scanner, almost seven years ago, to the day I started performing this research. Although manual penetration testing has always been the main focus of the test, most of us use automated tools to easily detect "low hanging fruit" exposures, increase the coverage when testing large scale applications in limited timeframes and even to double check locations that were manually tested. The questions always pops up, in every penetration test in which these tools are used…
"Is it any good?", "Is it better than…" and "Can I rely on it to…" are questions that every pen-tester asks himself whenever he hits the scan button.
Well, curiosity is a strange beast… it can drive you to wander and search, consume all your time in a search for obscure solutions.
So recently, because of curiosity, I decided that I want to find out for myself, and invest whatever resources necessary to solve this mystery once and for all.
Although I can hardly state that all my questions were answered, I can definitely sate your curiosity for the moment, by sharing insights, interesting facts, useful information and even some surprises, all derived from my latest research which is focused on the subject of commercial & open source web application scanners.
This research covers the latest versions of 12 commercial web application scanners and 48 free & open source web application scanners, while comparing the following aspects of these tools:
· Number & Type of Vulnerability Detection Features
· SQL Injection Detection Accuracy
· Reflected Cross Site Scripting Detection Accuracy
· General & Special Scanning Features
Although my previous research included similar information, I regretted one thing after it was published; I did not present the information in a format that was useful to the common reader. In fact, as I found out later, many readers skipped the actual content, and focused on sections of the article that were actually a side effect of the main research.
As a result, the following article will focus on presenting the information in a simple comprehendible graphical format, while still providing the detailed research information to those interested… and there's a lot of new information to be shared – knowledge that can aid pen-testers in choosing the right tools, managers in budget related decisions, and visionaries, in properly reading the map;
But before you read the statistics and insights presented in this report, and reach a conclusion as to which tool is the "best", it is crucial that you read Appendix A - Section 29, which explains the complexity of assessing the overall quality of web application scanners… As you're about to find out, this question cannot be answered so easily… at least not yet.
…
So without any further delay, let's focus on the information you seek, and discuss the insights and conclusions later.
The following commercial scanners were included in the benchmark:
· IBM Rational AppScan v8.0.03 - iFix Version (IBM)
· WebInspect v9.10.78.0, SecureBase 4.05.99 (HP)
· Hailstorm Professional v6.5-5267(Cenzic)
· Acunetix WVS v7.0-20110608 (Acunetix)
· NTOSpider v 5.4.098 (NT Objectives)
· Netsparker v2.0.0.0 (Mavituna Security)
· Burp Suite v1.3.09 (Portswigger)
· Sandcat v4.2.4.0 (Syhunt)
· ParosPro v1.9.12 (Milescan)
· JSky v3.5.1-905 (NoSec)
· WebCruiser v2.5.0 EE (Janus Security)
· Nessus v4.41-15078 (Tenable Network Security) – Only the Web Application Scanning Features
The following new free & open source scanners were included in the benchmark:
VEGA 1.0 beta (Subgraph), Safe3WVS v9.2 FE (Safe3 Network Center), N-Stalker 2012 Free Edition v7.1.1.106 (N-Stalker), DSSS (Damn Simple SQLi Scanner) v0.1h, SandcatCS v4.2.3.0
The updated versions of the following free & open source scanners were re-tested in the benchmark:
Zed Attack Proxy (ZAP) v1.3.0, sqlmap v0.9-rev4209 (SVN), W3AF 1.1-rev4350 (SVN), Watobo v0.9.7-rev544, Acunetix Free Edition v7.0-20110711, Netsparker Community Edition v1.7.2.13, WebSecurify v0.8, WebCruiser v2.4.2 FE (corrections), arachni v0.2.4 / v0.3, XSSer v1.5-1, Skipfish 2.02b, aidSQL 02062011
The results were compared to those of unmaintained scanners tested in the original benchmark:
Andiparos v1.0.6, ProxyStrike v2.2, Wapiti v2.2.1, Paros Proxy v3.2.13, PowerFuzzer v1.0, Grendel Scan v1.0, Oedipus v1.8.1, Scrawler v1.0, Sandcat Free Edition v4.0.0.1, JSKY Free Edition v1.0.0, N-Stalker 2009 Free Edition v7.0.0.223, UWSS (Uber Web Security Scanner) v0.0.2, Grabber v0.1, WebScarab v20100820, Mini MySqlat0r v0.5, WSTool v0.14001, crawlfish v0.92, Gamja v1.6, iScan v0.1, LoverBoy v1.0, openAcunetix v0.1, ScreamingCSS v1.02, Secubat v0.5, SQID (SQL Injection Digger) v0.3, SQLiX v1.0, VulnDetector v0.0.2, Web Injection Scanner (WIS) v0.4, Xcobra v0.2, XSSploit v0.5, XSSS v0.40, Priamos v1.0
For the full list of commercial & open source tools that were not tested in this benchmark, refer to Appendix B - Section 30.
The benchmark focused on testing commercial & open source tools that are able to detect (and not necessarily exploit) security vulnerabilities on a wide range of URLs, and thus, each tool tested was required to support the following features:
· The ability to detect Reflected XSS and/or SQL Injection vulnerabilities.
· The ability to scan multiple URLs at once (using either a crawler/spider feature, URL/Log file parsing feature or a built-in proxy).
· The ability to control and limit the scan to internal or external host (domain/IP).
The testing procedure of all the tools included the following phases:
· The scanners were all tested against the latest version of
WAVSEP (v1.0.3), a benchmarking platform designed to assess the detection accuracy of web application scanners. The purpose of WAVSEP’s test cases is to provide a scale for understanding which detection barriers each scanning tool can bypass, and which vulnerability variations can be detected by each tool. The various scanners were tested against the following test cases (GET and POST attack vectors):
o 66 test cases that were vulnerable to Reflected Cross Site Scripting attacks.
o 80 test cases that contained Error Disclosing SQL Injection exposures.
o 46 test cases that contained Blind SQL Injection exposures.
o 10 test cases that were vulnerable to Time Based SQL Injection attacks.
o 7 different categories of false positive RXSS vulnerabilities.
o 10 different categories of false positive SQLi vulnerabilities.
· In order to ensure the result consistency, the directory of each exposure sub category was individually scanned multiple times using various configurations.
· The features of each scanner were documented and compared, according to documentation, configuration, plugins and information received from the vendor.
· In order to ensure that the detection features of each scanner were truly effective, most of the scanners were tested against an additional benchmarking application that was prone to the same vulnerable test cases as the WAVSEP platform, but had a different design, slightly different behavior and different entry point format (currently nicknamed "bullshit").
The results of the main test categories are presented within three graphs (commercial graph, free & open source graph, unified graph), and the detailed information of each test is presented in a dedicated report.
So, now that you've learned about the testing process, it's time for the results…
The first assessment criterion was the number of audit features each tool supports.
Reasoning: An automated tool can't detect an exposure that it can't recognize (at least not directly, and not without manual analysis), and therefore, the number of audit features will affect the amount of exposures that the tool will be able to detect (assuming the audit features are implemented properly, that vulnerable entry points will be detected and that the tool will manage to scan the vulnerable input vectors).
For the purpose of the benchmark, an audit feature was defined as a common generic application-level scanning feature, supporting the detection of exposures which could be used to attack the tested web application, gain access to sensitive assets or attack legitimate clients.
The definition of the assessment criterion rules out product specific exposures and infrastructure related vulnerabilities, while unique and extremely rare features were documented and presented in a different section of this research, and were not taken into account when calculating the results. Exposures that were specific to Flash/Applet/Silverlight and Web Services Assessment were treated in the same manner.
The Number of Audit Features in Web Application Scanners – Commercial Tools
The Number of Audit Features in Web Application Scanners - Free & Open Source Tools
The Number of Audit Features in Web Application Scanners – Unified List
So, now that were done with the quantity, let's get to the quality…
The second assessment criterion was the detection accuracy of SQL Injection, one of the most famous exposures and the most commonly implemented attack vector in web application scanners.
Reasoning: a scanner that is not accurate enough will miss many exposures, and classify non-vulnerable entry points as vulnerable. This test aims to assess how good is each tool at detecting SQL Injection exposures in a supported input vector, which is located in a known entry point, without any restrictions that can prevent the tool from operating properly.
The evaluation was performed on an application that uses MySQL 5.5.x as its data repository, and thus, will reflect the detection accuracy of the tool when scanning similar data repositories.
Result Chart Glossary
Note that the BLUE bar represents the vulnerable test case detection accuracy, while the RED bar represents false positive categories detected by the tool (which may result in more instances then what the bar actually presents, when compared to the detection accuracy bar).
The SQL Injection Detection Accuracy of Web Application Scanners – Commercial Tools
The SQL Injection Detection Accuracy of Web Application Scanners – Open Source & Free Tools
The SQL Injection Detection Accuracy of Web Application Scanners – Unified List
It's obvious that testing one feature is not enough; ideally, the detection accuracy of all audit features should be assessed, but in the meantime, we will settle for one more…
The third assessment criterion was the detection accuracy of Reflected Cross Site Scripting, a common exposure which is the 2nd most commonly implemented feature in web application scanners.
Result Chart Glossary
Note that the BLUE bar represents the vulnerable test case detection accuracy, while the RED bar represents false positive categories detected by the tool (which may result in more instances then what the bar actually presents, when compared to the detection accuracy bar).
The Reflected XSS Detection Accuracy of Web Application Scanners – Commercial Tools
The Reflected XSS Detection Accuracy of Web Application Scanners – Open Source & Free Tools
The Reflected XSS Detection Accuracy of Web Application Scanners – Unified List
The list of tools tested in this benchmark is organized within the following reports:
Additional information was gathered during the benchmark, including information related to the different features of the various scanners. These details are organized in the following reports, and might prove useful when searching for tools for specific tasks or tests:
For detailed information on the accuracy assessment results, refer to the following reports:
· The Scan Logs (describing the executing process and configuration of each scanner)
Additional information on the scan logs, the list of untested tools and the abnormal behaviors of scanners can be found in the article appendix sections (at the end of the article):
Appendix B - Section 30 – an appendix that contains a list of tools that were not included in the benchmark
Appendix D - Section 32 – an appendix that describes scanners with abnormal behavior
Since the latest benchmark, many open source & commercial tools added new features and improved their detection accuracy.
The following list presents a summary of changes in the detection accuracy of free & open source tools that were tested in the previous benchmark:
· arachni – a dramatic improvement in the detection accuracy of Reflected XSS exposures, and a dramatic improvement in the detection accuracy of SQL Injection exposures (verified on mysql).
· sqlmap – a dramatic improvement in the detection accuracy of SQL Injection exposures (verified on mysql).
· Acunetix Free Edition – a major improvement in the detection accuracy of RXSS exposures.
· Watobo – a major improvement in the detection accuracy of SQL Injection exposures (verified on mysql).
· N-Stalker 2009 FE vs. 2012 FE – although this tool is a very similar to N-Stalker 2009 FE, the surprising discovery I had was that the detection accuracy of N-Stalker 2012 is very different – it detects only a quarter of what N-Stalker 2009 used to detect. Assuming this result is not related to a bug in the product or in my testing procedure, it means that the newer free version is significantly less effective than the previous free version, at least at detecting reflected XSS. A legitimate business decision, true, but surprising nevertheless.
· aidSQL – a major improvement in the detection accuracy of SQL Injection exposures (verified on mysql).
· XSSer – a major improvement in the detection accuracy of Reflected XSS exposures, even though the results were not consistent.
· Skipfish – a slight improvement in the detection accuracy of RXSS exposures (it is currently unknown if the RXSS detection improvement is related to changes in code or to the enhanced testing method), and a slight decrease in the detection accuracy of SQLi exposures (might be related to the different testing environment and the different method used to count the results).
· WebSecurify – a slight improvement in the detection accuracy of RXSS exposures (it is currently unknown if the RXSS detection improvement is related to changes in code or to the enhanced testing method).
· Zed Attack Proxy (ZAP) – Identical results. Any minor difference was probably caused due to the testing environment, configuration or minor issues.
· W3AF – slight improvement in the detection accuracy of RXSS exposures and slight decrease in the detection accuracy of SQL Injection exposures.
· Netsparker Community Edition – Identical results. Any minor difference was probably caused due to the testing environment, configuration or minor issues.
· WebCruiser Free Edition – a minor decrease in accuracy, due to fixing documentation mistakes from the previous benchmark.
The following section presents my own personal opinions on the results of the benchmark, and since opinions are beliefs, which are affected by emotions and circumstances, you are entitled to your own.
After testing over 48 open source scanners multiple times, and after comparing the results and experiences to the ones I had after testing 12 commercial ones (and those are just the ones that I reported), I have reached the following conclusions:
· As far as accuracy & features, the distance between open source tools and commercial tools is not as big as it used to be – tools such as sqlmap, arachni, wapiti, w3af and others are slowly closing the gap. That being said, there still is a significant difference in stability & false positives, in which most open source tools tend to have more false positives and be relatively unstable when compared to most commercial tools.
· Some open source tools, even the most accurate ones, are relatively difficult to install & use, and still require fine-tuning in various fields. In my opinion, a non-technical QA engineer will have difficulties using these tools, and as a general rule, I'll recommend using them if your background is relatively technical (consultant, developer, etc). For all the rest, especially non-technical enterprise employees that prefer a decent usage experience - stick with commercial produces, with their free versions, or with the simple variations of open source tools.
· If you are using a commercial product, it's best to merge the use of tools with a wide variety of features with tools with high detection accuracy. It's possible to use tools that have relatively good scores in both of these aspects, or use a tool with a wide variety of features with another tool that has enhanced accuracy. Yes, this statement can be interpreted to using combinations of commercial and open source tools, and even to using two different commercial tools, so that one tool will complete the other. Budget? Take a look at the cost diversity of the tools, before you make any harsh decisions; I promise you'll be surprised.
While testing the various commercial tools, I have dealt with certain moral issues that I want to share. Many vendors that were aware of this research enhanced their tools in preparation for it, an action I respect, and consider a positive step. Since the testing platform that included most of the tests was available online, preparing for the benchmark was a relatively easy task for any vendor that invested the resources.
So, is the benchmark fair for vendors that couldn’t improve their tools due to various circumstances?
The testing process of a commercial tool is usually much more complicated and restrictive then testing a free or open source tool; it is necessary to contact the vendor to obtain an evaluation license, and the latest version of the tool (a process that can take several weeks), the evaluation licenses are usually restricted to a short evaluation timeframe (usually two weeks), and thus, updating and testing the tools in a future date can become a hassle (since some of the process will have to be performed all over again)… but why am I telling you all this?
Simply, because I believe that the relevance of the test I performed for vendors that provided me an extended evaluation period and access to new builds was better; for example, a few days before the latest benchmark, immediately after testing the latest versions of two major vendors, I decided to rescan the platform using the latest versions of all the commercial tools I have, to ensure that the benchmark will be published with the most updated results.
I verified that JSky, WebCruiser, and ParosPro didn't release a new version, tested the latest versions of AppScan, WebInspect, Acunetix, Netsparker, Sandcat and Nessus.
It made sense that builds that were tested a short while ago (like NTO spider), were also something that I can rely on to represent the currently state of the tool (I hopeJ).
I did however, have a problem with Cenzic and Burp, two of the first tools that I tested in this research, since my evaluation licenses were no longer valid, and I couldn't update the tools to their latest version and scan again, and since I had 2-3 days until the end of my planned schedule, with a million tasks pending, I simply couldn't afford going through the evaluation request phase again, with all of my good intentions, and the willingness to sacrifice my spare time to ensure these tools will be properly represented.
Even though the results of some updated products (WebInspect and Nessus being the best examples) didn't change at all, even after I updated them to the latest version, who could say that the result would be the same for other vendors?
So, were the terms unfair to burp and cenzic?
Finally, several vendors sent me multiple versions and builds – they all wanted to succeed, a legitimate desire of any human being, even more so for a firm. Apart from the time each test took (a price I was willing to pay at the time), the new builds were sent even in the last day of the benchmark, and afterwards.
But if the new version is better, and more accurate, by limiting the amount of tests I perform for a given vendor, isn't that against what I'm trying to achieve in all my benchmarks, which is to release the benchmark with the most updated results, for all the tools?
(For example, Syhunt, a vendor that did very well in the last benchmark, sent me its final build (2.4.2.5) a day after the deadline, and included a time based SQL injection detection feature in that build, but since I couldn't afford the time anymore, I couldn't test the build, so, am I really reflecting the tool's current state in the most accurate manner? But if I would have tested this build, shouldn't I provide the rest of the vendors the same opportunity?)
One of the questions I believe I can answer – the accuracy question.
A benchmark is, in a very real sense, a competition, and since I take the scientific approach, I believe that the results are absolute, at least for the subject that is being tested. Since I'm not claiming that one tool is "better" than the other in every category, only at the tested criterion, I believe that priorities do not matter – as long as the test really reflects the current situation, the result is reliable.
I leave the interpretation of the results to the reader, at least until I'll cover enough aspects of the tools.
As for the rest of the open issues, I don't have good answers for all of those questions, and although I did my very best in this benchmark, and even exceeded what I thought I'm capable of, I will probably have to think of some solutions that will make the next benchmark terms equal, even for scanners that were tested in the beginning of the benchmark, and less time consuming then it has been.
The results of the benchmark can be verified by replicating the scan methods described in the scan log of each scanner, and by testing the scanner against WAVSEP v1.0.3.
The latest version of WAVSEP can be downloaded from the web site of project WAVSEP (binary/source code distributions, installation instructions and the test case description are provided in the web site download section):
How to use the results of the benchmark
The results of the benchmark clearly show how accurate each tool is in detecting the tested vulnerabilities (SQL Injection (MySQL ) & Reflected Cross Site Scripting), as long as it is able to locate and scan the vulnerable entry points. The results might even help to estimate how accurate each tool is in detecting related vulnerabilities (for example SQL Injection vulnerabilities which are based on other databases), and determine which exposure instances cannot be detected by certain tools;
However, currently, the results DO NOT evaluate the overall quality of the tool, since they don't include detailed information on the subjects such as crawling quality, technology support, scoping, profiling, stability in extreme cases, tolerance, detection accuracy of other exposures and so on... at least NOT YET.
I highly recommend reading the detailed results, and the appendix that deals with web application scanner evaluation, before getting to any conclusions.
Additional Notifications
During the benchmark, I have reported bugs that had a major affect on the detection accuracy to several commercial and open source vendors:
· A performance improvement feature in NTOSpider caused it not to scan many POST XSS test cases, and thus, the detection accuracy of RXSS POST test cases was significantly smaller then the RXSS GET detection accuracy. The vendor was notified on this issue, and provided me with a special build that overrides this feature (at least until they will have a feature in the GUI to disable this mechanism).
· A similar performance improvement feature in Netsparker caused the same issue, however, the feature could have been disabled in Netsparker, and thus, with the support of the relevant personal at Netsparker, I was able to work around the problem.
· A few bugs in arachni prevented the blind sql injection diff plugins from working properly. I notified the author, Tasos, on the issue, and he quickly fixed the issue and released the new version.
· Acunetix RXSS detection result was updated to match the results of the latest free version (one version above the tested commercial version) - Since the tested commercial version of Acunetix was older than the tested free version (20110608 vs 20110711), and since the results of the upgraded free version were actually better than the older commercial version I had tested, I changed the results of the commercial tool to match the ones of the new free version (from 22 to 24 in both the GET & POST RXSS detection scores).
· Changes in results from the previous benchmark might be attributed to enhanced scanning features, and/or to enhanced stability in the test environment & method (connection pool, limited & divided scope).
The following report contains the list of scanners tested in this benchmark, and provides information on the tested version, the tool's vendor/author and the current status of product:
The following report compares the licenses, development technology and sources (home page) of the various scanners:
The following reports compare the active vulnerability detection features (audit features) of the various tested scanners:
First Report:
Second Report:
Aside from the Count column (which represents the total amount of audit features supported by the tool, not including complementary features such as web server scanning and passive analysis), each column in the report represents an audit feature. The description of each column is presented in the following glossary table:
Title | Description |
SQL | Error Dependant SQL Injection |
BSQL | Blind & Intentional Time Delay SQL Injection |
RXSS | Reflected Cross Site Scripting |
PXSS | Persistent / Stored Cross Site Scripting |
DXSS | DOM XSS |
Redirect | External Redirect / Phishing via Redirection |
Bck | Backup File Detection |
Auth | Authentication Bypass |
CRLF | CRLF Injection / Response Splitting |
LDAP | LDAP Injection |
XPath | X-Path Injection |
MX | MX / SMTP / IMAP Injection |
Session Test | Session Identifier Complexity Analysis |
SSI | Server Side Include |
RFI-LFI | Directory Traversal / Remote File Include / Local File Include (Will be separated into different categories in future benchmarks) |
Cmd | Command Injection / OS Command Injection |
Buffer | Buffer Overflow |
CSRF | Cross Site Request Forgery |
A-Dos | Application Denial of Service / RegEx DoS |
Privilege Escalation | Privilege Escalation Between Different Roles and User Accounts (Resources / Features) |
Format String | Format String Injection |
File Upload | File Upload / Insecure File Upload |
Code Injection | Code Injection (ASP/JSP/PHP/Perl/etc) |
XML Injection | XML / SOAP Injection |
Source Code Disclosure | Source Code Disclosure Detection |
Integer Overflow | Integer Overflow |
Padding Oracle | Padding Oracle Detection / Exploitation |
Session Fixation | Session Fixation |
The following report compares complementary vulnerability detection features in the tested scanners:
In order to clarify what each column in the report table means, use the following glossary table:
Title | Description |
Web Server Hardening | Features that are able to detect Insecure HTTP method support (PUT, Trace, WebDAV), directory listing, robots and cross-domain files information disclosure, version specific vulnerabilities, etc. |
CGI Scanning | Default files, common vulnerable applications, etc. |
Passive Analysis | Security tests that don’t require any actual attacks, and are instead based on information gathering and analysis of responses, including certificate & cipher tests, content & metadata analysis, mime type analysis, autocomplete detection, insecure transmission of credentials, google hacking, etc. |
File / Dir Enumeration | Directory and file enumeration features |
Notes and Other Features | Uncommon or Unique features |
In order to clarify what each column in the report table means, use the following glossary table:
Title | Possible Values |
Configuration & Usage Scale | Very Simple - GUI + Wizard Simple - GUI with simple options, Command line with scan configuration file or simple options Complex - GUI with numerous options, Command line with multiple options Very Complex - Manual scanning feature dependencies, multiple configuration requirements |
Stability Scale | Very Stable - Rarely crashes, Never gets stuck Stable - Rarely crashes, Gets stuck only in extreme scenarios Unstable - Crashes every once in a while, Freezes on a consistent basis Fragile – Freezes or Crashes on a consistent basis, Fails performing the operation in many cases |
Performance Scale | Very Fast - Fast implementation with limited amount of scanning tasks Fast - Fast implementation with plenty of scanning tasks Slow - Slow implementation with limited amount of scanning tasks Very Slow - Slow implementation with plenty of scanning tasks |
The following report contains a comparison of advanced and uncommon scanner features:
The results of the Reflected Cross Site Scripting (RXSS) accuracy assessment are presented in the following report (the graphical results representation is provided in the beginning of the article):
The results that were taken into account only include vulnerable pages linked from the index-xss.jsp index page (the RXSS-GET and/or RXSS-POST directories, in addition to the RXSS-FalsePositive directory). XSS Vulnerable entry points in the SQL injection vulnerable pages were not taken into account, since they don’t necessarily represent a unique scenario (or at least, not until the “layered vulnerabilities” scenario will be implemented).
The overall results of the SQL Injection accuracy assessment are presented in the following report (the graphical results representation is provided in the beginning of the article):
The results of the Error-Based SQL Injection benchmark are presented in the following report:
The results of the Blind & Time based SQL Injection benchmarks are presented in the following report:
While testing the various tools in this benchmark, I dealt with numerous difficulties, witnessed many inconsistent results and noticed that some tools had difficulties optimizing their scanning features on the tested platform. I had however, dealt with the other end of the spectrum, and used tools the easily overcome most of the difficulties related to detecting the tested vulnerabilities.
I'd like to share my conclusions, with the authors and vendors that are interested in improving their tools, and aren't offended by someone that's giving advice.
As far as detecting SQL injection exposures, I have noticed that tools that implemented the following features, detected more exposures, had less false positives, and provided consistent results:
· Time based SQL Injection detection vectors are very effective. They are, however, very tricky to use, since they might be affected by other attacks that are simultaneously executed, or affect the detection of other tests in the same manner. As a result, I recommended to all the authors & vendors to implement the following behavior in their product: execute time based attacks at the end of the scanning process, after all the rest of the tests are done, while using a reduced number of concurrent connections. Executing other tests in parallel might have a negative effect on the detection accuracy.
· Since the upper/lower timeout values used to determine whether or not a time based exploit was successful may change due to various circumstances, I recommend calculating and re-calculating this value during the scan, and revalidating each time based result independently, after verifying that the timeout values are "normal".
· Implement various payloads of time based attacks – the sleep method is not enough to cover all the databases, and not even all the versions of mysql.
So now that we have all those statistics, it's time to analyze them properly, and see which conclusions we can get to. Since this process will take time, I have to set some priorities;
In the near future, I will try to achieve the following goals:
· Find a better way to present the vast amount of information on web application scanners features & accuracy. I have been struggling with this issue for almost 2 years, but I think that I finally found a solution that will make the information more useful for the common reader… stay tuned for updates.
· Provide recommendations for the best current method of executing free & open source web application scanners; the most useful combinations, and the tiny tweaks required to achieve the best results.
· Release the new test case categories of WAVSEP that I have been working on. Yep, help needed.
In addition to the short term goals, the following long term goals will still have a high priority:
· Improve the testing framework (WAVSEP); add additional test cases and additional security vulnerabilities.
· Perform additional benchmarks on the framework, and on a consistent basis. I previously aimed for one major benchmark per year, but that formula might completely change, if I'll manage to work a few issues around a new initiative I have in this field.
· Integration with external frameworks for assessing crawling capabilities, technology support, etc.
· Publish the results of tests against sample vulnerable web applications, so that some sort of feedback on other types of exposures will be available (until other types of vulnerabilities will be implemented in the framework), as well as features such as authentication support, crawling, etc.
· Gradually develop a framework for testing additional related features, such as authentication support, malformed HTML tolerance, abnormal response support, etc.
I hope that this content will help the various vendors improve their tools, help pen-testers choose the right tool for each task, and in addition, help create some method of testing the numerous tools out there.
Since I have already been in the situation in the past, then I know what's coming… so I apologize in advance for any delays in my responses in the next few weeks.
The following resources include additional information on previous benchmarks, comparisons and assessments in the field of web application vulnerability scanners:
During the research described in this article, I have received help from quite a few individuals and resources, and I’d like to take the opportunity to thank them all.
For all the open source tool authors that assisted me in testing the various tools in unreasonable late night hours, for the kind souls that helped me obtain evaluation licenses for commercial products, for the QA, Support and Development teams of commercial vendors, which saved me tons of time and helped me overcome obstacles, and for the various individuals that helped me contact these vendors.
I hope that the conclusions, ideas, information and payloads presented in this research (and the benchmarks and tools that will follow) will be for the benefit of all vendors, open source community projects and commercial vendors alike.
Q: 60 web application scanners is an awful lot, how many scanners exist?
A: Assuming you are using the same definition for a scanner that I do, then I'm currently aware of 95 web application scanners that can claim to support the detection of generic application level exposures, in a safe an controllable manner, and in multiple URLs (48 free & open source scanners that were tested, 12 commercial scanners that were tested, 25 open source scanners that I didn't test yet, and 10 commercial scanners that slipped my grip). And yes, I'm planning on testing them all.
Q: Why RXSS and SQLi again? Will the benchmarks ever include additional exposures?
A: Yes, they will. In fact, I'm already working on test case categories of two different exposures, and will use them both for my next research. Besides, the last benchmark focused on free & open source products, and I couldn't help myself, I had to test them against each other.
Q: I can't wait for the next research, what can I do to speed things up?
A: I'm currently looking for methods to speed up the processes related to these researches, so if you're willing to help, contact me.
Q: What’s with the titles that contain cheesy movie quotes?
A: That's just it - I happen to like cheese. Let's see you coming up with better titles at 4AM.
Although this benchmark contains tons of information, and is very useful as a decision assisting tool, the content within it cannot be used to calculate the accurate ROI (return of investment) of each web application scanner. Furthermore, it can't predict on its own exactly how good will the results of each scanner be in every situation (but it can predict what won't be detected), since there are additional factors that need to be taken into account.
The results in this benchmark could serve as an accurate evaluation formula only if the scanner will be used to scan a technology that it supports, pages that it can detect (manual crawling features can be used to overcome many obstacles in this case), and locations without technological barriers that it cannot handle (for example, web application firewalls or anti-CSRF tokens).
In order for us to truly assess the full capability of web application vulnerability scanners, the following features must be tested:
· The entry point coverage of the web application scanner must be as high as possible; meaning, the tool must be able to
locate and
properly activate (or be manually "taught") all the application entry points (e.g. static & dynamic pages, in-page events, services, filters, etc). Vulnerabilities in an entry point that wasn't located will not be detected. The
WIVET project can provide additional information on coverage and support.
· The attack vector coverage of the web application scanner – does it support input vectors such as GET / POST / Cookie parameters? HTTP headers? Parameter Names? Ajax Parameters? Serialized Objects? Each input vector that is not supported means exposures that won't be detected, regardless of the tool's accuracy level (assuming the unsupported attack/input vector is vulnerable).
· The scanner must be able to handle the technological barriers implemented in the application, ranging from authentication mechanism to automated access prevention mechanisms such as CAPTCHAs and anti-CSRF tokens.
· The scanner must be able to handle any application specific problems it encounters, including malformed HTML (tolerance), stability issues and other limitations. If the best scanner in the world will consistently cause the application to crash in a couple of seconds, then it's not useful for assessing the security of that application (in matters that don't relate to DoS attacks).
· The number of features (active & passive) implemented in the web application vulnerability scanner.
· The accuracy level of each and every plugin supported by the web application vulnerability scanner.
That being said, it's crucial to remember that even in the most ideal scenario, with the absence of human intelligence, scanners can't detect all the instances of exposures that are truly logical – meaning, are related to specific business logic, and thus, are not perceived as an issue by an entity that can't understand the business logic.
But the sheer complexity of the issue does not mean that we shouldn't start somewhere, and that's exactly what I'm trying to do in my benchmarks – create a scientific, accurate foundation for obtaining that goal, with enough investment, over time.
Note that my explanations describe only a portion of the actual tests that should be performed, and I'm sharing them only to emphasize the true complexity of the core issue; I haven't touched stability, bugs, and a lot of other subjects, which may affect the overall result you get.
The following commercial web application vulnerability scanners were not included in the benchmark, since I didn't manage to get an evaluation version until the article publication deadline, or in the case of one scanner (mcafee), had problems with the evaluation version that I didn't manage to work out until the benchmark's deadline:
Commercial Scanners not included in this benchmark
· Falcove (BuyServers ltd, currently Unmaintained)
The following open source web application vulnerability scanners were not included in the benchmark, mainly due to time restrictions, but will be included in future benchmarks:
Open Source Scanners not included in this benchmark
· Vulnerability Scanner 1.0 (by cmiN, RST) - since the source code contained traces for remotely downloaded RFI lists from locations that do not exist anymore.
The benchmark focused on web application scanners that are able to detect either Reflected XSS or SQL Injection vulnerabilities, can be locally installed, and are also able to scan multiple URLs in the same execution.
As a result, the test did not include the following types of tools:
· Online Scanning Services – Online applications that remotely scan applications, including (but not limited to) Appscan On Demand (IBM), Click To Secure, QualysGuard Web Application Scanning (Qualys), Sentinel (WhiteHat), Veracode (Veracode), VUPEN Web Application Security Scanner (VUPEN Security), WebInspect (online service - HP), WebScanService (Elanize KG), Gamascan (GAMASEC – currently offline), Cloud Penetrator (Secpoint), Zero Day Scan, DomXSS Scanner, etc.
· Scanners without RXSS / SQLi detection features:
o LFI/RFI Checker (astalavista)
o etc
· Passive Scanners (response analysis without verification):
o Watcher (Fiddler Plugin by Casaba Security) o etc
· Scanners of specific products or services (CMS scanners, Web Services Scanners, etc):
o WSDigger
o Sprajax
o ScanAjax
o Joomscan
o wpscan
o Joomlascan
o Joomsq
o WPSqli
o etc
· Web Application Scanning Tools which are using Dynamic Runtime Analysis:
o PuzlBox (the free version was removed from the web site, and is now sold as a commercial product named
PHP Vulnerability Hunter)
o etc
· Uncontrollable Scanners - scanners that can’t be controlled or restricted to scan a single site, since they either receive the list of URLs to scan from Google Dork, or continue and scan external sites that are linked to the tested site. This list currently includes the following tools (and might include more):
o Darkjumper 5.8 (scans additional external hosts that are linked to the given tested host)
o Bako's SQL Injection Scanner 2.2 (only tests sites from a google dork)
o Serverchk (only tests sites from a google dork)
o XSS Scanner by Xylitol (only tests sites from a google dork)
o Hexjector by hkhexon – also falls into other categories
o d0rk3r by b4ltazar
o etc
· Deprecated Scanners - incomplete tools that were not maintained for a very long time. This list currently includes the following tools (and might include more):
o Wpoison (development stopped in 2003, the new official version was never released, although the 2002 development version can be obtained by manually composing the sourceforge URL which does not appear in the web site-
http://sourceforge.net/projects/wpoison/files/ )
o etc
· De facto Fuzzers – tools that scan applications in a similar way to a scanner, but where the scanner attempts to conclude whether or not the application or is vulnerable (according to some sort of “intelligent” set of rules), the fuzzer simply collects abnormal responses to various inputs and behaviors, leaving the task of concluding to the human user.
o Lilith 0.4c/0.6a (both versions 0.4c and 0.6a were tested, and although the tool seems to be a scanner at first glimpse, it doesn’t perform any intelligent analysis on the results).
o Spike proxy 1.48 (although the tool has XSS and SQLi scan features, it acts like a fuzzer more then it acts like a scanner – it sends payloads of partial XSS and SQLi, and does not verify that the context of the returned output is sufficient for execution or that the error presented by the server is related to a database syntax injection, leaving the verification task for the user).
· Fuzzers – scanning tools that lack the independent ability to conclude whether a given response represents a vulnerable location, by using some sort of verification method (this category includes tools such as JBroFuzz, Firefuzzer, Proxmon, st4lk3r, etc). Fuzzers that had at least one type of exposure that was verified were included in the benchmark (Powerfuzzer).
· CGI Scanners: vulnerability scanners that focus on detecting hardening flaws and version specific hazards in web infrastructures (Nikto, Wikto, WHCC, st4lk3r, N-Stealth, etc)
· Single URL Vulnerability Scanners - scanners that can only scan one URL at a time, or can only scan information from a google dork (uncontrollable).
o Havij (by itsecteam.com)
o Hexjector (by hkhexon)
o Simple XSS Fuzzer [SiXFu] (by www.EvilFingers.com)
o Mysqloit (by muhaimindz)
o PHP Fuzzer (by RoMeO from DarkMindZ)
o SQLi-Scanner (by Valentin Hoebel)
o Etc.
· Vulnerability Detection Assisting Tools – tools that aid in discovering a vulnerability, but do not detect the vulnerability themselves; for example:
· Exploiters - tools that can exploit vulnerabilities but have no independent ability to automatically detect vulnerabilities on a large scale. Examples:
o MultiInjector
o XSS-Proxy-Scanner
o Pangolin
o FGInjector
o Absinth
o Safe3 SQL Injector (an exploitation tool with scanning features (pentest mode) that are not available in the free version).
o etc
· Exceptional Cases
o SecurityQA Toolbar (iSec) – various lists and rumors include this tool in the collection of free/open-source vulnerability scanners, but I wasn’t able to obtain it from the vendor’s web site, or from any other legitimate source, so I’m not really sure it fits the “free to use” category.
The execution logs, installation steps and configuration used while scanning with the various tools are all described in the following report:
The following appendix was published in my previous benchmark, but I decided to include in the current benchmark, mainly because I didn't manage to invest the time to get to the bottom of these mysteries, and didn't see any information on someone else that did.
During the current & previous assessment, parts of the source code of open source scanners and the HTTP communication of some of the scanners was analyzed; some tools behaved in an abnormal manner that should be reported:
· Priamos IP Address Lookup – The tool Priamos attempts to access “whatismyip.com” (or some similar site) whenever a scan is initiated (verified by channeling the communication through Burp proxy). This behavior might derive from a trojan horse that infected the content on the project web site, so I’m not jumping to any conclusions just yet.
· VulnerabilityScanner Remote RFI List Retrieval (listed in the scanners that were
not tested, appendix A, developed by a group called RST,
http://pastebin.com/f3c267935) – In the source code of the tool VulnerabilityScanner (a python script), I found traces for remote access to external web sites for obtaining RFI lists (might be used to refer the user to external URLs listed in the list). I could not verify the purpose of this feature since I didn’t manage to activate the tool (yet); in theory, this could be a legitimate list update feature, but since all the lists the tool uses are hardcoded, I didn’t understand the purpose of the feature. Again, I’m
not jumping to any conclusions; this feature might be related to the tool’s initial design, which was not fully implemented due to various considerations.
Although I did not verify that any of these features is malicious in nature, these features and behaviors might be abused to compromise the security of the tester’s workstation (or to incriminate him in malicious actions), and thus, require additional investigation to disqualify this possibility.