GitLab Secure Stage, 11 Jun 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: dast-benchmark Details and Workflow

Description

Covers the dast-benchmark tool and workflow for creating baseline applications for benchmarking.

A

Okay, I think we are recording all.

A

Right so this is covering the to benchmark tool and how to kind of go through the workflow of adding a new application to the benchmark. So I want to quickly start with some of the kind of issues you have when you go to do a benchmark or an analysis comparison of scans to to see if your tool is actually getting the flow coverage and they scan coverage that you expect.

A

So some of the issues that kind of crop up when you're doing this is really comes down to the comparison logic and what ends up happening is a scanner as its scanning. An application will be trying either random inputs, or maybe sequences are a little bit different. So it'll request one page and using a get method instead of a post method first, and it finds a flaw and reports it the other time. I might do the post request first and then report the flaw. So the reports will look like you're.

A

The vulnerability exist in a different place when it's really the same vulnerability just in a different accessed in a different method. So the benchmark tool has a number of kind of techniques. It uses to try to make sure that these comparisons are actually valid comparisons and not kind of to strictly doing a comparison and basically reporting incorrect results. So one thing that you'll end up realizing as your work with tools is that URLs are not very good for determining uniqueness, so doing a direct comparison on URLs to see if they match doesn't actually work.

A

A lot of a lot of cases, for example, like a lot of applications, will do cache busting, so they'll append like a random ID, or maybe the timestamp at the end of the URL request, and then the next time you access it. It'll have a different value so that the browser doesn't cache it. When you go to do your comparisons, to say, hey, does this URL match I'll be like you know it doesn't if you're doing it for a comparison. So what we end up doing is kind of two techniques.

A

One is we take a report from the desk engine and then we kind of convert it into an expected report, and the expected report says like okay: if you scan this again, this is the type of flaws, and this is the type of resources we expect the death tool to a seem that way, you can kind of get a good understanding of how good your coverage is.

A

So one of the solutions is to look at the actual values and determine if they're kind of randomized and if they are it'll, basically ignore that it'll say: okay, this is a dynamic parameter. It's going to change, probably in the next scan. Don't don't pay attention to that when you do your comparison, the other one is to have allowed the user or the person creating the expected baseline report to specify directly saying this is not. This is going to be a dynamic value, so ignore completely and there's a couple of things that you can ignore.

A

You can ignore parameter names, parameter values, the entire parameter, URL passed parts of the URL path, so it's kind of flexible and it lets you say: okay, this vulnerability is going to exist in and these are the types of constraints that are really required for that vulnerability to to be matched against. So besides that you also have host names, may change. So if it's running in a CI CD pipelined, it might generate a random URL or host name. So we don't.

A

We ignore that we basically strip out all the host names and just replace with a dot star. Another issue is Mitch, mismatched, query name ordering, so some applications may just change the order. So you use there's an example of the first request. Doing X is equal to one and Y is equal three, but then maybe the same page will kind of flip those those parameter key values.

A

So what we do is we basically parse up the URI and sort them so that they always match the same order and that's oh and evidence and attack, so obviously, tool has many different attack strings that it can send. So what we want to do in some cases we want to actually ignore the attack, string or the evidence.

A

So if the the evidence basically states it finds this flaw: here's how we determined that it was a flaw whether it was like a URL or parts of the page that that kind of demonstrate that the flaw was actually found. So in some cases you do want to take that into consideration for exam. If it's a particular type of link that must exist like it's missing some property that needs to exist, otherwise it's coming to the ulnar Bowles or something. So in that case we do want to account for it.

A

In other cases, like a sequel injection attacks string, we don't care what the attack string is so just ignore the attack string and that attack string may also show up in the URL. So we need to cleanse the URL of that attack string when we do our comparisons, because the next time it runs in my tried it a different attack, string and find the same flaw. But again, the flaw is exactly the same. It's just a matter of doing a kind of decent job comparing against it.

A

So, as I mentioned, the the expected report allows you to ignore various fields like the parameter names, names and values parameter path indexes. So we could say: okay, we're going to ignore this. This particular path, and this kind of just walks through all of that, so that kind of covers the. Why it's so hard to do a comparison between two scans? Now, let's look at actually how it does it so there's two kind of metric key metrics that we're looking for when we do a baseline or benchmark comparison between a scan.

A

That is the scanned resources which, basically is determines the coverage of the crawler of the task tool. So, if that there's a hundred links- and it only found 50 of them- you have a 50% scan coverage for vulnerabilities. We have flaw coverage and flock coverage really depends on a what type of flaws exist. How many do exist where they exist and other kind of properties too, to allow us to do the comparison so for scanned resources?

A

What we do is we take the HTTP method and then we also look at the kind of transformed URL and we kind of concatenate this or put into it just a big string with various properties removed if there are being ignored, and then we just do have these kind of two sets of values that we can do a differential analysis on.

A

So there's a sniff algorithm that will basically take the two sets of all the URLs from the expected report and all the URLs from the scan report and basically go through it say: okay, was this found yes or no? If it was, if it wasn't found from the expected report, then that is a false negative and if it's something existed in the expected report, but not in the new scan.

A

That would be a potential false, positive and I say potential, because a lot of times when you're doing these baseline analysis, you as a person creating the report, may miss a link that the scanner just happened to find that you forgot about. So, usually, you want to look at those results to see if any of these potential FPS are actual true positives, and you need to update your baseline report for vulnerability instances. It's a bit more complicated.

A

One of the issues that we ran across with the dass tool is, it doesn't have a concept of a unique vulnerability ID. So what ended up happening is we had to go through all of the alerts that the Zap could create or the dash tool could create and then create vulnerability, IDs and then use those as our kind of primary key to do our comparison against. So in the expected report, you'll see vulnerability, IDs or sure subsets of cwe's. So you can see like here.

A

79.1 79 is a cwe for a cross-site, scripting and then point me1 means. Maybe it's reflected there's another type of other types of alerts that may exist in the report and I'll show those in a bit, but we go through and pretty similar to how scan resources are compared. We kind of create this these two types of sets and then we create a comparison of those sets and then, if it doesn't exist, an.

B

A

Report, it's considered an FP if it doesn't exist in the scan result, that's considering that false negative and so on and so forth. So any questions so far, no pretty good! Thank you! So next I'm going to do a quick kind of walkthrough of the workflow of how you would create this type of expected report. So what I have here is a new application. No goats I did a scan earlier using are using it locally because I was crashing our desk tools, finding a fall and crashing it, and so I gotta fix it.

A

But this is the the repository for node, goats and it has a customized CI job to basically build it as an image and run it. So this the CI template will create the image of the node goat application with built into it, and then we have a secondary project. You could put it in the same one doesn't really matter but to actually run the desk and to get to the results, and then we download the results- and this is kind of the end product of that.

A

So this kind of just tells you if you're gonna benchmark an application. This is how you would set it up. So what we end up doing first is a lot of times when you're, creating these expected or baseline reports. Is you don't want to do everything manually and the scanner is gonna find stuff that you as a human either will miss, or it's just too tedious to report everything so we end up doing. Is we take this up real quick.

A

So here is the report as all our flaws, all that good stuff in it, and we're going to take this and actually generate an expected report from it. So it's going to go through and do that for us, so we go and we do best benchmark and we're going to put a few birds for codes and by default it will output a expected report. So if you open that up big again, so this is what the expected report looks like: it has some configuration stuff, rule files, etc.

A

But here you can see the the evidence in this case, this evidence is important. The attack was not included and it just has these different types of instances, so each flaw obviously can be found multiple times, so we want to account for that. So this one was anti Caesar token scanner, here's 79, which is cross-site scripting, so you can see in the post username of the sign up page, and if we actually look at this in the here, we could actually take this attack string.

A

Let's like it uses a single quote, double quote and then a script see I've already tried this once before and I believe this attack string actually doesn't work, but if you kind of modify it a bit, it does work. So it actually found a flaw. It just kind of reported it incorrectly. So it's not really an FP, but if it wasn't FP it, what you end up doing is you just delete this instance from the expected report that way you have a clean result set. So that's pretty much what that looks like.

A

So we've converted the expected report now, as I mentioned, verify the results you want to go through each attack and make sure they're legitimate there's times when you need to quite often actually need to go through the report and kind of modify it, whether it's for removing parameters, ignoring parameters or just making it making it so you'll be able to do a better comparison overall, for example, think that was this one.

B

A

Right so here's a great example: we have cwe 16, which is a content, security policy issue and a lot of times. This will exist on like every single page that it tries to find because it's missing the CSP header or the CSP header has some sort of yeah this case. It's a wild-card directive. So a lot of cases you could probably you say: okay, this is going to be no matter what it's. If this, if it's found in a post request, it's going to be the same flaw in this case.

A

The URLs are different, so we might want to keep that a lot of times. You don't want to keep that, but it really depends on on the floor itself or it looks another one directory browsing here. We go so you can see here, here's a get request in a post request to the same exact URL, the same evidence. It's like okay. This is clearly the same flaw. Let's just delete that and then we're going to mark this as a it could be either post or get or any other thing saying.

A

Were these ones both we're not going to bother doing that now, so we're going to save the expected report, we're going to go back to our over here. Oh this one! This one was actually good cause. It actually showed you see if I can find that for learn.

A

Ok, so here's here's one where we don't care what the parameter value is. This could be anything it's still going to be the learn you our URL, with the URL parameter as being the actual vulnerable endpoint. So what we do is we delete that and we say we're going to ignore the parameter URL, because we don't care what the value is. Even if it matches this, then we're good. So add that and then go back to our results.

A

So at this point, let's consider this expected record verify and everything. It looks good. What we end up might might end up wanting to do is to, for example, add some new URLs to it. So let's say we found some new URLs that we want to add to this report. So we're going to do that benchmark.

A

We just want the template so spit out a file for us, so this CSV file gives you everything you need to know to add new URLs to it. So it's open that up, we're gonna, add, let's see the star.

A

Then we're don't care about pass parameters.

A

Okay, so we're gonna. Add these new results. Add these new URLs to our expected report.

A

All right so now, if we do a diff of our expected report and our expected report with URLs you'll see that it added these new URLs for us so the next time we do. Our comparison. It'll include these as well, so that's great for URLs, but sometimes you need to add new vulnerabilities as well. So sometimes you as a auditor will be looking at the the code and say hey. This was a flaw that the desk will miss or we need to add more abilities as well.

A

So there is tooling for that as well against CSB the rules, you kind of need to know the vulnerability, ID and cwe ID. So there's another kind of command. If we go back.

A

Area, so this is kind of what the rule list looks like, so it just has the vulnerability ID and as a cwe kind of it's actually, the Java class name, the dot Java. And then this is the alert title so a lot of Weeki off the vulnerability titles, because again, there's no concept of a unique ID. So we key off this this vulnerability title and then we assign it this formability idea. So that's and these two faults are for ignoring parameters or ignoring evidence.

A

So, for example, if we want to add a new cross site scripting one, we would do this with them.

A

So let's say we want to add age where is reflected as reflected so seventy nine point one. So if we open up our bones, so we have a bone idea. Seventy nine point one so you 79. Let's say we attack your script, four letters, one evidence.

A

Method, say forget, say: parameter is wizard URI.

B

A

And we're not going to ignore path and I can ignore parameters for primary names. So now we have a new spa now we can add this to our results.

A

So you can see that added a new instance and again if we leave diff report URLs and.

A

You'll see that added this new attack evidence to the login URI methods get Kramer's user, so this just makes it easier for us on trying to add new vulnerabilities to a baseline report to not have to go through and add all these JSON fields and and try to get her all your syntax, correct, it'll just read the CSV file. You can have new lines. If the evidence requires new lines, you just double quote it to a chunk up to a to lock it in so so now we have that we have our vulnerability added.

A

So now we can do a comparison. So let's say we had either a new scan or we just want to compare what we kind of converted our this note foliage ax report to our current expected reports. So we're going to do that punch mark again- and this is all done in CI job as well, once the once it's added this a path, name, let's go and then type is full.

A

Ajax reports is Ajax.

A

Expect expected woods in URLs and bones so by default it will spit out. So you can see here it found a number of duplicates so this tool, the desk tool, will account for duplicate, false positives and duplicate, false and negatives, or excuse me false positives. A lot of these vulnerabilities are kind of found multiple times and it's a very similar place. So we just we count those as duplicates because they're not technically unique. So now we have a benchmark results. We open that up and you can see what it does is.

A

It creates statistics for each type of vulnerability ID. So in this case there is a hundred eighty two two causes. This is user agent. Buzzard usually does a lot of stuff. We don't really consider this. A true positive I would probably remove that and consider this whole class of issues as being a false positive. You can see for each one, there's like two positives, duplicate, two positives and so on and so forth, and also another key metric is the expected true positives.

A

So if there's only supposed to be three, you need to match those against to see if they actually have the correct number. We should see one FN as you remember. We added that cross-site scripting flaw. Obviously it's not going to exist in the report because we just added it. So that's that's marked as being a false negative, and then we had a number of duplicate, two positives.

A

We had 443 because, obviously again, because we didn't really clean the report, it's going to have a high number of true positives and it gives a total of the scan coverage and flaw coverage so that pretty much concludes how the benchmark works in kind of the workflow for creating these.

B

Reports so yeah, it was pretty awesome, pretty awesome. No, it was very clear. They.

A

Were new starter yep.

B

Yeah I'm just thinking so basically, what one needs to do is to convert and of their their death results into a particular format and off you go rich.

A

Yep yeah and I mean you'd want to manually verify an aversive. Obviously the the verification process is the most time consuming and requires the kind of skill set that you need for identifying these types of issues, especially if you're trying to add new vulnerabilities that the tool has not produced the phone. But overall, if you wanted to just, for example, take this expected reports and then use it to do a kind of base comparison to say.

A

Okay, if I ran one scan and I got these results, I'm going to create I'm gonna generate my expected results from that. Once can do a new scan and does it? How much does it differ? Am I finding 150 new vulnerabilities? That means there's some kind of variance to you with your scan that maybe the desk tool is doing something wrong or found something new, so you could still use it.

A

Even if you didn't really comprehend or didn't really have the kind of background knowledge to really verify and add these new flaws, you could still use this information in a kind of a qa7, so.

B

Nice are there any by chance, like already existing tools, that we've compared something open-source potentially so right.

A

Now this is kind of forced to our report. Only there may be in the future. It would be nice to be able to compare against like how other tools like Hattiesburg do and how those you ever do. That would be nice, but I think for for v1. We just want to be able to use this tool as more of a QA process to to make sure that our scanner is getting the proper amount of flaw in scan coverage.

A

So absolutely any fancy graphs I do so once a baseline application is in the desk benchmark it's stored in here we have.

A

This so every time this benchmark master branch's run, it will generate a new entry. So we can say it's take a look right now. Only dvwa is in there, but we're going to load that up, and this gives you a nice like. Okay, we got 48% scan coverage, like average looking good above 70% and then for each different flaw. We can see the expected versus how many it actually found and again false positive rates, false negative duplicates and so on and so forth.

A

Now, if you want to compare that, let's say with a previous run, let's do for Ajax 3, so we're comparing this with the previous run. So we could see kind of the difference between the two that I just get bit by the demo gods. I.

A

Might have I'm not sure.

B

That's probably an AJAX yeah.

A

That's how it worked against the.

A

Other cuz that was weird okay, so here is doing comparison. You can see we got the same exact results which is good. This is what you want to see if you run the same configuration against the same application twice so yeah and then I don't have the the tables doing comparison just because the tables are kind of hard to work with, but yeah well. I definitely appreciate.

B

The gift lab you I might bring just great graphs are great. It's really nice err to sort of visually, assess the results rather than bad yeah.

A

If you wanted to look at the chart, view of this would probably be easier for you to look at what was different between. Obviously, these are all going to be the same bar charts because it found the same exact issues, but if we ran it, if we compare it against like a baseline, which doesn't do actual attacks, you're gonna see very different results between the two. If it loads.

B

Meanwhile, while it's loading, um I was curious, I, don't think I've seen anything in the rule said around headers. Is it something that we intentionally Ahmed or it's just like it's up to the test itself, to whether to use headers or not but seems like everything is around query string and HTTP? Nothing.

A

So correct we do I, so the header would be included and.

A

If we open up the expected expected report.

A

So in this case, for the evidence or the parameter revenue I believe it would be, the header name would be the parameter value in here. So it would still it as if you are. Let's say you are attacking a header and the parameter. The header name was like content type or something, then the that parameter value would be filled in here.

A

God isn't yeah and also I believe they are working on adding evidence of HTTP headers into the newer version of the tools. So now, when you open up on the voter report, you actually see the request headers, so that should help as well.

B

Yeah well, I, don't help anymore questions. It's very detailed. The next step is just to dive into the code and see for yourself. Yes, yeah and I'm.

A

Open to people creating their own base lines to help me out, because it's a lot of work, oh yeah! Next one is no good, hopefully get that at it soon. Right.

B

Right all right, all right.

A

That pretty much is the recording, so I'm gonna stop recording.