GitLab Secure: Brown Bags, 21 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020.07.21 - Brown Bag: Unit-test Derived Fuzzing

Description

This BrownBag session discusses problems and solutions for deriving fuzzing harnesses from existing unit tests.

BrownBag issue: https://gitlab.com/gitlab-org/secure/brown-bag-sessions/-/issues/28

A

All right, can everybody hear me: okay, yep awesome. So this is a brown bag session about unit test derived fuzzing and yeah. This turned out to be um a lot more there's a lot to talk about here, so we won't go too much into code and seeing this is run with python ruby and go like I had originally planned, but we will cover everything we need to do to get there. There's a lot to talk about all right. So, let's get started so who am I my name is james johnson.

A

I am a staff security engineer on the vulnerability research team within the secure group at gitlab.

A

I am doc savage on gitlab, that's my username, so you can see work I'm doing there and these slides are made with a tool. I wrote called look at me so if you wanted to follow along this is what you would need to do to do that.

A

Basically, it renders markdown slides in the terminal so that I can mix code and bash prompts all right. So the problem that this that we'll talk about in this brown bag session is making it easier for developers to start using fuzzing um there's a lot of things. You have to do to be able to use it and use it well with a project.

A

You have to have knowledge about fuzzy methodologies, how to know if it's working well or if it's fuzzing things that aren't even related to what you're working on um you a lot of times.

A

You have to reset up environments, um copy and paste a lot of code uh create fuzzing harnesses to do all of the work, um and then once you get everything set up, uh you have to actually run it, and then you have to monitor for results and tweak the settings of your fuzzer and kind of dial it in so that it's very targeted and does what you're expecting.

A

So these are all hurdles that hinder developers from easily using fuzzing within their normal workflows, and so this brown bag talks specifically about trying to derive fuzzing harnesses directly from unit tests. I've done another brown bag about pi, doing this using pi test, auto explorer.

A

That has a different approach, and this brown bag session is talking more about a general, more sustainable approach that isn't so specific to python.

A

All right so in standard fuzzing situations, uh often the fuzzer will modify a single input and the fuzzing harness will forward that input to the target, binary or target program doing whatever setup needs to be done in order for that input to actually be processed and looked at, and this is a very simple straightforward case- it works very well for image, processors, things that process raw data now for deriving fuzzing harnesses or fuzzing code directly from unit tests. Things won't be quite that simple.

A

So here is an example of a very straightforward kind of classic fuzzing harness. um This is taken from the chromium source code itself. This is fuzzing their time, parser library and again it just operates directly on raw data by the time. So as a developer, this is all I would need to set up and so llvm fuzzer, um that's all part of lib fuzzer, so lib fuzzer itself takes care of the mutations.

A

You can override it and provide a custom mutator, but the default is all you have to provide is the harness to forward the data to the correct, targeted location within your project. So as a developer, this is all I would have to do. I wouldn't have to worry about mutation or anything um just for the data on and again it is only just raw data. So an array of bytes and a few things to note about this, and I did already mention it. The data is already mutated and this is called once permutation.

A

So if you have very heavy setup processes as a developer, you're going to have to be aware of that to set everything up in a way that won't slow down your fuzzing. So there's a lot to be aware of, as you write fuzzers, basically, the faster your fuzzers run, the more success you'll have, if you can only fuzz a few iterations per second, it's better than nothing, but it uh if you can exponentially make it faster. It's obviously much much better.

A

Now suppose we have this go code, we have a sum function. It accepts two parameters and a test function that tests it. So this might be something that I would write if I'm writing a basic math library.

A

Now it does look very similar to the buzzing harness here. It's a single function and it does things inside of it sends data onto the target library or project in a targeted manner, and the target is the sum function here, except now we have two variables and it's not raw bytes, so that right there would make it harder to make it work with lib, fuzzer or other types of fuzzing libraries.

A

So the rest of this brown bag is mostly going to talk about how we can take these unit tests and translate them or map them into a similar type of setup as what norm, fuzzing is or and map them into the same type of setup that fuzzing usually uses today all right. So let's look at this one more time, so this is called once for iteration, there's mutation. That happens. It's taken care of by lymph fuzzer, though we don't have to do it ourselves.

A

There is a corpus of data. We will talk about that, but it is a basically a way of seeding the fuzzer so that you have initial known, good, starting points to start from. Otherwise, you start with an empty data and you have to guess the correct format or the fuzzer. Does it also keeps track of whatever the fuzzer deems it as interesting and we'll get to that as well. um So these are also usually feedback driven and there's only one parameter fuzz at a time.

A

Now, if we look at this, though we could easily put this in a loop call. It once per iteration. um We'd still have to modify some of the parameters, so we have to take care of mutation. um It's not easily supported the corpus corpus.

A

You have to think about it totally differently. It is not just a collection of raw sets of bytes in the corpus um it it's a lot more complicated when you're fuzzing function, calls or multiple function calls from a unit test feedback driven. This should be very easy to map um and we'll talk about that. One input parameter fuzz at a time. I did want to put this in its own category to talk about because it does pose.

A

It is a different problem set than just buzzing a set of bytes.

A

All right so called once for iteration uh it. This could have just been a bullet point, but there is actually a little more to it than that. So unit tests have set up and tear down.

A

So if we look back at this, there is some setup here, so it creates a string and then uses the string and parse time string.

A

Now, if you look at say a lib png, a function for lib fuzzer, there's, maybe 50 lines of setup code before you actually forward the data to libpng, so that concept still applies, but with unit tests. You have to be aware of the testing framework and still maintain that setup and tear down, and this was something that I ran into with pi test auto explorer. It would.

A

um I ran into database database problems because I wasn't running the setup and tear down every fuzz iteration, so it would try to in a unit test, try to create a new user, and then you know, do operations on it, but primary key constraints failed, and I was getting all of these other errors that didn't have anything to do with actual bugs in the code. um So calling a test case multiple times, there's more to be aware of than just putting it in a loop.

A

One of the other aspects, though, is developer logic for testing those can also be used during the fuzzing process. So if the developer specifically says I care about, oh, let's go back to this.

A

Let's say if total, uh let's say the developer, didn't want the sum function to ever return the value 10., instead of checking that it's actually what it wanted, or instead of checking that 5 plus 5 equals 10, the developer actually cared about the total, never being equal to 10. For some reason, um we could maintain that logic from the developer during fuzzing and shake out cases where that still might be the case.

A

All right so mutating binary data is very straightforward. uh It's very simple to change a byte. Add a new byte mix, things up and you can add all types of logic. On top of it, for example, this is the go fuzz mutator, there's actually 19 different cases in the source code. For this, this is just as many as would fit on the slide.

A

Often it has to do with these types of things, removing bytes treating a say, four bytes in a row as an unsigned integer and incrementing decrementing, shifting left or right um that type of thing, um flipping bits. uh Yeah and again, every buzzer buzzing framework um has their own set of mutators and they're kind of derived from the experience of the person who wrote the fuzzer.

A

Now again, this is directly operating on raw bytes. So how do we mutate built-in types? So they're collections their strings, integers, booleans, uh so booleans, it's a very simple one. Only thing you can do is make it true or false uh flip it the other way and that's about it. But what if you have a dictionary or a list or a nested set of lists?

A

You have to deal with that in a way that makes sense, and a lot of this isn't something that I've done a lot before, and so it's there's a lot to be explored here, and this is some. This sample code is what I did with pi test, auto explorer to mutate dictionaries.

A

So it'll either pick a random value in the dictionary it recursively finds a leaf value, actually not a leaf value, any value in there. If it's a nested dictionary, it may recurse into it and find a value and then mutate that value.

A

But then again, if we're dealing with different types of data, that value could be an integer, it could be a list, it could be a string.

A

You have to have mutators for all these different types so that you know how to mutate them when you run across them and some of the other ones cutting keys out inserting keys yeah, and this is just to mutate- it talked about okay, so crossing over did I I think I skipped over.

A

A

All right so across, I have a slide that talks about uh just a second. Let me find it: oh, okay, it's in the different section, so we will circle back to what crossover uh means in this setting, um but basically it is merging two sets of inputs. um So say we have these two inputs: a and b these two values, integer values and hex. If you wanted to mix them, these are ways you could do it now.

A

If this actually makes sense and gets you results, I have no idea, but this is a type of topic that is or a concept that is used when you're fuzzing binary data. You have two sets of inputs that were previously deemed important and you might mix them you do you cross them over to create a new input?

A

That's blended from both of them so trying to map buzzing these native data types in different programming languages to that concept may not always make sense, uh but in this case maybe it gets you something, but maybe not all right. uh So the other part of this is that you also have non-built-in types. You have data structures, you have classes, you have whatever the developer dreamed up, so these will be passed around into the functions that are being targeted in the unit test.

A

uh One example of code that or a project that already modifies these at runtime is go fuzz and that is different than go. Dash fuzz go dash. Fuzz is the lib fuzzer uh implementation for go, or basically it's a fuzzy library that uses lib fuzzer made specifically for go phrase. It that way um go fuzz, specifically modifies uh fields randomly on go objects.

A

And if we wanted to do something similar in other languages, we would have to use runtime introspection um and I will get to the place where that introspection would occur all right. So let's talk about a corpus of inputs, so a corpus of inputs. I mentioned it seeds, your fuzzer. It helps you know a known good starting place, something that gets you relatively deep into the code without having to start from nothing, and it helps a lot with genetic algorithms and hill, climbing algorithms.

A

So, for example, uh lib png. Just during its normal development, they have collected a series of test pngs that you could very easily use as a corpus. If you wanted to fuzz lib png- um and these are intentionally made to.

A

To cause libpng to go down different code paths uh they're for testing, uh so there is a lot of correlation between fuzzing, corpuses and data that you might collect just throughout your normal process of writing unit tests.

A

Now some of this data we will be able to derive from the function, calls themselves inside the unit test all right so, but how do we track a corpus of inputs from the unit test? So if we look back at this, we're not dealing with raw data, we're dealing with native data types, maybe collections.

A

um So if we want to track that in a corpus we'll have to think about it completely differently, so we'll want to save function. Call invocations we'll need to track the variables that are used during the unit test. um Not all variables that are in the unit test will be relevant or maybe even passed directly to the target function. So we'll have to do some analysis of the code to make that work.

A

One of the other things that would I'm on the fence about if we need it or not, but I think we would need it. This is something that comes up with das, a lot and api fuzzing is. We may need to also keep track of sequences of function, call indications.

A

So, instead of having a corpus of unique function, calls we may need to even track sequences of function calls. So if we look back at this one suppose the unit tests call sum and then with the return value, it calls sum again with that value as another argument or it chains a few function calls together. We may need to track that as a series in order to kind of have a corpus that makes sense in the unit.

A

A

All right, so this is something that I took from pi test auto explorer. This is the type of data that I was saving for. Every new crash or error that it that it detected it has the there's the file, here's the source, um and these are all of the inputs that were passed to the function, and this is the type of data that is being mutated by pi test, auto explorer um and along with these, uh the this is saved in memory. None of this was ever written to disk.

A

So if we wanted to make um unit test, derives fuzzing that operates on function calls work, we'll have to be able to serialize this to disk so that we can persist the corpus and you, you reuse it in later fuzzing sessions and again it just gets more complicated because we're dealing with non straightforward data types.

A

All right, so any questions before I talk more about feedback driven fuzzing and how that would play out with unit test drive test cases, no. Okay, all right so feedback driven fuzzing. um I've got a few links here. uh If anybody needs them for reference, so genetic algorithms, they mutate data um to try to so I didn't add this in here: there's a concept of a fitness function, something that says one input or one thing is better than another thing: more ideal or performant.

A

Whatever the fitness function is, and so an item may be mutated and then run through the fitness function to see if it's better or it two items may be merged together and then tested to see if they produce a better outcome.

A

So this is exactly what most of the fuzzers are using now there's also if a fuzzer is not explicitly using that uh it may be a variant of a hill climbing algorithm where one solution is found and then incremental changes are made to the input again with some sort of fitness function to tell if you're, making incremental progress.

A

Oh all right, so if we look at these concepts with unit testing in mind, we need a fitness function. Code coverage is something that is already used by most testing frameworks and most projects have that set up, or, I will say a lot of them, it's very common, so getting that type of feedback for fuzzing should be relatively straightforward.

A

Unit tests and testing frameworks are set up to already deal with that now. This is an example of how feedback driven fuzzing would work and we'll walk through it. So let's say we have this function. Handle data takes two inputs, uh a an array of bytes chars and we've got the length of it. If length is less than two it returns, otherwise it does some processing now, if we call this function and with an empty uh array of bytes with length, zero and nothing in our corpus.

A

These are the lines that are hidden, but we've never seen them before. So it's new coverage- and we add this data into our corpus all right. So now we choose a new set of data and choosing the new data would be chosen done by the mutator optionally, using the corpus as a starting place.

A

So now, if we run this function again with these inputs a and one, we run through the same lines and we don't cover any new code, no new code coverage and we add nothing to the corpus again. If we do it with b, nothing doesn't help us now. Let's say we use a previous input b and or we decided to randomly, put together two bytes and we have ba.

A

Now we got past this first check and we're here, which does mean that we had new coverage, and so now we have two items in our corpus and we choose the last item we had and we mutated, and we come up with h a so now. We keep proceeding a little further in the process or into the code, and here we so we have b a h, a and now suppose we had mutated h a to have an h, lower case a now. We've made it even further.

A

Now we're processing uh h a and again the same process would occur since we have these items in our corpus, and these tend to be prioritized um that that's logic kind of found in the fuzzing framework itself.

A

So recent items in the corpus uh sometimes are prioritized over old items in the corpus, so the odds of this being randomly created are pretty high and then the process continues. We take another item from the corpus, randomly mutate it and try and get further into the code.

A

So if we were to watch it a little faster, it's pretty cool how it works. How, if you have that fitness function, that feedback mechanism you can continuously derive values that make its way further into the code.

A

Now code coverage is usually the type of feedback that is talked about. It's kind of the easy one, it's the obvious one, but it really can be any type of fitness function, and I wanted to bring this up because I it doesn't have to do anything with the code um directly.

A

So I wrote a something for fun and it doesn't use code coverage for feedback. I actually use performance events and it uses instruction counts and branch counts as its feedback mechanism. So you don't have to instrument the code. All you have to do is run the target binary with these performance measurements in place.

A

Actually, let's look at target, let's see. So this is what the target looks like very similar to the test example except it's very deeply nested, um and it should be very obvious if we, if the fuzzer is more performant than just randomly generating what uh nine bytes with the correct value.

A

All right, so, if we run this, it does occur pretty quickly and this is using performance events instead of direct code coverage. So my emphasis that I often bring up that code coverage is just one feedback mechanism, and this is one example of that. um So if code coverage is unavailable during unit testing, there are other ways that we can figure out if we are progressing further into the code.

A

And some of those may even be application specific, but again this brown bag is about trying to.

A

Figure out ways to make fuzzing more accessible to developers without them having to be fuzzy experts.

A

All right, so one input parameter, is fuzzed at a time. um I've talked about this a lot of times. I don't think we need to talk about it again so now. This is where the actual implementation would come into play, um and I'm saying it that way, because I have not made it as far on the code for this as I wanted to. um So that's why the brown bag is kind of ending around here.

A

But this is my approach, and I am in the middle of this um I'm and here somewhere, uh but the methodology that I am using right now is to parse the existing unit test um into an ast, an abstract, syntax tree and then rewrite all the function calls previously. I had rewritten the python byte code to do this hooking this instrumentation to capture known good values from the original unit test, but that's really not that sustainable and it's very python specific, so rewriting the source code should work between major versions roughly of the language.

A

Unless it's a very new language and it would work across languages, you would still have to implement it, but it is a bit more generic and I think, a bit more sustainable um and also you would also need to monitor for code coverage or whatever feedback mechanism you would use.

A

Once you add the code or rewrite the code to capture function, calls and monitor code coverage, then you would have to run the test normally to capture those known good values at runtime. So any runtime introspection would happen here at this stage.

A

Once you have captured all of that data, then you'll have a corpus of function, calls and then you can create standalone fuzzing harnesses for all the functions that you want to fuzz. Exactly how that would occur is a bit more to be determined.

A

You could have a thousand unit tests and you wouldn't create a thousand separate targets uh that wouldn't really make sense to me. I think you would probably round robin through each of them um on each fuzzing iteration but yeah, that is about it. um Preserving the imports, the environment, uh the module hierarchy of everything that you need for the unit test is something that you would have to do when you parse the ast and create the standalone, fuzzing harnesses and yup.

A

That's what the rest of this would be draw the rest of the owl it there is a pretty clear path to it. I did not want to put off the brown bag yet again, uh just so that I could have code in place or in the state that I want just so we could talk about it, but yeah that is the process, and so far everything does seem to be panning out. Does anyone have any questions? uh Anything you'd like me to go over a bit.

B

More hey james uh thanks for the talk, um I do have a question uh just a pretty generic one, but um so you mentioned that you're you're, currently working on this is this um I'd be curious to to check it out. um I noticed that the one performance fitness um test that you had was written in rust um is that the uh this one.

A

Yeah yeah that one yeah, so let's see ross, isn't very high on the priority list. As far as work is concerned, it's just more of a new language that I wanted to learn, and so this was something that I did for fun. I brought it up because it uses a different uh feedback mechanism than code coverage um yeah, but the same concepts would apply with rust as well. uh Yeah is that where you were going with that question.

B

Well, yeah! I was just wondering if that was like the the thing that you're working on right now, so I.

A

Don't know just for fun.

B

A

I was, let's see so I'm actually, I'm not sure if I've pushed up the branches actually just moves. This.

A

All right, so this is here we go resume chat. So this is the pie test, auto explorer project. Where's chat chat there we go all right, so this is where I'm starting, because I've got a lot of code in place for this. um I ran into some snags trying to trying to make it more than mvc make it work very nicely with pi test instead of just raw rewriting the source code and then implementing it.

A

um So right, currently uh pi test, auto explorer the main branch of it does do the instrumentation and captures the function calls it doesn't create standalone uh files to do the fuzzing, um and that's that's the aspect that I'm currently working on. um Once I figured out uh those topics with pi test auto explorer, then I was going to basically re-implement them in ruby and go um yeah and they each have kind of different um tool.

A

Sets that could help you with it, uh but pi test, auto explorer would be the one to look at right now. um It's the one. I've been focusing on.

B

Cool nice. So would this be kind of in the same vein as like the generic um sas uh idea, where it's language, agnostic, um uh the fuzzer or would we be implementing specific.

A

It would have to be specific to a language. um You know I I was thinking about that. It's the source code, all right, so you have the unit test from the project. It's in language x, right. You would have to do the instrumentation on the code.

A

To me, I think the easiest most sustainable way would be to rewrite the source code parse it into an ast, so that would have to be language specific, um but maybe there's a way to wrap it into a common library where you can abstract away a lot of the language specific things. um So, if you wanted to add a new language, then maybe all you have to do is implement the specific pieces right um so like implement the source code rewriting and a way to uh understand the testing framework.

A

So you can create a standalone fuzzing harness. You could wrap this in a single tool, but it would still languages would have to be added incrementally you wouldn't be able to knock them all out with one stone, gotcha.

B

A

A

All right, well any other questions. Anybody else want to see other slides or in general questions.

A

No all right well cool, then I will stop the recording here and I did add a part one on this. um I uh will have a part two once the code is in place um and in a state to show uh I in the past I had when I've merged a lot of technical topics. With wanting to talk about the code, it tended to get really messy presenting anyways, so I'm kind of liking having it split up. So next time we'll be talking specifically about the code, all right and.