Rust Programming Language Rust Verification Workshop 2021, 11 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Towards Automatic Verification of Unsafe Rust with Constrained Horn Solvers

Description

Zafer Esen, Philipp Rümmer and Amanda Stjerna

A

Next uh talk is going to be by two speakers again and amanda. uh uh She has just been mentioned at the end of uh nico's talk already, so they are both from uppsala university in sweden and work with philip brima there. Sorry for butchering your name amanda, you can fix it once you have the microphone turned on. I'm really sorry about that, and so they will talk about automatic verification of rust. So please go.

A

B

Okay thanks peter for the introduction. I guess I don't need to repeat the same information, and I hope you can hear me so yeah.

B

This is joint work with my colleagues, philip and amanda, and to start off with the motivation we envision a high enough level encoding, which programs from different languages such as c rust, java can be encoded into and which can then be fed into, for example, smt or csc servers, and here high enough level means that the encoding does not throw away information which can be useful for the solvers in the lower levels.

B

So the encoding should also be able to satisfy the needs of different programming languages, and that is why we are here today that, because we would really appreciate your feedback from the last community and the context for this work is in model checking and automated verification. So we are mostly interested in the automatic verification of non-complicated properties such as proving absence of runtime exceptions and, in this case using constrained horn clauses, which is short, chcs, is a good candidate for this purpose. Chc solvers are actively being developed and cancelled harder and harder problems.

B

Each year- and it is really straightforward, as we will show in a while to encode programs in cscs, so let's take a very simple program which simply takes the maximum of two variables a and b, and then we have an assertion which asserts simply that this max the returned max is the indeed the maximum of these two.

B

So uh on the right, you can see the pro control flow graph of the program, and here I wrote bb0 bb1, just to show that this can easily be generated from the mir, and here we have. Actually, we are already almost in constraint, horn clause format, because now you can see that here bb0 is actually like a predicate, and it has two arguments. A and b and the arguments are the program variables at scope at that time, and these don't need to be necessarily the program variables. We can have other variables here as well.

B

If you want to represent additional things- and one thing to note here is that these are actually universally quantified so, but I will get to that just for now. Just think of this as a simple control flow graph, and I guess I don't need to go through it much just it maps this to the control flow graph. Maybe one thing to note here is the assertion that the negation of this formula here, which says max, is indeed to max the negation of that would lead to an error right in the control flow graph.

B

So if we are to encode this in mathematical formulas, so here the variables are implicitly universally quantified. We don't have any precondition. We can't just start from true and the edges become these arrows here, the implication arrows, and then we have these predicates and each program variable or any other data which we might want to add, become the terms of these predicates and it's then simply a matter of following this structure.

B

So here, for example, we have this constraint a is greater than b.

B

So this is why it's called a constraint, horn clause and you have a predicate on this side and you just model the structure of the program and of course it is really straightforward. For example, if you would have a loop that would just, for example, going from three to zero. You would just have an arrow from three to zero here, and the assertion itself is, the error is simply replaced with false. So if again, the negation holds- and we are at this pb3- we are reaching the false state.

B

B

Here the css exactly capture the correctness of the program, that is the chcs. The set of css are only solvable if the program is correct so and if we get a violation, then we can get a trace like leading to this false. So we know- and here maybe also one thing to note- is that here bb3 you see that there's another a here. That is simply the term being assigned to the max.

B

So if you follow this path, then max has value a and otherwise in the second path max is b, so we could also say max is equal to a or b and then just use max here, okay, I hope I didn't make that too confusing.

B

But if you look at the formula like it has this format a c8c, it has a head and the head is uh in this case.

B

If you look at the example, the bb one, so it is usually written this right to left way, and this is simply an application of this predicate or it can be false in case of this assertions which represent the errors states and we have a constraint, for example, this a is greater than b and then we can have other predicate applications or nothing here, and this is in essential if we need to simply explicitly state that these are all universal quantified aap in this case- and we just write it down in this smt ellipse- not very readable syntax.

B

So there are already various csc based tools such as force, the c language, there's seahorn corn tricera, which we are developing. It's quite experimental, there's a j-horn for java. There is rust on for rust recently, and there are also various solvers for this, for example, spacer, which is part of z3 and there's alderica, which is also partly developed by us, and there is a solar competition which is held yearly cs comp, which pushes developments in the solar area. But this is relatively still developing, like there are new solvers being developed every year.

B

It's quite active field, okay, so one thing uh that we cannot easily encode using uh css is the heap unfortunately so like. If you have a data that resides on dynamic memory hit the heap or if you want to make some heap related operations, then you need to either use arrays, which is provided by smtlip, which is used by this smt solvers or, alternatively, you can transform it away and uh each method has advantages and disadvantages.

B

For example, this usage of arrays is used by seahorn. It's raston and jhorn chooses to transform it away, uh but I think ruston depends on the type system uh so that, like for in case of rust it is, it doesn't really lose anything by transforming it away for safe rust.

B

So if we again look at this diagram, so here what we meant by this high level, including is chc's modulo the heap theory. So without the hips here it's still a good match, but the heap theory makes it even higher level, because we can now preserve the structure of heap as well, while encoding the programs and rather than doing design coding at this upper level.

B

Now we shifted the work to the lower level, so the solvers need to deal with this, but on the bright side the solvers also know about the various like they know about the heat. Now it's not transformed away or it's not.

B

The arrays are not used to implement it. So in this case, this can be actually done at the slower level, and the solvers can also develop native decision procedures which are specialized in solving this kind of clauses, and another advantage of this is, of course, that now one optimization from, for example, which was done at this level if it's done at this level, is available to all the tools. So it's a good way to actually provide a framework for.

B

Improving all the tools at the same time so because all the optimizations are done at the solar level.

B

Okay, let's have a look at the signature of the theory, so what we would need is a we need a null address that would give us an address which is not allocated anywhere in any heap. We would need an empty heap function that would return a heap which is not containing any allocated addresses, and then we can. We need a predicate to check if a given heap and address pair is indeed allocated or not.

B

We also have allocate given a heap and object returns. A new heap that's allocated at this address using this object, and then we have the two partial functions, read and write when it's valid the heap and address pair, it's straightforward. We just returned objects at this location and in case of fleet. We just update that location and return the new hip.

B

So these are like the array select and store operations, but if they are not allocated the heap address pair at this location, then uh in the hip declaration we can define and, for example, an empty object. So this is required to make the functions total, which is required by smtlip, so the read simply returns that empty object in that case. For right, this is not required.

B

We can simply turn the unchanged tip in that case, so this covers reading and updating of data on heap also a location, but it's not really clear what this object sort here is, so the addresses can be used for pointers and if we go back to our previous clauses, for example, now we have a heap sort which can be just tagged along in these horn clauses in the predicates, so we can keep heap state.

B

So we have some new sorts to deal with it, but we don't really know what this object is.

B

So to talk a bit about that, we have thought that adts using the algebraic data types is a good match for uh the object sort, because uh now we can have a single sort on the heap right. We have, for example, integers addresses, maybe more complicated data types, and then we can just wrap those around an object, sort another adt using some referent getter method. So it's only would have this constructor and selector.

B

So, given a sort t, we can wrap it to obtain this object or we can just get whatever is contained, and we also have thanks to the theory of edit, is this tester method, so we can actually without opening this packaging check. If the underlying sort is indeed t or not- and this is supported by smtlip since version 2.6, so it's nothing new and yeah. This allows for very complicated objects to be placed on the heap.

B

So a heap declaration would look like this. I will not go every line, but you just define a heap name. Another address name the sort names and then you choose an object sort which can be either one of these adt sorts. So this is like a data type declaration, as I said like maybe not good to spend time here, but just to show you that uh this was like how the envisioned smt lip declaration of hip would look like.

B

So you have the various swords, and here maybe one thing to look at- is that a node sort with some data and then it has an x field, and this is the.

C

B

Which is defined by this heap declaration, so you can use as pointers these addresses okay, so my colleague commander will take over now. Let me stop sharing.

D

And um I'm mute and sorry while I he's there with me, while I figure out how to full screen so, can you see this?

D

um I don't have a copy. So please tell me you can see this yeah. We can see it excellent thanks. um Okay, so we go back to the previous example we had and, as you can see, it's it's barely translated c, uh among other things, apparently from the lack of the exclamation point on the assertion.

D

But here we've made one of the the parts a uh a reference. So uh when we do the encoding of the uh control photograph here, we would have to start by saying that a is valid on heap. It's what happens here on the first line and then we would continue much as we did before, but we also have now get this and set this and we perform reeds on the heap when we extract the thing and, depending on on how worried we are or how much of the the uh infrastructure we trust.

D

uh We may also want to add a final assertion here that says that the thing we're reading actually is an integer as well. So this is one of the st functions that zafar talked about previously, um so this would actually encode that uh what we have uh reading from the heap is actually an integer.

D

Now I I don't know enough about ansei for us to to know. If that is always the case, I can imagine that it might not always be, uh but if I recall correctly, that's fine. um This is uh how new I am to that um okay. So why are we in particular looking at unsafe? Well, it turns out that, uh as I said, you can pretty much transform away the heap and that is done with rosthorn. So there's nothing interesting left for us to do here.

D

We have we sort of have to go into and say for us to figure out point of chasing and that sort of thing um it might be the case that you can. You can sort of extend the logic um to to actually also capture safe rest, but but this is sort of right now, the low-hanging fruit we're going for, for, as you can tell, we were sort of trying to show you different things and in the hopes that some of this will be useful to you um so because we sort of come from outside okay.

D

So another example that immediately stood out to me when I was looking around for for things to verify is that if we look at the re-alloc uh function from the library, we can see that we have a bunch of pretty nice constraints here, and I mean some of them are obviously easy to verify. We it's it's not that hard to check that uh the size is greater than zero, for example yeah that shouldn't be a problem for you. uh But if you recall, we talked about that.

D

We sort of enrich our uh object from the representation with with this abstract data type uh system, and if we sort of uh go and look at what we can do uh so, for example, we can, we can go ahead and we can store uh the the allocator that we used to allocate some bit of memory and uh the layout we use for the allocation and the size of the allocations. uh Those things we can store uh associated with the pointers so that, uh oh right, you can't see my explanation hands.

D

um That's the problem with with the digital presentation um yeah, and then you can sort of associate those with a pointer and then we can. We can use those to enforce the constraints um using this abstract uh data representation.

D

We can also so so. That would essentially mean that to verify this, we would add assertions stating uh that all the preconditions are satisfied by sort of unpacking the source of this, and, of course, that would kind of look like a lot of flow analysis as well, uh because we would have this information of the source of the points as well um and also, of course, the effect of running the reallocation would be to update the size of the heap object, which we can also track to best of our knowledge.

D

So this is one of the things that sort of stood out to me as something that could be enhanced. By doing this, um let me have a quick look, because I see something happening in the chat and.

A

I think philip has the chat under control. I think you can just go ahead with yourself. Thank you.

D

um I figured that was the case. I mean that's, that's a good thing about having extra people on hand. um Okay, so I also had a look at. We also had a look at this bug that occurs in redox. It's actually from a pretty recent paper that looks at various memory bugs, and the problem here is that on line eight, of course we have. We are overwriting this uh this allocated uh object and which that will that were sort of partially allocated, and uh this will cause a drop on something that isn't initialized properly.

D

um So that's the problem and the fix looks like this, and so you just replace the right with uh with a pointer writing thing uh and if you look at the mirror here, this is just fresh off of the the breast compiler.

D

This is what the mirror would look like and we can sort of uh apply our recipe to to for verifying this, uh using the tools that we redefined.

D

So we will define abstract data types for the possible heap objects and we would use constraint home clauses on these edges of the control photograph and we would use the theory and so on and as far as we got so, for example, we would start by asserting again that this is not a recompiled version. Wait. Okay, um I did some fixes to this, and apparently that is not the version I'm looking at anymore, um but uh we would have not a valid here but an allocation.

D

So what is happening here is that we we um essentially uh carry on the information that this after running this location here we would have um this um variable having come from the allocation, and we would also have I've aligned a lot of stuff that we would have to add for bookkeeping so figuring out that which allocator it comes from that it comes from this particular layout here over here in block one and so on.

D

uh All this information would have to be sort of uh carried along and um yeah, and then once this happens, it shouldn't be. So you can't do reads of elements like this, and this was an early draft.

D

I added, um but it would look roughly like this, so we would have an assertion that essentially makes sure that the read here of that happens as part of the drop down here, but this is correct and uh when we run this to the uh to the home server, it would look at this and say that uh you're you're, reading uninitialized data essentially um and on this unwind edge here we would have uh hopefully try false and I hope that we never have to uh run the unwind.

D

But uh I'm I haven't looked into that enough to be absolutely sure.

D

Okay, so uh the things that we get out of this um so out of the box of things that we accomplished with the um theory is that we can now encode structural properties. We can say things like uh most of the early examples are sort of linked lists. We don't really go well with rust, because it's not the kind of thing you do, but you can still uh encode properties like the nth element of the linked list is.

E

D

um We can also tack on more information that allows us to track the origin or the coherence of various stuff, like with real look and, of course, we can also track initialization stages of memory for some some abstract notion of memory.

D

But given that we have this very abstract notation, we do not model alignment and representation, so we can't, uh for example, model things like transportation and that sort of thing um we don't know what the bits are doing. Really. We have overflows, of course, because we have bit vectors for representing.

E

D

Like like regular in in s, but we can't do alignments and so on, it's beyond the point of sort of. We know that this thing has this alignment, but we don't know what alignment semantically means, because that's not part of the model uh yeah okay, uh so we also have, as safari said, uh a preliminary silver, and when I say we I've definitely been suffer and our supervisor phillip. I haven't done this work, so I can't take any credit for it. um I'm essentially the rust part of this uh this project and yeah.

D

Some of the next steps for this is basically working on improving the interpolation theories for the theory, uh but you're you're, of course, very welcome to to look at this and experiment the software and so on, and that's essentially why we're here we want to have your feedback on this and see if it's useful to you in any way um continuing some stuff. We would like to try doing uh just throwing some spaghetti at the walls here.

D

um It would be interesting, given that we have infrastructure for encoding uh c and for uh generating contracts from from c uh down here. You can see an example of a contract that we've generated, given that we can do that. It would be interesting to to look at in verifying the interaction between rest and see. It'd be interesting to see if anyone can come up with with anything else.

D

To do, I know, but most of you have pretty advanced tools that also use smg, solvers and, and you might be able to adapt them or integrate them with with the theories we have, and it would also, of course be interesting to see how we can adapt this and utilize the uh reachability and typing and flow information that we already have as part of the rust compiler.

D

If you can sort of piggyback off of that, and to what extent we can do that um and yeah. That was pretty much sums it up. I think land is within the time slot, so I have to thank you for listening.

A

Thank you very much for the presentation, also to the software. Of course, we had quite some discussions already in the chat, but maybe there are more questions.

A

Maybe let me ask one in the meantime, while people think about more questions, so I'm wondering how would the interface between safe, rust and unsafe rust look when you said at the beginning that save for us can be handled by essentially just uh compiling it away encoding it away. uh I mean encoding the heap of way, but then on the unsafe side the heap is represented. So so how would you work on the interface there.

D

I'm going to have that old stuff just immediately.

B

Yeah, okay, I'm not is that amanda or, as amanda said, she's the rust expert. So I'm not exactly sure, but I guess one could.

B

For example, uh I mean one option would be to encode everything I guess in hip, but maybe that's not the best, because rust, as far as I know, deals with a lot of things already so as rustron does like it just depends on the type system and or another option might be to maybe determine what holds at the entrance of these unsafe rust points and maybe assume those and just add some assertions attempt like kind of like contracts, maybe for unsafe, rust but yeah. That's probably you have much better ideas telling me about this.

B

I'm I really don't have that much experience with trust.

A

So let me ask a follow-up question anyway or let's let ralph ask the question so he says: if you have interior ability, you can also not just compile the heap of ways. Do you have ideas how you want to handle that.

B

I'm not really sure if.

C

So if I might chime in so, I think our idea or vision would be that we actually encode the complete um program, including the safe parts, using the theory of feed and then what is done in raston, for instance. um So this would be a precision optimization that could be implemented within the theory solver as well, so we could actually identify. Does the program make use of these features that we cannot compile away? In that case, we fall back to kind of the more heavyweight heap machinery, and otherwise we do the efficient encoding.

C

Which would have the side effect of course, then, that the clever encoding of rostone would also apply to programs in c. That would follow the same discipline, for instance right, but really this is in a very early stage, so.

A

Yeah yeah okay. So if there are no further questions, then uh thanks again to both speakers- and I have a.

E

A

For answering questions now, nico has a question.

E

Just to follow on to that, to that point, uh if you were to do that, optimization, is that something that you could do like? Would it have to be true of the whole program, or could you do it for isolated parts or subsets of the heap, or something like that.

C

I'm quite sure that you could do it only for parts of the heap. Yes now, if you don't have any results for that yet, but I'm quite sure that this could be done. Yes,.

A

All right good! So thanks a lot.