Rust Programming Language RustConf 2017, 1 Sep 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: RustConf 2017 - Fast, Safe, Pure-Rust Elliptic Curve Cryptography

Description

Fast, Safe, Pure-Rust Elliptic Curve Cryptography by Isis Lovecruft & Henry De Valence

This talk discusses the design and implementation of curve25519-dalek, a pure-Rust implementation of operations on the elliptic curve known as Curve25519. We will discuss the goals of the library and give a brief overview of the implementation strategy. We will also discuss features of the Rust language that allow us to achieve competitive performance without sacrificing safety or readability, and future features that could allow us to achieve more safety and more performance. Finally, we will discuss how -dalek makes it easy to implement complex cryptographic primitives, such as zero-knowledge proofs.

A

Okay, so I'm I sis.

B

A

And we're going to talk to you about implementing pure rest elliptic curve, OD curves, so for my day, job I work at the Tor project. I write C. This work has done my Mike's really close. This work is done in my spare time, so this isn't actually work that I'm doing for my normal day. Job.

A

This talk is for people who are lightly familiar with rust, I'm, not expecting that any of your advanced rust programmers, our rust is actually not that advanced most the time, we're not even doing like lifetimes or memory allocations, or anything like that.

A

This talk also isn't aimed at cryptographers I'm, not expecting that you have any like advanced knowledge of cryptography or actually any knowledge of cryptography at all. We're also not expecting you to really know any math, just a basic level of like high school algebra and you're gonna be fine.

A

Even if you're, not fine, we're happy to answer any questions that you might have after the talk you can also email us or talk to us on Twitter or whatever. Okay.

B

A

What is curve to PI five one nine dalek, then we're gonna be talking about Harry's, going to talk about implementing low-level arithmetic field, arithmetic and rust, and and then we're gonna talk about some things about rust that we found really nice and some things that we think could be better and then we're gonna go over some of the other crypto that we implemented using our library.

A

So what is curve? Two five five one: nine dog in order to talk about our library, it's necessary to situate it within, like the stack, essentially that it's sitting in, and so you have your application at the top, and your application is using some sort of cryptographic protocol that could be, for example, like a signature or a key exchange. Or is there knowledge proof underneath that you have an abstraction layer, I called a group normally in cryptography.

A

You want a prime order group which we'll touch upon that later, but it's just a group of elements or things which has a prime number of things in the set. Basically, you can think of a group like essentially like a rust tree and that it's concretely implemented by an elliptic curve.

A

So the things in the group in this case are the set of points satisfying a certain curve equation, defined over a finite field, usually- and in this case we're talking about the field in integers modulo, a prime P, so our implementation was originally based off of Adam Langley's, add 255, one nine go implementation, which itself was based off of the reference ref ten implementation, just to give credit where credit's, due in order to talk about what curve to five five one nine Dalek is and why we made it it's important to talk about a little bit of like the history of other elliptic curve, libraries, their designs and some common problems.

A

Other elliptic curve libraries tend to have really no separation between the protocol that they're implementing and their implementation of the field, the curve and the group, and so this ends up with a lot of problems.

A

You end up with idiosyncrasies, sometimes in the lower level pieces of the code that carry over to the higher level protocol implementations.

A

So things like accidentally flipping a sign and then the protocol like comes out being implemented correctly, because the protocol is accidentally also flipping the sign in the reverse direction, and you end up with the right output, but for underlying, like the wrong reasons, there's also problems with like assumptions about how these lower level pieces are supposed to behave, and those aren't necessarily correct if you try to use the field or the group implementations to implement a different protocol.

A

This also results in like super excessive, the pasta so cryptographers have this thing where they they tend to, like literally copy paste each other's code around. They also like this is exacerbated by a lot of cryptography, somehow think it's appropriate to you have like a tarball of their code, unsigned inside another tarball of a benchmarking library, and that's like how you're actually supposed to get it as an end user.

A

It's just it's like is mind-boggling to me so anyway. This leads to large monolithic code bases which are idiosyncratic they're incompatible with one another in like really hard to debug ways and they're, often highly specialized, to perform only the single protocol that they're implementing, which is often signatures or key exchange, and there's just no consideration that there's this whole.

A

You know this is a whole rest of the field of cryptography and you might want to do something other than these two protocols and there's worse, some of the bugs I've personally seen in major widely used cryptographic, libraries, which is not gonna, name any names but like using C pointer, arithmetic to index the array so NC. Just as like a recap array, indexing works both ways so taking the sixth element of the array.

A

A is the same thing as saying like five bracket a so in this case they were doing like a bracket P plus five, where P is a pointer. This is equal to a plus p bracket, five and five bracket, a plus P and there's just so many ways that that can go wrong.

A

I've seen overflowing signed integers in Z and expecting the behavior to be seen or similar across different platforms and different like this. Is this like canonical undefined behavior? Like you? Just don't? Don't do this and I seen using basically untyped integer array, so in rust, but would be like a of 32 u8 and taking this like without you know, using any type system at anything like that. Just saying that this is the canonical representation of multiple things in the library and multiple things which are mathematically fundamentally incompatible.

A

So saying that, like so when you have like an elliptic curve point, you can compress it usually to 32 bytes by just taking either the X or the y coordinate. So that could be a right if there e 2 by it's also a scalar like a number could be 32 bytes, and these are things that are not mathematically compatible. They shouldn't be switched and your type system should be protecting you against making errors like this how's.

A

This change, using pointer, arithmetic to determine both the size and the location of right buffer and there's, there's still more bugs I can keep going with. Like a lot of horror stories of things that I've seen, so we didn't want to do this in C, obviously, so we started working in rust and the design goals of our library, where that it should be usable for other cryptographers to implement their protocols, it should be fast to write. It should be essentially the same as like writing a sage script.

A

It should be versatile, so you shouldn't only be able to write a signature scheme or a key exchange, but you should be able to write almost any type of cryptographic protocol.

A

It should be safe, and by that we mean like multiple kinds of safe. It should be memory safe, type, safe. There I mean rust, has extra nice things. If you build in debug mode, there's underflow and overflow protections, it should be readable, which is like a huge thing, because if you're copy-pasting around all these like assembly files and each cryptographer, is making all these tiny tweaks and changes and they're tarballs and there's no get history and there's no way to you know know why someone changed something.

A

You just have this like blob of unreadable code that takes forever, and it's not very explicit and readability also implies that it be should be auditable, which is a huge thing for security-critical code. All of these things are things that we would get from a higher level memory, safe, strongly-typed, polymorphic, programming, language, okay, so with that I'm gonna turn it radio hair.you, who will start to explain some is a low level field, arithmetic and rest all right. Okay,.

B

So, as you saw in one of the previous slides there's like this table of the different pieces, and since that's like kind of large as an example, we're just gonna go through like one thing that we we do so we're gonna, look at how so, as as part of this, we have to implement feel the rithmetic for integers mod p. Where p is this tune to the 255 minus 19 and just as like a worked example, let's see how that works.

B

So we're trying to do this only using the operations that we have available on our CPU, so in order to figure out how we're gonna do this, you need to answer two questions. First, what are our actual primitive operations and also what does the multiplication look like? So when you do a multiplication you're using a fixed size, primitive type, but when you multiply numbers or integers, they will get bigger. So how does that get handled? Basically, there's four possibilities.

B

One possibility that you can do is you can do error on if there's an overflow. So in this case, if you did I'm using UHS, so it's small numbers 8 times 40 will give you a panic and that's what rust does in a debug build there's, also wrapping arithmetic where you would reduce mod.

B

You know how big you can fit into that type. That's what Russ does in release mode, there's also for some things. You might want saturating arithmetic where, if you get too big it just clamps to the highest thing and then the fourth thing that you could do is widening arithmetic, where the result of the multiplication is going to be the next biggest type, and so in rust. You have intrinsic four one, two and three.

B

So if you explicitly want one of those you can pick it I'm not aware of an intrinsic for the fourth one, but you can just write it like this, and so now you might want to know like okay. What does this actually turn into? So, let's suppose that we're on x86 64 there's a really cool tool that you can see.

B

God bulbs, compiler Explorer, so in the top window, I've just put a an example of a thing that does a widening multiplication of two you 64's to produce a you, 128 output and then in the bottom. It's produced. What's the actual assembly that this will do and there's two windows, because you can see that LLVM will give you a nicer instruction on newer processors so on the older processors.

B

There's this mul instruction, where the inputs and outputs go into like fixed, predetermined registers, so you'd have to do a bunch of like moving things around and then on newer things you can pick where they go, but the point is that you can just sanity check that this. This really does turn into something reasonable.

B

So suppose that we have this ability to do multiplication of two 64-bit numbers into 128-bit product. How are we going to implement multiplication? So if you look at the original paper, they suggest using a radix to to the 51 representation. So what does that mean? It means that you're going to write numbers where the in base 2 to the 50. Why so you're gonna get five coefficients? You might wonder: where does the 51 come from?

B

Where does the five come from, so these things are going to be basically 256 bits wide, and so you could break it up into four times 64. But if you think about the discussion in the previous talk about instructions per clock and out of order execution, it's actually much better. If not all of the operations depend on each other right. If you have things that are full width and every time that you do an operation, there will be a dependency between them and it'll it'll be slower.

B

So that's why you would pick the next biggest one, and so we could write this in rust as a tuple struct, and you can use this multiplication to do it. So how would we actually do that?

B

Well, you could just write out the coefficients of you know: do your like naive kind of schoolbook multiplication if we start by writing out what the coefficients of this product are and the low term we'll just get X naught times y naught, and then we get X naught Y, 1, plus X 1 Y naught, and we continue in this way so I'm writing the coefficients of the output on the left-hand column, the actual digits on the right, and you get this sort of nice triangular structure.

B

Ok, so now you'll notice that, like our numbers, got bigger when we multiplied them, but we're supposed to be working mod p, and so we would like to reduce this back to the original size of the inputs. So how do we do that notice that this Prime has a special form since it's 2 to the 255 minus 19? You know that 2 to the 255 is 19 mod p and the reason is that mod PP zero, so zero equals two to the 255 minus nineteen. He bring the 19 over, and so why is this useful?

B

If you write out this product that we've just computed, then you can see that, for instance, this Zed five term has a two to the 255. You can replace that two to the 255 by a nineteen.

B

Similarly, for this 306 term, you can write it as 2 to the 51 times 19, and that simplifies into this nice thing, so you can get basically a pretty fast inline reduction and when you combine that with the formulas on the previous slide, you get this this, where the the triangle below is going to get moved up into the the upper part. So this technique for doing really fast reduction mod p. Actually, you can trace the the lineage of this all the way back to the 15th century.

B

If you're curious, it coincides with like the development of early capitalism in Venice. So, unfortunately, we've now moved on to Lake capitalism and things are not looking up, but why don't we just write this in rust, so I put I put some rust code on the on the on the slide: we're implementing Mull there's some weird lifetime stuff. That's one of the things that we'll get to later, so just disregard that for now I just put it in so that because it's the real code and I'm gonna define this little helper function.

B

That's like my own little intrinsic for doing widening multiplication. It's in line always so we'll just disappear, and we start off so remember in the in the previous slide. We had a 19 times some stuff, but that stuff is going to be you 128 and it's better. You know. Instead of trying to do 128 bit multiplication, you could just do a multiplication by 19 beforehand and then you just write down that formula.

B

Now we have this problem, which is that these see is that we're getting our 128 bits Y right there, you 128, and not you 64 s, and remember that our original goal is to try to get to back to a like you, 64 5. So we have to then sort of reduce these these see eyes, and we can do that so I've written this formula. But basically the idea is that you take this 128 bit value.

B

You take the low 51 bits, you keep them, you take the high part, and you add it to the next biggest coefficient, so you're just carrying the value up into the next biggest thing, and you can write that in rust in the following way.

B

You can construct a mask and do this carrying you can also because a nice thing with rust is that you can rebind the same names, and so here these see, there's there's a rebinding happening where we're notifying the compiler that now that we've done this reduction, all of these CIS are now going to fit into 64 bits.

B

So you know do whatever you want with that information, and once we've done this first pass, we've now fixed all of the the CIS to lie in the in to you 64's, but now they're, not maybe as as small as we'd like. So we can just do another carry pass and that's what that field element. 64 reduced function does it does essentially the same thing, but it's less complicated because you don't have to change types and it also will get in line. So actually that's our implementation.

B

It's not. You know the simplest thing, but it's not that complicated in our actual code, there's a bunch of debug assertions to make sure that all the things that are right size that there's no possibility that we're like violating some preconditions that none of the intermediate things can overflow whatever, but that's the actual code that we have, and we kind of just like throw a yam at it and see what happens so.

B

You might be wondering like how does that compare like I heard you had to do a lot of work to get things to be fast. Well,.

A

Turns out, it's actually really really really fast, so entry 55.9 donna is an implementation. It's like optimized assembly, it's what tor currently uses by default and, as you can see here in the bread, our performance is comparable to Donna slightly better for things like verification and then also just throw in like a general like, so that you understand like what the numbers are.

A

Kind of supposed to be ring is also a breast library for it's a higher-level library than ours, so implementing protocols and ring does that by wrapping boring, SSL's implementations, which are assembly, implementations and wrapping that in ross. So it's pretty pretty fast so now to talk about certain things about Ross that we really like and other things that we think could be a little bit better.

A

So, as obviously eye rests, cogeneration is done by LLVM, as just showed it's really good at generating code. It's not just good at generating fast code, it's good at generating safe code, so historically, there's a worry that an optimizer could, in theory, break constant time properties of an implementation. So what does this mean?

A

People have essentially, in the past, said that you can't use compilers to write cryptography. You have to write handwritten assembly because that's the only way to control what a chipset is going to do. Not only is that not entirely true, there are chipsets that do weird things which I'll get into later.

A

That's just this is all kinds of insane, so what does it mean to say that, like that code is constant time? So there are these things called side channels and the side channel is essentially a mechanism by which an adversary can learn some sort of like internal program State for cryptography. This is especially insidious because learning a few bits of a secret. It can often lead to full key recovery attacks.

A

A

Like a concrete example of a side, channel attack is like I, have a like: I make a static string and I just like hold the a key for like five minutes, make this really huge string and I load it in your CPUs cache, and some of the caches are shared with other programs that are running so the other program that you want to do a side channel analysis of.

A

Let's say it has like an if-then statement where it's branching on like a bit of your key, whether that bit is like a 0 or 1.

A

So if the, if statement is a really small piece of code and then the then statement is this like huge chunk of code, you can basically load your giant static string into the cache and you wait a little while and just like chill, and then you try to access your giant string again and you time how long it takes and, as you saw in the previous talk, hitting different layers of the cache and hitting memory instead results in different timings. So that's a timing, decide channel. There are other side channels you can do like you.

A

Can do differential power analysis where you build up profiles of like a different device or a different chip and like how much power it draws as its doing certain operations? But basically this is all bad. You want to write code that does exactly the same thing with respect to Secrets all the time.

A

So to prevent this I, we have put a lot of effort into, as you saw like propagating the carries in in Harry's code. There's all these like crazy shifting operators everywhere and crypto code just tends to look weirder because of this for ela vm's optimizer. It turns out that it's not breaking our code as far as we can tell for x86 and we've like sat there and geeked out on this subway for hours and hours and hours and wait longer than I would like to in the future.

A

We're hoping that there would be some way that we could do like an LLVM pass with like sanitizer, where we can statically analyze the output assembly, like there's a trick for Vell, grind that Adam Langley made where you just mark secret data, and then it gets initialized. And then, if you ever try to like index over the data or branch on it, the whole thing just oops pants, and you know that you've done something wrong.

A

Okay, so rest is capable of targeting a lot of platforms and targeting extremely constrained environments using no standard, so, for example, so dalek uses no standard. So if you write your protocol and tell Dalek to use no standard, you.

B

A

Present a foreign function interface that can be used with like a lot of weird places, so Tony our series, somewhere around here, got edge two five, five one nine Dalek, which is my signature, implementation on top of curve, two five: five or nine Dalek running on an embedded PowerPC chip inside a hardware security module and is working on getting it running under in an SGX Enclave as a side note to go back to the concept time thing, I can't guarantee anything if you're running something on PowerPC, because part of PC is one of these weird chips that I mentioned.

A

If, for example, it's mul operator, it will return early if you're, multiplying something by zero, so I, actually just like can't guarantee that crypto works at all on your MacBook. Don't do that so filippo it'll sort! It I had a recent blog post, which is really interesting. I, think it made like the top reddit of like both the golang one and the rough one.

A

So he has a thing where he like his calling rest from go with minimal overhead and he used Dalek as an example. Interestingly, this is three times faster than calling the curve 255 1/9 library- that's in pure go that is in going standard library.

A

So there's some things we don't like. One of them is what I've been calling the Eye of Sauron, and it's like saying that we have looks like this.

A

Estimate of it's, like you, know it kind of like dog food in my own library, I, don't really like being like and like Prince like it's. Not this results because, and rust operator traits traits take arguments of type T, not type bar ot.

A

So in order to avoid so, if you want- or you need a copy for your types which we do because we have like a lot of constants that are just like various things in the curve that get used all over the place and different like protocols like, for example, the base point of the curve, which is kind of just like a point that you pick and it's special.

A

We have like a constant file whatever, so we need copy, so we have to implement our operations on type like an T, and then you end up with code. That looks like this, for you keep putting parentheses around things.

A

As you have this, like expression of like multiplying or adding things together, it gets messy pretty quickly, I I'm, not a competitive, but it might be possible to perhaps do you like, auto borrow your copy types or some other special marker, where we can say, like hey I, like I, said the type, but I really meant, like borrow the type, don't actually copy.

A

Another thing which would be really cool and which we know is coming and we're really excited about is constant Eric's. So we've already thought of really cool ways to abuse, use, constant air and to optimize the field arithmetic.

A

So the basic idea would be to statically track the sizes of the intermediate values of these like field elements and use specialization to insert reductions when necessary, so it would automatically detect like oh hey, you have like three bits of carry space and you've done like two ads already. Let's, like do a reduction now, instead of like forcing users to know when to do it by hand.

A

So next Terry's gonna talk about some of the crypto implemented, so.

B

Just like a brief overview of some of the stuff that we've done, one thing you might want to do is zero knowledge proof. So the idea is to prove that you have some statements about some secret values without really revealing anything.

B

One example that people seem to want to do is are called like two quick, discrete log equality, where you have say four points, and you want to prove that a equals G times X and B equals H times X simultaneously, without revealing your secret value.

B

So implementing these proofs are for like people who care there's noir style proofs. So there's a lot of boilerplate for when your expressions get more complicated, so we made a crate that has an experimental, zero knowledge proof compiler implemented in Ross macros, not procedural macros, just like ordinary macros, because it's worse that way and it as a user. It's kind of nice because you just write this, and that corresponds to that example. Above and what does that expand into?

B

It turns into a DL EQ module with all the code for creating verifying these proof objects and he uses 30 to derive a parser into wire format and back so that's pretty cool and another thing that we did. That Isis came to talk about three.

A

And so there's other types of zero knowledge proof statement you want to make, and one is like in a lot of applications you want to like prove so before there was proving inequality. You might want to prove an inequality like you might want to say: I know an X and it's bigger than Y.

A

The problem is that, if we're working in a cyclic group, X being bigger than Y is the same thing as X being smaller than Y, if it wraps around the group, so you have to do what's called a range proof which is saying, like I, know an X and it's between like y and z, and you can do that. There's ways to do this in zero knowledge, giving away any information about X other than it lies in the correct range.

A

These are often used in confidential transaction systems and we're also using it in a future ANSI censorship system that we designed for tor, which uses a micro payment scheme that we made that's embedded inside of anonymous credential, and we use this micro payment scheme essentially to like store proof of good user behavior, which you can spend like sort of spend later as a way to like filter out bad actors from a system without knowing anything about the users or like being able to track the users.

A

So the basic idea is you wanna, prove X is in the range 0 to 2 B to the N, and you write x in base B such that it's the summation from 0 to n minus 1 of X I times B to the 8th power, and you prove that each digit is in the range X is between like 0 and B.

A

So traditionally, the way to do this is like, as due to a cryptographer named marker, and you just write the number in binary and then B in this case is 2. So you just prove that each digit in the number is 0 or 1, and then you can derive various statements about the range of the number.

A

So verification essentially amounts to checking a ring, the signature on each digits proof and if each digit is in the correct range, the whole number is in the range we implemented like there's a recent construction due to back and Maxwell, which determined that if you do Borromean rings signatures over a ternary system that you can share data between the digits and this off like this ends up being a lot more efficient, the name Borromean rings. Signature is actually like a pretty cool name. Borromean rings are like, if you imagine like the sign for the Olympics.

A

It's this mathematical like thing where you have three rings and they're interconnected, and if you cut one, the whole thing falls apart. So the signature scheme is named after these like interconnected rings, and that's because, like each of these rings, has sort of some of the data of the other ones or just dependent on the other ones. Computers, don't really like ternary systems, so also human brains.

A

Don't really like ternary systems, it turns out to be like really not nice to implement, but just so you can see an example of what it looks like to use. Kirti five, five, one, nine Dalek.

A

This is using rayon. So each of the digits, the computation of the ring signature for each of the digits can be done in parallel, and this is from like the inner loop of the verification of one of these groups. So you can see that like okay, so you can see that like. Oh, we can subtract two points.

B

A

Right well, anyway, this is an example of like what it looks like, and this isn't like the cleanest code, but it turns out to be really really fast and it's not the worst you're right, like we've aired it in an afternoon. It still looks about as messy as the actual paper.