Filecoin Filecoin Orbit, 22 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GPUs on Filecoin: An Update on Mining

Description

An update on Filecoin mining using GPUs from Volker Mische and CryptoComputeLab.

A

A

Welcome everyone to my talk, filecoin and gpus, so the goal of this talk is that you get an idea about how the process of creating a proof of replication is and that you get a good understanding of which parts work great on the gpu and which are not, which should be done on the cpu and especially for the parts for the gpu that you roughly know um how they work.

A

But before I get into the details, um as this is an event I'm celebrating a five coin for one year, um I also like to show you the progress. So I've run um the proof of replication on some hardware and saw the same hardware from a version a year ago, and the version that we have today and, as you can see um the version a year ago for sealing a 32 gigabyte sector took about three hours, and now we are down to about two and a half hours, which is a great improvement.

A

All the details about this graph will come later again, so um just to get this is just to get you excited. So if I talk about file coin, I mean proofs in this talk because follow my means. Many different things to different people, so sometimes people refer to as the protocol or as the cryptocurrency or sometimes as the implementation like lotus.

A

But in this talk I mean the proofs, as in post of replication and proof of space time.

A

As this talk is about gpus, I also quickly like to talk about what gpus are good for. So if you have huge amounts of data- and you have some operation that you can apply to the data in parallel and crunch through it, basically in one step- that's good for gpus what they are not good at is, if you have some interdependencies um and then we will probably just get back to the cpu and filecoin actually has both both those things.

A

So we have things that are intentionally not paralyzable, so they run on the cpu and we have parts that we want to move out to the cpu to in order to improve the performance.

A

So this talk concentrates on the power, the proof of replication.

A

Sadly, I don't have the time to talk about the proof of space time, um so the proof replication is also known as ceiling and has four phases and two big ones. So we split we make the difference between the pre-commit, which I would call the preparations and then there's the commit which is about the actual proofs and, as you can see, those faces again split into pieces, and the important thing here is now that we have kind of like a cpu heavy face and we have a gpu heavy face.

A

But now I will get into the detail to through all those four phases. So, let's start with the pre-commit phase one. um This is about sdr. Sdr stands for stack depth, robust and it's about encoding.

A

So what we do here is we take the original data and encode it in a unique way, and here the important important bit is the unique, because if you do the process on some on again on the same data, you will get another replica copy. That is again unique and this is part of the whole process, because we want that people store unique copies of the data and this step is not parallel, parallelizable by design, so it runs on the cpu and it takes quite a lot of time.

A

What it does in the background is mostly it's reading a lot of data from disk. Then it's doing some char hashing and writing those hashes again. But the point why it's not parallelizable is that the hashing you often depend on previous hashes to hash the new hashes. So therefore you have those in the dependencies and in this step, there's also some basic tree building which isn't really performance critical, but um tree building is kind of core to the proof for duplication.

A

Therefore, I want to briefly talk about what mercury building means. So what is a merkle key here? You can see an example. So at the bottom you have your your data. You want to build your mercury on and what you do is you have a certain number of children in this case it's two children, so we call this the arity of the tree in ficon. We also have trees, which have which have, for example, eight children and but in this case it's only two children, and so you take two pieces and hash them together.

A

Then you have a parent note, and you do this with all your data and you have then several parent nodes and then you again hash those parent nodes to get new parent nodes and in the end, you have the root node with two children, which is basically the hash of all your data.

A

So why is this interesting or like? Why would you do this? There's something called um merkel, inclusion, proof and what you can do with this is. If you want to prove that certain bytes are part of your data, um you would normally transmit the full file and look into the file. But this not this is not really efficient. So what you want to do instead is that you only transmit um approve a miracle proof, and here you can see in the light gray boxes, that this is what you would need to transmit.

A

It's only a subset of the tree and what it actually is is the the parents of the of of the f in this example and the siblings, and with all this information you can then verify that f is really part of the data.

A

Let's get to the next phase: it's the precumbent phase, 2, it's about positive hashing. So again we do some market recreation this time several ones. I won't go into the details, but the important bits is we use poseidon hashing for this poseidon. Hashing is a a rhythm that works on fine finite field elements what they are well, they are a cryptographic primitive.

A

um That's all you need to know for now. I will get back to this later and those um this cryptographic primitive is used in snarks, which I also get quick later and they're just better than or more efficient than working with their hedges and there um the tree building is highly parallelizable.

A

So therefore we do it on the gpu, um so I will get to the exact steps. So, first of all for the poseidon hashing, you need to initialize it with some constants. This is just done once. If you start the proof system instant once so, you just do it on a cpu.

A

Then something happens in kind of a loop. So if you think about the previous um example with the mercury, so you see at the bottom, you have lots of data and you then take let's say it's really a huge amount of data which is bigger than the memory of your gpu. So let's say you take a third of the data and you batch it up, and then you put it on your gpu and on the gpu.

A

It can then do this parent calculation of all the elements in parallel, so it's very good for gpus and once we have those parents, what you then gonna do is that you then return them and you create the next bench into the calculation.

A

So um it means that in this step we do the actual tree building and the tree logic and the batching is ordered on the cpu and only the actual hashing is done on the gp, so only a small part, let's get to the next stage the commit phase one. This is about merkel inclusion groups.

A

So we create those in order to make sure that this replica data that we created contains the same data as the original input, and this is a really fast process, so on a beefy machine, it takes less than a second. So therefore, we just do it on the cpu there's no, no point of putting something on the gpu.

A

If it's that fast and finally, we have the commit phase 2, which is about snarks and first I would like to talk about like why do we need another face, because we have already like proofs to prove that the replica is the same as the original data? The reason is that those miracle proofs are quite large and it wouldn't be feasible to putting them on chain.

A

So what we do instead is we use snarks in order to make those proofs smaller, so we can put them on chain, and now you probably want to like what what are snarks like? What is this? So that's probably, I could talk for hours and hours trying to explain what they are.

A

I think to me: snarks: is this magic cryptographic, black box, that does some magic things and contains lots of cryptographic primitives.

A

I will go into some bits of it, but not all details. This would then be a separate talk, so you what you do is you build something called a circuit? That's kind of I mean it's still software, but I would say it's basically the physical representation of a system that does some magic things and those are basically polynomials, as you know them from school, school or university, and you have lots and lots of them really a lot and all those operations that you do in in those snarks happen in finite films.

A

So you need finite field um because it's also working on 50, curves and so on. um So now we get back. What I earlier said is about the positive hashing, so side on hashing is native to finite fields. So therefore, it's a good fit for those snarks, so what's now actually actually running on the gpu.

A

So what you want to do is um in the system. We want to evaluate those polynomials at random points, which means basically, we want to get out the results, so you would have, as you can see at the bottom, like such a formula, and you can do it with pen and paper, and you just want to put in some value for the x and you have some values for the a's and then do the calculation and get some result back. This is what evaluation means and we want to do this um efficiently on the gpu.

A

So, first, what we do is something called inverse, inverse, fast, fourier transform and what it does is. So the problem is that in the circuit, those polynomials are represented in a certain way, but to do the actual calculation we need to represent them in a different way. We actually need to transform them into a coefficient representation, which is the one that you can see at the bottom, and most people are used to or think about when they talk about polynomials.

A

So we do this transformation first, and this is again an operation that is paralyzable, because you do some dividing conquer algorithm and you then recursively step through those things and in the end you get the the coefficients back, which is those a's in the polynomials, and the other thing is, then, if you actually want to evaluate this polynomial, you use a function called multi x. So what it does is that it now can operate on many of those elements. At the same time.

A

So here is I've, um put the kind of like the function call, so you put in the coefficients and you put in the x so the x. So those things are then really like concrete values and not variables anymore, and then you kind of do this calculation in one big step on the gpu and you get the result back.

A

So, to recap: the paths that run on the gpu is from the whole process of creating the replica and sealing and creating the proofs. It's actually only the poseidon hashing, the fast fourier transform and the matrix.

A

So let's look into what um it means and what performance improve proven. It gives you, so this is again run on the same machine as from the first slide, and now you hopefully have a better understanding of what what those bars mean. So, let's look in the details, so at the top we have the cpu and the runtime of a proof of replication, of a 32 gigabyte sector and in the bottom we have. If we have the gpus enabled on this machine, we even had two gpus.

A

So what you can see is the pre-commit phase, one which I named sdr encoding is the same on both as I mentioned. It runs on the cpu. It doesn't use a gpu at all, so therefore, it totally makes sense that this runtime is the same, and it takes a lot of time it's about two hours.

A

The next phase is the pre-commit, which is about poseidon hashing, and then we use the gpu, as you remember, and we can clearly see here like on the cpu. It takes something like, let's see like almost an hour like 45 minutes, but if you use a power of gpus, it's a matter of 10 12 minutes and the next phase the commit 1, which is the merkle inclusion proofs.

A

um You can't actually see it on this diagram, as I mentioned it only takes it takes basically less than a second. So therefore, you can't sleep in this diagram and it's for both it's just both cases, it's just this fast and then. Finally, we have the commit phase, two, which is the snark magic and again we can see that on.

A

If you would run it only on the cpu, it would take over an hour and then it's a matter of 15 minutes or so so you can see it really improves the performance a lot if we use the power of gpus.

A

And now, um finally, I'd like to talk about um what we use underneath so the proving system I was talking about is programmed programming, language contrast and, of course, we use libraries for certain things and I'd like to get into some of them the the important ones. So so we interact with opencl and there is a library, library called opencl3 which is well maintained, and it actually replaces a library that is no longer maintained, called ocl.

A

So we've used ocl in the past, but with newer versions of the proofs we use opencl, 3 and, as you've probably seen, the transition was quite seamless. Probably you haven't had any problems with it and what I'm especially proud of is that protocol labs helped in the early stages of getting opencl3 polished, because it was still quite young, but we've bet on it, and it was a good decision to spend some time on it and improving it and make it work for our use case and also make it work for other people.

A

So now it's really like a general purpose library, and not only for filecoin. It can be used by anyone and it's also its intention on the cuda side. It doesn't look that happy and the raster cuda library is only partially maintained and currently we even need to use a fork of it, but it should be short-lived. It's just a single function and I hope to get it upstream, because I really don't want to maintain a fork of it.

A

Then we also, of course, um with the most recent version of of dota's in the proof system. What you can do is you can switch between opencl and cuda, and so we have code for opencl and cuda and of course you don't want to code everything twice. So you've built an abstraction library to make this easier for you. This is again not filecon specific but general purpose.

A

So if you have, those needs, have a look and finally there's a library called easy gpu, which is for finite field, arithmetic over prime fields and elliptic curve, arithmetic, and what this is about is it's highly optimized code for gpus and we want to make it again, like also usable for other people outside of phi coin, and currently it doesn't contain fft and matex that I've presented.

A

But the plan is to move from some other library to move it into that library, so that people that need those functions can just use this library and don't depend on many other libraries of icon specific stuff.

A

All right, that's all I have for today thanks everyone for your attention.

A