Filecoin Tech Demos, 1 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: PostgreSQL Blockstore with write-through RAM cache for Lotus

Description

Riba demos his IPFS blockstore backed by a combination of a PostgreSQL RDBMS coupled with a write-through RAM cache.

https://github.com/filecoin-project/go-bs-postgres-chainnotated

A

B

Well, it's uh it's! It's like.

A

Let me just uh say hello to anyone, who's listening uh and welcome to the sparks first demo recording session uh today. This is our first one, so we don't have many demos yet, uh but this is the the today we're gonna have peter and he's gonna be doing a quick demo for us. um So uh please peter. Let us know what the demo is and take it away.

B

Yes, so uh this is a demo of something that has been worked on for like almost almost two months. So it's not uh you know it's not exactly a proof of concept. uh What we're going to do is we're basically going to cold start. A lot was backed by the block store that that has been.

B

You know, uh story coming together and, in fact, like I wasn't expecting this to be recorded, so probably nothing will work, but basically, what I have here is: um I have a uh a box that has been like completely clear to all caches and I have a postgres which is like completely completely stopped. I stopped this like 10 minutes ago or something like that.

B

So, let's see what is going to happen to basically bring everything up from from absolutely nothing, and uh I actually don't know how well this will work, but it should be within you know, within correct parameters.

B

So yeah there we go so now we have uh a postgres running and uh you have a a thing here that runs uh every every second basically tells you like. This is what this node has seen: epochs wise and uh tips.

B

When was it seeing particular chipsets what time and so on so forth? So basically, this will show you in real time. You can actually see everything happening, so uh don't like mind that this machine has a ton of memory. It's it's my my definition. Like generally, this entire thing can work with like 60 gigs of ram or something like that. So we're just going to start this uh lotus and we're going to see what happens so uh what it does right away. Is it basically uh so postgres is slow as uh as a general database.

B

When you go like uh the way we use block stores normally and you just go one one. You know one block at a time. What this has, however, is uh because you know everything is transaction based and so on and so forth, itm within every tip set. I can keep a record of like in order to compute this tip said those are all the thousands or maybe you know, tens of thousands of blocks that I had to touch.

B

I record that then I record this one exit set and so on and so horse, and I keep a value of these things and then, when I start stuff uh it knows what are the last. uh I believe I said right now at 3, 000 epochs of blogs that I have seen and I'm just going to load all of them in both together now. Normally this is way quicker than that. But for the sake of them, I want to basically do a complete whole step to see exactly how this is going to work.

B

So uh by now this should have said like uh listing that many and that many blocks uh took like 10 seconds or something like this. But the important part is that in the meantime, lotus is still working. It's basically still doing it slow things like one at a time and so on so forth. But in the background we'll now see uh that we start 25 uh preloading threads uh and we're going to read. 35 million blocks out of the block store at once.

B

It took us like 71 seconds, to figure out what we're going to load, and you can see that you know it's going to down. uh It is uh reading from disk. It is, you know, putting them in in ram and so and so forth, and you can see uh lotus drawing a little bit. Most of this memory is actually the in-memory cache of all these blocks and in fact, in a moment once this is all loaded it takes about 200 seconds.

B

It will tell us like exactly how much it had to load and from this point on this entire thing is essentially uh like self-sustaining, because almost everything that needs to touch for any kind of state validation is already in memory and whatever it writes to the database. It is you know, right through and so on and so forth.

B

So uh I'm going to switch to the other screen for a moment like it's not going to scroll away, because it looks exactly for this uh for this box here. um Let me see where it is: okay, so yeah. So uh this is where I started at five: one: four, eight, five, nine and right now we are at eight eight one, so it has been out for for like 30 30 blocks or something like that, and uh this is me hitting the api.

B

So the api actually works, it doesn't respond to commands and so on and so forth. It's just taking quite some time for it to get to the actual chain validation. Because again this thing is still ongoing and it is taking that long because I dumped all the caches. So if this database was already running and it had like most of the stuff in uh in memory by now, everything could have been already- you know, spinning and so on and so forth. But it's you know it's good to see.

B

What's the worst case scenario is so uh in the meantime, let's.

B

Go and by the way here, you see that we already like jumped one.

B

uh It took us like uh you know some time and and and this will keep keep going up so uh a couple of interesting things about this block store and this basically, the part that is really not specific to powerpoint in any way here performance is that the file point is the most is the largest block store by by far that we have uh compared to anything else, for example uh cluster. While it is larger data size wise, it is about 20 times smaller than the current chain.

B

If you put it all, uh you know if you put the entire chain with all the states and so on and so forth, we're currently at about two 2.8 billions of blocks and clusters like like some some low low amount of minutes. So the interesting part about this particular, oh, and let me show you what does this mean for.

B

So, uh while planning quality, all this, I have to kind of figure out like. Can I actually do that in postgres?

B

Like you know, uh a specific uh like design constraint here was that people can run just themselves like it doesn't have to be, like you know, host servers or anything like this, and uh if I were to like currently, I don't have like all the blocks in there because again still uh still in depth, but if I put every single board that we have like for for all the states for everything I get to about five percent of my capacity that a postgres can uh store, but before it starts running out of uh out of like different oids and stuff like that, which basically gives us about five to seven years depends on how exactly the chain grows.

B

So this thing is uh quite uh durable and the next thing I want to show you is actually how the specific.

B

Table is organized because this is the interesting part. So it's actually not that not that much stuff going on. In there uh we have a block coordinator, which is an out implementing uh 64-bit integer I'll, come back to it. In a moment, we have a subject match there, we'll come to it in a moment the size of the block, how the block is encoded, the actual content, the cid and linked terminals.

B

Now the cit is obvious, except this uh block store, unlike pretty much everything else, that we use stores complete cids. So it doesn't go just by multicast, but you can store, like the same uh the same, um the same block like, for example as simple and dax e4, and will be two different entries and, moreover, it allows that the main reason I do is it allows me to store uh identity ids, which basically uh allows me to have the entire uh dag in one place.

B

Why it's important, because I have a thing called linked ordinals, which basically, whenever I put a block in actually parse it for links, and I record the block ordinals of every cit that is seen in these links in this array. What this basically gives me is, I can traverse a tag entirely in the database without ever like touching, go or anything like that, uh and this is super important for uh any kind of um like you can actually do.

B

Preloads of small subtrees, you can say, like donkey images dc id whenever I asked for it to give me dcid and then like three levels down or something like that, extraordinarily fast to do in postgres and, moreover, you can like go ahead and do an entire chain export directly from this, because if you know which you know which which tables and which cids you're interested in you can just say, like start on the cid and then just you know, uh traverse all these links which are already in here.

B

So this will take care of uh the chain exports which are extremely slow right now and very resource consuming. uh Here, I probably can bring it down to five minutes for a full export or something like that, and why this interesting for anybody else in ipfs land is because, if you have that and then you walk through and layer by layer figure out what is your uh depth for each individual deck?

B

You can then basically go through the entire set and go like okay for stuff that we already know what is the next step, which means that we know the entire uh on the entire underlying deck uh we can see, which we have a reverse index, where we can see which blocks are not referenced by any of the roots that we're interested in, and we just delete this top layer. And then we repeat, this again delete this next layer next layer and we are basically three gc more or less that is entirely index driven.

B

It doesn't lock anything it just you just you know legit you see, I don't do any of that right now, because uh in falcon I actually need every single ball that I uh that I see, but this is one way out of the of the dilemma and uh yeah. In the meantime, uh this thing is already working. uh It is at a.

B

The size things a little bit so we are already at 872, so we're halfway there. It started from about uh 20 seconds per block. uh It is now down to 18 when everything is warm and fuzzy, it will actually be at about uh four seconds per block validation. So it's extraordinarily fast. uh In fact, here we go. We successfully primed the cache uh it took us from cold start entirely uh 209 seconds at 77 megabytes per second.

B

If things were warm, this would be about 150 and uh yeah, and the other thing that you can do now, uh unfortunately, don't have to say for the demo, but you could spin up another lotus off of that directly of the same block store writing literally in the same tables, because everything is going to address they. They know how to lock each other, how to not get into conflicts and stuff like that and each individual, uh each individual.

B

uh Each individual instance has its own main space, and in there we have, we have several tables which are basically the access log, which is not five point specific.

B

It can be like for uh sorry, x, stop and and recent access, where, basically, uh you can simply instruct uh well typic parameters, uh either say save that many recent blocks and just purge the table from time to time, with every control or literally for every individual block access, just store it down to a microsecond resolution so that you can later come back and do a frequency analysis on like oh I'm like taking these blocks like I'm asking them all the time, and I should like fix my fix my code or something like that.

B

So this was another thing for, uh for the uh actors. Team uh were interested in that and tips. That's visited is the part that basically reports for each individual instance. What did it see when it was when it was sinking the chain? And lastly, uh the thing that is missing here and why it's not kind of like instead of finished actual library uh when we go through tip sets, we can actually record what is open as lotus itself goes through.

B

So we can have a list of orphans as opposed to the least supposedly the the main chain right now and then all that all the transactions that we like record, all the messages, all the stuff from the storage market and so on and so forth. If we tag it with a state, then we can simply exclusive join it with the list of orphans and we have an always up-to-date, real-time.

B

uh You know view of what is actually happening on the network. Plus we can examine like what nikola wanted. We can examine how the you know how how the orphanage works and so on and so forth and um yeah.

B

This is pretty much it, and uh that is what uh my proposal is like based on that once the big uh database with all the blocks and everything is available and is streamables replica, uh starting off of that, with with a much smaller like pre-caching and with uh with a much smaller context of what we actually need should be taking like within uh five minutes, or so um I mean as soon as I was um uh demonstrating, even with cold start and everything we did, we actually did start thinking within 10 minutes so um yeah, that's all I have.

B

um I guess any questions.

A

That's super rad, I'm totally blown away by how cool that is.

B

Yeah really cool. This is neat uh yeah.

C

Yeah, um so the I'm interested in the xs log, so does it mean that it could be used to um kind of replay access patterns on the data, so we would be to take so the bigger one is when I was building my index. I still wanna like like what is the axis pattern actually and like? It would be super useful if you could just like replay actual access patterns, because yeah.

B

This is what you want, yeah check it out, so you have the blog portal. Now I actually didn't talk about the wolf arnold. The reason I don't use cids everywhere because of block horizon only is eight bytes and a cid is like 38 bytes. If I like reference everything by cid, my my thing will be too big, so we have the block ordinal, which is a reference to the to the cid table.

B

uh Access type uh is, um I do not remember my existential moment, uh how did I name them access the access types are uh I drag. The put together has the size and I have an ore on them whether it was already seen in the cache. So even if it is in the cache, I still keep a number that I did hit it in the cache and I still record it. So you do you can replace.

B

I replay the thing that you asked for and then uh how many times within this uh time uh the the actual time is rounded up to my millisecond to a millisecond so uh 1000 times. A second is my resolution and if I supply it, uh the epoch in which this happened and uh which tip set this happened, and when we switch the next tip set, we flush it again.

B

So, yes, you can totally replace this.

C

A

C

And then you can, as I said, you also store like puts and gets. You could even um get, for example, information about, like what's the ratio between puts and gets because it's also not clear to me like I just don't know, um and I'm not sure if anyone knows yeah.

B

Well, that is that's what this is for. Yes,.

C

A

Yeah, I have a question um so is this: uh is this uh data store um suitable for for miners, or are you thinking that it's it's only really useful for for um pl internal things um offer yeah.

B

You can you can totally run a note off of that uh you can like you, can import uh a snapshot and just you know just run with with that. uh The actual change to lotus itself is not very large at all. uh Actually, it's more complex, the tracking of the of the like of the chipset states and stuff like that, uh because this additional, but actually like block store part, is literally like you know, just just like hannah's work change this bookstore to this other bookstore and that's it um so, yes, uh yeah go ahead.

A

um What's the difference in um like, is it significantly better right now to use this this store instead of what what's currently there um yeah.

B

So the reason it is essentially significantly better is uh postgres is uh extremely economical, like this is the amount of memory it takes for each individual like worker. Well, it moves around what you can see like 300 max, so it's like nothing and the most of lotus right now is taken by uh most of all. This is taken by by the cash, so I actually run into 32 gigs of cash, because I store like 3 000, chipsets back, uh which is which is way too much. I I could bring it down and uh yeah.

B

So uh if you were to run the actual badger thing, it essentially requires the entire uh for it to perform. Well, it requires the entire.

B

uh The entire index model has to be in memory, and the index is seriously larger. Now it's like they can make native leagues or something like that, and you either see that or you see like crazy thrashing on the on the on the hard drive and, as you can see here like the access patterns are not like very uh heavy at all like this is uh t-screens right. um It's I mean.

B

Yes, it's doing things, but it's not like not how normally it looks when a battery is running so, yes, it would be beneficial for folks to switch to this. The downside is that you need to have a hospice. I mean which you know. It's not self-contained and yeah.

A

Yeah I I guess that that was kind of one of my questions, but then aren't most minors like cloning, the repo and building from source anyway, with a bunch of dependencies that they have to install to do that like it. That doesn't feel to me as like. It's just you know a huge hurdle.

A

That's rad, that's really cool.

B

Yeah, so uh that's it and uh yeah. If, if our thing gets selected, you know we'll actually publish an actual uh thing on top of that. If not, then you know, then it will just uh be used for for the analytics part, uh because uh sentinels are very interested in seeing that and yeah um one thing that I they uh need to address. uh I said that this will come down to like four seconds.

B

uh I have turned on the synchronous parsing of blocks, so whenever a block can seem, it actually needs to look at it and move its links around and stuff like this. So basically the linkage is always up to date. So that's why it takes longer for people and that's and that's fine. You know, because if, when you're caught up it's okay, but you can shut this off, then it will go down to like like four seconds per chipset and then there's a background process.

B

You can start that will basically kill things in for you, so yeah, that's uh again under development. I forgot. I have this terminal but yeah.

A

Amazing uh and to to boot, your demo totally worked and the demo gods were. I thought.

A

Right I'll stop recording now, but thanks very much.