IPFS IPFS Camp 2022, 1 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: State without consensus: introduction to Merkle-CRDTs on IPFS - Hector Sanjuan

Description

Merkle-DAGs are fun, but even more when they can be used to compute an ever-evolving state (i.e. a database) collaboratively maintained by a set of peers that may come and go as they please. Join us to learn about Merkle-CRDTs, what use-cases they fit and how they power things like IPFS Cluster pinset distribution.

A

And we're going to talk about American crdts as a way to replicate data between peers in a peer-to-peer Network, but, more importantly um than understanding how they exactly work. Is that you leave this talk with some ideas of things you can potentially do on top of the ipfs stack first things. First, my name is Hector San Juan I work at protocol after the software engineer, together with some colleagues samoli pedroiannis, we put together this America crdpt paper a couple of years ago.

A

The paper is not difficult and has a long introduction to a lot of the concepts that we rely on. But, of course you don't need to have read the paper to follow through this presentation, but just know it's there in case you want to read it later.

A

um I put a picture of my avatar there, not because I look extremely pretty in it, but because of a background, the designer decided to put some castles in it, even though they didn't know that I love castles, so I find it magnificent and um in the ipfs ecosystem we have different types of bricks and modules and Tech that can be combined into into something bigger, so that you can say that today, I'm essentially going to be building a castle, because a castle is a structure built by putting some things on top of other things in a way that it doesn't fall and Grumble.

A

So my castle has a property that you can store things and form a network with other castles to replicate the storage.

A

I promise there will be more good pictures. um Baba we're building in modern terms, is a replicated key. Value Store I give value stories, essentially a table with three operations. Get put and list read write inquiry.

A

My replicating key value store is special in that it has some non-trivial properties, so some things that are not super easy to get when you're implemented implementing something distributed like this.

A

um In this case, all participant replicas of this table can write to the table at any time all replicas can join, can live at any time and in any number anything can happen to the messages on the network. They can be dropped, they can be reordered, they can be corrupted, despite all of these things, eventually in time converge.

A

So all replicas, which have exactly um will have exactly the same key value stored with the same elements and the same values when they have applied exactly the same messages.

A

The key here as well is that I'm going to do all of this with little effort, and in order to do that, I'm going to need some solid foundations.

A

If you build something with a solid foundation, this happens. You see that Tower this was a tower in Saragosa Spain, the Finish was pretty nice, it was awesome, it looked beautiful. The foundations were not so good. Have you seen this Tower before? Has anyone seen this Tower before? Why not because it's not there anymore, they had to they had to take it out. It had to be dismantled.

A

um Therefore, I'm going to use some very solid foundations for I want to build so that I don't have to be dismantle. My Towers.

A

The First Foundation is sleep B2B.

A

There we go the First Foundation in Sleepy to be gossips up. As you know, lipid2b allows me to grade a peer-to-peer, Network and gossip sap is an scalable bus, stop implementation that allows every period in the network to broadcast messages to others. This is using Falcon and is quite solid. These days.

A

The second Foundation is ibld. Ipld is a number of interoperable formats to create content, addressed blocks and links between them to form Miracle Ducks. An America duck is like a tree in which every node is content addressed. So it is interesting interested not only because all the properties of content address data, but because once I Traverse, one of these Miracle ducks and I'm done with it, I will never have to Traverse it again because it cannot change as in it.

A

It is it's going to be essentially content, address and immutable, which allows me to actually not doing something twice.

A

The third Foundation is ipfs. Ipfs allows us to make our ibld nodes available to appear to peer Network, providing them to other participants and retrieving them from other participants when needed, um using content, routing and P routing. So that means ipfs allow us to replicate the medical, dacs and I'm building with on IPL deep.

A

And the fourth piece is critics conflict-free replicated data types, allow us to modify a share state, so a state that is common to all the participants in a network without coordination or centralized config resolution.

A

This logo for crdt was generated using stable definition, and it was the best it could come up with. um So now we have the building blocks and we can start building. Allow me to switch to to witches iconography from castles, because it's also good and I find it very beautiful.

A

In my cauldron, I'm going to put a lipito p host I'm going to put some pops up, leaves and a fresh ipfs instance. All of these are things that are available in your forest or from your local ipfs implementer and the parts that needs some witchcraft will be the crdd parts.

A

So first is the crdd part, so the part where I need to implement a key value stored using a conflict-free data type.

A

There are different families of crdds, but in this case we can use an operation based crdd operation based crdt, allow conflict-free, concurrent modification of a state by specifying that the operations use to modify the estate must always be community, so the order in which they are applied is irrelevant, and the result in our replicas will always be the same.

A

We can Implement our key Value Store, rather simply, instead of having a single table where we are done, we delete what we're going to do is to have two tables and elements table for added keys and a ton Stone table for deleted Keys. Additionally, everything is going to have a unique ID, and this way we're going to avoid conflicts when multiple replicas add or delete the same key at the same time since they will have what are seemingly random IDs.

A

Therefore, the replica will append a unique ID and a priority when it receives a write, operation and broadcast that to the others when an operation is received, the elements table and the Thompson table in case of delete are updated accordingly.

A

What is the process of checking if I key exists? If a key has been added to my tables, we check that the elements table for all the elements that have that key and verify that at least one of them is not in the Tom storm table. That means that they have not been deleted, not that we may have several entries in the elements table with the same key but different IDs.

A

So the process of deleting a key is that we get all those elements for the same key with all the different IDs and put all of them in the Thompsons table.

A

In crdd jargon, this is an ad wins observed, removed, set, but additional since this is not a set by a key value store, we are going to need two more tables one with the current value of a key and one with the current priority of a key. So when we write a key, we can use priority to determine if the new value of that key should overwrite what we already have and if priority is the same, we can always resolve the conflict, for example by alphabetically, sorting the values but never randomly.

A

So always a conflict resolution needs to happen in a sec. The same way in every place. I need to need to be the same result.

A

The priority can be a number can be potential timestamp, anything that helps you decide which key should have priority over the others. So, as you can see, this is conflict. Free replicas do their operations. Anyone receiving those operations can apply them, regardless of the order, and eventually they all come to the same state. They will answer the get questions.

A

The least questions uh in exactly the same way, but note that for all replicas to converse it is important that they receive all the operations on operation based crdds have the problem that normally they require reliable message delivery so because otherwise things don't converge. If one message is lost, you will never come to the place where you want to get.

A

Therefore, if you don't have reliable message, delivery- which really you never have, you need to re-implement strategies to ensure that your message are delivered. Therefore, you need to track which messages you are delivering. You have delivered, which you have not delivered. Etc things are getting very complicated. Very fast, in the second part of the CRT work that we're doing with our witchery help us deal with that by taking advantage of our foundations.

A

This is the America cidd part, rather than just broadcasting our crdt operations, as we will do in a normal implementation of operation crdts. What we're going to do is to add each operation to America DAC, embedding the operations in iple nodes that get appended as new roots of the deck instead of broadcasting the operation, what I can broadcast it's, just the CID of the new root using pops up.

A

We got many good things by doing this. First, we broadcast a small piece of information. Potentially, if you hatch a huge value Associated to your key, you don't necessarily need to broadcast that's um and make pops up transmit that, but you can just broadcast the CID.

A

Second, our operations are all linked to the previous operations in America that which means all the operations that you know of. So when someone receives the CID, they have direct pointers to everything that came before that, if they need them. Third, we have a notion of order because I'm adding new routes to my Miracle DAC. We know that this parent came later as the children right because it happened later. It must.

A

It must be an event that has happened later, because I added it on top of the children, the state, the whole state is uniquely identified by this potential root: CID I'm, fully rebuildable by traversing the DAC and applying the operations that are contained in it. But, most importantly, the state is now a set of ipld nodes. So this is ipfs favorite food, and now we have what is essentially an ipfs powered State. We can move it using ipfs.

A

So when a replica receives a broadcast, he can fetch the CID via ibfs, as long as it doesn't already have it, which is already trivial to check, because a CID will only represent one thing and not several and then Traverse it back from there until it reaches a child that is already known, applying the operations to the crdb states, adding elements and Tom stones to the tables, as we showed before.

A

The replica then remembers the receive CID as the new duck head or the new root of the deck, and whenever it wants to issue a new operation, it would just add it on the top, oh that in a system with many replicas that are writing at the same time, to this miracle DAC. This is not going to make a linear blockchain with one branch that is growing on top of each other.

A

This is gonna, make a very attack with many branches, with many roots in the first place, and of course, when you have many roots, and you want to add a new one- you add one on top of all of them, so the children of that will be all of them.

A

At the same time, each replica is going to be adding their their new notes on top of the different set of Branch, depending on how they're sync with them, when they land them on how how they're doing Traverse in the state and to a certain level. This is not bad. Why?

A

Because, if you have a duck with many branches, you can parallelize traversal of the attack. If you had a duck with one branch, you have to go node by no. If you have several brands at the moment, I have 10 children. I can have 10 workers processing those children. So as long as it doesn't get too bad, and this isn't a bad percent, it's actually a speeds uh things up, and so, with these foundations we have build the castle. So the tower is very straight: it's solid.

A

The system can deal with a lot of crab replicas disappearing. Rejoining message is lost, etc, etc. um But it's not it's not that it wasn't too difficult to do this with the with the things and the foundations that I got from from the ipfs stack.

A

Is that all the things that I didn't have to do that matter did I have to build a network stack or Define a protocol for my crdp nodes to communicate between them, no likely p2b notes and ipfs? No do that did I have to implement the scalable strategies to broadcast the information to the network. I have gossip side, it was already done for me. Did I have to implement operations tracking logic to identify replicas that are missing operations so that I can re resend them to them? No I can just if there is no activity.

A

I can just wrap broadcast. The current miracle that root every minute and I'll be done. Did I have to implement logic, about which replica is the best sweetie to send operations to another replica that has to receive them now bit swap figures that out on content, routing did I have to add any logic to re-drive fetching on a different replica of the one out with copy operations from doesn't provide them anymore.

A

No I can like Beat, Swap and, and the dark Sinker will take care of that did. I need to implement a linked data structure that is content address. Now everything ipld out of the box um do I need to care about how many replicas there are and how to connect them now. Leave B2B just connects things, and you can start broadcasting on the network.

A

Do I need to worry about my system, stopping if more than half of my notes are gone or do I have any imposements on on that set, not at all I, don't I, don't really care, I, think and die and thinks and join. So you see where I'm going. There are so many things done for me that are real obstacles.

A

If you're building this sort of distributed systems and IPL the ibfs pops up, they took a lot of the work that I could instead put into building the actual application on top into actual um putting things on top of the crd. So American crds are fantastic. Aren't they.

A

But laughs so there are also a number of downsides or even I, wouldn't say issues. But you have to be aware, when you're working with these sort of systems, of of the trade-offs that you make and of course, I told you like a very nice list of properties that I have, and if you want something else you will have to start compromising on those initial properties. So we may be acceptable or not.

A

First, it is difficult to Traverse Miracle Ducks efficiently, it's a such system, so it is a bit more complex than I showed so bronen Traverse branches, I'm picking right the new heads um when you get a fetch error as you're traversing and ensuring that exploration of all branches happens and that you leave nothing behind. Even when you face errors, It's Tricky, it can be inefficient and you certainly if you want to operate that scale. You're gonna go with a very damp implementation of these things.

A

The problem is that the issues that happen are hard to see because an issue everything in a third it is everything converges. So the result may be perfectly fine, but maybe to get that result you're working 100 times more than you should, because you made a bug in your code and and so that prevents you from working at a scale, it's very hard to test, etc, etc. The underlying the underlying storage, so the underlying key value store that is holding the tables that I showed you before locally in every node.

A

It's also very important how it's implemented and it's going to affect performance.

A

So that is that problem right um things are not as easy as I will show you in the presentation, definitely um which who knows the author of this of this painting foreign.

A

Who knows the author of this painting.

A

Mondrian, yes same guy, who will tell it's amazing?

A

um Why did I put it there because well, there's a trade-off in in how he was working and where he went. Obviously, in order to get into abstraction, abstract geometry, he had to drop a lot of the things that he had before in his painting. So I found that interesting and I built a comparison with that for the sake of the talk anyway, Evergreen deck size, I told you whereby the American duck and we're always adding new roots on top of it. That means it's perpetually growing.

A

There is no compaction or Snapchatting possible without compromising on some of the features that I told you before, for example, if we want to keep the part that anyone can join anytime whatsoever and that there can be Network partitions that are indefinitely lasting and that when they disappear, everything will converge.

A

If you want, if I want to keep those two properties, I I personally are going to find a way to do a snapshoting or do compaction of the state. This is because the system has no way to agree on anything, and essentially the the replicas cannot agree on a place where our snapshots should happen.

A

So they don't know when when they should get rid of the underlying Miracle dog and compact, or anything like that. Having that requires building agreements and building agreement requires consensus and consensus usually requires blockchain or knowing the number of participants knowing the number of replicas so that they can. You know you can set up proposes and they can accept it and so on.

A

You cannot build consensus if you don't know who you have to talk to it about it right and if you have, if you introduce that you lose a property that anyone can join at any time and anyone that you did not know that was there because was under a network partition that you didn't see.

A

Can you know any at any time you can assume you have to break that assumption and say: okay, I will not have Network partitions and I will at a certain point in time, know exactly who is part of my of my replica set.

A

So every application must decide if it's worth to trade those assumptions for additional functionality, um but what makes America crdt so special is precisely having those assumptions. So, if you start start stepping away from those assumptions, the American CR diseases start being less special and more like any other distributed. Key value store that you can build.

A

Another issue is how to protect against these honest actors to illustrate Crusaders entering and sucking Conference standing opal, which was Christian 2 at that time. So I will consider that a bit dishonest. You can put a lot of rules on what valid crdb operations that this is on on the crdd operations that the system accepts. So you can say you can put a lot of rules on what would be a valid operation.

A

I have and half notes not accept invalid's operations, but there is no agreement again, so you cannot you're gonna have update rules to the state that are based on the state itself, so you cannot Implement a cryptocurrency with wallets and balances just using medical crits, because the system cannot possibly agree on what is the current balance of a wallet? Maybe the balance hasn't converged. Therefore, you cannot put a rule that an update to that balance is.

A

You cannot put rules on that because you you have no agreement on the state in particularly not that updating the balance of a wallet is not a commutative operation. That's clearly the order clearly matters.

A

Therefore, um this is not directly directly compatible with medical crdt, so even with crdts in general, I would say in general, I find that there are adversary scenarios that are very hard to deal with when, when working in crdp land and the rules that you can set are limited by Design number, how data type Works, therefore, normally America crdds will be used in trusted systems where you trust your replicas to behave properly or you would you will interleave them with actual concepts of concepts of systems or things that give you a bit more guarantees on Productions against Bad actors.

A

Finally, there's a question of latency batching updates can give you a huge performance boost It's, Gonna Save space, because you have to issue less ipld notes. Instead of one operation per node, you can put nine thousand you're gonna reduce Network congestion because you don't have to broadcast so many updates.

A

But of course it introduces latency by the the time that you spend building the batch and receiving the operations that make the batch um will be a time that the other replicas are not receiving the keys that are being written, etc, etc. Applications that want really low latency, so it's really important that they get um a new key very soon. Perhaps they don't have time to wait for for one minute for a batch to be completed, but others can certainly um can certainly think take advantage of this.

A

um Who knows who knows what's happening on that picture?.

A

Is someone someone opting for a high latency low throughput link? Some boats are being sent over a mountain. Both are not made to drive over mountains.

A

um This is actually what happened when, when the Turks conquered on Dusty Noble, there was a water blockade and Mehmet II decided to essentially do that with his boats to attack from the other side, and it surprisingly worked. If you have a chance to read about about the fall of Constantinople, um do because this is nuts.

A

So what is this all for it's a complex word? There is good and evil and there are different applications and there's number one that one one fit all solution. Some of them are weirder than others, but in the end there might be an application to which this is relevant or partially relevant. The paper that we wrote happened in the first place. We were looking how to painlessly. Sync on the pinset database in ipfs clusters, um using the goibfs stack so orbit. Db was already there.

A

So rbdb is a database written in JavaScript um by one of my co-authors by samuli, and it was already using these principles right. So in cluster we have raft, P4 and raft is awesome. It's very optimized.

A

It works very well, but rough clusters kind of operate if half of the beers are down. So that was bad, because if you have a two node cluster, it is going to operate if one beer is down. So that was like a huge pain in the ass and I use, ability, I, usually usability problem, and every note that you wanted to add had to be acknowledged by all all the other nodes. Every departure had to be acknowledged by all the other nodes.

A

If one node left it was errors everywhere, and every update that you wanted to send to the cluster need to travel through the leader. So not everyone could ride on the the leader could write at every time, which means redirecting everything to the leader, etc, etc. It was not. It was not really suited to the kind of flexible system that we wanted to to do with cluster. Therefore, we implemented our pin set sinking layer with Mercury and it is.

A

um We have proven that America crdd data store that powers cluster can escape now to to 100 million keys I'm 100 million objects.

A

On top of what is now a pretty big miracle DAC, because it's been growing by by for a year uh using batching, we can, at the very very very least so using a very stupid approach to to sending updates to the API. We can sync more than 250 objects per.

A

Second, you can do more, you can hit several nodes or you can parallelize your requests to the same API and so on and get more more items per batch if you want, but this is like the most stupid approach and the base number that I can give you. The setup is highly scalable and the system can absorb right Peaks, very naturally, because you just right, you can build on top of America that and you brought cost as you as you want, and the system will sync as the conditions in the system allow.

A

That means, as bandwidth is available and a CPU is available and the things will travel around, we can add new beers. We can turn off peers for maintenance without having to worry about any errors or any issues in any of the other of the other cluster beers. So this is not at all theoretical.

A

We've wrote all these code and we're really using this Castle.

A

In any case, as I said, the purpose of this talk was to give you not only the overview of how America critics work, but ideas on how you can build your own castles using the strong foundations on the ipfs stack and ecosystem I hope you enjoyed it. Thank you very much.