Filecoin Retrieval Market Builders Mini-Summit, April 2021, 6 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: RETMKT Builders - Smart Records

Description

Petar presents on Smart Records at the Retrieval Market Builders Mini-Summit in April 2021.

A

Okay, so I'm going to talk about smart records, which is basically a part of the larger theme of composable routing, and the status of this project is that we have a actually an end to end v0 committed. Then it's working, but we're not just at the point where you guys can start dog fooding it it's coming it's coming soon, probably a couple of weeks or so um first thing is. I just want to give you a broader view how this fits in the larger composable routing picture.

A

So composable routing is this effort, which juan alluded to essentially trying to make all the parts of the end-to-end content, routing very uh reusable with each other, so composable, which basically means making them individually. Generic uh smart records are the data sort of half of the composable routing and the other half is? Is the logic of composable routing, so we're going to focus on smart records here. Just for your context, though, so what is? What is this routing behavior that I'm referring to? This is the other side of composable routing.

A

It basically includes three things which are um how you write: links that point to content so sort of rich links, not just cids, which are just opaque and have no hints in them. So how do you write links? How do you define? uh How do you describe context at any every given peer, because every peer has their own circumstances where they live in the network, and this is their context.

A

So how do you describe it and how do you a context, also includes the preferences that peers have uh when they want to do a routing of links and, and finally um routing logic is uh given a link and a context. How do you uh actually perform finding a piece of content um so routing logic?

A

The thing that makes this a framework is that routing logic, in other words, what you end up doing to actually find the content is a function of the link, the context and the records, um and the function is fixed, which is what this is. What makes it a framework uh we're just going to talk about smart records today, okay, so uh roughly, so what?

A

What are they, um the abstract way of saying sort of, like the very general way of saying what problem they're solving is essentially to say that they're they're, trying to uh sort of create a standardized shared public, medium for writing and reading by multiple participants which are talking to each other using multiple protocols and the actual location of this public medium is scattered on multiple locations.

A

So now this is a mouthful and it sounds fairly. Generic, so putting some flesh to it is is just interpreting this statement in the context of the familiar dht that we currently have so um so uh what's happening in the dht right now is that we have the dht actually does the dht protocol, like literally in the ipfs stack it actually takes care of two jobs at once.

A

One is that it sort of finds peers for you at a given key location, and then it also happens to be implementing the put and get operations which are currently very hardcoded to do a very specific type of data storage.

A

So the first thing that we're saying here is that uh the putting and getting of data should be decoupled from the dht. The dht should only be doing finding of peers, so that's sort of step step one, whereas um the functionality of actually storing data shared by multiple protocols is this new project that we're discussing now called like smart records um and it's it should be a separate protocol that just happens to run side by side on every ipfs node, together with the dht.

A

um So um uh of course, smart records are generalizing the put and get functionality in that now you can actually write any data type, not just the hard coded ones that are currently used for provider records and peer records.

A

The thing to note, though, um is that we also want to make this whole notion of a smart record sort of a sort of a self-contained um entity. If you will, that is also portable uh and um uh and why do we? Why do we want that?

A

uh Because this kind of a shared data space does not only necessarily live in the dht, which is the most obvious motivating example, but you could also imagine uh sort of sending uh these records across completely different kinds of protocols and in fact, currently, even in this session, but also in in w3d work group. People are already trying to send records uh records through pub 7 other protocols.

A

So if, if, if smart records, which is the standard that wraps data, is something that is self-contained and isolated from other context, you could compose it, in other words, reuse it in different settings, whether it's with the dht or with a pub sub or any other sort of exotic um exotic combination of protocols.

A

What is it exactly so so these are sort of problem statements in a way so far. What is it exactly? Well, the the sort of like the simplistic way of saying it is basically um kind of like upgrading our dht values to become uh publicly writable publicly updatable json documents.

A

Of course it's actually much more than that, but this is a good description at high level. um It's another way of viewing. It is as a mixture of a crdt and I'm going to unpack this in the next slides and smart contracts.

A

um The key point here is that the data associated with the record we want to view it as a replicated state machine which has its own interface and and virtual machine, so hence it's portable and the key requirements. So the key benefits that we're going to get from it is that it supports, like I said, reading, writing merging um and some smart services um that I'll mention at the end. Now um I want to uh talk next about the use cases that it immediately enables.

A

um The first thing is um it: it enables the deployment of new applications to the ipfs network without upgrading the network. So um if somebody writes a chat application, for example, where the participants want to exchange chat specific data, they can store this uh in the dht, because the dht now will allow you to kind of use your custom data structures, not just the hard baked ones.

A

It's also another design goal here is to enable different protocols to interact with each other. That's why records are a public blackboard where it's one blackboard. So there's one document per key where multiple protocols can write to it in sort of their own spaces um and therefore they they are able to sort of read each other's data if they understand it and and sort of benefit, each other. Something that's not really possible today.

A

um Some other use cases are. They also might end up being useful for facilitating cryptograph cryptographic, protocols that require a third party uh and I say, trusted in quotes because, um depending on the setting, you may or may not trust a dht node to be a fair sort of judge between two other nodes. But it has the possible potential to be used in this case, as well as a some kind of an opportunistic judge which, which enables lots of cryptographic protocols.

A

I've only listed fair exchange here um and, generally speaking, something that I'm very excited about sort of a side benefit. Is it actually unlocks application development on the dht to the public kind of unconditionally, and this can produce lots of interesting outcomes it should when we release it. So let me get to um the technical aspect of this thing and basically kind of discuss our thought process uh approach, uh sort of leading to the to the design that we currently have. So the key.

A

The key thing to just remember this is just sort of coming use. The dht as your mental model, and the key thing is that there are multiple hosts of the data that is being stored for a key. um You know this because this is how the dht works. uh So um there are multiple writers, so many people, for instance, write provider records for the same key and they're multiple readers, because many people might be interested in the same key.

A

So the players are multiple of of each, and these are the the hard technical requirements that um this shared medium has to have. So, first of all, the data model is generic because it has to be generic, because we want different protocols to be able to use it so think, json or ipod, in fact, they're the same thing. So the ipod data model is the same thing as json, so the data model is json, roughly sort of bottom line.

A

um The updates to uh to these records have to be commutative. uh This is simply because the whole system is asynchronous, so any two updates can be reordered and the semantics of what's what's happened: shouldn't change um the values, so these json values uh um that are attached to keys have to support merging, because uh that's because the the value of a record is is uh virtually always distributed across a few different um dht nodes in the dht setting and so reconstructing the record really involves collecting all of them and merging them together.

A

So merging has to be supported um and merging must commute with updating, um for the same reason, asynchrony uh and the final- and this is the sort of like the the the crux one- is that conflicts, uh which can only happen during merging and updating, must always be resolvable, uh because that's a public, medium there's. They always have to be resolvable right. So, uh and this is designing this part, this is sort of like what I want to spend a few slides later on now.

A

um Another aspect I want to point out is that um it is pretty clear that uh sort of a public medium like this requires the regulation, meaning what I mean so regulation generally, but specifically uh the most natural kind of regulation is to say um anytime. Somebody is updating a piece of the record. um We would expect that um we can impose payments uh for different kind of aspects of the of the operation uh payments per for per size.

A

How much data you write uh payments, perhaps for how much how long you want what you've written to stick to the to the record uh and you could sort of envision other kinds of payments.

A

um But the reason why, assuming that payments, uh or at least some virtual form of payments are are are part of the system, is, is kind of like a seems like a straightforward sort of thing to to assume is because it's pretty clear that if you can just publicly write, if you can write a public blackboard for free, um there are no sort of well-behaved ways of preventing denial of service attacks other than payments. Anything else would essentially be compromising with the clean semantics of the product itself of. Like writing to this document.

A

I know some folks are going to be um sort of concerned about well. How are we going to use this technology in the ipfs setting where there are no payments at the moment? um It's not a problem um in the sense that you could use um these records without a payment scheme and just uh reuse. The techniques for denial of service that are currently um these are really heuristics for the now service which are currently being used in ipfs.

A

So I'm not gonna delve into this um and next so I I wanna talk about the con, so this is how did we sort of design how to do conflict resolution really kind of like the semantic aspect of this service? So, first stepping back for a second there's. There's two ex there's two extreme ways you can approach conflict resolution in general, so one is the destructive approach, in which case you essentially define some form of competition between different writers to a location in the document and the winner gets to override the loser.

A

Competition can mean lots of different things. It could mean um I mean it's up to you. It could mean. Maybe people are like bidding money for who gets to write somewhere, maybe it's based on on who comes first, or maybe it's based on some rules that you can embed in the document itself.

A

You can get very creative with um how you define competition and uh that's also the reason why it doesn't seem to be the right way to go if you want to be general, because there's too many different uh sort of possible possible designs here in uh the alternative, the alternative extreme is to use non-destructive conflict resolution, which basically means essentially remember all updates as they are and don't try to sort of merge all the updates into a single document and just communicate all the updates to whoever wants to read the the documents um this has its downsides.

A

um It takes a lot of space, um but also it's kind of it's very difficult for what's more concerning to me, is it's very difficult for our users, which is application developers, so users of smart records to actually deal with um with a data structure? That is a sequence of updates because they have to worry about uh sort of reconciling them and merging them themselves.

A

And that's um that's developer, friction which, which I like to avoid. I think that's it's pretty key to avoid it, especially given that we're like uh giving this to a wider audience.

A

The there is a the good news is that there's a middle ground which uh gets actually the best of both worlds, so the middle ground is simply to um to allow so to basically keep a a separate copy for every writer, so a separate copy of the of the record for every writer, and this makes sure that writers cannot um dos each other. So they cannot write over other writers data, but um when they update their own records, um they can overwrite to gain efficiency and the pros and cons of the middle approach.

A

So, first of all, the good news is that if you have the middle approach as your as your basic api, you can implement any any of the other approaches like uh the competition based, for instance, you can implement it at the application layer um now. One downside is that slightly more data resides on the record host because you have a copy. You have a copy of the document for every writer. Of course, you only have a copy of what they have written, what they have written to the document.

A

um I don't. I think this is a fair price to pay, also considering the fact that, if you believe in having a payment model for all the rights, the host uh of the records not only doesn't mind people writing a lot. In fact, they probably prefer that um in order to make money uh and of course the on the other hand, the middle ground approach actually uses um consider considerably less data than the extreme non-destructive approach.

A

um Okay, so, um and now, let's having kind of given you our thinking behind the conflict resolution. um So what is really the sort of like the product? Looking like? What's the model, um so every peer identified by a public key writes to a peer specific document for any given key.

A

uh I already said these peers can overwrite their own documents and, and um the last piece is that, every time a peer writes um uh up sort of sort of updates, a piece of the document they get to specify a ttl for every single node um in the document um that that they write um this. This, of course, is obviously to accommodate protocols that have different temporal semantics of of how they want the data to uh to stay um the actual interface of the of the product.

A

So what sort of uh application developer gets to interact with is is really just um putting what we have so far into words uh here, peers update uh a key with uh a change in the value that they want to update and they get to specify duration uh for each node in this in this value, and um they also uh get to set what is the sequence number from the writer's point of view of this update and I'm going to mention in a second why this is necessary and retrieval works like this.

A

You just say I want to retrieve all the values for a given key and you end up getting a list of values, one for each peer that has written to to this key.

A

These things by the way are easily mergeable from a developer point of view, unlike getting a list of updates, because each individual peers record is is is should be a semantically, correct document with respect to the protocol. That's not true. If you were sending individual updates to the reader, why do we need to send the peer sequence number?

A

Well, because when two different dht nodes try to merge the documents they have for the same peer, knowing the sequence numbers from the point of view of the writing peer makes the merging possible in a in a predictable way.

A

One uh addition that will happen in the near future to the retrieval part of the interface.

A

Is that um we're not going to allow a retrieval of all values, because, eventually, when this system starts to get used, heavily you're going to have quite a lot of data in a single record, perhaps pertaining to different protocols, and it might be too much data to to retrieve um uh if you only if you only care for one of these protocols, so we would change the get all uh with uh give us a lecture and get back only the the schema.

A

The sub schema of the record that you understand the final thing which I I didn't want to emphasize too much in this talk, because I think, there's kind of already enough new concepts to kind of uh absorb.

A

But the final thing is that these records that um uh people get to write uh can have section sections of them that have um a special meaning to the to the record, hosts uh and and in particular, they're, essentially ways of asking the record host to perform some uh some service um for you, some service that goes beyond just um writing the data to the record and so examples of think of services like the of these are, for instance, um uh you know making sure that the multi address is properly formatted and also reachable before you actually uh write it to the peer records.

A

That concludes my talk.