IPFS IPFS Camp 2022, 31 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: The Future Of Content Addressed Data Transfer - Hannah Howard

Description

How does IPLD represent complex data? How does this affect data transfer and what kinds of data you can request?

A

So, what's going on with data transfer? Well, how can we do better, every time I think about bit, swap I, think about BitTorrent and how BitTorrent is a multi-party data transfer protocol that kicks butt and bit swap doesn't yet um and why? Why not? Because bit swap operates in an information vacuum? It's very hard because it knows nothing uh bit swap uh BitTorrent uh if I'm about to download a movie, I legally. Of course, um I get a list of peers immediately that have the data.

A

That's that content Discovery part it's built into BitTorrent, because you have a torrent and it tells you who's, got it right and then I need to know what the components of that movie file are and again the Torah I get a list of all the blocks that go into that torrent. So I've got all this information with which to split up requests among many peers in bit.

A

Swap you have none of this and, as uh dig pointed out, one of the reasons every implementation of bit swap is so freaking complex is because it's trying to figure out the peers who have the data while it's trying to get the data. It's a lot we need to I would agree. We need to break up the content Discovery uh from the data transfer um and uh oh wait.

A

I'm out of sync. Aren't we hold on one second, okay bit swap no do I have graph sync grasslink went away. Okay, that's cool that! That's! That's all! That's fine! um Anyway! Grass Inc uh is great, uh but it's end to end and it loses the the multi-pure thing so that that's a problem, but that's that's not actually the only problem with graph sync, um the the we've we've talked about this already, but just grass Inc has this everything but the kitchen sink approach to ipld.

A

um uh So at this point who feels like, if I called on you right now, you could tell me what a sid is. Raise your hand, you don't have you won't get called on. Don't worry! Okay, yes, see a lot of hands Okay who in front of all of these other people, could if I called on you right now, could tell me what uh me other than move um and maybe Rod what codex are data model? Schemas ADLs selectors wow, bold, all right here here, let's get the mic over to you, no just kidding anyway.

A

It's a lot less people right um and it's just a big lift to get all this right. um They're, they're, complex um and the other thing about is I I. You know, I introduce selectors, uh I've, I actually wrote. Probably the earliest implementation of them based implemented did somebody else designed them.

A

But um they're like this super complex, arbitrary query, language and as Europa mentioned um there there, there may be a little too complex, I I almost feel like selectors were written because we didn't at the time understand the concept of like an ipvm or virtual machine. So we almost built like a turing complete like query language and yeah, how many uh yeah, meanwhile, how many of all of you out there who actually work on these protocols?

A

How many of you guys have tried to look up something by something other than either a path or a give me the whole dag anyone raise your hand if you've ever done something more complex: oh volcker, Read Magic! Yes, okay, all right! Well, good for you all, but there's only two of you in a room of surprisingly a lot of people who are still here um so yeah. That's that maybe you don't need all that or maybe, if you need all that, you should be using a VM right.

A

um There are probably cases where you need that computation, but then switch to a VM there's other this other problem with the with selectors and I call it the Goldilocks problem, which is that like they are complicated, but they can never quite do what you just want right so, for example, enumerate a Unix FS directory without actually reading any of the files or any of the things at the end. I think literally magic might have encountered this problem like two weeks ago um and yeah.

A

It's like that they're not right there right and then so so there's all these like. Maybe this suggests, given that we're like three years into development and like we still often reach for the selector we want, and it's not. Maybe we have a bit of a design issue that we ought to Auto revisit and maybe start from scratch start simple right. Anyone know what graph sync was written originally to do.

A

Raise your hand, one person. What was it supposed to do.

A

Sink grass knows something more specific: yes, synchronize the file coin. Blockchain, yes, cool! uh Does anyone know if grasslink syncs the file coin blockchain right now it does not. In fact, it can't quite do that because of some selective design issues so yeah something went wrong there um uh and what ended up happening is uh in filecoin, the the folks developing filecoin decided to write this little protocol that was just tailored to the blockchain again understand your data write a protocol for it um and all that they wrote something called block.

A

Sync and all block sync does is when you're syncing the filecoin blockchain it they add. You query other peers and you say give me like what are going to be the SIDS of the next 100 blocks that I'm going to need to sync right um and then, as soon as it has those SIDS it just uses bit swap to get them right, and so that gets around the problem.

A

The round trip problem with the blockchain, because again the blockchain is a very deep dag, but at the same time it uses bit swap which gives you the multipur um and so I've been thinking about. I. Think there's something there uh I think this is there's a pattern in this because we know this actually works for syncing over lib P2P super fast, a a difficult dag like a blockchain right and it's working in production, and so what I and they're, not the only ones who wrote A protocol this way, there's a company called query.

A

uh They do not exist anymore um because they went and made iro um and they made iro because they needed ipfs to do things that didn't, but they, but in the course of building the product they built a protocol I think it was called Mana fetch or something it basically sent the information about the dag you're gonna download before you downloaded it. So what it's actually doing there, if you think like dig, was saying split.

A

The content Discovery in the data transfer and I would go one further and I would say, split the content, Discovery and then split. The dag Discovery right find more information about your data that you're going to download before before you try to download it and then multipure becomes completely possible right and it becomes fast. Now, there's some funkiness. You got to throw in there about to to maintain the incremental verifiability I'll hand wave over that, but I think it can be done and yeah I think this this.

A

This gets you to the sort of BitTorrent situation right um and I think we could do it and in fact, if I were to look back because I wrote, grassling I wrote the initial implementation and if I could go back and do it again, I would have written grasslink. This way I would have had it just send. You SIDS tell you about your dad, not try to send blocks.

A

um Yeah I have a I have my own little keyword for it. I call it lightning storm doesn't matter. We all have keywords and Magic uh projects, but yeah that what I think we ought to have is a is essentially a core Loop that takes a list of blocks and divides up those requests into multiple peers over a protocol like bit swap so it could be something simpler and then, and then has some software to verify what you're getting back and and to to make sure you're, not downloading too much without um verifying that yeah.

A

uh This pattern, this pattern even has a name. We call it. The Manifest pattern, I think it's something to look into um there's so many optimizations you could unlock with this, like one of the things I think is. Maybe we should just write a Unix FS sync right, unixfs queries. We know almost everything we need to know about right. The gateways language essentially works very well for Unix FS. If you need to Traverse a path, you actually don't need multi-party for traversing. Your path.

A

Grassic is great up until the point where you get the large file at the bottom. In fact, graph sync would be great if you could Tran if you could Traverse the whole dag of a Unix FS file, except for the bottom layer, because it's only the bottom layer of a Unix FS file that has the bytes right.

A

So you could Traverse that whole thing in a one, in the single request, if you're using Graphica to even send back all the little blocks, and then you get a list of blocks at the bottom, you don't even need any additional verifiability, because you already have the whole dag. And now you split up those big block requests over bit swap right, there's other ideas. Once you stop sending blocks, you can start thinking about storing uh just links between blocks right rather than and it's it's highly cachable.

A

If I run a dag traversal via a selector and I generate a bunch of Sids that go into those blocks, I can record them and the next time you request that selector I've got that list of Sids ready to go right. There's other ways. You could store that data. There's a lot of room to explore there.

A

um There's some. You know I think we mentioned uh there's a protocol called car sync. The bloom filters are cool path to go down to try to avoid sending duplicate data.

A

To just send me what I don't have some other folks are using something called Erasure coding and now I'm, just like throwing out things that we aren't really digging into, but are things you could explore if you wish to write the best web 3 transfer protocol ever um the Erasure coding is really good for kind of ensuring that we have a method for dealing with peers that are slow or that um you know essentially go on and offline.

A

It's similar to some of the ideas behind uh rapid um and finally I'll just leave you with this right. We have the technology. We could do this folks, um we can make it much faster um and if you've come to this talk today, the this track today and you're inspired, um there's so much room to do here and so many ways to help um and I hope to see all of you um on the githubs on the slacks on the discords and all that thanks yeah.