Filecoin Filecoin Orbit, 23 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Future of Decentralized Data Transfer

Description

In this session, hear how we're expanding upon graphsync, bitswap and better data transfer algorithms for IPFS and Filecoin.

A

Hey hi everyone. um I am so excited uh to chat with you today uh about um our evolving data transfer protocols in filecoin and iupfs um and how we're going to make the distributed web more reliable, accessible and interoperable um yeah.

A

So my name's hannah, I am a programmer um with the web 3 data systems team at protocol labs, I've been working on ipfs and filecoin for about three years and honestly in my entire career, I have never worked on more fascinating technological problems than I do now um and I hope by the end of this talk, you are as interested by them as I am um uh also uh bailey.

A

If you like the first slide, there is a lot more david bowie coming because my file coin orbit uh theme uh fell to earth with that it's full of space oddities. So let's go ahead and get started. um I want to start with a very simple question: um do you trust the data that you download on the web when you enter a value in your web browser? How do you know the content that you're looking for is actually what you got um and the the surprising answer?

A

Is you don't really know um uh on in the traditional web? um If the website uses ssl, you have some guarantee that you're downloading from the right site, but you still don't know if you're downloading the right data, because you don't know if that site's been hacked plus ssl is like pretty cumbersome to set up and maintain for people who are running websites um and while there are in fact ways to check your data they're so cumbersome that only the most technical people use them.

A

So, with it's somewhat surprising that the whole web works at all, um but in the distributed web we want to do better right uh and that's what I think is one of the most core innovations of ipfs and bilecoin we've developed data formats and addressing schemes um that provide much better guarantees about the data you download. Trusting data is no longer about who you get it from, but rather whether it matches a cryptographic contract. That's built into the address that you use to find it put simply. You cannot download something you did not ask for.

A

um So let's talk briefly about our data formats and what they enable us to do. um The first thing I want to talk about is a content identifier or cid. um It is a way to identify content that is cryptographically secure inside of a content. Identifier is a cryptographic hash of the types of the block of bytes that you're going to download, and the cid also contains information about how data was encoded to help us interpret those bytes at a higher level.

A

um As a quick aside, uh cid is sometimes uh pronounced just sid and I'm gonna try not to do that. That's sort of an internal shorthand when you do we do, um but um if I do just that's what I mean, but I'm going to try to say cid so um most blocks of bytes on our network uh decode into a structured data format called ipld or interplanetary linked data.

A

This means that you can take the bytes you get for a block and and decode them into something much more meaningful and the data model of ipld is similar to json, in that it has basic data types like strings and numbers and collections like lists and maps. However, ipld also supports embedding content identifiers into that structured data and that allows us to break up large data sets over lots and lots of blocks um and then the sort of contrived example. You see on the your screen on the screen.

A

We have a list of, we have a list, it has two maps inside of it and the first number is a cid uh and then and in the first map, there's a cid that links to an image and calls cheese, image and, and then there's a string uh value and then in the second map, you've got uh you or sorry, there's a number value, and the second map has a a c id that links to further data, as well as a string value with the name of the fruit.

A

In this case, um the wiki and ipld is in a lot of ways very similar to html. It is almost a hypertext format, um but it's structured data that a computer can understand um and, and the links are cids that will guarantee we get the right data so fun fact. um Almost all data that is stored in filecline is stored is encoded in ipld.

A

um When you make deals in file coin, you probably make them for files or directories, but all of those are encoded to ipld formats before they get transferred and that produces some unique advantages that we're going to see in just a moment um large data. So so so that's our format and then large data sets in ipld can get spread out over multiple blocks, um where each block is identified by its own cid and embedded into the ipld data. For that block are links to other blocks, all going back to an individual root block.

A

We call this a merkle bag. The awesome thing about this is while the root cid is just a hash of the initial blocks. In the first root block, you can use it to incrementally verify the entire graph. Once you find and verify the first block you can go, uh you have c ids for the next blocks and you can go, find them and verify them until you can keep going until you know the whole graph. It's cryptographically the data you want it and we've even written a query. Language for ipld called selectors and selectors.

A

Allow you to describe partial traversals of ipld data sets. So if you encode, for example, a large data directory tree in ipld, we have a way to describe a path from the root to that whole subdirect to a whole subdirectory deeply nested that can be cryptographically verified.

A

There's a lot of benefits that emerge from these core data formats using cids means you never get the data you you never get data you didn't ask for. We've talked about that already and since you can verify data that you can verify the I the integrity of data, it doesn't matter where you get it from, because you know it's the right data um ipld formats allow us to break up large data structure.

A

Data sets in incrementally verifiable chunks, and that means we can get data faster, but we can ask lots of people for smaller parts of the whole and plus we never need to. If we have a single chunk that appears twice in a graph, we only need to send it once finally oops sorry. Finally, we are starting to see the building blocks here for trustless payments for data. If I can verify data incrementally, I can break large transfers into much smaller transactions minimizing risk on both sides uh for paid transfer.

A

So let's talk about the technologies we built for moving all this data around one of my favorite parts about working on ipfs and filecoin is that you almost get to start over and work on and try to like write the core protocols that power the web and do it better.

A

It's like programming in the 90s and you're trying to figure out how to implement http or early filing for file sharing protocols like napster, but you know do it better, so the first distributed protocol we ever we implemented for ipfs was bitsoft and bitswap is a block exchange protocol. It doesn't actually understand anything about ipld. What you can do is you can ask for a cid and get a blog from a peer and get a block back, it's kind of like bittorrent.

A

In that sense, you ask peers for individual parts of a larger data set and then assemble them yourself. The ipld knowledge here lives entirely outside of bitswap. Usually uh it's on the client who's requesting data. um This has some really great advantages and that's why bitswap remains the core transfer protocol in ipfs. Today um there are there's some great things at work about working at the block level. Only first since you're only asking for one block at once. It's really obvious how to break up repeat, break up requests among many peers.

A

You just ask different peers for different blocks and since you're breaking up requests, it's easy to make them in parallel and plus, because the person who's sending data in response to a request is only sending block data they're only sending bytes. They don't have to understand the format that those bytes are encoded. They don't need to know anything about. Idpld, plus bitswap is a really mature protocol. We've worked on it a while and the go. Implementation is pretty efficient and it does a whole lot.

A

um There are some challenges, though, um bit swap because you only ask for one block at a time. There's certain types of nested traversals like fetching a subdirectory deep in a graph in a nested directory um where you they produce lots of long trip of round trips to get the data, because you don't know what the cid is farther into the graph.

A

Are you have to ask for the first and then the second and then the third and then the fourth all the way to get to your end um and that's pretty inefficient, also bit swap for people sending data since you're, only getting very small requests, it's pretty hard to optimize uh your disk. I o and your network, I o, so you can send lots of data at once.

A

um Finally, the block level nature of bitswap does tend to produce a lot of network traffic similar to the traffic congestion problems that you see in bittorrent, and this has resulted in you know uh the the the the other part is not a problem with bitsweb itself, but the implementation of bitswap. If we've we've written for ipfs, the go version has kept growing and growing and growing. So it's way more than a protocol implementation.

A

It's almost like an entire data transfer stack and it's so tightly integrated with go ipfs that it's almost impossible to put to pull them apart. So it's not surprising. We ended up writing a second transport in the context of building an entirely new technology with filecoin, so the new transport is called graphsync.

A

um It's kind of our our our new hotness um and it's it powers all of the data transfer and filecoin and graphsync is a protocol for replicating entire ipld graphs across peers, rather than request a single block in grasssync we request a sid and a sid selector, and that allows the peer to stream us all the blocks that match that query. In essence, we're performing an entire query against a remote ipld data set, and we can ask for and receive much more data at once.

A

We developed grassland because we knew that for storage and retrieval deals. We would require fast point-to-point transfers between clients and miners um similar to http and grassland, because the response, because the person sending data, was processes much higher level requests, they can read and marshal their data ahead of time and send it over the wire. So there's no extra round trips and we should be able to max out the speed of the network pipe um as and since we started from scratch. Grassland is a lot more configurable than bitswap.

A

We support custom authentication, sending side, channel information like payments and distributing individual requests to different data stores um on the receiving side. A lot of these things are easier to deal with when you're dealing with higher level requests in this configurability has enabled some of the really cool optimizations that we're starting to see in filecoin that you may have heard about like the dagstore. um It's allowing us to do.

A

I o at much faster speeds um and we're starting to support really large storage clients like estuary, that are setting out tons of data at once, and that kind of scalability is just really hard to do with fit swap.

A

um But grassland has its own challenges uh first, because you're dealing with large requests, it's harder to break them up. It's not clear how to it's not always clear how to break up a selector among peers um and second, we want to maintain our guarantees around incremental verifiability and that turns out to be fairly complicated. So our implementation is complicated.

A

um The other part challenge with graphsync is you need a greater shared understanding on both sides of the network transport, because the responder is essentially doing an ipld selector query: they have to be able to execute it locally to get all the data, and this gets complicated because ipld is pretty extensible and has an and supports a number of customizations.

A

And finally again this isn't a problem of brassync, but gobitswap is so it does so much and it's so much more than a transport protocol. um So when you start a new protocol, you don't have all that and it's just missing.

A

So um we've got some challenges um and probably our out most obvious challenge that you may may have been appearing apparent up until now is that we have two different stacks. um We have a transport stack for ipfs and a transport stack for falcon, and it's not just bit swap and graph sync.

A

We it's all the services that we built on top of them in bitswap, we have uh bitswap sessions, which is, is sort of an optimization protocol um for doing high higher level transfers, and then we have services like the block service and dag service that are baked on top of bitswap and hardcoded into that um in filecoin we have graphsync um and then we built other components on top of it to support file coins needs. We have a control protocol for called the data transfer protocol.

A

We have a file coin, retrieval protocol and a storage protocol. And finally, we have the support for payment channels so that we do in fact support paid transfers and gr in file coins, um but the but the larger problem is that data. Retrieval is so about so much more than transport protocols um to build interoperable solutions.

A

We need to think holistically about the what the data transfer process looks like, so you can download any content for any protocol of from any peer, and so we've started to do that, and we've identified three high level steps in the data trend retrieval process.

A

The first step is content routing. This is essentially finding your your content um in the regular web. You ask users to know website names, and then you use dns to find those websites in the distributed web. We just start with a content identifier that could be hosted anywhere and we need to track down. Who has that content and how they're making it available?

A

We're uh we've developed again different solutions here um with ipfs we have the dht or distributed hash table which enables us to find any content uh on ipfs, even though it's not it's, sometimes not as fast as we'd like it to be. We also have a bitswap want protocol that enables like speeding things up among local peers, because you can ask them whether they have a block without them, actually sending it to you on the file points side where it's really still early days um will gave a talk.

A

Will scott gave a talk earlier today about the indexing solutions, we're building to help people find content on filecoin, but um a lot of it is still in development.

A

Once you've found content, um it's actually not an immediate step to making a trans to making a transfer. We need to plan the best way to get it from potentially multiple sources, and there are a lot of factors we need to think about.

A

We want to get it fast and we know we probably want to get it free or low cost, but that might mean we're distributing the requests over many peers or switching between protocols or making trade-offs to maybe pay to get things faster plus, we have to recover from failures if peers turn out to not have content. Concretely, almost none of this exists for file point content. Today it it's it's largely manual um and the we do have a lot of planning in ipfs in bitswap. That's what the biswap sessions largely do. It's really great.

A

It just only works for bits. um Finally, we we have what I've already been talking about, which is transport. How do you get things from one place to another, and here we have really great protocols, but there's one thing missing: how and that's accessing all of the web 2 data right now to use our protocols, you have to run an entire lib p2p stack. What if we could bring some benefits like con from like content and id and identifiers and incremental verifiability to data that's available only on the traditional web?

A

So I want to talk to you briefly about how we're going to figure this out going forward. We know we want people to be able to download ipld content from ipfs filecoin and even maybe http as fast as possible, without having to think about where it's coming from or what protocols it's sent by wow. That's a tall order when you say it all together. um Well, the good news is: we've begun work on a prototype client that can retrieve data from ipfs and filecoin automatically choosing the best way to get it.

A

So that's really good news. um So how are we gonna do this? uh The first thing we're gonna do: is we're really gonna cleanly separate these stages of data? Retrieval I've been talking about. Well, all of these steps are asynchronous. They are distinct. We want to separate finding content, planning requests and executing transfers. Now it's not entirely that simple, because feedback goes both ways in a large transfer request.

A

We still want to be able to swap, but we still want to be able to swap different solutions at each step. um What about we're thinking about content routing? Now it's a problem of finding content from multiple sources. Some of them are fast, but have very little content like a local data store or local peers, and some of them are very slow, like the ips ipfsdht, but have tons of content.

A

We already actually do this without thinking about this in our ipfs stack, but we know there's going to be more sources over time, so we need to build a framework for fetching from multiple sources um and, honestly, we know other people might build better solutions for finding content so soon you're going to see delegated content routing coming to ipfs and in our new retrieval stack. We want to allow people to plug in their own solutions at each level the fastest way to get to awesome.

A

An awesome, retrieval, client that downloads fast from anywhere is to provide a framework in which people can discover and implement the best in class solutions for individual parts of the problem. um Solving planning is not obvious. The planning stage is is going to be a complicated one, it's not easy to mix protocols and peers and payments and always deliver the optimized solution. We have some thoughts but they're just preliminary for bitswap and graphsync.

A

We think we can mix them and get the best of both worlds. Graphsync helps us understand whole graphs um and get them in a single round trip, but bitswap is good at splitting requests. So what if we use graph sync to get a description of the repo of the graph in terms of sids and bitswap to actually fetch it? This is just one of the proposals for how we could get the best strengths from for each of each protocol and we're just gonna have to evolve solutions over time.

A

um One other thing we are definitely considering is http and we've developed a proposal for incrementally verifiable transport over just http 1.1. So when is all this coming I don't know 2022. um This is not a product announcement. This is not your fbm keynote, so we'll hope it's going to be very soon, but it'll be even faster if you help out. So, thank you very much and be aware of the spiders from ours. Thanks.