Filecoin Filecoin Liftoff - the Main Event, 24 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GraphSync with Hannah Howard

Description

Join us for Filecoin Liftoff Week, an action-packed series of talks, workshops, and panels curated by the web3 community to celebrate the Filecoin mainnet launch and chart the network’s future. https://liftoff.filecoin.io/

Events take place all week, October 19-23, 2020. #FilecoinLiftoff
For more information on Filecoin
- visit the project website: https://filecoin.io/
- or follow Filecoin on Twitter: https://twitter.com/Filecoin

Get Filecoin community news and announcements in your inbox, monthly: http://eepurl.com/gbfn1n

A

Welcome uh to this talk, uh I'm just who I am um my name is uh hold on. Let me just make sure it's working okay, cool.

B

A

So my name is hannah howard, um I'm the primary author of the grassland protocol. um You can find me on the internet at tech girl, wonder I use she her pronouns um and uh I work for an organization called carbon. Five carbon fives uh product development agency we've been involved in the development of file points since almost the very beginning um we're one of the sort of like long time, partners with protocol labs uh and I've been working on uh ipfs graph, sync filecoin for over a couple years.

A

Now um this talk is, is gonna, be uh super improv a little bit. um I put this together right before uh you know it was sort of last minute, so bear with me um we're gonna do our best, um let's dive in cool okay. uh Let's talk about graph sync, um so graph sync is a protocol to synchronize merkle, dag graphs across piers, uh which probably means very little to anyone who doesn't know the deep tech glitters. So let's find a different definition for grossing.

A

um So uh I would say graph: sync is the thing that gets your data from one place to another in file coin. Everyone understands that. So I'm gonna go with that. As as my working definition and then I'll talk to you about how it actually works and where that other definition comes from cool, so let's talk about the filecoin environment um uh and how data transfer and how we need to move data around in filecoin, um the filecoin environment has some pretty interesting characteristics.

A

um One is that essentially, no one, I'm calling them peers, but no one involved in filepoint trusts each other right. um Anyone who can participate in the file point network uh is potentially uh someone who is a malicious actor um for the most part. People are untrusted. um That means that um when uh I receive a request for data, uh unless we have some other mechanism for verifying, we don't trust that the person who's making the request is uh is actually uh acting.

A

Acting in good faith, uh they may be trying to request lots of data for the purpose of uh just slowing my computer down and or basically causing me to run out of memory. um Also, if I am requesting data, would I get back by default? uh I can't trust. I have no way of knowing uh whether the data that I am getting uh back is the the right data. um Unless I have some way.

B

To verify that.

A

You know with uh mathematica with you know our.

B

I have some way to prove that.

A

It's the correct data.

B

uh That doesn't.

A

Rely on trust, I I cannot trust it so, um uh and then the and so our goal is that we want to be able to do these data transfers in a way that minimizes the effects of malicious behavior. um So a key aspect of that is that.

B

A

As we transfer data from one place another, we need to be able to verify that data and we need to be able to verify it incrementally. So we don't want to. We want to know that we're getting the right data, we want to be able to know it every step of the process. So, if we're transferring gigabytes of data, uh we need to be able to know that at one gigabyte, we've already gotten the correct first gigabyte and then the correct second, the correct third, so that we.

B

Don't get all the way to the end and.

A

Find out, it was all just garbage. um The other thing is that we would like to wait. We would like to that for this transfer, despite these properties, to be fast, ideally um as fast as a transfer over traditional http um or faster. Ideally, uh in some cases, though, in the case of filepoint, we're usually just moving from one pair to another, so it's it's essentially pretty similar to an http setup. um One thing that's going to be key, for that is.

A

We would like to not gate that transfer around round trips uh between the client and uh between the requester and the responder we want to be. We want the responder to be able to send data without waiting for the client to acknowledge that they verified the correct data. um So what is that? uh So? What is it that we're transferring here? So basically, the key to doing incremental data transfer is to transfer it with um a data structure.

A

We call a merkle dag um now these data structures, a merkle dag, is a is a really common data structure in all those sort of protocols. uh Protocol software- um you you may, if you've used ipfs, you have probably worked with merkel. Dags uh merkel, that what so what a merkle dag is is. Let's say I have a giant chunk of data right now.

A

I could all arguably I could take that entire data hash it um and then uh basically say send me all the data and verify that the hash that I get matches the hash you told me uh would be the data would have- um and that's that's a simple way of doing it.

A

But if we wanted to break up lots and lots of data, um a merkel dag is essentially um a series of blocks of data, each of which have a hash where the root block has links to additional blocks and then and those links are the hash of those blocks. So you can imagine in this diagram, you have a root block and it has some sort of hash. um I'm putting these like q uh m hashes, because those are usually the um what hashes often look like uh in um in ipfs and filecoin.

A

uh We refer to these as cids or content identifiers. They are essentially a mathematical hash, plus some additional data um and um and you you would imagine this root block you can see inside somewhere inside the data. For the block, we have the hashes of the blocks that it links to. So here you have a link to this block, and here you have a link to this block and they each have their hash, and then they have links to other blocks right. So this is.

A

This is essentially a way of breaking up a giant piece of data into smaller bits of data where at any given level, we can verify that we have the right data and any other, and we know how to get the next uh bit of the right data. So that's a miracle bag. um What's in what does the actual inside of these blocks? Look like, um so uh it could look like anything technically, um but uh the actual structure that we use uh for uh for file point is something called ipld, which is essentially a distributed.

A

Data format. um There's a talk, I think, on the main track later today, where you can learn a lot more about ipld, um but uh essentially it is a format for distributed data, the short form uh the you can imagine that what it looks like inside of a single block is, you might have a series of nodes inside of a block where each node is a type of data right. You might have an.

B

Array node, which is a set of array which is an array of other nodes. You.

A

Might have a map node, uh which is like you know, analogous to uh you, know a map or uh or hash table or um or just like a javascript.

B

A

um And then uh you would have you could have individual data nodes, you might have an integer node or string node, um and then you, then you might have a special and then you have a special node type, which is a link node and a link.

A

Node is essentially a link to um another block of data, uh so the short form that I use uh for describing uh ipld is distributed, json, which is a super simplification of what it actually is because ipld one of its explicit goals is to not be tied to a serialization format. um It could be serialized as json. It could be serialized as something like um a seabor, which is the common binary object representation.

A

We also use protobufs for it, um but the way- but it's very easy for me, as a human to think about it or a human programmer to think about it as json with links in it. So you can imagine this is a json representation of what I showed you in the past uh diagram, and- um and so uh you know- and so here it's basically it looks like json, but we have this one special type, which is this link.

A

This link type where we actually have a link to another block, um and with that addition, we can essentially build a distributed data structure. um So uh so, essentially our our challenge is: how do we replicate a merkle dag of ipl date data over the network? Right um now you can imagine if we were doing traditional http, we could probably throw out all these data structures and we could throw out. um We throw out the need to verify incrementally and we could rely on essentially.

B

Like an ssl certificate as a way.

A

Of verifying authenticity or essentially, we could trust that you know as long as we can verify that the ssl certificate uh you know for cnn.com uh is valid and we trust cnn, uh for whatever reason, um we can trust that the data they're sending is correct and we can simply send all the data across and then display it. We probably don't even verify it and often you know when you're downloading files for it from the internet.

A

um Unless they, you know, unless it's a security, conscious site, they won't even give you a content, a cryptographic hash of what the data you're downloading is so that you can verify it. You just like you know, download a program and open it, which seems extremely extremely um uh risky, um but in any case, that's that's essentially what you're doing in http.

A

So um we developed ways uh over the course of the development of ibs uh to transfer this kind of data in a trustless environment um and the original solution um we have. Is this protocol called bitswap? um Sorry uh bitswap is essentially the basic way it works. Is you request a a single block from another peer?

A

um uh You request a one block uh starting at the root, um and when- and you say I want the- I want the block with this cryptographic hash um and then, when somebody sends you back that block, you can look at that block, run a hash against it and verify that the hashes match. You now know that that block is valid and therefore the hashes to other blocks inside that block are the hashes to the remaining data.

A

So you can imagine we would make a request for a top block for the root block and then once we've done that we could traverse down to one of the links in the block and request that block we get the data back. We can verify it again um and then once we get that black we'll know the links to the to the next child blocks and we could get, we could essentially proceed that way. um You know block by block all the way to uh the end of until we've got an entire graph of data.

A

Now this is the the the challenge with this is in order for me to um to request the any blocks after root. I have to get the first block and then, in order to request any blocks after that, I have to get the next block, so you have a whole series of round trips that happen where I can't request parts of the data until I have uh some of the initial bits of data, so that is uh that's a potential big speed uh slowdown.

A

Now we have some ways of speeding that up, because bitswap is very parallelized, so what we can do is once we get the first block. We can make the the next two uh request the next blocks uh from uh two different peers. At the same time, uh we can do all this stuff to massively parallelize it, and if we have a data structure, that's like a tree structure for our dad.

A

This actually works relatively well, because you know as long as as long as the tree is wide and not too deep, we're going to actually be able to utilize the parallelized requests really well and get those and get that data back uh quickly. Now the downside is, if we had a structure, that's very deep and very narrow, the actual most extreme example. This is something like the blockchain, um where each block only points to one parent block and.

B

A

To get them, you would have to do essentially end round trips for nr and blocks, it's very, not ideal, so um uh essentially uh so so what we need is a way to express what we want from a block and its children child blocks. That is uh that doesn't require us to know the hash of every single uh block in the tree right.

A

What I want to be able to tell a person that I'm requesting data from I want to say I want this block and I want all the children underneath it and all I want all their children's children and I want all their children's children right or I might say I want to request this block and I want only the first child of its of uh or I want, the child link that has this name, and then from that.

A

I want the child link that has this name, and this name, and this name you can imagine if a merkle dad represented a directory tree, we might request it in such a way where we were essentially each link was a subdirectory all the way down to a root note. So we have a language for expressing this kind of a query and it's called an ipld selector. um These are this is sort of just a diagram. um There's there's a whole series of different selectors.

A

You can do in um uh ipld, uh but uh the short version is it's query: language for merkel, dag graphs. um It's a way of expressing uh a query against the merkel data graph um that will that, uh based on the merkle, dag, will will return a predictable result. It's similar.

B

To somewhere, you could argue, it's.

A

An aldish to either.

B

A

Or graphql it's a way of again of like going against unknown data and getting a predictable result.

A

So, um given all that, let's revisit our our statement, so graphsync is a protocol to synchronize graphs across periods and what graphsync allows me to do is it allows me to send a root hash to another peer uh for a graph that I'm interested in and to send with it a selector which expresses all or part of the the subgraph below that root that I am interested in and then to have the other peers.

A

Send me all that data back um one thing, uh so I think the easiest way to think of a gr of grass sync is. It is essentially an ipld selector traversal. It's like running a query against uh if we think of like the merkle dag is like its own like database, we're running a query against a merkle database, but that database lives on another peer against a networked data store right. So I always say the groundsync is an ipld selector traversal, backed by a network store.

A

So that was a lot uh another. Just a couple. uh Other constraints to know about grassland, grassland was originally developed for ipfs and then has been moved into file coin um and actually seen most of its development in file, but because it also is intended to work in ipfs. We have to work with, like uh some additional constraints, in addition appears not trusted and requesters not trusted. uh We any peer on the network is a requester and a responder. It's not like a a client server environment. Anyone can serve grassland requests.

A

Anyone can receive grassland requests. um We may have to serve multiple requests at once, so we're trying to optimize we're trying to build an implementation that optimizes for handling multiple requests. um And, oh sorry, I apologize uh I'm trying to think there's anything else. Basically, just that you know it is a generalized protocol for moving this stuff around, so it uh can operate in it's intended to operate in all kinds of environments, um and so we have true all the concerns that would come from ipfs ddos.

A

um You know uh making sure that we balance requests between peers all those types of things and that those are all concerns that are implementation level. Concerns as opposed to protocol level concerns, but at least the go implementation um takes into account all of these things, cool wow.

A

That was like a mouthful, uh so I want to look at the grassland protocol first and then look at a little bit of the architecture of grass think and then look at a little bit of the file coin, specific uh questions, uh but I want to just before that. I just want to make sure I don't know how do I get to. I don't know if we have any questions, um I can't for some reason. I can't see the q a so oh no, there is no q a for this! uh Oh wait!

A

Here we go. We have a let's see, chat. Sorry.

B

Not yet no no questions yet. Oh wow. This.

A

Is all 100 totally comprehensible to all of you? I love that um because you're all geniuses, because I don't understand any of it um anyway, so uh let's go in to uh let's look at the grassy protocol, um so that's gonna be kind of the end of my slide deck, uh so I'm gonna just be showing you some bits from uh code um and and other sort of like online documents going forward. uh So let's look at, I wanna just look briefly at the browsing protocol itself. um This is uh this.

A

Is the ipld specs repo? This is where you will find a specification for grass thing.

A

Here we go, let's see, let's look at this. Okay, I'm going to skip all these and just get to the the protocol. Grassland is currently a protobuf based protocol, which is uh probably going to change it sometime in the near future. We're trying to move things to see seabor, um but for the moment it's a protobuf um so uh on the grassing protocol uh two, essentially it is a live pdp protocol, two peers, connect, overlapped, p2p and then connect on the sort of like grass sink uh and set up a stream.

A

That's set up for the grassing protocol um and uh they send messages back and forth that look like this. um A message uh contains a question about selectors, okay, cool. um So uh a graphic message contains uh these fields. uh It contains a set of requests, it contains a set of responses and it contains a set of blocks. um Importantly, each message can can contain multiple requests, multiple responses and blocks that are tied to one or more responses um uh this. Actually, this full complete request list.

A

That makes me anxious, because this is no longer super well used. So I'm just looking at this and seeing things where I'm like: oh wow, we should probably remove that in any case, so you have request responses and data. A request. Content consists of a request. Id a root. A root is a cid, um a selector, a selector is an ipld selector, um it's encoded as seabor bytes.

A

uh In the context of um in the context of graph sync, uh the requests have priority uh and then there's two special types of requests: there's a cancel request, which means what you might think and an update request, which means which is a way of expressing uh essentially some kind of update to the existing request.

A

um uh Importantly, uh both the request and the response have a uh have a set of extensions potential that uh should be that may or may not be present on the request, um and this is really important, because graphsync is designed to be an extensible protocol and we use this extensively in file point essentially as a way to you can layer on side protocols into the graph sync request process, um and this allows you to do a lot of really cool stuff like um it's almost like.

A

You can think of these as like being either your ssl extensions. Maybe like your tls extensions, maybe even like cookies. It's a way to like you can do authentication with this. You can do um you can. Essentially you know in retrieval. We exchange payments through these extend extensions. It's pretty cool um a response is really simple. A response has a has the request idea of what it's responding to and it has a status. Now um there is one extension that is in the response, which is mandatory.

A

That needs to move out of the extension is still currently technically an extension, but it is included in every working grass sync request and we most likely will not be able to process requests without it.

A

Sorry, that's a side effect of the of the process of building it, and then you essentially have a series of blocks blocks are essentially a cid prefix and then the data um you verify these blocks um by uh by essentially taking the data and hashing them uh so that you can so you only get to see id of these blocks from the data itself. You don't want to transmit that independently or you would need to check it to verify that the blocks are in fact what they are.

A

So this just allows us to construct the correct cid from the blocks. That's what this prefix does and that's basically, the um the the the whole of it one key other element in the protocol. Is I'm gonna go sorry about this? I'm gonna go back here. Seven minutes wow. This is real, quick one key element in the protocol. This is that must have uh extension that I mentioned, unfortunately, which is this uh response metadata, uh which will be encoded in the response. It will essentially say for this response to this request.

A

Id here are the links uh that I have traversed, um and it will tell you and with that you will include it will include two um two additional things it will include. It will sorry it will include one additional thing. One is, and it's essentially whether or not I had the block uh in this traversal one scenario that can definitely happen is I can express a selector to a remote peer.

A

They can start at the root and start performing the query and find that they are missing one of the blocks in the process of the query. This is particularly uh an issue in the ipfs environment in the file point environment generally, you're you're, the one for like storing you're, the one who's like storing the data that you that you want to store, and so you tend to have all of it, but in any case it's important particularly for security reasons, so that is essentially the protocol. Now the key thing that we need to so.

B

I'm going to talk a little bit.

A

Briefly about the architecture I'm going to move over to the grass and the go grassland repository um and uh something uh here we go. These docks these documents in the uh this is the go grass sync repository. uh It does have a document folder and it has an architecture document. um Some of this architecture is a little out of date.

A

I need to update it in which I will in the near future, but the basics are pretty um and pretty um I don't know accurate uh and the key thing to understand uh so uh so yeah so graphs go grass. Sync has a you, know: top-level interface and a requester implementation, a responder implementation and a message layer for sending messages.

A

um In order for us to create a do. A round trip request, a requester needs to uh essentially encode uh and send the request to the responder. The responder needs to receive the request and perform an ipld selector query based off of it. The responder needs to load those blocks as they're performing the selector query, um load the blocks from local storage, and then it needs to encode and send blocks uh traversed and metadata about the traversal to the requester. Now here's the key step.

A

uh Requesters need to verify blocks received, are the right ones for the selector query uh requested so because, on the requester side, we do not start with all of the cids um and we're not doing the sort of bit swap bit by bit verification on the requester side.

A

We also run a selector query on the incoming data to verify that it is the right data, so we know the root on the on the requester side, so we can start with that root and perform a selector query, as data comes in uh against it to verify that it is the correct data and it's important importantly, we do not save data. That is not uh that does not match the selector query.

A

uh Once we get once we do verify it, we need to store those blocks um and then we need to return the the nodes to the caller. So those are those are essentially that's like the overall sequence of events um uh and again, I'm like putting all this commentary in here about how I think of a graphic request as an ipld selector query um that happens locally. That happens to be backed by a data store. That's on the network.

A

A

This gives you like a I'm, probably going to not go too deep into the architecture um here. uh Essentially, this gives you like all of the step-by-step of like, what's going. On um importantly, uh we want to.

A

uh We want to make sure that as we're processing, multiple incoming requests and sending, uh and and also uh responding, responding to requests making outgoing requests uh that we have a mechanism uh that we essentially balance uh our workloads so that we don't get overwhelmed um and, oh, my goodness, it's 2 12 27, so I'm going to probably wrap this up so that we can get a few couple questions in before the end. I just want to talk about one other feature just so you know before we do a q, a which is the grasssync.

A

uh The gograss sync implementation has a number of additional hooks that are used for you to hook in with extensions, so that you can perform additional computation on incoming requests. For example, uh we have we have like an incoming request hook. uh Sorry, this is the wrong thing. We have an incoming request hook, which allows you to look at an incoming request and res accept or reject them um based off of additional information in the extensions we have an outgoing block hook, which allows you to like pause requests as blocks go out.

A

We have an incoming. We have essentially hooks for everything, so with all that you can build lots and lots of protocols on top of that and that's what we do for data transfer in both storage and retrieval and file coin anyway, that's I have two minutes. I'm going to go with a dean's question. I have a question about selector and implementation level details I'll figure a way yes go for it. Unmute yourself, I'm gonna, stop screen sharing.

B

Okay, um basically, I wanted to get a better understanding of like what types of selectors uh we have or we can use and how we can then use those to then create like multi multi-user queries right. So we can start using graph sync to query multiple peers for data.

A

Sure yeah we so to be clear. um We currently do not. We generally do not use the gograss sync implementation for multipure queers queries at the moment, because in filecoin we're primarily transferring data from one person to another. um However, that is an intended goal of the uh of the software um and uh the facilities. We have the the sort of selectors that we have right now um we have a like.

B

A

If I can, we have field selectors, we have a array, selectors, meaning. I want these elements of an array. I want these named fields of a of a map. um We have a recursive selectors which are really key, which is essentially I want you to perform the selection. Then I want you to keep doing it again until a condition is met and that is used for stuff like doing a whole, dag traversal, meaning like I want you at each level.

A

I want you to explore all the fields and then uh traverse all the fields, and then I want you to do it again until you get to you know a certain depth. Now um we there are certain selectors that we uh that we have, which are um we there's certain things that we might use um like. We could use a field index, so I can we uh to do some really primitive requests. Splitting like I want.

A

You know you to give me the first field and all the children under and you to give me the second field and all the children under it. That's not gonna, be very good for balancing um to balance we're gonna probably have to do some updating of requests in the middle, because we we don't really know the shape of the graph ahead of time.

A

um The other question: that's a little complicated is figuring out how to split up or whether we want to split up a request at the selector level or the block level, because grassland works with blocks. um We could easily build an extension that says I want you to. You know, uh count blocks by five and only send me.

A

You know the ones where the block count mods by one and then like to the next person and give me a block count that mods by two and mods by three, and they essentially make the same selector query to all the uh peers, but they each send back different blocks, and then you verify it. Obviously that has its own problems, because if one peer is more is not sending you the right data that breaks it for everyone, it's there's a number of different things that we could do.

A

um There is also, uh I think I mentioned an update request in grassland now. um So you could.

B

In the middle of.

A

The query uh change the query and say: hey, listen, you know, I don't need you to send me this part of it. You know those types of things. It's that stuff is all a little bit of a frontier. That's probably one of the next main areas for development in grouse.

A