IPFS IPFS þing 2022 - Data and IPFS: Transfer, 9 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: What's in a manifest? - @b5 - Data and IPFS: Transfer

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hi everybody I feel like between the interest of time. I'm just gonna get rock and roll, and uh this is gonna be very fast I intentionally just want to give like lightning talk level overview of a thing that we were calling manifest. uh It's I think it's more of a concept, there's a couple of names for it, um but I think this is a pattern that we've seen and thank you Hannah for identifying and naming a bunch of patterns at the top of this track. I thought it was a really helpful thing.

A

Did we did you use manifest I can't remember what was that.

B

I know I mentioned Cleveland, which is like.

A

Yeah yeah totally totally.

C

A

This is just a technique that worked really well for us, and it was for data transfer and I just want to share it out and hopefully tee up some more conversations.

A

um So for me and for us the the term, the definition we're using for manifest is it's just an aggregation of related datum identifiers so for us in the ipfs world, think array of cids right um uh and in the common case that is usually, if I'd finish, that slide, it's basically an array of cids in a dag, but not necessarily, and so it's just basically any aggregation of identifiers for the data that you want to communicate about. It's not the data itself.

A

um Other names for this indexes aggregations, um it's useful, um oh yeah, you also have to be very we'll get to that in a second uh manifests are useful in the sense that you can communicate about a thing before you do all of the data transfer for a thing right, and so it can put you in a position in knowledge, faster and I. Do want to be really careful about this because we often deal in miracleized data. It's really important to acknowledge that manifest or gossipy.

A

When someone sends you a manifest of something and says: hey, here's a list of cids that are somehow related. You do not have any reason to trust that uh you have to do at bare minimum Miracle verification, and you need to be very careful about anything any metadata based on a manifest. You can't trust that either, and so it's really important to like keep that in mind. Whenever we're talking about these aggregations, it's not you don't get to just de facto, be like great we're good.

A

You have to really think carefully about the fact that this, if we're thinking in trustless environments, these are gossipy things so like giant warning and because you this is a kind of tax you have to pay. When you look for these things- and it's really important to sort of think about that in terms of cost overhead, um so we use manifest in this thing called desync, uh which was basically it's just our think, rsync for dags, um it's a point-to-point data transfer protocol, it uses manifests and we use it in query.

A

uh An open source data set tool for git style, pushing and pulling so when you're thinking about like what we actually use it for this was this: was it so we had a command query push somewhere, and all of that was powered by d-sync. uh It was very successful. Inside of query. We went from relying on bit Swap and having users like.

A

uh It was really good for this, like high churn scenario where, like git, push and query push we're both offline ish commands, and so the initial version of query was like connect to the ipfs network, start a bit Swap and try and get the thing to move it swapping.

A

So that was hydrate peers and do all these things, and we got it down to something that looks a lot more like git and match these or expectations uh at the end of uh our sort of time, with query that was powering all of our data movement like period uh uh so just to get a sense of like what it was at the time. uh This was our definition of a manifest uh some.

B

Things that are kind of.

A

Interesting about this, um we are actually using strings which are CID strings, um mainly because this struct is intended to go over the wire, and so this is pretty uh as trying to get this as compact as possible.

A

um Our in the first iteration that we used, uh we also included links, and so the way that this worked is the notes array was an array of cids and then the links were a tuple of integers that describe links between the nodes array. So you don't want to list. The cids are relatively long. You don't want to list them a lot, and so it was really nice to be able to say hey.

A

We can actually send you not just a flat set of blocks, but actually a description that you can then turn into a dag on your side again a gossipy dag, but it gives you a really nice property where you can say: hey we're only listing all the cids one time you can de-duplicate that list, and then you can list all of the links between those things and express that in a very compact form.

A

For us this was actually quite useful because you could then build on top of this more gossipy stuff, but again we'll get to why this worked. You can then start labeling um path, routes which are could be subdag indications inside of that, and so that was just for us was just a map of hey uh inside of query. We had components of data sets and so often the shape of query tags was.

A

uh It was a Unix up as directory, and then one file would be called the data and that thing would often be like gigs and gigs and gigs, and so you would want to sort of label. That's where the real data starts, and you could basically reply with uh you know: I don't actually want that um and if you're again this is gossipy, but it really helped all of that be kind of uh clean and succinct.

A

We ended up including another gossip thing, which was just the sizes of all those cities, again an array of sizes based on that nodes um array, and so it mapped one to one. The advantage of that was for us doing. Data transfer. As an example, query Cloud would not accept data sets larger than 250 megabytes, and so we could actually just deny that right there on the gospel side of things based on assuming that it was valid.

A

uh Here's a here's, a sort of like quick, like overview of how this would work in a fetching scenario, we've got Basit and Jonathan Basit is trying to fetch cida from Jonathan. So the first thing they do is send hey I want this and they give the roots the idea that they want um again. You should note here that this is. We were only interested in whole, dag syncing.

A

We had no like subdag, it was very, very, very straightforward, um and what would respond with is a manifest info of cida, and so you get back hey. These are all these are all the cids involved in that dag. uh These are the sizes of all those nodes, and we populate that as as required. The thing that Bassett would send back to Jonathan is actually a manifest, so not a manifest info, but a manifest of just the diff of what they wanted. So there would be a local comparison of their Block store.

A

Saying oh I already have these since I don't need any of those. Please only send me these and then Jonathan constructs. Then this is where this is kind of like why we immediately switched to car files when they came out. This protocol predates the existence of car files, um because the car file includes an array of cids as its header, and so this diff response was actually just.

A

We jammed that into the front as the thing that to construct the car file from and started streaming blocks as they came from The Block store in the order that the cids came back from and because it's just a dynamic instruction. It worked really well as a sort of mechanism for putting the car file together.

A

The push is actually one step shorter. You just send a manifest info and repeat this repeat the process in Reverse you send back a diff and then you're getting a car file of diff locks streamed in response um worked well for us pretty consistent, also, very simple.

A

um The reason why we think this is simple and the reason that we uh sort of went with this is it puts an information poor in a position of gospel, knowledge's knowledge to optimize great planning right. You get this manifest. You have now you're in a much better situation, to plan out how you are going to acquire that data for us that meant a different response of what we actually needed um and while that is gossipy, it just summarizes the whole conversation and it drops and you're not asking for any blocks.

A

You don't have in this protocol at all um or any box. You don't need pardon me. um That's a duplicate slide uh things that were tough about this uh constructing manifests. A top of datastore interface was actually quite expensive, uh and so we actually we never got to this, but uh on ingest plans to create the Manifest, then, to give you a sense of timing on a 10 gigabyte. This was background like Kubo. Well, it was go, I can use GOP Fs in the historical sense.

A

Go ipfs like 09, like a 10 gig file, would take on the order of like two minutes to construct the Manifest, because you actually had to open and touch every block, uh and it was a lot. It was actually a quite painful part of the user experience when you would go to push it's just the construction that manifest was quite expensive.

A

um The other thing to point out here is the the data transfer is actually asymmetric right. If we go back to this like um when I want to pull or fetch yeah, we say: hey, give me this and then there's a very light data transfer and then Jonathan is sending a lot of data in response to what the sit said, uh which can be considered as an attack Vector. You can set a lot of these like hey I, want to fetch this I want to fetch this and make the server do a lot of work.

A

So we actually put this behind auth. um All of this was actually attributed to hey. The users know each other before they do any of this rsyncing like desyncing-like process um personally I would actually recommend that for this, this thing unless you're some sort of connection to whom is requesting whom you can associate that with your IDs but like that was important for us um yeah. uh A couple of opportunities I want to point here again.

A

This is really just a high level overview of a concept because I think manifests are just a really great utility or a great tool, because they're super simple they're easy to get your head around and they kind of cut to the point.

A

um Speaking of points, there's no reason that this needed to be point to point um I. Think that there's a really awesome opportunity where, if you ended up with a gossipy, manifest to then cut that up across Piers, whom you believe have that information and start Distributing your requests, because you're much more able to sort of say ah you can and then you get really doing really interesting. Stuff I have a better latency connection with this beer I'm, maybe 60 40 the list across the other period I'm requesting from um their pattern.

A

It's also super compatible with the car, spec I think a lot of times. If we can, if we do the work right, a manifest is just the same thing as an index in a car file, and so like there's a really nice thing there, where maybe we're just including those anyways uh which is dope.

A

um If I could do it again or if I had to do it today, I would Express Manifest. This way, I would have actually moved the links into the Super gossipy section and just had manifest, which is also got to be just be an array of cids because I think it's less controversial, um and so it's uh yeah and that's basically how we ended up using it. In the end, the links field was optional and it didn't get used on the responses. So have it be not there ah I think, that's it short sweet.

A

Anybody have questions anything.

A

C

Talk more about the the office, because more of that one of the things that, like there are a bunch of things that are highly asymmetrical yeah and you could also choose to like maybe stream. Lessons like you- can change the asymmetry of match the ratio that we get from a drastic.

A

Yeah yeah I mean yes, and so it's like that, we did put this behind off and we this was like pretty high trust communication right like we, we had a pretty good sense of the peers. The other thing to point out here um as Stephen, was mentioning last night uh point of contrast like in this. You know you have a really good idea of who has the data or where you're trying to send the data with some by some other means right and so like it's, not the content. Writing side of this is like you're.

A

You know much there's already this presumption that, like ah that, that pure is the one I'm gonna ask for that CID. So not only do you have what you just mentioned of like uh you, can you can play games with what you do with that information, but you also have um you have to be in a better position, a better knowledge of who to ask, or at least that problem needs to be solved summer? How.

B

Big were the men.

A

It's a great question: uh we could get I think we figured I. Did the math on this shoot. I don't want to misquote, but like it was it was you could get with a c-bar encoding. You could get like I think it was a yeah lots of gigs and, if not terabytes, into a one megabyte block yeah like it was. It was good enough to me that I was like I. Don't even think we need to worry about the block limit for a while yeah um well.

B

That I mean that's really cool, because one thing you could do is you could adapt. You could actually even ask multiple people for the Manifest in an untrusted.

A

B

Yeah and like, and that would be like you could run, afterward fits off, does its first discovery and then like use that and then also you have the higher trust. If you.

A

Have multiple people, which is the basically like the naming system of like 16 verification, yeah yeah cool I, want to get out of it. So let me keep moving, but thank you. Everybody.