IPFS DataSystems Workshops, 22 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: State of ADLs

Description

Recorded at DataSystems September 2021 Colo. Join us in #retrieval-market on filecoin slack to ask questions or build an open data transfer stack with us!

A

So this talk is going to be about ipld adls and I believe adl starts stands for advanced data layout, which sounds pretty complex, but it's actually pretty simple. At the core of ipld. Prime you've got this node interface, which actually just got moved, so I need to go there and this interface is sort of like the way you interface interface, with any node from go so at the center of it. You've got the kind method, and that tells you, what kind am I so you can have?

A

You can have things that are maps nodes that are lists. They may be null. They may be simple or scalar type kinds like floats or strings.

A

And that's fine and when you use, for example, usually nowadays, if you use ipld prime, you might be using its code gen and when you use its code gen you end up with, for example,.

A

What you end up with is what would be a good example, I think, maybe a.

A

Node come on agenda; no, that's good, so the code gen, what it does is it generates, for example, um here's a map from strings to something.

A

So there you go like with the actual go map and then a slice to keep the ordering, because maps and ipld have to be deterministic, and then it also generates this satisfaction file. That has a bunch of methods and you can see, for example, maybe end which is an end that may be absent.

A

uh You have methods like the kind and it tells you. What kind am I in it? This is code gen, so it's just statically, I'm always an end. You've got other methods like lookup by string, and these methods are defined, as I am only valid when my kind is x. So, for example, lookup by string is only valid for maps for map kinds. So in this case, if you call it on something, that's not an end in this case, statically is always an end.

A

This is just going to error, and this is just a shortcut for a common error for that case and then somewhere down there as end, then that gives you the actual value.

A

So usually, if you're, using um if you're using ipld prime nowadays with code gen, that's that's what it looks looks like so going back to the type node interface.

A

That's all fine for what you would call data model nodes, because they're pretty basic. What you see is what you get what adls do is they allow you to essentially build arbitrary behavior into this stuff? So you could, for example, have a node that implements this interface by having a behavior that doesn't directly map to what data you have on memory or what data you have encoded in json, for example, and I thought I would show what that means with.

A

Go ipld hand and so a hand what it is it's sort of like a hashmap, that's designed for content addressability, or at least it works decently, well for in that world, and essentially you can think of a map, as you know, a list of buckets that has a tree and then more buckets and so on. So that's essentially what a hand is um defined as, but then in ipld, prime, when you want to use this, you essentially end up with. Let's look at types, for example,.

A

So I've got I've, got hashmap node, for example, which is the internal representation of the hand, so the hand is itself.

A

I'll show it here, the hampton is itself represented as a as as the apld data model as a bunch of ipld data model nodes, so you've got the root with some parameters and then you've got nodes and nodes, then are word contain buckets, but you shouldn't be using this directly.

A

If you want to actually use a hand because then you would have to manually be manipulating the buckets and rebalancing and everything you want to use a library that abstracts all of that away and then, when I set foo the key food to the value bar, it just automatically deals with all the bucketing and all that stuff, and this is what the ideal layer does for you. So if I go into node, I think so. This is the node.

A

This is the node that actually implements the node interface, but with the layer of I'm going to deal with all this, for you so kind kind is always a map, so this is similar as before, but when we go into lookup by string, this is where it gets interesting. So look up by string. It doesn't just use the data model stuff that it has it hashes the key, and then it looks up into the buckets. What to do.

A

Oh, that's actually pretty cool, um so yeah, it's essentially gonna, be um it's gonna, go down into the right node and into the right bucket, and it's just gonna fetch the value that you want. Am I in a bucket look, look in the bucket. Am I going down one level then go down one level.

A

And then there's a bunch of to-do's, because I didn't have time to finish this, but you can get an idea of why it's useful to implement this interface for things that are not as static or not as direct directly mapped to the data model that you might think, that's pretty much it does. Does anybody have any questions about what an adl is or why it's useful.

A

Or any questions about how the hand works.

B

What would it take to be able to do mutation or editing.

A

Yeah, so that's possibly the biggest kind of worms that I've got left to do because in in ipld, prime nodes are immutable right. So if you want to modify an um if you want to set or modify or delete a value in an existing map, um what you could do right now is copy all the values from an old map into a new map and then just copy all the copy, all the data and generate a new map from scratch. It's going to be really slow.

A

The better way would be to essentially batch a bunch of operations like adding deleting setting and so on, and then efficiently apply those and only copy the data model nodes that you need to update. So, for example, if you only need to update one bucket that node would update and then the cid is all the way to a root word update, but everything else would be the same, cids and nodes same blocks.

A

I have not done that, though.

C

So quick question: if I want to build my own adl yeah, what I have to do is implement the interface.

A

Right now, yes, um with eric, we talked about well, actually, if you're implementing an adl, you usually want to say I want to implement a map adl, and then you only need to implement a few methods. All the others you don't care about. You only need to implement kind, is going to be static as map you want to implement lookup by string.

A

I'm going to go back to the interface to show.

C

Because you're implementing the node interface, yes and what we call the so so what's special about the adl. So what is different from like I'm implementing my own node yeah to an actual adl.

A

Well, it's essentially that an adl is an implementation of the node interface that has arbitrary behavior, but it uh what's behind. That could be anything it's just that the hand right.

C

So the adl is a concept.

A

Yeah yeah, but I think I think in general, adls are backed by ipld data model data. If that makes sense because, for example, a hand is the actual data you've got in the data model, there's a bunch of buckets and trees and stuff, but then you've got the ideal layer that shows it to you as a map, and that has the extra logic.

A

So I think I think in apld the adls are meant to be backed by data model nodes, because if you back them by like memory, then it's not really useful for apld, because it's just memory, you can't really share it with others.

C

And if I wanted to like start building my own ideal yep, what would be the process.

A

So the the manual process would be sort of what I showed you with the hand which is.

C

The node interface and.

A

Implement the node interface with your own type and then underneath that you would probably use, for example, code gen data model nodes yeah to actually define the. I don't know what to call it like the. I think eric calls it. The substrate, like the actual data, underneath your adl.

C

The your kinds which, in the end, are schematic.

A

Yeah exactly and then those things are what you could, for example, put in blocks and share with other people. So, like here's, my when, when you go to other people and and say here's my hand, you don't give them the whole map as a map, you give them the tree with the buckets right and.

C

A

A

Yes, I added this towards the end of this project.

A

I think I called it with linking.

D

A

No wait: wait, I'm on the wrong way.

D

But otherwise, how do you implement like look up by string.

A

A

No, I don't want github hang on a sec.

A

A

Yeah so I added this essentially and then you can say um essentially when you start, but you could set this later, probably yeah yeah yeah, and then it just remembers this. So when it, for example, when it tries to cross a boundary and it doesn't have a block, it will then use the links.

D

Yeah, that's a common thing. I've noticed often adls.

A

Are designed to like take like.

D

A multi-block node structure and present it with like a single node interface.

E

C

That was so. This is great. My only concern was like what is the way of I mean it's a fast way.

A

Of or some esl.

C

To implement something else, or we have to go to the interface, it's okay like it's not that hard, but there are a lot of methods that you heard from them.

A

So I think we should, I think we should expose two easy wrappers, that called something like new map adl, and then you give it a function. That's like get get by string and another that maybe implements the iterator, and then it essentially, you know plugs that into the core of a map edl, because.

C

To minimize the number of functions that you need to implement,.

A

C

Many are like just standard stuff like they cannot mention.

A

Yeah, it's also true that um there's a bit it's not just about getting by string because that's as far as I got I, but you also have look up by node, because your map could be keyed by anything. um You also have iteration and iteration might be a little bit harder to plug in um well. I guess you would provide the implementation of this. Maybe.

D

uh There's also like I mean, if you do it generated like I don't know like if you do a generated type with the code jam like you also like it often generates like other convenience methods that are not part of the node interface, but like are useful because they give you less generic types. If you want to use them, yep yeah, there's like a camera, there's like an iterator or like it generates just an iterator that will give you like.

A

It's also worth pointing out that um something that hannah mentioned the other day about codegen here is that if you look at my implementation of the hand when I actually use the code gend hashmap root, I I cheat all the time because I'm way too lazy so, for example, the pocket size- I just fetch it with the unexported fields.

C

A

So I think this would be a little bit nicer if I used bind node because then it would just be like simple go types.

D

Like collected all of.

A

D

Use of this and like presented it to eric and be like this is evidence we need an easier.

C

So it makes sense how I'm approaching like this is that if you're building the functions or the wrappers around the schema, yeah you're allowed to do that. But then you present it to the outside. Without that.

A

C

They are private numbers, yes, so so that's, why, like I usually don't expose them in the root of the package right and it's internal? Something like that. So I can use that, but the rest of the world yeah yeah, I'm selfish.

A

Do you want to talk about the unix of s ideal yeah? I can just take this over here.

D

um I don't know what it is, though. Okay, no, no! No, it's cool. As long as I have a web browser, I can make this yeah, I'm just showing.

A

You one tap, though, this time.

D

Dude in this tab, okay, cool.

D

And now we see if and I can remember how to use pc.

D

Okay yeah, so this is an interesting uh implementation of an adl, um uh so basically unix. The unix of sd1 is really weird in the sense that it isn't unix of sp1. Is they coded with protobuf nodes using the like, dag proto uh specification, which is like dag proto, is like a very early ipld codec that does not support the whole data model. It's got. You know like it's definitely like.

D

We were just a long long time ago that it came into being um and uh one of the things that's interesting about the structure of dag.

D

How a dad protobuf works is that it's basically got two. It always has two fields: there's a data and a links field and the links is a list uh like uh like a an order list um uh and then, but then each element in the link in the links uh field uh has a name for the link. So it's like it's like the structure is list, but each element has like of the list, has a name, and that name is actually probably more useful than the index in the list um and so yeah.

D

It's essentially an ordered map, and you probably want to look up things by the by the by the map key and in fact, the go go ipfs.

A

D

This everywhere they look up things but map key. So we have this so the the way this actually got started. Is we had this interesting problem. We were trying to integrate, uh go ipld, prime into uh go ipfs um where we had done all this work with to to decode uh protobufs in go ipld prime um uh dan wrote uh go kodak dab tv which does uh which does reading um sorry. Do you go by nickname, daniel dan or just daniel I'll?

D

Do daniel sorry about that um uh daniel wrote: uh go code act pb which, like deserializes, um uh deserializes, dag proto into ipld prime nodes, but it it deserializes them in like the way you would expect a go. Ipl the go ipld prime data model to work which is like the the links field is a list right, um and so we were trying to figure out. We wanted to be able to run selectors on protobuf nodes and the selectors were almost always going to be path. Selectors that expected named keys.

D

So the original way this got started is we were trying to write a version of go of a merkle bag, node uh or dag proto node, that you could call look up by string on and iterate like a map, um and so that's actually this with the very fir first route version was a path pb node, um and this ah and, like literally all this thing is, is like it's basically a it has underneath like here's path, p, b, node and underneath it it has a so-called.

D

I called it substrate, because that was the word that eric was using everywhere for.

C

D

Is the underlying node that that we're using and if you look at this, like the main methods, are like look up by string and then look up by node and then, like everything else, just differs to the substrate which itself just errors you know like so so that makes it easy um and you know like, and then we have like, we did write an iterator or a unix, fs iterator and then, uh like I'm, trying to think what other methods.

D

So you can see that here again we're just differing to the substrate element, um and then we like added these, like these- are like those weird like generated code, gen methods like iterator, which is not a mapping or a list iterator, but a nice like super type iterator that allows you to do stuff um so yeah. This is what we built and then it kind of like so then it like went from there and we were like okay. Well, we did that um and then we were like.

D

I think that was the very first implementation. We were just trying to get go path to work um with selectors um and it was working until we got to the point of getting a unix fs sharded amp directory, which is a thing that you need to be able to support. um Unix fs has two direct two directory models.

D

One is a a regular map directory where, like in the the merkle, dag is represented or the dag photo is like the data and then the links is just the list of like files and or directories in the directory. It's essentially the ls of the the directory. The other version is that the way the directory list is maintained is actually a camp that is embedded in the links field of the dag proto, um that's a unix of sharded directory, and so at that point we're like okay.

D

Well now we need to support this and then that required finding out, like the the very first version, was like. Oh we'll, just support lookup by string, and then we were like wait a second. If we need to support this other version, we need to actually look at each unix fs, like file and or directory to determine exactly what type it is the way unix fs basically works is that the data member of the dag protobuf structure, encodes in itself, a nested protobus structure, which uh describes what the unix fs thing is.

D

So that's how we ended up writing this like okay. Well, I guess we better write some unix at best stuff. um Where is that stuff? I don't even remember iteration utilities, hemp directory right, I think data is where we first like. This is just like the thing to like what does this thing? Do I can't remember what this does. Oh, my god, oh it looks generated. Let's look at the code gen. The code is the easiest way to figure out what this scheme is. I'm pretty sure this schema was like yeah.

E

D

This is written real. This is a fast right uh right, so this is like this is we we are building an ipld node to like deserialize, the unix fs protobuf into an ipld node. So we can look at methods on it, and so then we have like unix fs data which has like these fields, and then I don't remember how we use, how that did we actually deserialize it? I don't there must.

D

There must be some code to unmarshal and marshal it anyway, sorry right and then how how's that, on marshall, I'm fascinated consu decode, unix, fs data, oh okay, oh right, there's like basically some like proto buff consuming code to like read it. I just. I actually took this from your protobuf method and I just was like okay well. This is how dan's doing and I'll just do this, but with all the fields that I am aware of so yeah, it's actually re consuming the protobuf, but then building it ipld out of it.

D

um So yeah, so that's reading that and then like we're like so then we have um node, and then we have a hamp, the shard of hampton implementation. This is pure. This is the this is one fun fact is that in the world of protocol stack, we have two different concepts of a hamp. One is like a general purpose hampt, which is like the thing that um daniel wood had implemented and then the other is this thing in the units of s1?

D

It's like the sharded directory, which is like a totally different implementation of I mean it's still a hamps, but like it's a different implementation, different code, all that stuff- and it is specifically for directories um so yeah. We have this implementation here and I believe the big main function here is like getting look up by string to work which is like pretty funky in here. If I remember correctly, let's see we have look up by segment. Where's, look up by string, look by node with lookup attempt.

D

Oh my god, right there was a lot of stuff going on. We have this internal lookup functioning and you can see it's like. Actually quite a complicated function and then um and then it's like trying to like load the next node, because you know like a.

F

D

That's like a multi-node structure um and eventually like I believe it produces like something at the end, um and this is largely taken from the existing unix fs code and like moved over. But it's not that bad. um Sorry, it's no problem.

D

One thing that is obviously important here is that this has to have a link system in it right. um So that's a that's a tricky um little element and then one thing that got implement got it got sort of like thrown in the mix here um as a result of doing this, so we ended up doing this, for, like all of the the cool thing is that it works for basically for all um for all of the read functions for the most part in, oh sorry, no we did.

D

We did a unix, fs read for directories and chartered directories. We haven't done one for an edl for files themselves. It's not clear exactly what that adl should look like. Obviously one version would be to just have the bytes field return all of the bytes in the file, but that may not be that useful. We may want it to actually be able to do like byte ranges and stuff.

D

um One thing that was interesting out of this is: we ended up making one other thing that goes into um goes into.

D

What's it called uh iplds link system itself, which is the you can set on the link system, the so-called node refire, and what this will do is at the um at the um at the very end of the node loading process, the point where it gets like the regular ipld node.

D

um You can specify a refire function, reifying, I'm not exactly sure why that's the word we're using, um but uh it is how you reifying means going from the original to the adl um and in this case like what we're doing this is a node ref refining function and we're like okay. If we have a- and this runs on like every node that gets loaded by uh the ipl, the link system like what we do is we're. Like okay did we get do we have a dag, proto buff node?

D

um If not just return it as is. Do we have a data node, in which case I believe what it will do? Is it will? If there's no sorry, if there's no field data, then it will assume that it is like not a unix at s node, um and so therefore it will treat it. I believe the default refire here is just to return this like path node. So we're saying.

A

All of our merkle dag.

D

Nodes we'd prefer to treat as path pp nodes, um meaning like we can look up the links by string just because that's useful um and then uh right and then what is? What is this thing? I can't remember so like then we like look at okay, it has the data, so we try to decode it as unix fs. If that doesn't work, we just go back to returning a path, pb node and then like.

D

If we have a unix fs node based on the type of unix fs node, we get like a builder and the builder anyway. It doesn't really matter, but it does the magic with the the directories.

D

um So essentially, when you load, uh if you use this, whenever you're, using like your selector traversal, when you load a unix node it or a merkle, dag node, it will either convert it to a path node, in which case like you can use lookup by string on it or if it's unix at best node it'll con, convert it to either a regular directory or a um or a shardi directory. So that was the thing we ended up doing. um It was.

C

Just it was actually.

D

I think the main thing is the point of that fire is to mimic.

B

Current ipfs behavior yeah, so that, like the reason it's doing, things like treating them as happy notes is so that you have.

D

Yeah so but like um it is still useful for compatibility and, like also like you know, it was kind of cool that, like we did this and like all of a sudden in like a week, we ended up implementing like a lot of the read side of unix fs in v1 in uh ipld, prime, which is unique, and we even implemented it with that. What is the uxfs 1.5 edition? I think there's like some additional bytes that you get in enxsv 1.5 that doesn't exist in regular unix fsd1, and we added it to this.

D

So if we were to finish it, we would get the music.

F

V1.5 at some point we just switched over to this, and then we could get success.

D

Yeah spec is finished, yeah, it's not implemented in go ipfs, though right in javascript, okay, it's in javascript yeah.

D

So if we were to finish this one of things, that's interesting, I think, like an open development question and I feel the prime is like we didn't attempt to implement any of the like writing or modification inside of this because, like it's not like, the builder interfaces are tricky for how you're going to do all that and like and then like the other ideas that will there be this like select or traversal with modification and there's like an implementation of that sort of, but like really not built out. um So it's an interesting question.

B

Yeah so I'm gonna jump back to adl's.

B

The what are the next things finish this: what.

D

For this thing, or no, no, yes,.

A

I think we could talk about two topics. One of them might be make it easier to write simple adls, I'm sorry, that's fine!.

D

A

The second tab.

D

A

I might have made my cursor too sensitive.

A

So I think, on one side, it's make it easier to write, simply dls, which is what alfonso was mentioning, and I think I'm not sure if we need generics for that.

A

I haven't thought about it enough, but I think we could do some things about like make it easier for maps and lists, and I think, on the other side there's we need to figure out some interfaces for update and bulk operations because, for example, if you want to do some bulk modify operations and a hand or a unixfest or an amt or whatever else, I think the interface should be somewhat intuitive and common, uh because otherwise we're just reinventing the wheel in every single idea, and once we have those two things I think they've been table.

A

When are you saying I can spend two weeks on it.

B

um I guess the other thing that I wanted to bring up is like adl integration into the rest of the world. So right now, we've got this on a-link system. We can put one adl as the reoffire or we can find one refi function to try and say great I've got nodes in this link system context. This is like you know. I can have some heuristic set of things to try and apply adls in some cases.

B

um It is plausible, like there continues to be thought about like where, where and when you went to invocations, and so one of the places that has been talked about that, I think generally people haven't been too unhappy about- is long term. It might make sense to put adls on selectors.

B

So in a selector, you might be able to convey at some point in this lecture. I now have just noted. I would like to interpret this note with this adl so, for instance, I'm traversing through some large dag, but in this place I need to now interpret it as a hand, because I want you know this logical map key out of it, but I'm I'm not wanting to convey the chord path. The chord jumps through the actual uh implementation of that or, I might not even know them in some cases.

B

I think that may even be more true with like some sorts of um directories. If it's like you know things like that that you you've got your logical path, but you don't actually want to have that underlying path through it. You must have the sociological.

A

And I guess you would specify the adl in the selector, because you might want to choose a different ideal for different situations over the same data right, because, if you're hardcoded in the csd, if you're hardcoded in the cid, I guess you'd be kind of stuck with what's defined there. If.

B

You've got the cid, you could potentially get that as well and and just go directly to the see if you care, and so it's more for, like browsing or oversoul type things. I guess he.

E

Was some context.

B

Here basically, a lot in blue types, and how do you type cids content we talk about is behaviorally one options: you have a format whether it's a prefix cid that comes to schema or some type of information.

B

uh I'll show you this like a quick factory ids, but once the id points, the type information on schedule, the selector that you get whatever and the second half of the cid points, the actual data. But this gets complicated when you've never solved.

F

It so it kind of hunted us out and said: okay, my pdl's are the.

B

Same and they're just defined independently. It's not part of the data itself.

F

A

Okay and if you have two cids one that points the data another one that points to the adl, what do you mean by points to the atl? Does that mean like some sort of multicode that defines it or uh either.

B

A multicollector either yeah- you wouldn't have seen it personally or you could have some like some description of the admin on how it works. Potentially, it wasn't program or some other information as far as how to actually access. uh But again, this stuff started getting into a lot of land, and just you.

F

Know what you have something so yeah.

D

And then the it seems, like the other big unknown frontier is adls across networks in the sense like if you wish to do.

B

You know, basically, this is probably another one that should be similar right. It's the parts like you don't know, it's not part of this laughter. This is basically you don't know that part of the selector right. uh Okay. I would like to be able to actually negotiate this really. If I give you a selector, you can execute something. Then I would like you to be able to just tell me here. I know something so continue on until you know something then.

D

Yeah yeah yeah, you would communicate the adl.

B

Yeah, of course, let's see if we can do it. Yes, that would make our lives better so that that's one half is you know by transmitting adl indication. What you transmit is the sid of a awesome thing that complies with an adl, yeah yeah right, so we'll see how long.

B

B

Anyway, an exciting future for adls.