IPFS IPLD, 23 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 🖧 IPLD weekly Sync 🙌🏽 2020-11-23

Description

A weekly meeting to sync up on all IPLD (https://ipld.io) related topics. It's open for everyone and recorded. https://github.com/ipld/team-mgmt
Tags:

A

Welcome everyone: this is the ipod sync meeting it's in november, the 23rd 2020 and as every week we go over the staff that we've worked on on the past week and then discuss any open items we might have and yeah. I start with myself.

A

I actually can't recall what I've worked on last week. It must have been more, but I can't can only recall that my biggest successes are that. Finally, the prs emerged so lib p2p is now using the lightest, rust, multi-hash and forest. The rust fico implementation is using the latest rust, martial and rust cid, and so the good news is the so there isn't. So if I talk to the people from forest that um the performance like they couldn't really measure a difference, but they could at least measure that is so.

A

The quote is strictly better performance, so it's certainly not worse. It's definitely better, but it's probably not measurable, but still there. No regression there. So that's great um good to know, um and probably also like the only p2p folks would have said something if there would be a regression. So it's good um yeah, that's pretty much it um next. On my list is danielle cool.

B

So last week um I basically finished the multi-codec library the week before, but this is just an fyi that I plan on tagging. The first post rewrite release tomorrow. So if anybody has any last feedback, please speak now, and I did polish up the names a little bit um last monday. So now the names are a little bit more idiomatic go and the code generator is also smaller, which isn't nice.

B

I also spoke to rod about um well. This is coming from what michael said the week before, which is. It would be nice if the hammed adl also supported the file coin hand, because we've got a lot of data in that format.

B

So I spoke to rod about like exactly what are the differences between the two schemas and how important it is to support the old one versus the new one, because supporting the old one is useful today, because we've got a lot of data in that, um so the actual difference is only in the element type, because the new one is kind and the old one is keyed. I think I got that right. um So michael actually came up with an idea to just make like a double union. He called it.

B

uh So there's an agenda item about this later, so we can take a look at that and I also spoke to will about potentially using this instead of the file coin hand um idl in his stuff, because right now I believe he uses that with an ipld prime node wrapper, which is like read-only for doing things like selectors and stuff. So I think I'm pretty close, because I've got most of the stuff.

B

I need to read existing notes, I'm just missing, because the prototype is there and a bunch of stuff is there I think I'm just missing refi and maybe some other bits like caching, of link loading, but most things should be there. I think um I also gave a talk about ghost, build cache which went pretty well, there's a video up.

B

If anybody wants to watch that and I'm giving a talk tomorrow about init time in a time tracing in go 116, which is pretty niche, but if anybody has super large go binaries that take a long time to start. Those are all the unit functions, doing work and with 116 you can actually measure how long they take per package, which is really nice and the last thing it's something that's been going on for a bunch of months.

B

I've brought it up, I think once or twice before, but we're kind of redesigning the encoding json package into like a version two. But um it's not public yet because we don't know if it's gonna be a version two, maybe the experiment is gonna fail horribly, it probably will. um But I can share some details later if anybody's interested and that's it for me.

A

Thanks- and this actually reminded me what I did last week- I was speaking at a virtual meetup and I prepared almost all week the talk that was it and there will be a recording, so there's no recording now, but there will be recording, published sometime yeah so and next on my list is the rod a similar week. Vodka, it's been.

C

A lot of the week actually on um slides for michael's chunking tree presentation, um which I don't think, and I think most of you probably haven't seen, but I think, because I've spent some time trying to understand the algorithm and then figuring out how to explain it in a way that would help me if I was just looking at slides to understand it. So I think that slides are pretty good at getting to the core of the algorithm and uh and so they they're probably um uh worth sharing at some point.

C

um So I think there's there's just there's still just a little bit of a little bit of the uh the security case like the um collision case. That needs to be sorted out. That's that's really a fundamental problem that needs to be needs to have a solid solution before we uh can trumpet this thing loud and wide, um and that continues to bother me- uh and I know michael thinks- he's- got solutions for that.

C

But um we'll talk about that in one on one, I think, michael, um you know that's that actually took up a couple of days to do that. um I've been messing around with uh with rivers, s3 data file coin data.

C

uh I made a a local block store that could store it efficiently enough uh just using car files and level db for indexing.

C

um I know and michael's working on a block store as well, so we've got multiple block storage things going on and will's got car indexing stuff.

C

So there's lots of lots of block store stuff happening here at the moment, so lots of interesting things, but I'm hitting all the problems associated with um traversing, very large graphs- um and that's you know so I anyway I I don't want to get stuck too much in this stuff, but I really would like to um have easier access to this data and then spend a little bit of time on clarifying terminology around the farcoid genesis block, which is a bit of a mess.

C

But I think I've got a final form in a pr that I'll merge today in the specs repo. Just because there's there's a genesis and then there's a genesis in falcoin and um and these two things need to be described in our docs, because one's different to the other and so trying to extrapolate from lotus code.

C

How people think and talk about this thing didn't go as well as I hoped, and it required some intervention from various people to clarify the issues around it anyway. I I think the the pr is good enough as it is to clarify that language and the situation. So that's all I'll say about that. That's really all the notable things for me this last week.

A

I can remember the next one as well sure.

D

um So the graphql thing on top of filecoin is up on a public address and running and people are using it, which means there's bugs and feature requests and so forth.

D

There's a couple places where I'm continuing to tweak things.

D

One of the interesting ones is the natural like base type that graphql exposes as a scalar is called id and it is very natural for us to say our cids are that id type, but since that is a built-in graphql type. That means you do not get to easily override anything about that type, because it's built into the graphql library and for instance, that means you can't then define how to pretty print cids in their stream form, while holding them in their byte form right.

D

So we have our ids and actually a sort of non-standard type, and so we would like to think of this as that core graphql type of a id scalar in some sense, but also we have some custom rules. So that's that's one place where uh yeah we're gonna have to figure out a path.

D

It is doing a sort of icky partial loading of pants, which helps speed quite a bit.

D

There's a couple pr's in go ipld, prime one that was minor, that's merged, another one. That's uh has the code generation look at the destination package that it is generating code into and skipping a generation of types if they're already defined in that destination package somewhere, not in auto generated code as a way to just let you take a type and define it with some weird custom things and not have, and then also coach and the other things that aren't there.

D

um There's probably a way to cogen to like expose some sort of methods that you can put in as your overrides that is narrower and more tailored to the sorts of things that actually make sense, and I have not figured out how to do that. Yet um the other.

D

Interesting thing of note to this group- maybe is I added another cache that seems to be pretty effective at uh the speed, and that is a realized object cache. So this graphql server keeps a in memory thing of cid to ipld.node of the parsed things. So, rather than going back to the data store, it reuses those nodes that it has parsed and realized, and that seems quite useful and a layer that we should have a cache, for um that is quite simple right, like that's just you, you do an lru.

E

D

Or do you make some cash when you do it the only place where it was annoying because of my other hacks that I've been doing because lotus on its state routes, uh sometimes until the upgrade uh of v2 actors? It didn't have a layer of indirection at stake roots, so it just directly. The state group pointed to the hampt of actors. Instead of this now defined state route, object, which is a sid of the hampton factors and a version id and the way that I unified.

D

That previously, is, if I failed to parse it as the struct that had the hampton actors and the version id I made one up saying it was version zero and then put the same sid as the hampton of actors, because it was a hampton of actors.

D

But now that means there's two ipld nodes with the same cid in that case, uh so I am keying on both the cid and the type of the ipld node, because I do know what that type is going to be, um so my keys are uglier than they need to be, ideally and that's sort of my fault and my misuse of cds. In some sense- and I'm sure people will feel strongly and happy at me for that choice. But that's where we are.

A

I have a quick question because you wrote ipldql. Is this the actual name or I.

D

Have sometimes called it.

A

That! Yes, because.

D

That that was my goal for what the uh so so stateql is like the file coin applied version, but the generic code gen of an ip of a graphql server over ipld data uh right, because there's sort of two layers here, there's the one: that's totally generic to an ipld tag and then there's or schema and then there's the one that is the filecoin specific binary server thing.

A

So I would be super happy if we would keep it dql, for in case we come up with a query language or something and okay.

F

A

Don't know like ipld doesn't exactly roll off the tongue, but like it's like it might be. Just me like if others like, like.

E

Yeah, I think it would make sense to shove a g in there or something too yeah.

A

Sure but yeah so just having worked on query languages, it.

F

Actually is easier to say with the g.

A

F

Like that d to q is like a weird transition for my tongue tonight,.

A

Yeah, okay, s and sorry, I totally uh who's next eric is next.

E

So that yeah sorry, I'm still trying to not laugh at the pronunciation of them, so um there is a recording from last week. uh I was also in the ranks of people giving talks and so there's a what's new and ipld talk that was recorded as part of the ipfs meetup last week, and there is a link to where that exists on youtube.

E

um My section is like only 10 minutes of that, so if you want a really brief overview of what's up and what's new, it's about as brief as it gets, there's a lot of other stuff in that video, but I made a link there with the time offset too in go. I feel the prime code. uh There are a couple of small things merged, for example the cogen output rearrangement. So finally, cogen will output a finite number of files instead of a file for every type which was operationally annoying so fixed.

E

um I was doing some discussions with will, as he already mentioned, about how we might integrate adl stuff in practice, and I think it seems like we're probably going to do some present tense shortcuts, which are aimed towards letting people do arbitrary, weird stuff and then that'll. Let will do what he wants and then there'll be a rather distinct roadmap for like how we want to integrate adls well in the long term, and that's going to be a little we're going to take that one slower.

E

The present tense shortcut is probably going to be something that's unique to cogen and just like really, let's. It will assume that a human is very much in the loop. So it's going to be uh it'll, be safety's off sort of scenario um and we'll see how that goes and other than that. A lot of this week seemed to disappear. For me, too, um working on a lot of planning and scoping docs and just plain thoughts for the future. So more docs about that will probably come out later and that's about it. This week.

A

Thanks next one is peter.

G

Yes, uh so a lot of glue work again this week, uh stuff that is interesting for ipld aside, for you know, falcon plus dumbo drop space series. 2 uh started something interesting. um We hooked up uh will's front ends to not yet complete and somewhat unstable uh version of an sql-based data store that has like everything in it, including state and clinic receipts.

G

uh Well, supposedly everything- and it is currently served a lot of people already using it, and speaking of naming, I like to call this graphil ql, uh because it rolls over the tank very nicely uh and uh yeah that that that was interesting to get uh to get running on a very short notice, because the team really needed that um another thing that was done. A lot was moving around various uh versions of the data store for, for uh various purposes, um solidifying a little bit more.

G

What actually needs to happen for this to be, um you know available, at least for our ipld needs going forward, uh and I hope by next week I'll be able to talk about this more more in detail. But for now everything is still uh very much in the air and it's just it's just really uh really difficult to navigate through all of this. When everything you try uh takes like an hour and a half so yeah, but welcome to big chain um and yeah.

G

That's uh pretty much all I have this week thanks next one is michael.

F

Hey yo so yeah I had some more calibration stuff last week and then more um doing 2021 planning still got to get that ready this week um and then yeah. I did a bunch more work on cadb got it working and benchmarking. It looks great. It's really fast uh code's a mess, though um I just like hadn't, really implemented these trees. Much and then yeah.

F

Just things got like wildly out of hand, and so it's working really well, but there are some bugs and hunting them down has just been so painful with the way that it's written, um so I'm looking at as I was looking at kind of like breaking it apart and putting some better kind of abstract layers in there. I realized that there's just a much more abstract version of these trees that I could implement and then use that implementation of those trees to do this database or any other data structures.

F

So I started poking at that a little bit seeing what that's like um the caching layer we're talking about was funny. I have one of these catching layers already um in cadb and then now I'm porting it over to this other one and the funny thing about it is like um I'm actually making the addresses of the nodes sort of abstract, because in cadb those those are just integers for the file offset um and uh and then like in the other merkle structures like they're, obviously going to be cids.

F

um So I just wanted to make that a little bit flexible but yeah. It's like you want to make any of this stuff fast. You don't need a node cache where, like the fully materialized node is already there um yeah and then that's uh that's most of what I got done.

H

Thanks next one is chris. All right just continue. Work on dia's graph, sync request, validation,.

A

Thanks all right, so today we have agenda items starting with the first one from whoever, I think is from danielle. I guess.

B

Well, that's michael's pull request. I don't know if you want to talk about it.

F

Yeah, I I don't know how well this sort of double union thing works, I'm actually looking to eric and rod. Tell me if that's reasonable or not.

F

Should work I mean, theoretically, it should work um yeah, um but yeah. It also wrought it. It includes that change that we had talked about a little bit that just um makes the the root hemp reference a link um instead of being in line.

F

You're, muted, I can't hear you.

C

Just looking, this is a pull request against the specs repo. Is it.

F

Yeah yeah, it should be the latest repo latest one. um It's it's an incomplete pr in that. If you did want to merge, this we'd want to change the file code section at the bottom, but I just really wanted to talk about the schema change, but this should allow us to to have both to to support both.

F

G

I mean why is this a theoretical question? Can don't don't we have enough tooling to just try that already to just see what's going to happen.

F

E

The thing that I see is interesting about this is the semantics that come out of this.

E

Is that you would be able to intermingle both structures in the same tree? And I don't know if that is desirable or.

B

Intentional well, the implementation wouldn't really allow mixing them. It's just that. Well, at least from my use case, there would be an api entry point that says: use the old stuff, another one that says use new stuff and then, depending on that, I follow one path or the other. Does that make sense.

C

But loading them it would, I think one of the exciting is. If you look, you could then load that mixed crazy data.

B

Then I reject it later.

C

B

Mean when you insert.

C

F

When you instantiate it, you can say what type.

E

It needs to be in it's just shifting a bunch of checks from being done for you by the schemas to you have to do them in your application level logic for this, which is always something that you can do. It's just should questions.

I

Yeah it moves one check. I think.

C

I I so here's the thing I this is fine, except it'd, be nice not to do that in the formal spec like. Maybe this could be something that could be here's how you can combine both of them. You know why not both thing, um because that you know the the key union is so horrible like. I don't really want to bake the key union into.

F

E

E

Because of miscommunications and accidents like nobody wants it, I don't think it, but.

F

Yeah so ideal, so I mean the reason this came up is just that we could have a schema that daniel could use for cogen and implement both in one, and so I think that it does get us there. um I would like.

E

F

um Yeah I would like to take the um linking the root node change. I don't know how you feel about that rod.

C

I I'm not that keen on it, but it does you know, I don't know it's it's fine. It's you hate these. I hate these tiny blocks, like that's a tiny block.

F

Yeah, but you you only have to create it when you want to reference um somebody.

C

No, you have, you have to create a new one. Every mutation, so you've got the two two nodes that the root need to change in every mutation, yeah.

F

Yeah, I mean that's, not the that's, not the most annoying tiniest lock that you I create like, um but um I don't know what was it gonna say: um oh yeah so maybe like this could move down to the bottom and we could basically have a schema for the unified thing at the bottom, instead of just like the one off for file coin. If we're going to use this to.

C

Yeah, although that particular change of linking from the root yeah, that would mean unified one and unless, unless we make that a union, you make that a union for a kind of union for a link or a.

F

Could do that yeah we could put. We could put that in the main spec the kind of union yeah that makes sense um yeah. You should be able to invite it. That seems reasonable.

C

I know this question of of um linking to the root node and keeping your configuration out into the an external thing. I'm not totally sold on that, except in the case where the root node is interchangeable. With other nodes of the graph like in the in the vector spec you can, you can pull out other any any arbitrary node and make it the root node, uh and that's a nice pattern where you could. If you had your configuration, that was a bit more advanced, you could have that separate.

C

So you could link it down, but with the hemp.

E

C

Like it doesn't work that way,.

F

Right right, um but the uh another thing to think about too, is that it's not a given that those settings and that data go into unique block. There's no guarantee that that scheme is being applied to the unique block that could actually be embedded into a different schema and it could be in mine.

F

Did you see what I'm saying? No, so the whole root node there could be inlined in a different structure.

F

You could just take this schema and and use the the value data in line um and there you actually may want to say like. Oh, no. I actually want to be linking out here because yeah and because it's not a big deal, you're, not creating tiny. You know there are tiny blocks at that point. To do the every on every mutation you were already going to have to re-link it.

F

In yeah, I think I'd do that in dagda. Actually.

F

C

mmm There's so many variables submit to it's very difficult to say. One thing is objectively worse than another, um but I think we know that the size of cids is really costing far coin, because they're explicitly storing everything. And so once you push towards smaller blocks, then the size of cids becomes a real.

C

Problem I mean it's not. I don't want to derail this particular question, but there is this thing that came up in um was it the meeting we had on thursday, the no? It was.

H

A different one.

C

Maybe it was last week's meeting um just about um because filecoin stores everything explicitly um it takes up so much space, whereas other blockchains, they're, okay, with implicit structures that you can recalculate on the fly um and the- and this goes back directly to the nature of ipld itself, where we are explicit.

C

Every block exists as a thing um and these and they're very concrete with regard to cids and and navigation, um and this it's a more fundamental critique of ipld that if you remove the space where you can't be implicit for really simple constructions and just say, look they exist and you can recalculate them when whenever you want, if you've got all the data, um it ends up costing us really badly uh and it's and we're seeing that in file coin, um there's a bunch of places where you could imagine these implicit structures that exist and they're recalculable and everyone can consensus, build them, but because we have all of these blocks laid out on disk.

C

That's huge anyway. That's just the thing that I think would be worth us pondering as a collective.

G

Yes, but also falcon is one of the very few chains. Pretty much zone chain can think of where you can literally start from a state route and get everything else from somewhere else and be assured that the entire thing is correct. You basically don't have any of this stuff that you are fighting in bitcoin, that you know you have pieces of the chain which are not actually encoded into any of the hashes anywhere they're. Just you know, nonsense and stuff, like that, we don't have any of that.

G

We have like literally you have the the tip set key. It describes the entire like multi-terabyte chain to the last bit, which is but.

C

Yeah that that's that's the pro of ipld, but there's no reason you can't combine the two worlds, because the problem with bitcoin is that they they've approached it without this lens of every block. Every addressable block is a thing um they're devoid of that lens. They just they just have this sense of.

C

We need to be able to content address it all or hash it all in some form, um whereas I think we're at a we're at a point where we the way we view things means that everything can exist, but it doesn't have to exist on disk.

C

uh So there should be. There should be a synthesis of both worlds possible. Where you can say these parts are so trivial that they don't need to exist on disk. I don't know what they are. I mean I can't point to any examples on the far coin chains. Just and maybe that's maybe that comes from not acknowledging this and then just saying. Well, everything has to exist. So therefore, we're just going to lay it all out like this, but if you said well, not everything has to exist. There is these implicit pieces that will exist.

C

If you have everything um yeah, I don't know, I don't know where that's what's going, but it is definitely a big cost.

G

Yeah, I I believe when we spent a little bit more time actually analyzing the chain, and you know internalizing the structures from uh you know from a general perspective and what we're doing we will end up in a situation where it will be like. Oh, they actually need all of that, because if you think about it, what the chain actually does is kind of unprecedented, where every single thing ever existing needs to ping you every 24 hours like that's, not something the chains normally do so yeah.

G

Like like, like take the uh take the entire system of vector ids, it's literally uh how to increment primary key. You know an integer which, when you have a chain, reorg is rewritten so like that the type chain internally uh points to things by a simple integer, and when you have a reorg, you need to go back, match every single integer from the previous state to a stable address and rewrite them to the new integers that are now in in in chain, because the state is different and you go forward with that.

G

So even this kind of stuff they already um like shaved as much as they can by basically not using the no cryptographic stable, addresses and they're like all these really encodings. For for for all for all the you know, partition populations and stuff like that, so yeah I I'm actually not sure there is much to shave without yeah. I don't know we'll actually will know more at this at this stage. He's done who most closely like had a like birthday view of the entire thing. I guess.

C

I don't know if we want to spend too much time on this topic. It's there's a big revenue.

D

Yeah, I mean, I think we need to look at delta's. My suspicion is that there are sub trees, that always change together, where being able to collapse. Links would be beneficial to us because they're not referenced other places, um but I don't have. I I think we need tooling to identify what those places are, so that we can be better informed about how to structure things. I don't think we just know that offhand.

C

A

C

So this back of the hampton question: okay, so linking from the root to the root, the config to the root. Let's call it it's not I'm not a huge fan of it, but it does. It does help us to support far coin. You know with that additional schema change and it does also just open the door to for use cases where people will show up and say I don't want to store the configuration with the data.

C

I have it in my code, which I think is a legitimate request, which far coin has, and others will have so maybe that's reason enough to do that.

F

You want to change it completely or you do you want to do the union.

C

I don't know, but I'd probably be okay with just the link. It's just it's just my sense of tiny blocks having to mutate constantly with massive cids.

F

Again, like a lot of times like it's, it's not you're, not creating anything extra. It's just you're, just moving where the link is because the other date is getting inlined.

F

A

Like a strong opinion, but like a union sounds like best of both worlds because, like what you actually do is like you, either have one to have it implicit as far coin or you want to have it in the block.

A

Basically that and basically with the union, those two cases are served and if you, of course you then could do this thing with like a tiny block, but you probably won't like. Why would you so michael would I can put that's okay.

C

Yeah, maybe that's right, it's just when we the more we do this, the more we're saying, there's room for even more choice. uh You can do what you like we're. You know we're not we're not showing up with here's the best way of doing this thing.

C

That's the ideal is like you, you don't need to make these complex choices, because we've made them for you, um because there's so many subtle interactions here that for us to just for ipld to be this thing where you have to understand all of it to make informed choices is not a great place to be, and that's where this that's, where all these these unions are getting us, it's like.

C

Well, you could lay it out in this way or it could be this way or you can have a combination of both or maybe this other thing.

F

I mean look: the implementations can make these decisions they're going to have same defaults and they're going to default to whatever the author.

A

F

Same thing to do and then they're just gonna, you know, support both on reef. It's perfectly reasonable.

C

F

I don't, I don't think that, like we need to get this aggressive of making choices in the specs, um I I do think the implementation should make some.

E

The idea that was touched upon earlier of we should write something in the spec, that's terse and clean and is what we want and then shove in a couple of alternative schemes as appendices which can be any of these implementation choices or even several of them and just highlight them, for contrast is like as an implementer. You might want to do this. That does this funny union here, because it's really useful in practice. For this compatibility reason, I think, having several of those documents show up is actually super reasonable.

A

So, what's the what's the actual use case of having like support like having one schema for both like? Don't you always like, if you implement it, always implement either of them.

F

A

We're literally implementing both.

F

Like that's literally, what we're doing is we're planning one, that's both. So we need the unified schema, at least for that, but it doesn't have to be the main schema. It can just be the bottom.

C

Yeah, so what I would do in javascript is I'd, have the front end algorithmic piece and then the back end layout piece, but where we're at with like ipld prime is that becomes really tricky, because the the layout piece is so intertwined with the algorithmic piece, and I think I think um daniel you said you could see a path to doing that. But you would end up having to like sub packages and it would just get messy and awkward.

C

So if there's, if there is a way to squish them into one implementation, and then that is kind of ideal.

B

I think it is theoretically possible, but it gets into. Can I um I don't know what word to use. Can I course the code generate code generator enough to like. Let me do that and I think that's probably rushing it too much.

F

Yeah, we don't want to hold them up forever on this, when we can just like ready to fight schema.

B

Well, maybe eric has a good idea of whether or not the code generator is smart enough to support that kind of magic.

E

No long story short, no none of us.

E

But yeah, I think, volcker's question is also really good like uh in pursuing this idea of a unified schema for this particular implementation.

E

I think we are indexing pretty intensely on the costs and the practicalities of this particular go code generator where it's relatively high cost to generate two whole packages with two different schemas.

E

And as long as we're conscious of that, that's a totally fine trade-off to make. But we probably do want to be really conscious of a document, because if there was some totally other implementation of schema logic that had different costs than this, we might not be feeling the desire to make these choices at all.

F

That's a good point like the way that I use schemas in javascript, for validation like I would actually prefer to just have two schemas and figure out which one to apply and validate differently right, yeah. That makes sense.

C

All right, yeah, let's leave this one with me and maybe I'll come up with a bit at the bottom of the document that says his alternatives, and this is the way you can craft a schema that supports these alternatives.

F

Cool cool um cool. I know we got some time left rod. Do you want to go over the the slides that you made wait? I have one more like. Oh.

A

You do sorry, sorry, sorry, yeah, no strings.

F

A

No, it's it's actually kind of about strings. If I want to bring it up so now so I just got today. I pull requests on on dag on the javascript deck protocol. Buffers implementation kind of like the the one that ipfs is using and it is actually about.

A

The idea is that links only have bytes which are higher than 31, so basically the control characters. So basically the idea is that in javascript, a link strips of any bisector control, characters of the link, name and hope this makes if it does make sense, obviously.

E

The background is.

A

If you have those characters in and you have a filename with those characters, what can happen on the output like if you print it somewhere on the terminal or something you get, the control characters get could get in your way of everything, and then I briefly discussed this with with with alex whether this should be on the deck protocol, buffers layer or if it should be, as it is about printing things out on the printing out layer. Basically,.

A

I think it makes sense to yeah, so peter wants to say something.

G

Yeah, I I have opinions like I I saw this.

G

I saw this uh this issue just now in the in the in the in the list, and I'm like really really angry about it because, like stripping stuff, it's like the only thing you cannot do like no other tool does that I actually have a an issue from three years ago, where I explored exactly this using control characters to when you do an ls to basically, like you know, write stuff to the screen uh that the terminal will interpret and no other tool actually drops these characters just and calls them correctly.

G

Ls does the right thing: tartar, the writing and so on and so forth, and our answer is what we're just going to drop them. I mean like we need to like either escape the morning code them or you know or indicate somehow that there was some more stuff they are just dropping. Then it's like silly. I don't know.

A

Okay, so I'm I'm, we could also like totally escape them like so and so, or also it's important. So it's really so it's only so in in this pr. On the javascript side, it's really about like if you get it returned as a javascript object. So of course like for for creating the hash and so on. Of course you keep the original data around because well you don't want to have it has changed, but it's just for if you get back a javascript object.

I

A

Strip off those characters.

D

In what context are they stripping them off? I guess that's what I don't understand it's like when you call that string on it or like if.

A

You so if you, if you, if you deserialize the so you have the serialized the protocol buffers and you deserialize it into javascript, then what you see in javascript and we say dot name to get the name of the link. It will have those characters stripped off like just remove that's what the current pr is doing.

G

And and and you cannot round-trip them right- it's it's remote like if you try to encode it back, you are out of luck. No.

C

G

Drop a link to this in the chat.

A

Please um they, if you, if you and so it's on the it's also in the wait, let me post it on the chat as well. It's in the notes on the agenda items. ah Sorry, the check is: uh where is it.

A

So um sorry, um so it's this pr! No, so what what happens is internally, it still keeps it around. So it it clear cleanly round trips. It's just. If you access the property, then you don't get it in javascript.

G

So, okay, so basically, if you have the object and you try to encode it, it will like magically put put this stuff back. But if you like, create a new object with the same content, then you're going to end up within your with a new.

C

G

D

So when you're asking for the name, which I guess is sort of the I want to print out, the name is why you would use that version of it. Why don't we give the base 32 string representation of the sid, which is our standard way that you print this out and have it as human readable.

A

No, it's not. No, it's not the it's, not the cad, it's the it's the name. It's the name of the link. Obviously the name of the file so.

G

Essentially, a map key, but in in the protobuf like weird way of it.

C

This needs to be wherever it's been consumed. I don't see why we need to bust our data model. I mean this. This package is not pure data model, the the new one is js, dag pb is, and so I definitely wouldn't do this there, but for the same reasons I wouldn't want to do it here either.

A

Yeah um so basically so the answer is we should they should put it like, like on the layer higher whatever this layer is yeah like.

C

Push it as far out as you can go like, but because this is a problem. This is a problem in general. This is not just about this one field, anytime, there's a string!

C

Then you care about this.

G

And it's even about like I was.

A

G

Like no matter how they push it.

A

Yeah, so it's so it! So it's even about like like strings as like, like valid duty of eight strings, so we're not talking so just as a context. We're not talking about arbitrary bytes, we're talking about like it's the value of a strings, because it's the first characters of s key so just as yeah for next okay. So, okay.

G

Ipl team says no.

A

C

Well, here's here's one reason I mean you're gonna have to craft a response to this, but uh one reason is if they want the my growth path, migrate path to the next generation of of the stack. This is just not going to work. It's not going to fly there at all, because we're very strict about data model forms as object, shapes being able to round trip and not having any of these special properties. There's no special properties, no hidden properties, no rewriting! Nothing of that.

C

It's just what you get is: what's there, so you've got to deal with all that junk up up the stack. If it's a problem to you, we're not going to be patching, these minor things all the way down, yeah, because if you were to, if you're here's another argument, if you're a round trip to this dag pb data into daxybore, for some reason, which you can do, you have the same problem. But now it's in daxybor, where you don't have these defined properties in javascript. They are just properties that come out of the data.

C

They emerge out of the data unnamed and unhackable in the code, and so it's a problem for strings in general. They will be there. You've got to deal with them.

A

Yeah, okay, cool uh yeah, so thanks for the input, I will write a response.

G

Okay, I was going to ask: how are you taking to write the response so yeah? Yes,.

A

Awesome. Thank you, um cool um yeah. That's that's! Yeah! Okay! There's! Nothing else!.

A

Okay, so yeah, so we still have time so, uh michael, you wanted to say something.

F

I was actually just gonna say, like maybe rod can show the slides that he wrote. um I'd actually like to hear roddick instead of me see see how how well that the transmission has happened of how these work.

C

um Oh, uh that would mean sharing a window and actually pulling them up on this computer.

I

They're in google.

C

Yeah, they are sorry I just I did this on my laptop and not on that. Here, hang on.

F

Switch accounts and stuff.

C

I was actually editing the slides while you were presenting them the other day, michael because I saw oh.

F

Really, oh, it's good to know.

B

C

F

Copy them into a new presentation soon, so.

C

uh Yeah: okay: okay, let's do a quick, let's see if I can do this in seven minutes, probably not, but let's see.

C

Let's share this.

C

Okay, you see that screen.

C

So- and this is what michael's been working on with mikola, been hammering it back and forth um lots of interesting discussion happening there, um mccullough keeps on surfacing these interesting ideas, um but the the basic the at the core of this, I think, is really interesting to understand.

C

There's just like this one key piece. I think that if you get that, then you can see what's going on um so here's the problem right, we're trying to what we really want is a a way to store uh a sorted list of entries. That's that's like the holy grail for our data structures. Once we have that we've got, we can do so much with that um and maybe there's different variations of what we want out of that.

C

So maybe there's different uh data structures that will do this for us, because we we have different utility for them, but but really once we have something that we can store sorted and we can query in assorted ways to do range queries on. uh Then we can leverage that up to produce all sorts of really interesting tooling.

C

On top of this, so right now we have a hamptons, our really our general purpose, data structure that serves so many purposes, but um and it's it's got so many really nice properties, but it's the lack of sorting means that it stops short of being a general-purpose data structure. So let's say, we've got uh a list of things here we want to sort. This is this: is the generic problem space um there?

C

These entries, where they've laid out here just as strings to cids, but they could be anything in fact they could just be strings um in a list and you want to store a set of strings.

C

That's it, um but something that you can run a compare operation on uh to put them in order and the there's an additional constraint put in here to make this work, which is that they should be unique, and I think that's that's an okay property to have in general, because you can work around that if you have non-unique items in a number of ways, um so the naive solution is obviously to make a just a normal tree, where you branch with a maximum branching factor, and so in here we've got a maximum branch factor of four which splits our data into two and gives us a root.

C

Node to point two um that works except mutations, become a problem, because if you want to do inserts, then you have this shuffling problem. So an insert um will cr will make an overflow that then impacts it can impact the whole tree if you do it in the right place, and so you have to rewrite every node uh if you insert something at the beginning of it. um So it's it's a it's a yucky solution for what we.

C

What we care about and the same thing with deletes, make spaces, and you have to shuffle everything back up again, just to get the canonical form with the maximum branching factor, so really not a pleasant solution. um So the the idea that michael and nicola have been working on is to treat these entries as if they were something that you were putting through a chunking algorithm um and so, uh which that was the piece for me of just viewing these things differently.

C

Viewing them, essentially as a string of of like a string of bytes that you were. You were scanning through and finding some way to slice them up in in a predictable manner. So if you, if you scan through this list, where do you find predictable break points that you can always break uh and turn them into nodes?

C

And you can do that with chunking and chunking algorithms and you can actually do really simple chunking algorithms, so in uh in this, in the form that might so, mccollum had was actually using a proper chunking um function. So but michael just decided. Well, we just we have randomness here um in the form, predictable randomness in the form of cid.

C

So um if you have a uh well actually the cid in this instance, uh you have but you, but if you just have a hash function for the uh the entry. So if it was just a string and you hash the string, then you could just get an identity from that hash function um in some way. So just turn that predictably or as stably into a a number within a certain range, uh and so in michael's, current uh iteration just take the last four bytes you get a 32-bit integer.

C

Your range is zero to max that you win 32.. um You can then choose a branching factor that will give you a way of slicing these things. So if you, if you divide your index space up into your branching factor, so let's say your branching factor is four: you divide max. U and 32 by four and then you've got your four slots.

C

You just say one of those slots is where we branch, and so, if anything falls below the lowest one, then that's a a break. So so here's, let's take these things, hash them and turn them into these identities. And then you say uh our threshold.

C

Is this lower quarter um of the address space of the identity space and at that lower quarter whenever whenever one hits that, then we should break there, um and so that gives us a probability of one in four of hitting a break, which then gives us that branching factor of of roughly four of approximately four so in here we've got two two of these entries have hit um the um the threshold, and so we just we chunk. So we say: okay, we we.

C

When we read along long and long, we found one that met so we're going to say this: is our branch so treat these as a close operation for our our node? And then you end up slicing your set into a set of nodes based on this close operation. So we've got one close here and one close here. That's given us three nodes and it seems a bit sloppy like these things are not exactly the right size and that's true.

C

um They flex in and out around the target size, um just probably probabilistically, because you are seeking that point to branch on at a probability of one in four or one in branch factor.

C

So you're you've got a target rather than an absolute, but that flexing has become an important property for our data. Structures like the hampt uses, flexing a lot in a few different forms. But then you apply the same algorithm all the way up the tree.

C

So you you you hash and index these the the entries in your next layer as well in the next layer they just they're links to the base layer where they a starting index, so that you can traverse it later on. But you can run the same hash over that and do the same thing, and so you might end up also splitting at the at the next layer as well.

C

So if you, you know, get this this threshold below the threshold, then you branch again, and so you end up with multiple layers and you just keep on running the algorithm until you get a single root um and then inserts mean you just run the algorithm again over the set and they all because this they're stable, they all produce the same index, the same identity and you still have the same breaks so new and then new elements will either fit in the set or introduce new breaks. So here we've got one in the top there.

C

That's in that's fitting in that set, so the mutation then, is only that one path along the tree um or the second one is where we're splitting the node into two and so again, probabilistically one in branching factor. You will get a new split, introduce new nodes.

C

And then the same thing for removal- and you can either remove elements like this one here that is just in the middle of a node or you could remove a split, in which case you would be joining nodes. So so these mutations- don't always just they don't always just affect one node and a path up the tree. They can also affect a neighbor and um and as you think, about the way that the tree works you can.

C

This effect does compound, as you go up the tree because you're generating new data here, because the cids change, so these things may shuffle a bit but you're, never in a situation where um this is likely to propagate across the whole tree like it's, certainly not at the base layer, but there will be more mutation, cost within the middle of the tree than with a hamped or other data structure, but it's it's relatively minimal and also within the probability um of the branching factor you define um so this. This is the.

C

um What was this one illustrating? This is uh insertions that may cost you this is. This is a perverse case, so it's not it's it's. You could push this one to the extreme so where we insert we go from uh from here, do we do it we insert b and then we um we end up creating a new sc id here which uh has a break and then the next one has a new cid, because you've split it. um Does that.

H

C

Oh you've inserted a thing here and then you've created a new city which is in the within the threshold. So in the second level we've got these these three nodes and then you do the same thing here and then you've got a break, and so you you end up creating more levels here, um but these over time within probabilities. Again they sort of bounce back to the right shape.

C

um Lookup operations are you know this is just a b tree. You just go to the left, um just go to the the node that is lower than your index. You're seeking so look up under eight indirect index. Sorry operations are cheap, um size operations are expensive, um but um I think I laid that out here.

F

C

Certain places.

C

Yeah, if you trust it um yeah, uh so look up, insert replace, remove relatively cheap um and yeah. This, given a sufficiently random hash function, nodes will bias towards the target branching factor.

E

C

Will you know within that sort of flex uh and the randomness just discourages uh thin trees? So um the problem we have with the amt is, you can have these really thin trees um because there's no way of of biasing it towards thickness, um because you are brute force just using your indexes to to define the shape of the tree, which is pretty gross.

C

um So so it has. Has this thickness property, that's similar to a hampt and the the big deal about this? Is it's canonical for any given data set? So even though it looks like they've got this, this random craziness, going on any given data set regardless of intermediate removes or inserts, uh will form a canonical form with a a canonical root cid, uh which is what we want.

C

um So that's that, but there is this one problem which is that it's cheap to create collisions, um because because the branching factor is just dividing the address space into you know into that that segment, it's it, it doesn't cost you much to to craft entries that will uh be below the threshold to create a break or to not create a break. So in this instance we've you know, somebody could create a bunch of entries that don't break, and so you end up getting these really big nodes which can break some systems.

C

We do have limits on node size on uh that that we can store in our storage systems or trend or transport across uh transport layers. So we really do need to protect against this. um So that's the really outstanding one there's been a couple of solutions proposed so far for that- and I think michael and michael are discussing through.

F

Yeah, there's there's a bunch of ways to solve this. Actually, um it's just that these solutions are increasingly kind of complex the less um you control the access patterns so like, if you, if you have like, if you just own the data structure, you can just insert tombstones into the structure whenever you see an overflow and then it's becoming part of the canonical list, and so it's it's not actually out of hash or anything right.

F

um It's like it's still going to maintain a constant hash, but if you don't like own the structure or um if you don't have some kind of consensus mechanism that says we're all going to apply these changes at once, then you can't really use that method.

F

If everybody's applying the changes kind of in a different order, and then they need to come up with it with a particular structure that becomes a little bit more difficult um because you could end up with just the tombstones in different places, depending on when you pulled it or um a case that I have in dagdb is like the it's an index, so the canonical form is not actually the data structure. It's it's a view of a different data structure, and so you can't be inserting these tombstones really it.

F

That means that it's really not the canonical form anymore um and so for that you can use like I have an algorithm for using like a floating um fingerprint in order to calculate a sequence id that you can use um also like if you, if you get to hide something like if you own the data structure and other people, don't like manipulate it directly, you can just hide a nonce for the hashing and then they can't generate these attacks because they can't predict what the hashes are going to be.

F

So it's like really easy to fix. If you, if you have any kind of like agreed upon secret um but yeah, so yeah, there's a bunch of different ways to solve it depending on kind of what the access patterns are you just. It takes some some work to figure out which one is going to work best.

A

All right, we are running out of time. We are overtime so yeah. So thanks for the presentation, like that's that's interesting, um cool yeah. So then, uh yes, that's, oh god! Then then I'll see everyone next week, so goodbye everyone.