IPFS IPLD, 24 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 🖧 IPLD weekly Sync 🙌🏽 2020-08-24

Description

A weekly meeting to sync up on all IPLD (https://ipld.io) related topics. It's open for everyone and recorded. https://github.com/ipld/team-mgmt

A

Welcome everyone to this week's ipld sync meeting: it's august, the 24th 2020 and as every week we talk about the stuff that we've done last week and plan to do next week, and then we does any engine items we might have, and I already see that we have an antenna item, that's great um and yeah. So um I start with myself. I don't have that much to report.

A

I got a bit of time on friday to work on rust, ipld and I, but not really working on any new stuff, but just reviewing your pr, um but that's a pretty big refactoring one, because it's using the stuff, that's not um heap allocating but stick allocating stuff for cds and multihash and also a few other changes.

A

um One interesting thing for other implementers that um yeah might be cool is so um david. A guy working on russ, iphone d had an idea um about um codex, so um in rust um it would be nice or what we do is that you can serialize directly from native rust types into a codec. So let's say from native thrust types directly to dexibor and you can also directly go to deck json or whatever you like, um but the problem is that sometimes you don't know where you actually want to go.

A

So you first want to go to ipald. So most implementations first go from native types to an ipld type and then from ipld to wherever and in rust.

A

You might want to take this shortcut, but sometimes you don't, and the problem is that in the past we had kind of different apis to do those things so either going directly or going through the ipald intermediate step and the idea he had was we create a trade which is kind of like an interface in other languages and what you do is you implement this trade for the codec itself, but you can also implement the trade for enum containing multiple codecs.

A

So this way you can go now directly from one spec native types to a specific codec or you implement this enum for ipld. So you basically have the same api going from native types to ipld and back or you have from native types directly to a codec, that's kind of smart.

A

I think, and might work well and might also be an idea for other languages that you kind of yeah combine an interface which can be used by multiple codecs or by a single coding. It's kind of neat, um okay, um yeah, and next is what I try to do. Is that getting the changes from tiny multi hash, which is kind of a fork of rust, multi hash, get those changes upstream um into the proper one?

A

um Yes, yes, that's all I have um next on. My list is peter.

B

Yeah, I still have no idea this specific update, then, to this meeting, I'm going to space with a lot of other people on the falcon network, we're starting space race at midnight. So.

A

All right next is chris.

C

Hey guys, uh I did not die, so I'm back from uh extended vacation, slash personal leave. I could fill you guys in more on that. If you want to know some point, but um I got a sign to build a javascript graph, sync library. um So that's what I started working on today, there's a link to the repo.

C

If you're interested um there is, I don't know, I don't know how you guys, do your dev process normally or how you've done it, but I did actually create a requirements, documents and google doc that if you have time and interest you could look at it and provide feedback um if you think there's something missing, especially if you uh are familiar with graph sync, I did go over it with michael and um scrubbed it a bit today. So that's there and I did start implementing some things I do.

C

I could use some help um just to kind of get bootstrapped on the go graph. Sync side of things on a few questions I know michael said maybe peter or uh eric could could fill in a few things. So if you guys have time I'd get something on the calendar to do that. um Since hannah's like super busy and hard to kind of come down right now so anyway, that is my update.

C

I do have an agenda item about this multiple cids per block, which will probably be controversial and shut down. Believe someone understand so we'll talk about later.

A

All right thanks next is eric.

D

A

Hi, by the way.

D

Sorry, it's slightly late. I've discovered that my computer doesn't do network time sync and it's seven minutes off today, because quartz clocks are apparently.

D

Technology is wonderful, so I worked on some documentation this week and there is a new primer document that I wrote, which is trying to be an all-in-one sort of like try to fit into one contiguous page, just scrolling the links piece of reference material for what is ipld. What are all these things inside of it?

D

um This is currently residing in a hackindy sort of scratch pad, so everybody can edit it so go nuts if you're so inclined, um I'm not actually sure if I'm going to try to work this into our formal documentation or some offshoot of it or like what the trajectory is for this, um because I want to keep it terse. um I don't necessarily want to evolve this particular document and do a document for everyone. It's actually too long already.

D

The way I wrote in the first draft, I think, and um so something that's slightly more opinionated might be part of its future, but I'm not sure um I'm just trying to like flesh out these different kinds of documentation that we might need. We've got lots of rich things, but I think we need more terse things too. So I made a shot of it anyway, so feedback welcome.

D

um I've got a pr out for adding a bunch of new tests and fixtures to the schemas and the cogen stuff in go.

D

I started using json for shorthand for tests which, on the one hand, gives me the heebie-jeebies, because that's kind of a large dependency tree of other things that you're assuming to work for your tests, but um tests for all of our json stuff do already exist, leaning just directly on the basic node implementations, and so all this stuff can be tested without there's no cycles in this dependency graph.

D

So if I had to bootstrap the whole thing all over again, it would still be a clear like progression, so I think it's going to be okay and it saves a lot of line count. So these fixtures are also like probing everything at once. So you give it like a piece of json for the type of semantic structure and you go to another piece of json for the representation and it'll like make sure that these both produce the same data in memory and can go in every different direction.

D

So this is doing more effective coverage with less blended code, and I've already found a couple of bugs with it so yay um and the traversal package in glenfield prime finally has a git function, which just plain gets you the thing that you asked for by path as opposed to making you go through a whole callback, creative model. So people have been asking for that for a while.

E

D

Finally, did it long overdue and that's right.

A

Thanks and next on, the list is uh ron.

F

Hello um mine was a light week. I took a little bit of time off. I've had some um health issues, but um I tied it up the hampton amt work in far coin and handed that over.

F

um So that's out of my hands and they've apparently um assigned someone that will be taking charge of those libraries which is great and, as you can see by my links in there, I've put links in the notes too.

F

It's the work that I did in pull request form um the hampt finally got moved over to the farcoin org, which is where it should live to clear up any misconception that this thing is a generic thing.

F

So that's that was good to hand that off, because it was a lot of work um and um they they said that they won't. They won't be accepting the breaking block format, changes um that was um objected to up the chain, because it's just too close to launch, which is fair enough.

F

It's um that I think one of the biggest concerns is now they're trying to build uh a lot more integration type testing for the the whole system, where uh you get consistency, ids for different operations and they're they're already working on live data, so breaking block format, changes of stymies all that so that um anyway, that's fine.

F

I um I did that stuff because it was, it would be good to do, but it's all- and I knew it was in their hands- and that was that's fine, uh but those changes may get pulled in during the next major network upgrade whenever that happens. If ever that happens, who knows how that process is gonna? Roll out it'll be interesting to watch, but it's there for them, um and I think that would be good changes.

F

um It does leave us, though, with the hemp one of the aims, with this work other than um just trying to improve. It was to pull it a little bit closer to what we would produce as a generic hemp to as an ipld offering, so that our work could more easily overlap with theirs, and so it would be fairly simple for us to have some code that could work for them as well as for anyone else, and it wouldn't be a big leap between ours and theirs.

F

So all it means now is that theirs is producing blocks that don't really look quite like what we would produce, but it's not the the distance is not huge, it's just a little bit annoying. So that's all I. I spent some time in javascript es modules, stuff um converting some code and trying to understand michael's new stack and um yeah. There's a lot of pain to be had there, but they were my major highlights for the week.

A

So thanks to one who doesn't have put anything on the edge on the list, but michael do you have any updates.

G

Yeah yeah yeah. Sorry, I'm I'm like bouncing back and forth between a million things um so earlier in the week earlier in the week, I really finalized the um all of the esm migration stuff.

G

So we have like actually good tooling now there that like really does the right thing and it's pleasant to use, um and I migrated basically every module that we have on esm to that um so multi formats all the way up through block and then all the things that block the depend on block as well and uh all the way up to diagonally. Basically so doug d, I have a branch right now that I'm working on um where I'm doing that and then on thursday.

G

I had the idea for this block format and then that captured all of my attention for the rest of the weekend, and I won't get into that a ton now, because I actually prepared some slides to go over and just talk about it in the call in a little bit so I'll. Let all the other people finish their updates and then talk about the block format.

A

All right, so I guess yeah, so the most important thing I forgot at the beginning is that: does anyone new want to introduce himself.

H

I probably should um I got started today. My name is daniel. I joined the ipld team just to give a very brief introduction as to what I do. I mostly do go and open source.

H

So I know most of you are from other backgrounds, so I'm actually interested in learning a little bit of rust, maybe someday, but anyway, all that I've done today so far is really just getting set up with the accounts and reading stuff and tomorrow I hope to actually get into some code, but that's pretty much it nothing else. For me.

A

Cool thanks um all right um so yeah. So we have an engine item, so I guess so what should we do? First, I guess first the agenda item and then my click can do its presentation or any preference.

A

Yeah, let's do that yep, okay, so uh chris.

C

Yeah, so um I know we've kind of talked a bit about um compression and encryption in the past and kind of one of the key objections to it. Was it kind of breaks the content, addressability and so kind of question comes up. Well, why not have um you know, I guess multiple cids?

C

We could associate with a block uh one that which is like a semantic cid, so that would be one works like uncompressed uh unencrypted, but then you could have other cids which refer to that same block, because it's like a lookup mechanism that would be the actual encrypted or compressed version of it, and so, if you think about it from a uh you know, from an existing tooling point of view, a lot of ipfs works with basically a single cid per block, and so, if you're doing bit, swap or graph sync or something like that, that would could use the um I guess as stored.

C

But we could I'm actually I'm not sure exactly the best way of doing it. But just the idea is this: is that to allow like different identifiers kind of breaks, content addressability, but it does allow essentially the stored format to vary depending um independent of content, but still keep the content addressability in a sense now that makes any sense. But that's the high level question idea.

A

Okay, does anyone have any comments because, like I honestly have to think about this, like that, like yeah.

A

Yeah, does anyone have any any thoughts now.

F

We don't have a way with cids right now to differentiate on that level. That would be my biggest concern is okay. I've got a cid. What is it for and cids are supposed to have that information?

F

Maybe if cids had a mechanism or if we use different codecs to say this thing is a compressed container of this other thing. Maybe that would be reasonable, because it's not unreasonable. If, if you just came with a codec for just compressed or zip something then maybe that's all it is, but but.

F

I'm not sure what that buys.

B

You yeah like, uh for example, graphing uh is not really centered around um like per cid round trips, it does have batching and you know, and things like that uh or okay in theory in spec it does like the way. Go. Graphing works right now, it's a little bit more primitive than that, but that's not going to work. uh So I guess to echo brad's question: what does it buy you to basically have a a jumbo block or something in a sense to have a wet.

C

B

A jumbo block in a sense, like you know, with multiple series.

F

Yeah, an intermediate.

C

uh Well, so I guess what I'm thinking about is that if you have one block linked to another, you would so, let's just listen, so we we all understand the use case or the need for encryption compression right, and I think the again the feedback always got was that well you break the content addressability. So if you link from one block to another, um you know think about this. If you did, let's say like what people do today. Is that they'll encrypt content and then stick it in ipfs?

C

Well, then, you kind of lose all of your link ability, because now your links are encrypted in that block and to actually traverse them.

C

You have to decrypt the block, because it's not going to work with things like graph, sync or anything else really, um and and so you know, the idea is like well what if we had uh like a semantic like the way it is today, there is no encryption or compression everything, and so, if you link from one block to another, you would still use essentially the unencrypted uncompressed cid to link between blocks.

C

But then, um and so you could, you could traverse the graph that way, but then uh underneath it's actually could be stored, compressed or encrypted and therefore moved from one um ipfs node or one node to another. Preserving that compression or encryption not have to actually undo it or decompress de-encrypt send over the network. Re-Encrypt re-compress.

C

I know it's kind of abstract and I need to think about it more, but I just kind of wanted to throw it out there because, like I said I keep when we talk about this in the past, I keep bumping the fact that we break the content, addressability and so um to kind of preserve the content addressability. It's almost like. You need different identifiers, different ways, identifying the same block, one which is like I said, the semantic or uncompressed unencrypted version and then the one where it's actually on disk.

F

The way we've talked about, um one of the ways we've talked about encryption before is to essentially have a a block that presents itself as a uh as an opaque container and to navigate it. You need extra information, but it presents itself as a normal block, and when you read it you you have to apply some logic, which is where advanced data layouts come in.

F

So if you, if you read a block and you and you know that it needs some kind of logic, then it goes through the adl pipeline which is starting to exist in go and then in in there. You apply the logic which inserts a key and does some encryption then presents the real, the real block format um in there, and so in that sense you would have sort of what you're getting at. But you don't need that second cid, um it's the cid is the idea of the encrypted data.

F

It's just that to read into it. You need extra logic. One of the problems we have with encrypted blocks, in particular, is the it's, because all the links are all hidden inside the encrypted data.

F

You don't you, you break the graph unless you can, um unless you can encrypt it every uh decrypted every stage and it's not and and if you, if you're, just an intermediate storage service, that is just storing them and you know- and you want to do some sort of graph sync uh exchange, then, unless you're, given the key, you can't do that full thing, and so we did throw around some ideas of of where you could.

F

You could extract links that are not sensitive uh outside of the encrypted container, uh and so you present the block with some. You know a link section and then the encrypted payload, and so you could still traverse enough of the graph and for some use cases that would work just fine, because the links section would just be a graph of all these encrypted blocks and they'd only expose enough information to be useful.

F

But there are other use cases where links are going to be sensitive, like you might be linking to some particular external source that actually gives away what you're encrypting or gives a suggestion about it. So this is this um leakage of information. So it's it's a bit of a hard problem to solve, but and as this we did, we did talk about the same thing working with compression, but it gets so sloppy within compression. It really does get messy when you're pulling compression into the block layer.

F

um It's so much that it messes up, particularly particularly for compression algorithms that are not deterministic, which is something that we've been talking a lot about recently um is. Is there really a solid compression algorithm algorithm that that is, that is actually provides, compelling results that is deterministic, uh there's there's plenty of them that are that provide.

F

You know, okay, results, small amount of compression and they're fast and they're deterministic, but would you bother using them if you, if you actually wanted compression you'd reach for something that can do decent sized compression? A lot of those are not deterministic you're, you're, stealing michael stander.

I

F

I

G

I

G

Gonna say like like just to show kind of how data specific this is right, like I've been experimenting with this new block format, and one of the advantages is that the links and the values and the structure in different sections, so you can actually compress them separately and kind of play with that and you get the links which are just hashes and aren't going to compress well, and you know that out of the way, but even if you look at like the file plane chain data, none of the value data in the file coin chain- compresses very well.

G

It's like a maximum, I think, of like three percent, um and so that's just like not actually worth taking the parsing and decoding time, and that's probably because most of those bite values are like public keys or something right like it's all like actually encrypted fight data. um So you do have like these incredibly domain specific, like compression needs and then yeah. I mean I'll talk about my block format a little bit, but you can start to actually program these into the data structures. If you have a block format that has a natural compression scheme.

C

Yeah- and I know uh I won't use much- I just want to throw that id out there, but and I'm sorry, I'm kind of like brainstorming it on the fire, because I haven't fully thought it out, but one I'll just close up with this. um It just kind of occurred to me that maybe the right way to solve this is at the transport layer and so from a from a client point of view, you're always working with the cid, but there's some optimization of the transport layer.

C

If I want to basically move encrypted block from point a to b or compress block from point a to b, that's purely transfer negotiation, but in the end, once I start working with it directly, it's the the full content dress way. I think- um and right I mean one thing: I don't necessarily agree with.

C

I think that you can have deterministic compression um it's an area, that's been, and I've done a lot of work with compression in my career, so I know a lot about it, so I don't know why we, I think I understand why we think that's not deterministic, but it's it's more about just controlling the implementation, as opposed to yeah say like algorithms, like depression, just isn't deterministic because it because it is it's just code, it's got control. You have to control it such that. It's not. um You know like there's.

C

No random number, that's part of compression, it's it's! uh There are parameters and there are certainly different strategies. You can implement for a given compression format, but that can all be controlled just by controlling.

F

Well, but my biggest concern there, it comes from this stuff we're trying to make even dag seaboard uh fully deterministic um is that you went compression. Algorithms generally provide you with many ways to uh compress the same data that will decompress with the same algorithm. So the the the decompression side is. This is the asymmetrical thing that goes on where the decompression will look at it and say yeah.

F

I can decrypt that I can decompress that and um whereas, on the compression side, you can choose any number of ways to compress the data in into a format that the decompression will be happy with, um and so we we end up with this problem of the same data. Many cids, um and what we want to do is reduce that possibility.

F

So if we were to spec a compression into our formats like michael is trying to do, it would have to be a way where there's only one way to decompress and sorry one way to compress and one way to decompress and and if you were to and on the decompression side, you ideally should be able to verify that it was compressed in that one way, whether that's it had a the same dictionary or the same parameters for the algorithm you'd want to be able to say yes, this has gone through that one way and it's not it's.

F

It's not some same data that I've already seen before, just to compress with slightly different parameters. um We want to minimize that as much as possible, so that you know actually we want.

A

To eliminate it.

F

If possible, um but we're finding that even hard to do with just dag cbo.

G

This is like a nice transition point, but I'll get into this a little bit. um One of the cool things that I've been noticing is um okay. So, first of all, as I've been writing this block format it's in order to be deterministic. You have to only ever have one way of doing something. Right like you cannot have variability. You can't have optionality.

G

It has to always be the same way, um but that is like no variability in the structure and if the structure itself supports a certain form of compression, what you end up doing is just adjusting your data structure to fit that, and then you don't end up with algorithms changing the deterministic representation right, because the representation that you spit out is going to be chunked differently and is therefore actually different. Data like if you have a different array of bytes right that is literally different data.

G

It is deterministically different and what a lot of these algorithms are doing is figuring out what the layout should be of the data in order to compress it into a table right.

G

Whereas if, if you separate that from the actual format- and you say like- oh no- we only have one way to encode an array of bytes and it is deterministic, but we will compress and do we will duplicate every one of those bytes. Now, all of a sudden like you can write multiple decompression algorithms that all spit out different data structures, but each one of those data structures can be deterministically, serialized and deserialized. Does that make sense.

G

Here actually here, this is a good. This is a good point to actually start the talk, because then here I'll show, you I'll show you this new block format, because it'll all make sense.

C

Towards the end, one quick comment: it's real quick, so one of the general things I think is that doing some compression any compression is better than doing zero compression. I think that's one kind of rule of thumb. I've lived by um and heard many people say, and so I think it's not about having the most effective compression algorithm, but just doing some and actually hires compression algorithms that are very fast um like like deflate extremely fast.

C

You can almost do a mem copy speed, um so uh it would be good rather than doing like gzip level, nine just doing jesus level. One is like better than like nothing so anyway I'll hand it over to.

F

And I think those those simple formats tend to be the most deterministic as well, because they're actually simple, algorithms yeah, but.

G

But again, like I tested this against the file coin chain data, and even there like it's only three percent, because that data is just not really actually compressible, because it's all like cryptographically generated, so it just does not it's actually not worth it. It literally is not better than doing nothing because it takes twice as long to serialize it to serialize and that's not worth the three percent gain on the block format.

B

Yeah, I just just one thing uh specific on the comment that some comparison is better than no compression. If you do uh deflate or gzip one, then you cannot do a gzip nine anymore over that. So there is a point. Definitely the trade-off of. If you compress you can press once you cannot double compress or anything like that.

C

Yeah, you don't double compress, but I mean I'd say: I'd argue actually, michael, um that your file coin watching data is already compressed. So I'm talking about just basically, no I'm just using var ends everywhere. So it's like it's going to compress fairly well, no.

G

No, no I'm talking I'm talking about just the raw data out of file coin like decoded into seabor and then round tripped into my thing so, like I shave more bytes than three off of it because of other things I do in the block format. I'm saying just the difference when I took the value data and compressed that that only like lessened it by three percent.

F

Yeah all right- and this is gonna- this is gonna- be use case- dependent like if you take the average ipfs data, which is a website, then you're gonna actually get some benefit from there.

G

Right right: um okay, all right! I haven't even named this, yet okay, all right, zdag, all right. So I had this idea on thursday and then I didn't stop working on it since um but effectively. So it's a block structure that looks like this.

G

You have a links, header and a values, header and then the actual structure of the data. So you front load the links and then you basically have all the value information and then just the structure print out. So the link encoding looks like this, so you first, you sort it based on like a cid, specific, sorting algorithm so that we can then um so. First of all, there's no delimiter between these, because we know what the decoding rules are for cids.

G

So the only delimiter that we need is a null byte at the very end of the link setter and then because we sort them all this way. We can then shave all of the prefix bytes off of every different cid type, and since you only put these in here once this is effectively a compression table. So now, whenever you refer to these cids and the rest of the structure, it only takes one or two bytes, depending on the container format, because it may or may not need typing information.

G

We do the same thing with the values we take all the values we should, we sort them first by length and then by byte comparison and then when we store them uh in that values header. We we put the length in front, but we put the increase in the length. So if you look at this, this is four because the it's increased from zero and then it's zero, because this one is only four again and then six length gets two.

G

um This keeps the actual length integers in the values, header, low and actually compresses, even the representation in the table down a little bit further um and you as you parse this you need to validate its determinism. The length is easy, but you also need to do the by comparison, because you will rely on the deterministic sorting of the values header later on um and some stuff that we'll get into.

G

So if you look at the structure, so here's just like a simple list, um so numbers under 100 don't conflict with my token space, so they're, just uh inlined, basically and then numbers over 127 are also uh are just barrens. Basically, so you we inline most numbers without any additional typing information, um only some stuff in the token space is taken up. um So that's what a list looks like. We omit the trailing delimiter when uh it's a root structure.

G

Also, when there's no links or yeah when there's no links or values, we can also drop the null bytes with some extra rules as well. So I've been shaving a lot of bytes off of this format. As you can see, um here's like a list inside of a list right um so 109, one 109 and then two three and then then delimiter and then we again we can omit the trailing delimiter, because we know that the root structure is a list.

G

So that's that's pretty nice, like we get like a nested list here with three values uh for six bytes, um uh you see null true. False are also constants that show up in here. This is what that serialization looks. Like for for that kind of a list- um okay, let's look at some footsteps, so um this has no links in it. So we get a zero byte um to delimit the links. um This header value is wrong. Sorry that shouldn't be 12, not 10.. um So then we see five for the offset.

G

Then we get hello in binary than zero, because the next one is also five in the world. Then we say that it's a map, then the I'm playing around right now and poking at the map offset rules, so some of them in this example may be a little bit different. um So this is so. The map is basically first the offset for the key in the map right.

G

So that's uh from one and it's um I'll talk a little bit later but like the first key in a map is always offset by one and you have a separate token for empty maps, and this is so that um every subsequent key or sorry every subsequent key reference is an increase in the offset from the first one, and so it literally conflicts with the trailing null delimiter to end the map. So it's impossible to actually write out indeterministic maps.

G

You rely on the deterministic order of the table and then you only see the increases here and you actually have a rule. That means that you can't even parse um in deterministic maps, which is great. Then you see that you also have the value index. Obviously this is actually this example is wrong, because this is an untyped map, so that should have been. There should have been a trailing um uh string right there, so it should have been something like 103 or something like that and then the one. That's sorry.

G

I literally wrote these like late at night, um but um here so this is like you know what what a more complex object with more values, and you can see that, like the compression starts to set in here, so you know I'm referring to the same strings again and so they're just going to be deduplicated down here and they're, not going to take up a ton of extra space. But this is this is something that I'm just working on now. So this is typed lists and maps.

G

So when you have a typed list or a map um you you can drop that uh the typing um prefix from it, and then you get a much more compact space. So here we have a string type list, so this is just going to be like all the values. I don't have the new. uh Oh, no! This is a list, so it's always um yeah. It's always the the index plus one so that uh you can use a trailing zero to limit it with these typed lists.

G

But that means that now, like the the compression space for a typed list, when you don't vary, the typing and lists and maps is actually really really small um like we can get really effective representations um and an idea that I had after talking to rod actually is that um for the rule set for parsing out these references, um we can actually say.

B

He just killed his own.

I

B

Never know what.

I

We can actually say.

G

Okay- okay, sorry, sorry, I I messed that up sorry uh yeah yeah, so I like, I hit the wrong button to end the presentation and actually exit that thing. So anyway, I'm talking way too fast, because I'm very excited about this, but um yeah anyway. What you end up with are like these just very efficient representations. So my testing on the file coin chain is that this is going to reduce it by like, when I'm all said and done about ten percent.

G

I think, like, I think, we're going to shave about 10 percent off a deg seabor, um and that's not really taking advantage of any of this deduplication right like nothing like none of that data has any duplicate, cids or binary values in it, like the only savings that you're really getting on that 10 is just the efficient, more efficient representation of the links and of um some of the other bytes that I'm able to shave off the format leveraging the determinism, um but you can now that you know this feature exists right.

G

If you know that you can always de-duplicate, um you can do stuff like uh write domain-specific compression algorithms, so you can start to like chunk up your data and alter your data structure so that you always leverage this due duplication, because you know that like so, if you keep the ta, if you keep the the value table and the or the link table below a hundred um or sorry not below 100 below 255., um all of the addresses are going to be one byte for the whole table.

G

Once you go over 255, then it's a far end, so it can be a little bit bigger, but we basically have like this maximal space for that table and when you work your way through it like you can really start to design some crazy data structures that um are are super compact because of just how they leverage this compression inside of the block format and the nice thing about it.

G

Is that, like all of these domain, specific algorithms for how to how to lay out the algorithm are specific to the application, the block format will just decode all the time um like without knowing any of those algorithms.

G

Another really cool thing is that, because you're putting all of these in a constant table, the uh the reduction in the size of the block from the deduplication is also translated into the memory overhead when you parse the structure, because when you parse them they're also going to come out as constants and then just references back so they're always going to be pointers. Unlike when you go through like json or cbor.

G

Every time you see a string, you always turn it into a new string, and then you have two of those in memory unless you like think to go and check those and de-duplicate them right. um So yeah like it is kind of crazy, because what it really is is it's it's a block format as a compression table right.

G

It is like a block format that um you can program with ipld to compress the data representation, and so you can just like do very different things with it than we thought about doing them in the past right um and like when I take like the optimization that the raw gave me the idea for um and and some other things like, you can take a subset of this and just say: okay, we have a byte type list and then just references to the table.

G

So now it's just like a very good generic container format for any domain, specific compression algorithm. That would be fully transactional. So you don't need, if you're not going to do a streaming algorithm for compression. You can almost always write a more efficient compression algorithm that uses this block structure. And then you just get back like an array of bytes that you put together, and so you could. You could literally like write a standard word. That's just now a general compression format and you just decode the spikes right and then for us.

G

Those things are always going to decode as valid fbl segments. So you can use them as an fbl. You can stick them inside of other fbls and they just they just work.

E

H

E

Some more examples- and I think just to work it through um one thing that I've been working on is something called sebor ld, so cbor linked data where you uh it's using pivoting off of jsonld, where you can create int as keys based on an external library. There's a great discussion about this at the ietf working group seaboard working group just like two three weeks ago. The challenge is that you need to have a cryptographic hash in order for that external library and then right now, the ordering of the the the key the attributes is.

E

The of the ins is basically determined by the serialization algorithm, and so, if you throw a different dictionary in there, you actually can offset it and since we're dealing with cryptographic material, you can actually like in most cryptographic, libraries take null and validate null as true or valid.

E

So there's a big vulnerability right now we're trying to sort of work through and I'm suggesting using cids as the link for the external library.

G

Yeah, I mean that's, that's a really good idea. I don't like one of the things that that messes me up a little bit about external tables is that they really mess with the determinism of the data structure.

G

That comes out right, because you can take the same data structure and then apply a different table and then you'll effectively end up with the same semantic data out, but it'll be a different hash for the block format right, and so the nice thing about this is that the only information that is ever compressed is is in the block, and so it is fully deterministic to the block data. So you there's no way to take different block data and get a different hash for the block out of it ever right.

G

But that is a limitation on the compression, because you may not want all of your compression table uh or the low end of your compression table to be really small strings. There are like efficiency reasons why you may not want to do that right and you don't really have the ability to tweak that because it's deterministically ordered. So it's not like always the best compression table, but it is like a the best like deterministic conversion table.

F

There's a couple of interesting properties that um I've been trying to think through that need more thought um about this one is the ability to validate a schema pretty efficiently. If you just jump to the structure section, um I think there's enough information there to validate against any of our scheme any of our ipld schemas.

F

um So if you were, if you were some intermediate party that didn't care about the contents of a block but needed to perform some kind of validation, uh perhaps you were a loader that needs to feed data into some adl or something and all you care about is, is this thing valid?

F

Then you should be able to skip to the structure section and just pass that and there's no extraneous. You know bites that you need to skip over it's all structure and then that should be able to match against the schema or not, and you should be able to do fairly fast, schema, validation. If you built that in at the lowest level, that's the catch is that that kind of validation would need to be built in at that lowest level where you could get into there.

G

Well, there's a real opportunity to do that, though, right, because one of the things that you want to do is that you want about, like you want to be able to validate. The block is valid without parsing everything out of it right, yeah and so like with this. You can actually run over the structure and do a validation just to make sure that it is like a valid um z-dag block right. You like know that it's now ability to block it didn't do anything bad it actually parsed correctly.

G

And then, if you have typing information that you care about inside of that structure, you can validate it at that time without ever taking any of the byte values and converting into strings.

G

um Unless you specifically need to do a traversal through a string map key, then you would have to you would have to parse that, um but most validation, you can actually do without that and there's some other crazy stuff. You can do too like if you want to check if a path is in is in there and the the map doesn't have any keys in it. That match the length of the first one.

G

You don't even have to parse the string out right like you, you already know that it that it's not in that block, so there's a lot of like stuff that you get out of um and also, if you're doing, a block to reversal and you're just traversing to get a value out. You can do that without, like most parsing right like you, you can just parse through your path um and then you would only end up converting any of the strings out of the out of the value table.

G

um If they're the map keys you traverse into or if they're the final values.

F

The other interesting property that is interesting for our schemers is that the the format makes, I think, all of our union types um fairly cheap compared to what we have now with particularly json, but also cbore.

F

So it's so you do get a saving for using a kind of union, but the rest of them are not that much more expensive than a kind of union to do in this format, because you you brush away a lot of the extraneous data to like with a keyed union, the key sort of is brushed somewhere else, and it doesn't even matter how big the key is. um It's that's. Why.

G

You still have yourself the story of the key, though I mean you do have to throw them.

F

Yes, but but with kind of with with unions, you tend to have repeating keys and so like in the hampt. You've got up to 32 uh elements with exactly the same key in a keyed union, and you don't have to store that once, um whereas with dag, cbo and jason, you have to repeat it. um But if you can just extract that out, then it's you save and because and that's the nature of unions is they tend to be repeatable structures that you are using again and again in your.

G

In your data, so one of the things that I definitely hit upon here is that, because we don't type the map keys, we get to save that byte and so for us, a map uh key as a string is the same as if it were a byte or an integer right. It's not taking up any extra space because we don't actually relay any typing information.

G

So that's actually quite nice, um we're not hitting any of that stuff. But one thing that I will mention, though, is that this does give you an incentive to use maps or lists that are always the same type. So so yes, it it like. When you're, when you're doing the map or tuple representations for structs- and you have mixed types, we definitely shave down the size of that and make them pretty comparable.

G

But if you're using a tuple representation, struct and all of the things in it are the same type, then it's going to get way smaller, actually because we literally get to shape all of those type prefixes.

E

I it'd be also just to see some of these examples, because I think the example you showed was basically you're, saying beginning of list end of list, and so, but I think in the proper sea board that looking at the raw block storage, you basically have to state how long it's actually converted to an array and it's to prepend how large that array is.

G

Yeah yeah yeah, we don't yeah. I don't do that because um for for a couple reasons, but one is just that it's smaller um because the length is going to often be larger than zero. So if zero or a one byte field is going to be my delimiter, then putting a length in would be bigger. So there's just no point.

E

Yeah, but I think actually, your ipld spec states that there's no um infinite size or undetermined list sizes you have to basically state this is uh an int array of size, 10., that's right, but.

F

E

G

It no no, no, no hold on the way that the data model works. Is the data model only comes into play after the codec is being done parsed. So the parser can always tell you how big the the length of the array is because it'll be done, parsing it by the end. It doesn't need to be in the block format right like we know. We know that it's not. We know that it's not an infinite length, because the block has an end.

G

So so we definitely know that, and by the time that you get it back from a parser, it will have a fixed length for sure.

F

Yeah this is this is goes back to that thing before with compression that I was talking about um johnny, the our current seaborg uh decoders. I don't know about that. Go maybe eric's is more restrictive, but um they will accept uh indefinite size lengths and then just give you the thing at the end, whereas when we encode our encoders will do the size up front, so the encoders always do the same thing, but the decoders are sloppy, and so they will. They will present you with data.

F

That looks exactly the same, but it has a different cid because it was encoded slightly differently. So if somebody produces a a an encoder say in in rust, you know, you're working on with survey and baby survey does indefinite size length elements. They will be decoded properly.

F

But if we do a round trip, we'll get a different cid and that's the problem we keep on battling with with seaboard there's so many ways to do just simple things um that it just it gets crazy, and so one nice thing about defining a new format is that we get to say no there's only one way of doing that thing. You know you can't do it there's no variability here. You can't do ins of different sizes. No, it's all one way. um So that's that's an opportunity here and.

B

F

B

And just one more point uh on indefinite size lists, if you combine them with compression, you actually don't know how big your block is.

G

Yeah, but like uh I mean the nice thing about only having one way to do, things is that you can like just brutally shave, bites off of the format, because you don't need, because the lack of variability means less tokens and less differentiation and uh and also like the crazy thing I'm getting into, is that um it also allows for less um token conflicts with valid data.

G

So, um like like one example, is that, like, in my opinion, being able to encode a 0.0 float, is a bug in the determinism, because that is effectively the same value as an integer and in a lot of dynamic languages. It won't round trip properly because it's the same thing like you're talking about the same thing: you've just typed it in the block format rather than typing it in, like the data model layer above the block format and converting it.

G

If you need a float or an image, or whatever um so similar to varance right like avara, it's encoded in whatever the smallest space is, and if you need it to be 64-bit, you convert it to a 64-bit integer right like that. That's what we do in the encoding there. The same thing is here too we're like you're not allowed to do floats that are 0.0 um and in fact they will not parse or validate, because we need to use that zero byte for other things um like yeah yeah,.

A

G

A

Yeah yeah- I I just want to mention what I also mentioned to michael earlier today, because there's a fi there's a format already existing, which is almost the same thing and it's called fleas. I post a link- um and I it's on my to-do list to implement this in rust. For, like the past, I don't know six years or something, but now that michael came up with this, I think I have to do it because I'm pretty sure it has almost the same properties. So um of course it can't store cids because it's not built.

A

So it's basically a format which is made for general json data, but more efficient and also but like a slightly different goal, was there was that you can easily traverse the data without the need to deserialize everything. So you can easily hop through the data, basically to deeply nested things quickly without divisionalizing all the stuff um but yeah just yeah.

G

Yeah, so the reason why I don't think that that's that necessarily the feature for this is just because it is so cheap to parse through the structure, because all you ever hit are instant variants, that you're checking for all of the token space and anything that can come out of it and then you're actually able to skip over anything that you don't care about yeah right.

G

So it's actually like, like really simple, to get like it's, it's really not very costly to get from 0 to 100 in the index, so you don't really need like the skip list functionality but yeah. I can see how that would be really useful if you were inlining um anything ever right like I, I only like that's only that efficient, because we we never have a large value there that doesn't get parsed like on its own.

A

But anyway, I'm looking forward to do a comparison, because now I have a reason to actually implement this thing in rust.

G

Also, like another thing that I think is really important is like. We have a lot of numbers all over the place in our data structures, and so um I really wanted the number format to be as compact as possible so everywhere that you parse a value var ends basically get inlined and then the token space. Just if you, if you do a var that can flick some with the token space, then you you get a penalty byte, but most integers are just completely inline without any additional typing information needed to be in front of them.

A

Okay, we are almost out of time.

A

Anything else from anyone.

B

I guess a real quick question for michael before I have to run uh what is next step like how ready is this to you know, to implement in all the things to put in the falcons, watching.

G

No, no, it needs like tests and like needs to feel somewhat done before. You start like saying that people should use it for real things. um We don't want to take black form, that's philly and also like. I don't want to stabilize on this until we're done actually working on the format.

G

um So I have like at least a few weeks of just poking at it and tweaking things and seeing how many bytes I can shave off of different pieces and experimenting with different um uh data structures and stuff like that, like I, I have a few things that I want to run into it um to see how efficient um a lot of the table stuff is before I move forward. So there's a bunch of just like more.

G

I mainly wanted to get on the bridge radar because it changes the way that we think about how to create some of our data structures. If we potentially have this compression in the future um and there's just like it's a it's a good thing to have on people's radar, uh but I don't think that we're gonna be ready to actually move stuff to it for for a while. um I think maybe in like a month or so, we might wanna look at doing another implementation um by then like.

G

I should have actually like stabilized the format I mean, like I've, only been working on it three days, so I've broken the format every day and like change things and move them around, um and you know, there's stuff like like the token space is not in the exact right position that I wanted to be in.

G

So at some point, I've got to like decide how many tokens I'm going to have and then I'm going to like fix it to that space right and I'm going to decide where I want the token so that I can do better math and things like that.

A

Okay, then I closed the meeting. So thanks everyone for attending and see you all next week, bye everyone.

A