IPFS IPFS þing 2022, 19 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Invertible Bloom Filters - @matheus23 - Unconf

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, uh I want to talk about invisible bloom filters, which I thought was just a super interesting kind of data structure when I was looking for a bunch of things uh on like how to do sync better and just I just came across like a paper.

A

I liked- and I wanted to present about it, because I think it's very interesting- uh it's not to be confused by the way, with inverted bloom filters which, uh because hannah has asked about it, uh you were saying there was like there's, probably also a boom filter variant that doesn't have false positives, but rather false negatives or the other way around whatever and inverted bloom filters are exactly what that does. But this is not what I'm talking about, or that's not what I'm talking about. I'm talking about inversible bloom for this.

A

So that's just an aside which I thought was maybe interesting to.

B

A

All right, so the problem I was trying to solve was we would have two peers and I would have like a dag in both of them and one of them would add some stuff, and I want to sync that without having to send over the whole dag again so I'm going to have some deduplication there, uh and so ideally, I just send over those two new nodes and then both of them have the same data and everything's fine.

A

That's the underlying idea, of course, there's like other edge cases like what, if the other node has some extra stuff and they want to sync and end up in the same state, I'm just going to simplify this for now and keep it at that all right.

A

So the thing is for us bit swap was working pretty good in those cases or in most cases, but worked pretty nicely, but sometimes you would have these kinds of trees and brooke has talked about that. That is our big problem.

A

Sometimes we have these super long chains of stuff and bitswap just takes a lot of time to do all of those run trips, and so you end up doing or having like very high latency and you, like their peers, end up talking to each other a bunch until they finally have synced all of those trees. But the thing is the dag structure actually, for us doesn't matter that much.

A

We can kind of forget about the dag structure, it's all about both peers, knowing what set of hashes they have or don't have, and so that makes it become a like, and then you send over just the difference of the hashes set of hashes that both peers have, and that becomes a set reconciliation problem.

A

And so naturally, I went looking for papers and eventually found this, which is what's the difference. Efficient, set reconciliation without prior context um and like in disguise. It's just invertible bloom physics, and so that's what I'm talking about- and I like the title of the paper um without prior context- means like other kinds of protocols needs like some kind of context, for both peers to have in advance.

A

But this paper really just assumes just like bitswap that there's no prior context. uh Peers are just asking hey. uh I have this stuff, um what is the difference and like it's a like two round trip kind of protocol to get all of the differences between those peers and the like? The amount of stuff you have to send over is big o of the difference of the sets and not big o of the sets themselves.

A

So it's very little round trips and potentially very little stuff to transfer, um which I think is pretty cool um right. So how would you do set reconciliation? So here I have like two peers kind of or two sets of hashes. You could say uh and there's like one set, that has two blue hashes, which indicate like some additional stuff, some additional hashes and I can take the set difference and I'll get back that hash. uh The problem is those two sets of hashes.

A

Don't live on the same machine so to do this kind of difference operation I would have to synchronize the sets since potentially lots of hashes over the wire, and I don't want to do that. So what I can do instead is, I can use an encoding kind of function to encode. These sets as invertible bloom filters, and these bloom filters have hand wavy constant size, so their size depends on some other parameter. That is not the size of the sets here.

C

A

um And what I can do with those invertible bloom filters is, I can just subtract them. That's the magic of these kinds of bloom filters and you get a new bloom filter and that bloom filter is actually the encoding of that different set and there exists a decoding function that takes the bloom filter and creates the like gives you the difference uh set.

A

Of course there's like some caveats. Oh yeah, sorry, I I can like go the top route, which means like potentially saying over a bunch of stuff or it can go the bottom route. The caveat is this decoding function, uh it's being a bloom filter, has some success rate and some failure rate. It sometimes fails you just can't reconstruct the whole set um but like, and that is tunable by the parameter and that's the size of the bloom filter.

A

um So you, but the interesting thing is uh the failure rate depends on the size of the set you want to reconstruct not of the size of the sets you used in the very beginning here at the top, and so the bloom filter size depends on the size of the difference of your sets, which is the whole magical thing right, um yeah and so like you can tune the size to get like different success.

A

Success rates and you can like, I think the paper just assumes the 99 success rate and if you fail decoding the inverted bloom filter, what you do is you just run some other protocol or retry with a different size stuff like that?

A

Okay, so how does it work? I mean brooke, I think like touched on that a little bit on the carsing talk and it like at the end of the day. This may be just something that is interesting for others to know about and less something that we're going to do for sure.

A

I just want to preface that, but it is nonetheless very interesting and very useful in some cases I think so, just like in brook slides. We have a bloom filter or an in virtual bloom filter, which is somewhat like a counting bloom filter, so you have a bunch of cells and at the very bottom, you see the count here like the last row of elements that were hashed into the cell. The middle row is some kind of uh small byte thing like in my implementation. I just used a 64 bit number.

A

It is essentially you can think of it as a checksum and at the very top. This is these xr cells that brooke has been talking about, and so, when I have some hash, let's say this dead beef hash.

A

I find some cells in the bloom filter that corresponds to the hash, and I increase the count and I, like the top cells, are just 32 byte cells, so they're they're the whole hash uh they're, where you're getting from like the element in the decoding function eventually, um and then the middle row is just some kind of check, something you just do a different hash function on your hash. I know I'm going to say hash a lot in this talk.

A

I think um and yeah you saw that kind of thing in there and when you have something else that you're hashing in there, you just increase the counts again and when you, when you like, hit a conflict, what you do is you xor your hash and you x, or your checksum hash, all right, um yeah, so top row is the is in the paper is called the id uh middle row in the paper is called the hash and bottom row is called the count.

A

I find that a little bit confusing so when you're reading the paper uh keep in mind that id is actually a hash and hash is a hash of a hash yeah all right um and then there is a decode function. So you can like take a inversion, balloon filter that you encode it and you can decode it the way you do this. Is you look for cells that have a one or a minus one?

A

We'll maybe touch on that and you just look at the very top kind of cell, and you know that only one item was added to it, so it was exported with a zero string, and so you know that there's just the hash and converted out if it matches the checksum and when you have this element, you now know like where what other cells it was encoded into, and there may be like another cell. That does not have a count of one yet, but you can now export or subtract uh your like.

A

uh I don't know hash from that cell and you get back a bloom filter that may uncover more ones, and you iteratively do this until you have an invertible bloom for that that's empty and you have like a set of hashes. You recovered from it and, of course, this breaks down very quickly. uh This decode function, if there's like a lot of items in the filter and uh the magic of the filter, is that its size doesn't yeah. Its size does not depend on like the amount of items you have in there.

A

So let's say this is one machine and has this kind of uh invertible bloom filter, there's thousands of items in there. You can't recover anything, but if you have another machine and it has a very, very similar kind of set, it has lots of items in there as well just a tiny difference in between them. Once you subtract them and the subtracting is using xor and it's like doing minus on the couch accounts, you'll get back a boom fill.

A

That is maybe just a single element and you can decode that right, um that's basically it um there's some fun stuff. You can do with it. uh I mean this is like, uh for example, in our use cases we're like. Maybe we can't quite use this because you may have like a phone and you may have a laptop and your phone doesn't actually store the whole dac or your whole file system or whatever it is. uh It may not do that, um and so in those cases right, you don't want to actually sync those stacks.

A

You kind of only want to sync, I mean some parts of it. You there needs to like what I'm trying to illustrate here is some fun application or some fun uh thinking around invertible boom filters and using their algebraic properties. To uh like do some interesting use case. So what I want to have is, I have a phone I work in it um like. I have some dag here, but my phone doesn't have the whole dac. It just has a bunch of things that fetched and the rest of it is it doesn't care about.

A

Then there is an update or two concurrent updates on both the phone and the laptop and the laptop wants to fetch everything from the phone and the phone wants to know what has since happened on the laptop side.

A

I can still use invertible bloom filters for this, because there's they have these nice algebraic properties and the idea is, I have an inversion balloon filter for the whole set on my laptop, and I have an inversable bloom filter for the stuff that the phone did not store and the stuff that the phone does store is also invertible boom filter and I can just add them together to get the invert vertical boom filter.

A

That represents the whole set that the phone has, and I can take that to like compute the difference between them and like these kinds of algebraic properties. I think there's, maybe some more stuff to uncover here some more kind of interesting use case. You could do with it. I don't know um yeah very, very application space. I think um there are some caveats, so investor bloom filters are like the paper and some slides that the author uh has say it's like only really efficient.

A

If you have diffs that are like max at maximum like 10 to 15 of the total set size. So, uh for example, if you have a new node coming into the system and like it has zero hashes, and you have a note that has a whole dag that it wants of course, encoding that set of hashes as an invertible bloom filter that you can decode is going to take a lot of space and that's just not efficient. So like the efficiency is like up until 10 to 15 of the set size you can use them.

A

um You need to.

A

That is the problem, but you need to choose the size of your bloom filter in advance, um which sounds like just a bummer and like it kind of uh breaks the whole protocol, but it doesn't. Actually, I promise there's like a strata estimator, that's described in the uh paper. I just don't have enough space in this talk to talk about that, but yeah. It kind of estimates. How big the difference uh in in like the set uh size is so you can still do the whole thing in two round trips.

A

You just need to prepare a bunch of bloom filters with different sizes, which is like, I know, kind of a thing with bloomfield just in general. All right- and that is also a problem right bloomfield is constructing them, takes time um yeah and that's pretty much all I have oh yeah right, there's a chance. You can't decode people yeah, that's pretty much! All I have uh there's a paper link and I wrote an implementation rust as a toy, so I'll post, those slides as a pdf and yeah.

A

Any questions yeah.

D

Are you saying there's a chance at kentico? Are you saying at a certain point, it goes uh-oh I don't understand, or does it start sending incorrect information.

A

It it goes uh-oh, I don't understand. So specifically, you have like the count at the bottom and you you look for so-called pure cells, which is count one or minus one, uh and it may just happen that uh all of the cells in there um have count two or above and you don't know what to subtract subtract anymore to like get another pure song, and so it kind of just throws its hands in the air and says I can't decode sorry sure.

C

uh I'm curious, and so you you should I you shared this paper earlier in the discussions.

D

How do you go about finding I've.

C

Heard a couple of people feel like I like to go. Look at the research like how people are finding research on interesting topics in our area.

A

I mean it sounds super stupid. I think I uh I just googled different kind of keywords found a stack of a flow answer from the author yeah. I think at google's thinking dax and someone asked this on stackoverflow and the author of this paper actually answered, and so I kind of that was my intro to it and then I look for other papers stuff like that references. I.

E

Will say that, generally, the vision team has a large repository of papers in our discord and in our discourse forum, um and so that might actually be warrant specifically for ipfs. Maybe we literally set up a papers category in the ipfs discourse that we can tag um and collectively find useful things together. Research is actually really another great example that does this research, the ethereum research discourse forum also.

F

Like has a similar pattern, I always find the language described in the paper is not the language that I never really used. We just can't find it.

E

In search, which is why tags and other things like that, that would work for us when we were better right.

F

Like you were, like, I googled that and I was like there's no way, nobody knows the word.

A

It is true, it's true yeah. Nobody does yeah.

G

They talk about gods and stuff right.

C

Yeah they do if you dig into.

A

Also, maybe shameless plug, we do have a vision, discord and we have a papers channel in there and we continuously post papers there so, which is why.

H

I want to be any other behavior.

H

And the sentence that you found connected our universe to that paper to help get over the hump efficient, set reconciliation.

A

H

A

Like in like a normal bloom filter, this is one bit yeah.

B

A

C

Bloom filter: this is 32.

A

Bytes plus uh eight bytes, plus eight bytes, so you have like 48 bytes per bucket right then.

A

Yeah it is uh it it pays off because of like the whole thing with. uh If your set difference compared to your set is like small, then it pays off uh like over constructing bloom filters that encode the set instead of the set difference. That's why I'm like yeah, there's, there's caveats.

B

H

To ground and understand hey, how often do we end up with seats? Yes,.

A

H

Subtractions, given a series of different computations.

A

Yeah, absolutely, I think, in terms of like use cases, especially, I can think of this being very useful for um more, like ipfs operators that have very very big pin sets, but have like over time they change very little, but then they have like, let's say some cluster or something, and they want to like talk in this cluster figure out what what they need to send each other. But that said, maybe it's not too important to have like very, uh very low round trips in in, like in between cluster notes.

A

I don't know so, maybe like yeah, there's a bunch to understand about them and but.

H

Like it's kind of close like in the manifest world, you are sending one.

G

It's not quite like a no contact that sort of thing there's something behind your name. It's just like there is an object, we're trying to change right right.

A

G

No like these, you both have an immutable pointer to its data.

F

H

F

G

And we want to sync the state of that rhythm, which is I use, which is like a real use case. Yeah we'll have it, which is a different form of the problem than like. I want to download. I want to download you know.

B

The whole thing I want to download.

G

Wikipedia, given that I have some wikipedia already.

H

Yeah, which is like because now you're you're dramatically excited yeah.

D

But also you don't have if we both have a git repo. There is some common information there that we know exactly.

G

Maybe a better example is like I want to download. uh I have a newspaper reviews and I know I downloaded one downloaded printed state of communities. It's a different, beautiful point, you're having nothing to do with mine right.

B

It just turns out that.

G

Like you both happen to like with 80 percent.

B

G

And, uh and so like that, doesn't quite fit this model right because we have no, there isn't.

H

Which is useful, it's just a slightly different problem. I would turn that into stuff. You know ambient duplication, just the idea that the sense of stuff that they overlapped and versus this is doing much more with like high probability of doing condition via the back of its version. It's.

E

This other versions of yourself as the first case or the other case where you've got if we do a larger in um you know, james and philip, have shared, have a shared folder and they each have any devices who are on and offline, and you know that there's it's going to sink somewhere in there because they're operating over the same set.

E

Oh my god. I use them now.

E

F

You think about this is like a bunch of individual actors who have like varying issues and stay the less useful. This is, but I would argue that this is like not the most common case like like when you were talking about the mobile devices cloud like. Maybe you don't want the whole database in the device absolutely, but if you're thinking of the whole database of just much patch blocks, then any query is also just a bunch of patch blocks. Yes, and what you do want on.

F

Your phone is like the same big-ass query all the time.

B

So, like anybody.

F

Who has like more state than they want on the device over here and a subset of here? You can take that query and sync, it all the time the set's always going to be less than 10. It's all very useful.

A

Absolutely like this is it's it's the question of like where what set do you choose and different things like it may be wikipedia. It may be everything under this hash or it may just be, in the general case, an ipld selector or some kind of select to some kind of query. So.

E

But maybe this is a really really good point right, so you literally describe the words places where this technique would be useful, which could be written up in regular human language right and be like if this is the use case that you're, like I, have a bubble thing, I'm doing a thing right and or you can do the anti, where it's like for an example if you're doing wikipedia this is not a good use case.

C

So even even collecting.

B

Some of those things.

E

As techniques to use, cases could be useful.

A

Yeah, and maybe like one more thing, I wanted to address something you said with the private prior context. It's true. You need some kind of prior context. You need to know what set are both peers like talking about right, but what you don't need to know is uh you don't like, maybe in contrast to other kind of protocols, for example, if you're regularly communicating between your phone, your laptop, uh you may just remember what your laptop has or have the last time, and so you can it's.

G

Not if you don't get to be like I want five, are you like ten.

A

G

A

Exactly and that's that's, that is the context that this is talking about all right.

A