IPFS IPFS Camp 2022 - Libp2p Day, 3 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: The power of two choices - Petar Maymounkov

Description

This talk was given at IPFS Camp 2022 in Lisbon, Portugal.

A

Cool well, first of all, um I'm super happy that I can join uh if only remotely this year um and thank you for sticking around for the probably latest section uh talking in the day for you.

A

um So um today, um I want to tell you a little bit about the distribution of IDs in the in the DHT and uh basically what properties it has what problems it has and then tell you about a different way of picking IDs um that results in a much nicer distribution and potentially a much easier system to work with.

A

um So um this talk is going to be about the distribution of ID, so it's going to be a little bit more sort of mathematical, but the outcome is that I should hopefully should have tools and knowledge how to sort of take advantage of this by the end of the talk um cool.

A

So first I want to start by telling you more about what the distribution of the current DHT uh looks like, um and so for the purposes of this, uh we can simply just assume that peer IDs are just random numbers chosen between 0 and 1.. So these are random real numbers.

A

um The the engineering details uh of whether there are hashes of private keys are sort of irrelevant because we're just focusing on on the distribution of these numbers. So um let me start by just sort of simply asking the following question: you know what happens if you pick a large number of random numbers in the interval between 0 and 1..

A

So, for instance, let's pretend that we're going to pick a thousand numbers in the interval between 0 and 1, and this box I have on the screen is sort of a depiction of the interval between 0 and 1, um and if you pick a thousand numbers you're going to see a picture, that kind of looks like this, so the vertical lines are the the random numbers um and you can see that they are roughly uniformly distributed visually kind of like at the larger scale of the interval.

A

But at the same time, if you look closely at sort of individual windows in this interval, um there's parser windows and there's denser windows, and so we want to understand uh up. You know why is it that when we pick uniform numbers, we end up with a distribution realized distribution which is not really uniform and how non-uniform can it be?

A

So, in order to do this in order to understand this phenomenon, we have to have a way of first of all quantifying.

A

um You know this intuitive notion of some parts of the interval or denser and other parts are sparser, and so the way we do this- and this is probably familiar to most of you- is we take all these numbers between 0 and 1 and we insert them in a binary. Try now I have to assume that most of you roughly know what a binary try is so I'm just going to very briefly sort of gloss over the definition. Essentially, the idea is that you know.

A

If a number is on the left side of the interval, then it would go in the left subtree. If it's on the right side, it would go in the right subtree and then you keep doing this recursively um and you stop as soon as every number basically lives on a leaf on its own. So the try doesn't grow uh past the point where every leaf is like in its own. uh Every number is in its own leaf.

A

So in this case here um this try is what you would get for the numbers that we have on the illustration here and you can see now that um numbers which are in sparse regions of the interval end up having much shallower leaves in the binary time, whereas numbers which are in denser regions are deeper in the tree.

A

um um I won't like go into the details of why, because you can figure it out kind of on your own, um just by kind of like studying it a little bit, but the key Point here is that the the depth of how far a number ends so the depth, the the depth of how far a number ends up in the time represents the density of the interval around it. The depth in the tree of a number is a quantification of how dense um uh how dense the region around this node is.

A

So here you have the example, the red numbers um we if we want to summarize the kind of the non-uniformity of the entire outcome, so the whole interval with all the numbers. uh The standard way to do. This is to just look at um the distribution of these depths and so uh you're going to see a picture like this, which uh basically looks roughly like a bell curve, um and here comes like kind of like the first important uh fact that it's somewhat non-obvious that that I should tell you.

A

um The fact is that if you do this experiments any number of times that you want the experiment being picking a you know, picking a thousand numbers and and and putting them in a binary time, even though every single time you're going to see a different try, because the numbers will be different every single time. The distribution of the depths of the leaves is always going to be the same. So this is a sort of a magical magical fact that one can prove analytically and it's quite important.

A

In particular, it tells you that if you wanted to know the real distribution of the IDS in the DHT, you don't have to go to the dhtn scrape all the peer IDs. All you can do is generate.

A

You know about 20 000 numbers, which is the size of the uh DHT Network you can you don't need to know the size exactly? You can just roughly generate this. Many random numbers and the distribution that you see in your experiment is going to exactly match the distribution of the of the real of the real thing.

A

So um the the analytical results that we know this is like a textbook result in math um of what happens. uh What this distribution looks, like um is even more precise than just saying that distribution is going to be the same every time and in particular the result says that um uh the mean of this distribution is going to be uh roughly log n and by roughly I mean there is a constant factor in front of log n, which uh um I'm not saying what it is.

A

Usually theorems don't give you the exact number, but you can always find it yourself, just experimentally, so um the mean is roughly login and any, and it will be something like two or three times: log n. So it's like much deeper than if the tree was balanced. If the tree was balanced, the mean would be exactly log n.

A

um What's more important and more interesting in this theorem is that it actually also tells you that the deviation of this depth of leaves is also going to be about login.

A

So the to interpret this uh the the statement, it says that the variance so it says that their leaves that whose depth is either larger or smaller than the mean by as much as the mean itself. This is just another way of saying that the tree is really really uh imbalanced, um and this is in fact very easy to check if you open up the DHT code and look at the routing table of an individual node, you're gonna find out that, like uh oftentimes nodes, have a routing table of depth, 15 or 16..

A

um The depth of the routing table exactly represents um uh where your peer ID, the depth of your peer ID in this imaginary, binary, try and a depth of 15 is clearly enormous for a Network that has only about 20 000 nodes.

A

um So the the lesson here is that um the DHT distribution, predictably, is extremely um volatile and um the good news is that um you, don't you, don't have to sort of know these theorems or like uh find them in the literature, uh because you can always just simulate um simulate these experiments uh in a computer program and and figure out exactly what the distribution would be and to facilitate. This um I've created two libraries that we use quite a bit in the DHT part of ipfs.

A

The links are down below, so one is in go, one is in Python, and this library is implemented binary trying, so that you can sort of experiment with it and so insert IDs and sort of see what happens.

A

um So. This is um sort of concludes part one uh which is really just telling you that, unsurprisingly, the distribution of peer IDs in the DHT is quite imbalanced.

A

um Now part two I want to tell you about a different way of picking your peer ID, which would result in a significantly better distribution. So this this way is called the the power of two choices. uh This algorithm for picking up here, ID and the algorithm simply says, um pick pick two numbers.

A

If you're choosing your peer ID pick, two random numbers put both of them in the binary, try and then stick with the one that ends up in a shallower location, in other words, in a less dense region of the distribution and throw away the one that ended up in deeper location. In this picture, the orange one ends up much deeper.

A

You see it ends up kind of like much closer to a pre-existing number, and so you want to throw this away and you stay with the blue one which nicely lands in like a empty Gap in the in the distribution. So that's just another way of picking your peer ID and there is a corresponding um theorem.

A

That says that you know if you use this algorithm, what you're going to end up seeing is that actually, the outcome is significantly more uniform.

A

Both visually in the in the interval 0 to 1, um but also the try that corresponds to to this distribution would be nearly perfectly a balanced 3 with some defects uh here and there so fairly infrequent.

A

And if you want to represent this statement in a distributional form, uh you know you can see the picture on the right. The distribution would be something that's very skinny and focused around the mean, and it would have very small, Tails kind of coming uh coming off to the side. Now, the um the actual formal statement of the theorem says that if you use the power of two choices, algorithm the mean of the binary T would be uh now.

A

It would be actually pretty much exactly log n, but the more interesting Parts is that the deviation from the mean is a log of log of n. So uh previously it was log of n. Now it's log of log of n, so this is exponentially. Smaller log of log of n is usually such a small number that you can think of it as just being equal to one or two, regardless of how big, um how many numbers you, you know peer, IDs, your system, sort of has so um in practice.

A

What this means is that you, you expect to see a completely balanced tree which very infrequently might have uh nodes that are either one deeper than um than the average uh depth, or maybe one shallower than the average depth, and so this is called uh yeah, like I, said the power of two choices. Now the question is: um why might you care to produce uh peer IDs that are uniformly distributed versus ones that are not so uniformly distributed, um so in general, there are.

A

Multiple reasons um depends essentially what you're trying to accomplish, but I'll give you two examples uh just to give you an intuition of how you might use this more balance. Three um one question that comes up a lot in the DHT world is: um how can we roughly estimate the size of the DHT Network and a natural way to estimate the size of of the network is to um pick a random peer from the network and see how deep uh they are in the binary try?

A

So you can see that if you're three, if the binary trial is balanced- and you just pick a random number from the from the network from the random peer from the from the network, uh it is very likely that the depth of the sphere is going to be equal to the mean, which is exactly log n, and this means that you can infer the size of the network from from this number.

A

It would just be 2 to the power of the of the depth of the nodes that you, sampled uh and, and your estimation of the size of the network, would be correct uh most of the time so 90 of the time. So if you, if you, if you want to be even more certain that your estimation is correct, you can just pick a few random numbers and just go with the majority uh majority vote, uh so you can do this with a balanced distribution.

A

You cannot do this with the current distribution in the DHT, because here, if you pick a random number, uh a random peer from the network, the chances that you end up with a highly unrepresentative depth are nearly a hundred percent. Basically, um so this is one example of why you might want to have um balanced distribution, see you.