IPFS IPFS Camp 2022 - Libp2p Day, 1 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Decentralized NAT Hole-Punching - Dennis Trautwein

Description

This talk was given at IPFS Camp 2022 in Lisbon, Portugal.

A

All right, everyone um yeah thanks so much for having me um I'm so glad to present to you, the decentralized natural punching I mean nettle punching, has been a common theme over the last few presentations already and there's a specific protocol that the guys and implemented, which is called dcutr, which I will get to in a second and I, was working on a project on measuring the success rate of this uh hole.

A

Punching mechanism and in this talk, I will briefly discuss how hole punching works and specifically the dc2r protocol, then how we intend to measure that so our measurement setup and then the fun part, all the different graphs and uh yeah results from that, and also then some next steps going forward all right hole punching. So what's the motivation? Oh sorry, no. First, a shout out to Max who did great presentations on uh this dcdr protocol at first time earlier this year and also PHP Paris.

A

So um everything I'm about to present is a little simplified version of That So Max dives into the whole protocol way deeper than I will do so check out these talks. I will send out the links to the slide later as well, so the motivation. um What do we want to do? We want to have full connectivity among all nodes, affili P2P Network, despite Nets and firewalls, and as I mentioned, this was a common theme over the past few presentations. Already, the requirements would be no centralized.

A

Infrastructure should work over different transports and it should nicely integrate into our lip P2P stack, but maybe you're. Not all of you are aware of so so we'll just briefly touch that so Nets and firewalls net stands for a network at address translation. It basically Maps your local IP address to publicly retail, reachable IP address and firewalls in turn control the incoming and outgoing Network traffic from your home network to the internet, and it does this by keeping track of packets that leave and that leave the network and try to come into the network.

A

With this so-called State table that you can see at the bottom. This is a little bit more involved than it's depicted here, but basically the most simple rule would be only lead packets in um from which from sources that I've previously sent out a packet to so State table entry would be in this case, my so, if I try to connect to a server, the router keeps track of my source IP address from my internal Network, then also the source part and then the destination IP and destination port, and also which trend spotted users.

A

And then, if a packet comes back, it looks into the state tables and sees everything's fine and lets the packet through.

A

So what's the problem here in the case of a peer-to-peer Network, so here we can see that we have two peers: peer, A and P or B on the left and the right hand, side both are of which are on their home networks and in the middle we have the internet and yeah. Let's assume peer a wants to connect to peer B. So what happens so be able? Let's say they want to establish a TCP connection, so pra issues at TCP, syn packet.

A

The packet leaves the computer of peer, a reaches the router, a and the router a updates its state table at the bottom, as I said previously, just with the source IP address and destination AP address, Plus Port the packet gets routed to the through the internet, to router B router B receives the packet, looks into its own State table and sees well I, haven't sent out any packet previously, and it will just drop the packet because it doesn't let this through, because there's no entry, and so how do we solve that?

A

So, as I said so, this will be. So how does hole punching work? Basically, um we want to have a mechanism to connect both peers and, let's imagine that both peers, through a mechanism that I would go to in a second could simultaneously synchronize also could synchronize um with each other and simultaneously start a connection.

A

So this would look as follows. So both peers would send out a in the case again of a TCP connection, send out a sin packet. Both routers would update their state tables.

A

The packets would would get routed through the internet. They cross paths somewhere over. There, the packets reached it, the packets reached the other routers and since both peers have dialed the publicly reachable addresses of each other's routers, the state table entries would match. The routers would look into the state table say: yes, that's fine and the packets has reach uh reach both peers. So this would be in a nutshell, how punching works.

A

So how does this work in the lipitp world, and mainly this this dctr protocol, that the people at lipidop have implemented so DCP utr stands for direct connection upgrade through relay, and this also takes it away.

A

So we have a publicly reachable relay in the middle, and um so when prb boots up, it will detect that it is behind the net through a mechanism called Autonet, I believe and um we'll find a limited relay somewhere on the internet through another mechanism and since Kubo version 0.11, all all nodes are per default, limited relay, um limited relay nodes and limited relays only allow a certain amount of bandwidth to be shared and also only limited set of protocols to be routed through them.

A

So prb connects to this relay and there's no hole punching involved because relay is the relay is just publicly reachable and pob receives a multi-address. That's a little bit. The font is probably way too small for you to read. So it's just a relayed, multi-address and through another mean transports this multi-dress to peer a NPR. A then also connects to the same relay and asks the relay hey.

A

Can you forward traffic uh on behalf of me to prb and at this point both peers are connected, but through a third party through a relay and as soon as peer B detects that Pier a has connected to it through a relay it before we do anything of the whole punching magic here, prb just tries to directly connect to Pure a so because this this would be way faster than doing the hand the the dcuddr protocol that I will present in a second, and only- and this would be called connection reversal, and only if this does not work.

A

We will start this dcotr protocol. So how does that look like? Let's imagine peer a is not publicly reachable in this case, so prb sends out a connect message and starts a timer to measure the router time. So the connect message contains all the public IP addresses of router B, so no relate addresses or addresses, or anything just the public IP addresses of its router. Then the connect message gets routed through the relay reaches router a again. The connections are already established so get through.

A

um The Connect message ends up at router a and the router a responds with another sorry. The prv responds with another connect message containing its own public IP addresses and starts waiting for another message that I will get to in a second, so the connect message gets routed back, reaches peer B and at this point, peer B knows the round trip time. So it has measured the time from sending out the cones connect message to receiving the response, connect message and now the magic begins.

A

The POV sends out the sync message. The second message type of this protocol and again starts a timer. The synth message gets routed to router to peer. A Piera was waiting for the sync message and as soon as peer a has received that message, it tries to directly connect to the public IP addresses of router a and peer B has waited half the Roundup time to do the same and, in theory both should start a simultaneous connection attempt to each other- and this is the hole punching here, so they send out a packet.

A

The state tables get updated, the packets get routed through the internet, they reach each other's routers. The routers have the correct entries and we are connected so good times.

A

Well, so that's the dcodr protocol and how do we know that this works in the wild? So this is just the theory behind that. So we came up with a measurement setup to that um everything's open source. You can find it here again. I will share the the slide somewhere, I believe on the slack Channel also, um and this measurement setup consists of three different components: we have a Honeypot, a server component and a fleet of clients.

A

um The difficult or the challenge here is. Basically, we don't know where those peers behind Nets reside well, because they are behind Nets and we cannot find them. We cannot connect to them. They don't appear in the DHT, because only dht's um only publicly reachable peers will end up in in the dht's routing tables and so on. So how do we detect them? We use the Honeypot, which is basically just another strong DHT serve up here, and this Honeypot announces itself to the network by crawling it continuously and it just tracks the inbound connection.

A

So when a peer behind the net tries to receive something from the um yeah from the DHT, um it's by chance, they will end up at the Honeypot and uh if the peers advertise a relay address, we just persist those IP, sorry, those multi addresses and thus have detected appear behind the net. At this point, then, those peers that were detected get served through the server component. That's just exposes a simple grpc API and uh with just two endpoints.

A

Basically, so you can query appear that is behind the net and you can track the result if you've performed the whole punch back and then it's two two different client implementations that just serve um talk to the server. So these clients live on um yeah, on on your on your MacBooks or on your on your laptops, and they request this client the whole punch from the server. Then the client performs this dcotr dance with another peer and reports back the result. If it worked out or not or some some other things.

A

This is the architecture diagram of what I've just explained. Basically, so again we have the Honeypot the Honeypot slowly walks the DHT. So it's a very stable fear and by slowly walking the DHT. We we announce ourselves or the Honeypot announces itself to the network, and we just hope that the Honeypot gets inserted into the routing tables of other peers so that the client peer behind the net requests something from the DHT they get routed to the Honeypot and then uh We have basically the chance to detect some some dcotr capable peer behind the net.

A

And then you can see the Honeypot safe, saves the relevant inbound connections to the database, and then those entries get served through the server component and then we have a lot of client. Well, not a lot. You will see, but a few a few clients deployed um that then will request those peers from the server and try the hole punch on the right hand side.

A

um All of that is monitored. So we have a nice grafana dashboard if you want to check that out, if everything's up and running and fine and the results and the success rate and so on um also quite interesting. So now the fun part. So what are the results? Does it actually work?

A

um First of all, the so I mentioned that the Kubo version 0.11 has enabled the limited relay capabilities per default, which are a requirement for this dcutr protocol to work. Then the dcdr protocol itself, this hole punching was enabled per default with Kubo 0.13, and this was released on June, the 9th, and this is the um the red, the red line there. So we can see that people are upgrading to this new version and then thus benefiting from from this new capabilities here.

A

Also, if we take a look at the Honeypot connections, these are the so you have the time on the x-axis and the incoming dcutr capable appears on the y-axis axis per day. You can also see that people are updating and that this dcdr protocol gets more and more deployed in the network. Over time, um as I said, we have a fleet of Cl.

A

Well again, it's only it's not a big Fleet right now, so we have eight different clients deployed or seven tooth, seven different clients deployed so far and each of those kinds of requests pierced your whole punch and then report back the results of that and this rather messy graph shows how many hole punches. Each each client has performed during the last months.

A

The access on the on the left hand, side goes from zero to 2500., so it's around well around one and a half thousand hole punches a day per client, and each of these different colored lines is a different client which will which was active during different times of the last few months.

A

um Each hole punch itself can end up in different, so can end up in different states, so the server spits out multi-abdressers of those clients behind Nets, which are basically these relayed addresses. So if this is a very ephemeral peer that just goes online and then leaves the network again, it's very likely that if the client fetches this peer, I, don't know five or ten minutes later that we are not able to connect to this peer.

A

So this is the no connection outcome, so at the x-axis we have these different outcomes that the whole punch could have and then just the share of all hole punches on the y-axis and the no connectors on the left hand side. So we can see that actually many peers that we detect from the hole punch we are actually we are not able to connect to through through this relayed address. Then there's this no stream um outcome, which is quite interesting.

A

This means that we were able to connect to the to the other peer and then the other peer needs to initiate this whole punch, the CCR dcotr protocol, but it just didn't so it didn't open this dcotr stream, so we're the client was sitting there. We were connected through the relay, but then nothing happened. So this the stream was not open.

A

So this is the no stream uh outcome here then connection reversed means we were able to connect to the other peer through the relay the relay sorry, the peer tried to directly die out dial ourselves and actually it and this actually worked out. So this is also an outcome here and then the actually interesting ones um where we performed the hole punch. The dcdr protocol, this with the message message exchange and then we have the outcome of failed.

A

So it didn't work out our success. We were able to connect. We were able to establish a direct connection and at the end this is just this just applies to the rust implementation. This is a canceled state, which means the rust client was served. A quick connection and quick is not supported on Rust yet so they canceled the whole punch on the client side.

A

So this is also a little bit messy. So I think this is um yeah, so we will focus on on this part, and this is probably the yeah the crucial graph here. So this is the success rate of the of this hole, punch, uh dcotr protocol.

A

We have the time on the X x axis again and then just the success rate. So it's basically, the number of successful hole punches divided through the actual number. The total number of hole punches performed per day and each line is, is a is a unique client and, as I said there, they were deployed and online for different times. During the last month, um I gave each different client a different fruit name, so the just that we can tell them part, but um I can tell you.

A

I am the the Orange Line there, and this graph shows that my client has a success rate of around like well slightly less than 80 percent. So if we do this dcotr protocol in this exchange, we have an 80 chance of ending up with a direct connection which is I, think pretty amazing, so yeah yeah.

A

Then there are some some interesting things going on, so you can see the the red line that goes down at some point, and this is a client that was running over a VPN network and I think this needs further investigation, why the success rate is way lower or significantly lower than the others. The other lines are roughly at the same levels, but the VPN, the the client in the VPN network is well yeah, let's say 20 less successful than all the other clients, so this may be something to look into and then there's this do.

A

I have the cursor here. uh Well, I will just show it here, yeah this this little bit here. This is just a brief moment where a rust client was deployed and yeah. We need to look into that why this uh yeah, the success rate of the rust client, is so low uh compared to the other clients. All the other clients are go, go clients using the go implementation. We need to look into that at this point and another outcome, I think for the for protocol improvements at this point.

A

So I didn't tell you when this uh hole punch dance. So this back and forth of the connect message and sync messages didn't work out: um yeah we just try again and in total three times and if you, if we take a look at the successful connection establishments, how many attempts we needed to actually establish the connection.

A

We can see that in 97 or 98 of the cases it worked within the first uh attempt- and this means, if it doesn't work with the first attempt, we can just stop the protocol and not try again and uh be reasonably sure that the connection can actually cannot be established at this point. um So this could feed back into a protocol Improvement.

A

At this point, then, with which transport do we end up when we did the after, we did this whole punch DCR protocol, so we have the transport TCP and quick on the on the left, hand, side, TCP and quick on the right hand, side, and if we take a look at the um so this is. This is the Fret. So on the y-axis we have the fraction of successful hole punches and how many um yeah we're successful.

A

Using the TCP protocol versus um the quick protocol- and there are also again a couple of interesting things going on um on the TCP side- you can see that the VPN- this is the client that's running on the VPN side, basically cannot establish or yeah cannot hole punch over the TCP protocol.

A

um This is yes. This is why the the red bar is so low there and almost 100 on the quick side and on the other hand, the rust client has a 100 success rate on TCP in case there's, a successful punch and a zero percent success rate and quick, which again is quite uh yeah. So quick is not implemented in Rust, so it cannot. It cannot work.

A

But again, the VPN part is is interesting to me, so it would be great to find out what's going on there. Then we could so there. As I said there are so many relays deployed already. So people are taking up the new versions, and could we actually improve the success rate by choosing the relays depending on their RT to the relay? So we can imagine if I relay the the hole punch synchronization through a relay.

A

That's that resides let's say on the other side of the globe in Australia or New, Zealand, um I, think Network, Jitter and so on. Could um yeah could could impede with the with the router time measurement and so on and thus um limit the success rate here. So this is a CDF of the round trip times for successful home punches on the x-axis. We have the time in seconds and on the y-axis, how many hole punches. That succeeded actually had this specific round trip time and I.

A

Don't think this is it's quite I, don't think it's relevant to look at the exact numbers here, so just compare it with the with the failed hole, punches. um Okay, also hard to see.

A

um This is just everything added up, so in this graph we can see that in the failure case, the round trip time tends to be slightly higher. So this means, if we have a quite a slightly higher Roundup time, it is a little less likely that this hole punch um protocol succeeds. But this is yeah I think a very tiny difference here and if we take a look at all the different clients that were deployed, this is um these are box plots and the takeaway would be um that.

A

Basically, they basically look all the same, and so there's no clears, from my opinion, no clear correlation between round trip time and the success rate of the hole, punch and then I think the last graph plenty of plenty more grasp. But um yeah. Let's take this one as the last one. So this is um the protocol Errors By agent. So um this is something that I found out when the so I just took a look at the at the error.

A

Cases for hole punches and grouped them by the different agent versions, and we can see that this no stream error or outcome that I talked about previously, so where we were able to connect to another peer, but the other peer just didn't open. The dcodr Stream almost always happens with this rust, ipfs implementation and yeah I I, think I'm, not sure which implementation they use, the which dctrine implementation. But this is a I think something that we need to look into.

A

So this just shows we were able to connect to a rust, ipfs node, but the rest ipfs node did not start the stream in this dcodr protocol to us, although they advertise to support it, so would be uh yeah good to look into that.

A

Okay I have plenty of more graphs, but let's leave it at that. um The next steps would be for every one of you to sign up and run a client so scan the QR code. You can see that if you're on a Mac, you will download a system tray icon that already like a menu menu item there at the top right hand, corner which will just connect to our server, get a get appear and the peer will be hole punched and then the the outcome will be reported back the system requirements.

A

So this has a bandwidth impact of around. Let's say: you're like two websites uh yeah, you open two websites. Every minute, or so so this is probably the benefit requirement here, or maybe one website and it on my machine. It uses around two percent of CPU but 100 Megs of of memory. So that's not so good, but yeah. So I would be I really appreciate.

A

If you, if you signed up, we want to have a hole punch month so during December, so we try to extract as many peers as possible so as many people as possible to run this client also on on Linux. um There are binaries available um completely, please sign up and we, as I said, we want to have a whole punch month where we have as many clients as possible running to increase the data foundation for all the measurements that we did and all the graphs that I showed to you and with that.

A

Thank you very much.