IPFS IPFS Camp 2022, 30 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Identifying CID Eclipse Attacks through network measurements - Michal Krol

Description

This talk was given at IPFS Camp 2022 in Lisbon, Portugal.

A

So ipfs the system, as you know, that where you can have the CID, you can request the file and your request goes somewhere into the network and you can take. You can talk to one of the providers of the file and hopefully you receive the file.

A

You know if you have the regular URL URLs and there is a video at YouTube. So we have the URL like youtube.com video. You know that YouTube hosts this file with the cids. You really don't. If you have only the cids, you don't know who hosts this file in the network so for the whole system to work.

A

You require a kind of a CID resolution system where you provide the CID, and then you retrieve the the list of providers who host this specific file to be able to contact them and get the file and ipfs 2 has two ways of doing it. The first one is using bit swap so bit swap is very simple.

A

You just have when you launch ipfs client, you connect to some peers, it's usually around 300 and then, if you want to get a CID, you just send a request to every single peer via bit, swap if they have the file they give it to you. If they don't well, they don't it's pretty fast and simple. However, it's expensive and the main problem is that it's not guaranteed to find um the specific file.

A

So even if the file is in the network, but none of your peers has the has the file you will not find it. So. The second solution is that this should be the touch table um which is slightly slower. However, it's still efficient because you can you're guaranteed to find the content if it's in the network- and you can find the specific content within log and uh steps where n is the size of the of the network, we're going to focus in the stock only about on the on the DHD.

A

So one more introduction to the DHT very simple: when you, when you join um the ipfs, the ipfs network, he will generate a public private keeper.

A

Then you can hash this key, and um this key hash of the key will be your identifier in the ipfs network and will basically determine your positioning on the shot, 56 hash space of the of the network, and if everyone does it, everyone will generate a random key. So we have kind of perfectly uniform or equally distributed. Half space and the DHT allows us to go to a specific place in the DHD.

A

So I can say: hi I want to go to this hash in the in the half space, and the HTT provides this routing mechanism that, by conducting at most login nodes, you just go to the specific region and you're guaranteed to find the peers in the network that are closest to the specific place on the hubspace and ipfs uses this for this content resolution. So now, if I'm, a provider and I would like to find as a and I would like to tell the world that I host a specific file, I create the CID.

A

The CID also is placed somewhere on this chart. 56 half space, I'll I will use the DTC routing to go to this hard space. I'll discover K closest notes, the specific CID in ipfs. It's 20 here are using four because it's easier and basically I will then contact all those K nodes and we'll tell them hey, I'm the provider for this file and then, if there's another provider for the same file, it will do the same thing.

A

So, first of all, to find the CID we'll go to this region, discover kclose this peers and tell them hey: I have the file.

A

If everyone does this now, those K closes notes the specific City you will just Host this mapping, the CID and the list of providers. And now, if someone wants to get the file again, we have the CID. We know where it is. In the house space, we discovered the same, quick, close this nodes. We ask them hey who has this file and hopefully we'll receive a list of the providers, and then we can just having this information.

A

We can just contact the provider directly get the file okay, so ipf is an open system which is great but also comes with some challenges. um So now, if I'm, an attacker, I can create some civil identities and those identities again will be placed based on their public private keeper. But I can just keep producing public keys, so they they shoot my needs and it's actually pretty easy to just generate Brute Force, basically in private keys, so that their hashes will end up close to the topic hash.

A

And if the attack succeeds, the attacker can basically play the Civil identities close to the topic hash and if that works well, now K closest nodes are controlled by the attacker, and this is problematic, because now every single time there is a provider, you want to say: hey I, have this file. What will talk uniquely to this to those malicious nodes and they can simply drop this information? And if someone wants to now retrieve this file again, it will discover calculus nodes which now are malicious and they can just say the file is not found.

A

There is no one providing this file.

A

This is even more problematic because, even if the file is very popular and there are thousands of nodes hosting this file as long as I'm able to Eclipse those K nodes, the file becomes unavailable in the network and, if I place those symbols right away before the file, the providers advertise them. The attack is instantaneous, so it works right away.

A

If not, if I place my symbols before after um the the file was advertised, it takes some time because we've seen from the previous talk that honest nodes will keep the records for 24 hours. So basically after 24 hours, only the malicious notes will have the information who has the specific file and the file gets eclipsed right. So this is problematic because, with a single laptop generating those 20 malicious peers takes around half a minute on this laptop, which is not great because with a single laptop I can basically Eclipse any file on ipfs network.

A

So we were wondering how we can detect this file uh detect this attack first.

A

So if we have a perfect random keys and only on this nodes, they will be uniformly distributed on the half space and, however, if an attacker wants to launch an attack, it will has to place the Civil identities closer to the specific topic hash. And if that happens well, we'll see the distribution will be different, because we have this like region near the CID that becomes more dense and the distances between the CID. The peer IDs will be much shorter, so obviously, as a node, we don't have a global view of the network.

A

However, we have a view of the peer ID distribution as we go towards the specific CID. So here with the blue dots, you can see the regular distribution when there is no attack and then with the red dots. You have the distribution when there is an attack launched against that specific CID.

A

So you can see that they're very different, and because of that we're looking for kind of a metric that will tell me whether there is an attack or not, based only on the cids or the peer ID distribution that I perceive while going to a specific CID, and for that we used the Divergence.

A

You don't have to dive into the map, but basically this is something that I can give two distributions and it will give me a number which is a distance between those distributions. If the title Divergence is large, it means that those two distributions are very different. If it's low it means that they're, basically the same so what we do. We will need, first of all, to estimate the network size this.

A

We can do pretty securely and then calculate the ideal distribution that we expect to see if we have a uniform distribution of the peer IDs and then we go to our specific CD, and then we compare the ideal distribution with the one that we perceived here on this graph. You can see that those are the red dots. Are our work store at the cad that is eclipsed and blue nodes?

A

uh Are our walks towards um that's not eclipsed, and on the y-axis you have the Kyle KL Divergence, so you can see that the red dots are basically on the top part of the graph and the blue ones are on the on the bottom. So if you set the threshold correctly, we can basically tell whether there is an attack or not, and with some tunic we were able to get false negative rate of zero percent and false positives only on around one percent.

A

Okay. So with this, yes,.

A

uh Sorry second.

B

um In terms of false positive and negatives and I'm wondering what kind of tuning uh right.

A

So there's, basically the tuning is only um about setting this threshold, so you will see you know from because we have to say Okay above the the kale diverges higher than whatever, but you can basically do it empirically and that's based only on the network size.

A

uh Right so we know that there is an attack, but what we can do about it. So this is still work in progress, but basically would like to solve it when advertising the file. Only um so now, if I have a file I advertise, it I go to the CID I check the distribution. The description seems off, so what can I can do?

A

I can just go slightly away from this eclipsed region and basically give this information to the original, honest nodes that should receive this information and again I can verify it using the KL Divergence.

A

The good thing about it is that, on the on the Searcher side, we don't have to do anything, because, while we go towards the specific CID, we should be Crossing um those honest nodes, and they will just tell me you know if you want to go closer. That's fine, but I. Have this information. I can give it to you. You don't have to go towards this eclipsed region in the network and yeah. I think this is it for me.

A

So, basically, what we're doing we're not now confirming that every single time when you are looking for a file, you will go through those honest notes. If that works, we're basically good to go. My colleague Naveen will give a demo tomorrow about that, so he runs eclipsing cids as a service. If you want a file to be Eclipse, just let us know.

A

C

Did you tune that threshold so that your false negative rate was Zero basically, was that what your optimality condition was for the KL threshold.

A

um Yes, so basically we're kind of looking for for a good.

C

Balance between false positives, but you will not accept false negatives right.

A

C

Don't want to miss any you don't want to miss any attacks, but you will accept things that weren't actually attacks.

A

Right I think it also depends on the we are not still set on the on the actual response and I. Think then you know if the response is more harsh you might want. Maybe you know you'll care more about false negatives rather than false positives, and vice versa, depending on what you do after the the yeah exactly.

D

Hi good talk, there was one slide where you showed that the false negatives was zero percent. Is that probably uh always zero, or was that empirical data.

A

Sorry second I didn't.

D

Get so uh is there zero false negatives empirical or will it always be zero.

A

Yeah yeah yeah yeah, so.

D

E

A

This is this was like uh we said the threshold and with this the threshold that we chose, that's the result we got, but, as I said this threshold, we kind of tune it because uh we still don't know whether we care more about false positives or false negatives, because that depends on the exact response that you want to have to this. uh To this attack.

D

F

Hello, thank you, um so you detect this using the klw Divergence on the distributions, but it sounds to me like you need a crawl of the entire network to detect this.

A

No, so we do it on the distribution as it is perceived by while going to the to the specific CID. So here on this graph, you can see that this is a distribution of the peers I perceive while going to the CID. So those are only the purest eye eye contact, so I don't need the global view of the network exactly.

F

That's yeah, nice.

A

G

When you go back to the previous graph, um it's this one yeah yeah that that one. um Why are these horizontal lines? Could you explain? That's in this KL Divergence, uh horizontal.

A

Lines the red line, yeah.

G

The red dots form this this this.

A

Is the threshold that we set uh basically saying above is yeah.

G

A

G

A

Guess would be, this will be some property of chaol Divergence yeah, that is kind of discrete, but um I don't have a.

G

Good answer now: that's enough for me: thanks: okay, good.

E

That's very interesting and promising result, and so I think so what you have now is you can detect an attack right and then there's a mitigation, but I think it would be even possible to do something better. So when you provide your keys, so you're going to do a lookup for the the spot, where you want to store your keys.

D

E

Here you can detect the density and if, at this moment you detect an attack, then you publish it to a lot of peers right because there's going to be the attacker. But you want to give it to the honest peers that were there from the start right, and so you don't need to constantly monitor right right.

A

So that's the thing we wanted to avoid being forced to kind of constantly monitor the all my keys that I published.

A

There is always a trade-off because, obviously the more people you you give your information to the better like it's more secure, but at the same time, well I think we see the overhead of the network, we're also thinking on storing those records on the path, because currently you are using uh kind of find, find node operation to go to the to discover the case closest nodes, and then you give them um give them the information. So we use ad provider. However, it would be asking everyone on the path towards the CID, so.

E

That I I mean then the the content. Writing isn't sound. If it's asking on the path, then it's hard to to find.

A

Right right but I mean you still go towards the specific CID right, so you still give it to those like okay closest nodes, but then the closer you get and still we have to model that, but especially for popular files. It should give us enough information spread across the network. That I mean the good the good thing about it is that, then you don't have any additional overhead, because you just talk to the nodes that you contact anyway, yeah.

E

Yeah but I think so if you do first the lookup to store the the provider record and you detect that there is an anomaly, then you can look up for some more peer until you're satisfied with the the number and then you so you allocate them. So you allocate more provider record than actually 20., and so maybe it can be combined with optimistic provide, and so you so the the number of Provider records depend on the density of the of this specific location of the key space and I.

E

Think so that's just very strong and we don't need to Monitor and it just make it strong, yeah yeah. That's a good point. Yeah.

G

um How sensitive is that to the accuracy of the network size estimation, so how accurate.

A

Of the network side so.

G

So you say so this all depends on the network size and you are estimating that and how? How accurate does this need to be so.

A

uh Yeah, it doesn't have to be extremely accurate because again that depends on the threshold. So there are a lot of moving Parts, but we've run run it on multiple traces acquired by the um by the crawler that I guess you wrote, and so we basically go to a random key in the network and based on the distribution. You know we're able to estimate the network size and we were off by around 2-3 percent. From from all the runs that we that we had.

F

As another idea to avoid relying on the network size you, so what you're doing when you insert a provider record, is that you explore the region close to that Cid in the airspace. And you already know the region close to your own ID in the headspace, and you could compare those distributions and then you don't need to rely on the network size, because you know the density in your proximity. Maybe.

F

um Hopefully not.

A

Yeah, that's something to to look at.

A

You probably also have because I think now we looked uh so here we were kind of looking globally. It would be good to check whether, if you're a lot closer to the to the CID, uh it's more difficult to detect the attack.

H

Hi, uh thanks for the talk I was just here wondering what is the real cost to set up this attack?

H

Okay, so if I understood correctly, you can Eclipse any CID if you have 20 cables that are the closest in the network with that identifier, but identifiers are derived from public key pairs which are verified whenever you interact with the nodes to establish the secret connection, which means that in practice you have to have a huge database of public private, key Pairs and peer IDs such that you can fetch 20, which are closest to your target. Is that it right.

D

E

H

That we messed up- and we forget to verify the key when we establish a Communications Journal.

A

But I still have so I have both the private and public key right, like I, said, generate prior private key. So basically, what we do. We just fetch the closest node to the specific CID and then just Generate random key I check. If it's like closer to the CID, so I don't have like both private and public keys. I can establish the.

H

Maybe that's what we're going through tomorrow. We can, because we have like a hydrant node that has a huge database of public private keys and peer IDs and then I decide. Okay, I will Eclipse everything around this identifier. I pick up the right identities. I make my bed Hydra present itself to the network with that identity then, and we killed that segment of the network.

A

Sorry I wasn't sure about the second part. I.

H

I was just trying to understand how this could be like operationalized to make the attack like extremely effective.

A

But well: okay, I think you know so now the uh you wanna you wanna, take it.

I

A

I

Will see tomorrow, but um within half a minute to a minute we have a functioning attack. So it's it's very easy. Basically yeah the.

A

Truth first thing is, is pretty quick where actually so at the beginning, we're trying to parallelize it, but then it takes half a minute. So probably not worth it.

B

Okay, hi uh hi. uh Thank you for the talk. um I was wondering you. You were saying that you, your idea, was to mitigate this was to store the CID on the pad right yeah.

B

Have you checked how deterministic is the part when you.

A

Have I checked what sorry.

B

Deterministic is the path.

A

The further away you are from the CID, the less determined next stick. It is because you're kind of guaranteed in the buckets close to the specific CID that you will have more or less the same nodes and the further away you get um the less deterministic is it. Although, with the results from one of the talks, it seems that even for the buckets far away, we might be getting something deterministic because of this stable nodes. uh We actually didn't take it into account.

A

So that's that's something very interesting to uh to also see um but yeah the. So, basically, if we do this kind of on the path approach, the further away you are from the CID, the more secure. Is it because it's much more difficult for the attacker to kind of eclipse? You know a larger bucket, but at the same time you're also less likely to get the information. So it will work so the further away you go, it will work only for them very.

B

Popular files, it also seems that, depending on which point of the network, you you request the the CID. The paths will be different right. uh Yes,.

A

B

And that will make make that a bit more complicated, yeah yeah but uh great work.

D

F

So Navin is offering eclipsing as a service. Basically, so there are, for example, blacklists for some of the gateway gateway providers to not serve content, and one could extend this and say: okay. If we really don't want content on the network, we could Eclipse it. So it's not bad, then.

A

Right so I mean there are, you might think, yeah kind of good use cases for this, because if there's some harmful content we can say you know hey like let's Eclipse it. However, the problem is that you're giving a gun to kind of anyone who wants it. This is probably not a good idea, however. I don't know. Maybe we could we're thinking about you know kind of there can be a list of files that you don't want to participate in sharing, and then you just you know, don't do it yourself, but I think actively.

A

You know kind of eclipsing. A specific CID is probably of the limit yeah.

F

As a second question, is your code open source.

A

F

Is the code open source.

A

uh No, so we probably want to first to.

I

A

Have the kind of robust, because now that the code is kind of you provide, you provide the CID, you run the script and it disappears so yeah, we'll probably want to First, have a resume and.

H

See how we handle.

I

That, more generally but okay, excellent uh okay, one more.

E

So, just to follow up after uh what draw a thing and how difficult is it to create an actual Eclipse attack, so I think there's a protection in the code that um so you cannot have more than three IP addresses in your whole routing table from the same ASP or like the same IP block and two inside the same bucket. So it means that to have a natural Eclipse attack, you'd need at least 10 IP addresses right so yeah, even even if generating the keys is easy.

E

You still need, like a lot of IP addresses from different isps.

A

I'm not sure so I'm not sure because, like I know that there is this limit of the IPS per bucket, but I don't think you have to add all the appearance that you discovered during an operation to your routing table, so you can still use them. You know if you discover okay, closest notes to the CD, you don't necessarily have to add them to your to your routing table to to talk to them.

E

Yeah, but what I mean is um you cannot replace so because the nodes are going to Eclipse like a couple of notes, so you get closer and you try to remove the nodes that know about the target. Cid.

A

um Oh I understand but.

E

It's hard to remove them because they have diversity in the IP address, whereas the attacker is not likely to have diversity of IP addresses.

A

Got it so that's weird, because so we run it from a single IP address and it works in the actual.

E

Network or the actual Network, okay, okay, so.

A

Yeah, but it would be yeah this kind of should kind of protect against it. So yeah, that's something to to check.

E

A

D

A

Even if you know 20 IP address is probably it's not the yeah.

I

A

Extreme cost but yeah.

I

D

Nice uh any final question, or should we just go next door I.

H

G

Yeah, let's thank um Hal again.