IPFS IPFS þing 2022, 9 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Double Hashing & Content Routing - @guissou - Content Routing 2: Privacy

Description

Double Hashing & Content Routing - presented by @guissou at IPFS þing 2022 - Content Routing 2: Privacy - https://2022.ipfs-thing.io

The IPFS DHT currently makes use of double hashing. This talk describes a small protocol change on how double hashing should be implemented in the DHT. This change would significantly improve the reader’s privacy from the DHT nodes, with low overheads.

Slides CID: bafybeicm22quqzyvvdyxeiczbaacvdc2mliz2rovadaxnajopb6t2cwmbq

A

Gonna, do a presentation on how we can make use of the double hashing that is already in ipfs um for privacy, so it would require um some changes to the ipfs network, but double hashing is already here so um yeah. So this work has been discussed. um I was in a privacy, discussion group, and so what do I mean by double hashing? So what we have is in ipfs.

A

We have content and the content so to get in order to get to the cid we're going to have the content, we're going to get some hash and then from the hash, we'll be able to build the cid using normal, cid construction and then so.

A

The location of this content or the provider record that is going to point to the content um in the dht will be located at the address, which is the hash of the cit, which means that the location of the provider records in the dht will be kind of the double hash of the content itself and so quickly, going through um an example of um how the the content lookup works in ipfs um feel free to stop me at any point.

A

If you have any question or if I've made something wrong on the slide, so first the client got get the cid from somewhere and he's gonna hash this cid to get so the bit representation and uh gonna look up in the routing table the closest peer to uh this hash and, for instance, it's gonna be here pier zero. So then it's gonna, query, p0 and it's gonna say. Okay, I want to find this specific cid and then p0, which is um a dht and appear in the dht in in vht server mode.

A

It's going to take dcid hash it to um yeah. Take it sash and look up in its own right routing table, find and close the spear and return it to the original client, and so eventually we're gonna get closer to the content where the the provider record the source or we're going to do it recursively um then, to the next peer which is going to take exactly the same operation. Until eventually we find a peer that is hosting this provider record and the peer is gonna.

A

Give us a provider record from this we'll be able to just read the content and find the the peer id of the peer hosting the the the actual content of the file. So we need to do another dht lookup.

A

If we don't have the ip address of this peer and then we can so again request the content to the content provider, which is going to provide the content all right, and so now we want to know uh what's the privacy model here who can uh know what kind of data I am accessing. So of course the content provider know that I am interested in taking this image from from from it, and there is, I mean it's hard to do, but in this way there's nothing. We can really do we download the image from them.

A

Then there is the peer hosting the provider record that will know they will give me the provider recall. So they will know that I am interested in this file. Probably if I download the provider record and then all of the peers um that are gonna, help me to route to the provider record are going to know that I'm interested in this file and as well so anyone any passive observer that is on the path can be isp or someone.

A

The same airport wi-fi will be able to know that I request the cid and if they want to observe me, they they can just take the cid, do their request and eventually download the file that I'm interested in, and so um it's very easy to track what uh people are looking at. So, for instance, if there was um let's say youtube that was built on top of ipfs.

A

You wouldn't want the content to be encrypted, because you would want everyone to be able to access it, but just by looking at the cid that the people are accessing, you can see which kind of videos they are looking at and you can really spy on them. So that's um so that's kind of the problem, so we want uh client, privacy or reader privacy in the dht so that um the the the reader um yeah connex can access thing more privately, and so we only focus uh so here in the dht.

A

We don't focus about bit swap or um the content provider, privacy or the gateway. We just focus on the dht, a normal client.

A

So what we can do as a first solution is to look up for a prefix all right, but the problem is: if we look up for a prefix of the cid, we will not be able to route to the correct file, because if we take a prefix of the cid when we're gonna hash it, it's gonna have no locality with the hash with the hash of the cid itself, which means that it cannot work, as is in the system.

A

So what we want to do instead is take um a prefix, so substring of the hash of the c80, and so it means that the dht routing process has to be adapted. So what do I mean by this?

A

Is that so in a first step, we need to modify that instead of requesting the cid directly from the peer to the the sorry, the client to the dht routing peers, I'm going to request the hash of the cid and what it means is that the the peers in the dht routing the yeah, the ph routing in the dht, will not have to compute the hash of the cid anymore, which is good news, because it's an operation less to do and they can still perform the same operation by looking up in their routing table the closest peer to the value I'm requesting, and so it required to a change in the server code.

A

But here it wouldn't change anything for privacy, but then we can build on top of it. So the second change is: um can you read the red? Yes, no yeah yeah, so uh by choice of color, but so the client first is gonna. Oh perfect! Thank you.

A

So yeah, basically the the client is going to just uh take the prefix of the hash of the cid and um so then compute the the closest peer to the prefix or to this, the hashtag of the cid itself doesn't really matter and then request the prefix, okay and so the the prefix is. So we need to adapt a little bit the routing process, because um when you look for the closest uh peer to actually a prefix, you would look for um yeah.

A

All of the peers would, I didn't see, would exactly match this prefix and, if not um just consider it as a random bit and take what closes the same way, that you would do the short distance um normally, and so you can do so. It would actually work with the routing, so you can get every time close a spear and then, when you request a peer that actually um has one or multiple provider record that match this prefix.

A

um This peer is not going to know which of these cid. You want and it's going to give you all of the cids. Oh sorry, all of the provider records. So here we have an overhead, a network overhead, because we're going to have many provider record that are transmitted against only one in the current ipfs and then what's the client gonna do is it's gonna discard the cid? Doesn't care off and then do the same thing as he used to do, and so yeah now just a word about the prefix length selection.

A

So here I did. I said that we have to compute the prefix, but I didn't tell the the length not a security security parameter, but basically what we want to achieve is k anonymity, which means that if we take back this example, um if I want that on average every time um five provider record are given to me, then the file I want to access will not be distinguishable again inside these five files. So I get a k. Anonymity with k is equal to five, which yeah and so basically to compute the prefix length.

A

L I have to take. It depends on the the the canon unity, so the k parameter, which is not to be confused with the k bucket or the k parameter from calemia and so yeah. That's basically the computation, so the idea is, if you take, um I don't know the node, so that would be the the key space of academia and if you take the left, most node, so zero zero, zero, zero and you want to have- I don't know four provider records.

A

Then um it means that you have to take the prefix zero zero so that you will be four different elements and so yeah. Basically that's the the the computation. So we have to take the log of the total number of cid in the network um divided by the uh yeah, the key parameter.

A

And so what do we gain from implementing this? uh We get cannonimity and plausible deniability, which means that um you can pretend that uh so your you request, a prefix, you get served five provider records and you can pretend that it's not this sensitive or illegal file that you were downloading but another one. So it does make sense, and so we are more protected or less vulnerable to the dht routing table nodes to the node, storing the provider record and the passive observer.

A

But um we don't have ldar st or t closeness, which has which are two different metrics associated with uh k, anonymity and we have a small network yeah. We have a network overhead just for the provider record transmission and um yeah. It is very easy to just replay the same prefix request and so um yeah. So basically anyone could just so. If we take pl0 pl0 could just take the prefix request.

A

It get all of the provider record and say: okay, I know a bit what client what client is looking for, and I know it's been file number four and not the other. So the privacy is not still that good, but we reduce the impact that this actor may have and but we can go further. We can go and encrypt the provider record.

A

So how do? How do we do this? So we still want the provider record to be accessible to anyone with the cid. So we want to encrypt the provider record using the cid, and so you would need to know the cid to access the the content of the provider record.

A

So it means that so the first part doesn't change, but the provider record would be stored encrypted on the node, which means that if I want to pin a node on ipfs, I will first encrypt the node sorry, the provider record, with the key derived from the cid itself and and push it um to the dht, and so it even gives a bit of a content provider uh privacy. Because uh then the peers, during the provider record, wouldn't know.

A

What's the content of the provider record and then so a honest client that knows, the cid will be able easily to decrypt the provider record using the key derived from the cid that it knows from the start and but yeah. So the overhead here would be one decryption of symmetry.

A

Crypto would have been on the client, but then so, for instance, if we say that pier p0 is malicious, so pl0 will request the the prefix and so we'll get all of the encrypted provider record and peer0 only knows the prefix, which is a part of the hash of the cid. So, even if um p0 had the whole hash of the cid, it would knew he would need to do a pre-image attack to be able to recover the cid which, by design, is not possible.

A

So a p0 would be able to get the encrypted cid, but not be able to access it and see what's inside um yeah. So that's mostly it's. uh The the observer can only decrypt the provider record if they have the cid, but it means that it's not a perfect security, because if you want to access, I I don't know. If you have the picture for me holidays and that only you access it, then it is fine. Only no nude know the cid and nobody will be able to see your the picture of your holidays.

A

But if there is again so if we say that there is a decentralized youtube on ipfs, someone could go through all of the videos get all of the cds of the video compute, the hash of them and have a big dictionary, and so, when so yeah so an observer when they see the the prefix that is requested, they can go and check up in the dictionary if they would match a video and they can still um yeah know what you're looking at. So it's improved privacy a lot. But it's not the the perfect solution.

A

It is still possible to to make some attacks um with uh yeah a lot of resources, and so the downside is that we have a one, symmetrical decryption operation and so yeah. We still we reduce again the the impact or the the power that this adversary can have, and so that's kind of the final picture yeah. So.

B

One additional piece here right: it does mean that every provider is now skipped where before many provider records were the same, it was just ipad but now which one is different. So it's something like a network indexer that makes the database much worse.

B

Yes, because we've gone from a small number of providers to one distinct provider per sit, um but it does seem like potentially there's a way to de-link that or if we're not giving, but um you could potentially have two layers, one that is a provider record. That's just I am a provider and you can have the fee of which provider that is so. Do it in two stages, there's potentially a way to.

B

A

Yeah, definitely it's I mean it's where I stopped my thinking, but I think it can be improved a lot and there's a lot of implementation details. I'm not aware of so yeah. There are probably some tricks that don't work exactly like this and it's probably possible to improve it even more but yeah.

A

So the the the conclusion is that we can yeah really significantly improve the the reader privacy in the dht, and only this and the the yes or the dhc server wouldn't need to hash the cid for each request, which is good, because um when you look up you, you have a lot of node helping you to to be routing.

A

It's that much of hash operation less and the overhead to pay is just um sending k provider record with k being the the key anonymity parameter instead of one, but only one spare request and the computation of our head would be one symmetrical decryption of the provider record, which shouldn't be too heavy even for mobile applications, but yeah. It would require to modify the server code and go through a migration, so republishing all of the cids, with all of the provider record now encrypted to be able to find them again and yeah.

A

It's then I mean it is also the illusion of privacy, so people may feel safe, whereas they shouldn't, because it's still possible to attack if you have a lot of resources or if you are not in this very specific reader privacy, dht lookup case so yeah. Now, I'm happy to take any question. If you have.

B

Yeah, um so the pre-fixing part kind of reminds me of like, like a more general like an instance of morgana thing, that would be like a locality sensitive patch. Can you mention that, like uh the question is basically like, did you consider using other methods and prefixing as a way of like hiding some of the information about what you're looking for.

B

A

All right business, even.

B

On your radar, is there a reason that wouldn't be.

A

Yeah, so the the thing is that it's hard to do privacy and content routing, because if you make each file distinct, then you're not able to route it anymore. And so, if you do like not the prefix but a suffix method or something else, you lose the routing pro yeah proper uh yeah. You lose the routing component, so yeah. What what we have here is that if you have the prefix, you know it's still going to be in here, so you can route until this node.

A

Yeah yeah, so you you know that it's in this zone and you're gonna get those provider reports. But if you do.

B

Any other techniques you're not able to do your content problem or you'll have to have like. I know, one a provider report for each reader, which doesn't scale yeah.

B

Yep you'd probably still also want to like modify bit swap as well right because otherwise you'll just leak which cid you want to like everyone, you've talked to already.

A

Yeah yeah yeah, definitely but yeah. I think some something similar would be possible. Yeah.

B

Like I guess, once you you, you just limit your request to people, you know have it versus, or maybe you can do something similar to this, where you do like a double hat, yeah.

A

All right, no more question, then I'll leave the floor for the next speaker.