IPFS IPFS Camp 2022 - Libp2p Privacy, 2 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Double-Hashing as a way to increase reader privacy - Guillaume Michel

Description

This talk was given at IPFS Camp 2022 in Lisbon, Portugal.

A

So inside the DHT we want, the DHD server appears to help us route to the content, we're looking for. So that's the purpose of content. Writing. So we want them to know what we are looking for, because they can help us, but at the same time as we want privacy, we don't want them to know what we're looking for right, and so that's a bit of the challenge. How do we make sure that they can actually help us to get to the content we want without revealing them?

A

What we're looking for so I'm just going to go through a couple of definitions and that you're going to see again in the presentation, so the multi-hash in ipfs is basically when you have some content in order to get the CID, you will first hash the content.

A

So if you have an image you just make it through a hash function, you're going to get a binary hash and then from this multi-hash you will add some prefix and you will get a CID from there, which is the string you can see and then a concept that is important for the double hashing. So why double hashing? It's because we hash the content twice, so we're gonna hash, this multi hash once more to get another identifier and so in the slide um so yeah in practice.

A

It's you don't directly hash the CID itself, but um it's just easier to write it in a slide and to yeah write the the long version and also one another important um component: is the provider records so when you get to publish or to provide some content in the in ipfs? What you're gonna do is you do not drag and drop your content and store it on ipfs?

A

The content will stay on your machine and you will just make a pointer and write it in the DHT, and so anyone will be able to find this pointer and discover that you store the file, so the provider recourse are disappointed that map the CID, so the contact identifier to the peers that are hosting them. So in the case, it's also possible that multiple peers store the the same content.

A

And so, if you look up for a Cid in the DHT, you will get a list of the content provider and you can then ask them to retrieve the content. So now, how does um pinning content to the DHT work so I don't say, provide or because we're not putting the actual content in the DHT? We only put the provider record.

A

So what we do is we have our CID and for some reason, we're gonna hash it, because we need a flat name, and so we hash this CID. So that's how it works today.

A

So there's already a second hash which is going to give us a binary string and we're gonna look in the DHT for the closest location um in the the binary key space, so the academy so yeah, usually the HDs are represented as a circle as we've seen in the in the last presentation, but in the in the case of Galilea, it's better represented as a binary tree because we're talking about sore distances- and so here we we're gonna see so the the request is going to be iterative.

A

So first I'm going to ask the closest, pier or just appear I know. Okay, can you get me closer to uh the place I'm looking for and eventually I'm gonna find the closest peer IDs to the content or to them to the hash, where I want to store my provider record and then once the the request has converged, I'll simply ask these peers.

A

Okay, can you please store this pointer for me so that anyone that is looking for this content can then in turn, ask me for the content, so these peers are going to start a provider record and then any clients that knows the CID can simply hash it and look up for where the the content is stored in the DHT, and you only need to find one of the closest peers that is storing a provider record and then this peer is gonna. Give you the provider record.

A

So it's gonna tell you um where the content is actually located. So now you have the peer ID of the content provider. You can directly request it over bit, swap so that's how and content routing currently work, um and so what? uh What does the DST learn for each request so for each request, the DHC learns the rddt. If we take a DHT server peer. That is gonna help you to route your request. It will learn, of course, your peer ID, because you open a connection with them, so they will know you.

A

They also know your IP address and, as you request, then some content, because you want them to help you. They will be able to associate your content and with this peer ID, and so it means that they can track Which CID are accessing, but if they want to know really what you're accessing they can just take this CID and in turn also request it to the DHT and get the same file ICU.

A

So anyone that wants to spy on you can just listen to to your request, help you route them and then resolve the same content as you and you're absolutely tracked. So, for instance, if the orange node here is malicious and the client asks them for the CID, then this node now knows the CID right, and so it can, in turn, request for the provider record. Learn where the content is stored and retrieve the content which is undesirable and that's what we're trying to address um with this upgrade.

A

So why double hashing? How did we get there so um in ipfs? The way it works now is. The content is addressed by the content identifier, but it would be great to have another identifier that would be specific for the DHT so that you can look up for some content in the DHT, and so the peers could learn about this file identifier, but then they cannot use it to request the content, because when, for instance, you request the content over bit, swap it's going to be a different identifier right. So you want.

A

We want to have different content identifier according to them different content, routing mechanism or data transfer, so that we cannot link them together and also what we want is that this new identifier shouldn't be hard to derive. So once we have the CID, we want to be able to efficiently found the identifier in the DHT.

A

So that's one first Improvement, so that if I request a key in the DST, the node, there can learn it but cannot get anything from it now in order to gain um K, anonymity and plausible deniability. What we can do is instead of requesting exactly for um the the yeah. So sorry, um so this new identifier can be the hash of the CID or you can hash the CID along with a constant, and so that will be um the DHT identifier.

A

So the the second hash- and um so in this case, in order to gain anonymity and plausible deniability, because that's unusually desired um um component of privacy and what we can do is request. The hash of sorry request the prefix of this second hash, which means that um in this case we can request a prefix and we will get approximately in the in the right region in the tree. So the node can still help us route to to the right place. But um as we request the hash, they will add the prefix.

A

They will be probably many a provider records that match this prefix, and so we will get multiple provider records so that the DHT server node. That are serving us, this provider record don't know exactly which content we're looking for so I, say: okay, I'm, looking for approximately this content again, I get many provider record and I can only select the one I'm interested in and discard the rest.

A

um However, um this gives us can anonymity, but no elderiversity or t-closeness, which means that if, let's say in a specific branch of the tree um for a specific prefix, there is one file that is very popular and everyone wants to access it and, let's say 10 other files that nobody wants to access it. Then, if somebody makes a request in this specific branch of the tree, then it's very likely that they access the popular content.

A

So it's not perfect, but at least it gives us a plausible deniability, and so yet we can still do correlations attack on the prefixes. It's all the branches, um one other component that we can use to improve our system further.

A

Is we can encrypt the provider record so now, um when so the provider record our disappointers, that I put to the DHT to indicate that I store the content, and my peer ID is in the clear which means that anyone that hears about the request, even if it's a different hash, can know that I am storing the content. They won't be able to request me the content, because they only know the dhti identifier, so I will know that they don't have the CID and I will not give them.

A

But my peer ID is in the clear so as a writer, a content provider I can be associated with this content. So what we can do to address it is we can encrypt the peer ID with the CID itself.

A

um And this can be done through a symmetric encryption.

A

um And so it will guarantee so anyone can still request these new DHT identifier and get the provider record. But now only the peers that know the CID so like the CID would be the secret to be able to decrypt um this provider record only the peers, knowing the CID, will be able to decrypt it and know that I am storing the file.

A

And now so, this has some. The encryption has some undesirable effects, which means that um now, if um I know the content to of some file, um I can say that somebody else is providing it and I can create a provider record pointing to somebody else to Dos them and the so. The DHT server cannot verify anymore um if the peer ID in the provider record matches my peer ID because I'm uploading it so what we can do is we can simply sign the the encrypted provider record so that the DHT server can verify.

A

If um to the signature of the encrypted provider, record that the server cannot read, matches my connection with them, foreign and so in turn, it's also good because the client, so when a client then gets the encrypted provider record along with the signature, they can decrypt the provider record to get the decrypted peer ID and they can verify the signature against this peer ID.

A

So that's roughly this yeah, the the specs. So what will change so? We will have the encrypted peer ID that we encrypt using the CID.

A

uh We will have the signature of this encrypted peer, ID, and so the new provider record will be a mapping from the new DHT identifier, so the house of CID to the encrypted peer ID, and we also want the signature to to be there.

A

um So that's um the easy version, but in practice we're gonna encrypt with the multihash which is contained inside the CID and will hash um with a constant string in order to avoid Collision.

A

So now, how will the system work so I will still look up so for the second hash of my content in the DRC and I will store. So it will not be exactly the same location because the salt will be different, but the process will be exactly the same. So I look for the closest, pier and now I have my encrypted peer ID, so I I encrypt my peer ID with the the CID I sign.

A

It and I send it to the to those peers in the DHT and then the clients that want to look up. Some content is going to look up the hash of the CID and even a prefix of it, so that the nodes don't learn um a lot about the content that I access and then what I'm gonna get is I'm gonna get a bunch of Provider record. So that's a system parameter that we can adjust. So in the knitty, we can choose the K, of course.

A

So here I'm gonna get like four different encrypted provider record and I'm gonna discard all the one I'm not interested in, because I'm interested only in a single one, I'll be able to decrypt it because I know the CID and once I decrypt it I can verify the signature um because it was signed with the the private key of the peer ID. That is storing the content right and so then I have the guarantee that the peer ID that I have at least knows the CID.

A

A malicious node could not have created a valid encryption with using the the CID as a key and I can get my content from there. So now what privacy guarantee does it bring? So it brings us K, anonymity and plausible deniability, as the DHT will serve us multiple provider records.

A

So if we say that the dxt is an entity, um they cannot determine exactly which content I am accessing um now so before the so, my peer ID could be associated with directly with the CID I was looking for and now we can only be associated with the hash of the CID, and but so it means that if the adversary already knows the content, I am looking for.

A

They can still know what I'm accessing.

A

So, if we take an example, there's a decentralized kind of YouTube platform like PFS, and we have a global adversary that is going to crawl it and index all of the videos they can in turn, compute the hashes of all of the CID, and so, when I make a request with this identifier, they already have it, so they can still track my request. But in the case where I, upload or I advertise my holiday pictures on ipfs and I want to share it with my friends.

A

Nobody knows the cids except the with the person that I shared the link with, and so nobody can actually know um the data that is being accessed um also, so we get some writer privacy Improvement, because the provider record is encrypted. So it's not in the clear we don't know anymore, who is storing specific content.

A

However, the DHT server will still know the content provider because you have to open a direct connection with them, so they still know, but the rest of the DHT doesn't know.

A

And now let's go to the overhead, so the overhead is quite low, I would say because so there is no, let's say rtt or additional Hub in the routing routing works exactly the same.

A

um No additional packet when doing the lookup is just when we retrieve the provider record, we will retrieve um K, which so K anonymity instead of one. So that's like the so the number of bytes on the network will increase um on the storage side.

A

So the the DHT server node need to store provider record and those provider records will be larger because they will now contain a signature in addition to the peer ID and there's also a computation overhead, because a signature has to be produced for each provider record and the provider record have to be in republished over time every 24 hours, which means that every 24 hours you need to produce a new signature for each piece of content that you are providing.

A

So now the changes that it would imply for, like the ipfs environment, um so only p2b has been upgraded, so I mean when I say only it's like goalie P2P, roughly P2P, namely P2P jstly P2P. So all DHT implementation have to be upgraded, but then the application building on top of lip, P2P or ipfs would automatically benefit from this. As an application Builder, you don't need to change anything.

A

um These changed integrates in ipfs reframe, so there's a pull request where we specify um the new private content request and the double edging approach will also be implemented by the indexer so um yeah. This approach would be for the DHT and the indexer with the same interface and so Elizabeth from chainsafe that I couldn't make it today is currently working on the implementation.

A

um So yeah you can check if you're interested.