IPFS IPFS þing 2022, 9 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Private Set Intersection for Bitswap - @tschorsch - Content Routing 2: Privacy

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Thanks for inviting me and for having me um sharing our research, my name is florian um at tu berlin. I had a research group and we are working on the intersection of distributed systems on the one hand, and security and privacy on the other hand, and um for today I would like to share um some of our ideas, um in particular a very specific idea for a very specific privacy problem that we see in the bitswap protocol and that's the talk.

A

So, let's dive in um the problem that we see is that cid request or bit for spit swap requests in general leak um the interest um and that might be quite um sensitive, um so it might be quite sensitive uh with respect to what are clients interested in what kind of data are they interested in and um also we could also frame it. On the other hand, providing data might be as well as sensitive, providing a certain kind of data, so we see that cid requests leak the interest.

A

We could also even go further um that by seeing a particular kind of interest in a particular kind of cids that might be used to fingerprint clients, for example, um and therefore also can be used to track over time. So that's the the motivation for the setting and our primary goal was- and I said it's very specific problem that we are looking at- is to obfuscate these items of interest and our main metrics in order to evaluate these.

A

This primary goal is that, on the one hand, we look for low latency, so the number of round-trip times should be as small, ideally not more than the bit swap requests or the handshake. Anyway. In additional to that, we say the computational effort on the um on the server side, so the content provider should be fairly small.

A

Otherwise we have a attack vector with respect to denial of service attacks, so this should be quite small and scalable and optional, which we consider a nice benefit, but on the other hand, if the client who is interested in some data might perform some additional computation.

A

That is fine to us, but still we don't want to increase the overhead um unnecessarily, so this is. This is also something that we keep in mind, but it's not um the primary metric.

A

In order to tackle this goal, we see two directions that we can, that we can go for this. That is, on the one hand, a cryptographic approach and, on the other hand, a network level approach.

A

Network level approaches include something like forwarding, rerouting and caching all to obfuscate, the original requester by using indirections and so on, so that might help and the quality of privacy that we will get from this kind of approach is, um I would say something um like plausible deniability, so you can deny that you are the requester of a certain cid because it was rerouted or it could be. Rerouted uh could have been rerouted, and so that's that's the the level of of um these network approaches.

A

The one that I would like to focus on today is a cryptographic approach. The um the pros of cryptographic approaches are that they provide us more um strong guarantees that we that we can derive on that. So it's it's more clearly defined um what we can gain and um what I would like to discuss today is particular cryptographic approach, and that is private set intersection, and I would like to share this idea with you and I'm very curious also about your feedback on that.

A

um So um yeah, let's see um about this main idea and consider it also as a general tool um in order to solve these or similar problems. Okay, so what is private set intersection? Private settings section is in cryptographic approach and the main idea. You have two sets that could be the clients once and the server has, and what we want to find out is.

A

Does the server have my cds that I'm interested in as a client, and so what we basically do here is to calculate the intersection of these two sets if we consider the ones by a client as a set and the halves of the server as a set, and so what we are interested in is we want to get this intersection to the client, so the client should learn?

A

Does a server provide this particular cid that he or she is interested in without revealing the whole set of interests to the server, and that is the main goal that a private senator section can indeed provide.

A

It can provide this particular kind of learning, so the client learns the intersection without revealing the set to the server to be fair. What usually, what you usually still learn as a server, and I get to all these implicit implications and assumptions in a second, but what you typically learn, also as a server is the set size so that that, depending on which exact protocol you are you are focusing on, um there is a certain private, a certain information leak um to the server, but nothing that I would consider uh um incriminating all right.

A

So that is um the general setting and now let's talk how we can um bring that to life, how we would um think of integrating that into the bit swap protocol.

A

So here's an overview of how the uh see how the cid requests or how the bit swap protocol works. Probably you all know that better than um me, but anyway, let's briefly get get through this, so you have a client. The client usually sends one halves to a number of servers.

A

These servers then answer and in particular, what's interesting is um the half cids, so the list of cids that a particular server can offer, but still in the usual process, also the server all the servers learn the whole set. So that's exactly the big problem by by uh distributing and broadcasting all my want haves all the servers independent of whether they have it or they haven't uh um it, and they will learn my interests as soon as I know from which server I can download the file.

A

I can send my want blocks for this particular cid and then get the data transfer going. um There's also the dht requests will, in his introduction, already mentioned this differentiation. um But for this uh talk and for simplicity, we ignore the dht request for a second. So what what does that give us? If we now would simply assume we can perform um private set intersection? What would that give us?

A

What it would give us is that this list of one-halfs will be indeed protected, except for the size of that, so the the size will be leaked to the service, as I as I mentioned, um but it's still protected. It will be somehow encrypted and the server cannot um know the exact cid that the client is interested in as soon as the client. uh The server answers to the client. The client learns something about the service set um and in particular that means the the client learns which servers have or have not a particular cid.

A

So if server 0 answers with a half for a particular cid, then the client learns that this server has the information, which is also an information leakage, but um we could argue that a server that provides a file um that is that is acceptable, um but he could also draw a um attacker model where this is also a no-go.

A

um It really depends, as we'll introduced um on the attacker model, that we define and for us and for this protocol we say it's okay, to reveal this information to the client and then, as soon as the client starts, the file transfer and sends the want block for a particular cid. Then also the server learns this interests in this particular file. Again. This is an implicit assumption that we that we make it is okay. As soon as you start, the file transfer, um it is okay to leak.

A

um What you are interested in, um because um this this still um limits and mitigates the client's interests um that cannot be observed by everyone.

A

Only if you provide these files and now, of course, you can start designing certain attacks, and we can discuss about that if we have time for that in a q a session, but still this, um I hope you see that it really limits the amount of um of information um about the interests of a particular club.

A

So what we are envisioning is a a protocol that is based on an elliptic curve. The v helman see it's a very particular kind of approach that fulfills our privacy metrics, because it puts quite um the the um heavy computational work on the client side and not so much on the server side. um We argue that it's fairly doable for the server, since it's only done once and it will not introduce many additional rtts. We could also design it that we will have no additional rtts. So that's that's! That's good.

A

um It all begins with um an initial half by a server. You could think of that, and this is why I marked that with a star. You could think of that. As an inventory.

A

We could also name it an inventory message, but we stick to the semantics of the half message it could be sent and distributed unsolicited among the neighbors by the server or it could also be piggybacked as soon as the um server answers to a one-half with a half message. um It depends, um but for the clarity of this presentation, I will consider it here as a single, separate message that is similar to an inventory message and what you encode in this set basically is an encrypted um set of your actual cids.

A

um You use homomorphic encryption, so you build it or transform it into onto an elliptic curve, and then you take it to the power of a random number, and this random number construction is basically the diffie-hellman key exchange, um so the first part of the key exchange, and then you send this encrypted set to the client, the client stores this set and produces the one half.

A

According to this part, this is the second part of the diffie-hellman key exchange, and then we have a certain re-encryption where the server takes the set v and takes each item iterates over each item in v and re-encrypts that, by using his random number and calculating the power of these v's to this random number and producing a w, and this w basically is then or can be used by the client to double check this these items in v and checking whether they are also existent in the set of?

A

U- um and you can do the math here, but this is really the diffie-hellman key exchange protocol um that gives you you can do that without revealing any information to the server except the set size. So here you can really see the server iterates over each item and learns the set size, but then the client knows which server has or don't have the cids that the client is interested in and then can initiate the want block basically, and we can start the file transfer.

A

So that's the major idea. Where are some issues that we already identified well, depending on the number of ids or cids, that the server has locally?

A

This can become quite a large set, uh and each item um is an encrypted, a cid which bloats a little bit um this set, and so this can become an overhead that we identify and in order to um tackle this, um we also um thought about the idea of combining that, with with a probabilistic data structure, so consider a bloom filter.

A

We argue that cocoa filters might be beneficial or more superior to bloom filters, but anyway loom filters are probably more known. So let me stick to bloom filters here, and the idea is we map this set? U onto a bloom filter and this bloom filter can then be used and inspected by the client and still perform the same operations that we have discussed before, but instead of sending each and every item individually, we hash that to a bloom filter and since it is a probabilistic data structure, we also get only a probabilistic answer to that.

A

So there might be some false positives where one cid maps to the same bit as another cid, which gives us a certain false positive rate um bloom filters, have the ability to adjust this false positive rate, but anyway, um we cannot exclude this entirely, but still this reduces the amount of data that needs to be transferred quite significantly and therefore scales, much better for large sets, and that is, um as you can see, still um integratable into this handshake, as we have sketched it before.

A

As I said, this half star can also be piggybacked with this half, and then we have the exact same protocol um work. uh The protocol flow as the original bit swap, and um we don't ain't, use any additional rtt all right. So, let's see what we have accomplished with this protocol.

A

First of all, our goal was to protect client privacy and strengthen server privacy, and we argue that private set intersection can be a tool to achieve this goal, in particular what it can achieve. Is it really mitigates the non-selective distribution of cids you leak cids to the relevant service, um but not to the whole network, and that might be helpful to increase privacy from our initial assessment when we, when we thought about that, um our gist is basically that we indeed can integrate that into the bitswap protocol with very reasonable overhead.

A

I hope I was able to at least give you an intuition what it would take to integrate that there is a certain initial overhead on the server side. It needs to generate this half star message, the inventory, but we argue that this can be a single one-time computation. That does not need to be repeated every um every time, so it's fairly acceptable. In our opinion, there is an overhead for the one house, but it is negligible because we assume that this is also not as large as the half side.

A

um All follow up requests do not require any resending of stored cids. This is this is uh great as well, so we really need to do this half star only once and we have no additional rtgs, because messages can also be picked.

A

Where are we at the moment, so uh we are um looking into the still into the issue of large sets in particular, what does it mean to integrate a probabilistic data structure such as a bloom filter?

A

Where do we generate that? How do we adjust this? Do we hold multiple bloom filters for different set sizes and so on? So that is still some work in progress, um but our next steps are to have a prototypical implementation um into the bitswap protocol and run an evaluation to really understand what it would take to integrate that and also how it would look more or how it would perform on a more global scale.

A

So we would like to simulate that, in order to do that, we already identified uh some crypto libraries that offer us this um um elliptic curve based um uh private set intersection. um Just um after the golden rule, don't implement your own crypto. um We we identified some um that we can, that we can utilize here and um yeah. That's that's the current uh approach and uh for future work. We also think about so private set intersection offers you also um the data transfer to be secured, um not leaking much information there as well.

A

So we look into transferring or integrating the data transfer into this protocol as well, and also what we are thinking about is how we can also take the main idea to maybe dht requests, which I ignored for most of the part of my talk today, but we're also thinking about that as well. All right. So that concludes my talk. If you're interested in this work or the work my group does in general, please reach out, you can find my contact details here. You can also find me on twitter and yeah.

A

I wish you a great uh conference, so I would love to be there, unfortunately, due to teaching obligations, um it's not possible for me to travel at the moment, but this will change very soon, because the semester break is in reach, so yeah.

A

Thank you very much and if you have any questions, feel free to ask. Thank you.

B

uh First of all, thank you so much for this work. It's really great to great you, have it and yeah look forward to to um be able to deploy it into the network. I have a couple questions. um uh Maybe you mentioned this, but how big are the sets like the um the moon filtered version?

B

What sort of like uh reasonable overhead, but just want to get a sense of like how much overhead that is.

A

um I did not entirely get the question.

A

I understand that you are interested in the uh overhead of the different sets and uh so filters that you're going to end up okay, okay, um great um so um the size of a bloom filter um depends on um the false positive rate, um so you can adjust the false positive rate, based on an assumption on the number of sets that you would like to hash them into the bloom filter um and the false positive rate that it should yield, and that gives there's a there's, a clear formula um that you can, that can you can use um for many, let's say reasonably large sizes of sets.

A

You can assume something like um 256 bits, so we really talk about bits, not bytes um here, and we would measure that more in in the bit area. So that's something that we consider. It also depends a little bit on how many hash functions you use more hash functions means you. You are more resilient to false positives, but in the end this also enlarges the bloom filter as well.

A

At some point, the bloom filter will be full and useless, so you really have to make some assumptions, and that is the crucial point on the set size um and therefore I said we also think about um setting up multiple categories or types of sizes, um so that we can dynamically go one step larger and exchange that a larger bloom filter. If we for some reason, see that the set size increases- and so we want to be independent of the of the sets that we see and measure can measure in the ipfs networks.

A

For example, if we look at the cid distribution among servers.

C

Does does your set have to model the full space of cids so like or or just the cids on a given server? um Presumably it's not.

B

C

Because they're, like that's, what is potentially queried and so that.

D

C

Billions, but your false positive rate is going to be any query which is not just the set of that server right.

D

It's the number of cigs you put into it. So basically at least.

C

One point: seven bits to get a percent filter um and then, like.

D

You care about how many things you put in the set, not how many things could exist in the universe.

A

So um the the the um set size depends on how many cids this particular server holds. It's not the entire set of cids that exist in the whole network. That's not necessary, and it's really only what this ci this server holds locally, and this is this is what makes makes the uh or requires that that the bloom filter has a certain size.

D

um uh When do I have to remake you, do I have to remake you for every single client, or can I have one you that have just capitalized.

A

Every server has to do this once and they can share it with multiple clients. um That's that's the beauty of it. So it's really necessary to do this. Only once the nice thing of using a cuckoo filter is that you can also delete items from the filter.

A

So as soon as the set changes of a server, if you're using a bloom filter, you might might be forced to recalculate the bloom filter, because the set changed, you can add new ones, but as soon as you um drop a few cids, you usually have to recalculate the bloom filter, which is also not a huge overhead, but this is something that we can avoid if we use a data structure that offers you also deletion and the google filter can can provide that as well, um and so you really haven't you, can you can um use incremental updates so adding new cids and removing cids um and not recalculating everything from scratch?.

D

Two notes: um uh one: uh it's off: it's a technically a pops up protocol, so that may be one other thing here. We're like this works really good in a request response, uh but it does mean that, like I can't send you a want and have you serviced someone in the future if you end up with the data, uh so it means that, like you, don't have it right now, I'm not gonna get it right. Now I have to keep re-asking one.

C

Note, the other note is like we do also have nodes that have billions.

D

Of items, or at least many millions of items- and that becomes a bit of a problem here, but this is really good for, like the small.

B

D

B

um uh It seems to me that if you keep the, if you find a good size for the blue cooker, you can actually use the boom filters and aggregate the blue filters themselves across different servers. For example, if you had another downstream server beyond this one that was connected to the other server, you could start doing transitive want house in a pretty efficient way. So it's not about the the security necessarily, but you could.

B

You could take the other servers set aggregate with this one and and um drop that into the client or even just propagate, like the neighboring sets, and then that way you have like a like a larger search space that is still very compressed in terms of the over usual overhead.

A

So indeed, yes, you can you can you can think about that? But in order to accomplish the combination of bloom filters, um we have to keep a few things in mind.

A

We need to agree on the same hash functions, so that's fairly easy to to do in a protocol, but then we probably need to rethink the private sector section part, because we have a private value um here, and that is the random number um by the server, and that is this is a server secret and if we combine different loom filters by different servers, we will mix and match the different values from different servers, and that makes it a little bit more difficult to do.

A

But if we, for example, say we simply skip and ignore the whole private set intersection power and use bloom filters only and let the privacy guarantees only reside on the idea of false positives. That could give you some kind of plausible deniability. By the way um already, um then, you could also start combining them as well. That is true. Yes,.

D

I don't know if this mentioned earlier. This philosophy is for private content, routing uh so like large service videos and publish sets to somewhere yep.

C

Cool. Thank you. Thank you very much.