Internet Engineering Task Force 104, 25 Mar 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: IETF104-PEARG-20190325-0900

Description

PEARG meeting session at IETF104
2019/03/25 0900

https://datatracker.ietf.org/meeting/104/proceedings/

A

Hello, everyone we are getting started now, so welcome to the second meeting of the privacy enhancements and assessments Research Group. Since our last meeting we have a new co-chair and thank you Chris for joining us.

A

So this is the note- well you'll probably have seen this or will see this a lot in the in the upcoming week. Please please know this well.

A

Should we have a jab escribe.

A

Great thanks Alison: do we have a note-taker.

A

No taker, no takeover great, can't see you, but okay did anyone have any comments on the agenda or is there any people who would like to propose amendments to the agenda? I'm hoping not.

A

Okay, let's get started, then our first talk is by Ian. Oh.

A

Yeah, the official Twitter hat hashtag is te arg in case anyone wants to tweet.

B

Hi, okay, my name is Ian I've come here from tor project. This is my first ITF, so nice I got to talk about some work, I've been doing a tour project and how I'd like to generalize some of the principles from that to apply them to the wider Internet and I've started working on this draft. Oh.

B

Dear okay, the the slides are not quite scrolling properly, but okay, so I've been looking at internet measurement. Since 2013 I worked on the past wider tool for internet transparency, measurement I've been contributing to tor projects, since 2015 I want to just set the the scope of this early on this isn't about the ethics of internet measurement. That's a very, very wide, broad topic. This is specifically about safety, making sure that other users of networks that you're measuring don't come to harm.

B

This is a working definition of safety. The is going to end up in the draft and I have a github issue for sort of revising this and working out how we can improve it.

B

There's a lot of related work in this area. Some people have already pointed out other related work, I didn't know about on the PRG mailing list. If you know of other related work that I haven't heard of, please do, let me know. There's lots of people in different communities have come up with their own assessments of safety and ethics in their measurements, but there hasn't been much inter-community discussion on this.

B

So I'm going to give a little bit of the background where I'm coming from this isn't gonna be a talk about how tor works, but hopefully I'll give you enough that you can understand. What's going on, talk primarily is producing open-source code and then there's a volunteer run Network, and this network provides security, privacy, anonymity, it's robust, is authenticated. It provides integrity depending on how you use it. It gives you these properties and we need to monitor this network and make sure that it's working scaling.

B

Okay, that is healthy and one of the things that we measure from this network is the number of directly connecting users. At the moment we have somewhere between 2 million and 8 million. According to a recent paper and daily users of Tor, we can use this data to detect censorship events. If the censorship in the country, then maybe the number of users of Tor goes up. If there's attacks against the network, maybe Tory's blocked in a country, the number of users goes down, we can evaluate when we make changes to the software.

B

How is it affecting the performance of the software and see how it's scaling?

B

The philosophy that we've tried to follow is that we only handle public non sensitive data and each analysis goes through a rigorous review, often by academics, before publication of data or analysis and we're guided by the Tor research safety board, which is a group of academics and tor researchers and tor developers, and they basically a service for members of the Tor wider community to assess whether or not research that people want to do on. Tour is safe.

B

The three principles that we try and follow a data minimization source aggregation and transparency, so data mineralization is where we try and capture only the the least amount of data possible to answer the questions that we have and the level of that detail should also be as small as possible. So it's not just about limiting the the properties that were captured, but also the resolution of them.

B

The in in the tor network. We capture at relays and the relays are operated by volunteers. We have distributed trust against across those relay operators and they are doing the aggregations before submitting any statistics, so also we make sure that the the raw numbers are existing for as little time as possible and then we throw them away as soon as we have our aggregate and everything is open-source.

B

We publish design documents, technical reports on how we are doing things, and hopefully people are looking at these and if they spot any problems, then we're very happy to fix them. So, going back to the general case, the shortcut to making sure that your honor, your honor network and you're performing safe measurement, is to have no one else on that network.

B

So using simulations using test beds, this is a pretty sure sure way of guaranteeing that you're not going to upset anyone unless, of course, you're running a really big simulation and upsetting all the other researchers that they can't use a simulation machine.

B

So one one case study comes in unique users of Tor the easy way the the web analytics approach is you're tracking. All the IP addresses you've seen and then you're working out. How many unique IP addresses you saw over a day in 2010, we came up with a method of measuring the number of users in tor. We didn't want to count all of their IP addresses and we saw there's a little bit on how it all works.

B

The first step of a talk line connecting to the networks it needs to have a view of the network, so it reaches out to a directory server to get a list of all the currently running tor relays, and we can count this list of sorry count. The number of directory requests that are made and from that infer the number of tor users. So we don't handle IP addresses at all this.

B

We have to guess how long an average session lasts, because tall clients will refresh their view of the the tor network every few hours. But what we do get to see here is we get trends, even though we don't have exact numbers.

B

So that's how we get this graph, and this this is in an area of problems known as the count, distinct problem. Where you have a number of things, you want to know how many unique things there are across this data set and a few methods have been developed, 'iv of doing this in a privacy-preserving way, one of which is hyper log log from google.

B

This was originally designed where you have really large data sets, and you want to count the number of unique items in it in a probabilistic way or fad, or a really large data set is one IP address. We don't even want to have that, so this can be adapted for that use case. To keep track of the number of IP addresses.

B

We've seen proof count is another system with distributed machines with counters, and they use secret sharing to submit their results to a what we're calling a tally reporter and then the aggregate from the network can come out, but the individual accounts can't be disclosed and another another system is private, set. Union cardinality and I'm told that currently this is computationally infeasible, but I hope that in the future this this might also be an alternative and there are a whole bunch of others as well report and proclo from Google and prio from Mozilla.

B

So all of these schemes exist and there's lots of work going on in this area, so I'm hoping to generalize all of this into this draft and the next steps that I see is coming up with a really comprehensive general considerations, checklist that someone could go through and make sure that they're complying with I want to have an introduction to literature on on the secrets, sharing and multi-party computation systems, but I think really what I want to do is describe what the properties of those systems should be and not refer to specific systems, because I'd like this draft to remain relevant for quite a while, we need to think about future computation power that might be available.

B

So if you're saying okay, it's fine everything's a es, 2 5 6 encrypted and we threw away the key. Maybe that works for a few years. But then aes-256 is broken and then you need to rethink your strategy.

B

I want to have some discussion on obtaining consent from users and often in the case of internet measurement, especially if you're, an IXP and you're doing pcap captures you're, not gonna get consent from every user, but you might be able to get proxy consent from maybe the major ISPs that appearing and they have an idea of what their users want.

B

This may or may not be something that could be considered safer. I'm gonna have discussion on it nonetheless, and I want to ensure that all types of possible harm are covered. So in some cases, thinking back to a study that I did with EC n transparency through the internet we found there were some Reuters that you send an e CN packet through in their crash. They refuse to root future packets. Now, thankfully, those riches are gone, but that has the possibility of crashing someone's home router, which is not something you want to do at scale.

B

So all of these issues are up on github and I'm very happy to receive any any comments. Discussion, pointers of existing work and I'm happy to take any questions now as well or any feedback or suggestions.

A

Should you have any questions I.

C

Guess if there are no questions, asked Wendy and I'm sorry John's coming up.

D

Jonathan Holland CloudFlare, have you ever done any metrics, where any tests, where you compare the accuracy of your les détails measurement to a more detailed measurement? Yes,.

B

So going back to the the two million number, hmm there was a test done with the new proof, count strategy and they've discovered eight million unique IP addresses per day, and when we revisited what we were really counting in that graph we found we were actually counting the number of concurrent users. So while we had thought that that was the average session length it turns out, we were wrong. There are in fact eight million ish daily users by unique IP address, but then that also doesn't consider not and a whole lot of other issues.

B

So even counting IP addresses exactly isn't. Gonna get us the exact number, sir.

D

But it's reasonably accurate or yeah.

B

Reasonably accurate and, most importantly, we can see trends. Thank you. I.

E

Still felt it so I had a quick, redo draft and I think it's the thing and I hope it finds a home somewhere, whether it's in purity or something I, think it's a good piece of work to continue and I guess so does just positive feedback. Basically, okay, thank you. I guess, I have a question too I mean had to what extent are you trying to reach out and get input from other people doing kind of large-scale surveys on the Internet.

B

Definitely all feedback is good feedback, the large scale, measurements, small scale, measurements, there's, obviously different classes of measurements. So you've got your active measurement where you're sending probes out into the internet and maybe talking to other people's servers, in which case you might incur the bandwidth costs or whatever, and then you've got the passive measurements, which is large, large scale. But.

E

Also, as a doing active measurements is often a kind of a block list that they keep of who not to bother talking to because they complained. One second I think is that covered in the draft. I didn't.

B

They there is an issue about keeping a do, not scan, list yeah and working out what the best practices for that should be yeah, yeah, I think.

E

That'll be a good topic to cover. If you can. Okay.

B

A

For remote folks, can you please have the name and affiliation clearly when you come up to the mic.

C

Yeah, sorry so I have another question you were mentioning the or, if account is primarily the system that you currently use in this or project to do all this measurement and I think there were discussions in the past about potentially using prio. Have there actually been at the experiments done using prio for a tour, and can you comment on potentially the differences between prio improve con okay, so I.

B

Can't comment on what the differences are: I don't know the system pre or that well, but what I do think prio is in the browser. Yes right, so we currently don't have any client assistant metrics at all. Tor browser does not have any telemetry. All of our telemetry is in the release. So if prio is more suitable for climate system, attrex I'm not sure it's it's possible that we might do that at some point in the future, but it's also possible that the community might not want that at all. Ever so yeah sure.

B

F

B

C

Yeah all right! Well, thank you, Anne, don't cry cry! She ate it like that's! Moving on to the next speaker.

A

Right now, your.

G

H

Yes, I'm here, I can hear me. Yes, awesome things. So, thanks for the opportunity, my name's Ryan guest I'm, a Software Architect at Salesforce I work on our security and data privacy terms. My email is up here. Do you want to reach me after or I'm also available on Twitter siobhan Quinto next slide? Okay,.

H

Looks like there's a little lag on the presentation reckon see in there remote video, where you're at so one of the things we'll talk about is, is posted aces really two things. One. Some of the techniques is for identifying personal data and application logs where a SAS provides. We have lots of the enterprises using our system when I strike a balance between providing log analytics tools to developers analytics folks as well as our customers, so we actually send our logs to customers of our system, so they can get an idea what's happening in their organization.

H

Then your next slide. You want.

H

So the non-identifying personal data- we have really two techniques, one is store a dictionary base and this is a general-purpose technique that can be used across many domains. We use across acquisitions as well so with a new company joins, are merges with us. We have a torch, our general purpose tool. We can run against their log data or each type of data store.

H

The main thing here is: we use common names. We've started with the top Leonard census names over a certain amount of time. We also look at different location. Identifiers. Us states is a popular one and that can give us insights on to where potential customer data may be leaking into blogs. Next slide.

H

We also have a couple formats unique to our domain, so we have user IDs. We always know start with zero, zero, five, all by 15 other alphanumeric characters and then various different types of custom identifiers is, you can think of such as a DES SDK, a piece or I'll follow a certain format, and so things unique to our domain.

H

You can use that to grab her search for logs to find data, and we use these to constantly monitor, what's going on our application logs and we can derive metrics for these to see how well we're doing it. Anonymizing personal data next slide.

H

So then, from there we go and look at you know after we found data or potential place or data maybe enter your logs. How do we go and we anonymize that or then we have I've put together a collection of eight of the most popular techniques we use. We use a little more than this, but really what we try to do is develop a tool set that we share with developers.

H

So they can, they can have these and they can know so. Windy is one on one to use. Another so may seem self-evident, but data deletion is, is the first one. We start so there's certain class of data that we don't want to see at all. So things like Social, Security numbers credit card numbers. We don't want any of our downstream systems to have to deal with these, so just drop it like a star next slide.

H

Our next class of data is data masks there we take an input and we replace it with just six asterisks gives you insight and if a value is no or not, this really helps in serve the user experience or UI type applications where you want to show user that a value is entered, but maybe not exactly what it is next slide.

H

Another technique uses aggregation, so we generally bucket records together if we can to avoid knowing exactly what causes in the record.

F

H

Is really popular when we talk about things like error messages in our system, so a lot of times we don't care what you generate an error message, but more how many times it happened. So we'll minimize everything we can from the air and just bucket by it. That gives us insight into how things are doing without much easier generate. The error will cause the exception. Next slide.

H

Next, somewhere, that is called generalization so, where I live, the ISPs are chronologically numbered every residents. So give me an ISP give me an IP address. You can drill down the exact location that someone is one of the requests our marketing team was they only track ip's, but really they were using it to figure out sort of what the top countries using their system work, so we'd only sort of reduce the granularity and just leave the Geo IP location service and just bubble up in the logs or analytics.

H

What's the country that I need access from well, I also get the same information without how many get very specific next slide.

H

A similar thing to IP address that we've done is ok way to use category, so this use case is popular with our performance team. Would it look a response time? Our service is from different places around the world. Really what they want to do is know how far away the client is from our originated data center and track performance there.

H

So using IPS really want to know is if you can a bucket of what the distances and I use, that and and report on that or do analytics on that. Well, I have to know exactly where that location is next slide.

H

Tokenization is also something that we use. We keep a key value store of that set of tokens over a vein that we define an everywhere downstream. We replaced the token we replace the value with the token that gives us the advantage of serving things like right to be forgotten, requests or get her attention requests where, after certain amount, I want to forget everything we know about the user, just delete that token without having to worry about changing it in a bunch of immutable data stores.

H

So if we have a central key value store, we can just pop that token out everywhere. The data is generated next slide, I.

H

Believe, there's a talk later about differential privacy, so I won't go into much detail, but essentially that's important to us is for certain categories of numerical values. What we do is, we add, a amount of noise along a fixed distribution and we use that to having some privacy, and so when we have large amount in doing more complex machine learning models.

H

The noise really cancels out and you can hide it of a set of data without having your exact details, exposed next slide and the last technique I'm going to talk about. It is encryption with a set of access controls. So ten trustus is AEST to sit for encrypting data, but really brings the access. Controls and key management provide some unique opportunity to do some cool things.

H

So you can have any keys for a customer for a specific tenant in our system also for a service, and you can then decide how often you want to rotate those things. And how long do you serve? Retain older legacy keys after location, so you can do things like if you wanted to do sort of wide access took over the whole service. You can just delete the key.

G

H

Then the downstream systems would be unable to decrypt the data. So has some really interesting properties, or you know, for forcing data privacy, especially over a certain time, is most useful to us. So next slide and thank you for letting me present some of the work we're doing. We think it's some pretty interesting stuff I have my contact informations on the first side, if you're doing similar things, I have feedback, we'd love to hear from you and I'll be available to take some questions.

A

Thanks Ryan I know it's really late in California. Are there any questions.

A

I had a question Ryan, so the Salesforce planned to publish any of this work in an academic paper or like a research report.

H

We like to figure out right now, don't figure out the best way to do that, whether it's sort of open sourcing, some tools or in a more sort of academic white paper setting. We definitely want to share some of the things we're doing and get feedback on how it can prove them. Yeah.

A

You have a question.

I

Benkei duck and don't wanna get too far in the weeds, but you mentioned some bits about data. Encryption and sort of the key management for that and I was wondering if there was anything interesting to say about you know having the keys available for different services. Or you know, different levels of the service are different levels of access to different services, since presumably the data is only going to be encrypted with one key at a given time, but maybe I'm wrong about that.

H

Yes, so one key, but with the idea that scope towards the application logging use case in the same log line, there may be different he's encrypting the data, so a log may have sort of a service specific encrypted value and then ciphertext that's interpreted, may be a user specific value and then yeah access controls around those. We use a a variety of things, but most important for that is.

H

We have a server complex permission system, that's built on top of our platform where we can say you know who can access certain things and yeah and we found that to be pretty powerful. All.

I

Right, thank you. We.

A

Have one more question.

J

Hi Ryan: this is Pallavi I'm from Salesforce too so I had a question not about the hire like you know, encryption of the data but of identifying customer data because I think to identify the data you must be following certain regex or some known type of searches. So I think that would also be of interest because those need to be tweaked on a regular basis just so that something doesn't slip through the cracks and because if, if the identification doesn't happen, then all of these things like something might slip through the cracks.

J

So do you have any kind of standardization of that.

H

We haven't standardized per se and I mentioned kind of alluded to this. It's a combination of sort of general things that you would work against any data set, but also things that are very specific to our organization. Some strong assumptions we make about how the data is formatted and we use a combination of those were our teams right now are figuring out the right balance between how restricting me things or how are loosed investigating some ml techniques.

H

We have problems with false positives. Just to give an example, the food web browsers user agents look like IP addresses. The version numbers also hit some interesting info in positives there, but it's really, like you, said some constantly refining trying to get that around, and you know you know: Salesforce were very metrics driven I'm trying to report. You know how well we're doing and tracked that so like he said and make sure we're continuously laundering this. It's not just a point in time, but constantly going forward. How do we improve this?

H

J

C

Ryan I've one follow-up question with respect to kind of showing or asserting how effective the analyzation techniques you're using are you had mentioned a fuse of different or privacy is one of the techniques I'm wondering if you can comment further on how you choose things such as the amount of randomness budget that each client has or like each node has or whatever you want to use whatever the right terminology is and how you choose. For example, epsilon to you know, maximize the utility versus privacy trade-off for that particular mechanism. Yeah.

H

And this is something we're still very early in essentially what we try and do is, you know, sort of parameterize this and then each specific use case can provide inputs on how you know like what their kml movie is and how you know how private they want things or how a less private they want things. So it's very much right now on a smooth use, a case-by-case basis, I could get somewhere. We published some more. You know general principles, but right now I'd say usually a cross-functional group.

H

It sort of decides this, and really we just you, know, sort of recommend it as a general technique. So it's very much on a case by case basis,.

C

Thank you and, to what extent do the general techniques basically overlap with the content of the draft that was just previously presented? There's a lot of the same techniques that you're describing are those which are included in the date of measurement draft Rodian.

H

Yeah absolutely I need you uh dig into that, but yeah you're right, there's a lot of overlap and a lot of different groups are doing very similar things. So I appreciate the opportunity to share here and sort of give. You know our perspective. What we have found works for us and yeah. Exactly feedback like that is generally appreciated. Yeah.

C

Great and potentially room for collaboration as well yeah.

H

H

Great thanks, Ryan yeah thanks a lot.

A

Except you have Nick.

K

All right thanks, everyone I'm going to talk about a project called privacy pass, which is interesting technology in the privacy world, so as a high-level overview, its privacy pass, is a lightweight zero knowledge protocol and before getting into the details of what it is and and how we can use it, I'm going to give a little bit of context as to how this came from and in relation to CloudFlare, the company I work for and how? What with what problem, sort of inspired the solution. So CloudFlare has a service, that's reverse proxy!

K

So if you have a website or web service and you're using CloudFlare requests go through CloudFlare and then they come back that there's a TLS connection between the client and and CloudFlare. And if something is not cached, then the request goes all the way back to the origin and there's another little tiny red blob there, which is for bad requests or requests that are malicious in one way or another. And this is where the problem space occurs.

K

So, in order to reduce malicious activity and malicious activity, is spam or comet spam or requests that have have malicious payloads, there's several different techniques that are used online to help protect sites against these sort of things and reduce the load on on websites, one of which is a user challenge. This can be in as a sort of demonstrate here as a CAPTCHA.

K

You might have seen this, and so what happens is the browser is presented with some sort of challenge to prove that it's it's human and once that challenge has been passed, then a cookie is issued that allows clearance bypass for this. So one of the issues here is that the default security levels are such that for requests coming from clients, for which the the the site has no previous information.

K

It's very difficult to distinguish bad requests from good requests a priori, so there are situations in which IP reputation are taken into account and and if that's the only piece of information that there is, then that's the only piece of piece of information that can be used to to choose whether or not to show this CAPTCHA and so users using VPNs or tours, or something in which the source IP is shared among a lot of users.

K

So there's a lot of potentially malicious traffic coming from this will be served with these CAPTCHAs more often and solving a CAPTCHA using a cookies, not portable. This is important.

K

This is part of the web origin policy, so every single site that you visit will be given a CAPTCHA and so there's somewhere around 11 million domains that use CloudFlare, which make this sort of a bigger problem than it would if it was just one site and you have to solve one CAPTCHA if you're doing any legitimate or I guess a if you're browsing the Internet um you're going to run into quite a few of these sites, and it's gonna be very annoying.

K

So we want to kind of reduce that problem and one of the ways that you could think of doing it is figuring out how to potentially solve a challenge and get some back some sort of currency or some sort of proof or token, or something like this that that you did solve this this and something that's anonymous. So it wouldn't it be nice to have some sort of online equivalent to cash so that you can do.

K

Withdrawals, encourage and then make a transaction, and have these two things be on linkable, so withdrawing cash in and and paying for it would be, would be on linkable so and in the analysis, is actually not that great right. So, if you think of cash in a world where cameras are ubiquitous, then you have serial numbers on every piece of every bill, and so actually it is trackable from withdrawing from a bank account to paying somewhere else as long as there's cameras everywhere. So what what's?

K

What's sort of a better analogy here and I would propose that that a better analogy would be a self printing money that gets the shirt I'll describe how this works, so you would get a bill put in an envelope, put a serial number on it and then take a piece of carbon paper. This is a very rough physical metaphor here, but um it's a piece of carbon paper in it seal the envelope, send it to an official authority who then signed the outside and say yeah.

K

This is an official bill and then, when you'd open it up, you would have the official bill and the authority that has signed it would have no knowledge of the serial number, and so this would be a this would be an untrackable cash bill if you will- and so this is the metaphor, that kind of was Medivh ad Chomp back in the 80s when he invented the idea of ecash and from a high level it. This is. This is based on a cryptographic, property called blind signatures or blind blind signing. So there's two flows here.

K

The first is issuance of cash and then the second o is is redemption or spending it. So the way it goes from a high level.

K

Is you take a token you blind it in some way you send it over to the authority, who then creates a digital signature using that token and sends it back and then so you have a blind signature, and then you can take the blood signature and unbind it and from the perspective of RSA, the way it works is the blind token and the blind signature are signed, and if you unblind them together, then the token and signature are also correlated.

K

So you can take the token in the signature, send them to any third party who can then validate against the public key of of the issuer and then in.

G

K

Then return goods, so I may go into a little bit too much math in this, but I'll try to try to make it easily easy, but um in terms of RSA there's kind of two values you have an e and which is a public key and a DS. It is a private key, and this is how a Tommy and ecash works. Is you take your token k and you multiply it by a random number exponentiated by the public key, send it to the server the server exponentially.

K

It's by its secret key sends it back, and you can just divide out this random number and you'll get a pair, K and K to the D, which is essentially a token and a soakin token exponentiated by the server's private key. And if you want to redeem that you send it to any third party, they exponentiate by the public key check to see if it matches and if it does then great. This is something that was definitely signed by a third party.

K

There are some more details here, but this is this is the general general approach is, is blinding and unblinding and redemption, and so after some conversations at real-world crypto in 2016, some folks decided to kind of write down how this could be applied.

K

This method could be applied to solving CAPTCHAs, where you imagine, if you solve a CAPTCHA, that would be issuing a token, and then you could use that token to solve another CAPTCHA without actually having to actually put in the information, and this draft led to this idea, which is the foundation of privacy, pass where, yes, you have a token you blind the token you solve the CAPTCHA, send it to the server.

K

The server gives you a blind signature and then later, if you see another CAPTCHA, you can take the token in the signature, send it to the server and get a bypass from that bypass that couch without solving it. Essentially, so is this it. This is privacy class. Well, not exactly. This is this was our original paper that we submitted to pets and and were and got rejected, because you know it wasn't that satisfying to use this slow, 1980s, cryptography and there's been recent advancements in this field that are similar to eat.

K

Things like ecash that we could have used and we decided eventually to look into and to use, and the two fundamental ideas here are that of an OPR F, which is an oblivious pseudo-random function. This is very analogous to blinding it's it's a client and the server compute a value such that the server doesn't know if the result is, but the service required to be a part of it and then another concept called vrf, which is a random function.

K

That is computed using a private key, and you can prove that the private key was used to compute it, and so taking these two concepts, we came up with something called we're. Calling a vo PRF, which is a verifiable opr affix takes concepts from both and I'll kind of walk through what that is, as in terms of inspirations and work. There's a lot of previous work on here. Friedman.

K

First came up with a no PRF Yaqui and all came up had used this to do something called a do set intersection and in 2014 the idea the real idea of a vo PRF came about. It didn't have all the features that our final privacy pass did, but it was used for privacy password, protected secret sharing and a type of fake algorithm.

K

Later there was a paper called Sphinx, which is about using o prfs to do secret secret, hiding Sharon Goldberg, presented at the CFR G us a proposal about verifiable random functions based on elliptic curves.

K

It doesn't, it did not include anything to do with blinding in there but or or batched deal a huge proofs which are things where things we used, but yeah there's a couple other papers, including paper by burns and then Ryan Henry's thesis and taking all these ideas and kind of putting it all together.

K

We came up with prime seabass, so what's my time like so I've got quite a few things, we're good, ok, great! So hopefully this is this. This will be clear, but I'll walk you through exactly what how privacy pass works. There's a few fundamental things to keep in mind: one is the setting in which we're doing the computation and the setting is a prime order group.

K

You can imagine this as the group of points on an elliptic curve for say elliptic curve, Elmen, and it has to be a prime order group, which is just a small wrinkle, but um in any case group elements, I'll denote them with capital, letters such as P or Q, and then the fundamental operation you're doing on these is scalar multiplication. So if you are taking a point, P and multiplying it by itself, n times or I, get no adding its itself 9 n times that CL in multiplication, oh I heard something.

K

Okay and so scalars are lowercase letters.

K

The last two pieces are hash to group element. This is a function that takes a scalar, a token and outputs. A group element that is random in in a in a statistically average sort of random way, uniformly average random and then the last piece. This is the this is the only kind of tricky concept, which is that of a discrete log equivalence proof.

K

This is the only place in which zero knowledge comes into play here, but the idea is that two points, two pairs of points can be analogously related to each other, so PNR can be related to each other like Q and s using the same multiplier. So if you have P and s times, P and Q and s times Q, you can prove that this s there is an S such that you know, P is e times.

K

S is, is s times, P and Q times s is s times Q, and you can do so without revealing what s is so this is. This is really used for in envy ahrefs and it's used in other places for proving that a specific private key was used and these scalars are usually private keys. So this is going to be denoted. Dl a Q P to R is the same as cute-ass all right.

K

So, with these fundamental pieces, I'll walk you through a naive construction of how this would work and iterate through until we've kind of ironed out all the problems. Okay, so scenario, one client takes a point on elliptic curve T and sends it to the server and the server has a private key secret number s and multiplies T by s and sends it back and then later the client will take this T and this s of T send it to the server.

K

Then the server can check to see that had previously been issued by taking the T multiplying by s and see if it equals to the second point, and so this this this step is called a routine, and this is. This is a very naive scenario that everything is everything here is built on, is multiplying by a secret value on the server side. Since the server knows s, you can you can compute s of T. So the problem here in this situation is that during the issuance and the Redemption, the same s of T is sent.

K

So these these two are linkable together. So this is where we introduce the idea of blinding. So in this case the client has a same point T, as well as a blinding value beat, which is a randomly generated value that it keeps on its own side.

K

And so you take the blinded point: send it to the server the server multiplies by its secret point, sends it back, and then you can, because you know B as the client you can multiply by the inverse of B and get back T and s of T and then redeem that way and then just like before the server can take T multiply by a so check.

K

If it's the same and say okay great, this was issued before and because the blinded value was never or the value sent in the issuance is blinded, then the server action it does not have link ability to link ability.

K

So the problem here is this is great, but this is slightly malleable if you have T and s of T, if you multiply them both by any scalar, say two or three or four or five, then you're going to get an infinite number of valid pairs that you can redeem. So one issuance gives you an infinite amount of options. So that's bad all right. So how is this salt well I, remembered I mentioned hash to group element this. If you think of cryptographic hash, it's one way, you can traverse it.

K

So, instead of picking a point T where you start, you actually pick just a number T and take that T and passion into a point T and then in this case, when you redeem, you yeah, so you the exact same issuance, so you take T and you blind it send to the server, get it back and then, when you redeem, you actually send the the token value T, not the point t and because the server can do hash into the value T and then multiply by s it can it can still validate this and because it's you can't find another, a t-that hashes into something.

K

That's a multiple of the other T, because this is a one-way function. You really can't get another pair you're guaranteed to have a unique pair and there's a slight problem with this situation. It's it's! It's really not not that big of a problem but but essentially, if you're sending this token, it's not actually bound to any specific message. So if you happen to be sending it over an insecure channel, then someone could take this and you know associate it with a different message.

K

So there's a trick you can do here and which is rather than sending s times T, rather than sending sort of the sign point where you can do is sign a message and an H Mac of that message with s of T, and so how does the server compute this? Well, the server takes T hashes it to the point. Capital T multiplies by s, and then it has the key for the H back so, rather than explicitly checking that the s T that was sent is actually the same as s TS are computed.

K

You just check that the H Mac is computed correctly, which is which is great, and so this this goes. This goes through pretty nicely, but um this this is really as far as you can get with just a no PRF itself. This is this is essentially a no PRF construction. The problem here with is is tagging. Is the idea that this s could be chosen uniquely for each individual and or if there's you know a nest, that's chosen for everybody and then an S that's chosen for the one target.

K

When you send in your redemption, then they can track you essentially, and so what is the proof? What is the way to fix this, and this is where the DL EQ proof comes into play. Essentially, at this point, the server publishes a generator point and generate your point, multiplied by its secret key, which is essentially like a diffie-hellman public key and it publishes it somewhere, Universal somewhere, like the tour consensus or somewhere like a certificate, transparency, log or somewhere, where every client knows that this is going to be.

K

The unique value could be embedded into the software which is sort of what we did, and so the idea is here when you send back the server's SBT. You also proved that SBT relative to the blinded token is this is analogous to G to s to the G, so that the it was the same scalar that was used to multiply this is this is how VRS work, they'd, say: okay, this is a proof that we're using this the exact same s for you, that is in the public domain, and this is great. So this is.

K

This is mostly the end of how privacy pass works, except that there's. This only gives you one Redemption per issuance. So there's this is. The text is a little small here, but essentially we're doing this. We can do it. Multiple time say three times you can get solve, one CAPTCHA send in three different tokens. The problem in this case is that these D leq proofs are a little bit big a little bit expensive.

K

So we came up with a small optimization that you can actually compress these three do aq proofs into one, and so you can issue. You know three tokens and give one deal EQ proof for all of them simultaneously, and this is it this is. This is essentially how privacy pass works is when you solve a CAPTCHA. You come up with three unique values T or in this case it could be up to 30, depending on on how your can, how your configure so, let's say, 30 you hash them all two points on the curve.

K

You blind them all send them to the server the server multiplies them all by its secret key gives a deal EQ proof that the same secret key was used for all of them sends it back, and then you can individually redeem them each with this H max system. Ok, so this is it for the protocol and also I guess a year and half ago we release privacy pass as a Firefox and Chrome extension.

K

So this is something that is really implemented and deployed with it says 50,000 daily, active users, but the Chrome extension has about a hundred and twenty-five thousand users, wouldn't tell me how many were daily active and the the Firefox one says: there's something about 25,000 daily active, so I estimated this there's trillions of requests per week to the CloudFlare network and in terms of redemptions there's about a half a million per week using privacy pass, and this ended up getting accepted at pets last year, and so this this isn't public public domain.

K

This information right now, as well as all the sources open at privacy, passed out- github god I am so looking forward. We're currently integrating privacy pass with different CAPTCHA providers.

K

There's a slight issue here: relative to chummy and ecash, where anyone could validate these these tokens. So vo PRS are not publicly verifiable. You have to actually have the private key to check that the token is correct, so it's more like a voucher than cash, but the it's one small downside relative to to the RSA version, but the speedups you get by using elliptic curves and by having all the space savings is really except for it.

K

There's a no PRF submission to see FRG, that's currently on revision3 and we're also looking into different applications of the idea. For example, there's a draft submission for TLS to use these tokens to do anonymous resumption. There was a recent paper about how TLS session resumption is a tracking vector. This is meant and to solve that, you can think of this. You can do use this to do anonymous referral codes. That's an another interesting idea, we're exploring and anything. That's really a single bit of zero knowledge.

K

Proof can be used for this, so if you wanted to have a privacy pass token that validates that is run by say, a government and- and that proves that you're over 18 or at work X Y, Z, EU citizen- you can use this. It's very lightweight, there's no advanced math for that, and so with that, I will open it up to questions and.

K

This is what the discrete log equality looks like if anybody's interested.

L

Wes heard of Chris I, fascinating work, I I, really like the intent behind it and the desire to I guess the goals behind it as well, as is the way to say it, a couple of quick questions so one you said that there is, of course, some sort of expense you know related to it. Do you have any percentage? You know CPU increased to actually do the level of math associated with it. Yeah.

K

So, in order to do to do this from the client side and from the server side, it's it's really one elliptic curve scalar multiplication, so it's cheaper than TLS handshake. So this was this was one of the goals with RSA. It was slightly more expensive and this it's yeah I, guess one elliptic curve operation per token. Okay,.

L

That's not that bad yeah, so the next question and there'll be a follow-on is from what it sounds like there's, a limited number of tokens that you hand a client, so maybe you hand back 30 or something like that and after 30 they come back and they have to solve a CAPTCHA kin. Is that correct? That's.

K

Right, yeah, there's not an infinite number of I've tokens, so it's deciding on the parameters is use case dependent and we found that 30 or so was enough that it reduced the it reduced the friction for users. Quite a lot, and you know you can you can you can you can technically modify the code to do up to a hundred on the server side but um yeah? It really depends on use case and.

L

The follow-on to that is I'm, assuming that there is no reason why the client couldn't share their cookies with somebody else that they're transferable technically, if they share their secret keys with another one. Yes,.

K

In fact, on the client side, once you've unblinded your token, then you don't have any secret state at all. You can just you can share your actually, even with the blinded token you can. You can share with anybody. So one thing to keep in mind that I didn't mention on the server side is that there does have to be some level of double spend protection, because the these could be potentially computed multiple times but yeah.

K

If, if all in all, there are some other larger ecosystem, things that are brought up by by issues like this, such as farming, you could imagine someone could do a lot of farming and solve a lot of CAPTCHAs and then kind of use it to bypass things on a wider scale, but um but generally yeah. We think that having key rotation is a way to to help reduce that and metrics show that it hasn't been abused.

L

In that way, and well that was that was where I was heading. Next, was you know if you, if the malware is distributed, that you know took over a million browsers, then they could collect an awful lot of tokens to. You know make endless use of if they needed to at some point in the future. Yep.

K

That's right, and so it would be one you're, essentially multiplying we're reducing the cost but you're not eliminating it. So in order to get 30 tokens, you still have to solve one CAPTCHA, so you're you're, multiplying the value of solving one CAPTCHA by a factor of X Rex's number of tokens. There.

L

Still nobody behind me, so you talked about double spending on the server side, so you actually have implemented something on that you it's not in your slides from I saw. So you have something that simple, that's checking double spending and is that double spending checking infinite in lifetime or yeah.

K

So it's we are, our implementation of double spending is, is changing, we're we're rewriting this to you leverage a new platform. That's a JavaScript based platform called workers that has more robust double spend protection, but essentially you have to you have to keep the double spend strike register as long as the lifetime of the server's, private, key and so key rotation is the way to actually reduce that and- and so you do get in into a little bit of a little bit of a chicken and the egg problem.

K

If right now in the very first version of privacy pass, we hard-code the server's keys, so we haven't been able to rotate and so we're coming out with the new version in the next few months or so that allows that to be updated on a dynamic basis. Well, anyway, very cool, good work, ok, keep going well! Thank you.

C

Yeah Nick I have a quick question, so the DL, a key proof, is a very elegant way to solve the key tagging problem. Were there any other earlier designs that you considered in order to address that, like you, can imagine, for example, the server signing? What is the blinded token, providing the public key back-end and clients like gossiping it to see that they're both having the same view of the key yeah.

K

The it all boils down to the same problem, which is key key rotation and key distribution and Trust of key rotation. So we did consider publishing that in in places that the client can be assured of a shared global view. So something like the certificate transparency log is is one way to do it and distributing it from a central source that is signed by a long term. Key is, is sort of the design. That's that we're considering for the the rotation going forward.

C

And quick follow-up question: I know there was a draft that was originally written not submit anywhere and github. That eventually became the pets paper that was accepted. Is there plans for a actual specification for privacy pass for later consumption and potentially standardization or.

K

Yes, potentially I think that, in terms of HTTP and using this as an HTTP mechanism, there's nothing that would prevent this from being standardized we're first exploring Federation with respect to. We have our own capture provider, but we're now currently experimenting with a company that has called H CAPTCHA that does their own CAPTCHA as well as a company called arc rose labs who has something called fun, CAPTCHA and and as I mentioned at the end, this is potentially generalizable to a lot of things. So yeah.

K

We want to see how it plays out operationally before deciding to standardize but I. Think there's Stefan well right.

M

C

I think: okay, thank you very much. It Thank You Nick.

A

Haven't even not sign the the blue sheet.

C

Where are the blue sheets.

C

Surely they're in this room somewhere, oh they're, in the back? Thank you.

A

Next up, we have Amelia and Christopher.

N

So hello and Christopher, are you connected.

O

Yeah I think so.

N

Yes, so my name is Samantha understand: I I work for article 19, yeah.

O

And my name is Christopher Langstrom and I am a master student in Applied Mathematics at Uppsala University I'm, currently working on my master's thesis on privacy-preserving methods in data science and statistics.

N

So first we want to thank everyone for letting us have this presentation today. This is going to be a very high level view of differential privacy. Some of the slides will contain mathematical. Notation do not be alarmed. It looks like there's a lot of it, but it's for completion. It's in case. You forgot something about Exponential's or mean values, it's all in there and it's so you can go back and review the slides without feeling that you're completely lost.

N

So this presentation is going to be in approximately three parts. We'll give this very high-level overview of what this differential privacy aim to do then we'll go into some different mechanisms for achieving differential privacy like input, input, data, perturbation or output data perturbation.

N

Finally, we'll rotate back and see a few ways in which we believe that differential privacy mechanisms could be or could not be applied in IETF protocol standardization and the very last slide contains references that you can peruse at your own liking when you have the time that we found very useful to enhance our understanding of differential privacy.

N

So, firstly, where are we at well in the IETF? We have RFC 69 73 that contains privacy considerations for IETF protocols.

N

Differential privacy is, of course, a specific way of remedying some privacy problems such as identifiability, which is mentioned in the RFC 69 73, or a secondary use, which is also mentioned in RFC 69 73 differential privacy simultaneously provides a value to the degree of success that we have when trying to remedy these threats.

N

The overall aim of differential privacy is to provide an individual with plausible deniability when you are performing a statistical test on the database, in the sense that such an individual should be able to deny being part of any database on which you have performed an aggregate query.

O

Right so again, there's a bit of a delay in the slides but I'm, assuming that we have the definition up now. Yes,.

N

O

So this is our definition of differential privacy here.

O

So this is a strictly mathematical definition and this is called so-called epsilon-delta differential privacy, so we're gonna, break it down piece by piece and look at what each element of it means and try to link it to the underlying meaning that of differential privacy and what we're trying to achieve so, as just mentioned, the idea in differential privacy is to make it so that any one person in the data set is not overly exposed to any sort of risks.

O

So what we have here is we have D and we have our D Prime in the notation. So this represents two data sets that differ on just a single element, so they are the same up to one row and then we have M here, which is a mechanism which we apply to our data set. This will be something like a query life. It could be a statistic: computing the mean value or the variation of something and going forward.

O

We're gonna use the shorthand m to just mean M, as applied to D and M prime, to mean Ms apply to D. Prime next slide, please.

O

So we have our P of M here, which is then our probability distribution of our mechanism M. So this is the range of all possible values that our mechanism can take and then, of course, these sort of the probability that it will assume each and one of those values we have here, which is our standard exponential, which we have taken to the power of epsilon. And then we have this additive Delta here.

O

So what we're trying to achieve here is to make it so that the distributions of our of our mechanisms are sufficiently similar, make it so that it's equal roughly equally likely to assume any value, no matter if a certain person is included in the data set or not so, and we're using our epsilon and Delta's to sort of achieve this next slide, please so epsilon and Delta here become our privacy parameters. These are parameters that we can choose ourselves to enforce a given a given level of privacy.

O

So we can decide on our own by how we choose epsilon and Delta if we're looking to apply every high degree of privacy or, if we're happier with the lower degree of privacy so depending on applications.

O

Next slide, please.

O

O

What if we choose, epsilon and Delta, for instance, to be very small? What this inequality here will tell us is that the probability distributions of our queries on our data sets that differ by only an element are not going to be very different. They're going to be quite similar, or at least is going to put a bound on how similar or dissimilar they can be, and if epsilon and Delta are very large. Well, then they can differ by quite bitte.

O

So herein lies the strength of differential privacy that we're able to quantify our degree of privacy. We can put a number on how strong is and it's on a continuous spectrum, so we can increase and decrease to suit our needs and next slide. Please.

O

Right, yes, so now we're talking a bit about some methods on how to apply this and we're going to start off with the the method as it was originally proposed in the original paper, and that is to perturb the answer to a query. So we have a data set and someone queries us. They ask something about our data set.

O

So what we do is that we compute the true answer so to speak, and then we add a bit of noise to this, and then we give back the noisy answer, so the person receiving the answer knows that ok, this is roughly correct, but it's not. It doesn't know if it's higher or lower than the actual answer next slide. Please.

O

So the most common way of doing this is to add some sort of noise that is either Gaussian or laplacian. But basically what this means is that we add noise from a symmetric distribution right. So it's equally likely to take away as it is to add to our data, and this noise is added in such a way that it's dependent on epsilon and Delta, meaning that we're, if we have very small, epsilon and Delta, so we want to add a lot of privacy.

O

Then we're gonna add quite a lot of noise if we on a low level of privacy than just adding a little bit more. This is fine, but obviously there are some drawbacks to this approach. Right.

O

The adding noise to the responses to our queries is going to drastically worsen any estimator quality like statistical estimators, on this, with basically the result that you need a lot larger sample sizes in order to get the same, accurate accuracy as before, and then we have this that if we allow people to query against us sort of indefinitely to just ask the same question over and over, they can just average their answers or the answers that we give them and that will cancel out the noise that we put on.

O

So this gives us the notion of a privacy budget that we need to sort of limit. How many times someone can ask us about this, and finally, this approach.

O

You still need to trust the database holder right that we still need traditional security measures to safeguard our data.

N

So, there's a second method that you can also use for differential privacy, which is to perturb the measurement so rather than perturb being an answer to a query to a database. You make sure that whatever goes into the database is already perturbed when it gets there. There are some different ways of doing this. One of them is commonly applied at the IETF you remove or encrypt identifiers, so that they're garbled when they reach the database. You can also imagine perhaps swapping data between different identifiable flows.

N

So, if you imagine in geolocation tracking, you could divide a map into lots of grids and inside of each box on the grid pattern. You just swap all the individual tracks of traces of individuals and that would provide some degree of perturbation.

N

You can also randomize responses, which means that, with some probability, you allow anyone that inputs data into your database to just provide a false answer to your database. Now these methods also have drawbacks, the obvious one being that if the data in your database is not true or if it's swapped, you will also worsen estimated quality.

N

So if you wanted to find out some aggregate statistic like what is the average height of people in this room, then if everyone was allowed to say whatever answers for their height, you might not end up with a good answer for that, there's also a need to trust the entity making measurements.

N

So, for instance, you need to trust the randomization mechanism used or the swapping mechanism used for your paths. You need to trust that they haven't secrety and secretly introduced some other identifier. That was not the one that was removed and so yeah. It has some of these drawbacks that are really very difficult to work around with differential privacy.

N

Therefore, at the Delta differential privacy is not the only privacy needed. It only deals with a very specific case of a very specific case when we are trying to protect the identity of an originating individual for some piece of data when you're making a query to a specific database, so Donna sanitization and security are still going to remain very important.

N

Data minimization is still going to be probably the primary method that we can use to effectively protect privacy at the Delta differential privacy is, furthermore, not the only way that we can quantify how much privacy we are protecting. So there's a very good study from 2015 made by Isabel Wagner and Dave akov, where they found hundreds of privacy metrics that are all usable to assign a quantification of how much privacy is preserved in a given setting. We can recommend highly going through that survey and looking at the various ways in which privacy can be quantified.

N

Another study, which is here mostly for general interest, is by Duan in 2009, and he demonstrates that as long as you have a sufficiently large data set, which I know, many of the people that come to the IETF will work with very big data sets and they actually contain differential privacy properties in and of themselves at some levels of Etta and elta epsilon, the Delta and so dawn is very mouthy.

N

So if you're comfortable with statistics, I can highly recommend it to, but just be warned that it invokes the central limit, theorem and other methods from statistics that you might have to have prior familiarity with to to enjoy for the ietf community. I think one challenge for differential privacy is that it mostly applies to api's, because this is either about how you put data into a database or how you respond to the query to a database.

N

It's going to be application, programming interfaces that need to apply these methods, and they are not, at least in my understanding, standardized at the internet, Engineering Task, Force to a large extent, but nevertheless we had some ideas when working on this, which relates to the randomized responses mechanism.

N

So if you have a protocol which is providing some form of information, say about an option or some other data which is really only used to measure aggregate qualities of the protocol, then you might be able to use randomized responses to introduce additional level of privacy by having a client or a server lie at a predictable or predetermined rate.

N

We've been looking at the weekend on the quick spin bit whether it's possible to have a predictable false answering to the transition.

N

Also, we've been reflecting on whether explicit congestion notification could be an application of this randomized response mechanism, because you're really interested in aggregate data, not in this specific data of each entity. That provides the response, but we're very open to ideas on other potential use cases that people are familiar with from their own work at the internet, Engineering Task Force.

N

So any questions or comments.

I

Benkei duck so with this bit on the last slide about potentially introducing random or false data into your actual protocol streams. It sort of seems like the key insight you'll need to do, that is to figure out what random distribution to use, because you know futures try to use a uniform random distribution for like the bit values. That's going to be most likely, not representative, of what the normal flow is, and so it seems like this is inherently going to be a case-by-case sort of analysis to figure out.

I

You know what is the expected distribution and how can we add some structured noise in a way that provides the properties we want. I mean.

N

So in randomize responses, the typical thing that you would do is is some discrete distribution, so you're not in the Gaussian or laplacian space, but in, for instance, the quick spin bit case. Either you spin the bit or you don't so. The natural distribution to choose is Bernoulli distribution or binary to point distribution.

N

And then you say with some probability: P you give the correct transition and bid and if it's not P like yeah, so the way that I imagined this would actually be implemented in a computer is that you generate the random number between 0 and 1. And then you choose some cutoff point and if the random number you generate the post below the cutoff point, you spin and if it was above it you don't. And maybe you know this is the very simplest case. It's one of the simplest distributions out there. It's.

I

Like you're, taking a true stream of data and sort of masking it to to sometimes not provide the treating right, yeah I would.

O

Just like to mention a quick thing: there, someone mentioned previously the Google rapport project that they published in 2014 in that one they do exactly this. It's a randomized bit flip thing. So, if anyone's interested in that specific application, you can look into that.

P

Dave wheeler I'm quite glad you you had this talk. I had some preconceived ideas on differential privacy and I'm wondering if you can help me clear some of them up it from some reading in tutorials on the foundations of cryptography.

P

They have a chapter on differential privacy where they claim that anonymization can can be easily undone with large enough data sets and the differential privacy in and of itself protects the individual data set, but can be attacked by using secondary outside data sets that are large enough, so in in in our world today, right provide differential privacy on my data set. I may be providing information that someone else will incorporate with their own large data set and be able to diya naanum eyes that information.

P

So I'm wondering if you have some comments on that I mean.

N

The observation is entirely correct. Differential privacy is a statistical method. So what you're talking about this sort of similar to this repeated query privacy budget thing that Christopher mentioned that you're, creating statistical uncertainty around one additional individual in the data set and differential privacy does only exactly that. This is what it can. It can provide, repudiate repudiated properties for a single individual that is additional to a data set with respect to another query to that data set worth individual, not there, but it's not a catch-all for all privacy problems.

N

It only does exactly what it claims to do kind of formally, which is provide some assurances on what the answer to a specific query, to specific databases. Okay,.

P

N

L

Wes vertical u.s. si si, so this may be naive. Apologize.

L

My knowledge of differential privacy is spinning up and not level off at all yet, but when you are introducing noise to a system in order to deal with differential privacy and typically access to databases is only done through aggregation functions in order to you know make it actually work right.

L

How do you ensure that there's a potential error between the noise itself causes a issue with the there's multiple aggregation functions that could end up sorry, I'm, jet-lagged, multiple aggregation functions, where some may be able to help distinguish the noise between the rest of the material. So if you were able to run four or five or six different aggregation functions that the difference that noise would affect them each differently and therefore you're actually getting slightly less privacy. Because of that does that make any sense that.

N

L

N

L

O

Any sort of work on that.

N

Well, I mean it's there I. It looks like again a version of the privacy budget problem that if you make repeated queries to the database, then you can compensate for the noise. That's added the typical way of doing differential privacy's is adding noise, but many of you have would have studied engineering you're familiar with stuff like the Kalman filter. There's many different filtering technologies that or filtering algorithms that we already use on a day-to-day basis exactly to get rid of noise from measurements.

N

So now, if differential privacy adds noise and reasonably yes, we can remove the noise by performing the same types of filtering mechanisms that we've been doing to faulty measurements already for what 60 70 years and it's important to recognize that asian and differential privacy when you're thinking about applying it to a project that it does do exactly what it says on the box, and that can be very useful. But it has these limitations, including the ones that you brought. Okay,.

L

Thank you yes, I mean there's the. If, if I had to aggregation functions, aggregation function, one and aggregation function, two I could run an aggregation function twice and it would give I think the perfect mathematical level of knowledge of how much I'm protecting, but if I run aggregation one and then aggregation to the cost. To me in terms of a budget would be the same, but I may be able to use the difference between those aggregation functions to get a slightly less privacy reduction, for example.

L

So that that's sort of my thought but I don't have an answer. I'm.

N

Not familiar with if the privacy budget measures that you can compute from differential privacy now compensates for this level of threat, I don't know I.

A

Mean it's a concern so with respect to this spin I think you know they, like you said. The whole thing is about the privacy budget. With enough queries, you can, you know, figure out exactly what you want to figure out. It might be kind of complicated right to ask an on path observer to only take five measurements, so.

N

You still have you still need some level of trust in the in the people that make the observations. It's true, when we were thinking about the spin bit also, we realized that one additional complicating factor is that probably there's an IP address or something that identifies each flow.

N

So it's still a very, very nascent idea, at least for myself, but I'm very open to talking to more people about other use cases or about what the results were after our weekend deliberations on the quick spin bit. So we can talk I.

Q

Just wanted to remark back on your point about api's I. Think we've been a little bit narrow in this community about what is an API, because it's a human being interacting with the system, but there's no really good reason why an API can't be thought that something like a one end of a protocol connecting with another can't be thought of as exactly that. Database interaction as well and similarly a measurement a measurement agent receiving data is you could define it as an API to so I.

Q

Think it's reasonable to imagine that there's this this research group can come up with a lot of different approaches where people could use API like technologies in protocols and measurement.

O

Yeah I agree that interesting point of view. So.

R

Is you've looked at the spin bit so far? Are there any excuse me any other specific protocols with an ITF that are on your radar and whom you'd like to connect with people who developing it easy.

N

An explicit congestion notification somebody suggested to me could be an interesting application, so if you have any ecn folks in the room, you are very welcome also to talk to me and anybody in the room who thinks this might be suitable for them. That hasn't been mentioned. Please please do approach remaining so I think the easiest cases to look at at least for now is places where you have on/off answers. So basically you can use a binary distribution to say rather than giving a true or false answer.

N

You give a true or false answer with some probability and all the other things that are more complicated like if you can choose between five different options to communicate in your in a header packet header, for instance, then you would need to have a more complicated distribution than binary distribution. So we could certainly think about that tomb for randomized responses, but we would have to do a bit more mathematical work to make it still useful for aggregate measurements.

A

Thanks a lot so.

N

Also the reference list, but you can check it in the sides. Thank you.

O

Yes, thank you.

A

Should we have Martin.

S

Yeah hi, and thank you also for inviting me I'm, a researcher at the front of Institute for applied and integrated security and I'm also contributed to the MIT project and, in my research, I am mostly focusing on decentralized and privacy, preserving identity management.

S

When we started looking into this topic, we first took a look at the consumer identity provider market which looks roughly like this with offerings from Google and Facebook, claiming around 90% of the consumer identities now per se. This is obviously this might not be a problem, but if we look into the recent past, some issues arise from this fact.

S

The first one are privacy concerns. So obviously those companies want to make money and they offer the services for free and in turn, they monetize this mountain of data. They are sitting on also this data can be used for opinion shaping and mass surveillance data collection, which is an infringement on the users privacy.

S

Another issue which which has become more evident recently, is that those service providers, because they are sitting on this huge mountain of data, are now facing increased liability risks because if they lose this data or if it gets, if they get hacked for some reason and they are facing severe risks- and maybe that is also why one of the reasons why competition is not really there for those providers because actually collecting all this data, maybe the the money you can receive from monetizing.

S

It is actually lower than the risk actually face, and obviously the last one is simply the fact that this seems like an market oligopoly where there can only be one or two or a handful of so provide us. That is used, as maybe the reasons of liability risks, but might also just be because they offer so so good services in the area of social media that they simply have this amount of users, but identity Federation has been around for decades and it does not seem to take off in any way.

S

So our approach to this problem to those issues was that, maybe maybe we need to approach this differently. Our primary objective was that we must enable users to exercise the right to digital self-determination, and in order to do that, we must avoid third-party services that basically allow users to match the identities and share the data, and this should be done using an open and free service which is not under the control of a single organization and/or our business.

S

In other words, it should be done, but through an infrastructure that does not have any other stakes than the users right to digital self-determination in mind and third, ideally, which is part of point two is: it should be free and open-source software.

S

In summary, we were to empower users to reclaim the control of their digital identities. This is also whether the name for for the idea comes from, so let me explain what reclaim actually does and how it works in order to explain what reclaim does. Let's look at what or some of the tasks that those identity provider services actually provide to users and to websites and services that use them?

S

The first thing they offer is: they allow identity, provisioning and access control, so the user is able to create an account and basically create an identity management, personal data and share this data with third parties. The service itself, then, enforces the access, control and authorization decisions of the user.

S

The other thing the identify that does is it verifies and asserts the information of that identity. So, for example, if an email provider can say ok, this is a verified. Email address of elders or a sovereign state might say. Yes, this is the user living in Germany.

S

Italy noted that the second thing can be addressed or is addressed recently using privacy credentials based on, for example, non-interactive see large proofs, which we have seen in the in the previous talk in privacy pass as well, and reclaim actually focuses on the first issue. So the reclaim ID focuses on the idea. How can we actually allow the user to provision identities to to manage identities and share this identity data, whether it is third party asserted or not using a decentralized system?

S

In a nutshell, reclaim ID combines a decentralized directory service, so a service that is used to basically hold and provision identity data with a cryptographic, access control layer. Now? What does this mean? A directory service? We maybe know active directory and but name systems, I actually also decentralized directory services.

S

We borrowed here an idea from name ID name: ID uses the name Coyne name system, which is a blockchain based name system and allows users to manage identities and data within their namespaces.

S

Now our implementation of this idea does not use the name coin blockchain and instead uses with new name system and, if you're interested in the security properties of the human name system. There's actually talk in this week, I think on Thursday, but I, don't remember by Christian who will talk on it. Another research group.

S

However, the problem with name ID specifically, but in general, when using such a decentralized directory service, is that this data is more or less public. So anybody who has a specific resolver is able to retrieve the identity data and read it. So we added a cryptographic, access control layer. Basically, you can always add a cryptographic, access, control, layer using symmetric, cryptography and a very complex key management, and but in order to reduce the complexity of the key management, we're using attribute based encryption.

S

A true based encryption allows us to define access policies on the ciphertext which simply reduces the amount of keys we need. So, let's look at an example how it works. The user would basically register a namespace in in the name system in the directory service and populated with resource records, and those resource records basically hold the users information, such as an email address.

S

Now this email address is encrypted using a true based encryption, which means the user has a private key and a policy, and this policy says, for example, my email address can only be decrypted using a key that contains the attributes, email and the user. Does this with every attribute, so in the end, the namespace is populated with a number of cipher texts which represent the users identity attributes.

S

If the user now wants to authorize a third party to access a set of attributes in his identity, namespace, what it does is, it creates a single new key, which is an a B user key and the user attaches. The set of attributes wants to share to that key. So, instead of having to create or share n keys in order to share and attributes, the use only means a single key.

S

Now this key can be transferred out of bonds to to the requesting party or to the third party, but in order to facilitate, revocation and also key rollover, it makes sense to store a reference of off the key in the name system, as well or in the directory service.

S

So, in order to now retrieve the data, the third party simply either resolves. We use a key or already has it and then resolves the identity attributes from that from the direct and decrypts it.

S

Attributes which are not attached to the key cannot be decrypted by the requesting party. Now, obviously, if you wanted to use such a system, we do not want to burden the user other third party for that matter, with any kind of key management and and name system specifics. So what we did in our implementation and then our design was that we built an up MediConnect layer on top of this idea.

S

So under the hood, it basically works, like I just explained in this example, but from the point of view of the user and the integrating website. For example, it works just like any generic Open ID Connect provider, so it basically all searches to do or protocol, although I should say that the Open ID Connect protocol and the earth protocol are not specific enough to actually address.

S

For example, if the requesting party or the website wants to have a specific credential asserted by a special identity provider, this is currently not possible due to the protocol being still very simplistic, but in general this works and here's the standard. So, in summary, we have implemented this idea as part of lunette, which is a peer-to-peer system.

S

There's a functional, proof-of-concept demo else on git lab, which you can find under the the link here, although it does not really finished, it's a very rough around the edges and we want to make it a bit more user accessible because yeah, it's still coming directly out of research and we're still working on making it practical usable yeah. Thank you. Any questions.

T

U

You Alex mayor over from Nick Toth 80 I was wondering at the moment that you associate the public key of the attribute based encryption with, with a name on the new name service. Your name system. Aren't you like reducing privacy significantly, because then it allows somebody else to like collate various a BES via the same name. So if I put like different proofs or no, my no idea how you call them under a single name and it's very easy for somebody else to like discover that this was produced by the same identity yeah. So.

S

Basically, it doesn't distinguish it, so you always know which identity it is because you're looking up a specific identity. Namespace are you talking about if you're, if you're, using this the same attribute in different identity for different identities, because okay? Well, if you're using different identities, then you're, probably using different, a be private keys internally, which which makes it indistinguishable from each other. However, you could argue that using, for example, privacy, preserving credentials, so their knowledge groups as attributes does not make that much sense, because you're always identifiable as the single identity, so yeah well.

U

I'm trying to compare it with other self sovereign identity systems like sovereign or you, port, where you have a pair Y, is relation and you essentially cannot discover a real identity and unless that person was revealed so, but but by putting it behind a single name, that's often criticism with the DNS like, oh you put it in the DNS, so everything is like it I'm not the same name, so you know exactly who it is. But what so.

S

Obviously, if you use DNS as an underlying name system, you have that problem. If you.

U

S

U

Name service, if the.

S

Problem, if you have used gns yes, but in DNS, are also named coming for that matter. You can just create new problems whenever you want, so obviously, if you use the same saloon, um you are trackable, but you can just create new ones that will because they're effectively just public private keepers. Okay, thank you.

G

Telephone exchange I have a question which is also actually I'd, say a comment for everyone: who's interested in these kind of things. Having worked for some years now on a project which is very similar, except that we use the DNS, because we don't have a problem with the delay, so I mean the the problem. Now is that there are like a hundred different projects that are trying to do.

G

These kind of things well in the real world are only basically two and say single sign-on identity systems that are in white use, googles and Facebooks, and maybe in your who, we have dei, does public identity system for taxes and this kind of stuff. So I guess that the problem that we have is they I think we have also people agree that we need something like these, which can put a user back in charge.

G

But how can we get that option and the real problem is actually that we are dividing in like up like I'd, say 50 different projects with slightly different ways of doing the same things, and some people use the block to some others. But in the end with the same objective, but please we also divided. We don't get anywhere so I, don't know if you have any reflection of these or strategy for an option. Yeah I agree.

S

Wholeheartedly so yes, but I, think if all of those projects actually, for example, use the standard of maybe connect, for example, to use them. Then, if the users become, if, if you offer the software to users and show the benefits in terms of privacy, for example, they might just start using them. So there's no reason why you cannot put a generic openly connect discovery button on a website, so.

V

I'm Watson lad I'm from South, where my question is about sort of the economics of getting websites to integrate. Like one of the reasons I think websites have login options. Is that users already have these and you just click through so for users, there's a benefit to SSO.

V

How do you get websites to integrate another SSO, given there's limited space and user users? Don't like options, and how do you get users to see the benefits of adopting us or so thing? If you don't have websites to integrate it, yeah.

S

That's the chicken egg situation under I'm gonna have a solution for this. I would probably address it first in getting the users to want it, and then websites will just follow. I. Think that's the path you really have to take, but I don't have I, don't have a solution for this. Let's.

V

Say other questions about? How does how do you so one the reasons that I, for instance, use Google single sign-on is for digitalocean, for instance, it is the most secure way to log on the digital ocean. Is you can use aq factor, authentication device? They don't have one any other way. How do you address that level? The authorization issues okay.

S

So the well, the the digital ocean, for example. Probably so you want to login using two-factor authentication, but I'm assuming digital ocean doesn't really care how you login, basically that's something that Google just does for you, so that that then depends on the actual complete implementation, because, ultimately, connect doesn't really do authentication, so some an authentication scheme. Obviously you could extend it, but that's not really in scope of the system. Well,.

V

I'm not sure exactly what they're using, but you have to be signed into Google account before you can sign click, goodbye and design in yeah.

S

But they don't care how.

V

Google is authentication features so that their piggyback on it they don't have to do the work. Google. Does it and see.

S

You as implicitly they do it right so.

W

A nation was on I think the whole notion of sort of pinning this on the back OSS. Always probably a mistake like progress in the web of space is probably gonna make SSO as a value proposition like pretty useless anyway. I, don't think I, don't know what that means for Google and Facebook Simon's, probably nothing, but in the cases where Federation actually works- and there are a number of cases where it does, work is because relying parties actually want and get information about.

W

Users and users want to share that information with relying parties in controlled ways.

W

Add your own, for instance, where you're basically sharing an affiliation right, I'm part of the club, and that gets me access to the network and in those situations it has nothing to do with SSO, and it has nothing to do with at least not perceive user privacy like, but it has everything to do with UX, so I think. If to make progress in this space, you actually have to focus on UX, not technology, because that's where the really hard problems are yeah.

S

I think I'm, maybe to just comment on that. Briefly, the what it does is. Actually it's just a service any of MediConnect provider just allows it just provides a service that allows the relying party or the the website to retrieve this identity data when the users offline as well. So it's not just a direct interaction and a single sign-on issue. It's also a service that provides this identity data as a service right.

W

And that's a perfect illustration on this right because, like all of the projects that somebody mentioned, there are a hundred of these right and they they all focus on on the the backend part of this right.

W

They all focus on the technology of blockchains and zero knowledge proves etcetera, etcetera right, but and then they will add a connect or a sam'l front-end because it all has to live on the web because they're, like nobody, has the has figured out how to big you build usable UX for the user and whether it looks like and some people think it looks like a wallet or something right and use replacing a very large identity provider with a very large wallet provider, because you know, but nobody can figure out how to get users to want to deploy.

W

The client, hmm I I suspect that all of these projects should just stop worrying about like block chains and like that and start thinking about how to get users to want to use the user. New UX I think that might help I agree.

M

Hello Tobias from LMU I didn't really get what you store in the DNS system. You store the key to decrypt the email address so.

S

In your example.

M

S

We both install the encrypted email address as a record and we also store when we also authorize a relying party, then we have to issue a key that is used to decrypt, and we also saw that in the name system encrypted for every relying party, because if at some point we need to do revocation, we might need to update some of the decryption keys, and then we basically stole them the name system so and the next time, when the relying party needs to access the attributes, it can retrieve them from the updated keys from the name system.

S

M

And how do you do the authorization for the key.

S

Basically, when we, when we are doing the OpenID connect flow, the client ID serves as the public key of the client and we encrypt the keys with the public key of the client so they're. Only the client can then retrieve and decrypt the DBE okay.

A

A

Looks like we have time, yeah.

X

Hi I'm Brooks coffered I work for giant. This is my first IAT fi RTF sort of a meeting so anyway I'm here to talk about the next generation of the Internet. So if you have been to a welcome and presumably the first time, you turn up OIT if it was a welcome, you would have at some point come across the mission of the ITF, and this is to make the internet work better.

X

Now, in our net foundation, have a goal of to create correct the Internet of tomorrow and as such, they got approached by the European Commission to work on a series of documents because well they they discovered this to be a problem. This is from a presentation by I'm, not here at the moment that actually the Internet is not necessarily serving what we hoped it would serve, and so, in a short space of time, it's well as you can see from the presentations day and the problems you're trying to solve by being in this room.

X

It doesn't always do what we wanted to do, and so to create a next generation internet, a user centric Internet.

X

How are we going to rebuild a currently working system working to some definition of working because there's a lot of moving parts and you know trying to build something new or improve the user privacy on top of that working infrastructure is, is going to be a challenge, so I know net foundation. I did it be analysis, some some consultations produced some paperwork and in a vision, and that has resulted in European Commission funding stream to be able to explore this area.

X

In fact, it was a very big sort of architectural document of all of the things you could possibly look at and then they decided that trying to work on all of those things individually isn't going to work. The the heap would crumble. So we have this next-generation internet open call process where a few of these open chords are kicking off right.

X

Now, it's being built with a larger ecosystem, so in our net foundation, happens to talk to giant the organization, I work for where the association of research networks in Europe the ripe, NCC lots of ISP groups in order to get their feedback on what should what work? We need to undertake in order to improve, improve the situation, and so there'll be a lot of funding programs coming on stream in the coming years, and at the moment there are, there are four actually it'll be an elated slide.

X

So this is ngi vision document you can go to ng, itu / vision, and I get a copy of this report that Annette Foundation developed, and this shows the the current funding stream- that's coming on board. So this is a pitch for you to come with your ideas to these open core processors and and seek funding to complete this work.

X

So at the moment there are four fact of open calls in green run by mo net foundation, distributed data distributed ledger group in in orange, and this privacy and Trust enhancing technologies, which is the the project that I'm involved in a consortium. There's also three I think the deadline for these. These are the proposals for people to run these open core systems. So there's three projects that are currently the deadline is real soon now it may be at the end of this month.

X

Even so, those will turn into they're seeking a coordinator to run a funding call so that you can target these additional areas and there's more areas that are coming on stream within this NGO sphere.

X

So a no net foundation are looking after our search and discovery and privacy enhancing technology. These are open calls that have a very low barrier to entry, not everyone's, always successful, and but they offer money in the range of five thousand euros to 50 thousand euros. They have a and they and they run. These calls regularly every two months. These calls pop up throughout the the cycle of this project.

X

A neonate foundation also offer and other funding opportunities so check their website for more giant is part of a consortium, privacy enhancing technologies, and we have money up to a hundred thousand euros for. Unfortunately, a slightly weightier application process and the ledger project offers up to two hundred thousand euros on distributed data, distributed, ledger technology and blockchain projects, and you can find this information on the ng ITU website so yeah. These are the these are the currently open open calls.

X

The deadlines are ever so slightly different, so any o-net foundations run every two months. Their deadline is the 1st of April, but also on the 1st of April. A new call will open for them and run for a period of two months. Enjoy trust and ledger have open calls that end on the 30th of April and will be running additional calls later in the year, so we're not as agile as Internet Foundation to have this very rapid process that we hope these.

X

These proposals will be slightly longer more involved and we provide some other support infrastructure in getting your applications across the line. You can also contact us and we can give you advice and guidance in the lead-up to submitting a call, but once you submit something we we can't talk to you until the assessments out so feel free to talk to us leading up to that, and this is how you can directly contact me or visit the NGO our website and look at open courts.

X

Ask questions talk to me in the break I'm here all week tip your waitresses.

R

H

R

On who can apply what countries they're on obviously.

R

X

Can yeah you can still apply it, it has to be for the benefit of Europe, so it's not exclusively for projects or people resident in the European Union. It does make it easier and once the individual open call projects have done their assessment, there's actually a final group within the European Commission that decides whether it passes this bar of being beneficial for Europe or operating within Europe.

X

So things there are things outside of Europe that I've beneficial to Europe, but it would make more sense if you had a project partner that was in Europe, and you know- and you definitely focused your your proposal on how it is beneficial for your just, not beneficial for your pro thank.

R

A

Great that concludes our meeting either the blue sheets floating around.

E