Internet Engineering Task Force IETF 114, 25 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: IETF114-PEARG-20220725-1730

Description

PEARG meeting session at IETF114
2022/07/25 1730

https://datatracker.ietf.org/meeting/114/proceedings/

A

Okay, thanks for the feedback.

B

Yeah is anyone here willing to take notes for us.

B

It's a very simple job.

C

Okay, we have chris off nice. Thank you.

B

Can you repeat that sarah sorry.

A

So christopher patton has offered in the chat to take notes. Thank.

B

D

Hey all welcome to the perigee session. um We are your chairs, um I'm siobhan. We have chris and sarah joining remotely. I guess sarah you're sharing the slides anyway. Do you want to go to the next one.

D

Sure thanks this is the note. Well, please note it well also, please note that, according to ietf policy, we do have to wear masks inside all meeting rooms, so just be aware of that, and let's move on to the next slide yeah. So this is the agenda. We have three presentations and a quick update on um draft safe measurement. I believe that mallory will be giving so we have, I guess we can. The chairs can do jabbers, keeping an eye on jabber and we have a minute taker.

D

Thank you chris patton, and I think we can start so if sophia is online.

E

Yes, I am here.

D

Do you want to present your slides, yeah.

E

uh Do I have to share them?

E

Oh there, you go okay, it's me I'll, wait.

E

Okay, I see this light now and I see on the screen also that they are shown on person. um Let me stop my video okay. Okay, uh should I just go ahead? Shivan.

D

Sorry sorry was the question sophia.

E

Should I just go ahead with the presentation? Yes,.

D

Yes, please: okay,.

E

Okay, thanks um hi everyone. Thank you so much for having me here. um My name is sofia celia and I work at brave software, and today I was just going to present a very informal note or an informal presentation on privacy per se measurement techniques and, as I said, if this is just an informal comparison next slide, please.

C

E

C

Do you see a slide selection at the.

D

Bottom of your screen that you can drive.

E

uh No okay, that's no problem, we'll move the slides along okay! Thank you, sir well, thank you very much, sir. um Okay uh suppress a little bit of a disclosure uh again. This is not a complete overview, but rather an initial note around the different techniques that one can use for. Indeed, trying to try to attain privacy, preserving measurement techniques and what it really wants to.

E

I aim is to answer the question: if I want to execute measurements with privacy, which scheme should I use and well in general, that seems this likely seems like a simple answer. In reality. It's a very complex one, because there's a big array of different techniques that one can use in different schemes. um So it's difficult as a user of the system administrator or just a user. Who is a system administrator to actually choose the correct scheme or the scheme that best suits your need.

E

The other problem is also that, as we'll see, we have different techniques and different schemes, and the majority of them really attain a level of privacy and security. But it's difficult to correctly pinpoint and attest which specific level of privacy or security they are providing.

E

There's also some clear expectations of the efficiency of monetary costs that indeed this these schemes provide, and if you are interested for some. Furthermore, some further notes I have put online displaying for my pdf that is going to be developed into some of something more formal eventually through the months nexus live please, okay. So, let's just start from the beginning, which one is the main notion.

E

The main notion is, of course, that you, as a provider or whatever system or whatever application you want to indeed know something about your users and the reason why you want to know something about the uses is mainly because um you want to improve the usability of the system by understanding how indeed users use that system, and while that this seems great in practice at the downside, it has a big but consequence, which means that we are learning certain private things from yourself that we are not supposed to learn and in the light of this, the idea was to actually provide a privacy, a private and secure way to actually attain to actually be able to collect these aggregate measurements.

E

um Because, of course, as I said, these aggregate measurements correspond to a centralized leakage of private user data. Currently, the ietf actually has a working group devoted to this. That is called ppm that precisely has been trying to standardize certain of the techniques that I'm going to talk about to provide privacy and security for aggregate measurements.

E

I think I now can do that. Yeah.

C

F

Should have companies.

C

E

Now perfect, thank you um so a little bit of a wishful thinking.

E

So the first also question that one uh must pose themselves when actually thinking about it is what level of security and privacy won't want to attend, and the first main definition that one one find is this web by the linus, which is not a specific to taking aggregate measurements on the digital world, but rather in general, for a statistical disclosure control, mainly the privacy and security that it aims to provide is something similar to semantic security, meaning that access to a statistical database should not enable anyone to learn anything about the user.

E

That would could not be learned without access, and this is a very strong privacy notion that we will see that is achieved to a degree by the different techniques that this service works.

E

So, let's start with the actual techniques and scheme, and, as I said, there's many so I've tried to categorize in general by the specific technique that they use. The first one is the techniques of differential privacy, and the reason why I'm talking about this in the first place is because it's one of the oldest schemes there are out there to actually preserve privacy for aggregate um measurements, and the idea in the differential privacy techniques is that some local randomness, some kind of nose is added at some points.

E

When you're actually performing the aggregate functionality, it could be to the data collector, it could be to the output of the statistical function or it could be to the mechanism itself and what in general differentiation privacy wants to attain, is a notion that is called epsilon differential privacy. You will see here a mathematical notation, just don't worry about it. This is just for reference, but mainly if you ever read the papers of differential privacy. This is the notion that they want to attest and without any mathematics.

E

The meaning is that the output of the function of the statistical function is similar on both data state. If you change or remove the one element, so any kind of operation that you take in different data sets, can someone looking from the side cannot differentiate, there's two basic schemes that uh support differential privacy that he have been actually deployed in the real world. The first one is rapport from 2014 until 2019 it was kind of supported um and basically what it is.

E

It uses the same idea of acting of adding local random noise into the statistics that is taken from users and also using memoization. The problem with this scheme is that it is very costly because locally, you have to add all of this randomness and therefore it is very costly. In the light of these, this very costly another system was devel developed, which is called proflow and it's much more efficient and also uses a different architecture. The architecture that it uses is called encode, shuffle and analyze.

E

It is an esa architecture, and mainly the idea is that you also add this local randomness, but this one in turn is augmented by a private channel that randomly permutes a set of user supplied data. So you have the local one that you add, and in the encoding section, then you have a shuffle path that permits the different user data and then you have the analyze one that it just performs. The statistical function that you need to perform.

E

What is the problem with this one? Is that is more efficient in a way, but it also requires trusted architecture, because you indeed have to trust the shuffler in order to be in order to be secure, to be assured that, indeed the mechanism is work.

E

So there you have a little bit of a downside of problem in the light of this all of these schemes of differential privacy. I just touched two very lightly, but there are many, many more as you see, um all of them could have a lot of drawbacks and in face of this, another system was developed.

E

That was called prio and nowadays the ppm working group and the itf is also trying to standardize some preo based uh like schemes, not specifically the original prio, as it was first written in the original paper, but kind of similar schemes. What does basically prior 1.10? Basically, what we started on the first slide private aggregation and also they specifically specified three properties that they wanted to have privacy, of course, robustness and scalability.

E

So this is skin, indeed tries to be a little bit more efficient and the way it works is that it works with a small number of servers and a large amount of clients. And as long as one of the servers is honest, the system leaks nearly nothing about the user's data, except for what the aggregate statistic itself reveals.

E

So, as you see here, we have already a little bit of a leakage of the of the privacy that the system provides, because it's not completely privacy private, as the first slide that I showed you wanted, but rather than there is a specific amount of leakage. So, for example, let's say that you are using a mean function: you are trying to compute a mean functionality.

E

Therefore, something is leaked, meaning the number of uses that have submitted some user data because you divide by the number of users, so that is leaked by the aggregate function. There's other variations of these schemes. That, as I said, has been proposed to be standardized by atf.

E

You have prior plus, but it's a little bit more efficient because it uses boolean circuits instead of arimatic circuits, there's 302, which is not going to be standardized, as far as I know, um priya 3, which is the one that is going to be a sundares um as far as I understand, and that one is more efficient into the client to server communication.

E

To give you a little bit of a specific pinpoint of the specific privacy that priyo provides, it uses a specific privacy, that's called privacy in which an adversary, he controls any number of clients and all, but one server, there's nothing about the honest client's values, except what they can learn from the aggregation function itself, and this is just repeated why I already show you in the previous slide of what kind of privacy it provides. It's kind of a bounded privacy in the sense that there's some amount of leakage.

E

um Yes, I'll, take the questions at the end. I see that there's some in the sulu, but I will take them on the at the end.

E

Okay, so a little bit of a diagram for what uh does and know exactly how prio works. This is just a really high level explanation how prio works, but in general, let's say, for example, that you are an user who is um who is going to a park and indeed is in the park, and then the system actually wants to know if you are in the pack, because the mobile clients, for some reason want to know that um what amount of users are actually going to this park.

E

And if you are in the park, then the mobile phone says one. And if you are not in the park, the mobile phone stay in, steer and sends this data to a collection of an amount of service. So instead of sending the one meaning the yes, I am in the top arc. What it does is that it splits this one into shares in the shares, for example, of 15 plus minus 12, plus minus 2, which all sum up to 1 and sends the individual value to each server that uh that belongs to the system.

E

And as you see here, each individual server will not be able to pinpoint, which is the private value that the user is submitting, because it's just a secret share that the user has been submitting to the server. Now. It's important here, of course, that the user is honest in that the range value that they are sending should correspond indeed to that range, meaning that they send either a zero or a one, meaning a yes or a no sorry, a no on a yes.

E

um But, for example, let's say that a client sends a two uh in this case um in general, if you're only using secret shares, then that's indeed possible that the user could be sending a tool. So what you indeed also add in the prior system is that you add a serial knowledge proof that is a specifically called a secret shared non-interactive proof in the prior language. And basically this is a proof attesting that indeed the user is sending the correct value in whatever range.

E

So in an example is the correct value, is either a zero or a one, and that is sent once that is sent to the different servers. They each go between each other and attest to the validity or not, and if indeed is valid, they indeed aggregate. Whatever shares were shared with the servers and eventually they will be able to compute whatever aggregate function they were trying to do and as you see, this is preserved in the face of a malicious client.

E

But then there's some efficiencies robust here, mainly server to um sorry server-to-server communication is sufficient, but server-to-client communication is inefficient because computing, sometimes the serial knowledge, proves well efficient than the previous zero knowledge proofs. um It's still much more costly, okay and also a thing to note, is that pre-orbase systems uh only work on numeric values, meaning that you can only send numeric values. But if you want to send something like a strings, then the system does not completely cover that in the face of this there's another kind of problem that one can solve.

E

Indeed, when we're, when someone is trying to have all types of data to be a part of an aggregate function- and in this case we call it the private the heavy heaters problem. One of the schemes on this is the scheme of the star that has been recently been um announced in a paper and also proposed to the itf.

E

Not so recently, I think, a year ago it doesn't again focus on the specific f privacy problem, that is, that belongs to the previous system, but rather in another privacy concept, which is called the threshold k anonymity and basically the idea is that the server only learns any data from a client, any private values from user.

E

If there's at least k minus one or the clients that are submitting this data and what it prevents is the preventative collateral from from being specific, from specifically learning, uniquely identifying information or uniquely co-occurring patterns of data from a unique line, how does it work mainly? What it happens is that a client constructs a ciphertext of the data using an encryption key derived from some randomness. This random name is usually taken from a randomness server that uses an opr functionality. To actually have this data, then the client, once it has.

E

The ciphertext sends the ciphertext to the service and also a k out of end secret share of randomness and tax. These these shares within a specific task, so the server will know which one to combine and, at the end, the aggregation server organizes the shares into subsets, depending on the tax that was just submitted by the client and recovers encryption keys from those subsets of size, uh bigger or equal to k. And if you want that more a little bit more of a beautiful diagram, it is here, as you see here, you have three entities.

E

The first one is only to actually gather the randomness in order to be able to derive the key that is going to be used to group the ciphertext, and then the other ones is actually the client sending the ciphertext to the aggregation server. Who will take the auxiliary tags that the client actually define will be able to reconstruct the specific for the ciphertext and we'll be able also to decrypt it?

E

One of the things that is important to note here in a specifically of the system is that is the first system that takes into account not only about efficiency, but also the monetary costs that are associated with running all of these kind of ppn techniques, mainly the system. For example, one of the aims was actually to have low monetary costs.

E

Another system that is seem very similar to peop to prio is the popular system, which also allows for finding the most popular string among a collection of clients, as well as client, as counting the number of clients that hold up giving a stream. The only difference with prior is that in prior, you will require only one on a server. In this case you have a two: no, you have to have two non-colluding uh data collection service, and it indeed also presents the same amount of privacy as free okay.

E

So um that's a little bit of a really brief overview of the different schemes, but if you are a little bit more interested in how actually you compare to each other, which was the core of the question when actually having this note, um I kind of created two tables for this. On the first time, for example, it's good to consider the type of data that you need, one to use when you're, actually thinking about choosing some schemes.

E

So, as I already said, the prior-based functionality, somehow some numeric and then the star wars has some a string, uh other kinds of types of data that can indeed be accepting. There's also some types of robustness. Notion, for example, when I say that uh in some systems, for example, the trust assumptions are different in the sense that at least one server has to be trusted and on others they don't actually require any kind of trusting entity.

E

Also, as I said, uh while all of them provide certain kind of privacy notion, none of them actually attain the in original privacy. Notion that I just put you in the second slide, but rather all of them. Some have some kind of leakage in the case of pre-assistance, for example, I already mentioned that the leakage that they have is whatever that the aggregate function leaks itself.

E

That's leaked in the case of a star, for example, the server learns which clients share the same measurements in the same popular there's, a leakage of all heavy heating, prefix prefix. Sorry um all of them also have different efficiencies. So we already saw that in the previous prior base systems, you have an inefficient client to serve a communication that is improved in the iterations of the original video paper.

E

So you have a better efficiency in pre-ordering and I only added the monetary cost of the two last schemes of star and poplar, because those are the only ones that I have seen analyzed for the monetary cause. Perhaps uh what is missing in this table is properly inputting for the other schemes, but I didn't have a lot of time to actually analyzing them from the monetary cost.

E

Okay, so that's a little comparison and if a user wants to at least choose between all of these schemes, you see there's a lot of values to actually compare and a test, if indeed it works for your system or not.

E

Something that is also interesting to note is that the user needs and most of the times absent from any of these kind of studies or even in the original designs, in the sense that the voice of the user is notably absent, and while it's great that these kind of schemes are advancing the privacy of the user.

E

At the same time, sometimes users will not be wanting to participate in different aggregate functions, service or in different kind of service in general, and the reason why they will not be wanting to actually be part of the systems is because, while maybe this system, if they provide some privacy and not going against individual privacy, they might go against the whole group privacy. For example. Let's say that you have a survey about women and um about the number of abortions that women have. I will not would like to participate in that kind of functionality.

E

Even though, even if the system says to me that it's private, because the output of that functionality of the function might be used to harm me as a woman that belongs to the woman group or as an individual, that belongs to the woman group, so the point here is that, while it's important to provide privacy, it is not enough, it also has to take into account user consent and also, yes, I also have to take into account uh user consent. We should be really noted and really important into the different design system, and with that.

E

uh Thank you very much. Sorry, I'm running out of time. um So yes, sorry. Thank you very much.

B

All right, thank you sophia. Unfortunately, we don't have time for questions so we'll have to move on to the next presentation sophie there's a number of questions for you and chat. I'm sure you could engage with those folks further there.

B

F

C

B

So next up we have bharath. Can you request to share the slides.

G

Okay, can you hear.

B

Me, yes, we can hear you.

B

B

One more time: okay,.

B

Okay, take it away; okay,.

G

Thanks for having me so um I'm bar throgfin uh faculty at usc and uh co-founder and visit with paul schmidt, um one of my colleagues um and thanks uh in this to to jana and chris and tommy for um a lot of the discussions uh that went into this.

G

So I'm gonna give a very high level talk very different type of talk, um really trying to step back about this question that some of us have been noodling over for a while, which is there are all these interesting privacy preservation, network protocols, systems architectures that have been come up with by a lot of you in this group and folks in the ietf community, in the network systems and privacy community for years decades really and we're trying to figure out what is uh in common among all of these, the ones that we think actually achieve some sort of meaningful and practical privacy preservation, and it seemed like there may be a sort of a common principle that doesn't uh satisf uh sort of address.

G

All of the questions that we care about, but one of the core sort of design principles that may be underlying them. And so that's what this talks about. uh What we're calling the decoupling principle so at a high level. uh Basically, the decoupling principle is for internet privacy. You want to decouple who you are from what you do. That's that's the nutshell of this talk. This is an old idea. Is nothing new that we've come up with here.

G

uh Chom introduced this in 10 different ways in a classic series of papers back in the 80s and then in the 90s, um but it's been inconsistently applied over the decades. For some reason, this principle sort of gets rediscovered over and over again, and then it gets forgotten and it gets rediscovered.

G

It seems like right now we're in a phase an era over the last several years, where people have rediscovered this principle and are using it to good effect um and a lot of the proposals that have come out from many of you again have used this principle, and so what we thought we would do is just step back for a second and think about what what's going on in a lot of these systems uh that enables internet privacy by decoupling, who you are from what you do, um and it seems like decoupling, is easiest when we split uh by entity so meaning who are the different parties in the network that are participating to achieve some internet service and the mechanism that's being used.

G

So the mechanism might be a mechanism for authentication or mechanism for connectivity or um for whatever else that you're trying to achieve um the decoupling is always going to be protocol and context specific. So you have to look at the specific service that you're trying to provide, of course. um So we can go through that for a few examples and very sort of at a high level trying to understand what kind of decoupling is being achieved.

G

So the context here before we go into that is ordinary data. Confidentiality is nearly solved. This is sort of a broad statement I'm making here, but we're at a point where tls is everywhere. Data is encrypted at rest, and if it's not, we know that we need to do those things. There are some hurdles in some contexts where you can't use tls or you can't encrypt data at raster. You can't do some of the very well-known uh steps for data confidentiality, but we know we need to do those things.

G

So the solution is there it's about implementation, but what remains?

G

Is this a little bit more complex, layered metadata privacy problem- and this is in many different contexts at different layers of the network- stack everything from the mobile layer to various types of internet protocols to applications, um and so you need many different overlapping solutions, and so it's not that you would apply the decoupling principle for a single user in a single context in a single protocol then be done, but rather that you want to decouple all the things effectively um and really the privacy challenges that we're dealing with here are fundamental to the internet, maybe in a unique way versus all other sort of computing contexts, because we rely upon others to carry our traffic and process our requests.

G

This is sort of a fundamental thing with networking. That's not true in other domains, and so you know a lot of computer science now is focusing on privacy and security, but they don't have the same degree of challenge for privacy, as I think we do.

G

So a little bit of terminology trying not to make it super terminological, but you know just to be a little bit more clear about what we're talking about. So we're going to have a very crude uh binary distinction between sensitive and non-sensitive information. That's the level we're going to stay at a very high level.

G

This filled triangle is going to be sensitive user identity. So this is you know my my name or maybe in some context, my home router ip address and the uh the hollow triangle is a non-sensitive user identity, some sort of temporary identifier, a random identifier um and then sensitive user data again is going to be context specific.

G

But it could be everything from the actual contents of a request that I make to some service um or the response that comes back um and non-sensitive user data would be, uh for example, the fact that I did make a request, but no content of that request.

G

So then we can describe using. We just have sort of a tuple which will say there is some party in some context that has some knowledge about the user and we're going to talk from a single user standpoint from the moment. So if I write it like this sort of a filled triangle and a hollow circle, then we're talking about somebody knows sensitive user identity and non-sensitive user data.

G

So how do we apply this? So we'll go through a couple of examples using just existing systems and think about what the decoupling going on is. But first there are some caveats. Obviously, identity and data are always shades of gray. The idea that we can say this is sensitive user identity, and this is not it's really difficult to cleanly categorize that, um but you know that's uh we're going to sort of use.

G

These generally understood categories for the analysis, and then we can complicate it through thinking about side channels, and you know all the shades of gray that also come about for our user identity. The same thing, of course, is true: with data the data itself, there's shades of gray of what counts as uh sensitive and non-sensitive, and maybe even contextual.

G

Some data might be sensitive in one context and not sensitive in another. The user is ultimately gonna, be the judge of that um and then of course, further still. Identity and data are sometimes mixed and conflated, and so there too, this is going to be complicated. So I'm going to just sort of put those caveats out there, but also um we're going to ignore them for the moment and think about the the simple case. First.

G

So, let's just look at something: we all know something like mixnets or tor, and in this context we have some sender trying to send a message, some request or data to some receiver over this network and they're trying to achieve some data or metadata privacy for their identifier, their personal identifier and uh the message that they're sending the mixes are some third parties that are relaying the data and then the receiver is a partially trusted party who will receive and respond to the message.

G

So that's the setting that we have. uh We don't have to go into the specifics of a specific design. For this analysis.

G

So in this context, the sender, of course, has all the sensitive information, I'm the sender, I'm the user, and I know, of course, my identity. So the triangle and the data that I'm requesting. So that's the filled circle now. The first mix here knows my identity in some form, in the sense that I have to talk to them. So they know my ip address or they know some sort of network identity of me for me um and then the subsequent hops know less and less so they don't even know who I am.

G

They just know that they got a message from somebody um and there's the uh the request itself, which they don't know now, there's also the context in which uh it's I could pretend to be mix number one, so mix number one actually thinks that I'm sending on somebody else's request right, there's that design and in some systems um so mix one may not know even my identity. So let's, but let's leave that out for a second for a second um and then. Finally, the receiver uh is gonna, get a message from somebody.

G

They don't know who, uh unless I specifically conveyed to them and that's going to be sort of non-sensitive user identity and sensitive user data, potentially because I am sending a specific request which they are capable of decrypting, so that they can then give me a response, or maybe I'm just sending them data for their sake.

G

So if you look through this, you see that only the sender knew both sensitive identity and sensitive data.

G

And so the basic idea, the decoupling principle is really simple. Third parties should know at most one of sensitive user identity and sensitive user data, and so some of them might know the identity piece, but not the data piece. Some of them might know the data piece, but not the identity piece and it's not always simply, there's one type of identity and one type of data.

G

In some systems there are multiple types of identity, one of the systems, I'll briefly mention is gonna, be it has a mobile identity piece and a non-mobile identity piece, there's lots of other similar kinds of systems.

G

So this has been used all over the place. There's all of the classic designs from chom and all the systems that have built upon choms designs. They all exhibit the decoupling principle. Privacy pass and private private access tokens exhibit the decoupling principle.

G

In that context, you have a client, an issuer and an origin, the issue or an origin. Neither of them know both uh the identity and the data uh in the context of oblivious dns and odo, the resolver and the oblivious resolver. Also, they know either the identity or the data, but not both same with the origin um in the context of pretty good phone privacy, which is one of our systems.

G

We, this is one where we have a mobile identifier, and so you have sort of a user's human identity and their mobile identity, and no system knows both of those. uh In the context of private relay, you have the same decoupling across the multiple relays that the first relay knows the user's ip.

G

The second relay knows the origin that's being requested, but neither knows both and then private aggregate statistics. You have an aggregator and a collector, but neither knows both the private identity and the private data. So there's a lot of these, and these are just this is a very incomplete list of examples of systems that have used this principle, and so really. The idea in this talk is to sort of point out this, the similarity across of these, and so why does this seem to work? Why do people keep using it?

G

Why does it seem to work, um and it seems like this is an incomplete reason for why this works users often care about hiding their true identity from semi-trusted services and hiding the data or metadata of the requests from untrusted parties, but they don't care about that often, but sometimes they do whether they reveal that they are a user, some user of a public or popular service. So I don't mind revealing that I'm using such and such service.

G

I just don't want that service to know too much about me, and I don't want others to know too much about me um and users often don't care about whether they can hide a request from the service who is actually providing it. If I'm requesting something from a website- and I have to reveal a little bit to get that information- then I'm willing to do that because they are providing me a useful service.

G

There are some cautionary tales and sort of examples of where this doesn't work and why and then we can use the principle to sort of see what maybe went wrong. It's somewhat obvious in retrospect.

G

uh So if you take the popular architecture, really not popular anymore, but you know, was it really common architecture, I'd say 15 years ago, um which is to improve security of some network or some system x, let's just drop in a security gateway, a middle box somewhere and that security gateway is going to do all the all the things we wanted to do to improve our privacy, improve our security, whatever it may be, um and so in that context the sender of course has all sensitive info, but the gateway also has all sensitive info.

G

That gateway is doing. You often was, and still sometimes is doing, uh all processing for that user, and so that means it's seeing decrypted traffic, it's seeing uh requests that are going out, uh it's seeing uh user identity, and so you have to put all your trust in that, and we kind of we've always known that, um but this is just sort of a way of analyzing it to immediately flag the problem and then again the receiver doesn't have all that information.

G

So the value that we get here is that we can quickly identify problems that might arise by doing this quick analysis. This sort of decoupling analysis this isn't to say that if we show that we've decoupled we've solved all problems, obviously we need to be able to have consider lots of other things non-collusion between different parties that are providing a service. um Sometimes you get benefits by using hardware enclaves or trusted execution environments, so you can shift trust and therefore shift. Who knows what um and then of course, their side channels.

G

Even if you decoupled, there are tons of side channels, the side channels are going to be context and protocol. Specific they'll still be a problem and will still change the nature of this analysis, so I'm almost out of time so I'll stop there. um Thanks for inviting me.

B

All right, thanks broth time for a couple, quick questions. If there are any.

G

Should I read from chat, or what should I do here.

B

There was some chatter in the in the in the chat um uh some of it was carry over from the last presentation, though,.

F

Yeah, let's see here.

B

Okay, hearing no questions. Thank you broth um and uh we'll move on to the next presenter, which is mike.

B

It's mike, if you request to share your slides, there's like a little button near the top left. It looks like a piece of paper: okay and then you should be able to drive.

B

All right, perfect, take.

C

B

Yes, we can hear you fine.

C

Okay, um so my name is mike rosalek, I'm a faculty member at oregon state university. I also happen to be on sabbatical at cloudflare research these days and I'm going to be talking about this paper. That's going to appear next next month that usenix- and I just want to thank the chairs for the opportunity to present here. Let's see if I can figure out how to drive the slides.

C

uh So this is a talk about ssh authentication and specifically authentication using public keys. So I want to review how things currently work in ssh. So when I connect to an ssh server, my client offers a public key and says: hey. Do you want me to authenticate under this public key and the server might say no?

C

In which case my client will offer another public key and ideally, eventually, the server finds a public key that it likes, and it says yes- and in that case I authenticate by uh by doing a standard kind of signature of some of some random nots.

C

That's how public key authentication works in ssh and I'm here, because I want to point out some privacy and security issues with this.

C

With this approach, I think one of the most well-known problems with this is that the server can fingerprint the client, and what I mean is that the server can just say no to all of the clients, advertisements and actually by default. The ssh client will send all of the public keys that it knows about that are currently loaded into the ssh agent.

C

So the server can see your public keys. Even your keys, that are, um you know, presumably not uh not generated for this particular server. um So I want to point out a cool uh application of this, or maybe it's creepy. I don't know if it's cool, but maybe creepy- and this is what first uh got me aware of this problem.

C

So back in uh 2015 ben cox had this blog post, pointing out that on github, uh everyone's public keys are truly public, like you can just look up anybody's public keys in some cases, that's a nice feature, but he points out that if somebody cared enough, they could collect a massive database of everyone's ssh keys and so that's exactly what he did and he did some analytics on these ssh keys and then a few months later, uh filippo valsorta um had this cryptic blog post, where he says uh he invites the readers to ssh to this server.

C

uh The server is still up. I I encourage you to ssh to this server when I ssh to the server. This is the message that I get and in particular um I didn't type anything, and I just type ssh to this domain name, and it knew my my full name and it knew my username on github.

C

um So that's kind of creepy and the reason this works is that my github public, my my public key for github ssh, is loaded into my ssh agent, all the time because I'm always using github and so my ssj. My ssh client offers it to this uh the philippo's ssh server and he has a database of public keys. He knows that this public key belongs to this user on github, so this problem can be resolved.

C

You can configure your client to only send keys to the servers that you expect, um so this can be resolved with some configuration changes, uh and so if this was the only problem with ssh, then I would wouldn't really have much to say um so. I'm going to mention a few other issues with uh this ssh authentication.

C

Another one is that the client can probe the server. So the ssh client can offer a public key that belongs to somebody else. Suppose somebody else knows my public key. Then they can offer that key to the ssh server and if the server says yes, they can conclude that I have an account on that server.

C

Now the ssh protocol does support a preemptive signature, so the client can provide a signature along with the public key in the offering. You know in the hopes that this might save a round, but as far as I know, there's no ssh server that that has any has a configuration option that enforces these preemptive signatures uh to be given.

C

So that's a bit of a problem. uh Another issue is that the server obviously sees which of the keys was used. So if, if several keys are authorized to perform an operation on the ssh server, then the server observes which of the keys was actually used. This is kind of fundamental um to the protocol and it's even a little bit worse, because the ssh server can prove to anybody that you know somebody authenticated under this specific key, so authentication is not uh deniable um and last. This is a little bit esoteric.

C

I probably won't have time to get into it too much, but the server can say no to all the advertisements, but I can also just say yes to all the advertisements and it can let everybody in and it can let people in that it could not have predicted uh in advance and that's pretty fundamental to the protocol.

C

All four of these are things that we can sort of improve uh with our new protocol. So I'll give you the big picture of this new authentication method that we have designed for ssh.

C

It kind of works like this. The client has several secret keys that it knows, and the server has several public keys that are that are authorized.

C

Those are the inputs to the protocol and you can see in this case. Sk1 is supposed to go with pk1 and sk4 is supposed to go with pk4. For example, these keys can be a mixture of rsa, ecdsa and so on. um So, for example, sk the first public key could be rsa. The second public key could be dsa. For example, all of these can be used together in one attempt.

C

What does the server learn from the interaction? The server learns that the number of keys the client has and the server learns that at least one of this client's keys is authorized, so it doesn't in particular it doesn't learn. um In this case. We can see that key number one and key number four are authorized, but the server doesn't learn that information.

C

It just knows that at least one of the authorized keys was being used, but it doesn't learn which one and the client learns the number of keys that the server has and it learns which of its keys were authorized. So it learns that pk1 and pk4 were authorized keys, but in particular the client cannot learn whether public key pk2 is authorized by the server, because the client doesn't know the corresponding secret key. So the client can't offer somebody else's public key and learn whether the server recognizes it.

C

It can only learn that the server recognizes a key by also holding the corresponding secret key.

C

So this just works without any uh site-specific configuration, so it's safe for everybody to just put all the keys they know about into this protocol um and you get pretty good uh privacy guarantees um and then, regarding this kind of strange attack that I mentioned on the previous slide, the server can't convince the client that a connection was successful unless the server no knows an advanced public key, that's going to be used, and it explicitly includes that key in the protocol.

C

So the server can't just say yes to everybody, so this offers a little bit of a protection against some sort of attacks.

C

Let's see so, uh hopefully I have time to give a bit of a very high level technical overview uh of how the protocol works. It just has two main components, so the first component is what we call an anonymous multi-chem.

C

So basically, you can think that the server generates a ciphertext and it's addressed to a set of public keys and the ciphertext is c and while generating that ciphertext, the server knows that okay, somebody who has uh secret key j, will decrypt this message to message mj. So the server learns all these mj messages um and we need the property that the ciphertext c hides the identity of these public key recipients.

C

So that means that anybody can decrypt this cipher text and they might get a dummy value. But if they happen to have one of the one of these record, recognized secret keys they'll get one of the plain text that the server has predicted here.

C

So the server sends that cipher text over to the client, the client can decrypt with all of its secret keys, and so now some of these decryptions are equal to the values that the server already knows and so to tell whether they have any in common. They use a private set intersection protocol, so in private set intersection. Each party has a set of items and we use a variant where the client learns the intersection of the items and the server only learns whether the intersection was empty.

C

So in this case, if the server learns that the intersection was not empty, it means that the client must have been able to decrypt under one of these good secret keys and that's all that's all the server needs to know in order to decide on whether to authorize this connection so from our technical contributions of the paper.

C

Maybe that's less interesting to this audience, but we show how to generate this multichem that simultaneously supports the different key flavors that are supported by ssh, um and we uh we do this new modification of private set intersection um this this way of proving that the intersection was not empty, is kind of a new thing and we show how to add that to the state-of-the-art uh psi protocol and in the paper we have a full uc security proof.

C

So this is a composable security kind of the best, the best kind of security that we know how to prove um for an interactive protocol like this okay. So, finally, I want to just mention uh the performance of this protocol. We implemented this as a extension of open ssh, and it's quite practical. So I mentioned that the protocol supports rsi keys and elliptic curve keys simultaneously, but the rsa keys are much more expensive.

C

So for the worst case, let's look at if all the keys are only rsa keys, that'll be the the worst case in terms of performance, and the best case in terms of performance is if all of the keys are just these elliptic curve keys and there's really, we didn't find any difference between ec dsa and dsa. There's very little difference there. So I just lump them into one category as elliptic curve keys.

C

So in a realistic setting where the client has five keys and the server has ten keys, I think that's a realistic setting for a kind of a small github repository, let's say um even with rsa keys, 60 milliseconds and with elliptic curve keys. It's like instantaneous, nine milliseconds, um I mean nine milliseconds is not instantaneous, but for uh authenticating a connection. That's that's pretty good.

C

um A more extreme example would be. The server has 100 keys and the client has 20 keys. That's, like you know a github power user connecting to a really popular github repository with 100 maintainers.

C

Even that is, is still, I believe, within the realm of uh reasonableness for uh authenticating a connection even with rsa keys. We try to take this to the extreme and imagine a server with 1000 keys. So that's a server that authorizes 1 000 different public keys to be able to connect so with rsa keys, that's a little slow it's over a second, but even with elliptic curve keys less than a quarter of a second. So again I think it's pretty reasonable.

C

um So that's that's all I have this is my last slide. It's just a summary of uh of what we provide in this new protocol.

C

The protocol basically gives the the minimal amount of information to the server. The server learns only one bit of information that it learns that the client has some authorized key.

C

So if you want more information, I have a link at the bottom. It's the papers on eprint and uh I'll be happy to take time for questions.

C

A

C

Question in the chat from uh from chris p um concrete performance: this was like total round trip time. um So from the from the client's perspective. From the time you say, I want to connect to the server. um It includes the tcp setup as well. I think, but these were two two servers on the same lan. um So the time from saying connect to the time that we can send the first uh application command to the ssh server.

B

Might not be on yeah.

C

If there's a question.

F

When when, in doubt turn the microphone on, uh this is daniel kahn gilmore from the aclu thanks for this presentation, thanks for working on this, um this has been a long-standing feature of the ssh protocol or or bug depending on your perspective, have you thought about how you would apply this to um common patterns right now, like the git based forges, github, git, lab, etc?

F

That have everyone using one account everyone? You know if you.

C

F

Configure ssh to get github it's git at github, yes, and then your ssh key is what does the split, so they need to know what it is to make that work. Have you thought about how to apply this there, or do you have to change.

C

Your pattern, yeah, that's a great observation. um We do have a section in the paper where we talk about github as the most obvious application of ssh.

C

So it's true that uh at the time that you run the protocol, um let's see if I can illustrate with the picture it's easier for me at the time you run the authentication protocol. The server has to know which keys are authorized and if the server only knows that some github user is connecting to some repository, that's not quite enough information, so it turns out in the in the ssh flow. um The client says I want to authenticate to this user.

C

So in github it's I want to authenticate as git at github.com and then the authentication begins. So so there is a place in the ssh flow. Where you can say. Maybe it's I want to authenticate as repository name at github.com.

C

The server sees repository name and it goes to see which of the keys are authorized, so it would work if you could make that change to the github flow and, of course, if, if you wanted to support this as a optional thing, you could it would be like repositoryname, new.github.com or something so by connecting to new.github.com.

C

You would signal that you want to use this new authentication protocol and then the username would be the repository name um and that that's how we envision it's working but yeah. It would require some changes for sure.

B

All right and there's one more quick question in chat um if it's would.

C

You like me to take one's a quick answer.

B

uh If it's a quick answer, we can answer it now. If not, then we can take it offline.

C

uh Different configuration and constraints server side. I don't. I don't think I completely understand um what this person means by configuration constraints. Maybe we'll take that offline and I'll ask for a clarification.

B

Okay sounds great. Thank you mike thanks.

B

All right, I guess the last presentation is from mallory.

A

All right, this will be really quick. This is a draft that is has been adopted by the working group, um and so I'm here to give you a really short presentation about why I think we should keep working on it, but I'd like some help so next slide.

A

um This is just a draft of what this um this sorry. This is just a summary of the table of contents for what this draft does um so in trying to define what is safe internet measurement. um I think the focus on consent is good.

A

I did a bit of a rework on the table of contents so rather than um putting case studies for each of these um versions of consent, they're now just kind of subsumed into their sub sections, on informed consent, proxy consent and implied consent, and then there's a long, but maybe not exhaustive list of safety considerations, because it is about safe internet measurement um and I'm pretty happy with that list, as it is right now, um but always there could be things that are missing from it um and then there's a final section on risk analysis.

A

uh So next slide please. um There are quite a few open issues, uh mostly because the original author ian um leermath, who has done the vast majority of the work, um already put those issues in there. They have not all been solved, but mostly they're, quite low-hanging fruit in terms of he's identified, some really good citations that are within scope of the document. They just need to be elaborated within the structure and um here's a sort of short list of those six open issues. You can see right there, there they're pretty obvious.

A

Next slide, please.

A

Then I think so, um yeah there's a really basic update to this, which I think makes the structure a little bit more straightforward. um I'm going to also plan to send a message to the perigee list for those that aren't here or participating online in this meeting to get more feedback on folks who might be interested in suggesting text for the open issues or reviewing the current version to make sure. I think I have two questions about review at this stage.

A

One is the table of contents um complete so far and then two if there are definitely missing sections um that we already know about, um but the one the last thing I'll say before. Maybe people have comments or want to volunteer to help. um It's just that you might have seen the iab announce a workshop coming up um like q4.

A

I guess it's it's slated for late october and submissions for papers are due, I think, at the end of august or early september, but it's on measurement techniques in encrypted networks, which I think is um a place where we could present this draft and whatever version it's in and get a bit more feedback from folks who are also thinking about these issues.

A

It's the workshop, I think, turns the concept on its head a little bit, because what it's trying to do is um you know, make uh network measurement a little bit easier or try to solve some of the sticky issues with network management and encrypted environments, um and this draft is sort of coming at it from like from a safety and privacy perspective. You want to make that measurement safe.

A

If that makes sense, anyone have any questions or so. My plan is to continue to work on this, submit a paper to this workshop and then get feedback from there, as well as on the list from all of you. Folks,.

A

Comments or questions.

B

It doesn't look like there's any questions, but one recommendation now that the ietf is working on privacy, preserving measurement. um That's certainly a group that will probably need some guidance in terms of how to use the systems they're developing uh adap. Specifically, so I wonder uh to what extent you know either. This document would benefit from the work that that group's doing- or that would benefit from the work that's being developed here, but it seems like there's some cross-pollination that should be happening um as we move forward.

A

I think you're making a really good point and if there's anybody who actually has a viewpoint on how this draft should fit in with that work, especially those who are involved in ppm.

A

I really love to know that now or we can talk in the hallway or at some point in the future, but because I think the idea is that this would maybe be um this. This draft, because it's in the irtf especially would kind of be making a broader um sort of um approach to the issue and then ppm would maybe then yeah, as you say, sort of take advice from it.

A

Okay, thanks yeah no go ahead.

D

I think we're good to wrap up just one. I guess note that draft censorship just cleared last call, so we will finally have an rfc soon, maybe hopefully, after seven years, but yeah thanks all.

F

For coming and see you next.

D

F

G

B

F

B

We'll know very quickly.