Internet Engineering Task Force 105, 28 Jul 2019

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: ANRW-DNSandSecurity

Description

DNSANDSECURITY meeting session at ANRW

A

So this is the session on DNS and security. It's a fairly packed agenda. We have an invited talk on dragonblood, discussing some problems with WPA three that will hopefully be a very much interest to the security and Qatar a few people in the room, and we have several talks on DNS, both with the lens on privacy, performance and security. So without further ado like to introduce Matthew fennhoff, he is a postdoc at NYU. Let's give him a round of applause as you're started,.

B

So thank you for the introduction, so I'm going to be talking about WPA three on specifically about the dragonfly handshake that is used now in this protocol on this dragonfly handshake. It's also used in EEP PWD on this research was done in collaboration with al Ronan. So to give a very quick background, dragonfly is a password authenticated key exchange, which means it provides mutual authentication.

B

It negotiates a session key which you can, after the handshake used to encrypt actual communications and, more importantly, it is a handshake that provides forward secrecy on it, defense against dictionary attacks. Now, in the case of dragonfly, there is no protection against a server compromise. So concretely, let's say in the case of WPA 3, if not occur, would somehow get access to the access points.

B

The attacker would obtain enough information to easily authenticate as a clients, while recently there are also some pics that have been introduced that also protect against server compromise by introducing some kind of salts.

B

So what we did is we analyzed the of dragonfly the give a bit to explain the things that we found I'm going to quickly introduce the most important concepts of the Zen shake, and to do this I'm going to assume that we have a client here that wants to connect to the access point, and the first thing that has to be done with the dragonfly handshake is that we have shared passwords, which can just be an ordinary s.

B

Key string on this is converted to either an elliptic curve point or in general, into a so called group elements that we call be on this peak and then be used in the cryptographic calculations.

B

So the handshake consists of two main phases not going to go into detail. Basically, we first have the commit phase where the actual session key is being negotiated, and then there is at least in the case of Wi-Fi, a confirmed phase that confirms that both parties indeed used the same password on that day, negotiated the same session key. um The important question here is: how is this password converted to a group element and I'm, going to start with a simple case here, where we are using so-called mod p groups?

B

So these are not yet elliptic curves, because the dragonfly handshake also supports multiple groups, and here the algorithm is a bit simpler to explain, and it also nicely illustrates the flaws of this methods.

B

So an intuitive and naive way to convert the password into a group element is to simply take the hash of the password in the case of Wi-Fi. We also combine it with MAC addresses of the client on the server, so in other words, we include the identities of the peers. We then take the output of the that hash value. We perform some calculations to get a p-value.

B

That is a element of the cryptographic group we are using, and then we do a quick sanity check to make sure that we have basically valid on secure element in practice. This check always succeeds so.

B

This would be the intuitive way to convert the passwords into a group elements, but there is one thing that is missing here. The problem is that our hash output here our value it can be bigger than the prime of the group that is being used, and in this case the calculations that we did here wouldn't be valid. So how do we avoid this? Well, the way that it was decided to do this for dragonfly was to simply include an if test here. If the value is bigger than the prime, then well, we try again.

B

In other words, we include a counter here. We always incremented by one. Until we get a hash output that is bigger than the prime and then we can continue now, of course, this leaks time, there's a side channel here and in fact, an in mailing lists of the ietf on the CFR gee people warned about this. They said this doesn't look good.

B

You should do something else, but unfortunately, for the case with the mod P groups, this recommendation was never included in a specification on this, of course, leads to a trivial, well, a rather easy timing attack. So.

B

One thing that makes this timing attack a bit more interesting here is that we can pretend to be clients a. We can then measure how many iterations it took to find the password and because the identities of the clients are included here in the case of Wi-Fi, we can then spoof another MAC address. We can try to connect with the access points and then we can see how many iterations are then to accusing another MAC address. Then we can spoof and we can spoof a lot of clients on each time.

B

We can do a timing attack to measure how many iterations this algorithm took to find the group element.

B

So we tried to dis attack by setting up Wi-Fi access point, so it wp8 three access points on a bit of older, raspberry pi. um The reason we pick the raspberry pi here is because it's CPU is actually similar to the one of a professional Wi-Fi access point and when we then the dist IMing measurements, we found out that indeed, these differences can be easily measured simply over Wi-Fi.

B

So you can see the example here where we have the blue full line. That corresponds to the timing response. If the algorithm only takes one iteration, then we have the case where it takes two iterations, three iterations and so on. In practice, we found that against our target. If we make 75 timing measurements, then we can accurately tell how many iterations it did. So this is the case against WPA 3 and if we would perform a similar attack against the EEP PWD protocol against so-called IWD clients, which is a Wi-Fi client in linux.

B

This implementation is a bit slower on here. If you make 30 connection attempts, you can accurately determine how many loops where needed.

B

So we now know that, there's a timing leak, we can derive the number of iterations that were being used, but now the question is: does this really leak important information? In other words, can we abuse this to, for example, perform a dictionary attack on the answer is yes, we can. Let me take the following example. We are say again attacking a WPA three access point. We spoofed a client's MAC address of a and we measured that in this case the access point took two iterations to derive this elements.

B

Well, what we can then do is we can guess some passwords and our example here we're trying three different passwords and we can then simulate this method that converts the password to a group map element we convert. We can simulate it offline on our own PC and we can notice here that, for example, for password one, it only uses one iteration which doesn't correspond with our measurements, so we can exclude the password based on that on the other two passwords they're still possible.

B

So then we can spoof know MAC address B. We can again measure how many iteration it takes in a real world. We can compare this to our simulated results and we can continue this way on excludes passwords that don't match our observation, and we can continue doing this until we uniquely determines the the password that is being used so in this example its password three that matches our observations and in general.

B

If, for example, you want to test a dictionary of 10 to the power 7 passwords, which is roughly the same size of the rock you database passwords leak, then we need to do this. Attack for about 17 MAC addresses on the practice. We found that this is quite feasible in just a few hours of time.

B

The conclusion here is that the number of iterations that are executed, they essentially form a signature of the password, that's being used, and we can use that to then do an offline would force attack.

B

So I now covered the case when we used mod P groups, but dragonfly also supports elliptic curves. So the question is: are they affected as well on?

B

Luckily, if we at least use NIST, then dragonfly is not affected, because in that case they essentially listened to the warnings of the IDF on the CR 4G mailing list people and they included some defenses against it.

B

Unfortunately, dragonfly also supports rain walkers and, interestingly initially, we didn't look at his brain pool, curse because dragonfly supports of a lot of parameters. So it's hard to analyze every possible scenario, but the reason we did look at brain pool curse is because, after our initial disclosure to the Wi-Fi Alliance, they privately created some recommendations on how to avoid our attacks and in those recommendations they said. Ok, you can use brain pool curves they are safe to use. There are no timing, attacks against them.

B

Unfortunately, when we check that there's bad news, actually, if you do use brain pool curves, you are vulnerable. So here we also fall back to the typical security advice. Don't write recommendations on security advice privately, but we should all know that already so. I want to briefly give some insight into why these brain pool curves also have timing leaks. So to do this I'm going to do the same thing like I did with the mod P group I'm, going to briefly explain how the algorithm works to convert a password into an elliptic curve.

B

Point and I'm then going to explain where this timing information is leaked. So if you would want to hash passwords to an elliptic curve point, a naive ID would again me to just take the passwords. This case, we again combine it with the MAC addresses of the client and the access point. We take the hash output as the x value and then we just find the corresponding Y value. I, don't use that as the elliptic curve point, but you may probably now already have one remark about this, namely there's not always a solution for why?

B

So what can we do? Well, the first thing we can do is we can calculate this value here, which is basically Y squared and we can see if it is a so-called quadratic residue. If this value is a quadratic residue, then we know that the solution for Y exists to handle the case. If there is no solution, we can just perform a loop. In other words, we include the counter in our values that are being hashed on. We continue executing loops until we found a solution where we can, where the square root exists.

B

Now, there's one problem with this, and this is that now different passwords again have a different execution time now. Luckily, in the case of elliptic curves, they listen to the warnings and they include it suggested. The defense of always executing K iterations, no matter when the password was found, I, don't practice here. They generally use a value of k equal to 40.

B

So, no matter when the password is found, you're always executing 40 iterations, and this prevents at least the timing, attacks against NIST curse.

B

There is one other extra defense that they included here, which is that once the password has been found, they execute these extra iterations based on a random password. Now this is just defense in depth in case that there is something wrong with the code. That's the best way to explain that for now now what is the problem here? The problem is quite similar to the mod P case, so maybe some of you can already figure it out. The problem is again here.

B

This hash value that we get here as output can be bigger than the prime that is being used in the scripted graphic group. What was the solution here? Well, they again just say that if the value is bigger than the prime just go to the next loop- and you can see the problem here- the problem is that if this value is indeed bigger than the prime, then we don't do these quadratic residue tests and we again get different execution times now.

B

The interesting thing here, if we use NIST curves, the probability of this condition being true, is very low, but with rain pool curves, this has a probability of say between 10, between 20 to close to 50 percent, depending on the specific curve you're using on. In that case, no, the quadratic quadratic test may be skipped so again we're in trouble. Now.

B

The timing attack against this case is a bit less straightforward, because we do still have these extra iterations that are being executed, and on top of that, these extra iterations that are executed after the passwords element has been found. They are based on a random password so to quickly illustrate the impact of this. In practice. Let's say: I perform some executions using the same MAC address then, in this case here green illustrates that I executed. The quadratic residue test in a loop, the small bar here, illustrates that output was bigger than the prime.

B

So there's no quadratic test say that after four iterations we found the password. So at that point, extra iterations are executed based on a random password. So we get some amount of extra time that the algorithm takes. If you now execute the algorithm again, we always have that the first iterations they are identical because they're based on the real password, but these extra iterations they use a random password. So each time we also get a random amount of extra time that is being added to the execute.

B

But even with this, it leaks, sensitive information, because now the variance of the execution time depends on when the password was found.

B

For example, if the password is immediately found after the first iteration, you get a lot of variance, but if the password was found say after 20 iterations, you get a low amount of variance on top of that. The average execution time also still leaks information, because that depends on the number of iterations needed to find the password and how many of those iterations had a hash output bigger than the prime. So there was no quadratic residue test.

B

So again, we have the same case here that this forms a signature of the password, and we can use that in an offline dictionary attack, and if we perform this test again against a raspberry pi. We again notice that this timing information can be measured over Wi-Fi in the case of brain pool. We need to make more measurements or MAC address, because the timing differences are smaller, but it's still feasible to do this in practice and a few sorry in a few hours.

B

So another possibility here and I'm going to cover this very briefly. Instead of doing timing attacks, we can also do cache attacks. Basically, we have the same algorithm as before. We can basically use flush plus reload to detect when the password was found. We can do the same thing with brain pool. We can use flush unreal out to detect when the hash output was bigger than the prime now I'm not going to discuss much more because I don't have time for this. We actually found a lot more interesting things. We also found some implementation.

B

Specific vulnerabilities and EEP PWD on the one I want to highlight is that if you use bad randomness in dragonfly, then you can recover the plaintext password. So that's one thing to take into account. Maybe if you design a handshake like this I, think it's good to also discuss what would be the impact if there is a flawed source of randomness.

B

We also found a text specific to Wi-Fi, but I'm not going to mention them here, and the interesting thing here is that. Finally, the Wi-Fi standard is now being updated to use a constant time algorithm to find the password elements, and maybe this will be included in WPA, 3 I'm, not sure, but at least on the Wi-Fi on the altitude 11 group. They are working on it. So with that I'd like to conclude my talk, thank you for your attention.

A

So we have time for probably one quick question and then we have to move on to the next presenter. If that person come up, get ready, that'd be great.

A

C

The first two attacks you showed looks like they could be easily beat well ative lee simply worked around by just by redesigning exactly how the code works by iterating as many times and always performing the the the quad the QR test, even if P is greater than if the point is greater than P have. Are your other issues that you haven't addressed?

C

You hey you haven't talked about is: are they also can be worked around so there.

B

Are workarounds but they are not a deal in practice because we saw and practice that I mean. Theoretically speaking, you can implement this in a secure way to avoid cache attacks to avoid timing attacks.

B

But the first problem is that this is very tedious in practice, so there's a high chance of developers from accidentally still introducing a direct cache attack or a timing leak, so I would not recommend that at all the other problem is that a lot of times the Wi-Fi handshake is offloaded to the Wi-Fi chip itself, which is which has which is basically resource-limited and then always executing 40 iterations. It's just way too costly. If you wouldn't recommend it, what solution would you recommend?

B

So that's the one one thing I would recommend is to make to allow the offline computation of basically the password elements, so it doesn't have to be recomputed every time you perform the handshake on the second one is mentioned here to use a constant time way to do it and there's a RFC being written with several ways to do that in.

C

Other words modify the protocol. Unfortunately,.

B

A

Sorry question: we don't have time for any more questions at this point, I'm sure you could take it offline, but.

B

I'm free to discuss things off line as well today,.

A

Thank you all sega's, be here again.

A

All right, our next picture is Jolie. He is a assistant professor at UC Irvine and he's going to be talking about DNS interception. So, let's give him a round of applause before he is deserted.

D

Okay, thank you for the nice introduction and welcome to my talk. So my name is Tony I'm, currently an assistant professor at the UC Irvine, and this is a joint work with my causers from Shanghai University for the universities in China and UT Dallas, and actually one of the causer as in do I sit in there. So, if you guys later has have questions regarding our talks, you can come to us during the break. So let's get it started. So the topic of this talk is about the D s, security and let's first give some.

D

Ok, so first, let's give a quick background overview about the s. So in this talk we focus on the ETS resolution process essentially say: if you have a client, you want to get an IP address of domain name, for example, RI RTF, dot o-r-g.

D

So typically you work out your recursive resolver and then the recursive resolver will handle all those resolution process for you like going to the different, authoritative servers like the to the name, server top-level domain name server and a second level to my name, is over to cut the IP address of you to me and then give back to the client. So when you have a contract with ICP are typically, they will give you a default recursive resolver for you to use, but now it turns out and one more users prefer to use is public.

D

The answer is over. We have Google, we have open. Yes, we also have called flare. They are very good, so I believe them. The major three reasons for the users to switch the public theater observers could be the performance. Could it be the battery security and also the support for the t? As extensions turn out to be, may be better, so the security issues we investigate English talk is that say as a users, you choose Public DNS resolvers and you won this publicly as resolvers to handle your request.

D

You still send that dooming request the to this resolver, but there are some unpassed middleboxes that can see your request and assume. Uts request is not encrypted. So what can really happen? And in particular we look into this specific case, say: there's a guy on a path and basically intercept you OTS request and then redirected to one alternative resolver and that this alternative resolver to handle the resolution.

D

So then say the public das Google will be kicked out completely from his picture, and this authoritative resolver well, do the query and the response Hannity and I finally gave the request to the user. So that's the limb problem when we look into in this work- and it turns out this kind of a request.

D

Tampering is really hard to detect because the alternative resolver can basically spoof the IP address, so it can pretend to be Google and then, if you are client you're looking to the source IP address of this of these packets, it's really hard to discern those cases.

D

So we found that there are four types of it potential interceptors doing this network provider. Isp is only one of them: censorship. Fair. Also doing that now there have have already been some reports. Talking about that and then have our software now well doing that, and also Enterprise proxies are doing that as I example for the SPE we found there has been some reports and news about these practices, and actually this type of middleboxes is named as transparent, yes, proxy by those parties.

D

So it's kind of to be something known to the community, but in this work we try to do a large-scale, more comprehensive analysis instead of doing individual news reports. So that's the main contribution of this work. Okay, so basically the two main questions we want to answer in this research is first, how prevalent is this approach is practice? How prevalent is the as interception and second, what are the characteristics of the tea interceptions? Was their strategy and what do they really like in practice? So let's first come to the threat model.

D

So actually, in a previous example, we only introduced one way to do that. Yes, interception and it turns out there are a few more ways to do that. So, let's first come to our basic picture say we have those five parties. We have a client, Public DNS and the authoritative server. They are the normal parties handling you audience, requests and responses, and then now we have found has device and then one alternative resolvers likely to belong to a same owner of this on hosta on pass device.

D

So when the on pants device doesn't do anything anomalous, anything suspicious either shoot just a forward. Your request that to the public dns and like the public, the s to handle everything. So the bad things happen when this some positive eyes try to change your a request. So the first example we look into eases request retraction. So in that case the unpassed device were simply block. You request to public TS, of example, from google and then instead either word.

D

I mean whitter actual request to its own alternative, resolver and then daughter, deliveries over well duty resolution and response handling, actually there's another case for the TS interception. So in that case the unpassed device were replicated requested to different places. So first you'll request the worst still go to the public dns and then gather resolved. In the meantime, damned past device will copy the same requests, an issue that true to his own alternative resolvers.

D

So from the prospective authoritative server, there are be two requests and then two responses were go to the clients and typically the first one go to the client will be cached and used by the client and there's also a third category of TS interception. So in that case the request to public TS is still blocked and then the request or basanta 2.30, authoritative resolver.

D

But when this happens, the alternative resolver stops from being from ascending your request that you do certain ative server, but instead you it work directly, give you the response to the to the to the clients. So, in the end, we are looking to those three types of issues during TS interception and next let let us take a look at the methodology. We try to detect those kind of das interception practices so actually from the previous three examples.

D

You may already have some idea to detect this because say if you are able to control the clients- and you are also able to control some other authoritative nameservers and you are able to direct your client to send them requests to your own, authoritative servers and based and by looking to the differences between the requests, the patterns from the public, the ESRI servers and not arterial resolvers. You are able to discern those cases.

D

The main reason is like if you, if, by looking to the source IP address of the requests that go into associative servers if the source IP address does not really belong to you, for example, Google belongs to something you have no idea about nothing relevant to Google, and then you can figure out. There may be some issues. We see that yes resolution process, so actually this is what we do.

D

We do this and to end data collection and the comparison we do control some clients and item to send a large number of requests, and then we also can show some associative servers and to receive those requests and then do some comparison. But still there are two major challenges we need to address. First, how can we gather those large number of the vantage points? I mean those middle boxes. Some of them are very close to the clients.

D

So if you want to observe in novel of those, yes interceptions will lead an odd number of 1h points, so we actually leverage the two platforms. The first actually comes from proxy rack, which is a Sox residential proxy networks. Actually, as a client, a customer, you can buy it services and it's a peer-to-peer proxy network, so you can actually send your request. You skate away under. We were fun to appear in his pourraient, and that appeared to redirect.

D

You request that to some places else, so we had to actually leverage a large number of IP for us to do this measurement Authority. So, in the end, this is the first one we use, but the limitation for that it only supports TCP, and we will note yes, is major based on UDP, so to measure UDP IDs request.

D

What do we do is to actually work with a company in China who is who we have a good relationship with and allows us to write to to put our code in it's a network, debugger modules, so we actually implement how much Amanda logic and Latta to run on the client devices. So by doing that, we are able to measure both TCP and UDP and, and then second, the Challenger we need to address is how able to see the policies of interceptions of the middle boxes because they are kind of black box.

D

We cannot really go there and at the open the box and see how the implement there are rules. So what we are trying to do in the end is to enumerate or the possible policies- and this is a best-effort approach, so we actually focus on five types of fields. First is the public IDs resolvers that I mean a destined by the users to handle the requests and second, the different protocols and third of different types of a requests, and also we look into the different types of a TR, DS and finally, there's a particular challenge.

D

Here is how we able to link the request from the clients to the requests into the associative servers, because when the requests that come from comes through what those meter boxes, the source IP address were be changed. So there's an obvious way to to to link that. So, in the end, the with you develop this trick say we actually encounter the unique ID for the source into the to my name, and it turns out that those middle boxes don't really change the to me name.

D

So by doing that from the perspective or associative name servers, so we can link the request from e Sorrentino cinders to the ones received by by themself okay. So so, in the end, by using these two platforms, we're able to send a six million requested to the public, Athena's resolver, send it to our authoritative nameservers and we have a good coverage of the geolocation. So actually we have more than 170 countries observe we order that in our dataset and the more than 3000 autonomous systems. Yes in order.

D

That said so, we believe this is a really good data sets to look into so due to the time limit. I will only talk about three major observations during our study. So first question we want to answer is how many queries are intercepted, actually we're looking to the two different platforms for the global wide analysis we have about so 1,700, yes, 19 turns out in the end there were, there are 198 is doing interception and the for the experiments in China we found from those 356.

D

Yes, there are 61, yes to interception, so those numbers actually are not small. So we think this is. This is contents out to be an issue. We really need to take good care of, and then we also look into the differences between the public es. Resolver sign turns out. If you are trying to go to a more public, more, maybe more popular like well known publicly as resolvers the chances you'll request to get interception that may be higher.

D

So actually, if you are in China, and then you use Google's public dns, if you're using UDP 28 the percent of a chance, your requests are being intercepted and if you're using TCP, the ratio is smaller and actually we have one recursive resolver set of ourselves actually under China's edu network and turns out the ratio is much smaller, so actually I think it makes sense because I, don't think I mean those CTO, tes, rêves overs are known to them and they are interested to them.

D

And the second question we want to answer is: how are my queries intercepted? We talked about three types of TS interceptions: the Qwest redirection request, replication and direct responding in terms of, in most cases, those middleboxes who are due to request a redirection of smooth as mauricio. Well, to that request the replication and then for the directory. Responding is really really rare, and the third question we want to answer is: are my response that tempered I think for the security perspective.

D

This is the most important question we want to know, and it turns out that actually, the message is kinda positive I mean in some sense, because most the responses are not tempered from the six media and responses responses. Only hundreds of them are going to change the result, users, consent and there's the one interesting case we want to briefly go through here is traffic monetization.

D

So actually, if you are in China, and then you belong to this China mobile group of Yunnan autonomous system, if you use Google's public dns and you send the requested of Yahoo's IP address and then your response will be chained to this app advertisements IP. So actually this app advertisements also belong to the same yes, so they actually somehow monetize. They are traffic from your from UTS requests.

D

Okay, so I will quickly go through some way. We think about that can address this issue. So, first we think I mean this issue should still be taken care of, even though it turns out the response. Temporary is not too much because I think from users perspective is their rise to know who is really the the person that the party handling their request and so fathers in a way for user to to to know that, and certainly we found those open, resolvers security is not really good.

D

Actually, only 43% of them support the s sec and we found for those resolvers using the pant. Yes, resolution to kids, it turns out. All of those versions should be deprecated well before 2009, so they are using all those of very vulnerable versions that are really not good, so two tips about addressing this problem. First, we think we think the attack is still a relevant technology to address this issue. I mean you, those recursive resolvers may ignore the asset as a client.

D

If you do verifying your response from I'm using PS ik there's a chance, you can't detect such response tempering and then prevents these bad things from being happening. An attack in the suggestion we want to give is to use encrypted DNS. For example, if you set up a tea house, TRS connection between you and recursive resolvers, and you can somehow use the certificate to verify. Ok, this turns out to be the right, recursive resolver. What this turns out to be something I have no idea about.

D

So we we know there are some very, very good, very interesting, RFC's working on this direction and we believe this is right way to go to and in the meantime, we also provide this online checking torso even without encryption. If you go to our website, you can clearly see who is the real TS, recursive resolver, so you are using and to conclude, we to the first large-scale measurement and to in the measurement on this issue of the s, interception based on 32, alternative resolving, and we have some interesting findings.

D

For example, we found there are 259 s doing this, and if you, if you are in China- and you try to use coupons publicly, s should be really careful about that, and then there are some security concerns. And finally, we think there are some medications and we also propose online checking tor and in the end we should. We think this issue should be I mean address that by the efforts of the community. So we have more details in our paper, which was published in unique security last year, and here are some my personal informations.

D

If you want to send questions for discussions with me and if we saw the others feel free to do so, thank you so much.

A

If time again for one quick question, it's going up to the mic.

A

I'm, sorry, only only one I'm sorry.

E

For him by Bosco vodka, in your measurement, did you refer to mobile networks or fixed networks? It is a case of mobile networks. Did you consider a local offload, or did you consider internet vide invert? It is sold through an operator network.

D

So so, actually you have two platforms for the second one: we the one with security software that is mobile network, so so for the details of the network operators. I I, need to check the papers and was always the students but I think we can't discuss later. We have the details in the paper. Yes,.

E

I'm asking you because, at least in the context is a mobile network. The creation of the internet PDN, which I assume that it is the one on which are operating directly provides. You is a DNS solver and virtually ways. The functionality is not afraid. Okay,.

A

All right, let's say the here again.

A

All right, so the next talk will be on oblivious DNS, given by Paul Schmidt, who was an associate research scholar at the Center for information technology policy at Princeton University. So let's give him a round of applause, get started.

F

F

So this is work where, where this was actually presented at Pets symposium last week in Stockholm, so in the interest of time, I've got matching slides. As the previous talk, so I'm just going to sort of jump through this very quickly.

F

What we're looking at with with conventional DNS, is you issue this query and you're, sending it typically to your ISP, which is running a recursive resolver, and then it goes out to the the rest of the DNS hierarchy in the world, and so the the problem with that is that the recursive DNS resolver has all kinds of insight that they can use to see both your identity in the form of your IP address, and basically, all of your your actions on the internet.

F

So you can easily put together a database of users and what they're doing like one users going to Google and Amazon and others going to Bing. They can also see what types of devices you have on your network and so because of that these, these recursive resolvers can be the targets for data requests. So an oppressive regime can simply say: hey. You need to hand over all of your DNS logs, and so it's this dangerous position where we're sort of trusting these ISPs or any recursive resolver to hold all of this information.

F

There are these cloud services, Google quad, nine CloudFlare that are offering recursive resolvers openly, and they say you know obviously they're not going to keep logs, which is likely true, there's too much data. It's actually sort of stored, this kind of thing, but it doesn't actually solve the problem there they're promising to throw things away. It doesn't fix anything, we're still trusting them, we're just shifting trust from our ISPs to these resolvers, and so that you know, that's fundamentally, there's still a problem there.

F

You may have heard of other DNS privacy focused work like DNS over TLS or HTTPS. That's just encrypting the transport. So that's going to fix things like eavesdroppers between you and the resolvers, but their resolve are still gets to see all of your queries and your identity queue minimization. It hides things from the the rest of the DNS hierarchy, but it doesn't solve the fundamental problem here and so what we've done is we've designed the system that we call oblivious DNS, which essentially separates the user identity from the queries at the recursive resolver.

F

We built this with requirements that we had to be compatible with existing infrastructure. It's very hard to change. Dns software on Thor taters or at recursive servers. It's a really old ecosystem. It's just not simple, to sort of throw things out and build a new protocol altogether. We also had to minimize overhead because DNS underpins basically all web traffic. So what we did was we made a couple changes. The first was we modified. The stub stubs normally operate as lightweight processes on your OS.

F

They take care of DNS resolution for the applications on your on your machine and what we did was we use AES symmetric keys that we generate on the fly and we encrypt your query to a ciphertext. We then append a a domain that we own. Something like imagine we own that the TLD Oh DNS, so we append the this clear text domain, which means the recursive servers not going to be able to understand your query.

F

It's just cipher text, but it will use the existing DNS infrastructure to eventually reach our authoritative that we own say dot Oh DNS. At that point, we've got this Oh DNS, authoritative server, which is both an authoritative server and a recursive server. It holds the public keys, and/or, the private key to decrypt the session key session key decrypt, the query. It then acts as a recursive and goes does the entire process over again going to route TLD and the actual genuine plaintext authoritative.

F

So this separates the the client identity, because the that is held at the recursive server that the ISP recursive and the ODS or thorat ativ gets to see the actual query, but it doesn't get to see the user because we're essentially tunneling DNS over DNS and we're abusing the existing recursive resolvers out there.

F

Of course, we're introducing these operations and there's going to be some some overhead. So what we did was we measured this we implemented a stub and our Oh DNS resolver and go, and these tests were done on the same machine. So there's no sort of LAN latency and what we see is as expected, the symmetric crypto is really lightweight you. Can you can generate your keys? You can encrypt your domain, so you can decrypt the domains very very quickly. We used elliptic curve, cryptography.

F

In tech, that's going to be much larger than you can actually issue in a DNS query, and so we use this elliptic curve. Cryptography.

F

Query for doing this, we use standard libraries. So if we imagine if we use this- and this is on my laptop- so if we use a server class machine with optimized libraries, things should work better.

F

We then fed this fed the Alexa top 10 K domains and we see conventional DNS, outperforms Oh DNS by about 1 to 2 milliseconds it over, although we feel like we're not doing too poorly just with the standard libraries we're using.

F

So that's one piece of the latency, the the other thing that we introduced is LAN latency, because we're tunneling to this ode, ENS resolver, essentially that round-trip time to to that resolver, is added to every query. And so you can see this illustrated in this figure. We've got conventional DNS. This is our client was in New Jersey and you can see cloud players doing quite well.

F

You've got Google and quad. Nine are the other solid lines and then we we used OD NS resolvers one was in New York City, which is about 4 and 1/2 milliseconds away from us, and the other was in Georgia, which is about 19 milliseconds away, and what happens is basically that our TT is just appended to all queries. So this obviously motivates some kind of we need a widespread deployment of OD NS resolvers.

F

You can't just have a single DNS resolver out there in the world, because you introduced a huge amount of latency for a large number of people, so we argue for widespread anycast employment in order to sort of get nearby because we're using anycast. This introduces a problem because we're using this public key crypto to decrypt the queries or decrypt the session keys, and we can't just hand out the same public key to all servers on any caste. That would be incredibly unwise, and so what we do is we.

F

We have a special request that is sent on that anycast address and at that point the OD NS resolver that your nearest according to bgp will respond with its name, which is you know us one dot, ODS something like that, and it sends back a it's public key at that point. Your client can do this. You know once a day once an hour whatever in order to find the nearest server and then use that domain specifically to append all queries in the future and you'll know the correct public key to use.

F

So that's all fine and good. But what happens when you actually use this on the internet and web page load time is the metric we decided to look at here. We use 38, op, alexa, alexa top 30 web sites and loaded them using both conventional dns using the princeton resolver and the OD NS resolver that we had said you can see a few of the right-hand bars for each is is OD NS you can see, for a few pages were were decent amount slower, so Craigslist, Instagram Facebook.

F

Those mostly things that had a lot of little objects are. What we are a little bit slower on but overall, were performing pretty closely. There was one that was very odd live.com. It turns out that's because we were directed to entirely different CDN and in the ODN s case, we were, we had a single giant, javascript bundle and the traditional resolver went to a CDN where there are lots and lots of little javascript objects to download. So that was a little bit slower.

F

The other one that we were curious about is reddit in new york times. How could we possibly be faster than the conventional dns in those cases, because we are introducing this latency and it turns out if you look at the time to first byte for those sites, it's it's. It starts to make sense. Essentially we were directed to CBN's that were closer. We just happened to be our DNS resolver, directed directed us to a more optimal CDN for us rather than the princeton resolver.

F

So this really argues for again having this widespread anycast network of ODMs or solvers, because you need to be directed to you objects that are near you, of course, when we do this when we're sending these cipher text queries we're essentially ruining the caches of existing recursive, resolvers and they're not going to like that.

F

So we wanted to understand what this would look like in terms of traffic, so we took a trace of around 8 million queries and simulated users as we're turning up and down the the percent of users that are using au dns, and you can see when you have zero percent of ODS users, so the cache misses that the recursive are relatively small as you increase the percent, that's obviously growing. However, if you implement caching at the stub, which we do, we actually reduce the overall percentage of traffic, we're not inventing the wheel here.

F

There are some stubs I, think windows, stub, actually she's right now, but not every every stub does so just doing that actually helps quite a bit, even if you're cashing just a single user, you can get quite a benefit. So, overall, we don't think we're introducing too much traffic. We also were worried about undesirable cash entries, so we set our responses with a TTL zero, which should mean do not cash. This, however, some resolvers out there ignore that value entirely.

F

Some actually treat Xero as a special value where they permanently cash it, and so we had to measure this and try to understand it. We did the same same set of simulations with where we were varying the percent of ODMs users, and then we varied the the size of the cash at the recursive, and you can see if you have a very small cash, something like a thousand entries. Odine s is going to be painful.

F

We're introducing something like 15% of churn in turn and entering these ciphertexts queries into the cash which is not ideal but really in reality. Recursive resolvers have much larger caches than a thousand and overall we we don't see it. It's not that bad, so just to quickly wrap up. There are other things in the full paper that we talked about. We deal with ET NS 0 client subnet, which exposes some of your IP address your your client identity to the the rest to the OD NS resolver.

F

We deal with queue, name lengths with the very limited space and then the encoding that is used, and then, where do we go with this? We're thinking about policy based routing where maybe users might be interested in selecting Oh, DNS resolvers based on sort of location or SPE, or maybe they can choose to say us, dot, Oh DNS, to avoid certain locations around the world? So with that I'd be happy to take any questions.

A

G

It seems there's a potential issue here in that deployment can be subverted if I was an ISP that wanted subvert deployment of this and to be clear, I'm not I could implement my recursive resolver, that advertised to transparently forward Oh DNS queries, which then subverts the the whole purpose of not not putting the OH DNS infrastructure in the position of having to be the the third party that the user is transferred, their trust to I think there's maybe a way to address that and design.

G

If you have space to do it, but you might not have space.

F

Yeah, that's a good question.

F

You there is the the sort of issue that that we essentially need need to count on the fact that those who are separated that the recursive and the DNS resolvers are separate, which in practice is yeah.

G

But there's at least one recursive layer that exists exactly.

C

And exactly yeah.

A

All right, if there are no other questions, let's thank the speaker one more time.

A

Just an early right.

A

We's, damn right, I can go with movies.

A

Like all right, thank you. This is your adapter okay,.

A

All right, sorry, everyone thank you for being with us, so the last presentation is on some costs of DNS DNS regellius in DNS or HTTP for the modern web given by Austin. Please, let's give him a round of applause before he gets going.

H

All right, thank you.

H

Yeah, so DNS privacy's becomes a significant concern. We know that on path, network observers can't spy on and tamper with, DNS traffic. This is the DNS that you all know and love doe 53, let's all be referring to it for the rest of the talk, and so two protocols been proposed to encrypt DNS traffic. There is DNS over TLS or dot as well for us, the talk defined in RFC, 75 8 and then there's DNS over HTTP or doe and defined an RC 844.

H

That's the next line, so the contributions of our work, we for an extensive performance, study of doe, 53 and doe, and we give some general insights to optimize DNS performance.

H

So again, we really want to understand how doe 53, dot and Oh effect to the end user experience, and there are a couple of metrics that we measure in order to give us understanding. So first we measure quarry response times. We wanted to reproduce Mozilla's findings. So well you don't know. Mozilla did a couple of studies where they measured doe query response times. They also measured the effect of ECS on user performance. Things like that. We wanted to see we could reproduce the results on doe query response times as your page load times and again.

H

This is what we think are really important to users. They want to see if they can load new york times and how doe might affect those page load times, and we also wanted to see if you've changed network conditions, things like additional latency or loss. How that affects your user experience?

H

Okay, so this is kind of a general overview of our set up very simplified. But the general idea is that we have a client which is at Princeton University. We will perform traffic shaping so an emulated 4G network, lossy, 4G and 3G, and again these are not actually connecting to mobile networks.

H

These are just emulated conditions and your performing queries to recurse resolvers at and Princeton vault Princeton's default resolver CloudFlare, Google, quad 9 for each unique domain, name that is embedded in a web page for images, links things like that and then you're actually in steps. 3 & 4, going to load content from these web pages that you just perform, DNS queries for and we use the Trank of top list, which I believe was a presented and guess s, it's just taking basically Alexa other top lists averaging it over a period of time.

H

So you can think of this as a list of just top websites and want to measure again things like New, York, Times other stuff. So these are response times from cloudburst resolver on Princeton Network. You can see in the legend at the bottom, in blue, we have of CloudFlare settings so CloudFlare dot, Dovid, III and doe, and then by default, Doe 53, I'm referring to the university resolver at Princeton, which only supports traditional doe 53. So if you go ahead and hit next, you can see some interesting characteristics.

H

One is that for about 50% of queries, you can see that clatters dough actually takes over CloudFlare dot and it's faster and if you could hit next again, you can also see it's kind of hard to see right there, but actually at the very tail of query response times that quite a flared dough not only over, takes its own dovid III implementation, but also the university's dovid III resolver, which we thought was very interesting to see, and you see similar characteristics with google's resolver that if you could hit next please and that for about 20% of queries, again Doe seems to outperform dot and you can see a similar phenomenon, as you saw with CloudFlare.

H

That dough is once again outperforming dough 53 for the very tail of query response times and then. Lastly, this is quad nine right, so you can see at the bottom that this is quad nine dot. And if you could hit next, please that for about 90 percent of queries, that doe is outperforming dot. And then you can see similar behavior once again as with CloudFlare and google, that, for some reason, doe is outperforming quad nines doe, fifty-three resolver.

H

It does not outperform our universities, so fifty-three resolver, but nonetheless you see this kind of similar behavior that, for some reason, doe is outperforming. Don't fifty three for the tail of query response times. So again the takeaway sees baby to be that, for some reason, doe is outperforming doe, fifty-three for some percentage of queries.

H

There are multiple reasons we think this might be, one of which might be caching of the dns wire format. So, for example, if the recursive resolver already has the the answer in its cache- and maybe the transaction ID is some fixed number as it is implemented in Firefox's implementation of doe, which I believe it's set to zero, maybe they're. Caching, the wire format of the DNS response, which allows them to more quickly send a response back instead of having to construct a being response.

H

Each time- and this result that you know doe- is outperforming doe 53 in the Taylor query. Response time seems to support Mozilla's findings.

H

Next, we wanted to measure page load times which again, we believe, reflects the end user experience and for this talk, we're only going to show page load times for CloudFlare. But if you look at our full paper, which we have the archive link for at the bottom, we also show page load times for quad, 9 and Google.

H

So, as mentioned before, we also wanted to see how changing network conditions affects the ANU xur experience that affects query response times, page load times, so we perform some traffic shaping to emulate. Mobile networks again want to state that these are not actually connecting to real mobile networks. We didn't tether a phone, and but we wanted to see just you know if we could emulate network conditions, how would these protocols perform across different recursos so for 4G to emulate that we added fifty three point: three milliseconds of additional latency?

H

Well, Sigma J, dur, 0.5% loss, Wasi 4G same amount of latency and jitter, but now we're doing one point: five percent loss and then, lastly, with 3G, 150, milliseconds, additional latency, eight milliseconds of jitter and 2.5% loss. This is based on data. That was provided in a report by open signal and again we talked about why we chose these networks in the full paper, so these are page loads when you use CloudFlare resolver from Princeton's network. So you can see the way you can read.

H

These graphs is that this is taking all the page loads that were perform using CloudFlare, for example, dot on the Left graph, the page loads that were performed using cloud flares, Doe, 53 implementation. The vertical line is the median right, so this is the median difference between dot and doe 53 in terms of page load times. The white background indicates that this difference is between plus or minus 30 milliseconds right.

H

So the kind of key takeaway from this slide is that, if you're, using dot or Doh the difference in page load times on a university network to cloth, lawyers for cursor is only 30 milliseconds. You can go to the next line and the picture starts to change a little. Once you move to an emulated, 4G network, you still see on the Left plot that dot is performing within plus or minus 30 milliseconds in page load times compared to doe 53 right now. If you look at doe, this picture starts to change right.

H

So the background indicates that the median difference between doe and doe 53 is over 100 milliseconds right. The actual median here is about 153 milliseconds. So again, once you go to a emulator 4G network doe starts the drop in performance, but dot stays about the same and again the picture changes.

H

So now the blue background indicates that dot is performing actually better than though 53 when you're on a lossy 4G network, which is a pretty surprising result right and now doe has actually moved back to being within plus or 30 plus or minus 30 milliseconds from doe 53. So all these numbers are starting to change around, but the picture seems to be that across these three networks.

H

So far again, the university network, the 4G network and the lossy 4G network dot has remained either indistinguishable or slightly faster than though 53 and now this is an emulated 3G network. So now, both dot and doe are over a hundred milliseconds slower than doe 53.

H

So again, to remind you of the conditions for a emulated 3G network, we added 150 milliseconds of additional latency 8 milliseconds of jitter and 2.5% packet loss. So now the conditions have become so dire that both doe and dot are significantly slower than, though 53 in terms of page load times.

H

So what is this me putting this all together? It seems to be that if you're running DNS over TCP as a transport that this can actually help page load times, we saw that across the university network, the 4G network lossy 4G, that DNS over TLS was performing either indistinguishably and by that I mean plus or minus 30 milliseconds in terms of page load times or informed, actually slightly faster than though 53, and we believe this is because TCP packets can be retransmitted as soon as two round-trips. So as we're tweaking the loss.

H

More and more DNS packets are getting dropped and this means that they could be retransmitted faster than something that's being defined by a time-out, perhaps a couple of set on the order, a couple of seconds for a traditional DNS over a traditional DNS timeout. So this helps dock, it's okay. This helps dutton to perform well on lossy networks.

H

Okay, so we think to conclude that there are several potential improvements for del 53 dot and oh, you could send what we believe are called opportunistic partial responses. So maybe you say to a recur. Sir. Here are all the different questions: I want the answers for and then the recur, sir, as they get authoritative answers we'll send them back to the client right. We also believe wire format. Caching could help, as previously discussed so in Firefox is doe implementation.

H

The transaction ID is zero by default, and so what this means is that once an answer is cached on a recur, sir, it knows what the exact format of the response is going to be, so you could simply cache the entire. What the cache the entire wire format, instead of having to read the answer from cache each time in terms of the DNS cache this is, could instead be an HTTP cache.

H

Lastly, we believe that HTTP to push for dokas significantly help performance, so it should be to push has been talked about in the mailing list for doe for quite some time, but this is something that we believe once implemented. Widely could actually significantly help doe performance, because you could push answers to the client or the web. Servers could push answers to the client for domain names that it already knows. The answer for that are on the webpage.

H

So in conclusion, we seem to see that dot before it's better than dough on average and even sometimes better than though 53, but nonetheless, dough has potential again, as I previously discussed, with potential improvements with server push, dough could actually improve and performance quite significantly the choice of your cursor and the network conditions matter and we believe the transport characteristics of TCP should be further explored.

H

Thank you and you can check out our full preprint at the bottom, with the archived fling. Let's.

A

A

I

Curious about, if you tried multiple UDP retry strategies, because that that sort of would really heavily influence. The results I know that the Android is older, has like a five-second UDP retry time, all right, which is basically you're gonna like lose any race with with Gio tuh right. We do not try multiple strategies, but we think that's definitely something that would.

H

Fit in the future, it is documented which one you did use in the paper so.

I

We used to use Debian.

H

As our default client, so I think that a fault timeout with that was in resolved comm to make five seconds. As you said, okay, thanks.

J

Hi banette from Google Public DNS. So thank you for doing this research. It's good to see all the detailed information here. The one thing, as you mentioned in the beginning and I'll point out again, is you've only done this from the Princeton's University campus in Northeast us right, correct.

K

J

It's a heavily connected well connected part of the world, so the results here not going to be representative for people who are in places where the connectivity is not as good right. So that's one important point so I think the fact that you will share the data that will help us get more information from other sources of the things which I think keeps coming up here.

J

I want to point out is part of the reason doe works better is because it has an async API, while the d-o-t and DNS over 53 don't have that and I'm curious if anyone has an async API for the traditional DNS transport, so we can compare that right.

J

So if actually you could include that that would be great, then we'll tease apart the transport differences versus the API differences, yeah.

H

That's something we've actually talked about in our next work of doing, if trying out different asynchronous API is to kind of tease that out. So, as you said, that's definitely something we want to. Let's.

L

Make sure you heard that that's get DNS API is the asynchronous API for T got additional DNS transports. Okay, thank you.

G

Eric Nygren Akamai, also on the perspective of only doing that, from the perspective of a Princeton vantage point, I think from a what is the impact on that page load time?

G

Performance is really not going to be at all representative of the impact on CDN on CDN mapping, because going to a like a non ECS, CloudFlare resolver from from princeton versus going and using local Princeton resolver is probably going to get you to the same place for CDN just given how to connected, which is not going to be which I mean, whereas, if you're in some part of the world, where there's a local on that CDN cache that may only be accessible via users within that local network you're, not you're.

G

Not that may have a performance impact on page load time. You just not gonna observe in this case Thank.

M

You Giovanni and say then glory flying question. Yeah. You told before that all the queries you made we're like random or unique queries is that what you did in two measurements? What do you mean random? Do you how you handle my ask? My question is: is there any caching involved in this process or.

H

No we're starting with a clean cache each time, so our measurement is on a docker image and each time we started up we're getting a fresh DNS. So each query names are you sending? So these are all each time we're making a page load. We read so I guess this is something I should have mentioned in the talk, but we actually have a separate client which, for DNS queries, we're using dig for DNS over port 53 for Don tweezing stubby and for doe we're using curl and the reason why we did that is.

H

We noticed some peculiarities in the DNS response times and the horrors that we collected to get page load times. This is something we discussed more in the full paper, but we're starting with a fresh cache each time and using again dig curl and stubby in order to make these queries we're pulling out the domain names that we saw in the page loads and making those queries separately. Yeah.

M

All right now that makes sense if you thought to a resolver every operator. Are you gonna see that they have a very high cache hit rate, so maybe it'd be nice to extend your work to Cobra, also cases and where there's cache it interesting. Thank you. Yeah.

N

I guess you partially answered my question already by saying they use curl. But do you include the connection time in your your PC like when there is TCP general over TLS and HTTPS? No.

H

That's not in our data all right. Thank you all.

A

Right, let's thank the speaker one last time as.

K

A

Allow speakers in this particular session now we're going to take a break for lunch, which is in the next room over, and we will see you all when we return at 12:15 at 1:15.