Internet Engineering Task Force 102, 19 Jul 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: IETF102-MAPRG-20180719-0930

Description

MAPRG meeting session at IETF102
2018/07/19 0930

https://datatracker.ietf.org/meeting/102/proceedings/

A

Good morning, everybody.

B

Should we start.

A

Okay, this is a map Archie. We have a 1/2 a 2, a half hour slot in a lot of presentations. So let's start right away.

A

Unfortunately, Dave is not here this time, but I think you might be returning joining remotely at some point so you'll see.

A

We have also a note on intellectual property rights in the IRT F here very similar to the node well in the ietf, so you might be familiar with it.

A

In the slides, there are a couple of links to the jabber room to the mailing list and everything you need to know.

A

But given you probably already have found the slides, you might already know whether it is ok before we actually start with our agenda, as we usually do, and we have a lot of nice presentations today. I would like to quickly talk about the AME review we had at the last IETF meeting. So every research group gets like reviewed from time to time by the IAB to have some kind of feedback mechanism discuss how to develop the group and see like if it makes sense to continue.

A

We had our very first review with the IEP and in general, as was was very positive, so they all kind of liked it. But we also discussed about how to develop the group into certain directions, because what we do right now, because that's kind of the feedback we got from the beginning from the group is, we focus very much on measurement results. Getting people from academia presenting measurement results, then kind of you can consume this information and talk to the people and and get a connection there.

A

So what we do in more and more detail is that we usually try to solicit eight contributions from academia or also industry research. We did so by going to different conferences whenever, like day, four I happen to be at a conference, we try to announce that Mataji exists and people should send contributions. We also had some Lightning talks which were like explicitly on gender and these kind of things we use.

A

We create something that we call that we call a call for contributions which looks a little bit like a call for papers for conference, and we sent this call from contributions not only to our own meaningless, but also to different academia conference mailing list where usually the call for papers gets an out and, of course, a lot of the presentation you see here is really direct interactions with like presentation day for I have seen somewhere, and we liked a lot and we think it's in scope, so we go and we send those people email.

A

We talk to them at a conference and ask them to come here if possible.

A

The way we select the contributions is mainly driven really by the Charter. So first check is always is a contribution in charter, and then we also prefer contributor presentations that provide data over a presentation that only talked about methodology or something because that's kind of the feedback I think we got from this group. We in general try to fit as much contributions into the agenda as possible, but it's also sometimes a logistic question.

A

So usually we try to prefer in-person presentation over remote presentation just so that you also have a chance to interact with the speaker after the session or during the whole week, or sometimes it's just like this person can only present at a European meeting or just present at a u.s. meeting. So we have to figure out if we do it this time or next time or whatever.

A

What we really try to avoid is having any presentations in here that I presented in other working groups. That can still be interesting, but usually we have a nice presentation, so we think it would be unfair to give time to somebody who already has a slot somewhere else, and we also really prefer presentations which are kind of Newars or cannot be discovered somewhere else like. If there's a video of your presentation, then we might- and we don't have time for you- then we might like maybe decide for a different presentation.

A

What we really like is that we had a couple of presentation where we kind of scooped the conference itself.

A

So the presentation was before the final presentation at the conference and then another step we do is that when we have the agenda selected, we try to get slides very early from the presenters, so we can give them feedback and and try to guarantee a high quality and I think actually, and sometimes we give them feedback or request changes or whatever, but doesn't have them very often because, just like forcing people to send everything, early and and and kind of warning, then it that you want to like give them feedback or to have changed your crest, already kind of gives us a very high quality of presentations.

A

So that's what we do any questions on that part and.

A

That's also kind of what we presented to the IAB and what came out of this meeting was a couple of ideas. What we could maybe change, or improve and I want to quickly, just like it's all written on the slides, and maybe you have ideas or we can also discuss later, but like give a overview for you, so there was one idea to instead of having one.

A

So we never requested more than one session because we saw it like two and a half hours is already loud and we don't want to take away more of the IETF meeting time basically, but there was an idea to maybe split up into two smaller sessions, so it's easier to deconflict I've been doing this this time or we both Dave and I, considered it this time, but it's actually hard because we don't get the contributions very early.

A

So we don't really know how to keep conflict right at the point of time where the agenda scheduling is done. We don't know that yet another question that came up in the review is like: do we actually get the right people in the room which is like related to the first question right?

A

Does everybody know if there is a presentation here that is relevant to their own work, that the presentation happens and do they actually come and we had like, because of rumors were usually worried, for we had a feeling that my bride, she is well-known, but of course we can't do more. We can send like more emails to medalists. We could send emails to the chair madness to the chairs, can't judge if it's relevant for the working group or not so yeah. Any input is welcome here.

C

D

Is okay? Talk about this now yeah Aaron Fox, so one thing I recently did in taps was: do a reset on the conflict list, and so what you might do send a message out to the list asking for people who wanted to be here but couldn't because of a conflict. And then you can, you know, maybe get some new entries for your list, but.

A

I, guess that depends really on the presentations we have. If you wanted to be here, not right, which again is something we don't know by the time where the scheduling is done.

D

There's certainly gonna be some people who will only come because of a specific presentation but I think they're people, maybe who are interested in the topic who want to monitor the general effort, but you know, are just in a recurring conflict that happens from time to time in tabs. We.

A

Can give it a try, definitely.

A

E

For me, yes, I was just going to say that another way to to understand what Mary is saying is often in the IETF. We.

F

Don't know what the agenda is going to.

E

Look like until two weeks basically tells me I.

G

E

A submission cut off, but here or you guys know what research you know. You've done results you've got potentially a long time before two weeks before the IETF meetings, so you might be able to give Mary a more of a heads-up so that she could not have to struggle so hard to it with a conflict list. So.

A

Spencer, actually we don't like this time. We really knew which presentations are in at the time where the a draft gender gender deadline was requested, because it's it's maybe we started too late to ask for contributions, but people are also deadline driven right. Whenever you put a deadline, you get all contributions at the deadline so and then, like even after the debt and whistles going to people in trying to talk to them and figure out what they want present and there was a lot of frozen back.

A

So it's actually, it would be nice, but it wasn't the case for the last couple of meetings. I.

E

Have a dream! Thank you. Thanks.

A

H

Colin Perkins um I, don't think this has been a problem, but I don't get something we should look at going forward. Is that we're getting increasing amounts of research group measurement research coming in to the ATF, with things like the a and iw in the a and a P? We should make sure that there's a clear story for where this research goes and where its presented, we don't want to be. Having the same talk several times. No.

A

Sir I mean definite, so we had actually this case that somebody who presented at the workshop on Monday also while ago actually submitted a contribution to us, and we knew that this was coming up, and so we didn't have that presentation. So that's something we definitely try to avoid how to redirect people to the right group is actually kind of more difficult, because not everybody at the workshop is interested in the rest of the ietf right in these kind of things. So we also I mean it's also like.

A

Sometimes we have a lot of contributions, and sometimes we don't. It really also depends on the location like in here. Oh, we get, for example, a lot of contributions in Asia we get less. There are many factors, yeah.

H

I mean this is probably just a case of giving people some guidance for what the venues are in the ITF work and where they should be.

A

Coordination between the different efforts is definitely but like. You can only give guidance as soon as people show up another idea that we liked a lot and TAFE might proceed a little bit more is to organize a measurement related hackathon session.

A

We don't know what that means, yet, if we just say like, if you want to do something on measurement come to this table or if we actually think about like very specific project where we say like we want to hack on this measurement, or we want to run like this kind of small measurement study today and and see if people are interested. So if you're interested in contributing that that part or if ideas, what you would like to do at the hackathon, please talk to us.

A

Another thing that was discussed was- and that was discussed in is working group at the very beginning- is how do we get actually hands on the data? How can we make sure that all the data that is presented is available for further research and analysis, and how can we support this kind of data exchange? As this group I.

A

Think that was kind of also the the main motivation why we started this whole effort at the very beginning, but, like we very quickly understood that getting daters at henge is also not easy in a lot of cases, and we had the feeling that people are happy with kind of getting this presentations and then getting directly with those people in touch and see if they can do something further.

A

But of course we could also do a little bit more effort and try to figure out like put in the wiki, for example, where this data is available, which data or let everybody who presents fill out a template and tell us about the data. What kind of data you have? What format, how much data out where to find it? These kind of things? Would they be interest in this kind of effort.

A

Nobody at the mic, but you can send me an email. If you want, then another point that came up was like: do we do we want to connect? Do we want to use this group to connect to the apparatus community? And there is a group at the right meeting, for example, that Scott met?

A

That's also focused on measurements, so we could kind of try to like more closely work with them together, but I have to say, like I'm, actually not shortened up certain about that point, because first of all our Charter says were targeted for IETF meetings and second is given. We have this mode where we kind of present data I, don't think, there's something where you can like a lot of do a lot of common work, because it's just a forum for discussion Brian, who is the mad chair like.

I

Brian Dremel right at Polizzi, co-chair, yeah I think that's probably the right way to do it. I mean, like you know, we we talk occasionally you and I, and the meeting cycles don't line up. You know in such a way that there's much conflict so um yeah. If there's stuff, that's here that I think should also go too ripe. I will approach people and you know vice versa. So I think that there's you know a lot of the stuff.

I

That's here is I mean they're slightly different focuses right lecture than that stuff is things that are meant to be a little bit more operationally relevant, but also things they're sort of you know even just interesting, so yeah I have one of the reasons I'm here and I cover. This is looking for interesting things to encourage to come to Matt insell yeah.

A

So I mean like this: one is easy because you're sitting in the same office- but maybe there are other groups that are focused on measurements and I'm, not aware of so. If you like, involved or you know these groups, let me know we can see that makes sense to incorporate, hmm and then a very last point was also that, like some of these presentations might be interesting for an IETF blog post.

A

So in general we see like more posts on the IETF block and I already did this and talked to some of the people who presented the last time and they will write a blog post. So if you are presenting- or you have some data that you think is interesting and you would like to write a blog post just come to me.

D

And if you write a blog post, you get a free t-shirt, Oh.

C

Know that okay.

A

Any comments, further ideas, discussions.

A

No you're all waiting for the presentations. Okay, that's the agenda for today, so we have to head up talks and both are more focused on on tools of methodologies and our measurement data. That's why their head up talks because, as I said earlier, we are like trying to focus more on the extra data and then we have six torques all over the place. So it's like it's about privacy, ipv6, DNS, packet sizes, so the typical stuff. We are interested, let's start Johnny. Where are you.

J

So good morning, everybody, so this is a heads up talk it's just a 5-minute one. This is a paper we have presented. Tm a conference in Vienna is a tool and allows you to automate measurements related to the main aim. So.

J

One situation we face: SSID n, it's like if you wanted to measure a lot of properties associated to a domain name, let's say wikipedia.org. If you want to know stuff like where's, the DNS services are located information about SMTP HTTP. Where is hosted it's very hard to do that with current tools, there's a bunch of open datasets, but currently there's no two that do that automatically for you. So if you want to measure all these different protocols associated with any domain name, we would end up to have something like that.

J

You have to use the map dig mask and a bunch of other tools. You can do it, but it ain't pretty and that's a problem because you end up wasting a lot of time doing repetitive tasks. You have to do with very different data formats wasting a lot of time. More complexity, as we know, need some more errors. It is very hard, as if you're academic, a reviewer to review papers and try to reproduce the studies, because they're gonna be wastes a lot of time in just doing the measurements.

J

So what we decided to do is to build a new tool. It's called demamp domain name, echo system mapper and what it does is just automate the measurements of five protocols and also take a screenshot of a domain name over websites of exists and what it the way works is different from say, map Zimmy, usually you have a list of IP addresses already just choices for you, but this one, you have to provide a CSV with a list of domain names can be as long as we use for data.

J

Now, it's almost six million domain names and in a single machine we do 1 million domains a day, but this application is also distributed. So you can scale up very quickly and easily. This is the website. Let me see, there's a bunch of stuff in a paper applications, demos, whatever in C Coco to journalize death. Oh yeah, the good thing about this tool that later once you they do the to do those all the measurements. It's very easy for a researcher just to analyze the results who's in sicko.

J

So we have a demo dataset on the website. Selex 1 million. The queries are there, you can download and analyze and I just want to show one thing: this is a table that shows the various properties that we have measured using Alexa, and one thing we found is that currently, and unlike some 1 million seven, seventy percent of all the domains support HTTP and one in five are using. Let's encrypt there's way more, there were many more finders in the paper. Just would like to encourage the download. What do you think?

J

We would only think we make it like. We make it only open source for researchers, because this has some kind of commercial application possibility. So we just don't want to incentivize that so just click on it to have to register we're gonna, give you the access, repository and github, and that's it. Thank you very much.

A

We actually ahead of schedule so that there could be questions.

K

K

So I'm gonna use your key overview of elementary Indiana speedups of solutions. This was presented on the asari connecting American well, so DNS is currently monitored in two different ways. One is doing a pretty tag regression and that is done by original information onto the NS servers and 10 standing into a central server 47.

K

This method is used, Festival, IDs C was developed by org and we have DNS data these vital root servers. The second method is during the series in data setting and story stretch it like stirring it around package of the US. These another use by an example, use a Hadoop cluster to process and store all the information now to bring me this metal is that he doesn't give too much information out the current status of the DNS servers. So we started looking about Oracle in properties.

K

They take our methods and for this we first try to develop our own structure. We develop rapid en esta, did captured storage any sensation of as data, but we resulted. We were permitted the wheel, many of the things that we were developing, where I'll raise done by someone else and also when we presented this to DNS and mr. Eaton Tate. It was like to white being like it, so we start looking to events or solutions.

K

We thought that many of the promised that we were presenting we're array done by someone were paid owning production and we start analyzing different software for the captor storage and service Asia for the cattle we serve. We use he compare different software like packet, big collective DSC, and we saw that maintained, answer I'll transfer or network protocol study Ennis use.

K

So in this case we had to develop our own solution that was DNS sibling, which cap to al deformation and sent them to a central database for the processing well shall compare adidas benchmark of the different storage like pro materials through the elasticity s DV, and we also compared the different service ID software's.

K

Finally, when end up with that AB sector like this, where we had the DNS servers and we cut around information, we our own solution- that was DNS, sibling, I'll deformation resonated to a clique cluster and finally, it was presented in a graph, a novice obsession parasol. You ask something like this, where you have, for example, topiary domains that some frightened I was never analyzed directly in real time. We also have the any key domains names that are precise.

K

We also have some general information, like the query time.

K

Some of the performance and a single server setup we test with real data from technique Chile, and we found that these we could process like CentOS and packet per seconds, and we found I used 30 40 about the data and we could start adding 40.78 every day. So we kiss it's entirely possible to running crying like and store like at year of the eight-hour more.

K

We also try to flu the wrong service experiment and we found that we can handle like had a 20,000 packets per second on one single server, and you never go up more than 30% of the cpu, and this was like a true course computer. So it's kinda scale a lot more. So this is well and when money doesn't be in a service thanks.

A

Thank you any quick questions.

A

Yeah somebody coming. Okay, yes,.

L

John read out of my um sorry I'm it maybe I missed where you were capturing packets from from recursive from authoritative z-- from both where he.

K

Was so 30 the service of Chile, the ccTLD of the TL? This.

L

Horrid event, so the authoritative server yeah, okay yeah did you consider have any insight in gathering data from recursive servers. I know one of the challenges is. We could monitor what the authorities are giving back. That doesn't tell us anything about the end user experience. You.

K

Can monitor whatever you want? We only have to be ready for the from the Y a packet, so you can store anything whatever. You want all right.

M

Sahra Dickinson, could you say a little bit more about the format you're using to capture the data? Are you capturing full pcaps or a reduced set of data? Well,.

K

You capture is there ant, I did edit a packet, and we just start what we need, for example, type class, and it is whether we store in the geek house database. There.

M

Is a draft going through dns up at the moment, which defines a doing a specific capture format using c ball, yeah and I? Don't know if you're doing something similar or you decide not to use that.

K

K

Well, detector: you can actually swap, for example, to capture process, and you can also use that. Okay.

M

All right thanks very much.

A

Okay, thank you.

N

So I have some recent data on UDP packet, reordering that we're seeing in quick, so obviously I can't say whether this is representative of you know what what someone else might see. I can't say it's representative of TCP, but at least for this data set it's fairly representative. It wasn't chrome stable, so we're talking fairly large sample sizes. You know millions of users and many many servers and we have some from the server side and some from the client side.

N

The core metric is actually exactly the same, and the measurement code is exactly the same, but it turns out the data does end up displaying a little bit differently and I apologize for that. That difference makes a little bit more difficult to interpret but we'll walk through it anyway, so I probably should have just gone. This slide also we're using bbr congestion control on the server side from server to client, but we are using cubic from client to server so that actually may affect the measurement. Data may potentially change the likelihood of reordering.

N

Also, the client-side data, which is so client, is the client receiving and it's measuring on the receive side and the server is sending so those flows tend to be longer their CDN flows in this particular case, like large video playback things like that, the server's idea that tends to be you know, obviously less data intensive with a few exceptions, so there's an arrant asymmetry, at least in this, but it kind of reflects typical web traffic. So hopefully it's informative.

N

So the first fact is just how many percent of what percent of connections have at least one reordered packet on the client side? It's only 5.4 percent on the server. It's nine point, four percent I can't really explain why there would be more reordering going upstream and downstream. Personally, maybe someone else can could be Wi-Fi is we can blame everything on Wi-Fi.

N

You know a big takeaway is, you know, 90% of connection, see like no reordering whatsoever right so like if you're trying to optimize your loss recovery like maybe we should be optimizing like you know, for the like 90% first and then like ratchet up so so certainly I didn't expect it to be quite this start, but this might be even more motivation for like a more adaptive scheme, and it certainly is motivation that, like the approximate scheme, rack has chosen, where it initially uses packet number threshold and then drops over to time based is, is pretty compelling.

N

So the remaining data excludes connections with no rearing, because I don't really want to look at a bunch of like percentiles where everything's 0.

N

So this is on the client side, so on chrome chrome is receiving packets and this is in packet number space, and this is actually pretty brutal and depressing so I had to put the bottom scale at a log scale, because there's a still like a decent amount of energy. Around 100 and and I thought I mean I.

N

Think we don't actually get to like point under 0.1% until like a few hundred, so things are pretty bad in a packet number space and in particular, as you can see, 10 is like a few percent right there, so the default reading threshold of 3 is kind of woke, fully and sufficient for something like half the time when reordering actually does happen. So it's sort of you know you pick a number and you go with it, but it's it's sort of a weird number things. Look a little bit tomi.

O

Poly Apple just clarifying question on this last one, so this is of things that were reordered. Yes,.

N

So this means this.

O

Both directions or just.

N

As only on the client side, so well, okay, I'll go the server side later after I'm gonna do all the client-side data and then all the server side, data so and I think this metric I can't remember. If I have this metric on the server side or not yes, so on the client side, we also happen to record it as a fraction of min RTT.

N

So that's just kind of an indication of like if I was to do it in time domain. What would it look like? And here it looks quite a bit quite a bit nicer, so you see you know, there's almost no energy before 25% or sorry after 25%, except for that blip at the very end which this is running in user space, so I'm gonna suspect that that's chrome basically hanging for some long period of time and and thread jank or something like that yeah. So this this looks quite nice like there's, there's very little energy.

N

You can past 12.5%, so so we're doing much better I did have to filter the men RTT to greater than a hundred milliseconds. To get this data to be like sensible, it turns out that there are some clients that measure like sub millisecond, RT T's and obviously, if you have a sub millisecond RTT, your fraction of Manara TT, just kind of goes crazy.

N

It'd be interesting to get data for like greater than 10 milliseconds as well, which kind of is a more sensible thing. So the fact that it was so crazy, though, is probably motivation that we need a min like if you're gonna start using time based loss detection. Like probably you need something. That's on the order of like your clock, granularity or a timer granularity like 1, millisecond, so amount of minimum threshold, just to kind of ensure some basic sanity in the network to.

D

Do yeah Aaron Falk um I I'm, finding it hard to get any intuition around this without having some understanding of how distributed this is over flows like this? Do you have like one flow that has a lot of problems, or is it kind of spread out and same.

N

D

Same things, you know yeah.

N

On the server side, I have the number of packets I think we reordered for flow as well we'll get to that later. I don't have that on the client side, so so this is the max time for a given flow. So this is not even the typical time, but this is.

N

This is an effort of saying if you, if I, had an adaptive kind of time-based loss, detection that, like ratcheted up the time threshold to whatever was necessary to never experience, MIT a packet kind of how much time of reading window would I need to make that happen. That's my internal logic, yeah.

D

I guess it's just like when you the like the PDS that you're showing before when you're when you're, showing like numbers of events that you're measuring is it is it confused? Do you have like a single flow has multiple times or single path has multiple times I'm just trying to understand how pervasive are this like is.

N

This a one-off event, or is it, does it localize? Is it clustered together, topological I can't give you I, can't give you good information on that I can give you some kind of knowledge.

N

From my experience with looking at packet, traces which is having like one packet race in front of a bunch of other packets is oddly common, so I observed that a fair amount for you, you'd, be I, don't know if that's a common thing with TCP and I assume it's a matter of you, overflow, like one buffer on a switch or something and switches are heavily multi-threaded and you pop on to the other one, and it has no cube something like that, what exactly sure so so that phenomenon is common, but I I can't kind of give you any more prevalent stated midnight.

N

I know on the server side, here's the number that it does actually give you a prevalence number. So we don't have this on the client-side. This is for connections that have reordering how many were reordered, and so it looks like you know in this case, there's a lot of energy.

N

You know around like in the in the sub 80 range and that kind of makes sense, I mean most of these flows are fairly short, so and one fun fact is: yeah 38% only had one packet out of order on the server side, so that would say it's fairly rare. Overall, this scheme Thanks.

N

Here's it in percent. So again it's it's it's fairly rare, but the distribution is a little bit shifted. As I said, a lot of these flows are quite short than the server side.

N

So here's the kind of the same thing that, like the maximum gap in in quick packet number, is the number other we were looking before at the client side. Unfortunately, this is inverted distribution, so I have to kind of translate, but, as you can see, the the first kind of big plateau is one.

N

So over 40% of packets were just like Twitter as they are sometimes referred to as and then you get a fair amount of energy around two and three and then you know kind of things go off the off the chart on the top end from there. So it seems on the server side. The the maximum distance in packet number is smaller. I think that's mostly a product of the fact that usually we have a smaller condition window going from client to server than from server to client.

N

I, don't think it's I could be wrong, but I don't think it's actually in Harran network property. Unfortunately,.

N

If we look at the same data fraction of min RTT, when your min Artie T is greater than 100, milliseconds should I swap these graphs around, but it looks pretty similar to the client side. It's pretty heavily distributed towards.

N

You know the like. Under a quarter of a min, our TD.

N

So the best majority of connections seen over here during the tail is very very long. I mean packet number space. It's depressing really long, quick runs in user space, so you know small amounts of network where during may occasionally get amplified into much larger amounts of networking ordering jutsu like thread jank, particularly true on the client side, and so hopefully, TCP actually might see a little bit whispering if things are going well.

N

The 1/4 RTT reordering threshold is currently recommended in the quick recovery draft actually seems pretty sensible after looking at this data- and it certainly seems like this is motivation for like an adaptive scheme because it seems like the vast vast majority of connections could get away with a very, very short reordering tolerance that may be handled twiddles, nothing else initially and then ratcheted it up to something more aggressive later.

N

These those are my thoughts but questions.

P

Comments: Eric Negron: do you have a sense of what fraction of this of these are unfragmented packets and whether fragment packet reassembly might be leading to any of this? So we set.

N

The do not fragment bit, at least on the server side and I believe on all client-side platforms that permit it, which is not at some point. It wasn't Mac, but I, don't know one thing yeah, so hopefully we're not getting pregnant. I mean that network does weird things.

H

Hi Colin Perkins, so I was actually surprised at how much reordering using I was.

F

H

The numbers to be a lot less than this. You said at the beginning that you blame Wi-Fi and I suspect. That's probably the case, but do you know that or is that an assumption I do not have that I.

N

Mean it's it's very difficult for me to piece things apart, exactly about where these things happen. Fortunately, sure.

H

Okay, I guess my other question is: do you have any insight into whether there are patterns of where, in the connection the reordering happens? So is it always the first packets getting delayed, for example, based.

N

On looking at traces, my the patterns that I I see here is one is this: like you get this small packet that raises a head of a large packets that tends to happen at the tail of a response. So you get this long.

N

So what happens when two things happen when you build up queue pressure, and then you send a small packet, so that combination, something some portions of the internet seems to decide that that's that's the thing they should brace ahead, and so you know, if you don't, have any queue pressure, then this small packet doesn't need to press ahead of anything but like if you, if you kind of you.

H

N

A few milliseconds of buffer bloat, then it'll, sometimes very sad, yeah.

H

Okay, but but you're not seeing, for example, the first packets and a flow being delayed. Because it's.

N

I've heard that is a potential. Is it a like 3G issue or something like that where the radio is firing up and maybe that's happening, yeah okay,.

H

I mean that's good for connection establishments if that's not happening. So that's a good yeah.

N

I see very little reordering in connection establishment. um Thank you.

I

Brian Trammell um anecdotally, we have, we would also play my wife, I I mean so where we've, where we've done stuff for like in the lab, where we're looking at different connection things, it's always like the dodgy little um a wireless router thing and they suck. If you come over, um if you come over the wired interface, but they suck a lot.

I

If you go over the wireless interface, so you know we have like one or two boxes that we've done I mean we weren't actually looking at reordering, we were looking at something else and reordering messed up our stuff rate, and it was primarily on those boxes. So yeah. Let's blame Wi-Fi, okay,.

H

Q

Cory Fair has thank you. It's fun to have real reordering information, I wonder whether this is partly a function of what you measured and the fact I see. Tcp I mean maybe having less reordering. Maybe it's the footage of your congestion controller actually pushing hard for a little while so I wonder whether as a community, we could kind of gather more stuff, and do you think that would be useful to kind of look at other other transports of the links and see what we can actually derive from this yeah.

N

Yeah I mean I would love to see exactly comparative data for TCP. In particular, it.

R

Might be very difficult.

N

To actually keep gather.

S

N

To zero and various other kernel things they're, helping us out, but yeah incredibly interesting, yeah.

Q

I didn't say it was easy. I said it was interesting. I.

A

Mean it's still doing you see using TCP right so eventually come back can come back with TCP. Oh.

N

Yeah you should maybe you should ask Neil a new chunk of us.

C

N

Yeah, maybe it might actually be possible I I can't see if I can coordinate with the teams. That would.

C

Be great, okay, thank you.

I

Good morning, everyone um hi I'm Brian, Trammell eh. This is work actually did with Miriah, but she's chairing so I'll present, um and we asked a very simple question and we have a very simple answer: is buffer bloat, a privacy issue? Actually I want to see there's buffer board of privacy su yes or no so hands up for yes, yeah hands up for no, oh, yes, buffer bloat has a potential privacy impact. Okay, that's interesting, but there's a big asterisk on this. Yes, everything's a privacy issue right.

I

um So if you have significant buffering on a link right, so yeah, that's buffer bloat. If you have a public IP address that is associated only with that link. If the public IP address is responds to an ICMP echo request and that echo request and reply fair, the buffered queue, um then I can ping you and figure out how big your queue is. This is very surprising to me, um so I decided to come and talk about it for networks that we examined. This is a this is I would almost call this an ik data.

I

This is a a extremely bias study of people that I could get to clink on a link in a tweet, but for one in seven of the networks. They're, these conditions hold. So this is actually um you know, an advice we've been giving in the transfer area for a long time is like fixed buffer bloat, but now we can say actually no seriously fixed buffer bloat, because this is a problem. How did we get here on? This is sort of a recap.

I

I didn't actually set off to try and answer this question I tried to answer a completely different questions, so this was you know this is the quick portion of the morning I guess so. This was a question that was into the quick RTT design team in the spin bit. Discussions is RTT data privacy, sensitive passive RTT data privacy sensitive and the idea is yeah. If I continue, I can know where you are not what you're doing, and you know you essentially two very simple trilateration. You know the you basically take.

I

The radius is equal to the time on each of these pings and multiply it by the speed of light in the internet run. You know, do some very basic math and actually it turns out that the data is usually fuzzy enough, that when you do this very basic math you divide by zero somewhere and if you're not dealing with complex numbers. Bad things happen.

I

Usually this is done in a more approximate way right like so. These are the RT T's by color. So green is faster to a particular anchor in the right that lists network, which, if you just look at the colors he'll, guess it's kind of yeah. It's probably there in Europe right and you'd be right. Yes, it is indeed in Europe.

I

When we looked at this, we actually found the Internet Artie T. Is that some of delays at each hop a lot of these are variable? You can only derive distance when you're queuing stack and application delay or held the zero, which basically never happens. The network operations rule of thumb that one millisecond of Artie T is a little bit less than 100 kilometer as a distance holds.

I

So if you know the IP addresses then trying to do a geolocation by exclusion based on Artie T bait data is somewhat more erroneous than even the cheapest lowest quality. Ip geolocation database right. If your Artie T's are over ten milliseconds, which doesn't happen in the internet, all that often then you're getting at best exclusion information for national level right. So this isn't a problem, but we were concerned in general about the gia privacy implications of of passive observation. Murky.

I

T turns out to be not that scary, because of all of this variation, but can we we flipped this question around? Does active observation of our TT pose a problem and it's one of these things were actually just sort of like cleaning up the we're writing a paper on this and we were sort of cleaning up the you know. Looking at all the loose ends, you know we're actually going to get this question.

I

So let's actually go ask that question and we asked the question about remote load, telemetry kind of as an Africa right can a remote entity armed only with ping get information about the operation of machines on me networks. Oh no, this is me, is the internet, and here, if we look at that, you know the equations before about how we actually what these components of of RTT do. Then the load on the network is equal to the sum of the queuing delay and one direction in the sum of the queuing delay in the other direction.

I

If I have a cheap, router and somebody's going to ping me from very far away, can I actually get information, so what I did is my set stuff up on my cheap router and I started downloading my kernel from somewhere and then I pinged myself from Singapore in Amsterdam and I got this that trough Topeka. So that's like basically zero. It turns out I'm in Zurich, which is not that far from Amsterdam.

I

This is Singapore, which is about 220 milliseconds, it's a little bit further away when I was and I was paying myself you're, like you know about ten times a second and when I started downloading stuff at full rate.

I

You know, boom I see this peak of 800 milliseconds I'm like wow, okay, I'm, an Internet measurement researcher I, probably should have noticed that I have a second of buffer bloat in my own network and I hadn't before that, and it was gonna like wow, that's scary and then I went down here and I actually started using um rate limiters to limit the rate to other than full rate. Let's not full up, fill up the entire pipe and I was even able to actually see some some variance here um when I was pulling down.

I

300 kilobits a second on a a or 330 kilobyte 300 kilobytes, a second on a 40 megabit link right. So even if I'm, only at 10% I can. Actually. You know that signal looks different than that signal looks different than that signal can actually estimate the rate across there. So my connection sucks good. How widespread is this phenomenon? So we stood up a piece of software if you go and you click on that link right now, you're probably going to cause it to fall over because I've never actually had a room.

I

This big going, click on it so feel free. It actually kind of does work, I'll! Warn you! The JavaScript was me learning JavaScript over Christmas vacation, so you might get in a situation where you give me data and I. Don't show you the graph if it fails and says that I can't ping, you send back the link to me and I can give you your data if you're, if you're interested the way that this work is that you know I have a ping server somewhere, it has its the clients and JavaScript sends a ping request.

I

The ping server starts pinging, then, after a delay, I start downloading, actually I think I start downloading. The Hubble Deep Field from a CDN somewhere I keep pinging, and then you know that the thing server itself keeps the ping information it'll only actually ping the public IP address from where the request comes from, because you don't want to turn this into. Essentially, a botnet and I left this up and ran a you know, kept this running and updated stuff from just before.

I

I left for Montreal went up with a hundred and six measurements from 66 different networks.

I

33% of these networks always block ICMP, and you know so 7/8 of of definitely mobile network. So basically we classified these by autonomous system number to see. Are you an access network? Are you a you know, or your trust? Relax its network or a mobile access network, see a lot more ICMP blocking in mobile access networks because there's, you know generally some sort of nat somewhere along the path on 33% of these networks.

I

There's no indication of load dependent RTT, which means that last mile segment is either there's actually not so you're you're you're hitting a thing that is not um the congested queue on the way down. There's a queue in the way or um the queues are tuned in such a way that you don't actually have buffer bloat and we're in a remote load telemetry which might work on about 14% of these networks. So you know what this looks like, so you know we start pinging. Here we stopped pinging there.

I

This is um one Swiss access network and you see here, even though you have a public IP, that's payable at that point in time you see no variants. This is what happens when you do it on my network, so because of how the pinging works, we don't get. We don't fill the queue all the way to the top.

I

It's not maximum maximum load I'm only seeing 300 seconds of variance there as opposed to 800 milliseconds, and this is a very easy signal to sort of see so coming out of that you know, recommendations for protocol design, remote, local imagery. Anyone who can ping you from anywhere in the internet can measure Network activity. I will leave why this might be a thing that you don't want, isn't exercise to you the audience and you can take away sort of like two bits of advice here.

I

Some of it is good advice and some of his bad and bad advice, and it seems if there's evidence that both bits of advice are being followed in various place. It's very like so two ways to fix this. You can D blow it all the buffers and deploy AQ m and e CN.

I

That will fix it great and there's at least anecdotally one of those lines where everything is good, that's how they fixed it right like because I know that guy who built the network um bad advice, you could also just roll out CG and everywhere and block up block ICMP. um That would also fix this and other forces are causing that to happen as well. So with that I'm done and we'll take questions, how much time do I have.

D

So I'm struggling with your initial premise, because I believe the only private information that this exposes is that you have a crappy, router and well.

I

I mean if you ping me over time at a rate that I'm not going to notice, you can tell when I'm, using the network versus not using the network, and if we go back here there's if you ping me faster than I was willing to ping with this hack.

I

The pattern of I'm doing a bulk download looks somewhat different than the pattern of I'm doing adaptive, bitrate multi-streaming. So so there's there's a difference between web web and netflix and chill' on this graph. Now, in this case, this was ten pings a second. So if I had a competent network in front of me, they would probably actually report that as abuse but I also have 800 milliseconds of buffering on my cable modem, so I'm not sure I have a copy competent network in front of me. I.

D

I

Able to do this to myself for a long period of time and nobody noticed I.

D

Believe you can do it. I guess I'm just insufficiently paranoid to consider that private information, okay.

I

What's your IP address at home, Aaron.

E

S

Why do you need the ICMP echo? Why not just sending TCP syn and waiting father? We said I.

I

Could also do that. The reason that I did ICMP is because I was using ICMP and the rest of the study and I wanted to wanted to compare ICMP, ICMP ICMP to ICMP. Another reason not to do I mean so. If I seem pious block, you can use TCP, syn and reset know you could San, and then you get a reset if there's no server yeah right, the TCP Center reset gives you the gives you that the I mean really that the short answer is I was lazy.

I

Ping was already in the environment and I wanted data that was comparable to the to the rest of the study for the paper.

J

Giovanni said, the end yeah thanks for the presentation um coming back to the the big thing I think this stuff and just settle it. The good thing about paying is just like it's less blocked in other applications as well as less invasive, so there's the good side of it. The bad sides, as many networks treat things differently. They put like in great look exactly.

I

J

I was wondering how that impact your results so.

I

It might be that some of these.

I

Some of these networks, where there's no indication of load dependent RTT, is because the ping packets go through the same queue.

I

um Something doesn't hold and actually I'm I'm gonna take Stefan's idea and actually hack up the tooling to be able to do this in RCT hack as well, because that'll um that should go through the same queue in ways that that ICMP won't and if I rerun, that on the networks that I already have information on, then that could uh I could see if I can get any out of that no indication of load dependent, Artie T space thanks. That's good good advice, I.

T

Wes heard occur is I with respect to the privacy problem, so in and ESS the DNS workshop this year, I showed that just by studying, DNS packets leaving my house I could determine things like sleep-wake cycles and great great stuff, like that. So this sort of actually augments that even further, because you know certainly we all stream Netflix starting at 8:00 p.m. at night and rates when we watched most of our bandwidth and well notice when you're not at home and things like that and when you're at work and and this those do become important.

T

I

So the zoo, the thing that kind of struck me about this is that the DNS stuff requires you to be somewhere on the path between the network and its first recursive resolver right like so you you need, you need some passive. You need to pass a measurement here. Yes,.

T

Phonetically I studied it from the point of view of of an authoritative.

I

T

I

T

Yeah, if I was.

I

T

Example, I could determine.

I

Your ureic cycles rate, but I was able to run this just by renting a cloud server and pinging people. So.

U

Hey Benjamin, damn I just wanted to say: I really appreciated the research I think it's very insightful and I'm reminded of the adage that, if you haven't got anything to hide, that's because nobody's trying to hurt you.

G

K

Think about this, you can actually like use these for to monitor, for example, different web pay websites and they load they receive and get like a lot more information, not only for personal information and sample corporate information. How many people are basing your style when they write using.

V

Security I mean you had good advice and bad advice there, so like partially bad advice as well, which would be sort of randomized delays on ICMP responses, yep. Something like that great. That would.

I

V

More bad than four yeah.

I

I'm gonna put that straight up into bad advice, cool.

A

I

All right, thank you very much.

A

Karina's next so Karen presented already at the and our players in R and W, so you might have seen him already and he's here the whole week. So we took the opportunity to ask him also to present some of his other work in this meeting.

W

Thank You Mia hi everyone, I'm clearance at Lee, I'm, a PhD candidate at Technical, University of Munich and I'll talk about a quite specific and quite technical topic today, which is the aliasing ipv6 hit lists and I'll briefly get into what precisely that even means. It's based on two papers that I'm listing for completeness or if you want a background, go to these and have a look and, of course, it's as everything academic.

W

It's joint work with Alabama New, York, Pawel cars in much Steven, Luke and Georg again, so I do a lot of security scanning in the internet and IP before that has become fairly easy recently, because you can simply ping all the addresses in ipv6 that doesn't work. The address space is just too large, so we are back to using hit list what we did for ipv4 10 years ago to have this hit list. There's basically two approaches.

W

One is first, you need to collect addresses from whatever source you can get DNS passive observations whatever and there's also some papers that then use these lists to generate more addresses like learn the structure of IPS in those lists and generate new IPs and see if they respond and there's plenty of related work going on so I'm. Really not the only one doing something in that space and the question is truckers- is: are these lists biased?

W

So are they biased about certain a SS and prefixes, which means they would not give you a representative sample of the ipv6 Internet, and this is really enlarged by the problem that a single host in ipv6 can bind billions of IP addresses to itself, so you can bind entire prefixes to one single host and if you run into such an area in your hit list, you will definitely have a strong bias, because you will count a single machine billions of times.

W

We call this aliases because it's one IP address and or many IP addresses, and just one host for which you have many aliased addresses, so you can call it an alias prefix of all the IPS in that prefix belong to the same host, so just a brief intro. So this is what our current ipv6 hit list looks like it's also published.

W

So if you want I P addresses to here there, it consists of many components like domain lists in DNS domain list from certificate transparency, but also, for example, running trace routes to all the IPS and finding router IPS in between. So let's just a brief overview, we have around 50 to 60 million of IP addresses. The question is how many of these are real and not just aliases, so the state of the art in detecting aliases is basically saying I'll.

W

Take this prefix and I will send probes to certain IP addresses they can be random or they can be fixed whatever you prefer, and then, if I get a certain number of replies, I say it's unlikely that some random I people just respond. So I assume this is alias and state-of-the-art is somewhat.

W

That typically only require a subset of IPS to actually reply, so, let's say 3 out of 8 or something it's also typically done at a specific prefix length like slash, 69 or whatever, and the issue I had with that is that if you use random addresses your targets may actually cluster. So a random process typically doesn't give you nice distributions, but you can have clusters which you wanted to avoid.

W

If you use fixed addresses, you get a nice print, but people may predict those addresses and also they may be in use, because if they stand out to, you may also stand out to others. So what we said, let's do is we said: let's combine these two approaches, so basically we enumerate all the combinations for the next nibble and then add a random portion.

W

So for each prefix, where we say this might be an alias prefix, we send 16 probes and they are reasonably well des distributed, but not predictable, and we do that at many level and we also probe ICMP and key CPA, so some parameters that we used in this. When do we suspect something alias its. If we find more than 100 IP addresses in some level of the prefix tree, then we say: let's run this alias detection on this prefix.

W

Then of course we say we said we want all the IPS to respond, but of course, there's packet loss. So what we did is we said if an IP replies on either ICMP or TCP will accept it and also be have a sliding window of passed measurements where we also accept replies and that works really well to really increase the bar to say all the IPS must reply, so we have very few prefixes where we have fifteen or fourteen replies, or so typically, it's either all on one of these random IPS.

W

To reply, and one very interesting impact that we did find is that we would actually find prefixes that appear to be alias at a high level like a slash ready to level. But then you could find tiny pieces in it. That would not be aliased where ip's would not reply, and this is a quite interesting phenomenon.

W

So typically it might be that some coercion is routed differently, like gee and all zeroes address being used for routers or something, and hence we can't simply build a blacklist, but we actually build a multi-level prefix tree and too long as prefix matching in that to say which portions are aliased and which are not, and to also have it not covered a full length under an alias physics.

W

Yes, that always takes the time to render so what's the result, basically, we had these fifty-five million IP addresses and almost half of these addresses were in alias prefixes, and um if we do this plot, which we did using SAS plot, which was presented at last, never G by Luke, then you find it's very few prefixes and these make up a lot of addresses in this list. So this is, if you think about it as expected, but it's still quite quite alarming.

W

That half of the hit list is basically just double counting one IP address or one host. So to us it was quite surprising that is actually half of these addresses that are basically alias. If you look into what are these prefixes, it's typically there's a wide range, but a lot of them are CloudFlare AWS or something like that, and we know they sometimes use tricks like having basically a packet filter on random IP addresses that will forward it, but still you shouldn't be scanning and double counting. These IP addresses good. Then.

W

The next thing we did is: let's do some validation. We have this technique that claims to find alias prefixes. How can we increase our confidence that this actually does what it's supposed to do? So here we did use advanced fingerprinting, which we have used earlier to detect cases where an ipv4 and an ipv6 address belong to the same host. So you can use certain features to increase this confidence that it's really the same machine you're talking to on this set of IP addresses. Some of them are good in confirming this.

W

Some are good in fortifying that it's not always that black or white, for example. You can look at the initial TTL that you get back and it doesn't really mean something it's if it's the same, but it means a lot if it's different, because then it's quite unlikely that it's the same machine.

W

You can also look at TCP options, fingerprint like the order, the padding these small details in TCP options, replies and also entities, if you timestamps, which can be linear if you probe a lot of IPs and a prefix in a time order, and you get a linear TCP time stamp over all these packets, you can pretty much assume it's the same counter behind the whole is IPS, so the same machine- and what's important here to mention, is typically, you could say, just use n map fingerprinting whatever, but we need this to scale.

W

So all these methods work with just the one or two packets that we use for liveness probing, and we just extract more data from that. So it comes for free on top of our measurements.

W

If you look at what we find the fingerprinted, these twenty K prefixes that we considered aliased here, we just do it at us. Live 64 levels of you have a somewhat stable set to look at, and the confidence of course depends a lot on the test. What I just say. So, if it's the say my TTL, it doesn't mean a lot, but if it's different it does and of this 20k. Basically, we find one towels and also where we say this looks odd. It looks like it's not the same machine.

W

It displays some different behavior, but we also find 13,000 where we say even the timestamps behave linearly, which is very unlikely if you would probe different machines that they randomly pick the same TCP timestamp value over time.

W

So for us this was a confirmation that our approach of finding aliases works reasonably well, this few inconsistent ones, but the majority is consistent and there's a big portion that is actually very strongly current consistent where TCP timestamps give us a really high confidence that this method of finding aliases works good. So for you to take away, if you use an ipv6 set list, it can contain large clusters of aliased prefixes and alias IP addresses. So in our case, half of the IPS were basically the same machines.

W

If you use multilevel prefix detection, you can also, with higher confidence, cut out these small pieces that are actually not aliased in a space of an alias prefix. You can use fingerprinting or some other technique to really boost your confidence that what you use to what you did to find these alias prefixes actually worked and yeah. The paper and plots are online I've, put a link here and there's also a bunch of other stuff that I'm working on and happy to discuss of line with you. Thank you.

P

Eric Nygren, when you're looking when you're looking at this and things behind a prefix, are you looking and you think, have this topic of like same host?

P

How do you handle load-balanced cluster to think about load-balanced clusters in that sense, because or I'm aware of some cases of this, where you'll have a ipv6 prefix, where most things will answer, but, instead of being a single host, said it's going to there's some sort of load balance from those things so different, the same ID that any given IP within that prefix might go to some set of machines within that cluster and there's still logically the same and still might be a result in a larger hit list, and you'd necessarily want to be monitoring.

P

W

So if, if I quickly, synthesized you're saying if there's a load balancer, it might actually be different machines, but they might audio piece might seem to respond, but not from the same machine right.

P

Yeah, and do you consider that to be a single host or does that just.

W

Get disqualified that so the initial method would classify that as alias, because all these random IPS respond with setting validations that we do. We've got not considered alias because the different machines would probably have different characteristics and these small fingerprinting details.

V

Around the colony and we keep hearing about this sort of attack and was more end-to-end connectivity coming back well, we'll see even more of it. I was wondering the other day. If we can, if we could say, look any field in any packet, that's reserved should be random and I think that's so not forward compatible, because you know you can grease stuff, but then you you basically when implementing get updated, to actually support the options that you know that that were previously random. They might actually interpret it.

V

I'm wondering if there's any solid way to do that, like sort of require that every that every option has its own checksum or something like that. So we can basically fill everything. That's currently unassigned in a packet with random data, all the time and required that protocols do this and then make sure somehow that when those option numbers actually get allocated, we can validate them on the receiver, and that would be sort of like a more structural approach to this kind of yeah.

V

We designed something, and then you know people figure out how to sort of you know understand like what is actually behind it, because a lot of our protocols, just say reserve must be 0, and we faithfully do that. And you know here right, you could say well. I'm, just gonna send a random TTL on on replies and random randomize other stuff, so just to sort really.

A

Just trying to send this are you proposing to always use all defined, TCP options and just put random crap in there I mean.

V

A lot of a lot of protocols already do this right, like we're just gonna set random options all the time right. It's.

A

Just like I mean it's a lot more data right. It's a lot more overhead.

V

Yeah but the overhead is, is generally sort of a small minority compared to the payload. So if you're, you know I guess it boils down to how much you're willing to pay for privacy, because in this case right, if you don't have everything a lot more difficult. So.

W

Yes, basically there's some. There is some steps that you can do to increase your privacy to protect yourself from the fingerprinting. You could, for example, just rotate the TCP options that you send back all the since version for point n, Linux uses randomize start values for TCP times them so that that doesn't work starting from Linux for point 10, so I think there's reasonable steps. You can take to avoid that fingerprinting.

C

V

I'm talking about like, instead of like doing that, tactically once it once the attacks already in the fields, how can we sort of more design protocols to be robust to this type of saying, as we design them as the implementations basically get written.

R

To be a civic, you delft comment on the first previous question. I think the problem would also be if you fill up random on you basically end up leading your entropy to the internet, because you have to have a lot of random to put in all these packets here sending out I.

X

Would like to take this discussion to a different working group and.

R

The other thing is, did you compare so we have this data set of at v6, reverse DNS zones, and did you compare the zones we take s dynamically generated with the ones you identify as slack clusters from other hit lists, I.

W

Did not do that part if you sent it to us and your data that we did ipv6 dot.

R

Forum available, ipv6, dot farm or data sets are online and available under set address, so also for the rest of the room.

W

Yeah I mean we've been in contact before.

Y

To continue on a little bit more to explore Eric's question about load balance clusters. You can have a load balance cluster with pretty large prefix behind the deserving right and it may have not many machines like under a dozen, but if effectively for all the pieces that say they randomized as they're going to the backend server. So your second method will classify it as several million different machines or still under a dozen.

W

So if you just, let me repeat it again, if you have a large prefix and several load balance chunks in it, well.

Y

And let's say the backends pick up get load balance randomly from that large crate. It's.

W

Probably that would be considered not alias because as soon as you end up on a different server, some of these tests start to fail so I'm, quite confident that we really only counted as aliased. If it's the same machine.

Y

You could try to classify as possibly into not just one bin or multiple, but let's keep track of maybe several alternatives based on time stamps or something else that could help yeah.

Z

I'm Jimmy that here I'm just curious whether you identify what kind of host that hath likes very huge other aspects, such as roster, sorry I, can imagine. The a thrush 64 is occupied by a single host, but not started to carry drugs at first. First dude seems to be very strange, so I don't know if you have any idea about what it is so.

W

We found 1/29 that was basically all the IP addresses will show you. The same nginx is working on this machine website, so we even found slash 29s that were just the same machine.

X

AA

A random thought- maybe that's a firewall of some sort- that those proxying.

W

Social media. It looked like a web server because it had the nginx thing.

A

Thank you very much again have a look at the presentations on Monday. They were like more focused on to us and HTTPS and I hope you come back with more results next time. Thank you.

Q

Hi I'm, not Anna or Ian, I am Gauri. So um this is a paper that the three of us put together for TMA and the full text of the papers available open source. So you can look at that. That's in the final URL and the paper was about measuring the usable maximum packet size across an Internet path and Ivan change. This talk to talk about. How can we make path? Mtu discovery work to try and make it more ITF focused um okay. So how do things work?

Q

Really it's good to send big packets because the Internet can send big packets and we have something called path: MTU discovery, full standard, ipv6. What's the ipv4? It's just a network layer mechanism. You have a certain size of packet, you can send it supported some places, not other places. You get an ICMP message bike. You choose a smaller number, because that's what works? Hey it's cool when it doesn't work and that's what the talks about path to big messages.

Q

Icmp messages before or v6 are somewhat unreliable and we know ICMP firewalls droplets, some CPS drop, this ecmp tunnels or the firewall processing and corporate domains. All drop ICMP and, like all the data, when you try and send a big packet, because you never know that the data was too big, so big packets don't work how that and if you've been the transporter. You probably know this: we can do things at the transport layer and the transport layer can figure out that package don't get through.

Q

So when he discovers this, maybe you can decide to just send less. Aha, we talked about that in the IETF as well. So that's the background.

Q

mmm Here's the first fix that came from the community and ok, so TCP is the most common transport still quit might emerge soon, with TCP the server side advertises in MSS the biggest size of packet. It can accept and it senses through the network, and the client then knows that you can't send packets bigger than this, because the server doesn't want them. So how about you just change that number in the TCP syn packet change the MSS option to a smaller value.

Q

So in this case the CPU changes the value from 98 96 or down to 14 52. To be helpful, it's called MSS clamping. It's quite widely deployed as we'll see later so. um Given we know this stuff happens, we did some measurements. The first set of measurements are taken from a set of data centers, going to the top 1 million web service and yeah we tested them. We have 4 million points in this data set and but for ipv4.

Q

What we saw this is the MSS seen in the syn coming from a web server somewhere in the internet. We can atop 1 million an FIR ipv4. We see 1460 as the common number, which is what we might imagine, and this little tail off across the bottom, about 25% or so of locations and returns, something smaller, maybe there's tunnels in the way. Maybe there are some technologies which are different, and so the kind of distribution looks like this. The v6, the shape of the curve, looks like this.

Q

Oh that's odd, and this number is what you expect. This number is not what you expect so, where did 1220 come from?

Q

Well, if you followed the v6 ops groups, you would know that there was a nice EMP problem and you see a few load. Balancers, don't know how to return the path to big message to you, because they have two potential places it can come from. Life gets difficult, so the first hike that was put in to fix this will simply get the service privatize, a smaller MSS 1220, and now everything works, yeah and akima. Who did the work and saw this as an interim problem and they described how to fix it on their webpage?

Q

Obviously they fixed it for this, because before no longer does this, but not for v6. Oh, but really this isn't quite this story and v6 world's not as big as we think it is, and 80% of these points on this ipv6 Kerber actually served by CloudFlare service.

Q

So really this is reporting on two different and server configurations that are out there. If you work for these people, you might want to talk to us, because you could probably fix this. Ok, so that's a bit odd, but let's go further and let's look at what happens on these paths. The advertising MSS that's smaller than the maximum. You might imagine. So we do the same test, and now we run a pinged I can't with the full-size you might expect, might go across a long behold.

Q

93% of them were reachable with a big packet. They just reduce their MSS. They didn't actually change their links at all. Okay, so that's addressing data point and as we use quick, we know we can actually use these big pockets. So that's good.

Q

Let's look at edge infrastructure. All that was taken from a place that was well connected, so we know take some mobile clients in mind and we used a testbed Coleman roll, a number and hundreds of mobile nodes across Europe, and we could launch test campaigns from these sites at latest views. Right, ripe, Atlas, probes, look at the wide case as well, so these green measurements are from the edge Oh.

Q

Even stranger,.

Q

We did our test, we got the results we expected. Then we sent some pockets without an MSS option: sins without MSS options. ah Oh and behold, we saw packets arriving with an MSS options set helpfully by the network clumping this to a smaller value. Oh really is odd and perhaps even odder I, don't I have no idea why these numbers were chosen by these particular operators. This is not a name and chain.

Q

We have bigger data sets, and you know mean each operators just as chosen some number to clump, but these are 1,400 or 14 10 14 2088, hey apes, hops. That's 21 percent of our data set added an MSS option. When we didn't ask the one: hmm okay, maybe there's something about not looking at this MSS option in future. What about the wired edge? Okay turns out mobile operators weren't as bad as you think. The wired case varies wildly and we have three thousand ripe Atlas probes and we surveyed a number of different places.

Q

4.8% of our probes arrived carrying a nemesis option as well, and some of these were even bigger than the maximum allowed by a 1500 by Ethernet link, and yet we know they're on internet links, so yeah people are adding MSS options and clamping, but they're doing it not to Sarah Lee. In a very obvious way, people are trying to help always good to have help so stop looking at what the network does, because there's a lot of day to night paper and I've got the link to the paper.

Q

At the end of my talk, let's, instead now try and use path, MTU discovery. So in this case we use a tool called scamper. Many of you know this. We also use Naturalizer and traceroute, and we set up in Nord in a place where we could control it and we artificially reduced the MTU of the link and we saw whether the remote server could choose the right size of packet to send to us.

Q

So for the mobalage 1,500 byte packet was sent as a UDP probe. We set the DF flag for ipv4 v6. It was already set and small dataset, but basically um operators we're doing well and that's good, and what about the actual data would be for it for wired networks. This is camp reconnection, so the first line is the MSS was reduced in our connection and therefore we didn't actually do the tests or the pale one needs to be ignored and 60 or so percent of paths actually use path. Mtu discovery and worked.

Q

Was that good news baby, not so brilliant for a transport perspective? If you think that 40 percent didn't succeed in doing the thing you thought he was going to do about. 20 percent failed because they failed to get the path. Mtu discovery to work and some spikes didn't set the DF bit when we asked them to do it, which is a bit annoying. So I canceled, the 12%. Some networks cleared the DF when we set it, which is very helpful because that when the packet got fragmented, that's perhaps why v6 doesn't allow this.

Q

So that's the kind of left side of my plot. The right side is where we no filter all the ICMP messages, so we black hole. The ICMP messages life gets little evil here, because path, MTU discovery, succeed, 8% of the cases because their path, MTU discovery, algorithm, had a way of detecting this and dropping back, somehow, probably called black hole detection.

Q

hmm So not brilliant news. What about v6 must be better. A 95% succeeded, that's good, except that remember, and many of these passed in to actually send big pockets anyway. So it's not as good as it seems, and.

Q

Papers got a lot of detail in and I'm happy to talk about in detail, but really that you should read the paper and look at these plots if you're interested here. What's the take aways path, MTU doesn't work reliably. It's a nice thing to have down in the IP stack in ipv6 and ipv4, and if you can make it work, that's cool, but it really doesn't work reliably.

Q

The obstacles are actually obvious, but the useful thing perhaps in the data is we actually tested them and figured out where it doesn't work in how likely it is that problems occur and a few big messages often get there. Some path to big messages are just simply wrong. There are cps generating have two big messages with some bias, just set wrongly and probably just copying the wrong bytes of data here and there and then stick it in the packet.

Q

There are paths to big messages that you can't check where they come from, because there aren't enough bytes. Surprisingly, common before configuration is only return. Eight bytes of packet to header, even though the hosts requirements for v4 say return 576, they probably didn't read the host requirements, the v4, which is kind of been out there for a while anyway, a smaller MSS is commonly the way that people have used to control this problem. They simply lower the MSS for path, MTU discovery for TCP, which means that many servers don't really do the path.

Q

Mtu discovery algorithm that anyway, MSS clamping is common in the network as well. How can we make path? Mtu discovery, work because I mean I'm here to try and make the network work. That's why I come to the IDE yeah? Well, first of all, we have some measurements and I love having measurements. To start with. That's why I'm talking here we're going to get more, we've already started, getting several million more data points to try and really understand the idiosyncrasies.

Q

The issue here is: we really are concerned more about the cases where things go wrong than the high percentage of cases where things go right, because you really want this discovery to be reliable. Oh you can make it reliable. We invented something called TCP PLP, MTU D, which is best they in addition to TCP. That does the probing in the TCP stack in theory. This fix it for TCP. It's RFC for 8 to 1 we'd, be asking people to use this in the IETF.

Q

Because I again, unfortunately, the implementations of this are slightly broken. Maybe because people use MSS clamping, they don't actually exercise the cord. So nobody cares enough to fix it, but many of the PLP MTU D implementations we looked at were neither not enabled or they were not really function and didn't do harm. Didn't do anything useful. So TS vwg has a work item and it's called this. It's looking at doing. Plp MTU, d4 datagrams we're starting afresh. We are making something that will work with UDP.

Q

It will work with various applications running over UDP, including including quick. Hopefully we have some hackathon activity to make this work with quick and also with SCTP, and we think this might be a good solution to this problem space and with the measurement data we might be able to build something. That's robust and works reliably across the whole, the internet. Maybe in future we could come back and redo the TCP thing in a similar way and try and get tcp working in the way that we finally managed to get dirty.

Q

Brown working with UDP I promised you a more detailed copy of this. It's available in our TMA paper.

Q

This is an open access publication, you can download the and you can download the paper and we are doing more work questions. Please.

D

Aaron. Thank you for this great, very interesting. So much it's I'm still digesting what you've said, but it seems to me that you're.

D

What you've highlighted is that some of the problems are in the network, and some of the problems are in the end systems, and it looks like the stuff that you're proposing is focusing more on making things work in the network is.

D

Are you also doing something to get n systems to correctly make use of of the MTU discovery? And just an aside I think that DPP LMP ug has got to be one of the worst acronyms I run into the idea.

Q

We thought it was so long. We.

AB

Were gonna, keep it no I, think bc, b, b, b, lb m bu d is actually worse right.

Q

Okay, so the first answer to the question is: what can we do with the data and how can we use it usefully and I think just understanding these things. We can immediately start talk to people who maintain Stax and say hey look, and this is just going on and you might easily be able to fix this. That's cool I think the real solution is to fix the transport protocols to work with this.

Q

So there's a bit of work going on with the Stax and maybe I, don't know whether we can really change the operational practice of network operators and we can tell them about it. They can learn and maybe some will immediately adjust. I, don't know who's nice.

B

My name is obey but try to rotate. Where is ICMP Blackwell? There is a way, I think it. If you have all this data, I think you can to some extent rockets. Where is ICMP, Bracco I, think, oh, we.

Q

Can and the tests I skipped over quickly we're expanding ring search is where we actually found the piece of equipment and where the MTU dropped and whether it dropped more than once on the path. So we have the data for it and that will help us know, construct test cases for Datagram path layer and do you discovery and I think we will make those available to other people.

B

Any other characteristic of the distribution Rocco.

Q

Yeah, the characteristic is the Internet's quite broad and very varied.

AC

Can I just say it's packetization layer and his Brian's fault that we keep saying path my days, isn't it yeah, okay, where's.

Q

Brian, no sorry, please go. Oh.

V

Thank you for presenting this having been involved in some of the v6 load balancing a while ago. This is just like so I simpiy basis, basically unfixable and thank you for clearly showing that, and I think doing the transport layer is the only way that will ever work, because it shares fate with the traffic that you actually want to work and so I think the key things there are don't ever a black hole, which means start small and like try to grow and don't ever increase latency, which means parallel ice right.

V

You have to you basically, maybe send the same packet with the different sizes if you can cuz yeah, so those are those are gonna, be because, like MSS MSS rewriting has kind of those properties for TCP. Now you can't do that with any encrypted protocol, but I think those are the drivers right. We were we looked at this when we were doing the v6.

V

The first v6 whoops at Google and you've got these Hurricane electric tunnels, where the MTU of their network is 1,500, but the tunnel is 1280 and they don't know that they have 1280, and so you send. The packet gets all the way there and it hits this. It hits this I'm P and you lost an RTT. Oh is that right? We can't do that because if we impact impact, latency will never ship this thing at all, and so 1280 is, and so at the time at least for a very long time ago.

V

They know what we do now, the outgoing and the outgoing MTU we never sent anything that was better than 1280, no matter what you announced this, because we knew it wouldn't work and even if it did we'd have this latency impact. So think about that. Good luck, you know. Thank you. Thank you for taking this on well,.

Q

Thank you, I'm, not sure I'd, by sending the same packet with two different same, dare twice two different sizes. Maybe I would send a probe, which is what we're kind of talking about, there's no real content, but we can actually verify that it goes through and we don't long as it chairs fate, you're, absolutely yeah, and we don't try and raise the actual packet size of the real data until we really know that it works Jonna Jahangir.

Q

AD

I won't agree with what Lorenzo said and I won't take it a step further in the sense that I almost think that TCP PLP MD UD should be the first one to explore the reason I say this is because it's a it's a full-blown transport, it's something that we can run traffic on and we can actually measure and see how much latency difference a particular mechanism makes or doesn't make, and that makes it viable for deployment.

AD

If you have real traffic using a mechanism and showing that it doesn't cause any latency increases, but actually allows you to discover larger MT use. If they exist, then that's a valuable mechanism that can immediately go into other things as well. I, absolutely agree with not relying on on ICMP messages and going in the direction of doing PL p.m. Tod for TCP and I. I think it's. This is a problem. We should solve it's it's. It needs solving.

Q

And it should be solved, so why is TCP a little hard of us to work with them with the UDP and SCTP stacks? The answer is simple: we can't actually send a non data segment through the network and verify it got to the other end, because TCP doesn't let you do that with a size, there's no padding facility in the TCP header.

Q

So that's not a problem and mattis's RFC, which I mentioned has a way of dealing with this. But the way of dealing with it is he interacts with the recovery under congestion control mechanisms directly, whereas if you can send half probe messages which are identifiable to you as probe messages and can be echoed, but to the path they just appear as packetization layer messages, then you can separate these two and develop algorithm quickly and efficiently, without worrying about that other interaction which can then be addressed by the transport. So that is actually.

AD

Part of my motivation for asking you to be in DCP that.

Q

AD

Needs to be understood- oh yes, and and not just doesn't need to be understood. I think it's an interaction, that's important to resolve, because without that again you'll get zero deployment. We don't want to go through the process of trying to figure this out again and basically not have any deployment. So in that vein, if doing it in quick is easier, then that's fine, quick has the ability you can send in padded packets, and you can do all of that, but the interaction with lost recovery and everything has to be has to be resolved.

AD

One last thing on.

Q

That one and come join us we're playing with quick code. I'm sorry come join us from the congestion control side. We are playing with quick code, so we're interested in seeing how that works soon. Quick, yes, and we do intend to go back to TC p.m. if we have cycles in the end, to revise Matt's draft. It must open to looking at that. That sounds.

AD

Good you, you call quick an application and I would I would take that back and call it a transport, because I think it's important to make sure that we aren't building gendering Michael isms for UDP, but we are building specific mechanisms for transports like quick. Thank you. Okay. Thank.

P

You um Erik Nygren, based upon what you've been looking at. Do you see what do you see any value in the path space signals such as the MSS clamping, the MSS clamping, something that is is that is useful and helps here. Is it something that that, as we start having encrypted protocol such as quick or worth looking at exploring ways of getting signals from the path some of MSS clamping, or is it a dead end that we should run fat I'm far away from a.

Q

Good time, I bleed dot, as somebody just talking, it'd be nice to have better signals from the path.

Q

Mss, clamping wasn't the right thing. You were taking a side effect or something else, and you can stick any number in that you like, which is what we saw. Some people are raising it. Some people lowering some people creating it. That is a hike, and thank goodness, we can't do that with quick, and could we do something else? Yes, possibly and Ron bond occurs.

Q

Talking about some truncation method that might work with v6 we're sending probe packets through the network that you can somehow interact with the route is to figure out what really works could have some value. I, don't know, I, think my immediate takeaway is you've got to do this at the transport layer. It's got to be part of quick. It's got a part of TCP as CTP, something that understands how the UDP stacks operating. It can't just be done in the network layer. Part.

Q

A

Speaker could come up anyway.

A

J

Good morning, everybody again so this work is done by me. John and I si maritza, it's a second labs as well when to enter a carbon passo, fundo markowitz, also city here in the back, and it's a currently under review, and we call it when the dykes breaks the second dns defenses during the night attacks.

J

We have seen that there's a kind of growth in the number of the DNA, the daya, then I searched a taxi lately they're getting bigger, more frequent and cheaper as well and easy to be performed. So I think the latest numbers that we have is they have reached 1.7 terabits per second in 2018, West year, dying or 2016, know. Last year dine had at one point to turbot attack on the DNS infrastructure.

J

It was used, Donal use Mirai, but math was the first big button and during attacks and that we know, and you can actually buy DDoS attacks now as a service on a Internet they're called booters. They call stressors that can buy them and we have seen also the DNS hosting target of the doses and in particular there being two cases of the nurse's attacks on DNS and it probably protects against DNS that breaks down everything else. If you cannot resolve domain names, you know that's not gonna work, but there are different types of attacks.

J

So if you look at the left that there was attack that happened on the root DNS service in November 2015- and this page here is the DNS Mon for ripe and everything red here shows when they had rich ability problems and even though they some of those root letters had problems. There were no non reports of errors seen by users and it's kind of strange. So it's good for the unasked.

J

You know, even though they had problems, nobody kind of notice from the user point of view, but a young lion, the tech of untying, a big Denese provider in 2016 made to the news everywhere. This is New York, Times, The, Guardian, Bruce Schneier was writing about that too, in a bunch of other places, because some users could not reach the proper websites that were like I, think Netflix in a bunch of others, New York Times. So if the user cannot connect to the main applications, they get a league.

J

Another problem and the question that we wanted to this to gauge was like well, this is were to large denies source attacks. It had very different outcomes. Relative outcomes from the point of view of the users and the question is why so we wanted to know what factors actually impact his experience. What causes no change in the or sporadic problems? We know that the recursive servers on DNA, as they have like their various fail. Various mechanisms to cope with that cash in is one of those retries as well.

J

Anyone to know if operators can improve their services as well now just a quick recap on DNS. um This is a fear of showing how pretty much Dennis works. So if you're, a user you're here in a stub resolver- and you wanted to know an IP address- let's say from a particular domain and the blue green guy's yours- these are the guys who know that if they have information, they're occult rotative servers, so somehow you have to get there to get an answer. But usually you don't directly.

J

You have a bunch of recursive resolvers in the middle. These are the guys are gonna, do the job for you and there's an you know if you're familiar with un asesino primo show that works. But what matters is this guys here they may have caches in between. So if somebody else s the same query again here, they're gonna cash and give query much faster and if there's a problem here they can, with the green guys years these guys in the middle can retry to switch four servers. We tried a bunch of things and DNS records.

J

There stay alive at the maximum, usually for as long as for a valid than specified as a TTL time to live for the record. So this guy here the green guy will say: hey you can start this record up to one hour, 30 seconds or whatever. So it's a it's a store here and then this will be later go to the caches. Now, how can we evaluate the building resilience of TNS? So in this paper, what we don't?

J

You broke down this into three parts, the first one we evaluate like how caching really works in the normal circumstances and in a controlled environment by using like wild in the white or using ripe Atlas, and then later we move through a production zone. We analyze that in our in the roots, but into presentation and only cover that NL, that's the one I worked for, and it's that's what I have here and later Department is more interesting.

J

It's we emulate analysis attacks in the wild, it's to observe and the goats not only analyze how the alter daters behave. We wanted to know how users see that because well, that's a good paper all right. So how did it do this measurement for part ones? You wanted to know how caching works register a new domain that has never been registered, cache tester NL rerun to alternative name servers in ec2 in Frankfurt. We don't analyze any case here. That would add a lot of complexity, but I have a bunch of work of any cast.

J

If you're interested as vantage points, we use right atlas there. We use 10,000 of those, and we consider vantage points in this management, not only the probe itself, but a combination of probes, local recursive and each probe. They send a unique query which is identified by their probe IDs. So we don't wanna. When someone asks a question, one probe ask a question: they don't want another one to ascend questions, so you can interfere in the caches of each other and we Inc in each answer. We have a quite a answer.

J

Information in a counter there that allow us to identify later is in the paper, can I cover here, but allow us to identify if a query was answered by the cache or by the authoritative we kind of increment this over time. So you can actually determine it very easily and have a budget scenarios who would play of these details and to measure the influence of that alright. So why do we control? We control the the doublers over? That would be the vantage points, iraq atlas and we control their green guys.

J

Here's we have no control here and that's what you try to measure in this paper wanted to measure. How is in it your recursive layer is okay, let's get to the results, how good caches in a wired. What is seen is graph here in in the x-axis is our different experiments. By the way other datasets are open. You can check on the website and a paper in each of them.

J

You have the TTL we use in alternative name servers, and these are the number of queries that we observe and each query we classify in coordinates of four categories. The blue ones in here are the ones that correctly went to dr. titovs. There are answer and they should have been answered by the authoritative, so they're, fine, the green ones here is the cache hit. They should have been answered by the cache and they were and what we're we would see. So this is also good and we.

J

What do we see here is the caching working it works. It's fine for 70% of the times when it should work for the 50,000 TPS, and this is considering you are not very popular domain, so people are asking for one. The cache tests are now called a record, and this is not very popular. So all yes, we're only us were actually only. We were crashed query in that, so I would expect to be even better, so it's kind of lower bound for that. But it's not so good news is that 30% of cache misses.

J

This is shown by the yellow color here the AC and we are like okay, it's all there on the Y out here is the 30%, and why is that? There's a bunch of things caches may have limits. Our domain was not that popular caches may be flushed and the recursive, and they refer to caches. You know remember in the layer that we have no controller recursive, they're, very complex caches.

J

A lot of people now use any cast in the recursive layer. There's a bunch of free services are there and some of them may have cache fragmentation. So we went further. We internalize what happened with his queries that are answered by the servers. They were actually cache, misses and turns out to half of them are using public service, and this is, you can see the classification did and half of the ones that are used in public areas and go go again. This is not popular domain.

J

I would expect to Google behave better in a popular may. So all right. So this was in a controlled environment. We use in the wild using right, pathless, but another question you may have alright how that would work in the production zone, so we were I would forgotten now and we have access to the data. That comes your authoritative, so we just computed time in between two queries that gets to us from there recursive and for each recurse. We compute this time and we choose a domain that works.

J

Our alternative name service that we announced on is on, and you know the TTL that one is one hour. So this is the number of queries you analyze is six hour period. This is important, this number of recursive, and we see that roughly 70% as well or the recursive. This is a CDF here. They actually query around the time. Eternity TLS or what it means here is that our experiments are also like the real zone. You can confirm the same here in a production zone that I know.

J

We also look into the the route, the route using detailed data, it's also in the paper, alright. So what we have so far, we know how our caching works in the wild and with that as a sort of a baseline can move to the part, interesting parts and interest those needs like how can we know that cache works also during stress during the night, sir sir-sir attack each other? Send users experience? Ok, so how are we gonna emulator those? Well, we could have just try to do those our services to Amazon I.

J

Don't recommend that that will create a lot of problems. So what we do we emulate a DDoS on the same servers at an Amazon, I, just use, IP tables and I say drop one at percent of the traffic or 50% of the traffic. There's that kind of person make approximation, because servers under stress is gonna, stop drop traffic. There's some! It's not exactly like a DDoS, because you know the DOS attacker can have like links congested routers both having problems as well or the packets, the kilos in a bunch of other stuff.

J

But it's a good approximation for that and again we control only the book, the the alternatives and the control here is guys, but we don't control anything here and you want to see how if you recursos exit after you actually get your got your back once there's a DDoS here and if you can get an answer now.

J

First scenario: this is complete. A complete DDoS means the server would drop 100 cent of the packets and the TTL is set for this one is one hour: 60 minutes. This is doomsday scenario for an operator. If you are your turqu datums right now for what in this case for one hour, this is the worst it can happen, and you wanted to know how much cash she can actually protects. So in this figure here you see like an arrow going now, and this means the time when you can simulate it start to simulate the DDoS.

J

So you allow what all the probes just send one query before we start simulated, so the crash like there, the queries, the caches of the recursos get populated and after that we actually start dropping all the packets and next time there are more queries. We see like the blue guys here show people are actually getting an answer and you see a continuous drop until here is the time that caches expires and very few people get an answer like all the other. Cars are not get an answer and having problems.

J

So if you cannot resolve the domain, but we what I see here is that thirty, five to seven percent of the clients here are answered, but it by the clock by the server by the cache I'm. Sorry, even though the alternatives are down so this is very good results. It's also very interesting here that, even when the caches should have been expired, 0.2 percent of the clients are getting answers and there's address in which and firstly expire. It's called serve stale it, which means, like you know.

J

If, if I'll recursive can activation authoritative serve the answer, you knew beforehand and just a different answer. So that's exactly what it's happening here and this there's a very little very small number of people getting that, but I think it's a good idea. Alright. So now, let's change a little parameter. Another scenario we're gonna, carry on with more doomsday scenario: let's 100% failure crack a drop and it'll cursive.

J

What you're gonna do here, it's like, instead of only allow one query before the DDoS are gonna allow like one hour, which is pretty much like the TTL. So this the caches should be more or less about to expire when it started it does, and what you see here, the number of people gettin answers during after we started it does is far lower, meaning that cache is much less effective as time sells near the attackers, their cache times out and some of people here.

J

The blue guys are getting an answer because of in incidentally, because of fragmented caches, which is also very interesting because they end up filling up later and this second, an error here, I I just decided to switch up this server again and see how quickly they would catch up, and you see that very quickly, all of them catch up all the recursive they bounce back once the servers are up, so the butter line for this graph here does a you know: it's your clients, aren't gonna get protected for how long your TTL is available.

J

It actually depends of the state of the cache, the particular moment that happens if your records about to expire. You know you're on your own alright, so we also want to know from the previous graph. You're gonna now reduce the TTL from one hour to 30 minutes. That means that the cache did records should be in a cache now, not only for one hour, but for only thirty minutes you can see. If you like, the recursive have a problem. Their cursor would have 100% packet drop.

J

What do we see here again? The drop is much quicker now because a TTL is smaller, so the record stay for shorter periods of time in the cache, and some people here get stale answer after the cache expired, so for the operators here for DNS operators and for well-well anyone who uses DNS. Thank you carefully. How you choose your TTL of your records because you can buy and you may end up wind up, shooting yourself in the foot.

J

If it's choose look, if it's choose to short TTL, because you're gonna, you know you be undermining the protection that the caches can deliver to your clients.

J

So this is this is covers completely dos attacks, which means like the service could not give any answer.

J

Yeah I just talked about that, and but there's cases as well what we call part so DDoS its DDoS attacks. They attack the authoritative nameservers, but they're not strong enough to bring all of them down. At the same time. That's exactly what happened to dine with the roots in November 2015 and some other servers get go down some of others. They don't. So you also to wanted to know in this case how users would experience the attack. So now it's not dumb days and more, but it's a it's a very realistic one.

J

So what do we do here? In this particular scenario? We have a TTL of 30 minutes for the records and we draw 50% of all the incoming queers on authoritative, nameservers to simulate a DDoS. Similarly I 50% packet loss, and it's very interesting when we start to do those when the arrow go down here goes downs here we see like most users, don't even notice anything most of users get an answer and I think that's a very good result. This is a lot shows the resilience of DNS and the only thing they would notice.

J

This is the RTT on a figure here below. He has see the latest actually, and you see there too, so there's increase in the latency, so they take longer to resolve or domain him, but leave they're getting answers and let's play a little bit more. Let's make the case worse now, let's give it to TL, but let's drop 90% of the traffic. Let's see what actually happens, even we 90 percent of the packet, which is TTL 30 minutes most clients in the blue from here. You can see between the errors. They get an answer.

J

So it's it is. This for me, is an example of good engineering people who have the valve that to have beauties into DNS I mean they deserve the kudos for here, because 90% packet loss and, like 60% of people, get an answer. This is fantastic and they'll. Get me wrong once somebody gets an answer that goes to the cache of their recursive. So next time you come back again, it can fetch the records. So this is like coming together. All these things, all the good practices.

J

Let's now move to the another case. In a way, she set a TTL to one minute and if you set a record CTL for one minute, that means that nobody, nobody should get an answer from the cache very few people all if they serve stale, because once they fetch and they come back ten minutes later. It's already expired a record, so you won't see when you have 90% packet loss or katia with one minute.

J

You see that, like even with no caching theoretically mean you can serve stale 27 percent of the people get an answer, even though we have 90 percent packet loss, which is for me again a very great result for DNS.

J

Again, you see, although there's a different impact here in the latency, it goes much higher because well they have to do a lot of retries we're gonna analyze next next, but it's very still very good results, especially for operator alright, and on the on the previous graph there, the success of DNS was doing to retries what the alternative names with the recursos will do. If then, I get an answer, they're gonna switch from server to serve. In this structure we tried a bunch of other things and in this graph here we show a time series.

J

This is measured at the authoritative side, decides at Amazon that we control and we show the number of queries during normal operations and for the 90% packet drop of an ocassion. We see an increase of 80 times normal traffic Dorf, and this is like not as it does. This is like friendly fire. This is like a recursive going crazy and they cannot actually not actually go increase, they're doing their job they're trying to resolve a name. They cannot get an answer. They try it again. They switch to another one.

J

They try again until they get an answer. So what I'm? For an operator if you're under the does with this means that, like you're gonna, get a lot of friendly fire? I mean your clients gonna, try to resolve that so be aware of that, and if you actually over provision your dns a referred for ten times your normal traffic, be aware, if you have 90% packet loss is gonna.

J

Get this number of people hammering you, so your traffic 88 out of or nine times of the traffic is gonna, be just your friendly fire, so be aware, so implications. Caching works works in retries work really well so kudos for the DNS commit who had to build this, provided, especially if somewhat relativistic partially up. That's that's, really great and caches less longer in a propriety as well. Their cache last longer than the ddos for the demons operators in one implication may have like you can keep one up.

J

Things are kind of good gonna be fine, so you can make a one of your service is very strong, but it can be very careful like very it can be very careful of loads of distribution. We wrote a paper last year, I'm c.join, when you have multiple attractive name servers.

J

What happens with one is better than the other, so you may have a want to look at that with the cool thing about this paper is like you can explain why the roots that they root do those outcomes of the particular tech happened, because the roots have a very long to tail one or two days of a no mistake of one or two. If he it's one, so they're gonna be in a catch today.

J

Sorry, two days and and a lot of people can't put cap the TTL for two days for one day that doesn't matter if there's a long period, the attack was shorter than that. But we can explain that now and there's a clear trade-off here between TTL for DNS record. That authority that serves sets and the DNS resilience so be care, be careful with that, and many commercial websites especially have short details, because a short TTL when you change something in DNS, is very easy to propagate.

J

It's very quick, much quicker and you can explain that the pain of dying customers and users perception and just give an example, had a discussion with some folks on Amazon and if you use their recursive they're gonna cap, all the answers to 60 seconds so just be aware of that conclusion supposed to conclusions then. So this is the first reevaluate DNS resilience to the doses from the users perspective, trying to figure out how the recursive layer works and if how caches, retry work in under stress, we evaluate design choices of various vendors user measurements.

J

We actually have no information about where there are very little information about the recursive, the resolvers layer, which software that are using. We actually didn't care. We wanted to see in the wild. That's was the goal, so caching and retries is very part part a very important part of dns resilience. So if you're under attack people, the field training, people do scrub into a bunch of other things, but you know dns is already built in a lot of good stuff in it. Our experience. Experiments also show when cache and retry works, and they don't.

J

It's consistent with the recent outcomes and the DNS community should be aware this trade-off, and I think it should right. That's my personal opinion, but I'm not sure where the community thinks that should have evoke. 8/4 advocate first serve stay or deployment, because it's your last resort. Well, if he is your really really bad, so have a tech report here is my Mayo I would like to thank ripe ncc for all the measurements again.

J

A supporter is all the time in all these people, Wes Dwayne, Warren Stefan, in marking for reviewing the paper and the titleist papers when the Dyke breaks. So this is a guy closer I live in Holland. If you can see the water levels higher than here, so if he breaks are gonna, be swimming and I hope that this never happens.

L

um One thing that I thought was interesting is when you simulated a partial Authority failure: I think you did packet loss either 50 or 90 percent on both authorities that correct yeah. So one thing that I think might be interesting for future work is a complete failure of a single Authority and the other is 100 percent up and then I think, because you know in theory the recursive should, as they go through the source list, penalize the non-responding, Authority and I think might be interesting to see how that actually performs real. um So.

J

I think I I am I, have nothing to paper. I ran that actually as well. Okay, yeah I can actually I can't give her an answer right now, you'll be Jeff. I can tell you it's gonna, be just fine. It I can remember the results, because if here you have um what are gonna, do we have an IMC table for last year? That shows, if one doesn't answer, they move to another one. They get an answer. So all of them gotta get an answer.

J

I I'm, not sure I have to double down check the paper but I. Think I. Remember the results. Y'all get an answer. Okay,.

L

So it's in the paper that's linked, yeah.

J

Yeah, it's a no paper. There's no need a paper call technically I'm sure if you remove because of its based, but it works. Okay,.

L

Then the other another question I had is: is there any opportunity, I think for research between the stub and the recursive? If one of the recursive goes down because I think there's a lot of variability in what operating systems do when multiple cursors are configured, how they recover etc? Is that something that you've ever done any research on so.

J

I didn't look into that. The way they write works, it's like each probe is configure if yeah.

L

I know yeah I know yeah. It would be difficult to do with with ripe Atlas, probably but yeah.

J

That's an interesting case. We haven't looked into all that we look at the recursive layer as a black box and which is, you know, analyzers, but for sure that would be interesting result. Thanks, Thanks.

AE

Circular pipe, as you see somewhere early 2017, we added two measurements to each probe. So it's a built-in, if you plug in a probe, it runs already measuring root servers in particular like every 10 minutes or so with the intention that we are trying to capture the in case. There's also another attack against the root servers, so I'm not looking forward to it. But if it happens again we will have some baseline data that is close enough to what the user is probably observed, mmm but.

J

Sorry, just to understand you use the probes to contact the roots or you just use. Let them use the resolvers so.

AE

Local resolver, alright, basically we're seeing.

J

AE

Queries one is a for a random domain, so you expect a no answer and another is well known: domains like Google, Facebook and so on. That's.

F

AE

So, um if hits the fan again, we hopefully will have some good data. It.

J

Is a built in like how the probes do that yeah? Can you send other the measurement idea in the list? Maybe Oh Jimmy? Yes,.

AF

J

Can do that? Thank you very much.

AF

Paul Hoffman, so I love measurement, love the stuff. You do didn't like your conclusion that we should start looking at a surfs tail again. I, don't think your conclusion actually matches your data because of what you just said to the last person. If one of the authoritative x' is up you're gonna be okay, so an assumption that many people might make is a a way for an authority for a zone. To do it. Right is not to really worry about the TTL.

AF

So much as well as you know, to have enough authoritative servers and so I mean you definitely measured. One knob well, but serve stale deals with two knobs and I. Don't think you can match them in in this again, your data and stuff is wonderful. I. Just don't like that conclusion about serves Dale yeah.

J

Thanks I mean like in the paper: we don't. We don't really advocate strongly for serve stale, but this isn't my personal because I see like it would it work in our experiments but I I agree. We have to look that's carefully, but it was the only hope for people don't get an answer, but yeah thanks. Thanks.

Z

Jim, probably related to the previous question by PO. Have you examining the TTL was a parameter of the answers in the case of the cash exploration to confirm it's more likely to be effective, I'm, sorry, cuz. There.

J

Z

You did you see details of the answers. It's the case. Oh did I see what sorry am.

J

Z

No I'm getting it.

J

Z

Did you see the details of the answers in the case of a cache expiration to confirm whether it's more likely to be the effect of youth they're, like very short, TTL or something.

J

So every answer that we get, we have a counter that allow us to tell was answer from authoritative or from the caches and I have built a cache model in a paper that wants to get a answer for the first time. We start a counter. We start actually putting a virtual cache in our analysis and then we can actually compute later if it was the answer and it should have been answered by the cache. So we have all the information check the cache the session page we're not I.

AF

Think I can clarify what he's asking you had the up down. Yes, question. You were saying: oh people who lived through it I think your question was: if you actually have the answer on the second up, be different, you can tell which things were stretched from the first.

J

Oh, we have yeah, we have. This is exactly this. Miss I did that for for the part one, this is here. So these are we classify the answers according like this was answered by the caches, meaning. We know the detail of each answer, so we have all this code. Ok,.

AF

So again, I think the question that the research question was: if you change what's gonna, come back, you can tell what had been stretched through what was going down versus. Oh now, it's up again and it's getting the new answer. Oh.

J

AF

J

Look at the transition from Dow to up I just well, but.

AF

And that might actually go to your question of. Should we encourage service, tailors yeah, yeah.

J

Thanks for helping.

Z

Yeah, and also a very related question, is that I give, depending on the type of the recursive server the deployment level of youth. There is quite different: I gave them commercial, regardless of operators tend to enable, if current feature more likely so I I wonder what kind of reserve actually tested in so.

J

That's a good question: we there's I mean there's many vendors of DNS resolvers and they're different versions, and so what we have in the end, the population of resource is very heterogeneous and what they methodology upline does paper like we're, not gonna, try to profile them and identify them. That's too complicated. You know, look at them in a black box, analyze as a whole, I'm, not sure if there any other studies to look into that. Specifically in the lab.

J

You can do that in the lab, but our our goal is to understand that in the wild, so you can actually generalize a results. So there's a trade-off, but you can do in the lab. We can do that. It's a you can like I can run, bind there in a certain version, see what happens you can actually do that. We didn't do that. That guess. Our question was like from the point of view of an operator and you want to know in a wire. So it's a different thing, but yeah.

AA

J

AA

So I'm wondering what you research right now is DDoS connection conditions at the authoritative, so in the Netherlands we had this incident a couple of years ago, where an access network got its recursive, ddosed and I'm wondering what part of the conclusions you think would mirror that scenario where you have a nexus Network or an enterprise network that where the recursive have trouble reaching your authoritative and whether it makes any difference with respect to false comments about surfs tail and I'm, not knowledgeable about that draft at all.

AA

So this might be a dumb question, but I'm just wondering what happens if the DDoS concentrates on the recursive instead of or their link to the Internet. So yes.

J

Those happens on the recursos I think that pans a lot on. We should we're talking about because they gonna have a population of clients behind it being affected again depends a lot on the kind of the attack, but I mean I'm, not sure if I can, with my data, cannot say anything about that.

J

But if what I can say is that the damage is gonna be only around the people that connect to the particular resolvers, so I would expect less damage, then I'm reaching an authoritative but usually I mean you can do that like most of the attacks we've seen as an alternative site, I mean they'd get reported, because this is more where the information is located. So all right, I wouldn't know how to say that I just know the impact will be smaller or the clients, unless is a big provider like a this big services. Once.

S

T

All right, I'm Wes Hart, occur from the University of Southern California information, Sciences Institute. This is I, think your fourth DNS talk for the day, so I'm gonna be talking about measuring the DNS SEC and specifically the KSK role. That's about to happen and some of the issues with it so first off background I'll give you a little bit of background about the D in a second how it works. Just in case you don't know. Hopefully you do and then some problems that are sort of coming up with the KSK rollover.

T

That's gonna happen in October and how small they are. The case study analysis of I tried to look into well what if we tried to really look into what a resolvers actually doing with the trust anchors they have configured?

T

How much can we measure that and then I'll look at the impacts of I actually made some changes out in the real world that that caused some benefits and how big those benefits are and how deceptive those benefits are, and then finally, I'll talk about some lessons learned with respect to sort of how the IETF really ought to think about deploying or excuse me designing protocols in the future, especially those with trust anchors so for those that aren't familiar DNS SEC basically has this top-down chained authentication system that basically allows you to prove that a address that you're requesting, for example, either exists or it doesn't and that it wasn't mod fied since the time that the original author published it it's signed all the way down.

T

In order to do this, though, you have to have a trust anchor for the for some key up the chain. Typically, that is with the DNS roots, that is run by a Anna's data set and their top-level key, which is the one that's about to roll.

T

So one of the things that we realized a couple of years ago or I should say you know that the the HF realized a couple of years ago was that we needed to know what resolvers were actually using as a trust anchor. It's very hard to know if there's millions of DNS resolvers out there, how many of them are a validating and B how many of them are using your current trust anchor and how many of them have a new trust anchor deployed.

T

So it's okay to switch it's very hard to determine a point of which it's okay, to do a safe role, because most of the resolvers out there are using your new key and it's now safe to use so RFC 81 45 was written and it's called signaling trust anchor knowledge in the DNS security extensions, and it basically adds one extra query to resolvers. So any more recently deployed resolvers running newer code should hopefully when they are asking for a dns key for the route. Saying hey.

T

What's your current key, they should also simultaneously send another query saying: oh by the way, these are the keys. I. Currently, trust, and that way we can actually measure and watch how those keys are slowly being adopted and trusted.

T

So, with respect to the current key role, the last, the first key for the that was used to sign the route was created in 2010, with the expectation that, sometime after five years, we can start rolling the key. So the second key was created in 2017. It was generated in October of 2017 and it was put into production.

T

As excuse me, it was put into publication, so it was actually first appeared in the root zone in July of 2017 last year, with the expectation that it would be put into operational use on October of of last year and then in just before that happened. So on the order of two weeks before in September, 27th I can wisely decided to stop the rollover, and this came from this whole measurement system. That I just talked about, because there is sort of some unknowns, and so the next plan is that it will roll this October instead.

T

Now that more people have had time to analyze the data such as the data you're about to see. So this is the graph of the measurements of that 80 145 separate, set of signalling that I talked about, and if you actually look very carefully at the dates on the bottom, that gigantic uptick in black, which I'll get to in a minute and black in this in this graph, is actually the bad line.

T

October actually existed before that giant uptick as well, so it's actually actually got worse since last September. But basically, if you look at this graph, the black line and the black percentages are on the right hand, side are the number of resolvers that trust only the old key, so they don't trust the new key. So, on the right hand, side which was March of this year, 20% of the resolvers out in the world that we're sending RFC 81 45 signals did not trust the new key.

T

Yet that had been published at this point for nine months. So yet they had still not picked it up, and so I kind of ran into this question of why? What's actually going on here? What what is it that we can learn? What is it that we can measure so I did a few things. I want to know why so many new addresses were appearing too, because it wasn't just that there's.

T

You know these resolvers that never picked him up, but we were constantly seeing new addresses that could be coming from new software deployment could be coming from a number of things, but why were they only trust in you know the the old key, so I looked at two sources of data I looked at all of the 80 145 signalling data that I can had, which totaled about 1.1 gigs in size and thanks very much to them for contributing that data.

T

For me to study and and was carried from January to March of this year, I also looked at all the incoming requests to USCIS I is server which totaled 2.8 terabytes of data, and that was only just for March, so that was a lot of data. So the first thing I had to do is sort of reduce the problem. Space to something I could actually look at say over a weekend, so I cut I did a couple of things one.

T

If you look at in ICANN data, there was one point: two million resolver addresses sending queries to all of the routes within that 500 thousand of them. We're sending signals for the old key, but what I found interesting is that within that three hundred and ten thousand of that 500 thousand we're signaling only once. In other words, they only sent one signal in three months. That's just plain wacky right. These are things that should be sending a signal, we'd think once a day or so if they were really resolvers on the internet.

T

You know that we're staying up. Clearly, one signal is just strange. So then I looked into the B route data, which is line D and of the sources that we're sending to be in March. There was three hundred nine thousand a hundred and thirteen thousand of those were sending only the old key and then sixteen thousand we're sending the old key just once again.

T

Looking for that same commonality, and then I noticed that a lot of them and we'll see a graph here in a minute that shows this, we're only sending a couple of signals, only a couple of total DNS requests and they sort of went away so it'd be like these resolvers kind of coming up and then going away immediately.

T

So to summarize this, you know more clearly: there was six thousand seven hundred and two unique addresses that sent a single I'm using the old key query in the first quarter of 2018, and that signal went to be route during March and only sent two to nine other requests. That's just a very strange set of data, so I kind of looked well. What would cause this strange set of data so quick graph? This is a CDF graph, showing the number of queries sent.

T

Excuse me, the number of addresses sending sent by the sending a number of queries so only on the x-axis, which is kind of a strange way to think about it on the x-axis is a log scale of the number of queries sent and on the y-axis is actually the number of addresses sending that in March, so you'll notice that there is a 63% of the sources sent to or less DNS queries total in a month in an entire month. That makes no sense right if you like, bar graphs, are better.

T

This is the same sort of graph, showing there's a huge number of hosts, sending a very small number of signals very long tail. Obviously so I kind of wondered. Is there a commonality? So, looking at all of these requests, I looked at all the rest of the queue names that were sent to to a u.s. CI, Sai and thence tried to see if there was a commonality.

T

So he looked at again all the ones that were only sent two to nine queries total and what I found was that, of course, the highest number of queries sent were the signal for the old key, which is that strange, cryptic, hexadecimal underscore th for a 5c. That's the query that sent when you only trust the old key.

T

The second most popular was the root zone, and then the third and the fourth most popular were a single domain and to the point of three thousand queries followed by four hundred queries for a VPN provider and so clearly I thought I had found. You know a solution at this point it was approaching midnight I very quickly, realized you know. Hopefully something was good and sent a note off I can't staff who actually managed to get me contact information woke up at 5:30 to the next morning to go immediately, write them and say: hey.

T

What's up so I also examine the VPN provider, software downloaded the Android version of it because they have multiple applicable versions of their software and I searched all of the files in the android apk. For the DES record, sha-256 key looking for it and sure enough, it turned up a root key file that only contained the old key and didn't contain the new key yay success. It also contained the live unbound DNS, SEC, validation, resolver, but excuse me resolver library, so I reached out to them again thanks to octo.

T

So I can talk to staff for finding the contact information and this event said: Wow you're right. It affects 10, you know of our software packages and they promise to release something coming in the next couple of months. So a couple of notes here, one I'm. Actually this vendor did the right thing. They were. They were using guiness active verify that they weren't behind a paywall verify that they were actually going to send their VPN connections to the right place.

T

So they did the right thing, but then they failed update the key in the long run. That being said, I'm still not gonna release their name publicly, although it wouldn't be horrible to figure out I urge you, please don't try just for their sake, but let's look at the impact.

T

So this is the same graph taken on Sunday well, turning in these slides and you can see the first indication where their VPN software started to be released, and there was a major drop only last week did the Android version get released and actually iOS is now released, the slides already out of date- and it's dropped a little bit since then, but it's still looking like it's flatlining, so we're down back down to you know, 8% of resolvers out on the Internet are still signalling the old key and I've taken care of one problem.

T

One very important aspect of that oops, one very important aspect of that this is one user behind one address right. So if other resolvers have 300,000 users behind them, it's not even though the the really dark bad line dropped by a huge amount. It does not. It's not reflective of. Potentially the number of users I've helped because there was a one-to-one mapping between IP addresses and users, so that was really hard, so I looking into what other people have done.

T

A couple of other people, much more recently, Warren Kumari, you know, did a search for the old key in the github interface and found that there was 2069 references to the old key and only 412 of the new one. I did a Google search sort of similar things, there's more references to the old sha-256. You know hash than the newer ones, Roy Aaron says recently and I cans.

T

Octo team has looked into how serious or some of these and it's found a lot of commonality, and you know github has Forks and all that kind of stuff and how many of those code bases are new and being used. That analysis is not entirely finished yet, although he gave a presentation at IE, PG, I think on Sunday that I both missed and didn't copy his slides into this one, because they've already do so, there's lots of stuff going on.

T

So let's talk about lessons learned like what can we learn about this in the first place, um so first off flag days are hard. You think you know that we would know that within the IETF by now, but within DNS SEC, because of packet size constrains, we can't do double signatures that easily so so. I cans planned for rolling the key involved directly switching the key on a single day and not doing double signing for a while to let old and new users you know kind of migrate.

T

So that's hard to do and, more importantly, for me, I tracked down only a you know: a small fraction of the actual traffic that was out there sending keys. In fact he was still receiving twelve percent of its traffic from addresses. Only single, it's signalling, the old key and tracking down misuse in over a million. You know possible sources sending stuff to. You is very, very hard and I only solved a very small piece of the pot as I mentioned, but you know why is it that rolling ta is for DNS? Sec is so hard?

T

Well a couple of things, the eighty one, forty five queries are actually hard signals. They are decoupled from the query of looking for the actual key right, so the the resolver might send a query. Saying: hey I need your keys and then, in a totally separate key, that's not bound send a query for I'm using this key and those two queries can go to different places, so you can't even correlate them as well as there's no indication of intent. You may just be hey.

T

I, have these keys and I'm sending these you know, I need these keys, but there's no indication that I'm gonna use the keys to actually do validation.

T

So that's really what my my recommendation is for the future when you're thinking about designing you know internet protocols and things like that include signaling within it- includes sort of this intent mechanism, and then you know think about how how software updates and how configuration updates. You know happen over time. It's not easy and Trust anchor keys. Are you know rather critical, bootstrapping issues, so you really need to design for automatic updates from day one. Unfortunately, the the DNS SEC automatic update.

T

You know mechanism which is defined by fifty eleven came after the rest of the DNS X system was designed and doing it. Afterward is challenging because software updates slowly. So, if you don't have everything going out on day one, you now have all these systems deployed that understand the cryptography, but don't know how to update their base keys and then also update frequencies wisely, there's really sort of two choices, and this is sort of a hint for a future presentation to come. Where I'm studying a bunch of systems.

T

You have a choice of doing it really frequently so that everybody gets used to the fact that it's updating- let's encrypt, for example, you know, makes you update their their keys to your web servers on a three-month basis. You know you get used to it. You either have to automate it or you fail, or you have to do it sort of rarely and expect that you know hard things are gonna break and you better use really strong, well protected keys. You better have a have widely overlapping signatures, and things like that.

T

So when, as security is being used more and more in Internet in in IETF protocols, you really need to think about what happens 20 years down the line, what happens 10 years down the line when people need to update their keys? So any questions this is, we were left in London. So this is a nice picture from London, although ironically not from the last IETF but the in from the London before that. Well, Rory walks to the microphone.

AG

All right, Adams from icon I, can the octo team you just mentioned, and first off there's. Another blop of days are coming your way. You've asked for this last week and and busy this week. But it's coming your way. I appreciate.

T

AG

Mentioned Flag Day sure heart, which is which is correct, but also in the same context. You mentioned doing double signatures and myself something the problem is not a resign with the new key. The problem is that we stopped signing with the old key yes, and even if you do double signatures at one point, we need to stop signing with you. Okay, yes,.

AF

AG

So doing double signatures wouldn't really have solved anything. We would have that simple. There would be a there, be more overlap sure, but yeah you still have that problem yeah. You know it's in.

T

You're absolutely right at some point: you have to turn off using the old key. Yes,.

AG

You mentioned the github research, there's an enormous amount of craft dead bodies from github we've seen them. I can give you a nice example, part of the 2000 that you saw. These are 2000 files, not 2000. B portions are incorrect, so there are 1100 repositories, but in the end there are 300 unique files on github, so these are distinctly different from each other.

AG

The 300 on the longtail in the in the in the in the short tail. There is as an example, DNA SEC underscore chain underscore validated, not CC, which is in a very popular repository. Chromium, however, did was taken out of chromium about six years ago, and so, but these things get copied in forked and cloned and get up so there's a lot of stuff. Of that open wrt is another example. Open wrt is, is what you see open WT.

AG

You can download that install it and get it conflict overlay from from github, and you see a lot of alt keys in there. What you also see a lot as in all of them, they have the intersect disabled by default. Now, of course, we're people who ask you, install it and download the configuration. We don't know where people after that they might enable configure, enabled in a sec, and they might have a problem but yeah good work, and hopefully you find more stuff. But.

T

The point of this talk is more, you know what we should do within the IETF and lessons learned, but but yes, okay,.

B

Thank you thank.

AH

You baby hi Debbie, hello, where's. Thank you for a presentation for the merriment and also I'm feeling I have a feeling that maybe people are mean aware: the Landsman that I can delay the case Cairo, but not aware that talents of work to be done behind so really. Thank you for that. Sharing and I come up with two I do.

AH

mmm Why is that I heard that there is a wing group or research group, a coda suit, I just heard yesterday, and that is about updates of the IOT devices. I, think it's a similar problem that yeah yes. Currently, the software is, there's no updates or protocol to support the to compliance with that protocol right or Rover as a case so I'm thinking about. If there any maybe a potential war can be done well, any discussion already happened. Yeah.

T

I mean I think you know, the Internet of Things issue is much whiter than DNS right. They.

F

T

Many trust anchor bootstrapping issues do they have and the fact that that users don't update firmware, for example, once they buy a little modem, and you know if we don't get automatic updates and you know turned on for that. How do how are they gonna eventually get new HTTPS certificates or whatever protocols, they're making use of management protocols or anything that has to be dealt with? It's not just the DNS specific problem for especially for those widely deployed very small devices. Okay,.

AH

The second idea is that I once proposed a small raft on the Kiska roll team and about comparison and analyze of how how they were over is so difficult because a compare the transition process compared to ipv6 transition and HTTP conversion and I. We I found that there's no such transition period for the case control we just have choose a flag day and to stop siding, so that will trouble that that that's the case. That's the reason. I think, in my mind, are the best that the most a key difference between the different technology change.

AH

So I'm proposed very intuitive idea, and I think, maybe after the case clear over she performed in this year, the august or a October, October October I think there is more there more challenge work when people think about the everything role, because maybe the capability to roll the agrees that may be more difficult for the key, so I'm steers thinking about the maybe some do stack or backward comparative, compatible way of key key row or Agra reason. Role can be developed to later and I. Ask you if you have any no.

T

I mean I I I appreciate the question and in particular I mean Roy is 100% right that at some point you have to remove the old key. So double signatures don't help you the one thing that it does help us.

T

It allows you to have it a longer extension where you publish a newer key, which we've already done right with with this, and so I would argue that the biggest lesson learned for me was that the three-month window last year from July to October was too short and that we really needed a longer window both for measurement and analysis, as well as to make sure that that validator is out in the world really did get updated. Sometimes there are some slow roll out paths now that it's been a year and a month.

T

Well, I, don't think you know it would have been nice if we could double sign for a year so that that roll out could continue. And yet we could use the new key. But you know that we do. There is some thinking that needs to be done after after the key role really finishes in next year, when it, when the old key, is actually removed and signed as revoked. But thanks for your question, okay.

AH

A

Right: okay, thank you very much. That's the end of the session. I forgot to mention that Magnus was so kind and scribe today, and there was a little bit more discussion than usually so it was actually an effort, but thank you so much for having the discussion here. That was great and then see you next time.