Internet Engineering Task Force 104, 25 Mar 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: IRTF Open session

Description

The Internet Research Task Force (IRTF) Open session at IETF 104 will be held at 12:50 to 14:50 UTC on 25 March 2019. The session includes Applied Network Research Prize presentations.

A

We we didn't mean to put up art RTF. We really still are internet research task force and with any luck, I have now updated, slides, so welcome. I am NOT going to give you slides. I know you've memorized the note. Well, so we're not going to show it to you, but there will be a test at the end of the session and and just so you know, I RTF abides by the ITF Stillwell. So if you've seen it multiple times, you've seen it for us as well. We just changed it. So it's as RTF.

A

This is the day the week of a change of command and the I RTF and I want to introduce you to Colin Perkins. If you don't already know him and welcome him, he'll be running things from here on and we have three presentations of a NRP speakers: the applied networking research prize.

A

They there are normally two, but one of our presenters was unable to attend last time and they're going to be terrific talks about large-scale heart problems. So we'll start out with Brandon SHhhh linkers talk, Brandon is with Facebook and University of Southern California and his award papers called engineering egress with edge fabric, steering oceans of content to the world and take it away. Brandon thank.

B

You for the introduction good afternoon, everyone, my name, is Brenda shrinker from the University of Southern California. Today, I'm going to be talking about edge fabric, a system we built at Facebook to deliver traffic to end users around the world. So, let's start off here with a brief overview of Facebook's network Facebook, has dozens of points of presence around the world and interconnects with thousands of networks.

B

Now that rich interconnection with those thousands of networks offers Facebook two distinct advantages. First, it provides us with short direct paths to end users, which means that we can bypass transit providers that have traditionally been part of the internet hierarchy.

B

Second, it provides us with substantial path: diversity, meaning that for any given end user, we often have multiple distinct that we can use to send them traffic.

B

Now, let's dive a little bit deeper into what goes on in internet connection at a pop and I'm going to go into the challenges that exist for interconnection. So, first, at every pop around the world, we have one or more edge routers.

B

We establish physical interconnections or circuits between those edge, routers and other networks. So, in this case, we've established an interconnection within end-user ISP and also with a Tier one transit provider.

B

Next, we use BGP or the border gateway protocol to exchange reachability information with those networks. So in this example, the end user ISP. We receive routes to their end users across the network, interconnection that we've established with them, and we also receive a route from that tier 1, transit provider.

B

Third BGP at our router to selects which of those routes we're going to use, and in this case it selected that route through that. From that end, user ISP directly.

B

So what are the challenges to using all this rich interconnectivity? Well, our key objective here is to deliver traffic with the best performance possible, but the challenge to doing that is that BGP doesn't consider demand capacity or performance in its decision process. So let's take a look at what problems that creates. We have here a simple example: Facebook on the left is trying to deliver five gigabits per second of traffic to the end users and the ISP on the right.

B

Now our router is configured to use those short direct paths that we prefer, and so, as a result, it puts all of that load onto that upper path and everything's fine. Until later on, in the day now, demand has risen, we're now at 12 gigabits per second of demand and again BGP at that router can't be adopt, can't adopt a demand or capacity in real-time. It's simply not possible to express that with Beach of these policy terms.

B

So as a result, the router continues to make the same decision and it ends up overloading that link waiting. The packet loss and degrading user perform.

B

Likewise, BGP doesn't consider performance in its decision process. A simple example of that can be seen here that upper preferred route now has a securities route on it. So it's added 50 milliseconds of latency, also some piece of equipment downstream is miss functioning or malfunctioning, adding loss. So in this scenario, the route through that set that second route through that transit provider would actually be preferred.

B

It offers both better performance, but we can't configure BGP to adopt performance in real time, and so we end up still poisoning all that traffic onto the West preferred poor-performing route.

B

Now, despite all these problems with BGP and how it doesn't account for capacity or performance, it's still fundamental to interconnection, and it's not going away anytime soon. The thousands of networks that Facebook and other large content providers connect with all expect for us to use the BGP protocol.

B

So what that means we need to do is we have to sidestep BGP limitations, and what I'm going to talk about next is how we do that by shifting control from BGP at our edge routers to a software controller.

B

So I've briefly gone over Facebook's Network and an overview over the challenges. Next I'm going to dive deeper into our connectivity and the challenges I'm going to talk about how we sidestep BGP as limitations with edge fabric I'll, then talk about edge fabrics, behavior and production and finally I'll talk about the evolution of edge fabric and some ongoing work.

B

So back to those points of presence that we have around the world at each of those, we have three types of connectivity: first, we have transit providers and transit providers can deliver traffic to the entire internet at each pop. We typically have two or more of these for redundancy and we connect with them through a private circuit or sometimes known as a private network interconnection.

B

Then we have peers and we separate peers into two different categories, so I'm going to go into detail and why we do that a little later. But in general we have private peers on which there are on the order of tens per pop and again we connect with them through circuits, and we have IXP or public peers that we interconnect with via internet exchange points and those are on the orders of hundreds per pop and we interconnect with them through a shared fabric, which means we don't have a direct circuit between our routers and ours.

B

So how do we prefer across these different routes? What does what's our router is configured to do? In general, we apply this very simple policy. We prefer routes from private peers over internet exchange, point peers over transit providers. Now we prefer peers over transits, because peers provide a short direct pass to end-users and we prefer private over internet exchange. Point peers because we prefer circuits that are dedicated to have dedicated capacity between Facebook and the peer.

B

So as a result of that routing policy, the vast majority of our traffic actually egress is through these private peers. But that creates a problem because we cannot acquire sufficient capacity with our private peers to satisfy all demand.

B

We always try to do this, but due to logistical constraints and business constraints is simply not always possible for us to establish the physical capacity and, as a result, you can end up with scenarios like the one I Illustrated earlier, where during a peak time period or perhaps during a failure, you have a link that becomes overloaded due to be gp's decision process.

B

So, how big of a problem is this well to understand that we did a 2-day study of 20 points of presence, which is a subset of a production network, and we identified circuits that would have been overloaded with B gp's default routing decision process based on the her other policy I described earlier, and overloaded here means that demand would have been greater than the circuits capacity.

B

What we found is that it's 17 out of 20 points of presence they had at least one circuit that would have been overloaded and 18% of all circuits across these 20 pops would have been overloaded at least once now to further dive into how big of a problem.

B

This is, let's take a look at what the circuit peak demand is to its capacity four circuits, where we predicted that the demand was going to be greater than his capacity at least once, and what I have here is on the y-axis, a CDF of circuits, where the demand exceeded the capacity on the x-axis is their peak demand relative to their capacity. So a peak demand here of two indicates they had twice as much demand as a circus, actual capacity.

B

So two key points: I want to pull from this first 50% of circuits had peak demand that was greater than one point, one nine extra capacity and then ten percent of circuits had peak demand that was greater than twice their capacity, indicating that some circuits ten percent in this case are drastically under provisioned relative to their peak demand.

B

So going back again, bgp doesn't consider demand or capacity as a result. In these situations, where demand exceeds capacity, we're gonna end up with packet loss and degraded user experience.

B

Bgp s decision process in general doesn't meet our needs. We don't want degraded user experience and that's why we built edge fabric.

B

So next I'm going to talk about how we sidestep BGP limitations by using edge fabric and stepping back that again involves shifting control from BGP at routers, at the edge of our network to a software controller.

B

So before I dive into what our implementation actually is, I want to talk about our two key design priorities. Here. First, we focused on operate on minimizing op, I'm, sorry, maintaining operational simplicity, which means minimizing change and minimizing system complexity.

B

Second, we wanted to have ease of deployment, which means we wanted to interoperate with our existing infrastructure and tooling. We have BGP routers at the edge of our networks, like most network operators. Do we already have existing tooling for interacting with BGP, so we wanted a system which could interact with that existing infrastructure.

B

So in general, there's two key extremes in terms of how you can do Rowdy on a network on the left-hand side here, there's what most network operators do today, which is traditional routing I, have my routers, at the edge of my network, perform our configure routes, a per destination basis, based on what they've learned from BGP?

B

On the right hand, side I have another extreme which is host based routing and that's where each host makes a decision on what the route of that packets going to be and then uses some signaling method, such as MPLS or GRE, to signal to the routers at the edge of the network. How does how to handle that packet, so edge fabrics approach? Balance is balanced between these two extremes.

B

We have a controller that overrides BGP decisions at the router and when our hosts provide hints on packet priority, but don't precisely specify how the packet should be egress from our network. So what does this approach? Look like well, first routers, at the edge of our network, keep selecting routes like they do today using BGP. We still have all of our BGP sessions with other networks terminated at those routers. So in this case our router, based on all the information is received, have selected route.

B

A hedge fabric also selects ideal routes, but in addition to all that bgp routing information, it also has access to other inputs. In this case, that means advanced policy information such as, for instance, us configuring, based on business reasons or reasons provided to us by a peer prefix traffic rates, circuit capacities and route performance measurements, so edge fabric takes all that additional input, and it also makes a decision and in this case is decided to use route B.

B

So our router and edge fabric have chosen different routes. We need to resolve this. The way we handle that is, in this case edge fabric injects an override to that router using BGP, which I'll go into a little bit later, and forces that router to select the route to edge fabric prefers.

B

So edge fabric can perform two types of overrides: it can override BGP decision in order to move traffic for a set of end users. So, for instance, we can say on a per destination basis, override what BGP would typically do, which is perhaps send that traffic via appearing link and instead send it via transit link.

B

It can also move a specific class of end-user traffic, so, for instance, I can send low priority traffic, which is perhaps non video traffic over, which would have traditionally or by BGP, been routed over my peering link and I can instead ship that to a transit link.

B

So, let's take a look at how the all of this comes together to prevent congestion in our network and we're going back to that example. I showed earlier, where we have Facebook on the Left trying to deliver 12 gigabits per second of traffic to this is P on the right and BGP by default is going to put all of that traffic onto that upper link because we always prefer those short direct paths from Pierce. As a result, that link is going to become overloaded, so what edge fabric does.

B

Is it understands that this 12 gigabits per second of demand is actually composed of two prefixes? And in this case it understands that if it shifts one of these prefixes away and shifts that traffic to an alternate link, in this case, the path via the transit provider that it's going to prevent congestion on the peering link without causing congestion anywhere else? So.

B

How does this work at the bgp level? Well, we take that transit route that we've selected we injected via BGP and then BGP at oliver. Routers is configured to prefer routes from edge fabric, and we do that by configuring local pref on the bgp sessions for edge fabric, such that the local prep of its routes is always the highest and less preferred.

B

So edge fabric monitors, BGP decisions and overrides them as needed to prevent congestion in our network edge fabric is able to support a variety of traffic engineering policies because it operates over a variety of inputs and it can perform overrides on a variety of granularities and, more importantly, it's compatible with our existing BGP infrastructure, which means that what we've truly achieved with edge fabric is centralized control over the traditionally distributed. Bgp decision process.

B

Going back to those design priorities, I introduced earlier edge fabric, meets our goals of operational simplicity because we can always fall back to bgp at the routers of edge fabric fails. It allows operators to continue to use our existing tools because routes are injected to those routers via BGP and synchronization is only required between edge fabric and routers.

B

Likewise, it meets our goal of ease of deployment. Bgp sessions with external peers remain at those routers. We don't need to shift them elsewhere in our network and we use industry standards for route and traffic. Information such as BMP IP fix and s flow.

B

So, let's take a look at how edge fabric behaves in production and, in particular, criteria that we use to evaluate its performance edge fabric entered production in 2013, with the primary objective of preventing circuit congestion that we were seeing on period circuits at that time it runs per pop and it executes every 30 seconds, meaning it takes in both route information and current traffic information and then determines based on its decision process where traffic should go and then injects routes via BGP.

B

It controls a hundred percent of Facebook's global eco strophic.

B

So one of the things we have to decide when a link is projected to be overloaded, which means that edge fabric believes that the demand would be greater than its capacity is how much traffic edge fabric should move off of that link.

B

Now, if we move very little traffic only enough to get down to a hundred percent utilization, then we're gonna end up with packet loss during bursts.

B

Likewise, if I move a significant amount of traffic and now I'm at fifty percent utilization, now I'm getting poor utilization of those short direct links and I'm, not making good use of my capacity, so in general, we strive for based on operational experience is achieving 95 percent utilization, and this allows us to have high utilization with tolerance for Burson traffic. Now the key question here is: can we maintain that utilization without any packet loss, so we're gonna?

B

Look at now is two key questions can edge fabric, prevent circuit congestion and packet loss, and can we keep the utilization of circuits at that 95 percent threshold?

B

What we did here is we measured across our network during that two-day measurement period, and what we found is when edge fabric is shifting traffic away, meaning that it believes that us link would be overloaded if it didn't intervene. 99 percent 0.99 percent of the time there was no packet drops on that link.

B

Likewise, when edge fabric wasn't active, which means it wasn't shifting traffic away from a link, there was no packet drops.

B

So this means the edge fabric intervened when needed and it successfully prevented circuit congestion.

B

Now the next question here is: can we keep utilization at that 95 percent threshold and to analyze that what we did is we've looked at the circuit utilization against the threshold, every 30 seconds for circuits where edge fabric was actively intervening, and in this figure what we want is as much around that zero mark as possible, because that means we're keeping the utilization right at that 95 percent threshold.

B

Anything to the left means that the utilization is lower. Anything to the right means that it's higher and we end up with potential loss during bursts. So what we find here is that the vast majority of the time we're able to keep the utilization of these interfaces or these circuits within 2 percent of that threshold.

B

So edge fabric is able to successfully prevent circuit congestion and packet loss, and it can do that, while keeping circuit utilization at this high threshold.

B

So now I'm going to talk about the evolution of edge fabric and some ongoing work.

B

So I talked earlier about those two extremes of how you can have routing decisions made at the edge of your network at routers, or you can have routing decisions made at your hosts and when we actually started off with edge fabric, we were using the other extreme routing decisions made at our hosts. That's called host based routing, so in this model, what edge fabric would do is it would inject its decisions directly into our servers and then our servers would use MPLS, DHCP or GRE, depending on the generation of edge fabric.

B

This was to signal to routers, at the edge of our network, send this packet through circuit X. Now a key challenge. There is synchronization, you have to keep routing state maintained across all of your hosts and, if, let's say circuit, X disappears, my servers need to know that now that's no longer a valid option for them to route traffic via in comparison. What we do today, this edge based routing approach, diode described, has red fabric inject its decisions into routers at the edge of our network and overrides our enacted by those routers host.

B

Don't signal the precise path that they want to track our packet to take. Instead, they just signal to the router information about that packets traffic class such as this is a video packet. So this means that we don't have any hosts synchronization, which, in our network, drastically reduces the complexity of a system like edge fabric. Further, we have flexibility with dscp signaling, because we can account for different classes of traffic and we can always fall back to BGP at our edge routers.

B

So both of these approaches provide the capabilities we want today, but in general, edge fabric is best aligned with the design priorities I described earlier.

B

Next thing I want to briefly go over is about congestion beyond the edge of our networks and for this example, I'm going to talk about internet exchange points, so internet exchange points allow networks to interconnect through a shared switch. So in this case Facebook and another content provider may both connect to this big ixp shared, switch and downstream end user networks may connect as well so internet exchange points are often seen as removing barriers to interconnection.

B

I don't have to provision cross, connects between me and all these other networks as I want to interconnect with, but they also create a key challenge and to see why? Let's take a look at this example, in this case, both Facebook and this other content provider have hundreds of gigabits per second of capacity to this internet exchange points which, in this case Facebook wants to send 8 gigabits per second of traffic to those end users and the other content providers 6 gigabits per second now. The problem here is that is px.

B

They only have 10 gigabits per second of capacity. As a result, we end up with the same problem that Iola straited earlier demand here is greater than the available capacity, where any up with congestion and packet loss. Now. The key problem here is that these networks on the Left Facebook and this other content provider have no visibility past their network edge. They have no understanding of what that other networks circuit capacity is downstream, and even if they did, they can't see each other's traffic from Facebook's perspective.

B

A gigabits per second is fine for a 10 gigabits per second link and likewise for the other content provider.

B

Now this isn't just a problem for internet exchange points. It's anywhere past your network edge. You simply lack visibility. I can't see into what a private peer or an internet exchange point or a transit provider houses problems downstream.

B

So what can we do to identify congestion beyond the edge of our network? Well, we've looked at a few different signals before we were looking for instance, of prefixed traffic rates, so I could figure out how much of Facebook's traffic is going to go on the circuit again. That doesn't work here because trough cross traffic beyond our edge from other content providers is being mixed in and we don't know how much traffic they have circuit capacities.

B

Oftentimes, you aren't going to know downstream how much capacity does my transit have with the end user network I have no idea, and what that means is you have to instead use route performance measurements? You have to infer congestion from these performance measurements, but that can be particularly challenging because you can see things such as latency increases and you aren't sure as to whether that's due to a path, change or a change in client population or due to actual congestion.

B

Likewise, you don't know how much traffic to shift you have to continuously probe for capacity as downstream a failure may occur, reduce capacity for 20 minutes and then be resolved, so it requires a trial and error discovery process. Likewise, those interactions with other networks also create complexity.

B

They may also respond to congestion signals and thereby reduce the amount of traffic they're. Putting on those links- and you may increase your traffic and you may oscillate together. So it's very difficult to get a signal here as to how much traffic should I put on this link. Even if you know the current status of congested or not, that doesn't mean that five minutes from now it's going to be in that same status, so stepping back from all of this. What's really new here, these problems in general have been known for quite some time.

B

Bgp hasn't considered demand capacity or performance ever in terms of the inter domain setup, and what we see as new here is the scale of connectivity, traffic and quality of service demands that both content providers and users have, and that brings new challenges and opportunities.

B

So, stepping back again those rich interconnection I described earlier, it offers content providers like Facebook a number of advantages, providing short direct paths that can bypass transfer providers and substantial path diversity, but in terms of meeting our goal of delivering traffic with the best performance possible, we're challenged by BGP limitations, because BGP doesn't consider demand capacity or performance and, as a result, we built edged fabric, which has allowed us to sidestep BGP limitations by shifting control from routers at the edge of our network to software, and the result has been a more efficient network and better performance for our end users and with that I'd be happy to take any questions.

B

A

How does the controller tell a router to redirect traffic for a specific traffic class of traffic, for example, to use BGP flowspec so.

B

In this case, what we have actually at each router there's multiple routing instances and those routing instances. The dscp marked packets arrive at each instance based on the DSC p-value, so, for instance, DSC p-value 50 will arrive at Rowdy instance 50, and we inject routes into each of those instances. If there's no route injected, the router will fall back to the default route instance. So this allows us to customize on a per destination per classic traff per classic trap class as to whether or not we're gonna override the route.

A

He says, thank you crystal clear, don't forget to say your names.

C

Cathy Aronson I'm just curious: do you have any mechanism for ingress traffic, or is it just that your traffic is so heavily egress that this is the most important yeah.

B

I would say in general or traffic is its primarily egress, so it ends up being a much larger problem for us traffic for ingress traffic. It comes down to more of load, balancing using DNS entries or other other things like that to send between pops.

D

Erin Falk, so it seems like one thing. One of the effects of this mechanism is that it increases sort of the dynamics of route changes for that. A packets experience and I'm wondering if you've looked at the impact that this has on individual flows. I mean my experience with Facebook is that most objects are pretty small but I, but it it's unlikely that both paths are going to have the same latency and so for a particular flow.

D

If you get switched you're going to have things like reordering and you know, are there, have you looked at sort of the how frequent how much traffic is shifting when you make these decisions? How many flows are interrupted? That kind of thing like? What's that sure.

B

So the way the decision process works today is it's like we going to continue to select the same routes or the same destinations to shift as devote increases. So let's say I'm 100 megabits per second over my capacity I'll choose X to shift now. I'm 200 megabits per second I choose X and Y, and what that means is that once we've shifted something over we're likely to continue to shift it, it's not always that we will. There is some elbow of optimization there were.

B

We can change what we're shifting over time, but, as a result, we don't end up with with rapid oscillations for the same set of flows and.

D

Do you have any statistics on how much shifting is happening? I, don't.

B

Have any statistics in terms of how much shifting is no in general I expect that for TCP in terms of the reordering problem, it would be very brief on the on the flows that we're actively transmitting at the time and yeah. We can take this offline further right.

E

Hi Brendan I'm, Dave Blanca need idea about injecting the BGP, prefixes and I guess. The failure mode, then, is if the edge fabric doesn't work, it falls back to BGP. I was wondering about. You gave an example where you showed two non-adjacent before prefixes aggregating to more than the bandwidth on the on the 10 gig, and you selectively chose one to offload, say two and a half gigs of traffic or something right. um Where did you get those prefixes from? Do you synthesize them from what you know is downstream or are they pre-configured?

E

You know what what degrees of freedom do? You have to make more specific sort of things in the when you have a pure. That has a bunch of sure.

B

So the the general aggregation here is: we get samples from IP fix or s flow. We aggregate them up to the most specific prefix advertised by a BGP, and then we do break those prefixes apart again further, so let's say, I have a slash 20, which is one gigabit per second of traffic will break that slash 20 up into smaller prefixes, slash, 21 or slash 22 until we get down to a certain granularity.

B

So in this case, I think we discussed in the papers splitting up until we get at least 250 megabit per second granularity, which would mean that then, when we're shifting traffic, we can shift in 250 megabit per second buckets. So that allows us to keep that utilization at that high threshold.

E

That was curious to me is: you should have V, for example, but you guys are heavily v6 and you both do you make any effort to find the corresponding V for me, 6 prefixes and move them together, or do you treat them independently? No.

B

The decision processes are independent. We actually prefer to move v4 before v6 and that's because v6 we've seen cases where you shift it to a different route. That route is actually black, hoing the traffic and then you end up oscillating, because you shift away and back each time the prefix- and this is likely just because of v6 routes being less chromed than before.

F

Yes, my name is Stuart Cheshire from Apple throughout the presentation you talked about demand as being a fixed thing like we have 12 megabits of demand or 12 gigabits of demand going into a 10 gigabit pipe right. But if you're, all the transport protocols I know like TCP and quick, adapt the throughput, and if you send a sustained 12 gigabits into a 10 gig pipe and lose 20%, it's not going to continue losing 20%. The center's are going to slow down that rate.

F

So I I didn't understand why the normal congestion, control algorithms to adjust rate did not slow down when they're too fast and, conversely, speed up when that too slow. If there's excess capacity, TSP will speed up until it uses all the capacity, because there's there's no such thing when I'm looking at facebook is loading a picture too fast right, I want to load as fast as it can, which should be all the capacity that's available so to.

B

Be clear here when I say 12 gigabits per second of demand. That's what our controller sees as the demand to that prefix at that moment in time. The reason that it can be greater than that links capacity is likely because we shifted traffic away from that link on a previous iteration, so I may have let's say: I have a single prefix that if I was to send all through this link, it would be congested. I would have shifted it away on a previous iteration.

B

Now it's it's utilization has been able to continue to climb, and so now we're actually above what the links capacity is in terms of the transit protocols, reacting you're right, but you're still going to end up with a poor user experience as you're still going to end up with packet loss in order for those transport protocols to to react.

B

Also many of our shorts, our flows are very short, which means that you have a lot of flow is constantly going through a slow start, which means that they're going to end up interacting poorly when you're ending up with a lot of congestion in the link. Thank you.

G

Obviously the amount of krypton Eid, it's not a question, it's quite a remark: you are struggling to they're all good old problem over congestion link and informational about congestion. So you stopped just one step before their invention frame, relay and means of struggling of convention in friend and frame relay I hope it will be a result of your ongoing work and you propose something like begging for the BGP Thank You.

H

Jo a plea this might be in the paper: I have read it, but you, you implied I, think when you are looking for congestion off net through exchange points or in remote networks. The earing act of probing to look for congestion conditions. So I was wondering if you'd consider pulling those kinds of insights directly from TCP when you already kind of have in a passive observation. Some indication of whether transport protocols are being throttled beaten before packet loss exists. So.

B

We have an extensive TCP pipeline that we look at at these events on now. That's other work that we're looking at publishing so.

I

That the primary and first control that you have for directing your traffic seems not to be mentioned and well. Okay, feedback not explained, and the first thing that you I guess are doing is the server selection deciding to which of your server clusters at which location you direct the queries of these offer. Customers and.

I

I guess I guess some of well: okay, essentially the predictions of how much traffic will be generated. This way from each of a server clusters goes into edge, cast into edge fabric as the estimation of the required or of the generated demand for the former volume but kind of I wonder. Is there no feedback that actually feeds back?

I

We are having difficult situation in what for one of the connections of server clusters and please real. Please reshape the distribution of traffic between these servers so.

B

To be clear, there's there's two controllers here: I, don't talk about the other one. There is a global controller which decides which point of presence around the world and end users traffic will be sent to, and then there's this vocal controller, which each point of presence decides how we're going to egress that traffic.

B

Those two systems do have some cohesion between them and the interactions that you describe do exist in terms of how we decide what the demand is for each point of presence: that's not based on the global load balancer that is based on IP fix, or s flow measurements at that local pop. So that allows us to get in near-real-time every 30 seconds exactly right now how much load there is at that location.

A

Well, you can ask for questions by the jabber if you like, but anyway, thank you Brennen for a great talk and.

A

Our next speaker is another in RP 2019 awardee Florence. Try, though, who is at mocks. Punk institutes are Brooke and your top. His topic is bgp communities even more worms in the routing camp.

J

Thanks for the interaction and thanks for having me here- and I think I should adjust that ok, so this talk is on even nearer well, so this talk is on a paper at IMC last year, as already mentioned, even more warmth in routing can and full disclosure part of this work was presented at last ITF meeting by Randy it grow, but now you will get to full take.

J

So we were looking at bgp data that we collected and if you follow this, this development, you see a large increase of BGP committees being used over the last eight years we have seen more than two hundred ninety-six percent of increase of BGP communities being used, so individual values in BGP communities and I looked up yesterday. It further increased so last year, around five thousand I guess we're using BGP committees and now it's up to ten thousand, and we see seventy four thousand individual values for short communities.

J

So for me, as a researcher this, this means I should probably take a look, what's actually happening there. So what are we talking about? We are talking about the short bgp communities. You probably know your defined in our steen1997: they are a 32-bit value, usually split in half, so the first 16-bit being an AAS number director, 16-bit being a value where each PHAs is agreeing up upon values where their peers, what they should mean on what they are being used for. So there is no strict semantics in it.

J

Peers have to agree upon it on themselves and, as you have noticed, it's only 16 bits, so we now have a yes numbers which are larger than 16 bits. Finally, we get. The large community is defined in RFC 1892, and they are now at 12 by it. Well, you so know you have three three fields, each with significant space to use them so 4-byte asn, a ESS, can actually use communities. Now here the first four bytes are now defined to be a global administrator.

J

So it's now clear that this is actually an a s number, and this is the network defining the meaning of the community. So this is actually a good thing.

J

Besides the confusion of the naming, if it's a long or large communities, we spotted other problems where we try to do our measurements, the large communities were not really used in in 2018. We only found fifty one global administrator actually using them, so nothing we could actually measure on internet scale. This has has been become better and, if you're interested in the uptake of large communities, a meal from ripe has set up or have published an article where he looked into the development of large communities in the uptake.

J

So now we have around 120 global administrators that are using large communities, so, but how are they being used at all or in general communities can be split into two groups. We have informational communities that have passive semantics. They are used for location tagging. Where has this prefix? We learned in which a pop RTT tagging we have seen and on the other side, we have action communities that carry active semantics. They are used for triggering black holding or actions in other guesses, for example, paths prepending.

J

The problem here is without documentation of these P of these values. You cannot see if this is an active or passive community or if the semantics is active or passive, because it's already mentioned the peers decide themselves what these community values mean, and there is no bit indicating if it's active or passive or an action community, and this leads into several sorts of problems.

J

So, given the increasing popularity of bgp communities that we have seen and the ability to trigger actions in other guesses with communities and relay information between areas, the first question for me as a network researcher or science guy, is what could actually go wrong in using them and we found out several things can go wrong and the first thing we we noticed is when we look at how community is proper, people seem not to expect them to propagate widely through the internet, although we have to RFC's actually defining how communities should propagate or should not propagate our 15 1997 states.

J

Communities are transitive optional attributes, so they should be forwarded to your peers. An RFC 74/54 says you should scrub communities. You are using inside your network, so you cannot be manipulated from outside, but forward for any communities by other users, so it should be expected date that they are actually propagating through the internet, but still a lot of people do not expect this and a lot of trends providers don't actually forward them.

J

We only found 14% of transit providers propagating received communities and yes, this value seems to be small, but the Internet graph for the EAS graph is highly connected. So you actually end up in communities traveling quite quite a lot, but still many people do not expect them to propagate it widely, and the problem here is that this leads to some potential for misuse as they are propagating through the internet and can trigger actions multiple hops away, and there is no way for an operator to find out. If this is intended or not.

J

This leads into a problem. You cannot say well, this is traffic management and this is legitimate, or this is an attack, and we ask ourself the question if there are also unintended consequences in this combination of b2b communities being transitive and for wallet and used for actually changing routing decisions, and our assessment in the end is yes, there is a high risk for attacks that we already see some attacks as well.

J

So what we were looking at, of course, we took all over the publicly available BGP data we can find, and in the end we find that 75% of BGP announcements that we looked at have at least one BGP community asset and in 2018 it were five to six thousand a ESS. Now it's more than ten thousand a ESS that make use of these short communities now, taking a step back and looking at the propagation again what we can actually measure or what we cannot measure.

J

We have this very complex topology of for ESS, where a as one is announcing the prefix P, and this is recorded in a s for which could be a collector or just a simple peer with the a s path for three two one as expected, and now a s2 is taking the prefix P, where the community, in our case 2, colon, 3 or 3 so 2 is V, is actually defining the meaning of this community and this will be transported.

J

Finally, 2 is 4, so it's 4 is recording this community in its routing decision in its rip. So es 2 has added this informational community. Now a s 2 is also adding a community for signaling it or triggering an action in a s 3 its upstream, and this is also forwarded to a is 4. So both of these communities are now present or visible in a s4, but a s 4 cannot know who actually has added these communities and so can't we, but we needed this for our measurements.

J

So we had to come up with a solution for us. We can only death in fear or infer which a s is adding a specific community by assuming that, if the a s value or the s number present in the community is actually D as adding the community, we will get a lower bound of the travel distance or of the a s hop count.

J

This will lead us for the community to 2 to 3 or 3, with the correct travel distance of 2, a s hops and with the other community 3 1, 2 3 with the wrong assumed travel distance of one of also 2, but it actually off sort of one, because it's just one a s up, although correctly it would be too. But for us, this lower bound of the distance of one hop is sufficient for our for our work.

J

So if we plot these values, which again, is the lower lower bound of travel distances, we end up with this cdf on the excess pieces. You see das hop count and we find that 10% of communities having a s hop count of more than 6, so they traverse more than 6 different guesses from where we assumed them to have been added at more than 50% of communities still Traverse more than for a SS.

J

And if you compare this with the mean length of a s path that we have observed, which is around 4.5 or 4.7, this actually means it travels almost through the whole internet and the longest community propagation we have observed were 11, AAS hops, so they do propagate through the internet. Now, looking at another very complex and yes, topology I use one again announcing a prefix to a s 2 and adding a community 3, 1 2 3 to inform a s 3 or execute path. Depending there.

J

You will notice that this community value is also propagated to a is 4 again and although it's only intended for signaling, something towards a 3 is 4 is also receiving an announcement with this community. So we end up with 2 different s paths and in the first case we for our for our research call. This community be on path, because the a s value from the S community from the community is present on the a s path that we record in s 3. In is for recall.

J

This community read off path because the AES number 3 is not present on the S path. It could also be that the a s that is being signaled for is further hops away behind is 4, but in both cases this would be called off path, because the a s number is not present on the recorded yes path, and if we now take the right part of these community values, separate them by own pattern of paths and plotters. We end up with this distribution.

J

On the left side, you see a quite a number of community values in the off path, communities that are related to black, holding remote record like holding and on the other side. On the right hand, side you see very even numbers that look like provider like operator assigned and easy to remember, and we think that this comes from the fact that e that a essence that are not implementing black, we'll just forward by calling communities compared to a yeses that do black holing, which follow the black housing RFC and do not further propagate these communities.

J

So they will not be visible of path.

J

Now, coming to the experiments that we did to show that there actually are some problems out there in internet, all of the experiments were done first in a lab environment and then validated on the internet, with, of course, operator. Consent and I will show two different scenarios in this talk. There are more in the paper and the configuration of our routers are available publicly, so first going back again and giving an intro, how does remote trigger black holding is supposed to work? So we know we are talking about, as one is announcing.

J

The prefix to its upstream is two and then receiving traffic. This is expected behavior. Sometimes you have the problem that you receive more traffic than you. Actually, you want to attract. We call this a denial of service attack and one mitigation is a is one signaling to ASU that wants to blackhole a prefix. Usually this is done in band in the same BGP peering session, then the normal BGP announcements are being sent, but there are also cases where it's a special BGP session, which has other problems, but not the ones mentioned here.

J

So, as one is announcing, the prefix P checked with the black hole in community to signal a is to that it should drop. Traffic is to is, of course, still announcing the prefix P to all its peers, but without the black holding community. Now what happens is is is now dropping traffic and all the routing, the traffic towards P and it's poly routers and the link between a is 1 and a is 2 is released from your observers, traffic and is usually usable.

J

So you sacrifice parts of your network or parts of the prefix IP addresses to still keep all of the other prefixes and servers reachable. This is how it should word. What we noticed is that for this to be used in a secure, you need to employ some safeguards. Of course, the provider that is providing black holing has to check if the customer is actually allowed to black hole these prefixes.

J

So if these prefix are owned by the customer or the customer has permissions to black hole them, and this leads to the fact that you need different policies for customers and peers, different access control lists and leads to a lot of configuration overhead for for a secured usage of remote triggered black holding and, of course, receiving such communities. You have to add no advertise and nor export to the announcement. So you don't propagate it further, and we also noticed some providers translating black holding to the black hole in communities of other upstream.

J

So you were not even able to do selective black hole because they were translating it and announcing it to their peers as well, translating the actual rail user. Now, what should not be possible is depicted here. We have the same topology, but now a s2 is on the roll of an attacker, and a stew should just be a backup path to to the prefix of a as one.

J

But a s2 is able to actually at the black hole in community, although it's not on the best path so as to is announcing to a three that prefix P. Should be black hold and we notice a three is actually doing that, although the best path is through a as one and a as one as the origin for P is not actually requesting any black holding and the other problem that we noticed is that this is even possible if a s2 is not involved in in any connection to a has one at all.

J

So as one can just hijack the prefix P and announce the prefix P with the black hole in community said, and we noticed that in some cases we are able to circumvent a CLS and prefix filter lists, because the black holing community is checked before any prefix filter lists are applied, so we were able to confirm this on the Internet. It works multi-hop and it's hard to spot, because the community values are usually on monitored.

J

Reasons. For that we found is the black column. Prefix is a more specific, so you need a exception rules in your configuration to accept as let 32 so essentially everything that smaller than / 24 and some providers check the black holding community before applying any prefix filters, and we even found some configuration guides on the internet which had this problem, and they were the example configuration provided and the problem here. There is no validation for the origin of the community. Every AES on the path can add a black hole in community for the upstream provider.

J

Now yesterday, yup Snyder's gave a talk at the IPG where he presented the mitigation for this. If you would check that the peer that is announcing the black holding or the prefix with the black holding community is on the best path and only then accept the black holding. This is one possible mitigation for this attack. So if you are interested in that, you should check the recordings of that talk. So if you only accept a black holding, if the period is announcing, the black holding for a prefix is your current best path to their prefix.

J

These problems go away.

J

The second attack we were able to do was a traffic redirection attack again, as one is announcing its prefix P, and you see the current best press path from A or six to a as one is through a is three on the lower side of the topology.

J

Now a s2 is our attacker, announcing prefix P with community to do path prepending in AF 3, which leads to the longer path over a s, 4 and 5, to be the preferred path for a s6y. This attack could be interesting. Well, one thing could be: there is a network tap between is 4 and 5, and if you, even if you would identify that a is 2, is your attacker and you would screen the network of a is 2.

J

You will not find any network tap there, because they on purpose redirected to traffic, to a is 4 and 5 or any actual network campus could be, is 2 is being forced to cooperate here and the other thing it is. It could just be a denial of service attack because it's known at the link between is 4 and 5 a very thin link with less with not as much bandwidth as would be needed so by redirecting traffic there. You could actually fill that link there and after I gave this presentation at the right meeting.

J

We were actually approached by dine and they pointed us to an article where they found our attacks. That today are actually already using communities to foster propagation of hijacks. So the attackers found out that by setting specific community values, their hijack would actually be propagated more in the BGP network. So we already see attacks using communities.

J

So what now we identified several well topics where discussion may might be useful.

J

We found probably authenticity, transitivity standards, documentation and monitoring of community usage, starting with authenticity. I mentioned several times that every a s that is on das path is able to modify, add or remove community values on announcements on on in BGP, and there is no attribution possible. It means, even if you found out that there is an incident, you cannot find out who is actually responsible for that. We all know rpki, but intentionally.

J

This is not able to secure communities, because we do want guesses in on the s path, to modify routing and to add pathway, pinning, for example.

J

On the other hand, we also see that operators rely on the correctness of community values because they are basing policy decisions on, for example, where a route has been learned and large communities are there, but they only partially improve the situation, because all of these points still apply to large communities. They only fix the first part of being an NA s number. So the question is: how can we achieve authenticity, or at least attribution? So after an incident you know who you have to talk to to prevent further problems in the future.

J

Another thing that could probably to big discussion series we now communities are very helpful in debugging, because you know what is happening in the network and why certain networks are forwarding traffic in a certain way, and they are indeed a very easy low overhead communication channel and widely news. We still only see them being used one or two hops away. You usually do not signal black holding five or six hops away, or you do not usually need to inform peers six or seven hops away.

J

So, on the other hand, you have a high risk for abuse in communities being transitive. So the question is: do we have a high risk here, or do we have a more benefit? Do we need a discussion between benefit and risk using flow transit or allowing community values or communities to be full transitive monitoring is another field full of MIS misunderstandings.

J

We know there is no global state in PGP, it's a highly distributed system, and even if we look at all of the route collectors that are available, we will only see the end result recorded at these collections.

J

We do not know what has happened on the path between peers, even if you are able to look into lookingglass as many early, it's very hard to spot differences, so inferring modifications between the origin or das, setting communities and the collector is almost impossible, and even if you would be able to record all of these changes, you still have two problem. You do not know what these community values actually mean and there is no general way for attribution of a changes or recording who actually change anything.

J

So monitoring community values to detect abuse is extremely difficult in this. In this field, and.

J

We have the other great field of standardization and these short communities, ASN : value is still just convention. We can. We can not really be sure that the the first value is actually an IAS number and there is no defined semantics. I know there have been a lot of discussions in the past at all.

J

Came to the point that well, we cannot define semantics here, but this leads to these problems stated and by by communities being used both for signalling and for triggering and not being able to distinguish what is what you cannot even filter communities in a sensible way and until now they're only a little, it's only a limited set of standardized communities, and you cannot be sure if, for example, some a s is not actually using some arbitrary value for triggering backhauling, so standardization here might be a thing that that should be further further pushed forward.

J

Doing our research, we found another very large problem. This is with documentation because all of the yeses can define their communities themselves. There is no need for documentation. There is no central point of documentation. We found that some of the guesses are the are documenting them in who is now your our databases on their websites? Some are only providing community documentation in customer portals or not even at all.

J

So, if you see a community, you cannot find out in an easy way what they mean, and even if there is documentation, it's often an inoculant in natural language and parsing. This is impossible. We tried, we failed if you have a very limited scope, for example, trying to find out the community is used for geolocation or for for geo tagging of prefixes. You can of course, look for city names, airport codes, things like that, but parsing community documentation in a general purpose for a general purpose.

J

Application is not really feasible, so documentation is very limited and fragmented. It's very hard to actually find out a dictionary or find a dictionary for community meanings.

J

Again, things are happening on that on that I learned that the tag is is: has internally developed a system. What I call for community structuring?

J

They are only using string representations instead of community values internally, and these representations are then translated to short and large community values for their auto-configuration. An example would be tact at origin, dot, country dot de where de is a parameter for the community definition Tec origin country. So you see it's a hierarchical, hierarchical, a system very it says. Well, this community is a tagging community, it's a it, has passive semantics.

J

It is taking an origin on the country level and the countries Germany and their system allows the definition of parameters to communities and these parameters with the committees together are documented in one's system. They have working code and they are using this in production already right now they have an internal internet draft like document and if you're interested in that, you probably should should talk to Whittaker, who is sitting there and laughing.

J

So I think this is a great way for actually started document communities in a sensible way because you don't have to operate with metrical numbers and you can actually distribute these documentation and talk about policies and filters with your peers because you have to you can talk with with with strings and no magical numbers. And even if you have other router configurations, you can still use these string representations. And you know what you're talking about.

J

Then we came up with some recommendations for operators based on our work, of course, as the RFC already States, you should filter all informational comment, community values that you are using, that carry your a s number. So if you are using communities to check where a prefix as we learned, we should scrub these communities when you receive them from your peers because they are defined by you internally and used by you internally.

J

It might be useful to come up with agreements with their down streams, so to define what they are allowed to do with your up streams if they are actually allowed to do path. Prepending with your upstream for their prefixes, of course, publicly documenting you. The communities you are using is key to enable other, a SS to filter action communities to you. So if I have a customer where I know he might be playing a with BGP I might want to filter things. So he cannot trigger things in my upstream, but you need agreements for that.

J

If you are able actually allowed to filter these or not or define what they are allowed to do, and one one thing one request: if you are providing public looking glasses, which is a good idea, please also show the community values that you receive. Otherwise it's very hard to debug things.

J

So, coming back to D to the general problem, bgp communities are currently the only feasible way to realize signaling between a asses, but the problem is that the secure usage requires good operational knowledge and diligence. We do not think that a very over complex system is really suitable to secure the shortcomings of bgp communities, but we have to be aware that there is a problem, and while everybody in this room is probably able to handle this and do everything correct all the time, we cannot rely on that on a global scale.

J

There are a bunch of people out there who do not know what they are doing and there will never be a world in in which everybody is doing everything correct. So the question is: do we still can rely on or do we still want protocols that allow people to make mistakes that will break other people's network or do we need an evolution in protocols here that are less fragile and trot in and more usable or other or with other.

J

To prevent people shooting themselves and others into the in in the food, so wrapping up communities are widely in use. They are used to realize policies, they are needed, but they heavily rely on mutual trust between the peers, because there is no authenticity and security in place. There is no tribution. Attacks are very hard to detect and one take away from our experiments. We did some prefix hijacking that was reported on Twitter, but nobody actually spotted our agree direction.

J

Attacks based on communities, so we cannot be sure if there are other P already doing attacks using communities. If you're interested in more details, the paper is available. Here's the link and with a picture of my cat I'm, happy to take questions.

A

Few questions if people want and I would start with a question to what extent have the people who doesn't? Who did this research brought forward and these standards and the working group proposals to to make the changes again?.

J

A

You have your have the group who wrote this paper brought forward any work to the routing IDF side of the world.

J

We did not yet because we are in the stage where we well assessing the problem and well showing everybody. There is a problem: okay,.

A

J

I, don't know how to fix it, but.

A

K

One of the co-wrote pardon me Randy, Bush iij, anarchists I'm, one of the co-authors, strongly objected both in working group last call and ETF last call to the black hole, well-known community.

A

Okay, anybody else have a question well,.

I

Not not so much question, but one comment is when you say there is no authenticity. That is not completely true.

I

The thing is one has to recognize here that our local community system is offering a range of stuff where people where operators can be creative and you have in the global system. You have to control the creativity and the interaction of the various create creators.

I

But in the end, you have to see that, yes, you always have bilateral relations that are mapped into bgp, neighbor, neighbor, neighbor, ship relations and yes, what is exchanged there should be seen as something that is essentially just bilateral. And yes, if you, if you want to be a responsible actor in the old system, you have to really control what you are doing with your neighbors. And if you really take that understanding, you can actually start to build stuff. That says: well, okay, you and me are peering I'm, a responsible person I make an agreement.

I

What we are doing on our relation and for that, if you and me are doing a decent effort at controlling at doing the right policy for implementing our agreement, we actually have a chance for using that as fairly trustworthy and I might even I might even go out and offer you an agreement in which I would in which I promise you that I am doing stuff where I'm related, where I am relating to you in a controlled manner. The communities that Randy is sending to me. This is something that does not work recursively overview.

I

All topology, but with that very limited and closed understanding, one actually can do stuff and yes having better documentation, have a having more tools for doing it will help a lot to do this right, because a lot of the spreading of dubious information that you have observed is related to the fact that many of the operators, just at the time when they're 10 years ago, did their policy.

I

They didn't have good tools and they did not really take care of your responsibilities. But.

J

But that was what I meant with you have to talk to your downstreams, to make clear what you are allowed to filter, because if you don't have any agreement exactly.

K

J

You just strip all of the other communities. Yes, this could lead into other problems evilly and that.

I

Has that has to go that has to go with a very strict understanding into the BCPs that we are using.

A

Before you start, deaf people should forward around the blue sheets, and this will be the last question for this talk.

L

Very beautiful work, thank you for bringing it I mean. This is well-known and I've been facing this for like 15 years, probably since now, if we start doing remote black calling the mitigation, today's mostly basic hygiene, you take care of what you accept and you take care of issues, and there are some telltale consequences in very beautiful stuff things like bandwidth, immunities that are very useful in data centers. They were made non-transitive just to avoid those issues. While most data centers use ebgp, we can propagate those communities, so there's work definitely needed.

L

That would allow us to use very useful stuff. Thank you.

A

A

Okay and last but not least, is our a NRP 2018 awardee from who would have been in Bangkok, but for visa issues. Oh and.

A

That's Arash ma la vie hockey and he is at Thousand Eyes now and the work was done it at Northeastern University and taking a long look at quick Oh.

A

Can you click on the slides for him? It should be in there. Is it not? Okay,.

A

Yes, this is the case where we talked about at lunch, where you give the talk without the slides, hopefully,.

B

Not this is where the.

M

Working group add one note: I'm gonna go to some background material and history of quick. Oh no,.

M

To bring up to speed those folks who might not be super familiar equate perfect. So if you are an expert on quick and you know the background material I apologize in advance, please bear with me as I go to it all right, so I'm gonna be talking about taking a long look at quick. This was a measurement work that appeared at IMC, 2017 and I.

M

Don't think I need to convince anyone in this room that Internet connectivity is important, but just to set the stage and put things in perspective in 2015, 3.2 billion people had access to Internet. Obviously that number has increased over the past four years, but in that same year the number of people with running water was less.

K

M

These two numbers, next to each other, are I, find them depressing, but for reasons that are out of the scope of this talk, but it it emphasizes the importance of Internet connectivity use it in our personal life. In our perfect professional life, virtually every business depends on the internet and their viability is tied to the performance of the network that they're operating.

K

M

Naturally, there's a lot of effort to try to improve these networks and make them more reliable and more performant. Idf is one of those efforts and we do a lot of things. We come up with new protocols. We use traffic management techniques to make sure that our networks are utilized in a way that everyone's demands are met and we even design our applications to adapt themselves to the underlying network. So we increase the user experience improve the user experience and while quic is one of those effort, it's it's a transport protocol.

M

It stands for quick, UDP internet connection and it started in Google and it was basically a transport protocol design with today's needs in mind. Quic was designed for a bunch of main reasons. The first one was to facilitate rapid deployment. What does that mean? If you think about HTTP, you have HTTP. Hopefully you have TLS underneath and it's running on TCP, which is your Transfer Protocol and, as you all know, TCP is part of the it's implemented in the kernel. So it's in the kernel space. What does that mean?

M

That means, if, whenever you have a big change, let's say for TCP and you want to deploy it at scale, everyone needs to update their operating system and we all know that could take a long time, Windows XP that are still around as an evidence to the effect, so quick solves that problem by implementing in the user space.

M

So now what this means is that is that, whenever you have a new version of quick, all you need to do is: let's say: if you're browsing the web and you're using a browser, all the users need to do is to update their browser, and then they have the new version of quick. Obviously, this means that a lot of things like a lot of guarantees that the TCP port provides like reliable delivery.

M

Now all of those also have to be implemented in the user space which I get to in a little bit what that means in terms of performance, but overall this helps with rapid deployment.

M

Another main reason was quick for quick, which Google never shied away about pointing out was to avoid ossification by middleboxes. We all know there are many middleboxes in networks. These could be nuts or security firewalls or could be web caches and many other applications. A lot of them do claim that they improve performance. Perhaps in some cases they do, but there's also a lot of evidence that they actually do more harm than than good.

M

One of the examples that I find very interesting. This is this was a joint work by Google and t-mobile. It was a few years ago it was presented at velocity conference where they basically looked at the youtubes traffic over t-mobile network, and how does it interact with their web proxies, and this is summary of findings from their slides. They basically found that it's better. The YouTube traffic does not go through their proxies because they're hurting their performance and I don't want to point any fingers to t-mobile YouTube.

M

This is this is not an issue isolated to them. Another example. This is taken from a CloudFlare blog post, where they were basically saying we had TLS 1.3 enabled for a while, but no one was using it because the browser's were not supporting it and they were not. Basically, support turning it on, because middle boxes were breaking it and to be fair, it wasn't just middle boxes. There were other other issues that prevented TLS 1.3 from being deployed at scale, but middle boxes were not helping.

M

Tcp fast open is another example that a lot of folks believed it didn't it never got deployed at scale because of metal boxes, and the list goes on so and all of these things can happen because in TCP all of your headers are in the clear, so metal boxes can see them and act upon them. They can modify them drop them, add headers or break your connections into to all the things that you're familiar with, whereas in quick pretty much everything is encrypted, so you take all that from middleboxes.

M

They can't they can't do any obligations or meddling and finally, quick was proposed to improve performance I. Just a side note here: I have performance for HTTP traffic. I should mention that quic is eventually gonna, be a general-purpose transport protocol, but it would start it started with HTTP in mind and it's that's its biggest use case right now. It's very integrated with HTTP. So throughout this talk, whenever I say quick, we are basically gonna focus on HTTP over quick.

M

So whenever I say quick, I mean HTTP over quick, so quick improves performance by a number of optimizations. The most famous one is zero RTT connection establishment. If you're familiar with TCP, you have that three-way handshake to establish a connection before you can send any data. If you have TLS on top of TCP as you should well, there there's gonna be more. Our titties and quic tries to achieve zero, RTT connection, zero, RTT connection. What that means is that you can start sending data from the very first packet. Obviously that doesn't always work.

M

You should have contacted the server before and have valid keys for zero RTT to work. If you don't it's going to be one or two, our titties, but after that everything else is gonna. Be zero. Rt t quit previous head of work head of line blocking. What is that, if you have a HTTP stream, if it's HTTP 1, you have a stream, you have to open a TCP connection.

M

If you have more than one stream, then you have to open more TCP connections, and we all know that's not that all those connections have overhead competing over bandwidth. So it's not a great use of resources HCB to solve this by multiplexing HTTP streams into a single TCP connection. This is great. It gets rid of a lot of overhead. However, if any of these streams is blocked for whatever reason, then all of those the streams are blocked, and the reason for this is because TCP is agnostic to the HTTP streams.

M

As long as TCP is concerned, you have a stream of bytes that needs to go from one end to the other end and quic solves this by basically mapping http streams into quick streams. Now having those logic that logic of streams in quick, if one of the streams is blocked, the rest of them are not going to be blocked and can be, can proceed normally.

M

Quick has improved loss recovery. It helped ADA mitigates the akka ambiguity problem. The TCP has it has better RTT and bandwidth estimation. A lot of this good loss recovery comes from the fact that you can easily change the congestion control as well. So, for example, if you have bbr a new congestion control, you can easily replace your old one with the new one, and that comes from the fact that the first point that I talked about you can easy.

M

Everything is applicated in the application layer, so you can easily update things and deploy it and at scale, and there are a lot of other optimizations that I'm not gonna go into all of it, but basically quick try to learn from decades of Transfer Protocol evolution and take the good things that worked and put it into a single protocol.

M

A little bit of history, quick started in early 2010 at Google, as I said, I think it was in 2013 that was publicly announced and Google started using it soon after there was a spec draft and towards the end of 2016, the ITF working group started and the working group has been very active. There are many implementation of quick around quick, Google's quick is at version 47 now, and the working group is working fast and hopefully soon we're gonna have a standard version of quick and everyone's gonna be using that all right.

M

So that's that's why quick started and a little bit of history of it and but, as I said, quick would start. One of the main reasons for quick was improved performance, so Google has been reporting on quicks performance they've been using it heavily and they've been putting out reports that helps with page load time with YouTube rebuffering and all these great numbers that it's perfect and it's very promising.

M

However, the issue with these is that they're all aggregated statistics and not really reproducible by anyone else unless your Google and you have access to that data and they don't really report any ctrl-x tests. Again. Everything is aggregated statistics there at the time that we started our work. There were other evaluations of quick in their research venues. However, most of them were limited environment networks, limited tests and they used old, untuned versions of quake which I will get into in a bit. What that means and the results that they provided were not necessarily statistically sound.

M

Neither they provide good causes analysis for the performances that they observe. So we basically wanted to bridge those gaps like filling those gaps and provide them more comprehensive evaluation of quick. And how does it compare to TCP? So, as I said, we're gonna, look at HTTP performance and we're gonna compare quake in TCP.

M

We have a very simple set up. We have a client on one end, which could be a desktop client or a mobile client. We have a server on the other end which supports both quick and TCP, and we have at the network in between. We can emulate different conditions and see how the two compare to each other.

M

Our servers host a bunch of web pages and objects with different sizes and pages, with different object, sizes and different number of objects, and we fetch them using quick in TCP and we come the performance and I must point out that, even though I'm not gonna go into the details, we run once we get all the results. We want a statistical test to make sure any difference that we see is not due to noise or network variations or things that are not really differences between the protocol.

M

So whenever we report a difference between the two protocol, we are confident that this is the difference in performance or not know. Is there anything else, so the setup is pretty simple, but in 2016, when we were doing these tests, we had this big issue of finally having a server. This supports quick. It's not like TCP. There wasn't a quick module for Apache servers. The different many options are also. Basically, our two real options were either use.

M

Google servers because Google at the time had quick, basically hosts our stuff on Google servers and run our tests against Google or use a server that comes within the chromium code base. Well, the first option: Google servers didn't really work for us for the first obvious reason that we had no control over it.

M

So if you want to change something in the server or get some logs from it well, we couldn't because we didn't, we didn't own those servers, and the second issue was that we actually started seeing some unexpected behavior when we were testing again in school, and so here I have one example: to give you an idea, what I mean by that we uploaded a 10 megabyte object to Google App Engine, and then we downloaded it using our client, and this bar plot is the red part, is showing the time to first byte and the blue part of it is showing the time that takes for the rest of the download and, as you can see there, is this huge wait time like it's?

M

It's half a second! So basically, one third of our download time is wait time and we did some tests. We realized this wait. um Kinda exists in Google, App Engine. We wasn't sure we weren't sure why it's happening. Obviously, we didn't have any control to the server to investigate this more, and this was not good for us, because, if we're down, if we're checking performance and comparing millisecond times 1/2 a second wait time is not okay, so we decided to use the server in the chromium.

M

However, so this is the bar on the left is doing the exact same experiment, but with the chromium server the server that is part of chromium. Now you can see that verb, huge wait. Time is gone, that's great, but now our download time is much bigger compared to quick to Google I'm. Sorry- and this is problematic, because this is basically these two plots next to each other are telling me that the server in chromium cannot provide the performance that quic is able to provide, because we clearly see that Google is doing better.

M

So we had to try to basically infer what are the configuration that Google servers are using and and basically fine-tune our chromium server to make sure it matches the performance that Google gives. So we did that along the way we found some bugs and basically we fixed it I'm not going to go into the details but happy to talk about it offline. But after we did that the plot on the right, the bar on the light right, is basically the same experiment using our chromium server after adjusting it, and not only we don't.

M

We don't have that big wait time now. Our download time is similar to what Google provides, and this is obviously not the only test that we run.

M

We did a bunch of tests and there's not great, but we used Google as our baseline and matched our performance to Google and I spent time on this slide to explain it because there were a bunch of research work before us that they did a lot of great work with quick, but none of them went through this step to optimize the server and pretty much all of them reported poor performance for Quake, at least in some scenarios which, for fact, we know it's, because they were using a server that was not performant all right.

M

So now that we have our setup complete our test bed complete, we did some tests. Let's start showing you results from a desktop client and I'm. Gonna show you some simple results where we're downloading different object, sizes from five kilobyte to 10 megabyte and we're downloading them at different ball, unlike bandwidth, and we're comparing how quick in TCP perform compared to each other.

M

So in this case the RTT is 36 milliseconds, the loss is insignificant and those numbers so 45 44 %, and what that means is that, when we're downloading that 5 kilobyte object using quick in TCP, the download time for Quake is 45 percent better than TCP. Now, to avoid bombarding you with a lot of numbers, I'm gonna, replace that with a heatmap. So just think of it as red means, quick is doing better blue means, TCP is doing better and white means, there's no statistically significant difference between the two protocols.

M

So, if I complete this plot, you can see pretty much in every bottleneck bandwidth and for every object. Size quic is doing better than TCP we're able to download the object faster. So this is great. We added we throw in some loss into the picture and we saw still quic is doing pretty much better than TCP. In all cases, we increase the RTT time. We did worked with different artists. In this example, RTT is 112 milliseconds again, quick was doing way better than TCP.

M

So so far everything was great and we were very excited and then we did this experiment where we added some packet reordering and as soon as we added packet, reordering things started to change and we actually so we actually saw a case as cases, especially when that's covering the plot, but the Blues. The right side of the plot are big objects. The last column has a 10 megabyte object, so when we have packet, reordering quic is doing worse than TCP. So we want to see why this is happening.

M

We looked at quick, scold instrument, instrumented the cold, look at TCP to see how it's coping with packet reordering and basically, what we found is that TCP has this mechanism. When you have packets reordered, it increases it its neck and it can cope with that reordering. Whereas quick didn't have that mechanism in place and when packets were reordered deeper than its neck, it was basically thinking that those packets are lost, so it was going into loss recovery and we all know what that means, and it was that performance was going down.

M

So we wanted to see if quick can actually benefit from the same mechanism. That TCP has our guest was yes, but we wanted to test it. So I'm gonna.

M

Sorry I went all right, so we did knock threshold. The default my threshold for quake was 3, so we want to see and I'm. Looking at the example when we're downloading a 10 megabyte object. So it's a big object. It's a sizable transfer and we want to see if quick can benefit with it from the same mechanism as TCP, so we started playing with the neck and we actually saw that there's a big latency between my clicker and the slide, and we saw that as actually as we increase the neck.

M

Quick, quick spare formance actually gets better and when we let the neck to increase up to 300, which is actually the number the tcp the upper bound, the tcp is a lot to increase it's nag. Then quic is able to recover it's able to cope with the packet. Reordering I actually starts performing better than TCP.

M

All right so next thing that we want to look at what 0 RT T, because that's that's a big improvement that about the improvements in quake. So we want to see how much zero RT t help with and I'm gonna go to back to our base example, wouldn't where there's no loss- and we have a 36 millisecond RTT as I talked about quick, is doing much better than TCP. This is quick versus TCP.

M

So this time we ran a test and instead of comparing quick with TCP, what we're comparing is quick with zero RTT and quick without the rtt and red means quick with you art, it is doing better and, as you can see in the plot, most of the plot is red, which is great, but also, as you have noticed, this benefit of zero RT t is really.

M

You can really sense that when the object size is small and when your object is big naturally, because your transfer is is longer and your connection time is a very small fraction of your transaction, so it doesn't have a big big effect, which is it still great because if you think about web most of the time, you're actually requesting very small objects. So so this 0 RT t can help a lot in that in those scenarios.

M

Sorry, so comparing these two plots together, as we said, 0 RT t only helps for smaller objects, but we can see that quic is doing better for bigger objects as well. So we want to see what is it that the quic does that helps it to perform better. So I have an experiment here, which is a little bit extreme, but I like it, because it helps visualizing things a little bit better.

M

So what we have here is we have a case that the bandwidth is changing between 50 milliseconds, every I'm, sorry 50 megabits per second 250 megabits per second. So it's a little bit of extreme of an example but and we're downloading a 200 megabyte object.

M

So it's a very long transfer using the TCP and quake we're doing it back-to-back that's what this plot is showing on the x-axis I have time and on the y-axis I have throughput and, as you can see, quic is able to achieve a much higher average to put compared to TCP, which explains why it's able to get such a better performance, especially when there is lost.

M

So basically the thank you very quick is way more aggressively and better adapting itself to their changes to the available bandwidth, which is great but also made us think if quic is so aggressive in adapting itself to two available bandwidth. How is it gonna play with fairness to other traffic because, as we know, we want different flows to be fair to each other, so now no flow shuts down other flows. So we made TCP and quic compete with each other over a bottleneck bandwidth and we actually found out that quick is not fair to TCP.

M

We found out the quic is taking more than share share bandwidth. We repeated that experiment with when quic is competing with multiple TCP flows and we still got the same results and to make sure this is not our environment. We made quick and quick complete with quick things were fair TCP con competing with TCP everything was fair, but when the two protocols been competing with each other, quick was not being fair to TCP. We want to dig in a little bit deeper, so here I have the congestion window size for the two protocols.

M

The in this example they're both using cubic and as you can see, they start from the same congestion window size, but quickly, quick, increases in congestion window and takes a unfair share of the bandwidth and causes TCP to basically slow down and to zoom in. You can actually see that quic is way more aggressively, increasing its congestion window.

M

All right so I have one last thing to talk about before I run out of time, and that is mobile devices, so everything I talked about so far. The client is a desktop device and again going to my base example of no loss and 36 minutes like an RTT, we saw that quic is doing better than TCP. In most cases, however, we read it the same exact experience experiment, but this time the client is a mobile phone, and what we saw is that well one quick is still doing at least as good as TCP.

M

You don't see any blue cells in there, but the performance gains of quick started to diminish, so quick is doing better than TCP, but the gap is not as big as for a desktop client. So we want to see why this is happening and what we did. We instrumented the quick code to try to infer a state machine and see what's happening in and in quick and what states the the protocol is in at every time.

M

So I'm gonna show you this state machine where, for the case, where we're downloading a 10 megabyte object at 50 megabits per second- and it looks something like this: it's a classical state machine. You have different states, the percentage of time that you spend in every state the probability the transition probabilities. This is a little bit difficult to reach. I'm gonna, replace it with a table and as soon as I do that.

M

Hopefully, things are gonna become clear, as you can see when, when we're using a desktop machine, quic is in application, limited State for only 7% of the time, and that's the state that the client is receiving data faster. They can consume it, but as soon as you go to a mobile device where resources are more scarce, now quick is in application, limited State for 60% of the time- and this is exactly the price that quic is paying for being implemented in the user.

M

Space you're constantly context, switching between user space and kernel space, so which is fine on a resourceful device, but when you're on a mobile device, things are not that great. So that's all I had to talk about. To sum it up. We looked at the protocol that was rapidly evolving and honestly, sometimes I felt like measuring moving sand.

M

We did tests in a variety of networks and environment.

M

There are a bunch of other chests that I didn't have time to talk about, but I encourage you to read the paper if you're interested in and we instrumented the code extracted some state machine and that helped us to provide some root cause analysis for the performances that we were seeing and the formula I just want to point out that this work was done two years ago. So at the time quick was at version 36. As I said now, Google quick is at version 47.

M

However, nothing stops us from doing the exact same measurement on the new versions. We actually did that in the paper we looked at quick from version 25 to 36, so we had that evolution of quick performance and we can do the same thing for for newer versions and future versions, and with that I'm happy to take questions.

N

Jake Holland, could you go back to slide 19, please.

M

Sorry that clicker.

N

I wanted to ask so this is very interesting and thank you for for presenting it. I wanted to ask. Did you measure the retransmit rate in this? In this bottom scenario? Measured.

M

Of what I'm sorry.

N

The retransmit rate.

M

I think we did but honestly at the top of my head, I, don't remember: I have to go back and look at the paper.

N

Okay and okay, thank you, so that would be in the paper if you I believe.

M

So but if it's not I'm more than happy to dig in and find it because we have all the data, I didn't remove anything great.

N

Thank you um and one follow-up I believe it was two slides forward. Perhaps three the fairness, question.

M

And I have a lot of know in.

N

It for, though it slide, 20, 21 or 22 I think it was so if you want it, your.

M

Questions right, yeah, it's.

N

About the fairness, so um what I wanted to ask about was you noticed the difference in the fairness word? Where and I assume you mean that quick was consuming a higher proportion of the bandwidth? Did you compare that to sort of the expectation as observed here that that quick performs better than TCP normally so TCP will prove that this must mean? Presumably the TCP will leave some of the bandwidth underutilized or less utilized, and is it the same proportion here or how far different? Is it.

M

I'm, sorry I, don't think I understood the question so you're saying if the ball make them with was underutilized right.

N

So my my maybe this is too complicated to ask it to Mike, but what I was trying to get at is that we expect that quick will perform better than TCP based on the prior observations, even when they're not competing right, which means that on the same kind of link TCP is leaving must be leaving some bandwidth under unutilized in order for it to be able to beat it right. So how?

N

How much is the fairness difference? Sort of disproportionately.

N

It it is it different than the difference between and by how much than the difference between their performance when they're not competing.

M

If I understood your question correctly, you're asking that, if TCP, when not comparing, was basically leaving some of the available bandwidth on the table and if the difference was I'm sorry, this is.

N

This really a competition, fairness question, or is it just a yes? It is indeed running faster. No.

M

It's definitely a fairness issue because I don't know if this answers your question, but we did this for very long, so it we let them both get to that equilibrium, and we could see that when there is no competition, TCP is able to utilize the bandwidth or almost fully, if okay, okay, that makes sense. Thank you. Yeah.

N

So like you, so they all went to that 10 megabyte as sort of where it's when.

M

The transaction is long enough, yet excellent. Thank.

N

You Evan, sorry one last one, the TCP was both of them are running cubic. You said yes, yes, thank you.

D

That was one of my questions. So I have a clarification question. A larger question. Claire facing question is: what's the queueing discipline you were running in your bottleneck? What's the what what's the queueing discipline you're running in your bottleneck link? uh Are you running bbr? Were you running red aqm dropped hill? Oh I,.

M

I'm not sure, ok, happy people back in the paper, I believe so, yes, yeah follow.

D

The larger question is sort of going back to your very initial remarks. I'm really kind of first great work very interesting nicely presented. Thank you. This is a good paper, thank you for coming here and presenting it. I'm interested in the 200 million users who have internet and no electricity right and so I think that there's a lot of attention being paid to quick, as you know, higher performance and you know better utilization of of congested resources, but I rarely see performance numbers.

D

When you look at things like you know, very heavily multiplexed long, RT, tt1 links that have dial-up at the end and you know end users, who are you know desperately trying to you, know, load simple web pages and I'd be really interested in seeing some comparison, I mean we really want to make that sure those users don't get screwed if the world migrates from TCP to quick, we it's not great now, but it ideally, you wouldn't want to be worse, so I'm wondering if you spend any time looking at it.

D

You know there's a little bit here. You can sort of take the worst cases of all the things you presented and add them up, and you know maybe that's what it is, but so.

M

I guess closest to that that we experiment that we looked at some 3G mobile networks, so we run some tests there and we actually found out that our results showed that quic is actually doing better than TCP and it's in the paper.

M

The reason I didn't put it in here, because we kind of the things that I put in here I want to be cases that I can isolate and then show where the difference is coming from, but and in like in networks that and as I said, we found in 3G networks and the poor networks that quic is it still doing better than TCP. But most of our experiments were in controlled environments that were hard. Man writes right.

D

So I would just add to that that you might look at adding like very long RT T's in there and then looking at things like fairness and seeing how that compares that I don't have any intuition about how that would yeah.

M

I believe in our test we went up to 300 milliseconds of RTG and one.

D

M

huh But it's a good start. Yeah.

O

Cuz turn: what off you always say, TCP, but I'd like to know. Are you using TCP segmentation offload are using TCP, fast, open or I using TCP from the 70s, so.

M

True, so we I believe the segmentation offload is is on it's in the paper. We basically wanted to take a Linux box and use the TCP as is, and don't change anything, and we did this on a purpose on purpose, because, obviously you can optimize any protocol to you.

O

Know on a Linux box, I can turn TCP fast, open on or off very easily use the kernel option. I, don't know what you did. That's. Why I'm asking.

M

Well, all the default values that came with it they're all in the paper.

P

Montenegro Microsoft, so thank you very much for this work and I think I heard you say that there may be some ongoing work going on some more research. So if that's the case could I add a suggestion that you gee, quick or Google quick, is so fine and good, but the whole focus of the IETF ever is I quick write the idea octi of quick?

P

It would be great and there are several implementations out there that you could use if you would use those for the next phase of the testing, because that's a pretty different protocol really by today and the best part of it is, if you find something egregious or something that might need to be tweaked, then there's still time to go back to the working group and actually have an effect on the on the protocol.

P

If that's one of the of your findings and that would be potentially more relevant for the future than liquid, because possibly everybody at some point will be on ITF quick, so that that's one one suggestion and the other one is more of a comment. You indicated that since quaked is implemented in userspace, that's one implementation hours, for example, runs kernel or user doesn't matter so you could run it in kernel. You could run user space. It's not part of the Pirkle itself right. So I understand that for the test we needed to do.

P

Of course it was easier to get the user space, but that's not really part of the other protocol definition, so you could run it in internal and then maybe some of the disadvantages that you identified would would go away and in the mobile case, as you mentioned, yeah.

M

That's that's great suggestion. Yeah at the time we were basically working with Google's, quick, and that was pretty much the only option yet.

A

Okay, let's have a hand for a rush.

A

So to pitch for the remaining remainder of the year, there's four more great and RP talks to come. If you want the links and you can't find them for some reason, I did put up a an agenda slide set, that is in the tracker, so you can find that and also a humorous prog related picture. But in any event, thank you for being here and thanks for the great questions for our folks and end of IRT F open.

A