Internet Engineering Task Force 104, 25 Mar 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: IETF104-IRTFOPEN-20190325-1350

Description

IRTFOPEN meeting session at IETF104
2019/03/25 1350

https://datatracker.ietf.org/meeting/104/proceedings/

A

Okay, is there anybody who is, if you're, if you're, not here and you're supposed to be raise your hands because we're waiting for you and we we didn't mean to put up art RTF. We really still are internet research task force and with any luck, I have now updated, slides, so welcome. I am NOT going to give you slides. I know you've memorized the note.

A

Well, so we're not going to show it to you, but there will be a test at the end of the session and and just so you know, I RTF abides by the ITF Snowbell. So if you've seen that multiple times, you've seen it for us as well, we just changed it so Society F.

A

This is the day the week of a change of command in the IRT F and I want to introduce you to Colin Perkins. If you don't already know him and welcome him, he'll be running things from here on and we have three presentations of a NRP speakers: the applied networking research prize.

A

They there are normally two, but one of our presenters was unable to attend last time and they're going to be terrific talks about large-scale heart problems. So we'll start out with Brandon SHhhh linkers talk, Brandon is with Facebook and University of Southern California, and his award paper is called engineering egress with edge fabric, steering oceans of content to the world and take it away. Brandon thank.

B

You for the introduction good afternoon, everyone, my name, is Brenda shrinker from the University of Southern California. Today, I'm going to be talking about edge fabric, a system we built at Facebook to deliver traffic to end users around the world. So, let's start off here with a brief overview of Facebook's network Facebook, has dozens of points of presence around the world and interconnects with thousands of network.

B

Now that rich interconnection with those thousands of networks offers Facebook two distinct advantages. First, it provides us with short direct paths to end users, which means that we can bypass transit providers that have traditionally been part of the internet hierarchy.

B

Second, it provides us with substantial path: diversity, meaning that for any given end user, we often have multiple distinct paths that we can use to send them traffic.

B

Now, let's dive a little bit deeper into what goes on in internet connection at a pop and I'm gonna go into the challenges that exist for interconnection, so, first at every pop around the world, we have one or more edge routers.

B

We establish physical interconnections or circuits between those edge, routers and other networks. So, in this case, we've established an interconnection within end-user ISP and also with a Tier one transit provider.

B

Next, we use BGP or the border gateway protocol to exchange reachability information with those networks. So in this example, the end user ISP. We receive routes to their end users across the network, interconnection that we've established with them and we also receive a route from that tier 1, transit provider.

B

Third BGP ad, our router 2 selects, which of those routes we're going to use- and in this case it selected that route through that. From that end, user ISP directly.

B

So what are the challenges to using all this rich interconnectivity? Well, our key objective here is to deliver traffic with the best performance possible, but the challenge to doing that is that BGP doesn't consider demand capacity or performance in this decision process.

B

So let's take a look at what problems that creates. We have here a simple example: Facebook on the Left is trying to deliver five gigabits per second of traffic to the end users in the ISP on the right. Now, our router is configured to use those short direct paths that we prefer, and so, as a result, it puts all that load onto that upper path and everything's fine.

B

Until later on, in the day now, demand has risen, we're now at 12 gigabits per second of demand and again bgp at that router can't be adopt, can't adopt a demand or capacity in real time. It's simply not possible to express that with Beach of these policy terms. So as a result, the router continues to make the same decision and it ends up overloading that link waiting, the packet loss and degrading user performance.

B

Likewise, BGP doesn't consider performance in its decision process. The simple example of that can be seen here that upper preferred route now has a securities route on it. So it's added 50 milliseconds of latency. Also, some piece of equipment downstream is miss functioning or malfunctioning a DeMoss. So in this scenario, the route through that set that second route through that transit provider would actually be preferred. It offers better performance, but we can't configure bgp to adopt performance in real time, and so we end up still poisoning all that traffic onto the West preferred poor-performing route.

B

Now, despite all these problems with bgp and how it doesn't account for capacity or performance, it's still fundamental to interconnection, and it's not going away anytime soon. The thousands of networks that Facebook and other large content providers connect with all expect for us to use the BGP protocol.

B

So what that means we need to do is we have to sidestep B jeebies limitations, and what I'm going to talk about next is how we do that by shifting control from BGP at our edge routers to a software controller.

B

So I've briefly gone over Facebook's Network and an overview over the challenges. Next I'm going to dive deeper into our connectivity and the challenges I'm going to talk about how we sidestep BGP limitations with edge fabric I'll, then talk about edge fabrics, behavior in production and finally, I'll talk about the evolution of edge fabric and some ongoing work.

B

So back to those points of presence that we have around the world at each of those, we have three types of connectivity: first, we have transit providers and transit providers can deliver traffic to the entire Internet at each pop. We typically have two or more of these redundancy and we connect with them through a private circuit or sometimes known as a private network interconnection.

B

Then we have peers and we separate peers into two different categories- I'm going to go into detail and why we do that a little later. But in general we have private peers on which there are on the order of tens per pop. And again we connect with them through private circuits, and we have IXP or public appears that we interconnect with via internet exchange points and those are on the orders of hundreds per pop and we interconnect with them through a shared fabric, which means we don't have a direct circuit between our routers and ours.

B

So how do we prefer across these different routes? What does what's our router is configured to do in general, we apply this very simple policy. We prefer routes from private peers over internet exchange, point peers over transit providers. Now we prefer peers over transits, because peers provide a short direct pass to end-users and we prefer private over internet exchange. Point peers because we prefer circuits that are dedicated to dedicated capacity between Facebook and the peer.

B

So as a result of that routing policy, the vast majority of our traffic actually egress is through these private peers. But that creates a problem because we cannot acquire sufficient capacity with our private peers to satisfy all demand.

B

We always try to do this, but due to logistical constraints and business constraints is simply not always possible for us to establish the phys and capacity and as a result, you can end up with scenarios like the one I illustrated earlier, where, during a peak time period or perhaps during a failure, you have a link that becomes overloaded due to pgps decision process.

B

So, how big of a problem is this well to understand that we did a 2-day study of 20 points of presence, which is a subset of a production network, and we identified circuits that would have been overloaded with B gp's default routing decision process based on the other policy I described earlier, and overloaded here means that demand would have greater than the circus capacity.

B

What we found is that it's 17 out of 20 points of presidents, they had at least one circuit that would have been over voted and 18% of all circuits across these 20 pops would have been overloaded at least once now, the further dive into how big of a problem.

B

This is, let's take a look at what the circuit peak demand is to its capacity for circuits, where we predicted that the demand was going to be greater than its capacity at least once, and what I have here is on the y-axis, a CDF of circuits, where the demand exceeded the capacity on the x-axis is their peak demand relative to their capacity. So a peak demand here of two indicates they had twice as much demand as a circus, actual capacity.

B

So two key points: I want to pull from this first 50% of circuits had peak demand that was greater than one point: one 9x their capacity and then 10% of circuits had peak demand that was greater than twice their capacity, indicating that some circuits ten percent in this case are drastically under provisioned relative to their peak demand.

B

So going back again, bgp doesn't consider demand or capacity as a result. In these situations, where demand exceeds capacity, we're going to end up with packet loss, integrated user experience, bgp s decision process in general doesn't meet our needs. We don't want to created user experience and that's why we built edge fabric.

B

So next I'm going to talk about how we sidestep b, gp's limitations by using edge fabric and stepping back that again involves shifting control from bgp at routers, at the edge of our network to a software controller.

B

So before I dive into what our implementation actually is, I want to talk about our two key design priorities. Here. First, we focused on operate on minimizing op, sorry, maintaining operational simplicity, which means minimizing change and minimizing system complexity.

B

Second, we wanted to have ease of deployment, which means we want to interoperate with our existing infrastructure and tooling. We have BGP routers at the edge of our networks, like most network operators. Do we already have existing tooling for interacting with bgp, so we wanted a system which could interact with that existing infrastructure.

B

So in general, there's two key extremes in terms of how you can do routing on a network on the left hand side here, there's what most network operators do today, which is traditional routing I, have my routers, at the edge of my network, perform our configure routes, a per destination basis, based on what they've learned from BGP?

B

On the right hand, side I have another extreme which is host based routing and that's where each host makes a decision on what the route of that packets going to be and then uses some signaling method, such as MPLS or GRE, to signal to the routers at the edge of the network. How does how to handle that red packet, so edge fabrics approach? Balance is balanced between these two extremes.

B

We have a controller that overrides B gp's decisions at the router and when our hosts provide hints on packet priority, but don't precisely specify how the package should be egress from our network.

B

So what does this approach? Look like well, first routers, at the edge of our network, keep selecting routes like they do today using BGP. We still have all of our BGP sessions with other networks terminated at those routers, so in this case our router, based on all the information that's received, have selected route. A edge fabric also selects ideal routes, but in addition to all that bgp routing information, it also has access to other inputs.

B

In this case, that means advanced policy information such as, for instance, us configuring, based on business reasons or reasons provided to us by a peer prefix traffic rates, circuit capacities and brow performance measurements, so edge fabric takes all that additional input, and it also makes a decision and in this case is decided to use route B.

B

So our router and edge fabric have chosen different routes. We need to resolve this. The way we handle that is, in this case edge fabric injects an override to that router using BGP, which I'll go into a little bit later, and forces that router to select the route the edge fabric prefers.

B

So edge fabric can perform two types of already: it can override BGP decision in order to move traffic for a set of end-users. So, for instance, we can say on a per destination basis, override what BGP would typically do, which is perhaps send that traffic via appearing link and instead send it via a transit link.

B

It can also move a specific class of end-user traffic, so, for instance, I can send low priority traffic, which is perhaps non video traffic over, which would have traditionally or by BGP, been routed over my peering link and I can instead shift at to a transit link.

B

So, let's take a look at how the all of this comes together to prevent congestion in our network and we're going back to that example. I showed earlier, where we have facebook on the left trying to deliver 12 gigabits per second of traffic to this is P on the right and BGP by default is gonna put all of that traffic onto that upper link, because we always prefer their short direct paths from peers and, as a result, that link is going to become overloaded.

B

So what edge fabric does. Is it understands that this 12 gigabits per second of demand is actually composed of two prefixes? And in this case it understands that if it shifts one of these prefixes away and shifts that traffic to an alternate link, in this case, the path via the transit provider that it's going to prevent congestion on the peering link without causing congestion anywhere else.

B

So how this is work at the bgp level? Well, we take that transit route that we've selected we injected via BGP and then BGP at all of our routers is configured to prefer routes from edge fabric, and we do that by configuring local pref on the BGP sessions for edge fabric, such that the local prep of its routes is always the highest and less preferred.

B

So edge fabric monitors, BGP decisions and overrides them as needed to prevent congestion in our network edge fabric is able to support a variety of traffic engineering policies because it operates over a variety of inputs and it can perform overrides on a variety of granularities and, more importantly, it's compatible with our existing bgp infrastructure, which means that what we've truly achieved with edge fabric is centralized control over the traditionally distributed. Bgp decision process.

B

Going back to those design priorities, I introduced earlier edge fabric, meets our goals of operational simplicity because we can always fall back to BGP at the routers of edge fabric fails. It allows operators to continue to use our existing tools because routes are injected to those routers via BGP and synchronization is only required between edge fabric and routers.

B

Likewise, it meets our goal of ease of deployment. Bgp sessions with external peers remain at those routers. We don't need to shift them elsewhere in our network and we use industry standards for route and traffic. Information such as BMP IP fix, and s flaw.

B

So, let's take a look at how edge fabric behaves in production and, in particular, criteria that we use to evaluate its performance edge fabric entered production in 2013, with the primary objective of preventing circuit congestion that we were seeing on period circuits at that time it runs per pop and it executes every 30 seconds, meaning it takes in both route information and current traffic information and then determines based on its decision process where traffic should go and then injects routes via BGP.

B

It controls a hundred percent of Facebook's global egos traffic.

B

So one of the things we have to decide when a link is projected to be overloaded, which means that edge fabric believes that the demand would be greater than its capacity is how much traffic edge fabric should move off of that link.

B

Now, if we move very little traffic only enough to get down to a hundred percent utilization, then we're gonna end up with packet loss during bursts.

B

Likewise, if I move a significant amount of traffic and now I'm at fifty percent utilization, now I'm getting poor utilization of those short direct links and I'm, not making good use of my capacity, so in general, we strive for based on operational experience is achieving 95% utilization, and this allows us to have high utilization with tolerance for births and traffic. Now the key question here is: can we maintain that utilization without any packet loss, so we're gonna?

B

Look at now is two key questions can edge fabric, prevent circuit congestion and packet loss, and can we keep the utilization of circuits at that 95 percent threshold?

B

What we did here is we measured across our network during that two-day measurement period, and what we found is when edge fabric is shifting traffic away, meaning that it believes that us link would be overloaded if it didn't intervene. 99 percent 0.99 percent of the time there was no packet drops on that link.

B

Likewise, when edge fabric wasn't active, which means it wasn't shifting traffic away from a link, there was no packet drops.

B

So this means the edge fabric intervened when needed and it successfully prevented circuit congestion.

B

Now the next question here is: can we keep utilization at that 95 percent threshold and to analyze that what we did is we looked at the circuit utilization against the threshold, every 30 seconds for circuits where edge fabric was actively intervening, and in this figure what we want is as much around that zero mark as possible, because that means we're keeping the utilization right at that 95 percent threshold.

B

Anything to the left means of the utilization is lower. Anything to the right means that it's higher and we end up with potential loss during bursts. So what we find here is that the vast majority of the time we're able to keep the utilization of these interfaces or these circuits within 2% of that threshold.

B

So edge fabric is able to successfully prevent circuit congestion and packet loss, and it can do that, while keeping circuit utilization at this high threshold.

B

So now I'm going to talk about the evolution of edge fabric and some ongoing work.

B

So I talked earlier about those two extremes of how you can have routing decisions made at the edge of your network at routers, or you can have routing decisions made at your hosts and when we actually started off with that fabric, we were using the other extreme routing decisions made at our hosts. That's called host based routing, so in this model, what edge fabric would do is it would inject its decisions directly into our servers and then our servers would use MPLS, DHCP or GRE, depending on the generation of edge fabric.

B

This was to signal to routers, at the edge of our network, send this packet through circuit X. Now a key challenge. There is maintaining synchronization, you have to keep routing state maintained across all of your hosts and if what's a circuit, X disappears. My servers need to know that now that's no longer a valid option for them to route traffic via in comparison. What we did today, this edge based routing approach, described as red fabric inject its decisions into routers at the edge of our network and overrides, are enacted by those routers hosts.

B

Don't signal the precise path that they want to track our pack to take. Instead, they just signal to the router information about that packets traffic class such as this is a video packet. So this means that we don't have any hosts synchronization, which, in our network, drastically reduces the complexity of the system like edge fabric. Further, we have flexibility with DHCP signaling, because we can account for different classes of traffic and we can always fall back to BGP at our edge routers.

B

So both of these approaches provide the capabilities we want today, but in general, edge fabric is best aligned with the design priorities I described earlier.

B

Next thing I want to briefly go over is about congestion beyond the edge of our networks and for this example, I'm going to talk about internet exchange points, so internet exchange points allow networks to interconnect through a shared switch. So in this case Facebook and another content provider may both connect to this big ixp shared, switch and downstream end user networks may connect as well so internet exchange points are often seen as removing barriers to interconnection.

B

I don't have to provision cross, connects between me and all these other networks, as I want to interconnect with well. They also create a key challenge and to see why. Let's take a look at this example in this case, both Facebook and this other content provider have hundreds of gigabits per second of capacity to this internet exchange points which, in this case, Facebook wants to send 8 gigabits per second of traffic to those end users and the other content providers 6 gigabits per second now. The problem here is that is px.

B

They only have 10 gigabits per second of capacity. As a result, we end up with the same problem that Iowa's traded earlier demand here is greater than the available capacity or any up with congestion and packet loss. Now. The key problem here is that these networks on the Left Facebook and this other content provider have no visibility past their network edge. They have no understanding of what that other networks circuit capacity is downstream, and even if they did, they can't see each other's traffic from Facebook's perspective.

B

A gigabits per second is fine for a 10 gigabits per second link and likewise for the other content provider.

B

Now this isn't just a problem for internet exchange points. It's anywhere past your network edge. You simply lack visibility. I can't see into what a private peer or an internet exchange point or a transit provider has as problems downstream.

B

So what can we do to identify congestion beyond the edge of our network? Well, we've looked at a few different signals before we were looking for instance, of prefixed traffic rates, so I could figure out how much of Facebook's traffic is going to go on the circuit again. That doesn't work here because trough cross traffic beyond our edge from other content providers is being mixed in and we don't know how much traffic they have circuit capacities.

B

Oftentimes, you aren't going to know downstream how much capacity does my transit have with the end user network I have no idea, and what that means is you have to instead use route performance measurements? You have to infer congestion from these performance measurements, but that can be particularly challenging because you can see things such as latency increases and you aren't sure as to whether that's due to a path, change or a change in client population or due to actual congestion.

B

Likewise, you don't know how much traffic to shift you have to continuously probe for capacity as downstream a failure may occur, reduce capacity for 20 minutes and then be resolved, so it requires a trial and error discovery process. Likewise, those interactions with other networks also create complexity.

B

They may also respond to congestion signals and thereby reduce the amount of traffic they're. Putting on those links- and you may increase your traffic and you may oscillate together. So it's very difficult to get a signal here as to how much traffic should I put on this link. Even if you know the current status of congested or not, that doesn't mean that five minutes from now it's going to be in that same status, so stepping back from all this. What's really new here, these problems in general have been known for quite some time.

B

Bgp hasn't considered demand capacity or performance ever in terms of the inter domain setup, and what we see as new here is the scale of connectivity, traffic and quality of service demands that both content providers and end-users have, and that brings new challenges and opportunities.

B

So, stepping back again those rich interconnection I described earlier, it offers content providers like Facebook a number of advantages, providing short direct paths that can bypass transfer providers and substantial path diversity, but in terms of meeting our goal of delivering traffic with the best performance possible, we're challenged by BGP limitations, because BGP doesn't consider demand capacity or performance and, as a result, we built edged fabric, which has allowed us to sidestep BGP limitations by shifting control from routers at the edge of our network to software, and the result has been a more efficient network and better performance for our end users and with that I'd be happy to take any questions.

B

A

There's a question that came in from the remote participant Bruno asked: how does the controller tell a router to redirect traffic for a specific traffic class of traffic, for example, to use BGP flowspec so.

B

In this case, what we have actually at each router there's multiple routing instances and those routing instances. The DHCP marked packets arrive at each instance based on the dscp value, so, for instance, gscp value 50 will arrive at Rowdy instance 50, and we inject routes into each of those instances. If there's no route injected, the router will fall back to the default route instance. So this allows us to customize on a per destination per classic craft per classic traffic class. As to whether or not we're gonna override the route.

A

He says, thank you crystal clear, don't forget to say your name's.

C

Cathy Aronson I'm just curious: do you have any mechanism for ingress traffic, or is it just that your traffic is so heavily egress that this is the most important yeah.

B

I would say in general or traffic is, is primarily egress, so it ends up being a much larger problem for us traffic for ingress traffic. It comes down to more of load, balancing using DNS entries or other other things like that to send between pops.

D

Aaron Falk, so it seems like one thing. One of the effects of this mechanism is that it increases sort of the dynamics of route changes for that. A packets experience and I'm wondering if you've looked at the impact that this has on individual flows. I mean my experience with Facebook is that most objects are pretty small but I, but it it's unlikely that both paths are gonna, have the same latency and so for a particular flow.

D

If you get switched you're going to have things like reordering and you know, are there, have you looked at sort of the how frequent traffic is shifting when you make these decisions? How many flows are interrupted? That kind of thing like what's the where how much of an impact is this sure.

B

So the way the decision process works today is it's likely going to contain you to select the same routes or the same destinations to shift as their vote increases. So what say? I'm a hundred megabits per second over my capacity I'll choose X to shift now. I'm 200 megabits per second I choose X and Y, and what that means is that once we've shifted something over or likely to continue to shift it? It's not always that we will. There is some level of optimization there where we can change what we're shifting over time.

B

But, as a result, we don't end up with with rapid oscillations for the same set of flows and do.

D

You have any statistics on how much shifting is happening. I, don't.

B

Have any statistics in terms of how much shifting is happening in general I expect that for TCP in terms of the reordering problem, it would be very brief on the on the flows that were actively transmitting at the time and yeah. We can take this offline further right.

E

Hi Brendan I'm, Dave Blanca need idea about injecting the BGP, prefixes and I guess. The failure mode, then, is if the edge fabric doesn't work, it falls back to BGP. I was wondering about. You gave an example where you showed two non adjacent: v4 prefixes aggregating to more than the bandwidth on the on the 10 gig, and you selectively chose one to offload, say two and a half gigs of traffic or something. um Where did you get those prefixes from?

E

Do you synthesize them from what you know is downstream or are they pre-configured the you know what what degrees of freedom do? You have to make more specifics or things in the when you have a pure that has a bunch of aggregate prefixes, sure.

B

So the the general aggregation here is: we get samples from IP fix or s flow. We aggregate them up to the most specific prefix advertised by a BGP, and then we do break those prefixes apart again further, so let's say: I have a slash 20, which is one gigabit per second of traffic will break that slash 20 up into smaller prefixes, /, 21 or / 22 until we get down to a certain granularity.

B

So in this case, I think we discussed in the papers splitting up until we get at least 250 megabit per second granularity, which would mean that then, when we're fifteen traffic, we can shift in 250 megabit per second buckets. So that allows us to keep that utilization at that high threshold. Okay,.

E

And that one other thing about the prefixes that was curious to me is you showed a V for example, but you guys are heavily v6 and do both. Do you make any effort to find the corresponding v4 and v6 prefixes and move them together, or do you treat them independently? No.

B

The decision processes are independent. We actually prefer to move v4 for v6 and that's because v6 we've seen cases where you shift it to a different route. That route is actually black, hoing the traffic and then you end up oscillating, because you shift away and back each time the prefix- and this is likely just because of v6 routes being less chromed than before.

F

Yes, my name's Stuart Cheshire from Apple throughout the presentation you talked about demand as being a fixed thing like we have 12 megabits of demand or 12 gigabits of demand going into a 10 gigabit pipe right. But if you're, all the transport protocols I know like TCP and quick, adapt the throughput, and if you send a sustained 12 gigabits into a 10 gig pipe and lose 20%, it's not going to continue losing 20%. The sender's are going to slow down that rate.

F

So I I didn't understand why the normal congestion, control algorithms to adjust rate did not slow down when they're too fast and, conversely speed up when they're too slow. If there's excess capacity, TCP will speed up until it uses all the capacity, because there's there's no such thing when I'm looking at Facebook is loading a picture too fast right, I want it to load as fast as it can, which should be all the capacity that's available. So.

B

To be clear here when I say to all the gigabits per second of demand, that's what our controller sees as the demand to that prefix at that moment in time. The reason that it can be greater than that links capacity is likely because we shifted traffic away from that link on a previous iteration.

B

So I may have let's say: I have a single prefix that if I was to send all through this link, it would be congested. I would shifted it away on a previous iteration. Now it's it's utilization has been able to continue to climb, and so now we're actually above what the Winx capacity is in terms of the transit report. Protocols, reacting you're right, but you're still going to end up with a poor user experience as you're still gonna end up with packet boss.

B

In order for those transport protocols to to react also many of our shorts, our flows are very short, which means that you have. A lot of flow is constantly going through a slow start, which means that they're going to end up interacting poor way when you're ending up with a lot of congestion in the link. Thank.

G

You I was cyllid a lot of krypton eat. It's not a question, it's quite a remark. You are struggling with the old, good old problem of a congestion link and informational about congestion. So you stopped just one step before reinventing, frame, relay and means of struggling organization in frame relay I hope it will be a result of your ongoing work and you propose something like bacon from the BGP Thank You.

H

Jo a plea this might be in the paper: I haven't read it, but you, you implied I, think when you were looking for congestion off net through exchange points or in remote networks. The earing active probing till the folk ingestion conditions. So I was wondering if you'd considered pulling those kinds of insights directly from TCP when you already kind of have in a with a passive observation. Some indication of whether transport protocols are being throttled beaten before packet loss exists. So.

B

We have an extensive TCP pipeline that we look at at these events on that's other work that were looking at publishing.

I

Volodya for telecom I wonder that the primary and first control that you have for directing your traffic seems not to be mentioned and well. Okay, feedback are not explained, and the first thing that you I guess are doing is the server selection deciding to which of your server clusters, at which location you direct the queries of these other customers and.

I

I guess I guess some of well: okay, essentially the predictions of how much traffic will be generated. This way from each of a server clusters goes into edge, cast into edge fabric as the estimation of the required or of the generated demand for the former volume but kind of I wonder. Is there no feedback that actually feeds back?

I

We are having difficult situation in what for one of the connections of server clusters and please read: please reshape the distribution of traffic between the servers so.

B

To be clear, there's there's two controllers here: I, don't talk about the other one. There is a global controller which decides which point of presence around the world and end users traffic will be sent to, and then there's this local controller, which each point of presence decides how we're going to egress that traffic.

B

Those two systems do have some cohesion between them and the interactions that you described do exist in terms of how we decide what the demand is for each point of presence: that's not based on the global load balancer that is based on IP fix, or s flow measurements at that local pop. So that allows us to get in near-real-time every 30 seconds exactly right now how much load there is at that location.

A

Ask for questions by the jabber if you like, but anyway, thank you Brennen for a great talk.

A

Our next speaker is another NRP 2019 awardee florian struggle, who is at max punk institutes, our Brooke and your top. His topic is bgp communities even more worms in the rowing camp.

J

Thanks for the direction and thanks for having me here, I think I should adjust that ok, so this talk is on even nearer well. So this talk is on a paper at IMC last year, as already mentioned, even more warmth in routing can and full disclosure part of this work was presented at last ITF meeting by Randy it grow, but now you will get the full take.

J

So we were looking at BGP data that we collected, and if you follow this, this development, you see a large increase of BGP community is being used over the last eight years we have seen more than two hundred ninety six percent of increase of BGP communities being used, so individual values in BGP communities and I looked up yesterday is further increased so last year, around 5,000 I guess we're using BGP committees and now it's up to 10,000 and we see 74,000 individual values for short communities.

J

So for me, as a researcher this, this means I should probably take a look, what's actually happening there. So what are we talking about? We are talking about the short bgp communities you probably in earlier defined in our steen1997. They are a 32-bit value, usually split in half, so the first 16-bit being an AF number, the lector 16-bit, being a value where each PHAs is agreeing upon values where their peers, what they should mean or what they are being used for. So there is no strict semantics in it.

J

Peers have to agree upon it on themselves and, as you have noticed, it's only 16 bits, so we now have a number two which are larger than 16 bits. Finally, we get the large community is defined in RFC 1892 and they are now at 12. Buy it well, you so know you have three three fields: each with significant space to use them so 4-byte asn, a SS, can actually use communities. Now here the first four bytes are now defined to be a global administrator.

J

So it's now clear that this is actually an a s number, and this is the the network defining the meaning of the community. So this is actually a good thing.

J

Besides the confusion of the naming, if it's a long or large communities, we spotted other problems, when we try to do our measurements, the large communities were not really used in in 2018. We only found fifty one global administrator actually using them, so nothing we could actually measure on internet scale. This has has been become better and, if you're interested in the uptake of large communities, a meal from ripe has set up or have published an article where he looked into the development of large communities in the uptake.

J

So now we have around 120 global administrators that are using large communities, so, but how are they being used at all or in general communities can be split into two groups. We have informational communities that have passive semantics. They are used for location tagging. Where has this prefix? We learned in which a pop RTT tagging we have seen and on the other side, we have action communities that carry active semantics. They are used for triggering back holding or actions in other ESS, for example, past prepending.

J

The problem here is without documentation of these P of these values. You cannot see if this is an active or passive community or if the semantics is active or passive, because it's already mentioned the peers decide themselves what these community values mean. There is no bid indicating if it's active or passive or an action community, and this leads into several lots of problems.

J

So given the increase in the popularity of bgp communities that we have seen and the ability to trigger actions in other assets with communities and relay information between ayase's, the first question for me as a network researcher or science guy is what could actually go wrong in using them and we found out several things can go wrong and the first thing we we noticed is when we looked at how communities propagate people seem not to expect them to propagate widely through the internet.

J

Although we have to RFC's actually defining how communities should propagate or should not propagate our 15 1997 states. Communities are transitive optional attributes, so they should be forwarded to your peers. An RFC 74/54 says you should scrub communities. You are using inside your network, so he cannot be manipulated from outside, but forward for any communities by other users, so it should be expected date that they are actually propagating through the internet. Still, a lot of people do not expect this and a lot of trends providers don't actually forward them.

J

We only found 14% of transit providers propagating received communities and yes, this value seems to be small, but the Internet graph or the EAS graph is highly connected. So you actually end up in communities traveling quite quite a lot, but still many people do not expect them to propagate it widely, and the problem here is that this leads to some potential for misuse as they are propagating through the internet and can trigger actions multiple hops away, and there is no way for an operator to find out. If this is intended or not.

J

This leads into a problem. You cannot say well, this is traffic management and this is legitimate, or this is an attack, and we ask ourselves the question if there are also unintended consequences in this combination of b2b communities being transitive and forwarded and used for actually changing routing decisions, and our assessment in the end is yes, there is a high risk for attacks, as we already see some attacks as well.

J

So what we were looking at, of course, we took all the publicly available b2b data we can find, and in the end we find that 75% of BGP announcements that we looked at heavenly's, one BTP community set and in 2018 it were five to six thousand a ESS. Now it's more than ten thousand yeses that make use of is short communities now, taking a step back and looking at the propagation again what we can actually measure or what we cannot measure.

J

We have this very complex topology of four ESS, where a as one is announcing a prefix P, and this is recorded in a s4 which could be a collector or just a simple peer with the is path for three two one. As expected, and now a s2 is taking the prefix P, where the community, in our case 2, colon, 3 or 3, so 2 is the a s actually defining the meaning of this community, and this will be transported finally to a s 4.

J

So it's 4 is recording this community in its routing decision in its rip. So a s2 has added this informational community. Now, a s2 is also adding a community for signaling it or triggering an action in a s3. It's upstream. This is also for what is 2 a is 4, so both of these companies are now present or visible in a s4, but a is for cannot know who actually has added these communities, and so can we, but we needed this for our measurements, so we had to come up with a solution for us.

J

We can only death in fear, infer which a s is adding a specific community by assuming that, if the S value or the s number present in the community is actually the as adding the community, we will get a lower bound of the travel distance or of the a s hop count.

J

This will lead us for the community to 2 to 3 or 3, with the correct travel distance of 2, a s hops and with the other community 3 1 2 3 with the wrong assumed travel distance of one of also 2, but it actually off sort of one, because it's just one s off, although correctly it would be 2.

J

For us, this lower bound of the distance of one hop is sufficient for our for our work.

J

So if we plot these values, which again, is the lower bound of travel distances, we end up with this EC DF on the excess pieces. You see, das hop count and we find that 10% of communities have an Aes hop count of more than six, so they Traverse more than six different guesses from where we assume them to have been added at more than 50% of communities still traverse more than four ESS.

J

And if you compare this with the mean length of a s path that we have observed, which is around 4.5 or followed 7, this actually means it travels almost through the whole internet and the longest community propagation we have observed were 11 a s hops, so they do propagate through the Internet.

J

Now, looking at another, very complex, escapology I use one again announcing a prefix to a s 2 and adding a community 3 1 2 3 to inform a s 3 or execute path. Depending there. You will notice that this community value is also propagated to a is 4 again and although it's only intended for signaling something towards a 3 I use, 4 is also receiving announcement with this community. So we end up with 2 different s paths and in the first case we for our for our research call.

J

This community be on path, because the a s value from the S community from the community is present on the a s path that we record in s. 3 in is for recall this community being off path, because the a s number 3 is not present on the S path. It could also be that the a s that is being signaled for is further hops away behind is 4, but in both cases this would be called off path because you could.

J

A s number is not present on the record, yes path, and if we now take the right part of these community values, separate them by unpadded of path and plot. This we end up with this distribution.

J

On the left side, you see a quite a number of community values in the off paths, communities that are related to like holding remote record black holding and on the other side, on the right hand, side you see very even numbers that look like provider like operator assigned and easy to remember, and we think that this comes from the fact that II, that a essence that are not implementing black holding will just forward black hauling with communities compared to yeses that do black holding, which follow the a black hauling RFC and do not further propagate his communities.

J

So they will not be visible of path.

J

Now, coming to the experiments that we did to show that there actually are some problems out there in internet, all of the experiments were done. First, the lab environment and then validated on the internet, with, of course, operator, consent and and I will show two different scenarios in this talk. There are more in the paper and the configuration of our others are available publicly, so first going back again and giving an intro, how does remote trigger black holding is supposed to work?

J

So we know we are talking about, as one is announcing prefix to its upstream is to and then receiving traffic. This is expected we have here. Sometimes you have the problem that you receive more traffic than you. Actually, you want to attract. We call this a denial of service attack and one mitigation. This is one signaling to as2 that wants to black hole prefix.

J

Usually this is done in band in the same bgp peering session, then the normal VDP announcements are being sent, but there are also cases where it's a special bgp session, which has other problems, but not the ones mentioned here. So, as one is announcing, the prefix p tagged with the black hole in community to signal a is to that it should drop.

J

Traffic is to is, of course, still announcing the prefix p to all its peers, but without the black hole in community now, what happens is is to is now dropping traffic and all the routing the traffic towards P and his body routers, and the link between a s1 and s2 is released from y'all of service traffic and is usuals usable. So you sacrifice parts of your network or parts of the prefix IP addresses to you still keep only of the other prefixes and servers reachable. This is how it should work.

J

What we noticed is that for this to be used in a secure fashion, you need to employ some safeguards. Of course, the provider that is providing black holing has to check if the customer is actually allowed to black hole these prefixes.

J

So if these prefix are owned by the customer or the customer has permissions to black hole them, and this leads to the fact that you need different policies for customers and peers, different access control lists and leads to a lot of configuration overhead for for a secure usage of remote trigger black holding and, of course, receiving such communities. You have to add no advertise and not export to the announcement. So you don't propagate it further, and we also noticed some providers translating black holding to the black hole in communities of other upstream.

J

So you were not even able to do selective black hole because they were translating it and announcing it to their peers as well, translating the actual rail user. Now, what should not be possible is depicted here.

J

We have the same topology, but now s2 isn't the role of an attacker and is to should just be a backup path to to the prefix of a as one, but a s2 is able to actually at the black holding community of though it's not on the best path so as to is announcing to a3 that prefix P should be black. Hold and we notice a3 is actually doing that, although the best path is through a as one and a as one as the origin for P is not actually requesting.

K

Any black holding and.

J

The other problem that we noticed is that this is even possible if a s2 is not involved in in any connection to a here's one at all. So as long can just hijack the prefix P and announce the prefix P with the black hole in community said, and we noticed that in some cases we are able to circumvent ACLs and prefix filter lists, because the black holding community is checked before any prefix filter lists are applied, so we were able to confirm this on the Internet.

J

It works multi hub and it's hard to spot, because the community values are usually on. One Accord reasons for that we found is the black :. Prefix is more specific, so you need exception. Rules in your configuration to accept I was left 32, so essentially everything that smaller than / 24 and some providers to check the black : community before applying any prefix filters- and we even found some configuration guides on the internet which had this problem, and they were the example configuration provided and the problem here.

J

There is no validation for the origin of the community. Every is on the path. Can add the black hole in community for the upstream provider now yesterday, yup Snyder's gave a talk at the IPG where he presented the mitigation for this. If you would check that the period is announcing, the black holding or the prefix with the black holding community is on the best path and only then accept the black holding. This is one possible mitigation for this attack. So if you are interested in that, you should check the recordings of that talk.

J

So if you only accept a black holding, if the period is announcing, the black holding for a prefix is your current best path to their prefix. These problems go away.

J

The second attack we were able to do was a traffic redirection attack again, as one is announcing its prefix P, and you see the current best press path from a or 6 to a as one is through a is 3 on the lower side of the topology.

J

Now a s 2 is our attacker, announcing prefix P with community to do path. Prepending, in is 3, which leads to the longer path over a s, 4 and 5, to be the preferred path for a o6y. This attack could be interesting. Well, one thing could be. There is an effort between a s, 4 and 5, and if even if you would identify that a is 2 is your attacker and you would screen the network of a s 2.

J

You will not find any network tab there, because they on purpose redirected to traffic to a is 4 &, 5 Grady, actual network campus could be a s. 2 is being forced to cooperate here and the other thing it is. It could just be a denial of service attack, because it's known at the link between is 4 & 5 is a very thin link with less with not as much bandwidth as would be needed so by redirecting traffic there. You could actually fill that link there and after I gave this presentation at the right meeting.

J

We were actually approached by dying and they appointed us to an article where they found our attacks. That today are actually already using communities to foster propagation of Heydrich. So the attackers found out that by setting specific community values, their hijack will actually be propagated more in the bgp network.

J

So we already see attacks using communities.

J

So what now we identified several well topics where discussion may might be useful.

J

We found prominent isset e transitivity standards, documentation and monitoring of community usage, starting with authenticity. I mentioned several times that every hey s, that is on das path, is able to modify, add or remove community values on announcements on on in BGP, and there is no attribution possible. It means, even if you found out that there is an incident, you cannot find out who is actually responsible for that. We all know rpki, but intentionally.

J

This is not able to secure communities, because we do want guesses on years, path, to modify routing and to add pathway, pinning, for example. On the other hand, we also see that operators rely on the correctness of community values because they are basing policy decisions on, for example, where a route has been learned and large communities are there, but they only partially improve the situation, because all of these points still apply to large communities. They only fix the first part of being an IAS number.

J

So the question is: how can we achieve authenticity, or at least attribution? So after an incident? You know who you have to talk to to prevent further problems in the future. Another thing that could probably big discussion series we now communities are very helpful in debugging, because you know what is happening in the network and why certain networks are forwarding traffic in a certain way, and they are indeed a very easy law over to communications, channel and widely news. We still only see them being used one or two hops away.

J

You usually do not signal black holding five or six hops away, or you do not usually need to inform peers six or seven hops away. So, on the other hand, you have a high risk for abuse in communities being transitive. So the question is: do we have a high risk here, or do we have more benefit? Do we need a discussion between benefit and risk using full transit or allowing community values or communities to be full transitive?

J

Monitoring is another field full of MIS misunderstandings. We know there is no global state in BGP, it's a highly distributed system, and even if we look at all of the rod collectors that are available, we will only see the end result recorded at these collections. We do not know what has happened on the path between peers, even if you are able to look into lookingglass as many earlier.

J

It's very hard to spot differences, so inferring modifications between the origin or das, setting communities and the collector is almost impossible, and even if you would be able to record all of these changes, you still have to probably do not know what these community values actually mean, and there is no general way for attribution of changes or recording who actually change anything. So monitoring community values to detect abuse is extreme, be difficult in this in this field, and we have the other great feel of standardization in the short communities.

J

Asm : value is still just convention. We can. We cannot really be sure that the the first value is actually an IAS number and there is no defined semantics.

J

I know there have been a lot of discussions in the past that all came to the point that well, we cannot define semantics here, but this leads to these problems stated and by by communities being used both for signaling and for triggering and not being able to distinguish what is what you cannot even filter communities in a sensible way and until now they're only a little, it's only a limited set of standardized communities, and you cannot be sure if, for example, some AES is not actually using some arbitrary value for triggering backhauling.

J

So standardization here might be a thing that that should be further further pushed forward.

J

Doing our research, we found another very large problem. This is with documentation because all of the yeses can define their communities themselves. There is no need for documentation. There is no central point of documentation. We found that some of the guesses are the are documenting then in who is now your are databases on their websites. Some are only providing community documentation in customer portals or not even at all.

J

So, if you see a community, you cannot find out in an easy way what they mean, and even if there is documentation it's often in you're, not touring in natural language and parsing. This is impossible. We tried, we failed if you have a very limited scope, for example, trying to find out the community is used for geolocation or for for geo tagging of prefixes.

J

You can of course, look for city names, airport codes, things like that, but parsing community documentation in a general-purpose for a general-purpose application is not really feasible, so documentation is very limited and fragmented. It's very hard to actually find out dictionary or fight the dictionary for community meanings.

J

Again, things are happening on that on that I learned that the tag is is: has internally developed a system, what they call for community structuring.

J

They are only using string representations instead of community values internally, and these string representations are then translated to short and large community values for their auto-configuration. An example would be tact at origin, dot, country dot de where de is a parameter for the community definition Tec origin country. So you see it's a hierarchical, hierarchical, a system very it says.

J

Well, this community is a tagging community, it's a it, has passive semantics: it is taking an origin on the country level and the country is Germany and their system allows the definition of parameters to communities and these parameters with the committees together are documented in one's system. They have working code and they are using this in production already right now they have an internal internet draft like document, and if you are interested in that, you probably should should talk to Whittaker, who is sitting there and laughing.

J

So I think this is a great way for actually starting document communities in a sensible way, because you don't have to operate with metrical numbers and you can actually distribute these documentation and talk about policies and filters with your peers because you have to you can talk with with with strings and no magical numbers. And even if you have other router configurations, you can still use these extreme representations. And you know what you're talking about.

J

Then we came up with some recommendations for Pareto's, based on our work, of course, as RFC already States, you should filter all informational comment, community values that you are using, that carry your a s number. So if you are using communities to check where a prefix has we learned, we should scrub these communities when you receive them from your peers because they are defined by you internally and used by you internally.

J

It might be useful to come up with agreements with their downstreams so to define what they are allowed to do with your upstreams if they are actually allowed to do path. Prepending with your upstream for their prefixes, of course, publicly documenting the communities you are using is key to enable other a SS to filter action communities to you. So if I have a customer where I know he might be playing around with with BGP I might want to filter things, so he cannot trigger things in my upstream, but you need agreements for that.

J

If you are able actually allowed to filter this or not or define what they are allowed to do, and one one thing one request: if you are providing public looking glasses, which is a good idea, please also show the community values that we that you receive, otherwise it's very hard to debug things.

J

So, coming back to the general problem, PDP communities are currently the only feasible way to realize signaling between a asses, but the problem is that is secure. Usage requires good operational knowledge and diligence. We do not think that a very over complex system is really suitable to secure the shortcomings of bgp communities, but we have to be aware that there is a problem, and while everybody in this room is probably able to handle this and do everything correct all the time, we cannot rely on that on a global scale.

J

There are a bunch of people out there who do not know what they are doing and there will never be a world in in which everybody is doing everything correct. So the question is: do we still can rely on or do we still want protocols that allow people to make mistakes that will break other people's network or do we need an evolution for the calls here that are less fragile and trot in and more usable or other or with other to to to prevent people shooting themselves and others into the in the food?

J

So wrapping up communities are widely news. They are used to realize policies, they are needed, but they heavily rely on mutual trust between the peers, because there is no less enticing and security in place. There is no attribution. Attacks are very hard to detect and one take away from our experiments. We did some prefix hijacking that was reported on Twitter, but nobody actually spotted our agree direction. Attacks based on communities, so we cannot be sure if there are other people already doing attacks using communities. If you are interested in more details, the paper is available.

J

Here's the link and with a picture of my cat I'm, happy to take questions.

A

And I would start with a question to what extent has the people who done? Who did this research brought forward and these standards and the working group proposals to to make the changes again? Have you have your have the group who wrote this paper brought forward any work to the routing IDF side of the world.

J

We did not yet because we are in the state where we well assessing the problem and well showing everybody. There was a problem yeah.

A

J

I, don't know how to fix it, but I thought.

A

This might attract Randy to the mic. Oh one.

L

Of the co Rep army, Randy Bush iij, anarchists I'm, one of the co-authors, strongly objected both in working group last call, and I etf last call to the black hole, well-known community.

A

Okay, anybody else have a question well,.

I

Not not so much question, but one comment is when you say there is no authenticity.

I

That is not completely true.

I

The thing is one has to recognize here that Allah cave a community system is offering a range of stuff where people where operators can be creative and you have in the global system. You have to control the creativity and the interaction of the various create creators, but in the end you have to see that, yes, you always have bilateral relations that are mapped into BGP, neighbor, neighbor, neighbor, ship relations and yes, what is exchanged there should be seen as something that is essentially just bilateral.

I

And yes, if you, if you want to be a responsible actor in the old system, you have to really control what you are doing with your neighbors. And if you really take that understanding, you can actually start to build stuff. That says: well, ok, you and me are peering I'm, a responsible person I make an agreement.

I

What we are doing on our relation and for that, if you and me are doing a decent effort at controlling at doing the right policy for implementing our agreement, we actually have a chance for using that as fairly trustworthy and I might even I might even go out and offer you an agreement in which I in which I promise you that I am doing stuff where I'm related, where I'm relating to you in a controlled manner. The communities that wendy is sending to me is something that does not work.

I

Recursively overview wall topology, but with that very limited and closed understanding, one actually can do stuff and yes having better documentation, have a having more tools for doing. It will help a lot to do this right, because a lot of the spreading of dubious information that you have observed is related to the fact that many of the operators, just at the time when they're 10 years ago, did their policy.

I

They didn't have good tools and they did not really take care of their responsibilities. But but.

J

That was what I meant with you have to talk to your downstreams, to make fear what you are allowed to filter, because if you don't have any agreement.

I

J

And you just strip all of their communities. Yes, this could lead into other problems, if not only.

I

And that has that has to go. That has to go with a very strict understanding into the BC piece that we are using.

A

Before you start, deaf people should forward around the blue sheets, and this will be the last question for this talk.

M

Thank you for bringing it let's spell known and you've been facing this for like 15 years. Probably since now, if you start doing cream of blood calling the mitigation today is mostly basic hygiene, you take care of what you accept and you take care of issues, and there are some gentle consequences and very eautiful stuff things like banded communities that are very useful in data centers. They were made non-transitive just voices issues, while most data centers use ebgp, we can propagate this community, so there's work definitely needed. That would allow us to use very stuff.

M

A

A

Okay and last but not least, is our a NRP 2018 awardee from who would have been in Bangkok, but for visa issues. Oh and.

A

That's Arash ma la vie hockey and he is at Thousand Eyes now and the work was done it at Northeastern University and taking a long look at quick. Oh can you click on the size for him it should be in there. Is it not okay?

A

hmm Yes, this is the case where we talked about at lunch. Where you give the talk without the slides.

B

K

B

The a research group chess can work we'll.

N

Add one note: I'm gonna go to some background material and history of quick. Oh no,.

N

To bring up to speed those folks who might not be super familiar equate perfect. So if you are an expert done quick- and you know the background material I apologize in advance, please bear with me as I go to it all right, so I'm gonna be talking about taking a long look at quick. This was a measurement work that appeared at IMC, 2017 and I.

N

Don't think I need to convince anyone in this room that interconnectivity is important, but just to set the stage and put things in perspective. In 2015, 3.2 billion people had access to Internet. Obviously, that number has increased over the past four years, but in that same year the number of people with running water was less than now. These two numbers next to each other are I, find them depressing, but for reasons that are out of the scope of this talk, but it it emphasizes the importance of Internet connectivity use it in our personal.

N

In our perfect professional life, virtually every business depends on the internet and their viability is tied to the performance of the network that they're operating on. So naturally, there's a lot of effort to try to improve these networks and make them more reliable and more performant. Right EF is one of those efforts and we do a lot of things. We come up with new protocols.

N

We use traffic management techniques to make sure that our networks are utilized in the way that everyone's demands are met and we even design our applications to adapt themselves to the underlying network. So we increase the user experience, improve the user experience and well. Quic is one of those effort. It's it's a transport protocol. It stands for quick, UDP internet connection and it started in Google and it was basically a transport protocol design with today's needs in mind. Quic was designed for a bunch of main reasons. The first one was to facilitate rapid deployment.

N

What does that mean? If you think about HTTP, you have HTTP. Hopefully you have TLS underneath and it's running on TCP, which is your transfer protocol and, as you all know, TCP is part of the it's implemented in the kernel. So it's in the kernel space. What does that mean?

N

That means, if, whenever you have a big change, let's say for TCP and you want to deploy it at scale, everyone needs to update their operating system and we all know that could take a long time, Windows XP that are still around as an evidence to that effect. So quick solves that problem by implementing in the user space.

N

So now what this means is that is that, whenever you have a new version of quick, all you need to do is: let's say: if you're browsing the web and you're using a browser, all the users need to do is to update their browser, and then they have the new version of quick. Obviously, this means that a lot of things like a lot of guarantees that the people provides like reliable delivery.

N

Now all of those also have to be implemented in the user space which I get to in a little bit what that means in terms of performance, but overall this helps with rapid deployment.

N

Another main reason was quick for quick, which Google never shied away about pointing out was to avoid ossification by middleboxes. We all know there are many middleboxes in networks. These could be nuts or security firewalls or could be web caches and many other applications. A lot of them do claim that they improve performance. Perhaps in some cases they do, but there's also a lot of evidence that they actually do more harm than than good.

N

One of the examples that I find very interesting. This is this was a joint work by Google and team a while it was a few years ago it was presented at velocity conference where they basically looked at YouTube's traffic over t-mobile's network, and how does it interact with their web proxies, and this is summary of findings from their slides. They basically found that it's better that YouTube traffic does not go through the proxies, because they're hurting their performance and I don't want to point any fingers to t-mobile YouTube.

N

This is this is not an issue isolated to that another example. This is taken from a CloudFlare blog post, where they were basically saying we had TLS 1.3 enabled for a while, but no one was using it because the browser's were not supporting it and they were not basically support turning it on, because middle boxes were breaking it and to be fair, it wasn't just middle boxes. There were other other issues that prevented TLS 1.3 from being deployed at scale, but middle boxes were not helping.

N

Tcp fast open is another example that a lot of folks believed it didn't it never got deployed at scale because of middle boxes, and the list goes on so, and all of these things can happen because in TCP all of your headers are in the clear, so middle boxes can see them and act upon them. They can modify them, draw and add headers or break your connections into to all the things that you're familiar with, whereas in quick pretty much everything is encrypted, so you take all that from mailboxes.

N

They can't they can't do any associations or meddlings, and finally, quick was proposed to improve performance I. Just a side note here: I have performance for HTTP traffic. I should mention that quick is eventually going to be a general-purpose transfer protocol, but it would start it started with HTTP in mind and it's that's its biggest usage right now. It's very integrated with HTTP. So throughout this talk, whenever I say quick, we are basically gonna focus on HTTP over quick.

N

So whenever I say quick, I mean HTTP over quick, so quick improves performance by a number of optimizations. The most famous one is zero RTT connection establishment. If you're familiar with TCP, you have that three-way handshake to establish a connection before you can send any data. If you have TLS on top of TCP, as you should welder, there's gonna be more our titties now and quick tries to achieve zero, RTT connection, zero, RTT connection. What that means is that you can start sending data from the very first packet. Obviously that doesn't always work.

N

You should have contacted the server before and have valid keys for zero RTT to work. If you don't, then it's gonna be one or two our titties, but after that everything else is gonna. Be zero RTT, quick preview instead of log head-of-line blocking? What is that, if you have a HTTP stream, if it's HTTP 1, you have a stream, you have to open a TCP connection. If you have more than one stream, then you have to open more TCP connections, and we all know that's not that all those connections have overhead they're competing over bandwidth.

N

So it's not a great use of resources HTTP to solve this by multiplexing HTTP streams into a single TCP action. This is great. It gets rid of a lot of overhead. However, if any of these streams is blocked for whatever reason, then all of those the streams are blocked, and the reason for this is because TCP is agnostic to the HTTP streams. As long as TCP is concerned, you have a stream of bytes that needs to go from one end to the other end and quick solves this by basically mapping http streams into quick streams.

N

Now having those logic that logical streams in quick, if one of the streams is blocked, the rest of them are not going to be blocked and can be, can proceed normally.

N

Quick has an improved loss recovery. It helped the mitigates the I can big unity problem. That TCP has it has better RT, t and bandwidth estimation. A lot of this good loss. Recovery comes from the fact that you can easily change the congestion control as well. So, for example, if you have BTR a new congestion control, you can easily replace your old one with the new one, and that comes from the fact that the first point that I talked about you can easy. Everything is advocate in the application layer.

N

So you can easily update things and deploy it and at scale, and there are a lot of other optimizations that I'm not gonna go into all of it, but basically quick try to learn from decades of transfer protocol evolution and take the good things that worked and put it into a single protocol. A little bit of history, quick started in early 2010 at Google, as I said, I think it was in 2013. There was publicly announced and Google started using it soon after there was a spec draft and towards the end of 2016.

N

The ITF working group started and the working group has been very active. There are many implementation of quick around quick, google's quick is at version 47 now, and the working group is working fast and hopefully soon we're gonna have a standard version of quake and everyone is gonna be using that all right. So that's that's why quick started and a little bit of history of it and but, as I said, quick would start.

N

One of the main reasons for quick was improved performance, so Google has been reporting on quicks performance they've been using it heavily and they've been putting our reports that helps with page load time with YouTube rebuffering and all these great numbers that it's perfect and it's very promising. However, the issue with these is that they're all aggregated statistics and not really reproducible by anyone else on this, your Google and you have access to that data and they don't really record any control-x tests. Again.

N

Everything is aggregated statistics there at the time that we started our work. There were other evaluations of quick in their research venues. However, most of them were limited environment networks, limited tests and they used old, untuned versions of quake, which I will get into in a bit. What that means and the results that they provided were not necessarily statistically sound.

N

Neither they provide good causes analysis for the performances that they observe, so we basically wanted to reach those gap like filling those gaps and provide them more comprehensive evaluation of quick. And how does it compare to TCP?

N

So, as I said, we're going to look at HTTP performance and we're gonna compare quake in TCP, we have a very simple set up. We have a client on one end, which could be a desktop client or a mobile client. We have a server on the other end, which supports both quick and TCP, and we have at the network in between. We can emulate different conditions and see how the two compare to each other.

N

Our servers host a bunch of web pages and objects with different sizes and pages, with different object, sizes and different number of objects, and we fetch them using quick in TCP, and we compare the performance and I must point out that, even though I'm not gonna go into the details, we once we get all the results. We want a statistical test to make sure any difference that we see is not due to noise or network variations or things that are not really differences between the probable.

N

So whenever we report a difference between the two protocol, we are confident that this is the difference in performance or not noise or anything else. So the setup is pretty simple, but in 2016, when we were doing these tests, we had this big issue of finally having a server. This supports quick, it's not like TCP. There wasn't a quick module for Apache servers, the different many options around. So basically, our two real options were either use.

N

Google servers because Google at the time had quick and basically hosts our stuff on Google servers and run our tests against Google or use a server that comes within the chromium code base. Well, the first option: Google servers didn't really work for us for the first obvious reason that we had no control over it.

N

So if you want to change something in the server or get some logs from it well, we couldn't because, because we didn't own those servers- and the second issue was that we actually started seeing some unexpected behavior when we were testing in school and so here I have one example: to give you an idea, what I mean by that we uploaded a 10 megabyte object to Google App Engine, and then we downloaded it using our client, and this barplot is the red part, is showing the time to first byte and the blue part of it is showing the time they taste for the rest of the download and, as you can see, there's this huge wait time like it's.

N

It's half a second! So basically, one third of our download time is wait time and we did some tests. We realized this weight on. Kinda exists in Google, App Engine. We wasn't sure we weren't sure why it's happening. Obviously, we didn't have any control to the server to investigate this more, and this was not good for us, because, if we're down, if we're checking performance and comparing millisecond times a half, a second wait time is is not okay, so we decided to use the server in the chromium.

N

However, so this is the bar on the left is doing the exact same experiment, but with the chromium server the server that is part of chromium. Now you can see that huge wait. Time is gone, that's great, but now our download time is much bigger compared to quick to Google I'm. Sorry- and this is problematic, because this is basically these two plots next to each other are telling me that the server in chromium cannot provide the performance that quic is able to provide, because we clearly see that Google is doing better.

N

So we had to try to basically infer what are the configuration that Google servers are using and and basically fine-tune our chromium server to make sure it matches the performance that Google gives. So we did that along the way we found some bugs and basically we fixed it I'm not going to go into the details but happy to talk about it offline.

N

But after we did that the plot on the right, the bar on the light right, is basically the same experiment using our chromium server after adjusting it, and not only we don't me, don't have that big wait time now. Our download time is similar to what Google provides, and this is obviously not the only test that we run.

N

We did a bunch of tests and there's no great, but we used Google as our baseline and matched our performance to Google and I spent time on this slide to explain it because there were a bunch of research work before us that they did a lot of great work with quick. But none of them went through this step to optimize the server and pretty much all of them reported poor performance for quick, at least in some scenarios, which, for fact, we know it's, because they were using a server that was not performant all right.

N

So now that we have our setup complete our test bed complete, we did some tests, so I'm going to start showing you results from a desktop client and I'm, going to show you some simple results where we're downloading different object, sizes from five kilobyte to 10 megabyte and we're downloading them at different ballin like bandwidths, and we're comparing how quick in TCP perform compared to each other.

N

So in this case the RTT is 36 milliseconds, the loss is insignificant and those numbers so 45 44%. And what that means is that, when we're downloading that five kilobyte object using quick in TCP, the download time for Quake is forty-five percent better than TCP. Now, to avoid bombarding you with a lot of numbers, I'm gonna, replace that with a heat map, so just think of it as red means, quick is doing better blue means, TCP is doing better and white means, there's no statistically significant difference between the two protocols.

N

So, if I complete this plot, you can see pretty much in every bottleneck, bandwidth and for every object. Size click is doing better than TCP we're able to download the object faster. So this is great.

N

We added we threw in some loss into the picture, and we saw a still quick is doing pretty much better than TCP. In all cases, we increase the RTT time. We did worked with different artists. In this example, RTT is 112 milliseconds again, quick was doing way better than TCP.

N

So so far, everything was great and we were very excited and then we did this experiment where we added some packet reordering and as soon as we added packet, reordering things started to change and we actually so we actually saw a case as cases, especially when that's covering the plot, but the Blues. The right side of the plot are big objects. The last column is a 10 megabyte object. So when you have a packet, reordering quic is doing worse in CCD. So we want to see why this is happening.

N

We looked at quicks code instrument instrument that the code look at TCP to see how it's a coping with packet reordering. Basically, what we found is that TCP has this mechanism. When you have packets reordered, it increases it, it's nice and it can cope with that reordering. Whereas quick didn't have that mechanism in place and when packets were reordered deeper than its night, it was basically thinking that those packets are lost, so it was going into loss recovery and we all know what that means only was that performance was going down.

N

So we wanted to see if quick can actually benefit from the same mechanism. That TCP has our guess was yes, but we wanted to test it. So I'm, gonna I, think this is not.

N

Sorry I wanted all right, so we did not threshold the default night ratio for Quake was 3, so we want to see and I'm. Looking at the example when we're downloading a 10 megabyte object. So it's a big object.

N

It's a sizable transfer and we want to see if quick can benefit with it from the same mechanism as TCP, so we started playing with the neck and we actually saw that there's a big latency between my clicker and the slides, and we saw that as actually as we increase the neck, quick-quick spur formance actually gets better and when we let the neck to increase up to 300, which is actually the number the tcp the upper bound, the tcp is a lot to increase its nag.

N

Then quic is able to recover it's able to cope with the packet. Reordering I actually starts performing better than TCP.

N

Alright, so next thing that we want to look at what zero RTT, because that's that's a big improvement that about the improvements in quick. So we want to see how much zero RTT help with performance and I'm gonna go to back to our base example where there's no loss- and we have a 36 millisecond RTT as I talked about quake- is doing much better than T.

N

This is quick versus TCP, so this time we run a test and instead of comparing quick with TCP, what we're comparing is quick with zero RT t and quick without the RT t and red means quick with you, Artie T is doing better and, as you can see in the plot, most of the plot is red, which is great, but also, as you have noticed, this benefit of 0 RT T is really.

N

You can really sense that when the object size is small and when your object is big naturally, because your transfer is is longer and your connection time is a very small fraction of your transaction, so it doesn't have a big big effect, which is it still great because if you think about web most of the time, you're actually requesting very small objects. So so this 0 RT t can help a lot in that in those scenarios.

N

Sorry, so comparing these two plots together, as we said, 0 RT t only helps for smaller objects, but we can see that quic is doing better for bigger objects as well. So we want to see what is it that the quic does that helps it to perform better. So I have an experiment here, which is a little bit extreme, but I like it, because it helps visualizing things a little bit better.

N

So what we have here is we have a case that the bandwidth is changing between 50 milliseconds, F, I'm, sorry 50, megabits per second 250 megabits per second. So it's a little bit of an extreme of an example but and we're downloading at 200 megabyte objects. So it's a very long transfer using the TCP and quake we're doing it back to back that's what this plot is showing on the x-axis I have time and on the y-axis.

N

I have throughput and, as you can see, quic is able to achieve a much higher average to put compared to TCP, which explains why it's able to get such a better performance, especially when there's lost so basically the takeaway is quic is way more aggressively and better adapting itself to their changes to to the available bandwidth, which is great but also made us think. If quick is so aggressive in adapting itself to to available bandwidth.

N

How is it gonna play with fairness to other traffic because, as we know, we want different flows to be fair to each other, so now no flow shuts down other flows. So we made TCP and quic compete with each other over a bottleneck bandwidth, and we actually found out that quick is not fair to TCP. We found out the quic is taking more than share share bandwidth. We repeated that experiment with when quic is competing with multiple TCP flows, and we still got the same results and to make sure this is not our environment.

N

We made quick and quick compete with quick things were fair TCP con competing with TCP everything was fair, but when the two protocols were competing with each other, quick was not being fair to TCP.

N

We want to dig in a little bit deeper, so here I have the congestion window size for the two protocols yep in this example they're both using cubic and as you can see, they start from the same congestion window size, but quickly, quick, increases in congestion window and takes a unfair share of the bandwidth and causes TCP to basically slow down and to zoom in. You can actually see that quic is way more aggressively, increasing its congestion.

L

N

All right so I have one last thing to talk about before I run out of time, and that is mobile devices, so everything I talked about so far. The client is a desktop device and again going to my base example of no loss and 36 millisecond rtt, we saw that quic is doing better than TCP. In most cases, however, we redid the same exact experience experiment, but this time the client is a mobile phone, and what we saw is that well, while quick, is it still doing at least as good as TCP?

N

Do you don't see any blue in there, but the performance gains of quick started to diminish, so quick is doing better than CCP, but the gap is not as big is for a desktop client. So we want to see why this is happening and what we did. We instrumented the quick code to try to infer a state machine and see what's happening in and.

I

In quake and what.

N

States the the protocol is in at every time so I'm gonna show you this state machine, wait for the case where we're downloading a 10 megabyte object at 50 megabits per second, and it looks something like this: it's a classical state machine. You have different states, the percentage of time that you spend in every state the probability the transition probabilities. This is a little bit difficult to reach. I'm gonna, replace it with a table and as soon as I do that.

N

Hopefully, things are gonna become clear, as you can see when, when we're using a desktop machine, quick is in application, limited state for only 7% of the time, and that's the state that the that the client is receiving data faster than you can consume it. But as soon as you go to a mobile device where resources are more scarce, now quick is in application, limited state for 60% of the time- and this is exactly the price that quic is paying for being implemented in the user.

N

Space you're constantly context, switching between user space and kernel space, which is fine on a resourceful device, but when you're on a mobile device, things are not that great.

N

So that's all I had to talk about. To sum it up. We looked at the protocol that was rapidly evolving and honestly sometimes I felt like a measuring moving sand.

N

We did tests in a variety of networks and environments. There are a bunch of other tests that I didn't have time to talk about, but I encourage you to read the paper if you're interested in and we instrumented the code extracted some state machine and that helped us to provide some root cause analysis for the performances that we were seeing and the finally I just want to point out that this work was done two years ago. So at the time quick was at version 36. As I said now, Google quick is at version 47.

N

However, nothing stops us from doing the exact same measurement on the new versions. We actually did that in the paper we looked at quick from version 25 to 36, so we had that evolution of quicks performance and we can do the same thing for for newer versions and future versions, and with that I'm happy to take questions.

O

Jake Holland, could you go back to slide 19, please.

N

Sorry that clicker in.

O

There um I wanted to ask so this is very interesting and thank you for presenting it. I wanted to ask. Did you measure the retransmit rate in this in this bottom scenario, and.

N

Measured, the bottoms are the.

O

Returns memory.

N

I think we did but honestly at the top of my head, I, don't remember: I have to go back and look at the paper.

O

Okay and okay, thank you, so that would be in the paper if you I.

N

Believe so, but if it's not I'm more than happy to dig in and find it because we have all the data, I didn't remove anything great. Thank you.

O

And one follow-up I believe it was two slides forward. Perhaps three the fairness, question.

N

And I have a lot of another in.

O

It board as though it slide 20, 21 or 22 I think it was so if you want it.

N

K

About the fairness, so.

O

What I wanted to ask about? Was you notice the difference in the fairness word? Where and I assume you mean that quick was consuming a higher proportion of the bandwidth? Did you compare that to sort of the expectation as observed here that that quick performs better than TCP normally so TCP will prove that this must mean, presumably that TCP will leave some of the bandwidth underutilized or less utilized, and is it the same proportion here or how far different? Is it.

N

I'm, sorry I, don't think I understood the question you're saying if the ball make them with was underutilized right.

O

So my my maybe this is too complicated to ask it might. But what I was trying to get at is that we expect that quick will perform better than TCP based on the prior observations, even when they're not competing right, which means that on the same kind of link TCP is leaving, must be leaving the some bandwidth under unutilized in order for quick to be able to beat it right. So how?

O

How much is the fairness difference? Sort of disproportionately.

O

It is it different than the difference between and by how much than the difference between their performance when they're not competing.

D

N

I understood your question correctly: you're asking that if TCP, when not comparing was basically leaving solve the available bandwidth on the table and if the difference was one.

O

So it's I guess is this really a competition fairness question or is it just a yes? It is indeed running faster. No.

N

It's definitely a fairness issue because in life this answers your question, but we did this for very long, so it we let them both get to that equilibrium, and we could see that when there is no competition, TCP is able to utilize the bandwidth on almost fully if okay, okay.

O

That makes sense, thank you yeah, so like you, so they all went to that 10 megabyte sort of learns.

N

When did the transactions is.

O

Is long enough? Yes,.

N

Excellent Thank.

O

You Evan, sorry one last one, the TCP was both of them are running cubic. You said yes,.

M

O

D

That was one of my questions. So I have a clarification question. A larger question. Claire facing question is: what's the queueing discipline you're running in your bottleneck? What's the word, what's the queueing discipline you're running in your bottleneck link? uh Are you running PBR? Were you running red aqm dropped ale, oh I,.

N

I'm not sure okay, I think.

D

It's in the paper.

N

I believe so, yes,.

D

I'll follow the larger question is sort of going back to your very initial remarks. I'm really kind of first great work very interesting nicely presented. Thank you. This is a good paper. Thank you for coming here and presenting it. I'm interested in the 200 million users who have internet no electricity right and so I think that there's a lot of attention being paid to quick, as you know, higher performance and you know better utilization of of congested resources, but I rarely see performance numbers.

D

When you look at things like you know, very heavily multiplexed long, RT, tt1 links that have dial-up at the end and you know end users, who are you know desperately trying to you, know, load simple web pages and I'd be really interested in seeing some comparison, I mean we really want to make that sure those users don't get screwed. If the world migrates from TCP to quick, we, you know it's not great now, but it ideally, you wouldn't want to be worse, so I'm wondering if you spend any time looking at it.

D

You know there's a little bit here. You can sort of take the worst cases of all the things you presented and add them up, and you know maybe that's what it is, but so.

N

I guess closest to that that we experiment that we looked at some 3G mobile networks, so we run some tests there and we actually found out that our results showed that quick is actually doing better than TCP and it's in the paper. The reason I didn't put it in here, because we kind of the things that I put in here I want to be cases that I can isolate.

I

N

Where the difference is coming from but and like in networks that and as I said, we found in 3G networks and the poor networks that quick is is still doing better than TCP, but most of our experiments were in controlled environments that were hard bandwidth right. So I.

D

Would just add to that that you might look at adding like very long RT T's in there and then looking at things like fairness and seeing how that compares that I don't have any intuition about how that would yeah.

N

I believe in our test we went up to 300 milliseconds of RTG and one.

D

Satellite hop, but it's a good start. Yeah.

P

Let's turn: what off you always say, TCP, what I'd like to know are you using TCP segmentation offload, are using TCP, fast, open or using TCP from the 70s, so.

N

True, so we I believe the segmentation offload is is on it's in the paper. We basically wanted to take a Linux box and use the TCP as is, and don't change anything, and we did this on a purpose on purpose, because, obviously you can optimize any protocol to a.

P

Linux box I can turn TCP fast, open on or off very easily use the kernel option. I, don't know what you did. That's why I'm asking well all.

N

The default values they came with it they're all in the paper.

Q

Hey goggle Montenegro Microsoft, so thank you very much for this work and I think I heard you say that there may be some ongoing work going on some more research. So if that's the case, could I add a suggestion that you know G, quick where Google quick is so fine and good, but the whole focus of the IETF ever is I quick write the idea, ITF quick?

Q

It would be great and there's several implementations out there that you could use if you would use those for the next phase of the testing, because that's a pretty different protocol really by today and the best part of it is, if you find something egregious or something that might need to be quick, then there's still time to go back to the working. Women actually have an effect on the on the protocol.

Q

If that's one of the of your findings and that would be potentially more relevant for the future than India quick, that's possibly everybody at some point will be on ITF, quick, so that that's one one suggestion and the other one is more of a comment. You indicated that since quake is implemented in userspace, that's one implementation hours, for example, runs kernel or user doesn't matter, so you could run it in kernel. You could run in user space. It's not part of the Pirkle itself right. So I understand that for the pesky needed to do.

Q

Of course it was easier to get the user space, but that's not really part of the of the protocol definition, so you could run it in in kernel and then maybe some of the disadvantages that you identified would would go away and in the mobile case that, as you mentioned, yeah.

N

That's it that's great yeah. At the time we were basically working with Google's, quick, and that was pretty much the only option. Yet you understand.

A

N

A

Okay, let's have a hand for rush excellent and our pink box so to pitch for the remaining remainder of the year, there's four more great and RP toxic. um If you want the links and you can't find them for some reason, I did put up a an agenda slide set, that is in the tracker, so you can find that and also a humorous prog related picture. But in any event, um thank you for being here and thanks for the great questions for our folks and end of IRT F open.

R

Actually, I have another thing since we're changing from one leader to another I'd like everybody to really thank out Ellison for the work she's done in the past few years. Thank you so very much.

A

K

A

I'm glad you, uh you know, stick with that one.

A

Yes, things are better now with the PDF slides than they used to be.

K

A

Well, Collins, do you want to go yeah I probably have to, but I think it would be worth your while to take a look at the network tutorial after the fact yeah.