Internet Engineering Task Force 111, 29 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: IETF111-RTGWG-20210729-2200

Description

RTGWG meeting session at IETF111
2021/07/29 2200

https://datatracker.ietf.org/meeting/111/proceedings/

A

It's three o'clock on its busy day today. So let's start.

A

Can you hear me engine.

B

C

A

So welcome everybody to the second routine working group meeting at itf hundred eleven I'm going to welcome engine again welcome and it's great to have your scotcher, and I would like to thank chris bowers who's been colleague for many years and thank you again for everything you've done and we are going to progress.

A

Not well, please make yourself familiar with idf rules and they think that everything you share becomes property of atf and really really short agenda for today is talking so routing by stuart, unfortunately, addressing uh by tourists and self-healing networking slow label. There is no draft, yet uh authors are working on the draft, so we made exception here. The topic has been discussed in six-man number of venues and actually being tested and deployed. So it's really exciting topic here and stuart.

D

Right so uh asked to share slides I'm going to share the tcr slides uh tcr share right. Thank you. um So, first off there's a a small error in that crept into the agenda. The the draft is drafted, bcx, rtgwg tcr00.

D

So what we? What? What was our motivation for doing this right? New network demands stress the existing data plane protocols, things like collection of telemetry path, guidance, reroute protection, the incoming need for slos things like proof of transit, extensibility, programmability etc, including parameterized functions.

D

So we put together a sort of concept as to how we might design a protocol for for to address these needs, in particular to you know, to support multiple features. Concurrently, it's easy to add one feature. It's more interesting. If you want to add multiples, the need to authenticate metadata at intermediate nodes and and the need for extensibility in the in the design.

D

So the basic idea is to construct a packet out as a lightly structured set of tokens tokens as in computer science, tokens not token ring and cells, as in we make up a packet as a number of small components, in the same way that the human body is made up of cells. We apply longest prefix matching engines to invoke code points tokens.

D

We can combine to create more elaborate and interesting functionality, and we explain some of this in the use cases you can stack um the the tokens for per segment and per node behavior where needed.

D

So it's intrinsic, you should think of a token cell as a lego, brick and what we a lego, brick, that has token that's parameterization of the cells and we complement the the cells with cells that contain metadata and security support and some interesting parallelization stuff.

D

So the key differentiators from uh packets, as we've designed them in the past, are that we deliberately allow structured, non-linear parsing non-linear as in streaming media.

D

So you don't have to go from um one component of the package to the next and you don't have to deduce what you're going to do. The packet will tell you where you need to go next at this. This hop in terms of where you need to go in the packet next for processing, um so we support programmable behavior at multiple levels. In this approach uh we construct a packet from the tokens and then we combine them to get the defined uh behavior, and then we parameterize that behavior.

D

This will become quite clear and obvious in the next. uh The next slide. We have some tokens that specialize in things like security. We have tokens that specialize in scratch pad and another thing that is of interest. Although we have presented this as a packet design in its own right, we, I believe it's also an interesting method of describing advanced functions for extending existing protocols. So it's quite a good ancillary data or metadata design structure.

D

So here's a packet preamble, a few bits of pieces of housekeeping and then it's just a series of token cells as many token cells as it needs to define what is to happen to this packet during its lifetime across the network.

D

One of the things that we we thought was that actually the payload may exist in a token cell, so why would we put the pay? We can clearly put the payload on the end of the packet if we um if we want to, but why would we want to put the packet in a token cell? Well, some. There are some thoughts in the research community that you might want to modify the payload as the packet goes through, for example, for congestion management, for example.

D

Instead of throwing the whole packet away, you just it may be that in certain applications, it's acceptable and desirable to um throw some components of the payload away, for example, think about um elements of a video system, for example, where you you may be able to dispense with some bits, but other bits are important.

D

This is an optional approach. um If and of course we may have packets without payloads anyway about for oam purposes so right this is. This is one of the key slides right. So what does a token cell look like um it? Has a um it has a length because it, unlike, for example, mpls it's uh each cell, will be variable length. um It has the the a pointer to the next token to process at this hop after you've processed this token.

D

So the idea is we chain these tokens together to get the functionality we need, and then we have the match zone. So the match zone is the is the part of the token that we put into a longest match engine, so that consists of two components: we have the the cell type and perhaps sub id, and then we have the prefix, so the prefix of the token cell blob. So this is where the parameters go so think about this.

D

For example, this might be specifying an ipv6 address and then the prefix might be the ipv6 address itself or it could be. The ipv6 address itself concatenated, with a more sophisticated programming parameter in order to do something. A little more sophisticated than uh ipv6 network programming is doing by having a more more sophisticated parameter in there that you look up in the lookup engine. Of course, you you, you may discover that it's adequate to simply look at the token cell type and in all cases, once you've done the lookup.

D

You know they now know the structure of the suffix. So what's happening is the front of the this middle piece of the token here is vector, urine vectoring you into some code like an mpls label would and but, unlike an mpls label, the token carries a set of parameters that assist in the processing of the packet, the forwarding of the packet.

D

So um it would be obvious from that what the lookup engine does so the lookup engine looks up the token match zone retrieves the forwarding parameters, which was what goes on in any uh forwarder and then um vectors to a piece of code which is actually what goes on in a forwarder um but and sucks in the parameters it needs to create the the effect uh the effect may uh result in storing in some information in the pipeline. For the next token, so what sort of token cells might we have?

D

We might have a forwarding one with the addressing type we might have metadata scratch pad. These are writable areas of the packet, uh some security tokens. um We we think we can do some quite interesting, uh uh parallel processing um if the hardware can support it. So we have this concept of a manifest, which is the set of things to process in parallel, and the inverse of the manifest is the rendezvous where we can bring them back together.

D

uh Disposition disposition is what you do when the packet needs to leave this uh part of the the this zone. If you like of um tcr this segment, if you like in um in segment routing and um at the moment what happens is a lot of this is um uh induct in in um deduced, but uh we think we can be a lot more specific um with with some parameters- directives, for example, specifying some latency objectives, conditionals and basically anything else that you want to program into the into the system.

D

So let's look at the parallelization thing, which is um novel. um This looks complicated, um but what we're really doing is exploring the properties of this concept. Whether you would build this or not would would depend on your application and the capabilities of your forwarding hardware, but it's always interesting to look at what the natural consequences of the design are. So a um the the first token um is um um takes you to a manifest manifest says there are three processing streams that you can do in parallel.

D

If you have the capability, so we process one. uh We go to two in parallel with three in parallel with four two completes and takes us to five for three completes and that's the end of the action series and four takes us to another, manifest with three more parallel actions and then a terminator at nine and you're terminated when five, three and nine are finished now, uh I'm sorry and six and six and eight. So this is conceptual right.

D

This is just to show you what you can do once you start putting pointers into packets and building some of the some of the structures, and if you look below I'm not going to go through the detail, but if you look below, we can see how we can chain together. The structure we have here uh by putting pointers from one token to another.

D

um Well, this is the inverse function um where we're rendezvouing, because it's possible that you want to do that. It's quite good to do two and three in parallel, but you can't proceed anymore until they've, both completed, for example, to get to node. Four again, I'm not going to go into the details. The the the slides are fairly straightforward and there's quite a good description. I hope in the in the draft disposition.

D

So this is what a package is to do when it leaves the network and we we know we already have this and we have a need for this. It's the things like the next header in ipv6. That says you know what follows uh the mpls bottom label for things like vpn and pseudo wires and more recently, the network programming ip suffix, which is being used to specify what you do when the packet leaves the the sr domain now.

D

uh The interesting thing here is that we don't have constraints on size and we have the advantage of more and and flexible parameters if we take a a specific disposition approach rather than an implicit disposition approach.

D

So let's look at some some packets. um So here we have um the first token first active token is uh saying forward towards his ipv6 address notice that the the source addresses are optional, because uh we know that, for example, in transport networks you don't often need them and you can consider them a payload parameter.

D

um So um I we have. We have the first first token. Is the ipv6 address that social house we're going to forward, but there is a a pointer to the second uh token, which is telling us to look at some slo parameters. This packet must not arrive before this time. This packet must arrive in a window. This packet must arrive at a precise time.

D

So this is a way of doing latency based forwarding. When the packet arrives at the destination address, then we see we have the disposition token, and that tells us what we're to do next with this packet, how we're to process this payload and dispatch the packet out of the tcr system right? Well, um that's simple and straightforward. Now, let's apply some fast reroute.

D

Well, the first thing you'll notice is that for the fun of it, the first uh here to the right hand, side in light blue is the packet you see above and what we discovered. There was a failure and we needed to push a fast reroute token on the front. That's in dark blue you'll notice that um uh if we, if it's more convenient to us, we can use a different address family.

D

For this I mean I've arbitrarily picked an ipv4 one, but the important important thing here is that we're not casting the address family directly into the packet design. We are allowing the packet designer to consult with the network operator and use the addressing family. That's most convenient for this function.

D

Now, normally we would just push this on and do our do our best with fast reroute. But of course, if we're doing latency based forwarding of some sort, some sort of slo affording. We would really like that. The fast reroute system took advantage of this and knew the the history, and we can do this by uh pointing the uh next token, the rednecks token, to the exactly the same slo characteristics as was used on the main path. So now, we've added um a controlled latency system to fast reroute.

D

So what about segment, routing right so segment routing is um pretty pretty simple right. You just have a set of tokens, one for each segment and then the disposition token.

D

Supposing we want to do, um we want to do latency based forwarding with segment routing. We can do this by having each of the forwarding segment tokens take as a second action.

D

The latency based forwarding parameters, so you'll notice that these first two segments are using exactly the same slo parameters, so you can accumulate the same behavior and the same sort of time keeping across each of the segments, rather than doing it individually equally. Well, if we want to, we can have different slo characteristics for different segments. So these first two here uh point to um this second slo characteristic, and this uh t token n uh points to one of its own.

D

I I I the the the point here is to illustrate the power of the technique, not to suggest that every one of these techniques will be applied to every packet. uh How about some telemetry? Well, I think you'll probably get the picture straight away with telemetry right, so um we can do um hot by hop telemetry.

D

Simply by pointing the the next token pointer to the um telemetry um information um and and similarly we can do the same thing with a manifest, so it could be that you want to do some quite sophisticated and complex uh uh data collection.

D

Well, you can you can do your forwarding look up, which uh in most formulas is a dma action um in parallel with doing your um telemetry um action, and it will be a simple exercise for the reader to see that you can clearly put all of these together um and which we do here.

D

I'm not not going to go through that in detail here, but I think you'll find all the pointers are right and I I would uh ask you to look at the the draft and consultant slides and show see how the pointers allow you to construct these sorts of structures.

D

Authentication and security.

D

What we're what you can do here is, and there's uh some slides that um that I'm going to incorporate into the next version of the draft that show how you can break up the signature domains so that you sign the bits that are associated with a group of tokens or a set of parameters associated with a token and how you don't have to sign the whole of it. You can sign the bits that are relevant for this set of hops and this particular action.

D

So um what we got with tcr is a general purpose network uh data plane protocol with serializable and parallelizable um characteristics. uh It's the serialization parallelization. It's the ability to introduce scratch, pads and metadata scratch. Pads are our information where the packet forwarders write as the packet goes through. The network metadata are additional parameters or ancillary data that is needed to qualify, how you um forward the packet, it's a an extensible approach and it has differentiated security.

D

You can apply multiple levels of programmability to this concept. You can clearly combine token cells into a packet and link them in different ways.

D

You can provide parameters into the token cell and the understanding of those parameters is implicit in understanding um the the cell and being vectored to the code that executes that cell, in the same way that um an mpls label vectors uses some code and, of course we can introduce new token cell types without needing to rewrite and redesign the protocol.

D

So we would like to take this idea further to explore the potential of the concept, and certainly we would be really interested in working with. Anyone who is is interested in exploring where this can take us and that's the end. Thank you.

D

E

So I see a lot of discussions on the right.

D

I'm not looking at the chat yeah there any that that need comment now.

A

Carlos you wanted to talk. You disappeared from the queue.

F

I think the slides I just pressed on the slides, but.

G

So stuart, how are you going to say one thing that is the biggest thing I want to say out of this very large proposal. What sort of network processing device are you expecting to consume? The stream.

D

Well, we think that at medium speeds, an existing one could do this. There's nothing in here that you can't do with the existing hardware depends how many tokens you're going to uh to have. But fundamentally uh this is a sort of a hybrid between what goes on in an existing between mpls and ip in terms of basic forwarding and looking at the uh follow-on tokens is not particularly different from parsing an ach, and everyone is now looking at achs and metadata and ancillary data.

D

So this is kind of a different way of putting that together in a way that.

D

I think is more uh more powerful than um some of the other techniques we're looking at.

G

All right, so, while I agree with you, I'm gonna leave you with uh two comments. uh The first one is this, uh so much resembles uh the work that firewalls have to do with a given firewall program to implement. You know its intent for a given set of instructions, and we know that uh long chains can have negative impacts on forwarding rates, uh and the second comment is much like the firewall problem, you're now introducing a programming path. That now has a lot of exception paths that need to be considered.

G

So when things can't run through the program appropriately figuring out, what that actually means for the data that's being carried becomes a lot more exciting.

D

Yeah I I kind of sort of take this sort of yeah, because I come from the mpls school of design.

D

I kind of assume that um if uh you shouldn't be putting a packet through um an lsp or whatever a token switch path, I suppose, um unless you know it's going to going to cleanly get through there, because you know from the routing system what the capabilities of the path are and so long as you design the thing in the right in the first place- and you know with a lot of experience in mpls, of designing these things right in the first place, um it will work and if you screw up, then um the usual thing is just to dump the packet and increment to counter.

D

So I'm not too terrified of um of this. It's certainly very different from things. We've tried, I think in the past, um but we're sort of getting close to that with you know the ancillary data and the extension headers that we're doing because they have lots of undefined um states and and they're even less defined, because you're doing implicit um ordering, rather than explicit ordering of what you need to do to the packet.

G

D

D

D

It's not quite a tlv based system. Tony, I think we can make this run much faster than other than tlvs.

A

Sure so we are going to publish the comments here on the list as well. It would be great if you could read them and address them.

D

I will I will do my best to uh to to to to answer them, um if, if you can capture them in some way that I won't lose them.

A

You know people are always skeptical with new ideas. Yeah.

D

And this is, this is a concept right. I am. What I'm trying to do is to get people to understand the sort of power you get if you put pointers in a packet rather than rely entirely on uh implicit uh parsing, because no one actually uh just looks at the front of the packet anyway, do they they they they they. They now look inside all sorts of things uh to try and figure out what to do so. This is moving from implicit parameters to explicit parameters.

D

A

Okay, so uh we've got one minute: if there are any short questions comment, please go ahead. Otherwise, we'll move on.

F

G

You working on.

D

F

Queue as well, or only on the chat, queue.

D

I can work on either whichever you want to do. Ask I'm having trouble.

F

Following both of them in parallel, well, I think you know- maybe just one one marketing for you right, given, I guess, fair to say a little bit of the history and adjacency with mpls. This is also discussed in the open, mpls design team, which is meeting regularly. So there is another great chance to you know, join that effort and maybe have the discussion over there. If you haven't been watching uh that space, yet here in routing working group.

D

uh Just to answer robert's question: there is a ttl, I didn't show it it's in the preamble, so there's a bunch of stuff on the left. I always show which has got a a a small, tiny number of parameters that are always pushed onto the front of the packet, and one of them is ttl.

E

Start, maybe you can stop sharing all.

D

Right: okay, I'll stop! Sharing!

D

Thank you for your time.

A

Thank you thank you for presenting. We will take it to the list. It's a short meeting today and we've got a lot of information to.

F

Present all right, so uh this is uh an idea you know um born from from similar intentions. uh What stewart was talking about um with a terrible name so.

F

They have been you know in in in area and in routing working group in the recent past.

F

A couple of you know, discussions and proposals about introducing variable long addresses into you know an extension or you know, an evolution of ipv6 and I wanted to basically look into what might be not well explored benefits because driven by one of them the idea came, came up and then it basically felt that that also was very good to improve on some of the very well understood and explored challenges that we may have, and so the problem space is that of limited domain networks, and I was already presenting the problem space into iot ops in on monday, not the solution.

F

So maybe the document is also getting split up between the problem and the solution, um and you can see from the slide that you know the the core issues, such as um efficiently traffic steering like sr does it should. You know be very simple with this and more efficient than maybe many of the spring crh variations should support equally very flexible programming in the way that srf v6 adopted that with srh and then also by you know, value of having variable long addresses.

F

It would be easily feasible to introduce new semantics that need longer addresses, for example, and, of course, research. But let me start with the you know: new problem space that I think with ipv6 we've mostly ignored, which is the fact that you know the itf network product called iceberg is pretty much.

F

You know ten percent which everybody talks about, which is the internet and then ninety percent of what you know since rfc 8799 is called limited domains or the people called private networks, many of them in the iot space manufacturing, energy, all gas transportation, constraint networks. But equally you know, service providers, uh infrastructure network with uh ipmpls or ipsr is not quote the internet.

F

It just you know, has the internet running on top of it, um and so that that you know I counted all the devices on the planet and 90 of them are not on the internet or connected uh to the internet, but really just within these private networks.

F

And so then the question arises: how good is ipv6 and ipv6 addressing for those type of networks and given the time limits I I won't go into all the problems, but I've kind of summarized a couple of them, starting from pd to pi space ula space, which, especially through the work that I did in enema kind of I learned more than I think I wanted to know about it and then also the retirement of site, local addresses, which was done in general, but not for multicast.

F

So um there there there's really a lot to be said about a lack of ipv6 addressing to better support this so and the example that I wanted to bring up from industrial. But another space as well is that you want to be able to build. You know embedded constrained networks whenever you like, without having to bother about a single global address space, but just about your you know, network local, address space and then compose and interconnect them in an arbitrary hierarchy and topology that you want right.

F

You may want to start with some form of machinery that you're selling, which has an internal ethernet network and a router to the outside. You assemble these uh to form some larger machinery and, ultimately, an assembly line, so multiple hierarchies to even get to a single building block, that's shown here and then you can easily imagine how many of these building blocks here on that picture might need to be interconnected um in a way. That's certainly not the oh. I just need a flat.

F

Everybody can reach anybody else network, but we already, you know, have done good standardization with things like mud and other mechanisms to really ensure the security of these type of networks, and as it turns out you know, what's going to be proposed is just you know going very nicely along with it. I think so. Here is a typical, classical example of these instances that you can even see in uh you know, industry standards, I've worked in transportation with trains and so on.

F

So if you look at some of the standards for for how to build networks in a train car, um you know or any other, you know- example machineries. What you typically do is you're using the wonderful rfc 1918 10 net space, um every instance of a product you're building has you know the device has been given exactly the same 10 net addresses and then you have a gateway with net. Like any. You know, industrial ethernet switch you're.

F

Looking at that and you'll see that um oh well, why does it have this strange form of net well because exactly of this type of use case where, for example, netting the third byte of the address to be a unique on the next layer of land, which is the orange level, and then you have the second byte, where you can do one more level of aggregation with this existing um ipv4 addressing space and voila.

F

That's basically a lot of what you do for security and combination that your constraint to two levels may go up to another ipv6 level on top of that. But if you really compare ula ipv6 with um this stuff in ipv4, you have the same 16 bits really available, but you have additional problems with ipv6 ula because you know supposedly net is evil and you shouldn't use it, and then you run into hash collisions.

F

So here let me go over to the solution to spend most of the time on it. So in the example pictures I'm going to give addresses are going to have length modulo 4 equals 0 and prefix length. Also the same so effectively.

F

Hex string addresses like an ipv6 with any boundaries on 4-bit so that I don't need to cut in a single digit um and then dots are just you know, structural. You know optional things just to visualize, where the structure of an address uh ends. So the address allocation within a single network, that's kind of the one core new thing that we haven't.

F

You know in in normal networks and igp expected, which is that every assigned prefix has to be non-overlapping with any other, so that anybody who owns a prefix uh owns any longer address of that which is basically crucial to allowing this to work and um yeah that might actually, when we get to actual routing uh plane, for this might be an interesting question about the consistency requirements that we want to raise at the distributed, control, plane, yeah, and then we just expect that you know let's say for for now: everything is route in the igp like host addresses right, so everybody can route to anybody else's unique prefix.

F

So here is um then the example. So we have one network network one and a couple of devices interconnected, which is other each one showing its own prefix and then basically somebody else builds a second network, probably following from the same industry standards slightly different, mirror topology in this case, but certainly you know decidedly or unintentionally, overlapping address. And now the question comes okay, how do you start connecting this and allowing to have traffic flow between them?

F

So- and here is basically the the simple summary what you do for the interconnection, which is that you know one network wants to get connected to the other one. So on that network, um which is the blue one network, one you're starting to establish some connection into the other network, which is shown here um with these uh dotted uh boxes right. So you take ra and you connect it into lan, one on the orange network and you do receive a particular prefix from the 45.

F

Likewise with rb right, so you don't even have to have a symmetric um thing. It's good enough that only one network wants to connect to the other and connect as if you know it has just connecting normal nodes to that network.

F

So, and then is, is the very simple address processing that we're doing when passing the traffic on which is- and you know, given the time I I won't be so slow that everybody gets in the first run, um took me a while as well, but you go from the source and the destination address is 2 to 135, and the source address is your own 52., so the first two will route it to ra, which is the address of that node in the blue network.

F

The second two is the function of the the remainder of the address, uh which, in this case would be, you know, put it into a different network and locally the network. Connection number, let's say one, which is the parameter, which is the third part so and what ultimately happens is that ra is knowing okay. I need to put this out on the link into network two, um I'm going to strip this whole prefix and I'm recirculating the network, uh the the packet into network, two uh routing and forwarding.

F

So it's going to be forwarded to number 35, and then you know, the reverse thing is: is happening optionally, uh with the source address where we're basically prepending the return path, so that if, when the packet ultimately arrives at destination 52, it knows how to send return packets back to the 52 in network one.

F

So what is that um it is in the forwarding plane? Exceptionally simple, I think uh when we want to start having this type of interconnection with, you know every network having its own address space independently.

F

uh It's just normal prefix lookup, like we've done it forever, and the only novel things are stripping and prepending address prefixes and when you're doing this function, recirculating you know after that operation. So it's stateless or prefix address rewrite. So if we're looking into any type of address rewrite option, you know to connect networks, it should be the best scalable performant option right and, um as should be not too difficult to see, it would be possible to make that work for an arbitrary topology interconnect between these hierarchical or any type of mesh.

F

Now that was kind of the most complex thing now, when you think about just the logic of an address being, you know a sequence of functions and each function being, let's say either a semantic prefix or a node prefix, followed by a function code and a parameter. You can see that this simple, I get a packet. I do a prefix lookup, depending on the prefix lookup. I have an adjacency that does different things.

F

Can map into the most fundamental functions that we already need in routers and would like to be more flexible right. So, let's say the function. Number zero, um followed by a value for the next protocol is simply the whole stack thing right. So would allow me to eliminate the next protocol field from an ipv6 header, because it's simply in the address number one could simply be steering which basically then is followed by a node prefix, so that would be a replacement to what we're doing as our mpls or srv6. It's all in the address.

F

It's just a sequence of steering instruction node by node and, as you can see, the address space that you allocate for steering hops can obviously be a lot shorter than 128 bits.

F

um So it would be very compact encoding for the steering um number two was the instruction for this internet work where you're stripping in and adding addresses, and, of course you can come up with any other functions and parameters which is, for example, how you could map any of the programmability in srv6 that srh has into the address equally well, but given how it's all variable length, you don't need to waste 64 bits when you don't even need them like in most cases in srh that we have now and if, instead of a node prefix we're saying well, let's just start with a semantic prefix like we have in ipv6 as well, for, for example, multicast addresses you could equally, you know map any future semantics that you want to have into the same address space just make sure it doesn't overlap.

F

We've done it so far with the standards-based prefix allocations, I'm saying in general. This could as well be done through. um You know configuration programming of uh the forwarding plane from the control plane.

A

Let's check two minutes and we are not going to go over time.

F

Yeah, okay, right so the control plane. um I think there is a lot to be said about. uh How do we get? You know know how to get to a destination.

F

I make a lot of cases for many um private networks not really having to have other naming, because within the network these addresses are fixed for lifetime, um so they wouldn't be location dependent within the network and the path to other networks equally can come from pc controllers, but otherwise, there's obviously a lot of interesting extensions for path, routing that we already know how to do with bgp, uh for example.

F

Here is a you know, an example of how simple um a base header could be. um So this would be. You know how to take ipv6 strip everything that we don't uh think we need and arrive at a much simpler header destination address source address. We need length for both of them. I think we need ecn and hop limit. Everything else can go into extension header.

F

Obviously, this would require a newer version, not four and not six, um so you know it could basically be done in a backward compatible way so that it's a superset of v6, but that details haven't been talked about. So you know, um I think you know to come to a funny end here. I think what we're talking about really is. If we want to think more about addressing and not only you know, tag on more and more extension, headers, but think about the basic addressing that we have.

F

We have gone through a long evolution um which started with nat 26. You know ipv4 to ipv6 transition mechanisms most of them. I think people would find you know not very good, but some of them actually could be domesticated and become kind of really useful functionalities, which is, I think, what we're doing here. We also have functional structures on ipv6. You know: scope zones, unicast prefix, multicast, a lot of the things we have done where ad hoc things you know on top of ipv6, not very structural.

F

We have done a lot more structural stuff on the address processing in mpls, decks and source routing and sr. Those things also, you know, are nicely flowing into this, so I think when we take all these things together, this proposal, really, you know, should give us a very nice multi-purpose, functional address, processing architecture and that's it.

A

A

We are going to take one question here if any.

H

uh I have a small question about um the redundancy resiliency, because, if a node and if link for gateway is encoded inside the header is as a part of the address and a particular node particularly would be down, then we need to find some some route. But if it's encoded in the ap address, does it mean that we need to to change ap address or api actually yeah.

F

Yeah exactly, I think, one of the things you would obviously do in this case is also to have anycast addresses, which you assign to what more than one node.

F

Okay, so you're going to a logical, you know, exit node address.

A

We are going to stop here and move to the next video. Thank you very much thanks starless, and I would really encourage you to send your questions and comments to the list.

C

C

A

The slides don't hear you.

C

Can you hear me yeah, it's not great, so uh I'm alexander asimov, uh I'm working for yandex and since we are running out of time, I'll try to do it fast, and so we will try to first discuss the opportunity to enrich tcp with self healing capabilities.

C

I will start my talk with focusing on the data center environment, but, as you will see, it's not only about data centers here. So here is a typical topology of the data center. uh There are top-of-rack switches connected to the first tier splines, which are represented here with literal s.

C

The set of top-of-the-rack switches connected to the same first spines represents a point of data center upward and there are super spines that interconnect different ports, which are represented here with a literal x.

C

Of course, a loud balancing is widely adopted in such topology. Normally, it's equal cost multipath with a hash function using five tuple, so modern data sensors provide multiple paths between most of hosts. The single path between two hosts exists only inside the rack inside one port, the number of bus equal to the number of planes in the data center. In the case of interaction between pods, the number of paths will be the result of multiplication between the number of planes and the number of super spines in each plane to make you feel.

C

The numbers here is an example from one of your next data centers. It has eight planes with 32 super spines in each of them. This makes the number of paths in the case of intra-port interaction equal to eight and, in the case of inter-port interaction equal to 256.

C

Yes, uh one may expect that this number of paths should provide fault tolerance out of the box, but the real outages in the data center are way more complicated.

C

Let's say we have an outage and one at one of our super spines x11 and unfortunately, this outage doesn't affect the control plane, so bgp keeps running, but there is constant packet loss and since the majority of our traffic is tcp based, we have to rely on its lost recovery mechanism.

C

There are two main options in cp for loss, recovery, selected, acknowledging acknowledgement and retransmission triggered by rto timeout, but all these retransmissions will have the same five tuple, so they will all travel the same path resulting in service degradation and as the congestion window will be shrinking, it will increase the chances to meet rto event and continuous events. In is a disaster inside the data center.

C

Itil is calculated as the minimum from measured rtt and r2 amine. The default linux value for r2 mean is 200 milliseconds. If we have thought unfortunate enough to lose a syn packet, the basic timeout is even higher it's one. Second, while real rtt in the data center is about one millisecond to make it even worse, it increases two times after each unsuccessful attempt.

C

uh Network outage in the dc uh of course ends up when the broken device is is shut down, but all the time that it takes to collect data and debug the issue there will be continuous service degradation.

C

The question is: is it possible to enrich tcp with the opportunity to jump from the failing path to answer this question we'll need to take a tour through the linux kernel manic list, it was 2011 uh when rc 6438 was published. uh Where was stated that flow label can be used to balance encapsulated traffic. The idea is simple: since the transport layer may be not be available, we can improve load balancing quality. If we put the hash of the tcp socket in the flow label field in 2014, it was introduced in the linux kernel.

C

Similar changes were applied to other encapsulation methods such as gre or udp, but the progress in the linux kernel hasn't stopped at this point. Instead, it was progressing far beyond iitf documents.

C

uh The next year there was another patch in the linux development uh which added tcp hash recalculation upon negative routing event. What is negative routing event: it's rto timeout that we just discussed and next year this behavior was further strengthened.

C

Since this time the hash is recalculated on both audio and scene outer events and as we discussed outer timeout affects not only flow label value, it also affects all kinds of encapsulations. In the case of gre, it's affects its key field in case of udp encapsulation. It changes the udp source port and a surprise.

C

It has become a default linux, behavior and, as I have already mentioned, there is no corresponding ietf documents, but how has it changed the life of tcp flow in the data center?

C

Once again, let me have a partial outage at x, 1 1. This time we will also add the flow label to our hash function at the top of rack. Switch in the case of selected acknowledge acknowledgement, nothing really changes, but if we have an audio event, the tcp hash at the circuit will be recalculated and so does flow label.

C

So after each other event, we have a 50, probably probability. The traffic will jump to another plane. The more planes we have the higher is the probability that the jump will move tcp to the unaffected part of the network, and it is fully transparent to the application. In addition to the flow label in the hash function of the topofreck switch, we also deployed ebpar for agents at the host to change the outer mean values according to the real rdt in the data center, and here is result of our experiments.

C

We decided not to wait for the next big big outage that will create a service degradation on most of our services, but to create a small, artificial one.

C

Though, on real production services, we took a top-of-rack switch and created a constant packet loss on one of its uplinks. The switch had four uplinks in total on the left. You can see the outcome of the data plane, monitoring it is udp based and as predicted, it shows 25 percent of packet loss on the right side is the traffic volume from service from the host.

C

Behind this top of rex switch, the volume of the traffic has reduced four times now, the same experiment, but with flow label enabled or at the hash function of the top topofrex switch. The udp-based data plane monitoring shows the same result. It's a just udp ping without any re-transmissions, but tcp flows of our services are jumping from the failing path and service traffic is preserved unaffected.

C

For me, it looks great.

C

So what one can learn from in these slides using flow label at the level of top-effect switch and maybe first-tier spines- gives hcp of ipv6 a self-healing capability to make this jump of traffic quick enough. You should also use ebay pf to change the rto and scene, auto values according to your latency, and this goes for free, though with poor documentation, and one may note that an environment with multiple paths is not limited to the data center.

C

uh One may say that the internet itself consists of multiple paths between the majority of its points and all these jumping from failing path to another may improve user experience. In general.

C

Here I gained some real numbers uh from yandex network. Less than 40 percent of prefixes that we are receiving from our peers have only one best path. The average number of best parts is about 4, with a maximum of 44 best parts for the same prefix, and here is an example of a real outage.

C

As you can see, two of the five best parts were not affected by the outage, so it leaves the door for the jumping from defending path. So is everything works uh that great and do we need uh only to properly document implemented in the linux kernel?

C

Unfortunately, there was a trap.

C

And this trap is called anycast in the dc environment. State for any car services are noted that, while in the wild internet, any cast, speed proxies are represent representing significant portion of traffic volume and unfortunately, this kind of services doesn't perform very well with linux flow label.

C

Mechanics, let me show why let's say we have a client that establishes a tcp session with stateful anycast service, for example tcp proxy and everything is fine until rto happens on the client side, for example, if client pushes data to a storage, the rto event is not that uncommon, but ato at the client side also triggers cp, hash, recalculation and thus generation of neutral label.

C

As a result, subsequent packets can be redirected to another instance, which won't have any appropriate state and will drop these packets accordingly, this might be uh following uh outer following auto events that may return the traffic to original instance, but anyway, it won't improve user experience, and this is no theory.

C

We faced this kind of behavior with our users and had to switch overflow label at our border, routers, which partially improved the situation by the way hashing with flow label is juniper default behavior and if you have seen a greater number of tcp connection, timeouts in ipv6 compared to ipv4.

C

Now you should know why the scenario for the scene, article is a bit more complicated.

C

Let we have a scene packet, and this time it is correctly delivered to the proxy instance, but the following scene: arc packet is lost. In this case we will have two sides: the server and the client trying to re-transmit syn packets.

C

So after the default scene, audio timeout, which is normally one second, the client will send another sim packet with new flow label value and it may reach another instance, but not. uh But now we will have one client and two servers trying to establish one connection.

C

Which leads us to a race condition if the scene arc is received from the second proxy. Everything will be fine, but if we receive first the scene arc from the first one, we are meeting a peculiar situation.

C

The client needs to respond with an arc to finish the connection procedure, but it has a flow label that directs packets to the second proxy and the acknowledgement number related to the first proxy. Of course, such scenario will end up with a connection timeout in such race condition is such a risk condition is likely to happen.

C

It's really hard to say at the moment, because it's really hard to check how many such connections uh are broken, but both issues with rto and scene adder may happen only if the hash is recalculated at the client side, and this provides an opportunity for a quick fix. This will still have save the improvements in case of the outage and resolve the issues with tcp session timeouts.

C

So if we just restrict hash free calculation upon auto events at the client side, while keeping current implementation for the server side, it may become a proper default behavior and it should have a knob to enable it on both sides in the controlled environment.

C

Where uh can uh because, in the controller environment, we can assume that we can distinguish? Where is the anycast service? And where is the unicast service? So.

C

In in the ipv6 world, we have an opportunity to enrich tcp transport with self-healing capability. Most of the mechanics are already in place, and today it works in the data center environment, though we need to change the linux implementation to guarantee robot's behavior in general, and this time as a community, we need to properly document it. There are several open questions, for example, should we focus on tcp only or try to provide a general guidance?

C

There is also a number of confusing documents related to the flow label that don't work with linux implementation, uh and these issues should be, of course addressed. So if you think it's worthy and you have a desire to assist in this work, feel free to contact me. That's all for me. Thank you. Any.

A

A

Funny go ahead.

B

Oh I'm just I'm just switching my.

H

B

A substitute for clapping, very good presentation. Thank you.

C

Thank you very much.

B

Hi alex yeah. I agree it was a very good presentation. I had a question: is the insertion of the flow label done by the first hop router, slash, tor switch or in the linux kernel itself?.

C

uh Can you repeat the beginning of the question so uh are you asking about? Where is the hashing is implemented or where is who is changing their flow label? Who's changing.

B

C

Observe your in order to be able to do so. Yes, the flow label is changed at the host.

B

As a host, okay,.

C

Yeah, so it's a it is uh today it is changed uh on the host upon the rto events, both uh for the established sessions and uh upon the scene. Retransmission.

B

A

If you think of the cases where overlay starts on the door or first hub switch, nothing prevents us of copying flow label into overlay, hash or not copying, but at least encouraging.

A

So it would be evenly applicable for evp, analytic.

A

A

So there's no more questions and we are running out of.

A

B

Did you um did alex write a draft on this, or is this just a presentation.

C

uh At the moment, this is just a presentation uh we are still at the beginning of the road and discussing a proper way how to fix it, and then there will be a lot of work must be done in the itf documentation, as I said, because we need to update the documentation for the flow label itself and also write a spec that will describe how the hash recalculation should affect.

C

The corresponding parameters, including flow label,.

B

A

Okay ready, so the plan is to provide routing as a home for this work, and so tom and alexander already working on it, and so more people are more than welcome here. The problem is real. He provides real solution to serious issues and uh I would like to thank alexander for the presentation and again please reach out and contribute.

A

uh My wish is to have something similar to 7938, which is bgp in large dc, which has become the fact cookbook for hundreds and hundreds people building networks.

A

So I'm really looking forward to work here and thanks for doing the presentation doing the work.

A

uh We are completely out of time, so I would like to thank everybody for attendance and I really really hope to see you face to face in madrid. Take care everyone.

E

Yeah I just want to mention this is one of the example that, although you don't have a draft, but you are presenting something useful for the community and we also welcome you.

E

That's it: okay,.

A

Thanks everyone and see you at atf, 112.

E

E

E