Internet Engineering Task Force 105, 23 Jul 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: IETF 105 Technology Deep Dive: How Network Interface Cards (NICs) Work Today

Description

Held on 1230-1345 UTC on 23 July 2019, this Technology Deep Dive session at IETF 105 started with a description of how a basic network interface card (NIC) operates and led into NIC feature evolution.

A

Sorry for the slight delay, but now we are also online that's great good morning. This is the deep dive session, that's kind of the second tip type session. We are having at IGF class meeting. We had a deep type session on router architectures, which people gave us a lot of positive feedback about. So we thought we'd. Do it again, this time we will look into Nicks and we have some people here who usually, or some of them usually come to the ITF they're from the linux native community.

A

We will introduce them yes and we're happy to have them here. So my name is MIA cool event. I'm.

B

Jamal Hadi, Salim and.

A

We will start that's a wrong machine. The note that applies here as well.

A

So the note well applies here as well. This is kind of a site meeting, but we're still at the IDF. So it applies and that's our today's agenda.

A

We have the use of stuff we have to care about, which is mostly now set up and Jamal will now do a quick introduction to scope our date or our morning meeting here and then we start the presentation, and we have some time for questions at the end. Can.

B

We solicit a scribe, somebody is going to take minutes or come on volunteer subscribe. We have the recording all right and the blue sheets.

B

Okay, so when we was copying this talk an hour and a half didn't seem to do justice to the content, so we had to limit the scope. We could have a half-day discussion on not a tutorial, but just high level topics. So the what is in scope is we will talk about basic Nick support. How a basic Nick works will proceed to medium range offload from the host stack to the cut to the hardware and slightly more advanced features. We're gonna use, Linux kernel as a reference point, not necessarily the only way.

B

The only operating system that does this what's out of scope. Is we're not going to talk about kernel bypass, so not the PDK discussions. We're not going to talk about small CPE devices that use the same API on Linux, at least or very large Essex multi, terrible Essex, which may use the same API is in some vendor. Essex we're not going to talk about virtualization offload technologies, s, RI, o vb, m DQ and any newest schemes are out of topic, and storage is also out of topic, so this could be another session.

B

If should this session become exciting to the attendees, we could have another session in the future, the relationship to the ITF, if you're implementing protocols. This is very relevant to you. We you're running on the host or some middleboxes, which end up using NYX for nodes that perform both host level.

B

Features or performing functions so NYX can't process a lot accelerate as well a lot of and have a lot of helpers in the hardware for TCP UDP quake TLS IPSec. A lot of the nvo 3 is mostly commodity offloading. At this point in time, you can accelerate any of the layer 2 to layer, n, forwarding and filtering there's a lot of QoS offloading, it's a very condensed session. So what we'll ask is you can will only allow for clarification, questions and any other questions that you may have will come at the end.

B

I'm gonna introduce the presenters. You have a very competent set of forks here on. The left is Tom Hubbard from Intel Andy goes for direct from Broadcom and Simon hormone from metronome and I'd like to acknowledge Boris where's Boris from Mellanox. These are very competent Fox they have in. They know how the implementations work they understand. They had were very well so you're in good hands.

B

Having said that, these slides took a lot of community effort from the native community in general, and this is a list of people who, in one way or another, contributed shaped opinionated on the what should be cut out, what she became, how the slides should be structured, etc, and with that I'm gonna hand it over to the first speaker, tom.

C

B

That mic working the mic is, can you turn it on.

C

How's that yeah yeah much better okay, so I'm going to present the fundamentals and basic offloads Nicks. So a few definitions might be useful. Nick is a network interface card, sometimes network interface controller. This is the host interface physical interface to a physical Network. Host stack is the software that processes, packets and as protocol processing in the host.

C

Typically, this is layer, 2, layer, 3, layer, 4 processing, a kernel stack is simply a host stack that runs inside a colonel and sjl mentioned, for the most part we'll be referencing Linux, for that offload is when we do something inside the NIC on behalf of the host. So this is work that we move essentially from the host to the NIC for some purpose, work that involves a networking and acceleration is offload that is done mostly for performance gains. So what is a network interface card?

C

This shows the picture on the left of a card, and most of you should be familiar with. These whoever's had a PC, for instance, knows how to plug these in so they go into the system. Bus I would point out this particular card very ancient. Actually, it has a BNC connector, so this is true Ethernet and eisah connectivity, but nevertheless it's a NIC and modern-day NICs obviously look a little bit different, but basically perform the same function. So a NIC is the receiver and transmitter packets to the network to the physical network.

C

It's the device that does that and on the right we have a stack and you can see that in the protocol stack, the NIC is kind of at the bottom and on one side to the outside world. It connects to the physical media that could be fiber, cat5 radio and we use some sort of encoding or framing over that media Ethernet Wi-Fi fibre channel on the other side of the NIC. It connects into the system via the system bus. So typically, today, this is PCIe or USB.

C

In the olden days like this card, it was eisah. So the way this works is that Knicks have queues. Typically, they have a transmit Q and a receive Q, and these queues store the packets or indicate the package for transmit and receive the queues are composed of a set of descriptors and the descriptors describe the packet for the net. Some of the important things on these descriptors are where the packet is located in hosts memory.

C

What the length of the packet is and then some ancillary information that may have involved, for instance, if this was received as a broadcast Ethernet and other information like that. So in order to transmit the host stack, fills out a transmit descriptor and, most importantly, it writes the information in that for the packet where the packet is located in its memory and what the length of the packet is.

C

It puts the transmit descriptor on till Q and I should mention it's producer-consumer type of q, so it puts the transmit descriptor on the q pumps, the producer pointer, and then it sends an indication to the Nick, usually through a PC I write, a register write that there's work to be done, so the Nick wakes up.

C

It processes the transmit cue, and it looks at each of the transmit descriptors figures out where the packet is in hosts memory, performs a DMA operation, direct memory, access to pull the packet into its local memory, and then the Nick may perform some offload processing which we'll talk about in a bit, but eventually the packet has to be sent on the network. So there is a thigh and an assertive serializer inside the device that takes the packet in its memory.

C

Serializes the data and sends it out to the actual network receive is somewhat similar in the received path. The host sets up a number of packet buffers where packets will be stored in its memory, and it puts these into the receive queue in the receive descriptors. So again, in each descriptor there is a memory location and, in this case, maximum length of the packet.

C

When the NIC receives a packet, it DC realizes it puts it in its memory again may do some offload processing but eventually wants to send that to the host. So the way it works is the neck takes. The next received descriptor available on the queue gets.

C

The host memory, location DMA, is the packet into that host memory sets the length in the received descriptor increments, the producer, pointer or consumer, it's consumer pointer in the receive queue, and then it interrupts the host, which is typically an actual system, interrupt and the host wakes up and knows, there's packets to process in the receive queue. So it actually reads the queue and then can get the packets that have been received, new processes of them in the stack.

C

So what I just described is kind of fundamental and that's the basics of the neck and be started in approximately the early 90s, not soon after some of the basic off loads that I'll talk about in a minute came into being and we're developed, and we can track the evolution of NICs since then. So in the mid 2000s, we have data plane accelerations. So these are more advanced features inside the Nix IPSec offload, for instance, QoS offloads and more recently, there's a general movement to make these devices programmable.

C

So, at each phase of the evolution you can think of this as more advanced features, more functionality, more capabilities to process protocols and packets. But fundamentally the operation of the NIC is the same. It's the thing that transmits and receives packets to network.

C

So we'll talk a lot about offloads today, I want to give a little bit of motivation. One way you can think of offloads is these are just advanced features having to do with the packet processing or protocol processing. That happens to be done in the net. So there's a few rationales for this one is we want to free up the host CPU cycles for application work? This makes sense if the NIC can do the functions of networking in a more efficient way.

C

So since its specialized hardware, that is often the case, for instance, we can compute a checksum more efficiently than doing in the host CPU. More generally, one of the motivations is to save host resources, so offloads may save not to CPU, but memory DMA operations, memory, movement, number of interrupts scaling performance is very important and all floats help a lot there, particularly in low latency and high throughput.

C

There's also some interesting use cases, particularly in mobile, where we might offload certain operations having to do with protocol processing to a device for the purposes of saving CPU cycles and saving power, in particular on the core CPU. So in short, offloads makes sense as a cost-benefit trade off. If the benefits of moving work into the neck, you can think of it as a KO process or exceed the cost, then it makes sense in practice. This can be interesting analysis and we know that CPUs, for instance, are always increasing their capabilities.

C

On the other hand, the network and things you want to do are always getting more complex, so there's always a bit of a trade-off between whether to off to run on the host CPU, but in general we found offloads to be pretty useful and probably will continue that trend in terms of developing offloads and nic development in general. In the Linux community at least, we kind of enshrined some of the principles and something called less is more and I want to give three components of this.

C

So first of all, protocol agnostic mechanisms are better than protocol specific, and this is somewhat of a formulism of trying to prevent proto classification, but the idea is if we can develop an offload that supports, say all transport protocols equally versus one that is only only works with TCP or plain TCP IP packets generally, the offload that is more general. It's going to be more applicable and better for the user. In a similar vein, common api's are better than proprietary api's.

C

We have a lot of os's a lot of NICs, the more common the API is across those the easier it is for users to choose different pieces of hardware. This is particularly important in that we want to avoid the concept of vendor lock-in, which is where a vendor, whether purposely or inadvertently, kind of controls, the API such that it's really difficult for the user to change vendor the vendors that they're using.

C

The third point is the program program. Ability is good, so I put this in generally. In parentheses, one of the aspects of program ability is, if we make it completely openly programmable, especially user programmable, and allow users to do whatever they want. Users will do whatever they want that, as we know, leads to some interesting fracturing of the market and can be precarious. So we always want to make sure that if we're gonna create a open program environment, how do we develop the ecosystem properly and maintain some semblance of of sanity across these and portability?

C

So we can turn and look at some of the basic offloads I'm gonna skip that slide. So we'll talk about three basic offloads and these are kind of the oldest ones, they're very common amongst nicks. Most of these have been around since the 90s at least checksum offload, segmentation offload and multi. Cue checksum offload is the offload of the venerable TCP UDP transport checksum. So the idea is that we want to offload the computation of the checksum, so the ones complement summation in particular is CPU intensive.

C

If we offload that to the NIC, we get a nice performance game, as I mentioned, checksum offload is particularly ubiquitous. It would probably be pretty hard to find a NIC and on the market today. That does not support some form of this an interesting twist. That's a little bit. Recent is encapsulation.

C

So what we found is that say, IP and IP, encapsulation or particular udp-based encapsulations actually can have multiple transport protocols per packet that contained their own checksum. So conceptually it's possible to have two three four five or six checksums in a single packet, TCP Texoma UDP checksum, the GRE checksum, it's all possible, so we want to offload all of those checksums and we found some techniques that can leverage rudimentary checksum offload of one checksum to actually support multiple check, something even in the same packet.

C

So a little bit of detail, so transmitted checksum, alsa offload has two forms. One is protocol. Specific one is protocol agnostic, the protocol specific one. We the host sends a packet into the device, the device actually parses a packet determines, if there's a transport header and the checksum, and if there is it, does all the operations just set the checksum so to perform the ones complement checksum over the data. It will compute the pseudo header checksum if there's one there, and it will set the checksum in the appropriate field of the transport layer.

C

The more generic method is for the host to indicate in instructions exactly how to do the checksum. So it provides two pieces of information to the device. One is where the checksum starts starting offset in the packet and the other one is the offset to write the checksum, which would typically be the checksum field of TCP, for instance, and then the start would be the offset of the TCP header.

C

The device gets this and it will perform the ones complement some starting from the starting point to the end of the packet, and that sum whatever it gets. It basically adds it in to the existing value and the checksum field and checks then sets the field. As long as the host set this up and initialize a checksum field correctly, the device will set this correct checksum. It has no idea what kind of check summon it is. It doesn't know if it's UDP or TCP it doesn't care.

C

It just knows it's the standard internet packets exome for receive. We have an analogous situation. There is a protocol, generic and a protocol, specific method. The protocol specific method is called checksum unnecessary, as packets are received, the NIC parses the packet determines if there is a transport protocol that contains a checksum and performs the work to actually verify the checksum.

C

So it doesn't ones compliment checksum computes. The pseudo header adds them checks. If the result is checksum zero, if it is, the checksum has been verified and sets a bit and the receive descriptor to inform the host that it's verified that checksum so again that it is protocol specific. It only really works with TCP and UDP packets that the device explicitly parses the more generic method is checksum complete.

C

In this case, the device performs and ones complement some of the whole packet starting from the IP header through the end of the packet, and it simply returns that some in the receive descriptor to the host the host can take that and actually use it through simple manipulations of checksum to verify any number of check sums in the packet. So this is really efficient, really a very generic and is able, as I said, to verify many checksum in a packet.

C

So looking at segmentation offload, one of the observations that we've made is that networking stacks are more efficient when they process large packets, as opposed to small packets, so in particular per packet processing per packet overhead in the stack is significant. More than processing the data bytes, usually so we want to see if we can arrange the system, so we can process large packets instead of small packets.

C

So there's two forms of this is one on transmit and one receive on transmit segmentation offload. The idea is, the host produces a large packet, say, a 64k TCP segment, and we want to break this packet up into smaller chunks for sending out into the network, which may have say a 1,500 byte m2u. So we want to do this as low as possible. So the idea is the stack process.

C

Is the big packet processes, one IP header, one TCP header and at the lowest point possible either in the software or even in the network device, there's a type of segmentation or fragmentation. So we slice up the data, give each packet its own IP header, own TCP header and send each one. So there is a software variant and Hardware variant of this software variant is called GSO segmentation offload. The hardware variant is LSO large segmentation offload. You might see it also called TSO TCP segmentation offload with this, when this is specific to TCP receive segmentation.

C

Offload is the opposite. So, when small packets are received, we try to coalesce these into larger segments and larger packets. So again, this is per flow, similar operation, and there are two variants of this one of the software. One is a hardware the software is generic, receive offload gr o the hardware is lro larger, steve offload. This particular offload of all the checks are all the basic offload, there's probably the hardest one.

C

It does require the network device to be able to parse the packet and understand a lot of details of the protocol so, for instance, the implementation that do this really only understand TCP, usually some of that encapsulation, but until we have say a fully programmable environment, it is hard to generalize this one. One thing I'd like to mention about segmentation offload: this really only works in conjunction with checksum offload.

C

So this is a good example of where we develop a more complex offload, but it requires some of the basic offloads in order to operate, and we definitely see this with some of the more advanced offloads that we'll talk about in a little bit. The third basic offload is multi cue.

C

This is done in conjunction with multiprocessor systems, and the idea is that the NIC has some number of received queues and some number of transmit queues where queues usually are assigned to a cpu, and we get a sort of parallelism by the cpu to see to queue affinity.

C

One of the interesting properties is that once we have queues, we can assign properties to them, particularly in transmit each queue can have its own attributes so, for instance, of a kind of high priority queues and low priority kids. One of the important aspects when we deal with multi cue, we do want to try to keep packets in order. So, for instance, we don't want to be distributed, packets in the same flow across different queues either and transmit or receive.

C

So there are some techniques in the model of queueing to try to enable the inorder delivery as much as possible, so on transmit. There are essentially two methods to do this. One is the easy method which is fundamentally each CPU is assigned to a queue. So when an application is sending a packet, for instance, the queue chosen is the one associated with that CPU. The applications running on, and the advantage of this is that we get this sort of siloing locality.

C

For instance, when a packet is sent on a queue, we have to lock the queue in order to manipulate the queue pointer. If we do this in CPU per queue, then there's no contention for the lock and no contention for the structures of the queue. The second method is when the driver selects the queue. So as I mentioned, queues can have some rich semantics, such as priority. What we've done there instead trying to expose all possible combinations of this?

C

We allow the driver to basically understand the the queue layout of the topology, what the different queues are and when the host stack wants to send. It basically asked the driver that has intimate detail of the device. What's the best queue to send this on, the driver can do that.

C

So, for instance, if we're sending a high priority packet, where the metadata associated with the packet said this high priority when this goes into the driver, it looks up the queue that's appropriate for that, so they may have a CPU to queue, affinity, priority, there's also other attributes. You could apply like rate limiting on the receive side. This is normally called packet steering. So the idea is when packets come in to the NIC, they need to be distributed amongst the queues it turns out.

C

This is a lot like a CMP, and some of the techniques are very similar where we're trying to distribute any CMP to multiple interfaces on the state. Listening stateless side, there are two forms of this: one is called received. Packets tearing that's a software variant, RSS received site scaling is the hardware variant. They both essentially work. This the same when packets come in a hash is performed over.

C

The five tuple of the packet of the transport layer is available or three tuple for using the flow label, but the effect is to identify the flow by a hash. Take that hash and map that into one of the queues and that ways we're also consistent so for this particular flow, it always has the same hash. Therefore, we can always map that to the same queue in order to facilitate in order delivery.

C

An extension of this is something called receive flow, steering in this case the host itself can actually sort of program for each flow which queue to use. This is very powerful mechanism, so on a per flow basis. The host can indicate okay for this flow use. This queue. There are two variants of this. Also there is a software variant and hardware variant. The advantage of this is to get a really good isolation.

C

Some people use this where they pin an application to a CPU where that application only runs on that CPU and they associate a network queue with that application and receive flow steering connects the arrangement so that packets only for that applications flows go to that queue. So it's very siloed. The application acts like it's the the only application on the system. We get a lot of performance gains that way so with that, I will turn it over to Simon. Who will talk about some of the more advanced offloads.

D

Thanks Tom so so far, Tom has taken us through some basic offloads and the basic functionality of the NIC itself. Well, as the use cases, the demands of the users evolve and the hardware evolves. At the same time, it only makes sense that more and more processing could be pushed down to the hardware, and so in this section, we'll look at at examples of that in terms of offloading more of the data plane world the packet processing, but before I get into some examples in that area.

D

I just like to quickly cover some of the hardware solutions that might be used in in this kind of area. So it's important to note that these solutions it's a little bit of a mix-and-match. It depends very much on the use case which choice is appropriate and some hardware choices match some use cases more naturally than others, but at the same time they're not necessarily mutually exclusive.

D

So so far, the next we've talked about a fall in the first category, where you have a fixed data plane, and so this will become kind of a sick that implements a pipeline in hardware and we can also use more programmable technologies and this kind of fall into three sub categories. We have semi specialized processes, called network flow process or NPU network processing unit, and so in this it's a little bit similar to a general process, a purpose processor like a CPU on a in a server.

D

You have instructions, it executes the program and that program describes the pipeline.

D

They differ from a general purpose. Cpu is that there are a little bit more specialized, so they might have instructions to do network related functionality or or they might have much highest thread density things along these lines to make them more suited to network processing, and then you have FPGA, which is probably the most programmable solution possible. Here we have gate level programming.

D

So, essentially, you can program the hardware so find it, and so you can describe at the gate level what the pipeline should be, and then we have general-purpose processes, so this would be putting say an ARM processor onto the neck to execute the pipeline and- and you will get back to to this- the programmability aspects a little later in the presentation so back to data plane acceleration.

D

Here we have a diagram that represents roughly how this works, so we have applications and then in the corner we have a implementation of a data path and then down in the in the offload Nick. We have a data plane which implements some wall or maybe all of the functionality of the data paths in the in the kernel, and so this is able to afford, for, for example, four packets around and so on.

D

So the advantage of this is that more more of the processing of the packets can be done in the hardware and this alleviates the hosts of this task, so the CPU can be used for other things. It also can lead to higher performance, depending on the use case.

D

So here we're going to go through four topics in the data plane acceleration before I move on the first one is mass action, so this is a foundational building block of a data path and an Indy door of offloading and data path into the neck.

D

So the first step is that we do some kind of header extraction, so we pull out some fields, for example the five tuple, but is also. We also have metadata, for example, the port that the packet arrived in. Other things can also be available. Then, using this data we typically do a hash, and then there has looking up in the hash table.

D

We try to find a match, and if we do find a match, then the match will supply some kind of action that should be executed or our list of actions even and so this could be to forward to a different port. It could be to drop, it could be to move on another table. If you have multiple tables present, it could be to do some kind of modification of the packet.

D

We can also do more stateful things like we can do, policing which I'll get to a little bit later or connection tracking.

D

So using this max Action scheme, we can create a forwarding pipeline, and so here we have the matches and actions which we can use to fall between physical ports. The header extraction can operate at various levels of the protocol stack, so you would begin with with l2.

D

You can also extract the source and destination IP addresses from l3, and then you can also select, for example, the ports that layer for this. So you can create a specific role, for example, do some kind of special treatment on port 80 traffic, possibly to a separate host it's fairly flexible. In this regard,.

D

Oftentimes this is set up in such a way that if the offload data plane can't process a particular packet for some reason, perhaps it's for a protocol that it can't understand. Perhaps it's table capacity has been exceeded. Any one of a variety of reasons we may have.

D

You may have a mechanism in place that allows the processing of the packet to be pushed back to the to the host and the host marry after that in in various ways it might process the packet it might process the packet and then also program the hardware, to tell it what to do the next time it sees a packet foot say for the same flow, and we can also do tunnel encapsulation and decapsulation as well as tagging.

D

At this point, this is, of course, optional, depending on what the desires of the system are, and so so in this system we can see that essentially we have a pack of processing pipeline packets can come into the machine, they can be processed, they can be encapsulated or D capsulated and they can be pushed back out of the machine or towards the host.

D

So, building on this a little further, we can also implement QoS in the neck, offload, QoS and offloading so in the ingress cases as packets that are arriving on the machine.

D

The interesting thing about this use case is that there's no queue available, so the actions that can be applied fairly limited. We can release the packet, perhaps by dropping it or marking it, we can filter it and so on. Egress is a little bit more interesting. A little bit more complex, perhaps is a better way to put it because we have a queue. So we have the option of doing a much larger number of different things with the packets.

D

In order to, for example, enforcer decide packet rate, we can delay packets, we can of course, drop them and so on, and this is an area of which has received significant research over the years, and most of this research is applicable there. Of course, a challenges in implementing individual algorithms on an offload NIC, as opposed to a hoster, is usually a more limited execution environment, but nonetheless, the same principles generally apply.

D

Now, in this diagram, we have packets coming into the machine into the neck and also exiting the neck, so they're being folded from one port to another of the NIC that could be a virtual port or a physical port, and the NIC is applying some kind of QoS as they traverse the neck in the next slide. In this slide, we have a slightly different setup, so here we have packets.

D

Of course, it's two directional, but I just will only talk about one direction, which is packets, originating from an application running on the host and heading out towards the wire out of physical port of the neck, and the neck is applying some kind of QoS policy to those packets as they traverse the neck. This is so in this particular case.

D

We have different applications and by some kind of selection mechanism they are allocated to different queues and each queue has a different read instance running on it, and this could have could mark the packet so drop the packets, if they're exceeding a certain rate and so on, and if this, of course is not limited to read, I just use this particular example.

D

So with the point I wanted to draw out here is it there are two fundamentally different models here: one is of applying QoS between ports of the neck and one is curious, applied to packets originating from the host and then passing through the neck.

D

So, moving on to the last part of my section, I talked about crypto offload a little bit, so this is a little bit different to what I've talked about so far with the data plane. A processing package in the sense that what we're really focusing now is offloading from the host a very computationally expensive part of packet forwarding if you are applying crypto and crypto itself, tends to be quite complex.

D

So what we have at the moment is we have a offload of TLS and this is only dealing with TLS connections which are in the established state. So the the host is still responsible for the connection establishment.

D

It's still responsible for the TLS, handshake, the certificate, negotiation and so on and once connection is established, then it is able to pass that connection to the Kay TLS module inside the kernel which in turn, so at that point it's passing the credentials of the connection into the KTLA's module, which in turn can push those same credentials and connection information down to the hardware.

D

And then, when we do transmit.

D

Essentially, what the hosts will do is to format a TLS frame, but it does not perform the cryptographic operations so there, the authorization hatch this space for it, but it's not filled in and the packet or the the record payload is in plain text and then the offload Nick will receive this record and perform the cryptographic operations. So it turns to plain text into cipher text and it tends to fills in the hash on rx. Things are reversed.

D

However, it's worth noticing noticing that Rx is significantly more complex implementation, wise than TX, because we one needs to deal with things like out of order. Packets.

D

Reassembly of fragments and so on. Essentially we you have much less control of what's coming into the box as opposed to what's going out of the box.

D

Ipsec acceleration floor follows a similar principle to the TLS, in the sense that some parts are offloaded and some parts are not, and at this time we have two models for this. One is the crypto offer load, which is very similar to what I described to two TLS, in a sense that it is the hosts responsibility to add the IPSec headers to the packet, but it does not perform the cryptographic operations which are left to the card.

D

It's worth noticing at this point that, on the one hand, in did this combines a number of different offload switch, which we've already discussed the LSO, the segmentation offload and the checksum offload. So one if one is offloading the crypto one also needs to upload those operations there. Conversely, with IPSec traffic, one cannot offload the segmentation offload, all the checksum offload, if one does not also offload the cryptographic, so there's significant benefits to being able to build this stack, but in a sense it's an evolution.

D

Well, one could not build this particular piece of technology without other pieces that have come earlier that with the ones that Tom spoke about the other model. We have is a fuller float so by full of load. What you mean here is that cutter is responsible for adding the IPSec headers on transmit and, of course, we're moving them on receive.

D

This can lead to additional savings and host resources. It is clearly also more complicated to implement in the hardware.

D

In which, regardless of which of these two models, you use the DI key, the key negotiation between the endpoints, which itself is quite complex, remains on the host.

D

There is scope to offload this, but the way that these things tend to evolve is that you start with something simple that has a very large benefit, so the crypto offload and then we mode to a fuller offload and potentially the IKEA. He could also be offloaded at some point in the future.

D

So with that I'd like to hand over to Andy who will now address further evolutions in Nick technology. Thank you.

E

Alright, so you've already heard a pretty long discussion about how this, how these things all works. That's good, I, appreciate everyone, who's still awake and finished, checking all their email. No and I will talk about it without programmability. So you know what Simon and Tom talked about. Really all these offload features that were enabled exclusively by hardware providers or hardware vendors who feel like this is something useful, probably from feedback based on users.

E

Maybe not it sort of depends, but we're gonna, we're gonna, build on that and talk about how it's sort of the next evolution in this in this path is fully programmable mix. So, as tom talked about, those are, could be good, probably good, but really there's a couple. Key features I want to highlight and think about, and why programmability of a Nick would matter so right out of the gate.

E

I think one of the really important things is that it facilitates really a rapid protocol development, so we're kind of in a phase right now where fixed-function offload is so powerful and so useful that if you, if you want to deploy a new protocol- or you think you want to help develop a new protocol and you want to rapidly iterate that one of the problems you find you're getting yourself into is that well. Are we gonna really? Our current infrastructure are really gonna burn, more cores processing packets, just to support this new protocol.

E

What if we just live with the old one and deal with that, so programmability gives you that option to offload those operations to hardware and really still give you the efficiency you want in the new protocols, so the other obvious one is to quickly fix bugs and security problems.

E

No one in the Linux community likes to remember 15 years ago, when anytime you, you had somebody report, a problem with some TCP related thing on Linux, one of the first suggestions on a mailing list or on a message board post was: oh: did you try turning off TSO because they were sort of famously it was problematic for for some nicks or some kernels at some point- and you know that became that would be something that if you had a programmable NIC and you knew what the problem was, because probably you wrote it- you could go fix it so, additionally, rolling out new security fixes, always a great idea.

E

There's this notion right that that, if you, if you run a large or small scale data center, there's there is going to be some magic packet. That's going to melt your network, and this would give you that opportunity to snuff that out and hardware before it gets too far. So so today, in the programmable NIC world, there's really two two sort of main types: one is special purpose hardware or FPGA and P use that Simon is referenced before so.

E

This is something that we're gonna program, very specific hardware, we're gonna, write code for and then the other one is really a new class of NICs that have appeared in the last couple years. That really just contain a general-purpose processor. So this might be an arm and x86 MIPS, maybe in the future like RISC five, but really just something-something general-purpose, they can run any code so and and I think really.

E

While this might seem today like something that isn't exactly what what you might want sort of looked at, some of the forwarding plane, realities, slides from the last IETF and I. Think there's a really interesting quote at the conclusion. At the end of the that, what's what's niche today can be broad tomorrow and I. Think that's generally speaking, what we've seen across the board in networking and in NICs that there'll be someone that'll roll out a new feature and someone will think I don't know before long everybody's got it and everybody wants it.

E

So so I think program ability is gonna, be that next that next piece so kind of build on the common language that we had for our pictures earlier. Hopefully, this language resonates with people, otherwise that's a bummer because we use it in the whole deck.

E

So so we really do in this case, when you have an FPGA or an NP you, the control plane, is still gonna stay in your host kernel, so it's gonna do if you're running a routing, daemon or something else that setting up flows all that still runs there, but now we're in a case where this this offload data plane is going to run down in the FPGA or the NPU and in fact one of the unique pieces with this is.

E

There will be cases where a software data path does not exist in the kernel, for whatever feature you're adding. Now, that's it's a little bit different from what we do in the Linux community, where, if there's Hardware offload capabilities that are there and your hardware, there's sort of an insistence that there's a software fallback data path that exists and within Linux, that's been extremely helpful and we're gonna continue I think to push that. But this is a case where that might not be the case.

E

You may just have a data path, that's completely done in the kernel, with no software fallback at your own risk, I, guess and and in fact, that data plane could be expressed in a variety of languages. So you know maybe p4e BPF and PL, or maybe just a native instruction set for for that that NP you as Simon talked about many MP use, have has something that maybe have special instructions for performing operations, and the key that we talked about too, is that this is. This is dynamically programmed?

E

So in this you know this death packet that could exist. You can roll out new code quickly or, if you're, rapidly, developing a new protocol, and you start to say you know what maybe I don't need 350 bytes of header to describe this. This new protocol, maybe we'll make it a little shorter, like 324 or something who knows so, they'll keep the other.

E

The other piece is a really a general-purpose, and so this is a little bit of a unique situation, a little different than we've had in the past, but it's becoming pretty popular, and so this is a case where we're moving the entire host networking stack down on to the NIC. So and yes, I said that right. So what that actually means is your NIC could actually run another copy of an operating system.

E

Some people shudder at this thought, because maybe it sounds a little more complex, but the fact is, if you have this already implemented software on your server, you could actually move it down to your NIC and free up the server course from doing that work. So, in this case, the data playing offload is down on this general-purpose processor, as I mentioned, and also the control plane. So now what if your routing daemon was running on the NIC or what?

E

If your, whatever was receiving, you know, OpenFlow messages from a controller was running completely on the NIC. So now you've, you found yourself consuming zero host resources. Server host, not NIC host. You know sort of different CPU complexes. There are not sort of actually different CPU complexes, so now you're not consuming any of the resources of your server, and you can free them up for doing useful things whatever those may be. So this control plane offload is also really nice.

E

If you have what some are calling now, a bare metal deployment where you're really you're you're setting up servers, you don't know exactly what they're going to be used for, but you're responsible for networking. You can feel pretty confident that there's a good chance that your server administrators are not going to ruin whatever network setup. You want them to have pretty confident, also in the multi tenant deployments. This would be really good.

E

You can can make sure that that no one, no one person has has a chance to destroy too much and it really it brings a lot of the server networking administration back into the purview of the network. Admin I think that's a sort of a constant struggle between those two groups somewhat understandably so this this gives networking networking tentacles to get a little bit further into the server. If you will so kind of in the same vein, here's that picture again.

E

So now we've got this general-purpose processor running down on our programmable NIC running whatever OS you want, and then this forwarding functionality again moved completely away from the server course down to the neck.

E

So this this would mean that, obviously, if you have applications that are running in your server, they're still going to get the data that they need, but you're not spending your time just needlessly moving packets between between different applications, whatever those look like and and the reality to here- and it doesn't get any more recursive than this I promise- is that the programmable mix also have offload capable devices. These things are all being put on. The same die so you all have a control and a data plane and a fixed function device.

E

That's all embedded down but, like I, said, I promise that that offloaded data path on the fixed function device doesn't also contain another general-purpose processor and another one on down. It's just as simple I appreciate that it's just just as simple the simple fact is: we're building these chips that are pretty large and have both the general-purpose. You know- maybe you know maybe armor MIPS cores on the side with with a fixed function ASIC, but there are also people building building NICs that in addition to that, have efi JS RMP use as well.

E

So so I think this is kind of a new world in a lot of ways. I think there's not a lot of not a large number of users that are doing this, but I think this is a strong case, especially in a place like this, where we're seeing rapid protocol development, where the programmable NIC is an extremely powerful option and an extremely interesting going forward, so I think really. The way. To summarize this is that we think about the networking trends going forward.

E

There's an insatiable need for more bandwidth and lower latency I think the devices that we carry around in our pockets every day that that help us consume more and more data, and not only in the air but where the actual wires are in the data centers there's just people want want more all the time, I'm amazed how many people are walking around doing video calls or driving doing video calls I wish.

E

That was a joke, but it's not and I wish it was passengers, but anyway and I think there's we're seeing more and more to that there's an interest in deploying new protocols. We I regularly hear requests for things that you know. We wonder how we can make the hardware that is fixed-function support and how long it will take to maybe support that. So this this gives a new option for people who want to want to do those things quickly and I. Think that that the Knicks are gonna work together with host operating systems.

E

To make these things happen, we don't see offloads going away. We see offloads becoming more powerful and and becoming more flexible and and continue to be important. So and I also think that the program ability and this flexibility will really spur innovation that that we haven't thought of before I. Think that's the magical part about about some of these devices that that are completely or not completely fairly flexible.

E

Is that is you get the chance to do something that you would have never thought possible a few years ago and and who knows exactly what will come next so I think that's to me really exciting. I think yep.

B

Okay, we don't have much time left so, but we'll open the mics now for anybody who wants to go up and ask it's a discussion, please use the mic state, your name and it's being recorded and who hasn't signed the blue sheets. Everybody if you haven't signed the blue sheets, please sign.

A

F

Black first of all, many thanks for consistently using the Linux example. Throughout the talks everybody's on the same page, you may not be able to answer this question. I'm gonna. Ask it anyway. Can you say anything about significant differences in other important operating system environments.

G

E

Being told to sit down here and share the mindset, I think everyone wants to answer this one so I think the question was about whether or not we see consistency across other operating systems. Is that right, the.

F

Question in particular, was, can you say anything about where how you described things? Work in Linux differ significantly in non Linux operating systems, I.

E

Honestly, don't have a ton of visibility in that so.

C

It answered the question: I would point out that some of the earlier work actually came out of windows. For instance, RSS was literally invented, I think it was n des described that and I believe they had the early checksum offload and I think what happens is as Linux became kind of more popular in open-source. We had a lot of developers that are working on that and at some point, the NIC vendors as the volumes go up, they start to pay attention. That being said, we do know that FreeBSD may use.

C

Some of these I know that some of the work that we did and the packets during was being applied, and that's a good thing so, like I said in my talk, we do want common API is across OS as but, most importantly, there's nothing I, don't think, there's anything we're doing in the Nick. That would be specific to Linux or any particular OS. In fact, I think some of these techniques would even be applied in something like DP, DK or kernel bypass, so again we're just using Linux as a reference.

C

You know obviously major vendor support, Linux, FreeBSD, Windows and DP DK and the more features we can have common across those the better. Obviously- and it's good for us- if it's that, if the features are common because then we have common support and things like that,.

H

I

Okay, but let's go independent in this sense, you've talked a lot about different features on different cards and all the rest of it. When you're writing the code, you've got to know what the card can do. That's on the machine. The your code happens to be running on so I, don't I, don't we need to sort of explain all about the api's now, but it's really the questions more about what up from what I've seen is going on.

I

Essentially at someone writes a page to say you know, what's the consensus on what people do say for offload and does that need Standardization? Is it working at the moment just having it done ad hoc? Would it screw it up if it was standardized I'm, just thinking it seems to be all very ad hoc at the moment how the description of what's the hardware is capable of so that you can write a code to know what to use.

E

Yeah so I think I, don't know that it might feel a little bit.

E

Ad hoc I feel like there's a fair amount of communication within the kernel development community as if to communicate up to the upper layer stacks what's available, and we have lots of feature flags and feature capabilities that are enumerated but and I think part of what's done at some of the different conferences throughout the year, whether it be net, dev or Linux, plumbers or others, is to help get people together and come up with some of those and I think as we've started with basic offloads, whether it was checksum or TSO, or things like that and we've gotten beyond doing things like flow offload that that's been negotiated, so to speak fairly.

E

Well, it is, it does feel a little bit ad-hoc, though especially I, think from the outside, because what would probably burnt outsides the wrong word to use it as an observer. It probably might feel ad-hoc, because you just see patches, show up and support exists and usually what happens is one vendor will come up with it? Another one will say: oh yeah me too, and then they'll do it and maybe enhance it a little bit more but I think that's a the goal is to have.

E

You know, write the code once that's your application and have it run across multiple vendors and I think we do a really good job of that right. Now. Okay,.

I

So basically your answer is the process isn't broken. It just looks like it is from the outside I.

E

Think that's probably fair yeah.

J

Google I had a clarification question for Tom I think did you. Mention. Did I hear correctly that you said that implementing any sort of receive offloads requires checksum afloat and if so, why or did I miss misinterpret what she said so the.

C

Implementation for checksum offload of.

J

Course I think you said, and I may be mistaken, that implementing anything like gr, o or LRO requires checks on my float. Otherwise,.

E

That doesn't work.

J

I just wanted to know why that is cuz I, don't know well.

C

So if you think about let's look at large segmentation offload so in the NIC, this is splitting a packet up into individual TCP segments. Each TCP header has its own checksum, so I need to actually after I do the segmentation. Then I need to set the checksum. It has to be per packet, and this is actually one of the trickier things with something like segmentation offload, the fewer things I have to do per packet, the better. If it's the case where I could just copy all of the headers to each segment.

C

That's a lot easier, but each time we have to consider like IP ID is another good example in the IP header. But packet lengths are always interesting and check sums the hardest one so anytime they have to set something that is unique for that packet. I have to do that in the neck and checksum offload is definitely one of us.

C

So this is a case where we kind of lose the ability to make a completely generic 'el, though our other other avenues, but at some level we have to have the knick have a way to understand how to how to dice up packet.

J

And for receive, you have to do it because you have to check the individual check sums. Otherwise, the you, you might end up returning a corrupt, bigger packet to the stack in terms of capabilities you know for receive. You also need checks on the flow food to receive offload yeah. Ok, the other question I had was: do you? Is there somewhere, I mean it so. The earlier questions that have essentially said there's a cabal of you know ten people in the world who actually know how to do this.

J

Is there any sort of documentation where we can point? You know, vendors to say: hey, please, implement checksum offload and here's how to do it because there exist cards that don't do it today, not you know PCIe cards but I.

C

Think that goes back to kind of the ad hoc nature of it. So.

J

That's a no, then we.

C

I mean: do you want it to be super formal or informal, I think when we're shooting for programmability? That will actually solve a lot of that such that we can add different protocols on the fly with different characteristics. Thanks.

A

We quickly take a question from Java. Actually so there's a question for me: can a person about path, MTU, he's asking when doing Hardware offload doesn't matter how much the does it matter much to the host OS? If there's lower higher MTU on the wire and with gr? Oh, it seems it shouldn't matter.

C

So the question was about path, MTU and I. Suppose a segmentation off flirt, so it does matter and in fact, when we're doing something like LSO TSO, we aren't just chunking up packets per the MTU. We want to abide by the path em to you, so the way it works is the hoe stack actually tells what the size is of the packets to go out. So we can abide by path. Mtu.

C

One of the interesting things that we try to do is when we're sending LSO try to keep the packets the same size, except for the last one. That simplifies the problem that we just talked about with Lorenzo, where we have to set the length for each packet easiest way to do. That is to kind of infer what the lengths are. So we tell the NIC.

C

This is the the length yet the maximum length make all the packets the same size except for the last one, which could be short and then that way, we can accommodate paths em to you so in terms of larger em to use in the data center. We're seeing like 9 km to use with jumbo frames, that's actually a little less pertinent to LRO and elet and Ella. So in that case, we're actually just using the the native MTU to accomplish the larger packet size.

C

So in some circumstances, that's a little less important if we have already have a large empty, your path, MTU to begin with.

K

Rodin's fresh chicken elapsed, I was wondering about the crypto offloading that was sort of in the middle of the presentation. That sounds very interesting but I'm. What I'm wondering about is to what extent does that repeat the risks of all of these vulnerabilities, such as padding Oracle's and all of that and repeat that in the nick implementations, is that is there any information about that is? Are there experiences with that all of the stuff that got sold in crypto stacks that are just on the normal CPU.

C

Was a question portability.

K

And the neck now the question is I mean there are all these vulnerabilities if you do crypto implementation, like timing attacks, see things like padding, Oracle's specific to symmetric implementations. To what extent are these? What is the risk of these get repeated in the NIC implementations and if they are a unique? How do you fix that.

D

Yes, although, as I understand the question is, if we look at crypto there's a wide variety of attack, vectors varying complexity and any individual implementation might be suffering from any number of these. So if we push an encrypted implementation down to the hardware, what is what kind of problems might we see there in this area? Yeah, so I think that that's a good point, and certainly we can't pretend that there is not going to be any problems. I think that as the complexity of what you're sorry as the complexity of what's being offloaded increases.

D

So, for example, if we use move from a crypto only offload towards a full offload, then I the surface for these kind of problems must surely exist. In my mind, I'm not really sure what the best way to move forwards on this, certainly that the vendors or the supplies of the code or ideally open code we'd need to move rapidly. But perhaps we also need to have some kind of mitigations in the system.

D

So I don't know if they would be relevant to something like a timing attack. But if it was say a packet of death type of attack, he may have a facility in the system to allow the user to maybe using some programmable component or something to to mitigate against that. But.

K

You I mean: would it be a matter of flashing, the the the Nick and putting new software in to fix something like that or is it you can throw away your Nick if the vulnerability is serious enough, so.

D

I I didn't quite catch that but I guess. The question is: what would be the mechanism to fix it? I I think the it would depend on the implementation. I mean if it's a if it's a kind of a fixed device and you're receiving firmware from the vendor, then I suppose the main avenue other the mitigations would be to get an updated firmware.

D

Okay, you know, but as we move to a more programmable world that that the user should have more flexibility to address these problems themselves right, so to react more more fluidly to the situation. Thank you. So.

A

The next session starts at ten, so we could have like very, very quick question. Oh we take this offline. Okay,.

L

Wait a minute: it's a simple question: okay, quick! What we see exactly what we see is that there is a trend towards moving protocol implementations to the application space for various reasons, and we see that with quake in particular. What I've seen in your presentation is that the interfaces that are shown are food camera.

L

At the system level, we still walk to enable api's that can be used directly from the applications to use the offload functions. I.

E

Think within the Linux kernel there is a little bit of that. There is actually there was a presentation done last year in Prague at the net type conference, about actually offloading quick and and what could be done. What kind of kernel interfaces are needed in order to to make that possible? So I think that I think the move to protocol implementations like that in user space is may be a result of hardware in flexibility.

L

E

L

You you could say that, but the the move to use of space is also the idea that it's developed independently of what the kernel does so so.

A

Actually, Christian I have to ask you to take it offline, because there could be a whole nother session to talk about that right. That was a simple.

L

A

And I think you're still here for the rest of the day, all three of you right, so people can approach you with more questions. Thank you. Everybody I hope you enjoyed it.

M