Internet Engineering Task Force IETF 114, 25 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: IETF114 IRTFOPEN 20220725 1900

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

This one I see Barry right.

B

Yeah, it's uh pretty empty room so far,.

A

A

So it's a little awkward doing the uh the chairing when the chair is remotes, but the speakers are local I, don't know if I hope. The speakers are local.

C

C

A

A

A

Questions um I see some human touches, along with the other presents, are in the room.

A

A

A

A

A

A

A

A

Okay, I could get a few minutes. You get started it's uh there and if so, do you want to come up and get ready for the microphone while I do the introductory slates.

A

A

Okay, Tasha, can you do another check, I see you're there. Now it's a cute little camera to work.

C

A

Perkins I'm the irtf chair. Hopefully you cannot see it and hear me in the room, I'm remote.

C

A

A

A

Okay, let's make her, but hopefully it's not too bad. uh So uh this is the irtf open meeting the the irtf follows the itf's intellectual property rights disclosure rules uh and a reminder that, by participating in this meeting and by commenting on the presentations that you, you agree to follow the irtf processes and procedures, including disclosing any intellectual property relating to the contributions that you make uh I'm sure most of you have seen these slides before the the details are in the documents linked.

A

But uh essentially, if, if you have IPR on the documents you're talking about you, you need to disclose that if you're commenting a microphone.

A

In addition, a reminder that uh the iitf routinely makes recordings of these meetings uh available, both the the online and the in-person person meetings, including this one and this meeting is being streamed, uh live on YouTube, as well as via the usual meat echo system.

A

um If you're participating in person- and you are not wearing one of the red- uh do not photograph lanyards, then you can send to appear in these recordings and if you speak at the microphones, um then again you're consenting to being recorded and as I say, the recording is being made available on YouTube.

A

Equally, uh if you're participating online and you turn on your camera or your microphone. That will make a contribution, then that is being recorded and you can consent to being recorded and also the the chat is also being recorded and will be made available in the the usual jabber archives.

A

All right, that's a participant in the iitf as I say. You uh acknowledge that uh recordings of the meeting may be made available and that the previous that any personal information you provide will be handled in accordance with the privacy policy, and you also agree to work respectfully with the other participants in the ietf and the irtf. And if you have any uh issues or concerns about that speak to me or speak with the onwards team uh and the the itf's code of conduct and anti-harassment procedures uh linked on the slide.

A

Also applied to the irtf.

A

For those of you participating in person, um please sign in using the the mobile meter code that the meteco light Tool uh we're running the queue electronically. So, if you have questions, then we're using the electronic queue that's accessed via the meteco tool um and keeps the audio and video off if you're, using the on-site version that the meateker light tool uh remote participants, uh please leave your audio and video off and unless you're you're, presenting um uh or asking a question uh just to avoid feedback and.

A

Also a reminder for those of you who are attending the meeting in person uh as a covet safety measure. The ITF is requiring those those of you attending the meeting in person to wear an ffp2 and 95 mask or its equivalent, uh and the only exception for that is the the chairs and the presenters who are actively speaking uh in particular. Participants who are making comments or asking questions from the floor. Microphones are expected to wear a mask at all times, including while they're asking those questions, as I said.

A

The only exception of for that is the the active presenter at the front of the room.

A

Okay, so uh as I say, this is the the irtf open meeting. uh The goals of the irtf are to complement the standards work being done in the ietf. By focusing on some of the longer term research issues, uh the iitf is very much a research organization, it's not a standards, development organization and while it can publish rfcs and and we we do publish both experimental and informational documents on the RFC series that the primary outputs of the irtf is research is understanding his research papers.

A

The irtf is organized as a series of research groups- um hopefully you you can see them on the slide here. The the crypto Forum group and the uh privacy enhance enhancements and assessments groups met earlier today.

A

um The the other groups- men uh so highlighted in dark blue on the slider meeting later in this week. uh So please do um look out for those groups uh this week and try and attend the sessions. If you're interested in those topics.

A

A little bit of research groups, news uh I'd like to welcome Curtis heimerel who's, recently joined as coacher of the Gaia group. The global access to the internet for all research group um Curtis will be joining uh Leandro and Navarro. Who is um planning on stepping down from from sharing that group after this meeting and Jane coffin? Who is continuing so I'd like to welcome uh Curtis uh and um thank him for his service and thankfully Andrew for his his many years of service to the group.

A

I very much appreciate the efforts the Endo has put into chair in the group and I look forward to to working with Curtis going forward. So thank you both. Thank you. Both.

A

Did I say the irtf is primarily a research organization which have not published many rfcs we've had one RFC uh published since the last meeting um from the information Centric networking group looking at architectural considerations for using an ICN main resolution service, but primarily the the iitf tends not to publish much in the RFC series and the output is more in form of.

D

Interesting presentations and understanding.

A

And research papers, foreign.

A

Networking research price and the the goal of this prize is to recognize that some of the best recent results in applied networking research uh is to to recognize some interesting new ideas which are potentially relevant to the internet.

D

A

Community going forward is.

D

To recognize up.

A

And coming people who are likely to have an impact on the internet, standards process and internet technology.

E

We're very grateful to.

D

The internet Society.

A

To Comcast and NBC Universal for their sponsorship of the anip that allows us to make these Awards bring to bring people to give these thoughts.

A

And uh what we're doing today is uh the goal of this session is to to make some of these Awards. So I would like to congratulate uh Tasha Swami and some Kumar, who will be giving that there are World talks this session today, um Tasha will be talking first in a couple of minutes uh talking about his work on data plane. Architectures lots of the line rates in for insurance um and Sam will be following.

F

A

In the session talking about TCP, low powerlessness.

C

We've got two really really good talks coming so.

A

Please do uh please Instagram pay attention to those and again congratulations to Natasha and to sex.

A

Going forward um look out for a little bit more more talks uh go to some current cash and Daniel Wagner will be doing the talks as it f115 and the nominations for the um nominations for this for 2023 Awards will be opening in September 2022. So to look out for those um and we'll care for the nominations.

A

C

A

Trip meeting and restarting hopefully you can hear me.

A

Okay- okay, hopefully that's better as I was saying, look out for the nomination to the 2023 anrp um in September this year and um congratulations to tusha and to Sam who will be giving their their NRP talks today.

A

In addition to the applied networking research prize, we also host the applied networking research Workshop, which we organize in conjunction with ACM sitcom.

A

This Workshop is taking place tomorrow, it's co-locating with the ITF in Philadelphia, so thank you to TJ, Chang and Marwan fired, who the chairs this year and who've been organizing that Workshop um we've got a program of uh I think there are four four really nice research papers a keynote and some Innovative talks on novel approaches to protocol specification as I say that the workshop's happening tomorrow, um if you're there in in person, then please do consider attending if you're attending remotely, then you can register and attend.

A

um Registration is free for anyone, who's also registered for the ITF. Although we we do ask you to to register separately. So we know who's attending the workshop um and the anow next year will be again co-locating with the the ATF in July 2023, um which is planned to be in San Francisco.

A

And to finish up before we get to the talks, uh I'd just like to um note that we, we are very pleased to offer a number of travel grants for these meetings, um both to support early career academics and and PhD students from underrepresented groups to to attend the irtf research groups and a number of travel grants for the applied networking research Workshop.

A

Thank you very much to the travel Grant sponsors to Akamai, Comcast, cloudflare and Netflix for supporting that. um If you're, you know, please see the the travel credits page linked from the website, um the details of that and if, if you're interested in sponsoring the travel grants in the future or if you're interested in applying for a travel, Grant see that webpage or contact me for for details of the sponsorship opportunities and again. Thank you very much for the spices.

A

So that's uh essentially all I have to say today um the agenda for the remainder of the day. um We have the the two anrp award talks. uh Tasha Swami will be first talking about Taurus, a data playing architecture for per packet, machine learning and that will be followed by Sam Sam Kumar's talk on performance, TCP for for low power, wireless networks.

A

Okay, um I will, at this point switch over to uh Tasha. Can you check the microphone when I get the slides up.

G

Yes, it's okay.

A

uh Yep I can hear you remotely.

A

Is it working in the room.

G

Should I get started.

A

Yes, just one one: second, if you have a phone I can pass you control, so you can control the slides yourself. If you have the meat Echo light. uh If not, then um shout when you want to go to the next slide.

A

Okay, so I should have control over that. um Well, uh Tisha is checking to see if that works. uh I'd just like to say that uh the as I said the first talk: hey it's Tasha Swami who'll be talking about Taurus a data plane architecture for per packet ml uh Tasha is a PhD candidate in the electrical engineering department at Stanford. uh His research is focusing on the intersection of machine learning, networking and architecture, and he works on the hardware. Software stack for data plane based machine learning, infrastructure and applications.

A

uh Cheshire is due to graduate this year. I understand he's on the job market. So, uh if, if you like this work, then please do uh talk to him. uh He'll be around at the ATF all week and if you find this talk, interesting uh I believe he's also going to be presenting in the koinagi session later this week, um Tasha over to.

G

You awesome thanks Colin uh cool, so um I'm going to be talking about Taurus, which is a project that uh me and my colleagues have been working on, and so Taurus is essentially a data plane architecture for per packet, machine learning and foreign.

G

That means all right.

G

So this here is a quote from a 2015 Google Blog and at that time uh Google is already dealing with uh over one petabit per second of total bisection bandwidth um and it's only grown larger and harder to scale since so what we're essentially dealing with here is a situation where networks require more and more complex management with higher and higher performance, um and so it's uh the time is ripe for finding new Solutions here and uh one of the promising Solutions in this area is machine learning, so machine learning can allow us to um essentially take in data from the network and make progressively better and better decisions, as we train our models and these machine learning algorithms can approximate Network functions based on the data they see and they're also going to customize their operation to the data that they're training, on which in turn means that these machine learning algorithms are actually customizing their models to the network itself, and so we're sort of uh doing elements of this already with handwritten heuristics in the network.

G

So something like an active queue, management, algorithm or uh hashing, and load balancing and playing with operator tune parameters. So all machine learning uh is doing here is taking the next step by automating the um the search for these kind of parameters that allow to work well within your network.

G

So, uh if we're okay with using machine learning, we now need to examine where exactly in the network. It has to happen so I'm sure many of you already familiar with software-defined networks, essentially the control plane and the data plane are split and the control plane is responsible for policy creation, um essentially in the form of flow rules which are installed into a data plane where that's, where you're going to find your switches and they're doing packet forwarding via match action. So um right off the bat.

G

There are two good candidates for where we should operate with machine learning.

G

And uh on the left here, I have a diagram of the same typical defined software defined Network, but on the right, uh I have a software defined network with the Taurus worldview, and so what we've actually done here is we've split the machine learning operation into training which is going to happen in the control plane and then inference which is going to happen in the data plane.

G

So in the control plane, policy creation is going to take the form of flow rules plus ml training, and when installing this information into the data plane, it's going to be sending flow rules as usual, but also the ml model weights and in the data plane. We're going to be doing our typical match action packet forwarding, but we're also going to be doing decision making with ML inference.

G

And so that brings me to one of the core tenets of Taurus and that's essentially that ml inference should happen per packet in the data plane, um and so the the intuition here is relatively straightforward. You want to be able to do per packet operation, because that is the finest granularity of traffic essentially operating on a packet scale.

G

Now not every application may need per packet level operation, but the there are applications that need it, and so the platform should be able to support per packet operation and then the data plane, that's where the packets are. So if we're going to be doing decisions on packets, it should happen in the data plane.

G

Oh I think uh PowerPoint animations don't play well with the PDF, um so that's, okay. uh What what's basically happening in here is that um if we were to do just a rough off the uh off the cuff math here, so you have traffic at one gigapacket per second moving through your data plane. Now, in the time it takes you to send a packet digest from the data plane up to the control, plane, calculate flow rules and then install it back into the data plane.

G

In this case, we've assumed um a half millisecond for each step, so we've now missed 1.5 million packets in our traffic stream. By the time uh we had flow rules installed into the data plane.

G

So in the example here we're doing anomaly detection so we're trying to find out if incoming packets are malicious or benign, and maybe if we find that it's malicious we're going to install some rule to say block that IP um and by the so, if we've missed 1.5 million packets during this flow rule, installation time by the time we block that IP we've already let a ton of potentially malicious traffic into the network.

G

So the whole takeaway here is really just to show you why we can't let our operation for these kind of this level of application happen in the control plane and if we're committing to using machine learning, we can't have inference happen in the control plane.

G

So fundamentally, the conclusion here is that the robustness and performance of your network are going to be determined by the quality of your reaction and the speed of your reaction. So in the machine learning worldview, the quality of the reaction is going to be determined by your training data. So how much do you have? What kind of cases does it cover? How well is it cleaned, but also your speed of reaction? So in the case of the anomaly detection, um you want to act on a malicious packet.

G

The moment you see it, you don't want to have to go to the control plane and come back and install any sort of flow rules, and this is essentially the per packet operation in the data plane.

G

So zooming in on uh the control plane, let's talk a little bit about the actual implementation of how you do this so I mentioned before that. We're going to split our machine learning uh into training in the control plane and inference to the data plane, and so the key here is that training is off the critical path. If packet forward is Packet, forwarding is happening in the data plane, then um the control plane is not uh responsible for making uh per packet level decisions, which means that we can do our machine learning.

G

Training there at leisure and um essentially we can. We can put in whatever the latest and greatest ml accelerators are whatever your favorite ml framework is installed in a control, plane server and have it training models offline.

G

The trickier part comes in The, Next Step, where now uh we need to deal with the actual critical path, basically um tackling packets, as they come so machine learning. Interference here is going to happen in the data plane like I mentioned, and the the final outstanding question here, then, is: if we're okay with doing uh training in the control plane, we can use whatever existing Hardware. We want and then what do we do about the data plane?

G

Do we have say a switch that can do inference at line rate per packet operation, and so this is really the the Crux of Taurus and that's what it aims to do so tortoise is an architecture for per packet machine learning, inference in the data plane.

G

So, uh let's jump into the actual Hardware um and how we enable this kind of machine learning inference at line rate, so I have a picture here of a piece of pipeline, a protocol independent switch architecture. So this is the typical uh programmable structures you'll find in these kind of switches, so some sort of programmable packet parser match action, tables that allow you to encode your network functions and then uh maybe a programmable traffic manager.

G

So we're going to actually keep most of these elements and just make a modification of additional Hardware that'll. Allow us to do our machine learning inference, um but the natural question is, if we're committing to adding Hardware into the switch pipeline.

G

um What does that look like and more specifically, what is the abstraction here with which we're going to create our programmable machine learning, Fabric and so in Taurus? We use the mapreduce abstraction, so mapreduce is really useful for machine learning, because it supports a lot of the common linear algebra operations that you need for your ml algorithm. So this covers everything from neural network support, Vector machines, k-means all these different kind of applications and just as an example I have here in the picture um an example of a single neuron from a deep neural network.

G

So you can see exactly how map and reduce her applied here. In this case, in the blue box, um we are doing an element-wise multiplication, that's our mapped with multiplication with inputs and weights and then we're applying a reduction, and so this is going to essentially add all the values together and um you're, going to produce a scalar value from your vector of inputs and then finally, we're going to apply an activation function, and so that suffices for a single neuron.

G

But you can mix and match this pattern ad nauseum uh to create a full neural network, so by stacking extra of these blocks um in parallel, you'll be creating a layer of neurons and then stacking them. Sequentially you'll be creating multiple layers, and so that's how you can create say a deep neural network.

G

So the other advantage of the mapreduce pattern is uh comes from the kind of performance that it enables. Primarily it's a from the the simdi parallelism that same instruction, multiple data. So we can get a lot of performance out of the parallelism with minimal logic, and this is as opposed to what you might find in a say, like a a typical like Tofino pipeline, where they have vliw pipelines, which give you much more flexibility.

G

But the cost here is that there's a lot of logic, that's needed for the communication hardware and um that ends up taking up a lot of the the overall on-chip area and, uh in addition, simply parallelism gives us the ability to unroll the loops in our uh in our algorithms. So the the idea of unrolling here. If we take the example of, um say a single layer of a neural network and say you have four neurons in your network, you can either um execute them.

G

Sequentially you're doing one neuron after the other, or if you have the resources, you can instantiate all four of them in parallel, and so the trade-off here is that more on rolling is going to give you better performance, essentially doing all four of those neurons at once.

G

While less enrolling means, you only need the hardware for one single neuron's worth of operations, but it's going to take you four times as long, so it's less resource intensive, but it's also uh uh much less um a much higher latency and uh but we get this kind of control with the CMD pattern by um essentially un adjusting unrolling factors.

G

So we went ahead and we uh essentially adjusted the switch Pipeline with a mapreduce unit that uh implements the patterns that I just described so the the a we still have our typical programmable elements. We have a programmable packet, parser match action, tables and traffic manager, but you can see in the center.

G

We have this mapreduce unit and that's essentially, what's going to do our machine learning inference, and so there are a couple uh little idiosyncrasies about the um the arrangement of the pipeline that I want to point out and uh and that's how we use these different elements for machine learning context, even if they're, typically Network elements, so a packet parser is normally for pulling out your headers from your packets um and doing whatever you want with your match action rules.

G

In this case packet parsing, is also pulling out the features for our machine learning inference, and then we have match action tables before and after the mapreduce unit, and so these are doing different types of rule-based pre and post-processing on our machine learning inputs and outputs.

G

So uh the match action tables before mapreduce can be doing some sort of cleaning on the features and then the match action tables on the output on the right side of the mapreduce unit can be doing some sort of interpretation of the results, and uh so when we actually went to design this mapreduce unit, um there's a couple of things that came up it turns out. You can't really just stick an accelerator into the switch pipeline, so uh what we did was we kind of established? What were the the points that we wanted?

G

Our mapreduce block to fit, and so most of all, we wanted it to be reconfigurable. So essentially you should be able to program it. It can't be a an Asic for a single type of machine learning application you should be able to put in whatever or program whatever application you want. Oops.

G

And um it has some neat line rate with the fixed clock, so this essentially uh rules out in fpga, because in fpga will give you a variable clock. We want it to be deterministic um and, of course, line rate is our performance requirement and then minimal area and power overhead. We don't want to blow up the entire chip area, adding in like a mapreduce block. It should be something that is a small but gives you access to a whole class of applications.

G

And so finally, the one little thing to note here- that's kind of interesting- is that most of these ml accelerators are built to do uh batch processing in an effort to get high throughput, but in the network pipeline you're, actually, processing packets, as they're coming, which means that you're operating on a batch size of one um which is uh turns out, puts a lot of different performance demands on the hardware than a typical accelerator would see.

G

uh Yeah so um I have a quick example here, just to make this a little more concrete, going back to anomaly detection.

G

um So you can, you can uh imagine, say a packet coming into the switch Pipeline and we want to see essentially whether it's malicious or benign, so the packet hits the first stage and that's where we're going to um do our packet parsing. So we're going to read local features, say our IP.

G

Whatever information we can extract from the packet itself, the packet is going to move to the second stage, which are the match action tables and from there maybe we're going to do some sort of uh retrieval of out of network events, so these would be different kinds of uh elements of metadata that the control plane may have installed into the match action tables. So something like the failed logins per IP.

G

The packet then moves to the the center block of the mapreduce unit. That's where we're going to apply our learned anomaly detection. So you can imagine this is maybe a binary neural network and it gives it a a score from zero to one on how anomalous it is so one is definitely anomalous. Zero is benign and finally, the match action or the the packet will move to the post-processing match action tables, and that's where we do our interpretation.

G

So say we got a score of 0.8, so it's pretty anomalous and uh now we want to drop it or quarantine it. This is where the match action table will set a rule for that, such that when the packet now moves to the traffic manager, it's going to go to the appropriate destination.

G

Thank you. So uh in the paper we actually did a full um Asic analysis of uh this Taurus hardware and how we can um we wanted to show essentially that it has minimal overhead and it's feasible to to build something like this, and so we based our evaluation platform on a coarse grain. Reconfigurable architecture called plasticine and we programmed our applications in the spatial Hardware description, language and so spatials is just an HDL that lets.

G

You use these kind of uh parallel patterns like map and reduce to program your um your your reconfigurable architectures at the Loop level, and so the the basic architecture. The mapreduce unit here is really just a grid of compute and memory, tiles, so easily scalable and very, very straightforward.

G

In the compute units we have Cindy lanes that are operating in parallel and a reduction Network that allows us to implement the reduce operation and the memory units are just blocks of banked SRAM, so uh we're doing severe pipelining within the compute unit, but then we're also doing pipelining one level higher between the compute and memory units. So the AED here is simdi parallelism everywhere uh and then pipeline parallelism everywhere and that's how you get your performance really.

G

So uh we went through a set of real world applications and um program them onto our Asic and we ended up using a 12 by 10 grid to support all of them, and um we compared it to state-of-the-art switches with four uh pipelines and um our reference, which was 500 square millimeters, and we found that our grid, which could support these different applications, was only adding a 3.8 percent overhead or 4.8 millimeters per pipeline. So um again earlier I said we want minimal area overhead, so 3.8 is pretty low.

G

Given that you're now getting an entire class of machine learning, applications.

G

um And jumping into one of these uh applications, I've been using anomaly detection. As a recurring example. Here um we tried out two different types of anomaly: detection, with support, Vector machines and deep neural network, and so uh for both models. You can see in the throughput it's one giga packet per second, which is the line rate for um high-end uh switch pipelines like your tofinos and broadcoms, uh the latency that we added was in the hundreds of nanoseconds or less so. In this case, you would choose your application.

G

You can see here that the bsvm requires 83 nanoseconds, while the DNN requires 221 nanoseconds. So, depending on your slos and what kind of requirements you have to meet, you can choose your algorithm to reduce latency um and then, in both cases, the area and power overhead required for the hardware to implement. Just these applications is um single, digits or uh or less a 0.6 power, overhead point, five percent area, overhead or 0.8 and point uh and 1.0 respectively.

G

uh Again, if you don't need, um say the full Suite of benchmarks, you only want a reconfigurable fabric that will let you do anomaly detection. You can do it with um minimal overhead here and so in the paper there's a several more applications if people are are interested, such as a congestion control, Network and a traffic classification, Network.

G

So uh we went through this whole process of doing an Asic analysis to prove that it could be done.

G

um But as far as research goes, we don't really want anyone waiting for some sort of mass-produced, Taurus Asics, so we've put out an open source, fpga based test bed, um and so this is just a rough diagram of what it looks like at the control plane. We're using your typical Network OS, like onos, we're using a Tofino switch to to mimic the piece of pipeline elements like your program. Mobile packet, parsers, match action tables and traffic managers and then we're using an fpga to um uh to mimic the mapreduce unit.

G

So we set it up in this bump in the wire configuration um and so uh because of the limits of an fpga you're not going to be able to hit the same performance as you're going to get with the Asic core screen reconfigurable architecture. But it's there to serve as a proof of concept for the functionality.

G

So just a quick demonstration of this test bed. um We did an example. Essentially the example I mentioned earlier about anomaly detection, where we're trying to do uh detection of anomalous, packets in the control plane or we're trying to use Taurus and do anomaly detection in the data plane, and so with the test bed that I just showed you you can do. um You can do either. So you so in the case of Taurus we'd, be uh placing our anomaly detection application on the fpga.

G

While, if we're trying to do control plane based anomaly detection, we would run it at the uh the controller on the CPU.

G

So uh the takeaway here is is the same sort of message on um why you really can't use the control plane for efficient um machine learning just based uh decision making? And um if you take a look at the very uh last two columns, the F1 Square.

G

um Now this is the F1 score for the model when it was implemented on the Baseline, which is control, plane or Taurus, which was in the data plane and in software in tensorflow. uh The F1 score is 71.1, so you can see that Taurus on the far right side of the uh the the uh the table is achieving an F1 square of 71.1.

G

So it's Faithfully recreating the model as it was in software and um we're processing packets as they're coming in, whereas in the control plane we actually had two sample packets from the network and um run it through the control plane and run it through an ml framework and try to install flow rules. And what ends up happening is that you miss so many packets while doing this operation, that your effective F1 score drops pretty heavily.

G

So you can see on the far right column the F1 score for the Baseline ranges from 1.5 to almost almost zero, so you're effectively throwing away your model because of the added latency.

G

So that's just uh one example of what you know what we did with our fpga test bed. um There's, of course, lots of other things you can do, but the just to reinforce the point why you have to operate in the data plane.

G

Cool, so yes, that's mostly it uh for me. um I have my contact information here and I have at the bottom the gitlab link for the fpga test bed. We hope people wanna can try it out and there's the link to the full paper in this easy to memorize URL.

G

So uh yeah I'm happy to take any questions.

A

Okay, uh thank you very much yeah the excellent talk um since we we have a some people remote some in the room, I think. If, if we can manage the queue using miteko, uh let me take a queuing tool. I think that would be helpful. uh I do see, I guess it's Barry at the microphone there.

B

Okay, actually I'm being uh Dave Oran right now, um Dave Aran asks I assume the class of anomalies you can detect are those that can be detected by header Fields within the width of the ALU of the switch things in the packet data beyond the headers won't be seen. Is that correct.

G

um So the in the case of anomaly detection, we used uh the KDE NSL data set, which had a a record of different um attacks that were calculated from, like you said, either header fields or um you can also actually calculate aggregate fields from across headers. So you can um uh say, create like a histogram using the matte checking tables across different packets um and the the packets.

G

The uh the packet headers um are going to be limited by the packet header Vector size, that's moving between stages in the switch pipeline, but you don't necessarily need to be limited to features in the header, because the control plane can install different types of metadata into the magician tables, and you can do your own processing in the match, action tables over time or whatever other kind of calculations. You want to do on your headers. So the headers are just the starting point for the the features here.

H

Hi is this working yeah George.

A

Michaelson can I expect to.

H

Can I sneak two questions in? Is that okay yeah? So the first one- and this is the naive, attendee question I suspect the paper is very important for interpreting that last table. It was really quite opaque how to understand the meaning of the columns and their impact on a comparison to the Baseline I. Think there's a lot of implicit knowledge in your table structure.

B

H

Paper explains it the slide where it was just a bit of juice to a naive reader. So at the start of your talk, that was the first point. You made a case to say that the delay between doing a packet sample constructing table match rules in the controller, injecting those rules down into the functional plane and applying them had a huge packet loss and mismatch interval.

H

But it seems to me the delay to perform the ml operation and tune your ml have a model that is representative of the condition you want to model and then install that has a similar cost. It's not to say, there's no benefit of ml I. Think it's huge, but the component. That's about the delay, cost of doing an instantiation of rules. I, don't think, is a basis of doing it.

H

I think you're on stronger ground, arguing it's about the ability to do complex match at line rate than the static cost of doing the rule, installation right.

G

So um the installation you're right about the installing the model itself. So the idea is that you could be taking sampling uh packets from the your network and be sending different kinds of metadata to the control plane and essentially be doing your training offline and you can install model weights or replace model weights as uh as needed. The idea is that whatever is operating in the data plane itself has nothing to do with the installation of model weights.

H

Yeah completely agnostic and I thought I thought that idea that you could do the model training asynchronously the sample.

G

H

Is very beneficial, but if you consider a new class of attack attack that you have to understand it and do some form of Bayesian analysis and classification, which is completely unmodeled here, exactly how you do that training unknown how long that takes it's, not about the speed of the chipset! It's about your ability to do the good bad classification, a priority to inform the model and then download it. That's quite a high cost in time.

G

Yeah so so this is always like kind of the uh the trouble with security rate like if you want to do an on-the-fly analysis of a brand new attack. That's not really uh what we're! What we're targeting.

H

G

The moment here, but.

H

uh But in engineering terms your case, this is extremely fast, that line right well made I enjoyed listening to it a lot. Thank you. Thank you.

E

So excellent work to share I have a question during machine learning and datablane will consume more energy.

G

Oh sorry, I can't.

E

Using machine learning and the datablane will consume more energy so, and we would like to reduce the energy consumption of.

B

G

E

So have you looked at this issue? Yeah.

G

So I think the the for energy consumption needs to be looked at, maybe more holistically. So, while you are increasing by some small percentage, the energy that you'd be consuming in the the switch itself, you can consider that say if you're doing anomaly, detection you're, removing the cost of running an anomaly, detection, application and software on a server somewhere else.

G

So with this like specialized Hardware, here, you're consuming less power in the switch than you would running it in software elsewhere, so on the whole you're reducing power cost, but for the switch itself, yeah you'd be increasing it minimally. Okay,.

E

A

Okay, thank you. uh Questions.

A

And I guess: I'll wait a question uh I mean you know this. This is a an iitf uh meeting which is which is co-locating with the ietf um and obviously you know since this kind of case with the ITF. The question is, then you know to what extent have you given any thought towards how the how this might change or affect the type of work the ATF does?

A

Are there any implications of these types of systems for the way way we design standards or other types of protocols the ATF designs, or is this just a an optimization that fits within the existing architecture?.

G

Yeah I think uh so one of the the things that uh actually uh hashemu, who uh asked a question earlier um brought to my attention, was that was what kind of um standardization is needed for packet headers if we're going to be using them as features or carrying model weights or basically doing kind of this um like ml uh ml assist type operations.

G

um So I think there's probably something there as far as making a cleaner definition of what what has to happen at the uh the the packet standardization level um to support this kind of machine learning and make it uh easier for different, different types of ml systems to interoperate.

A

Yeah, that makes a lot of um presumably there's also something in terms of the control plane and the standardized um so programming model. For that, in order to to to specify the the model is that right.

B

A

um I I mean I'm thinking that your traditional programmable switch uses P4 or something like that. As a programming model do do. We need a similar standardized programming model for these types of ml switches. Oh.

G

Yeah so um yeah, so so is like kind of a compliment to to P4. um We went with mapreduce so we're not necessarily married to the idea of using um a map produce block or anything. The the bigger idea here is just doing inference in the data plane um so but yeah.

G

It could definitely help to have some sort of standardization in the way that people works, but for um the the mapreduce elements, so uh you could even consider like an extra control Block in P4 as mapreduce, and we actually, we have another paper intermission on um what the the language level constructs here. Look like so yeah there's. It's definitely an area for standardization as well.

A

All right great, thank you. Are there any any final questions.

A

C

I, don't see anything.

A

Thank you very much.

A

All right Sam, if you can come up while I, try and share the slides.

C

A

All right, can you see the slights.

I

Yes, we can see the slides here.

A

A

All right, so with any luck, you should now have control over the oh slide, so they gone away.

J

A

I

I can see it on my phone. Oh yeah I see I had to Click Share, okay.

A

There we are just took a little while yeah, okay, okay, great all right. So the second talk today is um focusing I think on a very different problem domain. um So, in this talk, Sam Kumar will talk about his paper on performance TCP for for low power wireless networks.

A

uh This was originally presented at the nsdi conference in 2020. If I, if I remember correctly, uh Sam is a PhD student at UC Berkeley, uh advised by David color and relax he's broadly interested in system security and networking, and his research focuses on rethinking systems designed to manage the overhead of using cryptography and presumably also improving the performance of TCP for low power, wireless networks so um Sam over to you.

I

Okay, um thanks Colin for the introduction, um as you said, I'm Sam and I'm going to present uh my research on performing TCP for low power wireless networks, and this is a joint work with macular Arbiters at UC Berkeley uh and, as you mentioned, it was published in 2020 at nsdi, so I'm going to begin by giving a brief overview of of history of research in low power, Wireless personal area networks or low pens uh to put our research in context.

I

So Loop band research began in the late 1990s and at this point in time, researchers deliberately Cast Away the internet architecture based on the idea that the load pads may have to operate in two extreme environments and two different uh from regular networks in order for the internet architecture to directly apply so many of the early protocols like s-mat, dmac and so on, and the early systems like tiny, OS and contiki did not conform to any particular standard or architecture.

I

And this allowed the researchers to nicely explore how to tackle the challenges of Lopez chance without being hindered by having to conform to an architecture.

I

About a decade later, in 2008 IP, the Internet Protocol was first introduced in the space largely enabled by the six low pan adaptation layers standardized by the ietf.

I

And what happened here is that people found ways to take the lessons that were learned in the earlier systems and applied them within an ipa-based architecture, and this essentially caught on in a few years by about 2012 IEP, had essentially become the standard in the space. But surprisingly, the adoption of IEP did not come with TCP.

I

For example, open thread, a low pan, Network stack, developed by nest and used in uh in the smart home space didn't even support TCP and instead the community has come to rely on protocols like co-app, which are specialized low-fance protocols based on UDP, also worth pointing out that during this time, low Pens have not yet achieved the same kind of pervasive adoption that we've seen in other protocols like Wi-Fi, at least in the context of branding internet access to devices.

I

So a natural question is whether to get that kind of pervasive adoption of low pens. We should adopt not only IP but also the broader set of Ip based protocols, including TCP.

I

In this context, our work completes the transition of low pens to an ip-based architecture by showing how to make TCP work well in low pants and a research artifact is tcplp, a performant TCP stack for low pens. So what exactly do I mean when I say performant?

I

Well, one metric is good, but and that's the amount of bandwidth that an application is able to get when operating over a TCP connection. Now there have been a few prior attempts to use TCP in the space, typically based on a simplified, embedded, TCT stack like micro, IP or blip, and what we can see in this graph is that our work tcplp achieved significantly higher good, but than prior attempts to use TCP in this space.

I

In fact, we can calculate an upper bound on goodput shown by these dashed lines based on measurements of how fast the radio can send out packets and how much overhead is lost to headers and acts and so on, and our work comes quite close to these upper bounds.

I

um I'd also like to share an update: that's happened since we published This research, which is that open thread the low power Network stack that I mentioned that's used in the smart home space. We simply adopted TCP directly, based on our research. It uses tcplp as its TCP implementation, and the research also influenced thread.

I

The network standard that open thread implements so I'm delighted to have been invited to spear help spearhead this process, and I am hopeful that uh that the adoption of TCP in this space will help improve the adoption of low pants, more broadly in the smart home space.

I

So now I'm going to take a step back and provide some more context as to what exactly low pens are and what some of the challenges are with using low pens and I can do that by comparing low pens to other Wireless technologies that you might be more familiar with so on. The left. Wi-Fi provides a host with internet access via an access point in the middle Bluetooth, uh doesn't really provide full internet access. It's more like a cable replacement, Channel, a wireless USB of sorts and then on the right.

I

We have low pans, which aim to provide internet connectivity at the same level as Wi-Fi would, but to embedded devices and, while operating within the constraints of low power, for example, having a transmit data over multiple Wireless hops to set up an embedded mesh Network, so low pass have been used in a variety of applications, for example, scientific applications like environmental monitoring, structural monitoring of a bridge, um and it's also been deployed in the indoor environment in a smart grid context and recently, there's been a push to deploy it in a smart home and iot space, and the thread and open third efforts.

I

I mentioned earlier are one such attempt, but despite being useful for all these applications, it's difficult to use low pants because they also come with a set of challenges. The first set of challenges come from the resource constraints. The fact that the embedded hosts have limited CPU and memory resources.

I

uh The second set of constraints come from the link layer, a low band link clear like, for example, IEEE 802.15.4 has a small MTU of only about 100 bytes and has low wireless range, which means that, in order to in order to get connectivity over a large area, you need to transmit data over multiple Wireless hops and finally, energy constraints are also an issue. uh You typically don't have enough energy to keep your radio on and listening all the time. So you do recycle your radio.

I

What that means is that your radio is actually in a low power, sleep state for say, 99 of the time and then one percent of the time you can turn on your radio to send or receive packets and in order to provide an always-on allusion to Applications. Despite doing this to save power, we need some careful scheduling at the link player in order to make sure the data is only sent to a node when this radio is on and ready to receive that data.

I

So, to make this more concrete, uh I'm going to tell you about the platform we use in our research, it's called Hamilton and some of the stats of this platform are on the slide. The key Point here is that this kind of device is more powerful than the devices we had when load band. Research first got started in the early 2000s, but it's still substantially less powerful than even a Raspberry Pi. You cannot run Linux on a device like this.

I

Instead, you have to run a specialized embedded operating system, and you can understand our researchers tackling the central question of how should a device like this connect to the internet and the result of our research is that we show that tcpip works. Well now, as I mentioned earlier, the adoption of Ip in this space did not include TCP, and that was no accident. The reason is that researchers had doubts as to whether TCP would work well and they expected it to not work well, given the challenges of low pants so here are some quotes.

I

I've taken from some research papers to show some of the concerns that the community has had about using TCP. The first one is that TCP is not lightweight and may not be suitable for implementation in low-cost sensor nodes with limiting processing memory and energy resources.

I

The second one is that certain features of TCP may cause harm like, for example, that the connection oriented protocol aspect of TCP is a poor mesh for wireless sensor networks, where actual data may only be in the order of a few bytes and finally, there's the wireless TCP problem. The idea that TCP may use a single packet drop to infer the network is congested which can result in extremely poor performance, because Wireless links tend to exhibit relatively High packet loss rates.

I

So again, to summarize, more simply, there's a concern that TCP is too heavy, that its features are necessary and that it'll perform poorly in the presence of Wireless loss. So Central to where research was understanding, tcp's performance and what we did is.

I

We did a study where we actually ran TCP in a low pan, measured its performance and tried to draw conclusions about how well TCP really does or does not perform, and what we found is that out of the box, TCP indeed performs poorly, but it turns out it's not due to the expected reasons that people had the actual reasons were somewhat different.

I

Okay, so the actual reasons are that low pants have a small L2 frame size, basically a small MTU, and this results in very high header overhead. The second problem is that hidden terminals are a serious issue for TCP when operating over multiple Wireless hops and, finally, that the kind of scheduling at the link clear needed to support a low duty cycle and low energy consumption interact poorly with TCP.

I

Now, there's a key difference between the issues on the left and the issues on the right. The issues on the left, if they were to exist, would be fundamental issues, there's no clear way to adapt TCP or the link layer to eliminate those issues.

I

But the issues on the right it turns out are fixable within the Paradigm of TCP or a fairly straightforward techniques. So in our research we show why the expected reasons don't actually apply. We demonstrate techniques to address the actual issues causing poor TCP performance, and our overall conclusion is that TCP can perform well in low bands.

I

After all, so that's an overview of what I'm going to be telling you about uh and they're, also by the way, a set of techniques that we propose in order to make low pens work well, which I'll go over in the course of the talk.

I

Okay in the next part of the talk I'm going to focus on the expected reasons for uh why uh or why the expected uses for performance don't apply um and to go back here. I'll be talking about this technique in this part of the talk, and the reason is that this part of the talk is more about our experiments and our observations about the expected reasons.

I

This technique has to do with their implementation, which is why it's included I'll talk about the remaining techniques in the next part of the talk where I dive into how to affix the actual reasons for for performance.

I

So our methodology is based on a Hamilton platform. As I mentioned earlier, you can see the picture there. This is a Hamilton platform connected to a Raspberry Pi and the Raspberry Pi is just there as a back channel to collect logs and so on and measurements. uh The TCP stack was, of course, running on the Hamilton platform directly. Our software stack is using open thread with Riot OS and we used a wireless test.

I

Twitter collector data, where each of those numbers is one of our Hamilton nodes, uh the lines connecting them Show an example of a topology in reality, openthair is going to generate this dynamically. This is just a snapshot of what it might look like and we ran TCP where one TCP endpoint is in the wireless mesh on one of the Hamilton nodes and the other TCP endpoint is hosted on the cloud and Amazon ec2.

I

So um one of the first things we had to do was to implement TCP. uh Now, as I mentioned earlier, there have been several prior attempts to use TCP in this space based on simplified, embedded, PCP Stacks, but we wanted to use a full-scale TCP stack in our study. Now the challenge is implementing a full scale.

I

Tcp stack is hard and in fact, there's an entire RFC devoted to all describing all the problems that people were seeing in full scale, TCP stacks back in 1999, even though these TCP sets had matured for at least a decade. By this point, so um our approach was not to implement a TCP stack from scratch, since we felt it would be too error prone uh to do.

I

Instead, we started with the mature full-scale TCP implementation in FreeBSD and re-engineered key parts of it, so it would work well in an embedded platform and we call our resulting implementation tcplp where the lp stands for low power.

I

So now that we have our implementation of TCP, we can concretely answer the question of what are the resource requirements of running TCP.

I

So what we found is that tcplp requires 32 kilobytes of code memory and about half a kilobyte of data memory per connection to store all of the TCP connection state in a full-scale TCP implementation, while our platform has substantially more code and data memory than that now, as an optimization, we use separate structures for active sockets that are actually endpoints of a TCP connection and passive sockets that are just listening for new connections, which helps to save a bunch of memory as well.

I

um But the point here is that you know, at least in terms of connection State, we're well within the bounds of the available memory.

I

So natural question is what about the actual buffers used to send and receive data, so um the TCB buffers need to be the bandwidth delay, product and size in order to be able to send at full speed of the network, and we empirically determine the bandwidth delayed product as two to three kilobytes, and we can see in the graph here how we experimentally did that you can see it two to three kilobytes of buffer size. The available could put over TCP levels off.

I

So here is a TCP, including the size of the buffers fits comfortably in memory and, in fact, there's another conclusion to be drawn here, which is that if you notice the the size of the buffers is actually much bigger than a connection state which suggests that most of the overhead of TCP doesn't come with. The complexity of the protocol is from the buffers and any performant bulk transfer protocol would need these buffers in order to transmit at the bdp. So in some sense the overhead really isn't bottlenecked by tcps complexity at all.

I

um There's also some. uh We also introduced a technique here in order to reduce the memory used for the buffers, uh and part of this has to rely on TCP having both a receive buffer and a reassembly buffer to store in sequence, data and auto sequence. Data for reassembly now full scale.

I

Tcp Stacks, like FreeBSD use packet, cues, there's a separate queue of packets for each of these, but in the embedded setting we don't want to use dynamically allocated packets because, if we hold on to dynamically allocated packets in a memory constrained setting, we may cause other memory allocations to fail. So instead we want to use flat arrays and the naive strategy would be to have a separate flat array for your receive queue and for the reassembly queue now to optimizes.

I

What we observe is that there's an interesting relationship between the advertised window size, the number of buys we currently have and the total size of that buffer, which is that the number of received bytes plus the advertise window size, is equal to the total size of a receive buffer. Now. The observation we make on top of this is that all of the data we may possibly get for reassembly has to fit within the advertised window size, that's the contract of TCP that, if you're sending to a recipient, you should not go past their advertised window.

I

So this allows us to actually store the receive buffer and the reassembly queue in a single flat array.

I

Okay, so the way this works is that we have our flat array and the yellow region with the start and end pointers is just a circular buffer destroyer in sequence data then, as we receive Auto sequence data that needs to be reassembled, we store it in the same array past the end of the circular buffer, using a bitmap to keep track, of which of these bytes are active, corresponding to received Auto sequence data and which of them are just empty slots on the array where new data can be stored.

I

Okay, so in this way we can significantly reduce the memory for buffers by in some sense, not having to allocate a separate buffer for a reassembly queue and just sharing that with the buffer we've allocated for the received queue.

I

Okay, next I'm going to talk about the wireless TCP problem, and before we talk about that, let me tell you about the number of in-flight segments since that affects tcp's congestion control. So, as I mentioned, the Batman's live product is two to three kilobytes. Each segment is sized to about 250 to 500 bytes, and this was chosen carefully. It's actually based on the technique. I'll tell you about later on in the talk or coping with a small MTU of these networks.

I

uh So I'll come back and explain this, but for now take it as a given that our segments are 250 bytes to 500 bytes and what this works out too. Is we have 4 to 12 in-flight TCP segments at any one point in time now this is different from other higher bandwidth networks.

I

You might imagine if you're transmitting over a higher bandwidth, Network or over a longer distance, you may have hundreds or thousands or tens of thousands of packets in flight and in comparison 4 to 12 is, is very small and that profoundly affects how tcp's congestion control operates.

I

So here are some examples of how of TCP neureno's behavior in a low pattern and for now focus on the left graph.

I

um Here uh our maximum segment size is 462 bytes um and what's going on, MSA and active segment, says I'm actually subtracting the space for TCP options. So this is how much data is sent in htcp packet and our Bama delay product is filled by just four kcp segments.

I

So what ends up happening is that yeah, our losses are very frequent, but because we only need a connection window of four segments in order to fill up uh the bdp and standard line rate, gcp's condition, control actually is actually able to recover from losses extremely quickly and we spend most of our time actually sending at a full window.

I

uh Despite the losses in the wireless medium being frequent on the right, we have a more challenging scenario where we size our MSS to be smaller, and we use some active Q management which induces some more lost events, but we still find that TCP is able to reach a full window and operate there most of the time, despite seeing treatment losses so somewhat counter-intuitively.

I

We find that because our bandwidth in these networks is so small, our bandwidth delay, products are small and, as a result, we can recover to a full bdp quickly after a loss, and this means that the wireless TCP problem actually does not affect tcp's performance significantly in these networks and it's much more resilient to wireless losses in a low pan than it is in a higher bandwidth wireless network.

I

So that was a surprising result, but one that works well for us, because it removes one of the obstacles we ordinarily would have faced in getting TCP to work.

I

So now, I've talked about the expect why the expected reasons don't apply in the next part of the talk I'm going to tell you about the actual reasons for poor performance and going back to our slide with our techniques on it. I'll be telling you about these three techniques. Now there are a couple: I didn't get to the zero copy, send buffer the link to your queue management, and that's because I don't have the time in this talk to talk about it.

I

But if you want to chat about it afterwards, I'll be around or if you or you can look in the paper to find some details about those.

I

So first dealing with the MTU problem: here's a graphic showing the size of the MTU in Ethernet Wi-Fi and I've Tripoli irritated 15.4, which is an example of a load pan link layer, and what we can see is that um TCP IP headers are very small compared to the ethernet and Wi-Fi mtus, but there's significant compared to the IEEE inner Toyota 15.4 MTU, and this is going to result in large header, overhead. Okay, normally we size, TCP segments to be as large the link supports, but no larger. This is standard.

I

This is what's used in Ethernet and Wi-Fi, but in the case of IEEE 802.15.4, it's only 104 bytes right. Our MTU is small and our TCP IP headers can actually take up more than half of that. If you include the cost of TCP options, even if you use a standard, IP header compression, that's part of six low pan and what that means is that, if you're transmitting data in a TCP connection, more than half of the data you're setting out are just these headers and your good put a severely affected by that.

I

So, in order to overcome this, we break this conventional wisdom and instead allow tcplp to have TCP segments that span multiple link layer frames. Okay, what that means is that we're relying on the six low pen, adaptation layer to handle fragmentation and reassembly for us, which adds some overhead, but it means that the overhead of our headers is now amortized over multiple frames, allowing us to get some good good put now.

I

There is a trade-off here um if we use too much fragmentation if we set our our MTU I mean if we set rtcp segments to be way too large. What's going to end up happening, is that we rely on too much fragmentation and that's bad, because now one fragment gets lost, we lose the entire packet. So what we want to do is we want to choose a TCP segments to be as large as possible to effectively amortize the overhead without incurring more fragmentation beyond that.

I

Okay- and this graph was an experiment where we, where we measured the maximum segment size and the good put that results, and we found that the gains essentially level off around three to five frames. uh So that's what we use for our future experiments and it shows that you know there's a good trade-off to be made here where we can get good good put despite the despite the header sizes. Now one thing that we didn't do but could potentially help in agree.

I

That's orthogonal to this is to get good, TCP header compression right, because six low band, currently standardizes UDP hetero compression with six load pen, but not TCP, hetero compression and that's another opportunity to reduce these overheads. Further.

I

Okay, now I'll talk about how the link layer, scheduling to support a low Julius cycle interacts poorly with TCP, so recall that these devices often don't have enough energy to keep their videos on listening all the time. So we Define the duty cycle as the proportion of time that the radio is listening or transmitting. Basically, the percent of time where the radio is not in a low power, sleep state, okay, and in order to get good energy Concentra, we want the duty cycle to be as close to zero as possible.

I

uh Now there are several ways in order to support this. uh In the session of literature, open third uses a particular Duty cycling mechanism. That's called a receiver initiated duty cycle protocol, which I'll now explain so in open thread. You have two kinds of nodes: you have battery powered nodes where we want to minimize the duty cycle and wall power nodes that are plugged into a wall outlet and have enough power to keep their videos always on okay. Now sending a frame from B to W is easy, because W3 audio is always on.

I

So we can just send the frame whenever we like more challenging, is the reverse getting a frame from W to B okay. So what has to happen is that W has to wait until B is Radio is listening, and how does it know when boost radio is listening? Well, this is where the protocol comes in. What B does is that it never returns on S3 audio, to listen for a for a frame it'll, send a data request packet to W, informing it that it's now listening.

I

So w has to wait until it guesses data request packet and once it does, then it can go ahead and send the frame to B and B will listen and receive the frame. Okay. So what's the key Point here, the key point I want to emphasize is that these idle duty cycle is directly related to how frequently it sends data request frames.

I

B can choose to send data request, frames very rarely which allow you to get very good energy consumption, but doing so what uh what we're doing so will cause more of a delay in getting frames to it, since W has to wait for the data request frame in order to send it one of the uh one of the data frames. Okay. So now let me talk about what this means for TCP operation and I'll.

I

Do this by comparing HTTP over TCP to co-app, okay and co-app is a rest-based protocol running on top of UDP and in our setup we had bsnw. We did a request frame every one. Second, basically it basically it's in for packets every one second, and that allows it to get a really low duty cycle now.

I

The key difference between HTTP and co-app here is that the HTTP requires two round trips, whereas co-app only requires one round trip, okay, so for the first round trip right, you start at a random uh phase within the 100 Milli within a thousand millisecond sleep interval, so you'd expect, on average, a 500 millisecond, delay and Co-op is consistent with that for HTTP. What happens is that for the first round trip we see 500 milliseconds, but the second round trip starts right at the beginning of the next leap interval.

I

So the second round trip consistently sees the worst case latency when transmitting the packet from W to B, okay and, as a result, HTTP performs more than twice as poorly as Co-op. uh On this workload now I want to point out that there have been some recent extensions to TCP, for example, TCP fast open, which you can use to eliminate the second round trip and get performance parity between coapp and http.

I

But this problem also happens for both transfers where the ACT clock nature of TCP causes it to consistently experience the worst case, latency even for bulk transfers. So this is an important problem to solve. Regardless of that, and our approach to solving it is to use an Adaptive duty cycle. The idea is that we can use the TCP and HTTP protocol state in order to vary. How often we send data request frames, the idea being when we expect a packet. We want to send data request frames more frequently.

I

So, for example, if I'm an HTTP server, one of these battery-powered devices and I just accepted a TCP connection. I can be pretty sure that that I'm going to soon receive an HTTP request on that connection. So I may choose to send data request frames more frequently at that point in time, and doing this nearly entirely eliminates the gap between Co-op and HTTP, intros or performance.

I

So if we zoom out and look at the overall Network, this adapter Junior cycle technique works well, for the last hop going from a wall powered node to a battery-powered, node node. But the overall network has the last operator over multiple Wireless hops to even get to that last hop and what we observed with the TCP performs poorly over this chain of wall powered nodes due to Hidden terminals.

I

So let me step back and go over hidden terminals to provide some background on that. For those who aren't familiar with it, uh we can understand the wireless range of a node is looking something like this. uh The unit disk model is a simplification where we consider this to be in some sort of the perfect circle uh in practice. Of course, it can be more complex depending on the exact environment. Your deployment is in uh but unit just model is, is going to be enough for us to capture the phenomena of Interest here.

I

So we'll go with that. um So imagine you have four segments in a line. I mean four four nodes in a line with their uh with their transmission ranges shown here, and we want to transmit data from a to d. Now.

I

The nature of TCP is that we have multiple segments in flight at the same time for a single connection and that's why we have segment one being sent from C to D and segment, two being sent from A to B, but unfortunately this is bad because the wireless Rangers are going to overlap at B, so the two packets are going to interfere there.

I

Okay, now in the context of Wi-Fi, we typically overcome this using a protocol based on RTS and CTS frames that allow us to mitigate the hidden terminal problem in most cases, but in the context of low pens. The small MTU means that RDS CTS typically has too high of an overhead as a result most uses of it don't use RTS and CTS packets.

I

So, as a result, we're only relying on csma right so at a csma can't detect uh C's transmission, because uh it's all the way, because a is out of range of c and csma at C can't detect A's transmission because C is out of range of a but both of the packets end up interfering at B and the packet gets lost.

I

um This also happens because of data packets and acts going in opposite directions. So, for example, here uh what we'll ultimately see is that um you get the same problem with b and d, both setting at the same time to C, because each of their csmas can't hear the other so to mitigate this. Our approach is to add a new random back off delay between link layer rate price okay.

I

So the idea is, if you transmit a frame and it fails, but you know, because you don't get a link layer acknowledgment for it, then you wait a random amount and retry the transmission, and this is different from csma in two respects. The first respect is that in csma you do this randomized uh delay with exponential back off. If the channel appears busy in this case, even the channel appears clear if your transmission fails, we still do the back off.

I

So it's different in regards of what triggers the transmission and second, it's a much longer delay right, because in csma you can rely on hearing a concurrent transmission. You can transmit immediately if a channel appears clearer in this new delay that we're adding this link free trial delay. What we're seeing is that we want to have a delay, that's chosen between 0 and 10 times, the time to transmitter frame, the idea being, even if there are two concurrent Transmissions that can't hear each other with high probability, they won't overlap in time.

I

Okay, so um the way this would work is that uh each of these two nodes would send its data once in order to go and they will collide, but then, when they retry the transmit a second time at hopefully different intervals, and they won't overlap in time and the transmission will succeed.

E

I

So um we did a measurements for you to understand what kind of Link delays would be appropriate and what would work. What we observe is that there's a huge reduction in packet loss even from a small delay and as we increase the delay too much, it starts to iterate your good bud because now you're beating a lot when transmitting your packets. So we found that there's a sweet spot here at around 40 milliseconds, which is about 10 times the time to transmit a single frame and I have Triple A root 2.15.4.

I

uh So that's what we used in our study um and this reduced a packet cluster from six percent to one percent, which was we should consider a significant Improvement.

I

So finally, I'm going to summarize our evaluation and and conclusions so first I previewed this result at the beginning. We're able to achieve significantly higher good but than prior attempts at using TCP and we're very close to a reasonable upper bound that we computed based on measurements of how fast the radio can send out packets and the overhead loss to headers and acts.

I

um We also did a measurement study to study the Energy Efficiency, so we use TCP and Co-op for a sentence and task and measure the radio dutious cycle over a 24-hour period, and you can see the radio duty cycle here. The key point is that TCP is not significantly worse than co-op. In fact, they perform comparably for the duration of the experiment at about a two percent duty cycle and be considered this a success because TCP is able to perform essentially on par as a protocol over UDP developed, specifically for low pens.

I

So now the TCP is a viable option. What does this mean? Well first, we should reconsider the use of lightweight protocols that emulate part of tcp's functionality um in the sense that you know, if you have a protocol, that's specialized and performs just as well as a general protocol, that's more interoperable and used more broadly. We should perhaps prefer the one that's used more broadly and is more interoperable.

I

Second, we think that TCP May influence the design of low-pad network systems in the sense that you know for a long time. It's been the case that many smart home devices that you buy on the market require a specialized gateway to get internet connectivity um and TCP gives us the opportunity to allow these devices to connect end to end to any uh Services externally that they may depend on and finally, I just want to mention.

I

The UDP based protocols, I think, will still be used in low pens, but just in the same sense that they're used broadly in the internet for applications with specialized protocols substantially outperform TCP in cases where TCP performs, on par with the specialized protocols using TCP, is now a viable option.

I

So just to talk a little more about the about the middle point about how TCB May influence the design of low-pad network systems when I say Gateway architecture I mean a setup like this. Where you have your devices, these smartphone devices you bought on the market and in order to allow them to communicate with an application server in a data center somewhere, you have to install some specific Gateway in your home. That is some protocol, translation and application logic in order to bring connectivity to those devices.

I

uh What this means is it's often the case for some of you may have experienced this is that if you go buy a smart devices from a new vendor uh now, all of a sudden, you need another Gateway for those new devices or even maybe the newer versions of devices from the same vendor like, for example, uh for a long time. It was a case that for life that, if you have bulbs from say, lifx and Bulbs from Philips, you would need separate gateways for both of those devices.

I

um So uh the the introduction of Ip in this space didn't really change this, in the sense that now your application protocol on the left is now implemented over IP, but you still need the application layer, Gateway and the missing piece. I. Think that would allow an end-to-end. uh A connection here would be to have a transfer protocol. That's supported on both sides, namely TCP, and once you do this, your application layer again, please become regular border routers and you could potentially consolidate these together into a single border router.

I

So um in conclusion, uh we implemented tcplp a full scale, TCP stack for low pan devices. uh We explained why the expected reasons for Port TCP performance don't apply. uh We show how to address the actual reasons for poor TCP performance and we show that once the issues are resolved, TCP can perform comparably to low band specialized protocols. That's all I have prepared I'm happy to take any questions. Now.

A

A

Excellent talk, um I see we have a couple of people in the online queue and a couple of people at the microphone um which we do the uh I guess that we'll do the microphone first, so uh I can see who that is. But if, if you can go ahead and see your name in your question,.

F

So hi I'm Matthias I'm, one of the co-founders of white, great work thanks a lot um one remark and two questions a question. First, um so you argued that supporting TCP is important because it's popular now Creek becomes popular. Did you work on any comparison from a system point of view.

I

um Sorry, I didn't quite hear what they said becomes popular.

F

You said that TCP is quite popular, but quick also becomes popular in the internet. Quick. You know.

F

I

So we didn't do a comparison against quick, but I like to comment on that, because that's a good point that other transports are becoming popular. Many of the issues that we addressed aren't specific to TCP. They apply broadly to TCP and other protocols needed for bulk transfer like, for example, um the main issues getting it to work with hidden terminals, getting it to play well with link, clear scheduling and so on, apply broadly to any protocol. That's transmitting a lot of data and wants a significant amount of bandwidth.

I

Therefore, I think that many of our conclusions would actually apply equally well to quick, as they do to TCP. Okay,.

F

um And another question: I mean in your paper: you note that you also have an implementation for gnac the default networks they can write. Do you also plan to submit the Pierre to upstream's implementation.

I

um At some point we did have plans for that, but what happened is that Riot OS already adopted a different TCP stack and it seemed a bit redundant to contribute a second one. uh Recently. What we've done is we have. We must have contributed our code to open thread which now uses it as its default. Tcp stack, okay,.

F

First, did I highly encourage you to submit the PM and final remark. um You said that the fragment needs to be a packet needs to be is lost when the fragment is lost, I mean this depends a little bit on the fragmentation screen right. If you consider, for example, selective fragment recovery, um it doesn't matter too much whether the fragment is lost or not, for the whole packet yeah.

I

So um my understanding about the basics load pan work, at least the way it was implemented in the operating systems we looked at was indeed that if a fragment is lost, you lose the whole packet. But I do agree that there are protocols you can use to recover a loss for Apple without losing the entire packet, and those could also help with the problem allowing you to make the packet bigger and amortize TCP IP headers, even better.

K

Cool all right, hello, um Tommy, Pauley from Apple. Thank you for doing this. Talk I'm very interesting, I'm super happy to see the use of TCP here. um I just had a couple questions from the presentation um way earlier and you don't have to go back when you're talking about um the memory, saving aspects and the ability to have the flat buffer. You had the diagram there of you know, essentially here's kind of what's in flight and then there's the out of order bits and there are gaps in there as well.

K

um When you're doing this, are you able to essentially guarantee 100 of the time that you'll never need to allocate memory, or is it like just most of the time? And then there would be a failover case where you do need to have Dynamic allocation.

I

That's a great question: uh we ensure that you never have your dynamically allocated memory, cool and the way we do it is that you store the data there. You have a bitmap to keep track, of which bits contain the out-of-order data, but the bitmap can also be sized statically because it depends only on the array size, which is also static.

K

Got it okay, cool um and then the other question is more about kind of what you're ending with talking about how you can use this to get to internet hosts end to end um and I believe in your tests. You were testing against um end-to-end internet connections.

K

For that do you need to modify anything on the TCP implementation on the internet servers like, as we were, mentioning things like timing, the retransmit timing schedule, so you want to add Randomness, so you're not colliding.

K

um Is there something that needs tuning on the internet hosts to make sure that they are friendly to the low pan devices? Or can you use completely unmodified um internet hosts to talk to yeah? That's.

I

An excellent question- and the short answer is that the hosts on Linux side were completely unmodified. Great um I mean that's to say a little bit more about that. uh The timing that we adjusted for, like the randomized delay, was none of the TCP level. It was at the link layer so, as a result, the the other.

D

Side actually.

I

Doesn't see any of that got it um there's also one of the advantages to us using a full scale. Tcp stack like the one from FreeBSD, because it's been battle tested in the real world and it's interoperable with all the major TCP Stacks that are out there and I just want to say that uh interoperability is actually a problem in the embedded space.

I

Many of the ability, CPS tags you find are have, inter have interoperability problems in pretty subtle ways, with the real TCP Stacks that are used and that's something we managed to sidestep by using a battle test. The TCP implementation as the basis of our study.

K

Cool. Thank you.

L

So hello, this is Thomas also from the riot Community thanks again for this work, thanks for using Riot, uh there's uh another encouragement using Junior C, because you have a generic packet buffer here, which you could reuse that even reduces your memory over it even further. uh Just just a remark, one question about uh about your multi-hop experiments. You showed us nicely how, by jittering the the TCP forwarding how you could avoid the hidden terminal problem was that in a cleaner environment without cross traffic, with only a single TCP connection,.

I

Yeah, so uh the hidden terminal problem affects even a single TCP connection in isolation, um and we verified that our randomized back off fixes the problem. In that case, yeah.

L

But only in this case I mean the normal cases that you have background traffic right. So.

I

Yeah yeah so I mean if you have background traffic. This is also why we used randomized delays instead of fixed delays, because if we had a randomized back off it doesn't matter, the interference is coming from the same stream or a different stream right in both cases, you'll back up a random amount and hopefully transmit again without colliding.

I

um This is also why we did it without because I mean there are several protocols, you could use that look at TCP state in some way um and having it just be a randomized related link. There gives us some confidence that it would work across TCP extremes and regardless of the source of traffic, whether it's TCP different TCP streams or even something else.

L

In this context, did you also consider experimenting with more flexible uh um link layer, Mech layers, then just a csma CA, for instance the dsme Mac layer, which is also supported by riot.

I

No, we didn't experiment with that. uh We looked at csme because that was the most common one, supported across all the operating systems and networking protocols that we tried across tiny, OS, Riot and open thread. So it's the most natural to focus on that.

L

Okay, thank you. Thank.

B

A

All right, I think I can't I think we have a remote question.

J

Am I unmuted? Finally, so uh this is following up on the multi-hop case. uh So in these environments the uh forwarding devices are in fact also very low power. Low um resource devices.

J

um Did you see or could you speculate on what you might see as to whether TCP traffic would have more stress on the buffers of the forwarding, multi-hop Wireless nodes, us.

I

So that's a great question. um First I want to I mean so first I just want to clarify that the buffers use of the intermediate routers, these aren't TCP layer. Buffers is just like the general packet buffers used for forwarding, because you know an end-to-end TCP connection, there's no TCP, State sure.

J

Sure, but it made for the different may put a different low aggregate load um on those buffers then say: Co-op traffic or something that's more. You know simple request response related.

I

Yeah so I mean, of course, that's the case that, when you're transmitting at higher bandwidth you're going to place some more stress on the on the buffers of the intermediate uh all of the intermediate routers- and there are a couple things that that we do in that we actually did in our study in order to help mitigate that.

I

The first one is that we added some active queue management functionality to those intermediate routers, where you mark packets as congested using explicit, using explicit condition, notification and so on in order to prevent TCP from filling up the entire buffer and keeping your cues short. um The primary reason we did. This was to improve fairness of different TCP flows that are competing for buffer space of these intermediate routers and also to reduce the and also reduce the latency of traffic.

I

But it also has a side effect of limiting the amount of buffer space, that's being used by a single TCP flow. To address. Some of the concerns that you brought up.

J

Thanks I was looking for the aqm uh angle on that.

A

All right uh so I have a question: um does the uh I I I very much like the idea of the head of multiple link their frames? um Does this put any constraints on the link there or or does the six Loop Handler um so handle all of that.

I

um That's a great question, so some of these uh can potentially be handled at the sixth low pan layer, but others do indeed have to do with, with the uh with the link layer directly like, for example, the randomized delayed that we added to avoid hidden terminals is something that would operate at the link layer right because of the six little band layer uh you don't have, or at least you don't, naturally have the same kind of visibility into.

I

You know when your link layer acts are coming in and so on, whereas you would need that to determine uh that a transmission failed and how much to back off on the re-transmission and so on. So some of them do indeed affect the link layer.

A

Yeah other requirements that the link layer delivers packet in order um to avoid um damaging the headers or os6 weapon handling.

I

uh Sorry I didn't understand your question.

A

Is there a requirement that the links packets in order in order, because of the way you've sent the head of split across multiple link their frames, or? Is that all that the reordering handles by sixfloper.

I

Oh yeah, so the reordering and reassembly is handled by six low pan and there's no strict requirement that you have to transmit the frames of a packet directly one after the other consecutively.

I

In fact, one of the things that I skipped because of the time limit was another set of techniques we have uh at the level of managing how to deal with concurrent frames, basically how to schedule frames when you're some of them are going through other wall power devices, some of the battery powered devices and in effect, what we do is if you, if you receive a data requesting from a battery powered device, then you prioritize sending frames to Tech in order to reduce its duty cycle and let it go to sleep as fast as possible, and that's one case where we specifically might interrupt another transmission and not send its frames.

I

Concurrently, I mean nothing streams, consecutively.

A

Oh okay, yeah, that makes sense uh Gabrielle.

A

D

One: okay, um yeah! Thank you very much for for this work. This is great stuff. um I did have a comment on um the comparison with Co-op I think the justification for Co-Op was not entirely based on. We can't use TCP type type thing it was more based on. We can't use HTTP because of the justification, for it was for folks who wanted to use a restful interface for the application layer. Not every application of the year in iot wishes to do that, but there's certainly a lot of a lot of incentive to use restful.

D

So when, when the restful folks started uh to become interested in iot, the only alternative was http11, which uh I completely agree is terrible. uh It's my Sardar. It's textual based protocol. You cannot compress it it's. It's very verbose, Etc! It's it's terrible!

D

um We subsequently had HTTP 2, which became a binary protocol, and we actually had a paper uh three years ago in an RW about you know how to use that over something like six low pen, for example, uh some just initial scratching the surface. But now we have HTTP 3 and quick, and it's all binary so um and I um I understand you. You guys haven't had a chance to go after the excellent work.

D

You've done to look at those layers, but I would highly encourage you to do that, because that would that would address um a significant portion of the of the application layer. Incentives um to um for iot as well.

I

Yeah, so uh so, thanks for for clarifying that um I do acknowledge that Co-op has has evolved quite a bit in a few years. Some of those Evolutions happen after after we published this work, uh but I do want to uh to clarify my position on Co-op a little bit uh based on what you said. It's that uh indeed I think that Co-op is useful and it has its uses and it's very flexible.

I

It's been evolving a lot over the years and that's great um I do I mean I have noticed that Co-op has been evolving in some sense, more and more towards the same kind of abstraction. That TCP provides right. In some sense, uh the ability like, for example, with some of the recent work on on streaming on streaming, block, transfers and so on.

I

um All I'm saying here is that I think that an application that's built on co-app in these kinds of networks with all the latest features like, for example, the ability to have multiple blocks in Flight, uh concurrently and so on would also be wise to potentially consider using TCP directly itself, given that TCP is also a viable option in these Networks.

I

But thanks for that comment.

A

All right, thank you. uh One more question. uh Benjamin.

M

Hi, thank you so much for the presentation. I really appreciate it. um I was just wondering you talked mainly about the applications of this in um in lands. Do you see any application for longer range networks like um like mobile ad hoc networks or anything of that sort.

I

uh That's a great question, so all of our experimentation was uh was done using iFly 82.15-4, which is a personal area network protocol, and that was motivated by the recent interest in it in adopting some of that technology uh to work in the smart, home and iot space.

I

um Some of the I mean some of these lessons might carry you over to the mobile and and ad hoc network space like LP, Vans and so on.

I

um I'm, not sure I'll be able to tell you any specifics, given that I don't have much experience with those networks, um but I mean by first gut would be there's probably a way to make TCP work well, given that it's been adapted to work in so many different kinds of networks uh in all kinds of different environments, but other than that, I'm, not sure. If any of this, which of the specific techniques would directly carry over there.

M

Thank you so much.

A

All right, uh thank you very much.

A

Excellent talk.

A

uh And thank you again to to both of the speakers. I think that there were two two really great talks there. um Both uh Sam and Tasha will be around all week. uh I'm sure they'll be very happy to to talk with people more about their work. uh So please do do you find them have a chat, chat about their work, make them welcome to the the ietf and to the irtf.

A

uh Congratulations both to uh Simon satosha, for the award of the anrp. This time, um as I said earlier, looking up for um more anrp award talks um at the the uh itf15 in London in November, uh the nominations for the uh 2023 uh nip Awards will be opening in September. So if you know any good work, please think about nominating that work and look out for the uh applied networking research Workshop, which is taking place uh co-locating with the ITF in Philadelphia tomorrow.

A

um Thank you again. Everybody um hopefully I will see some oral of you in London or at.

D

A

Nrw, tomorrow or later this week, thanks everyone.