Internet Engineering Task Force IRTF, 25 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: IETF 114 Internet Research Task Force (IRTF) Open Session

Description

The Internet Research Task Force (IRTF) Open session, including Applied Networking Research Prize (ANRP) presentations, will be held during IETF 114 at 19:00-21:00 UTC on 25 July 2022.

A

A

A

Okay, it looks like people are still coming in.

A

um I see some tuna.

A

A

A

A

A

A

A

Okay, I could have a few minutes now, so you should probably get started. It's uh I don't know there, so you want to come up and get ready for the microphone.

A

While I do the introductory slides.

A

Okay, so sure can you do an audio check? I see you're there. Now it's taking a little camera to work.

A

Okay, um I will assume that you can hear me in the room. uh If that's not true, please enter the chat um so.

A

So uh welcome everybody. This is the iota open meeting uh at uh ipf4104, uh I'm colin perkins, I'm the irtf chair. Hopefully you can all see and hear me in the room and I'm remote.

A

So uh I would like to begin with the um the usual note. Well, slides.

A

Echo, but hopefully it's not too bad, so uh this is the irtf open meeting the irtf follows the itf's intellectual property rights disclosure rules uh and a reminder that, by participating in this meeting and by commenting on the presentations that you, you agree to follow the irtf processes and procedures, including disclosing any intellectual property uh relating to the contributions that you make.

A

uh I'm sure most of you have seen these slides before the the details are in the documents linked. But essentially, if, if you have ipr on the documents you're talking about you, you need to disclose that if you're commenting a microphone.

A

In addition, a reminder that the irtf routinely makes recordings of these meetings uh available, both the the online and the in-person person meetings, uh including this one uh and this meeting is being streamed, uh live on uh youtube, as well as via the usual meat ecosystem.

A

um If you're participating in person- and you are not wearing one of the red- uh do not photograph lanyards, then you consent to appear in these recordings. And if you speak at the microphones, um then again your consenting to being recorded and, as I say, the recording is being made available on youtube.

A

Equally, uh if you're participating online- and you turn on your camera or your microphone and make a contribution, then that that is being recorded and you can consent being recorded um and also the chat is also being recorded and will be made available in the the usual jabber archives.

A

As a participant in the irtf, um as I say, you uh acknowledge that recordings of the meeting may be made available and that the previous that any personal information you provide will be handled in accordance with the privacy policy, um and you also agree to work respectfully with the other participants in the ietf and the irtf. And if you have any uh issues or concerns about that speak to me or speak with the ombuds team and the the itf's code of conduct and anti-harassment procedures uh linked on the slide also apply to the irtf.

A

um For those of you participating in person, um please sign in using the the mobile meet echo the meet echo light tool, uh we're running the queue electronically. So if you have questions, then we're using the electronic queue that's accessed by via the meet echo tool and keep the audio and video off if you're, using the on-site version that the meet echo light tool, remote participants, please leave your audio and video off and unless you're you're, presenting or asking a question uh just to avoid feedback in the comment section.

A

Also a reminder for those of you who are attending the meeting in person uh as a covet safety measure. um The itf is requiring those those of you in attending the meeting in person to wear an ffp2 and 95 mask or its equivalent, and the only exception for that is the the chairs and the presenters who are actively speaking uh in particular. Participants who are making comments or asking questions from the floor. Microphones are expected to wear a mask at all times, including while they're asking the those questions.

A

As I said, the only exception for that is the active presenter at the front of the room.

A

Okay, so, uh as I say, this is the the irtf open meeting, uh the goals of the irtf uh are to complement the standards work being done in the ietf. By focusing on some of the longer term research issues, the iitf is very much a research organization, it's not a standards, development organization and while it can publish rfcs and and we do publish both experimental and informational documents on the rfc series that the primary output of the irtf is his research is understanding his research papers.

A

The irtf is organized as a series of research groups. um Hopefully you can see them on the slide here. The the crypto forum group and the uh privacy enhanced enhancements and assessments groups met earlier today. um The the other groups meant uh sort of highlighted in dark blue on the slider meeting later in this week. uh So please do um look out for those groups uh this week and try and attend the sessions. If you're interested in those topics.

A

A little bit of research group news, uh I'd like to welcome curtis heimerl who's, uh recently joined as co-chair of the gaia group. The global access to the internet for all research group um curtis will be joining uh leandro navarro. Who is um planning on stepping down from from chairing that group after this meeting and jane coffin? Who is uh continuing so I'd like to welcome uh curtis.

A

And um thank uh him for his service and thankfully andrew for his his many years of service to the group, and I very much appreciate the efforts the android has put into chairing the group and I look forward to working with curtis going forward.

A

So thank you. Both.

A

As I say, the irtf is primarily a research organization. We tend not to publish many rfcs. We've had one rfc published since the last meeting um from the information-centric networking group. uh Looking at uh architectural considerations for um using an ic and main resolution service, um but primarily the the iitf tends not to publish much in the rfc series and the output is more in form of interesting presentations and.

B

A

And research papers.

A

To support that we run the applied networking research prize and the the goal of this prize is to recognize that some of the best recent results in applied networking research uh is to to um recognize some interesting new ideas uh which are potentially relevant to the internet standards.

A

Community going forward is to recognize up and coming people who are likely to have an impact on the internet standards process and internet technologies, uh we're very grateful to the internet society to comcast and mbc universal for their sponsorship of the a rp that allows us to make these awards bring different people to give these thoughts.

A

And uh what uh what we're doing today is uh the goal of this session is to to to make some of these awards. So I'd like to congratulate uh tasha swami and sam kumar, who will be giving their award talks.

A

On this session today, um tasha will be talking first in a couple of minutes uh talking about uh his work on data plane, architectures. What's the line right in for insurance um and sam will be following uh later in the session talking about tcp, low power.

B

We have two really really good talks coming.

A

So please do please do stick around pay attention to those again. Congratulations to tasha and to sex.

A

Going forward um look out for a little bit more of more wood talks. uh Goten akiwati uh korean cat and daniel wagner will be giving the talks of atf-115 and the nominations for the um nominations for the 2023 awards will be opening in september 2022. So to look out for those uh um look out for the nominations. Opening in online.

A

Okay, did the audio improve trade muting and restarting? Hopefully you can hear me.

A

Okay- okay, hopefully that's better, as I was saying, look out for the uh nominations for the 2023, a rp um in september this year uh and um congratulations to tasha and to sam who will be giving their very nrp talks today.

A

In uh addition to the applied networking research prize, we also host the uh applied networking research workshop, which we organize in conjunction with acm sitcom.

A

um This workshop is taking place tomorrow, it's co-locating with the itf in philadelphia, so thank you to tj, chung and marwan fire to the chairs this year and who've been organizing that workshop um we've got um a program of. uh I think that there are four four really nice research papers uh a keynote and some invited talks on novel approaches to protocol specification.

A

As I say that the workshop's happening tomorrow, um if you're there in in person, then please do consider attending if you're attending remotely, then you can register and attend. um Registration is free for anyone, who's uh also registered with the itf, although we do ask you to to register separately. So we know who's attending the workshop um and the a rw next year will be uh again co-locating with the the itf in july 2023, which is planned to be in san francisco.

A

And to finish up before we get to the talks, uh I'd just like to um note that we we are very pleased to offer a number of travel grants for these meetings, um both to support early career academics and phd students from underrepresented groups to to attend the irtf research groups and a number of travel grants for the uh applied networking research workshop.

A

um Thank you very much to the travel ground sponsors to akamai, comcast, cloudflare and netflix uh for supporting that. um If you'll, you know, please see the the travel grants page uh linked from the website um for details of that um and if, if you're interested in sponsoring uh the travel grants in future or if you're interested in applying for a travel, grant uh see that webpage or contact me for for details of those sponsorship opportunities and again, thank you very much for the sponsors.

A

So that's uh essentially all I have to say today um the agenda for the remainder of the day. um We have the the two a rp award talks. uh Tasha swami will be first uh talking about taurus a data plane architecture for per packet, machine learning and that will be followed by some sam kumar's talk on performance, tcp for for low power, wireless networks.

A

Okay, um I will, at this point, switch over to uh tasha. Can you check the microphone when I get the slides up.

C

Yes, it's okay.

A

D

A

I can hear you remotely: is it working in the room.

C

Awesome should I get started.

A

Yes, just one one: second, if you have a phone, I can pass you control, so you can control the slides yourself. If you have the meat echo light. If not, then shout when you want to go to the next slide.

A

Okay, so I should have control over that, um while uh tasha is checking to see if that works. uh I'd just like to say that uh the that, as I say, the first talk today is uh he'll, be talking about taurus at data plane, architecture for per packet ml tasha is a phd candidate in the electrical engineering department at stanford.

A

His research is focusing on the intersection of machine learning, networking and architecture, and he works on the hardware. Software stack for data plane based machine learning, infrastructure and applications. uh Tasha is due to graduate this year. I understand he's on the job market. So, uh if, if you like this work, then please do uh talk to him. He'll be around at the itf all week, and if you find this talk interesting, I believe he's also going to be presenting in the coin rg session later this week,.

C

A

C

Awesome thanks colin uh cool, so um I'm gonna be talking about taurus, which is a project that uh me and my colleagues have been working on, and so taurus is essentially a data plane architecture for per packet machine learning and go into a little bit of what that means all right.

C

So this here is a quote from a 2015 google blog and at that time uh google was already dealing with uh over one petabit per second of total bisection bandwidth um and it's only grown larger and harder to scale since so what we're essentially dealing with here is a situation where networks require more and more complex management with higher and higher performance, um and so it's uh the time is ripe for finding new solutions here and one of the promising solutions in this area is machine learning, so machine learning can allow us to essentially take in data from the network and make progressively better and better decisions, as we train our models and these machine learning algorithms can approximate network functions based on the data they see and they're also going to customize their operation to the data that they're training, on which in turn means that these machine learning algorithms are actually customizing their models to the network itself, and so we're sort of uh doing elements of this already with handwritten heuristics in the network.

C

So something like an active queue, management, algorithm or hashing, and load balancing and playing with operator tuned parameters. So all machine learning uh is doing here is taking the next step by automating the um the search for these kind of parameters that allow to work well within your network.

C

So, uh if we're okay with using machine learning, we now need to examine where exactly in the network. It has to happen. So I'm sure many of you already familiar with software-defined networks, essentially the control plane and the data plane are split and the control plane is responsible for policy creation, um essentially in the form of flow rules which are installed into a data plane where that's, where you're going to find your switches and they're doing packet forwarding via match action.

C

So right off the bat. There are two good candidates for where we should operate with machine learning.

C

And uh on the left here I have a diagram of the same typical define software defined network, but on the right uh I have a software defined network with the taurus worldview, and so what we've actually done here is we've split the machine learning operation into training which is going to happen in the control plane and then inference which is going to happen in the data plane.

C

So in the control plane, policy creation is going to take the form of flow rules plus ml training, and when installing this information into the data plane, it's going to be sending flow rules as usual, but also the ml model weights and in the data plane. We're going to be doing our typical match action packet forwarding, but we're also going to be doing decision making with ml inference.

C

And so that brings me to one of the core tenets of taurus and that's essentially, the ml inference should happen per packet in the data plane, um and so the the intuition here is relatively straightforward.

C

You want to be able to do per packet operation, because that is the finest granularity of traffic essentially operating on a packet scale. Now not every application may need per packet level operation, but the there are applications that need it, and so the platform should be able to support per packet operation and then the data plane, that's where the packets are. So if we're going to be doing decisions on packets, it should happen in the data plane.

C

Oh, I think uh powerpoint animations don't play well with the pdf, um so that's, okay. uh What what's basically happening in here is that um if we were to do just a rough off the uh off the cuff math here, so you have traffic at one giga packet per second moving through your data plane. Now, in the time it takes you to send a packet digest from the data plane up to the control, plane, calculate flow rules and then install it back in the data plane.

C

In this case, we've assumed half a millisecond for each step, so we've now missed 1.5 million packets in our traffic stream. By the time uh we had flow rules installed into the data plane. So in the example here we're doing anomaly detection so we're trying to find out if incoming packets are malicious or benign, and maybe if we find that it's malicious we're going to install some rule to say block that ip um and by the so we've missed 1.5 million packets during this flow rule.

C

Installation time by the time we block that ip we've already let a ton of uh potentially malicious traffic into the network. So the whole takeaway here is really just to um show you why we can't let our operation for these kind of this level of application happen in the control plane and if we're committing to using machine learning, we can't have inference happen in the control plane.

C

So fundamentally, the conclusion here is that the robustness and performance of your network are going to be determined by the quality of your reaction and the speed of your reaction. So in the machine learning world view, the quality of the reaction is going to be determined by your training data. So how much do you have? What kind of cases does it cover? How well is it cleaned, but also your speed of reaction? So in the case of the anomaly detection, you want to act on a malicious packet.

C

The moment you see it, you don't want to have to go to the control plane and come back and install any sort of flow rules, and this is essentially the per-packet operation in the data plane.

C

So zooming in on uh the control plane, let's talk a little bit about the actual implementation of how you do this, so I mentioned before that we're going to split our machine learning uh into training in the control, plane and inference in the data plane, and so the key here is that training is off the critical path. If packet forward is packet, forwarding is happening in the data plane. Then the control plane is not uh responsible for making uh per packet level decisions, which means that we can do our machine learning.

C

Training there at leisure and um essentially we can- we can put in whatever the latest and greatest ml accelerators are whatever your favorite ml framework is um install it in a control, plane server and have it training models offline.

C

The trickier part comes in the next step. Where now uh we need to deal with the actual critical path, basically um tackling packets, as they come so machine learning inference here is going to happen in the data plane. Like I mentioned, and the the final outstanding question here then, is: if we're okay with doing uh training in the control plane, we can use whatever existing hardware. We want and then what do we do about the data plane?

C

Do we have say a switch that can do inference um at line rate per packet operation, and so this is really the the crux of taurus and that's what it needs to do so. Taurus is an architecture for per packet machine learning, inference in the data plane.

C

So, let's jump into the actual hardware and how we enable this kind of machine learning inference at line rate. So I have a picture here of a piece of pipeline, a protocol independent switch architecture. So this is the typical uh programmable structures you'll find in these kind of uh switches, so some sort of programmable packet parser match action, tables that allow you to encode your network functions and then maybe a programmable traffic manager.

C

So we're going to actually keep most of these elements and just make a modification of additional hardware. That'll. Allow us to do our machine learning inference, but the natural question is: if we're, if we're committing to adding hardware into the switch pipeline.

C

um What does that look like and more specifically, what is the abstraction here with which we're going to create our programmable machine learning fabric and so in taurus? We use the mapreduce abstraction, so mapreduce is really useful for machine learning, because it supports a lot of the common linear algebra operations that you need for your ml algorithm. So this covers everything from neural networks, support vector machines, k-means all these different kinds of applications, and just as an example, I have here in the picture um an example of a single neuron from a deep neural network.

C

So you can see exactly how map and reducer applied here. In this case in the blue box, we are doing an element-wise multiplication, that's our map with multiplication with inputs and weights and then we're applying a reduction, and so this is going to essentially add all the values together and you're, going to produce a scalar value from your vector of inputs and then finally, we're going to apply an activation function, and so that suffices for a single neuron. But you can mix and match this pattern ad nauseam uh to create a full neural network.

C

So by stacking extra of these blocks um in parallel, you'll be creating a layer of neurons and then stacking them. Sequentially you'll be creating multiple layers, and so that's how you can create say a deep neural network.

C

So the other advantage of the mapreduce pattern is uh comes from the kind of performance that it enables. Primarily it's a from the the cmd parallelism that same instruction, multiple data. So we can get a lot of performance out of the parallelism with minimal logic, and this is as opposed to what you might find in a say, like a a typical like tofino pipeline, where they have vliw pipelines, which give you much more flexibility.

C

But the cost here is that there's a lot of logic, that's needed for the communication hardware and um that ends up taking up a lot of the the overall on-chip area and, uh in addition, simply parallelism gives us the ability to um unroll the loops in our um uh in our algorithms.

C

So the the idea of unrolling here. If we take the example of, um say a single layer of a neural network and say you have four neurons in your network, you can either execute them. Sequentially you're doing one neuron after the other, or if you have the resources, you can instantiate all four of them in parallel, and so the trade-off here is that more on rolling is going to give you better performance, essentially doing all four of those neurons at once.

C

While less enrolling means, you only need the hardware for one single neurons worth of operations, but it's going to take you four times as long, so it's less resource intensive, but it's also.

D

C

Less a much higher latency, and but we get this kind of control with the cmd pattern by um essentially on adjusting unrolling factors.

C

So we went ahead and we essentially adjusted the switch pipeline with a mapreduce unit that implements the patterns that I just described so the the we still have our typical programmable elements. We have a programmable packet, parser match action, tables and traffic manager, but you can see in the center.

C

We have this mapreduce unit and that's essentially, what's going to do our machine learning inference, and so there are a couple uh little idiosyncrasies about the um the arrangement of the pipeline that I want to point out and uh and that's how we use these different elements for machine learning context. Even if they're, typically uh network elements, so a packet parser is normally for pulling out your headers from your packets and doing whatever you want with your match action rules.

C

In this case packet parsing, is also pulling out the features for our machine learning inference, and then we have match action tables before and after the mapreduce unit, and so these are doing different types of rule-based, pre and post-processing on our machine learning inputs and outputs.

C

So uh the match action tables before mapreduce can be doing some sort of cleaning on the features and then the match action tables on the output on the right side of the mapreduce unit can be doing some sort of interpretation of the results, and so when we actually went to design this mapreduce unit, um there's a couple of things that came up it turns out. You can't really just stick an accelerator into the switch pipeline, so uh what we did was we kind of established? What were the the points that we wanted?

C

Our mapreduce block to fit, and so most of all, we wanted it to be reconfigurable. So essentially you should be able to program it. It can't be a an asic for a single type of machine learning application you should be able to put in whatever or program whatever application you want. Oops.

C

And it has some meat line rate with the fixed clock, so this essentially uh rules out an fpga makes an fpga will give you a variable clock. We want it to be deterministic um and, of course, line rate is our performance requirement and then minimal area and power overhead. We don't want to blow up the entire chip area, adding in like a a map reduced block. It should be something that is small, but gives you access to a whole class of applications.

C

And so finally, the one little thing to note here- that's kind of interesting- is that most of these ml accelerators are built to do batch processing in an effort to get high throughput, but in the network pipeline you're, actually, processing packets, as they're coming, which means that you're operating on a batch size of one which is uh turns out, puts a lot of different performance demands on the hardware than a typical accelerator would see.

C

uh Yeah, so um I have a quick example here just to make this a little more concrete, going back to anomaly detection.

C

um So you can, you can uh imagine, say a packet coming into the switch pipeline and we want to see essentially whether it's malicious or benign, so the packet hits the first stage and that's where we're going to um do our packet parsing. So we're going to read local features, say our ip.

C

Whatever information we can extract from the packet itself, the packet is going to move to the second stage, which are the match action tables and from there maybe we're going to do some sort of uh retrieval of out-of-network events, so these would be different kinds of uh elements of metadata that the control plane may have installed into the match action tables. So something like the failed logins per ip.

C

The packet then moves to the the center block the mapreduce unit. That's where we're going to apply our learned anomaly detection. So you can imagine this is maybe a binary neural network and it gives it a a score from zero to one on how anomalous it is so one is definitely anomalous. Zero is benign and finally, the match action or the the packet will move to the post-processing match action tables, and that's where we do our interpretation.

C

So say we got a score of 0.8, so it's pretty anomalous and now we want to drop it or quarantine it. This is where the match action table will set a rule for that, such that when the packet now moves to the traffic manager, it's going to go to the appropriate destination.

C

So uh in the paper we actually did a full um asic analysis of uh this taurus hardware and how we can uh we wanted to show essentially that it has minimal overhead and it's feasible to to build something like this, and so we based our evaluation platform on a coarse grain. Reconfigurable architecture called plasticine and we programmed our applications in the spatial hardware, description, language and so spatial is just an hdl that lets. You um use these kind of uh parallel patterns like map and reduce to uh program.

C

Your um your your reconfigurable architectures uh at the loop level, and so the the basic architecture of the mapreduce unit here is really just a grid of compute and memory, tiles, so easily scalable and very, very straightforward.

C

In the compute units we have cindy lanes that are operating in parallel and a reduction network that allows us to implement the reduce operation and the memory units are just um blocks of banked sram, so uh we're doing severe pipelining within the compute unit, but then we're also doing pipelining one level higher between the compute and memory units. So the idea here is cindy parallelism everywhere uh and then pipeline parallelism everywhere and that's how you get your performance really.

C

So uh we went through a set of real world applications and um programmed them onto our asic, and we ended up using a 12x10 grid to support all of them and we compared it to state-of-the-art switches with four uh pipelines and um our reference, which was 500 square millimeters, and we found that our grid, which could support these different applications, was only adding a 3.8 percent overhead or 4.8 millimeters per pipeline.

C

So um again before uh earlier, I said we want minimal area overhead, so 3.8 is pretty low. Given that you're now getting an entire class of machine learning, applications.

C

And jumping into one of these uh applications, I've been using anomaly detection. As a recurring example. Here um we tried out two different types of anomaly: detection, with support, vector machines and a deep neural network, and so uh for both models. You can see in the throughput it's one gigabyte per second, which is the line rate for um high end uh switch pipelines like your tofinos and broadcoms, uh the latency that we added was in the um hundreds of nanoseconds or less so. In this case, you would choose your application.

C

You can see here that the bsvm requires 83 nanoseconds, while the dnn requires 221 nanoseconds. So, depending on your slos and what kind of requirements you have to meet, you can choose your algorithm to reduce latency um and then, in both cases, the area and power overhead required for the hardware to implement. Just these applications is um single, digits or uh or less of 0.6 power, overhead 0.5 area, overhead or point eight and point uh and 1.0 respectively.

C

Again, if you don't need, um say the full suite of benchmarks, you only want a reconfigurable fabric that will let you do anomaly detection. You can do it with um minimal overhead here and so in the paper there's a several more applications if people are are interested, such as a congestion control network and a traffic classification network.

C

So uh we went through this whole process of doing an asic analysis to prove that it could be done. um But as far as research goes, we don't really want anyone waiting for some sort of mass-produced, taurus asic, so we've put out an open source, fpga based test bed, um and so this is just a rough diagram of what it looks like um at the control plane.

C

We're using um your typical network os like onos um we're using a tofino switch to to mimic the piece of pipeline elements like your programmable packet, parsers, match action tables and traffic managers and then we're using an fpga to um to mimic the mapreduce unit. So we set it up in this bump in the wire configuration um and so uh because of the limits of an fpga you're not going to be able to hit the same performance as you're going to get with the asic uh core screen reconfigurable architecture.

C

But it's there to serve as a proof of concept for the functionality.

C

So, just a quick demonstration of this testbed we did an example.

C

Essentially the example I mentioned earlier about anomaly: detection, where we're trying to do uh detection of anomalous, packets in the control plane or we're trying to use taurus and do an uh anomaly detection in the data plane, and so with the testbed that I just showed you you can do. um You can do either. So you so in the case of taurus we'd, be uh placing our anomaly detection application on the fpga.

C

While, if we're trying to do control plane based anomaly detection, we would run it at the uh the controller on the cpu.

C

So uh the takeaway here is is the same sort of message on um why you really can't use the control plane for efficient um machine learning just based uh decision making? And um if you take a look at the very last two columns, the f1 score.

C

um Now this is the f1 score for the model when it was implemented on the baseline, which is a control plane or taurus, which was in the data plane and in software in tensorflow.

C

uh The f1 score is 71.1, so you can see that taurus on the far right side of the uh the uh the table is achieving an f1 score of 71.1. So it's faithfully recreating the model as it was in software and um we're processing packets as they're coming in, whereas in the control plane we actually had to sample packets from the network and um run it through the control plane and run it through an ml framework and try to install flow rules.

C

And what ends up happening is that you miss so many packets while doing this operation, that your effective f1 score uh drops pretty heavily. So you can see on the far right column the f1 score for the baseline ranges from 1.5 to almost almost zero, so you're effectively throwing away your model because of the added latency.

C

So that's just uh one example of what you know what we did with our fpga testbed. um There's, of course, lots of other things you can do, but the just to reinforce the point why you have to operate in the data plane.

C

Cool so yeah, so that's mostly it uh for me. um I have my contact information here and I have at the bottom the git lab link for the fpga testbed. We hope people wanna can uh try it out and there's the link to the full paper in this easy to memorize url.

C

So uh yeah, I'm happy to take any questions.

A

Okay, uh thank you very much the excellent talk um since we we have a some people, remote some in the room. I think, if, if we can manage the queue using meet, echo uh me take a queueing tool. I think that would be helpful. uh I do see, I guess it's barry at the microphone there.

D

Okay, actually, I'm being uh dave or am right now um dave iran asks, I assume the class of anomalies you can detect. Are those that can be detected by header fields within the width of the alu of the switch things in the packet data beyond the headers won't be seen? Is that correct.

C

So the in the case of anomaly detection, we used the kde nsl data set, which had a a record of different um attacks that were calculated from, like you said, either header fields or um you can also actually calculate aggregate fields from across headers. So you can um say, create like a histogram using the matte checking tables across different packets um and the the packets.

C

The the packet headers um are going to be limited by the packet header vector size, that's moving between stages in the switch pipeline, but you don't necessarily need to be limited to features in the header, because the control plane can install different types of metadata into the imagination tables, and you can do your own um processing in the match, action tables um over time or whatever other kind of calculations. You want to do on your headers. So the headers are just the starting point for the the features here.

D

E

Is this working yeah george michaelson, can I say, can I sneak two questions in? Is that okay.

A

E

So the first one- and this is the naive attendee question- I suspect the paper is very important for interpreting that last table. It was really quite opaque how to understand the meaning of the columns and their impact on a comparison to the baseline. I think there's a lot of implicit knowledge in your table structure. I'm sure the paper explains it the slideway. It was just a bit of produce to a naive reader.

E

So at the start of your talk, that was the first point. You made a case to say that the delay between doing a packet sample constructing table match rules in the controller, injecting those rules down into the functional plane and applying them had a huge packet loss and mismatch interval. But it seems to me the delay to perform the ml operation tune. Your ml have a model that is representative of the condition you want to model and then install that has a similar cost, it's not to say, there's no benefit of ml.

E

I think it's huge, but the component, that's about the delay cost of doing an instantiation of rules. I don't think is the basis of doing it. I think you're on stronger ground, arguing it's about the ability to do complex match at line rate than the static cost of doing the rule, installation right.

C

So um the installation you're right about the installing the model itself. So the idea is that you could be taking sampling uh packets from the your network and be sending different kinds of metadata to the control plane and essentially be doing your training offline and you can install model weights or replace model weights as uh as needed. The idea is that whatever is operating in the data plane itself has nothing to do with the installation of model weights yeah.

E

Completely agnostic.

C

E

I thought I thought that idea that you could do the model training asynchronously the sample exactly is very beneficial, but if you consider a new class of attack that you have to understand it and do some form of bayesian analysis and classification, which is completely unmodeled here, exactly how you do that training unknown how long that takes it's, not about the speed of the chipset. It's about your ability to do the good, bad classification, a priory, to inform the model and then download it. That's quite a high cost in time.

C

Yeah so so this is always like kind of the uh the trouble with security rate like if you want to do an on-the-fly analysis of a brand new attack. That's not really uh what we're! What we're targeting.

E

C

E

But uh but in engineering terms, your case, this is extremely fast at line right well made. I enjoyed listening to it a lot. Thank you. Thank you.

D

So excellent work to share, I have a question: doing machine learning and data blend will consume more energy.

C

Oh sorry, I can't.

D

Using machine learning in the data plane will consume more energy so, and we would like to reduce the energy consumption of filters.

A

D

Switches so have you looked at this issue.

C

Yeah, so I think the the for energy consumption needs to be looked at, maybe more holistically. So, while you are increasing by some small percentage, the energy that you'd be consuming in the the switch itself, you can consider that say if you're doing anomaly, detection you're, removing the cost of running an anomaly, detection, application and software on a server somewhere else.

C

So with this like specialized hardware, here, you're consuming less power in the switch than you would running it in software elsewhere, so on the whole you're reducing power cost, but for the switch itself, yeah you'd be increasing it minimally. Okay,.

D

A

Okay, thank you uh when I have questions but for sure.

A

And I guess, while we're waiting at the.

A

This is a an irtf uh meeting, which is which is co-locating with the ietf, um and obviously you know since that this communicates with the itf. The question is, then, you know to what extent have you given any thought towards how that how this might change or affect the type of work the ietf does?

A

Are there any implications of these types of systems so for the way way we design standards or other types of protocols the itf designs, or is this just a an optimization that fits within the existing architecture?.

C

Yeah, I think so one of the the things that uh actually uh shamu, who uh asked a question earlier um brought to my attention, was that was what kind of um standardization is needed for packet headers if we're going to be using them as features or carrying model weights or basically doing kind of this like ml ml assist type operations.

C

um So I think, there's probably something there as far as um making a cleaner definition of what what has to happen at the uh the the packet standardization level um to support this kind of machine learning and make it easier for different, different types of ml systems to interoperate.

A

Yeah, that makes a lot of sense, presumably there's also something in terms of the control plane and the standardized um sort of programming model. For that, in order to to to specify the the model is that right.

C

uh Sorry very exciting.

A

um I I I mean I'm thinking that your traditional programmable switch uses p4 or something like that. As a programming model do do. We need a similar standardized programming model for these types of ml switches.

D

C

D

C

Yeah, so so it's like kind of a compliment to to p4. um We went with mapreduce, so we're not necessarily married to the idea of using um a map produced blocker. Anything. The the bigger idea here is just doing inference in the data plane um so but yeah. It could definitely help to have some sort of standardization in the way that p4 works, but for um the the mapreduce element.

C

So uh you could even consider like an extra control block in p4 as mapreduce, and we actually, we have another paper and submission on um what the the language level constructs here. Look like so yeah there's, that's definitely an area for standardization as well.

A

All right great, thank you very much. Are there any any final.

A

Okay, I don't see anything screw. Thank you.

A

All right, sam, if you can come up while I try and share the slides.

A

All right, so, can you see the slides.

B

Yes, we can see the slides here.

A

A

All right, so with any luck, you should now have control of the slide. They gone away.

D

D

A

That's is it working.

B

I can see it on my oh yeah. I see I had to click share. Okay,.

D

A

Just took a little while yeah, okay scope, okay, great all right, so the second talk today is um focusing, I think, on a very different problem domain. um So, in this talk, sam kumar will talk about his uh paper on performance tcp for low power wireless networks. uh This was originally presented at the nsdi conference in 2020.

A

If I don't, if I remember correctly, uh sam is a phd student at uc berkeley uh advised by david culler and rolac, he's broadly interested in systems, security and networking, and his research focuses on rethinking systems designed to manage the overhead uh using cryptography uh and presumably also uh improving the performance of tcp for power. Wireless networks.

A

So, um sam over to you.

B

Okay, um thanks colin for the introduction, um as you said, I'm sam and I'm going to present my research on performant tcp for low power wireless networks and to join work with my collaborators at uc berkeley and as you mentioned, it was published in 2020 at nsdi.

B

So I'm going to begin by giving a brief overview of of history of research in low power, wireless personal area networks or lopens to put our research in context.

B

So low plan research began in the late 1990s and at this point in time, researchers deliberately cast away the internet architecture based on the idea that low panels may have to operate in two extreme environments and two different uh from regular networks. In order for the internet architecture to directly apply so many of the early protocols like s, mac, dmacc and so on, and the early systems like taneos and contiki did not conform to any particular standard or architecture.

B

And this allowed the researchers to nicely explore how to tackle the challenges of low panels without being hindered by having to conform to an architecture about a decade later in 2008 ip.

B

The internet protocol was first introduced in the space largely enabled by the six lowpan adaptation layer standardized by the ietf, and what happened here is that people found ways to take the lessons that were learned in the earlier systems and applied them within an ip-based architecture, and this essentially caught on in a few years by about 2012 ib, had essentially become the standard in the space.

B

But surprisingly, the adoption of iep did not come with tcp. For example, openthread a lowpan network stack, developed by nest and used in in the smart home space didn't even support tcp and instead the community has come to rely on protocols like coap, which are specialized low-pass protocols based on udp.

B

It's also worth pointing out that during this time, lopens have not yet achieved the same kind of pervasive adoption that we've seen in other protocols, like wi-fi, at least in the context of branding internet access to devices. So a natural question is whether to get that kind of pervasive adoption of lopens. We should adopt not only ip but also the broader set of iep-based protocols, including tcp.

B

In this context, our work completes the transition of low pans to an ip-based architecture by showing how to make tcp work well in low pans and a research artifact is tcplp, a performant tcp stack for low bands. So what exactly do I mean when say performant?

B

Well, one metric is good put and that's the amount of bandwidth that an application is able to get when operating over a tcp connection.

B

Now there have been a few prior attempts to use tcp in the space, uh typically based on a simplified, embedded tcp stack like micro, ip or blip, and what we can see in this graph is that our work tcplp achieves significantly higher good put than prior attempts to use tcp in this space.

B

In fact, we can calculate an upper bound on good put shown by these dashed lines based on measurements of how fast the radio can send out packets and how much overhead is lost to headers and acts and so on, and our work comes quite close to these upper bounds.

B

I'd also like to share an update. That's happened since we published this research, which is that open thread. The low power network stack that I mentioned. That's used in the smart home space. We simply adopted tcp directly, based on our research. It uses tcplp as its tcp implementation, and the research also influenced thread. The network standard that open thread implements.

B

So I'm delighted to have been invited to spear help spearhead this process, and I am hopeful that that the adoption of tcp in the space will help improve the adoption of lopens more broadly in the smart home space.

B

So now I'm going to take a step back and provide some more context as to what exactly low pans are and what some of the challenges are with using low pens, and I can do that by comparing low pans to other wireless technologies that you might be more familiar with so on. The left. Wi-Fi provides a host with internet access via an access point in the middle bluetooth. uh It doesn't really provide full internet access, it's more like a cable replacement, channel, a wireless usb of sorts and then on the right.

B

We have low pans, which aim to provide internet connectivity at the same level as wi-fi would, but to embedded devices and, while operating within the constraints of low power, for example, having a transmit data over multiple wireless hops to set up an embedded mesh network.

B

So lopez has been used in a variety of applications, for example, scientific applications like environmental monitoring, structural monitoring of a bridge, and it's also been deployed in the indoor environment in a smart grid context and recently, there's been a push to deploy it in a smart home and iot space, and the thread and open thread efforts I mentioned earlier are one such attempt, but despite being useful for all these applications, it's difficult to use low pans because they also come with a set of challenges.

B

The first set of challenges come from the resource constraints. The fact that the embedded hosts have limited cpu and memory resources.

B

uh The second set of constraints come from the link layer, uh low, pass link, clear like, for example, ieee 802.15.4 has a small mtu of only about 100 bytes and has low wireless range, which means that, in order to in order to get connectivity over a large area, you need to transmit data over multiple wireless hops and finally, energy constraints are also an issue.

B

You typically don't have enough energy to keep your radio on and listening all the time. So you duty cycle your radio. What that means is that your radio is actually in a low power, sleep state for say, 99 of the time and then one percent of the time you can turn on your radio to send or receive packets and in order to provide an always-on allusion to applications.

B

Despite doing this to save power, we need some careful scheduling at the link player in order to make sure the data is only sent to a node when this radio is on and ready to receive that data.

B

So to make this more concrete, uh I'm going to tell you about the platform we used in our research. It's called hamilton and some of the stats of this platform are on the slide. The key point here is that this kind of device is more powerful than the devices we had when low band research first got started in the early 2000s, but it's still substantially less powerful than even a raspberry pi. You cannot run linux in a device like this.

B

Instead, you have to run a specialized embedded operating system, and you can understand our research as tackling the central question of how should a device like this connect to the internet and the result of our research is that we show that tcp ip works. Well now, as I mentioned earlier, the adoption of iep in this space did not include tcp, and that was no accident.

B

The reason is that researchers had doubts as to whether tcp would work well, and we expected it to not work well, given the challenges of lopens so here are some quotes. I've taken from some research papers to show some of the concerns that the community has had about using tcp. The first one is that tcp is not lightweight and may not be suitable for implementation and low-cost sensor nodes with limiting processing memory and energy resources.

B

The second one is that certain features of tcp may cause harm like, for example, that the connection oriented protocol aspect of tcp is a poor match for wireless sensor networks, where actual data may only be in the order of a few bytes and finally, there's the wireless tcp problem. The idea that tcp may use a single packet drop to infer that the network is congested which can result in extremely poor performance, because wireless links tend to exhibit relatively high packet loss rates.

B

So again, to summarize more simply, there's concern that tcp is too heavy, that its features are necessary and that it will perform poorly in the presence of wireless loss. So central to our research was understanding, tcp's performance and what we did is we did a study where we actually ran tcp in a low pan, measured its performance and try to draw conclusions about how well tcp really does or does not perform, and what we found is that out of the box tcp.

B

Indeed, performs poorly, but it turns out it's not due to the expected reasons that people had the actual reasons were somewhat different.

B

Okay, so the actual reasons are that low pans have a small l2 frame size, basically, a small mtu, and this results in very high header overhead.

B

The second problem is that hidden terminals are a serious issue for tcp when operating over multiple wireless hops and, finally, that the kind of scheduling at the link layer needed to support a low duty cycle and low energy consumption interact poorly with tcp.

B

Now, there's a key difference between the issues on the left and the issues on the right. The issues on the left, if they were to exist, would be fundamental issues, there's no clear way to adapt tcp or the link clear to eliminate those issues.

B

But the issues on the right it turns out are fixable within the paradigm of tcp or a fairly straightforward techniques. So in our research we show why the expected reasons don't actually apply. We demonstrate techniques to address the actual issues causing poor tcp performance, and our overall conclusion is that tcp can perform well in low pans.

B

After all, so that's an overview of what I'm going to be telling you about and they're, also by the way, a set of techniques that we propose in order to make low pans work well, which I'll go over in the course of the talk.

B

Okay in the next part of the talk, I'm going to focus on the expected reasons for uh why uh or why. The expected reasons for performance don't apply um and to go back here. I'll, be talking about this technique in this part of the talk, and the reason is that this part of the talk is more about our experiments and our observations about the expected reasons.

B

This technique has to do with our implementation, which is why it's included I'll talk about the remaining techniques in the next part of the talk where I dive into how to affix the actual reasons for poor performance.

B

So our methodology is based on a hamilton platform. As I mentioned earlier, you can see the picture there. This is a hamilton platform connected to a raspberry pi, and the raspberry pi is just there as a back channel to collect logs and so on and measurements.

B

uh The tcp stack was, of course, running on the hamilton platform directly. Our software stack is using open thread with riot os and we used a wireless test button collector data, where each of those numbers is one of our hamilton nodes. uh The lines connecting them show an example of a topology. uh In reality, open third is going to generate this dynamically.

B

This is just a snapshot of what it might look like and we ran tcp where one tcp endpoint is in the wireless mesh on one of the hamilton nodes and the other tcp endpoint is hosted on the cloud and amazon ec2.

B

So um one of the first things we had to do was to implement tcp. Now, as I mentioned earlier, there have been several prior attempts to use tcp in this space based on simplified, embedded tcp stacks, but we wanted to use a full scale, tcp stack in our study. Now. The challenge is that implementing a full scale.

B

Tcp stack is hard and in fact, there's an entire rfc devoted to all, describing all the problems that people were seeing in full scale, tcp stacks back in 1999, even though these tcps had matured for at least a decade. By this point, so um our approach was not to implement a tcp stack from scratch, since we felt it would be too error prone uh to do.

B

Instead, we started with the mature full-scale tcp implementation in freebsd and re-engineered key parts of it, so it would work well on an embedded platform and we call a resulting implementation. Tcp lp, where the lp stands for low power.

B

So now that we have our implementation of tcp, we can concretely answer the question of what are the resource requirements of running tcp. So what we found is that tcp lp requires 32 kilobytes of code memory and about half a kilobyte of data memory per connection to store all of the tcp connection state in a full scale.

B

Tcp implementation, while our platform has substantially more code and data memory than that now as an optimization, we use separate structures for active sockets that are actually endpoints of a tcp connection and passive sockets that are just listening for new connections, which also save a bunch of memory as well. um But the point here is that you know, at least in terms of connection state, we're well within the bounds of the available memory.

B

So natural question is what about the actual buffers used to send and receive data, so um the tcp buffers need to be the bandwidth delay, product and size in order to be able to send at full speed of the network, uh and we empirically determine the bandwidth delay. Product has two to three kilobytes and we can see in the graph here how we experimentally did that you can see two to three kilobytes of buffer size. The available could put over tcp levels off.

B

So our conclusion here is that tcp, including the size of the buffers, fits comfortably in memory and, in fact, there's another conclusion to be drawn here, which is that, if you notice the the size of the buffers is actually much bigger than the connection state, which suggests that most of the overhead of tcp doesn't come from. The complexity of the protocol is from the buffers and any performant bulk transfer protocol would need these buffers in order to transmit at the bdp. So in some sense the overhead really isn't bottlenecked by tcp's complexity at all.

B

um There's also some. We also introduced a technique here in order to reduce the memory used for the buffers, uh and part of this has to rely on tcp having both a receive buffer and a reassembly buffer to store in sequence, data and auto sequence.

B

Data for reassembly now full scale, tcp stacks, like freebsd use packet queues, there's a separate queue of packets for each of these, but in the embedded setting we don't want to use dynamically allocated packets because, if we hold on to dynamically allocated packets in a memory constraint setting, we may cause other memory allocations to fail. So we instead, we want to use flat arrays and the naive strategy would be to have a separate flat array for your receive queue and for your reassembly queue now to optimize this.

B

What we observe is that there's an interesting relationship between the advertised windows size, the number of bias we currently have and the total size of the buffer, which is that the number of received bytes plus the advertised windows size, is equal to the total size of a receive buffer. Now. The observation we make on top of this is that all of the data we may possibly get for reassembly has to fit within the advertised window size, that's the contract of tcp that, if you're sending to a recipient, you do not go past their advertised window.

B

So this allows us to actually store the receive buffer and the reassembly queue in a single flat array. Okay, so the way this works is that we have our flat array and the yellow region with the start and end pointers is just a circular buffer to store our in sequence data.

B

Then, as we receive auto sequence data that needs to be reassembled, we store it in the same array past the end of the circular buffer, using a bitmap to keep track, of which of these bytes are active, corresponding to received out of sequence data and which of them are just empty slots on the array where new data can be stored.

B

Okay, so in this way we can significantly reduce the memory for buffers by in some sense, not having to allocate a separate buffer for a reassembly queue and just sharing that with the buffer we've allocated for the received view.

B

Okay, next, I'm going to talk about the wireless tcp problem and before we talk about that, I need to tell you about the number of implied segments is that affects tcp's congestion control. So, as I mentioned, the bama's lip product is two to three kilobytes: each segment is sized to about 250 to 500 bytes, and this was chosen carefully. It's actually based on the technique. I'll tell you about later on in the talk or coping with a small mtu of these networks.

B

uh So we'll come back and explain this, but for now take it as a given that our segments are 250 bytes to 500 bytes and what this works out to is. We have four to 12 in flight tcp segments at any one point in time. Now this is different from other higher bandwidth networks.

B

You might imagine if you're transmitting over a higher bandwidth network or over a longer distance, you may have hundreds or thousands or tens of thousands of packets in flight and in comparison 4 to 12 is, is very small and that profoundly affects how tcp's congestion control operates.

B

So here are some examples of how of tcp and urena's behavior in a low pain and for now focus on the left graph.

B

Here our maximum segment size is 462, bytes um and what's going on and when I say in action segment size, I'm actually subtracting the space for tcp options. So this is how much data is sent in each tcp packet and our bandwidth delay product is filled by just four tcp segments.

B

So what ends up happening is that yeah we're our losses are very frequent, but because we only need a connection window of four segments in order to fill up uh the bdp and send senate line rate. Tcp's congestion control actually is actually able to recover from losses extremely quickly and we spend most of our time actually sending at a full window.

B

Despite the losses in the wireless medium being frequent on the right, we have a more challenging scenario where we size our mss to be smaller and we use some active queue management which induces some more loss events, but we still find that tcp is able to reach a full window and operate there most of the time, despite seeing treatment losses so somewhat counter-intuitively.

B

We find that because our bandwidth in these networks is so small, our bandwidth delay product is small and, as a result, we can recover to a full bdp quickly after a loss, and this means that the wireless tcp problem actually does not affect tcp's performance significantly in these networks uh and it's much more resilient to wireless losses in a lower pan than it is in a higher bandwidth wireless network.

B

So that was a surprising result, but one that works well for us, because it removes one of the obstacles we ordinarily would have faced in getting tcp to work.

B

So now I've talked about the expect why the expected reasons don't apply in the next part of the talk, I'm going to tell you about the actual reasons for poor performance and going back to our slide with our techniques on it I'll be telling you about these three techniques. Now there are a couple I didn't get to the zero copy send buffer the link, clear queue, management, and that's because I don't have the time in this talk to talk about it.

B

But if you want to chat about it afterwards I'll be around or you can look in the paper to find some details about those so first dealing with the mtu problem, here's a graphic showing the size of the mtu in ethernet wi-fi and I triple eeee into the 15.4, which is an example of a low pan link layer, and what we can see is that um tcp ip headers are very small compared to the ethernet and wi-fi mtus, but they're significant compared to the ieee inner tutor, 15.4 mtu, and this is going to result in large header, overhead.

B

Okay, normally we size, tcp segments to be as large the link supports, but no larger. This is standard. This is what's used in ethernet and wi-fi, but in the case of ieee air duty 15.4, it's only 104 bytes right. Our mtu is small and our tcp ip headers can actually take up more than half of that.

B

If you include the cost of tcp options, even if you use a standard ip header compression, that's part of 6lowpan, and what that means is that, if you're transmitting data in a tcp connection, more than half of the data you're setting out are just these headers and your good put is severely affected by that.

B

So, in order to overcome this, we break this conventional wisdom and instead allow tcp lp to have tcp segments that span multiple link layer frames. Okay, what that means is that we're relying on the six lowpan adaptation layer to handle fragmentation and reassembly for us, which adds some overhead, but it means that the overhead of our headers is now amortized over multiple frames, allowing us to get some good good put now. There is a trade-off here um if we use too much fragmentation.

B

If we set our our mtu, I mean if we set our tcp segments to be way too large. What's going to end up happening, is that we rely on too much fragmentation and that's bad, because now, if one fragment gets lost, we lose the entire packet. So what we want to do is we want to choose our tcp segments to be as large as possible to effectively amortize the overhead without incurring more fragmentation beyond that.

B

Okay- and this graph was an experiment where we, where we measured the maximum segment size and the good put that results, and we found that the gains essentially level off around three to five frames. uh So that's what we use for our future experiments and it shows that you know there's a good trade-off to be made here where we can get good good put despite the despite the header sizes.

B

Now one thing that we didn't do but could potentially help in a way that's orthogonal to this is to get good, tcp header compression right, because six lupine, currently standardizes udp header compression with six low pen, but not tcp, header compression and that's another opportunity to reduce these overheads further.

B

Okay, now I'll talk about how the link clear scheduling to support a low duty cycle interacts poorly with tcp, so recall that these devices often don't have enough energy to keep their radios on listening all the time. So we define the duty cycle as the proportion of time that the radio is listening or transmitting. Basically, the percent of time where the radio is not in a low power sleep state, okay, and in order to get good energy construction, we want the duty cycle to be as close to zero as possible.

B

Now there are several ways in order to support this. uh In the session, literature open3d uses a particular duty, cycling mechanism- that's called a receiver initiated duty cycle protocol, which I'll now explain so in open thread. You have two kinds of nodes: we have battery powered nodes where we want to minimize the duty cycle and wall power nodes that are plugged into a wall outlet and have enough power to keep their videos always on okay. Now, sending a frame from b to w is easy, because w video is always on.

B

So we can just send the frame whenever we like more challenging, is the reverse getting a frame from w to b okay. So what has to happen is that w has to wait until b's radio is listening, and how does it know when b's radio is listening? Well, this is where the protocol comes in. What b does is that, whenever it turns on as radio to listen for a for a frame it'll send a data request packet to w informing it that it's now listening?

B

So w has to wait until it guesses it a request packet and once it does, then it can go ahead and send the frame to b and b will listen and receive the frame. Okay. So what's the key point here, the key point I want to emphasize is that these idle duty cycle is directly related to how frequently it sends data request.

B

Frames b can choose to send data request, frames very rarely which allow it to get very good energy consumption, but doing so would, uh but by doing so will cause more of a delay in getting frames to it, since w has to wait for the request frame in order to send it one of the uh one of the data frames.

B

Okay. So now let me talk about what this means for tcp operation and I'll. Do this by comparing http over tcp to coap, okay and coap is a rest-based protocol running on top of udp and in our setup we had bsnw data request frame every one. Second, basically, it makes it listen for packets every one second, and that allows it to get a really low duty cycle now.

B

The key difference between http and co-app here is that http requires two round trips, whereas co-app only requires one round trip, okay, so for the first round trip right, you start at a random uh phase within the 100 million within the thousand millisecond sleep interval, so you'd expect, on average, a 500 millisecond, delay and co-op is consistent with that for http. What happens is that for the first round trip we see 500 milliseconds, but the second round trip starts right at the beginning of the next leap interval.

B

So the second round trip consistently sees the worst case latency when transmitting the packet from b, okay and as a result, http performs more than twice as poorly as coap. On this workload. Now I want to point out that there have been some recent extensions to tcp, for example, tcp fast open, which you can use to eliminate the second round trip and get performance parity between co-app and http, but this problem also happens for bulk transfers, where the acclock nature of tcp causes it to consistently experience the worst case, latency even for bulk transfers.

B

So this is an important problem to solve. Regardless of that, and our approach to solving it is to use an adaptive duty cycle. The idea is that we can use the tcp and http protocol state in order to vary. How often we send data request frames, the idea being when we expect a packet. We want to send data request frames more frequently.

B

So, for example, if I'm an http server of one of these battery-powered devices- and I just accepted a tcp connection- I can be pretty sure that that I'm going to soon receive an http request on that connection. So I may choose to send data request frames more frequently at that point in time and doing this nearly entirely eliminates the gap between co-app and http.

B

In terms of performance, so if we zoom out and look at the overall network, this adaptive duty cycle technique works well for the last hop going from a wall powered node to a battery powered node, but the overall network still has to operate over multiple wireless hops to even get to that hop and what we observed with the tcp performs poorly over this chain of wall powered nodes due to hidden terminals.

B

So let me step back and go over hidden terminals to provide some background on that. For those who aren't familiar with it, we can understand. The wireless range of a node is looking something like this. The unit just models, the simplification, where we consider this to be in some sort of perfect circle in practice. Of course, it can be more complex depending on the exact environment. Your deployment is in, uh but unit this model is, is going to be enough for us to capture the phenomena of interest here. So you'll go with that.

B

um So imagine you have four segments in a line. I mean four four nodes in a line with their uh with their transmission ranges shown here, and we want to transmit data from a to d. Now. The nature of tcp is that uh we have multiple segments in flight at the same time for a single connection and that's why we have segment one being sent from c to d and segment two being sent from a to b.

B

But unfortunately this is bad, because the wireless ranges are going to overlap at b, so the two packets are going to interfere there. Okay, now, in the context of wi-fi, we typically overcome this using a protocol based on rts and cts frames that allow us to mitigate the hidden terminal problem in most cases, but in the context of lopens. The small mtu means that rtsdts typically has too high of an overhead as a result most uses of it don't use rts and cts packets.

B

uh So, as a result, we're only relying on csma right so at a csma can't detect uh c's transmission, because uh it's all the way, because a is out of range of c and csma at c can't detect a's transmission because c is out of range of a but both of the packets end up interfering at b and the packet gets lost.

B

This also happens because of data packets and acts going in opposite directions. So, for example, here what we'll ultimately see is that um you get the same problem with b and d, both sending at the same time to c, because each of their csmas can't hear the other so to mitigate this. Our approach is to add a new random back off delay between link player rate price okay.

B

So the idea is, if you transmit a frame- and it fails- which you know, because you don't get a link layer acknowledgement for it, then you wait a random amount and retry the transmission, and this is different from csma in two respects. The first respect is that in csma you do this randomized delay with exponential back off. If the channel appears busy in this case, even the channel appears clear if our transmission fails, we still do the back off.

B

So it's different in regards of what triggers the transmission and second, it's a much longer delay right, because in csma you can rely on hearing a concurrent transmission. You can transmit immediately if the channel appears clear in this new delay that we're adding this link retry delay. What we're seeing is that we want to have a delay, that's chosen between 0 and 10 times, the time to transmit a frame, the idea being, even if there are two concurrent permissions that can't hear each other with high probability, they won't overlap in time.

B

D

B

um The way this would work is that uh each of these two nodes would send its data once in order to go, and they would never collide. But then, when they retry they'll transmit a second time at hopefully different intervals, and they won't overlap in time and the transmission will succeed.

D

B

So um we did a measurement study to understand what kind of link delays would be appropriate and what would work. What we observe is that there's a huge reduction in packet loss even from a small delay and as we increase the delay too much, it starts to eat away at your good foot, because now you're beating a lot when transmitting your packets.

B

So we found that there's a sweet spot here at around 40 milliseconds, which is about 10 times the time you transmit a single frame and actually to the 15.4.

B

So that's what we used in our study um and this reduced the packet loss from six percent to one percent, which was which you consider a significant improvement.

B

So, finally, I'm going to summarize our evaluation and and conclusions so first, I previewed this result at the beginning. We're able to achieve significantly higher good put than prior attempts at using tcp and we're very close to a reasonable upper bound that we computed based on measurements of how fast the radio can send out packets and the overhead loss to headers and x.

B

We also did a measurement study to study the energy efficiency, so we used tcp and co-app for a sentence and task and measured the radio duty cycle over a 24-hour period, and you can see the radio duty cycle here. The key point is that tcp is not significantly worse than co-app. In fact they perform comparably for the duration of the experiment at about a two percent duty cycle, and we consider this a success because tcp is able to perform essentially on par as a protocol over udp developed, specifically for low bands.

B

So now the tcp is a viable option. What does this mean? Well first, we should reconsider the use of lightweight protocols that emulate part of tcp's functionality in the sense that you know, if you have a protocol, that's specialized that performs just as well as a general protocol, that's more interoperable and used more broadly. You should perhaps prefer the one that's used more broadly and is more interoperable.

B

Second, we think that tcp may influence the design of low-pan network systems in the sense that you know for a long time. It's been the case that many sport home devices that you buy on the market require a specialized gateway to get internet connectivity um and tcp gives us the opportunity to allow these devices to connect end-to-end to any uh services externally that they may depend on.

B

And finally, I just want to mention that udp-based protocols, I think, will still be used in lopens, but just in the same sense that they're used broadly in the internet for applications where specialized protocols substantially outperform tcp in cases where tcp performs on par with specialized protocols using tcp is now a viable option.

B

So, just to talk a little more about the about the middle point about how tcp may influence the design of low-pan network systems when I say gateway architecture, I mean a setup like this, where you have your devices, these smartphone devices you bought on the market and in order to allow them to communicate with an application server and a data center somewhere, you have to install some specific gateway in your home. There is some protocol, translation and application logic in order to bring connectivity to those devices.

B

uh What this means is is often the case that some of you may have experienced. This is that if you go buy smart devices from a new vendor uh now, all of a sudden, you need another gateway for those new devices or even maybe, the newer versions of devices on the same fender like, for example, uh for a long time. It was the case that for life that, if you have bulbs from say, lifx and bulbs from philips, you would need separate gateways for both of those devices.

B

um So uh the the introduction of ip in this space didn't really change this, in the sense that now your application protocol on the left is now implemented over ip, but you still need the application layer gateway and the missing piece. I think that would allow an end-to-end connection here would be to have a transfer protocol. That's supported on both sides, namely tcp, and once you do this, your application layer, gps, become regular border routers, and you could potentially consolidate these together into a single border router.

B

So in conclusion, we implemented tcplp a full-scale tcp stack for low-pan devices. uh We explained why the expected reasons for poor tcp performance don't apply. uh We show how to address the actual reasons for poor tcp performance and we show that once the issues are resolved, tcp can perform comparably to low-pan specialized protocols. That's all. I have prepared I'm happy to take any questions now.

A

Okay, thank you sam now. That's for that excellent talk. um I see we have a couple of people in the online queue and a couple of people at the microphone.

A

um Should we do the uh I guess they will do the microphone first so who that is, but if you can go ahead and say see your name in your question.

D

So hi, I'm matthias, I'm one of the co-founders of white great work thanks a lot. um One remark and two questions uh question first, so you argued that supporting tcp is important because it's popular now quick becomes popular. Did you work on any comparison from the system? Point of view.

B

Sorry, I didn't quite hear what I said becomes popular.

D

uh You said that tcp is quite popular, but quick also becomes popular in the internet. Quick. You know.

A

D

So trying to quick, quick, yes, yes, yeah yeah.

B

D

Did you do any comparison.

B

uh So we didn't do a comparison against quick, but I'd like to comment on that, because that's a good point that other transports are becoming popular. Many of the issues that we addressed aren't specific to tcp. They apply broadly to tcp and other protocols needed for bulk transfer like, for example, um the main issues getting it to work with hidden terminals, getting it to play well with link, clear scheduling and so on, apply broadly to any protocol. That's transmitting a lot of data and wants a significant amount of bandwidth.

B

Therefore, I think that many of our conclusions would actually apply equally well to quickest due to tcp. Okay,.

D

um And another question: I mean in your paper you'll note that you also have an implementation for gneic the default networks they can write. Do you also plan to submit the pierre to upstream's implementation.

B

At some point we did have plans for that, but what happened is that riot os already adopted a different tcp stack and it seemed a bit redundant to contribute a second one. uh Recently. What we have done is we have. We must have contributed our code to open thread which now uses it as its default ecb stack. Okay,.

D

Firstly, I highly encourage you to submit the pm and finally remark: um you said that fragment needs to be a packet needs to be is lost when the fragment is lost. I mean this depends a little bit on the fragmentation screen right. If you consider, for example, selective fragment recovery, um it doesn't matter too much whether the fragment is lost or not, for the whole packet.

B

Yeah, so um my understanding about the basic slope and work, or at least the way it was implemented in the operating systems we looked at, was indeed that if a fragment is lost, you lose the whole packet. But I do agree that there are protocols you can use to recover a loss for happening without losing the entire packet, and those could also help with the problem allowing you to make the packet bigger and amortize tcp ib headers, even better.

D

All right, hello, um tommy paulie from apple. Thank you for doing this talk um very interesting. I'm super happy to see the use of tcp here. um I just had a couple questions from the presentation um way earlier and you don't have to go back when you're talking about the memory, saving aspects and the ability to have the flat buffer. You had the diagram there of you know, essentially here's kind of what's in flight and then there's the out of order bits and there are gaps in there as well.

D

um When you're doing this, are you able to essentially guarantee 100 of the time that you'll never need to allocate memory, or is it like just most of the time? And then there would be a failover case where you do need to have dynamic allocation.

B

D

B

Great question: uh we ensure that you never have to dynamically allocate memory cool and the way we do it is that you store the data there. You have a bitmap to keep track, of which bits contain the out-of-order data, but the bitmap can also be sized statically, because it depends only on the array size, which is also static, got it. Okay,.

D

Cool and then the other question is more about kind of what you're ending with talking about how you can use this to get to internet hosts and end, and I believe in your tests.

D

You were testing against end-to-end internet connections.

D

For that do you need to modify anything on the tcp implementation on the internet servers because we were mentioning things like timing, the re-transmit timing schedules that you want to add randomness, so you're not colliding.

D

um Is this something that needs tuning on the internet hosts to make sure that they are friendly to the low pan devices? Or can you use completely unmodified um internet hosts to talk to yeah.

B

That's an excellent question and the short answer is that the hosts on the linux side were completely unmodified. Great, um I mean that's uh to say a little bit more about that. uh The timing that we adjusted for, like the randomized delay, was none of the tcp levels at the link layer. So, as a result, the the other side actually doesn't see any of that got it. um This is also one of the advantages to us using a full scale.

B

Tcp stack like the one from freebsd, because it's been battle tested in the real world and it's interoperable with all the major tcp stacks that are out there, and I just want to say that uh interoperability is actually a problem in the embedded space. Many of the other tcp stacks you find are have have interoperability problems in pretty subtle ways, with the real tcp stacks that are used and that's something we manage a side step by using a battle tested, tcp implementation as the basis of our study.

D

Cool. Thank you.

D

So hello, this is thomas also from the riot community thanks again for this work, thanks for using riot there's another encouragement using generosity, because you have a generic packet buffer here, which you could reuse that even reduces your memory overhead even further.

D

Just just a remark, one question about about your multi-hop experiments.

D

You showed us nicely how, by jittering the the tcp forwarding how you could avoid the hidden terminal problem was that in a clean environment without cross traffic, with only a single tcp connection,.

B

Yeah, so uh the hidden terminal problem affects even a single tcp connection in isolation, um and we verify that our randomized back off fixes the problem. In that case,.

D

Yeah, but only in this case I mean the normal case is that you have background traffic right.

B

Yeah yeah, so I mean, if you have background traffic. This is also why we use randomized delays instead of fixed delays, because if you have a randomized backup it doesn't matter, the interference is coming from the same stream or a different stream right in both cases, you'll back up a random amount and hopefully transmit again without colliding.

B

um This is also why we did it without because I mean there are several particles, you could use that look at tcp state in some way um and having it just be a randomized delightfully. The link clear gives us some confidence that it would work across tcp streams and regardless of the source of traffic, whether it's tcp different tcp streams or even something else.

D

In this context, did you also consider experimenting with more flexible link layer, mac layers than just a csm aca, for instance the dsme mech layer, which is also supported by riot? No.

B

We didn't experiment with that. We looked at the sme because that was the most common one, supported across all the operating systems and networking particles that we tried across tiny os riot and open thread. So it's the most natural to focus on that.

D

Okay, thank you.

B

A

You all right, thank you. I think we have a remote question.

D

Am I unmuted finally, so uh this is following up on the multi-hop case. uh So in these environments the uh forwarding devices are in fact, also very low power. Low uh resource devices.

D

um Did you see or could you speculate on what you might see as to whether tcp traffic would have more stress on the buffers of the forwarding, multi-hop wireless nodes,.

B

uh So that's a great question. um First, I want to I mean so first I just want to clarify that the buffer is used at the intermediate routers. These aren't tcp layer buffers it's just like the general packet buffers used for forwarding, because you know an end-to-end tcp connection. There's no tcp state.

D

Sure sure sure, but it may put a different, may put a different low aggregate load um on those buffers than say co-app traffic or something that's more. You know simple request response related.

B

Yeah, so I mean, of course it's the case that, when you're transmitting at higher bandwidth, you're going to place some more stress on the on the buffers of the intermediate uh of the intermediate routers- and there are a couple things that that we do in that we actually did in our study in order to help mitigate that.

B

The first one is that we added some active queue management functionality to those intermediate routers, where you mark packets, that's congested, using explicit connect using explicit condition, notification and so on in order to prevent tcp from filling up the entire buffer and keeping your cues short. The primary reason we did. This was to improve fairness of different tcp flows that are competing for buffer space of these intermediate routers and also to reduce the and also reduce the latency of traffic.

B

uh But it also has a side effect of limiting the amount of buffer space, that's being used by a single tcp flow. To address. Some of the concerns that you brought up.

D

Thanks, I was looking for the ac aqm angle on that.

A

All right uh you have so I have a question. um Does the uh I I I very much like the idea of the headers multiple linked airframes? um Does this put any constraints on the link there or or does the six loop handler handle all of that.

B

um That's a great question, so some of these uh can potentially be handled at the sixth low fan layer, but others do indeed have to do with, with the uh with the link layer directly like, for example, the randomized delete that we added to avoid hidden terminals is something that would operate at the link layer right, because at the six little band layer you don't have- or at least you don't, naturally have the same kind of visibility into.

B

You know when your link, clear acts are coming in and so on, whereas you would need that to determine uh that resolution failed and how much to back off on the retransmission and so on. So some of them do indeed affect the link layer.

A

Yeah are the requirements that the link layer delivers packets in order um to avoid um damaging the headers or or is six we're pen handling that.

B

uh Sorry, I didn't understand your question.

A

Is there a requirement that the link layer delivers packets in order in order, because of the way you've uh sent the headers split across multiple link their frames, um or? Is that all that the reordering handled by six looper.

B

Oh yeah, so the reordering and reassembly is handled by six low pan and there's no strict requirement that you have to transmit the frames of a packet directly one after the other consecutively.

B

In fact, one of the things that I skipped because of the time limit was another set of techniques we have at the level of managing how to deal with concurrent frames, basically how to schedule frames when you're some of them are going to other wall power devices some to battery powered devices and in effect, what we do is if you, if you receive a data request from a battery-powered device, then you prioritize sending frames to that in order to reduce its duty cycle and let it go to sleep as fast as possible, and that's one case where we specifically might interrupt another transmission and not send its frames.

B

Concurrently, I mean nuts in the streams consecutively.

A

Okay, yeah, that makes sense. Okay, gabrielle.

B

E

When okay um yeah, thank you very much for this work, this is great stuff. um I did have a comment on the comparison with co-app. I think the specification for co-app was not entirely based on. We can't use tcp type that thing it was more based on. We can't use http because the justification for it was for folks who wanted to use a restful interface for the application layer. Not every application of the year in iot wishes to do that, but there's certainly a lot uh of a lot of incentive to use restful.

E

So uh when, when um the restful folks started to become interested in iot, the only alternative was http one one, which uh I completely agree, is terrible. uh It's massager, it's textual based vertical. You cannot compress it it's. It's very verbose, et cetera. It's it's terrible.

E

um We subsequently had http 2, which became a binary protocol, and we actually had a paper uh three years ago in a in our w about. You know how to use that over something like 6lowpan, for example, some just initial scratching the surface, but now we have http 3 and quick, and it's all binary. So um I I understand you, you guys haven't, had a chance to go after the excellent work.

E

You've done to look at those layers, but I would highly encourage you to do that, because that would that would address um a significant portion of the of the application layer. Incentives um to um for iot as well.

B

Yeah, so uh so, thanks for clarifying that, um I do acknowledge that co-op has has evolved quite a bit in a few years. Some of those evolutions happen after after we published this work, uh but I do want to uh to clarify my position on co-op a little bit uh based on what you said. It's that uh indeed I think that co-op is useful and it has its uses and it's very flexible.

B

It's been evolving a lot over the years and that's great um I do I mean I have noticed that coap has been evolving in some sense, more and more towards the same kind of abstraction that tcp provides right. In some sense, uh the ability like, for example, with some of the recent work on on streaming on streaming, block, transfers and so on.

B

All I'm seeing here is that I think that an application that's built on coap in these kinds of networks with all the latest features like, for example, the ability to have multiple blocks in flight concurrently and so on would also be wise to potentially consider using tcp directly itself, given that tcp is also a viable option in these networks.

B

But thanks for that comment.

A

All right, thank you. One more question. uh Benjamin uh salt summit, closet.

D

I thank you so much for the presentation. I really appreciate it. um I was just wondering you talked mainly about the applications of this in um in lans. Do you see any application for longer range networks like uh like mobile ad hoc networks or anything of that sort.

B

uh That's a great question, so all of our experimentation was uh was done using ieee introduce 15.4, which is a personal area network protocol, and that was motivated by the recent interest in adopting some of that technology uh to work in the smart, home and iot space. um Some of the I mean some of these lessons might carry you over to the mobile and an ad hoc network space like lp vans and so on.

B

um I'm not sure I'll be able to tell you any specifics, given that I don't have much experience with those networks, um but I mean my first gut would be there's probably a way to make tcp work well, given that it's been adapted to work on so many different kinds of networks uh in all kinds of different environments, but other than that, I'm not sure if any of this, which of the specific techniques would directly carry over there.

D

Thank you so much.

A

All right, uh thank you very much.

A

Excellent talk.

A

uh And thank you again to to both of the speakers. I think there were two two really great talks there. um Both uh sam and tasha will be uh around all week. uh I'm sure they'll be very happy to talk with people more about their work, uh so please do to find them. Have a chat chat about their work, make them welcome to the the ietf and to the irtf.

A

uh Congratulations both to uh sam and tasha, for the award of the anrp. This time, um as I said earlier, look out for more a rp award talks um at the the uh itf one five in london in november. uh The nominations for the uh 2023, uh a rp awards will be opening in september. So um if you know any good work, please think about nominating that work and look out for the applied networking research workshop, which is taking place uh co-locating with the itf in philadelphia tomorrow.

A

um Thank you again, everybody. Hopefully I will see some or all of you in london or at the end, nrw tomorrow or later this week thanks everyone.

A