Internet Engineering Task Force 92, 24 Mar 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: IETF92 - IRTF Open Meeting and ANRP Awards

Description

Open meeting of the Internet Research Task Force (IRTF)

A

Can you hear me okay,.

B

If you're here for the I RTF open meeting you're in the right room, if you're here for something else or you want to read your email you're welcome to do that. But if you here for something else, you're, probably in the wrong room, the observant amongst you will notice that I am NOT. Lars eggert I am Matt Ford I'm with the Internet Society and mostly here, to introduce our speaker for this session.

B

Who is our applied networking research Prize winner at this I ETF, and that is Aaron game bro Jacobson who's going to talk about who won the award for designing and evaluating and NFV control. Plane he's going to tell you a lot more about that. But maybe we could just have a round of applause to congratulate Aaron on winning his AARP award.

B

I think Aaron you're gonna you're going to present and then we'll take some often time afterwards for Q&A. If you want to save up your questions and I guess, if you have clarifying questions you can you can dive in with those, but otherwise we'll save, save questions for after errands talk and I'll moderate the discussion thanks yep.

A

Something's not clear, certainly feel free to step up to the mic and interrupt me. So thanks so much for that introduction. I hope you'll find what I'm talking about today interesting. So what we've done is we've we've done some research to take the principles that we have in software-defined networking and extend those principles to network functions or middle boxes that are running in our network in order to allow operators of networks to better satisfy a number of different goals.

A

So for those of you who aren't familiar with network functions or middleboxes, the basic idea behind them is that they're going to perform some sort of sophisticated analysis of traffic or flows as it passes through this device in the network, and typically it's going to take some stateful actions on that traffic, so good examples that commonly exists. Things like when optimizers caching proxies intrusion prevention systems and we're seeing two shifts in the way these network functions are being deployed today.

A

The first of these is network functions virtualization, and the basic idea behind this is that we want to take dedicated Hardware appliances that are deployed today and replace them with virtual machines that are providing the same functionality, but allows us to run the network functions on top of generic compute resources, so we no longer need customized hardware. The benefit of this is that we can dynamically allocate instances of network functions as we need more capacity in our network or, as we need to introduce new functionality.

A

The other trend, that's reshaping. The way network functions are deployed is software-defined. Networking software defined networking gives us the ability to flexibly re route traffic between these network functions as we create them or as the needs in our network of all, and so together. What these two trends give us is: they give us a way to dynamically, reallocate we're in our network, we're processing certain traffic and what processing is happening to that traffic and as a result, that can enable a variety of interesting service, abstractions and capabilities for our middleboxes.

A

So one such example is we could build a system that elastically scales network functions as the demand and our network changes over time. So we start off here with a single instance of an intrusion detection system, and we want to make sure that this intrusion detection system is going to always be satisfying some sort of performance SLA.

A

Perhaps we have an SLA that says the packet loss that we experience has to be less than some percentage, so as the load in our network increases will start to overload this initial instance that we have that's going to start to create SLA problems, and so we need to add another instance which NFV makes it easy to do this and with Sdn, then we can rear out some of the traffic from our original instance to the second instance, and now that gives us the ability to shed load from that original instance and now satisfy our SLA okay.

A

So now. The second thing is that, at some point, the load in our network may go back down, and so just as we scaled out, we want to be able to scale back in. So at some point, we want to be able to destroy the second instance, because it's no longer needed and we route traffic back to the first now.

A

The problem here is that, while we're doing this scaling in and scaling out, it's important that we accurately monitor the traffic and have our IDs function as we expect it to to actually detect malicious tax in our network. The thing is, it turns out in order to do all three of these together, we actually need more than what we can just get with this concept of NFP and this concept of SDN, and so with only these two abstractions.

A

Today we can't quite realize these scenarios, like elastic NF, scaling or some sort of high availability situation, so to understand a bit more, exactly what we're missing and what else we need. Let's take a look at this scenario in a bit more depth so again we're going to assume that we start off with a single instance of the ids, and here I'm going to look at traffic at a little bit finer granularity, I'm, going to assume that we know specific flows. These could be TCP flows.

A

It could be a set of traffic from a group of hosts, but some notion of flow through this network. So, as we see traffic from these flows, this intrusion detection system is going to establish some state related to those could be things about connection endpoints potential information about what we've seen in the payloads so far a variety of different pieces of information.

A

So now, when we start to hit an overload situation as the rate of these flows increase, we again can launch another instance, but the question becomes what exact set of traffic are we going to talk about rerouting in this particular case? So one option is that we could only rear out new flows that are coming into our network, such that if we have some green flow that comes in we'll send it to this.

A

Second IDs instance that we just created it'll, establish some state and properly analyze this traffic, and this is great from a cost perspective. We clearly needed this extra instance, but this isn't going to help us satisfy our SLA. We still have all that extra traffic from the red and the blue flow going through our first instance we're still starting to experience packet loss, so this isn't going to work.

A

The other challenge that we face is that there could be information at each of these IDS's that we need to collectively combine in some way, so maybe we're trying to do port scan detection. All of these flows are going to a particular hosts and if we don't aggregate information about connection counts between both instances, it's going to take us longer to detect that scan. So it's unclear of accuracy will be affected in this situation. Also, okay, so we need to get some traffic off of this original instance.

A

So we'll pick one flos, let's say the blue flow and go ahead and rewrite it now. The problem is that, while we've rerouted, this flow we've run into a situation where we've left it state behind, and so now the state that we need to continue to analyze this traffic and detect any attacks that might be in. It is now only available at our old instance and not available the new place where this traffic is going to so we're not going to reach our accuracy goal at some point.

A

Eventually, this blue flow will die out of the network. The load in our network will go back down and so from a cost perspective, we want to ideally be able to destroy the second instance. The problem is: when do we go about doing that? If we destroy it immediately, we run into the same problem where we get rid of state that we need it.

A

We no longer we'll be able to properly analyze the green flow if, instead, we wait for this green flow to die off now we're going to run into a situation where we need to wait for a potentially unbounded amount of time before we can destroy this so in traffic traces we've looked at from our campus network, this may mean for 25 minutes we're going to continue to run this instance, maybe longer so that means we're going to satisfy our SLA is and accuracy, but from a cost perspective, we're spending a lot of extra money.

A

We don't need to. So what exactly do we need that? If we want to get these three goals, what's missing from just an FB and sdn? Well, one thing is that we need some way to manage the internal state that these network functions are maintaining, and so we need to be able to move it copy it and, in some cases share it between different instances of a network function. Second of all, as we're transferring this state around, we want to make sure that we're not compromising the accuracy of our network function.

A

So there are certain guarantees we need to have on how the state transfers happening such that we don't lose updates to this state. We don't potentially have packets that aren't processed and maybe even in some cases, we need to make sure we process the packets in a particular order. Now these same requirements apply not only to the elastic scaling scenario that I talked about, but to other interesting scenarios like transparent, failover or potentially.

A

If we want to do something like in-place upgrades, so I hope, I've convinced you that we need something new here and some for the rest of the talk and I'll talk about what are the challenges in doing this and meeting those requirements? I just talked about I'll talk about then our architecture that we've developed in order to meet those requirements and address those challenges, and, lastly, I'll close with some preliminary evaluation results.

A

So there's three main challenges that we face in meeting the requirements of being able to move state and to do in a way that's safe. The first of these is that there's a lot of different network functions out there, everything from wine optimizers to cashing proxies to when you start to talk about cellular networks, things in the evolved packet core, and we want to make sure that we're minimizing the number of changes we need to make to these and that we can accommodate a lot of different network function.

A

Architectures within this broader system that we're proposing to develop. The second issue is that there's lots of things going on in the network here we're thinking about moving state, there's updates that are happening to that state. There's packets that are still flowing through our network and we want to be making forwarding updates. So how do we avoid problematic race conditions between all of these different things that are going on?

A

Lastly, it's important that whatever we're doing to move state around doesn't have a lot of memory, overhead, cpu or and doesn't take a lot of time, especially if we're talking about moving state in scenarios where we're trying to do scaling we're ready in an overloaded situation. So we don't want to impose a lot more load onto. What's already overloaded?

A

Okay, so what could we use? Well, one thing is we could say why not use Virtual Machine snapshots. We already have virtual machines that our network functions are running on. We know really well how to snapshot virtual machines and clone them efficiently. The problem is, we can use this to do scale-up. This will give us a copy of the state we need for both of these red and blue flows, and we can move the blue flow and we'll have at state.

A

The problem is: when we run into that scale down scenario: we have no way to come recombine to VM images into one. So that's not going to work out another solution that exists out. There is a system that came out of IBM Research, it's called split merge. The basic idea of split merge is that you use shared library in order to access and update and excuse me in order to access and create state internally. So you basically replace all memory allocation calls with calls to their library functions.

A

The problem is that they're targeting a very specific scenario, which is elastic scaling, so it's not clear. Their solution will work in other scenarios and also in their system. They don't provide any of these safety guarantees, that'll ensure that we don't lose important updates and that packets aren't reorder in cases where that can affect the accuracy of our network function. So this brings me to our solution. Open, NF, open and apps architecture is very similar to what you'll see in SD n.

A

So we have this logically centralized open and f controller, and on top of this will run scenario, specific control applications. So one control application may be implementing this elastic NF scaling example that I talked about and it'll issue operations to move copy or share state as it needs to underneath this controller, then we'll have the network functions themselves and they were going to conform to some sort of southbound API that we've developed such that we can accurately export an import state from these different instances.

A

So when a control application issues in operation, a module within the controller will translate that into a series of the southbound API calls to do our state transfer and once state has been successfully transferred. We can then communicate with an existing forwarding module to tell it to update the forwarding state in our switch and rear out our traffic. So I'm gonna talk a little bit about the the southbound part first and then I'll go into how we implement these higher level functions.

A

So I said: there's a lot of different network functions out there and obviously, depending on what they do. They maintain very different internal state. The state and a caching proxy looks different than the state in a nutrition detection system which looks different than the state in something like a really simple firewall or network address translator, but it turns out, despite the fact that they share different state, the way they go about.

A

Creating and updating this state is common and that they think about state in terms of either being associated with an individual flow or multiple different flows. So, to give you an example, let's take a look at the state for an intrusion detection system, specifically the bro intrusion detection system, which is an open source. Ids. That's existed for many years, so here we have for every single TCP connection, a couple: different objects, a connection, object and protocol specific analyzer objects and we're going to organize these in some sort of a hash table.

A

Likewise, we have state that is maintained, / host, so for every host, we're going to maintain a count of how many different connections have been established or attempted to be established with that host.

A

Likewise, we may have some state, that's updated for every single packet we process and something like statistics applies to all the different flows that this network functions responsible for, so we can use this taxonomy to develop a relatively simple API that allows us to get put and delete state from these network functions on this flow basis, so these functions first of all, except what kind of scope of state are we interested in and a filter that defines a flow space for what types of flows?

A

What set of flows were interested in then we modify the network functions to accommodate this operation. It can take its internal state, apply this filter to it and any state that matches will be sent to the controller. Likewise, if the controller wants to provide some state to be integrated into the middle box, the middle box can take the state and integrate it into its existing structures.

A

So this relatively simple API means that we don't have to expose or change how the network function is organizing its state internally, and it provides an intuitive way for us to reason about what state were interested in. So now that we have these capabilities from network functions. Now we can go about using these for our to realize the operations that our control applications issue.

A

So, let's say our control application wants to move all HTTP traffic from being analyzed from the first instance of an IDs to a new second instance that we've created so we're going to tell it that we want to move all traffic. That's on port 80.

A

The first thing that's going to do is ask the middle box for any state that it has related to http flows and that states going to be provided to our controller next we'll go ahead and flush, this state from our first instance, because we don't need it there anymore and we'll put the state to our second instance. Now that the state's been moved, we can finally go ahead and update our forwarding such that we can resume analyzing our HTTP traffic. At the second instance, we have similar capabilities to be able to copy and share state.

A

I won't go into the details of that here, but I'm happy to answer questions about that if you have them later on. Ok, so we've addressed this first challenge. Now, how do we deal with all these race conditions and providing important safety guarantees? So one problem that can occur in the move operation I just showed- is that we can lose packets or lose updates to state as a result of packets arriving.

A

While we're trying to do this, state transfer so I'm going to assume here that we're running the bro intrusion detection system and it's running a script that computes a hash of the payloads of all the packets for a given connection and compares that hash against a database of known malware. This is a standard script that comes with this IDs, so we'll go again, have two different flows: red flow in a blue flow. So when packets come in, the ids is going to say: ok what hash?

A

What's the hash of this packet and added to a rolling hash that it's computing so now at some point, I say well: I want to move the red flow, so I'm going to go ahead and do my state transfer like I did before, but before I had a chance to update my forwarding state. Another packet comes in for this red flow. So now this packet comes in and the intrusion detection system says why don't have any state for the red flow. This must be a new flow.

A

So it's going to go ahead and establish some new state. Now, at some point, our forwarding updates going to take effect, and now our third packet is going to come in and now when we try to compute a hash over this third packet. Now we only have the first two packets and so the hash that we're excuse me the first and the third packet. So now this hash that we compute is going to be incorrect and we're not actually going to detect that there's some malware in this flow.

A

So what we want is we want a guarantee that these state operations are lost free. We want to make sure that we're not losing any packets and that all packets are being processed that that have passed through our network at this point in time.

A

So split merge also provides a limited form of this loss freeness, but it turns out a key thing they don't deal with is the fact that packets may already be in transit to a network function at the time we start the state transfer, so while they can buffer packets at the switch they're, ignoring the fact that packets may have already passed through this switch. So this doesn't quite give us the lost freeness that we want. So how do we go about doing this?

A

Well, we're going to enhance the capabilities that the network functions provide for us just a little bit we're going to add an event mechanism such that when some set of packets come into this network function, we can say: do any of these packets match a filter, and if they do, we can send an event to the controller that says: hey I was about to process this packet. It was going to update, or it may have been, to update some state that you're trying to move.

A

We can then tell the network function to either go ahead and process that packet buffer for processing later on, or simply throw it away and not process it any further and to add this capability we just need to modify the main packet receive function within a middle box out of code that add a little bit of code. That checks should I be raising an event or not fairly simple change. Okay. So how do we use this now to get this loss? Free property? Well? Well. First thing we'll do before we start transferring any state.

A

Is that we're going to enable events on our first instance we're going to say whenever you get a packet that matches this red flow space that we're talking about I want to know about that, and you should not process that packet further.

A

So now we can go ahead and start to transfer our state from our first instance to our second, and if a packet comes in in the middle of the state, transfer will go ahead and construct an event and send that an event to the controller where we'll be buffered temporarily after we finish after then, we can finish our state transfer and after the state transfers done now, we can let this packet that was buffer to the controller, be processed by our second instance.

A

So now we make our forwarding update and when the third packet comes in, it turns out that we've seen all packets for the flow they've all reflected in the state. We can compute our correct hash, and now we can detect that. There's now we're here now there's another potential problem we were running to, which is reordering and in fact adding this loss. Free mechanism can actually introduce reordering. That may not be possible otherwise, and this could be problematic in the case of a script that comes with bro.

A

That looks for weird activity looks for things like did you get a syn packet after you've already got in a data packet. So let's go back to this fifth step from the last slide, where we were flushing the packets that were buffer to the controller, so we'll flush these and then we'll go ahead and make our forwarding update. So now we make our forwarding update, but before that update takes effect, another packet comes in so that packet comes in. It goes to our third instance, our third instance sighs, well I or excuse me.

A

Our first instance says: I've events enabled so I'm going to send this. Third, back to the controller, the controller will say: I've already flushed, the buffer of events, so I'll just go ahead and pass this directly to the switch and directly to my second instance, but before this packet is that second instance, our forwarding updates already taken effect. So it's possible. Another packet comes into the switch that packet gets forwarded to the second instance and now arrives before we've gone through this whole sequence of forwarding along this third packet.

A

So now we have reordering that's happening. So what we need in some cases is, we need a guarantee that our move operation is order, preserving which says that all packets are processed in the same order that they arrived at the switch and that updates to state happen in that order.

A

So how do we go about realizing this? How am I doing on time? Okay, let me actually I'm going to skip through this, because it's kind of complex and we can come back to it later if people have questions okay, so third challenge issue of overhead. How do we make sure that we're not introducing a lot of memory, CPU and other overhead in actually providing these operations?

A

Well, so the thing is that we're given applications, some choices, the first choice that we're giving them is: what sort of state do you want to move if you're only moving, HTTP flows, you only need to move state relating to those HTTP flows. If you're trying to create a middle box, that's highly available, so you're snapshotting state, you may say I only care that, if something fails that a certain set of flows continue to be processed correctly, so now you only need to grab that state.

A

The other option is that you can decide whether or not you need these guarantees. So the example that I was going through this intrusion detection system was off path. That's what makes it an IDs versus an intrusion prevention system so because this IDs is off path. If packets get if packets get dropped on their way to the IDs, there's no way to get those retransmitted, the idea is getting a copy of traffic.

A

However, in the case of an IPS, if a packet gets dropped on its way to the IPS, that IPS is in the middle of a connection, which means normal TCP mechanisms will recover from that loss and that IPS they'll have another opportunity to see that packet. So in that case, we don't need this loss, free property and so by giving control applications the flexibility to choose what they want. They have some control over how much overhead they experience.

A

Okay, so going back to our three goals, we wanted to get SLA s. We wanted to make sure that we could do it at low cost. We want to make sure that our network functions are happening, who are operating accurately and analyzing traffic correctly. So we've addressed this issue of diversity by making sure that our changes that we make to import an export state are simple and we have a simple events mechanism. We deal with race conditions by adding this events mechanism and by having lockstep forwarding updates.

A

And lastly, we address this overhead issue by making sure that applications have the flexibility to choose what set of state and what guarantees they want on those operations. So we've implemented the open and F architecture.

A

The controller itself is implemented as a module running atop, the floodlight SDN controller and we've also implemented a communication library that can be linked into network functions in order to communicate between the controller and the network functions themselves. We've modified for different network functions so far to conform to our southbound API and provide events and export State. So this is the bro intrusion, detection system, we've modified iptables, squid, caching, proxy and also pratts, which is a asset detection and monitoring system. That's used in our University Network.

A

So how well does open an F perform and does it actually give us the benefits we wanted? So we're going to take a situation here where we have a trace of traffic from our campus network that we're replaying at a rate of 10,000 packets per second and we're going to start with one instance of the boro intrusion detection system 180 seconds into the experiment. We say: move all HTTP flows to be processed by a new instance 180 seconds later, we're going to move any active, HTTP flows at that time, back to the original instance.

A

So, in order to actually do the transfer of state that we need takes 260 milliseconds, and so that's quick doesn't take very long. We also looked at. Is this accurate? Have we maintained the accuracy of the network function, so we compared what happened if we let all of the traffic be analyzed by one IDs and didn't do these moving back and forth operations versus what happens? What is the output of the IDS?

A

If we do this scale out and scale back in turns out, the log entries are equivalent if we had used this VM replication that I talked about earlier, there would be entries missing from our log, because when we do this scale in operation, we have no way to combine to vm snapshots together.

A

Lastly, there's this issue of cost, so how quickly were we able to scale in? We were able to scale in as long as it took us to move the state back, which again was about 260 milliseconds. If we had used waited for flows to diox--, the flows in this particular trace lasted more than 25 minutes, and so we would have needed to unnecessarily continue to run the second instance of the IDS until those flows had finished, so that would have been a lot of extra costs that we would have been paying.

A

So I said this move takes 260 milliseconds. How does what we're doing it? The network functions, contribute to that. So we can look at how long these gat and port operations take on our network functions, and we did this for three of the thing network functions that we modify and it turns out that the cost to serialize in DC I state is most of the time that we spent in these network functions so potentially there's some definite improvement opportunities there.

A

If we can do a better job with how we go about serializing and deserializing, we may be able to improve the efficiency there.

A

The other thing is that it takes longer, as the state in the network function is more complex, so iptables has very little state for every flow because it only does simple rules like is this TCP connection in the established state, or not contrast that with the bro intrusion detection system, which needs to know a lot of information about the flows that it's analyzing, so it states much more complex and, as a result, it takes several times longer to actually go about transferring it.

A

Ok, so we have these low-level operations, but how about the high level operations and out of the guarantees impact the time that it takes us to do these move operations? So here we're going to assume that we're running the asset, the preds asset detection system, we're again using the same trace of traffic at a slightly lower rate, 5000 packets per second and we're going to move the state for 500 flows that are active at a given point in time.

A

So, if we look at how long it takes for this move operation to complete, we can look at first of all. What happens if we don't provide any guarantees? If we don't provide any guarantees, then we're talking about 190 milliseconds to do this operation. We can do some parallelization of our guests and our puts that were issuing in order to speed things up a little bit. So now we can cut that down almost in half not quite half to about 130 milliseconds great now.

A

The problem here is that we're losing packets as a result of this, so without any guarantees on loss, freeness or order preservation, even in the best case, we're losing 462 packets. So we add in our loss, freeness guarantee. Now our move operation takes longer takes about twice as long, but we're not going to lose any packets.

A

However, we do pay an overhead penalty, so there's these buffering of packets that are happening in the controller while we're waiting for the state transfer to finish, and so there's 881 packets that end up being buffered and they experience about 120 milliseconds extra latency on average as a result, but never higher than about 150 millisecond penalty.

A

If we add in this order preserving requirement, we again see another increase in the amount of time it takes, but we don't see a significant increase in the amount of overhead that we're imposing on packets, although there are more packets that were imposing this overhead on so here with this order preserving operation, we end up buffering 883, 83 packets of the controller and also I- didn't talk about this, but there's another approximately a thousand packets that we buffer at our Center second network function before they're processed.

A

So the overall takeaway here is that these operations are reasonably efficient, but the guarantees that we want to offer in some cases do come at a cost, and so it's important for control applications to have that flexibility to decide whether or not they need these guarantees. So where are we going from here? What's the next steps for open enough? Well, the first thing is that there's a lot there's, there's buffering that was happening in the lost freenas case.

A

There's even more buffering that's happening in the order preserving case, and so the question is: how can reduce this amount of buffering that's happening in an effort to reduce the number of packets that receive extra overhead and in order to reduce the memory usage of our system? So one thing that we can do is rather than pausing traffic and immediately saying before this state transfer starts. I want you to start raising events. We can allow the network function to continue to process cuts, and then any packets that are processed.

A

We essentially reprocess those packets, at our second instance, to bring its state up to speed similar to what virtual machines do when they take when they're doing migration, where they'll take a snapshot and then replay updates to memory later on, to bring that snapshot up to speed before finishing the final migration. The second thing that we can do is try to improve the scalability of this system.

A

So right now, all these packets and oh, the state, is going through the controller, which means there's a limit to how many operations we can handle simultaneously out the controller. But it turns out, the controller doesn't have to be involved. We can actually use a peer-to-peer mechanism to transfer state directly between instances of a network function and still get all the same safety guarantees that we want. Lastly, I said we need to modify the network functions and obviously there's a lot of network functions out there. So how do we make this task easier to do?

A

Well, where you can use some techniques from program analysis in order to analyze the network function code and automatically figure out? What state is this maintaining and what state do we need to actually export from this network functions as we have some ongoing work in that area.

A

So, in conclusion, I hope, I've convinced you that we need something more than just an FV and Sdn. In order to be able to realize rich scenarios, we want to dynamically reallocate packet processing. Particular we need the ability to quickly move copy or share network function state and do it in a way. That's also safe and we've achieved this with open enough. If you want to learn more or if you want to try out the code for open and F I, encourage you to visit our website open an FCS wisc.edu with that I.

A

Thank you for listening and I'll be happy to answer any questions you have.

B

Thanks Aaron air. Are there any questions.

C

Yeah kevin faulk, carnegie mellon, so one question is the things you talked about. Look like they're all cases of / flow stadia to move. So, if you're doing something I think you mentioned at the beginning, there's things that could be not / flow state. It could be pretty large. So, for example, is this a chunk of malware that I've seen before as opposed to does this packet?

C

Have these bits set, so you didn't I, don't think you had the graph that actually shows kind of the size on the x-axis and the impact on the y-axis other. You had examples. So what does that look like? Do you have that or.

A

So I I guess I, don't have an exact graph. The best I can show you. The best I can put up here is sort of this which says: here's how how much you're talking about so in the case of iptables I can give you an idea that the state for a single flow is less than a kilobyte in the case of bro, we're talking about a hundred or two hundred kilobytes of state for flow.

A

So it is, it is reasonably small, that's true, and so one thing you can do is is is to be able to start to proactively copy some of the state and our replay events. That's future work would enable that. The other thing I want to touch on that you mentioned is this idea that everything I was assuming here was / flow.

A

So a good example of multi flow state would be objects in a cache, and you may say: well, there's there's cash sharing protocols out there or we can just go ahead and and not worry about it, because it'll just get recast right, and so that may be a trade-off you're willing to make. In some cases, as you may say, it's not critical that I copy this state, so I'm not going to bother, but in some cases, if you're actively serving a connection and you're saying here's an object, I'm serving to this client.

A

If you move that connection in the middle of serving that object, then you definitely have to go about copying that object over well.

C

Or take the cache hit later, I suppose right I mean there's some. It depends on the semantics you want to. You will know exactly it's it's very dependent say so I guess there's another kind of question related to that which is probably bigger than just your project, but so, if I have a cascade of three or four of these functions and one of them Rob's the packets in some way such that reclassification of the prior uplink thing needs to be done, but now you've migrated one to some other place.

C

What kind of situations could I get myself in with respect to do that? In this era? Scheduling something or other you do to deal with that yeah.

A

So we've so we've thought a little bit about sort of the chaining scenario where you have many of these network functions that you're passing through so I think. We think that in many cases you can sort of, if you have a chain, you can sort of my great for one middle box in the chain at a time, and you end up doing some temporary redirection. In that case, you can certainly do better scheduling if you look at the entire chain at a time.

A

We need to give some thought yet as to how we sort of extend our safety guarantees to guarantee them across multiple network functions. No.

C

I, the last comment was related to you said about chain moving the state ahead of time or something or program analysis, but.

D

C

To me that something along the lines of some distributed almost like distributed shared memory or something where, where you can look at page accesses if you're willing to eat some of that page fault time, might be relevant there thanks.

D

I, execrable of Nokia, actually I just have one comment. That's in some cases you can fix this problem just with the controller and they're moving this state, because you.

B

Speak into the mic please and.

D

If you have a pro, if you have the subscriber which which was registered in the same network element, you can just move with him, because he must be aware that he now might registration to the different network government. So in some cases you only application itself can move this state and can inform other elements. That's now the subscriber I move the state from this unit from that unit, so I mean it's in some up for some application.

D

You can change the state with the controller, but for some cases you can do it only with the application level. So application itself must decide how to correctly move the state. That's.

A

True, so I agree that there's certainly there's some information. You need to know about the network functions to go to know how you're going to go about writing these applications and that's something that we haven't yet done. A good job of capturing we're hoping that, ideally, some of our program analysis could give you a simplified model of how this network function works or potentially, give you recommendations on ears which are hues which you should have your control application do, and if you have it, do this you'll get this equivalency level of output.

A

If you have it do something else. Instead, you'll get this equivalency level of output. So some interesting questions about how you communicate that with someone who's trying to write a control application. Thank.

D

E

Okay, so for the state move monster, what's the condition to check that move, so is that dynamic, warm, oh, it's aesthetically can be good or so.

A

E

A

That's really up to the control applications how they want to do it. So your control application in the scaling scenario could be maybe it's monitoring, CPU and it says I'm going to monitor CPU and then I'm going to do some sort of measurements of what are my elephant flows to figure out exactly what sort of flows I want to move from one box to another, so that's completely flexible and you could implement whatever you wanted there right.

E

So and because I assume that when you move something from one place to another place, it'll also cost you something like been with us or something like that. So initially, you want to move because you want to meet the sa way, but, right now, when you move actually, you are like a spanning, like some bandwidth cuz, that you know make the problem like even worse and then I believe he'll. You know from the presentation, it's not quite clear to me and also this moving thing. You know you can legally solve the essay way a problem.

A

So it's true that there are, there are other SLA s, not only I, guess I was referring to as a lace for the network function itself, but you're right that what you're doing in the network can certainly have an impact, and so one way to do that is to be a little bit more proactive and say alright, we're getting close to violating an SLA, and so we want to recognize that we better start this migration now, otherwise we're for sure going to violate the SLA s that we have but yeah I agree.

A

There's there's some interesting questions there and that's one of the reasons we also want to look at. How can we reduce the amount of state that we're transferring, and so some of our program analysis is trying to understand, rather than exporting all of the state that the network function is maintaining? Can we figure out what state was updated since the last time? Maybe we create a snapshot in a failover situation, or can we figure out?

A

Maybe some state affects the packets that are output, but are not by our network function and other state affects the log. And maybe we say you know in something like a caching proxy. We're not really concerned about the accuracy of the. So we're not going to bother to move that state, and so you may be able to limit what state you move in exchange for a relaxed notion of the behavior of your network function and how much it compares to what would you would have gotten if you didn't move at all.

E

Diallo past and phone again, first of all, is that you are mentioning continuously middleboxes and well. We are working with natural functions that are related purely to the data playing I, mean writing functions and forwarding functions in general. How do you see these kind of framework applied in that in that environment.

A

So it's an excellent question: I haven't really thought about it in terms of control, plane devices I've only really thought about in terms of data playing devices, I think I think there's probably a different problem there and potentially a simpler solution. When you start to talk about things that the control plane, sort of the thing that comes most mind is is work, that's being done in the distributed SDN controller case, where there you're SDN controller, is your control plane, and so there you're concerned about moving state.

A

But you don't have packets, necessarily that are going through this controller, and so you don't have that challenge to deal with yeah.

E

No I mean I, mean I'm thinking about the when you're going to make it you're moving the function. Is it performance, / penalties and D, and the kind of I mean.

A

E

See a value in this for the air for the data pulling functions, but the other hand is precisely the penalties that you will occur in the kc to apply a a former framework like this, something that I was thinking, whether whether in the case, for example, where we have a project on visualizing, the hunger others mm-hmm, whether that we could apply these in Sun in some environments. Just and I was curious. If sure.

A

Yes, so I think you could I think one challenge that you certainly face is sort of. Where is this going to which is sort of a standard nfe challenge you know to migrate in an ass across the entire continental United States versus to migrate it between something in a metro area is going to be a really different situation and one is probably feasible. The other is some.

E

Kind of state that to preserve as well but anyway. Secondly, if I've understood well, because AC is just to share my vibe um to see what I do share it as well as I see this as a similarity between these and this some time ago, I remember when, in object-oriented programming, there's a object, persistence frameworks, I think this. This is a clear connection right, so this is very much connected with this. Yes,.

A

We haven't, we haven't necessarily look specifically at that body of research, although we have started to look at it actually as we're doing some. This program analysis, because there's all sorts of things to figure out what objects exist beyond the processing of a single packet and what objects are only used during the processing of that one packet at this middle box and so I think there is definitely a broader body of work there that's worth considering, because.

E

There are some researchers have been talking with that. They are starting to think precisely on a network programming paradigm that is object, oriented and thereby precisely one of the properties that they were thinking about was about persistence and this kind of replicability. This is something that it was taking note because probably will- and finally it's about that- you were mentioning here- a control application, the control playing. This is something that your start thinking about. Well, you take the d sdn architecture. You have the end of your key texture and well.

E

We have having some problems in match in them, and then there is a third dimension, because this is an additional dimension right right. So well, we have three access now, how we put them in space. I mean I won't on application that is running according to as the end principles that can be can apply any of the orchestration, and now we want to replicate it. How do you see these? Are these all the whole thing matching and.

A

So I think I think. Certainly some of what this controller is doing could be part of another controller. That's already doing something: sdn or the nfe orchestration things in the network.

A

But it's not it's unclear how tightly you can integrate those because they're each solving us that they're, each solving a slightly different problem and so I think there's just going to be need need to be some interfaces there, for the same reason that when you're talking about NFV orchestration, you may have an interface into your system. That's going to worry about launching the VMS themselves and figuring out where they're going to go, and then a system that's going to worry about. Okay.

A

Now what nfm is I'm actually putting on this, so even there that could be split into multiple controllers, potentially so it's sort of at one point: do we end up with too many controllers running around the network and I expectable? We are rapidly approaching that. Well, it's.

E

Rather, the central control dude, you end up with three four five centralized control.

D

E

This isn't, it is going to be a problem for us operators to to share make a decision, but it's not interesting anyway.

B

There are other questions for Aaron. I did have one I'm wondering it's it trivial to sort of bound the amount of buffer space you need in the controller, or is that so we did you kind of bound the number of flows you can migrate to stop that yeah.

A

There's a couple different things you can do so in theory, it's reasonably predictable. You know how long you know how much on average, how big state is, and we can predict how long it's going to take to transfer that but you're right they there's this trade-off of the more state you're transferring the longer it takes the more buffering you need to do so.

A

One way to do that is that you say all right: I'm going to move these flows in pieces and I'm going to move ten flows at a time after I've moved those ten flows, then I can move on and move the next time flows.

A

The challenge you run into there is that now that you're breaking flows, smaller, your forwarding entries in your switch, need to be broken down that much smaller, which may be okay, may not be okay, but yeah buffering a certain leonitia big challenge with this framework and something we still don't have a great answer to how to go about reducing that.

B

All right: well, if there are no more questions, then I think we're done so thanks. Aaron thank.

A

B

Can whoever's got the blue sheets bring them up here? Please.

D

B

B

I get one without.

B

A

B

When to comport the slides pretty well all right, given the time.

A