Layer5 Service Mesh Performance Community, 28 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CNCF TAG Network / Service Mesh WG (Oct 28th, 2021)

Description

Service Mesh WG Topics:
- GetNighthawk: Adaptive Load Control Deep-dive with Jakub Sobon, Google
- SMP: Finalize benchmarking research and publishing plan
- SMP & Meshery: Specification review in contrast w/Prometheus Node Exporter metrics

A

All right, good, actually now is okay, good, so actually so welcome. Everyone now is a great time to not yet officially start but to get to know a couple of folks take care of a couple housekeeping items um so some housekeeping items. This is the cncf technical advisory group for network uh meeting.

A

Some of you might have seen some of you've been on these meetings for quite some time. Some of you are well aware of the fact that we talk about that network is a big space. We talk about a lot of things and that, as of late we've been trying to con compact agendas.

A

So some of these particulars that I'm about to say, probably aren't maybe neither here nor there, but this particular meeting sort of represents a few different initiatives, many of which can fall under the service mesh working group, and so we use this time to go over those initiatives, and those include those include the things we're going to talk about today. So I won't explain it good, hey, there's um a lack of precedence for some of what we're about to do across other cncf meetings.

A

I think- and I think those other cncf meetings are lacking this, which is like it's nice to meet you all. um You know we're not going to have 400. You know we're not going to have 100 people on this call, so it's nice to um do a quick round of introductions to make sure that people are kind of aware of of each other and what they're here for and what they're, what they're interested in and those kinds of things, and so to that extent also what your favorite color is. That's you know we'll take note.

A

And so well, so no particular order other than my other than I've been wanting to say hi to xing um jing huang. I think I'm saying.

B

Hello, hey hello, everyone, I'm xing from intel, uh intel cloud native team, uh I'm so happy to join this group, and my work is meaning for service match acceleration.

B

So I followed the source match performance project I think, is very useful, so I hope to get more information about it and hope to participate in the future. Feature. Work of of surface match performance. Okay, thank you.

A

Jing, that's. That was perfect, um although if you don't stay your favorite color, I might just be left to assume that it's like you, know, pink or something purple, maybe okay, but okay, good! I'm glad.

C

A

Sarcasm, my sarcasm uh translates perfectly good um nice to nice to talk to you thanks for thanks for bumping it it's nice to bump into you in github as well. Just I'm glad that you're here also thanks for asking for clarification about the meetings and agendas and which one is where and things it's been um a bit of jostling around and I think we're in a better place. So, let's see how today goes there's.

A

Do you want to do a quick introduction.

D

Yeah, so I'm rajkoro sophomore at university of london, and I have five years of trading experience, and apart from that, uh my software engineering skills include full stack web and mobile development, iot and electronics engineering, and I am proficient at system design, computer networks, linux, algorithms, through my projects, and currently I am interning at digital product school, which is a part of antherium tomb, which is a part of technical university of munich.

D

uh It's it's the biggest incubation center in the whole europe and I'm participating there as a software engineering intern, and these days I'm also exploring open source and yeah. I got to know about cncf meetings from uh the last event: the cube con yep.

D

Thank you. It's beautiful.

A

Beautiful nice to meet you um boy, I realized that I'm kind of dragging this out we got. We have some important things to go over some interesting things. um Very briefly.

A

uh I just um there's a number most of the rest of you. Folks um have the misfortune of already being familiar with me, although um mr roberts, joel, do you want to do a quick intro.

C

Hey, oh sure, joel roberts- um I have the past about five to ten years, have been focused on network um service provider, large enterprise uh data center and wide area networks, uh previously system uh sysadmin um and a former cc plus plus developer.

C

So looking to get you know back into the linux compute uh abstracting out this stack. If you will into the cncf world and um predominantly learning and seeing where I can contribute here, getting up to speed on the specific technologies and then the favorite color lee would be blue all right. So purple is.

E

C

Second, so yeah.

A

Sure, blue, you know ding ding ding the correct answer right now: okay, good uh joel nice- to uh you know uh c, plus plus, like actually the keyword of the meeting, uh so there's a there's a bonus round uh for you somewhere in there, but that's actually the first topic of of our uh of our meeting today. So I'm gonna before go harassing everyone else and and kind of um introduce jacob uh and also make sure that actually you know what I won't. I won't even introduce nighthawk I'll.

A

Let um jacob do that and so jacob's here to talk about um nighthawk and its capabilities around adaptive load control, I couldn't be more excited and so um jacob do. You want to educate us for for a while.

E

uh Would you like me to start the talk or do you want me to say my favorite color.

A

Well, yeah, the color and then the talk. If you would, I mean it's a pretty good, it's hard to beat john.

E

Sounds good, would you um I guess I'm gonna share my screen. I don't know if you need to stop sharing yours or I actually cannot share my screen. If you could send me that permission.

A

E

Meanwhile, hello, everyone, hello, everyone. My name is jacob uh to introduce myself and and to relay to me um you would like you would have to like dogs, forests or programming. I'm not picky, on which one or any combination is a bonus.

E

I will leave my favorite color as a potential guess for the audience after we finish after we finish this presentation, but uh if you think, uh can you can you see the slides.

F

E

Perfect, if you think my favorite color is blue, you are correct and with that, instead we can go to more serious topics. So thank you very much leah for giving me this opportunity and uh thank you everyone for joining.

E

uh We will be talking about uh adaptive load controller in nighthawk uh today uh I work for google um for a team that focuses on performance testing of load, balancing products and in that capacity we use nighthawk quite a lot for those who don't know what nighthawk is. It is a essentially an l7 traffic generator, so it is capable of crafting requests uh at a prescribed pace with prescribed content toward load, balancers or web servers, and then provides very detailed statistics about what happened so how many of those were successful?

E

What were the latency breakdowns and so on uh adaptive load? Is a library built on top or abstraction build on top of nighthawk that we developed to solve some of the problems we encountered while trying to build what I call real life real performance tests using nighthawk before I jump into this. What I would like to do today is I I will talk about motivation, so I'll explain why we developed a thing with such a long name.

E

uh Then we will talk about the possible testing modes with nighthawk and some of those you may already know if you use nighthawk in the past or if you were here when autocad gave the precursor talk, which will be essentially comparison of the open mode uh and closed loop, open loop mode and closed loop mode when using nighthawk, then I will deep dive into the architecture of the adaptive load controller itself and we will look at the available and potential plugins that can be developed for it and also this configuration now that's my goal for today.

E

Please feel free to interrupt me with questions at any time. Discussion would be preferred. I do think we have enough time enough time to do that, so feel free to stop me or just dive into chat. I I will ask lee to let me know if there is a question very good. So assuming you can see the slides, but let me know otherwise we can start talking about the motivation.

E

So, as I mentioned, we are working on load, testing or performance testing of load balancing products, but this could be any product and our main goal is to determine the maximum load such system can sustain- and these are two terms that I feel like defining so maximum load could be many things depending on on your product. uh Most of more often than not. This is the maximum rps or request per second. The system can handle with particular configuration.

E

It could, of course, be some other variables sustain uh is, of course, also a definition based on on the product and sustain could be. I can handle this rps while having end-to-end latency, that's below certain number or I can have. I can self-certain, I serve certain rps with my cpu usage being below 80 now. This is what we want to do, but we don't want to do it only once.

E

We want to track this maximum load over time, because, as software systems develop their maximum capability, all the performance that they have changes over time, more often than not downwards, because we add new features and then these load tests uh can be used uh as indicators. When is it the right time to run some sort of fixie to improve performance?

E

So this is why we use this sort of test to capture large performance, regressions or even slow, gradual performance, regressions now last goal, but not least, is that we want to have very high fidelity with production in the testing environment. We want the testing environment to land load on the system under test roughly the same way as production delivers load.

E

We will be getting back to this throughout these slides, but one thing that I want to stress at the beginning is that it's not going to be a stress test, because production really stresses our systems. We usually deploy horizontally many tasks or many jobs and spread the load across them. So, in other words, we want to measure something and we want to measure the maximum load without stressing the components, because that's not what production does.

E

So those are our goals and in the next few slides I will talk about the basic testing modes that we have within nighthawk and we will start with the most basic one. Now we will use this simple diagram quite a lot. This is what I call the minimum components we need for a realistic load tests uh or performance test. We have the load generator, which is nighthawk or assuming it's nighthawk.

E

For the purposes of this presentation, we have the system under test, which is um some load balancer or some http server, or could be some key value store anything that we want to test now. Nighthawk applies load to the system under test by sending requests and eventually after the system after the test concludes, nighthawk will store results into some database for a long-term visualization review or anything we would like.

E

Now in open loop mode, you will notice that there is no feedback loop between the system under test and the load generator.

E

What this essentially means is that nighthawk has no idea how the system under test is doing, whether it's coping or not, nighthawk is sort of blindly sending requests at the configured rps, which could be below the maximum load the system can handle or over, and if it is over, this essentially turns into a stress test.

E

uh Why is that important? uh What we found is that in the real world, the diagram is not an only as simple as this picture. At the very least, we have some monitoring system in place that monitors the system under test and pulls additional metrics from it. The statistics from nighthawk will give you an overview of what is happening in the test. For example, you will get end-to-end latencies, but using some monitoring system you can get additional statistics from the system under test, for example latencies per individual backend.

E

If this is a load, balancer or latency is a software feature that you have added and you are evaluating so in our load tests, it's very important to store those metrics as well, and we have learned that if we put the system under test under stress these statistics get uh we get holes in them. The system either doesn't respond when it's under test or responds slower. It doesn't update the statistics, so essentially we get incomplete as results.

E

uh The other thing to point out is when we put some complex system, non-linear system under stress, uh we will get variance in test results because across multiple runs, the system will not behave the same when it's running close to its maximum capability.

E

The other thing to point out is: we also need to run the system under test in some environment, and most environments today will have some sort of health checker to verify the health of the system. If the system is under high stress, it may lag in responses to the hull checker, which may assume that the system is unhealthy, which generally is not good for the test results and and what will happen next. Of course, it could be argued that all of these problems can be solved. uh We could disable the health checker.

E

We could get rid of the monitoring, but I'm gonna go back to my goal of production: fidelity where we want to try to run the test roughly the same way or run the system roughly the same way as in production, because if we don't, we don't get comparable results.

E

So that's open loop mode and before I proceed uh any questions on this everything else we will discuss, will build on this diagram.

C

Hey the traffic, do you have options on what traffic you generate like? Can you do imx traffic, or we know this system is going to process this type of information, so there there's there's options in the traffic you generate.

E

Yes, nighthawk nighthawk does support multiple protocols, which is one way of setting what we are sending and then there are components in nighthawk that allow us to set the exact content of the request. So this could be any mix of plain http requests gets posts with the various bodies.

E

The component is not on this diagram, it's called request source, but it allows the test developers to kind of replay either synthetically generated or or something that was captured from production run.

C

Could you replay a pcap from production through knight in hawk.

E

Pickup, uh depending on which layer we would start from so.

C

Nighthawk is an l7.

E

So uh if, if we are happy with replaying that that answers it.

C

E

C

D

E

So, uh thank you joel. So that was the open loop mode and the natural solution to the problems discussed of the open loop mode is use. What is called the closed loop mode, you'll notice that the diagram is the same. We just have a feedback loop between the system under test and the load generator.

E

Now what this means is that nighthawk is now aware how the system under test is doing in a sense, nighthawk tracks how fast or if the system under test responds to the requests and if the system adapter stops responding, because it's a very close to its maximum load knighthook will pull back and uh allow the system to recover in practice.

E

What we found is that this shares most of the problems with the open loop mode, because in order to figure out that uh the system is under stress, we need to bring it to stress first and depending on how well or how complicated the system other test. Is it may or may not recover from that stress ever of course, that's not the ideal setup, but it's the realistic world setup that we found uh even when nighthawk recovers.

E

It will still keep the system under uncomfortably close to stress levels as as it's trying to stay within the responding, but not just quite not not just quite fast enough, so, in other words, with closed loop mode from practical perspective. We also have problems with uh large variants in test results in incomplete test results, because the system was not responding to monitoring well and and therefore we started looking for a for another solution.

E

And this is where the adaptive mode comes in an adaptive mode is an abstraction built on top of the open loop mode, what adaptive mode does in high level and we're going to deep dive into this with diagrams? I just want to give you the overview up front. Is it breaks down the test into two stages? We have the adaptive stage or we can also call it the search stage and we have the testing stage now in the search stage.

E

We apply some search algorithm to iteratively, try various testing configurations, so we iteratively execute nighthawk with different testing, a different testing setup. You can assume different rps different amount of requests per seconds and then in each of these iterations we use open loop. So we don't listen um to the feedback from the system under test. We just run under the settings we set and we observe and based on how the system behaves. We decide. Are we in stress or are we too low, or are we optimal once we decide we're optimal?

E

We go into the testing stage that will run on those optimal parameters. So the goal of the adaptive stage is to avoid stress testing by figuring out oops. This was too much rps. I'm now stressed testing the system. Let's pull back.

E

So how does that look like you're already familiar with the right side of the diagram? We have the load generated nighthawk. We have the system under test. We now have a new component here, which is the adaptive load controller and, as mentioned, there will be two stages to this test. So in the first stage the adaptive load controller will iteratively execute nighthawk.

E

You can think of them as short benchmarks with various load specifications. After each iteration we will collect the iteration results from nighthawk and any monitoring metrics from the system under test. We want and bring them back into the controller so that the controller can decide what the next step is. The next step is the next step could be another iteration or the next step could be a decision that we have converged and found the optimal test settings, in which case we move to the testing stage.

E

Once we move to the testing stage, we run one last iteration of nighthawk, with the chosen specification and at the end of the test we collect results into the database to be said, it is also possible for the adaptive controller to decide that it cannot converge and fail, which could happen for some impossible settings like, for example, if already the initial value that we start the search with uh places the system under stress- that's a reason to say we will never converge.

E

So any questions on this.

A

um Here's a quick one- I don't this is maybe in advance of or orthogonal to what you might have um spoken to jacob, but the um the adaptive uh or the definition of um that load, controller and kind of the um the optimization routine that it's you know seeking out that it's executing that section of code or that is that um dynamically insertable, or does that that, like a specific type of load control, does that just need to you need to create a a new build of nighthawk that that would have that right.

E

So that that question is, I think twofold. Thank you lee uh one of them is, is that component pluggable and the answer is yes. The system is built for us to write our own search algorithms to decide what the iteration steps are. Is the step dynamically loadable I'm going to differ at the end of the discussion. I have a slider where I'm I wanted to talk about that. So, if you don't mind, I'm going to postpone that question.

E

Thank you great. So um what I wanted to give you is a visual aid on how this looks like. So this is a fairly random graph. That shows a progression of an adaptive load test, and you can see the area highlighted in left, which is execution of the search algorithm. You can see the individual nighthawk executions as those bars that were running at various values of rps and once the adaptive load controller converged on settings that decided are optimal. We enter the green area, which is the actual test, the testing stage.

E

We will talk about configuration just to point out the widths or the durations of these individual stages, the gaps between them, that's all configurable.

E

So this is just to have something that we can look at and say uh this. This is how it looks like.

E

So what did we achieve? uh We have learned that uh with the adaptive load controller, uh one of the main advantages that we have is that we now have automated adjustments of rps. I'm gonna expand on that a little bit, but imagine you are running a team that is maintaining many load tests over many products within each load test.

E

You have many test cases with various protocols, various requests for each one of these there will be an rps value that is the current maximum that the system can handle, and you want to maintain that and make sure that you don't regress below it without the adaptive mode. We have to manually maintain these rps settings for each of these tests and that's very easy value to rot over time with the adaptive mode.

E

We don't have to include the rps setting in the test at all, because on each execution the adaptive mode finds the current maximum value and tracks it over time. In a in a database, we did address the problems that I mentioned. We encountered with open loop, so we found that the tests are more reproducible and more stable than we executed multiple times we get roughly the same results.

E

Another very good advantage for us is that we are now able to detect performance gains.

E

If we run tests at fixed rps value and the software miraculously became more efficient, so the maximum value that it can handle and went higher. We will never find it out because we are running at a fixed rps if there is a following regression that happens to regress still above our rps setting, we will never find out that we miraculously gain performance and now regressed, which is not something developers like if they miraculously gain performance. They would like to keep it not lose it so with the adaptive mode.

E

If there is a performance gain, the adaptive mode finds out that it can now send more rps to the system under test, and if there is a following regression, a simple window analyzer on the history will figure out that there is no downward slope that needs investigation so thus far for advantages disadvantages.

E

Well, the obvious one is that tests take much longer to execute without the adaptive mode. You only have the green area where you run nighthawk at some rps and collect the results with the adaptive mode for every test case. You are using much more time much more over computing time. Many more resources configuration is more complex because this beyond just configuring nighthawk, you also have to configure the adaptive adaptive controller, uh the search algorithms, all the metric plugins their thresholds. We will look at that and another thing we will look at the following slides.

E

uh We may need to develop custom plugins to be able to make decisions on all the metrics. We are interested in so potentially more development.

E

So, thank you for listening. Thus far. uh The next section of these slides will talk about architecture, so we will deep dive a little bit in how the software component itself looks like before we go there. Any questions on what we covered thus far.

A

I have one, but I wanted to make sure I wasn't hogging the question um jacob. You might have said this and I might have missed it so um briefly, um you you ended up talking about open mode, a fair bit and how use of open loop style load generation has an effect on. You know: adaptive load, control and the progression by which you know the sequence of tests occurs.

A

um Did I did I miss it? Did you contrast that against closed loop mode? You know under the same context,.

E

Yes, there was a fairly short slide, so it was. It was easy to miss, uh but essentially uh the conclusion there was that, while closed loop mode has a feedback from the system under test, it has to put the system under stress first to receive feedback that the system is not doing well, and what we found is that when we put the system under stress for a while, some of them never quite recover.

A

It makes sense, do you? um Has it ever been the process that you would do openly or open loop mode for a bit uh try to hone in on the general uh sense of where that stress begins and then switch to closed loop mode as a so follow on.

E

The adaptive mode does something similar. In essence, it runs all these bars. Here I don't know you can see my cursor all these bars here are essentially executions of open loop mode, just to see how the system is doing at the various values and then when we are happy that we found one that does not put the system under stress but is fairly close to its maximum capability. We run one more, but also in open loop mode. We did not find the use for closed loop mode in this setup.

A

E

Great so, let's talk about architecture again, just to bring back the familiar component. We have on the right side, the load generator and the system under test and some monitoring system that retrieves statistics from it.

E

The adaptive load controller as a software component talks to nighthawk over grpc, which means it assumes that the nighthawk grpc server is started in some environment somewhere and is ready uh to receive rpc requests with each rpc request. We would start a benchmark and receive benchmark results.

E

The adaptive controller receives configuration in a form of a protocol buffer. We will be looking at that one, a little bit later and validate its own configuration for sanity. uh The way how the users interact with the adaptive load controller. It is a command line tool at the moment, so you execute it as a command with various flags.

E

Now the adaptive load controller has a couple of components and the ones marked in yellow are all pluggable. What that means is they are using the envoy plug-in system for extensions.

E

Most of maybe I should have mentioned this earlier. Most of nighthawks code, closely mirrors on voice code always features, and it uses always features. So that is why you will find you will find a similar software component being used. So anything that's marked in yellow. Here we have some implementation or implementations of that component, but it's fairly easy to write other implementation. That would do slightly different things and fulfill the same api.

E

The step controller is the main component of the adaptive load controller and it is essentially the search algorithm. So this is what decides? uh How will the next iteration or the next step look like based on data that it receives when it decides? How will the next iteration look like it uses a variable setter to affect the next configuration of nighthawk?

E

To give an exact example, if you are doing a binary search across a range of qps, let's say the first step is 200 qps we execute nighthawk with 200 qps, then the step controller decides that the next step should be 400, and if that's the decision, it will use the variable setter called rps to set the value, the corresponding value of rps to 400..

E

This is just an abstraction so that it's easy to add other variable setters. If we want to search for other parameters than than just rps in the configuration.

E

The next interesting component is a metric evaluator, and this is the piece that receives metrics from the nighthawk iteration results and from the monitoring system and decides whether these metrics are over or below the bounds that we set. In other words, when we go back to our motivation, this is the components that decide. Besides, with these test settings, is the system under test sustaining the way how it decides it uses a scoring function, which is a pluggable component, and the basic example of a scoring function is a binary score, so you are familiar with those.

E

If you ever implemented your custom sort in any programming language, you just say: are we above or below the value? So true, false evaluation or some programming languages use zero, plus one minus one?

E

The last plugable component are the metric plugins themselves, and these act as shim layers between the adaptive load, controller and the sources of metrics. There are two sources of metrics or well. We can write any metric plugin, so there can be any number of metric sources. What we use in practice are two sources. One of them is the nighthawk iteration result or nighthawk benchmark result.

E

We can make decisions on anything in in that and the other source that we typically use is whatever is being pulled through monitoring from the system under test.

E

This is the complete architecture, so I'm gonna stop here again for a second before we start talking about plugins and configuration any questions on on this diagram.

E

A

Your mind um the just to just to reiterate um the scoring function. You, um let me put some words in your mouth and see if this is true, which is the scoring function, is ultimately looking for a yay or an a one or a zero.

A

um But the um the considerations and the potential um signals that are included in that function is is quite variable. Right is up to the yeah. Okay.

E

Exactly I think, I think those are good words, uh the reason why it is all exploded into multiple components. This I mean you could technically write this as one function. uh It's exploded into multiple components, because we wanted that plugability so that uh individuals can write their own metric sources, their own scoring functions, because maybe somebody will want more complex scoring. Maybe a or nay is not good enough signal. Maybe we are interested in some additional metadata, saying, nay, but we are getting closer.

E

You know hot or cold, in which case the search controller can use that data to make hopefully wiser decisions. So all of this is possible in that architecture.

E

Great, so that was the architecture. Now, let's talk about plugins, um I will give you an overview of what plugins are currently available in the open source repository, and then we will look at how the configuration looks like also looking at an at one example.

E

So the available plugin implementations uh for metric sources, so uh sources of information about how the system under test is doing. All the plugins that are implemented in the open source repository are based on or read from the nighthawk benchmark result.

E

uh The plugins that exist uh will be able to make assertions on uh the rps that nighthawk attempted uh the rps that nighthawk actually achieved. There could be a difference between those two for various reasons, including network congestion or something running out of resources.

E

uh We can make assertions on all the latencies that are within uh nighthawks result uh on the send rate and the success rate. The send rate is that, out of all the requests that nighthawk was meant to send, based on the rps setting and the time that this took how many of those actually made it out of the box again, this could be lower than hundred percent.

E

If nighthawk is having some performance, issues on on its machine and success rate is out of all those requests that were sent, how many of those resulted in an http 200 code, so in other words the system under tests serve them and we're happy with the result when it comes to search algorithm, the step controller, there is one search implemented the exponential search thus far, we didn't need another, but it's fairly easy to implement other search. Algorithms. Here um you probably know exponential search is just built on top of binary search.

E

It tends to be a little bit more effective on searching large unbound sorted lists. So what the exponential search does is at first, it makes large hops through the space and finds a range in which the target qps is, and then it runs, binary, search on the range running, binary, search on the large space would take much longer many more executions.

E

Then we have a metric scoring. Two scoring functions are implemented in the open source repository the one that we kept mentioning is the binary scoring, which is just simple, a or nay are we good rb above rb below?

E

There is also a linear scoring function uh implemented, which is able to add a scaling content to constant to the answer that scaling constant uh can affect the answer and then feedback into the search controller to help it make better decisions.

E

But the exponential search is not using that, so that that is mostly there to as an example when it comes to to variable setters, we have one variable center that sets the request per seconds a second in the nighthawk output, which means that out of the box, we are able to search for various various rps settings. We are able to find the rps. The system can sustain.

E

uh Leave we are approaching 30 minutes mark. How are we on time.

A

Good we're good. This is uh uh that's the please keep going. I will interrupt if I'm just looking out the other agenda items. Certainly the other agenda items are good ones too, but um but those.

E

Two, I think we have about five more minutes. Maybe six.

A

Yeah, no, no problem you have to first. Do it.

E

Okay, so the next thing, of course, please again.

G

Can we add more plugins? That's what you said. You can add more plugins. Somebody can go in and say hey. I want a different ratio, a different measurement and just add a plug-in.

E

That is exactly correct. That is why the architecture includes these as plugable components, and that is what we ended up doing uh quite extensively uh for monitoring systems, as I mentioned, for example, for the metric sources. We only have plugins that read from denied hog output, but we often develop other plugins that read from the monitoring systems and add additional metrics. We can make decisions on so each one of these it's fairly trivial, to develop your own implementations.

E

Thank you for the question, so let's look at a configuration and I'm going to switch to a different window and take you through a quick walk through the protocol buffer that configures the adaptive spec. If this is isn't readable, um I can't really see it please. Let me know I can I can enlarge the font, so the configuration of the adaptive load controller comes in as a one message called adaptive load session spec and the interesting thing that we configure here is well.

E

First of all, we set the nighthawk traffic template, as mentioned, nighthawk is executed iteratively. This is the base of the nighthawks command line arguments that will be used in every iteration. So all the shared parameters.

E

Then, at the top we configure uh what metrics we monitor I'll give you an example of this is this: is the on voice plugin system and on the next slide I'll show an example of how that looks like, but this essentially allows you to say what are the monitored metrics and what are the thresholds for those metrics? So how do we decide whether we are above or beyond individual thresholds or if the system is sustaining?

E

The next interesting piece is the step controller configuration which is essentially configuring, the search algorithm we are using and configuring the variable setter. So what are we setting inside the nighthawk traffic template on each iteration, which portion of this template is changing?

E

The remaining portion of these are just various timers where you can set. How long will each of the iteration take? What is the maximum deadline or total deadline until the system has to converge or bail out? What will be the duration of the green area of the graph or the testing stage? And what will be the duration of the gap between individual iterations? We call this cool down so that the system that we're applying the load to has time to recover.

E

So with that, this is a promised example of how the configuration looks like on the left side of this slide. You can see the metric configuration. uh This is all uh text format of a protocol buffer, and we are saying here that we will be monitoring the send rate metric, so how many of those requests that we were trying to send? We actually managed to send. We will use the binary score on that and we will never allow any of that to be lower than 99.

E

So you can imagine multiple repetitions of these in your adaptive load controller that allow you to set various metrics that you monitor and um the limits for them on the right side. We have the step controller, we use the one that is there, the exponential search. We tell it what the initial value is- and this is where the configuration for the variable setter would be as well. It's not included because variable cetera that sets the rps value is the default. So this performs exponential search across a set of rps values, starting at 200..

E

The remainder of that are just the various deadlines and timeouts that you can configure.

E

So with that, uh that concludes the main portion of my or the main meet of this talk. I have one more slide when I want to where I wanted to discuss some questions that were previously asked, including dynamic loading, but before we get to that, um are there any questions on the main content or on the configuration.

F

Jacob uh everyone here so I have a basic question on the architecture from the client server perspective and also the multi-threaded approach, because we should be able to run multiple instances of nighthawk so which section are you covering that? Are you covering it today or I just wanted to understand that part yeah.

E

Is is that a question in regards to horizontal scaling of nighthawks exactly exactly.

F

E

Yes, that that is one of those that I'm planning to cover. So let's cover that uh and then we can see we have any other questions.

E

This is what I I heard that was asked previously, so compatibility with horizontal scaling came up a couple of times, and that is a good question and when we go back to this architecture diagram, uh you can see that the adaptive load controller interacts with nighthawk over grpc.

E

Now the horizontal scaling, or at least the way we are planning to develop, it all happens on the right side of that grpc server. So even when we horizontally scale nighthawk, there will be one um server component or our command line component that will manage that will manage all the scaled or all the parallel running nighthawks to develop load. So in other words, these two features are almost orthogonal.

E

As long as the adaptive load has a point or a server component, it can talk to and specify what load to apply whether there is one nighthawk server on the right side or 10. Applying that load is abstracted.

E

E

The next question that was asked uh is question of dynamically loading plugins. So I I think you have discussed this on one of the previous meetings, already at least based on notes and what was discussed then is correct. The plugins are not dynamically loaded or don't support, dynamic loading out of the box. We use the envoy plugin system, which is actually a list of extensions that need to be statically linked and built into the binary.

E

However, because this is a plugable system, I don't see any blockers in developing dynamic loading in multiple possible ways. One of them could be if you have a need for that, to develop a plugin that dynamically loads other plugins, so that would not even require not even require some large refactors of the idol code. You can imagine that as a plugin, that would have its name like dynamic, loader and would use any means we would like to to dynamically load other plugins. It would relate to.

E

But the short answer is not supported. Right now would need a contribution.

F

So, but is there any limit on the number of plugins which we can create, which are dynamically loadable.

E

I I mean it's probably limited by the resources available to nighthawk. Okay, I'm maybe otto is aware, uh since I saw you on the collateral, if there is any limit to the number of plugins on, we can load. But my guess is it's going to be memory and system limits? Do you know of any auto node? I'm.

H

Not aware of any uh limits involved there, I don't think anyone has tried to push it to its limit there so yeah, but I think it can load a lot of plugins yup sure.

E

I think I think it's mainly gonna be memory and the plugins only take cpu and other resources if they do something. So it's it should not be a low limit.

E

E

Thank you and the last question that came up and I have to say I did not cover that today, because I I think we would need a separate session, but the question was how to build and run nighthawk. The best thing I can do today is give you a pointer, as mentioned before. Nighthawk is built on top of envoy, so this is the only process, repository and because night work nighthawk is built on top of envoy. A lot of the advice that is here, quick start for bazel build for developers will apply to nighthawk.

E

This has a set of commands on how to adapt your local workspace so that it can build envoy. If you go through these steps with almost no modifications, you will also be able to build nighthawk in the same way.

E

Thank you very much. That's everything I prepared for today and uh back to you lee.

A

Good jacob we've been, this is great: we've been chomping at the bit for this one. This is.

C

A

Oh well, there's a few so of the other couple of agenda items that we have. um We should probably probably postpone those those are some things that we can address asynchronously, um but since jacob is here, otto is here as well to continue down this path. So so there's a few folks on the so yeah I guess to step back a little bit. There's.

A

There have been some additional builds of nighthawk that part of the community has been trying to automate, um which is which is one thing. um There's uh individuals on the on the call who've tried to build nighthawk on their machines and they run into different items.

A

So the reference that you have is a good one um part of what we're hopeful for and jacob you're well aware this, I'm kind of repeating some things, maybe that you, but just as a refresher that part of this part, some of the things we're looking trying to do is um help help uh churn out bring some more ci like help churn out a few more builds, um also to well to really to dig into some of the use cases that we have around other questions to be answered with a step control with the you know, custom step controller with the custom adaptive load control.

A

um So this is really encouraged, uh encouraging to see the architecture like part of what you were saying like I've got a few clarifying questions uh like part of what you were um saying is well. Let me let me clarify a bunch of things. One adaptive load controllers themselves. um Each of the um like the scoring function, the um step controller, etc.

A

All of that communication is done over grpc to nighthawk and as such there isn't necessarily like I get. I guess I get confused between the requirement for those things to be in c, plus, plus versus the freedom the use of grpc enables, and so yeah.

E

Both of those components so adaptive load, controller and nighthawk are written in c plus, so, in other words, even though rpc typically gives language freedom. In this case, those choices have already been made.

E

The purpose of grpc here is just to talk to knight hawk that can be deployed on another machine rather than on the same machine where the controller runs or the compatibility with uh things like horizontal scaling. The only communication that happens over grpc is asking nighthawk to execute uh with certain parameters and then getting the results back.

A

Got you make sense.

A

The horizontal scaling.

A

Can you say more about how the data syncs um like play into horizontal scaling and the um the central the centralization kind of the? uh What do you call the coalition of results from different nighthawks sort of get coalesced back.

E

Can I can I ask a friend for help.

A

A

H

Although I guess facebook, you know that I mean you yeah yeah, I'm still here so um aggregating data from multiple night talks, so yeah, that's part of like uh the horizontal scaling efforts and the plan right now is to have like um propagate all the night after the nighthawk instances propagate all their results uh back to uh a sink which is uh a service, and then that sync can do the the aggregation for you.

H

So you can query that to hand it like a single result to you, um and that would that that is helpful, because it will do like a merge of all the latency histograms.

H

For you, so, instead of like having 10, distinct results from then load generated, you'll be able to obtain like a single one.

H

So that's it and the way we're planning to do this. It's still also possible to query like these 10 distinct results. If you want to dig into the details that can sometimes be useful, so nice.

E

Thank you very much and and to build on top of that. um This is where you see why I'm saying it's compatible or orthogonal, because the adaptive load controller essentially just receives one result representing those 10 rounds and can make decisions on how the next test run would look like nice.

A

Questions from from others, um I know I've had conversations with at least half of you about various questions that we think we might and when I say we I mean, like there's a number of projects that can be represented in this discussion and uh and so do. uh Who else has questions.

A

So, just as a as an update uh as a conversation piece, maybe so one of the the tools that's being represented here is measury. It currently embeds nighthawk and a couple of other load generators.

A

But there's been work in the community recently to pull it out and have it separately, deployable uh sort of in advance of or in preparation for some horizontal things, and so.

A

And so that yeah, so so hurry up.

A

Yeah and yep there's also been so so um martika is on. I think she's still on.

A

No, she had asked the question earlier. Well, one of the things we want to do is um establish um and we want to do a bunch of benchmarking.

A

We also want to establish a new unit of measure, a mesh mark, and it might be that doing so through adaptive load controller is, um is part of the way to go like but part of how people would assess that, like because the the concept of a mesh mark and having a scoring system like that, isn't, um isn't constrained to just um latency and throughput, or isn't constrained to sort of, what's being measured out of the box from nighthawk today, and so an adaptive load control with ability to extend to account for different.

A

You know like a variety of scoring functions or other considerations that are environmental or um or that are potentially monetary. You know, financial to the extent, to the extent that something like mesh mark would work to account for financial considerations of like whether or not someone wants to spend that much on their infrastructure like allow their infrastructure to run at that rate or or it's like running at that rate, but they need to. um They. May need to reschedule some other infrastructure into a different form.

A

Maybe they were initially running it as a serverless thing, and I don't this- is I'm making this up on the fly but they're running it as an expensive compute? They want to move it off or you know, use cases like that. So this is because really the presentation was really nice um jacob. The architecture looked the way that you presented it looked clean, so I'll just go on good faith that it is me, but.

E

I'm gonna be consciously quiet now.

A

Anybody else last.

E

Chance, sorry, one more thing to add: um I am together with otto one of the maintainers of the nighthawk repository, um and that includes the adaptive load controller. So if you will have questions uh feel free to reach out, um you can all contact us by opening issues in the nighthawk repository, and it will likely be me or responding back to you so we'll happy to support your efforts.

A

Last chance for a pot shot at jacob.

F

So uh how do we raise these issues? So do we need to raise a uh bug bugzilla or something like that? How do how do we track those issues which we raised to you uh the main way.

E

To contact us in the nighthawk repository is to open an issue, a github issue in the nighthold repository. That is what we primarily use for communication. We are also available on slack. If you just want to ask a non-committal question, we are, uh I believe, in both the layer, five and the envoy um slack workspace.

E

Okay, the always like word space also has a kind of a nighthawk channel where you can just ask nighthawk specific specific things. I saw a question in the text chat.

E

Are there any plans to split it apart from envoy repo, assuming that this is about nighthawk? There are currently no plans to split it because, uh being being coupled with uh uh envoy gives us a lot of functionality. We would have to implement otherwise like the ability to uh issue requests in various protocols. That always supports which right now goes from http, one all the way to http and so on.

A

Very good any last takers.

A

I mean this: is this? Isn't amazing it's about once a year that we get jacob on a meeting so, like you know, I'm trying to get all of my harassing in like it's gonna squeeze it.

E

This is almost a standing joke between me and lee, and I I I'm thinking about bringing it up to two please so.

A

Fantastic jacob thanks for this lots of interest about the work um thanks for all for coming and jacob for sharing and muscling through google's legal process just to just to do it.

E

All right for your attention and your time.

A

One sorry last housekeeping item I forget is so with the consolidation of agendas. Just a quick reminder. The next time that tag network meets is, um I think, it's next week, so it's not in two weeks, but it's every first and third thursday, so yep so happy, halloween everybody! I guess we'll see you post sugar rush.

A

Okay, see you all next thursday!

A

Thank you. Jacob.

F

Thank you. Everyone bye.