Apache Mesos Performance Working Group, 18 Oct 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GMT 2017-10-18 Performance WG Meeting

Description

Agenda and Notes:
https://docs.google.com/a/mesosphere.io/document/d/12hWGuzbqyNWc2l1ysbPcXwc0pzHEy4bodagrlNGCuQU/edit?usp=drive_web

A

Hello, can you me yeah.

B

C

D

A

Okay, let's see we have Ian and Ilyas so I guess we get started.

A

Let me present the agenda.

A

All right so this week, we're just gonna: go over some failover benchmark results from from Ian and some HTTP API benchmark results from Ilya before that, I was just going to give an update on some of the optimizations to quota buff message, handlers and defer and dispatch in the process.

A

Dimitri was working on those. It looks like he's not in the call, so I'll just describe what what's being done. So the first thing was that we added support for arenas and we started making use of those in the install handlers in the process where we're taking constant messages, so that just provides a bit of an optimization there.

A

And the second thing was: there's some upcoming work to to basically do copy elimination and defer and dispatch. So we want to support moving into defer and dispatch.

A

Dmitriy and Michael we're gonna work on that. So.

A

Yeah, that's about all that's what's happening there.

A

It's about that.

E

Move support; okay,.

A

Yeah the goal there was to focus first on master failover, where we're copying all the protobuf information that that you can start telling us across the continuations so wanted instead just not be doing any copying at all. There.

E

All right, I was gonna mention Dimitri's previously committed to work, but I.

A

Guess it's already talking.

E

About yeah yeah, so when.

A

I could, in the previous.

E

One sure which I did in a 10 so.

A

Yeah, so I think he's got it down to maybe one at this point. When you're going across a dot, then that uses defer but we'd like to make that zero.

A

So stop sharing my screen. Do you have some content in or you just want to discuss it? Uh-Uh.

E

No no go ahead. Okay,.

A

A

So sorry did was that know that you had had no content or I'm.

E

Sorry I thought I wasn't muted I know. Oh, are you sorry? Are we moving on to the second item in the list or oh okay, I see I, don't have okay, I, don't have a screen turning to to share I I just have a status update. Okay,.

C

E

I basically put together some benchmark for the agent rich broom attrition this week and the results and hasn't been conclusive, but I think it's it's a good start. um We can I have WEP which I can I can share shortly after and I guess. I can guess, gather some feedback here by just talking about it briefly. So um the benchmark.

E

Is basically having a real master with a bunch of fake agents um which, because even for a full-fledged mock slave class, it would involve a lot of states recovering and I'm complexity there. So, basically, what I've done is just to create a a prototype process which is called a test slave, which does only two things. One, it's consent read register message to the master and it it's installing a instant is installing a handler for slavery, registered message.

E

So in those two on to the masters we registered message when it receives that it basically marks a future, as you know, finished as ready, and so in this way we're basically simulating.

E

For example, 2,000 or 20,000 agents registering and all calling the master at the same time, which is obviously a little bit different than the real-world use case. But without you know, randomized back off.

E

This is arguably more deterministic, so it basically just measures the time and the master has received that the the registered slave message, which also is a little bit different than because all of the actors are within the same process, which is a little bit so in terms of performance characteristics a little bit different than what you will really do when you send these from different hosts.

E

um But at least this captures the path from that point on to the next, a couple of asynchronous processing steps where it consults the authorizer and registry, for example, and then does handle processing all of the states sent over by rivers. Your message reverses your slave message, so it captures all of that so right now the knobs are the number of agents to to register with a master and then just try to simulate what the real slaves done.

E

There's the the real agent captures a number of frame work number of a number of frameworks on this agent, and there are tasks running on these framework running from these frameworks and also they are completed frameworks and they are tasks. History, including completed tasks, so I'm per hosts I'm, giving five frameworks and for each of these framework, I give five tasks and also I, give 50 completed frameworks and I give five tasks per these completely frameworks. So just a few simple knobs to simulate the behavior and so far the result hasn't been all that you.

F

Know great and hello, just one question you said: you're simulating the behavior. Can you describe exactly the behavior that you're trying to sneak there like this scenario that you're simulating here right.

E

Just so, first you have a master and then I create you know thousands or tens of thousands of agents. These were like fake agents, which are basically just actors I to simulate the slave object and by simulation I mean it will, when I create it, it will send a really register slave message to the master, all of them without the real back off and all that without retries.

E

It would just send all of these messages, but these messages are big right, so they have all of the states from the end yeah.

F

I followed that part I was trying to understand the number you like 50 frameworks with five tasks. What those tasks running or is it one task status, update per task or something I didn't get the scenario of the scheduler behavior. So when.

E

Rate register happens. There's no tusk updates, there's just a task status right, so the the math will come up to.

E

250 running tasks and.

A

E

Completed tasks I.

A

Think currently, the point is: there's no frameworks involved in your benchmark right. It's just about agent right.

E

Right right, you're.

A

Stirring all the agents and how fast that happens, yeah.

E

Yeah, that's: okay! It's a first route.

E

But I have run this and the number I got is support for these numbers, so principal 2000 agents and probably five five hundred tasks per per host. But the the caveat is the I.

E

Create new frameworks, distinct frameworks for these per agent so that I don't cross, sharing firm uruk's from them. So are they all in separate roles or is it known.

F

E

Just the basic default from Rick info everything the same just just as a simplest scenario you can think of, and it it takes a a minute.

E

For the master to to finish and a minute is measured by well, the the time is I spent on creating these messages is also included just because I I send them I, send them over when I finish, creating them, so that's lumped together, but the master would process these, and eventually the master would sent a slavery registered a message and when the slave actor this test, the slave actor receives that message.

E

It's going to mark a future as being completed so I basically have four two thousand agents: I have two thousand futures I'm doing a weight on all 2,000 of them and I stopped watch once that, all of them all of the futures are completed. So that's the measurement and it took a minute to to read register to process fully process. All of these and and I I cannot see any difference between.

E

The version, basically is the head versus the version before Dimitri's first differ hatch, so I'm guessing did.

A

You do any analysis on it. Did you get flame graphs or any perf data? No.

E

I haven't which I totally do, but so with flame graphs, I think a fly and I I told Ben that our environment has some problems which basically I Jenna I, try to generate ride to generate flame graphs from our production assistants. But the the graph came out with the whole bunch of unknown symbols, and we know what it. What is the problem, but then right now we cannot fix it, but still.

E

Graff seems to me can only tell you this much because previously with even with that graph with a bunch of whole bunch of unknown symbols, we could see protobuf processing being a very wide bar and so I guess. It is a known fact that these copying or constructing is, is taking a lot of taking a lot of cycles. I guess.

E

To derive meaningful result from there unless I see that bar to be dramatically reduced, then I I would I will capture a flame graph. My benchmark tests and and I will share the results, but at this point, I'm not I'm, not sure how I would interpret that result. I had.

F

A good question: this is the first time I've attended. One of these, so I'm probably missing a bunch of context. But what is what is the context for this study? Was there some behavior, you observed in you're like--you're, so you're registering 2,000 or 20,000 agents? Were you seeing slow agents, registration and that's what motivated the trying to figure out what's slow? Is that what's going on right.

E

So we have already done a bunch of optimization which improved the failover time drastically, but previously. So what motivated this in front of their beginning was when the master fills over the agent would all start reconnecting registering, and but it would take sometimes a very long time for them to be fully to be fully registered. I, see basic things.

G

And when you register register, where agents do they just send one request or they try to send lots of them before they receive acknowledge.

E

Right so in in this benchmark, and so that's why I'm open for feedback, but in this in this version it just each agent would just send one rate registered slave message and wait for the response, which is the slavery registered message. It wouldn't do back off, it doesn't do rate rise and one because I found it simpler to create a basically new, fake and slave object and which has the logic to do all of these.

E

If I add all of all of these mentioned things, I have to add whole bunch of logic in there and but I'm, not sure whether that's gonna help us engaging to the performance of of the master and and that's certainly not possible to recreate the one percent real scenario front in front of the real use case right so.

E

This is setup. Make sense to you. Do you think at the minimum, some some more elements should be added.

G

Well, the possible addition is retries, I think when I was twisting my arch, it's basically not the code that it's in is in the master, because there are few things missing like moves to defer.

G

Then there was a vector, optimization, vector conversion. Optimization I mean when we convert repeated bitterer field to vector this patch is not.

C

G

And another one was moving data from messages to handler that what was that? That's? What Ben was talking about in the beanie beginning, so that patch with moves optimization inside defer or inside futures? That's not it!

G

This is not Roland. No, it's not it's! Not all of it and all my tests included like all three and in masters, there's just this one. So maybe that's because you didn't see significant difference in performance when you compared and hat and diversion before before the purchase.

E

Okay, I thought, because even if we're sending, um even if the agents are in the same process as the master, at least the protobuf handlers and from the handler to the message, the new method that actually handles it and from that point on to all the subsequent the first, they are still all happening right.

E

No, so right, I was expecting some improvements there, but um I think.

G

A great improvement was was from optimizing conversion from the repeated field to the vector, but it's not committed. It's not committed.

F

F

A

That what happened in that patch was instead of using arenas. We construct a a message on the stack and we move into the vector for repeated fields, but I think what will? What will do? There is just have move support for moving the whole message into the handler. Accomplish the same thing so we'll have that at some point, but we don't have it yet.

A

Okay, but yeah I mean it would be great if you could gather some performance analysis, information and share that with the list and maybe at the next meeting presented.

G

Did you increase the number of threads in pool because I guess with default value? There will be just few processes competing for processor time and I.

E

Didn't so the number of threads I am using is 32. We have the root, of course,.

E

But seems to me it's gonna be a large number of idling actors which are sending messages and then waiting and then just going to be this master which, which is have which has like a very large backlog and it's it's. It's processing.

E

Do you think in this scenario the number of threads wouldn't make a difference.

G

Maybe not I'm just thinking.

B

G

You launch lots of agents and basically all the threats are busy with making like filling that backlog for, but fulfilling the queue for the master and.

A

G

A

Think, like one question here, Ian is like how ready are your agents when you won't start the benchmark? Have they already constructed the message they're going to send and all that or you're doing that as part of the once the benchmark starts.

E

I'm looking to add after I start the clock, so I could free construct all the messages and they start a clock in the send them.

G

Maybe the better option would be just push all the messages to master queue and then I'm just run it along and push the message queue and then what strung the master, like the only one thread would be running like you construct the messages then push them like message event. You push that to inject it to master queue and then like when everything is constructed you just let it run.

E

Right, yeah, that's yeah! That's basically prepared the messages. First, without running the clock right.

A

Sure, okay um I think we're almost halfway, so maybe we can switch over to Ilya.

A

um Elia, did you have some content for the HP API benchmarks.

B

Members to show okay.

C

B

What I did I created two benchmarks that their purpose was to compare the performance of the old the process message-passing based scheduler driver and the new v1 scheduler HTTP api for the implementation of the those benchmarks? I used the old HTTP, the old lik process based driver that all framework Harry are using right now and then you HTTP, based the one scheduler driver. That's called Isis so because those those benchmark benchmarking tools are using the process and parts of for the mazes, the performance of those benchmarking tools, effects and benchmarks as well.

B

So yeah I used the latest version available, which is at this time 1.4 and I compared the performance in two scenarios. First scenario is when the master is generating a lot of events for this benchmark. I send one huge. I reconcile cold. That contains a lot of non existing tasks and for all of those tasks, the master will respond with update and update event and I measure.

B

The time that best from sending the first from sending the coal and until the last updated event that scalar scheduler receives the other scenario, is ingesting a lot of scheduler calls by the master for this test. I sent many revive calls. I chose revive because their processing is relatively cheap, so it won't affect that much and after I sent all the revived calls that had I'll send one reconcile call with one on existing tasks.

B

I do that, so that an update event that the master generates in response to that call will designate the end of the processing of revived calls that I sent previously and in this benchmark, I measure the time that passed from sending the first one revive call to receiving the update event.

B

So some results for the first benchmark that measures incoming measures, the performance of.

B

Sending of many events by the master, the performance is roughly the same, so reconciling 100,000 of tasks takes relatively six six point four seconds if we are using old or scheduler driver, and it takes around six seconds if we use the news the new HTTP scheduler driver. So, as you can see, the performance is somehow even better just.

A

A quick question: yeah: how exactly are you counting that time.

B

I start the clock when I'm ready to send the first call the the update, call and I start the timer when I received the update event. The last.

A

B

Yeah, okay, okay yeah one interesting thing that I found when doing this benchmark is that, right now the HTTP scheduler driver calls the received call back in an a synchronous manner and interesting thing that if we switch to the synchronous calling of that callback, we gain roughly forty percent before improvement. As you can see in this benchmark, the time that took to process the same amount of reconcile reconcile calls is three.

A

Second, three point: seven seconds, so that would be an optimization to the C++ scheduler library for the new one yeah.

B

So in this scenario, I think the the new scheduler drivers is good, so move to the second benchmark. That measures the performance of call ingestion- and here the results are not that good, because when we are using the new HTTP scheduler scheduler driver, the performance of it is nearly three times worse than if we use the old leap process based driver.

B

If we see the numbers when we use the process driver, it takes 6 seconds to ingest 100,000 of rewired calls and with an HTTP driver. It takes 18 seconds.

B

One of the one of the things that affect the performance that they found is the authentication code in in the process, particularly in the process base. Is it continuation?

B

What it does it tries to perform to run the authentication stuff on every HTTP call that we process process gets and if we disable that part, we gain like 40 percent performance improvement, but performance still won't be as good as the as the only process driver. I.

B

Think that if we speak about the new HTTP scheduler API, we don't need to run authentication for every call that the scheduler does, because when the when the scheduler subscribes it can authenticate and after that the scheduler gets a stream ID from from the master and this stream ID acts acts as a session identifier.

B

We don't need to do authentication II any any subsequent call after after subscribing so.

A

We had uhm there's definitely to do to do per connection authentication rather than per request, but yeah. It's not there right now. Yes,.

B

So there is some field for optimization.

B

Yeah, that's what I, what I found? What we can work on in terms of the performance of the HTTP API I did some profiling with perfect. We have some flame graphs for the master and for benchmarking tools, but unfortunately nothing.

B

Nothing stands out if we, if you look at this, these flame graphs, unfortunately, besides this thing where same weight takes like almost 30% of the time.

A

Just a couple questions: did you build this with optimizations on or off yeah yeah.

B

A

Or on okay, and did you use the lock free event queue or no.

B

No I use just the default okay.

A

Yeah, it may be worth enabling the luxury event, queue and luxury run queue to see. If that helps at all, so it looks, like is essentially run queue, locking in this section from what I can tell yeah, so that might be worth trying that out and seeing yeah.

A

B

Up, that's what.

A

B

The other graphs similar or yeah they look similar. This is, for example, for the master when we are using the early process. Driver looks relatively the same. This thing stands out, but now it takes fifty-five percent of the time.

B

A

I'm, just taking some notes, we could you put the link to your doc in the agenda.

A

Okay, yeah that was for today are there any other topics. People want to bring out.

H

Well, I guess, while I'm here, just I would like to figure out what Demetri wants to do with the defer dispatch stuff. We're essentially like the context I have is just that we want to move the arguments through the front dispatch, they're scheduled work or investigation going on anything else. I. Just briefly looked at the code.

H

Okay, so would you like me to drive that work if you can.

G

Share ideas that.

H

You have that will be useful. Okay, all right, I'll reach out to you and slack I, guess, Demetri I, guess essentially essentially what we want, though I think. First, we need to build a one-shot bind utility right, as you mentioned, like stood behind, doesn't work for us because it assumes that it can be called multiple times. So we can kind of build the one start buying the utility in isolation somewhere in stout, and then we can use that in the process.

H

To make sure we move the arguments through properly, the other thing is. The other thing is that I think we should do some research to make sure that underscored deferred. Actually it doesn't ever get called more than once it shouldn't be I shouldn't write in case takes so.

H

And then there's some refactoring work that I would like to do for deferred as well by.

G

The way are we moving to C++ protein.

H

Yes, so what I want to do there first is the CI is pretty unstable at this point and I I have people who don't want me to update upgrade to students or teen just yet, because we want to stabilize the CI first I'm, not sure how long that's gonna take, but the other, but but the other thing is that we actually need to upgrade this update the CI to begin with, to have support for the new compilers, because otherwise it's just gonna break everything. So so the CI.

E

Instability is just referring to the flaky tests. I think there are more.

H

Than some flaky tests, okay, there, like like, like 3040 like pestering.

E

Probably more and people add.

H

Them every day, like there yeah so for what I'm hearing there's more than 40 tests that fail relatively consistently legally, but it's like we kind of yeah we're not in a good state with the CI right now. So but yes, in short, yes, we I do want to move to 14, there's just some resistance. Today,.

H

Do you think think I would I mean we would still need to build a bind thingy right with 14.

G

H

G

You know I'm just thinking that maybe 14 would be simpler. Yes, yeah.

H

I agree: okay, yeah, maybe maybe maybe then you and I can kind of Hamburg yeah.

D

How are you thinking about trying to guarantee that this is only gonna get invoked once.

H

I was mainly just thinking about just looking at how things like where, where and how it's called like, for example, for example, we presumably you know, take the result of defer and pass it to then alright. So we should look at the Ben code to make sure that yeah we actually do what we call it once and that kind of stuff I, almost wonder.

D

We want to introduce like some type, which is which, when we move the type-checking basically into the runtime like that yeah once function,.

H

Where yeah we could we could, we could probably do some runtime the run time you're checking to get some confidence that it's only being called points well,.

D

By like by things like future, then if they take a call once function or or deferred once whatever it is, then that's basically exposing in the API, hey we're guaranteed. And if you give this thing, we're only gonna invoke it once and then it sits on.

B

D

We shouldn't because we're basically coding I mean.

H

Isn't it always on us to call it only once yeah.

D

Yeah, it is always on us, but I'm just saying that's, basically that could I think that would be a sane way to think about. It is if someone creates a called one sort of deferred once function. We will move your arguments into it now. I want to be clear. We we can't move the future arguments, and so we can't your value that one we still have to pass and as an value traveler, because there could be multiple people that's hold onto future.

D

So the only arguments we can move in are like the other things that you might be binding as a time of creating that one Scala ball, which you want to get moved through, which which you couldn't do in a standard with us alone.

H

Does that make sense, but I mean I think so we don't want to move everything just because we're only gonna call it once, though, right.

D

I would just mind the things that you can explicitly move into us didn't move once we actually invoke.

C

E

Was just telling you well it's not from me, but I, don't know. Okay,.

H

Okay, okay, so but you're not suggesting to introduce a deferred one's function. Right then well.

D

What I was basically suggesting was we, we introduced a kind of bind which is tight as something which is basically like. You can only move this thing and once it gets someplace and someone wants to hold on to it. The runtime contract is the only executed once and.

H

Do we have places where people are holding onto the result, D like underscore deferred well,.

D

We end up holding on to the the stood function inside of future and we ultimately execute like it gets it gets. It gets implicitly converted basically to a state function in them: okay, okay, so this would be like. If somebody gives us a you know, call them all once or whatever it is yeah I didn't under the covers. When that gets executed, we could move in. We could move through I guess any parameters that pass to it: yeah, yeah yeah, but.

C

What I was just saying is we can't.

D

Move the actual underlying future value, because if someone does multiple dot bends on the same future, not a chain then, but literally just the same future, and they passed two different callable once --is. The first.

B

D

Would go if we moved it. That means you couldn't call the second one. So I think there needs to be something completely different. If we want to have the concept of I have a basically a future that nobody else has ie I have like the the r-value of this future. Nobody else has it and there's no other copies of it. Call it like an owned future if you lost your you'd, be future and therefore, when I pass, my.

B

Coming once to unique.

C

D

Have also here, including minutes, that, when somebody I can also.

B

Then guarantee that I can move the value out of.

D

That unique, unique future into the lambda, but I mean that's just so much trickier, because, basically, if someone calls dot, then all of a sudden under the covers this unique future can like you can no longer call doubt then on it again, you can no longer call dot on any on it, like you can't call like dot on ready.

D

You can't do any of these things, so that's the case that I think is going to be much trickier and in the short term, if we find places where that's a performance issue, I think we should solve that by basically passing shared pointers or unique pointers through the futures. Yeah.

H

I get what you're saying I think um yeah I mean the the part that the problem essentially is just that futures are inherently shared, and so we we have the same problem as like stood bind where we have to be able to call it multiple times, essentially calling dot gun multiple times on. It is yeah you have to. We can't move it into the bend because you have to be able to come well.

H

Sometimes my my understanding was we're not trying to move the future value through, but rather the arguments that kind of travel a lot with it right.

D

Yeah yeah I I agree. I just wanted to clarify that at one point: good, yeah, okay, all right: okay, okay,.

H

Yeah, so I'll try I'll, try with Alex about what we should do about seahorses 14.

H

If we can do that sooner rather than later, I would be happy, but but uh what I also don't want to? You know shake the see I even more I.

E

Just wanted to get a clearer standing on so when, when this plan work has done and like, like you said, to move the arguments that comes along with with the future and.

B

E

Typical handling message, handling methods were like doing partial processing and then at the end of it, basically more often than not at the end of it, because we need to invoke another asynchronous component, would jump out of the current actor and go to another and then later come back. So does that mean that then, for most of for a lot of these arguments, we'll just do just and remove on on all of them if they're not shared they're, just they're just parameters within the scope of the method, we're.

C

E

Gonna move there and move on all of them so and then the process is gonna process them with the move semantics and then we basically don't spend any time copying these things right. But you have to identify these moveable things and and put standard move on all of them.

F

A

Mean you could also. The idea is that you wouldn't be copying the tasks, for example, that come in from the message when the agents are registering, you just be moving them across the synchronous context and then ideally moving them into the master storage as well. All.

E

A

E

The way front, the initial part of off yeah right, yeah, yeah, okay yeah, that sounds great.

A

Of course, it would be good to see some performance data from your benchmark to see how much that's actually contributing, but it should it'll be an improvement so.

E

Okay, yeah right.

A

A

Going once alright guess, wonderly thanks some even and Ilya for presenting stuff, yeah I'm, looking forward to starting to make some improvements based on the sponge parks.

E

H

C

Right so yeah right.