Apache Mesos Performance Working Group, 19 Sep 2018

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: GMT 2018-09-19 Performance WG

Description

Agenda: https://docs.google.com/document/d/12hWGuzbqyNWc2l1ysbPcXwc0pzHEy4bodagrlNGCuQU/edit

A

B

B

All right, while we wait to see who joins, why don't you and I try to make a list here of things that have been happening? Is the resources.

C

Your sort or change benchmark allocate a benchmark base.

C

Sorting change and your call back so to offer call back optimization.

B

B

What was the call back thing you're talking about? Oh yeah,.

B

Mm-Hmm I guess you could call this.

D

B

Okay, I guess we have enough people that we could probably start talking through these, but weird we were just making a list of trying to capture all the things that have been happening and we can kind of go over them.

B

There's some stuff here. What else has been happening.

A

B

Think this is, these: are the I love things we've been doing.

B

B

For Ian and so I think, Ming and I know most of these things what's been going on here for Ian and Jen. Do you guys you guys want to talk? Do you guys want us to talk over all these things, or did you have any specific engine agenda items you wanted to bring up.

A

So I skim through a bunch of these reviews seems like it's gonna. It's gonna help a lot in terms of the master performance. I looked at copying. Why I like that sorting, I guess, that's the random sorting know.

B

There's we can talk about those things, but there's there's some patches that haven't landed yet, oh, that are up for review now. I.

A

A

This I think I bring this up just because I saw a comment yesterday on one of the chairs I'm watching, which was benchmark condition path on maces master I think this was grouped under schedule. Api, v1, information, epic and Greg was asking, isn't the issue? Do we need to do any more benchmarks as a point?

A

Yes, relevant I know that I personally haven't spent a whole bunch of time. Investigate, but I know that we are certainly because we're hitting the b1 HTTP API path and so any any improvements there and greatly improve things for us.

B

Okay and that's you're, mostly interested in the scheduler, calls right. Yes,.

A

B

Yeah we can talk about that. There hasn't been any work to my know that I guess, that's not true. There's been some work just from the fact that we did some moves did some copy elimination that would improve the p1 path, as well as the original paths, but I don't think, there's been any focus yet on the scheduler v1 call in Jedi all.

A

B

I think specifically.

A

The Sun I think.

B

Ilya might have written a benchmark. I seem to remember there being a benchmark a long time ago and him saying some things about what he found. But I. Don't remember if we committed that or what I'd have to go and look through yeah.

A

Yeah so there's one committed know what I think the what the question looked at was from Anna and so maybe a different one.

A

B

Bye-Bye, if that was it reconciliation, dinner.

A

So the jury I'm looking at is 6405. Oh sorry, it's not committed. It's closed due to inactivity, okay and that patch was.

A

Scheduled a library.

A

A

Yeah I can probably talk about this last okay.

B

Anything else you into cover.

A

B

B

Let's just walk through these in each one right now, mang has some patches that are out that are almost landed for making the resources or a perk do copy. Alright, that's helpful, because a lot of the functionality of the resources wrapper, especially in the allocator, is doing filtering so say like getting just the unreserved resources out from a resources object before the patches.

B

We have to copy them all out which is rather expensive to do, and you know, and often we don't modify the output, and so the approach in this patch is to make the resources wrapper copy-on-write so that when you do filtering, it's less expensive, I think when mang. Maybe you can correct me but I think when we ran benchmarks.

B

For the allocator benchmarks, it had a large impact on small clusters and then it had a pretty small impact on larger clusters.

B

Which was why actually, some of these other changes are happening but I think anything else. Add there me well.

C

I think what those small clusters actually is. Not it's not that small okay check I think around Z.

B

Maybe maybe you can just maybe we should share that yeah.

C

So data okay, then Mitra fine.

C

Preliminary occasion.

C

Yeah yeah, we probably should show the graph I. Think you. um When we talk about the blog, we were sure it anyway. So do what that? Oh yeah, so I guess you're already phonic yeah. So this is a graph. So comparing the allocator performance across different versions and the last one on the bottom is copy-on-write. So.

C

Yeah, so so it's the optimization is more. It is more significant when there is less frameworks and when there are large number of frameworks, for example, the last column when it reaches 1k, the sorting overhead becomes dominant, it's less about it, the filtering, the source, copying and etc. That's why we.

C

Have been doing some optimization on this order as well.

B

Okay, anything anything else to add about the resource copyright stuff. President.

C

Would have any.

B

Questions about it.

C

Yeah so Dino, it's still like the the code is there, but we don't have a good abstraction to like god that when you want to militate the underlying resources, you have to like be focused to check that there's. Only the modifier has exclusive ownership. So right now it's a little bit brittle, but but but we have planned to add a better abstraction to force more safe access.

A

Korso I missed so I was disconnected but seems like so the more the agents there are I guess the difference is not obvious.

A

B

Know many if you know whether it's the agents or just the combination of there being more agents and more frameworks or oh yeah, there.

A

Being more frameworks.

B

C

So here the Treasury is the richer is the same, so the agents to frameworks is tend to want, but based on the perf data, I got I. Think it's mostly about frameworks, because because when you add agents I don't think it affects the sorting right got it then I think you have the perfect race. I mean. If people are interested, we can take a look. So basically it's the cactus sort and to stop functioning disorder and to calculate shares becomes dominant.

B

Oops yeah I don't know if people are interested or not, but.

C

Yeah this one is.

C

Oh yes, asserting, so this is the the the poor profile of the last of 13 in the graph you see in the in the rightmost column, when there is I, think one key agents, 10 K, 10 K agents and 1k framework. So this is the poker valve, so you can see the lodge number, so the the the the most time are spent in the allocate function and whizzing that it's thought. So that's the place where I think the number of frameworks matters.

B

Yeah I think with the patches that I have for this order, that we can talk about I. Think if you treat this actually, if you treat one point, seven as one as the normalized baseline I think we get it. These patches get it down to like 25% of that. Something like that. Okay, so there's so, there's gonna be an additional big drop here. If we land those notches and that's DRF as well, that's only done with DRF yeah. We didn't look at the random sorter for this, but I.

B

Probably the random sorter will be a little bit worse or much worse, because it's doing a random shuffle every time, whereas now the sorter.

B

Well, we can talk about what I did to this order, but it I wouldn't be I would expect that it, the DRF store, is actually probably more efficient now than the Oh random sorter, yes, mainly because it doesn't have to like it. It stays sorted most the time, whereas the random one has to reshuffle whenever asked call. So it's like a.

A

A

What's the what's the motivation you say it doesn't have to stay, sorted doesn't mean if it's, the water is, it's not 100 percent correct, 100% of your time, vulnerable, I,.

B

Guess, let's just talk through these now this other thing so I.

B

Have to make this a bit smaller, so I can see my tabs.

B

But we have the important ones are oh, my review for the next one, but this was one that makes about a 35% reduction, which was, to just add a new type here that contains resource scalar quantities.

B

We wanted to do this for a while, but we haven't yet I only added it in this order for now, because it's a little bit more work to pull it up and use it in the resources wrapper and all the call sites of it, but it basically just stores a vector of string scalars and it actually replaces the hashmap of string scalar totals and allocation totals here and actually change any of the logic using these things.

B

So all that stays the same it just swaps, the type of it and that's the first patch think it go in my runs. It goes from like five and a half seconds to three and a half seconds and then I have another patch.

B

That I haven't published yet, but that takes it to about like under one-and-a-half seconds.

C

B

I guess one and a half seconds versus five and a half seconds is I, don't know 25 percent of a little bit less. Maybe it's around forty percent of the original time from when I started working on this and then I assume, if you add Meg's changes in it's probably gonna, be a little bit lower as well. I didn't include his here.

B

But that I didn't mention what the next patch is I guess I should do that. The next patch is.

B

What's the best way to talk about this.

C

B

Way you tell the DRF sorter that there's been an allocation here. It goes and it updates that node, so it finds the node in the tree, it updates it and then it walks up to the parent to the root of the tree, updating all the parents, and then we say: okay, the tree is dirty, it's not sorted anymore, which means we're going to restart it next time. Someone asks for a sorting order.

B

What I did was each so I basically took out the this dirty flag setting here and in order to do that, as we walk up the tree, we shift the nodes, you know left or right until they're in their right place, if it's not dirty to begin with. So if the tree is already sorted and we modify an allocation of a node here, we then shifted in its parents, children into its right position and we walk up the tree and continue to do that until we've hit the root.

B

So it whenever there's an allocation, this tree stays sorted and we don't have to resort that's what the outstanding code is that I haven't published. Yet.

B

Is that making sense.

A

Does that always result in improvements, I I, imagine if you have, if you do and allocated, is securing a allocation cycle, you would decide which framework to allocate resources to, and you will then call mists right to tell this order. That's it's allocated so.

B

A

Sorry go ahead. You know I'm just trying to think. If.

B

A

B

Not always beneficial and it wasn't beneficial in the benchmark. That I ran this on originally and that's because for the framework sorting remix orders, we currently update their pool of resources every time the role gets allocated, resources, so I had to change that as well, and once you change that, what is what an allocation cycle looks like it's sort, something get allocated, sort something got allocated and so on a and so each time you call sort that's what is expensive in this code. Here is you have to you?

B

Don't know that it's pretty much already sorted, there's just like one couple: things I need to shift a little bit, and so weary sort the whole thing and then ends up dominating the numbers, as we saw in the flame graph here.

A

Yeah yeah I think you're right, yeah, that's it so so it is one sort per allocated. The call so.

B

Not not necessarily but pretty much yeah.

B

It tends to be like sort and allocated that tends to be the two calls that happen over and over again allocation cycle.

A

Right, oh yeah, then sounds like you should be an improvement. Mm-Hm.

B

We just we'd have to change the way that we do DRF of the frameworks in a role we have to make the resource pool for that the whole cluster, which is consistent for what we're doing with the hierarchy in the DRF sorter. So we might as well make it consistent, but that may change behavior slightly if the dimension of role allocation is different than the whole cluster size like if the resource ratios aren't the same, then this already might be a little bit different.

A

And that's done in yet another patch yeah.

B

I haven't published those changes at all so but they're, probably in different patches, yeah cool yeah it'd, be interesting to look but yeah I think um you know that'll make this at least this last case. It'll drop the first two cases a lot as well, but it'll make this last case line a lot better.

D

What's wrong ring in the current? Oh, yes, a shift a bit what others like better structure, you're using to to store the sorted, allocate it's.

B

D

There's a simple vector actually, okay, it's uh I was also wondering like what, like probably, we can look at how like what is distance, that being shipped on average yeah.

B

Did the reason it's so the expectation is that it's not going to be most cases, especially the steady-state cluster, are going to be that the node doesn't shift a lot or you know, even if it does I go at cheaper operation and sorting that tree yeah.

D

I'm thinking about like, if we use like orders, set like like balance tree I'm, not sure if this will be much way beneficial but yeah. If, like it, is only sorted like shoots by like to say one or two positions, then definitely getting better is easier. I like faster.

B

Yeah, a vector of pointers, which means it's pretty compact and probably fits in cache pretty easily yeah.

D

So it's in this order. Do we sorry, okay, CSR's? Do we always get the like, Max or max time and share or oh, we need to you, get a full order, so yeah.

C

D

Allocation right, okay, yeah I'm.

C

Dry now it is to retain the vector, I think eventually, after we turn the refractory you we can just like it doesn't return a whole vector it's. We only need to inform the allocator about who is the next yeah.

B

That's actually the next thing that shows up it's the collar of.

A

B

C

May be more, every.

B

C

Yeah I think in some up some of like, for example, in the in the corner. This is what they do like: keep a red black tree of all the nodes and always return like who is the next tick next task? Something like that. Well,.

D

If it's just next, maybe we can just have a party queue like a heap. We don't even need a night pantry like.

A

We need a tree.

D

Unless we want the whole order.

B

So those are the some other patches as well I, don't know if they're gonna be interesting to talk about right now, but they're, just some more minor things, just avoiding some vector resizing and avoiding some national map lookups when you're calculating weights and things like that and we need. We need different benchmarks if we want to show the weight, lookup improvement right now,.

B

We don't use the benchmark. I've been running is like one roll and many frameworks in a roll. So it's a it's a specific setup that we've been looking at, including all this all this stuff is that same benchmark, one roll and lots of frameworks in that role. So we should probably.

B

Evaluate the other case, it's like one framework parole.

B

But yeah we're making lots of improvements here, it's going to look a lot better and we'll talk we'll definitely blog about it and the 1.8 release yeah.

A

B

Super exciting.

B

uh I'll write these down after I'll, send some notes out after to write this down. There's also bit, maybe main you want to talk about this one there's been an alligator benchmarking structure, that's been added recently, yes,.

C

Okay, I think I can just talk over work. Maybe let me each find a patch and share the screen.

C

All right, let me share my screen.

C

You guys see my screen: yep, okay, so yeah. This is a test base. Lmm indicator benchmark harness for writing the benchmarks. Hopefully it can make reading allocator benchmarks easier. So what we did is we introduced a new base class called.

C

Benchmark caspase and what it does is what take a bunch of parameters that you set and initialize a cluster for you. So what are the parameters? So one is framework profile you can specify. The name goes, number of instances, resources, etc and agent profile, so name again, instances, resources and used resources, so so the base class worth taking all these parameters and set up to the classroom for you, it will initialize their allocated based on the parameters.

C

You can specify difference, orders things like that and it will stop to watch at the agents so basically abstract all the common things that previously existed in the individual Aikido benchmarks at the frameworks on the Asaro and then from that that point you can focus on what what's happening when all the clusters are set up. So maybe let me show so. This is the first best markup you'll edit that utilize the base cause.

C

So what it does is it tries to evaluate so it launched a number of non-homogeneous frameworks and and try to calculate what's the number of tasks launched waiting on the time frame, so in the beginning, pause clock and an initializer benchmark config, you create the agent profile and you create a bunch of framework profiles here, for example, we we picked like for marathon instances like Madison is in sense that it has a small task profile and owns a large number of tasks, etc.

C

Jenkins spark dispatcher and then so all these are pushed into the benchmark, config and pasta config into the cluster, so so to configure well have a bunch of default values. Here we are using the API of sodor. um It's like one second allocation interval, things like that. Another thing is why we initialize classic. We pause the allocation so from this point on it's two classes in a pristine States. So that's so that's some improvement over previous benchmark well. Well!

C

Well, while you're adding frameworks and agents that might be event-driven allocations which might pollute the final result. So here on, once you start allocation, it's our pristine State and then it's it's all about like traversing the sort of issue in each iteration. We advance the clock and you will get a bunch of all for callbacks, which is put into the offer, offer queue, and then you can decide what to do with this offer here. I think it's about so for its framework, getting an offer and then launch launched a bunch of tasks given several constraints.

C

For example, its offer each framework has some X tasks per offer, so you might try to launch as many tasks as possible, like the launch. Here is really do not decline the resources. So after we are read, we is also offered. We declined at the remaining resources and then at the end we print out a bunch of statistics so yeah. So so this space is really about initialize, a cluster for you.

C

There are more future works by, especially given the test body part where I think this part can also be automated, so hopefully in the future. While you are creating the framework profile or allow you to specify how the framework will be treating those offers, so all you need to do is think about okay, I'm, giving an offer and what I'm going to do with this offer. So we will so basically this this part of the test body will be an automated as well so in the future.

C

Hopefully, writing a benchmark is a it's just about populating all the few things that there's more config, and once that is done, you can just click start being the allocation and then you're getting to statistics. You want um yeah I, think that's pretty much it so right now the base is there and there's only one um benchmarks: that's currently using the base I'm planning to like maybe slowly migrating other benchmarks that try to benefit this from this space and also add some other benchmarks in particular.

C

One currently, in my mind, is yes, of course, yet his quota related benchmarks, yeah.

B

Can you show us the statistics that get printed so.

C

Right now on, the task at the base doesn't have any buting status statistics. So all these are are wheezing the casted body, so each test individual will bring out like plucked and print out. The statistics I do have plant at matrix, especially the allocation latency to the to the test base, so that you don't need to I have so that at the end you can just courage the metric snapshot and it will bring out at least allocation latency number of allocations, etc. But, right now we can do that.

C

The cheetah due to the fact that the metric right now depends on the clock, the leap process, clock and we can't do that because in the test we have manually controlling the clock and it will mess up with the metrics. So once we decouple the metrics from the process clock clock instead use stopwatch, we will have buting metrics in the in the allocator test base, but things like classic capacity, cluster allocation and all these things on the individual has the still needs to you know, collect and report itself.

B

Yeah I just was curious. What they like do. You have an example: output of this benchmark, I.

C

Think it's here.

B

So I guess he at the bottom: it's showing that the target allocation wasn't was not reached in 30 seconds yeah.

C

Yeah, okay, so yeah, so the test is what what the test does is. Each framework has a target number of tasks instances and then, in the end, like after 30 seconds time out, you will bring out what are the CPUs and memory is being allocated on vs.?

C

Okay, so this is actual allocation and this is tacky allocation, which is aggregate the above target task instances. All the frameworks wants to launch.

C

Yeah I think that's all.

C

C

Okay, I think that's all.

B

Alright, thanks man, let me grab the agenda again.

B

B

The next thing was I think I might have a ticket for this.

B

We tried to capture all of them in this ticket here, but.

B

When I was auditing the code path on the master side, if you look at the benchmarks that mang's been talking about, none of them benchmark the actual master side of sending those offers out to frameworks. We don't have a benchmark for that right now. So I don't know how much these changes actually improve things. But I did an audit of the code paths there and there were a lot of low-hanging.

B

Expensive things that we were doing so I I think at this point, I cleaned up most of them and there's still some more work to that could be done, but I think even just now there should. You should see a large improvement.

B

One of these was actually a regression M 1.7, which was that when framework metrics were introduced, the way that mention metrics are being incremented was by evolving the message to v1 and then evolving it to a non v1 call, which is copying it twice for every outgoing message to schedulers, even schedulers, actually to do I.

B

Don't remember if that's for every scheduler or just easier scheduler.

B

Yeah I think it's to every scheduler I.

C

Wonder why we were doing that why we uon evolve.

B

Well, the patch that I had was to instead have a specific overloads for increment of nth. That knows how to handle all of the old-style messages here, because previously there was only an increment of nth for one type here, which means you have to convert it. Okay into this type and it's the entire type. It takes the whole message, not just the event type, so this just adds overloads to increment the right thing, so we don't have to copy it at all.

B

We just pass it through and then a lot of other stuff was just like making sure vectors have the right capacity, we're where we can and avoiding looking up things multiple times and maps, and then the main benefit I think from these patches is just eliminating copies. Protobufs, the one just showed was to two extra copies of every outgoing message.

B

There was this patch, which was reduced, reduced, reducing copying of the resources that were sending out.

B

There was reducing copying and there were evolved helpers by using moves instead of copy from, and there was yet another copy of the offers, because the way that the repeated fields wrapper was implemented for evolve was it actually took an argument by copy by accident. I think so. There's another copy there.

B

And at this point it's pretty it's, it doesn't have a ton of unnecessary stuff going on. I still haven't had a chance to benchmark evolve, but I suspect that there's room to improve that, because the way it works right now is it DC realises end and serializes the offers.

B

So when it gets down to evolving the actual an individual offer itself. It goes through this path, which is going to deserialize the b0 offer and then serialize it back as a v1, offering you'll notice that some of the overloads here actually don't do that they have logic to just copy it over and the claim the node here says like since this one's common.

B

We wanted to speed up performance which makes me think that maybe they benchmarked this, but it I think it should obviously be faster to not deserialize and then serialize back in this particular case, where you're copying just one string, there's also room in this code here to like, take our value, references and move things, but it's gonna be quite a bit of work. So until we have a benchmark, I I probably wouldn't look into that more.

B

So we still need a benchmark for the master side of sending offers out to make sure we can do that as quick as possible and there's there's lots of things we can still improve there. I think I tried to write them down in the in this ticket. So there's the deserializing serializing cost that I just talked about and then there's also. We could use some parallelism to speed up sending out the offers.

B

That's that's pretty much it for that. One I I, don't know again how much it's going to improve things, but I suspect that it's gonna be pretty significant, given what we saw when we did copy elimination for the master, failover benchmarks.

B

Probably in 1/8 I will try to get a benchmark so that we can show this improvement in the release, but I'll I'll probably do that closer to 1/8. Unless someone wants to help out with that.

B

Any any questions about that.

B

B

The last are the second last thing was I'm almost done with the 1.7 performance, blog post I'm, just waiting on a bunch benchmarking data from Benno and Alex, and then by the time that they're done we'll probably be able to show some better. You know master branch numbers for this stuff.

B

And then I think that'll be it and we'll be able to publish it.

B

And then yeah Ian, you want to talk about the scheduler call ingestion performance.

A

What really does just occurred to me while we're talking about possible things to discuss and I'm, not fully where, instead of sentence I was king people care, aware of them, so I think in general I. This has been a pain point. I, don't have specific numbers, so I think that's its overall I just had a abstract idea about what he I need to get us improved, so not necessarily a specific things.

A

I have cycles to do or expect. Then we do next, it's just something on my mind: okay,.

B

Yeah I think I seem to remember something from Illya.

B

How do I I don't know if he reported it.

B

Yeah I'm not sure how to search for this ticket that.

A

Is the scheduler you wanted for money certainly includes.

B

They all have to do some looking around after this to figure out where that, what's up yeah, it's okay.

A

I think maybe I will.

A

Do some researching and probably be better prepared for next session.

B

Okay, just doing a quick search in my terminal for all the benchmark tests yeah there. Okay, that's what I suspected there is a scheduler reconcile tasks, benchmark test, I.

B

Thought this what Ilya was using, so this will right.

A

Yeah I remember in one of the previous performance meetings. She shared numbers using this yeah.

B

So this benchmarks, basically just like don't even set up a cluster with agents and stuff, just send a big message to the master and.

B

Where does it time it? Okay, so just times how long this and takes for things to get settled yeah? So we could. We could use this to just take a quick look run perf on it. I, don't think he ever back, then I, don't think we were asking folks to provide perf stack traces, but it's it's probably as simple as just running this and getting some perf data and sharing it. And if we see any obvious problems there, we can take care of them cool.

A

Well, I think I probably do that yeah.

B

B

A

Yeah but obviously I think that this is making me pretty most optimistic, because previously, when I saw the rundown sorter I was actually gonna, try to switch to that and see what performance should look like, but it looks like and because ultimately, I think in in our environments.

A

The fact that you know one framework is just starting out with nothing and the other framework. Being you know, eight percent of its way to fully launch and doesn't necessarily suggest any kind of priority to us. So I was gonna, see how how much ever in this order is gonna help. It seems like well in my.

B

Case you I mean in that case you might want to actually use the random servers. It sounds like those two frameworks are equal in your in your scenario.

B

It's just we haven't I mean there's room to optimize a random starter as well. If you do go down that route, it's just gonna mean that those two frameworks are you equal. Instead of the one with zero resources getting a lot more than the one with 80 yeah.

A

Yeah semantically I think matches the expectation on a learning disorder, but then you were saying actually the performance for DRF could be actually well yeah.

B

I suspect that they're, probably equivalent today or the denim stores, probably faster at sorting but.

B

After these patches, it's gonna be worse, it's gonna be slow, the random store is gonna, be slower, and unless we do the optimization to be able to like iterate without having to sort the whole thing it it's gonna be hard like that's. The main optimization we'd have to do is let let the caller kind of iterate in some way, which basically means you know if you only needed the first two results. You just did two random samples, rather than doing a random sampling of the whole data set and sorting it.

B

So it's definitely optimizable as well. We just most people are using d refs order, so.

A

Yeah you're right many things completely yep.

B

A

Sounds very good all.

B

Right glad to hear it I think, that's it unless anyone else has anything anyone.

B

Nope, alright, alright thanks everyone, I'll I'll, send out some notes. After this Thank.

C

You Ben yeah.