Apache Mesos Performance Working Group, 16 May 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GMT 2018-05-16 Performance WG

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

A

A

A

A

A

A

A

B

A

Gonna be a short meeting.

C

I don't know, there's a bunch of stuff to talk about. I didn't really fill the agenda before.

C

C

A

A

C

C

C

I'm just gonna add some nuts for now I don't know if people are gonna be able to present today, but there's some stuff that tell them could pill are working on.

C

For the allocator, oh.

A

Cool that was good. Let me just.

C

All right, I, think, Kapil or til might join just to chat about what they're doing I. Don't think they're ready yet show a lot of interesting data, but they can talk a little bit about the starvation stuff that they're working on I think I can probably start with just talking about flame scope.

C

I actually was playing around with it by grabbing some stack traces from the master, an agent I, don't know if you guys have seen this tool yet, but netflix released this as a as a way to generate kind of a time-based view of of.

C

Cpu consumption, so in this one it's the master and it's pretty heavy heavily loaded. So it's not as interesting, but if I look at say like a agent I can see that there's like Oh interesting, there's like these periodic CPU spikes and I can drill into one of these and should generate yeah. So it's just gonna give me a flame graph for that specific time. Window and I can see that it's mostly like okay, real path like OS real path is 13% and then the rest is like.

C

So it's like reading doing, L, stat for real path and malloc, and free and F, open um and I think when I had grabbed some other stocks of this.

C

That was when I had figured out that it was the OS children call, but in this particular one for some reason. Maybe it's because we don't have frame, pointers, I'm, not sure it's, not it's not showing that here.

C

But that was why I had filed that ticket around down.

C

Os children being expensive.

C

Os children I think it's not here either. Always children is like super sense of because it goes and touches like all the processes in proc and I guess the only way to improve that was kind of a well. The only major way to improve all the system call overhead is that that new system call you were showing James, which didn't end up landing.

A

Yet yeah seems like that should be a cheaper way to get the children of a process yeah without walking everything they I guess need to do some kernel code groveling to verify that it's not there.

C

Yeah, so I can't find a stack right now with it.

A

This looks like a cause, ID point so where's the data coming from that so I was looking at yeah.

C

C

So you run perfect court and then this one makes it kind of I think it flattens the all the stacks like it kind of just makes. The file include like a massive collection of stacks, usually this files a lot bigger. So you run this like father massages it into like a format that I guess scripts can understand more easily, not quite sure why it's a separate command that maybe it's just a lot bigger in the file and all you can do- is take this file and drop it in flames that flame scopes.

C

Examples, folder and it'll show up, and you can start using it.

A

Coolest I managed.

C

Before I ran it through, CBP filled like they suggested I, guess that helped a little bit, but just with like the weird template, stacked names but I'm. Not at this point, I'm not sure. If we turning off frame pointer emission will help I didn't quite understand. Why.

C

Let me find, let me find a link.

C

So in here there was a comment about perfect, supporting a workaround for missing frame pointers, Libin wand, which uses dwarf. This can be enabled with call graph dwarf so in the master stacks actually used that I passed that in and I think they were better, but I haven't had time to more rigorously compare okay, yeah there's this bug in flame scope, where sometimes this happens, throws an exception and I filed an issue with them, but I'm not sure what triggers it exactly.

C

Okay, here's all right here, it looks like you know, maybe we're getting a little bit more like process manager. Resume is showing up. This looks like metrics, but I I, don't know why it's showing up in the await process. I would expect that it is spending its time and like the master and allocator and so on, competing gauges, but instead of this shows up so I'm wondering if that's because we don't have friend pointers, earth or if it's just been optimized and inlined or something I'm, not sure.

C

But we get a lot more like it looks like we might get I don't know we might be getting more when we pass this workaround but I'm, just not sure if this workaround is actually as I'm, not sure if it's comparable to having frame pointer, Commission turned off.

A

I haven't looked at this for a long time: okay, I remember at least so.

A

Our internal builds not very complete, I guess so trying to get dwarf, support in the right places with tool, chains and symbols and the right kernel package in them. There's a bunch of packaging and distribution issues which made which made the dwarf sort of unreliable or hard to get so friend. Pointers are easy to enable and seemed to basic seems to work in the cases that I could see.

A

I, don't know how mature the dwarf stuffy's, especially in the house, because there's a ton of dwarf like the dwarf, the dwarf is huge and, as I think you might need I, don't know if you want to do the extended dwarf that set this like I, think there's gee gdb three, so there's a lot of different knobs to twiddle around this area that it's pretty time-consuming to explore all the possibilities. Okay,.

C

Yeah I, also just don't understand, what's happening under the covers to make this I think toon was telling me that it looks like it might be that the dwarf information includes the the stack frame size or something, and that lets that lets perf know where the how to jump to like the previous frames and so on. Yeah.

A

Ooh I should we have. The dwarf should have enough information to let like a stack trade, a step walker. Do things like figure out that something got figure out, that a function got in lined and that you're really in an inline function and present them as a stack frame and and it should be able to fill.

C

In to a to us similar degree that the frame pointer scanned, it can provide that or I do more. In that case,.

A

I'm not sure if you get a frame pointer if the functions in line like they might not. Even if I don't even yeah.

C

A

Even be a friend there, yeah.

C

Probably not so in that situation, this might be like better yeah.

A

I think in I think theoretically, the dwarf is better, but I mean may sauce generates like gigabytes of dwarf. So okay, yeah.

C

So I and the other thing is that this flag is like one or the other. You you it either uses FP or Dorf or lb. Are um that doesn't seem like it can use both for some reason.

C

Which seems a little bit surprising to me: I would imagine that you could use both together to get better information, but yes,.

B

Your experience have you ever like just compiled messes and useless with pinpointer on.

C

Are you asking me.

B

James James.

A

Yeah we do that what I'm a parent bill does that builders done if no Amit frame pointer for late years, no and there's.

B

A

I, don't have I, don't have a benchmark to really show the difference.

B

Let's see, maybe we can consider also yeah.

D

Benjamin has some patches: oh okay,.

C

Where he's turning it on, for, if you pass an able debug, but I was kind of asking him whether.

C

Whether it should be on always I.

A

Think for quick and dirty, you could just enable and disable the flag and run the benchmarks that are in me sauce. That's much more reproducible than anything you get in prod and I I expect that'll probably give you at least a sense of whether it matters yeah.

C

I'm, guessing also that enable debug might actually help the dwarf.

C

Stack tracing just because really, we only do g1 by default, which is I assume what.

C

What I was using on that box that I was testing on yeah.

A

Less aggressive optimizations will give you better stack traces well.

C

Debt is that what key without a number? Yes.

A

That's good No G just generates debug info. Okay, it's like orthogonal to the optimization level you can be. You can do Oh G, which is supposed to optimize, but still make your code debuggable, but at least my experience of trying to get debug stuff in resources that it's not debuggable enough. So if I needed a debugger, I've usually had to go 2-0 I.

A

Think anyway, for me sauce the performance bottlenecks in me. Sauce are not really around compiler options in general, it's more about I. Think the still the high order bits are moving.

A

Look how things get allocated to the process, processors and memory. That kind of thing I think you know tuning the frame pointer is probably pretty marginal at this point. Okay,.

C

A

Flame script was super cool, though.

C

Yeah, this is a master example where it's like this master is using almost 300% CPU, and you can see it's spending time and like allocating and I noticed that it was actually like. There's a big chunk of that which is just computing. The available resource on the agent, the subtraction operator is actually you're expensive. So we have already seen a lot of interesting stuff come out of this I actually wanted to yeah.

A

It's kind of consistent with historical performance issues we've had where you know we, where we've hit expensive operations around resource arithmetic, so that kind of yeah being that can see it looks like are just made to believe. It's results. Yeah.

C

It looks like that we've managed to make the addition. Here's here's that bug again, it looks like we've managed to make the addition very efficient because I don't see it showing up. But this looks like JSON responses, because in this particular cluster I mean the Jason endpoints are getting hammered and then the allocator- and this looks like metrics to me- it's showing up a little bit weird. You know that's coming in and waited but I.

C

That's this is suspect that this is waiting for all the doubles of metrics. It's just this code in itself is very minimal. I'm, not sure why it shows up, because this, instead of like let's say the gauge functions.

C

But that's what I'm guessing this is it.

A

Might be the sample right? What was your sample right for path? It was the.

C

Default and I guess- and this is 20 milliseconds here- so it's a pretty small window- I can't grab a big window in this graph for some reason, because it's going to hit that bug.

C

C

So that's this is a broken flame graph. I filed a ticket with them. They haven't responded yet, but yeah I seem to hit this whenever there's I don't know a lot of data I'm not quite sure exactly. What's triggering it so in this graph, I can only really look at like small little windows, if, like in be thirty, forty milliseconds.

C

And yeah- and this is always what seems to show up here- waited anyway, um it's pretty easy to use it I'm, hoping other folks can.

C

Make use of it and share interesting thing, graphs.

C

I'm also hoping we can update the profiler end point to maybe make it very easy to just get one of these files back without having to get on to the box and run these commands and SCP this stuff back. Hopefully, you can just hit a end point and get the file, and then we can take the file and put it in flame scope. Would.

A

That they end up sampling itself, because, while we on the agent or master, which is then doing the exec of I I guess this is. This- is not that much code waiting for perf to finish yeah.

C

It's just gonna be the sub process waiting, which is should be pretty minimal right.

C

But yeah I know one thing: I ran into when I was trying to do this I think with chin chin was helping us mom there's this flag, but by default it will actually grab all this sub subprocesses stacks. So I was running it on an agent and I was getting all this JVM stuff and I was really confused and then I realized you had the past I or something no inherit to get just the process that you're passing with the pig parameter.

C

That's just a little gotcha. If you ever try to do it on the agent anyway, that's flame scope, I think it's really interesting pretty useful tool. Hopefully we can start using it more.

C

Yeah anyone they don't have any questions about it or.

A

No I think that looks pretty cool thanks for digging into that. Then it's really. um It's really really promising yeah.

C

Okay, let me just take some notes here so.

D

C

Let me talk about one more thing before talk about the allocator starvation stuff.

C

So the other thing I was going to talk about was I, don't know if you guys experienced this, but because of the fact that we're using pole gauges for most a lot of the master and allocator metrics. Today they, the metrics response, can be it's again singing.

E

Intermittently.

C

Delayed as well as it can induce a lot of CPU consumption on their master and allocator actors, I assume that you guys are maybe just pulling it from one thing like a metrics monitoring system or something but I think in some other cases, people are pulling it from different sources and it might be getting pulled pretty frequently. So what I'm hoping to do is migrate.

C

All of the existing pull gauges graph to push gauges so that when you hit a metrics snapshot, endpoint, it's it's returned pretty immediately and doesn't have to trip through any actor keys.

C

So I think it's a towel from uber mentioned. He was gonna help with that.

C

So hopefully they have someone who can contribute that and we can help review review it and we actually follow through.

C

Otherwise we can just do a piecemeal overtime and hopefully we're all aware to not add more poll gauges, but if I need more slipping, we can migrate. Those as well.

C

I just wanted to mention that so that everyone here was aware of it.

C

Okay, there's no questions about that. We could talk about. We could talk about a little bit of the allocator starvation benchmark work that they've in until I've been working on.

F

G

Guys, like me, yeah.

F

Okay, so I don't really have much to show as such, but I will just actually talk a bit about what we have been doing in terms of the benchmark. And what do you expect so, hopefully by the next next meeting that we have here? I will be able to present some numbers to actually get more more feedback so anyway, just to actually give an overview of what is the problem that we are dealing with?

F

I just start about start with the scenario where we have a bunch of frameworks that are registered with the master and of those frameworks. There might be a class of frameworks which are actually resource-hungry as in whatever resource is thrown at them. They'll actually accept it and will have always some jobs to launch and so on, and then there is a different class of frameworks which actually doesn't care about many resources.

F

So if you are sent an offer, they will, if you send an offer, they'll actually accept, maybe like half of them or maybe like only a fixed number of offers, and they keep declining everything else. The way DRF works is when it is time for the next allocation rounds. It puts the the frameworks in a priority queue.

F

Well, the highest priority is given to those frameworks which have not used their fair share yet, and so what DRF will try to do is will say: okay, these are the five agents that I have example and out of these 20 frameworks, I have five frameworks which have used less than their fair share. So let me send the offers to these guys now, when you do that these five guys will not accept these.

F

Often, if they don't accept this offer, then they'll decline the offer and implicitly, if you do, if they don't do anything else, they are actually setting a declined time out or the refused time out to five seconds, which means they won't be receiving any offers for the next five seconds and that works well. The allocator then goes on to the next set of frameworks to offer these resources and the next set and so on.

F

If we are in a in a cluster where we have a lot of these frameworks, which will just refuse the the offers that are sent to them, eventually you'll see that after some time there will be their timeouts which have expired in the allocator, the refusal timeout has acquired, and so they are added back to the queue the the original original five framers at that refused offer, and so at that point it it becomes sort of a situation where the initial say, 20 or 30 or like a small set of frameworks, is the one that is getting offers repeatedly, whereas the the guys who are actually using resources at the time they are actually at the back of the queue due to the do.

F

The priority queue the priority. They will never get any resources back, they will never get any resource offers and that that is what we call the starvation just to actually show you, some numbers actually I just talked about some numbers so I, then this experiment with 20.

C

Just just let's pause for a second, so does everyone understand that scenario that those describing.

E

Yep yep, okay, great okay,.

F

So I, then this accelerant, actually maybe I, can just paste these numbers in the in the document. In a second yeah.

G

Feel free to share your screen too. If you want okay,.

F

Let me do that.

F

Then you have to actually stop shading I. Think. Okay, let me do that.

G

F

Just one more second yep time to select that.

F

Can you see the screen.

G

Yeah, okay, yep.

F

Okay, so on the on the Left, what you are seeing is this matrix. So let me explain what's going on in here, so the first column is the framework ID in this particular test. We launched 25 frameworks and there were I, think five agents, yeah, so 25 frameworks, five agents, and we actually did this decline timeout. We actually do since to make a synthetic.

F

We set the decline timeout to one second, so every time you decline an offer, you will be actually also telling the master that you don't want to receive any offers for the next. Second, the default value is five seconds, but this makes the test run quicker and then what we have is we have the some trimmers which are actually accepting a fraction of the offers that is sent to them and others are not accepting any offers. So in this table that we are seeing the first column is the framework ID zero through 24.

F

The second column is the total number of offers that this framework received. The third column is the total number of offers that it accepted, and the fourth is the total declined. So, interestingly, what you see is the first framework because it accepted an offer like it declined the first offer, but as soon as it accepted the like one of these offers and started using those resources, it never received any more offers back. The same thing happened with framework number one and two.

F

The reason why framework one and to see more offers is because they accepted the initial offer after quite a while, so they reject they kept rejecting several offers and then at some point they accepted his offer as soon as they accepted the offer, then never got any new offer back and the frameworks from number three all the way down to number 24. They were the ones which were repeatedly offered the resources and they kept repeatedly declining the resources.

F

So that's pretty much the the offer starvation bug that we are looking at and mang actually is working at the patches for addressing them using the the exhaustion oeufs the exhaustion testing. So, for the benchmarking, what we are planning to do here is we'll actually define some sort of goal state and the idea will be: let's run this benchmark and see.

F

Let's say if, in this case, framework zero actually gets an amount of offers or not with the the new fix so or actually how long it takes to get it, the N amount of offers, and so on and so forth, so that that's pretty much all I have in terms of the that the topic at hand, any questions.

C

Yeah I just wanted to mention that I'm planning to send an email today, just to kind of describe this issue that people run into and how we're looking to what our options are to try to address it.

C

After having looked into the exhaustion, it's approaching or over the past few days, it's.

C

It's actually a little bit more complicated than we thought and it's a little it's a little bit not clear how it's going to fit in if we make rolls hierarchical. So it's a little concerning for that reason, but I'll kind of talk about the different options we have and folks and give some feedback there on the email thread.

G

C

Give another part that I guess was interesting for this group. Was that till grabbed some profiles?

C

Let me try to grab those from him.

C

So yeah he found that there is some low-hanging fruit in the allocator.

C

There's like there's like a lot of slave ID, copying being done, and it actually shows up quite heavily in the in the profiles that he got so as part of just running. This benchmark also just found some room for improvement. It probably shows up in some of the other benchmarks too, but I think now we'll probably take a look at those and see if we can fix some of the performance problems in there, but the performance is kind of orthogonal to this. This problem, this is going to happen, even if allocator is really fast.

C

Yeah, we can discuss this more if people are interested like what this problem and the options that we're kind of looking into, but I don't know if it's I think we probably need a resource allocation working group soon. So.

A

Given the parallel work on their framework, metrics around allocation and DRF, do you think we could use those to build some kind of tool which creates a profile of how allocation is happening in your cluster? That we can, then you know, build benchmark models of.

C

Yeah, it's a good question: we're actually adding those metrics because of this issue, so.

C

A

I think it'd be pretty cool for me to it like I haven't. Looked it looked quite briefly at the dark. I might study that significantly, but maybe I could write something which will sample those metrics over a few hours anonymizer to produce some kind of table or some kind of parameters that we could then feed in to this benchmark.

A

They we could kind of have a way to share our production of experience with yeah.

C

A

Rest of the wider community yeah.

C

So that's kind of what I ended up doing, but without using the benchmarks or the metrics. We just kind of we we've been running a particular set of scale tests where frameworks behave, particular ways, and so we've kind of encoded one of the worst cases here in this benchmark, where everything is just declining without any large timeout, and they don't suppress, there's still issues even if they're trying to be good citizens, but this particular benchmark.

C

We just took the worst case, which a customer actually ran into because they deployed a framework that was written by I, wasn't supported by us, but it was it just had like a default, five-second, filter and and that company was running hundreds of framers I believe so those started starving all the other frameworks because they had a low share of the cluster.

C

So yeah we're kind of looking at the worst case here. Hopefully, that's sufficient and in the future we could start modeling. Other setups metrics would probably help us figure out what what those setups look like like based on how how they decline, how they suppress and so on, but I kind of hope that we don't have to I.

C

Don't want this to be like a a rabbit hole that we go down where we kind of.

C

Keep trying to optimize various setups I hope we can come up with something that kind of adjust as the problems more more broadly and I.

C

Think in the longer term, like I, don't know if I don't know if these will be problems, I.

C

Kind of think these these problems are a little bit.

C

Symptomatic of our model being not a good fit for the use. Cases like we still have DRF from the fact that the paper was written based on DRF a long time ago and I think for a lot of our users. At least they don't really need the RF.

C

They often don't overcommit their clusters, so there's no is often not contention, and instead they just have resources that are free and they want anyone who wants them to be able to get them, and then they care about certain things like guarantees and so on.

C

So I think drf probably is going to come into play more where it's, where we're running in a revocable world and things could be contended there, but from what I've seen for most of our customers, at least in the non real world there there isn't there, isn't this issue of over-committing and they're being contention and must us needing to figure out who should get how much they mostly care about their guarantees from quota and getting that satisfied if they're under contention so.

C

Yeah I think us us making allocation decisions is I, don't know how sustainable it is. We might have to move to optimism at some point before this gets too out of hand, but for now we're just trying to address the the current behavior and try to make sure that it's a little bit better for these cases where things are starving.

C

I, don't know if you guys care about DRF or if, like a round robin thing, would be perfectly acceptable. But that was another option that we're looking at was changing the sorting to be just round robin between between roles and frameworks.

C

A

Carrie, do you want to call it on that?

A

E

um Sorry I was a bit distracted.

C

That's okay, we just I'm just kind of rambling on about how it turns out that I mean DRF was out a long time ago without and the assumption that DRF hat was that there's congestion, it was far more demand than on the cluster, has capacity for, and it kind of addresses the problem of well what to do in those cases like how to kind of maximally utilize things when there's a lot of congestion, but most of our customers use cases have been there, isn't this issue of contention, and rather they have free resources and they want anything that needs resources to get them in those cases like DRF, doesn't really add a lot of value for them.

C

So I was kind of just explaining that one of the other options we were looking at to address starvation was to change, provide another sorter which isn't DRF, but instead it's just round robin.

C

That way, we in that way we hopefully cycle through roles and frameworks more evenly than then, where we cycle through just the low share stuff and don't reach the higher stuff.

E

Yeah so I think DRF yeah for us it's basically the same. It is not add a lot of value in most cases and it actively prevents frameworks from getting offers in other cases, so I think round-robin would definitely be an improvement in I hope most cases for us, but I also think that the request base model would probably be helpful because in most cases we know exactly how many resources we need yeah, so that would that would definitely have us for other cases.

E

Around robbing thing would definitely also be good, and then I mean quota based things and so on anyway. Separate from that issue so or somewhat separate, yeah.

C

Yeah I agree that request directly tells us what the demand is and then we can satisfy that weren't directly quota is like we can infer a demand from that.

C

It might be a little bit too I more higher than it should be, but it's like we can get kind of close to what their demand should be, hopefully, but I'm hoping that I'm, hoping that before we build too much complexity into the current model, we can consider migrating towards a more optimistic model where, um in a non revocable world at least we let we just let it be a free-for-all where we impose constraints like you have a limit, or you know, there's like a opt in attribute here that you can't get unless you're off to dinners and things like that.

C

But within that people can just consume things that they want.

C

Which, I think, is kind of more of a sustainable approach, but that's a little bit it's a little bit far away at this point. So right now we're yeah just trying to add more mechanisms to make the current approach last for a little bit longer request could be yet another tool to help us with that.

E

Yeah totally I mean I feel like a round. Robin should also not be hard to implement, should be pretty straightforward, so once that's in place, we could definitely give it a try and see if and how that improves our situation. Okay, yeah.

C

After we saw that was one of the approaches. The other approach was something we called exhausted.

A

C

Where we could either do round robin or dr.azz, for what we would be doing is ensuring that, if a framework declines an offer, everyone else gets a chance at it before it would go back to the same framework.

C

That's the kind of exhaustive mistis property that we've been describing. That is a little more complex to implement, because if they do quite a bit of tracking.

C

But it provides a guarantee that a stronger guarantee that everyone's gonna see it.

C

After having looked into that approach for the past few days, it's a little bit more concerning now than the ramrod one, but.

E

And so the exhaust, if nacelle it's a bit like I mean it is somewhat similar to setting filters right yeah. So we have had issues in the past with setting too many filters where the allocation times just skyrocketed and okay, but it's definitely something that we have to be aware of like we, unless you find a magic solution for that, then I would be perfectly fine with it, but we've certainly seen problems with filters in the past, and so we can't use anything like that. If it's not like super performant, okay.

C

Yeah, it definitely has a the exhaustion. This one definitely has a performance overhead because we have to track.

C

We have to have a per agent set of the rows and frameworks that have been allocated to for that agent and that has to fill up essentially and then get cleared. I have seen some performance problems as well with filters where, because they're resource subset based, you can actually build up a lot of different filters on a single agent, and that starts to get really expensive, because we have to check all those filters and they're all like slightly different subsets.

C

So I hope that that's what you guys have been hitting no.

E

We've just said: complete failures, like do not accept any offers from this note, unless a change tried this.

C

Right, but do you know how many there were per per agent for a particular Kramer.

E

I would assume only one okay, because what can happen we did not fit your face on any resource type or anything else. We just checked just as much offer match. If it doesn't, then we just gonna figure it for the next round or fall. Okay, however, and is there things like reservations involved or no that that was way back? That was okay, no, maybe to two and a half years back. Okay,.

C

Okay, so it shouldn't be a lot, but it could. It could still be several given that we could start sending you some amount and then madam could grow a little bit, but it's still not enough and so on, and each time would be different like there are. Some of them are subsets of each other, but we don't actually check for that. We actually store all like if you, if you filter for a superset of what we previously have the filter, for, we won't get rid of that earlier, subset filter.

C

We would leave it, so we would have to at that point so that could be happening but yeah, maybe maybe for you guys. It's a smallish number I think hopefully, now that we have, you know things like flame scope. It would be great if you guys ever run into any high utilization or performance problems, just grabbing a flame scope and then sharing it with folks, and we can all drop it into the tool and take a look at what's happening.

E

Yep sounds good.

C

All right, I think that's it I'm, just taking them for myself.

C

We can break early or if anyone else has something I want to discuss in the last eight minutes and bring it up now. You.

C

No ok, thanks for joining I think. Hopefully we can have some more interesting stuff to show around the benchmarks in the next meeting.

C

And yeah I'll try next time to send out an email and remind folks what the agenda is going to be and stuff like that, all right thanks a lot. Thank.

G