Kubernetes SIG Scalability, 15 Sep 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2016-09-15 Kubernetes SIG Scaling - Weekly Meeting

Description

2016-09-15 Kubernetes SIG Scaling - Weekly Meeting

A

Great, so this is the public meeting of committees, 6k laying September, 15, 2016 and I think we were just about to discuss agenda item so so so one item, one item is that I think will towards the end. David wanted to give a demo of the staff dump look we're working on, and then tim has a Kim at an item about. He wanted to talk about just now, which was splitting performance tests separately from functional tests. Perhaps would be the way to say it well.

B

That the intent test suite is basically like a lump of things. That kind of has a roll up of everything, and we did it for expedience in the past. But the question I wanted to raise was whether or not we wanted to have a separate set of test suite specifically geared towards these longer-term performance tests, because right now we only have a small subset of things right.

B

We haven't really added much sense over the last couple of these cycles, but we would like to add more uh and we don't necessarily think the existing in ten test. Suite is the place to add them, because we don't want them executed on a / PR basis, it's kind of like how note n tends broke off their own thing. It's a similar question, I guess I.

A

I think it's a great idea I, so.

C

Can we effectively do that without having you know, tights immunization, with the test folks at Google I just feel like there's, um it's always seems like there's a bit of a limbo in terms of sort of how you make progress and get visibility into until a lot of that stuff. So.

D

My thoughts about it is that, like basically performance tests, we can run them on very small cluster, which actually doesn't make much sense, and we can draw out at random a large cluster which is like way too expensive to optimal every single PR. Yes, so basically we have like internally suits that are running performance. That's obviously like there are not many of them, and if we have more of them, it would be great and, like day should be blocking merges soon.

D

It's not yet done a lot on some issues still, but it should be done back in a month or so. I think you.

C

Know it's some of these things. You don't necessarily need to start blocking on. If we just had, you know stuff running and you know every you know four hours and we just saw the trend line so that we just know when, when the risk- and we can it the.

D

Other words like bjork, like kindred, is already like part of like it's running every whatever hour or something like that, and it gives pretty pretty good results like basically and they it's part of like it's non blocking. But it's like unmarked you you can like look there, look at the results from this for in the merge queue or submit q page so I think.

C

Part of it, if we don't do it on every PR, though doing something such that when it does, you know, go above some threshold, we actually started. Learning people I think would be useful. Also I mean otherwise. It's too easy to ignore that stuff. I mean that the nice thing about going in every PR is that if stuff starts break in like people notice cuz, they can't get work done. No I think like marriage about actually is filing.

D

C

Does it mile issues.

D

E

As a father, I know, there's a I think we have a sick testing right. um What if he did, did a responsibility like to you and maintain those tests, testing, infrastructure and also the dashboard so that we can use it proud they would probably create.

B

The general pieces to make sure the automation is there, but we would probably create some of those tests right right. That's.

E

Why I mean like like they treat the sick, tends to create infrastructure and I all those dashboard, all those Jenkins job, and we just write a test so that we make sure the performance is great. Yeah.

F

Okay, but I don't really understand what the goal.

B

Right now, with their kind of lumped together, and we want to add more tests right and the question is just adding more tests to the existing in to end tests. We make sense, or would we like it to live its own entity? What.

F

What does it mean exactly like I guess it does a separate test, their rank on separate clusters. The only thing there sure is, I guess, im doin's frames. Yes,.

B

But it which kind of just like lumping tests into a ball I guess is.

F

The question also you want to like run right as end-to-end load test shoot, so I basically run a number of performance. Testing go big cluster, yes, uh I got, we can just use future performance, talk for it or something, and that will just work. Yeah.

B

I know, but it's just a question of where it lives, so what we.

F

Can there's a dot dots like end-to-end performance or something they like to Ray and got the guys should do it right, yeah.

B

That would do it.

F

B

I wanted I open an issue. We can just spanger there and then make that happen. Well,.

F

Basically, I think that the old infrastructure is there well, except all the to link we love to have, but no one actually has tended to build them like good dog boards and stuff and I. Creating a separate directory for tests should be straightforward.

C

Okay, sorry I'm just taking notes. Okay, so it sounds like nobody's objecting here. Doesn't sound crazy, figuring out exactly where this lives and and how we integrate it into the into the current tooling, is going to be booted to an issue done right. Okay, I took it.

C

And okay, hun Chow is adding some stuff thanks. um Let's see so other issues that we want to talk about so so um getting a demo. The stats dump stuff would be nice, uh I, don't see, Bob I, don't see David online to you. I do he's here. Oh ok, I think in the room.

A

With you, I can see him in person and on and on zune moment so, okay.

C

Maybe I'm just blind okay, um so yeah. So we can. We can segun tu into getting a demo of what's going on there.

G

I'll have to share my screen. Yeah.

C

F

Show you triggers.

H

Be a good talk here.

G

H

So this will be a relatively brief demo. uh You guys, let's can see it alright,.

H

Alright, so so this is just an overview of the work that we've been limiting and cracking to back up with you status so that we can later analyze it using various tools.

H

So I'm just going to do this locally on my laptop or themes you can see, I have no docker containers started so the first.

C

Thing I'm going to do is.

H

I'm going to start from it. Well sorry, the first thing I'm going to do is define my environment is briefly one of the things to note about the finding your environment. Is it's really important to this that you capture the start and stop time of the run? The reason is, is if you query past or before the start time or pass the end time, and your window is too large, it will actually prepare no data.

H

So so it's important to remember when you started I'm going to create a directory to store by database, you can see that it is empty and I'm just going to start up with its database or with this server and I'm, starting with a basic straight, which escapes my little laptop.

H

So let's sit over here. If first thing we'll do is we'll look and we'll look at the metric standpoint we can see. These are all of the metrics that it's currently collecting. Okay,.

H

Now we're going to start bonus and you can view those metrics, and this is one part- did maybe not goof quickly slide. My earlier.

C

H

Okay, so you can see it's up and it's flaking metrics as of a couple seconds ago, great so here I document how to configure the database. You want me to do that because earlier so the next thing we want to do is actually capture the data. So first you're going to stop ref do this by just sending the signature to dot 5 container.

H

Okay, it's not running anymore. We.

C

Can't see what you're typing on that bottom window there! Somehow it's getting cut off.

H

You sure had a pattern ha sure your browser did you see I've.

G

Seen? Okay, sorry about that, hey! If you just stop sharing and I'm sure your desktop! It's okay!.

H

Enjoy no that's weird, I think the desktop I.

C

See with your cutting off the bottom, somehow it otherwise, it's fine, ok,.

H

You see it yeah, alright, so yeah just an side.

H

So what we gonna do is we're going to remove the committee to status, or so we verify it's not there. Oh.

H

It doesn't matter, I actually didn't back it up yet so.

H

Let's go ahead and get the soft I'm.

H

Okay- and these are the two times we need to remember start time- is online.

H

Okay, so now we're going to extract the database, and this actually may end.

C

H

Causing a problem because.

H

We're going to create the directory permit is going to do ontar previously part of database. That's not all open.

G

H

H

H

So we're just going to take a pretty good state, backup database, which actually is going to beat the problems, because we don't those, but you can see so here's a backup database, okay,.

H

So the next part is we actually, we query that database with our spark lines, and we end up with we end up with a JSON record that contains values which are a list of lists. Each one contains a timestamp, add value and in this case, for just squirting process resident memory. Ok, so we've saved that in a file.

H

Okay and then we can open that file and process. It using you know very short, pipe on notebooks, so in this case we're using pandas. We can basically just load that JSON file online and then using Boko. We can plot it in a single line. In this case, we don't have a particularly interesting graph with rolling.

C

Quality resident memory.

H

Over one hour, one hour, the averages, but it's linear. So so that's a that's in an ad hoc way to be able to do this analysis in the future. We, so there are basically three ways that we we viewed as data being used. One is once that is backed up to be able to view its using Ravana. Again, there are a couple of caveats you need to know about trying to do with Ravana on a backup, theist or one is that the time has to be specified properly.

H

So you need to remember the time we back up. Another is that profondo may not be aware of Prometheus restarting so you may need to be start or fauna in order for its recognize, the broken connection and other than that, you should be able to be used for fauna to to view those real-time dashboards that you do normally. If you want to do with you more complicated, we have just what I just showed, which is a simple ipod, ipod notebook, racking, similar value.

H

Since the data frame you could add additional columns and then, finally, you could use something like a help. Staff or something else to do more, sophisticated searching which miss does not allow, and then my Ling, if you want to see the code that actually emits what I intended to demonstrate their watched it once that it's here there's a pull request from Bracken, and it contains all the sense that I have just I thought did only it uses to control instead of offer and then finally, the this dashboard is. The dashboard was shown in the past.

H

This has been for your opponent, and this is the list, so we're right now backing up all metrics which he exposes, but this is sort of a mountable dashboard for measuring pod scaling and the.

H

Dashboard is available in the pull request for cracking services. Here we're going to check that into Griffin about net. They have a dashboard pasa for e, so you can either get that JSON from the pull request or see it show up on the phone about that. So that's that's all I had for today are there any.

C

Questions this this looks really interesting. I'm wondering is there a way that we could package this stuff up so that you could throw it at any kubernetes cluster and have it start doing its thing? Yes,.

H

So that's the ultimate goal. Right now, the one sees I showed the pool request once those floor requests are merged, there's another full request for our CI system, but once all of the four requested our side are emerging, the Kraken, then every time we run density testing, it will automatically upload your fifth data as part of our test results. I think the hope is that we also up late upload that by four plot, therefore, a federated testing as well, so the part of the Federated test was ultimately share, I think it incorporating it.

H

So if you look at the code, it's relatively straightforward I mean they're only like three steps right, stop from and through us back up to your store, restore from. If this so I don't know Joe, where do you? How do you think we could promote? Well.

C

So what I'm thinking about is is the experience inside of Google for collecting profile.

C

Information of sort of what's happening on an individual process is really kind of nice and you kind of see this leak out in terms of some of the ppf stuff that you see in individual go binaries and I'm, and essentially what you can do is you can just you know, hit an endpoint in and say collect data, and then you download a file and then there's a whole bunch of tools for processing and mucking with that file after the fact and I'm just wondering could we'd like package this at up as a single container or a pod definition where it's like.

C

Okay, if you run this pod and cube system, it's able to actually throw it in any Cooper Nettie's cluster and it's able to actually discover enough to start collecting, interesting data. And then you know at the end, you tell it where to throw that data, and it can be, you know, cloud storage pocket or you can, you know, have it go into some sort of like you know, you know, terminated, waiting for download state where it serves it up on an HTTP endpoints.

C

You can download it's something like that right, so you know so I mean at the end of the day. We might be able to do this in a way such that people don't even know that they're using Prometheus to collect a lot of this data that makes sense yeah.

H

And I think I think that's possible. I think the golden hurdle that I see is that we depend on. We currently depend on having an external volume, so in our case we're using EBS, but as long as we, so we would have to use persistent volumes if it limited that way. I think it'd be possible, but.

C

Even without a persistent volume, I mean you could use it, you could use it using an empty directory and then have it so that you can, you know, tell it to stop and then have at our everything up and then just wait to serve that. So you can just hit it to download it and then and then then have it quit itself out right sure we consider doing that.

H

The problem that we ran into is that once the previous prosper container dies since me run. This is a resource controller. The pod is used unhealthy and it's restarted. I.

C

Would view it I mean I would view it as as maybe as a as a as a job right I mean this is a run once type of thing. You can run it as a job where it'll run for a certain amount of time or until you sort of poke it and say: hey you you're done now. You know package things up for so that I can. I can get it you know and do analysis later now we can take it offline, just some ideas, I'm sort of like you know.

C

If you could package it up such that you could just throw it at a at a goober, Nettie's cluster, and it just you know, collect data over the next hour and then makes that available. Somehow that'd be really kind of interesting, I. Think I agree.

H

And I think the Gators likely.

C

To do that, this.

B

Is the reason for the munching because we're serving up Prometheus metrics in the Prometheus matrix format? So it does its own time series database versus json. If, if we were to serve metrics, be a JSON which is totally inefficient but useful for data analysis? Would that potentially be more helpful for the long-term analytics perspective? So.

H

We're actually doing that so right now we store the videos database and the only reason for doing that is so that you can continue to use for fall backwards to do it in the future. But this ipod.

C

Line notebook but.

H

I showed this is reading a JSON file and this query ear. What we do is we created for all metrics.

H

We run a query for all metrics and save a file, a JSON file for each metric 200 questions about got stuck in the middle yeah.

B

Yeah I was taking the man on the middle so instead of querying prometheus to get the json data just like what happens if, if Cooper Nettie's itself was bring up json base, metrics that you could just scrape and store directly into something like elastic, search or whatnot. So that way you can do your analytics directly. Would.

H

That be helpful, this is a question. I think it'd be helpful. The only thing I worry about is, if we're duplicating effort, not just previous but keeps dirt so part of the advantage of using something like keeps produces, that you get metrics that perhaps are unrelated to Cooper identities, but still in it back to perform to the system. Yeah.

B

We have we have this weird, an assistant wide level like there's. There's this whole meeting, that's after this meeting, which is about this stuff. We have this whole weird many metrics problem, that's like MacGyver. Actually, if you, you combine all the ways we're collecting data uh across the cluster, so I I almost want to attend the next one, just to see how that one's shaping up right now, yeah.

H

G

There I, I agree, I think there is something usual.

H

In terms of the one true way to get your data use vectors.

C

Well, very cool: I love that you know it's something that we can squirrel away and save and analyze later, and you know I can see tools around sort of comparing and trying to figure out. Why is this one slower than that? One? That type of thing which is gonna be, which is gonna, be critical, yeah.

I

And just a technical note, I've driven to chat you've, probably seen no data, because step depends on the range and you can screw ride the step.

C

Right, oh you mean, if you don't, have this 10 times running right. Okay,.

H

Right at three right: that's the reason it's important to keep track of those, and so we people file with a certain times in the directory that contains all the JSON and the binary. You.

I

Write just saying it can be a little fuzzy and adjust the snap a higher resolution right.

C

Well, very cool: are you thank you about five minutes left here. um Let's see so uh Bob I know that you are going to be disappearing for a couple of weeks into stuff in Korea right that is correct and so I I'll be you know. I'll still be around. um I may be a few minutes late though. So, if I am you guys can feel free to start without me, I have to drop the kiddo off and I have to run back home. After doing that, so it's a close thing. The bus is late.

C

I may be late, um uh but you know Tim I, you know you've been a constant here, so I'll empower you to get things started to come around. Okay,.

B

I thought to dig up the eccles to push the magic buttons yeah.

C

So that a note on zoom is that if you log in from the web and then launch zoom from the web, it's sort of like forward your login information to the account to the to the to the installed app right. So you can be like logged in on the app using one account. But then you can be logged in on the web using a different account and when you launched a meeting from the web, it's the web, one that wins it's very confusing that make sense at all. It.

A

G

A

I can resend you those credentials with you can't find them just. Let me share your hospitals, any.

C

Other any other notes.

C

All right well, thank you. Everybody see a bunch of ya at 10, 3.