Kubernetes SIG Scalability, 23 Feb 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2017-02-23 Kubernetes SIG Scaling - Weekly Meeting

Description

2017-02-23 Kubernetes SIG Scaling - Weekly Meeting

A

B

A

Good morning, everyone, this is the Fed 23rd, sick, scale, meeting and recording is has been engaged. So let's have a meeting.

C

Welcome on John Beckett I accidentally I'm.

A

I was out last week at the CN CF meetings, which does lead me to want to give one update from that. But I thought.

C

I've got guys, let's.

A

Start with that, I think that I mean there was a lot there's a lot going on, but I think the thing that's of the most relevant interest is. There was a lot of discussion at the board meeting and then also at the also at the TOC meetings around the use of the CNCs bare metal cluster, and if the utilization on it's been very low, I think the Red Hat folks, notably, have done probably the most with it. There's a bunch of issues with how to how to provision it.

A

So there the presently the CNC F cluster is but the way it works.

A

Is you sign up, and then you tell them what operating system you want on it, except that you only get to pick from four options, which certainly doesn't work for us and I suspect that it's because they won't let you bring your own bare metal provisioning to the party, it's kind of not you that useful for folks to bare metal level and, on the other hand, there's another there's another contingent of folks that are very interested in getting it working for doing doing a bare-metal CI.

A

So things like having a goober Nettie's on bare metal, Cabrini's on OpenStack, etc. So the it seems like there's going to be a lot more energy on that, but I'm now trying to just send a long email to the Intel folks. I looped in Chris Wright from red hat as well and I think we're going to try to try to see if we can improve the tooling or the governance around the big big bare metal cluster. So, if anyone's interested in that, we can't let me know and.

C

Kind of interested like what is like how they want to detection but I could not reach out across socialistic like.

B

Like well, how.

C

They want to use it by detected by a set of small pastures where the pastures or like a life skill test, very mental solutions.

A

Or the the one of the original goals was to be able to scale test bare metal. However, there is certainly a contingent and I'd say I'm supportive of this, the the notion, the notion that it that at least some portions of it should be carved off for CI for CI for various projects, which I think is thus I- think that's a sensible compromise. So that means yet. You have I, don't know 800 nodes available for big scale testing and a couple hundred had been carved off into smaller, see I clusters.

A

It seems like a fine compromise to me so I.

D

Mean last I looked at it the amount of sort of paper work in process to do anything with that was like ain't. Nobody got time for that right and I.

B

D

Know I'm not surprised that nobody's using it because it looked like Polina triplicate to get anything going. I think.

A

That there are two issues: one is the amount of paperwork, but the other is the amount of paperwork kind of per server. So the the anecdotal reports were that if you that, because you're essentially asking a manual group to go provision servers for you by hand, you would say: hey I need X, and then they would run off for days of days days to prep it so it which is, you know, for a cloud native group a little silly, so yeah anyway. I certainly didn't want to take up the whole meeting with that.

A

But if there's some interest in that topic, please please touch base with me. Offline, okay,.

C

Yeah but I see that do sekaiichi be three or 16 written in da gender, which I know pretty much nothing about I.

A

I thought I saw Tim st. Clair drawing here. He usually has some comment on this topic and I.

C

Think items also here there's this.

E

Echo so there is a couple of checklist items to be taken care of, but it's my understanding that Matt from google harrah's already running through those checklist items we do need to go through the feature of auditory. That's my bad they're just doing checking off all the bits so that the PNG is happy for this release, but for the most part from the checklist items that I'm aware of are pretty much just the manual roll back and we'll forward tests.

C

The daughter banquet that I know like and we are fighting finding backs of the in them like like every week.

A

For finding bugs and what park all now, there's.

D

A

Diamond Jim.

D

I think miss what you guys had we're actually in the same room, so we're dealing with their life we're dealing with the the echo sort of like which computer is is actually on. Sorry. Could you repeat that.

A

I was asking which items were broken. I think that America's, referring to.

C

Like I'll get up when to downgrade like some scenario, downgrade upgrade downgrade was possible without put about something that didn't work like from tide and just find something if you're fine right now,.

E

C

Think we really I don't know at any known back.

E

Right now, I saw the last issue. I saw wojtek I, really the last problem in math and found, and that was the last I heard of it, but it wasn't too.

C

Long ago so I call signals.

E

Yeah, well, that's the last of weird things: I need to go through the checklist down on ciao, because I think you were the two winners of the feature, so I think we need to run for that checklist to make sure that's all good, so yep you're.

A

Talking about the feature, repo checklist, yes, all right! Well, if I were going to be snarky, I would say if you actually get through that checklist. You'll be one of the few people who do, but it's the lot to be done. So it's good to see somebody going through it.

A

Aaron Aaron isn't here, so I have to kind of channel him a little bit. Okay,.

C

Ideas also extended life, not sex, hope you will be here. So I would have one thing that we unlike said last week that GM as like as a starter project, was working on a tool to compare like environments like tests accidentally can load test, run on different environments to do and compare the results and see how those environments are to each other so that we had working panda and I would see em to the little bit like okay.

F

Just give me a few culminating in secure it. What.

A

Are we at what are we to creat.

F

Clayton's item month, I'm fine with that.

G

B

Want to take I.

G

So this this I, don't think, is something that everybody would be interested in, but I did want to raise it to see. Who would so as as we've done more to automate running open shift on GCE we've had some discussion about like what the ideal architecture would be so open shift added, out-of-the-box very early on XFS project quota for md ders, mainly because it was a point of attack for multi-site cluster and right now. There's a couple proposals being fish.

G

Has a proposal open that will probably be worked on in 17 from the note sig, which is about local volume access and it sort of people who work out and touched on things like we want I, oh isolation on nodes between docker cubelets and the operating system, and then tenant workloads, I always and come from yeah well in the challenge, ultimately comes down to there's several different classes of actual disc workload and just like CPU memory quoz cheering in practice. We have this today, but people really are necessary.

G

I was trying to get answered, which was like just starting from the gcb perspective. When you have these kind of three classes of workload, you know special disc, which is PVD. You attach the system kameez where you're essentially bringing in a holder disc, and you know you might have a different sort of network bandwidth for them, but you also don't want the operating system choking on you know having the same I oak use for those operation or for those disks.

G

As for other tentative apps, when you have things like emptied error, which are roughly equivalent to like a best-effort cause tier at least today and could get actually in terms of s, so I don't know so today. So today we we have since cube 10, run and dieter.

G

We generally recommend it be on a separate disk or device hold on metal and on cloud, but effectively empty der for the workloads on the note being treated as a best effort from an IO perspective, which is we set a hard limit by a xss quota for every unique user ID and because we force all containers to run as different, unique user. Ids, can these leverage that ambitious proposal there's more work to be done for XSS project quote and all that I did one dive too much an FS group.

G

The group ID can be made unique across all namespaces on a cluster, for instance. Today, and so that's another angle or when you look at what gets written I, don't think the point of them either and the proposal talks about this is to be totally isolated, is essentially it's a slush capacity or I'm scratch storage and there might be higher levels of local storage in the future. Fish has a number of proposals. My question has basically boiled down to I was trying to figure this up in GTE experts was on something like GCE.

G

Is there any actual advantage from running a PD for soccer, a PD for dieter and a pvt operating system at the fundamental levels of the OS and the GCD environment?.

D

G

D

Get it on bandwidth right I mean these its own through my op budget, that it gets limited to close.

G

But this VM is quoted as well, and so that was my question. Is we hit vm quotas all the time on most look at work load tests that we run, but the question I had asked in chat was when I run the parallel ete suite at parallel, 25 against the four node to cpu cluster for 50 gig and 75 gig PE SSDs.

G

The only quota limit, I'm hitting is the vm quota, and so one of the questions I had was just a little bit from that sort of configuration because there's operational costs that come with attaching additional discs in involve Jeezy.

D

Here the I mean the vm quota scales with the size of the vm, and the big thing is this: some ram overhead outside the vm that actually is just provisioned based on the size, the vm that's built into the cost right.

A

But but multiple pd's do get you do get you more throughput, / pd right through well.

D

But there's a limit pd, but then there's also limit per diem right and the network for the vm is also the women also so there's a sort of a bandwidth limit on a vm based on the size of the m. So there's a lot of stuff that just scales with the size of the m yeah.

D

So one thing to look at there- and I don't know I mean I haven't played with this- and you know- is for some of this stuff just for knocking out these tests, like looking at the local, the local SSD stop shun there in GTE I mean it's obviously not persistent, and then you get my.

G

Id, the interesting thing is like I think this fits a little bit of leg fees if you're not running a lot of really dense workloads and like I'm, trying to look I'm kind of teasing around, like as the question of is there a sweet spot for reasonably sized nodes for reasonable app workloads? Because I would say you know the vast majority of work. Those people are running a cute 24 and three core tight CPU workloads that aren't you know they might have some tenancy but they're, just not.

G

They just don't tend to be these massive workloads, whether there's really any advantage to multiple disks at the size of the VMS on GCE and now. The next question is like: does anybody know of any differences like that for AWS, brazzer I.

E

Think, right now, too, is that we're running into fundamental limits that we're not tracking records you go ahead turns is for right. We don't have senses for I ops, we don't have sensors for network bandwidth, you know and we're fundamentally limited on both front when you're starting to run it providers one different levels.

G

And I- maybe this isn't even oppression for this group, but just while I had folks who knew a little bit about the juicy stuff like if we did single a single PD with logical volumes, can we get any kind of reasonable, os-level, I, ops, control, at least in terms of Q deaths and dispatched I ops over a window? Or do we really have to rely on PDS going forward? It was in GC environment to get aisle isolation like that I.

D

Don't know how that stuff is working these days when I was a little too.

C

Pulled out on the level question for me, toss away by this is I I, think he dumped it and I think I, just an abstraction, so I like to just would scale exactly what size I. Never think this should scale the size of the PD. So having blonde like my multiplicative, one should will make a difference so.

D

C

Okay, so I can't.

D

Tell you claiming that as the the folks who did the virtualization stuff for GCE like made local SSD work, they had to tune the hell out of the the para virtualized I'll have em, and so there's no bottlenecks there right, and so it would be. My guess happy, you know, I'm talking sort of you know in guess amiss here and I think it really just comes down to the disc and the limits that are exposed sort of outside of it and those things are totally blind in terms of which across right inside, though yeah.

G

I kind of had suspected that an inch of Tim's point I kind of gotten the feeling that of the last I, let's say: 10 major production level, performance problems, you know, half of them were stupid, cpu, fast loops and the other half were io problems and the echo problems are basic story is well. Maybe we can do less I/o operations, sometimes without really getting any better at it like. So it we end up being very the other than doing less I. Oh, we don't have a lot of jobs. The.

E

Super small I Oh bandwidth limits, but it's not enforced and not enabled, and that last time I checked and trying to system was down below. Well, they had all kinds of issues: um I'm trying to recall what the problems were, but we just probably use these back way back in grid days to limit the number of people who could do sort of local storage when they're trying to do like HDFS peck things.

D

E

D

As those so Clayton I can like I would suggest that you think. That's just a note because really about deep experience in terms of controlling, noisy neighbors with spinning round and.

H

F

Don't know all of its.

D

Public or upstream, or what have you and so I can't say but like like you know, there are solutions here or like you know, there is a way forward. It's just a matter of I, don't know I, don't know well, I just knock her and how much is you know, etcetera, etcetera, yeah,.

G

The current steal from sig note is that- and this is mostly vicious perspective of some facts from Dawn- is that the only real isolation is device and everything else ultimately doesn't work and oh yeah block aye yo is trash. That was the other one. A black iose groups are trash or iose groups in general or trash, and so I do think that then, from to the scale team, I guess my question would be: how do we better figure out like against turns point?

G

How do we better figure out how we monitor, understand, isolate and track and keep get that into the workloads that we're thinking about, because it does seem to keep cropping up in a lot of different scenarios? So.

E

Some some other systems have more deep level of introspection on the current usage statistics and they view things like they publish for cpu. They publish like a load average right, and you know your base about Pierre and scheduling back amberwood adverti. They do do similar things and I up track. So that way you can prevent. If somebody is going to be a heavy mist, loser they would take into account and what one thing they typically do will be. They rank so they ranked by losing the most mobile apps.

E

So that way, you steer you steer clear away from other treat, that's a very static view of workloads, though I know, the load average is console. Yes, I.

D

Know but you don't reschedule things based on I mean like maybe you do, maybe only we had a reschedule er yeah. Oh.

C

Yeah: okay, come with block really.

A

I think on, I think, on AWS committed I op clb is kind of what you're looking for it.

E

You mean you could also like if you do paints in Beijing, save these things there for trashing. This got jobs, okay,.

D

So the big thing is that, if you're running, if you're running big machines on G see you, then then the limits end up being per device and you can actually use the device as the limit there. If you're running small machines, you know you're just you're going to be I/o limited, regardless of what you do, and so now you essentially just like how do you actually give you that up, right and and just you know, see groups and I ops, limiting and cross on disk?

D

And you know there's just there's nothing good there, that's public that I know of I've, never.

E

Seen anything that actually works? What's.

G

Really funny too, is it you, like I r, assume that the IO problem is mostly because of the way that inversion works, but the conversion would be one of the big problems, but you think that there'd be a way to do a simple counter for something: that's it do much more than a couple hundred of something a second all.

B

Right so I'm kind of curious. Why why we consider block IOC groups ash? um It seems like a simple thing to helpful yeah.

D

That and the problem is that is that niño from the from the from process does not equal an iota disk right, so in a lot of sequential iOS gets collect together and there's elevator algorithms and blah blah blah blah blah. So there's there's a wonder too thick layer there to disk Danu.

G

D

So you know Google all like here's. One of the truisms is that when you're doing like a network disk system- and you have a lot of clients all calling in essentially, there is no sequential writes right. Everything ends up being random. But no there's no sequential reads right, because everything ends up being random right, which.

B

Actually is arguably the.

D

Best case with has been the average cases, it's more real, it's more predictable right. So that's you know, I mean the long way around is to say you know, take all your disk in your cluster and then run some sort of distributed storage system. On top of that, and then we expose that to things using. You know higher level things that can cause yeah. You know statistical averaging across stuff, but we're not going to get anything. You.

G

Know honestly, even just getting disk some form of disk accounting, some sort of rough approximate discounting flowing back up through the system to at least let us know like I, think I think that's everything is. How do you even know that this is a problem because well, basically I.

E

Didn't ask America like this is a signal problem. There is one type of signal, at least from the node, that's giving me some type of updated blowing average. They couldn't want to load average for what's going on and just publish that as part of its 10. Second updates yeah, because right now right now is writing literally 8k every time it is the same data except for that time stamp. So if we add an extra bit that changes along with that a cake, I, don't think it's it's. uh Okay,.

A

I want to make sure we get a chance to move on to Tom's demo trick.

H

Yeah we do that a jumper ironing we.

A

Tend to run, we done to be a little lazy about that.

H

09 distressed are sharing my screen, so hi everyone.

F

Okay, I get shitty.

H

Stuff, ok, so this tool is pretty much for comparing or two tests together. It's running very similar environments and, for example, for example, is Q mark and real clusters. So with the word we do basically is take. The last few runs from both tests and kind of aggregate, the matrix take average across all the runs on kinematics, like so api call it in peace and the porn star, dirk latency, and let me actually starts with showing.

H

H

So you pretty much have these options for the tool, which is basically the comparison scheme that you are going to use and the left job and the right jobs that we want to compare and the the wrong selection scheme, which is pretty much. Do you want to select the last all the runs from the last n hours or all the last n runs and n itself is the value of it. So I will start with running this against.

H

The run is against our default options so which is pretty much that with n is equal to 48 and we get all the runs for the last 48 hours and the tests are you going to compare other GT scalability and the few more hundred or DP, both of which are hundred noted? One is unreal.

H

No Dan ilysm pollen was the first so right now it is choosing the runs from the last 48 hours for the first test, and it is to them for the second test, and now it's getting the matrix from the build log of the exams.

H

You forget a given process: you're horrible dump of the metric comparisons, and sorry for that. But um yes, so like this is a very sad thump, I she'll, probably formatted, but this way too much data yep. So let's take some metric, for example, Kate endpoint. So this either it's pretty hard to understand, but yeah. This is pretty much.

H

The first entering this of the first column is the test, which is the density test, the second one and the third one third column, pretty much tell you of the matrix or just get endpoints API calls, and we- and we thoroughly understand the 50th percentile of the time line, for these calls and false basically means that these metrics failed to match and I'll soon come into explaining what the criterion for checking, if the matrix actually matched or not, is but before that so yeah, they failed too much, and it is because the average for the average.

H

So that is followed by a comment which kills, which tells the number of sample points we got from the left test and brightest. There means the standard deviations and the maximum values. So we pretty much kind of get a super high value works to mean which is which is like 798. It is of the order of flick of seven kind of pretty close yeah.

F

H

Kind of pretty cure, but other comparison criteria is a kind of tight, so we want the average of the left to be within a twenty percent margin of the average of the right job, so only family yep.

H

We probably me to do this up a little bit depending on how much they actually very but yeah.

H

So, let's see another metric which actually has the taxi league matches- and this is one of them so get namespaces called in load capacity, p a load capacity test pretty much at the same as you can see, the average actually I think there's.

C

H

But I whatever you want, even in the list.

C

Like the second one should be higher and.

F

C

That this is nine and five, and this was seven hundred thousand seven hundred thousand yeah.

H

The previous one was, it is taking more in the case of select job, which is the real trusted. Okay,.

C

H

Not not such a topic yeah and so don't care for getting spit as we. You can see that they're pretty close, so they are. This is still a prototype. Click for now I ran a few experiments on these, and it turns out that, like close to, sixty percent of the metrics are actually similar when we give a slap of thirty percent and I came into twenty percent now, so probably it as much lesser I. If I remember correctly, it's like around fifty percent of the matrix mad as.

B

H

When I say fifty percent its fifty percent of the metrics, for which we have some non vedo sites, samples from both it is because of the vdr because we cannot compare metrics across to death, relax calls for the matrix and one of the best so yeah.

H

This is a mess up where I printed it, but yeah I needed it.

C

Fishing, tough guy act, but that's just like that. Such a process like I, think it is right. We have gotten a soul that will allow to compare environments right, and this is the DB TT value we see here like for us. It's q mart and the rest like any normal Custer's, but like we can also rely as long as we have that identity test results. We can compare anything then take your load test result inserting the Clayton's question, if I understood correctly, but I don't really get it.

C

Is that, like you care, you just need to well what we do we're doing technology strong, it's just with us in bed like if anyone will look at our public available to discuss scale test results we just like embed them into into the test blogs but I birthday.

C

So if anyone will just run the test which will automatically embed those days, all those things you can just compare, we can compare environments between each other. I will will be able to compare environments with each other, which is like the first step to actually I'm doing. What you've always wanted to do so like see where the difference is? I.

I

Was wondering this like I mean so the one thing is the whole question: how do you even go off the problem exists, or do that you have to have some kind of strain.

B

I

Otherwise, you only I mean giving these generate, guided in whatever gesture trying to measure that for my well.

C

Yes, but like we really respect windows using the Kings tests, we.

I

Have an identity does that pur- Alana hello.

C

But I, if anyone wants to actually make sure that there they are I can head like, is not acute variance I any statistical. Statistical significance is all they need to run. A number of tests join single test can be anything it can be really badly go to meet. One month we mean anything yeah so that, like a sad reality of mathematics, sure.

E

C

Okay, so that I.

H

Just share the link for this door in case you wouldn't see it already so yeah they pretty much has the more detailed design of the stool and probably later would be a good idea, because I just Q over these details way too quickly.

C

Okay, that was it it's the same. Thank.

A

You thank you for the demo.

C

Team are, we will, are we going to stay for two minutes more.

F

A

I'm willing to stay nobody.

C

Like okay, like what it was team, we wanted to discuss how to like, but this is like a working discussion married right, so maybe anyone can learn in tech, foot space, it's the wrong, but I could put you to do kind of go back to the assault on us, follow definition and like recall what we wanted to do.

A

Well, I think it's a good reminder that we have to get back to working on it. I think it probably hasn't hasn't got much energy here from the team. Well,.

C

Yes, it was only on him and I was involved in the forgiveness thing. Getting more sore was kind of on a schedule see which kinda, which took a lot of energy like a month, but now we will back yeah. So everyone wants a second great team.

C

Do you like, if I under correctly, that there was like a two proposals, one very lightweight and which was kinda like proposal proposal, which is mine and like one 3ds heavyweight I.

E

Think at this point, given everything I think just having one extra metric that we can agree it on and start to define and track over time. That was your original proposal right, yeah, that's totally legit with me. I think that's the right way to go the question: what metric do you want to want it to be and how to measurements.

C

Yes, those like my proposal is always the same, like system proposed for starting pods and.

E

That's just an end-to-end system throughput, but.

F

E

So you have to say: like images are pre-loaded, you know, because that time will be accounted for. What other factors are there will.

C

A number of like environmental factors like, for example, services, no services, what if they are deployment, jobs, I.

E

Think you could probably break it down into like an overall system. Well, I think start simple and they build on it right, like.

C

So what I wanted to do to actually know like this is one thing, and second thing is that we want to start simple, but what questions, what what does it mean exactly like? How do we want to describe it Ashley, just a number and the definition as we bid already or do we want to change the view we have towards the d1 that you proposed here.

E

The original one that we tracked a long time ago was always pods per second total throughput, but we get it at on our own custom, jiggery right and but that was a very useful thing, because we saw a performance drift over time, so that was like an end to ends from 0 to when it actually got running.

E

But we did that on density, which is a replication controller right, so you could say, like you could do one for each one where you could do start up late and see if this is what happens when I just try to start up a thousand pots right. This is, is the submission. This is what happens when I try to start up a thousand pods, be a replication controller, which is the.

C

Dirty, yes, there's a caveat which is services which I think is the next biggest bottleneck that we have basically.

C

When we start with, let me try to get a service, that's like over all the posters which we, a game type system, might get really unstable, like cube proxies. If I recall correctly, I like, when pretty bad spider started, losing their insane amount of memory. Well,.

E

We have there's been fixes in 160 and hawkins, I believe, has most of those fixes in place, so it should prevent the the proxy IP tables think from going bananas, but I do think there correct me if I'm wrong the j, you had a service load test right. You can measure the amount of services per second. If you wanted to.

C

Also I'm, like like lobsters, have this option so yeah. We.

I

Had 14 namespaces, we yeah, we could be in other services, one I, don't know where we put it, though I'd have to pick up where that thing was now.

C

Okay, like okay, but I, grass-type I side up like the main topic is we want to when they talk about is hello? Are we saying telling the pots by themselves or like and there's nothing except the boss, always talking about thought within services? I.

E

Think you would you both like you would start up if you separately, like pods without services and pods with services per second well. Why would you do them both? Because then you can it's an AV experiment to test how the effect of your.

F

C

Look, oh ok, boom boom boom boom.

F

E

Should probably write this down if somebody keeping notes in the dock.

I

I have some notes: I can restock to a doctor. There's.

E

The meeting notes I was ritually.

C

Yeah I will try to write something down tomorrow, leash, oh, that this is all sighs, hello ducks. That's what started a long time ago. Wasn't it all.

A

Right I have a moment where I want to ask a stupid question, which is right based on the discussion. I'm, not sure, I, understand what you mean by throughput at this point. So.

E

Throughput is just basically the impulse response curve of like how fast it we can react from end to end without.

A

E

Falling down so we used to have this thing like an example here to test would be like what happens if you start removing those caps on communication right I keep the S limits. Like you see the curve, let me let me kick off my screen, so you can uh so before so before the ramp curve would be like the smoke would be like this right. You take off QPS limits and all of a sudden, the smoke goes like this, but then it becomes unbounded.

E

So the system all of a sudden has like you know, schizophrenic effects right, so the ability would be in like to just. We want to track the slope of this line of how fast the responses from end to end startup, lengthy, while still being stable,.

E

Does that make sense Eva um it.

A

Just seems like the shape of that curve. Are you worried about how long that curve can be consistently growing or we.

E

Care about like how fast can we get this look while still like not flying apart or anything.

C

E

Like how fast can we get this without breaking everything, because, right now the whole system is very under dampens like the whole system's response? Time is like this: when, in the beginning it actually was faster, but we put all those limits in place because what was happening is it would be faster, but it also start losing its mines and breaking, but now I think, as things have progressed, and we have you know, they sort of kind of other scalability fixes folks have fix the caching issues. Folks, it looks like a number of other performance issues.

E

We could probably start tweaking those values and make this slope faster. Yes, but.

C

Like we also never really cared about what happened well, cutting on notes, yes,.

E

C

From the API server first I feel I think we can actually tweak them. I, don't know from like the overall cost, or else we can now that. Well, we can tweak them, but I don't know what can actually increase them. A lot I think.

E

Maybe it would make sense is like start start with your take your original doc at the break down some other pieces. Don't do too heavy weight and you know start to break down what end-to-end system throughput needs for the different types of resource objects right and which ones we want to work on. First, because we're not going to do all of them right. Well,.

C

There's only love one nice thing like is anyone angeline suppose of anything other than a pod.

E

Well, the jobs well.

C

I'm eventually use the pod right right, yeah.

E

But it's a different controller right so.

C

The ok so the pseudocode to me: ok, so the soup, with all their given controller, actually.

E

We're going to have yeah that use. The thing is like when we measure engine system throughput it's kind of like a black box and we're ignoring those other bits that are in play.

A

It's just the simpler thing to do to worry about the response curve, just on a per controller basis, to begin with, yeah.

E

That's what we're kind of we're doing it. I mentioned / resource which is / controller. Okay,.

A

Well but you're out of black box mode at that point right well,.

C

No they're kind of slightly the thing that you put into the black box is like that of job verification tool or something nice yeah.

E

So, there's still many things going on that we don't have all control over we're, just measuring like here's, your impulse and here's your response curve. So it's just like trying to remember the class, but like signals glass back in college right, we're like you're, measuring the impulse response curve of the black box and to figure out what the function is.

E

Believe it or not, but I actually did signals for a while get a little bit everything crazy, I.

C

Don't the question from Jay: oh.

I

Yeah, what knobs are we talking about? Turning here, I think.

E

Right now is just setting up the test and then tweak the knobs over time. Well,.

C

Your PS limits veggies me.

D

A

I think there I think there are tuning the system, but I think the other thing that we're interested in with the sort of thing is trying to track regressions rate where we don't think anything's changed, but the curve has managed to like get whack has gone different for some reason than we want to know. Why catch that? Yes,.

E

This Manor, originally we actually like you, know Rob, isn't on, but Rob and I actually did this and I was able to track when, when the slope of the line changed, but we haven't had that automation in place for a long time, we tore it down rejigged, and that's that was some of that stuff that we were sloughing off in the end to end tests, America ripped out some of that data was specifically meant to track throughput.

E

It should be its own test.

A

I'd be tended to blank.

F

A

Turn Dora I was.

C

A

Saying I'd be tempted to make the argument that the regression is probably the more important thing. First come on, Mary yeah.

C

Well, yes, we are, we don't throw I, think.

E

Just starting with like starting with the spec for like what this test should be, you know nothing fancy something said, but I guess if there's actually diagnosed with this time. Yes, but it's not trapped as to put right no.

C

Fools to die donkey harshly. Remember it's a person's report like we certainly try. I put it on the log. It yeah.

E

But the whole purpose of this is like we're going to make an SLO that we're going to measure across releases.

C

Turn us do we want like currently our s ellos, a kind of defined by our tests. Right, yes, do we want to keep this way or do we want to write it like write, something that will be more like end user focus because I end users, don't change tests, I.

E

Don't know what you mean with it do.

C

We want to have some direct something eventually something along the lines that you wrote like with this user stories or jack from by multiple perspectives of looking back system, and why is that? Your engine just isn't a given? That's right! Oh I,.

E

See you're saying the I think start simple: let's just start simple and start having a simple test in place and then eventually we can expand upon it right now. I think it's you know. As a cig, we should probably be adding more and more tests to measure the different matrix of the system over time and right now, we've kind of only have a couple of those core metrics, but, as we mentioned before, they're kind of filthy lies right. If you start loading up actual other things, you start seeing abnormal behavior right.

E

So the I'm fine, with just starting with something simple right.

E

Having it right down, and it just create a simple issues which can just work on it and start creating a simple test. Well,.

C

I think we've industrials into the test my dinosaurs, I.

E

C

E

Go like this is exactly the store with it yeah, but it's good it's. Instead of that, it's just going to have one one value: that's trapped over time, instead of like because right now the slog identity. Anyone trying to read density is like. Oh, my god. You know like you just.

B

Want one thing that says your.

E

Overall throughput for loading was this right with this, but with it like at the graph, the histogram, maybe.

E

Because, like it changes its bursty so like as you load the cluster, it's not really. It is usually like just.

C

Like whatever I below very smooth and I, just I, thought and salt in the end, I could yeah.

E

So you have the tail at the end, so like it's attenuates at the top of the curve, so that attenuation means that your histogram will not be a smooth bell curve.

C

Yes, ok, so I will get my dogs and look at it again. We write it and put into the issue or something Alexis.

E

C

Starts with that and for Jay like at a yes, then if it is too much.

C

Ok, so the wall, Little J, can explain what I just I thought you should get my dog, those like what oh cool. Ok, like the other like I I, guess the other things that are other matrix. You want to measure right, I.

E

Think, let's just start at one thing at a time right, like after.

C

What my idea we.

E

Have posted one thing: first, double yes,.

C

And see how we can actually yeah cool, let the dogs out like I, oh I, act, I, doc, yeah I want to do it, but I time-traveling tus next weeks, so I will try. But it might take me more than a week yeah.

I

He's referring to early nineties rap song.

E

He's joking, so what why don't we just uh we can reconvene and talk about it next week, see where we're at? How does that sound just add, as an agenda item will follow up there.

C

Okay sounds good louis, I said like I have taken full calendar for the next group of seals leading us now we will be in seattle on thursday.

A

Excellent great.

B

A

Really are almost out of time because it's only five minutes to the community meeting, so I think we'll call it a wrap for today. That's good. Alright, thanks everybody and have.

C

A nice discussion about sovereignty.

F