Kubernetes SIG Testing, 17 Apr 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Testing 2018-04-17

Description

https://bit.ly/k8s-sig-testing-notes

A

Okay, hello, everyone- this is suggesting meeting today, is April 17th 2018.

A

What is on our agenda today looks like we have conversation about note performance, benchmarking, content, yeah, yeah, yeah,.

B

Yeah hi floor is yours: okay, cool! Let's see, let me share the doc, so we've we've written kind of like a well insect. Let me share and then I'll give some background.

B

Okay thumbs up, if you can see things cool yeah, so that basically this came out of the the recent resource management working group face-to-face meeting that happened last month, March 29th and 30th. It was hosted by Nvidia.

B

We kind of like started with a retrospective of what we had done in the last year and kind of like a lot of the recent performance. Isolation and you know, low-level support for performance, sensitive applications, work in the cubelet was driven out of that working group and then kind of you know incubating features that got offloaded to SIG's for for code ownership.

B

Most of that stuff went to sig node but stuff like CPU, pinning and support for pre-allocated huge pages device like accelerator, plugins, I'll kind of like started and in conversations in that group, and there are some more proposals that are on the backlog. But we kind of came to the conclusion consensus in that forum that we've reached the point that even engineers that have done a lot of performance.

B

Optimization work in in the past are having trouble holding in their head like what the prospective impact on performance would be for any given feature that's being proposed without having access to data, and so this is just about like kind of codifying engineering due diligence and making it easier to provide evidence of the benefits of any future proposed low-level node performance features.

B

So to that end, there's really two goals. Right. One is to detect regressions for a different kind of like vertical workloads that um you know vendors and users care about. So you know certain classes of performance sensitive, like high-performance networking, machine learning. Training is another one. That's come up a lot, so you know, there's kind of like certain use cases that users are kind of we're pushing for these features are using. So it would be great to kind of like get some representation of that in a in a performance test.

B

Suite that we could, you know, use to detect regressions over time. So, for example, if, like 1.11, you know if it makes this workload run 20% worse, it would be nice if we knew ahead of time and then the other, the other part is kind of, as I said, which is to help us to prioritize and evaluate the effect of future performance events.

B

So, but that's that's really what this document is about. It's not very prescriptive about like where the code goes, or you know the testing infrastructure or anything like that. It was kind of like the next step coming out of the resource management worker was to just raise this topic within the cig testing forum and try to get some feedback, some guidance and kind of like helpful. So long. So with that brain doubt, is there any feedback.

B

I've already talked to Tim Sinclair who's on the line, so got his thoughts already, but open to anything that anyone else can think of. It would be a good place to start concretely we're thinking about, like you know, initially just for ease, maybe adding a subdirectory in test no D to e and then having it not run as part of the default set, but letting smutty reasons run it themselves and then worry about like integrating it into some.

B

You know periodic performance.

B

Regression detection thing does that make sense.

C

Sorry I missed the opening part of this help me understand how much of the performance impact here would be affected by the container engine that the couplet uses- nurses, I'm, trying to understand like what kinds of performance impacts, changing features you're talking about here, I guess, custom.

B

Oh yes, if some examples would be, you know, I guess I listed a few before, but one is the CPU painting or CPU affinity feature that was added it's currently in beta.

B

That saying, is implemented such that you could add more policies and so really to get any sort of signal about the effectiveness of like a more autopilot like policy. We wouldn't want to run it against a suite of you know, well-known workloads, there's also the question of Numa or kind of like low-level device, topology aware resource allocation within the cubelet. So right now we have a you know: device manager, subsystem that lets.

B

You use a device plug-in that kind of like helps with the kind of like setup, an allocation of devices to a container, although some things like GPUs, that's way that the support is moving F PTAs specialized hardware like if you have a you, know, InfiniBand or you know something like that.

B

So those two features just by themselves. You know you can use them in isolation, but if you're on a multi socket system- and you enable both features- you're- not guaranteed that you're going to be pinned to a cpu on the same Numa node as your device. That was allocated for example. So that's one thing that's kind of on the horizon. They want to be able to test and see, like you know, for different workloads.

B

What is the actual impact of the problem and how many people are we going to help by implementing this in the next way and.

C

I guess that's where I'm I'm trying to understand if we're talking about node level, changes impacting cluster level work once or or if you're talking about, because to me when I'm just reading through the dock, I'm, not sure when I think of no testing I, think of testing a single node in isolation as opposed to more work load that into a cluster. Maybe my terminology is confused. Yeah.

B

So this is a this is really about application performance as a function of like new features in the cubelet.

B

So not multi-node, but you know on a single node. Maybe you have like you know, a an application whose performance you care about a lot, and you know if you want to kind of like pressure it by adding some aggressors like something that uses a lot of cash or something that uses a lot of memory bandwidth, for example, extremum workload. That would be a way to kind of like keys out like in what situations do these performance optimizations help?

B

But anyway, we want to make sure that they keep working and that the applications that are represented in the test suite continue to run better and better on future versions of gratis. How.

A

B

A

Apologies in the auto tests are available like from cloud providers. What not into what extends you need. Actual hardware to system yeah.

B

That's kind of a tricky one, you know so for multi socket GCP. As far as I know. Last time we checked it doesn't support there. It doesn't report multiple Numa nodes, so even their biggest machine is just you know one like logical socket, you know. So you know if we don't have to get everything in the first step, if we can start somewhere and then kind of like iterate, that would be fine over time. It would be good to maybe extend you know.

B

A lot of these features are really kind of intended for the bare metal use case. So if we could get some bare metal machines in a different testing infrastructure, that would be cool, but you know we don't necessarily expect like CN CF, for our carré's to provide all the infrastructure, but there's a place to host the code. It would be kind of nice if it was upstream but, like I, said open to all sorts of opinions. At this point it's kind of wide open at.

D

This UNCF see I at least packet is an option and that that's like more or less bare metal wine. So there should be some way of wiring in something not saying that I'm. You know it should be too disruptive, but you know Amazon for first and foremost would have BMS that supports like multi socket. So this would be a path forward.

D

Just a side comment: okay,.

C

So so that all makes sense to me like any act like this totally seems like I'd like it a good idea to do I guess the other thing that jumps out to me is metrics collection, so I would really love for somebody to correct me if I'm wrong, but I, don't think we have the greatest support for metrics collection in our tests tools. Today, we like so one example, would be the scalability tests.

C

They measure like CPU and memory usage, I, think, but that's some like hard-coded thing that lives somewhere in the test e to be packaged. We don't support something nice, like continuously scraping given set of Prometheus metrics into some kind of file that could be exported and then be imported into something later and I feel like that's going to be the larger hurdle that you may have to overcome, but.

A

There's a performance group in Red Hat led by Jeremy eater, who has a whole suite of tests. They called perf bench that is all open and very fulfilling to create some pretty impressive Gator visualisations. So that might be an option there.

B

Okay, yeah, that sounds great we're in touch with Jeremy frequently so I'll just ask him about it: okay,.

C

And then the other thing that jumps out to me is looking at storing the test results in I. Don't know that our testing or testing infrastructure doesn't directly store test results in protobuf format and GCS. We store test results in GCS as like JSON files or XML files, machine parts, cool stuff and then there's a separate job that goes through and scrapes GCS and converts that into data and then gets scraped by something else. It puts into bigquery and something else straight to a query and converts planning to print. Above.

C

Maybe the test conserves out so all that to say like mom, human readable, but machine parsable data is kind of the expected um artifact and then the rest of the jQuery can take it from there. So.

B

What how does the order go again is it goes from JSON or XML, or something and then goes into bigquery and then becomes protobuf? Yes,.

E

Yeah we have, we have a job that reads all the new test results from GCS and imports, those into bigquery and then whatever I can use it.

E

If you did have a small amount of metrics you, you can only have about one Meg of data per build that can't get important to bigquery, or we can do some other method and the the performance group, like the scalability group, already collects some some metrics for their scale tests and those I think are just stored as JSON right now and they have their own dashboard epigraph from everything.

E

That's the critical.

C

Definitely I I'm cautioning against using that as an expensive, yeah.

E

Totally, but sticking to JSON and XML for external formats, just very easier to be parsed by whatever tools is probably a bad idea. Yeah.

C

So if you want to take a look at the code, that's responsible for converting that JSON and stuffing it into bigquery. Look for a kettle in a testing for kubernetes, extract test results, transform and love digit, something okay,.

B

Like KTP, oh yeah,.

C

B

C

Posting chat all.

B

Right that'd be cool. Thank you. If they're as like an overview like doc on kind of how the data flows.

E

E

C

Everyone's posting links to the docs faster me I, don't have a great ovary doc, but I think somewhere. Looking at maybe the Cooper Nader read me and it talks about the expected final format that things should be posted into GCS, but I. Don't really have something that concisely describes how things float from one place to another: okay,.

B

Cool okay, so, but in general, just kind of like at a high level.

B

People think it's a sane thing to do and same thing to fit in to the note e two E's and then the rest is figuring out. The machinery.

A

So what does it mean to do the tests? Look like an awfully nice. Do they? Are they similar? It.

B

A

Be yeah I mean if you're getting a lot from that framework. That seems like the best place. To put it I think people that are trying to write. You did like things outside of kubernetes end up either trying to copy or vendor stuff, and it's not always very pretty usual huh I mean.

C

Did this area of where the code should live and what's the best framework fight pattern to use for it is something I would expect Tim st. Clair to have a much stronger, pinion on and stuff that I would view as falling under the purview of the testing Commons some project of this sake, we've.

F

Already chatted, like he's pretty much I, gave him a couple different options. There's the question that I posed towards him is how many people are going to watch this signal who's. How is it going to block just because we have tests and test grid and we actually even get the data up? There doesn't mean that it blocks things and people even watch it. So having the signal and having the right, Watchers I think is the important part to the logistics matter.

C

They do, but that's a that's a problem that literally still the release team needs to find a better way to solve and I.

F

Know it's totally segues into my next topic too. I know.

C

I'm still somewhat on the hook to document this, but it's kind of a crawl block, one thing where first, let's make sure we can actually get the data into you test period and then we'll make sure we have humans watching it and then we'll make sure you have the machinery to enforce all this stuff. But right now we don't have a whole bunch of that documented, very concisely, outside of like a Google Doc that I have from any issue that is probably marked as stale by a beta bot.

C

By now in the release we know about, you know documenting what it takes to move attempts you.

B

C

New tests call them blocking and make sure they remain like what criteria they have to meet to be blocking, and then how often we should verify that they're still meeting those criteria in terms of test generation and flakiness and ownership and responsiveness of those owners to test failures, and things like that well, I I mean I, was just made me interested in like mechanically. How would this work living outside of the repo more is a may, be an update from communitive like where you can see this heading I.

F

Think having it as part of the node and ten tests logistically makes sense so long as it's run and their signal, you can always just run any test apparatus external to core and you'll, probably get a higher fidelity if you're on the hook to own that piece, I think in reality, I mean that's what everyone else does I see I see Steve nodding his head because I know that's what openshift does right and we do I do the same thing to so the it matters for who's the consumer and who are the owners and who are the consumers of the data.

B

Yeah I think I mean that's. One of the issues is that it's meant to be kind of shared responsibility, although that's kind of a bad term, nobody likes but yeah, sure donorship. Let's say you know any any sort of like a feature owner for one of these performance features should be interested in maintaining the test and there are a bunch of different organizations. Yeah.

F

And logistically, we've had a problem with like maintaining ownership right, I think that goes back to what Aaron was saying. It's like we need. We need somebody, who's gonna, be the watcher on the wall. If we're gonna promote this over time, I can.

C

Tell you like the human oriented process I put together when I was CIA to lead for release a little while back, which was step. One which we do in force is make sure that every job that runs a suite of tests is owned by some saké. So it should show up on some six dashboard somewhere. So, for example, six scaleability owns the performance related jobs, so it's their responsibility to respond. If those jobs go from green to red listening, they stay great, but then all of the individual test cases insane like the correctness job.

C

That makes sure that kubernetes is behaving like a kubernetes of five thousand nodes. They don't code any of those features. So each of those individual test cases are owned by the snake. That's responsible for that feature, and so it's kind of their job to go, say: hey, you know, sake network or whatever, like your new proxying feature, isn't working at scale. You need to fix it.

C

That's scale like it might be worth your test might be working totally fine and AWS and in small scale and GCE, but it's really broken at high scale, and so try to that's. The shared responsibility in terms of you have like a watcher on the walls is a sink to watch the job, but the person responsible for owning the feature itself or the thing that's actually being tested, will generally own the test case.

C

um We don't really have machinery around automating, this sort of notification of people in practice as a human, when I was applying this to you, jobs and tests that were blocking the release or going out the door, it tended to work, and it's like it's dumb enough that I think it could be converted into something automated.

C

But is I mean how does that sound, yeah.

B

I think that Bigfoot's mean and.

C

So, in that sense, I review kind of signo is probably responsible, for you know the performance at the node level and making sure that it's it's good but they'll probably be responsible for routing the appropriate issues to the listings responsible for implementing that back in the future.

B

Okay, so is this: is that something that you're working on codifying more bird per I.

C

Mean I kind of written up at a dock and I'll work it with the current see a signal person Oh like the convention, is basically just make sure that the state responsible for the future shows up in brackets in the testing. It's you know very strongly tight, but most of the test cases follow that pattern. Today, there very few that don't, and so we can.

C

We can start with that for now, like I personally feel the amount of time it's going to take you to get reliable test data and a meaningful enough signal to even block anything was probably even of like 112 timeframe, by which time I would hope. We would have something more enforceable in place in terms of defining what blocks it doesn't right.

C

So, for example, what I have advocated for in the past and or these teams have never done is say like we are literally not moving forward in the release schedule until all of the tests on this dashboard are green. Well, you know what we've always traditionally done is cop builds, even though all the tests might be failing, which aids, alphas and betas go out.

C

The door a predictable schedule, but we literally have no idea how functional they are, and then we start chasing that game when it comes to actually cut release candidates and go through burned down and whatnot. So what the release team is trying this time as a Karen's that it may stick is to say, if you can get more tests passing, we will freeze later so that, ideally, if things are stable, you don't have to go into a code freeze in order to stabilize everything.

C

But if things look unstable, we're going to have to continue to have this lengthy code freeze process to pay attention to this I want.

F

To like I want to somehow you know me even bring this back to steering at some point and talk about like what are the what are the metrics, by which we caught a release and block there, and we block a release because we've never blocked a release, as you mentioned- and this is part of my conversation, something for today on on red signal.

F

Right, like we have a bunch of tests that are still non-functional and I'm I have to go through and take ownership, whether I want to or not for some signal, because we have product that went out. That has problems right because, like every dot, zero release of kubernetes and I know Justin's not looking at the camera, but he can probably nod vigorously that every dot, zero release we've ever had has been a steaming pile of awful and we fixed the problems almost really really fast.

F

As soon as it got out the door right and I really don't want to like keep on doing this, like it's kind of pathological. Please.

C

Help the release team and please advocate that the release team actually have the power to delay the release schedule. If the tests are passing, like I really tried to be very diligent about defining what how stable should a set of tests be before we decide that or like how flaky should they get before we decide? You know what this isn't meaningful signal and like who should be responsible for saying. No, no, this feature absolutely must be in a release, and then how does the conversation proceed with like?

C

Well, if you want feature in the release, you actually have to have automated tests that are this reliable to make sure that feature actually works. Otherwise we're definitely shipping a broken feature and, like the release team totally in my opinion, should have the power to do that. It seems to be more of a like product level, conversation and I'm, not entirely sure.

C

You know one standard to be number two another. It's not something I think steering committee has Power Team force. It's like I, think I'm hearing are delegating authority to sinks manner, sig relationship say no. We're not we're.

F

Not cutting it released, it's not ready. Let's give them the power like we arbitrarily can say: let's give them the power there.

G

Is a structural bug in the process, which is that the merge to the next release opens before the release team is decided so right from the get-go, the release manager can't say, put right to select right from the get-go there behind the bowl. One thing is that this is no longer true. Okay, can you describe what you mean a little bit there I so I, don't think we've concrete company I, don't know, we've filled out the 111 team, yet entirely it is ranches. It.

C

You're right, there's, no there's no catch release manager, so the patch release manager would be the person responsible for cutting the release. They don't.

G

Want to minimize that but I think like? Is it not the case that the 111 branch opened before the reefs manager was chosen, or is that not true.

C

So there is technically no such thing as a 111 bridge. Yet right master is what we're cutting the 111 candidates for before.

G

We froze one, we froze master, we stopped POS going in priors, let me see 110 and then we reopened it so that I consider that opening for 111 branch as it were, okay.

C

Sorry, but the test we've been passing when we froze it. You understand that right, the testable this better than passing it tries me bananas. So, in my opinion, I would much rather happily ship when it's done. We ship when it's ready schedule, but what what seems to be preferred by those who have a more marketing or product oriented meant exactly ship on a predictable schedule. Every court.

F

The question that I have a really difficult time answering is it's like this imaginary force that no one can see, but it kind of pushing the universe. Maybe it's like dark energy right, the expansion of the universe. It continues unabated, but like there's, no particular person that is doing that, pushing like it so much become a pathological thing where we continue down this road because we've done it for so many times, even though like no one is actually doing the pushing right like they're there.

F

If do, we have a single source of who who is driving the release drum to make sure that it occurs at a giving cadence yeah.

C

It's a good question rather than testing almost.

G

At once, again be the schedule is proposed before the release managers chosen again. I think that's so.

C

Bonkers and bananas, it drives me up the wall. The fact that we kind of very well-attended talk at the last week on about how we should change the release. Scheduling process has been completely swept under the rug and each release to basically already has a schedule ready to go before we've even done our postmortem, it's bananas I, agree, I get but I'm trying to suggest. Maybe this is more release oriented topic, as opposed to a testing topic for like we.

C

We provide signal and I think our tools mostly do a good job of that now, whether they're not the thing that generates the signal is meaningful or not, is kind of not entirely our responsibility if I were to for, as somebody who's been a CI signally, for least before I think one thing we could do better is make tester and understand a little bit more than it does about test hierarchy, because, like right now, I can go. Look at a summary dashboard and see perpetually failing test cases that aren't.

C

Actually they don't exist anymore, but test grid has some has hysteresis we're like test. Failures can show up, but the overall test doesn't show up so I know that I shouldn't actually look at that job, because the overall job is passing the failure user left over residual stuff. So, like there's stuff, we could do to to prescribe more of a signal for those results, but ultimately it's the responsibility of people generating those test results to do stuff with them.

C

So it makes sense. I took us every time. With this sorry.

C

So that's the status quo today.

C

There isn't clear ownership of test. Suites I mean I feel like there is clear ownership. The problem the airports are fun. There is clear ownership with jobs. I.

C

Just don't think that we, you know when I found the role as a human I didn't want to immediately turn on bots, to start nagging people automatically on job failures or on test failures, because I believe that history has shown those get ignored, not always right, I think when we started filing issues based on the top 10 clusters, as noted by the triage dashboard, forty percent of the time humans actually addressed those 60 percent of the time they aged out, but I, just kind of feel like it would be worth continuing to have humans push forward.

C

How do we notify and how do we make this meaningful it almost to me, you don't want to seem like people paid more attention when I as a human being came and bucked them, because they trusted that I was a little less noisy than a bot, so until we can figure out a way to make less muzie automated notifications, I still feel like it deserves some tight collaboration with the release team, first and foremost, I'm happy to help with to do and it's true. Maybe this opens up the possibility of right now.

C

I talked about how a cig owns a bunch of tests, but, for example, we have one job that runs all the tests inside of GCE and then each individual's safe, just like Reggie Jackson's that down to their specific set of tests, and so it could be. If see it, you know. Sig knows tests fling can cause the whole job to fail like Signet, where it doesn't necessarily see that and it makes the signal go easier for them.

C

So there was our thought at one point in time of why don't we like regex down just the same network tests and just the same new tests and just the state UI tests and stuff? Now we create a whole bunch more jobs for us, my personal feeling is that would result in us spending a lot more time, standing up clusters and a lot less time doing useful stuff with them.

C

We've run way over time with this and the person who originally chose this topic is their drop-down, I kind of feel like we're done, but I don't know about you.

C

Cool, um it is not clear. I am super passionate about this topic, just because I tried to push it forward on the east side of things, but, and so, if you want to keep this discussion up, I think the release the sinc release slack channel would be a great place and these team meetings as well.

C

Okay, thank you so much Steve for hosting, while I pluck reaping in my laptop, because apparently that's just what happens. Sometimes happy Tuesday everybody so.

D

You gotta come up. Thank you.