Kubernetes Contributor Celebration 2020, 11 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: A Tour of CI on The Kubernetes Project

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, zoom is preparing to live stream. The meeting I'm going to turn off my.

A

Camera zoom is preparing to livestream the meeting. All right turn off my camera.

A

B

Excellent, so should we kick this off, then.

C

Sounds good to me.

B

Okay, so uh hello, and welcome to a tour of uh ci kubernetes with dan and rob, and what we're going to be doing today is I'm just talking a little bit about the work that ci signal does to report uh flakes back to the community of developers who are working on a release. So the ci signal team is a team. That's part of um sig release and what sig release do is work to shepherd a range of features and fixes into an to the next upcoming release of kubernetes.

B

um As part of that work, there's a range of cross-cutting concerns and activities across a number of teams that take place, um ranging from uh bug, triage and and sharpening features through the kep process. And what and what the ci signal team do. Is we monitor the signal coming off of end-to-end tests as part of ci in kubernetes, so um what we are primarily focused on is detects, detecting what are called flaky tests.

B

So a flaky test is one that produces a non-deterministic result, that is to say sometimes it'll pass, and sometimes it will fail, and the key point is on the same commit. um So in this talk.

B

What we're going to do is going to look at the tooling that we use to uh to monitor the signal and we're going to just tour around some of the tools on that we're going to look at how we report flakes and then we're going to, and then we're going to I'm going to hand over to dan who's going to show us examples of uh fixing the flakes and working with the community to get flakes eradicated because they are a scourge and in terms of how ci signal works as a team. There is this.

B

There is an uh a really awesome.

B

One of the awesome things about kubernetes is that as a project, the kubernetes project works hard to bring on new contributors, and it's one of the things that really attracted me to the kubernetes project. When I started out um when I came to kubernetes first, I I hung out with and attended a lot of meetings on the contributor experience project and it was through them that I learned how the project works, how the project is structured and um through their help.

B

I ended up finding out that my area of interest because it's naturally ci um I went over to the release team and worked on um ci signal. So I have been the leader of the release. Just the the team lead for the release just gone by dan. Was this: the team lead for the 119 release and I shadowed dan on that. So how shadowing works? Is that there's an application process whereby you can apply to become a a member of the team of the ci signal team and other teams as well?

B

um And I just want to talk about that a little bit um because, as a team lead, your first task is to go through the applications and figure out um who you're going to select and the one thing I just want to note for people who do apply for shadow roles, um for teams like ci signal is that the competition is fierce. So that is to say, for the 120 uh release. I think I had an excess of 80 people apply to be shadows on the team and then from a management point of view.

B

We can only supervise and and and uh train up about, a team, the team size about four or five they're their bets. So if you applied- and you didn't get onto the team, don't be disheartened- be aware that there's a lot of competition there and be aware that it's possible to apply again, there will be more releases.

B

The other thing I would say about about that is when it comes to application for for jobs in the community, like that, it's uh it's important to remember that if you arrive um and apply having done some of the work, there's nothing stopping you from from looking for flakes and reporting flakes, and if you're interested in becoming a shadow on the team, it will stand you in good stead if you follow the instructions that we provide today um and just muck in and and and and do the work.

B

So um with that in mind, bear in mind that we have a slack channel release ci signal which I've linked to in the docs there. And if you have questions about this talk or about participating- and you can always ask those questions there and the other important thing to do is to join the sig join sig release and I have links about the sig release team there and how you can participate in meetings there. So um so we're going to get into test grade. I have a small trigger warning.

B

One of the things that we're aware of in the kubernetes project is that we're working towards replacing um harmful language and replacing a neutral uh language wherever possible, and I just like to give a shout out to the working group- that's um that's being set up and called the naming working group which is uh being set up to to look at this across the project, so so in uh on github and in our git workflow, and we have branch names that bubble up into ci and we're aware of that and over the course of the coming weeks and months.

B

That's something that we'll be working on to make changes too. So, let's just get into it and let's just have a look. The first uh tool in uh the first tool that we have in the chest in order to look at how ci works on kubernetes and this tool is test grid, so so test grid from a release.

B

Point of view presents us with a wide range of jobs um that run end-to-end tests across um across multiple runtimes, and this is the this: is our our go-to tool for for figuring out whether or not a job is flaky and whether or not individual tests are flaky. So this is a summary of the blocking group of jobs, and these are jobs that are considered so important that if they were in a bad state that would block a release. And but we have some flakes here, that we can look at.

B

I'm just going to look at some flakes here on one of our jobs, which.

C

B

Conformance ga only job now so um one of the um so so how test grid works is that is um it basically provides a job view of all of the tests that have been run for an individual job and in order to make sense of what we're seeing here, there's a lot of tricks that I have in order to figure out. What's going on here, um the column headers here the column headers here um give us an indication of the periodicity of this job. So this this happens. This job happens twice a day.

B

um We have times regis, we have times logged uh in pacific standard time there. If you want to flip that over to your local time, I I'm coming from uh dublin ireland, so that's gmt. I can get a sense that at 5 30 in the morning for me and 5 30 in the afternoon roughly every day, each cell in test grid and represents a test result and green means. A test passed and red means that it failed. So in terms of managing information and trying to get a handle on looking at these failed tests.

B

We have a couple of tricks here that we can and a couple of options that we can make use of, and typically when I'm hunting for flaky tests, I exclude all tests that have not failed, and that then gives me an easier um use, a chunk of information to manage the other thing that I would often do when I'm looking at this test here.

B

um If I mouse over it, um the test name is highlighted on the on the left and then at the top. I can see the date at which the test run the time it run. um I have a number here that uh the tooltip says build number, but it's really the proud job number and then below here. This is the commit id from kubernetes against which this test run.

B

So the first thing that I do when I'm trying to figure out is this a flake is, I will mouse over the test I'll just mess over to the right and I'll note that the git commit id has changed. If I mouse over to the left, the git commit id has changed, so I'm not 100 clear it that this is a flake, um but what I want to do is I want to go back in time.

B

If I go back in time, I can look I'll just get rid of that. If I go back in time, I can see another incidence incident of this test failure and if I mouse to the right here and mouse to the left, you'll note that I have the same commit id for this uh for this failure. So this is is pretty much the definition of a flake um for the commit id that ends in three b: nine: zero f. Here we have a pass here.

B

We have a fail- and here we're back to passing again so so by definition, this is a non uh producing a non-determining deterministic test result and for any of the cells. I can click through to get to this view of the job. So this view of the job is presented by a tool called spyglass and spyglass allows us to see how the job was run in prow and straight away there. We can see that failing test and if I just expand on that, we can say that this is the test error message.

B

So um this looks like a straightforward um test: failure in so far as um instinctively. I know that this probably isn't a runtime issue, um given the name of the test, the name of the test here. uh I might actually speak to this, because this is a trap for new contributors. So, on the left hand side here, sorry just go back on the left hand side here we see this test name here now.

B

If we want to drill down to the test, we could attempt to do a search for this, but because we have a, uh we have a tool here that allows us to search the repos and if I, if I pop that in to hound, we'll see nothing for you dog now. This is because this is not really the true name of the test.

B

So the true name of the test is advice pods with min tolerations, and I'm going to take this off as well, and just do a search for that and we'll see why this is down here.

B

So if we hop onto this test.

B

So this is the this. Is the test that's flaking for us. um um We can see there and dan can speak to more to this and we'll see more of this when dan does his thing, um we have that the name of this test is evicts pods with min tolerations, and it's tagged as disruptive and how the rest of that name in test grid is formed, is by traversing down through the hierarchy of the end, to end test suite that we use to do end-to-end, testing, okay.

B

So the next step is now that we've identified this, as is certainly a flake in real life. What we would do is we check to see if this test has already been reported. So let's have a quick look and see how that would work.

B

If we were to do that from scratch, we would go to issues kubernetes issues and, if I dropped in, if I dropped in this test name. In fact, I should find this and we can see that this test um has already been logged by scrap codes and scrap codes um uh worked on, that's prashant. He worked on the 119.

B

Well, he worked on the 120 release, formerly as a shadow, um but he started working on. He actually started working freelance as it were on on 119, and I showed him how to to to do this work. But this is an example of a logged flag test where we go through and we describe um which jobs are flaking and what the test was which test flaked and then we have basically all of the evidence to back up um to back up our um flake report.

B

So I'll just show you how that works uh in terms of filling that filling that out quickly and then I think that will be me down and I think we will go on to on to your stuff.

B

Next, um when we log in issue and we've, we have a range of issues that we can report and we have the failing and flaking test, and if we just have a quick look at flaking test, um we see that we have a github template that allows us to enter in um all of the information to describe the flake that we have found now. um I think I've linked to prashant's logging of this in the hackmd.

B

um The one thing that I want to quickly point out when filling this, because that's a good example of how to do this. um The one thing that I want to just look at here is the triage tool. So in test grid, we can look at jobs from the point of view of all of the tests that run in a job. Okay. So, um but there are other views that are interesting to look at when we're trying to figure out why the test flaking.

B

So we have a tool called triage and what triage does is triage looks at the output of end-to-end tests from the point of view of um from the point of view of errors in those tests, so it'll group test failures by error. So if we were to take our test and something by having my buffer, if we were to take this particular test and do a search.

B

Do a search here for this test in triage we would we'd get useful information. That's useful for test maintainers to figure out what's going on here. So um one of the key things to figure out when you're trying to deflake a test is you need to understand when and where the the test is flaking? So if we look down through- and I might make this a little bit bigger.

B

Okay, that too big okay, so here we've done a search for the test effects pods with min toleration seconds and if I just scroll down here.

B

We can see that there were 17 failures now for today I'll speak to that in a second. uh So this test failure occurred in this way um across a multiple number of jobs and that's useful information for somebody trying to triage the test. And let me just see, let's scroll down a little bit further, because if I recall in preparing the talk.

B

I think I might have seen this test failed, yeah on a windows job, so so that means that it failed when running unkind. It failed when it when it was run in the context of a windows runtime. So then that that I think that's useful information done, isn't it for for for people troubleshooting the test. um So I think that's pretty much. I think that's pretty much my walkthrough dan, I'm just wondering it. Are you feeling any questions there or feedback and chat? I can't see it.

C

um So we had one question around if we had a bot to report flakes, because you demonstrated right when we can tell for sure that there's a flake- and I mentioned in the channel the general channel on the discord server for folks that are watching on youtube, um that the uh feta bought will um go ahead and record flakes on pre-submits. um But we definitely could improve that on the periodics right. Because.

B

There are some cases where.

C

We can, we can show that for sure.

B

Yeah, the the over the course of the 120 release. I attempted to do a crazy, join between the very, very structured data that we get out of test grid um and join against the manually logged flakes in github, and although everything is technically possible, the the challenge there is is that um different people log flakes in slightly different ways. So that's difficult to parse.

B

One of the things I will say that is awesome about um about um ci on the kubernetes project um is that the data from from jobs that are run is collected um and put into a bigquery database, and so so our ci process is database, backed which then means that we have tools like triage that allows us to slice and dice the runtime data in different ways which helps us and get to the bottom of why things are the way they are and at the moment, there's a problem with getting data across to um bigquery and people.

B

People are working on that, and so you can see there that we have data up to december 1st, um but that that's something that's been actively looked at and that's something that um that that we're working to fix and, broadly speaking, um the ci signal team.

B

Although it's part of the sick release team, we work closely um with this with sig testing, um because because it's their technology that we're using to deliver ci and we work with testing ops as well and at and broadly speaking, um the infrastructure which runs uh ci jobs is very reliable and jobs are run reliably and and anytime. We hit up against infrastructural issues. Now um it could be to do with configuration of jobs rather than jobs not being run, um but I suppose we I I can hand that over to you down now.

B

From that point, there yeah.

C

Sure sounds great, and, and that was a great overview of the tooling there robin and one of the you know things we see uh when folks come to the project, and you know from our own experience as well. It can be a little bit daunting to understand. You know like what url matches, what tool and what each tool is used for and there's some overlapping responsibilities.

C

We also have you know, speaking for myself and probably the rest of cs signal. Members um have an issue of sometimes calling things general names like you'll, hear a lot of things referred to as prow. um For instance, I've hardly ever hear anyone say spyglass, um and so you know it can be a little tough to parse that out. So I just want to you know, follow up with what you were talking about by saying. Please feel free to ask questions. There's there's no such thing as a dumb question.

C

um There's probably tools that we don't even know about, even though we've been doing this, for you know over a year now, um and so definitely feel free to ask about that. But yeah thanks for that overview, rob I'm gonna. Go ahead and start sharing my screen here: okay, um I'm getting a hostile disabled screen sharing, so I might need.

C

B

Just say, I'm not sure if I can help with that, let me see I might be able to claim.

C

C

Oh yeah, it looks like you've got co-host. I just need to grab it. Okay as well. Let's see if I can get.

C

Some assistance there.

B

Now bobby may be able to help us with that.

C

Yeah, let's see I ping him in the channel and see if he knows their own.

B

C

All right looks like you already got it.

A

You should be closed.

C

Awesome thanks bob um all right, so I'll go ahead and share my screen here um and I always have trouble with this zoom over hang.

A

You should be co-host.

C

Already cool sounds good, I think I've got it. Let me do one thing: real, quick and I'll reshare again.

C

All right, so we are set to go here um and so rob's, giving us a nice overview of all the tooling and how to interact with it and that sort of thing- and you know, there's, there's miles more layers to it and a lot of that you experience when you actually try to you know, track down what what issue you're seeing.

C

So um in my personal experience, as rob mentioned, I kind of came to um kubernetes through sig release and started out in ci signal and did a lot of the issue logging and that sort of thing and then the more I did it. The more I became familiar with the fixes that were coming in for different cigs and uh became more involved with that, um and so you'll start to see that, especially if you spend a lot of time on ci signal, um you know you'll start actually starting to fix some of the bugs yourself.

C

So we're going to walk through some of that another advantage of walking through it. Is you see how you know your workflow through these two different tools? We've seen goes so I'm going to go with a few different kind of like cherry picked issues that we've had um over the last few months that demonstrate um different scenarios you'll run into um when trying to troubleshoot.

C

So this should also be useful for folks who are not interested in ci signal, but you know work with a sig or doing feature work um and and getting to fix something that they uh introduced.

C

So um rob kind of called out already that flakes and failures are kind of the two main categories of of. I guess failures that we're going to see here with tess and so there's three different main categories you can run into when something is failing right: it's either a bug in the code, so the actual code base, the actual kubernetes code base, there's a bug and an implementation there. um So that's what you generally think of as the purpose of testing. um So that could be one uh situation.

C

Another could be a bug in the test, which means you know we're testing the wrong thing, um so that could be another and we'll walk through one of those and then the last one which can be the hardest to track down, especially if you're not involved with sig testing or sig release can be an issue with infrastructure or tooling, and there's a lot of different things that can go wrong there.

C

So I have a few different examples that we'll walk through, um but without any further ado. Let's go ahead and start with a bug in the code which is kind of the most straightforward example. So this is an issue I opened on november 11th. This was an informing job that we have on gce, ubuntu master default and you'll see this issue, format that rob already detailed, and so I basically just followed the general thing. I got a ping that this test was failing.

C

I went ahead and looked at the board, which this board will be out of date at this point, um but went to this job and saw that it was consistently failing at this point, which is why it'd be a failing test rather than a flaking one. So we had red all across the board here, so I opened it up, gave it the failing test label. um This is informing and uh we wanted to get this fixed up for 1.20.

C

As rob said, informing jobs do not have to be passing for us to go ahead and release. That being said, uh it's definitely concerning when something is consistently failing, no matter where it is um so. Once again, I have the output here of what we are getting um from the job and um the the next thing to note is the sig right. So how was I able to determine uh what sig this went with?

C

um So primarily it had to do with um a docker exec liveness pro probe, so obviously, docker is running as the cri implementation on a node, um and that would you know, cater to something in signal, so I went ahead and put that signal label on it. I also was able to determine a pr that introduced this. uh This failure.

C

So once again, as rob was already detailing, um when you go through here, we can look at both the the kubernetes commit hash, as well as the test infra commit hash which I'll hop over to the test infrarepo in a minute, um because uh that's where a lot of this tooling lives, as well as the configuration for all these dashboards.

C

So when I actually looked at the job that I opened this issue for there was a distinct change in commit hash which went from green to red, and so I was able to tell you know it likely was introduced by whatever happened between these two commit hashes um and if you're not familiar, um github actually has a pretty useful way to that's, not what I'm going to do to compare, commit hashes and there's some shortcuts to be able to get this open.

C

But here's just an example. I have in my my search history here, but you can actually just supply the first commit and the second commit uh separated by an ellipses and you'll, see all the commits that happen between those two.

C

So this is actually for a different failure that I was tracking down that have cached here so but I'll use this as an example, you can see here that all these commits are part of a single pr, so likely whatever started failing. If it went from green to red um between those commit hashes and there's only one pr, we can go ahead and determine that that was probably the pr that introduced um whatever caused the failure.

C

um So this is not the the test that we're looking at right now, um but this was a similar situation where I was able to narrow it down um to one pr here.

C

So looking at this pr, we can once again see its cubelet work, so it is in sig node. You could also see that um by the labels on it and then andrew, who makes lots of different contributions to the kubernetes project and is a a very important contributor um had introduced um some new functionality to respect exec, probe timeouts, um and if we go back over here, you'll see, I went ahead and tagged him on here and kind of gave some context around the issue.

C

um The first thing that I wanted to realize is uh if this caused a consistent failure. How is this merged right? So when we look at the the release dashboards here, we're looking at periodic jobs. um So, as rob said, that means right that we're running them on a consistent cadence um that you can see here at the top, the difference between the hours that they're running and the time we also have pre-submits.

C

So so the the ones running on a cadence are periodics, um the ones that run on a pr before it goes in are called pre-submits, um and so, let's just go to a recently opened pr.

C

We can see who our winner is. Let's look at this at cd version. One here um so you'll see that there are lots of uh jobs here that are running against um the pr before we merge and those all have to pass if they are merge blocking, and so how did we pass all those tests right and get something in the thing caused a periodic job that is fairly critical, with it being on the informing board to fail?

C

Well, if we go back and take a look at the job that was failing, we can see how it's configured and potentially, what jobs ran that did pass.

C

So I'm going to go over to spyglass here as as rob showed and there's a number of different things up here that are helpful. Links to artifacts from different job runs, so you obviously have a link back to test grid. Artifacts is going to show you all of the artifacts from that test run. So you can see the proud job configuration um the start and stop things, and then you can also see um the different output from things like the node logs and and that sort of thing.

C

So once again, looking back at this, uh the prow job yaml is what is going to tell us um how this job is configured, and I believe here we search for it. um The important thing on this job was actually that we are running with cube container runtime docker, and if we flip back over here, um this obviously had to do with the docker exec liveness probe.

C

So if we look back at the job, it also has some similar jobs. Some similar tests running against it and let me hop back over here one more time and see if we can pull out the specific job here.

C

um Node end to end, so we have pull kubernetes node ended in which sounds pretty similar um to well, not that similar to gc master default. But if we look at the end-to-end test here, we can see that it's running similar ones and if we hop back over to the node end-to-end though, and take a look at the pro job, yaml you'll see that it's not specifying docker as a container runtime, and I won't go into exactly why we're able to determine that a different container runtime is being used.

C

But basically this is using container d under the hood. So the fix here was that we had run tests. We had run tests against one container runtime, but this actually was a change in a um a different container, runtime docker versus container d, which we can talk about docker shim and some of the deprecation around that and what the difference between docker and container dr but I'll leave that to a different talk. But essentially right.

C

We were running this test for this scenario against a container runtime, where it wasn't being exercised and you'll see here in the discussion.

C

My initial inclination was that we were using incorrect version markers but andrew did a little more digging and we were able to see that it was because of the container runtime you'll, also notice that you know this was running with a different operating system as it's obviously running on ubuntu. So that was something to take in mind. uh The the pre-submit was running with the the google container os, um and so that was another thing to investigate anyway.

C

Andrew was able to go through this and find an appropriate fix for it went ahead and tagged me on it to let me know which is really helpful if you're a a contributor- and someone has ping you to fix tests, if you tag them on it and reference it so you'll see here, we are eventually able to get this in.

C

I believe we talk a little bit about running this test in the specific job yep so you'll, see here and we'll talk about what this means in a little bit. But during this pr, andrew temporarily added this node conformance tag onto the test which made it get exercised by the test that was running, and so we got an idea of whether it was passing or not so that allowed us to using the pre-submit be able to determine if this was going to actually fix the problem.

C

So that's kind of a walk through of a typical or somewhat typical bug in the code. So new functionality was introduced, it broke something it passed the pre-submits and we caught it in the periodics another follow-up to an action like that is determining. Maybe we should be running this test on the pre-submits right. If it's going to cause breaking changes, maybe we need to be exercising that code path.

C

All right next thing that we could look at is a bug in the test right, so when for any of these jobs, essentially what's happening for most of them and I'm excluding build jobs, and things like that ones that are actually running things from our end-to-end test. Suite what's happening is they're, not rebuilding, kubernetes and running it.

C

Each time right, they're getting the latest version of the kubernetes release, so the latest build of it depending on the job, and it's going to download that and then it's going to clone the repo for the tests that are in the kk repo, also just to make sure that we're defining everything. Well kk is a common acronym uh to refer to kubernetes kubernetes repo, as there's many repos under the kubernetes org.

C

So if we look and test here uh primarily we're thinking about these end-to-end tests here, so what happens? Is these jobs will go ahead and download kubernetes they'll run it, and then they will exercise the tests in the test subdirectory here against the version that they've downloaded if it's something like sig release, 1.20 blocking you'll see- and let's look at one here- real quick just to demonstrate this.

C

Let's see if this has the extract information here, you'll see that we want to use the latest 1.20, whereas we look at the same job on the master branch here.

C

Let's see, I think it was the ingress one you'll see that we are just doing latest fast, which is actually uh a fast build.

C

Well, we won't get into fast builds today, but essentially um they're building for a single architecture um and operating system, rather than the general build which builds for all um so yeah, based on the you know, uh the version that we're trying to test we'll download a different version of kubernetes and clone that uh branches tests and run them against it all right so back to this example of a test failing because a bug in the test.

C

This was one that once again uh flipped from green to red, but it was just flaky right, so um we introduced a new test and uh it didn't actually just cause things to consistently fail, um but we could tell from the scroll back- um and I just have this from context, but we can probably find one here as well.

C

C

One of these might be a good example, so this one is is having a pretty consistent, flake rate, but if you had something where it was, you know green for 30 columns or something like that, and then it suddenly started turning red every other column. That would be a good indication of when the code was introduced that caused the flake.

C

So in this case I wasn't able to pin it down to a specific pr, so you'll see that I have my emphasized might be related to here this pr, so I was able to determine when the flake rate increase, increased and guess on the pr um it looks like I was, let's see if that was the oh. No, that was an issue I actually tagged. um So it looks like that. Morgan here was actually able just to watch for this, and this is a example of the benefits of tagging.

C

Folks that you know are responsible for a certain area of the code base.

C

So morgan was able to follow up and say: oh, I know exactly what's happening here, because morgan had context that I didn't have, which is really useful and said. I can do the fix here, went ahead and assigned themselves to it and provided the fix and tagged me on it, and this is a really good example of what could cause flakes based on how the test is designed.

C

You'll see what morgan's doing is relaxing the matcher here so for this specific test it has to do with the resource metrics api, so we're looking at the metrics that are produced and initially when this test was introduced, it was match all elements. So basically saying I expect the metrics that we see to look exactly like what I'm specifying here, um so no more, no less and for the ones that we have to look exactly like these ones. I've provided in reality that wasn't what we needed to test here.

C

We just want to see that these two that we were interested in were present and if we had you know a million other ones. uh That was all right. So you could see how this could produce um flakes here right because sometimes there may be other metrics. Sometimes there may be not depending on what else is running in the cluster and exposing this. um So we can just change this uh to match elements and ignore the extras here and that turned us back green. um So that was a very quick fix here and you'll see.

C

We got that in for 1.20.

C

And one thing I wanted to point out here which rob actually brought up when we were going through kind of the preparation for this is you'll, see this uh g-struct here uh package and let me go up to the top uh and see where that's coming from so you'll see. This is coming from uh gomega here and essentially what it's doing is allow us to match the contents of strokes here. But I wanted to point out that we use some external frameworks alongside the kind of end-to-end framework um in in the end-to-end tests.

C

So gomego would be one of those which can help out with you know, diffing different structs and that sort of thing, but the primary one that you're going to see the use of potentially controversially is gingo.

C

So ginkgo is used across a lot of our tests and you'll see it used to do things like set the context or run things before each and the particularly important one that that rob was pointing out earlier when searching with hound. Is this ginkgo it which is prepending information onto the front of this kind of description here so you'll frequently see if we're looking at? Let me go back to one of these and see if this was this, wasn't a failure. Let me grab a failure and see if this has it so yeah.

C

This is a good example here um it probably pre-pended this kubernetes end-to-end suite may have done. Sig storage, probably the csi mock volume, csi volume, expansion and then we probably just saw- should expand volume by restarting part pod if attach on node expansion, on which looks kind of like what we're seeing here with should report resource usage to the through the resource metrics api.

C

So when you get more familiar with interacting with some of these tests, when you see a failure, it's a little bit easier to go ahead and pull out what to search for, but it can definitely still be challenging. um Also, the tags here, like sig storage, um can help you identify.

C

uh You know what part of the end-to-end testing framework to look in rob. Did you have something you are one to bring up.

B

Yes, so so the way I the way, I think of that or the way I think of that name.

B

Is it it's almost like the it's a flattened hierarchy um of the ginkgo uh traversal down to the test, so so what ginkgo is trying to do is to trying to describe in english uh what is happening, how, where it's happening, how it's happening uh so so the um so so that's you typically find that an end to end test frameworks where you, where you almost want to be able to just have an english statement of what the expected behavior of the test is. um So I think the the ultimate.

B

uh I think the ultimate next step up would be, have cucumber and just be able to have end users read um what is happening under the hud and just extract out those english uh uh statements, and so that um people who aren't into the test but are interested in the expected behavior can read those off and but that's that's for another day. That's another day's work we've plenty to be getting on with. I think it stands now.

C

Absolutely uh and as rob says, there's lots of areas for improvement. um One of the things I want to point out here since we're talking about finding uh the actual test failure. Obviously a good place to look here. Is you know the code where the failure happened right, so we have line numbers here because they are go tests that are being run.

C

um Sometimes you have to follow this a little bit to um you know, figure out where it actually happened, because um the way that things bubble up with ginkgo like if you're doing a before each or something like that the failure may be on you- know, line 66 because of this failure down here. um So that's typical go testing stuff and once again, if you have questions on any of these practices, definitely feel free to drop them in the sig release, channel or ci signal.

C

Specifically, if you like them in relation to that one other thing that I want to mention with gingko- and this can probably lead a little bit into looking at tests and for repo as well is- and this was actually really evidenced by this failure- that we saw on gc ubuntu master here.

C

So let me grab that real, quick.

C

Is these test args that we passed so you'll see as we've already shown in a number of different test failures that we've looked at? We have these indicators and the brackets here so basically tags on the different tests, and that allows us to choose different tests that we want to run based on the tags that they have right. So here in this uh end-to-end gce one. We want to skip all tags that have are all tests that have tags that are slow, serial, disruptive, flaky or a feature test.

C

um You could also do things like ginkgo focus, which says only run tests with these tags, and then you can combine, skip and focus right to get a specific subset. So a good example of that is one that rob was actually looking at earlier. I think conformance ga only.

C

I would guess here- and I haven't looked at this recently, but I guess that we have a potential skip actually, maybe not on this one. Let me grab another one here.

C

All right, so this one is a great example, so this job is testing the device plug-in for gpus, um and so it's a very focused test right. So you'll see here that we're passing in ginkgo focus feature gpu device plug-in.

C

If we actually went over to kk here- and let me grab that again- I'm going to pain, everyone and use the github search, but should be pretty easy here. So you'll see here that there's only really one area where we're running or potentially, two areas where we're running uh gpu device plug-ins here so it'd, be really easy to troubleshoot an issue with this right because it either happened here here likely.

C

So there's only really those tests that we're running and we'll actually talk about this test uh in a minute as well. um So when I pull up this pro job yaml, that's configuring how everything is run. uh Where does that come from right, well, test infra is probably where you're going to spend most your time if you're really interested. In this whole ci signal and testing.

C

Realm and you'll see some of the different names that we already mentioned here, um some of them that we didn't yet uh a lot of these are used to either build images that are um used to basically bootstrap the tests um or their frameworks like cubetest, which allows you to you know, spin up the kubernetes cluster in a consistent way and test against it.

C

You also see prow here some test grid stuff, although test grid is a separate repo and there's lots of different things in here and one of the things, and, I think, there's an issue open to it. um But some of the tools are outlined here, but it'd be really great if we could get a diagram right that showed how all these tools interact with each other um and get an overview of this, I know. There's something rob is passionate about.

B

This is this is what well you want the diagram when you're starting out, and then you see as you learn as you go. Oh yeah, it's okay, they'll just talk to each other, one of the we're having a lot of chatter in the in chat about um about data retention and how data uh how ci is is data backed so carlos is asking some good questions uh about um how far do uh test grid results go back.

B

um The broadly speaking, uh there's two aspects to data storage from running ci jobs in the kubernetes project, and the metadata is, I think, loaded up by kettles, which is what dan is visiting there. Now that's kubernetes extract test, transform. That's the etl yeah, it's it's a it's a! um I don't have great stats uh on the amount of data going into bigquery, but I just know instinctively: it's a it. You know they're using bigquery for a reason. It's a lot of data, so metadata pertaining to uh jobs should be um funneled into into bigquery.

B

When it comes to uh job artifacts and the limiting resource, there would be um uh buckets uh yes, so yeah it'd be it'd, be data storage book it's there so um and, like donna saying all of these tools are, um for the most part, they're pretty much in test grid test grid is, I suppose it is a monorepo with a lot of tooling in it and and when, when uh I suppose, the most famous one or the the application that I know that has moved out um recently has been test grid.

B

So the back end of test grade has been open sourced and the front end is still uh is not yet open sourced. um uh There's a lot.

C

B

Little things that I'd like to do for for the ci signal use cases um there, but but when it comes to getting expert and deep knowledge, um uh if you, if you, if you log issues on the on the new test grid repo and if you make suggestions for front-end changes that you would like to have happen and they, if they are implementable and the team who work on test grid, will make those changes for you and they're awesome to work with.

B

um I'm in, I speak to michelle shepardson who works on that project once every two or three months, and and we talk about what we'd like to have happen on test grid and she's awesome. She does great work on that.

C

Awesome thanks for pointing out rob and- and that's a great point- uh you know a lot of these uh issues that I'm going to here. um I'm not really able to show you exactly what was happening um at that point in time, um and you know if we have that data still available, we could present it, but I also don't have all the context as well on what the data retention story is there.

C

We will talk in a bit about how some of the infrastructure management works and that sort of thing, but the main thing that you're probably going to be interacting with at least when you first start out with interacting with test grid or if you are just a sig owner or something like that, is the config directory.

C

And uh if we look at jobs um and rob, you may have already said this, but I wanted to point out once again these different tabs here. Well, first of all, the tabs at the top right are dashboards. They're, test grid dashboards, each of the tabs listed under a dashboard are jobs, and then each of the line items well build master fast was not a great one. To pick for that, here we go um each of the line.

C

Items uh here is a test that is being run, so I it may just be me, but when I first started out with ci signal, it was really difficult to me because sometimes we'll say you know like this test is failing and we really mean a job.

C

Or you know this job is failing um and on all of our issues, which is something that I've kind of pushed on in the past, we'll say failing test, and sometimes this is an example where I actually have the test listed here, but sometimes we'll have a job name um if we aren't able to pin it down to a single test or it's a lot of test failing and we'll say, failing test this job name which yeah it kind of is true, but it's not as exact as I'd like it to be um so anyway, something to keep in mind.

C

It goes dashboards, jobs tests and, as I mentioned all of these job configs and then the association of the jobs to the dashboards that they're presented on in test grid all lives in test infra and so we'll primarily be looking at the kubernetes one. Here, where you'll see all the different sig ones um specifically for sig release, we can look at their release branch jobs which either these are actually auto-generated.

C

uh Maybe we'll have some time to look at the tooling as well.

C

That makes this happen, but there there's too many things to cover all of them, but you'll see these are all periodics right because, as sig release, we're primarily interested in those you'll see that we have things like the test grid dashboard as an annotation which says you know, this belongs on 1.17 blocking we'll have alert emails, so you'll see the release team on all of the ones under sig release, and then you'll also see the appropriate group for the sig that may own that job.

C

If there is one, uh if it's something like general end-to-end tests, there may not be a a single uh sig that owns it. Go ahead.

B

Yeah, so so one of the things that might be worth pointing out here as we look at this- is that, as as part of as part of the efforts to corral and manage and tend to um tend to ci jobs on the project, um we have. um We have organized um sort of like a program of work in order to maintain these jobs, um and so so there was a release where we were getting a lot. A lot of noise in our signal pertaining to jobs, not being uh not stating their resources properly.

B

So as we look at their line, 36 down to 42 and this job is- is requesting um limits in cpu and memory and requests for cpu memory and a lot of jobs didn't have um have those resources specified. So as a result, it made the schedule difficult. It made life difficult for the scheduler to schedule those jobs on the infrastructure.

B

So from the point of view of finding work to do um I I'll add the project board to the hackmd, but the the hackin, but the sig testing team and aaron krikenberger did great work um with laurie apple, to set out the program of work whereby that sig testing team and testing ops could get help from the community in order to improve ci configuration uh on things like that.

C

Yeah, absolutely, and- and since you bring that up, uh a good thing to remember with all of these different tests uh is that they are running in a kubernetes cluster themselves, so they get scheduled to a node. um So they're, essentially you know just how pods get scheduled in kubernetes. These jobs are running in pods, on a gke cluster actually, um and uh here you'll see that we have the resources for them, so that is controls how they get scheduled to the nodes right.

C

So, for instance, maybe if I find one of the kind uh jobs here, they have pretty heavy cpu requests.

C

In fact, this actually uh means that they have to run on a node by themselves, based on the node size that we're running um and if there isn't a node available in the cluster for a job to run, um then you'll see it gets failed schedule after time, out of something like five minutes and that's one of the examples I have here in a moment, um but that's the general idea, um so yeah all of the the jobs live uh in the test.

C

Infrarepo- and another thing I want to mention is uh the same: job may be on multiple dashboards, so, for instance, this one on 1.17 blocking it's a conformance job. So if you want to go to here and look at conformance all we would see um you know all these different version, conformance jobs here uh also on this dashboard right.

C

So a jobbing on a dashboard does not mean it's only there, and this allows for different sigs uh to you know, prioritize different jobs that they want to look at, and there may also be you know, other sigs or other groups that are interested in in their status as well all right, so many things it's hard to it's hard to get them all in there, but we're going to try also. I know we are kind of like reaching the hour um rob as my uh counterpart in this talk.

C

How are you feeling you you good to keep going, I'm good to keep going yeah all right? Well uh bob! If, if you need to kick us off, you just let us know, but we've got a little bit more here and we'll keep pushing um until someone tells us we're not allowed to anymore all right.

C

um So this is an especially fun one which let me close out some of my tabs here, uh and this is actually I can see my my phone blowing up a little bit with some messages about this, uh because it's not completely resolved yet. But I chose this um fail. Well, it's really more than just failing tests, but there are were quite a lot of failing tests, and if this is very recent, so um you'll probably see that they're still there, uh probably on 1.20 blocking, we should be able to see some of them.

C

I believe all of these kate's beta ones are failing.

C

um Let's go back a bit yep all right, so you'll see that this uh immediately started failing here, um and uh none of the tests were actually even being run. So if we click on one of these you'll see uh the extract step here, which is basically where we download that version of kubernetes, based on based on the version marker that we have um was failing right.

C

So we weren't able to run any tests because we weren't even able to get a kubernetes cluster to test against, and if we look back at the commits here uh number one. We see here that the info commit didn't change. uh There is a change in commit hash for the uh other indicators here, but it went to missing. uh So this isn't really helpful right.

C

If we, if we can't get the kubernetes cluster, we don't get some of this information which there's various improvements that can be made there as well, but something that I had from context and also seeing that this started happening across other blocking job boards as well, which we can see once this loads.

C

It was a shorter time for 1.19, which we'll see why, in a second, this basically started happening across all different branch blocking boards. So the immediate thing I thought was, you know when's the last time that we released well. We just had a 1.20 release of the 1.20.0 release um and it was about the same time that the uh jobs on 1.20 started failing and then the next day we had patch releases for all the different branches and that's when they started failing.

C

So there was probably something um wrong with that release, so it turned out that we had some faulty logic in our release, tooling, that led to we can go to this well, I should have just clicked on all releases here.

C

You can see that, uh for instance, the 1.15 or the 1.17.15 and the 1.17.16 rc.0 are on the same commit here, and what we want to do is actually separate those commits so that we're able to determine they happened at different times, because they were on the same, commit all of the builds that were happening, which we'll do this for 1.20.

C

And let's find one of the builds that, let's see what day these are on, I think it's this one but essentially well. You can actually also see it from that one. We were looking at earlier uh with the extract here.

C

uh The the version marker that we're using is saying it's 1.20.0-1, plus the digest of the hash there um and what it should be saying is rc dash whatever or rc.0-whatever for 1.20.1 right we're moving towards the next release, um and that wasn't happening because of those commits being on the same hash and the way that version marker gets. There is when this build job runs.

C

Sorry, we have to build these versions for the test to start using um when they ran, based on the version that we are uh building, we'll say, publish uh extra version markers so, for instance, here we're saying, uh publish a kate's beta version marker and then it also sees that we're on the 1.20 branch here. So if we look at some of the logs here, you'll see that we're publishing extra version markers and then, if we go down and actually look at the copy, let's see if I can find it.

C

Yep so here you'll see that we're publishing version markers latest latest one latest 1.20 and kate's beta, all right and and bob says we do- need to wrap up soon. um Yeah I'll basically say that that we were building on the wrong commit, and this is just an example. The reason why I bring this one up is this an example of something that had nothing to do with bad code and kubernetes bad tests in kubernetes or even the infrastructure that the job was running on.

C

There was an issue in repo machinery there's an issue in our release process um that calls the version that we were using uh to not match a regex. That basically said this is an appropriate version to download and use to run tests against, um and maybe we can follow up uh with a separate discussion that talks about um all the different layers that this exposes of traversing. You know from uh code in the release repo to code and test infrared, a code, that's in kubernetes kubernetes to download and extract um that that version.

C

uh But this is an example of how um difficult it can be to find um uh why a test has started failing. uh So definitely always uh ask for help right, because uh there's other people who have context that you may not have- and the last thing I want to point out here is this- is an example of a pod not being able to be scheduled. So this is what rob was talking about right with those requests and limits.

C

It basically says there wasn't an available node due to either insufficient memory or taints on the nodes, um inefficient, cpu, insufficient. Sorry uh cpu. um They basically said we weren't even able to attempt to run this job and that's something that- and I think that this one I did here uh yeah, that's something that you may want to open on test infrared uh to say, there's an issue with either our job configuration or um our underlying.

B

Infrastructure yeah- and I think it's worth saying that that that a massive massive amount of work and effort goes into keeping everything up and running from from a job running point of view, and it is a it's no small undertaking and and the team to keep things up and running are awesome. You know they do great work um and when there's problems you know they're very, very responsive.

C

Absolutely I should probably close this one.

B

So I think it hasn't failed yeah, so I think we're gonna. I think we're gonna, maybe finish up there um and just thank you dan. For for for all of that good info. um You can see the diff you can see how far ahead dan is on the ci signal journey than I am, and uh I.

C

Think we have different expertise, I'm not farther ahead.

B

C

In a different place,.

B

Oh but you're doing that you do the fun stuff that I want to do next. You know so um um so just like to thank everyone for for um attending.

B

um What I would say is if, if we want to follow up with the with the team, you can uh reach out to us on uh release ci signal on slack on kubernetes slack, and if you have any further questions that you want to ask, uh you can follow up with those questions there and, uh broadly speaking, the the there was a lot of interest in what data goes where and how long that we can um uh see.

B

Data for and uh uh josh burkus has pointed out a few features that he'd like to see in test grid and I'd say to josh um log them as issues on the test, good repo and, if they're doable. You know they'll get done in time um and just thanks to everyone for the for the questions and thanks to joyce kung for for providing support during the chat as well.

B

Joyce is going to be the new um going to be the team lead for ci signal for the 1.21 release um and so uh really looking forward to that and yeah. I think we can wind it up there bobby and end the meeting thanks don.

C

Yeah thanks rob thanks.

C

C