Kubernetes SIG Cluster Lifecycle, 18 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-09-18 CAPZ Office hours

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

uh Good morning, everybody this is the cluster api provider azure office hours meeting for september 17th. uh We follow the cncf uh standard process, so please be nice to each other and raise your hand. If you want to ask a question or say something, and now we can get started with today's meeting. Please add your name to the attendees list in the document.

A

Oh, I don't know. Is there anybody else new who would like to introduce themselves and oh congruent.

A

So just move on to the discussions. I don't think I don't see any psas today. I guess we can get started with the open discussion. If anybody wants to that's something at any point, please go ahead. uh Matt you had the first one with the entire.

B

Stuff yeah, so um I mean, as you obviously know, but just to let everybody know uh we talked briefly yesterday about um what's going on with e2e. In terms of uh you know, we have several jobs that provision.

B

We have several tasks, that provision clusters and they're all running in serial and now we're proposing to add an ipv6 cluster and if that runs in serial, then our tests are going to take a horrendous amount of time, but we all thought this should have been working. What I found out yesterday- and I can show empirically, is that the dash focus flag that we were passing in, even if it didn't have an argument, is short-circuiting uh the sharding and the multiple nodes that uh that ginkgo does.

B

So, if we're and we weren't actually using the focus flag in the context of uh the pr testing, so I have a pr out there that just you know, removes the entire flag and that seems to fix it. This corresponds directly to a comment. I remem remember, reading by ansi the author of dinko, saying oh yeah. If there's focus, we don't bother doing the sharding logic and using multiple nodes because reasons I've forgotten. Why?

B

I can't find this comment now, but I know it exists and I'm sure this is a thing, but I'm having trouble finding a justification for it. Other than uh that that we can see that this happened in the in the pr. So I think if we merge that um the other thing that's confounding here is obviously nader. You were showing yesterday that uh promoting our it clauses and the testing up from outside of context seems to suddenly unblock the parallelization right.

A

I had a focus set, a value, though I think oh and I put up. I put a question on your pr asking, because when I tried to run it this morning, it was trying to run all eight tests, which means it's running the clustering like the copy stuff as well, which is something I don't think we want.

A

So I put a question there. Oh cecilia, you.

B

Mean when you run it locally, it runs the whole thing. Sorry go ahead, cecile.

C

um Yeah, I was just going to ask it's kind of the same thing: are you removing the focus flag always or just when it's not set, because in test and fry we set the focus flag for the ci tests. So since we have like two suites, we have capi and then the azure.

D

C

So for azure we focus on the workload cluster ones and then for cappy, the cathy ones.

B

No, this just removes the whole argument if there's no actual ginkgo focus supplied, but.

C

As far as I can.

B

C

Does that mean that the test won't run in parallel in ci if they are using the focus flag.

B

That's what it looks like to me and I think that's what happens when you use focus in genko now again, I'm having trouble finding the justification for this, but I know I've read that um I was starting to look at the code for ginkgo yesterday I was having trouble finding where that happens.

B

B

B

Yeah, so I guess I mean, maybe I need to pin that down in terms of where that is in the code or documentation. So we can know for sure that uh focus always destroys the parallelization.

B

But uh but that's what seems to happen.

A

Yeah, it might be uh yeah, maybe there's like another option you have to set or something like, but when I, when I was showing you all yesterday and I made them hit and remove the contacts, I think I had a focus at the time so because I was only running the two tests of azure, it seems to be running in parallel, so you might have.

B

A

Kind of rework how the tests are structured, so, I think, would be okay for the for the girl making them faster.

B

I mean I was hoping to make the simplest change possible just and uh and the documentation suggests that you know these. It's should be getting parallelized, whether or not they're inside of a context or not, but we're not all seeing the same behavior. So it's very confusing um yeah.

A

Do we all have the same version of ginkgo? I think.

B

The script installs it so we should talk yeah.

A

B

Should yeah because it kind of pins it to 13 version 13 in our local directory?

B

So as long as you're running the make targets and stuff you should be, we should all be getting the same version, there's a newer version, but it doesn't really change anything significant, there's just minor bug fixes. So I don't think it's worth spazzing.

A

Out, I'm trying that side issue not related to this. The genko folks are there's like a design document and stuff for, like a version. 2.0 and people were commenting on it. If anybody wants to impose the direction of being cool.

B

Do we have opinions about ginkgo? I mean everybody has strong opinions about can't go as far as I can tell positive and negative.

B

um Well, so I guess the the the impetus for focusing on this yesterday was that uh we really can't merge the ipv6 change until we figured out the parallelization, because that's just going to add an egregious amount of time. So um so I still think it has something to do with focus, but obviously I don't have the final answer so I'll continue looking at it this morning, cool okay,.

A

You can move on okay, uh thank you, matt, for all that work, yeah sure cecil, you mentioned like the fast delete pr yeah.

C

Yeah, uh sorry um so yeah, I just wanted to give an update. So yesterday we talked a bit about how we could make azure cluster delete faster.

C

So currently, what happens is whenever you delete a cluster that triggers all the machines to be deleted in cluster api, so cluster api waits for all the children uh deletes all the children manually and then waits for uh and all the children and their children, and so that deletes all the machines which then triggers azure machines to get deleted, and that happens before the cluster is deleted, which means the azure cluster waits for the vms to be deleted.

C

Before it's deleted um and since in cab z, we delete every like resource by order of dependency, one by one.

C

It can sometimes take a long time and whenever you're deleting your entire cluster and your cluster is in a resource group and that resource group was created as part of your cluster, which means uh it's managed, then deleting that resource group is typically enough, because the uh when you delete the resource group azure will go and take care of the dependencies for you and it's usually faster, because you don't have to wait for the result of every single resource.

C

So this is kind of a attempt like best effort, attempt to leverage that when we can um so to take a look at the pr, I haven't added tests yet, but basically, what it does is whenever you um check for whenever you delete a an azure machine, it checks if the owning cluster has a delete timestamp, which indicates that the cluster is being deleted.

C

And if that's true and if the resource group is managed, which means it has the owner, like the it's managed by that cluster, which means the it has a tag on the resource group that says owned by that cluster.

C

Then we skip the azure machine deletion um and we just remove the finalizer right away and uh then we go to the azure cluster or the azure cluster takes care of deleting the resource group and cleaning up the whole thing if those conditions aren't met. So if this is not a cluster deletion, if we're only deleting an azure machine or if the resource group was brought by the user, which means there may be other resources in that resource group that we don't want to delete. So we don't want to delete the entire group.

C

We have to go one by one, then we'll go ahead and proceed as before and just delete all the vms and network interfaces and everything so yeah. Take a look. Let me know if this sounds crazy or if you have any like questions or.

A

So anybody have questions. I took a quick look at the vr and it looks pretty good so far, but I'll look at it again and try it.

A

A

David, do you have a question.

A

Okay, uh I guess next thing also ccl, since checks are here.

C

um Yeah, so I just wanted uh to remind everyone that there's a doc that we put out there two weeks ago and if you haven't had a chance to review it. Yet please take a look. um I kind of brought this up at the capi office hours yesterday as well, because the design would require adding a sentinel file from the bootstrap provider perspective or like in the bootstrap provider contract, um but this is really the azure side of it.

C

So the proposal is to use vm extensions to detect that the flag is or the signal is there. So yeah take a look.

A

Let me just check would not work without the change in copy right.

C

No well, we could then think if the change in kathy doesn't happen, we could think of, like other mechanisms, to signal success right now. That's the assumption that we would have access to that file.

C

um I don't know if what we don't really have a like proposal process in kevzie yet, but I don't know if we should merge this as a pr just leave it there as a dock.

C

I guess we merged the multi-tenancy one.

A

I think we should merge it as a pr, but like not close, the issue like we did with the multi-tenancy, because yeah I've been going back to look at that, like the document is harder to find okay. But if you had the pr it's easier to find.

C

Sounds good I'll uh check with jargon and one of us will do it.

A

Okay sounds good. uh That was the last open discussion item. Does anybody have anything else.

C

A

To briefly talk about.

C

The kcp upgrade flakes where we're at just so. Everyone knows.

A

A

I I tried to figure out what's going on with that, I ran that. That's why I was looking into like the end-to-end test.

A

I run it like at least 10 times like it always works for me, uh like the upgrades with different versions. I think it's just a timeout issue, because it seems to be mostly failing for timing out and the periodic job.

A

But in all my local tests, it seems to take about 30 minutes roughly that one average for that part of the upgrade machine like upgrading the three uh control plate machines from like the one version to the other version and the timer we have is at 40 minutes. So I don't know how slow how slow that is on the ci. You would think the ci would be faster than my local machine, uh but I haven't been able to recreate it.

A

I just kind of hesitant to just bump up the time uh just because 40 minutes seems like a lot already, so I was hoping to wait for how, like the changes with the parallelization, all that end to end refactoring.

A

Maybe that will clear up some some of that wishes uh and like after failing for like three four times on the periodic job like past yesterday. So.

A

That's where I'm at I'm not sure. I don't want to bump up the time because they think it's already high enough, but- and I can't I can't make it fail for me.

A

If anybody have thoughts.

C

Yeah david has his hand raised.

D

What kind of uh setup do you have vocally.

A

Later just curious, you mean, like my machine, it's just the laptop laptop.

D

uh How many cores.

A

Nah, it's pretty it's not very new. uh I'm looking looks.

D

D

And just curious, I I've had issues running locally um the tests where it would be flaky mostly on it, attempting to clean up after itself um it just you'll, just kind of hang.

A

Are you saying that maybe my I have four chords.

D

Okay, uh how many are you dedicating to uh the docker vm.

A

A

I'm just looking to make sure I'm telling you the right information.

C

In this case, it's passing locally. It's failing on proud, though so his local setup is good. It's almost too good because he can't right right.

A

I thought that was surprising, but that's a surprising part, because I think if it's taking half an hour on my local machine would take like way less on ci. So if that's not a correct assumption, then maybe we can just bump up the time a little bit and see.

D

How how many cores are okay on ci, like is ci only running with three cores, like it's, probably a uh pretty constrained uh container, that's fair, um but then again, why? Why aren't these? Are we seeing like flakes upstream in candy, I saw a email go through about kcp flakes upstream.

A

I think that wasn't the adoption and there's a pr to fix some of the tests for it. I've seen a couple of prs to fix end-to-end stuff, but nothing about the upgrade one and because we were having trouble with like the nineteen, zero and nineteen one, and all that I like. I think I've done like 20 30 upgrade tests in the last like few weeks and they've, never timed out for me. But now I think I have a good laptop or something.

C

Yeah the kcp upgrades like h, a cluster test and cappy is pretty green. There's been one failure in the last 20 days. Maybe.

A

Our stuff run a lot faster because of docker, though.

C

Yeah, that's true: they don't have the time at issue as much um how about we just increase the timeout just to collect data, since this is running periodically, and we know how many failures we got over the last two weeks and we just try to like do this over a week and see if we're still seeing the same amount of flakes.

A

C

A

C

Music but then like circle back now, just leave it like that and not where.

A

You know we'll circle back for our next meeting, so that'll be two weeks and then we'll see the pattern about that. Okay, I'll make sure.

C

A

C

It's possible they're, actually failing and they're, not timing out that they're timing out, but they would never succeed. So that's another possibility that we need to look into yeah.

A

A

Okay, I I will. I will I'll take that one uh anything else. Anybody wants to discuss.

B

um I can't find the raise hand thing, but I do.

A

B

Since we're here uh cecil, do you want to pair on making new uh images uh iran? I.

C

B

Of like 1 17 12 last night, that seemed like it was fine. So if you want to hold my hand, I can make the three new patch versions this morning.

C

Yeah, let's do it. um They were released yesterday, right, yeah, okay, we'll do that they'll be available by.

B

C

B

C

A

Cool sounds good.

A

Okay, oh by the way, sorry a while back, I told cecile and david that I would try to find a better way of doing the release, notes and I looked into it a little bit and then I realized that uh cappy have their own like little tool. That does that so, and the kubernetes one that we use really is not gonna be very suitable for us. So I started trying to take the cappy one and make some changes to it.

A

But then I got distracted by all this stuff and we didn't finish it so hopefully I'll come back to it at some point and and have something to show now we can decide if it's better enough.

C

Sounds good? How come the kubernetes one doesn't like classify stuff for us? Is it because it uses the sig labels to order like.

A

C

Using the labels and.

A

It's using the area and the labels which we don't have, it doesn't go and look at the actual issue. It just looks at the labels that are on the pr.

C

Makes sense so if we started tagging prs as like area feature area bug, would that potentially work.

A

C

A

Is that what area book is like.

A

I think there's like a kind also they have like a kind. They check.

B

A

Which they tell us the bug or not, but then they have an area as well. Let me go back to the actual code of the release, tool and I'll, tell you and then we can see if you can do that. Okay,.

C

Yeah just curious otherwise like using the capy method, is fine too. It's just that it requires people to put the emojis, which we've been training people to do, even though they're not required right now, but it would just become required. Otherwise they wouldn't get classified.

A

You look nice well using the kubernetes tool would definitely be better because it it has a lot of other stuff, but anyway I'll look at it and I'll update you.

B

I don't quite understand the emoji system like since we changed them. I don't always know which one to use, just as another data point.

C

Yeah we try to document it in the contributing doc like there's a list of the emojis and examples and stuff, and what each one means.

B

But we do, I must have missed it.

C

Every person has kind of their own interpretation of them, so they end up being like so then you'll see someone else, use like the seedling leaf or something and then you'll use it too, and then it just spreads.

A

C

Time I do a pr.

A

I go back to the actual cluster api code that actually reads the release, notes and look.

D

Why don't we put them in the uh pr template um in kind of a comment like a comment block, uh so it's just like it prompts the proper usage yeah. That's what.

C

We didn't do and we could even add them before they're, even like actually required, so we start like a trial period and start getting in the habit of using them properly before we actually need them.

C

A

I can do that I'll. Just take it from how close api is doing and I'll put the same thing.

A

Cool so I'll do a pr for the bump, speed bump. Sorry, the timeout bump and apr for the emojis.

A

uh Okay, I think that's all we have anything. Anybody else wants to add anything before we close, I feel like there's been less people attending since we switched to thursday, but I like thursday, better so I don't know.

C

A

It's too early.

C

I feel like it's in theory. It works well for everyone, but then, when it's like that, early people forget that it's happening and.

A

Yeah we can try it for a few more weeks and then decide sounds good. Okay, uh I think that's it. uh Thank you, everybody for uh joining the meeting and have a good trust. A good day.

A

Talk to you later, bye, bye,.