Kubernetes SIG Windows, 17 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Windows 20210817

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, everybody and welcome to the august 17th 2021 iteration of the kubernetes sig windows community meeting. As always, these videos are recorded and uploaded to youtube so be sure to adhere to the cncf code of conduct.

A

Let's just get started um does any we can do some introductions if any folks wants, I haven't been a lot of people or like we haven't, had a lot of use for this section of the meeting notes. So if, in the future we don't might consider dropping this and just seeing what happens going forward with that.

A

But I don't see too many new folks on the call, so we can just go right into announcements. I don't have any other announcements for today. um If anybody has any announcements, please raise your hand.

A

We can get those started um but same announcements from last week. Hold we're still kind of in the middle of cup collection and the kep enhancement freeze is, I believe, the first or second week of september and to check the date again. So we'll continue working towards that.

A

I don't see anything in the chat or anybody raising their hands um so james. Do you want to um talk about some of these? This perfect that you've been doing as you added a bunch of issues to the agenda.

B

So um we had a report a while back that when the metrics uh summary endpoint gets called, there's a context deadline exceeded and um we this was reported a few different times. There's a couple different scenarios where this this happens, um but it we thought we'd fix it with bumping up the priority of cubelet and q proxy and some other services like that, and it did seem to eliminate it for the most part.

B

But um we still got a report of it and I was looking into it a little bit further and added some more metrics and I was able to reproduce it. It doesn't happen, it's very hard to reproduce, um but uh essentially what I found was that if there's multiple concurrent uh calls to the stats endpoint, it can one of those calls can kind of hang out and uh take a fairly long period of time to return.

B

In most cases, it's only at over a minute to maybe two minutes um and some really uh some really long cases. It can be much longer than that, um and so I started looking into this further and um the metric server has a 60 second timeout and when the metric server calls- and it takes over that minute, it the metric server times out and then, when you do cube, cl ctl top nodes.

B

You don't get those metrics back for that node um for that time period, uh and this also happens um if you're using hp as you'll, also see this kind of manifests itself. The hpa can't query the metric server, because the metric server doesn't have the metrics for those logs and then your stuff doesn't scale.

B

um So I started looking into the code a little bit further and that's when I opened up a few of those other issues there uh there happens to be in docker shim, and this actually is the case for uh container d as well. There we make two separate calls to hcs um stats endpoints um for the containers.

B

uh I think we used to go through docker and then we went back and uh go directly to hcs um when, when that happened, we also clicked network stats on a separate um part of that call, and so we actually open up the containers twice and then call the stats on those twice, which adds quite a bit of cpu load to the the system. When I removed that I saw about a 40 increase in um see our 40 decrease in cpu, um and so um so I went to go.

B

Try to fix that and um I noticed that container d doesn't have um network and network stats, uh and this is because um the network stats for container d, so container d creates a container and the v2 hcs shim api.

B

And when we query for metric stats um for the network stats, we use the v1 api, um and so you don't get any stats coming back for those containers, and so uh what I did was went back to um hds shim and exposed out the the network stacks um and then um white and plumb them in to a pr um that I have open um yeah.

B

It's that work in progress, one I'm waiting on a um release of hcs shim, which should come out this week, uh and then I was gonna, um remove what we're gonna bender in the new hcs shim and then um go ahead and I'll rebase this pr um one other thing I did here was um uh when I was testing the the metric stats.

B

I ran a couple four to five concurrent calls to the metrics endpoint at the same time, and this can happen because the um uh the eviction component, we'll call it um the metrics endpoint that the metric server might call it and then, if you have any logging components um in our case, it was azure. Container insights also call it. You get three or four of these stacked up on top of each other and what happens is within the docker shim case.

B

um Docker shim opens up the named pipe for everyone, and so it causes a right to uh the file system which causes a higher cpu during that time period, and it exponentially goes up as you make more calls to this stats, endpoint or there's other actual things happening on the system. So if there's other containers starting or stopping- and you make a call to this end point it- it exponentially goes up. So the other thing I did was I refactored um the the metric endpoint to not make any docker calls.

B

We weren't collecting any extra information there that we didn't already have um and so uh yeah. So that's that's kind of a summary of all the things. It was kind of a little rabbit hole as I went through through it. But um if you ever take a look and um uh get some feedback be useful, I don't know if other folks are also seeing this, but I wanted to just bring it up here and share. What's been going on.

A

Yeah, I'm also curious. Is anybody else on the call seeing reports or experiencing themselves issues with metrics timing out on on windows, nodes or any of the different ways that can manifest most likely? It's that hpa or horizontal pod autoscaling stops working?

A

If so, I think we'd be very curious if any of or all of these help kind of alleviate those symptoms in your workloads.

C

I haven't seen it, but I haven't done a lot of extensive testing of hpa. To be honest, um so it might be something that I might have a little bit of a play with just see uh see what I spot.

A

Yeah, as james mentioned, like at least over in azure, we were seeing this quite a bit and then, when we did start setting the the press's priorities um for the the cubelet that did help. um But we're still seeing this under kind of a lot of system pressure.

D

And one thing I'm not particularly sure about because I haven't seen it but does container d even collect network stats in the first place, doesn't.

B

D

B

um This this so uh it doesn't because the the call to collect stats for the hcs v2 api doesn't return stats at all and v1 for hcs stats. It did, um and so we have an internal bug to to fix that um the that's going to become important when we go to the cryo-only stats.

B

But in the meantime, by using by exposing these stats through hns, uh we will get those stats exposed um until that that until we were able to cut over, which is probably still a couple releases away. So.

E

Yeah, I haven't noticed um yeah issues with timeouts, but yeah one point to confirm is yeah the network stats. So I noticed this with container d versus docker that yeah the summary uh endpoint doesn't return any additional network stats. So it would be great yeah when we get this fixed.

E

A

I also mentioned that I did see the cri. Only stats was added to the signate agenda today to discuss some issues. I think on linux, they're having issues with some file system, stats, I'll, try and kind of mention this as well- that this will help that moving to the crm stats will definitely or should definitely help with windows performance.

C

Probably a bit of an odd question, but do we have any tests around like metrics collection with metric server, that we could kind of run a suite on and then see if it's a flaky test or whether it's something else.

A

We have one very basic test that I believe I added a while ago when we've tried to fix this, um when we tried to fix the stats timeouts hanging in like 1 17 118 time frame, and what that test does. Is it just creates a number of containers? I think it's like 10 containers in parallel and then queries the metric endpoint a number of times and checks to see if the average duration of that call is below a certain threshold.

A

We don't do anything super fancy about trying to you know, call it and parallel a bunch or maybe covering some of these other containers. I think.

F

It's tagged as cereal, which is why we never use it right. I think that one we skip it internally in our ci jobs, because it's serial.

D

Regarding uh stats collection, there is one use case that they are actually pulled and used, for example, pod horizontal, auto scaling in which, basically, if the cpu switch goes over a threshold, the pods should increase in number or decrease accordingly, but I'm those are not conformers. So I don't think all the jobs are running them.

A

Yeah we marked that as serial because we were seeing when we were running it as part of the daily test pass. Sometimes there wasn't enough um space on it on one of our test nodes to schedule, 10 containers in parallel, so we just marked, and so we were seeing test flakes with the timing out waiting for all the containers to start, but we do exercise that regularly in our ci jobs under the serial slow test pass.

B

um It's a good thought to maybe see if I can come up with some sort of test. Mm-Hmm.

C

It would be quite cool to just oh sorry. Oh.

A

C

I was gonna say it would be quite cool to just like at least know we're getting each like. We could have a test that, like said, checks would get in cpu for the pods, whether we're getting memory, whether we're getting disk usage network. Obviously, when it's there things like that, and then that way at least we can just see that a result is returned, that's expected and not just zero on it on them or something yeah.

B

Dude I added there's. um There is one test that does some like: um like sanity checking it. Doesn't it just make sure that there's a value there, and I added that for um the uh the log collection metrics that that are there that I added in 120, I think, um and so maybe we can extend that a little bit more extensively to like catch the network tests make sure we don't drop those again.

C

Jay this sounds like you could put this in your conformance testing.

F

I just made an issue, I'm so excited.

F

I think it's just deserializing what we already have, but I'm not sure if anybody wants to dig in. I feel like this might be a great getting started issue for the e2es for windows. Right, I don't know um like. Basically, it would be like investigating our test coverage of metrics I'll write it up. So it's like a good good first issue.

A

Yeah and I think that'll be really important to kind of piggyback off of the cri. Only metrics work that's happening and make sure that it works for windows. While that's all going.

D

And I can help with that, if needed, I did some time ago, basically a matrix collection for uh create integration, testing container d, including for windows, so I can basically copy paste that code, if needed. Now it's testing, cpu and memory, and so on.

A

Awesome that would be great.

F

Yeah, so if you're gonna do it, I will not tag it as a good first issue, because you won't need that level of introduction.

A

Me link the issue that jay just created here.

A

All right, um I guess that's some really good investigations and really good exploratory work here. Is there anything else to discuss? Otherwise, um we can move on to a couple more of the agenda items.

A

So, let's do that uh next one is a test flight that we've been seeing occasionally and during the last iteration of the backlog refinement we decided to bring it up to the community meeting because we're seeing this kind of consistently. So I wanted to discuss this here so there's an issue in the the test. Job pods should delete a collection of pods on windows and there's a bunch of comments here that I um try to investigate this. A lot.

A

Do you want to? Is there anything you want to talk about here? Cardio.

D

Yes, um basically one idea so looking through the issue.

D

Basically, I noticed that that test in particular is spawning three pods and immediately uh release them like in a matter of milliseconds apart, but um in the way cubelet works. It goes through with the plot creation and including all the necessary steps, including image, polling and so on and so forth.

D

So that got me thinking for a little bit um what, if you actually spawn a pod, and you just decide to delete it afterwards and especially if it's a part which would use a really huge image, you would still want to cancel that context, and that request, though you don't want it, you wouldn't want it to continue to pull a few gigabytes of data just to get it afterwards, but.

D

This is not currently how cubelet works, so I was wondering if we could change something like that in cuba's behavior to cancel requests. That would also fix the test as well, because you won't have to wait for the pods to get deleted as well.

D

Apparently, there's there has been quite a lot of uh flakes from what they saw in kubernetes when it comes to uh racing events between creation and deletion of pods. So this might be something that they don't particularly want, but it might be useful for us at the very least, since we tend to have bigger images.

A

Yeah that makes sense. um We should let me tag signal down here and then we should also bring this up with sig node.

A

I don't think that I, I don't really remember ever hearing discussions about that behavior before so I don't know if that, if there's good reason for not canceling that from the signoid folks, but I think that's the next place to to do that. Is anybody else, seeing issues kind of similar to this or testing or would have any use cases that would test this, or does anybody see a reason why we would not want to cancel the context of starting the container in situations like this?

A

I don't think I have any use cases where we would want this behavior.

A

Okay, let me try and bring this up with sig node, and we can see where that conversation goes. Thank you for investigating this audio.

D

Yeah no problem, uh I can only tell from my experience it happened a couple of times to wrongfully pull a wrong image when I wanted to spawn another container, and I had to wait for it to finish.

A

Yeah sometimes 10 plus minutes.

D

And uh there are also a couple of windows images which have 10 gigabytes in size. So, like.

A

D

Yeah yeah, these are there's like almost full windows version of a container image, for example the damage we use for the gpu testing, that's 10 gigabytes in size.

A

Yeah, okay, um yeah: let's I'll, try and get signate involved in this discussion too. All right, I think I saw our evan. She was adding a couple of things here.

A

Er today mark notes, agenda.

G

Hello uh yeah: this is about that projected volume um issue that you know we discussed a few weeks ago. I.

C

G

Us wondering about adding some docs around this, and so I was you know starting to write all this up, and I was also made aware about the bound service account open issue that this would cause for windows pods, because I think this is now default in starting cube 122, which means that, yes, if you are using a windows pod, it's going to have a projected volume attached to it, which means the service account could have potentially an issue, which is what I'm trying to look into and write something up.

G

So david eats was telling me that we should add like a blurb or something in this section, or at least that's what I'm thinking. We should do like add a node inside projected volumes stating how it's working today or not working today with windows, pods and then also point it out here that there could be an issue with protected volumes or windows. I'm just wanting to know what what you folks think about it, or is there a fix for this the horizon?

G

Then we don't need to do this, so I just wanted to bring this up again.

A

Yeah, I don't think I think you were out when we discussed this last week. We don't have a fix in the horizon are coming in the horizon and we did kind of decide that the best thing to do would be to document this. um I was don't remember if we decided to document this on. You know the kubernetes dot io docs only or in some of the microsoft like msdn docs um brendan. Is this something that you could help irvin with us figuring out where we want to document this issue.

H

Yeah, I think we were considering a mix of both um but definitely um yeah. That's something I can I'm happy to help with.

G

Okay, cool um going to try writing something up. First sure I don't know where I might write it up in like one of our openshift repos even and point you at it, and then we can figure out. Okay, where do we put it in other places? uh Yeah.

A

Definitely starting.

G

A

Yeah and if you would like to help either contribute or author what goes on the microsoft docs yeah.

G

I can do that too. Is the microsoft docs also on, like some github thing, that I can just open prs for or that's not how the microsoft docs work.

A

I believe that they are, and usually at the bottom of the page, there's a link to the github page, that's backed by yeah.

I

So what will happen when you do them? um Is you just hit the link and then just do it in the browser, like whatever changes you want to make and then it'll go into review and it'll happen pretty fast, typically, okay, yeah.

G

So what I'll do is I'll I'll write up a pr for like in some in maybe the wmc openshift repo just so that we have a place say that this is the stuff that I want to add, and then I can work with uh brandon to figure out. Okay, this part goes into the microsoft docs. This part goes into the kubernetes docs and I'm I'm fine writing helping write up all that.

H

Also yeah the stocking's like.

F

I might be missing something, but I was wait so does that mean that by default internal service accounts on windows, pods, don't work on 122 when this or I'm I'm assuming I'm missing something here.

J

I just want to. I wouldn't say that they don't work. I would say that I mean it might be insecure.

F

Yeah windows pod in 1.22 has a security, vulnerability. Okay,.

G

So the thing is, I I don't want to state it explicitly, because I'm still trying to wrap my head around what the implications of of having this projected volume to be. You know it's basically kind of can be accessed across pods right, yeah.

E

G

What does that actually imply?

G

So that's why I'm not I'm.

B

Suspecting this is a.

G

Security issue.

B

And this is, but it is mitigated by blocking um volume mounting the host right. So you have to be able to volume out the host, which is, I think, in the security standard practices to it's blocked on the baseline.

A

Yeah as one container user, if you mount like, add a host mount and then try and access and then know where to navigate to the other. To these projected volumes are on the host. I believe that that container user can see the projected volumes from other container users, but if you block host mounts you can't just like magically hop over to the other containers projected volumes, you need to search for it yeah. My understanding.

G

Yeah, the only the only the only thing we need to make sure is because of the way, because of the fact that we're not applying the correct permissions on on a user basis for these files. Is there something that we're missing here that, even without mounting the host path? Are we running into some some trouble here.

A

And I think under the hood, it's a little bit more complicated than not applying the correct permissions. I believe that it's windows container users share sids on at the system level, so our security identifiers.

A

So if you start multiple containers with the as like as the same well, even as different users, when you try and pop out of the container and access host resources, those all look like the same user to the like to the containers. So and that's where this is a little bit complicated and there's no easy fix, is that there is no right now that we don't have the mechanisms in windows to say if you're running in a container as this user.

A

But you mount the host that I want to be able to restrict access to these projected volumes that got remounted back in.

G

Yeah, I think that that was the issue and I I ran into when I tried to implement the fix where I try to apply the permissions and, like the the message I think was yeah. I don't know what this user is, because in the at the host level, it's not yeah container user. Aware is what I understood.

A

Yep at that time, yeah they.

B

Don't share the same same user database. What's that again.

G

They don't share.

B

Yeah, the container and the host don't share the same sam database or it's the basically account management database. So that's why, like, when you try to apply permissions on the host, the host has no idea about the users that are in the database that the.

A

So just use this one sid or one security identifier.

G

Yeah, okay yeah, so you know, based on you know that experiment that I remember doing uh with you know james had suggested. I think maybe it's not as big as an issue if we block host path mounting, but we should confirm that there's not some other loophole that we are missing here.

A

Yeah continue to missing a direct conclusion all right folks, um I need to hop over to sig node now um discussing caps and everything, so I'm gonna. Do that um I'll hand it over to jay and james if we're doing, if you guys are all doing pairing today, bye and thank you, everybody see ya participating.

F

Yeah, I think we should do it today, who all can stick around this.