Kubernetes SIG Node, 23 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20210323

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

All right uh well welcome everyone to the march 23rd uh signed meeting. um We have a number of items on today's agenda, both uh past present and future, that we can talk through. So um with that in mind, uh maybe uh sergey orlana. Do you guys want to give an update on just where we are with our overall uh pr health.

B

uh I think sergey is not on the call, so I'm not sure that we've gotten a chance to run the numbers today. uh But if you give me co-host, I can share my screen and talk a little bit about the burn down to the test. Freeze tomorrow,.

A

Sure I know I was just looking in the tracking board while I get that set up uh before here and it looked like. We had about 20 rpr's that could be approved in various states, but not clearly all should go in m121. So I was trying to.

B

Yeah I'm looking directly at the milestone. uh I would I put the link in the notes. I just can't share my screen ah here we go uh so.

B

We currently have eight items that are still open, uh so there's three prs and I guess five issues that are currently in the 121 milestone and I'm not sure necessarily how many of these that we are going to be keeping.

B

But this is what we've got on the list. So uh in theory, I think all of this should be merged by the end of business day tomorrow, uh but uh this is kind of what's on our list, so I'll make sure to put the link in the slack channel. So people can take a look at all these various things, but in theory these are all the things that we have committed to for this release other than you know, last minute, critical, bugs so uh yeah.

B

I haven't actually taken a look at the board in a while, because the only things that will be merging right now must have the milestone on them. uh Is there anything that I haven't looked at if there's anything critical, urgent, that's missing the milestone, uh but uh yeah anything that doesn't have the milestone on it won't merge until after we reopen uh development, so uh yeah we're frozen.

A

Yeah, I was looking this morning to try to see if there was um things that we missed, that should have had the milestone just as odd outliers and one of the ones I was actually looking at before coming here was uh causing me to refresh my memory on how we do flag versus config file, parsing and precedent of things related to the read-only port. So that was one that I thought maybe go back in, but um yeah uh thanks for the update here, alana no.

B

A

Move on to the.

C

A

Okay, uh so sergey, if you wanted to, I think this week's a special week as we close so maybe we can just move on to the next topic ben. Do you want to uh bring up your topic.

B

Oh ben might not be here. I pinged him earlier. uh I threw this on the agenda after chatting. Ben would like to move the pause issue. uh I think image stuff out of the core kk repo, possibly into a kubernetes sigs thing, um but uh we would need to discuss at sig node first. So I don't know if I'm doing a good job at representing.

D

Yeah, I think, like I saw that thread and I think it kind of makes sense like only cryo and container d consume it and urban. She was looking into uh like maintaining some things related to the pause issue for openshift, so she might be interested to help out there over she like. Let me know if you don't wanna.

E

No yeah sure I'm interested in that. So I'd like to help out yeah.

B

So I think, if we choose to do that, uh I think chairs have to file an issue in kubernetes or asking for the repo creation or something like that.

A

Yeah, I think anyone can ask for the repo we just have to approve it. So.

A

uh I I guess I apologize is this moving the pause container? Would it or.

D

Yeah the the maintenance of that whole image out of.

A

Okay and uh right.

F

Not the location where it's stored.

C

What's the benefit to move out, I saw the past imagery is already uh uh in a special directory and which is basically uh kubernetes open source community try to maintain, but I know a lot of is more out, but I just want to understand: what's the benefit uh like the more patch, if there's the sacred issue, so they are going to auto patch those things or I just wondering. What's the benefit.

B

According to ben's thread, he said that the benefit would be that it decouples from the kubernetes release cycle, so we wouldn't have to worry about like getting. You know, milestones added to things uh when we need to push in like critical updates or that kind of thing.

C

God do make sense for pulse because we haven't updated for years. Nothing. I I just the reason I ask is just like: okay, it sounds like they all want to. They will have collected all those related, the image to the single place. Then someone have to responsible for and the patch and the maintenance, all those kind of things. It's not a positive positive image, actually pretty straightforward yeah.

G

um Hi, um so the most recent change was uh to make it rootless. I guess, uh and the main question that we had around was what is the sequence in which we would update different projects to use the new pos image right, so do do we want uh kubernetes to go first and then contain id and cryo, or what's the order that that's the basic confusion. So, for example, right now, I think continuity moved to three five and kubernetes hasn't if you're still in the three four x um mike.

G

Maybe you can correct me if I'm wrong, but uh my the original thing was like it should soak in kubernetes first and then.

F

You know, I think we were hoping, I'm sorry. We were waiting for it to merge first in kubernetes before we merged ours. I think right, but yeah you're right, you might have got it. We might have gotten out of sync on accident or it got reverted. Maybe I'm not sure.

G

Yeah we'll have to go back and check, but don that's the basic problem like we. We we anybody, can like cut a image right like version of the image, but then how do we make sure that it gets soaked and then applied to the rest? That's the bigger problem than actually pulling it out into a separate arc.

G

A

Then we'll pull out the separate repo, if I'm not mistaken, wasn't the pause container still having it's like a common image for both linux and windows.

G

Yeah yeah it yeah it's multi.

A

A

Okay and then the main thing that we have as a behavior and pause was around pid reaping.

H

A

So I guess the the issue by separating it out is, if there's any desire to like increase the scope of behavior beyond pit reaping. So I.

D

Did we have some kind of a race thing where like if we have it in a different repo than like cryo or container d, could test out like a master in a pr, and then we know it's safe to release it with a particular version of kubernetes?

D

I thought that was an issue or like is it? If, if we make a change today to the pause image, does it get immediately tested in the pr with the runtimes? That could be enough.

G

At any point in time, we can bump pause image in kk repository and cut a new version. It's just not going to be used right now and if it so to be able to use it, we'll file another pr and it will get used only in container uh sorry in docker-based jobs, not the continuity-based jobs.

G

um That is the problem right now. So what we would then have to do is switch continuity over or at least you know, put a pr and container d to switch to the new image and do a soak there anyway. So that's the basic problem. We need to write it right up what uh how we need to test it and then that that will be once that's done. Then we can go over and uh move the code into a separate thing.

A

Yeah, I'm just trying to think through, like uh the burden of a separate repo comes with other things potentially like. uh Typically, it's tracked as a separate sub project um uh like I'm thinking about like cri api, like we have just thinking through, like the governance check boxes we cross off. When we add these repos often they are a separate sub project and.

G

It's not that bad. They don't need to be tracked as separate sub projects. uh We can just. We can just add the owners file there into one of the existing sub projects, so that should not be a problem um yeah the hardest part is, you know somebody willing to go there and do the things that are needed to do so, not just the docker file right. We need to add a cloud build.yaml, um those kinds of things and add some minimal testing um as well.

C

But this is obviously for me, it's the actual process and the right. So now who is on that actual process, it's kind of the either a little bit again pause is one small example pause having to change it for last couple years, even in the past, we did talk about the increase after scope, but we never get it wrong to do that, and also looks like that. Appealing reason: I'm almost gone right now, but uh this is kind of the this trend will not affect the only pulse.

C

So the basically one simple example: even we fix the problem you just list. The container did not soaking with this one and someone somewhere have to say here: it is the development version of those image in another report, and here it is the production uh image. What we plan to release and then there's the two soaking you have to one is against the device branch and of that uh separate the ripple. And then you have like the all four release that you have to map to the production motion.

C

That's just actual process who is going to pick up that actual process here and so people will then in the future. We have ended up another oh signals here. It is a list of the image you are going to responsible and you are going to have the ownership, but the creator of the actual process for us.

C

You know today we basically naturally say: okay, we own trust images, that's our functionality and everything our tests goes through against these kind of things. Energy is all good together. For me, that's more important.

F

Yeah exactly don and then in the container run times we don't want to use colon latest. We want to use the one, that's proper and appropriate right right, but if that doesn't exist in gcr we can't test against it. You have the dependency now on container d and cryo to move up when you want to move up, but you could do that with docker shim. You can't do that with container d and cryo.

F

So we have a we have. We have what comes first, the egg of the chicken problem right yeah. I think it's.

C

F

In this case, to have the image builder be outside in another repo.

G

So what I would suggest mike is somebody should write this up, what it should look like, and then we pull the plug on it and not not not the other way. Around yeah.

I

I just like to add from the window side that pretty much any any image that uh has to do with windows has to be updated every six months because of windows have requiring uh the host compatibility uh constraints, so uh microsoft released. uh I think, six months ago now, 20 h2 and so 21 h1 is coming at some point and then so on and so forth. So for every six months pretty much. Every image that runs on windows would need um to to be updated and pause is definitely one of those images.

G

Jeremy, I hear you, uh volunteering.

F

Gems, just another potential uh cry tools, also pushes images for testing and it that krytos might be the right place for this pause image. Just a.

G

Thought yeah: let's see if somebody steps up to write, write this down before um you know not me yeah.

F

I know I know what you mean, we're so busy.

F

C

So, at least at this moment I believe we are not approved as well and I think a sick test. A cigarette needs to have to give some proposal from halawa. Then we fill in about the water signal like that. We have a list of the image we in the past that we own so we're going to fill that gap and make sure all those kind of the process it is settled and won't uh interfere. Our witness process make sure all those kind of process there.

A

I guess just because we haven't done something in two years. I guess uh it is a good opportunity to ask: is there something we should have been doing the pause container that we haven't been um uh so if folks have particular things in mind, um it was a good time to bring it up, but we still don't have I'm trying to think if we have the pidge sharing feature even on by default everywhere or not right now to think through the behaviors on reaping but um uh yeah. I can follow up on that afterwards.

A

I guess anything else. You want to discuss on this topic or we're going to move on.

J

One short question: uh come listen to all this discussion and get my idea should we have ncri some level of uh agreement between the kublet and run time about, like which version of those containers didn't configure it or expected like to prevent the wrong skills. So.

D

Right now, like the sandbox implementation is up to the runtime right. It could be a vm or something so it's all being left to the runtime side today. So you're thinking, like a configuration just for a native uh container.

J

My thinking was what, for example like if we in kubernetes we decide what we need to have a post container, not less one specific version.

J

We need to communicate with uh to a runtime or vice versa, like if runtime says I am configured with post container version, let's say 4.0 and kubelet is expect like 4.1, when you should say like sorry, I think it's.

F

Alex, I think it's more of a you know. What is the expected behavior? We can certainly have some versioning information passed down to the cry, but I know, for example, cryo has gone without pause container in some demos right, so we we may not want these pause containers around for the implementation of the sandbox.

G

F

G

There is one dependency right now, which already exists, which is the cubelet needs to know what pause image b is being used on the continuity side, so it doesn't prune it um so, and we we have a parameter that we use currently in in the cubelet, config or command line to specify this is the pause image. Don't kick it out.

A

Yeah, that's that's the only use case and honestly, um if we expanded that use case, I think it's reasonable that I might want to have the cubelet introspect the set of containers to ignore for gc or I'm sorry the images to ignore for gc beyond just pause. um I know I would have benefited from that in other scenarios, so.

G

A

Maybe maybe we should just uh let the cri report back.

A

System containers system image is not meant for gc, and then we just said: cubelet doesn't even care about pause images anymore and then runtimes can optimize the presence of a pod sandbox based on pitch sharing being on or off generally, which is what the cryo demo demonstrates. I believe right.

D

Yeah, I think I derek that's. That's a good idea like we have other use cases for don't gc this image.

A

So yeah yeah um could we just like maybe set that as a plan of record, then for 122 to just be like as a part of cri graduation. We want to track.

D

Yeah yeah: I can capture that yeah. I can capture that.

A

um Okay, uh thanks alex for prompting more conversations, maybe get to better solutions. um Okay, uh next up uh siri stats cap. I think there was one update that I wanted to share.

K

Hey uh yeah, that's me um so I I I I'm bringing this up just to going to refresh everyone about uh this, um and specifically, I have a question for some community folks and just people in general um uh yeah, so the stats kept um we're still I'm still pushing for it uh to try to get it in um 122..

K

um I feel I I feel like we're like approaching a um you know, the correct uh set of features that we're going to support in the cube right now, we're mostly talking about covering the staff summary api with the cri and my specific question.

K

Is um it came up internally when talking about this that it may behoove us to um to have, and so because, right now, the cubelet, uh the advisor supports the cubelet, uh sending stat, summary api and then c advisor itself also supports a c advisor metrics endpoint and uh we've kind of been talking uh about how to handle the advisor metrics endpoint, because it's used in a lot of prometheus um instrumentation um built on top of cube um internally. It came up that it's.

K

It would totally be possible for the cri implementation to um to give off those uh metrics. So, instead of having the uh the cri pass, it the cry: implementation pass it to the qubit through the cri and then the cubelet say: here's the advisor metrics endpoint. um It's currently being discussed doing that directly from the cri. I know both continuity and cryo already emit some prometheus metrics. So it'd just be a matter of wiring up most of the c groups.

K

Information for each of the containers, as well as the um the disk stats and the network stats.

I

K

I just wanted to check in and see what people thought about that um and see if there's any initial uh reaction against doing something like that.

F

I I don't think, there's an I don't have any disagreement with it. It would be interesting to get lantau um in on the conversation and maybe dawn and have a smaller group. A little work group put together the the reason it was done. This way was because that was the design plan right. I'm sure you saw that right.

F

It was the plan, that's why we did it that way, um right right, change, the plan- that's probably fine, we just have to you, know dot our eyes and cross our t's and make sure that when we change the plan everybody's in agreement with it, I don't think yeah and I I I miss you but.

K

Yeah and I'm not so sure that this deviates it kind of does shift the plan, but it doesn't deviate too far from the intention where the staff summary api is to fulfill the like functional aspects of kubernetes so like to handle eviction and scheduling and then have the um have the metric c advisor stuff just to be built. On top of that, it's just kind of shifting the responsibility of who is actually reporting those metrics from the advisor to the cri. So I think it's yeah. I agree.

F

With the balance of that plan, yeah yeah yeah, I agree completely. I just I'm- I'm not aware of the reasoning behind the original plan decision. Let's put it that way, so we'd have to.

K

And yeah and in and then talking to david ash paul, it seemed- and I I may be wrong on this, but it seemed as though the reason for that was to scope the changes because they were already getting pretty uh invasive.

K

um So and obviously this is introducing more invasive changes, um but I think because it seems like c advisor's. Integration of the cube was kind of just the way that it was, um and so um the introduction of the cri stats was a way to kind of patch. On top of that so now this would be then the second iteration of like uh working around that. um But someone should correct me if I'm wrong on that. That's just my impression.

C

So so peter and uh sorry, I didn't follow up with all you described here, but from high level. I'm not disagree with you. I I want to most. I need to look at your cap and follow up with you and the david here more detail, so we are open to all the discussing so so, but but one thing I just want to let you know you are right. Well, in the past, we have actually several paths to how to get the uh final stage. So right now, just like what you see.

C

This is kind of the third. We try to evolve and- and so it's not like that- everything's a side to stone and we cannot change it so so cri it is. It is back in. I think, michael, I also say michael cosby's here, and he may remember that time. We we did actually have some like uh uh middle ground. We need the other middle ground and this is what we try to uh help on the kubernetes community. So but we do have some destination, but unfortunately I didn't really understand exactly what you say completely.

C

So that's why we can follow up with this yeah.

K

Yeah no problem um and I'm I'm happy to answer any questions offline or on I mean uh generally the what I was proposing was just moving the source of the metrics, the advisor to the cri. Instead of passing it through the cubelet to have it.

A

That way, peter just for those who may not keep this all to memory. I think just remembering the theme that we're trying to solve here right, which was we want to get the cri out of its present state, that lets us then uh figure out how to move any special privileges that are given any run time, whether that stock or shim or cryo or container d around stuff uh eliminated, and then the major issue here was just the cost of dual scraping, which I think is significant, particularly in more intensive usage environments. So.

A

Just trying to think through here where it was like for this particular cap are, we are we saying, keep the status quo with docker shim until docker shim was moved out of tree, or are we saying dr shim itself would have to report these stats and proxy them to see advisor.

K

So my my kind of I mean because all of these changes I'm expecting them to be invasive enough to have to feature gate it. So my kind of imagine the way that I had imagined it working was by the time that we actually got around to ga in this future.

K

uh Docker shim would already have kind of been deprecated uh far enough, so that we don't have to support it. I had not at all thought about. I have not at all thought about supporting this in doctor shim, um so I I have a feeling that this will be something that'll be gated uh and for people who are still on docker, shim they'll still have to go through c advisor and then uh and then eventually, once dr shim is gone, um we won't have to worry about that interaction and then strictly be going through cri.

D

K

G

Yeah, definitely, let's not worry about docker shim at all here, um and so just a general news update to everybody here is the mirantis uh repository has um a copy of the docker shim code. Now so it's gone, so we don't have to worry about it anymore uh and, let's um you know, let's not touch the code at all. uh If you can avoid it.

A

Yeah, I'm just uh trying to be sensitive to do the right thing for everybody here, james and codes, so um just trying to make sure we uh yeah don't worry.

C

I heard you direct, I think that derek that doctorship is just one topic here and also basically the dark. If I understand you correctly, you worry about that. We make this cri really work like more complicated. Add more complexity to this one, because uh ci stats change scope um so include of not just like docker shim and also like the other ci related work. We talk about like the api promoter to better all those other things around this. So I think this is.

C

We need to consider right so so I think the peter david, when they're driving through about those stats and the stereo stats, will be considered so try to now to make the existing plan we make and to more complicated here. So yeah.

L

A

Yep cool right all right, uh so um in the spirit of being more efficient with our resources, I guess ike or I think alana uh you want to talk through what the latest thoughts around swapper. I know there's been a number of discussions on it. Yeah.

B

A

B

Put this on the agenda, um I don't know that there is much to discuss other than uh I've started, uh working on drafting a cap, and so hopefully I'll have something up for like a pr for discussion.

M

Right yeah, I think.

B

On the community doc in terms of like changes in a few weeks now so like, I think it's stable.

M

Yep uh pretty much it's like a summarize, like the current status quo and uh uh update the community on the current snapshot for like the alpha scoping. uh Basically, the uh tldr is that, uh from the workload point of view, there won't be any like visible performance or behavior changes.

M

um This is a combination coming from the both the cubelet behavior and on the side of consuming the swap so essentially before alpha the cubelet wouldn't even be able to start on the note that I have swap enabled, but after that, the goal is to have that limitation lifted, but we're gonna still disallow the workload to to consume any of the swap, so that this gave the workload.

M

Some predictability uh for at least 122 and we're also gonna expose a basically experimental, knob, uh probably like through config, so that uh that's a a essentially the master switch that controls how much uh swap uh each workflow can can essentially consume and uh the other box value can gone b0 and uh yeah. So basically, this is the high level uh for the current alpha, scoping and uh yeah.

M

If there's like no further like uh objection to this, uh the next step, we just gonna, be like drafting out the cap and also start diving deeper, like at the code level for the implementation.

A

Yeah, so I had a couple questions and I know we had some internal discussion at red hat about this and I don't know how much that translated to maybe elana and like your conversations, but uh are we intending to treat this as maybe two discrete features, so one I had in mind was like the ability, I think, what you described here, of uh a node being able to run with swap on, but um uh the workload can't consume it, and so from that perspective, like node services are protected by swap. But you know, workloads are.

B

So there's kind of like there's like a path at which point you know one could kind of fork for the use cases right so like the alpha, regardless of whether the like.

A

I guess what I'm trying to get at is like independent of alpha or not like. Should we discuss this feature as two features that each might have separate alpha criteria so, like one was management, components of a node can fail over to swap to preserve workload.

A

That's one feature um and like, as a part of that feature, it would mean that, like um workload still gets reliability but management components, don't get. You know, killed prematurely type of deal and, and some of the elements of that feature I would think through, would be a desire to potentially reserve the amount of swap that's not available to the cube to then later hand out to workloads.

A

um But I guess what I'm wondering is like. Is there any desire among uh the folks looking at this to split this into two discrete feature gates similar to what we did with peds.

M

uh At this particular moment now yet, but we do have like some high level thoughts on like what the swap use case gonna end up being like, and I think it will be a nitro natural divergence like to split between like a system, point of view and the workload point of view. uh But the gist uh I guess for alpha is that if you want to maintain the background, compatibility and the predictability, uh it should be like a combination of both like a system level and uh some some course uh workload granularity control.

M

uh What I meant is that um say uh eric like if you were presenting, if you can go to the public doc for the poc, there was a a table, basically highlighting all the three uh main cases that we intend to support, but the the last one is more of a probably for for better scope or probably not even better or even like further future. But essentially uh in terms of like how we're treating swap there are three main essential cases. One is a complete disabled swap.

M

This is more of the cases of the status quo, but this can be implemented either by capability fail to start or uh no workload can essentially like consume. Angular, swap so essentially with this alpha we're just like a switching implementation for for this, like disable swap world, but moving forward like there are two ways we can essentially consume swap one is that we don't take any of the end of the accounting uh into consideration um in this world.

M

uh Essentially, uh let me see if we can present this part yeah.

A

I guess I'm following your your table here. I guess all I'm trying to call out is like. Can we describe swap in two discrete features and then give an evolution of each feature set? So we'd have support node running with swap enabled, which would basically be what we can do with swap as a node sre right and supporting our management components and then of a separate.

I

A

Which is support swap consumption? I.

M

Think right now, it's harder to come up with a clear boundary between those two worlds, because essentially once we enable the swap on the world, if we don't impose any proper accounting for the workload that that's pretty much like any workload can consume as much swap as I want, then we're in certainly gone breaking all the assumptions that we made for the qos classes for all the workload.

M

That's why I think like alpha. This should be like at least uh be saw together, so.

A

That's maybe the one thing I wanted to clarify, um which was uh docker shim today I viewed as like the reference implementation that I expected our cris to have followed, which today docker shim prevents any end user pod from consuming, swap by explicitly nulling it out in the c group, and at least I had intended that it would be a bug if particular cr implementers were not restrict, not bounding, swap to zero. It's just unfortunate.

A

We didn't pass swap as a separate function in the cri, and so uh I know murnau and I discussed maybe cryo fixing that to not set it as unbounded. I'm curious, if container d do we know what they might be setting for swap today, but my perception is even if a node was started with swap enabled the the runtime.

I

A

I would have expected is that runtimes don't actually give swap to pods.

M

Yeah, that's that would be the current deliverable for alpha, so essentially on the consuming swap row we're seeing like knows what an overload can consume swap by default. uh This limitation is show essentially like a impulse through cri by setting the um the swap field to zero.

A

M

A

Wondering it's like it's possible. We have knobs that are not deny by default and they're kind of just left on set, and this was one that that.

F

Maybe we should follow them, but yeah.

A

F

It's fair to say: we use the initial defaults that doctor shim had as well right. It's also fair to say it's configurable.

A

F

So cuba had.

L

Explicitly mailed.

A

Out what docker shim was setting, so it wasn't getting picked up from the run time, and so it was a surprise to me when we saw maybe a behavior with a different cri. So.

C

I want to add it here so docker. Actually, docker shim did what the right thing I yeah there's the, but it didn't pass the inside to zero, but it also didn't pass that's basically uh because back then we clearly say we don't support the swap and continuity actually did pass, but not uh also it's not the config. While I, I can quickly share the code in the continuity there and we did pass, but uh also it's not right. So we need to take care and of course we need to take. We.

C

We don't need to worry about the docker shim anymore, but if we decided what to do and just yeah, we should address this at the ci level.

D

Yeah yeah, we should we, we need knobs in the cri and then the cubelet can zero it out and then later on, we can adjust it depending on how this evolves uh question.

J

Instead of those hidden assumptions on hidden configuration parameters, maybe it's a lot better to extend to a cri message to actually explicitly call out like how much memory reserve it. How much memory allowed to use it for the swap and so on. So when it will completely eliminate all the misbehavior between different implementations.

D

B

Because, right now that was just, I think, an oversight uh so to answer derek's original question of. Like specifically, you know: how are we splitting this in terms of like the sort of control plane or like management component use case and the workload use case? uh I wouldn't be opposed to having separate like feature flags, for example, for the two different cases in say like beta.

B

uh The only thing is that for for an alpha implementation, I think it makes sense to just have the one, because I can't imagine a world in which we would have like. I think it's not feasible right now, but correct me if I'm wrong, uh where we would have like no swap enabled for management of components, but also have it enabled for like workload, components.

A

And I I would challenge that. I guess we can get to that point. When we go through kept reviews, we we did two knobs for pids from the outset and it works fine. um I think it.

I

A

uh Protect management components while still gathering more use cases on workload side which I think will be more evolved. um Let's take.

B

A look at that.

A

Yeah, okay, awesome, uh so I guess big. Thank you to a group of folks who've been helping collaborate on swap. I think it would be useful to see it moving forward. So any other topics we want to raise on swap and thanks for the summary.

M

Oh, this is just like pretty much a quick summary thanks.

A

Cool uh renault and adrian, I think you went through it yeah.

D

Yeah so adrian and I uh went over uh like checkpoint restore and we internally we, we gathered a few more use cases, so we wanted to like give an update on the use cases and how we imagine this working and gather some feedback. So I think adrian has prepared some slides uh he can share. You can give him co-host.

E

Yeah, I have some slides. Yes,.

A

And adrian, you should be able to share okay.

E

Does that work? Do you see my screen?

E

I don't see it looks great okay, so um so ronald and I have been talking um about how to move the the checkpoint restore cap forward and no now has been asking me a lot of questions uh which um I could answer but um and which I think would make sense to put into the cap to make it clearer what it's about and before updating the cap.

E

I just wanted to to to um to present them here so that everyone um is aware of it, and- and maybe uh if, if I missed something or just if there's some more input, would what what what would be part of the cap. So so, first one of the things um which have often been mentioned was there was a pot life cycle and the the idea behind checkpointing from my point of view, is that checkpointing is basically possible at any time of the pod life cycle after the init container has has finished.

E

So once the init container is finished, um a pot can be check pointed at any time the user wants it, and the checkpointing is of course, triggered by the by the cuplet and then by the container engine, um and I have implemented this for cryo and then the container engine talks to the run time run c and c run, has support for creo, based checkpoint, restore and then crew. Then, in the end, writes the actual checkpoint to a given directory by the runtime engine by the cubelet and with the checkpoint written by crew.

E

Then the container engine creates an archive which contains everything for the container checkpoint, and everything means this is the checkpoint of all the processes in the container and the file system. Differences between the oci image, which was used to start a container and the current point in time and and the reason um why this is also part of the checkpoint, is, with this checkpoint archive. It's now possible to restore a container on any other um container.

E

Engine which supports this and, and so the engine will pull down. The oci image apply the the diff from the checkpoint and then restore the containers and.

E

And so so, this is the there's a container checkpoint archive and then the engine can create. If you want to checkpoint the part, it can create a pot checkpoint archive which contains all the container checkpoint archives and some additional pod metadata a thing about a check pointed containers. A checkpointed container can be restored in any part, you want as many times as you want. So it's it's once you have the checkpoint, you can move it around and and restore it. This is uh for the possible uses cases I'm talking later. This is necessary.

E

Currently, I'm I'm putting the checkpoint archive the pot checkpoint archive involved. Cubelet checkpoints is a directory where the archive is stored. The cubelet also has some additional metadata about the checkpoints it has locally stored, and currently we only want to allow checkpointing and using cubecontrol.

E

So no scheduling based checkpointing restoring it should be triggered by the user, and the reason for this is um for the beginning. We want to have it as simple as possible, so no automation there yeah.

D

Adrian sorry to interrupt, but I think it it may be useful to like, did you capture the use cases in this language.

E

Should I start with.

D

Them yeah: we can start with the use cases and then I think I view these as like details and we can cover these, maybe uh yeah. So.

E

No problem yeah, so um the the use cases from a high level are described already in the cap, so there are from my point of view, there are three use cases so um for maintenance. I want to install a new kernel. I installed a new kernel. I do a checkpoint of all my pots. I reboot, I restore all my parts without losing the state of the pot.

E

So if you have, uh if you don't have a pot which has a state, it's it's not useful the the thing, but if you have uh pots with with state, then um the checkpoint and restoring helps you to not lose the state. This is one of the use cases. Another other one is is the fast startup, and so I have a complicated application which takes some time to initialize to read data from disk, um I'm showing later I'm talking about the jvm later as one possible use case.

E

So I wait until the application is initialized in the container. Then I take a checkpoint of the container and then you can create copies of the container from the already initialized point of time, and then it starts up really fast without needing the initialization, and the third use case is the migration. I want to move apart from one system to another, one and nunal, and I have been talking with a couple of people who have shown interest in checkpoint, restore and kubernetes.

E

um Since the the pull requests in cap are opened and one of the main uses which came up to us was was fast startup of of jvms, so there are jvm takes some time to start up and and if you can create a checkpoint after the initialization of all the libraries after the after the jit has run and you can create checkpoint, and then you can restore from that point in time and you can get up your jvms really fast um from a checkpoint.

E

The other contact I had was from from mathworks. They are using already checkpoint, restore for a couple of years in production to decrease startup time of their um containers by by five minutes from starting from a checkpoint. So so users have to wait five minutes less in the beginning, until the system is ready from what I've been told by them is they are currently using um it with docker to do checkpoint restore, but they told me they are thinking about moving to kubernetes, and I think they already have some different implementations.

E

um Also trying to do this and and another uh use case which, with which we have been in contact with um there, was a group which tried to migrate the storage system on a running kubernetes system from one storage architecture to another one, and they asked us if checkpoint restore would help.

E

So you would migrate all the data then check pointer containers, unmount the old storage uh system, mount the new storage system at the same location, restore it and the container would not lose state and keep running from the new uh and keep running from the new storage system. So those were the use cases uh we captured.

A

Adrian or ruminal, um I think these cases leading with that first is really helpful. I think the question I had was um maybe renault you can answer this when we were chatting about this, we had discussed, maybe what a potential pod api change would look like. That would.

D

A

Launch this pod.

I

A

This checkpoint is that something we have in this deck that we could share.

D

It did you capture the the one where we like we, we said two things right. We uh one is, we start a part from a specific checkpoint and the second is: we are able to take checkpoints at any point. So if you have a slide on that, we can jump to that or I can just talk to him. I I don't think I have a slide on okay, so so, uh basically like, while, while chatting with derek and uh also talking with adrian the way we thought like it's like this. This feature like this.

D

There are two ways, uh two things here mainly so you one thing is: you should be able to checkpoint at any point, so that allows you to keep checkpointing a pod which just creates this checkpoint, which can be used at a latter point, but it shouldn't change.

D

The status of the running part like checkpoint is a point is a point in time operation. It gives you a checkpoint and you're done. It doesn't change the pod status or anything, but what you can do the way to consume.

D

It would be to have a way in the api to start a pod from a checkpoint and that's the only way you can consume a checkpoint like a new pod will be started from some previous checkpoint and that will solve like the jvm issue, math work issue and so on, and it doesn't interfere too much with the existing pod life cycle.

D

So that's the overall uh idea we had so any thoughts on that.

H

uh I had one question like: what's the format of the archives.

D

Adrian you wanna take that.

E

Yeah so so the the format is.

E

E

Yeah, so it's it's! Basically it's it's! It's a it's a checkpoint and it's a it's a it's a diff of the root file system. That's what what's in the tower archive and then there is the the config json. So we know how to restart a container, and so it's all the information which uh it's possible to get out of run c.

E

Okay, the the the root file system. Diff is not all the forensic. This is out of the engine, so um this is um but it's it's all files and directories which are part of the tower archive.

H

Yeah, what when I did this in the past like as far as migration goes, one idea that I had was these archives could just be images. You could use a manifest list and have the different things like the root fest diff as a layer in that image, and then you could use the actual shaw of the oci image as like the pointer to what's the actual image.

H

So you're, not using the name of the image that can change, and then you could have different kind of metadata added, like the the actual oci runtime spec that was used and and then like, when, if you, if you use uh image as the archive, then you can push and pull from registries, and you have kind of the distribution aspect already solved.

D

Yeah, I I think, that's great input, michael, that's, that's something we were struggling with like how we can move these checkpoints from one node to the other, because you don't want to like run this part on the same node. You want to checkpoint it and run run it anywhere and maybe like we can take advantage of some artifact metadata to say that this is a checkpoint and have a new pod point to it and pick it up from there.

H

Yeah, I could show you some of the work I did in container d on building an image out of a checkpoint and okay and like also you can like, because you have bind mounts or log files and you need to have sometimes depending on what you checkpoint the law. The files that are open by the process have to be the exact size because those they'll seek to where they were so. You could use all those different, manifest lists to add in bind mounts or log files that gets transferred to another system.

H

C

Yeah another way, our internal uh did also check up high into this stone. That's the couple a couple of couple years ago, and mostly we just rely on the remote storage. So that's right. Those kind of just me when so this is why, last time when I asked her say okay, I want to understand the highlight of api and remove the dependence on the local node, because when we first tried that's the local node dependency and and basically it's like the we need uh when we implementation on the local host like the menu.

C

I remember andrew mentioned that it's manual tour and we could do, and but we need a careful need to remove the no code host dependencies and then you could rely on the remote host remote storage. So that's what how we play with, but that's the many years ago we tried so yeah back then we have failed.

C

It is just because we have the we didn't cautiously either some localhost dependency back there. So that's why? It's not that easy to migrate to those, and you end up those things have to restart always on the. So that's the problem.

D

Okay, so so we yeah so don. I think we like when I was starting with adrian. We were imagining a checkpoint object, which could be backed by like a persistent volume and like.

I

D

On michael's input, it can also be like an image somewhere on the registry. Yes,.

J

Do you understand correctly what this will work only for default like run c and c run runtimes, so any kind of like vm based runtime, so any other runtime class will be prevented from format.

E

So so so so currently, yes, but um I I don't think there is any anything preventing um uh vm-based runtime to provide the checkpoint interface uh run cds. So I I did also the c run implementation and I just um replicated what run c was doing and and and once you have the same interface that run c is doing in c run. It was possible to checkpoint containers with c run, and so it's also it should be possible with any other runtime if they have the same commands as run c for checkpointing and.

J

A

Then, just thinking about other things that might um be related to this- that we keep in our minds is anything anything we may do in the future. Around user name, space remapping, both at a node or pod level, would that impact our could be pursued around checkpoints. It feels like there's a relationship here that you'd have to preserve that mapping.

D

Yeah, I think that we'll have to store the mappings for sure, like adrian, can talk to the details of like how much user name space has been tested with check pointing I assume it should work. But adrian can correct me.

E

So I I haven't, I haven't uh used a username spaces with checkpoint restore for for some time now, but I know that that lexi and lexb they are also doing checkpoint, restore with username space-based containers, and there is support in crew to work with usernamespace containers. It's it's all there, but it hasn't been used a lot in the in the past. Yeah.

D

The the challenge will be like when we are moving it from one node to the other and like most likely, we won't be able to guarantee that the same username spaces can be given back to that pod. So if we can manipulate something in the checkpoint to like change the range or something that that may be.

A

Miss you yeah, uh no we're running up at the end of the hour. I think uh thank you and adrian for continuing to explore this and sounds like michael yeah. Be awesome if you could connect your experience with yeah.

D

Sure I think what we can do is like we can kind of share a doc like go over. The use cases add what we talked about today and then uh like michael, can michael shared a link, but if he he can chime in more as well, it will be uh helpful.

C

Andrew can you link your slides to the stick note meeting notes.

E

I will do that. Yes, yes, thank.

C

You thank you yeah.

A

All right well, um thank you all who attended today and shared with the broad community, so um look forward to finishing out this release, and then I guess uh we should reserve some time in an upcoming sig meeting to figure out what we're gonna do in 122.. So everyone have a great afternoon or evening we'll talk on slack.

A