Kubernetes SIG Node, 9 Jun 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20200609

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Yeah, okay, let's do and I think the nits first to start last time, because two weeks ago we can solve the the last the previous ones. So the last time we have met the victor and talk about signal, the testing and because we ran half time, and so we didn't, we didn't give you time. So we thought. Maybe you want to start today. Yeah.

B

Sure thank you dawn, so I put some stuff on here signal test enhancement, there's a link for the dot I just want to start out and say: there's been some great efforts from the team to sort of jump in look at these tests to understand. What's going on and fix some issues, one of the comments this is directly out of our meeting, says: hey.

B

We had an we made an entire test grid, tab blue with one PR and that's when you look at the test grid tabs- and you know there was a lot of red, so blue is definitely a step in the right direction. So that's good and that's what we want to give a shout out to everybody say: hey thanks for all the hard work, all the PRS. It's really good know at the same time, what at least for me and what I'm seeing is a lot of activity and a lot of review requests.

B

I was like how can I keep up with all this, so I did a little project board I'm experimenting with that to sort of keep track of these PRS and just to give everybody sort of an idea of how much work and effort is going on. There's 10 PR has already completed emerged on just the signal testing piece, and these are include things like some dark updates. How do you run these tests if you just want to run a particular test?

B

Some cost image fixes a little test config and what are the big things we had before? That was causing us some issues as if there was a test that said, hey use this image and the test wasn't there. It was a solid failure. So now, if the image is missing, there's an explicit failure, so that's good and there's 13 PRS still in progress, so we've got 10, merge, there's 30 more that are currently in progress being reviewed. There's some dock updates, more image cleanup and then there's some things with the benchmark test.

B

So I know probably ed Morgan on I want to talk just briefly about this benchmark test. What what happens is there's some benchmark tests that run the benchmark. Tests were failing without a memory, and so there were a couple of approaches to fix it. The one we sort of I think agreed on and that we're pushing forward is, let's lower the number of pods from 105 to 90. Now this is just really a temporary workaround to get the test of running again.

B

The analysis so far looks like container D was split up to a separate process, I guess, but that one seems to be what's using more than memory and so just to make sure that we didn't lose that I open the issue to track that and say hey. You know we're gonna, probably lower. This number pause from 105 to 90 get the test going, but we've got this underlying issue where at some point you know it seemed like the history was way back few years ago. This was a test says: hey.

B

How many pots can we safely spin up on a note and Don I? Think there was a big issue. I think you had put back in their comment way before so that was sort of what this test was about and that's sort of. What's going on so I think there's if anybody's interested in issue interest in container D, helping debug I know, there's been some great help from Roy and kneeing and and others to help figure this out. So that's all appreciated.

B

One of the other things we are seeing is and many of these tests we're testing with costs image and a bun too. Now in the past.

B

If you look, there were many images of different flavors tested in the past and there is a link there on the images of all the images that are available, which includes other images of course, and so what we were thinking of doing as we get this the current image of stable, adding more images to get more test coverage, and that will seem to be well-received by the group and this ax will, as we get things cleaned up and more toward the green and blue, will look to adding more images soon.

C

B

That's sort of a update, any questions or comments or any other folks that are working on signal, testing ham, wanna jump in add some more comments. That would be great, so yeah, that's.

A

Huge progress, actually we used to have the family media and for the eat not eat, eat has and but because the not all the imagery have not the image owner, so they either ones and enlightenment. And so we, then we have the policy if you are making a snack. The know that et ties to particularly fair for one at the one month that time and we have to retire that image. It's giving kinda speakers, the other image issue details.

B

A

So then we also have the policy. Next we want to have like the get rid of the image like like we used to have the Debian and Ubuntu you, and also we asked her to have like that. We also have the chorus and we also asked a dryer but I believe the retina actually meant a different side is the separate one is not really here, because a lot of work to end animal not know the et tags, so they'll ever get around to end up with that as the package it together.

A

So that's why we're an apple only have the phone in the immediate provider will have to read heroin by way? Yes,.

B

That's good, that's good, because we were sitting there wondering it's like what happened. Why are we not using these tests in these images anymore and I can completely understand if there's a failure and no one is responding. Hey yeah make sense, get out. Yeah.

A

Yeah, so we're meaning to restore those policy, because at the end we only have a few media there, so we cannot delete, have them based on previous policy. All those two should be retired. You.

A

People Lisa hands and a do you want talk, yeah.

D

I also wanted also wanted to draw something else out there for all the all the testing related work. There is a as it was already mentioned. There is a project port in the kubernetes all and a lot of working there. The only the only people who are a perverse are the are the chairs for signal. I was also if we could get like some a periodic time slot, where people could walk through those issues, PRS and prioritize them. That will be. That will be also really helpful.

D

We've been getting a lot of help from people in cig testing because they are a perverse for a day for a lot of the things, but some of the things I really just constrained to a signal chairs.

A

We yeah, we should fix this yeah thanks, yeah and I. Think there's a couple. Pr already start to add more people to maintain those kind of things, and we also need to define more clearly define those policies and yeah another things. I think the Victor already mentioned that benchmark tests yeah we have the benchmark tests and we even have like the performance I think that people propose a couple weeks ago to those so about the know: the performance things like the skin being attacked using our mechanism.

A

So we build those the performance tests for the note, the powder density or you could say the know, the skill behind his skill and then the mechanics actually that the people build. That whole thing is my intent. That's the three or four years ago and doing the stair I development during the continuity development and the crowd development. Actually, using that a lot to measure the continuity and accumulate the resource usage.

A

So then each release we basically based on those data so and to make sure I understand how much do we want a set for our benchmark tests. So we maybe want to be someone want to pick up that work, also maintain that one and and also for that work, they're more work, I, happy new. My intern also put that like that to do list mission is there and they want to evolve because we talked about initially when I hire him as the intern.

A

So we talk about what we want for the community, but he didn't return to the signal the next year and as the intense that we haven't put effort to grow those things yeah. They just put here just as the follow-up as the benchmark tests.

D

Are these only sure anything is somewhere to it on how to pick up the work and try to try to get some maintainer? At least speaking for myself? I will be really interesting.

D

A

Used to have the issues there, but I think that, because nobody unity kappa so movie, maybe those issues are in route him and being odd being the auto closed by the bob. But I can't pick up those things because I remember the the dog I revealed he wrote about know the performance perform static, understa to do things. We can pick it up. We can start from there to be fine issues and we used to have the leader help he'll pleaded tiger label in the Cuban ideas and a signal order.

A

So those kind of the issue but I know that because the being a wire and a lot of issues being also closer. So so we may need the reopen those issues. Yeah.

D

A

Thanks for everyone warned here and a half because you used to be knackered, not many people look at the note the III test. We are the only one, and so the issue is not bob up too much to the community. So much because internal the team is already tightest and already stopped. So now we connect the reefs attention and there is the awareness to the community so I'm glad that more people want to help you.

A

So so week, do you want to quickly update about the resource management? I saw that yes,.

B

Yes, thank you don, so you know we're still meeting once or twice a week, and this is just to go over some really kept enhancements and stuff that sort of allow dedicated focused on outside of signal just because there's not enough time to do everything in an hour. So we met today and you know, there's one kept that was approved, and this was really just adding a slight improvement for sort of topology manager stuff for really getting better alignment between GPU devices and improving performance, which is an optional extension, but there's the link.

B

It was approved and they'll. Just one quick update, you know, Kevin I, don't know if he's on the call clues from Nvidia they've done a lot of work there and you know it's sort of late in the enhancer process for 19, so it was approved. It's not a huge change thing to I said a cig note, but it's not likely to be implemented until you know, 1.20.

C

B

Somebody wants to pick it up, has some spare cycles and we have talked about it, some some of our folks to see, but we don't have any commitment yet so right now, that's the current plan. um I thinks.

E

Meeting that one though Victor I thought Alex this morning from Intel, was potentially gonna see if he had capacity to pick it up, but absent. That I think everything he said was accurate cool.

B

Thank You Derek. The next meeting was scheduled for Thursday June 11th right now. There are no topics we were.

B

We were planning to do pod level, resource alignment, this Thursday, but it turns out holiday in Poland. So there was a request to move this out to the next meeting. So that's the current plan, and so if there are no topics added to the agenda for Thursday, we'll cancel the meeting for this.

B

E

Common I would add, was BG had presented a proposal to City odd level alignment as a technology manager policy, where we today only have container level alignment I know both Kevin and myself, four plus one on that proposal. Right now and I think we were waiting to see if Alex from Intel had time to offer additional context. But given that our current topology alignment policy as a per node policy and that it seems like any future, evolution might become a per pod policy.

E

But we lacked a clear action next step to do that, we wanted to bias to enabling more flexibility on the per node version of that policy and support Pablo alignments. So I wanted to thank VG and the team from Samsung I'm playing together like a very clear description on where and how they would use this policy in a classic 5g deployment, where you might have a set of notes that are only running UPF functions and other nodes that are doing different work, which might not run with this policy.

E

So one thing I just want to make note of that is like. It was really really helpful when evaluating that kept to understand like the overall topology. The cluster would be deployed in to to understand the system and if it would actually be used, but it may be abstract, very concrete and so maybe going forward. One of the lessons I would say it's come out of the the discussions we've had and Victor up.

E

Maybe you could second, this was it's really helpful when people or more concrete on their use case, to drive understanding so like decisions on things being per pod versus per node are not as easy like and just being clear, I'm saying well up heard. No, the thing makes sense for me because my use case is around dedicated GPU nodes right.

E

That's really helpful for us as a community understand and and biases towards doing something, whereas absent like a concrete deployment, topology, we're kind of biased to just do nothing, and that's not good for anybody so anyway, if folks want to review BG's proposal on pod level. Topology, please make those comments today, because I wanted to unblock that this week. It's.

F

The recording of that proposal available anywhere.

E

B

E

Recording is I need to check if I had uploaded that one I'm like a week behind on my uploads but there's also a a blink. Maybe we can put adhere to the slide deck on. Why and where it would be used, so it had a 5 node deployment for we're user plain functions would go and stuff and it was the most insightful useful piece of slides, I've, seen related to 5g in a while. So I just.

A

You know: can you upload after review, if possible and at least think of the slack back here, the back of a lot of our us and I heard a lot of us for the west coast cannot attend it's the six o'clock it's to earlier for many of us here and but I. Think many awful I want to know more about those things and expansion. This is well be change after our call our neck, the resource management. So so one things what I my comment!

A

It is I wanted to say that it's harder for us to find a solution. One size fit all and, and also so I mean I, didn't see the teenagers of a proposal from his problem, so I can see that clearly wired they came from and to support the mobile application to support the Falchi did I also have other people and how other used pieces, not all it is the tiny cutting kisses. So so so when things I really like what Alex proposed last time, I think it's more than a month ago.

A

He gives the proposal what in doubt doing that's really aligned with the three years ago the decision we made it so next, the more. This is also why we have that continental interface and all those plaque or extensible API. We defined we try to say oh here it is the identified common core for all the general graph that Walker node and then we have like the the runtime cars and another label or those kind of things and Adam II seen control all those kind of things concept. So you can't customize D for the know.

A

Customize the different of the they'll also have the sky donut extensible framework and you can plug in your scheduler and you can plug in your content of runtime and for different. So that's not last many years we try to push forward and the annex the demo, actually I'm meaningly next. He give all this kind of the scenario different the hardware and how we are going to do, and so so that's that's. The I just wanted.

A

Recent I haven't looked at his latest pop, oh so yeah, but that's a general make on my comment and my feedback I. We need to be cautious to only support one use kisses because there are so many different use cases and the hardware is always different. Yes,.

E

I think I think that exact sentiment will be probably captured in the recording. So there's no disagreement, they're doing and I think um I think the issue is a where there's no clear consensus on how to move forward even an analysis POC that he presented.

E

Do we just want to do nothing in response right and so I think I think. As long as we recognize that we have a per node policy today for static alignment of CPUs or topology pause, some alignment on devices and CPUs.

E

All we need to recognize is that if we look to elevate that discussion to a per pod policy, we just have to figure out how to mitigate that or dock to it. But I, don't think even Alex. I, don't to speak from here has come forward a path forward that well.

G

Actually I fought about with us forward and we've had some internal discussions with an Intel and I put already in a web of what future topics for this topology discussion making. So we have some ideas: how to implement the things we can share all the items we have it's a question of like where we want to go and I will try to summarize all the current painting proposals and outline like few options like where we can put with things and when, based on on words, we can discuss like what is for future steps. Yes,.

E

I think, hopefully us I'm not misrepresenting our conversation this morning, but.

E

To be clear, we have not approved BG's kept yet because we were aborting time for Alex to to comment on it and I didn't get the impression Alex, even in a prior discourse that you felt that it was necessarily a bad incremental step. You seem to mind you still want to take incremental steps. Yeah.

G

It's it's a good incremental step. I understand for reason. I understand like what asks what we tries to solve. I just want to verify like few minor things, to make sure what like what future things will not be prevented like we will not end up this something what we need to support for multiple years, what we cannot change.

G

That's one reason.

A

Successful for tracking of the time so thanks for David to remind us, because we just want the hell out after summary, about resource management, work group and now, let's move back to our target our agenda. So so second, a container set the car water. So it's back to the first away. So second container.

H

I just wanted to do a quick update on what we have done regarding this cap. Oh.

H

Oh sorry, so, as we talked in the past meeting, we met with guys working and the cubelet grateful shutdown and we were trying to help them and also time to think how to make that work and in a way that will interoperate properly with the sidecars cap.

H

So far, so good also I wanted to mention that we found some edge cases on termination. Like termination, was the biggest concern it was raised here about this about this cycle cap, and we found some some corner cases when like if the some current cases does murder the details, and we have some ideas on how to improve it. Regarding the the.

H

The cubelet graceful shutdown we wanted to ask, maybe, if like it's a minute for you guys and can save hours for us, but if not who just dig into the code, the the field part dot deletion, grace period seconds like the code, for example, there is link in the agenda, considers the case where it is not set.

H

I wasn't sure if that like, if we cannot rely on that feel being said of if that was added just because when they field was added, it needed the support when it was not site for to interoperate with all the releases, and it was just not removed, or something like that does anyone here happen to know, because that will be useful for the Hewlett graceful shutdown or maybe, if it's useful they are, we can use it for that. I've get.

I

um Yeah looking I'm trying to call it in my um I.

E

Think there's two different levels: the default thing that could have been happening, one on the API server and then one on the qubit side. Here so I'll have to follow up afterwards. That's okay, I! Don't don't have this memory but I think termination. Grace period seconds is optional.

H

By the user right.

C

But the so I looked at thicket Lane.

F

I, looked at the gate, blame super quick. It looks like it may be added I to handle pods that are missing the spec entirely, and that may happen when we load pods from checkpoint or something. But it looks like we've written the code or it was added with the intention that the cubelet could continue running even when certain, even when we didn't have a pod spec at all,.

H

H

Because, like the bodies being restored from from continual labels, so we like docker container or the container runtime label, so at that point the pods should be but yeah. Maybe when you restore it from level, it's not said yeah, maybe it's something like that. Make sense. Sorry, but yeah. That was the quick update regarding inside cars.

H

Regarding the other, the next item in the agenda, al burned, I won the awkward I couldn't join, but we're planning to work in the history namespace design proposal in the future and wanted to note someone else is already working on it or want to join forces. So.

J

Hey Rodrigo, so I invited Sargon from Netflix are I still around on the call.

J

Maybe he so I spoke with them. Netflix is also interested in that feature and like cryo team like Giuseppe and I, will be happy to help review and push this forward.

A

I remember: there's another team in the redhead in the past actually initiated this effort and there's a state so I. They can't want to continue driving this one yeah.

E

So this effort was initiated originally by myself and because Chaudhary, who is no longer with Red Hat I. So he did a lot of great work to help push it forward, but he's not presently working in this place anymore. So it's it's great to hear. Numeron all had been chatting with folks at Netflix that had an interest and it's good to see the renewed interest.

E

I will say that, like we got very far in almost getting this over the hump, but then it reached a point of there were a couple issues, one if I recall this is the github PR issue that eventually made it that you could not open the issue or PR anymore, because they're just way too many comments and.

C

So I'll be.

E

Curious if the unicorn reappears for that, but the challenge we had there was trying to figure out, probably how to handle upgrade safely and I would I would bring in Jordan Liggett, who I remember being a very prominent reviewer on this particular PR as well, since I think at the time it was myself Vegas and Jordan that were trying to iterate on this. There were also some areas that were maybe under defined and so it'd be good. Ranade know we haven't chatted like if the Netflix use case was was a node wide remapping yeah.

J

It doesn't know like remapping, yeah and I think yeah. They are they're interested in a node wide remapping, the phase, one that you are trying to tackle. You and I think. One of the issues we had here was with docker, where, if you have a privileged part, I think with cryo and continuity, you can switch off the user name space, but docker did not have support for that. I. Remember that as one of the excuse but I'm not allowed to check well what we agreed to do.

A

We talked about this plant at a signal and I. Remember that I'm nantao and my Cobra is here today we talked about this one. Then they they try to take back those issue to the continuity community and they continue discussing I. Don't remember, are we put back, but we they actually take that task. I, hope, I'm, not sure micro is here. Yeah.

K

There was an open PR that was working on in container D. You know whether, like how to push it through on the CRI side for all the mapping and get container D support for it, and it's had some good traction lately. I don't think it's closed. Yet. Okay,.

A

K

A

H

Yeah so I just wanted to say that we were planning to work on this if someone working on it, but thank you for all the information so.

A

My purpose or something Eric can you, you have invited the Jordan and give make attend today's meeting, maybe sometimes next week or maybe even fall. So we can't have like the recap on the status and so far I heard a couple issue when it is figure out the upgrade and we can focus on those things another one. It is the continuity and, if I remember correctly, they take that job back to the continuity we upgrade and also Sarah I change you and we agree.

A

If we can come up with the continuity community, then we come back to open for others that say: I change it, but obviously things change, because quite now, I think you remember there's more than a year ago. So so so there's the other issues.

A

So we can lay out after one by one obviously user name in space actually is pretty common popular topic, but all many people with the hands they want that one to enhance the there's that creative management and policy and but is actually is more from the Cooper 90s community, leading that effort, because you knew so maybe we should have focus on those things. Yeah yeah.

E

Honestly, if my memory right now is not thinking that the container run time as much as a challenge as understanding like the intersection between user name, space, remapping and storage,.

E

But admittedly this said.

K

Yeah persistent volumes would be a the big next step. It's like the second most old all lost.

E

Divisors are from because it was the first milestone honestly like it was in the Alpha criteria, because I have to go through, but either way like.

E

No and I don't know if he ever discuss this in the past, but like I'll, just caution that this is a very difficult to reason through when thinking about all the permutations and so for sure, the the I would hope that we could continue working from we kisses PR and that's now closed 63, sick, this five horn, but even just point out myself I'm trying to remember. Oh all, the challenges that we ran into. But what I'm curious about is is the use case that we were driving for universally shared among folks.

E

Or is anyone exploring this as an attempt to try to get a rootless cubelet like I, want to I'm curious if people are exploring this as a means to support better security, isolation of end-user pods or as an indirect mechanism to try to support a rootless qubit and if we can maybe align on like what success criteria is? That would be helpful because I know this has also come up in the past.

E

Support routine would which is more complicated.

L

You know I was still more thinking. You was about the former, not the latter, yeah.

K

Part pods, not ruthless yeah.

E

So, like heroes, suta in the past had raised this as a issue tied to rootless qubit and so, but definitely for the original cap. I think it's still solid and I think the PR that PICUs had was really close, but still had some issues that we could talk through and they're. Just all of my memory because they're two years old, I.

J

Think like first, if we can do a recap, maybe in a couple of weeks to see the things ended over there to refresh.

A

A

Is the first attack nothing for the lutenist cubed and so routine X cubed and the Melissa for the former? What you you list them and we the problem it is we never. This is being going on the user's name space. It's been going around for a couple years, and now we I just naturally say that week has already made the majorly things down, but it are several open question. We need closer on those open question and move forward.

E

Yeah, so the PR has 239 comments on it. It's very slow to load for those who hadn't see it. I'll link it in the agenda, but I think I would zoom in on comments that had been left by Jordan during the initial review, when we were trying to get this at least into an alpha state.

L

Anyway, I think I think, with the the secret to support, was dropping in to run C and the solidifying I think we know it's probably a good time to start up a war group in this space again I.

E

Guess that that's a great point, Mike I, think it probably needs a new kept right, yet updated state of the world. How kubernetes has evolved because I think this even predates.

E

Some of the CRR pieces I think well I. Don't even think that as much as like other like it definitely predates the ephemeral containers and yes, and so we need to map these features to each other and I. Think it also predates pit namespace sharing and so yeah probably deserves a an updated cap. And so if there is a group within the community, whether that's a set of folks, ideally from multiple interests, whether that's Netflix or vincit, and your team to just kind of to pick up a new cap. That's probably what I would recommend.

E

K

That what welcome any instructional help us to do that.

E

K

M

E

Remember there being in a previous enhancement that, because had written that I had reviewed, but it just updating it to. In light of the fact that there are new container types so like what would use your namespace remapping mean for.

E

Some of those container types in news cases and I- don't have them all in my memory right now, but I could see if pheromone containers but potentially causing an issue, at least for Phase two that was in that cup, but anyway, yeah help help is desired. Yeah.

K

This is just is also an area that I know that lots of eyes are or on because it's you know, pushing pushing that that feature down down the road a little bit.

A

Okay, so I have to check off the times, so in this moment I'm ecstatic, and so we can follow this one. We can follow up next time and the next one I think that the kamala do you want to talk about. I saw the Europe I already merged. Do you want to talk about that.

A

Yeah okay looks like the circle that he's not here and then a small to me. I can.

F

Actually talk about that a little bit so we had to revert his PR because it caused some performance regressions. It doesn't so normally when we delete containers. All of that is sort of synchronous through the life cycle event generator, it doesn't seem like sandbox deletion is done the same way, and so it added like 10 ish seconds to pod deletion, which was unacceptable, so he's investigating how to do that. He may have had questions, but I, don't know what those are, but that's.

F

The current state is that his PR had to be reverted for performance, regressions.

B

F

B

Had one quick question: how was that regression detected? Were there some tests that did this or just people started complaining from other downstream tests or something.

F

Yes, so Aaron Creek and Berg.

F

Saw he said so: I can link the bug that he created okay and here just for anyone who's curious. Thank you. Looked like you. Had some fancy graph.

A

Ib new is from a tester, but then you have to analyze those tests, yeah I, remember: I, saw this one and I also remember every time which things that deletion logical actually caused some performance regression in the past.

A

So nice move on an animal topic when they do you want update after increase pad updates, yeah.

M

Sure this is a quick update, so the core implementation I am pretty much close to completion on that I got some tired reviews from David looked into it and left some comments, I addressed them. Thank you, David and I believe the API has been partially reviewed, at least by DeMorgan and I've, addressed the concerns and change the code, but I don't know if there's a second pass.

M

That's happened yet I'm planning to there are a couple of small items left, which is the resource quota limit range I, don't think, there's much work to do in resource quota. We agreed to use just limits for now. I do want to go in and take a doubt that we have an implementation. I wanna take a look at if there is any real problems with corner cases there shouldn't be, and then there is a limit range which is which should be fairly simple.

M

To just ensure that resource code, update is not blocking resizing and limit range arrow keeps the limits within bounds. Then the next big task is to do the e. Do we test so the goals for alpha release 409? He needs to have the basic e to e, and then he do it for multiple containers. So far, I've been testing them manually and David. I did check that concurrency. To can two concurrent updates the mutex mechanism that I have there.

M

It does work for both for it'sing serializes for additions with the resizing, as well as multiple resizing. That's happening concurrently. So I think that looks okay for me to me, but please take a look and we'll take another look.

F

M

F

M

F

We decided we needed to only ensure that we don't exceed that request. Don't exceed limits internal to the note or if we need to make sure that that can't ever happen external to the node, meaning that I'm not sure if we decided it's fine for pod assigned resources to exceed the node capacity or if we just need to make sure that those aren't ever applied so I, don't remember what we decided, but I think there may be races there still but I'll take another look: is this for teachers quota?

F

No, this is I'll just leave a comment on the book. This is just for general. Make sure that requests don't exceed allocatable. The node.

M

Yeah I think the idea of this matrix is to make sure that allocatable never exceeds the the total. The sum never exists, the allocate will capacity of the node, so the requests can be what they want. What was desired its if the node doesn't have the capacity to handle it?

M

It will just sit there and keep retrying in the same pot until it can be allowed in the main things to ensure here was that if there is a pot addition and a pot resize happening at the same time, they're serialized and the the one that comes later gets rejected and when you do cube, natori starts we restore to the original state before the restart, so the mutex seems to achieve those things.

M

The point you'd mentioned was that we sort them by creation times and that seems to work. I had some comments about, should we use the resource version, but that's for necessary.

F

I'll take a look thanks for they. Okay.

M

A

Check off time we have one last item and we lord thank you for the V&A for the updates and yeah yeah. You know, can you talk about I? Think that you you you just want to allow something? Yes,.

N

A very quick description so in the past qubit was the advisor would report GPO metrics. That was a change that was made two or three years ago, I. Think since then, we've decided to move towards a different model for reporting. Metrics we've noticed more recently is because qubit actually reports GPU metrics, specifically NBA GPU metrics.

N

It now has an open handle on the Nvidia driver, and this is a problem, because what that means is that, if from a cluster and main perspective, I want to update the Nvidia driver or remove the Nvidia driver, I need to kill qubit, so I have a pure that is in flight right now, where it's still working progress, but I'm, adding a a flag to Cuba to disable. This reporting of GPO metrics, which is done by the way to Missy, advisor and I, think the discussion I want to have is well.

N

Is it possible in the future to enable this by default? Right now, the PR is just if you are close to Edmund, you can just label this flag to disabled do but to collect these metrics. There's some dependency on the advisor to get real, ender and I think these are your generally. Your question here is: is this something that we think is desirable and if so, what would be the process? Do we need to just open it? You okay, is that something that we want to start advertising.

N

A

E

I've never liked that we put this in as it was, and so I'm happy to know that we have a reason to undo it.

E

I'm wondering if we can tie this to the general change to cubelet vendor and see advisor in some way to avoid having this happen, but I.

E

Don't know that this warrants a kept as much as just a general path to ensure we don't regress. Anybody like it doesn't spend SIG's to my knowledge and local to what component so.

N

E

The issue on this PR in.

N

My experience, people don't really use these metrics, because the metrics are really collected are a bit very reliable or were very reliable, especially GPS ization. So it will mean I, think people there's probably people out there. There use it, but I definitely agree with the fact that at least we should be trying to remove this. You.

A

Know so we've learned of the path, but I want to say that we definitely crowd this one, because we used to have like a lot of debate in the signal. We don't want this plan, but do we accept it? It's a to move forward, move the ball forward right that time. So we definitely support this one and the better. Still we don't know we we're not. We cannot cover all the use cases here sown it's just a move on with your PR and I.

A

Do this disable this one, the flag and the two people in a cylinder and s-trans possible and make that announcement through the release, notes and all those other thing and all the other cheek and nose? And then, if there's the no concern no complaining from the community and from them all the visitors that is winter, so we just next ring is all maybe follow you meaning, so we could just completely make this default this happening.

A

It is I think this is kind of the majority signal, because I remember at that time we've been have a lot of discussing internal and external a so. This I think that, at least from my perspective is heard also direct perspect and that's Kanak, though. What we want to move yeah.

E

I think rinoa I was trying to think about the PR. That Balaji has have been to disable spirit and qmodem points to see if there was a pattern now that would make more sense, but I think in general, a strong plus one on China way to not have this yeah.

N

We had discussed that was David in terms of who we tie this. To that, specifically, are you mentioned I? Think maybe David can chime in here, but the general idea that we surfaces is it possible to have an explicit to disable this specific action.

N

F

I'm happy to I'm going to give my opinion in the past. We haven't wanted to make the cubelets metrics configurable, because a lot of subtle, qubit behavior depends on having metrics available such as eviction or others, and so we wouldn't want people having cubelets that are broken or don't work.

F

I think GPU metrics seem to be different in that a one renaud has said that they have problems with opening with using the GPU driver, which is specific to them as well as we have already, as the sake agreed on and out of tree replacement for these metrics GPUs metrics, specifically I do think that we shouldn't remove this flag and remove GPU metrics entirely without deprecating. The summary api, as we have in the past, tried to maintain backwards.

F

Compatibility so I agree with your earlier point, Derek that we should still tie this to an overall effort to remove our or to remove any legacy API, as we have that dependency advisor, but I think in this case adding the flag so that users that have issues with cubelet using the driver. It is the right solution here.

N

Well, I added the notes of everyone and me I think this is probably something I'll surface on the PR, but it looks like we're pretty much aligned and there's some details. We need to iron out, but I think we. We had a pretty good idea. What what's the past moving forward? Okay,.

N

Well, thank you very much if I can get some eyes on me on some other metrics PR, that is just a cherry pick which I've just pasted it on the Google I would also be nice. Thank.

A

You there's the comment from the Dames and Alex, so I hope you miss your Alex and yeah yeah so say, but those GPU metrics that have I think they're also signal to make the conscious decision. It's not like the that hand, because at least I strongly against to those add into the cuban NSA and said over there, and but we have to because we have many customer to support a new side of the workload. That's that time. So we connect the make the neck the conscious decision. So so I also agree with the themes.

A

They say: okay, too many of the flags and but to take this one with the deprecated. After summary, and all those kind of things that's been like the three and a four years ago that I fir to ask that even before CRI and we didn't make much progress, so I agree need to think about the retire as much as if we could coming up a little bit of the code base and remove some of the one of you when there's code and cleaning up the code base. That's the good choice.

A

Even we cannot only remove entire of the summary and and or maybe next coming up the state about their integration. So at least this is the ones baby step for the for the clean up, I.

O

Understand that done, it's just that it's frustrating to see that things are stuck and we are adding flags just to let somebody disabled something for their own use case, rather than actually making a complete effort to to clean things up- and you know it's since we are doing this, we have less incentive to go, do the actual work that is needed. That's what I see it. Yes,.

A

Agree: yeah well.

G

What we can do if we understand what we just like architecture in correct thing, to use, what we can do is to have a flag for people who actually using it to explicitly enable it back. So we can think about like why we are still using a different video suggest what we need to use something else. I.

O

Like that Alex yeah and not it should not be a command line flag. It should be a conflict flag. You know we should stop adding command line flags for sure totally.

A

Yeah thanks for the comments and the thanks for everyone attending today's meeting and we really learned half the pants. So thanks and see you guys next week.

D

Everyone thank you.