Kubernetes SIG Architecture, 9 May 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Architecture 20190509

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Welcome to cig architecture for Thursday, May, 9th 2019 the agenda for the meeting you can find at bit: dot Lee, /, cig architecture, I'll go ahead and paste a URL in here, and today, we've got a few things going.

A

The first thing I wanted to bring up is that the next meeting in two weeks is during coop con EU, and so we're gonna go ahead and cancel that now, if it's not I'll, go ahead and get it cancelled in the calendar here shortly, so you don't get any meeting reminders or anything while you're at the conference, and so that's the first thing. The next thing we have up is an update on the triage process and Brian was going to show us some stuff, which is why today he's ready to screen, share and present.

A

B

You hear me: yes,.

A

And we can see your screen. Ok,.

B

Great yeah, the.

B

Usb HSBC adapters, it seems a little bit flaky so anyway, yeah so I went through the triage process. Is a dry run a couple of times and it sent out email, mailing lists, I'm just gonna go over that live because and some others thought that would be useful. So the general idea of triaging issues is similar to for other projects. You've probably worked on what's a little bit different about stick architecture is that you know. Often we don't feel field bug reports like broken tests or features that are broken.

B

We mostly feel feature requests, but otherwise the process is the same. The goal of triage is not to fully resolve issues generally, but to figure out how to move them closer to resolution and reduce the work of other people. The money stumble across those issues or try to help with triage in the future, so I sketched out kind of the process used or in the past some things that work for me, I realize won't work for everyone. Also, some things have changed.

B

So we have some folks in the project who are helping out with triage generally and routing them to SIG's. Also to debauch commands are numbers of work in some cases can suggest SIG's when they file issues, and they may not necessarily know which SIG's are the right ones, so they tend to have lots of SIG's, hoping that some of the things will be some of the relevant ones. So you know one of the I was hoping the bots will add me.

B

Sig needs kind needs priority, but I didn't actually find any issues that needed those, because somebody else had already decided. Most of those things need. Priority is also not needed for every issue, so that one doesn't show up so searching for those needs. Labels turned out not to be useful, so I just started with.

B

Are the things categorizes, the architecture actually sig architecture and do the issues are the issues clear like in particular for me, since I received zillions of emails and github notifications and other things I have to decide what to look at from a very small amount of information. So if you look at, let's see, let me just look for big picture, for example, and.

B

Uncheck my ability right so if I look at labels and kubernetes are issues in communities, communities that are labeled, sig architecture, PRS, so remaining make these issues.

B

You know a bunch of them had been triaged and closed, so there only five open, but you know 35 is still a fair number. If there were hundreds open, it would be really important to be able to decide what to do just based on information in the ListView.

B

So we see him rusty is the type and of the issue, so we ideally we'd like to be able to decide on the disposition and even change the disposition of the issues just from looking at this list. You know just check a bunch of things and say no remove cigars picture from these or check a bunch of ones and add counter the categorization. Is it a bug? Is it clean up or is it a feature?

B

A lot of things inserted architecture are clean up or feature so the first thing I do is I, go through this list and look for things to attach so I actually just start. It started at the top with the this K log issue, and actually you know if you looked down my more detailed message. I actually saw a pattern here which is a there, are a bunch of these K log issues and I. Remember we discussed K log, a previous architecture meetings.

B

But it made me think about well, does is Kate, should K log actually be a cigar picture thing so.

C

B

Yeah, so that's probably.

D

Isn't a thing that covers K log except.

B

That there should be so it's logging, so that's sitting instrumentation, so I actually suggested right here, hey Dennis, should this be sig instrumentation and they made it so so actually this should no longer be categorized, say the architecture it's not right thumbs. Are you on, can see all the people on? Yes,.

E

You need to move it to sync instrumentation, so.

B

We can just say you know, removes the architecture and the Bob will make it so, and I actually learned that the behavior has changed on I. Have the old habit of just changing the links list directly without the black bands that actually doesn't even work, so the bot actually looks through the history of all the state. Sig adds they can remove sig commands and Ulan fighting with the boss. So it definitely is the commands Oh.

B

Even if you could invest change the labels directly, because you have the right access or admin access to Depot, otherwise the bot will undo what you did. So the other thing is once you type commands. A particular problem for me is that github will subscribe me to the issue like this used to be true. Even if I change the labels, hey.

A

Brian, what a quick thing that I just realized I think your screen share is still sharing the ListView. Did you open it up in a new window and only share the one window? Oh there we go. That was okay.

B

Yeah, so if you changed the labels on the side that used to subscribe also fix that it doesn't anymore, but if use the bakken who still subscribe to so I always now play for unsubscribe. If I don't want to follow you is she going forward which doesn't help me a lot? That's like one in 30,000 things I'm subscribe to, but it's good to have it to get into for people who are triaging otherwise, you'll end up getting bossed flag.

B

You have notifications, like some of us who've been on the project longer you know so then moving right along so finalizing objects are mutable. So I also happen to know what this one is about, but you know I find it interesting open it up. I call my thought on that. So.

B

B

So this one I actually want to be subscribed to you, so I commented and subscribe. This one is appropriately sick architecture. One issue I ran into is that sick network is also on it, and sick network has a bot that labels things triage unresolved. So thank you to Valerie I. Think for pointing that out. So I did not remove the triage and resolved even though I triage tit for sig architecture. We need to figure out what we want to do about that multiple different states with multiple different processes, something else I realized.

B

Let me go back to inconsistent imports is code. You see it has area code organization, that's actually the code organization, stuff projects need to get some project labels so that we can use the project levels for these things. But right now it's an area.

B

This issue had already been triaged incorrectly categorized as part of the code organization. Some project is actually got. It is actually also on the project board, so under projects it says backlog and the code organization project, so the API reviews, code organization and conformance efforts all have project boards, so the triage processes for those sub projects is to start in some kind of input, column like backlog or to be triaged or something like that.

B

It's a little bit different for the different project boards, I think so, let's see if I can actually get this in a.

F

Tab instead, yeah.

B

So you know they think here it might start in the backlog column, for example, to get triaged here, it's at the bottom, so you know, maybe it will get moved up or maybe it will get moved to another column and someone needs working on it. So the goal of the triage project is to make sure that the.

B

It's categorized to write some project and that some project will be able to see it and pick it up with their triage process. So we want to move it from the general cigar cutter pool to the crest sub project right now. There's no good way of looking at this list view and seeing whether it's already in on the radar of the sub project, whether it's on their project board. There is work underway to address that there are a couple of different ways to address it.

B

One is: do have labels that corresponds to the project boards or just say that it's in some or another way is: do you have the project boards automatically slurpin issues based on a query. So, for example, if we had area code organization, we could just know that that must be on the code organization project board, because our project board automatically will slurp in those issues. So that's another thing that we're working on getting resolved. So this other there's another kellogg issue, dependency graph for analysis, that's the source dependency graph.

B

I actually opened that and took a look and.

B

Pattern, I think I don't remember whose categorized is code organization or not, but I made sure that it was yeah, so I added code organization to that one.

B

So there were some issues that were not sig architecture, so in those I also removed sig architectures. So.

G

B

There was one about so called D and it containers which was actually related, isn't yet another deferred container like proposal which is not cigar, picture specific, so I removed, CID architecture. Some issues were labeled rotten by or stale by, theta bot, so for those I decided whether to remove those labels or or close them.

B

Here's another case where I removed, sick architecture on rebasing to digital images, from busy box and scratch and other things.

B

So in general you know just going down this list and making sure that they're categorized it's the correct, SIG's they're categorized to the correct sub projects there on the. If they're, one of the seed architecture sub projects on the project board, it's correctly categorized as kind cleanup or feature requests.

B

Also, you know making sure that the titles are correct and accurate, so this one was interesting. So this title before was.

B

We see I, think I mentioned this in my bill.

H

Down to the bottom of the issue, it'll show that you changed it from okay.

B

Well, it's also my email, so memory limit set incorrectly said in a home chart week, sequester crash right, so that was the the users specific scenario where they triggered the crash, but reading through it I concluded, it was a cascading failure scenario and I have seen that and other scenario, so I went tracking down. Well what are the related issues? I also, you know this was a case where the person just added a bunch of SIG's say nodes in architecture. Six scheduling, sequester lifecycle and github.

B

Now helpfully doesn't shorten now, but it said you know this. Is the here open this issue? There first thing act to the minis right, so you can say well, this is someone new to kubernetes. I didn't have to go search for that, like I might have done in the past, so they weren't really sure how to categorize it and- and he you know- cholesterol. I cycle was not super relevant in this case, so I removed sequester lifecycle using the command.

B

Initially I tried using the just doing it with labels and that and finally I was fighting the bots. So you know the oldest issue of this. Flavor was 25 29, which I, remember and I happen to know. I happen to remember that this issue was not resolved yeah. It was closed but not resolved. By addressing the underlying problem. It was worked around, so I went chasing down more so there actually been a bunch of these filed over time.

B

This is the oldest one. This is a more recent one and Jordan helpfully, or this longing how hopefully chase down a number of other previous and recent ones caused by different cases.

B

So we're going to talk more about reliability next, but I spent some time chasing down related issues and decided to resume things getting in my way of my tabs.

B

Let me move this below I decided to just use this um this recent one and it's the umbrella issued for this. So what I will talk about?

B

Probably next, but the main point here is I decided once it's relevant, I retitled the issue we filed an issue, I guess to get a retitled command since most full people don't on write permission, can't change titles right now and wrote a sort of a summary of what I thought the general problem was and the other issues and scenarios that were related so going back to the West at some point, I decided you know going down through this list that I had actually seen all the issues before they were categorized as Sagarika texture and I stopped looking through them.

B

So we don't have a way of marking. Those is triage also that tends to bit rot over time, which is sort of the point of data box, although it creates a lot of noise for me personally, so that will just need to decide what we want the process to be in terms of how to curate the backlog of issues.

B

But hopefully you know if they're all in some relevance a project board we can decide. Are we ever going to do them, or should we close them which get better that not how the bots just force us to do it Sagarika pictures actually in good shape compared to many other things like API machinery, a node that has massive backlogs issues, so we're in relatively good shape here, but yeah I mean if it's feasible.

B

Generally, I lean on the side of keeping it open. Although I know, Tim went through and closed some, so realistically we're never going to do these things. Some are just obvious from the beginning that we're never going to do them like there was a recent request to Rio affect the entire system from master slave to peer to peer. Like yes, thank you for your suggestion, but not feasible.

B

There have been. There is another recent suggestion of rename service to reverse proxy. Yes, I see the point, but we'll take it into consideration when wininet redesigned a service, API.

B

So, in a nutshell, that's the general process. You know I think for people who want to take a stab at it I would say, don't be afraid you can, if you're, not sure about some specific issue. You can actually ask the author for clarification about what they meant and that will definitely help anybody else.

B

Who tries to take a look at it if you're, not sure, if you have a guess, so what the cause might be, but you're not really sure you can look it up or you can ask others in the community on slack or in appropriate sig mailing lists right in this cascading failure, one just as an example, the person who said their container zoomed in that closet, cascading failure. So you know, if you're not really sure.

B

Well, what actually happens if container keeps the lien you could just do a google search and find out right, so I actually did that, even though I was pretty confident what the behavior was I. Just google searched a couple of things like kubernetes, container and seeds. Memory limits, you know I happen to know that that causes boom. You know.

B

So if you're also aware that you could also search for accommodation, Jill and I actually just found an example where someone had this exact scenario that showed yeah, they started started a container written status with whom killed a few seconds later and went to crash the back off right. So then the question is well once it's in crashes back off what causes it eventually to get evicted from the node, so we could actually go and back and ask the original issue. Author.

B

Hey, do you happen to know what caused it to get evicted off your node? Are you able to reproduce it? I haven't actually done that, but that might be so can do. I do happen to know that a common case is local. Disc ins. Up fills up eventually needs you, you know log spewing things or containers piling up or all kinds of things. That's not the only failure mode that can happen and if the container the disc shows up the container unsigned can't start any containers. The key level will eventually report.

B

The runtime is unhealthy, the node will be unhealthy and then the pod will get evicted and a new one will get generated by the controller and the scheduler will schedule it somewhere.

B

But yeah I mean don't be if I guess the point there is don't be afraid to like do some homework. Those things up, you'll learn something ask for help point out the issue, if you're not sure which is they get, should land in you can ping people and other SIG's.

B

You hit her through the campus you, although I hesitate the lie just on get up notifications, as many people have been on the project for a long time or in my situation, where there's too much noise from gaffe notifications, but you can dry one of the other communication channels, like slack or the mailing list or a meeting.

B

Does anybody have any questions about won't generally were trying to do with the triage process or how to go about the triage process or whether you.

G

Know it's okay,.

B

To go change, labels or whatnot or I, guess one one last point: don't assign anyone to an issue unless they agree to be assigned, because some people won't even notice that they're a sign and if the issue is signs than other people will tend not to work on it or look at it and that's actually some a signal we want to enable. So that not everybody looks at everything. That's a waste of time.

B

You can see, see people.

I

B

Yeah yeah, you can, you can see, see people instead, you can.

D

Sorry like as somebody who shares Brian's problem of having too too many things, I don't see anything that succeed to me. Oh yeah, literally, it goes straight to my archive in my gmail. I only see things that are directly assigned to me in the communities: cameras, repo yeah.

B

It also alternates where they're searching for mentions even works or not for some of us yeah yeah. That's usually.

I

It just file into the CC in a machinery when we triage people often want somebody not necessarily to do the thing described in the issue, but like watch it and see. If something happens in the UCC I mean.

D

For that, what I would I I don't know I feel like I, really want something in between mentioned and assigned from github point of view like consulted or something yeah, but I'm, not sure that would actually be any different than mentioned. Yes,.

F

I basically said I agreed with Tim and I said: CC goes to don't oh yeah, no I saved.

D

Them all in my archive so that, when it inevitably gets assigned to me, I have the whole history of it in my mailbox yeah, but yeah I, don't see them by default and I like just anecdotally. I went through Signet work triage last night and I did about a hundred bugs many of which were open three years ago and have activity in the last two weeks and was like specifically people asking me for input where I'd never seen them, and it's really bad I. Don't really know how to break that. Tim.

J

That that sort of thing has occurred to me- and it seems like different people- have different methods of managing their information overload and I. Do we have anything written down somewhere? That says, if you want to, you know, get people's attention in the following ways. This is the standard way of doing it, because otherwise, you know everybody seems to you know brian says he never eats these things and somebody else's. They never read those things and somebody else's, please assign to me, and somebody else needs never assigned to me because I don't see it's okay.

J

Is it reasonable to try and standardize that and say this is good ways of doing it? Then these are bad ways well,.

B

I have an answer which is for Sagarika texture. We advise me on risk because I don't mean unless I'm explicitly notified and then only at the end of the day, I'm out of ease, but.

J

That's you personally Brian I'm I'm saying: can we at least try and agree on on a reasonable approach for everyone? Yeah.

B

J

Deal with you personally, I.

B

Don't know about everyone in the whole product, but the recommended approach. Forcing architecture is the one I recommended which is try to get all the issues into some sub projects, project board and then that sub project has its own curation process, removing things through the different stages in their project boards. If you think an issue needs to get discussed soon, raise it on the sega architecture mailing list and if it doesn't get resolved there then schedule it for a meeting. So that's that's a recommended approach for sig architecture.

B

I mean it and people disagree with that. That's fine, but I, think we should write that down as part of the just triage process and the issue lifecycle.

D

So I just to riff a little bit on what Quinton was saying: I noodled around, but I've never really come up with something satisfying of some descriptive way. For me to say: hey here's my preferred contact mechanism and to have one of the bots say hey. It looks like you mentioned. Here's this preferred contact mechanisms right and I worried that that, like all the stuff you're ignoring will just Accio. If you do that, maybe I mean like in this cases, there's there's really two classes like I.

D

Just get mentioned, I'm a ton of things because apparently I talk a lot and I also have people who are saying specifically like. Can you look at this and yeah I? Have a I have a hope that if somebody got a message that said, here's Dawkins preferred contact mechanism. If you need him specifically to look at this, PR hit him on slack right and it was just automatically posted every time somebody mentions me like that would be great for me. Maybe or maybe it would just turn into another snowball.

D

I don't mean the status on your yeah yeah.

B

But we there's no, that.

D

Doesn't translate I.

B

Don't see the chest but I think someone else would say something: yeah.

K

I just wanted to say that kristov actually proposed something similar on the contributor experience list yesterday that you should check out if that's interesting to you retain your github status to automation and if you get, if you get auto assigned as a reviewer and things like that, Oh crystal saw on paper, I saw I'm. Sorry I spoke for you for then no.

H

Okay, I was gonna, speak up, so yeah a couple, a couple things I would. What mention is so github has something in between mention and a sign and that's a review request and that's actually what our bot does when you, when you flash see, see somebody that is a review request.

H

B

Doesn't exist for issues that exist for PRS, sorry,.

H

You are you're correct, it doesn't exist for or if she said it only.

D

H

On PRS also that.

D

Doesn't manifest in the email headers but anyway interesting continually.

B

H

But one thing like trying to handle some of the overload and trying to handle some a BB in particular PR review, overload feature that we've got that's like baking and contrabass right now, I'm going to send it out to Kate out tomorrow and I'm, hopefully turning it on next week is. If you set a github like busy status, the bot will not try to assign new PRS to you for review.

H

You may still get suggested as an approver on PRS, but as far as being like the first-line reviewer on a PR, it will skip over you for for assigning. If you have a busy status there opportunity to tie that you github user status into more pieces in our automation.

H

We want to be thoughtful about how we do it, because we also don't want to flip on the other side, where we're like spamming issues with like individual user statuses like I'm cognizant about not wanting to like post more things directly into issues in PRS, because it can get spammy really quickly. But if we're thoughtful, there are other ways that we can kind of tie that github user status into automation. Because it is an available thing in the API and you can say, like hey, I- have limited availability, I'm busy one.

I

Thing that came up on a prowl issue is like, if we could add it, just a unique string to every type of notification the bot gave us that would let us set up email filters like I, would just love to have an email filter that deleted all test results: females, because I, really don't care about test results on things that I've reviewed and it like it sends like molded bumps to think I'm at the top of my inbox multiple times. Every time you push. So that would be super useful.

H

We do that very inconsistently. There are a couple types of notifications that we actually do that, because that allows the bot to identify the same type of comments and be able to edit it or clean it up. But it is a good suggestion that we did that across the board for bot comments that hey. If the bot comments on a thing, he uses this unique string, that's like even just in an HTML comments or hidden from the UI, but it's something that is like searchable, either through email or.

D

The API that would be amazing, like we used to do this for for PR sizes, we used to put a comment that says: I'm calling this PR Excel and then I have filters and labels for that, and then at some point we stop doing that. I filed a bug on it, but nobody's been able to get back to it. So.

B

I'm gonna plan body that I'm gonna tie boxes in wrap it up, but someone wants to help document and involve the triage process itself. Cpm previously reached out about helping with project management and curation and say contributor experience is obviously working on this as well, and others are developing their own crosses, so it'd be great. We had like I, don't know that the processes need to be 100% consistent, but we some known practices that work communicated across the different states. That could be a recommended starting point.

B

That would be great, and you know, fixing some of the remaining issues that aren't quite ready, like the lack of sub-project labels, the lack of the ability to plot to get label to know that from the ListView that something is, in particular product board the ability to change titles on the issues and things like that iron out those issues document the process, at least that we use for cigar culture but Brian.

E

Hi so I had a question here to the people who are on the call who wants to do this, so we can try this a couple of times every week I mean every every week we can try once I know. Valerie was interested. Is anybody else interested because I would like to like organize a session a half-hour session every week where we do this review? So we can write things down and you know work out the kinks in the process. So George is interested George. Can you add yourself to the doc?

E

Please anybody else, especially people who are usually quiet here and not participate too much.

E

Going once going twice back to you, man, okay,.

B

Thanks yeah so again with just thirty five open issues: this is just kubernetes, kubernetes, I didn't show the community ones or the enhancement ones. I used to do this for the entire project way back when definitely the larger vine you have the more important it is that the process be efficient.

B

G

The more people.

B

Who are participating more important? It is that there is some explicit indication on the issues about the disposition of that issue so going to reliability. Specifically so I talked about this at the contributors, summits, quality being job one. It mentioned a bunch of issues relating to quality, like correctness and scalability and and whatnot specifically I want to talk about reliability.

B

Reliability hasn't been really an explicit focus of the project. You don't have tests that are directly targeted towards reliability. So much you don't have a sink of the kind of scalability. It's horizontal, focusing on reliability across the projects. I went as part of this triage process after stumbling across this cascading billion I looked at related, cascading failure, issues and also reliability issues generally, so you know who, against area Lee I, actually added the viability area to a few as some of the costs and was removed it from some more clothes on.

B

They were no longer relevant or asked if they could be closed because they seemed stale. But so this is is a handful. It almost certainly does not represent the full set of issues, but I think focusing on cascading failure in particular, especially since it's you've been recorded a couple of times publicly recently by the way. There's this, if you haven't seen, is kubernetes failure stories collection that someone in the community is putting together, I recommend looking through it.

B

This does list some of the cascading failure ones like this one by target, which was actually more about their systems failing than kubernetes. But for sure you know there are a lot of these million ways, scratch with cluster 100 ways to quash our pressure, cluster, etc, kinds of things that have been recorded one. This came up in three different contexts. This week it came up in do during this. She triage it also came up in conformance actually there's.

B

One thing that I observed in conformance is that there were in twin tests that were looking at very specific messages in reported back as part of pods container status in the pod, and that seemed to be because we were leaving broken containers that were never ever going to work on the run on the cubelets so that they wouldn't to work around the fact that if the qiblah killed them, the controllers would recreate them immediately in a hostile way in a tight loop like, which is the purpose. So we seem to.

B

This has been rediscovered by several different people and several different contexts. So it was done with security context, and we don't run this containers as routes and it was requesting routes. It was done with a farmer. It was done actually also with things like resource Florida, where we have a bunch of workarounds for the fact that the controller behavior is doesn't degrade gracefully.

B

So the pattern that seems to have evolved is try to not make the controller think it has to recreate the bot. That's not really helpful because it hides a useful signal for various things in the cluster. So I think we need a better solution to that, but we should at least have agreements between workloads and node and scheduler, and you know I, don't know who a fish wheel, I think signo domes the no controller as well.

C

B

About what we think the pattern should be, and.

C

B

Should make sure that that is clearly indicated through a status field and some resource that can be dependent upon by clients and in conformance tests and right now we do not have that.

J

Ryan that it sounds like this to to sort of distinct areas here, the one is the one is the stability of the kubernetes infrastructure, you know crashing whole cluster and those issues you mentioned the other one is: is application stability in some of which is affected by how kubernetes managers containers? Are you wanting to tackle both of those areas or or one more than the other? I.

B

Referring to the former right now, the stability of the cluster itself, which can be impacted by things like, can pods being repeatedly killed and recreated or containers crash, the being and filling up local disks, which causes nodes to fail, and you know getting created and killing other nodes. Okay, make sense thanks yeah, so you know there are a couple of different ways. We can tackle it. If you don't have critical mass for a bigger effort, we could just have a discussion amongst the relevant sinks and come up with a plan for what to do.

B

You know there are different levels of kind of effort and organization we can put into this right now. I think you know in at least right now, making for six that I know are working on reliability. Efforts like the API machinery I know, is working on rate limiting and preventing dose of API server and sig. Node is working on some reliability. Efforts as well at least come up with a common way of surfacing those things, such as with the reliability label, or you know, having just communicating across different cities what's going on.

B

That would be a good starting point with the cascading deletion, in particular that an agreement on how the control loops should interact so that we can make sure that we have a clear contract between the different components about how to respond, as opposed to having everybody kind of sort of rediscovered the same problem and work around it, and sometimes in similar ways, and sometimes in different ways, but in ways that have unfortunate consequences in other contexts, like you know, writes what I keep when I was reading through failure stories.

B

I also came across reports of users, finding these dead pods on their nose and writing controllers to go, kill them. So that's not really an ideal solution.

B

So two people have thoughts on how we want to try to tackle this right now. Can-Can as its and up I can't see other people on the screen, if they so I think some of these are actually.

L

For posterity comics, for instance, right but I, don't think the world is controllers should be involved in that loop. In particular, the scheduler shouldn't place things on nodes, great, can't possible schedule for things that are going to crash lis back off. We could probably more sensitive there and like if it's continuously crashing, and we could find a way to make sure that those controllers implement even more of a back off than they are a do to try to stop lunch and Poinsett. We know we're not going to ever ever work.

L

The vide keep it there is that so the user goes and updates it right. Then what do we do there? We consider that effectively a new deployment with a new UUID, or actually this piece replica set with UUID a week, we're all prior history good.

B

Question so these kinds of good questions I, don't actually want to design the solution. Here, it's more of one approaches: we could get someone to volunteer to own it and assign the issue to them and make it make a proposal and.

B

G

Or three nice things that.

B

Are involved and review the proposal? That would be one approach together as we could. You know schedule another meeting specifically about this, to brainstorm solutions and.

B

Yeah, that's more or less where I asked this, how do we want to tackle them? Do we.

G

Have anyone from scheduling this is a complex problem? Actually, we cannot simply just avoid rescheduling these kind of parts, but yeah we need. We need more more discussion about this. We.

L

G

Thought about this, but the solution is not that simple. No.

L

Especially for the security context, there's no limit tree from node to let the scheduler know that he can't even place the pot there right now, all right so not aware and there's there's lots of issues like that. But if we're just looking to come up with the broad strokes of how we're going to approach a problem, I would suggest that we get these from signals, safe scheduling and see well, yeah.

B

So I guess one first step would be to take what I did and go a little bit farther, which is to go through issues that we track down and.

G

B

Try to categorize what the different sources of problems are, so we can at least all have a common understanding of the current situation, and then we can, if more information needs to get service from the nodes or we need to change the workload controller behavior or we need to change the scheduler behavior. Then we can start to have that discussion.

B

So at least having someone go through more comprehensively and write up. A summary of the set of issues would because there's sort of a partial one- and this one, but it's just partial, like it- doesn't mention the ones that have already been worked around.

C

Even before one from Frasor, they been you and I discussed there.

C

B

Yeah, this issue is from 2014. Yes,.

C

Trouble in history means we like yeah notice, those problem.

C

We try to work in policy for the casted, a rule like that: okay, there's a mini proposal, one proposed and a maximum senior account, so we have all those kind of for the per job of her work known so another proposal, but that's after be on signal, so we basically either this right hole on the thinner and the know that the crash ended back off on the note, but to there are some more things on the controller there, also like the things on the schedule, give the signal. So that's actually this one really perfect.

C

It should be cigar catcher. We could help you guess to have the policy and the card I will write, particularly different. Our controller and also particularly think about the work, know that we couldn't have the cop, but we should all agree from the overall community, so not just signal them. So basically, so that's why I think yeah.

B

So in terms of I agree, it's not just one state, which is why I asked folks from Oakland six in this meeting about what we need is like specific people. Yes, factually she can chase this down. Yeah I'm kinda interested.

I

I have almost no time but I'm interested, especially to see if there's any way of applying the sort of rate limiting design that we've gotten Congress for, like regular API surgeries, would.

G

Be interesting if we could.

I

Expand that to rate will make the creation of pods.

B

Yeah I want an issue: that's been raised as how hard it would be and again as people are raising their hand or something in chat. I can't see that. So if someone can moderate and interject, that would be great but yeah. So one of these issues at a specific proposal for a generic rate limiter that 30 even third-party, like operators and workload.

G

B

Could use not just the built-in ones, because demon set has a one drowned and demons that for a specific scenario, but that didn't get into the other and built.

I

In workload controllers, yeah I, don't know if it's fair to make every controller collectively.

B

M

Was there someone else.

J

Aaron had person.

M

J

Was about to suggest that, but this seems like a big enough problem that we want to actually formally get a working group together. It's one of these things that we've started trying to solve so many times, but if we actually put up working together and people who actually have time and and set a clear set of goals and at online, we maybe have a better chance of succeeding.

B

Actually think I don't know the whole group if we just had one person to go and just look down like I, said and write a summary of what the problems in the workarounds been I. Think now the good starting points like I, don't necessarily think there needs to be more meetings. We need to categorize the set of problems. Then there will need to be a proposal to address them and then there will need to be reviews about causal.

B

If we were wanted to tackle reliability more broadly than this one issue, then I would agree and then some sort of working group or sick or something might be exotic. We can't even get one person to volunteer to do this. I, don't know that I would suggest. We go that route so.

L

I think I, don't not sure I'm convinced that this is one issue like I think there are several different issues here that wouldn't call it. They potentially would call for several different solutions not going to oppose the idea of somebody just cataloging a plate looping in general. Right, like that, sounds like a reasonable thing to do and start off with like. What's wrong, then figure out. You know what we should do sounds like a reasonable first step and I, don't I, agree. I, don't think we need a whole group of people to.

B

Do it yeah I think this is somebody the right person could do this in like one day they have one day.

J

That's fine I mean I. Think if the if the working group is one person and and a sustainable process comes out of it, that that's totally fine.

L

So we to be clear: the output that we're looking for here is a categorization of issues related to type moving and basically a list, and then from there we would go and figure out who should be responsible in terms of coming up with proposals for how to address those either as a group or individual. Then from there we would figure out like wailing specifically.

B

Pod by grouping in the pod lock cycle right container in pod lifecycle, so our just constraining to that first cuz, that's the most common scenario: isin I.

L

Can go do a list, that's not.

J

Sorry I had a slightly different question, which is which I thought was your original question Brian, which is, is this area of reliability, big and important enough that we actually need something bigger than a single effort to solve this particular problem? Be that a sig or a sub-project of of this sig or a working group or something else or do we just leave it as it is at the moment, we're kind of all the SIG's collectively try and make stuff, reliable and empirically don't seem very successful at it.

B

Yes, now was something I was also getting at I. Personally think it is big enough. I am concerned about whether we could stop it, but I do think it is big enough. For example, the scalability that sig created cute mark to do some benchmarking I could easily see a reliability state quitting a variety of chaos. Tests stressed us and things like that to help bulletproof the system better by just deploy a crash with being pod, with an even set across all the nodes, a large cluster and see what happens.

J

Right like we should have I totally agree with you and and I guess. The question is: do we want that to be a completely self-standing sig or do we want that to be a sub-project of of this sig acht I don't feel strongly either way, but it sounds like I.

B

Don't feel strongly either I think it depends on how many the scale of the effort- and it can change over time like you, can start as an informal group and become you know, sub projects or working group or whatever and evolved into a sig. That would be fine, I, don't I, don't think the problem will go away so I think standing body of some sort rather than a working group, is eventually what we want. I just, don't know how quickly we can get critical mass on it. I think.

J

Starting off as a as a semi-formal group in this group subgroup of this group, whatever we call it a sub project or a working group or something would be a great starting point and help us to get some stuff.

C

Signal Welcome Party, based on that across our logo, to be based on sex.

B

Great, thank you yeah. So let's figure that out. Let's take that to the mailing list and see how others feel about sub project versus other type of sub group and see if we can get some wood behind this area, we're about out of time, so I did want to move on to quit. Stuff project updates a.

B

Parody is Jordan.

M

They making progress, we are scheduling, actual meetings to go through things, or at least I am for mine. I found that is hopeful, get some on the calendar get some ordered.

M

If you have questions you can look in the project board to see where yours is, if it is sitting in an unassigned column, feel free to reach out to the people in the owners aliases for the relevance things to get that owners assigned reviewers your sentence.

F

Thanks Tim, it's been delightfully, boring, we've just been trekking away against. The PRS that have been coming in there is one kept that John had proposed. I think holds a lot of promise to help federates some of the work and give us a way of prioritizing what we think is important in the long term, but I do think an ease of revision. So maybe once it's been revised, we can be publicized that feedback it last but not least, shreddies been working on automation to help do our triage process a little bit cleaner and.

B

E

So we have, we had been trying to prune dependencies, so we have a few people looking at different options, we were able to make a few changes already and some others are in progress. We are trying to switch gears now trying to see. If you see what else we could do around like give a specific theme to a specific meeting and say: oh, can we talk about feature branches or can we talk about you know?

E

What are the options we have for mowing code around things like that, so we're going to switch to that format now and see what comes out of that.

B

Okay and I guess last time on the agenda, rewards and recognition. Tai chi learners files, miss Paris on I, am.

K

On and I'm gonna make it super quick all I need is your expertise and awesome counsel from this group, because you all are approvers and reviewers. We I was thinking about an idea where we could serve important information, congratulations, swag, etc to people who get put into owners files just so that we could deal with some of the discoverability informations and things like that that we have and serve them like mentoring opportunities, and things like that. So if you have any ideas, feel free to comment, especially ideas.

K

Without wanting to do any work, that's totally fine to just insert on the on the link there. That's it for me thanks. Okay,.

B

Thanks and quick announcements, the meetings during q con are canceled, so if you're a kook on enjoy it, if you're not enjoy the full time, so that's it thanks everybody very much. Definitely if you have issues to discuss with regard teacher use, the mailing list is the best format for long-form discussion, and you just have a question, feel free to try. /, see everybody and I guess three or four weeks. I.

G