Kubernetes SIG Network, 9 Jan 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG network meeting 2020-01-09

Description

Kubernetes SIG network meeting from Jan 9, 2020

A

And we're recording welcome to the first kubernetes sig Network meeting of 2020.

A

Do we the first item on the list is appropriately issued triage?

A

B

Hadn't decided what we were going to do before, but I have both queued up so awesome.

A

Who looks like we did Peoria as last time in December, so maybe we should switch back to issue triage and we'll just alternate. Okay.

A

A

B

Okay, is this working yeah, it looks good I can see it.

B

Okay, headless service configurable priority and weight values.

B

This sounds like a feature.

B

And somebody has already suggested filing a kept.

B

So we can just.

B

I think, yes, man cannot access cluster IP port when node reboot.

B

Anyone have any idea about this or feel like taking a IP vs bug.

C

The usual wipe off or at the I mean destination for FPS bug, exists, injured, yeah, sorry, brain drain caused everything Andrew here, I feel that assigning in bugs in them, such as Oh.

B

Andrew I thought you said the usual black hole for IP vs bugs is injured, I think.

D

Probably also worth.

A

Just reminding that you don't need to fix the bug, it's just this triage yeah. Looking into next steps, triaging anybody.

C

Else want to step up this before we assign it Andrew anybody who's using IPPs and wants to know it a little better. You don't need to fix it. Just triage.

B

Or if we feel guilty about assigning it to people who aren't here, we could just skip this and assign it next time we could. Is he even tagged I'm here.

C

B

He aware of it probably no.

C

Let's just send in his way and then if he wants to reassign.

E

Userspace proxy, do we need to remove the triage on that one now.

C

We after they've confirmed that it's either a logger features. Okay,.

F

B

Priority one source up: they should be higher than inshore portals and says user space proxy dan. Do you want this, or do you want me to take it.

G

Sound like they will have a PR, so hey, if you want to take it, go ahead all right.

B

Core DNS behaves unexpectedly, cannot work its domain name.

B

H

I

We have crystal inner person deep online.

I

So you can assign it to me. The three sided crystal 3000 crystals. Are me she's, Ariana, okay,.

C

B

Sorry, he removed his assignment.

J

I see what is the.

C

Hue proxy, at the end.

J

Its Bowie I.

B

Need test a cube mark drop yeah about.

C

B

B

Anybody feel like signing up to write a test case yeah.

K

I can take on. This is Rob Rob.

B

B

Not log not saving endpoints, if the service doesn't need health check yeah. This is stupid. Logging in the generic queue proxy service code solve a log.

C

Level three so I.

C

Don't know if I have to look at the contact s today. If it's useful thing I guess we can remove triage, it seems like legit. You could and I want it. Label like this is an easy one for somebody good to jump on.

B

What is the demand for that? Slash and.

C

L

There's a specific shorty header.

A

For London, because they're all good the first issue label there is help. It looks good yeah.

B

Looks like it was help. Okay,.

J

B

No, no, the description of the label says it's. We need help. Okay,.

C

Oh, you sweet, which the next issue sorry I was looking at. The labels was like oh yeah,.

B

It did it yeah whoops, your refresh.

B

Help Wanted, okay, it's community service over 800. How far we go, we're going to fight through 15, past sure, okay, community service, override security groups with a dubious extra SG annotations.

B

It sounds like a cloud provider thing and.

C

Risky at all, let's.

B

C

B

That that's not a cig right, that's an air! Is it a Sager Street.

M

B

Sorry I talked over you. What did you say? Tim you're.

G

Right you misspelled remote, there yeah yeah.

B

So do we know the command.

B

I'll just leave this open and come back to it afterwards to proxy Phoenix and H, and s endpoints and containers are being deleted. Despite passing hostname override.

B

This looks like another one that is same.

B

Pod environment source get host ipv6.

B

Does anyone know if the downward api was updated to do.

C

B

N

117, so that give that can Aneesh. Can you take that one yeah.

L

Yeah I can take this one.

L

Amac a are the first one.

N

B

There's not populated spec pods later.

B

This looks more like node than Network.

L

B

Like Cupid yeah, there's a lot of discussion here already anyway to appreciate feedback from sig Network. So who knows about this I.

C

Used to know about it, you can decide it to me and I'll see what I can find unless somebody wants to jump in you.

O

Gonna sign it to me too: ok,.

O

That was Casey too bad.

J

You see the voice, modulator everything get away.

B

Request pod itself via service failed when queue proxies IPPs. This is a hairpin problem.

B

And nobody is responding, assign that to Andrew yeah.

C

I guess so I'm sure we do because I remember, jumping through the network. We.

B

Have an easy hairpin test, but it probably only runs against IP tables.

L

B

All right, that's a quarter best, all right, so you don't get to the old Chinese one.

A

Thank you, Dan.

C

Can go through and load up the same list of bugs with triage and do triage outside of the group just saying yeah.

B

If anyone wants to triage an old Chinese bug, it's eight five, seven four one.

C

Any native speakers honestly would be super helpful if they can translate.

A

From Scott and think you're up next.

K

I, come okay, great, so I just want to bring some attention to a kept that we had. That is pretty simple to add at protocol to services and endpoints. The idea here is just to make it a little bit easier to work with little balancers and I'm sure there are other use cases as well, and this has been a few issues related to this for years now, and these suggestion was well, we could just add a protocol to services and endpoint ports, and this has already gotten a decent amount of discussion and I.

K

Think general consensus is it's acceptable, although it doesn't solve all the problems, it solves a problem, but I just wanted to bring it up because it is a change and wanted to make give Signet work a chance to give any additional feedback in this meeting. So I ask that.

C

We bring it here specifically concerned with. Can anybody think of reasons why this might be a really bad idea, I can't think of any, and that usually scares me I.

P

Think there is one I mean there is no standard of what protocol should be like, for example, for ts should be, should be GRS or SSL or HTTP HTTP 1.1, not including person, since an exam so I'm, absolutely seven. We.

C

Wrote the comments specifically saying you should, if you can't use I, am a service names which are called out in an RFC, but you can use the prefix to define your own. So you can have company comm slash my special thing, or you can say HTTP, which is the I, am a specified protocol, name and actually I. Think I am a says. Hds TLS and maybe a couple of other names on the same thing.

C

So maybe maybe we should establish convention around like use, use the canonical ones like HTTP, but the other ones. If you're in danger a.

Q

Clarification question: what's the exact use case of consuming this.

C

So we find cases where, like a load, balancer wants to know whether the back end should be TLS, encoded or not, or just as a hint for like.

C

It was the TLS in the back side of the important one yeah.

K

This is this has been I solved, shall we say with annotations that are frustrating to work with and not consider that consistently across clouds. I think all three of the major u.s. clouds have different annotations. That mean the same thing: yeah yeah, so this is culturally in the standard way to you does the same thing is all this sorry if somebody was just talking and I couldn't hear you you're, really quiet. Sorry.

R

I work on contour, which is an ingress controller, and we solve it the same way. We have a custom annotation that we tell people to put on the service great.

C

So you use this yeah.

R

Yeah we would, we would use this. We also recently let people do it on our custom CID and was added 4k native okay.

J

Is there would it be possible.

J

Sorry, multiple protocols report is there a use case for that I mean like? Would you put your key in there I mean GRE a valid one. It's not an I am a one yeah, but it would share. For example, the same age TT us is there's like some multiplexing actually yeah.

J

That's a good question of hey like, for example, the.

J

Tls negotiation has like a list or something which was the other thing that I think was similar to this. Of course, we don't want to I. Think.

C

C

Grdc imply HDPE I, think there's not like another transport, but you.

S

For some reason, you can also do it over TCP yeah.

Q

You could conceivably have something: that's handling an HTTP level that isn't aware of the GRP, see details. It.

J

Might be worthwhile for people who have we just look at how people can see the police support piece, I, don't know things and what it would mean like. Maybe it doesn't matter because it doesn't really have a effect like if we, if you put like TLS, that's efficient.

J

C

If you had a so I, don't know G every six cuts very well. If I had a load, balancer that understood HTTP but not G, our PC would I decorate this as HTTP or what I need the decorators Jeremy C and not getting benefits.

J

Yeah that one, maybe we just have people commented it. Okay, if it matters like, maybe it doesn't just doesn't matter and it's efficient to have one or you would need to support negotiation, but that's really on the lower side, where that would develop. I.

I

Don't know that we probably a comment period on caveat.

J

Have a path for localizing things, just gonna say this.

C

We can always turn it new plural if we have to so I guess. This is a call for objectors, otherwise, we'll proceed wrong were.

K

You planning to feature gating there. Just go straight. I was not planning on conjugating, since it is additive and actually has no implementation, and it has no implementation. Other than 10-point slices will inherit this services, otherwise yeah.

K

There's a question in chat: won't just bring some confusion to ku proxy, whoever posted that expand.

C

I

Thought something funny putting these and keep trucks you would need to understand. Making users using a protocol include proxy not being able.

C

I, don't think Hugh proxy needs to interpret this because it's not doing any higher level protocol stuff so that I can be there now.

H

This is really only used by the.

H

Cash to proxy yeah, it's big, but it's just a little more.

C

C

Oh to each his own yeah.

K

H

We have like a common way to do that. You can create a custom Informer, but.

C

Okay, so let's leave it open, I guess just for another week or two and then charge ahead.

C

Alright, that's.

A

Good. Thank you, lastly, to Nancy Carol next yeah.

T

Can you hear me yeah, yeah, so I brought in this cap about the mix protocol supporting service definition? Maybe you remember, it was a issue in December that it should be a cap and it's a kind of heads-up, but now there is a first draft and I try to investigate the different cloud providers and different cloud load. Balancers from pricing perspective.

T

What it would mean if if additional protocol is implemented or added behind the same IP address or behind the same opens, for instance, so that is now in the cap and also if there is a need for some option control. Then there are some proposals in the cap for that either as an annotation or similar to the current metal-metal lb implementation or GCE workaround.

T

Okay, so please check it if you think so, and it comment.

H

Thank you for writing. Yeah thank.

T

A

N

I'll pick this one out for Anand, basically there's a dog update that needs to be approved by Tim. So it's rubber-stamp. It's come through the dog team and you are named as the reviewer. So if then just throw a rubber stamp on that.

U

It's just adding as your CNI adopted, I'm Jane Docs.

N

Where all the C&I Zirin documented? Thank you all right.

A

K

We're back to Rome, okay I've got a couple quick ones. The first one is with endpoints lice. We realized that we need to split it into two different feature gates to do. A seamless upgrade the rash now is that you can't guarantee, during a cluster, upgrade process that all endpoint slices are going to be created by the time they would be consumed by COO proxy I. Imagine in 99% of use cases.

K

That would be the case, but since this is a single version job and since the endpoint slice controller would only be creating employ slices once a master changed and then the first node, if not every single influence, less is up if it's retrying to read from those with to proxy ago different issues yeah.

K

So there is a PR that splits that to feature gates and enables the controller by default, hoping we can get that into this release, so at least endpoint slices will be created by default, but not consumed by default until the release after that, so we can guarantee a smooth transit transition.

K

Does that seem like a reasonable approach? Any questions comments, usually master of.

F

The app is first yet, but it usually that happens first, but it doesn't that the problem is that doesn't guarantee yeah what small enough for that or yeah any advice, ya know who knows about dating or for something pointman that no one a master are equivalent right. That's the yeah.

K

Yeah I mean it's. Unfortunately, the endpoint slice controller is limited by whatever in the API rate limit is, and we have no idea what else might be consuming that rate limit. So you can't assume that endpoints less control is going to get all 20 PPS. It could be any number of things. So in a really big cluster, with a lot of endpoint slices, it's potentially a problem and the way the queue proxy support is written. Is it's one of the other just yeah yeah.

C

Yeah, okay: well, the alternative would be to write emerging logic, that's removed and then take.

K

The union of them that that's true, that that seems it would be very slow, very you'd- have to have Watchers on two different leaves: nurses, which would be the exact opposite of what we're trying to do with them once less. Also less viscous leadership, yeah, yeah and.

C

It's temporary transition, yeah, all right and I guess this just drags out. It does like this is where we are with the project. Yeah do.

A

We already mandate the evening to and I do sequential minor version of grids and you can't skip one. We do.

C

Okay, every now and then we get something right, easy.

K

Move on to the next one, then this has been going on for a bit. It's clarifying ingress, be one and specifically what we're doing for path type there. So path type is a new idea into ingress and we have implement implementation specific, which is basically what we've already done. That's backwards compatible, but we're also adding patents that are prefix and exact. So you can be a little bit more clear for each path.

K

How you want it to be match, decide whatever is implementing that there's a minute long discussion around how we should default those because this is adding a new attribute, but we also want it to be required at some point, and so I just want to bring attention to that cap. If you have any strong feelings on how we should approach cap tight, definitely add a comment.

K

There I think we've reached a consensus before he was actually involved in that cap, but since this little larger audience take a look and if you're interested in ingress be one in the progress there we're meeting every other week, just the opposite time of this meeting so a week from today, it'll be in risky one. Instead of staking it work on the ingress.

C

Be one topic if I can interject mr. Luciana, are you happy with the validation back.

S

Yeah I'm trying it on now the difficulty I keep running into, though, is figuring out, so it's trivial to figure out: okay, you're passed in V 1 or V 1 beta 1. Do you know of conversion code fires before validation, or is that only emergent cover fires.

C

Before or validation, so you're going to get the internal type 'button, but the string that I passed in is the original API request. Okay,.

S

So it's not me, but it there's precedent for it right. So as soon as I say like if this is spec dot, back-end, which isn't going to exist anymore, it says that doesn't exist, which kind of makes sense, because we pass in to the validate ingress spec, currently just the internal type. But we delete that because we're you're renaming it right.

C

So you can write the code in this case that says that checks, the default back-end field and then switches on the API version and for you, 1, beta 1, spells it back-end and for you on it calls it. You fall back the error message: it's really it's ugly that it has to be open coded that way, there's an ongoing discussion about validating on real version types, but.

V

C

Has its own ugliness, so at least this lets you make progress.

S

Yeah, so it's fun to have like a bit of duplication, because I think that's what I'm gonna have to do just be say if you're passing this in and then you're you're using extensions, v1 beta1. Just that's what I call the import and there's gonna be a bit of typically patients.

C

Need to switch it in the Aero path right, because the code is still processing the internal type.

S

They should I keep getting is still, though, that the spec that backends, which is I'm passing in the internal networking and respect type, is always going to be set to default. Backends. That's.

C

Right so I'm saying you only switch in the error message that you produce the logic is that is going to operate on the internal type with the internal names and then at the last. Second, if you detected it in his failed validation, you do an if clause to see if which witch named put in the error message. Oh okay,.

S

I understand so it doesn't really I, don't really need to validate the the back-end type. It's just if this thing fires at all. If your v1 say this, if your yeah kind of dirty that makes a lot more sense, yeah, so I should be able to finish that up. You know tomorrow morning. That's what I was working on today, but I got lost in the types thing: okay,.

C

Cool I'm, sorry, it took so long to come up with an answer for that and we can thank. We Amy minimal, David eats for coming up with that.

S

That's no problem. Jordan was helping out to a couple examples that I'm creeping from awesome.

C

And whatever that whatever we do here is actually gonna set precedent so watch another API is rename fields across versions they're going to copy this pattern.

S

Do you think that's gonna force it to like become better or is this kind of like I? Don't.

C

Suppose some area without.

S

A bunch of plumbing.

C

There's real ugliness and the other side of it too. You know they're, all the alternative is to take all the validation code and convert it to each version and then duplicate it for each API version and then extract the commonalities into common libraries and like that itself is a ton of work.

C

So that's the dimension, Clayton and Jordan and David and myself are having. This is good enough for now.

S

Okay, great so I should have that fixed tomorrow morning and.

C

You know we recently religion upstream.

S

Or, should we continue it in because Rob and I have been working in my branch to to create the monster PR.

C

That's up to you, I mean eventually it'll come up stream, but let me know what you have something and I can take a look at it and make sure that I'm gonna pay it for a care of. You sounds good thanks.

A

Cool Rob is there more, you wanted to say we're looking for now.

K

That's it Thanks cool.

A

Thank you and I wanted to quickly introduce Q Khan Amsterdam, which is coming up, looks like.

A

This email, the other day, was very thorough enough mm-hmm, but yeah. The deadline for maintainer track sessions is January 20th, which is no next week, but the week after and.

J

Yeah I put a comment there. I can submit it and then people basically reply and if you're gonna be there and we could coordinate yeah, let's get it in no reason not to know they said there may be a lack of room, so we should just submitted. Our sessions are always very well attended, let's put it and we'll figure out what to do with it. Okay, so all you know out after a race and everything and.

A

You know great thanks buddy and then we just have one more item on the agenda, which is from Mike.

E

Yeah, so maybe I missed something, but it seems to me that kubernetes is pretty good in fact, I really like the Korea's control plane. It's it's a distributed system built with a lot of attention to resilience to all sorts of crap. That can go wrong, but there does seem to be one blind spot or maybe I'm just missing something. But you know one of the failure modes that I see is if something goes wrong in the data plenty of that the C&I plugin establishes.

E

Nothing really detects that, specifically all right. There's health checks on containers, the lorry start containers, there's health checks on nodes, but nodes communicate over a host network, not the cluster Network right. So what I'm saying is, it seems like there's kind of a an opportunity or some kind of issue here where nothing is detecting the fact that the cluster network specifically has problems and reporting.

G

It per se so there's two things that were working on upstream in CNI on that and the first one was the CNI check command that got added to CNI, sometime I, think like mid last year and I believe that there is now support in cryo for that command and that's still on a per container basis. But that explicitly allows whatever CNI plug-in is handling the networking there to check its control, plane and return errors, at which point the pod will be terminated and then retried. So that's kind of the short-term fix that check.

G

Support needs to get into the docker shim part for kubernetes and that's kind of been on a to-do list for somebody on my team for a while. But we haven't quite gotten there yet for dr. shim, but we did add it to cryo. The second part is kind of a much longer term CNI GRP C, which would allow events from the plugin to go back to cube a synchronously, but that's also kind of tied in with random discussions that are pretty early stage around. What should the overall networking API look like for kubernetes? So.

E

Both of those answers, we're really focused on the control plane and like I'm, trying to specifically raise issues with about the data plane right.

G

Part of that is that the check command for sandboxes specifically is supposed to do essentially. Does this container have network access, and if the control plane is not working for the plugin, then you could argue that you know that container may or may not actually have network access. It depends on the control plane itself, but it's also intended to do things like you know. Does this contain your still have an IP address?

G

Does this container still have the resources that it needs to talk to the network, so it's kind of up to the network plug-in itself to figure out what it thinks is container network health? Ok, that's.

Q

G

Basically, I think it's part of the solution, but it's probably not the entire solution. So any yes,.

E

Yeah, it doesn't seem.

G

To be end-to-end, no nope I mean it could be. It depends on what your network plugin is yeah.

C

I think that's a real challenge if, depending on whatever plugin, you are, how do I know that you have connectivity or my pod might not actually be doing anything with the network? My on have any open sockets. It might be network policy about from egress, so I can't go in and poke out like you have to do it. The driver level right like or something the host to make sure you're getting something across that pipe. If it's deep.

C

Interesting, but is the idea, then, that check would be something that the runtime calls periodically yeah.

G

It would be part of the pod sandbox status and that would get called every time. Pod sandbox status get called through CRI and the plug-in can do as much or as little as it wants to do, and if it decides that for some reason the sandbox is unhealthy network wise. Then it would return an error and then the expectation is that cubelet would decide that that pod needs to be torn down and restarted.

E

What about pain right that doesn't go to the user level that just goes to the kernel? What is something somewhere else were to try to ping, for each pod is using container networking. What are something somewhere else tried to ping that pod IP.

E

Nothing actually never.

C

Policy doesn't say anything about ICMP, but actually committees doesn't require ICMP or there are were environments where I simply doesn't work.

H

We also do they want me to get like if we do like a service level health check like what's the mitigation of mitigation, we have it's restarting pods or you know just rain clouds I mean creating them like. That. Is that much more, we can check what are the failures that we be checking where that service level health check that we can't check in these others can I what.

E

Right, I'm thinking of situations, I mean I've, seen two cases right. One is something decays on a node and then all the pods on that node that are using close to networking become unreachable, and then the other case is something is, is wrong in the cluster and everything is unreachable. On cluster networking, of course, in the latter one there's no recovery, but you might like some kind of report that says: hey your cluster. Networking is totally hosed and in the other case you know restarting really blacklisting the pod I'm.

E

Sorry, the node and restarting the pod other nodes would be the recovery yeah so check.

G

Sort of works for some of that, but really the overall solution is, you know something like CNI gr, PC or some higher level protocol, where the network plugin could indicate back to cubelet that the note itself is network unhealthy.

G

We do have that check right now, and so you can sort of simulate this right now as long as you use something like a demon set or something outside of cubelet to remove your CNI config file and then, depending on what C&I driver you use, that may trigger the node to go and a healthy if the config file is gone. So if.

C

I can interpret Mike for a minute, I'm, not actually sure that that would be addressing the issue.

C

What if hypothetically we've lost the route in the cloud provider and so within the node, all everything's happy the GRP c CI or the c, and I check whatever is all running. Fine, even a local art for ping across a beef pair will work a fine all pod health checks will be successful. The cube look but because it's using the host network we'll be recording to the API server that all the pods are happy and healthy, but the pods themselves actually can't get in or out on the note so.

G

At that point, you'd need some kind of check in you know the cloud provider or whatever thing is providing that interconnectivity for the node to say, hey this is no longer healthy and, to you know, essentially, exit or taint the node or something like that. I.

E

Think in our style of doing things right, we would want and to end behavior tests right, not something that's tied to a particular implementation. I guess I also want to back up to the again this interesting point of what's the recovery and what are we really looking for and I talked about problems in the node and in the cluster. So for the case of problems in the node, it suffices to actually have cific probe pods deployed for this, so they were there.

E

You, for example, might have a deployment set that I'm sorry daemon set that has a test pod on every node, a simple test pod that should respond to something over the cluster Network and then you could have some end up and testing throughout the cluster. Maybe some kind of voting on results.

E

You know detect node level problems just by behavior.

B

Yes, so this is making me think of something that I always think of when I see the the weird Network policy rule that the node is always allowed to connect to every pod, so it can do health checks and that's the fact that we don't really specify anywhere whether health checks are supposed to be testing if the the pod server is working or if the network is working to the pod. But maybe this just makes sense as another kind of health check where, where we say explicitly, we want some component to periodically test.

B

Can you make a network connection to this pod from an arbitrary place in the cluster and, if not, then something needs to happen. So.

C

One of the one of the discussions that came up through the ingress of e1 and Gateway API is was: should we have a definition of health check on an ingress or even on a service level that would be driven by a network oriented end-to-end health checking, not a pod landed. That's.

Q

Necessary in a lot of these cases, I think I agree.

C

I think to build a lot of higher level there to integrate with a lot of higher level load balancers. You need an externally reachable object, but we don't require it. So it's been challenging to try to figure out. How do we make that work without requiring a tie? I'm not against the idea of a service level got like a keep adding service stuff, but adding a service level health check that defines behaviorally? What does it mean for this? For the pond behind the service? Do I.

E

Yes, um although I still I'm still thinking about the node case, you know- and maybe this is good, because it's more compositional right it's if we can test that the well yeah I mean yeah I mean you can imagine all sorts of things, but if you know the case of something to color the case on a node and we have some uh that can detect that and of course, that'll help anything that's running on that node and also, if you're testing the service.

E

You don't know what went wrong if the service isn't healthy, whereas if you're testing a node, you have a better idea. What went wrong.

Q

The couch I've seen for trying to detect the signs gone wrong on a node is if the node doesn't follow a universally it's a very subjective to lay down. If, when there's a problem, when there's not such as, if there's a problem with new resources or some kind of refresh.

E

So yeah, let me see if I understand you you're, saying, though,.

Q

Let's, let's say that pods are failing to start because you're not able to assign new addresses or something so it's.

G

Clearly, a node.

Q

Little problem, but it's not a case of everything on the node of failing.

E

Okay, yes right right! If it's it's not consistent throughout the node, then that's a little harder that gets back to the problem of. How can we test every pot on the node rather than a probe pod on the node? It would be every pod, that's using cluster networking and I guess back to ping I mean. Maybe we want to just be doing something with ping where network policy allows it or maybe define their work quality to allow it or have some wave. It's inspecting network policy to see where it's allowed.

E

But who cared could deploy demons set just to make sure that there is at least a probe pod on every node using cluster networking? And then we could have something. That's just pinging, all the pods they use cluster networking and for which network policy allows pings.

C

I mean something maybe as an incremental step towards proving this out would be to write that prober, but actually write it as a line operator that creates a demon set, runs a probe, tears down and even set and then periodically recreates the demon set so that you're forcing it to allocate and reallocate IP addresses. So you bypass some of these, like it used to work, but now it doesn't problems rights to be held for thrashing on recent. Yes, yes,.

Q

C

Q

Song at lift that we've been fighting with a little bit. Our current idea is maybe defining a custom threshold, a percent of workloads that can go unhealthy on annuit before we decide to notice unhealthy making like the demon set, a new resource approach could work, but I worried that there's so many failure edge cases where a new resource might still succeed. When there's no problem, it's.

H

E

Q

Catch a non non, Universal failure right.

E

Right I mean Tim the the it's good to detect problems with new stuff, but there could also be the flipside promise with just the old stuff. Isn't nothing to pay after a while to.

W

See an ID check command that Tim's talking about.

E

This well that's again of undefined scope. I want to be sure to be doing kind of an end-to-end data plane test.

A

Right but I'm defined scope. It's it's up to.

W

The plugin to either make that a good check or a bad check right, but the plugin has limited abilities.

E

Since it's more or less a local actor well,.

G

Many plugins would but many plugins would be able to have larger sets of capabilities, especially if they run. You know a more centralized or smarter infrastructure, I suppose yeah I guess.

C

If you want the enterprise-grade version of this, you've got something that runs in your cluster, where each plug-in and reaches out to the cluster agent. You have two of them, so you're guaranteed. One of them is not on a node and they both have to reach in and because the plug-in knows all the details, it can say: okay, ping will work, fine or a ping. All work fine, but you can do this other thing.

E

C

This is it's an interesting topic and I think it segues into the Gateway and service health checking stuff nicely it's hard to abstract. They say we're going to cover every possible use case. I mean I, don't know if there's some route in real problems, Mike or if maybe we could accumulate from people or the Valerie. We've got food, something around experience.

C

What sort of problems that really run into that we could prioritize, which things are important to try to address I.

E

Mean I, it's I've, you know I'm not running production, stuff myself, but in my own experience you know. Cluster networking is something that fails so yeah I think gathering experience from people who do have. It would be a great place to start so I'll send out an email start, an email thread on the mailing list. Maybe we'll gather some experience and think about what sort of I want to focus really on what really. Ultimately, the question is what remedies well I said what no it's remedies can be, kicked upstairs and all sorts of stuff.

E

So I guess what sort of conditions do we want to detect or could be detected right and what could automatically be remedied?.

C

Okay sounds great all.

A

Right, thank you. That's good! Thank you, Mike.

A

Alright, that's the end of the written-out agenda. Are there any other last-minute topics or we get five minutes back.

C

Is there anybody here, who is their first meeting, wants to say hi welcome to the new year? We should do that at the beginning of every meeting.

M

Hi I'm Ryan well from souza. This is my first meeting, but the first one I was attended, was I, think right after November, right after Thanksgiving in November, okay, but yeah, just yeah getting up to speed here, looking for ways to help and contribute awesome.

R

Hi, my name is James peach I, look for being where on contour, which is an interest controller, so I'm, just gonna start showing up in the meetings is to keep an eye on. What's going on in the network space. More generally,.

V

And I'm just getting up to speed on things listening in and also have attended a couple of meetings in the past just getting at it in a different manner. In the new year.

L

U

I'm Kat, this is my first meeting. I am and as your networking and I've been working on the application gateway, ingress controller, so I wanted to get more involved, mostly listening cool.

C

Welcome a lot of ingress people showing up that's great.

D

Hi I guess insight hi. It's way my name is cos. I work on the es team on the a SS scene. I don't need scene, I, related stuff, cool.

C

And I turned over a rock and I got a whole lot. Anybody here.

A

Everybody it's great awesome.

C

Well, I hope that y'all stick with us, we should do I would love to do as a group a once once for the year backlog Burton one of these meetings to go through not just the triage, but like all the things that we have triage into bugs and see. If there's any that are closeable or we can get some of these new contributors to help out with some of the.

M

That'd be awesome, just from my perspective. Over the last couple months, I've been trying to find issues to kind of start, dipping my toe in the water and the minute I start trying to look at one. You know someone will jump in and they've got a PR for it and the way we go and so on to the next thing so it'd be helpful for me. Is you know to get a little direction from the team like where some good places to jump in and help sure.

B

Yeah, if you want to jump in like leave a comment on whatever issue or PR or whatever you're thinking about because lots of people when they're going through PRS, if they see that somebody else has already claimed responsibility, they'll be like oh good. I. Don't need to touch this one then so it'll be all yours. The converse.

C

To that is if somebody left a comment, and it was three months ago- there's been nothing since poke it see if they're still there leave another comment.

M

Right, okay, so yeah I, guess that I guess that that helps cuz. What are the questions I had? Is you know? huh How do you know? How are we claiming issues? Just you know, assigning an issue to yourself enough of a tag that says: hey I'm working on this, so.

C

Unfortunately, we can't assign issues to non org members, but as you're working up to Ord membership, you can just say on the bottom: hey I'd like to take a look at this and that's fair game.

C

Okay, it's a great starting place and and people should not be shy about jumping on things like don't worry about stepping on toes or those sorts of concerns like there's plenty of work to go around. Not all of it is organized very well and come in so close by. If you get stuck jump on slack and we do talk.

Q

Yeah I joined a year ago, I can vouch for that.

A

Okay, okay! Well, that's all the time we've got for today, but I will see you all in two weeks thanks. Everyone.