Kubernetes SIG Network, 28 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Network Meeting for 20220428

Description

Kubernetes SIG Network Meeting for 20220428

A

This is the sig network meeting for april 28 2022.. um Do we have a triage lined up for today.

B

I do there's only one issue to even look at.

A

C

All right give me a moment a share window.

C

D

The contact button.

C

B

D

I know I thought that was the one related to contra.

B

Oh no, I mean there were a bunch open, but people have been asking questions on them, so we don't need to discuss them here. I think um the only one that I thought was worth discussing here was this one which cal opened cal. I don't know if you're here.

E

Sorry cal wasn't able to make it today.

B

um No problem he's he's merely asking the question of what is the intention of the load. Balancer source ranges field. um I thought if he was here, we could discuss it, but there's discussion in the bug, so maybe we'll just leave it there.

D

I don't understand the concern of the scalability of the program.

B

uh I mean, I think, if, if you put a thousand of these things in there, then you're gonna write a thousand iptables rules, which you know isn't that many all things considered yeah.

A

When you overflow,.

F

The like the json object.

B

F

D

B

D

But the same applies to labels, so to annotations I mean yes.

B

D

B

I I think you're more likely, so I think there's two problems. One cal thinks it's underspecified, which maybe the text of it is, but it is, it is being validated um and the second part of it is what's it like. Should we be programming iptables with it? I think the answer is yes and I don't see it as problematic.

B

Maybe we can fix the comments around it, but I don't think there's any path where we get rid of it.

B

I mean, aside from the fact that it's a ga field, so, okay, that's it for triage, then.

A

Cool- and I think the item on here but from from jay was actually from me. I still have a discussion from last time- jay's not here, so we can skip over that and tim are planning to be out and not on. This call on may 12th.

B

That's right, you guys are welcome to have the call anyway or we can cancel it it being the day before the week before kubecon many people will be traveling.

E

People might be traveling, especially if they're participating in like pre-event um contributors, summit, etc, stuff, but.

A

Just letting you know, yep I'd be fine, canceling it, and then we can just have an extra juicy post. Keep going meeting.

A

So I'll do that and then we've got uh alexander to discuss, keep controller manager yeah.

G

So I linked the presentation in the doc. You can either follow along over there I'll just try to share my screen quickly.

G

Hopefully, this will work.

G

Okay, I hope you can see my screen. It's not tell me so. Essentially, I would like to talk about the cloud controller manager and specifically at scale, um so I'm kind of working on a project right now where we have a mini to main menu relationship between lbs and cluster node, which isn't really all that common in the normal cases.

G

But in this case it's something that is being used in this project and we are encountering a couple of issues with regards to the sync time to sync by the cloud controller manager and what we're seeing is. This is mainly happening on large clusters, where nodes transition from a not ready to ready or back and forth state, and so having looked at the implementation of the cloud controller manager and especially especially at sync loop for the nodes I'm seeing is that we sync all load balancers for each in node event.

G

So to speak, so whenever it transitions- and this is causing a specific corner case- to trigger outages by clients which are trying to connect to these lbs- and I just wanted to quickly present this- I already filed the pr uh which kind of fixes it. Obviously, I'm looking forward to getting your input on that and maybe reaching the consensus, but the idea is essentially this.

G

So in this case we can imagine that we have five load balancers, each of which are using external external traffic policy, local, so they're only pointing to one node um and one of these nodes transitions from not ready uh or ready to not ready, which will cause all of these load balancers to get updated.

G

Now, the toil here on the cloud provider side might lead to the fact that the delta between so the duration to sink all of these load balancers is, if it's greater than the delta for the node transition, from not ready to ready again that increases the duration of the experienced outage by the client.

G

So I detailed this a bit better in my pr and that kind of maybe exemplifies things. uh So, whenever people get a chance to have a look at that, please do and essentially what the idea is here is that for external traffic policy, local type of services, whenever a node transitions from not ready to ready, uh if that node isn't actually hosting the end point for that service.

G

It's not really impactful for that for that service, so there's no real reason in in actually updating that load. Balancer, um like I said this, isn't in this specific corner case, with a many-to-many relation and for external traffic policy. Local now there's a second problem as well, which is mainly with regards to applications which are high intensity, as I call it or low latency, which kind of can reduce the problem to to the cubelet.

G

An example of this is, for example, if a service, so the pod servicing the acting as an endpoint to the service, gets a burst uh of increased number of transactions per second, for example. This can increase the cpu and memory consumption on the node, hence leading the node's resources to get a bit starved.

G

The cubelet could have difficulties, keeping up, marking the node, not ready, causing the ccm to remove the node from the backend pool from the lb and hence having the client experience an outage, even though the app might be able to keep keep up with this increased load.

G

An application that follows this type of a deployment deployment model doesn't really want to impose resource limits because that might impact service slas um either in terms of the memory that is being used or the cpu at any given moment in time right. So I have a couple of proposals for this. uh I just wanted to present them in this meeting just to get people's opinion on it, but obviously I guess the greater discussion will be maybe on that pr.

G

So the first proposal and the thing I implemented in point number one- is the four services etp local do not sync, whenever a node transition state from not ready to ready and back and forth if the node doesn't host the service endpoints, because, as I mentioned, if that is the case, it doesn't really impact that lb.

G

It doesn't it's a bit of a moot point for it, and the second point is for all services: do not sync no transitioning state, because the reasoning behind this was more that node readiness doesn't really have a direct correlation to whether the endpoint is accessible or not.

G

There are a lot of way reasons for a node transitioning to not ready. Many of those might not have any impact on whether or not the application can actually service requests, networking wise uh and if that is not the case, I'm kind of looking for people's opinion on maybe adding a knob. So an annotation on the service object to be able to disable the node readiness impact on the reconfiguration of the the load balancer, and this would again be useful for applications which are which are experiencing a high intensity or a high load.

G

So to speak, because in those cases the thing that might be crippling the sla is the cubelet and not the actual performance of the application.

G

And then I was kind of thinking about this even further, and I was imagining things a bit the way health check. Node port works currently uh for etp local, which is to say that a probe is configured on the load balancers and that actually targets a node port endpoint on the node, which returns a status code to it. So why don't?

G

We do that as well for what concerns the node readiness state, which is to say that in this case the cloud lb would target the same endpoint, but the end point wouldn't just return: an http 200. If the endpoint is running on the node, it could also interrogate the cubelets read only port uh and because the cubelet is the one that actually sets the readiness state uh and then going even further beyond that.

G

I kind of imagine also aligning the the behavior for all types of services, so not only etp local, so so cloud providers perform health checks in different ways today, they're a bit sporadic and it's not really well aligned.

G

So maybe the proposal here in the future would be to write an enhancement proposal for all of this, which would kind of align the behavior for all different types of services.

G

Now for each proposal I kind of wrote a problem with it, so for number one I don't see any that might not be true, but for the second one, node readiness has no impact on routing decisions for the lb, but maybe it should actually be the case. So maybe the current implementation is actually correct.

G

um So for the discussion for number three, I think. Obviously we need an enhancement proposal and a sign-off for this, but given the cni plug-ins can implement health check note port themselves.

G

I don't really see if there's any possibility to force unite plug-ins to use a specified implementation uh for how it's supposed for how it is supposed to work. I just checked before this meeting psyllium, for example, is using a specific method or a specific server, so to speak uh for for the health check, node port, whereas cube proxy, obviously is using standard implementation and ovn kubernetes. I'm only seeing it because I used to work on it, uses q proxy's implementation for that server, um but could we align all of them?

G

I'm not sure, and then obviously number four is even more difficult, because that has impact on the cloud provider implementation for all of this um yeah. So this is kind of the problem that we're facing on this project and some of the ideas I've surrounding it feel free to tell me if you think, anything's crazy. I see somebody raising their hand.

B

Hey thanks for that um great exploration um I have in the past, spent some time thinking about this too, and I have some thoughts. uh I know bowie is here too so he can actually speak in maybe more details.

B

um Some of the things you identified, especially in the first part, um I think, are bigger than you're, even describing, because some implementations have limits to the number of back ends that can be behind a single load, balancer right, like they're designed for vms, not for containers um and like the google load balancers, we have to pick um a subset of the available nodes in order to put them behind it. So we keep.

B

We have a separate implementation.

B

The cloud controller manager makes it possible to have different implementations right, and so we have a separate implementation which tracks the sort of ideal, most ideal subset of nodes to stick for a given service, so for etp local we're going to choose nodes that have a back end and for non-etp local we're going to choose those some subset of nodes and also because sorry boy, why don't you just say.

H

Oh yeah, I was going to say that that in in our implementation we do a subset of nodes, so you might not run into this problem because it's not all nodes. All lbs.

B

Right there we go, um there's also things like programming time right like it takes this propagation delay like you, you suggested, so um you don't want to be flapping. If you can avoid it.

B

um You raised the topic of resource starvation on the node and I'm I'm torn on this one, because on the one hand, if your app is starving out the cubelet, then the cubelet is clearly under resourced, like it shouldn't starve right.

B

On the other hand, it's very dynamic right. The cubelet doesn't always know what resources it's going to need in order to service all possible applications. So I'm not sure what the great answer there is yet um the last part um removing the the suggestions.

B

um Well, first, can you clarify when you say, do don't sync not ready to ready? Do you mean don't publish that on the node, or do you mean just don't use that to change the service endpoints.

G

Yeah I mean don't update the lb, the cloud provider lb with the new set of nodes.

B

Right, okay, um so so yeah this. This is where I had spent most of my time. Thinking of how can we do this because there's a second part of this, some of the load balancers, if you remove a node from a endpoint set, any open connections will be killed, whereas if you leave it in the set but fail the health check the connections, there will be no new connections, but the existing connections will be left alone right, and so today we do the worst thing, which is every time a node comes or goes.

B

We we just rather every time a node becomes unscheduleable right, which isn't even to say the node is unhealthy, just that it's being cordoned right, like it's going to be drained, say we're going to do a six hour drain. I set unschedulable on that node any traffic into that node will get killed.

B

That's not what we want right. What we really want is to say there's a conditions where the node is in the set, but not accepting new traffic, and there are conditions where the node is out of the set right very similar to the discussion. We've had over the last couple of meetings about graceful, endpoint draining at cubeproxy. We have the same problem here at cloud load balancers and we haven't done a great job, making this lifecycle possible. So some of your ideas here are are interesting to me. Some of them seem impossible.

B

Like I don't see how number four could actually work um number three. We could actually do the or the and sorry inside cubeproxy like we could make the healthcheck node port not just return true, but actually dynamically check cubelet. At the same time, right that helps cubeproxy. It doesn't help other implementations, but we could at least say hey. We think this is a valuable.

H

Question on this, the intent of this is to replace looking at node ready unready by just checking cubelet directly.

G

Yes, that would.

H

G

By the ccm by the ccm, so the ccm would only update the lb whenever a node gets added or deleted so that the lb has the full set of nodes. But it wouldn't care anymore about transitioning readiness state so to speak, because the the q proxy would do that now by interrogating the cubelet port. Instead, it would be more that more dynamic, essentially.

D

What's the status now of the ccm, I thought that the the the idea was to get to to move to independent ccm and not to use any more of the in three ccn.

B

There's still a framework of code that most of the ccms are starting from.

B

D

The existing controllers.

B

Still exist, that code is still being used, but now everybody's building their own binaries, for it.

D

Right, that's what I'm saying is, if it shouldn't be easier to to create its.

I

B

Oh yeah cloud providers should be able to and are encouraged to find clever answers to their own problems.

G

I'm not very familiar with any of this work. You'll have to excuse me in our case, for example, you could imagine one case where we're running on gke, so the gk runs the control plane for us, but I believe they just use the ccm provided by kubernetes right.

B

Even yeah, we're, like everybody, we're in the process of converting to a google specific control cloud controller manager so that we only link the google code into it and uh not all the other cloud providers, and so that we have a clear ownership and ability to you know, do stupid things when you need to.

G

But for what kind of.

B

This is the direction.

C

B

G

Yeah for what concerns the service controller part, uh so this node sync loop so to speak. That would continue to exist even in the google implementation or maybe not. Maybe that's a.

B

Secret we've already.

G

B

Of it right bowie does that live in the which binary does that live in now.

H

It it lives right now, spread out in certain repos are now badly named, such as ingress gce.

H

I think the general idea is that, instead of converging implementations in the code, it's a convergence on apis from a cloud controller manager perspective like you, your low balancer implementation could, in fact like have like a completely different control loop if it's, that makes the most sense, because we found historically that the previous method of like trying to make it very generic in the code from an interface standpoint, just ended up not really working out very well in all situations.

B

For exactly these sorts of reasons,.

G

um Okay, so kind of a corollary question to to this entire discussion is at least for point one, uh given that this is an issue that this project is currently experiencing at scale. Could it be considered a bug if it's considered a bug?

G

Is there a possibility of having something like this actually being back ported to older version versions of kubernetes, but maybe that's the discussion for that pr. Instead,.

B

I haven't seen the pr is that the one you linked here, 109 706,.

G

Yeah exactly I just filed it today this morning, so.

D

Other thing that you need to check alex is is uh what crowd provide the sagittarius, because I know that's another using the the out of three controller manager. I think andrew is not here. No and the psyche is.

D

B

I'll have to I'll have to look at this pr, but looking at the description of it, wouldn't this mean you need to have a watch on pods.

G

No, in this case uh not, I didn't see it see the necessity necessity for that, because this is only handling etp, local, so to speak, and so you're just provided the entire set of nodes that exist in your cluster and as long as that is synced.

G

The health check, node port will actually tell the lb where the endpoint is uh so there's no reason to to watch endpoints or pods and check, if they're being, if they're being scheduled on other nodes and whatnot.

B

G

B

At number one you're saying four services that are etp local, do not sync if it does not have service endpoints. How do I know in the service controller, whether you have service endpoints or not? Oh right, yeah yeah, I use the endpoints lister so to to get um the name depends on the the metadata in the node. Whatever the node reference field.

G

B

Exactly yeah, that's.

G

Maybe not a good idea, maybe not.

H

Well, yeah: well, just just pointing one thing out: you need to use endpoint slices by the way.

G

Yeah I was hoping that this might get backported to versions that do not have endpoint sizes. So that's kind of why I didn't do that, but yeah. Obviously. um Obviously that is the right.

B

Way to go for it, so if, um if we were to draw out the sort of state machine of nodes with regards to load, balancer traffic right, we have at least node doesn't exist, node exists, but is unscheduleable, node exists, but is not ready and actually the crisscross of those two right. So there's four states there and then well like I said it covers node exists and is schedulable and is ready, and I guess the question is: do we want to change the behavior on certain edges right? Is that a fair assessment?

B

Yes, I would say.

G

So in this case, what is kind of shooting uh us in the foot is no transitioning from ready to not ready and back and back and back again so to speak. So the schedule ball part is not really the the big pain point with clusters at scale and especially with these cpu intensive loads that we're running and all of that.

B

Right and the problem here is that ready and unready can either mean hey. The cubelet went out for a long lunch and didn't come back um or it could mean this. Node has powered off and we don't know yet, but this is our first indicator right. um I guess in that case, if we're relying on health check, then the health checks will cover it right. If that's the assertion, yes, definitely.

G

The second one yeah well the where the machine is actually powered off so to speak, because you wouldn't have keep proxy reporting back the status of the lb either.

B

B

I would need to go back and do some spelunking. um We need to make sure that in the non-etp local case. So when you have a regular old service that we don't continue.

G

To send that case, I didn't touch that case uh at all, so the the the implementation only focuses on etp local yeah. So that's what the change concerns itself with for everything else. It remains the same as it currently stands.

G

Sanjeev. You want to ask your question out loud.

J

um It's just a knife question uh is this implying that they aren't existing in orbs enough to protect cubelet uh from abusive workloads? It's more a general problem.

G

This is what we're seeing I mean you could protect it by setting resource limits on your pods right, but we don't want to do that.

B

Well, you mean protecting me by having you set. Limits is fragile right if we need to set the appropriate request for cubelet make sure that cubelet is getting enough shares that it can always service what it needs right. This is the whole like allocatable problem right. If we carve off too much, then we eat a lot of your node that you might not be using and if we carve off too little, then you have this, so I might need.

G

To do some spelunking and see if we set those parameters correctly for the cubelet on our end yeah, that's a good.

B

Thing um the because allocatable comes off the top right like if the node is uh eight cores- and you say allocatable is seven- then cubelet's got one to mess with it's supposed to anyway.

J

And not just not just resources, but also scheduling priority for the cubelet process.

J

We should document all the options and best practices to ensure cubelet is always protected.

B

That's a good point, I I'm sure I'm fairly confident, not sure fairly confident that sig note has got some of those things documented. I wonder if we can do a better job or maybe maybe this is a good catalyst to go and like relook at those things and make sure that we have a new, a new thing for them to consider network abuse.

G

Just the last thing I wanted to discuss was the possibility of doing 2a, which is to say that you annotate, that you have the possibility, as a user, to annotate the service object, which tells the ccm to completely ignore node readiness checks at all, so that ccm will only think the addition of a node or the deletion of a node for what concerns the lb and the lb associated with this service. If it transitions to a not ready state, the ccm doesn't really care about that. It will just keep it configured as it was.

B

How about that's really my least favorite option? um We have so many of these little fiddly knobs that force users to understand our implementation details.

B

I would put that at the bottom of my list of things to prefer I'm personally, I'm trying to figure out if we can do something like one but not even consider whether you have endpoints or not just change that edge like maybe that edge doesn't make sense. Maybe we should lean more on health checking and and just ignore the ready, unready, schedulable and schedulable differences.

B

That's where I was headed. I actually have a branch somewhere that started to implement some of this in cube proxy, and I realized it was getting a little bit complicated with all the health checking um and then it I know I got busy.

B

So what is that exactly? uh I didn't fully follow what if what if we just did something like what you described in one, but we ignore we just we always ignore ready, not ready, and we rely on health checking and we say like there needs to be a health check for regular old services and there's a health check for etp local services and those might not be the same health check.

B

But we just don't don't change the set of back ends, except when nodes are added or removed, which I think is what you're getting at right. That's point: four.

G

Though, uh at least I may did a piss-poor job that may be writing it down. But that's my idea behind point four, which is to say that um we have a health check. Note port probe by the lb and the cloud provider already or the cloud implementation so to speak, already uh defines a probe as well on the lb for regular services of type load balancer.

G

But these these probes they never check the cube, let's state, so to speak. um So why.

B

Couldn't they do that? Okay, so I misinterpreted four. So the way I read four was: have your health check. Do two probes which, like I can't there's no way, I can convince the cloud load balancer to do that right. um No.

G

But I could, I could end.

B

It myself in cube proxy.

G

Right, that's what I was getting that's what I wanted to do so my plus sign at least on point three here. My plus sign is and sorry it wasn't really so the I mean http 200 if endpoint is running on node and cubelet is reporting. Okay and the thing doing this so checking that is q proxy, so proxy would check the endpoint as it currently does, and it would also do I don't know. Curl of 127.001 cubelet ready only port read only port.

G

Sorry, is it okay and then returns an answer to the lb, so it sounded.

B

Okay, um I'm gonna have to read your pr and think about that um anybody else. I've been.

G

Doing all the talking I didn't mention it on the pr, but I'll put a comment on it. Maybe summarizing kind of these ideas that I have in this presentation hopefully do a better job and.

J

And I think it would be good to confirm that you've set all the available options to prioritize cubelet.

G

Yes, definitely thank you for the pointer because I didn't check yet and possibly were maybe not doing it right so.

G

But that doesn't mean this isn't actually a problem. Yes, that is kind of yeah. That's more for point two here, which is to say, resources are starved right right, but yeah the time to sync lbs is definitely still a problem, because when you have so many lbs on a cluster, the cloud provider can either limit you or it just takes a ton of time to sync and.

B

You so I would also say you should go um look for the cloud. Google cloud, internal load, balancer, subsetting logic, which I think is like bowie said. I think it's in the ingress gce repo, where we do it a separate control loop.

B

If you can't find it we'll we'll help you find it, but you might also consider doing that and just saying I'm not going to sync all my nodes, I'm going to sync some subset of them. Okay,.

G

A

Great, um should we move this discussion to the mailing list or the pr or both, where do you guys think, is best pr? And if we overflow to the mailing.

B

G

Perfect I'll put a a comment detailing kind of these other options on the pr so that we can take it from there and give people a chance to read through and think it through as well. Because it's not always.

B

Easy for the record I assigned myself so that I don't lose track of it. That doesn't mean I'm committing to work on it. Yet always. Thank you.

A

Thanks, alexander um antonio, you were next on the agenda.

D

Yes, this is because yesterday they, I don't know how. I think that continuously people wanted to other uh release. Note saying that the cni was going to break that people had to to get a migration path on something and casey calandrelo.

D

I think that casey can understand fixed the the problem, but I'm not really sure well this. I wanted to say that for the cni plugins, this was going to be a nightmare because they wanted to do a vlog post, a big announcement saying that everything was going to break and okay, and I'm not sure- and I was expecting dan williams to be here if if this was really a bag for containers and cry or a bug in the con on the cni plugins, because the ones that parse the cni configurations are the cni plugins right.

D

So the the bug was like this is when you have a cni configuration without version, if you are using cni greater than 1.20, it pars the version as unknown, and it failed to delete the interfaces.

I

But that's fixed now.

D

Is that was that the summary it it's fixed in cni library and container the folks thought that it was a problem for them, so everybody in kubernetes had to agree. But what I'm not sure? If is this a problem for container d because they have, they are doing some kind of cube net thing for testing or if it's a problem for cni plugins.

D

So that's why it's a heads up for the cni vendors to check this, because if they are using this one to see the zero as library they may have customer clients or people failing after the grade- or I don't know it's it's a complete problem. I I just got into that yesterday by chance, and and thankfully we we were able to to solve it, because this was going to create a lot of of noise and that wasn't really nice but, for example, calico and all the people with you, and I should check this.

A

Yeah I had seen this thread, but hadn't yet had a chance to read through all of it yet um so it's definitely on my list.

A

So I understand what you're saying is just that: it's in the cni library, failing the parse configs that don't have a version specified.

D

When you use the the after h080 or something like that, if you don't have a person in the cni config, it returns a minus one and before it assumed that was zero. Point. Zero point, no zero point one point zero, and that was the breaking change in case casey fixed that yesterday and now in 110. It keeps assuming that is 0.10 and the, and what the container the people find out is that that was breaking the ci. But I think that is because they are not using a cni plugin.

D

They have some sort of cube, cube net hack for testing, but I I really don't know, but I just want all of you to check, because I mean it's going to fail on on users, because they cannot delete the interfaces or something like that.

C

B

And did you uh wow this is complicated? um Does this container? Do you folks have an opinion on this.

D

They they I stopped at them because they were creating a blog, they were creating release notes. I mean there was a full announcement about that. Everything was going to fall apart in 124 with the cni's and and I couldn't believe it and that's why I pulled casey in the morning and he found this aggression and and and then I went to bed. I didn't think yesterday. So when I wake up, I say, but this is impossible container is not passing the cni configuration.

D

So that's why I want people with cni's to check so they don't hit this part.

B

Okay, well, uh casey's got it on his radar. um I don't think anybody from psyllium is here. Are they.

B

No uh and tria folks are here: yes,.

B

No, yes, somebody make noise. uh Yes, all right, cool. Consider yourself notified.

B

Is there follow-up antonio, do we need to to come back to this, or is it sufficient to say.

I

The ball is back in their court if, if somebody finds out.

D

Something that just shared it in the channel I mean everything went silent, so I assume that everything is okay, but I was surprised that people didn't you know, warned us before I mean they, they were preparing a blog and release notes and everything, and and considering that.

B

Well, thank you for catching that.

D

No, I was lucky I mean because somebody being in the in the sigma world channel, I think there was team from the guy from seattle dogs. He always pulled us. That was all.

A

Oh thanks, antonio I'm just looking at those pr's now, I'm pretty sure, we've already updated, but uh I'll double check that um who's next uh sanji. If you're next.

J

Okay, um how much time do I have, because it's probably.

A

uh There's only one one thing after you and I think it will be pretty quick because it's just a meeting time discussion so.

J

Okay, so we'll give it a go, it probably needs more time, but we'll see what we can do. Let me share some slides.

F

Do we want to just quickly do the meeting time, discussion first and then yeah.

C

Go ahead and grab the rest.

J

K

Go ahead sure uh mine's, like you, said, very simple. I I wanted to follow up on that mailing list thread that tim started back in the day. uh Specifically, the idea was hey. Can we move this earlier, so it's more accessible to people who want to attend from europe? I I summed up the response on that thread. The most popular response was either of the suggested times. The suggested times were 9 and 11 a.m. Pacific time I it seems like 9 a.m, very slightly won out over 11 a.m.

K

I tried to add a table in the agenda just with those numbers uh so based on that, I would suggest just moving it to 9 a.m. If there's no objections, um does that make sense.

E

This is it's still going to be thursday. I.

K

I would I don't want to over complicate this, so I would say same day is going to be easier than trying to find a new day too, but if that is problematic, we can figure something out.

E

Am I the only person who has been in solid meetings because it's thursday for like the last seven hours, I mean okay,.

B

No, I am too, but I would do my damn just to move things around, because this has a larger attendee list.

E

uh I mean some people, including me, may have overlapped with helm, which is at 9 30 a.m on thursdays. So I'm always gonna be juggling that. But if that's just me then.

K

Is that a weekly meeting? Is there a way we can do like? Alternatively, it's weekly, okay.

E

Can this be 8 30.

E

Does everyone hate that are you, pacific people.

B

My biggest concern is I like, having the hour open before this, to prep and do stuff, so 8 30 makes that exceedingly difficult. But if that's where we land I'll try to figure.

L

It out I mean nine is fine, it. It just means I might duck out early. Sometimes.

B

Well, didn't you hear helm's been killed.

E

I heard that we're we're all going to um the smallest possible web assemblies right, that's right, everything on the edge, no more computers, only raspberry pi asics.

K

Perfect well I'll, send out uh a mailing list update on on that thread, with the assumption of moving to 9am on thursdays, starting after kukan, I move the meeting. Okay and I'll give the time over to sanjiv thanks.

J

J

Okay, so um yeah I've put this uh link to this doc in the mailing list, so obviously we won't be able to go through most of it right now. Just wanted to introduce it and have you all look at it later and provide comments. This is something that's relevant both to sig network and, of course, sig network policy and sigma particular cluster. It kind of straddles these areas so hoping to come to a resolution there and deciding how to take forward multi-cluster network policy.

J

uh Let me do this yeah, okay, so I'm not going to go through this, but you know there's some requirements that have sort of made sense to me and based on discussions with others. But by no means are we done with being sure of the exact set of requirements.

J

So here's a working list of requirements that you know uh I've put together uh based on some discussions with some of you offline as well, but we need to keep uh working through that, um but in addition to but in in, but there's at least a few use cases that are more obvious than others. So we can get going with those use cases and we can decide whether we want to add delete, modify, um there's a lot of stuff here related to you know single network models, multi-network models and things like that I'll.

J

Let you take your time to read through it.

J

We might need to have another call to kind of go through it in more precise detail and there's sort of the big questions as well, which is you know how critical is it? Do we even need multi-cluster network policy?

J

If we do need it is the priority high, medium low or never.

J

Should we design this to cover various kinds of multi-cluster deployment models like what is frequently called single network and multi-network modes of operation? Should all the features work in all modes or is it okay to identify them? Some modes are more important than others, so, for example, design some features that only work in single network mode um should this feature be limited to our reference architectures defined by the kubernetes multi-cluster services.

J

Api model, right, which is what the sig multi cluster, is doing, or should it not consider that as a boundary it should, you know, feel free to go with the beyond the multi-cluster mcs api model? I apologize. Some of you may or may not be all familiar with the mcs api model.

J

I'm presuming many of you are um and then you know a lot of these are very analogous discussions that happen both in signet sigmund cluster, as well as in other kinds of multi-cluster networking projects, including, for example, istio multi-cluster, for example, you can have sdo meshes, which can themselves have a single network or multi-network deployment model, and then you can have federation across meshes. So the analogous sort of topologies come into play here as well, and then a little bit more uh clarity on requirements on the trust model across clusters.

J

uh You know, is it you know, name spaces which are uh global and there's a little bit more. That needs to be worked out on the requirements. But having said that, um you know one can reasonably put together some reasonable use cases. So let's just go through real quick and you can feel free to provide comments. Later one is okay. I want to be able to control I'm going to be using mcs terminology here, multi-cluster services, api.

J

Forgive me if you're not entirely familiar, but it should be fairly clear from the pictures um so a I want to control which cloud which pods in a cluster are allowed access to a service. That's been imported from another cluster. Okay, so here I've got cluster. A I've got two parts. I want to be able to flexibly control that pods, which fit into category p1, are not allowed to access this remote service right.

J

So, okay, context here is that there's this remote service running on cluster b, it's a multi-cluster service and it has been imported across the network, which is an arbitrary network. Maybe single network may be multi-network and since it's a multi-network, it probably requires a gateway, which is what I've shown here and allowing for all these different models. I still want to decide how some pods are allowed to access imported services and some pods are not allowed to access imported services. So that's use case number one you see here.

J

It should ideally work in both single and multi-network, and it appears that we should be able to do at least this kind of use case within the current mcs api right.

J

One example of doing that would be fairly straightforward in in a network policy which could be either a new version of an existing network policy or might even be a new kind. uh That's another discussion for now. Just let's not worry about the kind.

J

Let's just worry about the selectors, so in this case, something like okay, just having an additional kind of egress destination, which is basically a service import or a reference to service import, can allow you to configure that these pods, which are selected by this policy, are either allowed or not allowed to egress to this imported service right. So that would be an example of a sample manifest that satisfies this kind of use case.

J

um Another use case, I'm I'm just running through a little bit, so you know I'm sure these questions will maybe just take two minutes at the end to provide some high level thoughts, or maybe I can pause after this use case, and if you have any questions you can comment right now use case b.

J

Is I want to control on cluster a whether I can access a service if the same service is provided by multiple clusters like in this case, mcs4 is provided both by cluster b and plus to c, but I want to control and on this cluster say that when this client accesses this multi-cluster service, I only want him to access it through this cluster and not to this other cluster.

J

Okay and again, I would ideally want this to work both in all kinds of topologies, including single network multi-network, flat network uh and so on. uh It would appear that we can do this without changing the current mcs api.

J

With a few assumptions, the assumptions could include the fact that, if you're running this is in multi-network mode, the combination of gateway, ip and cluster set ip must be unique. Per export right so that as long and what this would do is that every cluster that is exporting this service would attach the appropriate labels to allow the importing cluster to decide to put in its data plane only destination.

J

You can say that, okay, if this cluster is exporting mcs foo with label of color, equals blue, and this cluster is also exporting mcs foo, but with a color with the label of color equals green, then you can have a policy in cluster, a which says cluster a clients of this kind when accessing a multi-class when accessing multi-cluster service foo can only access it from clusters with the label, color equals blue, and so in order to distinguish that in the data plane, you have to have this assumption that the combination of the gateway api and the cluster set ip are unique so that you can differentiate that in the data plane this this some of it may not be entirely obvious.

J

We can spend more time if needed. Thinking about it. um So there's a few more use cases. uh I could either pause here for any comments or keep going.

A

It's got two minutes left so um probably a good time to pause and get any comments.

F

uh Sanjeev, I do have a quick question if you don't mind so so in terms of controlling you know which service endpoint, in which cluster that a client can access, um do you is there any specific scenario where this might be used for you having in mind referring.

J

To this use case, which is currently on the screen or the previous one, no.

F

The pre-previous one, okay, this one right here, uh no, I think, uh use case b. I guess controlling this yeah this one.

J

Okay, so here uh this was actually suggested to me by uh nathan, midler from google. uh What he said was that um you know very often um a service would be provided by lots of different clusters.

J

It will be given the same name, but you may have policy regulations which, which tell you that okay, this cluster, should only access this global service from clusters located in the u.s, for example, and um not from clusters located in asia pack or things like that.

J

So you want to name the service the same, but you want to control uh which back ends. You are allowed to access that service from and which not now the most general version of this would be to be able to control on a per back end pod.

J

That would require that would imply. It is always a flat network model where all pods are globally unique and it's all running on one network and this whole thing is working exactly like one cluster, since our initial mo assumption is, it should work in all topologies, including multinetwork.

J

Our granularity is on a per cluster basis, so anyway, the short answer is the use case was.

J

I want to be able to administratively control whether I can access this foo service from clusters located in north america versus clusters located in asia right.

F

I I understand that part, but I I guess my question was more in the lines of in that specific case. Why don't we just put cluster a and b in one cluster set and put the south america or whatever clusters in another cluster set so that you know those are not aggregated to begin with, but.

D

This is, you know, teaching the user.

F

How to do such uh things, which might not be ideal, but you know this is just my mind.

J

They have, they may have good reasons. For you know we don't want to force them to always create more clusters in order to solve a problem in some cases. If that's the only option, then yes, but very often it would be that that restriction applies to pods of category p2, but parts of category p1 may want to access a global service across all providers of that service wherever they are and they're both located on the same cluster.

J

So yeah you can solve network segmentation by just creating separate disjoint cluster sets, but sometimes you want to have shared cluster sets but partitioned within the cluster set, but but yeah your input is welcome and you know we should discuss how relevant these cases are. But uh there was something.

B

I'm skeptical honestly personally, I'm skeptical of that use case, but I'm I want to hear more. um I see that we are at time you. You should put yourself first on the next agenda: okay, let's say 30 minutes. I need 30 minutes, okay, yes, oh the next agenda will be cancelled. I guess well, no! You.

A

Can have it, but I.

B

Won't be there.

A

It has already been removed from the calendar.

F

And and include some links to multi-cluster stuff, so people who haven't been paying attention to that can get some understanding of multi-cluster in general so that we can understand how multi-cluster network policy fits on top of that.

J

F

J

Main thing is the kep for the mcs api, but I'll. Add it yeah.

J

Okay, thanks we'll do discuss more on the next call and also feel free to look at the doc. Until then, thanks sanji.

A

That wraps it up thanks, everybody see you not in two weeks, but in four weeks at 9 00 a.m. Pacific.

C

We're at kubecon.

A