Kubernetes SIG Network, 3 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Network Bi-Weekly Meeting for 20220203

Description

Kubernetes SIG Network Bi-Weekly Meeting for 20220203

A

Welcome everybody to the sig network meeting for thursday february 3rd 2022 issue triage time. Who has it up anybody? I have it up all right.

B

Good, all right, let's go with 10. all right. It should be easy today, if I can find the right window cube critical description. Yes, that's the right one.

B

There are a whopping four for us to look at today. There were 13 earlier there were there's some triage there. Yes, I I noticed many people doing triage, so I just filtered those out anybody who went through and pinged theirs and asked for whatever original poster feedback. I filtered those out of this list uh and I think I closed one that was just long idle.

B

So here's the four that I think are left worth talking about the first one is um pretty clearly a bug in the describer for ingress. I left it open here. I actually added the help wanted label and I'm going to take triage off. I left it open here just to throw out. This seems like a should, be a relatively easy bug. If people are looking for things that they want to that they can contribute.

B

This is a maybe a good starter bug to just go. Look at the describer and figure out if we can just maybe delete those lines and uh and update the test if it was even tested, hopefully a small one, easy way for somebody to score a point.

A

Based on that comment, though, it looked like the behavior was somewhat intentional in the beginning, or have we decided that that is just not useful anymore? I mean I.

B

Looked around and I can't find anywhere that we actually enforce this. It was in it's in some of the implementations and that's fine. If the describer is true to the resource that's reading from the api, then the describer will show the correct thing if the implementation made that choice, but we don't need to pretend the implementation made the choice when it didn't okay, fair enough.

B

And this is seeing prashant's name on this brings back some memories. That was a long time back. uh 20 2015 feels like forever ago. So I will leave this one open. um I accepted the triage on it. It's a should be a good one for someone who wants to jump in if you've got someone new in your org or in your orbit. That wants a good bug to start on. Here's a here's, a fun one.

B

Second was a flaky test. This one is clearly a gcp thing. It looks like we leaked a load. Balancer bowie is assigned, I'm not even sure why it didn't close. It didn't just close the window, but always assigned. I think it was. I left it open just to ping boy to remind him that he's assigned to this one.

C

Yep we'll take a look.

B

Udp traffic to a single pod behind a cluster ip is black hole when the destination changes, yeah.

C

I'm holding on to this one I'm trying to re-throw it. That's concurrent status.

B

Okay and uh the last one then was better availability of deployments with replicas equals one. I suggested that it's actually a dupe of the external and internal traffic policy issues, so I I will take the action to mark it as a dupe close it out. That's it.

B

Fastest triage ever.

A

All right on to the agenda, then, uh let's see we first have updates from kaping apac working group who wants to take that one.

D

uh Hey hi everyone, I'm rajesh kapoor. This is actually my first time joining the sid network meeting. Welcome welcome! Thank you. I work at vmware and I've contributed to sick testing before in uh off late, I've been helping with the kipping working group, so I've a couple of updates on where we are with cupping. So I'll probably share my screen and walk you all through it.

D

uh Are you all able to see some yellow, slides yep looks good cool okay, uh so uh what we did with kaping was like a bunch of us uh who happened to be in apec time zone got together and from this uh working crew, apart from the cupping main group uh to work during ap cars and the goal was to uh so. This happened somewhere around december last year and and the goal was to get uh ci working for cupping.

D

uh You know uh focus on a couple of back ends, like you know: iptables ipvs, user space, and there are a bunch of us in this group who are new to stuff. So what we do is we meet every wednesday at around 7pm ist, that happens to be like 5 30 a.m, pst, and we hack on stuff, uh like most recently, we've been like trying to fix uh test cases, test failures that are on ip tables, ipvs back-ends caught in the ci and stuff like that, so uh over a period of two months or so.

D

uh This is where we are currently wherein we have got github actions working for cupping. uh So, thanks to folks, like neha friedrich, shanusha douglas they've been uh super helpful to get these, uh you know the hooks around ci to test back ends like ip tables, and things like that. uh This is also this basically runs e to e conformance in signet tests. As of now, apart from that, this is also like if someone wants to try out cupping without running e2e tests, and things like that probably should use this script.

E

Also, a big thanks to antonio for hacking that whole script together, because the initial one was all sauna buoy and it was all weird and stuff. So he kind of got us on the right track and we copied all of his code and then we got it. Yeah.

D

Yep yep, that's right! uh Thanks for pitching ng! uh Sorry, I think I missed out uh including uh antonio here.

D

Okay, uh apart from like focusing on the back end side of things, uh we got windows user space in uh it's not uh yet integrated with the ci yet, but I think we're getting there uh from the ipvs and ip tables point of view.

D

We caught a bunch of e2e failures and then neha and anusha they've been working with vivek and hanuman around fixing these issues like offlate, we have some things that aren't session affinity and things like that and like these vacants still don't have unit tests. So if anyone wants to, uh you know, start contributing to cupping and see how these beckons are written in cupping like I think this is a like. This is a great opportunity to start get started with cupping.

D

uh Apart from that uh sorry, these are the old, slides yeah. Apart from that on the user space port front, uh useful user space board front, uh we have the there's a pr in flight. We have uh the changes in, but then there are a couple of uh e2e tests failing and he's still uh trying to fix that, I'm basically helping jay to get user space in shape. uh Most recently, pallavi joined us to help us out and yeah also mikhail introduced. This diff store to us in nft back end.

D

So if we look at the back ends, uh nft is sort of using sort of a full state model over here, as opposed to uh how I p tables and all in the current implementation of q proxy, how they use service change, tracker data structs and things like that. uh Nft is like using something different, wherein they're getting the entire state of the cluster, with all the service endpoints and things like that and using this diff stove, which basically happens to be a b tree under the hood.

D

It's trying to get a diff of what all has changed and then write the uh nf table. Rules um jay feel free to pitch in because I mean I'll get the details right here. uh So this is something new that has been introduced and uh yeah I mean it's. It's super fun code to go through. So, if anyone's also out there, you know looking to contribute to documentation and things around diff store, like that's, also a great opportunity.

D

Apart from that, we are also working on upgrading or move to go 1.18, and this is pretty much where we are with cupping.

D

So yeah I mean this is all I had. uh If you want to add anything to this jade feel free to go ahead.

E

Now, what time is it in india right now.

D

It's it's almost 40 am now.

E

F

You so much for coming and telling us about this at an inconvenient time. This is really great to see this. Are you.

E

Gonna go to sleep or eat breakfast right now,.

D

I think I've started my day pretty early, so I'm just gonna continue.

F

G

But thanks for the updates, bridgette.

H

A

All right, uh let's see next up, fqdn support for network policy.

G

Hi yeah. Let me I have some slides, which I can pull up.

G

Can y'all see my screen, I'm not sure.

A

Yep looks good all right.

G

G

Yeah, so this is something that I think we've been talking about a little bit in the network policy working group on the side, and so I wanted to bring this to the overall group and have a discussion on that.

G

The things that I wanted to get into right now was basically a get a up down on whether we want to add fqdn support entry as part of one of our built-in apis and potentially get some clarity on. You know which api that might be uh we'll we'll have some proposals and we can talk through those.

G

I I mean some discussion of the details of the api. Spec is obviously inevitable, but um I'm hoping that we won't get too rat hold on that, and we can, you know maybe iterate on that in the cap and maybe just get some high level. Yes, no concerns that we can dig into deeper so a couple of user stories right off the bat. um These are things that I think I've seen and I shared with the netpoll working group and people have seen the same things.

G

The first story is we're targeting application developers in both of these examples. The first one is a pretty standard complaint of people who want to write network policies that target external services. They want to say, allow egress to some service, whatever it may be, but managing ip blocks can be tedious um and it's not always obvious which ip block you've configured um just by looking at your random ipv4 access address, and so the request here is. I want to be able to write down an fqdn instead and have my cni do the resolution.

G

So you know I can just point you to wikipedia.org and you'll you'll handle the rest for me. I don't need to worry about resolving that. um Another story that we've run into is a bit of a more um advanced use case. um This. This is for application developers who want to be able to use a wild card syntax when specifying fqdn.

G

So either they have like a cloud provider. They want to say: hey, you know, star dot, my cloud provider, just let me talk to all of my cloud provider apis or they have some sort of on-prem hosted services and they want to batch a lot of those um just a convenience api that lets you playlist a bunch of things instead of having to enumerate them one by one, and these are the overall stories that we're trying to target here, and so this brings us to the proposal for bringing fqdn in tree um and right now.

G

The ideal proposal would be to extend network policy as it currently stands. um There's a couple of benefits to doing this. The first is just from a readability. Understandability perspective users already know what network policy is and um they're already configuring it.

G

So it's a convenient place to come inspect what traffic is allowed to leave your clusters um or allowed to leave your pods without having to dig too deep, and the second is that the allow list with implicit deny actually works really well for fqdn, because you don't have guarantees on completeness of dns records and all that stuff. It's good to be able to just say: here's what we know is a good ip um you're allowed to egress there.

G

We we we can get away without making guarantees about completeness of resolving dns, and this is a sample of what extending the api would look like, would be adding a top level field, egress fqdns as an example, and that lets users specify you know alongside the ingress and the egress uh blocks today, uh specify additional fqdn matches um by either name or pattern, um and then you have the the port uh specification as well and go do like a cross product type thing it's a little bit dense.

G

So I have a couple of examples that um maybe make it easier to see what's going on. In this case, you know you're selecting, say all of your pods in the default name, space and you're, saying yeah, you can egress to this particular website on port 443 tcp.

G

So it's very similar to the standard network policy you're just swapping out what would have been an ip block for this fqdn um and another version of this. You know you can still select pods uh much as you would in network policy. Today- and you can specify you know a name itself or you can specify a pattern and have sort of a prefix match expansion.

G

um An important point to sort of talk through here is about how we feel close, and so this is something that we talked about for a bit and what what I'm proposing right now is to reuse the policy types field um that we'd originally added um to help extend with egress, I believe or ingress rather, but the idea is that we would default um use the defaulting in the api server um to add this egress tag um to any policies that use the egress fqdns field, and so, if any old cni, um if you haven't updated your cni, but you try to use this.

G

Your sfqdns field you'll, basically get a deny all egress policy and so we'll be able to use the existing mechanisms within the api to just fail, closed um and lock down your workload. um So it's not great from a um you know: user, like smoothness perspective, you'll, have to you'll realize that your traffic isn't going out and you'll have to dig into why. But at least it's not a security bug or you know it's not a hole where we're now egressing a bunch of traffic that we didn't intend to.

G

So that's the idea around um how we'd like to extend the api, um there's more details in uh a doc that we have written. um I think I've linked it in the meeting notes. um I guess the alternatives um or I guess we have just a quick um look at where this should be added into the api spec.

G

I did some digging on how the api is defined and there's a few other places like the network policy, pure the network policy, egress rule that are contenders, for you know adding fqd in there either alongside ip block and pod selector as a peer or even alongside a list of twos, but in both of those cases you either end up with undefined behavior or failing open as part of the api spec, neither of which sounds particularly exciting or fun. So I'm proposing a top-level field edition for integrating this into network policy.

G

I did a lot of talking there. I guess I'll open it up for some questions. Discussion.

I

No one else says something I'll say something. So when you have a star, how do you expect this to be resolved?

I

I mean you wanted to do a pretty much an ls on the name serve that which not might not respond. How do you expect to get all the they've layered three addresses that maps to the store.

G

Yeah uh great question, um so a couple of thoughts that we had um is one. The idea that we were kicking around was adding a core dns plug-in, which would act as maybe like a side channel that once it resolves ip's matching your star, it can send the resolved ip out to your network policy provider so that you're also aware of the resolutions as they're happening.

G

um Another option, which is more heavyweight um and is to just put in like an envoy proxy or something like that. That's intercepting your dns requests and then try to do a res. um You know update your allow lists based on that.

I

um Well, how do you know that it's addressed by dns? Yes, because you put the policies on, can still use ips directly right, um yeah.

J

I understand that I.

E

I

What etc have the the patterns in the policy and have some engine that periodically then finds all the a and quad quad records right that match that pattern and make those available for for the for the policy, or I mean I don't know, create new policy with with the addresses.

K

I

Where you have one that just has the address.

K

G

Well, so, since it's only allows um the the the thing that we can get away with is we won't allow um other traffic to egress here? um I think the problem that you'll run into is that if you don't do a dns resolution um from your workload, you won't update the allow listed ips, and so your traffic would like, if you just tried to hit the uh ip without doing any dns lookup you would. You would not have populated the um the network policy enforcer.

G

So this is, it would be inconvenient in some sense of you know you have to do a dns lookup.

K

Before you try to egress to that point, only a7 stuff out of an http, it makes perfect sense right, but for.

I

Someone that works at layer, 3 or layer 4 level to say that there needs to be a forced name. Resolution is, I think, it's very very hard to.

K

To explain, I I mean.

G

I'm open to other opinions. My my thought is that if you want to use pure, like only ips, you have ip blocks and you you can do that. I I figured that if you're using fqdn for you know ease of use, I feel like that extends equally to when you're writing your application, like just as you want it might be two different persons right.

G

That yep that's entirely possible. um I I don't have a good answer there. I think what.

I

K

I

Before sort of the first one that where you have something that looks at the policies and produce the set of addresses based on sort of what the policy, what the patterns and the policies are, I mean I like that part- I mean I should see that doable and make that list available that won't work. We do that in openshift, it's terrible yeah, it's like it's. uh You have a race conditions, always right and then time to live, and all this exactly oh yeah.

L

Well, not only that that you know if you have a dns pointing to multiple records. What you happen to get may not be the one that when the heart gets it right, so it's like that mute.

I

Four-Year-Old decided.

B

Yeah, so my first thought, when I saw this proposal was the same, I think like implementing it as a transliterator seemed not so bad, um but to resolve wildcard. You pretty much have to be in the resolution path. Right like you, have to have some active process. There.

B

So that's my big, that's my big uh number. One worry.

M

But I think that that's a feature not because, like I said we, we have this in openshift, where we implement it by. You know, looking up the thing and checking the ttl and doing a new dns lookup anytime, the ttl expires and trying to deal with all that and it never works like every customer who's ever used. It has filed bugs against it. Like you, don't want to do that, you have to intercept dns, there's no other way.

B

I appreciate the passion in that statement. Dan.

N

I think that's that's the the the right thing. If creating a policy for dns that I don't know right now is called uns only, but.

I

The dns is going, ssl is ssa, I mean tls as well right, so.

N

But the the thing my question is: is you shut down all dns in the cluster and you force the resolution of the bots because you can force the resolve.com in the paths and you force all the posts to resolve in your core dns, whatever dns, and this coordinates has a policy and then you put the rules there. It's it's.

N

A

I mean the problem is that this might be. This is per name space, though right. So how is cordy and us supposed to know? Are you suggesting that core coordinates? Does some of the logic or I'm just interested right.

N

Well, I'm saying is the the only way to to filter the analysis in in the result.

M

You you don't filter it. You have dns, have a side channel to the network policy implementation.

N

The thing is, I know this spot cannot resolve to this domain. I will send a an error, not an ip.

B

I I don't think the proposal was to not resolve right, like maybe I missed that. If, if I'm not allowed to talk to foobar.com and I resolve foobar.com, should I get an nx domain or should I just get an ip that gets blocked.

N

How did you do that is, for example, right now all these dns filtering that you can use in your home right. I don't remember the umbrella and this one. What they do is just they have a list of dns and they send you to the you know to a block ip, because otherwise you you run into what that question said it's impossible to to to get root for the the ips resolved by admin. I mean yes,.

M

The the way that I had thought about this before is that the network policy implementation tells core dns. I need to know if anybody asks for the ip address of github.com and then whenever anybody does core dns, the core dns plugin will tell the network policy implementation. Okay, somebody asked- and this is the result that I'm about to give them and then that can update the network policies and then the pod gets the dns results and it goes through.

M

C

C

Like the the fact that there's a race condition when people see lots of like instability, while this handshake is happening.

C

N

C

People being like, I resolve dns immediately, I went to go, connect and oops. I got bounced and then, like I decided to bail out at that point.

M

So you could delay the dns response until the network policy plug-in has said it's ready, but yes it it. It is still complicated.

A

I mean the other problem is that it doesn't prevent the application inside the container from trying to use its own ip address somewhere out there. um You would have to also make sure that you block any like port, 53 or.

M

Something like that traffic right. No, you don't, because this will allow rules not for deny rules, so it doesn't matter if they use their own ip addresses they're. They can't get access to anything extra.

B

Because they didn't go through the central dns okay, they didn't get approval.

O

Yeah, I think uh I think everybody talks with the background of how do you think the network policy should be implemented also a lot of the discussion, hence that the implementation is ip tables. So, at the time of packet arrival to connect to github.com some ib table should be there to do the allow.

O

But the idea is not. Every network policy is implemented using ib tables. Some people have firewalls some. There are many many ways to do it, a lot of the problems and they are valid problems uh and the race conditions described are because everybody thinks that this is ib. Table's implementation form.

I

Between the civilization between basically commit operation, and that has been synchronized in the distributed system so.

O

The way this works on traditional firewalls, uh think of f5 kind of firewalls, uh let's pack it interceptors so at the packet arrival do you know the packet? Even if it's tls, you know where the buckets go and you can do reverse dns.

O

Look up at that point of time and okay is this allowed to as part of the ips I have or not right or the part of the dns names I have or not, and that decision is then cached and then ttl on top of it and then, but if you are relying on iv tables, then you're not on the data path and.

C

Yeah, if the semantics being proposed could use a reverse record lookup to enforce, as opposed to a a capturing the result of the query.

I

Many many domains don't implement reverse lookup right. You can get the top domain, probably, but.

P

I mean this all. This, though, is kind of in reference to implementations. I I know people have doubts that it can be implemented, but I think I don't know, I'm speaking for a whole now, but isn't this more just like? Is this an api that the community thinks we could utilize and then we can tackle the can this be implemented?

P

If the community thinks it's an api, that's worthwhile sort of question just my two cents on it.

O

It's a good way of thinking about the only catch here is the wildcard.

O

So without wildcard you can do sort of pre-pre-fetching and prime the cache was iops and all of that was wildcard. It's almost impossible to implement if the destination for destination host does not have reverse.

Q

Logo um to to add my two cents, um yeah, I think in andrea. We actually implemented something uh some policy similar to this, and the way we do. That is that if a policy selects some pod, we intercept any traffic. That's going to that pod from source pro 53 and then look at you know their.

Q

What what the dns they're trying to resolve- and we basically wait until something some rule- is realized on the day of the past before we actually pass the packet on back to the back to the original requesting part, so that we know that no pod will get a response, uh we'll even get a dns response. If that's the fdm we're actually filtering.

N

But that's also assuming that everything is udp and tcp are now in two years. Somebody is going to implement tls dns or I don't know the name.

Q

N

True, that's true yeah! That's that's!.

I

Q

A lot of assumptions.

I

I work with telcos right and the telcos have been living on that they can intercept dns requests and do I mean basically sell services based on that. They know where the users are going and that is going away so the whole the whole thing with the deep packet inspection and when we look at 5g, where almost everything is encrypted becomes useless. I mean, then you have encrypted bit streams that you can look at sort of how they behave to try to identify what it is.

I

But I need to look inside something it's becoming uh not doable.

H

And uh it seems this seems good what you're describing.

I

Seems good, I have no problem with it. I mean we refused to do something for china, where they wanted to push uh basically root record in everywhere, so we could intercept uh all the disappear walls, but so so the problem when you, when you start doing that, is that you're going to get dependent on snooping, and I I don't think that is a good idea.

I

It's uh for the latest, seven stuff. We have urls right. This makes more sense, but for layer, 3 layer, 4.

A

I mean: doesn't that point back more towards having core dns have some part to play here, because if you can't snoop the traffic, then you need to have the thing. That's responding, be the thing that is involved in the plane. Absolutely yeah.

O

But uh I've seen actually a lot of clusters actually majority of the cluster I've seen they do split, split fries on where anything.cluster.local goes to coordinates. Anything that's not clustered.local goes to whatever environment dns outside the cluster.

O

A

O

This is saying.

N

A

Me is go ahead. Antonio yeah.

N

That's the thing of the api. I mean if, if it's feasible for all these dns implementations to implement that kind of policy in dns, that's my question.

A

I mean at the very least it seems like the star or part of this proposal needs to be optional per plug-in, because it's clear that not that many plug-ins are going to implement it immediately and it's hard to get right.

A

Does that make sense. But.

M

It's hard to get right to like I. I think you need like super well specified semantics like before we. We can say that this is or isn't a good idea. Yeah.

O

I think the api is good. I think I honestly think the api needs to to progress and we're gonna have to see the implementation as a separate thing just to to to tell you to set your expected to set the tone without intercepting the packet. This thing is entirely good. I can run a pod first that has dns mask and then download the records. I want and then run my pot and resolve using that dns mask.

O

B

I'm not concerned sorry, I'm not convinced that this belongs inside network policy, though like why. Why is that better than having a new optional resource that defines this as a different thing.

P

It'd be network policy v2, and you know that comes with a lot more of uh and tail. It.

B

This sort of dances on the edge of being l7 right and network policy always been l4. So are we saying that network policy is open to l7, because if we are, then what stops us from adding http rules here.

M

I think the reason for integrating it with network policy is that you want to be able to say, deny all connections except ones to this dns address and and and so like. You want this to interact with the existing default and eyes.

O

And to your point, this is still l4 it just instead of describing the end point as an ip, you start describing the end point as dns.

P

If it was implemented in the dns, oh sorry go ahead. It.

G

Opens up, the intent is l4. If you will, like the the intent, is to guard against ipport combos. uh It's just. The specifier is a little bit leaky in terms of.

B

The level okay I I can buy that I still have the question of- is this better built in than in a separate resource.

I

Sort of I would say, do you request them, you're sure that you talk to the right, dns server, that's responsible for this name and there's a lot of things that opens up when you do name resolution. Do you trust resolution? So no one, no one is hijacking it and so on. It's.

E

I mean you could argue that you're sort of already l7 anyways, since you can block things based on namespaces, which are sort of connected to l7 addresses like I.

B

This is a very weird argument that we're getting into I I wave I wave my concern about l7, miss of it, because I I get that the intention is to do l4 enforcement, but you have to do some l7 trickery to get that information so I'll waive that concern, um but suppose I'm an implementation right.

B

I could just as easily have a separate resource for this, which I can merge with my other network policy resources, because I'm the same implementation does that make it clearer that it's a optional and b different. I'm asking.

A

Not saying I mean that's effectively what we do with openshift right now, but that's because it predates network policy.

O

And it will make sense if we can, if we are thinking about building a function where the cni can declare capabilities, and we can assert that the api server level that oh the thing that all you want is not implemented by the cnr.

O

So yes, it can be something separate only if it makes sense in the future where we can assert the.

O

I really don't like the idea of discovery by by fire like you can now deploy a thing and then you start using it, and you don't know if it will work or not. If the scene I picked up, actually implement these things or not so.

O

I'm kind of inclined to uh and separate the api, because it's easier, it's better, it's optional and a lot of people can say ah not gonna. Do that.

C

But yeah, I guess like what is the driver for putting in the same api? Is there some consistency.

G

Is there some interaction.

C

G

C

G

Discoverability and readability in that that the semantics are very much the same as standard network policy. It's a list of allowed egresses in this case and users know where to go for that today, um you you write your network network policies that target your workloads, and you say what can leave um if you add a new resource, um you know it just adds a layer of friction of. If I need to look at, why is my traffic leaving or yeah?

G

I want to change something I have to aggregate things manually myself make an edit to the right resource and then push that out. It's it's just a layer of friction, confusion.

C

Aren't they in that situation anyways, because network policies are implicitly glommed together when they're instantiated, but.

D

Now, there's a.

G

New type yeah we're just kind of making it um trickier.

M

Although we're.

G

Already plugged in.

M

Network policy so yeah, I think.

E

It's, I think, that's also debuggability. It's really hard to figure out why network policies aren't working with an l7 thing, it's easy because you can't access your thing but like, and so it's like already confusing and then andrea and psyllium and everybody else and calico are already implementing an fqdn. I think right, I know andrea does and I think celia does and casey can tell me if I'm wrong but like.

C

Right, that's an argument.

E

To standardize.

C

It, but not necessarily put in the one resource.

E

Well, that's. The other point is that it's really hard to know when things don't work.

I

If you have an implementation, that's easy to do because then you do an implementation extension right. But when you're trying to set the sort of a standard you have to cover.

O

Everything so two types of two types will make the life of the c9 implementer easier, but we make the user lives are way harder.

O

The inverse is also true right too tight to it like we made it a single type, it makes the same eye life a little bit harder right, but it makes the user's life a lot easier.

O

So I'm not sure I'm not sure I buy a lot easier, a lot easier for the users, because I can see one policy that describes everything right. I have either namespace.

B

No because I've already got, I can have multiple network policy resources and with admin network policy, I'm not even allowed to see those, because that might be an information leak so like we need better tooling anyway, the ability to inspect a single resource and get a picture of the world is just not true.

C

B

C

It's it's also just like. Let's say you like how descriptive is the describe code like.

B

But describe code assumes that I can read it right, like I doubt very much that a change to cube cuddle describe to go off and read all the other resources at the same time would be accepted, and that still assumes that I could even read the admin network policies which I probably can't so like. I think I forget it was on this call or on one of the other discussions. But we talked about something like a like a trace route resource where you you create it and have the implementation, tell you yes or no.

B

It worked or didn't.

G

It I could just clarify really quickly is: is the the concern with making this a part of network policy has to do with? um How we make it optional is. Is that a fair summation of the concerns.

B

I don't think that's the only concern uh at least not not on my half, there is the like. Would we predicate this on the ability to self-describe capabilities that like if we did that, then you might not be looking at integrating this until 28, because it would take a while to get that work done right and maybe we could pipeline them together, but but still it'll be some time. um The second part of it, though, is just like api growth is a problem. All around network policy is not the best factored api.

B

uh Is this something that we want to put into that api? Or is it better to say you know it's a it's a different concept put it in a different space, even though it has a lot of overlap.

B

Oh bridget, raise is a great point. We've been talking about this for 25 minutes or something um do you have a proposal doc that you want to bring to the mailing list? Did you already do that.

G

um I I have a proposal doc I can send. I can send that out to the mailing list. I haven't sent that out yet.

B

I think that's probably the next step yeah. Sorry, I don't mean to cut you off, but we do have other items in.

I

The difference between this and the normal normal firewall is right is that you have access to one of the endpoints. Typically you'll see the the five tuple, but for next time,.

A

Here's, what I'll do I'll add another item for this next week or not next week, but next time, and we can circle back to it. Does that sound good.

G

That sounds good. Okay,.

B

Yeah send it to the mailing list in between and and hopefully we can digest it more before. Then.

A

G

A

Thanks all right, uh rob and point slice controller fails to sync: hey yeah. I have a somewhat.

R

Interesting bug that it involves a series of potentially race conditions- I'm not quite sure yet, uh but I wanted to raise it in case because I think there's potential to change behavior here, and I want to make sure that if we changed any of the behavior here, it wouldn't break other things.

R

So right now, what the endpoint slice controller does is when it's running through at end points and trying to convert them more pots and trying to convert them to endpoints. It looks for a node and it looks for that node to try and determine what the zone that pods in uh that's great. If it doesn't find that node, it bails out entirely and it triggers another sync: it returns an error we get into the exponential back off, keeps on sinking or eventually tails off. If we run out of retries, that's great.

R

What I've observed here is in some cases, especially if you have some level of churn. You can run into a scenario where the node just doesn't exist and therefore the service is never updated. The endpoints license for that service just becomes stale as long as we're in that state, uh which does not feel great.

R

I think we just never, at least from my perspective. I think we just never expected to get into that position where a pod refers to a node that doesn't exist.

R

There are a few real use cases that happens in the most likely, one that I can think of is in garbage collection. So when no life cycle controller says hey, I don't see this node anymore, I'm going to get rid of it. If there's still pods attached to that node, the node is gone before the pod. Garbage collection even starts, uh so it is a real thing.

R

I I've observed in at least one case where this lasted for around six minutes, where there were no endpoints, less updates for a service.

R

uh The alternative that I can think of is either drop endpoints, where we can't find a node, uh because something's weird is going on here, but then we risk dropping all endpoints if our node informer is crazy out of date whatever, and the other option is just if we don't have a node continue on and just don't populate zone.

R

That seems likely like the safest bet, but it means that we're populating endpoint slices in a potentially inconsistent way.

R

It means that, oh in 99.9 percent of cases you can expect to have a zone here, but there might not be uh and it it is technically an optional field. But I am concerned that there are some things out there that expect it to be present, because traditionally it always has been.

C

Yeah, wouldn't that then just push the throwing away piece to the next person. It's like, oh well. Actually I really needed the zone, so I'm going to have to throw this away. Anyways.

R

Right right so that that's it so if we don't have a zone and some someone actually depends on it uh or maybe more more likely if we don't have a node, maybe we are in that state where the pod is about to get cleaned up anyways.

R

I don't know, uh there's probably not enough time to go in depth on this issue. I just wanted to raise it. If you have ideas, thoughts, uh yeah andrew you had a comment.

C

R

C

There's um there's like dynamic fixes, which is like okay. Maybe we can make it last shorter, there's also just the case of like making it more robust to like there could be a bug or something that just causes one node to become invalid and just kind of get stuck there and like. Should the system be able to get over that or at least kind of limp along.

N

For me, the main question is: if we don't have a nose, the body is valid or not.

R

So there there's two potential reasons: we don't have a node one. We legitimately don't have a node and two endpoint slice controller has lost connectivity. It's node watcher failed. Some other thing. Some other connectivity piece failed and we do have a node. The endpoint slice controller just doesn't know about it.

N

Yeah, but that that's impossible, because the the connection is the same for the part and for the node.

R

I mean it seems at least possible that one watcher fails and the other one doesn't or.

S

R

Out of order, like let's say we get the pod ad before we get the note ad, and that seems unlikely, but it could happen.

N

Yeah that that's true.

B

Should we lean on some sort of heuristic of like it's one, node, that's fine, you can just exclude it versus it's all of them and hey. We should probably just assume that there's a bug somewhere else in the stack and not exclude them.

B

That's dangerous, so is either of the other options.

O

I'm not convinced that those are the only options.

O

What I've been thinking is what what, when do, we consider? Our part is that.

R

So in in this specific case, what was happening is uh the cloud the the vm itself was gone from: the cloud provider, uh no life cycle, controller, cleans up the node and then pod garbage collector sees oh there's no node attached to this pod. Let me delete the pot, but each of those steps takes time and there's a period where the pot is just orphaned.

N

Well then, the pot is not valid.

R

Yes, but that's not the only way this can happen, you know like there. There are other race conditions that can lead to this, where the pod is legitimately valid, and we just don't know about the node.

O

Cubic cubelet failure on the node will give you exactly the situations that drop describe.

N

Yeah, but let me do the question on the unsung: is you can never have a pod, so the pod is always after the node, so the the node for sure has to be before the pot.

R

Yeah, but the same applies for a garbage collection. The no the node is garbage collected always before the pot is.

N

Yeah then we can only have one one way. This happened.

O

So uh do you do? Do we have a history of why the node garbage collector does not just garbage collect the pulse as it deletes the node, because at this point of time it knows that these spots are. It seems to me like silly like. Oh I'm, just gonna clean this side of the room where I live. I don't give a damn about the rest of the room, which is everybody else lives in. It sounds.

R

Silly right, no, I agree. I mean it's two different controllers, doing different things in their own isolated worlds, but it'd be nice if they all just did the same thing. Yeah.

O

But we do that for for for services and endpoints, we do delete. We force a delete endpoints when the service is gone from from the services rest back right, so yeah, I don't think we're gonna have a I need to read and think and and contemplate on this one. This is. This is a hard one.

R

Would be easier if that was the only potential trigger like that, that's one of the most likely triggers, but there's all these other potential race conditions that could lead to the same condition.

N

Yeah, but I I think that you can never have a pot without having the knob first, so this is only one wave. That is that, then, if the node is gone, it's become the the pod is going to to to be gone.

N

I mean it's impossible that you create a pod and you receive the note creation after the body was created or if it's possible, it has to be something that you no.

I

One else deletes and leaves dangling pointers and you're screwed right.

N

Funny you need to create.

I

You need to release.

N

The whole cluster of objects at the.

I

O

Antonio, the ab solution in your in your statement, man is fantastic, like I've seen enough stuff to tell you that yeah. This can happen more frequent than what we think a classic example. I'm thinking, or is somebody doing in place, upgrade for notes.

O

Just take somebody over I'm going to take this note down. It's just node x, I'm going to take it down for 30 seconds, replace it repave it and then go up, go back up again yeah, but depends how fast, how fast the node collector.

S

No garbage collector will work.

N

Yeah yep position. I think.

C

It sounds like we should probably take this to a dock and kind of sketch out all the different conditions. It sounds fairly complicated.

A

This also, we got one item left in the agenda and I wanted to make sure cal. Is that something that we want to target.

O

A

Today, okay, never mind.

O

It's not urgent, but we need to make a decision early in the release cycles because there is dependency on a cap and a dependency. So we just need to make a decision. Yeah.

F

I moved that one to next week so that we could finish talking about the thing we're talking about now.

A

Yes about the cap, I just wanted to make sure it wasn't.

N

Urgent but it it's service team team approved, then proof- and I talked with the person that reported the issue they tested. It.

B

N

B

What I approved adds the second ip, but it doesn't make any statement about what if my node was dhcp'd and got entirely new ip addresses.

N

Okay, then it's about that. Okay, it's not about sorry! Then I thought: what's about the the no.

O

It's not about having another ip it's about. When do we update types, and also I mean just look at v6.

I

That happens all the time.

O

Yeah rob before we drop off there's a comment you made and it's gonna like let the light bulb in my head in my head. uh Failure in the watchers are should be discussed outside the context of this problem.

O

Yeah we can decode it defensively against failure of the watchers. So let's say you lost connection to the watcher and you're, not getting the update and you decide to bail out bill out and restart everything. That's that's fine, but you can never really make any assumption about the watchers working or not.

C

O

That's sometimes if we separate those out, then yeah, because we had a problem. I remember in clusters. Q proxy was losing connections to the api server, so it wasn't getting the updates. So now what right? It's: a combination of load, balancers and.

B

Keyboard lives and all that stuff, but yeah the problem. There is it's a question every watcher has to answer independently. What happens if I stop getting updates wrongly right? This is a a pattern that our internal systems at google handle in various ways of of like. I don't believe that 50 of my cluster disappeared at the same time. So I'm going to ignore this update, there's something else wrong right.

O

Every controller needs to make a decision about that and make assumptions.

C

Yeah- and I think the um just you know even ignoring the watcher issue, there's this issue of uh consistency like let's say the pod just sticks around with, like a bad note, name like you, should probably like toss it at some point or just ignore it. It's like it's.

C

There could be these like data consistency, issues that just because a bug or something that if you took a strict view of it like absolutely not like I'm, not going to proceed, because this is inconsistent on some level that may end up, resulting in some kind of outage, because you just like the controller, couldn't.

B

Yeah, I agree. That's a great that's a great point boy, because node name is actually user settable right. It's a request for scheduling. So, if I want to cause havoc on, you would be a good one for the clustered uh podcast. I create a pod that has a wrong node name and watch all your endpoints disappear.

O

Yeah yeah and now we have on record an attack vector for everybody to go and try. I'm.

B

Assuming that doesn't work, but I don't know I didn't try it I mean the attack vector has to be in your own namespace. So at least it's pretty well bounded right.

B

P

A cbe number.

B

What's that oh yeah get a cve number.

R

Just for you, I.

B

R

So yeah it's fair uh yeah. I have a pretty detailed uh description of the bug itself uh in this shoe. So please, if you have time uh chime in there with thoughts perspective, different approaches, my biggest concern is, I don't want to break something else in the process of trying to fix this, uh which it seems likely could be the result. uh So anyway, I know we're at times, so I don't want to take too much more but yeah. Thank you.

J

Thank you. Thank you, folks. All right thanks thanks, happy cat freeze day yeah say it again.

B

Anybody who hasn't, if you have a cap and you need it to be approved before we've, got like an hour or something right um and it hasn't been approved yet or lgtm. Yet please ping me, let me know on slack asap, I think I've got them all, but if I happen to have missed yours, please let me know.

B

Thanks all right.