Kubernetes SIG Network, 29 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Network Bi-Weekly Meeting for 20201029

Description

Kubernetes SIG Network Bi-Weekly Meeting for 20201029

A

um Excuse me kubernetes sig network uh meeting for thursday october 28th 2020. first up is issue triage. Thank you. Bridget.

B

All right, we have a whole bunch of issues. I loaded it up a little bit ago and oh, a new one came in at this last possible minute. So let's look at this brand new one 28 minutes ago.

B

C

Oh yeah, I literally just opened that I noticed the test.

B

Was failing various failing tasks?

B

Okay, sorry, I didn't see who said that that's.

A

Chris oceana yeah, the author, seems to see: did you self-assign that one chris I did yeah? Okay? Oh.

B

All right it just doesn't, I guess, because it does, it says, needs triage. So.

A

All right, chris, can you remove yeah.

B

A

B

All right, moving right along nf tables do not scale for services.

D

um So this is something that we, I recently run into uh creating clusters on a systems like centos, eight plus with nf tables, enabled and q proxy in place, and what we try is what I'm saying the repro steps: bigger cluster with plenty of replicas and a lot of services, and we are seeing that the end of tables kind of choke.

D

I would say that they choke and basically half of the services do not work and the scale is around 500. What I'm saying so.

E

I've been seeing the same problem, actually that's with ip tables, nft wrapper. So if you're, if you run this in in a newer car now like 5.4 or something like that, it works fine. But when you are running in the 4.10, it's uh really slow and the problem.

F

Is yeah I haven't? I have.

D

E

The cli that's a good point.

D

I will have to try with latest deviant, probably to get that in.

A

Yeah, what I would say is that you need at least iptables 1.8.4, and I noticed it said centos 81 up above. I would not recommend that combination, because I recommend at least eight two boy, or even better, eight three, uh which probably is not out yet for centos. um There have been a large number of fixes since then.

G

All right so eight you say right: okay, I can retry it with h2 yep I mean or um ubuntu.

A

um 20.10 or yeah something a little bit more recent.

G

But uh so uh is 20.10 has a nf table based, yes, yeah. It defaults to iptables and ft all right.

A

H

A

It should be somewhat um uh recent, but again you're going to want at least uh 1.8.4.

A

Of ip tables, yes, oh.

D

Okay, so yeah we'll take you looking at it.

A

And I'm pretty sure centos 81 does not have that sure. Okay,.

B

Does uh somebody want this assigned yeah.

A

Just to send it to me, okay,.

B

Sorry, I can't easily see who's saying what.

A

A

A

B

B

Deployment with node, selector and tolerations has no points ports pointing to the wrong node ip say that eight times fast see. There's some discussion in here.

B

So picking the right, node.

A

Is it already assigned to anybody? No.

A

I

What are we thinking.

A

Anybody want to volunteer to follow up on this again following up means. Debugging.

J

I can I can take a look antonio great.

A

Thanks antonio, thank you.

B

Okay, moving on.

B

Oh, we have verb problems, someone conjugate them.

K

This is sounding like not actually network.

B

J

These guys are working in adding.

B

Problem solved from our point of view, yep. This is what I get for loading it half an hour ago or so all right.

B

Well, that sounds dramatically terrible. A single rogue node has the wrong cider.

B

Should we have left that tiny reaching your defensions.

L

Yeah, I don't know if this might not be a bug. It feels like a bug I talked to andrew about it. um I don't know it seems really weird. I don't.

M

L

M

It's supposed to do.

N

That the thing crashes, if it cannot resolve and the user needs to go and manually edit the cider either it's that solution to that that you just remove the slider. This has been there since forever and I think it's the right way, because that node will not get any functional part with.

O

That but the cube controller manager shouldn't be crashing. That's what I feel like the bug.

L

N

L

N

Returns an error in the initialization and of the controller of the iphone controller and that cascades all the way out to uh to another, because this is the new controller code.

M

Why is that not a bug.

H

N

Because there is no way the cluster can resolve itself.

A

Is the bug, though, that maybe a better error message or a better exit? I mean it's got an error message. There.

M

A

So there is no.

M

Continuation out of this right, because anybody it's created, why couldn't you just cord in the node and just like I mean just.

B

Bump it out be like.

N

How would the problem is cordoning the node and then uncontrolling the node? You are messing with something: that's not user intended right, user antenna is for this to function. Usually, users should not have this situation unless they manually modified podsider or they change the bot sider after it gets chunked and assigned to nodes. The point I'm trying to make here is: there is no way we can recover from this.

N

This is basically six fold.

N

So, if you wanna allow it to happen, what will happen is the cluster will be randomly dysfunctional and there is no way you can trace it to that.

K

I mean, but when I mean the kcm crashing counts as randomly dysfunctional as well like I, I don't see how this is an improvement.

P

Yeah, but if it didn't crash, then it's kind of silent and you'd have to kind of figure out. Why clusters like that, but it can like. I mean okay, so that's a problem that we have. I.

N

Am I I am not I I'm not saying I don't have a strong opinion like this is not a hell. I would like to die on, but uh the point I'm trying to make here is: if you let this work, and you start assigning pods to this node, you will randomly get traffic disruption to pods and services right, so don't assign.

K

Pods to the node.

L

Yeah, so the the you, the the user story I have here was, I was doing something stupid but, like I don't feel like, I deserve to be punished in such a severe way that the whole like I had a windows cluster and a kubernete and a linux cluster, and I'm federating them and I'm messing around with ips and I've got two different coupe proxies going and I've got a cni that I'm not sure, 100 sure it's working and I'm playing around with stuff and then boom.

L

My whole cluster goes down, so that's the my my thinking here is that I would like for my linux portion of my my cluster, which is which is correct to still work. Even if that one windows node is rogue right all right, given the.

B

Time uh jay, would you like to follow up on this one.

L

We can call up, I can reach out to other people and see if we can continue the debate later.

B

All right is it being slow, okay, and I, I think, we're at time unless somebody really really really wants this one.

B

People are suggesting calico.

K

Yeah, I was going to say just scroll until you see the word calico and then assign it.

B

And assign it where sorry.

K

To casey, that's not really fair, especially.

B

Okay, I think I'll stop here and let us go to the actual agenda, but I'll leave this up in case we want to come back to it.

A

Sure, and maybe we'll have time at the end, um all right. uh Some questions uh who is jay is that you jay or somebody else.

L

A

L

I forgot what I thought.

A

Yep europe, so does node local dns kep status. Need updating is the question.

L

Oh yeah that came up somewhere, someone was asking about it and then I just figured out add it to the agenda here. um I don't know who who here works on that, but I'm just kind of I put somebody here.

F

This is pavitra, I can look at it, but I don't know which cap we're talking about. We have one with status, implemented and there's the beta one which says, uh which I have a pr for to change to implement it. But I don't know if there's a bigger change.

L

Okay, that's probably good enough. I mean if you're looking at it.

F

Well, just to clarify, I was only changing the status of the cap. Is there anything else? It says important links or something I'm not sure what the genius.

L

um No, I don't remember, I don't remember who asked the different point so I'm sure between I can look at it again after the meeting or something I'll pay you if I'm curious. Oh.

F

Yeah sounds good thanks.

B

I actually don't know who added the important links thing, but that was not related to this one. That was a different bullet point. um I didn't add it, but I was going to edit. So whoever it was please say things about.

B

Do you want to update the important links I really want to get rid of this features spreadsheet. That has a comment from me pointing out that the spreadsheet talks about kubernetes, 117 and 118, and I just don't know why it's there.

A

We have not updated that spreadsheet for quite a while or done that for quite a while. So I think that can go away at this point.

A

We can always bring it back if we're going to use it for, like 120, for tracking.

I

I'm happy thank.

A

You uh were there any other questions about the important links section or just feature spreadsheet.

A

All right, if not uh back to j for cubelet net controller questions.

L

Okay, skip the other one and then the other. I just had a quick one. Does anybody have any background on how external ips are set for the kublet versus internal ips um for nodes.

K

The administrator sets them.

A

They're, just okay, usually cloud providers said external ip and okay, it sets its own in. Are you saying like node.standard internal is introspected and the external.

L

Is like metadata, the cloud providers.

K

It depends, oh sorry when I said it.

A

Oh no, like actual external ip, on the on the node, so you mean node.status.addresses, node internal, I versus node external ip right yeah. Yes, dan will have something to say about that.

K

Yeah, so the cloud provider sets all of those unless you're on bare metal and have no cloud provider, in which case cubelet sets a single internal, ip and nothing else. But now that you bring this up, um don't.

N

You have a pr for that.

K

Well, there's a cap about improving that situation, but I just filed a pr um which I'll paste in chat um about documenting what all this means.

A

Historically, it's been inconsistently implemented between cloud providers, especially the entry ones, I'm not sure if the situation is better now that some of the cloud providers have moved out of tree, but partly it's because, like the public cloud providers had a more defined well had a better definition of what these things were. That was more consistent between cloud providers, but then, when we started adding other things like uh vsphere, openstack and other stuff that got a little bit muddied, not really consistency.

N

Just the vms usually have one neck, you know one ip and we're done it was.

A

Really complex.

N

A

What I mean is that at least the cloud providers had some notion of the vm has like an internal ip address, which all the other nodes can talk to it on directly and an external ip, where that node may or may not be exposed to the public internet or somewhere like publicly outside the cluster. That was like what the, as far as I understand, the traditional meaning of those two things was, um but that got muddied somewhat when more cloud providers showed up.

P

Yeah in general, we let the cloud provider decide because it may be external to the cluster, but still internal to your data center. um So yeah it gets yeah. It gets a little messy. Maybe maybe it was a mistake not to like have a better well-defined types. Yeah.

K

Yeah andrew, you should definitely look at that kubernetes period that I just posted because you'll probably have opinions thanks.

L

I mean that's good enough for me, I'm you know.

A

Okay, all right- and we answered your question about the invalid range pod cider. um So next up bridget, you have some accolades to hand.

B

I just want to make sure we all take a moment, because we had so many of these meetings where we talked about the dual stack. um Pr of you know amazing ginormousness, and uh I wanted just to have a round of applause for cal and everyone on this call who got that merged and it was enormous, take go look at the diff on there and then just think about reviewing that and think about giving yourselves all a round of applause for all the reviewing you did on that, because that was huge, yep for sure.

J

And only only took one day or for people to complain. That's.

B

The other record.

N

No, I just want to say that our job is not gone. We still have a few follow-ups specifically in the machinery. Part tim has been at it and he is like. I think he took some time off and he just called him, so I am hand hands off when he's letting him do whatever all right.

N

So there is a bit of changes in the way we do machinery and so on for for services, and we are going to copy the approach forward for the new things like the lars is working on multiple external ips for services, and we want to copy it backwards to both ips and outsiders.

N

So this is coming, but not as fast right.

A

So on that line, um what are the outstanding things now that that pr has merged? We still have like the node addresses thing uh cleanups. I think dan that you had a referenced. um Do we still have the load? Balancer?

A

uh Well, sorry, that's external ips! We still have the um uh health checking cap stuff outstanding.

N

Yes, so feature-wise feature-wise, you can start. We can start moving this to be the as soon as possible, because nothing is stopping us from actually start serving. These things from feature completely completeness completeness perspective. We need. We do need two things: the node ips, because the node ips means that host network on the host network pods will get dual stack correctly.

N

The other thing is the external ips, so people can do dual step external dual stack services, all right right now, because the way cloud provider works, they can only get one family, which is the cluster family of the cluster ip family of the old cluster ip again, those I think of those as auxiliary functions or features to the core of what's already there.

N

So my next next action item is to. I think there is enough people thinking about the preference. I think we need to start thinking about the beta aspects of this as soon as possible, like stabilizing any anything that needs stabilization, and we have what's very comforting to me is we have plenty of tests out there for us to just look at them and watch them being green or red, and that will give us enough confidence in what we're trying to do.

N

Did I miss any aspect, any features with anything else.

K

Going back to what dan was saying about um the health check thing, uh we agreed to do nothing, um leaving open the possibility that we might revisit it later. The the cap was closed um or well. The cap was merged, saying we still only health check. The first iv.

N

So the way, the way I think the month, the way we want to think about this is dual stack means one plus, so we still have the core stuff loss. That means health check remains, as that means cloud provider remains remains, as until external ip thing is figured out all right.

N

uh That means uh everything the one one part that we really need- and I think dan is working on- is the node addresses for on-prem those techniques, because dual stacking is on cloud is covered again, it's of referencing the same discussion. We just had the cloud providers just have a clearer meaning of what ibr, what not ipr.

J

But they know that ipc cube, let him in bare metal done done already fixed that yeah. That's the part that dan is trying to push forward as far as.

K

Yeah, so you can, you can continually override it. It won't auto, detect it and, and okay still the problem for clouds that some people think that the cloud shouldn't return. Ipv6 ips in a single stack, ipv4 cluster, because it might confuse other components to see ipv6, ips and node addresses so there's still some figuring out to be done.

K

J

What is this terminal ap thing that you meant.

A

Maybe I actually meant load balancer ips, okay. I should update that um I'll. Just remove that. So, interestingly,.

N

Enough low balancer ips are ruler, it's just external ips and so far to be honest, I am trying to figure out. Why do we have two of them on the service type, but that's a different discussion.

N

Yeah yeah, it's exactly it's exactly that like why, but in all cases, I do believe that when it comes to egress ingress traffic people usually are very specific about the family that they're trying to expose and ip they are trying to select, because this is typically audited by security cooperation and yada yada.

N

So I'm I, I will not be in a hurry to to actually have those tech services external. Yet I think we're good enough with what we have.

J

Well, I I don't get, what do you mean by this term? The aggress traffic.

N

I mean to go and say: okay, I have a service load balancer and it's dual stack and then you automatically go talk to the cloud provider or your external load balancer and configure one service with two external ips covering two families.

A

Okay: okay, okay, okay, I mean devil's advocate. Why wouldn't you want to do that if you're exposing say google.com and you have search? Why wouldn't you want to do that.

N

Oh yeah, I'm not saying that you don't want to do that. I'm just saying. Usually, people are very specific, and today you can do that by the way having two services using the same selector, two different families. You get the same results right, they're, just tying them to a single resource, because clouds deal with ip as ip. They don't deal with ips dual stack rp like you, don't go to azure, for example.

N

I know at least I know how this works and say give me an ip and then you assign a family or or two families cloud actually deal with them as two different resources. Okay,.

A

All right um who added the part about need to think about beta for 121 asap.

B

Oh, I wrote that, based on what cal was saying, which is okay, we we are doing alpha for this in 120.. Great 121 is next we're going to think about beta, for it.

A

Yeah brainstorming there. What is the criteria to get to beta? At this point? I know in the past a couple of meetings. We had talked about lack of confidence in the api because nobody's really using it at this point, and obviously I don't expect anybody to have used it in the last like couple of days since cal's pr merged, um but you know where, where do we go?

A

Do we just need some more um uh time from the fields and people to get comfortable with it um and get the other pr's merged, or is there stuff, above and beyond that,.

B

Well, there's checklist items of, like number of you know: clouds that stuff works on, and tests and whatnot that are different for beta than they are for alpha. I don't have it up, but pressure. We can pull it up pretty quickly and that's.

K

Was pretty vague about this? It was, it was more just like the api is complete, and that was pretty much. The beta criteria.

A

Yeah I mean I think, like once these things merge. We think the api is complete, but you know we had expressed some uneasiness with the api that it was complete in previous meetings. I.

B

Think we wanted to find out what the unknown unknowns are, because we can't anticipate what all of them are.

A

I know I mean like that sure, but does that only take time- and you know like six months from now we're like oh well, nobody brought up any problems with it. So let's declare it beta. I'm just trying to like get some more certainty.

B

A

B

A

B

Need a definition of done.

A

A

And it seems like that definition should have some kind of deadline on it where, if nobody brings up specific concerns by a certain point in time that we assume that no more concerns will be.

A

K

I mean, or we could come up with a more concrete set of of criteria like bridget was saying uh sure this many network plugins, and this many cloud providers.

B

Yeah, I can start working on a list like that. If we want a list like.

B

A

Sure, let's use that as a starting point.

N

I am more of if you have something that has to be there. Let's talk about it now, and and otherwise somebody go and come with the list and, and we either say good enough or no, it's not good enough and we move forward and by the way as we move forward. If, if we want to declare this beta and somebody beat with it out with a good reason, absolutely yes, it doesn't matter if we have a green checklist right, so the right to v2 before good reason is always there.

N

I'm just happy that I will not have to babysit this vr anymore. To be honest,.

A

Well, thank you for all the babysitting you've done in the past. Cal.

A

All right any more thoughts or shall we move on last chance all right.

Q

Rob you're up service ap.

A

Q

This is a really tiny one. I already sent it out to the mailing list, we're just looking for alternative names that could potentially replace gateway in this api and we've already gotten some fun suggestions.

Q

But if you like naming things, if you have any ideas, I encourage you to take a look and add some potential names we could use here. That's.

A

It all right uh does thanks. um Does anybody have anything else that they would like to talk about for the agenda before we go back to.

L

Triage, I have a quick random one I'll go ahead.

N

Just a comment, uh antonio found out a way, a very easy way to exercise and test the api machinery stuff uh as part of the testing. Most of our current testing today is an e2e. The problem with italy is two things. You cannot run them locally unless you're like a test wiz and we are not all right. There are ways I discovered that was talking and going while talking with the success folks.

N

The second thing is when you submit it, takes a good half an hour in order to get results to the thing which makes them very unusual to an inner loop like test and then come back with results. So the integration test, part of the apr there is a huge integration test battery. He created it. I just modified it a bit and there is a lot of wins in there. You can run them locally, they are easytron. All you need is local hcd and they are easy to run and they gave you result almost instantaneous.

N

So my advice to you, if you are working with api changes, anything that has to do with validation. Oh, I need to test that this thing is created and updated correctly or created and updated treated correctly and so on. Please do look for integration all right. Please do look for integration, it will save your life and it will save our life as as we quickly test these things.

N

A

Comment all right, uh jay, you had another comment.

L

um Yeah uh was just talking andrew about this. I is there. Does anybody know if anyone is using either the windows or the linux uh user space could proxy actively at all.

A

Open shift makes extensive use of the user space cube proxy on linux.

L

Okay, what's your use case for that.

A

uh Primarily to support our idling feature where you can idle a pod um but keep the um the uh service there and then, when any traffic hits the service, it will spin up pods to back it and hold that packet and then deliver it once an endpoint shows up.

L

Oh, is that like a system deactivate thing or whatever, no okay,.

A

No, um the uh proxy posts an event, I think to say I.

F

A

Pods and then something listens for that and spins up the pods.

L

Okay, it sounds like sock you're doing some socket activation type stuff, but okay.

O

Yeah, but not not.

K

L

But it's not using system d yep, it's the way to do it. Okay, cool! I was just curious. um I may have some questions. I may ping you in slack uh at some point, I'm just going through the code trying to figure out how it works. Yep.

A

I would say that um the user space proxy is not particularly well maintained. We do fix bugs when open shift finds issues with it, but most people don't really touch it and, as far as I know, it's not particularly well used outside of openshift.

L

We don't have any test grid jobs for it or anything. Do we.

A

I don't think so. We test it in openshift, downstream ci, um but also we kind of have a hybrid proxy mechanism where some things services that can be idled use the user space proxy and those that aren't titled use the or something like that. I forget the details. Okay and you don't know.

L

Anyone who's using the windows. One, do you not? I know I know we happen to be using it for this one poc type thing, but I don't I haven't, found anyone else. That's using it.

J

Yeah but the the windows folks they they. They came to the signal one month ago or something like that. I mean I don't know what kind of cube project are they using, but I know that they are using it.

K

Yeah, the the windows originally had only the user space proxy, which was a quick port of the linux one. But now there's a windows kernel proxy as well, which I assume is like better.

L

Yeah, okay, I I don't want to bog down the whole thing. That's enough context to keep me going thanks. Everybody.

A

All right anybody else or shall we go back to triage.

A

All right back to you, bridgette.

B

Triage it is okay, we've got padpod low bandwidth.

B

Do we have like specific bandwidth limitations, ooh calico,.

A

Just assign this one to me I'll, follow up.

B

Please remind me of.

I

B

I feel like I'm going to get that in the wrong order, and some poor person is going to get a whole lot of things assigned to them.

A

You wouldn't be the.

B

First, okay, multiple back ends. It looks like somebody already answered this and was like eh.

B

This is a wish list item you might not be getting, but.

K

Yeah, it's clearly triage.

K

Oh, it has kind feature yeah. So so it it's been triaged!

K

Okay, so I guess tim isn't here, but that's my understanding. If it's a feature request and it's labeled as such, then that constitutes triage.

B

Okay, so I remove how do I remove that.

A

I think remove needs triage, maybe.

R

No, um who said.

A

Accepted, oh, no, that's triage! I don't even know what that means. So.

B

That means that it.

A

Has been triaged uh well, if we think it is a relevant issue, then we will accept it by applying triage accepted. Do we think this is a relevant issue.

K

Not necessarily um okay, maybe it does need to be assigned to someone. Then all right, I I haven't taken any you can assign it to me dan winship.

B

You want to be the crusher of dreams today,.

B

No, you cannot have what you want.

B

Okay, defining a wild card.

B

This person has given them examples, and then they think that the documentation does not answer their question and jay was trying to get somebody else to look at.

L

This yeah yeah yeah he he wanted to look into this. I think he's still digging into it. Maybe give him give him another week. um Look at this. I talked to him offline.

I

B

L

Yes, I don't think he's here today, but um I'll I'll ask him to come to the next cig network.

B

Okay, every time I see split horizon dns, I think you have dreams. We will see if it turns out the way you want it to.

I

A

uh And this is also a feature request, so we could just accept it and remove the needs triage label if we think it's relevant.

I

A

Can you go up to the top really quick.

K

I feel like the answer to this is write a core dns plugin.

I

Do you want to explain that to them? I think that this time we.

J

Can see the the guys from coordinates? I don't have the handle, because maybe they can explain better than us assigned to me and I I will follow up. I I know the guy from cortinas.

J

Maybe they have an you know a way to do this without us doing anything.

B

Always the dream, this one got updated. Well, I had it open.

L

Yeah, I think chris put a pr in to improve some of those, and I don't know we brought this up last time about how we at some point want to orthogonalize these tests. um I guess this is more of a placeholder until we figure out how to do that.

C

L

C

At it now uh I'll do triage accepted. You know there was okay.

H

C

Used for some follow-ups but yeah that pr did merge that I originally had.

B

Can I assign this to you chris.

C

B

I say it knowing full wall that you can see my screen and you can see I'm doing it.

J

Yeah and the the funny fact with that thing is that one of the the issues that we review with and one of the first issues was due to this. We changed the tax and one of the jobs picked this test, and now it's failing.

J

That's why we should be very careful changing the tax.

L

Yeah, it seems like it's not really about changing the tags, but more about talking through it like and figuring out a plan for the long term.

J

No, I mean it's, this is kind of uh fun fact. You know it's. This is always a fallout with any of these changes.

O

Speaking of fun, facts.

B

I love when we see sorry to hear bad news tim's not against yeah tim's, not against reasonable fixes to stop the bleeding. So that sounds great. Who would like this bug.

N

No, no it's about you. So we should. You know we need to judge the cni folks are already tracking, ah I would say they have a reference somewhere.

B

And they've already answered this so yeah.

A

Yep remove the triage label just to triage, accept it. I guess.

B

A

I think it's triage.

H

A

Accepted whatever the bots said up at the top.

A

Yeah, it's triage accepted.

A

Awesome: okay,.

A

Oh this one, I put a comment on.

B

We have panic at the kublet.

A

Yeah, there's a pr to fix it. um I it's probably good. I just wanted to make sure so you can leave this one and.

B

The pr is merged, yes, it is so do we want to write back to the person and ask them if their problem is solved.

A

I did that's the last comment on there.

B

A

So I think we can leave this one as is, or do triage accepted sure.

B

Mostly, I just don't want somebody to have to reprocess it yep.

B

Right into 17 days ago, which means.

B

uh Antonio has thoughts.

J

Sorry, which is all this one.

B

Jobs that are flaking.

J

Well, I have a bunch.

L

If this isn't helpful, I don't know if it's useful or not uh antonio mentioned it might just.

J

Okay, yeah, I got it so the thing is: if you click there you need to for flakes, we need to have a stressful, I mean if it's not failing two or three times in a week. It is not worth the effort, you know, because you are going to spend hours and maybe you are not going to be able to reproduce and maybe is the ci.

L

Yeah yeah um yeah.

J

I gave you the link, so you go to the links and you see I mean if there are three failures in in three days, then for me it's afraid, but the people from from ci signal in sick release. They are, they are already watching. So that's that's the thing you go there and you see you.

L

Okay, so when we see uh signet flakes that are related to sig network tests, we shouldn't file those as issues unless there's like a uh unless they're.

J

Clearly, you need to think that sometimes phase because of the ci is you know this is running. I don't know how many kind jobs in in in a machine- and there was a lot of contention in the previous release, so.

L

Okay, cool, I just didn't know we were supposed to file them or not. I hate filing flakes. I just felt like.

S

I was supposed to so I do.

J

As a as a rule of thumb, if you see three failures or do you see this graph, you know that is that there is a pattern, just open for sure and.

S

Okay, so look them under test grid first and then and then do it. Yeah figure it out.

L

L

You can close this one bridget, because I I don't.

B

Oh, you don't want to write about why you're closing it.

L

Oh yeah, you can yeah, maybe you could just say in there that it's you know whatever. um I don't have enough data to really say this is a real bug.

L

I can look at a test grid later. If I.

B

Okay, I'll, let you close it then.

L

Sure, okay, I'll close it now thanks yeah.

B

All right we're doing so great on our issues here, more flakes yay.

L

B

The same situation, jay.

L

Yeah I'll take care of all my flag files.

M

B

Error deleting contract enter contract entries for udp pierre.

B

Oh, did they reproduce it with a newer kubernetes? We don't know and they didn't answer. Let's see, the original submitter never came back and answered any questions.

A

um I mean it looks like it, the fix could be pretty simple and this might be a.

R

Good thing for well, it looks like kim got assigned to it in the last meeting.

I

R

Yeah I mean he debugged.

A

It oh, but I mean there's still an open question, but still like our contract stuff.

B

Still listed as is, is this basically is this uh triage accepted.

A

I think we should triage accept it. It looks like a legit bug. There's no way. We should ever be passing an empty string to contract.

B

B

Note ip and j you open this one. Do you want to give us more.

L

Yeah, so this is, uh this is an interesting one. We have um this issue where, when we probe uh certain things from the e2e client, depending on how your firewalls are set up, you can have totally different test results um based on whether or not you can like access a node port through a local host port versus, because we we set up a probing, client and, depending on the location of the probing client, you get different test results, which means, for example, in this, you should update node port udp test.

L

You know you might run it five times and three times, it'll pass and two times it'll fail, because the client will come up if you'll pass, for example, if the client is on a different node than the the server um so to make it consistent. One thing we could do is just always spin the client up on the first node in the cluster, which seems to be a pattern. We do that in other places, where we fix a particular node.

L

That's one solution. um The other solution is we could, you know, run the test. I don't know, there's other solutions. I don't know.

K

Here is that we have to make a requirement of the test cluster, be that new imports work um and, and they only work with an unlimited range of ports or whatever. But like I mean, because you know we, we make other requirements about the cluster that you're running the tests on right, and I think this just has to be one of those requirements that you don't have a firewall set up. That blocks. Node ports.

L

Yeah, so we should always fail in that case, though, right.

J

Yeah but that's what we.

K

Do in that case, because it's not a valid cluster for running the ede tests on.

J

Yeah, but the problem is that it can pass.

J

That's the point if it's a not valid and depends on the the way that runs at that time. It can pass the test. That's my understanding of the problem.

L

Yeah, it depends on where, depending on where the client comes up, this test can pass.

K

Like if, if you have a node that reboots every 20 minutes, then maybe it will pass ci and maybe it won't, but we don't care because you shouldn't be running the tests in that question.

J

Yeah well, but the thing is, you can't have a deterministic test that that fails on that situation and we don't have it.

K

I mean okay, so so in that case yes, I would say we should make sure to always probe from a different node.

J

Yeah I mean how do you use the node part, always from a different note, so I think that that's the thing that we need to be sure that we test.

A

What's the core problem here that some nodes in the test in the cluster that the ede container runs on have different firewall rules or that all of them have firewall rules.

K

So that node ports only work to local host bridget, you can just assign this to me and we can okay comments.

L

Yeah, okay, just if anyone's interested the the reason we saw this is actually in cluster api. You know like for for aws and stuff, we have udp pretty locked down in certain cases, and certain customers want that. So that's why that's an important test for us and having consistent results in it would be useful to us.

B

Yeah, that makes sense all right: we're back to contract.

B

We've got on trackbox, not flushed, for udp host port when pod deleted.

A

So uh what version of key were they using here again, 116? Did it say.

I

J

What what what's going on with udp? There are a lot of pieces with udp, okay,.

B

So some troubleshooting has already happened. Antonio has looked at this okay.

J

A lot of people are using quick.

N

J

Because there are a lot of bugs.

N

Maybe yeah you will see you'll see tuesday, even more and more.

J

B

J

B

You want this one since you've looked.

J

At it, yes, yes, I was talking with you.

H

Sorry, can you put me as well on it? This is zoe. uh I saw the word gke in there all right.

A

um Also, uh antonio, oh okay, so twist port manager all right, never mind.

A

Got it I, it would not surprise me if any of the host port manager stuff did not clear contract.

B

Yeah well, okay and dual stock yay. Do we tell them dual stack? Is different now stop trying to use it with the one I.

A

Don't think this problem is fixed in 120 um before as well.

R

K

Discussion here I think we can triage accepted this yep. Okay,.

N

Can you assign me as well? Let me if I haven't seen this.

B

Yes see now you have all the time and energy in the world for a whole bunch of exciting new.

B

B

We are almost out of time, but yeah, maybe time for this one move another flaky test.

A

It's already got assignees. Can we triage accept it.

B

Are we, what do we think bowie.

T

uh Looks like it's been assigned.

H

T

Is taking a look at this.

H

H

F

Get him to update.

B

Right dropping packings, right and left possibly center. Who knows.

A

I'm running 117. dan, you probably have something to say here. If you haven't already looks.

B

Like I assigned it to antonio no wait, tim.

J

So the the the guy said that that he's running a asymmetric routing and some of these packets are declaring invalid. So he wants to move the the current ip table rules and what I asked him is he can set in the kernel, tcp liberal. So I mean I don't feel that we should move all the ip table rules just because one scenario, and that I mean I don't know how bad it is- that scenarios.

J

B

It looks like we have that.

K

So so yeah the the issue has sat around, but it's not because we're waiting on the reporter really it's because we're just not moving on it. um So I would say: oh, although tim added.

B

Oh, he just changed when we changed the way the labels work.

K

A

I would say it's probably triage accepted, because we've had a lot of discussion on it so far and we're still.

K

Yeah we haven't decided if we're going to fix it, but we understand what's being discussed.

J

My question for you is: is tcp setting the kernel, tcp liberal, and an option.

K

J

Exactly I mean his general.

K

Comment is that we are making judgments about packets that are not part of cube proxy. Okay.

J

From from that perspective,.

B

All right, I stopped to give to give.

H

B

H

Oh, I can follow up. We found an interesting thing about tcp, liberal. Basically, you should set it it's our conclusion, but I can take that to the mailing list.

J

That's that's I in in the mailing list, in the ip tables and in several places I I I heard the same. You know if you are using not environments, you, you should work with that, but I mean I I don't have any evidence. That's.

H

Yeah there was an interesting thing where there was a tcp bug in the linux kernel introduced in 5x and that caused all of the vms, with tcp liberals disabled, to reject certain situations so I'll send something out.

A

All right and with that we are out of time.

A

So, thank you. Thanks bridgette for running triage thanks cal for getting full stack services pr merged thanks everybody else for showing up today and see you in two.

A