Antrea Community, 15 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Antrea Community Meeting 08/15/2022

Description

Antrea Community Meeting, August 15th 2022

A

Hello, everyone welcome to the entry community meeting. um Salvatore is away today, so I'll be replacing him for for this meeting again and uh as far as I know, we have two things on the agenda today. uh The first thing: the major item is online. uh We will present um the updated graphene home page, the customized graph graphing home page, that the team has built for the saya observability solution, and the second item on the agenda is chen will give like a super quick update.

A

I think on the entria, the 1.8 release, which should come out in the next couple of days so and then, if you're ready uh just just start, whenever you want.

B

Yeah sure thanks anthony for the intro, so um so the main purpose of the discussion is to to gather more feedbacks on the customized graphina uh homepage. um Let me share my school.

B

Okay, um so I'll first talk about the motivation here. um So, as you might already know, one of the major feature of project thea is is network flow visibility. For for the um for the network for visibility, we are using grafana as the as the visualization tool and uh we are using click house as the data storage.

B

So the workflow is like this flow. Aggregator will export flow records to clear house data storage and a graphana. Will um we're using some sql query to create to query the click house, data storage and get the data and plot data to some diagrams um to visualize the networking metrics that we are interested in the cluster?

B

um So here is the graphana dashboard here and here's the default home page and we have already built six pre-built dashboard except the home and homepage. um So there are, um for example, the first one is flow records dashboard.

B

It includes um the count of flow records currently in the cluster, and it also has the um the details of each of the flow records like the content of each of the field. um This is the flow record dashboard and we also have um paw to paw dashboard.

B

um It is used to visualize the pot to pop traffic, including their throughput and cumulative bytes, and similarly we also have pot um proud to service no to no and also never policy dashboard.

B

um So these are all the six pre-built dashboards and currently on. The issue we found is is about the home page. So, as we can tell here um on top, it has a welcome to graphana and some links to the documentation, and then it has some instructions about how to use grafana, and um here is a a dashboard links panel and on the right hand, side has some latest from the graffana blog.

B

So the main issue is there is nothing relevant with project thea, and that is the first issue. It's more about a default on graphina home page, and the second issue is um in the dashboard panel. We might think uh we might see some links here linked to the dashboard, but if you take a closer look, they are under the recently viewed dashboard, which means if we log in the graphana, without seeing any dashboard here will be empty like like this here here is the home page, we'll see if we uh first time log in here.

B

So that means um so. That means like we listed in the documentation user will need to click on the search dashboard buttons in order to see all the pre-built dashboard. That will also be a little bit inconvenient for users.

B

So these are the two issues we found is the current default dashboard the current default home page?

B

So that is the motivation for us to build the customized home page. So here it comes to the design.

B

So, on the left hand side, we have some statistic panels, including some kubernetes resources, information about like about the kubernetes resources like here we have the number of paths number of services number of nodes, and we also have some other metrics about the networking conditions, like the number of active connections.

B

Number of stop connections, number of two external connections, and we have the total bytes transmitted transmitted in the selected time range and the overall throughput and the number of network policies.

B

um These are kind of like a cluster overview and, on the right hand, side. We have a text panel, including a short description of project thea. It includes some documentation, link directly to the documentation part in our apple, and I also put a logo of entry here and below it. It has a short short introduction of each of the preview. Dashboard basically introduce uh what is included in each of the dashboard and concludes each of them in one sentence.

B

And on the below here is a a diagram showing the top 10 active source paths. I define the top I define the active by uh by which pause send the most bytes in the selected time range, and um here we also have the number of flow records received per minute here. It will show in this minute there are 54 flow records being received and, on the right hand, side. We have a dashboard links here.

B

We will have all the dashboard here by default and also it has the recently viewed dashboard and it, and if user want, they can start some of the dashboard and they will be displayed here.

B

So it's kind of like a mixture of all of the things, but overall idea is to provide some background information for the first time user and also have an overview of the of the networking conditions in clinic's cluster.

B

And I have already discussed it with the internal flow visibility team and trying to gather more feedback from the broader intuition.

B

If you, if you feel like there's anything, you would like to add to it or remove from it or you don't like the layout, um you can just let me know.

C

Hello, the home page, looks good to me. I have a few questions. uh The first is about the number of stoppage connections. uh What does it mean? It means the cumulative connections, the cluster error added or something else.

B

uh Yes, so so it means uh so I I define it by the flow and reason field if the flow is already ended and the connection is already ended in this selected time range. The last, for example, here, is the last 30 minutes. Then um this count will be the accumulated kind of the connections that has has already stopped ac.

C

So if user change the time window, it will even count the uh the the counter for uh started. Connection of a large window like one day.

B

Yes, that's true, so basically, it's counting like how many distinct connections um has stopped in this likely time range and we defined and stopped by uh querying the data in the font, which flow record have the flow and reasons equals to, for example, three. That means the connection is already: um it's already stopped. Yeah.

C

Thank you. I got it. Another question is about do this.

D

Sorry for interrupting by the way I just have a minor comment, I feel probably terminated or some other sounds better here. If you mean the.

B

D

Are closed uh due to any reason, that's my feeling because stop sounds like it's stored by by something uh like mad policy or.

B

Just my personal.

D

Feeling, I'm not sure what other guys think.

B

C

I I also thought stop it connection is a little confusing.

C

And then another question is about the data transmitted and all those throughput. Do you think that value? If we add a data transmitted and throughput about cluster to external connections, feel people may may be more interested in that kind of data, because that can show how this cluster communicate with each other with the bandwidth and how many, how many data they they have chance? They have translated.

B

Got it got it so it means. Currently, it is the sum of the in-cluster data and to external data, so I mean just split it into two different fields: right.

B

C

Different or just keep this, this overload one and have a separate one for between cluster and external and by the way for data transmitted. Does it count one direction or for, for example, if one port node on one node talks to another port on another node?

C

How we called this data transmitted where we cut the data twice one for egress on the source side, another time for ingress on definition, side.

B

uh Currently, we only calculate the uh the data for single direction, we're not calculate the reverse side.

C

Okay, so it's only on request direction.

B

uh From the source to central destination,.

B

Yeah, but I do think um like do you think um it would be better if we make it clearer like uh saying it is the vice and from source to destination, and we have another another statistic here: showing the reverse spice.

C

I'm not quite sure, but as if we just have one how this title it may be confusing because it sounds like it. It comes all directions. So, that's why I'm asking whether we have duplicate duplicate uh calculation for same data.

D

Now, uh alan, so what we buy by one direction, I think you mean that we, we compute water baths from sender to receiver. Right yes uh for train you you're saying we should you're saying we should also have uh have a mattress for the uh how many bars is saved by receivers. That's what I mean.

C

No, uh I mean, for example, if one one client uh request one server and the request is very small. It's just a.

C

Maybe the server responds a lot of data.

D

Yeah yeah: that's what what I think you you mean, but I I think that should be called here right. No.

D

From any sender, the sender can be the the the source of the connection. It can be also the you know, server side of the connection it also. It can also be a center and send the pads to the back to the client, for example, that also this space also continues.

D

B

Currently, for such a connection, uh we're only we're only uh do the summation for the bystand from the source to destination. If we, if we also want to count the bytes uh received by the source, then we can also add the reverse of the delta account here.

D

uh No, I think the question like when you have a content server uh does this pass include both the traffic from the client to the server and also also the best sending out from the server to the client.

B

B

I I don't think we have.

D

B

Yeah because I think because I think in all the flow records for every connection, we only have single entry.

D

B

So, in that case, the server will be the destination and the client will be the source.

B

Okay, so so that is a single entry in the flow records table, but.

D

Yeah, I think that is a little misleading in my mind, I think if you see a data transmitted, probably we should count both uh the traffic will come to a server and the the reply from server to account.

B

Yeah, I can count both in the query here. I can count both, but I'm just saying um currently, if we using this query, it only comes the bytes from source to destination, but I can change it.

D

uh By the way, I don't know which way is, but I think even champ into the same uh uh since either you have two uh uh numbers, one for request and one for reply and or maybe just one core data transmitted to include both.

B

Yeah call it, but I will include both.

B

A

All right thanks and len yeah is there any more questions, guys.

B

um Maybe I can, I can put the conflict page of it in the zoom chat. If you.

A

B

A

I don't think it's uh publicly accessible.

B

Oh, oh no, I mean here I put two screenshots of the design homepage here. So.

A

uh Yeah, but I don't think the confluence page would be accessible if someone is watching the recording of the meeting and tries to access those pages, because it's an internal vmware service. But if you do have.

B

A

You do have a github issue in the cr repository. uh It may be a good idea to share the link and to put the screenshots there yeah.

B

Thank you. Thank you guys.

A

All right thanks andland, uh so uh lan uh actually has a topic she wants to bring for discussion, but I think we will have enough time at the end of the meeting, because I think chan is only going to give like a quick update. So chen, are you ready.

C

Yeah thanks antoni for until 1.8. uh Currently we have uh three parts left of there are about the issues found in the last release. The first one had been uploaded about only. I think I couldn't much it after the test uh succeeded uh and there's this there's another one. I found in the last minute after merging the pr for supporting auditor logging for kubernetes network policies.

C

Oh sorry, there's no! No, not that one after merging the pr that supports name spaced uh group for anti-narrow policy. I found that if we use net policies, if we use nested groups for internal policy, there's no validation to to to disallow the the this parent group referring to a child group which select a namespace in in which sex uh ports in other name spaces. So this internal policy, even though it is a name space scope, it can apply to ports in in other name spaces.

C

So it's a security whole. So I think this is the must to fix in this release. I already work, I'm already working on a patch to fix this, but the pr will need to based on the refactoring pr. So I should. I should push a pr for review uh later today and if everything goes well, we should merge this tomorrow and another is a minor one. This this issue already existed for several releases. I think, but we just found it um this month.

C

It's an issue that the agent failed to reconnect to os after some time out, events and lead result leading to.

C

Leading to the agent not working but still showing a show as running, so we had a pr to fix it, but the problem is during the review. I found some um confusing, um I'm not sure whether it's it will um currently, I'm still not sure whether the problem is but me show you the code.

C

C

Is based on my and antonio's understanding. I think there is some code piece in this library which helps clean up groups installed in opengl switch and for for the situation only enter agent restarts.

C

But this pr removes this clean up, so I'm not sure whether it is safe to just merge it and we are still working on fig figuring out how the glues are cleaned up and whether it is safe to modulate this change. But since this issue already existed for several releases- and this is the first time we found it- I think maybe it is risky- we could have have a patch release for this fix later after. We feel it is safe to merge the patch, but we we we must have this.

C

We must have this patching to avoid the security hall. Yes,.

E

I try and I was just just trying to mention that for 402.8 I was having some minor comments: they're they're, pretty they're, basically just needs. I do apologize. I couldn't find time earlier today to to review it, but I think for that one. If all tests has passed, you can basically address my comments in the next pr that you're about to open. uh That's totally fine by me.

C

Okay, thanks: okay.

A

Yeah, uh for for the first issue, uh will will we need to backport the back portion fix to previous releases, or is this something that's new in in this release?.

C

It is something new in this release. Okay, it only applied to it is introduced when we support namespace scoped groups.

D

If I actually, uh I want to ask the same question for the risk condition. One do you think that is seriously now we need to back portal well.

C

D

C

It has a chance to reproduce, but I only found several failures uh since uh from the test on the tests, and then I never heard a real complaint about the issue or any real issue caused by this. I think it's not very easy to reproduce.

C

Even I I I have a unit test to mock the situation, but it is really hard to reproduce. I can only reproduce when I specify the cult to 500.

D

Okay, but when it happens it means some. Some policies will not be realized correctly.

C

Yeah yeah, we could also blackpool it, but since this uh in this release, we made a lot of change to narrow policy. It may also need some code change when uh just the cherry pick cherry picked, the pr may have many conflicts yeah, but I can still work on it.

C

I think it is good to fix back party.

B

C

And so if everything goes well, we should release 1.8 tomorrow or the day after tomorrow.

D

Okay, your first api is a it's a big one. It's a relative, simple one.

C

It's simpler than this one, it up it's about maybe 200 lines of change.

D

Okay, got it yeah, I just mean if it's too big, uh I personally have failed. Okay, we don't fake, saying one dot right. You.

C

Mean this one yeah, but this one will introduce security.

D

Sure, but that is only for his new feature right.

C

Yeah yeah yeah.

D

I I'm not saying I wish you know, I just mean if it's hard to be merged. In short, if you say, if you, if you're saying it's simple, then yeah you can just try and review the module.

C

Another approach I was thinking is: if we don't have a safe way to fix it in the in this release, we could just disallow the group to be parent group.

D

Yeah, that sounds good to me too. Actually, yeah.

C

That was fine and see if I can manage it to fix it properly. In.

B

C

C

Okay, that's all about the release, any questions.

C

If no, I will give you the time to land.

A

Thank you thanks, jen, thanks for the update and all the work on this release, so lan yeah, you have uh about 30 minutes. So take it away when you're ready, yeah thanks.

F

uh Antonio and I think it will be a quick update uh because for the first phase about the gateway, actually it will be a simple design, and here is an issue which created by jianjin and we plan to support the multi-class gateway.

F

Actually because you know that in 1.7 we have multi-class gateway support and it allows uh cross-cluster traffic to go through the tunnel and there. So the different cluster can access a remote service. But in current implementation we allow the user to annotate one node to become become a gateway or it can. They can annotate multiple nodes, but the only the last created one will be the active getaway and if the node is filled, we didn't take any action for that, which means that we didn't support the high availability of a gateway of in 1.7.

F

That's why we like to you know, make our gateway more robust, and so we can.

F

You know if there's any gateway, no failure, then we can detect it and make sure that we can use another getaway candidates to support the mud cluster feature, and we do have a discussion in this in this com issue and but uh for after a few discussion we like to you, know for the first phase we like to do a simple one, uh considering that we have a few more other candidates which will be, you know, to support the multi-class feature.

F

So in the first phase I would like to support active standby getaway, and here is a design which I posted today. So I like to quickly go, go through this one, so everyone can understand the current design and the also I will call any comments or suggestions.

F

If you fear we have a a more you know, strong, strong way to support a tree yeah and for now, as I just mentioned, that we didn't detect the node failure when the gateway is notice, maybe it's stopped or just not ready in the kubernetes, but we may not check that information to recreate the gateway. So for now, in this first phase we did. We actually didn't change any crd and but just uh uh some a few uh process, so we changed in different controllers.

F

uh The uh from user perspective, uh the same as before user need to annotate the nodes with annotation this annotation, and as long as the nodes has this annotation. We think this is a gateway candidates, so it can become a active gateway, but only when the node is ready. uh You know that's when we create the node so inside the kubernetes.

F

It may not ready when, even when you already annotate the note, so we as a matte class controller will watch the node readiness and make sure that the first writing note with this annotation will be, will be the gateway and when this nodes would be in the gateway, then the gateway cr will be created by the controller. The gateway will be like this one, and so there's no difference as before, and the nodes will be.

F

The the gateways name will be the same as a node name, and there are few refinements for the node controller and the gateway controller. The first part is we. We will refine the note controller to watch. You know that. Will already watch the note event, but we will also check the notes, readiness information to make sure that we will take different actions based on the node readiness.

F

So when suppose, there is only one node with a annotation, then the controller and the node will is also ready. Then controller will create the gateway and they get. You will see the gateway uh cr in the. If you use a couple to list the gateway and if there's any new nodes becomes a gateway, I mean, if it has new annotation with a gateway annotation, then it will become the gateway candidates.

F

We will. The controller will save this information uh in memory, which means controller, will keep these candidates and understand what which con, which should uh how many nodes will be, uh will be a gateway if any, no, if any, existing getaway field.

F

So if the new nodes uh becomes a gateway candidate, so the controller may may didn't take any action to adjust the save like as like gateway candidates.

F

uh Wouldn't the uh you know that we have a gateway cr here and if this gateway is filled and it becomes not ready, then our controller will delete the existing gateway cr and it depends on the uh it depends on the gateway candidates. It will take different action first, if the gateway candidate is not empty, then our controller will check the candidates and check the notes.

F

Readiness, if the note is ready- and it will be pick the first one- uh it will be in alpha if alphabetic order- and it will pick the first ready one to create the new gateway, cr and yeah, but if there's no gateway, which means in current environment, you may have only one node with this annotation and as this node becomes not ready, then the controller after the controller deletes the existing gateway cr. The controller will take no action here, because there's no longer available nodes to be the gateway, then the gateway controller.

F

These parts will be refined and this will become simpler because in the current environment, our mod cluster in this phase, one actually design, we will only allow one gateway in the member cluster, so the gateway controller will just watch the gateway events when the gateway is created by the node controller, and it will create a classroom for info kind of resource in exporting the leader cluster, which means it will export its own uh cluster information to the leader.

F

Then, if there's any empathy update, then the controller will just update it's correspondingly, for example the notice external ip or internal ip changed. Then we need to reflect this information in the in the resource export in the later cluster and when the gateway is delete deleted, then the gateway controller will just delete this kind of resource exporting the lead cluster.

F

So it will be simpler than current implementation, because in current implementation we allow multiple gateway being created. Then we- and we pick up the last- created one and exported the last graded one and yeah the class information kind of resource exports. Actually we didn't do any change here, so it will kept kept.

F

The same as before, it's include the id and also the type information and the namespace and the member class id, and also the gateway information like the gator, ip and also source cider here, and but we do have a new getaway web hook in phase one.

F

So we can make sure that we only have one uh active gateway. As long as the multi-class controller is running, it will check that if there are any new getaway creation events coming to make sure that, if there's any existing gateway, uh then it will deny the request to create, which means uh any creation will be filled.

F

So there will be at the most of one gateway in the member cluster yeah. I think it's. The first phase will be a simple one and so here's the overview- and let me know if you have any questions or comments.

F

Okay, no uh then, uh thanks uh and tony I will yeah. You can continue. Yeah.

A

Thanks, it all makes uh sense to me it's great to be working on this. Thank you.

F

Okay, thank you.

A

All right so any late questions on land's presentation or any other topics that you guys want to bring up.

A

E

Right, I just have a. I just have a quick question: is there going to be any uh traffic breakage in terms of when, when actually happens like when one node um dies and we change the gateway to another? um uh What what about uh you know, traffic breakage? um How do we handle this.

F

uh Yes, I think appearing at least this one. I think we will have to accept this kind of situation and this will happen, but we will take no action. I feel I don't know if there's any available solution to avoid any breakthrough. You know, but in phase one we didn't handle this case. Yeah.

E

Yeah that makes total sense to me. I guess my question was just that uh um is it uh you know after the aha takes place and the uh the new gateway comes up, um the traffic will um automatically resume working in that, in that specific case, is that.

F

Yeah, that's right.

E

Okay, yeah sounds good.

F

Okay, thank you.

A

All right, thanks yank, uh if that's all, I think we can stop the meeting uh here and I'd like to thank uh everyone for joining and I'd like to thank yanlan, chan and nan for other presentations and I'll see everyone in two weeks.

C

Thank you bye all right.

D