Antrea Community, 31 Jul 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Antrea Community Meeting 07/31/2023

Description

Antrea Community Meeting, July 31st 2023

A

Good morning, good afternoon, good evening, thanks for joining with the anthria community meeting today is the first distance of the month of August. Of course, if you are in the United States you live in the past, then it's still July for you in any case, Jokes Aside. Let's go to today's agenda. We have two topics into this adventure. I will say two very interesting topics.

A

We will start with the proposal about doing a first and assembling live tracing, increase flow I'm, just telling you what I was told as it as a as a topic for today, because I'm also very curious to learn what this is about and show you and will be leading these presentation and uh I am sorry if I did not get out the right pronunciation for your name and then we'll have a talk by Andrew about the Improvement to existing CI pipelines and new functionalities for kinda.

A

So I feel that since we have a fairly packed agenda, it might be best to start as soon as possible, and so, uh let's start with the presentation about live tracing interest logo. So please go ahead.

B

C

Can you hear me.

A

C

Yes, so I'm going to give a presentation about um introducing the first sampling Knight tracing the trace flow, so let's first quickly review the Milestones of the trace flow feature. So initially we support injecting a crafted packet to so we can gather information about the the entry running running process and then add some support for the live traffic Choice flow.

C

That is, we can capture the first packet that satisfy the given conditions, such as, for example, we can assess on conditions like the package headers so button only capturing the first package is not enough, so we plan to add the sampling feature to the trace flow. We plan to add three sampling methods. The first is the so-called first and Sample. That is with we capture the first and packets, and then we have the interval assembly. We can capture one out of every n package and the third is the time interval sampling.

C

This means in capturing package at a given time interval such as one second, so we plan to implement the first and sampling first because it seems the most simple, but there are some challenges to overcome when implementing the first instantly. The first is that we cannot just use the crd status to store our results because they are too large So. Currently, our solution is that we save the results on the disk of the nodes in the ppap entry format and each node passively waits for the user's request to fetch them.

C

The second problem is the possible obvious overhead, because we before the change, we only make OBS send the package with the TCP, the first TCP packet. That is a package that starts the connection, so there is no no problem but but to support us. Definitely we need to make obvious that every matched.

B

Packet to the entry agent.

C

So sending a package means that the OBS process have to send the patch a message through sockets to the entry process, and this process requires many memory copy and and the switching between kernel and user space and also to process the packaging message in the entry process.

C

May incur High CPU overhead, so this may reduce the flow rate as well. Currently, the project set a a rate.

D

C

At most 100 packets per second more packages than this standard will be dropped directly. We may need some benchmarks to see if we can increase the rate limit or we have to optimize the whole pipeline.

C

I think a possible solution of the OBS overhead is that for the interval assembly we can use the ipfix. The ipfix is a protocol for Network monetary, which is supported by ovs. Already, for example, we could add a flow to the bridge, so we so the action is the focal sample action.

C

There's some parameters of the sample action. The first is probability so the set of, for example, we said the proper probability to ten thousand means that we capture one out of uh 10 000 packets and also we have a collector set ID.

C

So the OBS will send the send ipfix message to The predefined Collector and we have two parameters: OBS domain ID and OBS Point ID. The OBS means observation. So these two fields are, these two fields are are in the ipfix protocol message. So.

B

C

Can use so the collector can know which floor are we tracing by reading the observation domain, ID and OBS Point ID.

C

The the current status of the development is that we have already developed an alpha version that can sample the first 10 packets and write the data to disk, but there are still a lot of work to be done. For example, we need to determine the final Trace Flow crd Design, and also we need to find a way to design a user-friendly API to expose the trace flow results to our users.

C

Also, there's some bugs to be fixed because we change the OBS pipeline and, as I mentioned before, we may need to Benchmark the overhead of this change and we need to add some unit testing end-to-end tests to verify the correctness of our code.

C

So that's so then I will briefly introduce the code chain. The changes we made. uh The first is that we change the trace flow crd first back part, we add a sampling field. The central field has two properties, method and num the method. Currently, the method can only be the first n and the num Fields. uh The not field determines the the value.

D

C

But I think it's not very reasonable, because if we set live trap, big too fast and still we assign a centrally, then there will be a contradiction. So I think a more reasonable way is to Let's simply be a property of live traffic like this and the same. The draft only property is set in the same because the sampling and the draft only the properties are all for the live traffic mode.

C

But this will introduce a breaking change, so we may need to introduce this change in the new version in the new API version.

C

And also we change the we change the fields of the stress velocity Rd status, so we add a new field called the sampling currently the sample. You only have one field: the non-captured package, which records the the number the number of captured packets.

C

So for the code in the entry controller, the main change is that the the succeed, the succeed, the standard of the trace flow has changed um in a orange box. So we can see that there is now another logic. If we are on the sampling mode and we we are, we set the method to be the first in sampling.

C

Then, if then, if we have uh captured enough number of packets, we set the status of the trace flow to be succeeded.

C

And also for the entry agent side, the OBS pipeline is changed a little before the change. We only we only, uh but we only track the first tech first package. So, as you can see, we match the state, the account contract stage to be new and new and tracked. The new means that this package is the first packet of a TCP connection, but for the sampling feature, if the central mode is on WE disable, this uh contract State new condition.

C

So we can check every Target package, but this, uh but this change has also introduced some unintended side effects so without temporarily change the priority of this flow to be high. Previously this priority used to be low, but this change also caused some some parts. So we need to find out the reason and to fix the bug and for the object updates.

C

We can see that before the change we update every time we receive the packed in message, but for the sampling mode, this will be true will cause too much resources because the there will be too much more texting messages. So we cannot update every time we received a pacting message, so so, if we are on the sampling mode, so if I will assembly mode, we must add a update rate limiter, which limits the rate of crd updates.

C

And also, if we have already collected enough number of packets with update the crd immediately, so that the controller can can know the now.

B

C

No, no! No! No, so the controller can know the status. The stress file has.

B

C

And also after we process the active message successfully, we then write the packet to a local file in the PCAT energy format. This is done by introducing a new library which is the gold package. The gold package is a library that provides the support for PCAT for entry format and the go back sheet is the gold package is maintained by the Google officially.

C

So this is the the all contents of our of my presentation.

C

B

A

Thanks a lot for this presentation and do we have questions.

E

Actually, I have a question. You mentioned that the capture the package would be too large to to be saved in the crd itself. So how could users retrieve the NG data from um how could they retrieve that I I didn't say it mentioned in the slides.

C

um Yes, it's a very crucial, crucial problem and we have been working on it, but uh currently uh hanging is responsible for this problem and he's designed.

F

The user friend.

C

The IPI okay, so.

F

High increases yeah.

F

I can have answered this question, so we are planning to reuse the other methods like from the support bundle. So in the older version of support bundle, we use uh algorithm API to to retrieve the raw data from kubernetes for, for example, if we, if, when a user, starts a new request to fetch a specific support, bundle data, we we aggregate the the files from the local and create a tafel and the reason using the HTTP API to return the data to the user.

F

So uh we only so only if the user starts a new request. We we give the data to the user. Otherwise it's just stored on the local disk. I think it's. The the current version of this water bond is switched to a new uh I I, think it's uh maybe a field server or something like that, but I, I, I, but I think it's. The older version of this water bundle is still working for for our, like condition. Yeah.

E

I remember for the um uh for the support about the API I'm I mean not the latest super bundle- collection, Sandy, API I mean the old one uh yeah I remember. It uh sends a request to each agent API.

B

Server, yes, yes,.

E

So I I guess this may not work for when you want to do something with crd, but probably it could be done by entrance owns CLI and the CTR, and perhaps it could read some status from the crdf data, the crd status and start a direct connection with the corresponding agent and to get the data met. It might be possible, but people who are not using nctr may may be difficult to get the data right because.

B

E

They are not in, they are not really aggregated in the same place.

F

uh Yes, so I, so in the design actually had a new field in the studies field of the trees, velocity, which is called the like. Like a package place, for example, we can generate a new HTTP pass and we write it to the status field.

F

So, when user want to retrieve the data, he can either check the status field of the of the crd or you can use nctl I'm, not sure if this is a good choice for for user to retrieve the the hot data.

B

E

Have to say it's it's the most practical solution, right, yeah, yeah,.

B

E

Don't know other ideas yeah and do we have a limit for the the value of n, because we store the data in agent correctly. If it could be a huge I guess in it's possible it could cause Sun uh disk issues. If we don't limit the value, do we have a upper limit currently in the design.

F

Yeah I think there definitely is so. The problem is: what's the executive number we should set, so we we need to evaluate all the parameters like the story size and the the package size so before we actually so in the uh pull request we act, we definitely need a top limiter, but the exercise we I think we need more uh discussion or test to to set a reasonable one. Yeah.

E

Okay, thank you.

F

E

Right yeah uh in previous uh code, when we only support a real Trace, relieve traffic, only and capture the first package, only and U.N, when it's not a Leo traffic mode, I think we have a timeout for uh I, guess it's a right or for the flows to match the target package and and remember it's something like five minutes under for first and packets, two ways they're near that or we need to dynamically set the timeout according to and how to wait.

E

A expire. The flows for trade for for first and assembly.

C

Currently there is a there is a weight. Currently there is a timeout timeout field in the stressful spec and it has a default value. So if users want to trace for more more time, you can.

D

C

Set the timeout to a larger value.

E

F

um It's all right, yeah.

E

Please go ahead. Yeah.

F

Yeah so I think the Fairly harder limit is uh it's still uh working for all the stepping method. Even if we like add a new new sampling method, we we can treat the timer for limited for all these choices. I think it's is reasonable for a user to to understand the the meaning of the field and I didn't see any conflict yet to be between between this field. So I'm not sure. What's the uh if.

E

F

Is the the back Choice yeah.

E

My question was: if we still use the hot the the hand limit five minutes, is it enough to capture first end package if N is a big value and the connection has a very few packets.

B

E

A question I'm not sure if you really valid scenario.

F

Yeah, like what's the minimal value for the amount, I'm, not sure so, I think it's a very rare condition by the way we we do need to consider the this problem. I I prefer to to use the term out as a higher limit, even if we don't have, even if we don't have the nav package. Okay,.

G

The chicken, even today the the choice of the tunnel, is from the spec. It's not a hardcoded one.

E

Really it's it's from the US specific.

B

One not okay, I, believe.

E

Stand for Carly find that yeah.

G

By the way I joined late, um so just looking at the spec I start to think, should we really reuse the twist flow of crd or we should uh should I have a new one, because I think this facing a little different from original choice for in a new capture package right.

F

It has the same proposition for, for this feature yeah he can join into yeah I think you have a similar proposal for, for this feature, yeah to do to create a new crd. So, what's in your mind,.

C

I think it's a philosophy problem from everyone has has its own preference preference, but uh personally, I think that this this function is is different from the current one, so I think at least we need to. uh As mentioned we can. We need to uh put the sampling property into live traffic and not like that. Currently, we just add a new parameter.

G

Now the current uh live traffic tracing a trace flow is different from what you guys are proposing right today. We just capture the headers and the word limited set of headers. They don't capture packets,.

C

Yes, I think that's the point so so indeed that difference.

F

So if we we have a new crd, we have the advance of like still copying the design from twist flow. I think the I'm not familiar with, but I think a week. It used uh external field server I think it may be better for the package. Storage, foreign.

G

People can compete.

C

I think uh I think it's um reasonable to expand the current life tracing. So even if we don't uh don't use the sampling, we we still collect the raw pack data as well. I think that will make the new function and current function. uh uh Consistence.

G

G

Yeah I, don't have a strong opinion. uh Probably let's you know sing a little more here. um If you think is the consistent Behavior. Maybe we can use it uh if it's very different that you want to add many new parameters, probably wish you, the Singapore new crd, that's my opinion, foreign.

A

Do we have any other question.

A

And what are the plans for moving forward with this feature? So now you are a bit sort of a POC stage. Maybe. Are you already considering it, including in some specific country or release.

F

No I don't think so. We didn't have that requirement.

G

A

See: okay, nice um right, so this is uh I, don't know if we have any other question regarding uh uh sampling for Trace fluff waiting, just 10 seconds.

A

All right, it looks like it's all for this presentation, so thanks a lot to shiwana for uh for doing this presentation, and then we can move to the next topic regarding the CI improvements, which should be led by Andrew. So please go ahead.

D

Oh, can you see my screen.

A

D

Okay, okay, uh so hi everyone uh today, I want to share few uh improvements to the existing CI pipeline. uh So. Currently, I am working on these four upcoming improvements, so first is use, is top all stale job to kill job related to a PR, and the second is run CI test in dual kind. Cluster third is uh run ipv6ci test in IPv6, cluster and last one is uh automated, upgrade support for uh Jenkins kind cluster. So, let's save one by one.

D

So basically it is. The question is why we need this Improvement or why we need this job. So, uh like suppose, you have created a PR and you trigger few uh CI tests on that PR to test your changes and after four or five minutes, you push some new changes in your Pi. Now this time uh you need to re-trigger your CI test to test your updated changes. So before read trigger the test, you need to abort your previous job.

D

So for that you need to use Jenkins UI to about all the running uh or waiting scale jobs on your VR, so basically, instead of using Jenkins UI to about running or waiting jobs, you can use this stop all jobs to kill all the previous running tests on your PR. So we can see in the workflow. If you trigger a CI test on the GitHub PR, then after some time you updated some changes in the PRS, then you need to uh about this previous job before retriever the test.

D

So for that you can read, you can trigger this, stop all job, so it will. uh It will Abode the whole waiting or running job on your PR and then after this you can re-trigger a CI test again to test your latest changes.

D

So basically, in this we have used a Jenkins test API to uh get information uh about the running and waiting jobs in the pr, and we have used Jenkins token to perform post operation in the Jenkins and uh currently uh I have enabled this feature uh only for the cabway related shops.

D

uh Why I have enabled this feature only for the cavity related job, because in the uh in the Jenkins like there is lots of jobs running so uh we can't apply for all the jobs like this time, because otherwise it will impact some other important job. So once this implementation of proved by the maintenance, although it is working fine for the cafe jobs, but we can have a follow PR for uh enabling this feature for other Jenkins jobs and next to we have a next Improvement related to kinds.

D

So uh we can run a network policy conformance and E2 test in kind work uh stack cluster here uh I have created three uh Jenkins jobs, so uh we have here three trigger phases for the e2e and conformance and network policy, and uh it's a workflow is very simple. You can uh trigger any face on the GitHub VR to test uh e2e, test or or conformance or network policy, and based on the trigger phase. It will create the kind cluster kind voice, Tech cluster, and then it will learn the CI test.

D

Whatever you have triggered and after finishing the test, it will delete that particular kind cluster. Similarly, for the dual stack, we can run IPv6 CI test in the IPv6 cluster here. I have also created three jobs, because it is very simple for the uh developer to trigger any any test like a 2E conformance Network policy, so they can uh if they want to test only one uh if they want to run only one test like t2e, so they can trigger only one trigger page, so its workflow also similar to the dual stack.

D

You can uh trigger any uh test phase on in the GitHub here and then it will create a new kind, uh IPv6 cluster, and then it will run CI test based on trigger face and then, after finishing the test, it will delete the cluster and the last one is automated upgrade support for Jenkins kind cluster. So basically, uh we know that a kind uh kind cluster currently running on the Jenkins Pipeline and it does not have the like capability for the automatic upgrades and on every kind release we need.

D

We have to perform a manual upgrades upgrade in the kind test fit so which is INS of inefficient so to resolve. This I have uh added a Jenkins script, name like a kind of grid.essage to support automatic upgrade and its algorithm is pretty simple. So we can like we can uh fetch or we can get the latest version from the GitHub using called command, and then we can.

D

uh We can have our existing kind of version uh using um using command, and then we can check if latest kind version is greater than existing kind version, then we can call a upgrade kind function and in that function, based on the OS, we can upgrade the kind in the test bit and after merging this, the workflow will be like here or you can see. uh First, you would trigger the trigger the CI test faces on the pr.

D

Then it will check for the kind version and after that it will create the new kind cluster, whether it is IPv6 ipv4 or dual stack, and after that it will run the CI test, and after finishing that is it will delete the kind cluster. So these currently, these uh four task is uh work in progress.

D

So if anyone have any suggestion or question so they can ask.

D

E

Hi Andrew I have a question for the first improvements for you. You are making uh you. Could it.

C

E

Done automatically without having users having to import the the instruction uh because I, remember being you, can have actions. If we push new commit the previous one will be canceled automatically and I think it's the same for the check-ins jobs, even the the jobs finished and the sex said. Its results will not be reported to the new commit anyway. So running, then, is usually meaningless. So could we just make it automatic.

D

E

D

Yes, yes, I heard the question yeah I I can like I can add that thing we can like automate. It.

H

A another goal for this job is to clean up the previously uh CPA clusters uh things. If we just support the previous job, there could be some resources remaining on the test valve. So we need to. We need this trigger freeze to clean up all uh uh redundant clusters on testbed.

E

uh uh But when you want to clean those tests that you you, you still want to execute it after somebody uh push a new change or you want to just do it uh when you, when, when anyone put this command into the command or for any PR.

H

Yeah I think, like you said uh we can investigative for the new trigger freeze. So if we can clean up the cluster as a reaction Behavior, so we can add, maybe add a clear actions in yeah in this function.

H

It is on through a question.

E

uh In general, to confirm, if I understand correctly, uh this, the the clean up jobs triggered by this uh instruction is for clean up of uh test Advanced and jobs. A social athlete with this PR only or is for.

H

Only with this PR.

E

Yeah then, what's the problem, if we just make it automatic without user, uh without user typing, the instruction.

I

uh Yeah Chad, so actually this isn't like work in progress. So actually this is like manually currently can be done, but yeah it can be automated because it is based on the pr. So next time like any changes, uh gets posted. So previous, uh a previous running jobs will be canceled automatically yeah, so that will be taken care of like once. This PR gets much.

E

Okay, thank you.

J

I think one thing that comes to mind with the automatic approach is well: it's not it's not for most PRS, but right sometimes when we are working on like the release. Pr, for example, I I, don't know if we want the jobs to be aborted automatically in this case, because you know when you're working on the release, PR and you're editing the release, notes, for example. Well, you kind of like don't want to have to run those Jenkins jobs again.

J

If you get a passing status, if you know what I mean um yeah yeah I get it I mean it's only a minority of the pr um that or.

E

In this category, yeah yeah, yes, I agree, but the the jobs that will be cleaned up is uh the other ones or it could also be the rounding ones.

H

Will be clean, oh.

E

Okay, okay, then yeah. That scenario might be if it's automatic, it might be close. Some uh some uh small problems, yeah and then Tony said, but my concern of the manual instruction is perhaps people wouldn't remember to type two commands uh I I guess it requires. You type, stop all their jobs once and wait for the jobs to be terminated. And then you start another round of tests privacy, but to a good.

B

Yeah yeah, maybe.

H

J

Can investigate it. I just wanted to say maybe like today, I think if someone does like test e to e and then push pushes the change to the pr and types test e2e again, then we're gonna wait like one hour for the previous job to complete and then we're going to start a new job for the second command. Maybe in this case, what we're typing like a command to run a job again in the previous iteration of that job is not completed yet I.

J

Think in that case it would make sense too aboard the first one automatically, um because we're basically running that job to completion for nothing. In this case.

H

Yes, we're provided to save some time for give out. First.

J

Yeah and people that have access to the Jenkins dashboard and know that can like, of course, log in and kill the previous job, but for most contributors even yeah add VMware, that's not the case. They don't have a.

H

J

Don't have that access.

H

Yes, and also, if, if you just support the job on Jenkins, uh the the previous cluster will still read me out. The cpv test value, so this trigger freeze can also clean the uh redundant clusters for this PR.

K

Or in that specific case, you know my two cents are for the cloud jobs. Obviously, I think uh we we had a uh Downstream trigger, which is there are two different Jenkins job. One is setting up the cluster and having the tests and the other Jenkins job is a cleanup job. So when the test job is finished, you know the cleanup job is always triggered.

K

No matter what so you know, we could probably do something similar to make sure that whenever somebody kills a running um job, which is testing in you know the on the capv, and you know it triggers the capv cleanup, no matter at which point you know it, it was killed. uh That's something that we can also look into. I guess.

H

Yes, uh we can invested if user reports a job, uh then if the pro section can be executed like a cluster cleanup function,.

E

And I have a small comments uh for the second Improvement, uh the the freight uh I see to use IPv6 Dash DS, but for the other dual stack tests. That's no DS, so I I have no preference for, for this phrase, I just wants all of related jobs could use a unified naming styles to avoid people having to remember too many different place. Yeah. It could be a waste yes or without yes, but they should be in unified, I. Think.

D

Yeah, okay I will take.

D

More can I stop sharing.

A

If there are no other questions, uh I guess so, do we have any final question going three two one zero. So, yes, you can stop sharing now.

D

A

All right, so there were two very interesting presentations thanks uh a lot to both of you and uh we still have some time uh allocated for today's meeting. So if there is any other topic that you would like to bring up for discussion, please go ahead and I will be waiting as usual 30 seconds for topic, proposals.

A

And it seems that is all for today. So thanks, everyone for joining and I wish everyone a great day great evening or a good afternoon. Thanks a lot again and talk to you in two weeks time.

F

I

A