Antrea Community, 30 Jan 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Antrea Community Meeting 01/30/2023

Description

Antrea Community Meeting, January 30th 2023

A

Perfect good morning, good afternoon good evening, and thanks for joining this instance of gentria community meeting uh depending on where you are in the world, it's either the 30th of January or the 31st of January, and for today we have a presentation related to Tia project. uh Most specifically, we have Tushar presenting a throughput, a normal detector for uh detector for uh Tia and um yeah.

A

So this will be a presentation that involves uh the implementation of this uh detector in the Tia solution, with the clickhouse backend so and uh well, I, probably better stop talking now and let to Shar do all the talking so to Sharp. Please go ahead with your presentation.

B

um Thank you Salvatore um good morning, everybody uh good afternoon and good evening for all the folks over here, so I'm going to present the CL throughput anomaly detection. um Let me share my screen with you guys.

B

uh Let me know if you can see the screen.

B

uh Can any, can everybody see the screen.

C

B

All right, so basically Tia throughput and oblique direction is, uh as his name suggests, it's a anomaly detection technique, so it is kind of like uh you are detecting any abnormalities in the network. So this is one of the features that we have implemented as part of uh Network traffic um analysis in Thea. So the thing is like.

B

uh So basically, if you understand this thing like uh why there is anomalies in the network, that could either be because of some uh minute simple reasons or it could either be because of a threat in the network. So it is always better to have that analysis. To have that uh things told to us beforehand, or at least we should be able to analyze that there is a through.

B

There is an anomaly there is something going on uh wrong in the network, so this is the this is where the throughput and object detection comes into picture. So we have. uh We have used like uh three uh algorithms for this thing, the first one being ewma, so the exponentially weighted moving average. So it is a. It is a basically used uh in the time series analysis and this model is used for uh in order to figure out uh the throughput uh difference.

B

So basically uh there's a difference if there is a difference between the forecasted throughput and the actual throughput, and the difference is too big, we'll be saying that there is an anomaly in that thing. So how do we detect that this algorithm uses a weighted, algor uh weighted points uh in order to figure out uh where whether the uh the analyze throughput or the calculated through in order to find the calculated throughput?

B

So what it does it gives it gives higher weights towards the newer points and the lower uh low low weights towards the older points, and that way it it is able to figure out like this there's a whole uh completely a complete uh derivations uh in this equation in this which we will not go through, because that will be too much so for now. It is basically just understand it like this that you have some uh throughput.

B

That should be there and then there is a calculated throughput and if the, if the difference is too much, that is where you find it an obli and the second one is arima, so I call it a rhyme I'm, not sure if it is a r. I m a uh sorry if I'm pronouncing it wrong. uh Auto regressive, integrated moving average model. uh So basically it is also the subset of linear regression.

B

So it basically what it will do is like it will figure out a linear pattern in the uh in the model uh where you find the throughputs and then if there is a throughput uh which is uh way uh far, then the linear regression model. Then we can say that yeah there there was an anomaly in that because it shouldn't have happened like this thing right, so that is arima model for you and then there is DB scan model.

B

So DB scan model is basically like uh when you do the density based spatial clustering of application with noise. So what it means is like.

B

uh Let's say that you have some throughput values, that is, uh that is making a higher density point in one area and then there is, uh there could be multiple clusters of density points in in any in any network, and if you find any point or a bunch of points that are like far away from these things uh from these clusters, then you can obviously say that yeah there is an anomaly.

B

So that is like the big picture over here in this now, um if we go into, how did we implement it, throughput anomaly, detection and what actually happens in the back end? So what we actually do is like. uh We have created one crd, so a custom resource definition uh in arthria in RTR software. So what what that does is it allows uh so custom resource definitions basically allows you to have a custom resource deployed on your uh on your cluster.

B

So let's say that I have made a custom resource of, uh for example, throughput anomaly, detection spark job, so you can have a pod that will run specifically throughput anomaly: detection spark jobs. If, if you want us a different custom resource, you can do that as well, and this is basically the crds are basically to uh for you to use uh any custom resource that you want to use in the in the cluster. So what happens once you have created the cluster and uh now what you have done is like.

B

You have started an instance of that custom resource. So once you have you, you give the kubernetes that there is a definition and if there is a resource that obeys this definition, uh please help me out and give that definition, the name of the name and the source of the custom resource definition, I'm, sorry I'm using so many words over here, but basically it is like that. So uh so what uh once that crd is deployed in the kubernetes? And then you try to? uh You uh include new object as a custom resource in this.

B

What you will do you you can either do it through a yaml file or you can use our tscli for this demo. We are going to use the tsclr, but you can obviously use Yami files as well and I. Can then you can also like curl to that uh base API, uh so, basically yeah, when, whenever you create a custom resource, the customary resource definition, you also create a base API and all the resources based on that custom. Resource definition are going to be under the base apis, and then it could be uh different.

B

Things like you can do views you can do lists. You can do stats, you can do status and a lot of things and thus that specifics API. So after you have created a crd yeah and then you create a new job. So what happens when you create a new job? The tscli is going to so the place where you use the CLI that is CR manager. Then this Thea manager is going to uh include that tscli and he's going to send it to a controller.

B

So the controller that is going to be responsible here will be the Tad controller, throughput anomaly, detection controller. Now this this controller is going to utilize all the arguments that we have passed to the Tia manager and it's going to make a specific, uh specific command. That will help us uh uh invoke a spark of uh invoke a spark job.

B

Now, how will a spark job be invoked and that that happens because uh after the controller it goes to a spark operator, a spark operator uh is is basically a place where you have uh where you have already made sure that your images of the new job should be present.

B

So the spark operator will take this crd and will create a will, create a new uh instance, for example, at our stage, we'll have the Tad instance and what I mean by tid instance is having a driver spark driver pod that Spark Run driver Port is then going to create the executor ports. Now executor. Ports are basically responsible for any action that you do in your specific application.

B

So we have a stat application that is going to do some things in the back end and this and this things are going to happen in the spark uh in the spark executor board and now the things that are going to happen in the back end are basically the spark executor is going to run the script. The script is going to take all the arguments that we have passed to it.

B

It will go and analyze the data that is already present in the click house and table from uh in in our kubernetes cluster and from this clickhouse data it will take all the flows table now. The flow stable has the data of source IPS, Source, Port destination, IPS destination ports. What protocol we are using and the the throughputs that we have seen until now, so that they are the actual throughputs that we have seen.

B

So it is going to contain all those data now that data will be passed to uh to a specific algorithm, the algorithm that you can choose as a user and that algorithm will go and calculate uh all the throughputs that should have been there and based on the throughputs that have been calculated and the ones that were already present. We are going to see if there is any difference. If the difference is too high, then we can say that yeah there is an anomaly.

B

If the difference is not too much, then we can say yeah, it's just like it's simple calculation thing, so we'll just go ahead with it. So it is not an anomaly. Actually so I know like this is this is a lot of thing that I just thought I just threw on here, so um to see it in action and to understand it more. uh Let's, uh let's move towards the demo part of it.

B

So, as I was saying, there should be a crd, there should be a custom resource definition, and this is this is the custom resource definition and uh there would be uh Tia uh throughput anomaly detectors the name of the name of the resource that we are going to pass, and then there will be the apis is going to be stored in the uh crdpi and triad.io the, uh as we know that there there's always like some some things that are required in a in any uh new resource.

B

uh For us, the only required resource is the spec, and the spec will have only one required uh resource, and that is the job type. Now, what is the job type will go or and talk about this later, uh for now it is like there will be a job type. There could be a start interval an interval. There could be instances, there could be driver code, requests and everything, and so um just just to give you a little uh brief on this thing.

B

So these last five things the executed instances, driver code, request, driver memory, execute a core request, execute a memory. This all are basically the arguments that you send to a spark job, uh so, as your spark can reserve specific memory specific course for you as a driver, as well as for the executor pod. So these These are basically for that spark. The end interval is start and start interval is basically like from where till where you want to see if there is any normally and then job type.

B

Basically, it means like which algorithm you are going to use, so it could be the arima, it could either be awma or it could either be DB scanned. So as we can see that we currently have like uh five pods running in our flow visibility uh flow visibility in namespace and uh what we can do is we'll go to the Tia manager, so clear, my tscl library, sorry, is basically uh it is like normal CLI that we see everywhere.

B

So it has different uh commands that you can use the clickhouse completion, help uh policy recommendations or a support, bundle, throughput and Omni Direction and the one that we are interested is in this. So let's go inside this. Let's say it's: ubered I, normally detection and let's go into the help of that. So as you go inside the help it says like there should be a command that should be ewma, arima or DB scan and also it says, Alias as it could be either throughput a normally direction or it could be Tad.

B

So, as you just saw me, like writing a whole thing. Instead of this, we can just write that, like Tia Tad help, so that also works now the commands that are available with us is delete list retrieve, run and Status. The one that we want to do right now is run. So, uh let's try to we'll we'll get back we'll get back to all the other commands as well, but for now uh we are going to use the Run command, so let's say Tia that run and let's go into the help of that.

B

So as we go into the help of the uh run, as you see, uh we have different uh things as, as we saw in the cids driver code, request, driver memory and times and time um execute a core request. Executor instances execute a memory start time and then the type type of the screen is basically is the type of the algorithm that you are going to use. So for this demo we are going to use uh that run and let's take the type as arima, and let's give it the driver memory as 1GB.

B

So so basically, this is this is this: is the command that you have to pass from the CLI part of it? So once you do that, you see successfully started throughput and omelette detection job with name arima. So, let's see which instance it has created. As we can see uh yeah see over here, we have a new spark driver that has been that has started running. So let's try and check its status so first to check the status. We need to know what is the ID that it has like.

B

We can obviously fetch ID from here, but let's see if we can have it from uh from our list as well so theatad list and if we say so, it says like there is a name that is this and then there is a status that is running right. So now I understand like this could be an ID but uh and do not get confused due to name in this uh I'll get uh I'll show you why I'm saying this as name.

B

Let me actually show it uh quickly to you guys like why I'm saying this as a name. So, uh as you remember, I told you there will be a base API So. Currently we have the space API under the anomaly detected CR entry Ohio uh with the resource name as throughput and omelette detector. So let's try and get this uh I guess. I do not have the token here here.

B

Okay, we'll get back to that thing later. Let's try to do it over here, because the executor board is started running uh so they are. The driver has already included an Executor port and the executive Port is running uh right now and it is doing its things. Let's see what is the status of this uh job so to do that we have status. I can show you help as well. So it says like in help you can just show the name. So name is basically like you can either just write the ID of that.

B

So, for that reason, let's first try that list and the list we are going to include the status status will be of this specific ID and if we go into this, so it says like 50, has been completed. So there's like this uh status is going on in this thing, so um yeah. Currently it is running, um it is showing all the status that it has completed right now, stage is all that it has completed.

B

It should take around uh two minutes for that time and as as I told you that, uh as I told you in this uh slide uh that once the executor pod has uh executed all the function or has done everything it is going to update, it is going to populate a table inside the clickhouse database and the table name will be ta detector, so we can for the on the side.

B

We can go towards our clickhouse and once you are inside clickhouse, let's see if it has the tables, and we can see there is a TA detector, NTA detector, local. uh Just to avoid any confusion. uh These tables are made as a part of clickhouse deployment and it is not responsible because of the crd that you create, so there could be multiple other tables that has nothing to do with the crd and still be present inside the clickhouse database.

B

So, as we see uh the driver is also completed, let's see the stage she says like the status of this anomaly. Detection job is completed so, okay, so, let's see what it has done.

B

So, let's go select all from and the table name is TA detector and you can see it has created 21 rows in that and now just to show you guys, like uh I'll, show you what is this data and everything in a better way, but just to show you uh how many, how many actual uh flows were present in which we have seen the the anomalies in 21 rows so select all from close. That is the input table that we use currently.

B

So, as you can see, there are 3000 rows in this and we have only stored the data for the you know in which there is an anomaly. So we are, we are not going to use the whole data store. We and we are not going to duplicate the storage in this. So as like just to keep the memory short, uh so we are only going to use the uh we just use the places where there is an anomaly and now to see it uh in a better shape.

B

uh Let's see Tia that and let's go back to the health thing, so we have already seen list. We have already seen status now, we'll do the retrieve so retrieve is basically like having a result of what you have done so instead of status, I'll just write retrieved over here.

D

So let's say retrieve and.

B

Then I can like, because this could be big data uh like bigger data, not like big data, I. Think like big bigger data, so I'll just write it into a file. So let's say: output, dot, Json file and let's see if we have the.

B

Json path, so uh you have, you can see like uh this is the ID. uh This is the ID that we were using right now. um This is the source IP, there's a that's uh the source, Port the destination, IP destination port and the place where, when we saw this anomaly is this: uh we calculated the throughput, uh which was little different with what we saw and there's like throughput point in like uh two of these uh two of the such uh flows uh from this port.

B

uh So that is one and then you can see the next ones, and these would be specifically like uh if I have to guess actually, not guess, but actually the truth is like these are going to be 21 uh tier 21 data. uh We are working on whether to uh write this data as Json or whether we want to put it as a table, uh so just just to make it easier for the user to go through this table and figure out uh what is going on.

B

So this is basically a um basically the thing um and uh to show that, um let me try to get the command uh if I'm, not wrong, I, just miss the token. So let me try to find the token for now.

D

D

And token, and if I do the call.

B

Now see yeah, so we can see the this crd has been registered and we can see the and the uh the ID is over here. So the reason why we use it as a name is because in the metadata we have created it as a name. So that's why we keep it as a name over here, so that is pretty much it and uh the one only only command that is left is now the delete one.

B

So I can show you guys, like Tia, uh dad delete and the same thing just keep this uh ID and it has been deleted now, if we see inside we'll, let's go and try in the detector, so the table has been cleared if there would have been any driver.

B

That's due to Port running that is also cleared, and now, let's try and fetch this data again and as we can see that there is a kind, but there is no matter there is no items that are present, because there is no other uh detection job going on uh just to verify that completely. Let's do this throughput and only detection list- and you can see there is nothing inside this. So this is basically throughput and ombre Direction. Now what we? What we do? What do we do actually with this data is uh over here?

B

So, as you can see, um we have calculated the we have or the or the present throughput, the one that we already had and then the calculated ewma uh throughput and the arima throughput. So we see there are multiple spikes in this data in this table uh when we visualize the table that we just uh got from the flows and uh the ones that we uh calculated after uh doing the spark shop.

B

uh This data shows us that there are multiple bytes one, two three four, five six in particular, uh but we can see that the arima or the ewma reaches the first spikes like they're. Quite close, they are not that far away. uh However, when we move towards the Ender and the two spikes they are quite far so that that is where we see. uh We say that yeah there is anomaly, so not every every Spike could be an anomaly but yeah some spikes could be an anomaly, and this thing is what we are calculating.

B

We are figuring out whether it is an anomaly or not, so this is basically um the presentation uh I've taken the references from yongming's, Radio, cable or radio uh paper and also from subramanian's internship uh project, so that is pretty much from my side.

B

Thank you. um Let me know if you guys have any questions.

A

Thanks to Shar that was a nice demo and let's open it up for questions. So if you have any question any suggestion observation anything you'd like to ask to share, please go ahead.

E

So uh so, actually so this uh anomaly detection is for like what is the purpose of it? Like I mean uh it is like uh walks across a single cluster or like it can work across multiple cluster and yeah. So.

B

Yeah, so basically the purpose of uh having the throughput anomaly detection is just to figure out. If there is an anomaly it could, as I said like it could be either uh because of some threat or it could be just because something went wrong in the in the network or something like that. But we should be aware of that. There is an anomaly going on in our Network and based on those anomalies. We can figure out if it is a threat, so it is basically a part of networked uh traffic analysis.

A

Your question is uh operates on the data that are stored in the Tia flow of database and at the moment these are the data which are sent from the anterior flow aggregator, which pertain only a single cluster. So we don't have. uh We don't run the analysis across data for multiple clusters, yet.

E

Okay, so obviously it is on the data which is on which it is analyzed like.

A

That's right, that's right! If you had, if we are, if you had a data pertaining a network flows for multiple clusters, then you know it would also identify throughput anomalies in traffic across multiple clusters, because at the end of the day, the throughput anomaly detector is an algorithm that analyzes uh data in a in a network flow, and uh you know it doesn't care whether certain destinations are always in in the same cluster or not.

E

So uh like uh so, it is also like a DB based uh data or like it's just a network flow on it.

A

The network flows are stored in the clickhouse database. That uh to share was mentioning. Basically, this is part of the generic of T architecture, where you have as an Indian entry agent. We have the flow exporter that captures Network flows and sends them to a centralized aggregator. The aggregator does correlation between flows, to match sources and destination and then sends them either to clickhouse for the in-house installation or to snowflake when people are using snowflake as a backend and and yeah, and then that's once the once. The data are in the click of database.

A

This spark job analyzes the data in the database. That's uh so, let's say that the um the part about running the algorithm and analyzing the data is sort of a different pipeline from supplying the data in the database.

E

A

Okay, good um yeah, please any more questions.

F

Is this possible to have uh to have it run in the background instead of like triggering jobs, manual.

B

uh So uh currently we are using it manually, but uh we are working on the one that uh can go in the background as a.

F

D

F

Something else you may have mentioned: maybe I didn't catch it. What what data did you use for your testing? What you're, showing in the graph uh on Slide Five.

B

uh Yeah, so basically, this is this data, as as we know that uh R cluster is like, we do not have too much anomalies, so we just introduced. uh We just took the startups from our uh cluster um and then just uh explicitly included the anomalies in that just to see if we are able to accurately figure out whether there was an anomaly or not. So this data is basically like input. Data taken from the cluster then included some with some anomalies and then ran with the uh ewma and the arima uh uh algorithms.

F

D

F

When you do have a new many detection running in the background, do you expect that there could be a large volume of events, and, if, yes, is it something that the user can can tune? Can they tune the algorithms to reduce the amount of events if they want to.

B

uh Like I, don't understand the question, so is it that if they are anomalies, then we tune the calculator to not find that obvious.

F

Well, if, um if the user thinks that too many events are generated and that the detection is too sensitive, is there a way to tweak some parameters so that fewer events or generated basically, some things that may have been considered an anomaly before will not be considered an anonymity anymore with the new parameters? But you still have some events for the most like extreme cases,.

B

Yeah, like I guess like Yoming, would be better to answer this thing, as uh this thing was basically based on his paper, so I I guess like basically it is like ml, so it is like training and getting the data result data so that could be possible. But currently we are not working on that, but young young can provide better uh point on it.

C

uh Yeah definitely just um related to arguments you pass to the algorithm. It could be triggered to a more sensitive one. But if you, if the user finds too much um negative positive or too much events, um we could pass some arguments. Zero, CRI to let the algorithm uh less sensitive or we could set a higher threshold to generate less to generate less events.

C

Yeah, but I I. Think that didn't cover any of this first implementation by Tushar, but we we are definitely adding more arguments to a CLI later.

F

Okay thanks young Ming, thanks to Sharp.

A

Hello, just a quick question from me um regarding the efficiency of the effectiveness. Sorry of those algorithms was the minimal. Is there a minimum duration of flows voltage algorithm to make sure that they can make an accurate detection.

B

uh Currently, it is like uh so for arima. We cannot take any data if the data difference between the start and the stop is less than three seconds.

B

We do not take that data.

A

And uh whereas for the others.

B

Well, for the others like DB scan, uh so that is like cluster based thing, so that does not include any time restriction and I I guess the same is for the ewma, so that does not have any, but arima has like three seconds, but that can also be tuned. We just used L3 seconds, but that can also be tuned.

D

A

I'll just do the things.

A

um Yumming, sorry, were you: were you trying to oh yeah.

C

So I uh at some point because on Twitter makes three seconds, and um actually it will more likely because of the lack of data points will cause logarithm less um accurate for the Army at least. We need three data points for each connection, so it will depends on the interval of flow aggregator if their interval is 60 seconds, which means that we need three minutes to have to have an optical point for each connection right. So yeah I just want to correct that.

A

Cool thanks, yaming.

A

All right do we have any more questions here. Please go ahead and ask your questions. I'll keep waiting for a few seconds in case. You are thinking about any question here.

A

Okay, perfect seems that it's all for this topic, so many thanks to the game to char for this presentation of this demo. We hope to see the code merging to the tsos code base as soon as possible. That's uh that will be a great addition to the NTA capabilities for the Tia project and um yeah and I. Think that's all on this topic, and uh do we have any other topic for that you like to bring up for discussion today. So let's go now for open discussion, anything that you like to discuss.

A

Please bring it up, otherwise it will be all for today, I'll keep waiting for for a bit in case. You want to propose a new topic well in the meanwhile. Let me also check if somebody proposes any topic on the slack Channel while I was sleeping.

D

A

Okay, five four three two one and that's it so I would like to thank, as usual, everyone for attending thanks again to share for your presentation, and uh we will have our next meeting on February the 14th.

A

So that's all for today and I wish everyone a good good night, a good day or a good afternoon, um thanks for joining again and bye.