KubeVirt SIG Performance and Scale, 1 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2021-07-01

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.8taxjc2uv4bg

A

Okay, all right. Everyone welcome to sixth scale um if you need yourself as an attendee, please meeting, unlike the dachshund and uh the chat meeting chat.

A

um Okay, so today we're uh I wanted to talk about uh kind of continue on with some discussions that we had last week um and kind of try and see if we can take some of the uh things that we have, um that we've talked about in the past and kind of um talk a little bit more about the implementation. How we can accomplish some of these things um and, like I mentioned last week, um uh we so we're going to move to weekly.

A

I also said on the mailing list and I'm sure you're all aware, because you're here we're going to be doing we'll, do weekly meetings on thursday, okay, um so the first item uh on the list. This is um the pr that um marcelo uh has been working on um marcelo. Do you want to talk a little bit about um you know any of the progress that you've been able to make on this one, since last thursday,.

B

All right, um so, unfortunately, I don't have too much into update. um I was busy with some activities and I'm updating the pr today I just realized. So actually it was a little bit mess to update that because it was no way- and I just realized today- that uh changed the master branch to maine and so, and things was not like- was not synchronized well and anyway, um I'm I'm going to update that.

B

So the idea is to remove all the the parts that I'm collecting from ito's metrics and have only the part that actually is creating a bunch of them and and measuring the and getting the uh creation- and you know running time, from the vm and from the creation to running phase and and report that.

A

Okay, okay sounds good, um okay, all right, so it sounds like we don't need to talk anything more about um this. Then I think I think it's understood, okay, um so the let's go to the second bullet point then um this is um this is the issue that I created to measure um the different performance metrics, um so we've already had some progress here. David did a nice job on this and we've already had. um We already have that merged.

A

I want to talk about some of the remaining bullet points and just see if we can kind of talk about how we can implement some of these um or, if you want to change some of these or add or remove any of these. um So I copied the bullet points in here. um The first one, um your work hue length. um I started looking at this, I already started to write some code about about it. um It seems pretty straightforward.

A

um Essentially, we want uh the what at least the way I'm doing is it's having a gauge um a gauge vector where, with we take the keys name and record it um and then increment the gauge every time that we see a key that gets pulled off um the queue.

A

So we have some sort of count and then we decrement it every time that the um it is it completes. uh So this the idea is that this will give us at any given time uh the number of or the the how long or where q is, and we can. We can monitor that in prometheus.

A

um What do people think about that is that um I don't know. Is that the right approach? What do people think.

C

It sounds good, I mean okay, they pretty clearly see how much is intuited yeah, um not entirely sure yeah I mean if it's. If things really go south south, you will see it and promise at least.

A

Yeah, I think like so what we've seen like from our measurements, because we basically just kind of did some experiments a lot ago, where we kind of posted this to standard out to record, and we just kind of script, the logs for it um in general. What I expect us to see is like to have like a count. That's high, it just kind of stays high and it slowly goes down and it very quickly descends.

A

um So I'm curious to see how it shows up and see if we see the same behavior, um because I guess what we would expect is that we have like a parabola where it just when it quickly increases, and then it quickly decreases or even just almost no queue at all, because it's just being the input is speed, is equals the output speed. So let's we'll see, but uh that will give us some good information.

D

A

D

You want to partner that, with a kind of accused service time to see how long each event you know on average, how long each event's taken to get through the queue.

A

Yeah, so that's what I was thinking for number two um so like event callbacks in queue time um or maybe we can clarify this so like so gavin like we'll talk to you about what you're thinking.

D

I was thinking service time of the queue uh from when the events delivered uh in the work queue to uh when we're done.

A

D

A

Start time, yeah, okay, so an event. Okay, and this could be so an event to clarify what that, what that would mean that that's like um that would be like uh that would be. What like a key has been added to the queue. Is that what it would be, or.

D

Yeah, just the queue this the queue um service function called, I guess, for any event,.

A

Okay, so we add, we add okay, so any any, so any any call whatsoever. That is it wait. So this would be like um like like when we do a re-cue kind of like that, like when we did um I'm trying to find the diagram.

A

So if we did like um when we do, let's say like right here.

A

So when, let's say we see a um a status change right right, we update the status, then we add ourselves back to the queue and then we go through this loop again after we get picked up.

D

A

D

The the first guy.

E

D

About the depth of the queue and then the second average time average trip time through the queue for each and effectively, the two multiplied together. Tells you how much time you're spending in that processing uh handler.

D

Well, I guess I wasn't thinking of their eq okay,.

A

Okay, well, I just want to try to capture this so like this is so, if I, how would I um when would I start this like? When would I start my sort of when I start my timer.

D

I think at the entry point to the work queue uh whatever it's called executable.

A

A

Okay, let's start execute and then when we we stop at the uh when the keys pulled off. So our wedding.

A

A

Okay, so this is the so so not the entire time of execute, but for each individual key we're getting it. So it's very similar to this so just get the time of kind of each individual key.

D

Yes and then, and then maintain an average uh across all keys so time time, each individual keys, execution, okay um and keep an average uh all you know uh outliers or whatever it is.

D

A

Okay: let's do that don't process any time; okay, so event, callback. So this this is going to be whenever.

A

Maybe I should change the title here, because so I I'm calling this callbacks and q time so this could be like, for example, um so this okay, what what this doesn't capture is the the time spent in the.

D

Cube right before.

C

D

A

C

What I wonder here is if, if we are really interested in that too much, I mean to a degree yes, but I personally would be more interested in seeing how long we need for a key to be processed.

C

A

So how so you want, so that would be like how fast an individual key is being processed.

C

I mean that the faster the processing loops in the controller, the faster we can pick the next key from from the queue right. Okay,.

D

D

As I said, that would be the service time and then like you're, referring to the queuing time before you pick to run is also interesting, because that we saw that, with the rate limiters, where things were sitting in the queue for a long time because of the rate limiter. So we would we'd want to measure that to pick it up.

C

Not sure what do you mean, you mean uh the error rate limiting or something.

D

The um kubernetes client, um the the default client has a rate limiter and, uh um and you effectively spend a lot of time in the queue before you come out of it again. um So to make that visible, I think you need to to measure the queue time at the time in the queue.

C

Yeah, my only thing is that uh it's from that perspective, not clear if you spend, for instance, time is such a rate limiting thing or, if you, if it's enqueued there for a long time, because your processing loop takes a long.

E

A

um Okay, so how? How should we kind of tackle this because, like I could see this kind of going from so I could do like, for instance, um we could. I could literally do the moment when a key gets added to it to the queue, so that would, I think, be like like when we do when we do this, like re-cue call or enqueue call whatever it is like.

A

Maybe um we do we record the time the time stamp.

C

Just just be careful with the enqueue things, because you can delay the enqueuing too right. It's.

A

What do you mean you can do it again.

D

You can say after a time.

C

Yeah, so there are two ways on how to get it into the queue again right. One is to update the one thing: is that the object changes and you get it through the watcher or the other way is to enqueue it with the delay again, we shouldn't really have direct reading cues, okay. So how? When should I.

A

Start measuring even with the.

C

Enqueue after thing yeah, uh if the object changes in the meantime, it would still be processed. So it's not a fixed delay. It's like at worst case, look at it after this time again.

A

Okay, all right, so how should I? How should I measure this thing like? When would we start? What would we want to measure this? We want to know the amount of time that a key is spent um in the queue. What would be my start point?

A

Would it be at the start of execute, or would it be? I mean it wouldn't be, because that's just that, that's like that started for every that's just a loop. It would need to be somewhere before that, or even not before that, it's whenever there's somewhere after that.

C

So in principle, there is always code which first waits until until we get confirmation from kubernetes from from the link. We need these libraries that the cache is in sync so that it listed everything first and it's up to date. Yeah.

C

And then we normally start so, but this means that the controller then starts with the fuel queue, because all objects which are interesting for us will be processed it's it's. It may be a little bit tricky to get a clearly a clean starting point because of that.

C

Okay, um I didn't look too much into the rate limiter. To be honest, I can, but I I found it pretty straightforward to to do the actual pros to measure the actual processing time and not how long it stays in the queue.

A

Well, that's so like that's! We want that. That's kind of what we want, though, but like that's what so we want to know an individual event like if, if the status, if we just updated the status of a vmi like how long do we know the moment, we we like did the update to when it gets processed right. That's.

C

Like our event processing um I had for me, it was, I mean you showed us some graphs and so on, but for me it was always a little bit unclear how you were getting these this data and what exactly you were measuring and where exactly what time was spent.

C

So I think, with the pr from david, we have now a much better way of looking under the game, startup times and phase transitions, and I think the next step, the direct next step. I mean a lot of other here, but zero on the list makes sense, but at least for me from it first uh structure gathering and this I would really look into. In addition, how long controllers need for the objects to process them in the controller, loop and um and yeah um yeah?

C

That's pretty pretty much it and then then you can narrow it down already a lot in the basically in the business logic and lessen the infrastructure where issues are and when you still can't find anything there. I guess the rate limiting and the q length may become more interesting, but I mean you saw an increase in the q count. For instance, but it was never clear for me if we actually spend time in the processing loop or if they are really stuck in the queue for such a long time. For non-obvious reasons.

A

Yeah, well that that's like that's kind of what I want to find out like with this is like because it's just it's that, like we have a general idea like we can see when we do when we are looking at the queue and we're actually processing the events and sort of scraping the events from the log that we can see what their time stamps are that there's still something is going on here.

A

um Yeah, like that kind of the idea like.

C

Yeah, all I mean is like let's say you have uh five uh threads for your controller yeah.

A

C

Then- and you implement this scotch and you measure the time in the queue and then there are just let's say I don't know five vms which, for whatever reason are stuck in the processing loop, then all the events will just stay there for ages before anything moves and you, you will not really see anything with the magics, whereas with the other other thing, you would pretty clearly see that suddenly a few objects take an insanely long, uh insanely long time until they finally leave the processing part right.

C

So yeah like just from the queue length, it wouldn't be. The first thing I would implement. I would first look at the other thing, but I don't know.

A

Okay, well I mean.

C

I mean you also hinted somehow that that you saw already that it's clearly due to the rate limiting or something which is just new to me, but I mean can of course be.

A

Yeah, like like kevin, said, like there's some rate limiting. That is happening, which is something that, um like that's fine like we can, we can get around the rate limiter, um but more than that, like would like to understand like why it's getting rate limited in the first place, trying to understand why, like white, has to do that so that that's kind of that's what I like. I want to capture with these, because something is going on with the key that this that's in the claim makes sense. It's.

C

Not like something I mean how how do you know that it's that it is directly limited, not the processing loops, for instance,.

A

Oh because I never heard that.

C

A

Oh because, when, when we've, when we've gone through increase the qps to get rid of to get around the late rate limiter, it goes away.

C

We'd increase what.

A

We, when we go and increase the the qps in the the rate limiter and the what is it the cubic, the client or something gavin, and then the problem goes away.

D

Yeah, when you create your um forget to call it create something from config when you create your kubernetes client. uh If you just accept the default, you get something like 10, 10, ops per second with a burst of 10, or something like that.

D

But you can specify a a higher burst and a higher um average rate, and when we increase that uh we got substantially.

C

Higher okay, then in general, all vms start fast, suddenly or.

A

Yeah, okay, well, okay,.

C

Well, if that is okay, then yeah.

A

Yeah I mean I, I think, we'll get some well anyway. I think we'll get some some things from this, even if it becomes we kind of prove that it's that it's a non-factor, then that's, that's fine like we can. We can get rid of it. That's okay! I I think, like I think that they're it would be good to know kind of just to get some insight into what's going on in some of this uh in this area. So, okay.

B

I think the the word q has some. You know some metrics that can be exposed like uh the q lamp and the the time you know the key timing, like uh I just sent like here uh some links the chats here in the zoom.

B

So um I'm just wondering if we need to you know um you know the depth yeah yeah. This is this. Is some metric? The other link actually shows it's a little bit better. You know on this running process, the other one yeah yeah, so you just you know the depth key and the cue you know, latency, you know uh work duration um and I think those those kind of metrics.

B

It's very related.

B

Maybe we don't need to you, know, pre-implement the thing again and just expose them and and then the processing uh time of the the key itself. As ramon was mentioned, this one is actually the one: that's not it's messy here, it doesn't have it so then it might be be very interesting to spend more time. Maybe this one and expose those ones here.

C

B

C

Exposing it should be just enabling it basically in the code, yeah and then.

A

Yeah, so we need to expose so that what do we like? What um I I haven't looked at this so like this, we just need to. um We just need to report them they're already there or something like or or like. We just need to wire them up or something to like, whatever keyword, exports.

C

Yeah, you have to wrap the queue or something in a specific way.

A

C

It's more we're just wiring the client libraries properly, which you already have.

A

Okay, this looks good, so we we've got some okay, so we can hook all these up and I think that's that would satisfy like kind of what you're looking for yeah. Okay, I think both these, I think they cover. Actually all of these were q ones.

A

B

A

Okay, that that looks great, so then, um okay, so that we can, we can look at wiring up, so I um I can take that one, I'm already sort of looking at this one. So um I can look at wiring these up. That looks that's good! Okay, cool yeah, thanks! um Okay, um all right, we'll skip past these work queue ones, um since I think that's covered. um Let's talk about some others, um latency between the vert launcher pod and the vmi object. Oh, this is for electropod being ready in the vmi object.

A

What's a way we could report this as a metric.

A

C

Why don't we, I think we already set the metrics per mile wider. It should already be exposed this one.

A

How how so like what what's.

B

The vm phase, I mean something like that.

A

B

A

F

A

For for one of or do we have time, stand for both these available and then we just need to wire them up to be.

C

What I meant is that regarding to what well what mars, marcelo shared the metrics provider, I mean I added that already two years ago or something and it should actually be reported. Did you ever check if it's there.

A

No, I don't know what is the metric called.

C

There is depth, adds queue, latency and so on.

A

Wait sorry you're talking about this, do you hear where the worksheets are.

C

Yeah, I'm still in the work queue. So so what.

A

C

Just said so uh this this this monitor, I already let me check, but history.

C

Yeah editors in 2018 already it should be there. Actually, I wonder.

E

If there's something.

C

Wrong with, but at least when I edit it I've, seen I've seen it.

A

Okay, okay, that's that's good! Okay, less work to do all right, good! um All right! We just need to find it. Then it's just information so just in a wireless dashboard, okay, great okay, so the next one, um the latency between divert launcher pod being ready and the vmi object. um So what about this? One.

A

Like how could we so we had, I think to do this. We need to know we have to have a time stamp of when the vert launcher pod becomes ready, and then we need um to know when the vmi object goes to running. So we might have this. So we have the vmi object running now with data changes, so we just need a diff between the two to this. I think we are doing. We must have this like this.

D

Is that not scheduled uh phase that launcher pod being ready? Is that not when we update the vmi to schedule yeah.

A

C

It's so yeah, oh.

A

Is it scheduled that it goes to ready? Oh sorry, you're right, so when we.

C

It's scheduling, yeah.

A

Yeah, so it's so we go. So we go from scheduling, yeah, okay, so it's so. It needs to match vmi um scheduling and convert launcher ready, yeah you're right. We need those. So we need so we have this now. What dave has changed yeah and this one I don't know, do we. I don't know if we have. This can does this report it on the object.

C

Yeah so when little bit launcher gets ready, the vm goes to scheduled, I mean so it's not exactly when british laundry is ready, but almost immediately.

A

Yeah, well, that's that's what I want to measure because there is a different, the well there. There is a gap here there and it's and like I was talking about earlier, like we there we it's it's noticeable like when um that there is actually time and for none of the reasons we mentioned before. There's time between these two.

C

A

C

Exists when you increase.

A

The rate limiting it, it almost all goes away when we increase the rate limiting.

D

But I I I'm not sure if this is the one you mean ryan, but I I've seen even today I saw um a vmis where the pod was ready and running and it took about a minute before the uh and the vmware was in scheduled status and it took about a minute before the vmi got updated to running state uh off running phase. And I don't know where that time went uh you know, but handler does that update but um and it waits for the domain to get going and so on.

D

But um I don't know I can take a minute for that.

A

Well, we want we want to know like we want to know like once. This pod is ready, there's a handoff that happens right so schedule it so is it? Is it scheduling our schedule because I like, where.

C

Yeah so run through it launches ready. It goes to scheduled and, in addition, uh a label is added to the to the vmi, which makes it visible to their handler. From that point onward handler can see it in its own queue and we'll do the delivered part of work and some additional hardware setup.

A

Okay, so there's our there's our diff. We have. We need this time stamp. We have this one on dave's changes. We need this one, um but we could get this even if we don't have lit already. um This would be easy to get um because it's it's seeing it like when the moment we do that action. When the uh controller does the handoff, we could.

F

Take a time step I mean you just mentioned to.

C

Yourself that when you increase the rate limiting the all the issues pretty much disappear, so I guess I would not focus on too many small details here. You seem to. It seems to make sense to finally have an ins have insight into the queue also seeing if the processing rates are fast and seeing. If things are fine, then, and trying to capture this threat,.

A

Yeah, I guess so. The point is like that we with if, if qps sort of is the solution like um it was just kind of curious, because um what how I don't know how that's going to scale like that's the that's the concern, because I'm wondering if we can find some insight into why it is that we're getting rate limited.

C

Yeah, I'm just not sure if you can find it with this.

A

No, no I'm not going to find it with this. This is the the idea is. Is that if something so this the goal of this one is that if say, for example, we had a change that was introduced that somehow caused an issue between these two things so that we had some sort of delay. We could. We would know it like. We know that this change actually slowed the processing of the um the vmi relative to the vert launcher, like it's just another, measurable that we could, we can look at.

A

That's that's what the goal is, because we we don't. We want this to be. We want this to be at the pretty much at this exact same time, and then we can measure that so we something that is worth doing.

A

Yeah, okay, so then I think on this one: we just need the vert, the vert launcher ready time, and then we can take the diff and then we have um we take it from the scheduled phase and then we know- and we report it: okay, okay, that's! I think that that one makes sense that one's pretty straightforward. um How about this? One latency between volume creation and the vert launcher pod.

C

A

Volume creation in this case, what does it mean um so like this would be the pvc and um uh actually does this make sense so like with the pvc and then uh and the difference between the time it takes to create the pvc from your from your dynamic parishioner and your and whenever launcher pod is actually going, but it's not ready to see the eye yeah yeah, also external yeah, okay, this.

B

A

External yeah, okay, we'll skip this one, okay um device plug-in latency. This also is sounding external. Can we at least unless there's a way we can measure? um Is there a way we can measure like how long it would take for something to attach.

C

um That would be more in the cube there or or it's difficult. I mean it also. It depends also depends on which device plugins you're talking about there in keyboard. We have multiple different kinds of device bookings. Let me put this way.

C

There is the one set like kvn device, plugin and other pci device plugins, which are directx xposed pivot handler, and then we have other device plugins which are just coming in from other parties which can request both, but we always have the issue that the device plugin never knows, for which part actually a device was created.

C

So this basically means that the q, the cubelet, will at some point ask for a device. You will provide the device, but you never know for which port or vm or whatever it was supposed to be used for, and that makes it a little bit difficult to measure anything there.

A

Okay, all right, maybe that's something that could be implemented on the external plug-in side, then for both these okay um kubernetes api calls latency count um made by us.

A

Do what what's a way we can do this.

B

uh David there is a metric metric at the real red exposure. The kubernetes calls in latency.

B

um Of course, we cannot make sure that it's on kubernetes calling that, but I would say that in the in the cluster that we have run the test, it will be mostly made by cooperating since we are not going to collocate things in the same cluster.

C

Yeah, that would definitely work for the tests. In addition, david looked into uh wrapping the kubernetes client with an instrumenter.

C

Okay, so you can, you know, there's uh how is it called http? I forgot the device.

D

Is called run trip time, yeah.

C

Round tripper yeah yeah he he.

D

Started playing with.

C

The round tripper on the client side.

A

Yeah, I think he'd also mentioned um uh was it. Jaeger uh was the project that did some analysis here. Does anyone remember I.

E

Mentioned jaeger once yes, general tracing in our code base.

A

Yeah, so we could so that this we could use this round tripper and then jaeger was, I think, was it? Does jaeger just like do the do visualization or is it like have some apis that we can just like call for measurements.

E

Jaker itself does visualization, but they have their own protocol or use open telemetry for actual tracing inside the um the code, like you said, like you said, spans like this is reconciliation and you see how long reconciliation takes and they can do another span inside. That's saying this is building a template for whatever and um it's less on the api call specifically, but you instrument, you can instrument specific pieces of your code, but it also can transit uh jump contexts like it can pass on through http to other clients and also implement it.

A

Okay, okay, that sounds like definitely an option too. Okay. So all right, this sounds like we've got two things we can do here: okay,.

B

Just some some mention so there was like a well, I would say, like last year, some big discussion, kubernetes about you know enabling jager or not because yeah you know kind of it's uh eager is for rpc. It's not really, for you know, uh you know asynchronous calls and things can get nasty. You know when you in kubernetes, for example, when you have like a.

B

uh I think that it doesn't follow the same path, for example creating pcs externally, and then you don't have you know you know the hood the root of the the spine goings. You know to the creation and I'm just saying that it's maybe it's available, but me it can also face some big challenge with jaeger.

E

um Yeah, the discussion in kubernetes was so is a cap for adding tracing to the api server and the general commands control plane and uh what's happening. Now I think, is it's slowly tracing any synchronous api calls, it will not add tracing to like operator reconciliation loops, but that is for the coordinates part, because uh the challenge was they wanted to trace everything that happens to a resource and to do that they would have to save the trace context somehow on the resource, because asynchronous stuff happens.

E

For our case, though, adding tracing to our reconciliation, loops and our api calls would be no problem, I would say, because we just output spans for every reconciliation loop. We don't. We can't link it to api calls, but we still can trace what's happening inside a an operator um and annotate it with their respective information like meter, data and stuff uh just to see what's going on in the operator, it's not linked to any synchronous stuff and that shouldn't be a problem.

A

I wonder if there's some priority here because, like um like, like you're, saying just about the q stuff, I wonder if there's anything about operators already implemented with like jaeger or something that we can look at.

E

um In my previous job, we, where I added tracing to all our operators, which were like I don't know, 20 operators for custom stuff, nothing open source and it was trivial to add.

F

E

Just added two lines of code for open tracing to where we were curious about more information and the exporting was done by in our case, uh uh google clouds, but you can do that with jaeger or anything else. Okay and.

A

You do you want to look at this one then, because you're getting something you already know what to do here.

E

I can look at the tracing part. If I have time I don't know when that will be um but yeah I I I wouldn't mind seeing if we can add some tracing to the operator.

A

Okay, yeah that'd be awesome.

E

But uh I I think roman mentioned david's already looking at the round trip before at least tracking api calls, so that's something kind of different. I would say.

C

Yeah and we have a basic round trip already- it does not expose the resource type, but I just edit the comments here. You see him in the screen here.

E

Was that the one that also outputs our client metrics.

C

This is so I added two links here is on the http roundtrip you get this. This is wrapping, the club is clan, is wrapping line go and it? But if you go down, you see immediately just just a little bit.

E

C

Does code method and host you don't get the resource out of it, but if you look at observe and increment it would be pos. It should be possible to extract the resource out of it.

E

Yeah, I saw those metrics, I wondered where they come from.

C

Yeah, this is the one thing, so that should be easy to extend to also that you can also see okay, we call we update the node a lot or we update the clients a lot. You basically just look at the right. uh uh You just split the the url and yeah you just take it out for it and.

E

C

You go back uh ryan up to the to the topic from before. uh No I mean men on the google doc.

C

Yeah there's other link here.

C

Here we already set up metrics for for okay, whatever you, you should see them too depth. How many adds to cue latency work. Duration, vectoration should actually be the one from the controller already and yeah I have to confess.

C

I had a vague idea that I implemented this, but I didn't do anymore, where I did that.

A

Okay, it's this okay, great! So I what I can do then I'll I'll. Let me wire these up and then I'll come back.

C

Yeah, I have basically, let's just see them on prometheus.

A

Oh, I mean I mean like what I'll do: yeah they're there. What I'll do I want to put this into grafana and, like I'm like our end, and I want to get I'll come back with some bass lines for this right, um like with our scale, and we can get some ideas um like because, eventually like with like a with all of these, like we wanted like, like you were asking earlier about this like we want to be something that we, we don't want code to ever change this in the future.

A

We want this to always want the integrity to be intact and and so like. We want to have baselines for all every single one of these, so um eventually that's what we'll get to.

B

So, okay, just you mentioned about the grafana. uh I think there was some discussion you know some time ago to actually we have some grafana dress work and it would be nice because you know like all these metrics. We forget them and if we can maintain some graphone dashboard, it will be easy just to check and see the metrics there. You know.

E

A

I thought there was like a public, wasn't it wasn't there? Wasn't there talk about um like creating one like a public one? That would be measure some, I don't know periodic job or something.

C

Yes, um I think, that's what you mean much really right.

B

C

Okay um yeah, since we talked about that, I can give you an update from federico who was preparing that in the background so uh from his first from lymphoid perspective, he has everything ready for that or almost so. If we would create now periodic jobs with what you enabled in kubernci marcelo that we can deploy the prometheus when so when we would create the periodic job and label the path. Accordingly, the metrics would already be collected and show up in the grafana dashboard.

C

What we do not have yet is um access to prometheus directly so that the developers can play around with the melees themselves. There we have the bloat balance already, but the keyboard.o domain is now owned by cncf and since they own our domain, it's a little bit tougher to get a dns entry than before. So we can't actually, so you can't actually reach it, because there's no dns associated.

E

You can do a host sentry.

C

Yeah yeah yeah yeah. I could give you the a curl call with the host header and you would get.

E

There but the grafana we have there's a graffana that is linked up to that already or.

C

uh Yeah like just um right because.

E

Then you can just explore tab and graffana yeah.

C

So the the thing is, I mean it's really it's just for not for external people. I would keep it want to keep it read only yes,.

E

Yeah that makes sense.

C

Here is the dashboard in general, so the idea would have been that we at least temporarily exposed the prometheus metrics directly so that we all can play around with dashboards and the collected data and so on, and that we then add permanent dashboards to description.

E

I mean one way we could maintain.

F

E

Having a repo with the dashboards that people can make requests to.

C

Yeah, uh that's there. You can.

E

C

So the all these dashboards you see there they are, I can show you the url.

C

Too, just one second, looking.

C

Up just I just forgot, where I'm projecting for the they are.

C

Yeah here here is an example: pr where you can find out where in which directories they're hosted.

C

E

C

You basically just have your prometheus instance, you create your dashboard and you click on export and we can and you get a json, and this can be then committed here.

E

C

And I just have to look at the exchanges.

B

A

Okay, cool, okay, so it's something we can leverage. Okay, um all right! We have two more um two more to go so vmi, pod metrics. This is cpu. Mem usage open, go routines juicy times.

A

um Okay, what's a good way, we can measure this. Can we, I think, there's some? What's it called? um uh I just lost it. There's a project that kubernetes uses to leverage this.

E

Sure yeah should we have those already.

A

Do we have this, you never know, that's pretty quick.

C

A

Guess you're talking.

C

About how's, it called.

E

Go the gold metrics expressed by prometheus, or do you mean.

C

E

Yeah, what it's called.

C

C advisor c device yeah.

F

C

Yeah, uh I think, when.

B

We install the parameters operator, we already have this matrix yeah.

C

I'm not sure if c advisor would not give us more metrics, but at least the cubelet gives us a baseline.

E

Right but isn't c advisor just for the node and those metrics come from if you use the prometheus library, if you export prometheus metrics in invert, launcher for example, or do you just want to like open code routines, you won't get from c advice. I think.

B

E

B

Many of the metrics that see a divisor exports. It's when we install permits operator. We can see that the advisor you know reads like the c groups and directly um and I'm not sure if it will give more things, because, like uh the the thing that cool it's not showing is the the node metrics.

B

uh However, the permits operator installs the node exporter for that so.

C

B

C

Exporter too, but uh perhaps you edit the node exporter, but um I think what if we really want to get the go runtime matrix from word launcher ports, and this seems what you want. Ryan right.

A

C

Then this is a little bit tricky because word launcher has no network access because we give it to the vm right.

C

So we would probably have.

E

C

Talk about uh collecting with word handler, the metrics and exposing them together with the others, like we do with vm metrics.

A

C

A

Explain so can you say that again like why? Why is it difficult with the launcher.

C

So I mean it's easy for word laundry to just expose prometheus metrics, but it's not easy to scrape them because the parts have no. The the the application in the part doesn't really have network access anymore after we started.

D

C

Okay, because the vm has then the network access and not the part anymore,.

A

F

So we got to do something: that's on the node like the handler you mentioned, and we just, but we need to associate the the data with the pod itself. It shouldn't be too hard.

C

Honestly, because.

F

No, you already have the flow.

C

For the vm matrix, which works exactly that way, kevin yeah.

E

Yeah, I know I I want to say it: should it shouldn't be complicated like we only need to I, I suppose vm metrics already do that. We just have to merge the metrics we get from the launches.

C

E

D

C

Ideal case, but word launcher, doesn't expose the most prometheus metrics. We just.

E

C

Pc call and then translated in prometheus metrics, so it may.

E

Be a little bit more wiring, but at.

C

Least the pattern is there yeah.

E

Yeah, okay yeah, that would be, would be ideal.

F

Okay, that's that makes sense to me. Okay, it sounds like right that understand that.

E

And I think those go metrics are actually more useful than what we get from c advisor, especially because we still want to like we have this um story going on, that we want to make sure our vm overhead is small and correct, and I think those metrics help us see more how we.

F

E

A

Yeah definitely, okay, that makes sense to me: okay, um what's the next one uh latency for virtual machine instances and virtual.

F

Machines, um so this is api calls made to us.

A

So it would be like.

C

A

So like we are doing a get um or a list of virtual machine instances. How long does that take so this is just an idea of like.

F

Our divert api's performance when it is handling any sort of verb.

C

So, but do you want to measure that from the in-cluster components or from a user component user perspective.

A

I think I think I I don't know, I'm not sure if we can, we even measure it from the clients per side. I like I, I think the only way we'd be able to do it was would be from the in cluster components.

A

C

There you may get away with the round tripper.

E

um I thought we talked about this and, like the api, still already measures that for us, I think the only problem I mentioned was that if you want to see latency for get requests, you also get vnc and console requests which are long and they destroy your latency.

E

That's why I also thought about moving them to like connect instead of get, but that's breaking so in general, the apis give us those metrics already.

A

uh What do you mean like the api server already gets us then, like we um uh like the.

D

um We have it in the audit logs, but it doesn't have to expose otherwise.

C

D

C

In the prometheus metrics already from there yeah as well.

E

Which is the api api exports metrics for the requested forwards and as it's proxying, those requests to our weird api? It should be in there kevin.

C

I didn't look at them for some time. Does it record the api group or something or what is it recording.

E

It records path, I think, or okay.

C

I can just tell you that in seeing console and so on, they're in a different different api group, that's called sub resources.

E

Yeah they aren't supposed to sure if.

C

They are, maybe they are excluded by default anyway, from the others. Yeah.

E

I haven't looked at the matrix for a few months, but I have a cluster here.

A

Okay, then, um okay, if you have a link, kevin uh that'd, be great um yeah, just at least see like uh what's there or if there's, if it. If there is something there, we just it's not hooked up, we can hook it up or if it is hooked up. um Maybe we just need to.

E

A

Okay, okay, great okay, um that covers that covers everything. I actually wanted to add one more thing, so the um I was actually doing um I haven't used like um I like it just so that everyone is on the same page like when everyone's doing development, like with make cluster output, make cluster sync right and you're doing enable prometheus and all that stuff to do the testing to like withdraw stuff like this. um How how do people do like?

A

How are you like testing your um with the dashboard like if you're doing it that way, you're doing developing with make cluster of cluster sync? Do you have to like port forward and all sorts of stuff to make it work? So you can watch the dashboard or watch for atheists.

A

I'm just wondering how people are doing development.

C

So I guess we'll make a class subway cluster uh marcelo included a grafana dashboard there, which you can.

E

C

Thought I wonder if it's the best thing, because when you do make it for permanent usage, because when you do make cluster down whatever you had, there is gone too right.

A

Well, let's just say like if I wanted to so like let's say I wanted to say I was looking at developing this right um and I and then using make cluster of cluster sync. I have enabled the dashboard right. It sounds like everything's there like. I have everything there. um What like is it is this. What people are doing?

A

Is this the right way to do this like and then maybe just need to do some port forwarding to like, because it's what like, how does make clusters stick and make cluster up is just using kind or something.

C

It's no it's starting real vms, so.

C

So we start when you do when you start the two cluster node with keyboard ci, you get two or you get multiple containers, but two vms containing each ibm, a real vm and they are forming a cluster with cube adm.

C

And they use nested virtualization to actually then run vms in there. Okay.

B

So and yeah, what was the question it's like how to access the dashboard is that or.

A

Yeah like so, if I'm was, if I want to do development like can I have the prometheus dashboard with make cluster up like how could I view the dashboard? I have.

B

A

You'd mentioned.

B

C

A

The node port, like how would I do it.

C

Yeah uh just a minute.

B

Of course, so you you need to do this ssh tunnel to the vm. So actually the download you don't know.

C

Yeah yeah, you don't have to do that anymore. It's not forwarded by default.

C

E

um Just for information, I added a a metric that I just checked. That is, at least on my opinion. Of course, I I but I think it's a coordinated metric, so it should work for everybody.

B

E

B

I also see this breast I just sent here in the chat. You know, rest client, request, duration and it shows the path also, the url of the api.

A

Okay, can you post this yeah post in the.

A

A

Let's quiet request, duration.

B

Yeah yeah, I was using this one so, but I I don't know if they show different things anyway,.

E

Yeah michelle, the one you posted is like our client metric. What calls we make, I think what we make ourselves. That's what we export and the one I shared is what the api server records on the requests it gets and forwards. Oh nice, I.

B

Think we we want both so.

E

B

I think so too.

A

Okay, all right, let's give this a try: okay, rowan! You also posted in chat so like.

C

After you, after you did make class start make last sync: you can call this cluster up, cli ports, grafana or prometheus command, and it will give you the the forwarded part to your localhost of your machine. Okay, I'm going to write this in here too. That's helpful, okay, so.

A

C

It should be script table, so leona can use it in scripts and uh there is also a way to use a fixed port, but the disadvantage with the fixed port is that you easily end up with collisions right and.

E

The number, I think, is the output by the way.

C

E

The the number is.

C

E

You can basically.

C

Do something like just to illustrate it? You can do, for instance,.

E

C

You could create a small script open.

E

C

And you get the port when you just like edit it to the chat. Now, when you call it like this, you get the port directly. It would then be something like https.

A

So here's what I'm doing probably yes, there true. So this is what um so I'm going to do this and then I'll try that okay, so that that way, um oh and then you close to something else, yeah.

B

If I remember correct, I think you also need to make your phone out through. Also, if you want, you have something yeah.

A

I I saw grafana, we actually ended up getting deployed with this. It was there yeah.

C

A

A secret phone I'm going.

F

C

Review the code so.

B

F

B

If it's going through.

B

B

It wasn't deploying my test, but.

A

Oh okay, it's it's! Actually, it's not deploying grafana. It's reporting a grafana service. Just that's why I thought it was.

B

Oh, it's because it has now the an additional service that we created. Okay,.

A

So it is not so it's not there. Okay, so I.

F

A

Flag- okay, okay, cool, okay, yeah, great okay, so I'll do this here, because this is this is helpful. uh I want to give this a try and see. Do some testing, okay, we're about one minute over? Does anyone else have anything you want to add.

A

Okay, all right, thank you. Everybody we'll see you next thursday have a good day. Thank.

B

A