GitLab It's time to Go!, 2 Aug 2022

Previous Meeting

⏯

youtube image

►

From YouTube: It's time to Go - Unique connected agents fix - 2022-08-02

Description

Configure team fix a bug related to the metric that count the number of unique connected agents

A

So so this is the it's time to go meeting. We talked about goal laying here and we prepared for this week.

A

We wanted to discuss because we implemented a prometheus metric that tracks the unique number of connected agents, but uh mikhail figured out recently that there is a bug, because when we saw the metric is we see like this weird behavior, where the agents go up and then zero and then close to zero, and also I'm wondering this is this seems also wrong, because the number of connected agent k pods, should be higher than the number of unique connected agents right. This is not. It doesn't seem correct.

B

Exactly yes, it should be at least equal at.

A

Least, equal yeah at the very least.

B

So yeah, so we've we've got the dashboard merged. Can you open the.

A

Want to take since is it better if you take control yeah sure, because you did the investigation.

B

Yes, I already know what the problem is, because I look, but it wasn't obvious at all.

B

um So this is our dashboard.

B

You know, if you haven't seen it, I recommend to go and check it out, pretty cool, it's a standard dashboard for a gitlab service made by our serie team or infrastructure team, or I don't know how who is the responsible anyway, and so this is the thing I've added the yellow line and.

B

It's already above, but it should be the other way around. So we don't see this problem here because we need to increase the.

B

The time interval and.

C

B

Here here it is and.

A

B

Seven days.

B

Yeah, okay, you see it's growing, then dropping a little bit and dropping a lot and then growing again. So, okay.

B

And this can be closed and.

B

So this all is happening in the agent tracker. We call the register agent.

B

And this this is the count. This is the thing we return the number of items in the in the set and when we do a register connection which comes from anyway, we we call register connection from the configuration server from here.

B

Register, that's what it calls register and then it goes under register when it needs to unregister. So in register. We put elements into the into three three sets: it's like secondary indices in the database, connect agents by project id by agent id and then just connected agents, a set of the ids of the agents. Basically, and then we just get the length of that set here to get that number that we graph and a particular agent id can be deployed.

B

It can run multiple as multiple ports right like 1, 2, 3, 10, 20, whatever, and but they need. They all need to be counted as a single agent, because it's a single agent id for this metric and they all cast instances those boards can be connected to.

B

They all try to put this id into a set in radius. Basically- and that's what we do here and if, if multiple put the same number, it's still a single number in the set- and it's counted as a single thing all good here and when we remove we, don't you see connected agents? Unset, we don't have it here. Oh.

C

B

A

This was the decision yeah. Yes, this was done.

D

On purpose, because if.

B

An agent disconnects for from an air cast instance, other pods may still be connected, and they still should be counted because this agent, that is still connected. So we don't do this and instead rely on garbage collection to clean up eventually uh the ideas of uh agents that don't have anything a single port connected. So once all pods disconnect garbage collector will remove this id from.

A

The set drop that drop is the garbage collector.

B

That's what I thought initially, but no garbage collection is configured well. Actually, it's the default number is.

B

So regis tracker, you see period.

B

And it comes from this parameter, and this comes from gc time to know, and this thing is defaulted here- it's defaulted to this number and it's 10 minutes right, but this is obviously not 10 minutes. This is like seven days, so this is like multiple days.

B

Obviously, this is not garbage collection right, so the and I thought like what's going on so we have these drops when cass is restarted.

B

That's what it correlates with it's more visible on staging, because in staging, apparently we we get addresses more often than in production. I guess our infrastructure team does things there and for some reason, class is more often restarted. So, but why so? When, when calcium started, it kind of something happens and eventually those things are garbage collected so- and this is the connected agents hash, so how how this hash works is so what we do is we term to recap?

B

We set and we don't unset and we we also have refresh and garbage collection, so garbage collection goes through all the items in the radius and removes the ones that have expired and what refresh does is it fights garbage collection? So if item is, should still be there, because this cast instance is still running and it wants this item to be. There refresh puts this item back, writes it to update the timestamp.

B

So what happens here? We do not. We do not call and set.

C

A

A parenthesis there, it does not put it back, it refreshes the radius key expiration.

B

B

Because the expiration for a key value pair in a hash, there's no such thing in redis we implement well, I implemented it manually as a as a as part of the value of the of the key value that we store in hash. So if you do, if you go here, go here, this is the implementation set data now refr key.

B

This is what use expiring value, so we track when a key value pair in hash expires in the value itself. So what gc does it goes all through all the values, all the key value pairs and checks, each value and checks the expiration of that value, and then, if it expired, it deletes it. That's why we have gc.

B

This is simply because redis.

C

B

Have this you can set an expiration and we do that for the whole hash. So if no.

B

If nothing happens in in the whole hash, this whole hash will be expired, all key value pairs, but there's no functionality for a particular key value pair to have expiration. Unfortunately, so we do it manually and refresh just bumps this this values, but it doesn't do it for all the values in redis. It only does it for the ones that this class put there right, because we only want to refresh what we care about. We don't want to refresh what other casas put there.

B

We actually want to delete the things they put there if those classes know are no longer running right. If a cast crashes, we want to clean up after it and that's why we have gc and we only refresh our own keys. So our own keys is this data. Here, it's just a map, basically from key to the uh so expiring hash. It's not a single hash, it's a hash of hashes, so you have a key and then you have a key value inside of it.

B

This is so we can have, for example, connections by project id, so um key is the project id and then we have connection id to data. So it's a hash of hashes, basically and the first thing, the by project id it identifies the radius hash and then, within that redis hash we have the connection id to the value. So that's why it's a hash of hashes anyway, this data thing it keeps the value until cursor starts right.

B

We we should forget about this value and stop refreshing it, but we don't so we kind of that's. That's that's the bug.

B

Let me let me go through this again, so we we set it when we set it. We not only put it in radius. We also put it into the map so that we can refresh it see. We put it into the map in memory in memory. This class has this in memory and then, when we call refresh every now and then like this refresh registrations, which is called when a ticker ticks so every refresh period, the sticker ticks and we call refresh refresh registrations and we call refresh on all the hashes.

B

So we are kind of keep poking it so that it doesn't expire and it doesn't get garbage collected and because it's still in memory, so it only goes away by garbage collection like when casserole starts, and this cast that has been restarted or another cast eventually see that the value has expired and deleted. And then that's the drop.

A

B

C

A

It it's like uh okay, so it's when the new cast that comes in that uh figures out that uh that that hash was expired by looking at its value in redis, then it cleans it up.

A

So if it wouldn't restart, if it would still be online, then why else? It also does not check the expiration or it's just.

D

A

B

D

B

Check exploration.

A

The expiration is that.

B

But we still run refresh, we still.

A

B

Refresh this data, because we still have this data in memory because we haven't removed it here from from this.

B

B

From this map right, we still have it.

A

How long does every start takes in staging or production when you restart cash? How long is this period when the cases uh has died? 10 minutes.

B

Came back 10 minutes so gas restarts. Then it stop. It doesn't have this old value in memory 10 minutes later this value is garbage collected.

A

Okay, because when the new cast starts, it won't have the thing in memory, because the the key memory was belonging to the cars that died, that that stopped right. Okay, yeah.

C

A

It so it's only on it's, it won't be refreshed. Okay,.

D

B

So this is, let's check this.

A

Yeah that makes sense.

B

Theory pretty certain, but let's let's confirm anyway, and this is staging because in staging it's more visible and we'll start more often.

A

Okay, this- this is weird, but it is okay. I think right weird, but it's okay.

C

B

C

B

See on a single page- and it doesn't show me why why there is no line here? Okay, you see. This is like on this graph.

B

This correlates, so it.

A

B

Fine and then they restarted and then like couple of minutes later three minutes later and then uh how many minutes later.

B

56, basically, something like that. We like it it's there is an overlap like it can expire, but then the gc runs a little bit later. So you see roughly in 10 minutes this this this happens eventually, so it does correlate with restarts.

B

C

B

I think what I'm saying it makes sense.

A

But the thing is also like that, right before, when you have at this point where you're marking right now, where cast, has not been restarted yet so you have about 60 something connected agents. This number is correct right. This number is not wrong.

B

No, it's not correct what I think it's it's it's at this high number, because in uh staging we have qa or something running so new agents. That's my guess! Actually I don't know that. But that's the only explanation I can come up with is that in staging we have a qa or something that creates new agents and the new agents connect, and that's why the graph is going up.

A

B

And then it drops to one and then we have a a qa test and then it's two and yeah and it keeps growing.

B

That's what yeah you see it's 11, then we have a q8 test to something and it's 12. Then we have another one and 13 another one and it's or 2, okay, 14 15..

B

So this pro this, the the bumps should probably correlate with ci jobs. And this also this graph also grows right. It was one so there's, probably an agent for in staging that is always connected, and then you see, qa test runs and it's two and then it's the test finishes and it's zero and then the test runs and we get a bump.

C

B

And then you see these bumps.

C

B

With these increments, so this should go up and down just like this graph, the green one, but they don't because we don't remove the the record from memory. We just that's what we need to do.

A

We only remove them when we restart guess.

B

uh Yeah because the memory is cleared because the whole program is terminated.

A

We removed them from memory, but we didn't remove them from radis.

B

No, we don't remove them from anywhere, but the fix is to remove them from memory, so that refresh should stop.

A

I mean we remove it from memory when cast dies. That's.

B

What I mean oh yeah, yeah.

A

Yeah but yeah, but but it keeps on redis, I see yes, it keeps incrementing okay, then uh then yeah, but that's yeah, the the weird thing still. I think I think this. This explains a lot, but if you go to that uh chart before when, like the chart that you had before from staging that, uh you can see that it drops when it drops on a restart.

B

B

It was try to do this. Yep.

A

Oh for this one, it really matched the zero okay, because on on the other one on another chart that you showed it wasn't it like, we were cleaning it up like the chart that we have on the issue that you have open.

C

There it does not.

A

Match even when it drops because of the restart, it does not match the amount of the number of agent k pods. So it is still a little bit above, which is, I think, which is weird.

B

Yes, I don't know, maybe I mean there are multiple agent cast instances, so one of them stops and things are cleaned up and the other one is not or something like that. I'm not sure to be honest,.

C

B

Don't have an explanation for this, but at least that we should fix, and then we will. We should see what what what if there is any other problems. I I think there shouldn't be any other problem.

B

So do you want to do the fix.

A

Yes, yes, let me play with that.

B

Tiger are you following.

D

A

Right, let me go back to this, so we have the get the file.

A

It's the tracker.

A

Register connection.

A

So we want to remove it from memory when.

A

When can we do this.

B

When we unregister, we want to remove it from memory, but not from radius, so it should be here in a register connection.

A

And where? What is the object that we have? Because this is the reddest one in memory, we have? What.

B

So we have this connected agents hash and what we need to do. We need to change the hash to remove the mapping from memory, but not from radius.

A

So it needs a new method.

C

A

So probably when we here, we would have like a set from memories like this or yeah.

B

Yeah, something like I don't know how to call it.

B

A

What sets memory.

A

C

B

How about forget, forget, forget and then the signature should be identical to one set, and but we should document the difference there.

A

B

Usually, the comments should start with the name of the method or the name of the yeah.

A

Sorry didn't didn't get you. The comments should start with.

C

B

Like forget removes, or only removes, that's fine, that's fine up to you and then you should have a dot it's just because the linter will complain. I don't care, but for consistency we should just do what the flinter says yeah and then now you need to implement this method.

C

A

Right close to the onset.

A

Okay, so we get the key and this unset data. What does it do.

B

That's what it does. It removes it from memory from.

A

From memory in this one, it must arrive, so we just don't need to do this then, but uh this can return an error. Can this return an error as well.

B

B

Yeah, you are changing a different function. Different! You need to change the one below.

A

um Oh yeah, sorry.

A

Yeah but yeah, but still so yeah. We.

B

C

It looks like we.

B

Don't need the error and we don't need the context we can.

C

B

Yeah have key and hash has.

B

And then you can remove the return from from.

A

Oh yeah, because it doesn't return anymore.

A

Right, do we want a specific test for this.

B

B

B

Yeah, um so we need to test that it removes it from memory, but it doesn't remove it from redis.

B

And these tests are integration tests, so they actually test.

B

Test the real thing, so they work with with that with the radius.

B

And it's good, then we can just check.

A

Actually, actually, we don't need yeah.

B

We don't need that yeah now we should just check if radius actually has that that value there so get hash should be equal to uh like what you put there. Yeah the cool hash, perfect.

A

And we should test that removed from memory yeah.

B

Yes, you can just check check the map directly.

B

You can do empty because there was only one key.

A

And the map is the hash hash and now.

B

What the field has dot.

B

No, that's what should be empty, I think, and now you need to start radius and set set an environment variable. If you go to setup hash function, there is like radius url or something that it looks for.

B

No, why not here, I'm already explained yes,.

D

B

You need to set this to url of radius.

A

I will just push this to.

B

Well, you can just run it from the editor.

A

That's what I want to do it's because I was doing this before I knew how to to modify configuration.

A

And this is program argument.

B

And no, this is environment. Variable this environment.

A

Okay, then I don't need this test and.

A

B

I'm not sure about quotes actually, maybe remove. I don't know, let's, let's try with calls that's right. Okay,.

C

A

B

Yeah you need to remove the cord. I think it's because uh this is not shell. This is environment variables, so you don't need to quote things you just it's key equals value without calls and do you have radius running?

B

A

Should but we can double check.

A

So, let's play.

A

B

Okay looks like oh good.

A

Let's push the merge request.

C

B

No not yet you have added the method, but you are not using the method.

A

Oh, we haven't added yet.

A

Yeah, it would fail at some point because for this um okay, so forget and the same thing here, the agent id now we just need agent id. I guess.

B

Let's have a look at what you use with set and that's what you need to use with forget.

A

B

C

An agent id yeah.

A

Like not here no here, let's see right.

B

No, I think it's the.

C

Key is new and.

B

Hash key is the agent id.

A

B

A

First thing is that.

B

Is the key to find the hash and and the.

C

B

Is the key in the hash?

B

So it's a hash of hashes right, so we have only single global hash in the whole radius for all agent ids and we don't that's why key is new and this works because we've passed the.

B

House called key to redis key function that ignores this value. So if you look at how we construct the connected agents, you you should see uh now in in the tracker.

B

Look look at how connected agents hash is constructed in the constructor.

B

Where are you going 57 constructor of the tracker.

B

Below so this is the connected agents, hash and connected agents hash key this function. If you look at it, yeah.

C

B

Ignores it ignores the past interface line, two three uh because yeah we don't care about this thing, because we have a single hash and yeah. That's why we, whatever we pass kneel or anything else we it doesn't matter, so we pass new just to make it more explicit kind of.

C

A

Right and the test for this.

B

uh Yeah for unregistered there should be a test there somewhere, and you probably should regenerate mox, because you have something that doesn't compile in this file.

B

Oh yeah here and also, if you run the tests, some tests should hopefully fail.

A

Let's do that then.

B

Yeah, I just run the tests for the whole panel, but you first need to fix the compilation.

B

And that's where you need to re regenerate the mox because they don't match the new interface anymore.

B

So are you running gdk all the time yeah, mostly it's, so it's very heavyweight, I think, but yeah. If you have one of those m max, then you don't even notice.

B

Oh you're, okay, so I recently accidentally discovered that I was running gdk.

B

This machine is crazy, it doesn't care about gtk, you know and yeah it's just it's awesome as if it's nothing at all.

A

D

A

Like also using zoom, together to the key, everything gets super slower here.

B

No, it's it's doing. Its job is just slow.

B

I, like the tabs that.

D

B

A

It's unfortunately, I do this every day whenever I restart my computer, it's just I term two and then, when you right click it you have the option.

A

B

Also manual thing.

A

Yeah, I think you can uh create a you have a preset for tabs, but I never done this and you can even restart commands. You can I've.

C

Done this once.

A

But you can say, like start so many tabs somewhere and run those commands for each tab. You can do this, but.

C

A

You can colorize them, but when on the startup I need to to research.

C

A

Possible, I don't know, and you can rename it by double-clicking as well.

B

B

People build such features like it's just a terminal.

B

How much effort was spent to do this? Oh, it's crazy. I'm probably used like point zero. One percent of this whole thing.

A

Yeah there was one which I liked, which I used in the past, and I I think I forgot to re-enable it, which it was useful I like to have like the timestamp from my command lines next time, it's possible to do this. I don't remember now how.

B

C

A

B

To do is to make to see if it's possible to run these commands concurrently mock gen, but then I thought well to speed everything up, but then I thought some mocks can depend on some other mocks. Theoretically, it's probably a bad idea, but maybe yeah. We probably don't have this problem anyway. I didn't spend too much time researching how to make it generate things concurrently,.

B

I think god generate doesn't give you a way to run these things concurrently.

B

I would that would be cool. I think, like I see that you have 60 or 80 percent of your cpu in use, so not on 100 work harder. No, it should work harder.

A

But they have uh asked to regenerate just for the files that I want to yeah.

B

You can do that, but not using the make command.

B

So if we turn this into a bazel row, then it it would and do it properly. Then we could make it automatic, but it's a lot of work to do it. The right way.

B

So I I'm lazy. I don't want to invest time into this.

B

And also it's much faster on my computer. I don't have this problem, so it's have. You asked for the new computer.

A

Yes, the thing is that I wanted to send it to berlin um and yeah and I only arrived there on the 27th. So I put a note there and yeah. I couldn't got it. I have gotten it yet, but yeah. I should get it in the next month. Hopefully.

B

Yeah, it really makes a huge difference. It's actually a surprise.

B

I I mean I think we lost this kind of previously when you get a new computer, it's always like an order of magnitude faster and the in recent years. They're all the same, like it's still slow. Nothing is working as fast, but with this max, that's amazing.

B

You have this feeling that it's much better but yeah we will manage to slow it down.

B

Just add more features to your terminal, like you know, covers to tabs and so on, and everything will start to be slow again.

A

Yeah more services yeah, so we did have some errors.

A

We can choose one of them start playing with them. Maybe this one.

B

Yeah, you can just click on the test.

B

In the in the list of tests that fail here, click on it and it will bring you to that to where the test was uh double click. Maybe.

A

So an expected code to mock red is mock, explain, forgets.

B

Yes, so forget what's code, but the expectations need to be emulated.

A

And this should happen on all of them project goal, so maybe we just uh set on set. So whenever we call on sets is when we.

B

C

B

Yeah line 96.

B

Basically, no, not here you see.

A

B

No, no, no, no, they are grouped by like in um in pairs. So unset must be code after set. That's the whole point.

B

We don't care if we unset this one first or that one first or set this one first to set that one first, but we care that unset is called after set.

C

B

That's why we use in order in in groups and.

A

D

A

World do we care that forgets called after us.

B

No, but we care that forget is called after set a.

D

B

Set this one on line 96.

C

B

So you need to do the same as other the ones above, but do.

A

A

Oh yeah, because this is the one that has to do with connected agents.

A

um And here it's new and this.

A

B

You need to put in order around them just like above currently it's complaining because you have a coma at the line 97, but leave it there just put put the.

B

In order are above around this location,.

A

What do you mean in order? What's.

B

The dot needs to be at the end of the function, call.

C

B

And then also the arguments there are fewer arguments like the first one should go.

B

And now it should be fine.

A

And sometimes it takes a minute not a minute, a little bit.

C

B

Oh again, you have a comma, so that's why it's, but you will need it so put right in order around these implications, just like above.

B

This you mean no. This is fine on line 87, you have gomock in order around two mocks around two stone convocations and each mock invocation returns an object that identifies this invocation, and then we tell the mocking framework that we want these invocations to be in order, and we do it by passing those values that these marks return to the inorder function, I'll.

C

B

B

Put a comma there put a column it okay, so we tell the framework that there will be two invocations in that order.

B

And the same thing about.

B

Yeah, does it make sense.

A

Yes, it makes sense.

B

Well, now, probably the same thing in the other method, in the other failing test.

B

You couldn't copy the whole thing.

C

A

Let's run again.

B

Awesome and that's our tests for this for this change, I don't think you need more tests.

A

Yeah because we already unit tested right like we know that this works, so it's just as simple: it gets cold.

C

A

C

B

We have more time how much time that's it.

B

I think that was perfect time.

B

I usually like to run make test lint before pushing because sometimes some other things somewhere else in the code base blow up and then linter as well. Maybe doesn't like something, but that's up to you, I'm just saying what I do.

A

Is it like this.

B

No, it's make space lint, you tell make to run two targets first test and then lint, okay or then may you can do lint test whatever you want.

B

How much ram do you have in this laptop.

A

I think it's 16.

B

Okay, do you have any spare memory like one five gigabytes? No, probably not.

B

D

B

Even one gigabyte and you, if you haven't done that you.

C

B

Things up I'm talking about the bazel wheel by by moving the sandbox for simulinks into a memory volume, it's documented in the contributing I think uh file. This will run faster. That's what I'm saying in the docs in the docs directory um yeah this file. I think that's where it's com documented, so you create a memory drive and then tell bazeo to use that for temporary files. Basically,.

A

It's just super slow now that.

B

They're not in this file, developing them not.

B

Contributing now it's using your cpu properly.

B

Yeah, probably close to the end somewhere.

A

Additional resources, no.

A

Against gdk optimizing.

B

B

Yep, so you need 10 gigabytes if you put everything there like if you use sandbox okay, if you use the output user root option, but you don't have so much memory, so just use the sandbox base and one gigabyte should be enough.

B

It will speed things up because it will use memory rather than your hard drive.

C

B

A

Oh, try it out.

A

Yeah we can wrap up, I guess for today. I will soon push this.

C

A

We pairing nobody needs to watch 30 minutes of bezel to building packages.

A

Yeah, this was very fun. Thanks for investigating the bike.

C

B

Thank you very much. It.

C

Looks good okay.

A

See you next time.