GitLab Scalability Team Demos, 4 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo 2020-11-04

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Great so jacob you have the first item about the rate limiting mr.

B

Yeah I just wanted thanks. I just wanted to give a quick update that it's in review um and to uh inform everybody of what it looks like right now, um because I don't know it's it's it's quasi-demo-ish and it's it's. I think it's interesting. I hope um the so yeah it's in review um and uh we reduced the scope a little bit, because I had a conversation with andrew, where I realized that we were making the configuration part more complicated than it had to be.

B

So what it looks like now is uh the configuration is an environment variable, it's just one environment variable and the advantage of that is that it makes the merge request smaller, and this is already supported in both omnibus and cloud native gitlab. So we don't have to do any follow-up there to make it possible to set the environment variable. This should just be.

B

This should be pure. The configuration management that's in place should be able to handle this. I actually have no idea how uh sres manage this, because we have it's nice.

B

That john is here uh because we have deployments both on vms and kubernetes, and I don't know where the config is stored if it's the same repo, if it's the same mechanism uh and what that looks like, but I'm ex I'm assuming that we can set an environment variable in both of those without it being a big deal and the environment variable is just for the bypass, which is uh it's not the interesting part here.

B

uh Well, it's essential, but the bypass just um there's not much um uh excitement in deploying that the more interesting part is to uh turn on the rate limiting and to correctly set up the uh h3 proxy to label the things that need to be bypassed, and um one of the things that we now have in the merge request is that every request that matches the bypass will be marked in the json log. So it should be very easy to see that a we are not labeling. All requests that with the bypass we are labeling.

B

Some- and it should also be easy for the person deploying this to make a test request where they set the header from the outside and see that it gets blocked by a proxy.

B

So um what I'm trying to say here is that I think the observability of this part should be good the way it is now and go ahead.

C

I was just I.

B

Finished finish off what you're saying yes, so I think the observability of that part is good and the the mechanism of making the conflict change is simple and it has a low cost for us to build it and ship it, uh and it ended up a bit lower simpler than uh I thought it would be at the start. So that is good and um yeah.

B

We're now having uh conversations on the issues about how to estimate the limits- uh and uh that is a bit weird but um or it's a bit nebulous to me, but craig thinks he can do it just from seeing the state from wreck attack and all we need is to dump the state.

B

So I want to see if I can write a little script to do that, and I want to double check that it is reasonable to dump to state, because the only way I see to do it is to do a full scan of the cache instance, and when I wrote that I thought we probably don't want this. I'm just going to say like we could do this, but we probably don't want it, but then craig said oh, I did it in two minutes, and um that sounds too good to be true.

B

So I want to double check that it works.

B

Also just to support craig, uh not not to be not that I don't trust him, but it's just like it yeah it. I want to double check that and then it did the work sort of shifts to craig uh figuring out the rollout plan and supporting him on that awesome yeah. That was actually what my question was. Is that like sort of scheduled for for happening it's part of the epic yeah?

B

The idea of the epic is that we stick to checkboxes in the admin ui and uh the checkboxes that say, uh or three checkboxes actually unauthenticated rate limiting web authenticated and api authenticated, and we can only do that if there are same numbers below the checkboxes, because otherwise this will reject everybody's traffic, and we can only do that if we can allow list the people we need to allow list. So that goal of the checkboxes sort of has all this work hanging under it.

C

It's a little bit complicated, but one of the things that we might be able to do is there's a way. I don't do it very often, but there's a way in.

C

I think, there's a way in um elasticsearch to make a query where you don't ask for like the rates at which, like a certain request, was coming in from a certain user. You don't like aggregate on the full bucket, but you can. You can almost like break a bucket up into sub buckets and then ask for the max of those. So the reason why it's useful is that say you are looking over the last seven days. You know you don't want to say like give me like minute by usage.

C

What you can say is like any 10-minute period like which user was generating the most traffic within that minute right, it's a little bit complicated to explain, and I I don't fully understand it myself, but there's.

B

Enough data in the log to identify the user.

C

We well because of the context metadata. Well, we wouldn't even use the context metadata. We could use the um ips, you know because we could use the the lao lists that we've already got that are known, and so we could filter out those users um and then and then what's left, we could use this this technique, it's like if you go I'll, go and give it a try, but but basically, what you can probably do is say like who hits um who hit rails the most in a like one minute period, yeah.

B

C

Last seven days um and and that's kind of like I I'm not sure if the one minute periods are like aligned with minutes or if they're like rolling one minute periods, but either way you can kind of get a pretty rough idea.

B

Yeah, so what you're saying is that we might be able to get good estimates with log analysis uh without.

C

B

To go into uh racketech data, that's a good point because there shouldn't be any record tech data right now, um although.

C

I mean once you turn it on. You can see what.

B

It's doing but yeah yeah.

C

I mean you could turn it on with, like exceptionally high defaults right.

B

Yeah, that's what I would think and then you could see you could try to inspect the counters and see what they end up at.

C

I suspect that what's going to happen is, however, high. We have those defaults, there's going to be like a bunch of people that get relimited just sadly because of the state of things, but um that doesn't matter like I'm not saying that we should work around those people.

B

A hundred thousand per minute- I I'm sure, we'd, be fine with a hundred thousand per minute and then we could dump the state out of uh rack attack and uh the way rek attack works is that it creates a redis key for each um each user that it's counting, but also for each time period. So it's uh it's like. It looks at the clock and it says for each minute I'm going to make a separate key.

B

It doesn't literally do that, but that's what it amounts to and then those keys expire from radius, so the it makes it a little bit hard to look up the state for a given user, because you need to know the time you need to know how racketech calculated the bucket that it's counting in, but if you're just dumping, if you're using a scan, you can just take all these keys and look at their values.

C

How long what I presumably they've got like a one minute, timeout ttl, or something like that: it's not.

B

C

Seven days or something ridiculous.

B

No, no, no, no it's it's very. um The period is configurable and the timeout is whatever is left of the period plus one second, so it is a very finely matched timeout, so these things should disappear almost immediately when they're not needed anymore.

C

One of the things we should just take um note of is like, if you change workload in redis quite substantially, and you start expiring, a lot more keys, because this is enough assistant greatest as well. Isn't it no it's kind of it's cash.

C

Sorry, it's cash, um but we if we have a much higher volume of expiries, it's there's, there's certain things in red that slow down a little bit with that, and I can't remember what they are of hand, but we've seen them in the past and we should just keep an eye on on those metrics.

C

So you know because it has processes that run when you say frequency is the name of the config, um something like arbitrary like that, and we should just make sure that that is uh one of the things we look at as we roll it out.

B

We probably generally want to keep an eye on the all the saturation metrics of the of the cache instance when we turn this on.

B

But that's a good point: yeah um I'll um yeah, considering I I don't fully unders. I mean I sort of understand your idea of the log analysis, but I'm not sure if I could explain it well. So if you could write a comment about that andrew uh that would be helpful I'll.

C

Try and put together a sample query: it's this yeah there's like some certain rolling aggregations and I'll try to put together a sample query and that will explain it. Yeah.

B

And I I I'm going to spend a little time investigating how to do that redis scan and see what it looks like and if I could just because you need you need to script it you can just say here are all the keys, and now I'm going to look at all the keys, because then half of them will have been erased because of the rolling time windows.

B

So you really need to query them as fast as possible and hope you don't not not too many have disappeared and uh that works well with the scan because you get a batch and then you can immediately create that patch. So, but I want to script that up uh and see what that looks like. So then we have two approaches for craig to choose from either the 100 000 requests per minute or the log analysis or both.

C

I've just looked up that thing in elasticsearch and it's called a I mean this is obvious. This is obviously why I could remember it. It's a sibling pipeline aggregation on a max bucket I'll put together.

B

A query sibling pipeline.

C

Aggregation on a max bucket.

B

Obviously, um I I think I've uh shared as much as I can think of uh here. Are there any other other questions on this topic.

A

I'd just like to add: please: can you put those comments onto the issue that I've added in the agenda because at least then um all of this is collected together rather than putting it on the agenda itself.

B

Yeah, we have a slightly confusing situation now because we're discussing the rollout in two different issues, both the dry run issue and the rollout issue. Well,.

A

Have we established now that we definitely don't need a dry run, because if we definitely don't need the dry run, we can just close the ticket and move the whole conversation over to the road.

B

I'll I'll see if it looks like we can do that because um well I'm of the I'm of the opinion. We probably don't want it, so I think it's up to craig, because I'm trying to.

A

Yeah, um what we do is we open another ticket to talk about the log analysis to find that that number.

B

C

Possible, I can just spend a few minutes and then like. If I, if I can't do it, then it's not worth opening it together about it. I think.

D

Okay, well just a couple couple things to add: um I think the dry run is going to be really important and I'm hoping that we can not turn this on and find like inadvertently start limiting something like an internal like internal api requests, or you know an important customer. So that's that's possible. That'll be helpful.

D

um Another thing is: is that there are there were a lot of customers interested in the white listing feature, and I know that this isn't like what we're going to be delivering to customers for advertising, but maybe there's the potential here that we can. You know maybe inject nginx config, to allow some self-managed customers to utilize. This is that worth pursuing or not.

B

Well, for those customers, yes, but for us no.

D

Right for us, I think we're going to use aj proxy.

B

Well, yeah, of course, we could maybe also do with nginx but um yeah. We we can try that that would the end result would be more documentation uh on how to do that, and somebody sitting down to try out this procedure make sure that those uh nginx conflicts, things big things.

C

I think it's a great idea, like it's an omnibus change right job. Is that what you're thinking like it's like.

B

Not even because you can inject uh arbitrary.

D

Text into the next facility, the facility is there to inject uh nginx config, so that could be the first thing.

B

So we just need to have an example thing where it says: here's how here's a bit of text you can just inject and then it does that job, because.

C

On the back of craig misskill's point about um doing this in the product I was like, I was thinking about a little bit afterwards, like you could quickly get yourself into a quagmire with that, because if you've got like and if you don't do things in a smart way, if you've got like 50 rules and you're kind of just iterating over each of those rules like for every request coming in for your white list in the application like you'll you'll, slow, the application down pretty badly and obviously the way nginx and nha proxy do it.

C

Is they build a kind of tree where they'll sort of root the requests down to the right part of the tree based on the cedar blocks? And if we built it in the product, we'd need to do it that way, because otherwise we just slow everything down really badly, and so it's kind of another thing to say like maybe we shouldn't do that. Maybe we should leave that to experts.

A

Well, the way that I've been trying to approach the project in general is that we need to make this work for what uh the sres need right now to stop.

A

Having so many incidents about the bad, automated traffic, that's coming in and trying to make the change to the product as minimal as possible, but I have gone through and found a whole bunch of issues that seem related either to whitelisting sorry, yeah to the whitelist to rate limiting, there's quite a number that have been raised over the past past couple of months, and when we finish with the work, we can go back and comment on all of those issues and say this is the functionality that exists. Now.

A

This is what's there and either that unlocks um the ability for the stage groups to pick that up or continue, or they ask us to help them with putting it back into the product in a different way. But I was just trying to isolate this product this um project, down to what we needed to do and then tell the rest of the tell the rest of the stage groups, what we've done as a result and then leave the choices to them for how to take this for forward a bit bit more.

B

I I wanted to yeah thanks that I I agree. uh I I wanted to ask something else uh that uh about what director said um or that I wanted to talk about. The dry run idea a little bit um because uh if you don't specif qualify it, it sounds like a obvious thing.

B

You want, but it's one of those things where, if you think about how it should work, it becomes complicated um and that's why I'm now pushing against not having it because it looks like the one we want is complicated and I'm not sure if we're going to use it.

B

So let me uh try and quickly explain why I think that um the first dry run idea I came up with was to say: uh we have these uh rule definitions in the initializer, so they run at startup and they call methods on the recotec class like on a singleton, so they're, basically stuffing rules into recotec in a global variable. And that's then it's for the rest of the life of the process and I thought uh well.

B

We can make uh the code that stuffs those rules in dynamic and check an environment variable and if it's uh the environment variable is set. We don't check push, put a block rule in or a throttle rule, but we just put something in that tracks it and only logs. If something matches the thing. So then you log your your violations, but you take no action, and this is something that requitec can natively do like. There is a type of rule called track and it will follow the same logic as a throttle.

B

The problem is that if you design it like this with there's one environment variable- and it runs at startup- is that either you put everything into dry run or everything is active and kicking in. So that would be useful. The first time we roll it out when we're trying to get all the limits right for what we have now, then we could say everything is just a tracking rule and we make sure that that looks good, and then we restart all the fleets with the environment variable away that become real rules.

B

But then what happens next time? You want to turn something on or off. Do you put everything into tracking modes, or do you put things one by one into tracking mode and then yeah? It gets a little more complicated to have a good interface to say to turn individual ones into tracking mode or not, um and that's why I thought. Maybe we shouldn't be doing this.

D

I I'm a fan of just having the environment variable that allows us to log.

D

You know log the violations with whitelisting, taking into account, because this will allow us to, at the very least, not be surprised by something we didn't think about like, for example, maybe an internal ip address that we thought we were setting the header for, but we're no we're not, and that should be very clear if we're logging every violation, um every rate limit that's kicking in, and I think I only really see this as the first time we turn for the first time. We turn it on. So I think that would be helpful.

D

At least it would lower the risk of something bad happening. Right I mean then.

C

D

We could always just turn it off right away if we need to, but it's still like you know,.

B

That was my other question is like when you think of uh we have to adjust these limits anyway, and if we do the thing where we do a log analysis or we set it to an insane number per minute and we push it down, then we might be able to get the same effect because say we say we allow a million requests per minute and if we can effectively dump the state out of rec attack and see which things which counters are going up, then we can also see uh that we get sort of the same kind of capability and the difference is that you're not building something?

B

That's you can only use by turning the whole system into uh sure in driving mode yeah. I I I it's a bit tricky. I I also I mean I wanna. I want this to be safe. I don't want this to be something where we knock ourselves out, because it has a great potential for doing that, um but so okay, so I guess what I'm saying is that we're still not sure yet if the dry run thing is what what shape it should have like.

C

B

If we can do good enough analysis with million requests per second uh a good enough introspection, then maybe we don't need it, but we haven't quite established that yet.

B

Yeah, in a way of course, uh turning these rules from throttles into tracking is the best way to know if you're going to violate or not because you're using all the wreck attack is doing all its own application logic and it's making its own decisions, and it's just at the end of the decision. It still lets the request, through that's better than trying to reconstruct from the outside what you think wreck-it deck is going to do if the numbers are different by inspecting its internal registers.

B

Is that the that is inherently more complex.

B

um So another thing I thought about is- and I don't know if this is a good idea, but we could say that uh each rule has a name and we could have an environment variable. That is a list of names, and then we need to use some sort of schema to serialize the list into a string, but that's something we can figure out and then, when the app boots, it checks for each throttle.

B

If it's in the list of things that you'd be tracking only and then you can turn things on or off individually, but it's not a very friendly interface because as the sre you need to, then I don't know put a json string in an environment variable and hope you got it right.

A

It just sounds very.

B

Not simple um in terms of what the last thing I was describing, I think, would be relatively simple to make uh in the code, but it would not be simple to use.

A

That, no that's what I mean like in terms of having to make all of those choices.

A

It just doesn't feel like it's a simple thing to to yeah. To use whether or not it's easy to create is a separate thing, and um but I think it's about how we make a decision as to if we're going to to build it. This dryer mechanism or not- and I think I'd already asked on the issue like how much work this is that we're talking about.

A

But it sounds like the first step here is to do the log analysis and see if there's anything interesting there and then use that as the input to decide about this dry run is that is that correct.

A

And how do we finally make the choice if we're going to do this or not.

C

Yeah, so what's interesting, is I thought that my results were? I ran the query and for api. I don't know if this is a public call, but I linked in there who the top offender is, which is no surprise at all, but it's much worse than I actually you know every time.

C

I see it it's more shocking, um but I won't mention their name, but then I took that out and I just included web and so this this morning the best results I got was that in in a one minute period there was one user who made a thousand requests um and, and then I went and looked it up, and I looked up the user in that period and sure enough. The results match like what we're seeing. So I think we can use that analysis.

C

The only problem with it is that you can't sort by um by the by the worst offenders, so you have to sort by, like average amount of traffic over the period. So basically you need lots of results and then kind of sort them on the client side. I'll. Send you here I'll I'll I'll give.

B

You a um so what the upshot is that, based on the log analysis, you think that we can say with confidence that certain numbers are not going to uh cause a major problem for us.

C

Well, yeah: we can, we can sort of construct, like you know, if we, if we know which the ips are, um that we're going to white list and we can filter those out because we got them in the logs. um How many ips are in those lists job? Do you know roughly.

D

C

Yeah I mean: are they like hundreds or they're, like dozens.

D

Yeah, I don't know like uh right. I.

C

Think she doesn't.

B

I think there's uh half a dozen or up to a dozen different categories of things, and I don't know how many things are in each category um yeah, but.

D

I I guess, like my my biggest concern, is like just the initial configuration complexity and rolling this out across virtual machines and kubernetes for different uh ingresses, like for web api and good https, and without having a dry run mode. A driver would make me feel a lot better, because then we'll have everything turned on the way that we want to turn it on with the whitelist in place, and then we can look at logs to kind of see like okay.

D

Are we actually rate limiting anything that we didn't expect to rate limit um and it will validate the aj proxy configuration is correct. uh The environment variable is set correct everywhere. I think.

C

This is the advantage as.

D

I, as I see it, yeah.

C

I think what the goal wasn't saying that we shouldn't have a dry one as well, like I agree with you like it.

B

C

That's the thing we're.

B

Trying to decide here, like we have this issue about having a dry remote in the epic and from my point of view, and I think also rachel's point of view. uh We want to decide what the scope is of that issue and whether it's in this app we're going to do it or not, and um I I think the problem we want to solve is that the sres rolling this out need to be able to do this with confidence and uh like we, we sufficient confidence to match the severity of the change.

A

And that's yeah and I just wanted to add it it's about making the rollout as safe and as easy and straightforward as possible, um and I think taking taking the input from the sres is really important because they're the ones are going to have to so.

E

Thinking about this from a sort of product perspective, like I feel like you know, we sort of talked about the dry runners, the thing we'll be doing once, but it really feels like- and this is much more complicated to build. So I don't think we should do this as part of this epic, but it feels like what you really want is changes to be able to go into a dry run mode. So, like I have my you know nothing set up. I make some changes. They go into dry run mode. Then I roll those out.

E

Then I want to change my existing configuration. I might also want to put that in dry run mode before I actually do it because, like I, you know, I might be in some.

D

B

I might be like doing it.

E

B

Because you can individually flip uh environment variables on hosts.

C

What about what about, as we just instead of having one header that we respond to, we have two right and the one is the the one that we've got at the moment and it blocks, and it's like that and the other one is the same thing underscore dry run right: yeah like high.

E

Percentage like if we did this, what would happen.

C

Yeah yeah yeah and then you can move acls. Like all you know,.

B

It doesn't work practically with the way rekkotek works, because the value.

C

B

Driver mode is that recordtech increments, its counters and uh once you're inside a throttle rule, you can't say, um do increment the counter, but don't block um then.

D

B

To basically rewrite wreck attack, I'm I I there's another thing that occurred to me and it has to do a bit with the product perspective. I think, uh until now or earlier in this project, I've been um thinking a bit too much from the point of view of. uh I don't want to want to ruin gitlab with this uh with complexity of or weird stuff, and now I said, okay, we just put the config in an environment variable, and I want to get this out the door.

B

So I'm shifting a little bit more to a pragmatic perspective of uh I'm, not I'm less concerned. If it's pretty I'm more concerned that our we get the job done.

B

So from that point of view, I wonder if maybe we should just have the the global driver mode, because that is a simple change uh and um it's a simple change and it adds value for this broad project and yes, that thing will then sit there for who knows how long, in the rate limiting code- but maybe I shouldn't worry so much about the long-term cost of having that thing exist in the raid empty car.

C

We start publishing this to other people. They might find it useful as well right.

B

C

A

Another thought I had.

B

Is that with what shoulder is saying about um testing changes, uh you could put one host into dry run modes and uh wait? No, you can't deploy changes because in the database so that those are global.

E

Yeah and also the red estate is shared right.

C

Yeah yeah, I don't think the the single is.

B

Right- and you also don't know the request rate, because you get only the fraction that that single host sees.

E

B

E

B

E

I wasn't trying to present that as like we're gonna, do this, I'm just trying to say like that seems like the most useful version of this feature. Right, like you know, what's what am I changing?

E

How would that work, um but yeah the other thing about the code living around is like. um If we don't document it, we could always remove it once we're done with it on gitlab.com. If we think that's an issue, so we could add it undocumented like if we want to keep it, we document it. If we don't, we remove it.

C

So can we remember that there was? There was just kind of a little bit more background um when wikimedia moved across to gitlab.

C

um I don't normally read hack and use threads too much, but for some reason I got onto a whole thread on there, and one of the big themes on that thread was how big self-managed instances of gitlab have all sorts of the same problems that we have when it comes to abuse and spammers and uh common spam, and all of these same things, and so that kind of made me realize that maybe there would be other people needing yeah things like that.

B

Yeah, the counter arguments to me saying we are only going to use this once is, of course, that we are not the only people who need this and a bunch of other people are also going to use this once, but be very glad that they have it at one time. Just like we are going to be very glad if we have a dry run mode, the one time we turn everything on.

B

So that's it's a is a very valid counter argument. There.

A

I think a question for sorry before just to interrupt there, um when this was enabled the first time when it was quickly turned off. Would it have been helpful the first time if there had been a dry run mode.

D

Yeah, I think, I think for sure it would have been helpful. I don't recall what we set the limits to. I think we set the limits to be fairly generous and we still had issues, but I don't recall the details, um but yeah that would, I think, having having this in the product would also be nice, not just as like an environment variable.

C

Yeah and the the um just to kind of put a contact there, the the the people that are doing the most um requests, I'm seeing about 22 000 as a maximum in the last six hours, twenty two thousand in one minute, so that kind of gives you an idea of um of of you know, I'm I'm, assuming you put the rate limit below that, so they would have hit those pretty quickly.

B

Right, yeah, okay, uh I this was useful discussion. I think um on my I'm now leading to saying: let's build the simplest possible dry run mode because it's not that expensive to build and it is very useful and um it's it's, maybe not the ideal thing from a product perspective, but it is still also useful as part of the product, and uh I think I'm going to worry less about how useful this is for the from a product perspective, or I mean how this fits into a vision of the product.

B

I guess that's more, the the question so short, first or let's just do the global rate limiter, I'm going to think about that after the call, but I am now leading to just saying that and then we can shape up that issue.

B

It's not a complex code change. So then we just do that too.

A

That sounds good.

B

This was this was a useful discussion, so thank you. Everyone.

E

I added a thing to the agenda: it's actually completely, not a demo. It was just while I've got andrew and jarv here, particularly um I was trying to get some historical um sidekick stats and I'm gonna say historical. I mean last month by shard and um I can't um or I can, but only for the catch-all and catch nfs. um I just get too many exceeded chunks limits. um It seems like whatever I'm trying to do with this, and I can't find a recording rule.

E

That already does what I want so like so do we do? We need something like.

C

Yeah, so I mean that that thing that you're looking at is already pretty aggregated right. um So there's a few things that you can do.

E

Oh sorry, there is, there is one thing I noticed about that, so that recording rule that I'm looking at is um some without fqdn instant shard, but that won't include pods and the other kubernetes labels we get, which is where I think we're getting the chunks from.

B

The problem is all the threshing of the kubernetes hosts and that we need to collapse more of these things.

C

No, it's not! I. There are plenty of metrics that don't that like, um if you go into observability channels, you see people asking similar questions like pretty much at the moment. Sorry, a little quick rant, there's no thing that you can get more than two weeks worth of metrics for at the moment, so I it's very rare to be able to find like data for things, um and I think it's like a major problem right now. um So ben has.

B

And why? Why is that.

C

Well, if you look on that query, there's a big blur. You know the prometheus errors are always wonderful, but there's a thing that says: limit, 8333 and ben has like got an open, mr that he opened in the on monday to to increase that number. So that's about 10 megabytes of data that that weird limit there, if you divide a chunk size by like 10 megabytes, it comes up with that eight three, three three three number, and so I think he's going to extend it to 100 megabytes and hopefully that'll make a difference.

C

But there's like yeah.

B

Is this a thanos problem or a prometheus problem.

C

It's a thanos problem.

B

So thanos is is struggling to aggregate all the data for us and it's supporting the query.

C

It's yeah, it's got some limits in place. um Let me just see I can probably so, but there are things that you can do to help. um The first is that you use the this is like total technical debts and it needs to be fixed. You use the label called environment and there's another label. That's the same. That's called end and the difference between the two is that thanos will route end to specific nodes where environment will go to we'll fan out to all the nodes.

C

So the first thing you can do is you can change it to end and then the second thing you can do is in the res block. You can try and choose like 3600 to get like hourly data rather than um to get one sample per hour, but that's obviously pretty dangerous. That's.

E

After it's, I thought that was on the thumb off side that that was applied, but maybe I'm wrong.

A

So I thought that was after aggregation.

E

That it would do that like it would apply that, but.

C

Yeah, I I you you probably right, um but it seems to. I think I I don't know the technical details, but I do have better luck, but obviously the thing there is that you're using a one-minute rate and then you're getting one like one minute sample every hour. So your data is going to be super sketchy right because you know you're taking one minute out of an hour and assuming that that's what the whole hour looks like, um and I mean it's a very rough, but I'm still getting those errors.

C

So you know that didn't help and then obviously max one-hour sampling.

E

Yeah I tried, I tried just using the like underlying metric with a rate, um but it didn't really help with.

D

E

Rate so like a rate of like an.

D

C

E

I guess again, the problem is it's fetching too much in the first place,.

C

C

Sorry because I I'll find this there's a there's, an issue about this, um oh yeah. um In fact, I think.

C

So I was talking to ben about this on monday and he created a um because ben actually wanted to decrease the retention. So what I often do is I just go to um thanos a prometheus instead and I actually get better results than I do from thanos, but he wanted to to reduce the retention on prometheus down to like three days, which would kind of and we'd lose that and we'd have nothing. um So if you look on the thread um it's at 11 789, I thought you'd actually made the merge requests already yep.

C

E

Yeah, it's emerged.

C

It's definitely not working because, as it.

E

Says eight three.

C

E

C

Yeah, you can see the problem still right, um yeah. I I think we should just ping him on here and ask uh if if we can get that sorted out and that's good.

E

C

E

Yeah, so let me see like how much does it fetch, so it fetches from the fourth to the third, so I can get about a day's worth of data, so I would be able to get about 10 days worth of data with this change approximately, which would work for what I'm doing so yeah.

C

I mean it's still like I don't know, I don't know what the solution is here, but I'm really struggling because you know if you're trying to look at data, you can't retrospectively apply a recording rule and if there's something you know, I was looking at well actually on that.

C

On that thing, um matthias was trying to look at ruby memory um stats and you don't know what you need recording rules until you're looking at it, so you can't retrospectively apply them and it just so happens there that the cardinality is too high, and we can't do this and it's it's really kind of slowing us down, and I don't really know what the solution is.

B

Yeah, it's part of the problem here that we that thanos receives automated traffic or untrusted traffic and needs to have defensive limits per query and that we're exceeding that. Because then maybe we could have two thanos instances, one where humans can run crazy queries that use a gigabyte of ram, but they get their query and one that gets untrusted traffic and that has to work within constraints.

C

And possibly longer than a two-minute timeout as well, and you just kind of like bear with it right. um You can't have asynchronous queries, but you could.

E

Yeah, I would be, I would be happy waiting like you know, yeah. I would be happy with like a batch query thing where it comes back in an hour and says, like I'm looking at historical data anyway, I just want the data um yeah. um Sorry, another question about the recording rule andrew, so I think for this one it would help like. I don't think, that's the problem, but I think it would help because at the moment we have so these these are like recording rules that we need to tidy up anyway.

E

I think, but they use some without fqdn instance, um which for prometheus we've got other labels. Sorry for kubernetes we've got other labels like pod, which are basically the same thing.

C

But yeah so so, what's happening with that, there's been like a lot of backwards and forwards on there's a there's, an issue and I've been a bit snarky on it, um but basically what it is is I like, I think what we've agreed now is that we're going to get rid of fqd well, if qdn will just kind of be um relegated to like legacy and on the on the instance label um ben is supposed to be working on this at the moment.

C

uh Add node label for kubernetes discovery, 11504 um yeah, so so here's the original issue on on that sorry I'll put it in here and then that kind of morphed into another issue uh which has got less stuff on it. But I thought ben was actually working on this at the moment, but there's no one assigned, but um basically what I think we've agreed on is that the instance label will become it won't be an ip anymore, it'll be like actual in in the vms land.

C

It'll be like the actual name of the vm um like fqdns, it's not a ip, because I I just find like if you're looking at ips, I just glaze over. Like you know, I don't like using that as a way of just distinguishing things and then in the in the in the kubernetes world, it'll be the pod identifier and there's like a little bit of risk when it gets recycled that you know, you'll get two pods that have got the same name, but it's better than you know.

C

Any other solution um and ben asked around and other people are doing that as well, um and then that way, it's quite nice because everywhere in our graphs, we we use fqdn at the moment, we'll just replace that with instance, and that's much better, because instance is like a standard for prometheus, where fqdn is kind of like our own label and uh it'll have the ports on the end. But I'm I'm not that bothered about the port. But I I don't want to ask you. You know I don't want 10 dot.

C

You know that I'm not a fan of that. So I I the reason. I'm getting a bit like like banging on about it is like I see I've been seeing, merge requests of people adding like fqdn, comma pod, name comma- this you know on like piecemeal on on graphs because they they're struggling with this, and if we just kind of did it, you know, with strategically with the with the instance label. That would be much better than fixing individual graphs one at a time.

E

Oh right yeah, no, I just mean for this recording rule like we probably don't want some without any way right like we just want some, and then you know, because this is like an older recording rule like I think that's why it's this way, but we probably just want like the explicit labels and then we would have already.

C

Yeah yeah yeah yeah, I I I really personally, I really dislike without unless it's outside of a width, um you know you know exactly what you're removing exactly because of this, because you added a new label and then suddenly the cardinality of your recording rule explodes and you're, not controlling that. So I I tend to think it's. It's like a bad practice.

E

Okay well I'll, ask ben um why I'm still getting a limit of eight three three three, unless that I'll double check the numbers, because maybe it's like there's an extra three in there that there wasn't before, but otherwise I'll ask then uh like what's up with that and uh we'll go from there. Sorry, like I said.

C

That was a bit of a derail. No, it's a really good. Someone should open up the jakub. Do you want to open up an issue about the the the thanos query for bad users? It's a proposal. You want me to do it. I I wouldn't.

E

Be what I use all the time.

C

Exactly it'll be my.

B

I'm not sure if I'd be on it much, uh and I also don't know where.

C

The issue should.

B

C

Yeah for for timeland the way I've done it is like. I actually do lots and lots of small queries, and I splice them together on the client side, just because I kind of gave up trying to do like a year's worth of data in in in a query, because it just can't work. So um that was an interesting exercise.

E

Oh right, yeah, I could do it that way. I could get like because, like in grafana, it's much easier to like slice for a particular absolute time, so I can slice it and then like build it up that way. I guess.

C

Like do you want to do well? Actually, if you do it with um jupiter, notebooks and and pandas, it's very easy to do so. Okay,.

E

Cool thanks for that um that lets me know it's not me, but yeah. um I see that I feel the pain yeah.

C

Please please comment on that issue.

E

The one about the increase in the limit thanks yeah.

E

uh I think I was the last one so unless anybody's got anything else,.

A

I do have one quick question, but I'm going to stop the recording um because I wouldn't want it to be it's about a customer. So I'm just going to stop the recording.