GitLab Scalability Team, 24 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalability Team Demo - 2021-06-24

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

A

uh I'm looking to buy a new bookshelf andrew.

B

Good the same thing.

A

uh I've got the first thing so I'll start, so this is related to um what was on the demo. The other week, which I missed about how, when we have low frequency metrics counters um in prometheus, um they will effectively start at one as far as prometheus is concerned, um and what that means is when you try and take a rate from them. You need two samples. You need that to go up to two before there's an actual rate, because prometheus doesn't treat missing as zero for good reasons. um So you end up with.

A

If it's low frequency, you might never get to two from a particular um process scrape and it might just look like the rate zero when actually it's non-zero, it's just that we didn't start. It start the count at zero each time. um I think I'm explaining that right. So um let me demonstrate the issue. I guess and then uh share that.

A

uh So this is the current state, so um I only just started this prometheus and the background process, so we can see a bit clearer here, so um I'm just running the workers for source code just to simulate a sidekick shard, even though that's not a real shard. So we see the update. All mirrors worker just came from nothing to one trending projects. Worker I scheduled manually also came from nothing to one. So if we take rates of those like you know, the rate is the slope and there's no slope.

A

um Even when the first job happens. There's.

C

A

The first job happens: there is a slope as soon as a second job happens, so this only applies to low frequency things.

A

So uh let me go fix that so, first of all, I'm gonna stop.

A

Prometheus I'll also stop sidekick, and then I'm going to uh enable this feature flag. I put it behind. I think it's possibly overly cautious to put it behind a feature flag, but I set the feature flag to enable by default to get around that, because the feature flag that I've added here will only apply on process start because all it does is on process start finds the potential values of sidekick jobs, completion, seconds counts or sorry, jobs, completion seconds, the histogram um that this process could run.

A

So it doesn't generate a series for every possible sidekick job. It just generates one for the psychic jobs that this process can run. um Whilst do I need to do, I need to do something with docker: that's it this.

C

A

Way of clearing my prometheus data is in the gdk, it's uh it's stored in docker and I just delete all my docker volumes. um So I've enabled the feature flag. I'm going to start prometheus.

A

And then over here I'm going to start. So this is again just starting all the source code management workers.

A

um So I'll see. We've got nothing here just for now, um and then I'm also going to get ready to schedule a trending projects worker, but not do that just now. Just to demonstrate that so we'll need to wait for.

A

This to actually do something first, before we get any.

A

A

Right so I might have done something wrong here, because I would have expected this to already be at zero. Am I on the right branch.

A

B

A

Cool good demo, um so here we've got some jobs running oh no they're, not running, they dropped until duplicate uh dropped and start executing, because they've been deduplicated for some reason. ah There we go right, so I've got zeros um for all the possible combinations of labels that this histogram could have for this process.

A

So now, um if I go and schedule a trending projects worker and do that same query graph, we'll see that there's a bunch of zeros here's the update, all mirrors worker, because that's the cron job so just runs regularly and then once the trending project worker runs and gets scrapes. That will go from zero to one. So that should solve this issue, but I'm only doing it for one metric and there's probably a bunch of low frequency metrics we have. Actually, I found one with the plan team purely coincidentally.

A

They asked me about something that was related to service desk emails received where they emit a metric when service test does something, but service desk is again quite low. Frequency and we've got a lot of psychic processes that can pick that up so per sidekick process. It's very low frequency that that happens, so their rates don't match what they see in cabana at all, because they're always starting at one yeah. There's the trending projects worker, I think um so yeah.

A

I don't have a good answer to the the general problem there, um because to do this registration thing, you need to know all the possible combinations of labels that you could have, because each series, each combination of labels, is effective, its own series. So, for instance, here we have job status labels on this. We also have labels for the worker class and the worker queue.

A

um So we need all of those to match up correctly for this to do anything, because otherwise we just create two different series, and then that wouldn't help. um So I don't know if there's a better way to do this um or if it's just a. I think we have to take care of her, but uh yeah.

D

I mean maybe we could. um I don't know it will very well, but I I we have some sort of wrapper class around prometheus right in the rails code base. So maybe we could structure the api of that class so that when you create a metric, you must submit the possible label values and um but then I guess that assumes that the metrics get defined at boot or yes,.

A

um This is also a good disincentive for people to add dynamic label values, um which would be probably bad for cardinality anyway, because if you have dynamic label values dynamic is bad, then, how.

C

Are you gonna? How are you gonna be.

D

A

Generate all the series that boots so yeah.

D

That that's the thing in principle: you should be able to generate them at boot, but it's awkward and because it's a dynamic language, we can't be sure it will uh happen.

A

Yeah, the only thing I was thinking of was like a more heavyweight approach where, like you, create a different type like we have a registry of metrics that you want to create on boots, and then you, you use that constructor rather than the other constructor. If you want one of those and then yeah, I don't know, I haven't really thought it through, but um there would still need to be some somewhere some code. That says these are all the label values this metrics and how.

D

Well, I I suppose we could also do something obnoxious where, if you try to define the metric that wasn't defined at boots, then uh we throw we raised an exception right. We could have.

C

D

An initializer we say now is the time you are allowed to define metrics and then that changes other noise.

E

I mean, if you make it that it that we encourage people to do it through some sort of class method kind of like the work attributes kind of where, where it's almost defined on the class, then that encourages that, obviously it's the people can get around it. But you know if you make the api sort of like these, are the metrics for this controller or whatever, then that.

A

E

A

That I think it will, I think, the other. The other side of this coin is like. Do we actually want to register all of them at boot, because you know cardinality is going to be an issue there like if they are low frequency at the moment, cardinality.

D

But it is the worst case cardinality. So then at least you find out sooner that you have a higher worst case, certain cardinality than.

A

Later yeah, that's.

D

A

But we also increase, we will increase the cardinality for every series. We do this for that, isn't already, basically generating every metric value. Every label combination on every process already so yeah.

B

One of those one of those would be the request counter thing with the future category, not endpoint status, code and http method. If we.

C

B

One we would, we would be in pain, because we did that once for that. One.

A

And now we do it for, um but we almost do need to do it for that one right. When are we doing because we're, I don't know some combinations, I mean.

E

Oh, do we do it for 500 or something.

A

Yeah, let me share um wait.

A

Apparently I lost my connection.

F

A

So we do something in here: oh.

C

Yeah, so we do it for you right.

A

New feature categories, because we did it for every feature: category that caused a big problem um and because we have this like product of a product thing where we have per method per status per feature category. It gets real big, real, quick.

B

And andrew uh for the one with because we use this for errors, um we get around it by recording a zero. If it's missing in the recording in the intermediate yeah and then the intermediate recording rules.

E

So it does division by zero, because the one rate the error rate goes up, but the the the um total rate doesn't go up or what's what's the.

B

um I think, because we we have more successful requests than failures, so sometimes we would have uh an operation, but the failures would not be recorded and that would like it needs to be a zero on the top there yeah.

E

B

The division by zero.

G

E

Zero on the bottom.

B

uh Yeah, I don't think we had a division by zero. It was.

A

Because error rate, not because of success rate right so yeah, yeah yeah.

B

A

And now we're talking about on the top and on the bottom of a fraction, it's good we're good to get different levels here.

E

So, just just one kind of broader question is like: would it be better if we just said we're going to do this properly and we're going to? We can have all the combinations we're going to preload and we just accept that soon. We're going to have to start shouting prometheus much more aggressively than we're doing at the moment and having lots of prometheus instances. And then we can do things properly and we don't have to do that. Sort of slightly strange, like combination code.

D

Well, what about uh I'm curious about this recording rule idea? Because it is kind of city that we create this combinatorial explosion, where the pods also push up the cardinality and the processes that we really don't care about, and the recording rules filter out stuff we don't want. So is there a way we can inject the zeros at the recording rule level and only work from there.

A

That would be nice, that's something! That's really frustrated me as well as like the metrics generated by the application can have quite high cardinality and then, when you apply them across our fleet, we get a multiplication again of something that we don't end up caring about a lot of the time once it's.

D

Yeah, it's just a side effect yeah, it's a side effect of the prometheus architecture where it's a pool model and these things must have their.

A

D

A

I don't know andrew.

E

I so I'm I'm, so are you saying that basically, the the the recording rule recognizes that something was absent and then has become one and then sets up a rate for that like basically chain does something nice.

D

I suppose the problem is the recording rule doesn't know what label values exist, but the code knows.

E

B

Function, I.

E

B

Isn't it isn't it that uh if I remember correctly when we wrote some of those rules, if there is an operation um we'll, if there's an operation rate that is higher than zero, we'll record a zero for the error rate? It's something like that. We had, I think, and that was for stuff, like um gitlab components like the those aggregating rules.

D

But the problem is that we even have things where the operation rate is too low. Then we have a crown job that was always failing, but it was so infrequent that we didn't notice so.

C

D

Say the metric exists because of the operation rate. Then we we still wouldn't have got it so.

A

D

That would be the case where it.

A

Goes like just from nothing to one and then never goes.

D

A

Value and then there's no rate um at any point, because there's no numeric change.

E

If you created a func like a like imagine like it's alternative reality, that'll never happen where we fought prometheus and we created a new rate function which said that if it goes from absent to one, then it has a rate of you know. It assumes that absence is zero.

E

Would that solve the cardinality problem and fix all of our woes.

A

Maybe, but I I I didn't, read the full github issue about this- that's closed about like having it assumed zero. But as far as I understand there were good reasons for not doing that, so it would probably give us other problems yeah and then the other thing is the solution. That seems obvious. That you can't do is that if you summed them before taking the rate, then that wouldn't be a problem because you'd be aggregating across the fleet into one place, but that's also a thing that doesn't yeah that doesn't work yeah.

E

A

E

Yeah and the other, the other way to kind of consider it is- um and this is something we've discussed before, but maybe the way to get around all these cardinality problems is just to have a really boring metric, which is in the application, sli successes and sli total requests, and it's got no other demand. Well, it's the only dimension. It's got is basically the service level indicator component.

A

E

I think that's about the correct.

A

Answer that we probably need for.

D

Terrible for debugging.

A

D

uh To reduce the cardinality to the point where uh we don't care about pre-initializing everything.

E

Yeah, it's basically we just all we have is, is the we don't have to. We don't need the recording rules anymore, because the application effectively emits them in the aggregated state. So we've discussed this before, but yeah yeah. I think.

D

E

But is it it's terrible.

D

E

Debugging, well I mean from a getting their point of view. It's fairly. It's not you know. You'd have to have a middle. It was discussed in that um service level, monitoring v2 proposal at the bottom, and there was a sort of thing on it where you could just sort of have a yield block. If you want- and you just sort of said you know, this is my success criteria and then you know when it comes out of that block. It just counts the counter and that's and that's it and it just has no labels.

E

It's really nice for us and for error budgeting and but from an application. Developer's point of view. It's it's not. um You know they want to go and start looking into the the the dimensions and understanding you know what was it a 302 or three or four, um but.

D

E

What logs are for yeah, I mean, maybe that's the way you go with it right, yeah.

A

I think I think, there's definitely something there. I think for the service desk thing like for anything that we're alerting on that probably is a way to go. I think for the service testing, it's more arguable because, like that's a it, could become a metric that they're alerting on. But at the moment it's like they're trying to use it as an informational metric, but they can't because it's so infrequent so I mean. Maybe the answer is make that an sli but.

D

In the end, if it's an informational metric, that's infrequent, then we only care about the ones that give us problems with um alerting right. If a development team wants information about something that never happens, then that is it's not the same thing as a crown job that doesn't run. I mean.

A

A

This happens quite a few times an hour. It's just spread out like across different sidekick processes, and that means that the individual series that you know with the full label combination tend to just go, don't exist one and then go away. Yeah.

D

I think what I'm trying to say is that we have what also what I'm hearing here is. We have different use cases for prometheus, so one is uh alerting. One is error, budgets and one is I'm a developer out and I want to see if a random thing happened in production with the counter uh the so we we can take different strategies for these different use cases and for air budget and alerting.

D

We can choose to be as economic as possible in terms of number of series we create, and that doesn't mean that if developers want to have when it wants to track something with a counter they still can and if they want to find grain detail, they have logs.

A

Yeah, I think it also does apply to sres as well, though, like especially in incidents like you know. Yes, you can use logs, but I think it would be useful to have more than the absolute bare minimum of detail in the metrics too, especially as.

E

There's there's one other sort of side effect of having that that sort of richer api with the service- and I mentioned that in that proposal a long time ago, but you can make sure that your logs and your metrics are perfectly correlated right, because you could effectively emit a log on every service level failure and you could make sure the logs absolutely have the same labels as the as the metrics right, and so you kind of force that correlation, and so when a when a metric starts tanking.

E

It's very easy to say you know at the moment there's kind of a little bit of magic where we kind of look at the thing, and then we know where to go in the logs. But with this it's just like this met. This sli is tanking and we know that when it's tanking it starts writing logs that are that have got the exact same labels on them and then we can search for those logs and and help with the diagnosis of the problem.

E

So in that way it could be easier for people to to kind of get that detail.

E

Does that make sense.

A

D

I'm just wondering like well: we um I think we started off with you saying, andrew that it's bad for debugging and now you're saying that's. This is good for debugging.

E

We could make it, we could make it good for debugging. We could. We could like. We could work around it right, it's sort of my point, but it just wouldn't be metrics. It would be logs, probably yeah,.

A

My feeling traces, you know, we've got this, mr um I'm gonna see how that goes like cardinality is like ease of use wise then I might speak to the product planning team about their service desk metrics and see like if we do the same thing there. What does that look like? Does this look like a thing that we could reasonably generalize, or does this look like something where we just want to do it differently have a different approach? After all, sorry.

D

Silly question, but what are service desk metrics.

A

uh Sorry, the metrics that the product planning team have created for service desk that they asked me about that had the same issue where they were infrequent and the rates didn't match. Cabana is.

D

Service desk, something that teams use internally or is that a feature? Oh.

A

Sorry so the gitlab feature service desk has some metrics that it emits.

D

And then and then users consume those metrics.

A

No, no, the team consumes those metrics.

D

A

D

Right, okay, so that is, um uh but that's not for alerting purposes. That's for informational purposes like how do people use our feature.

A

D

Yeah so that sounds like it's the example of complete completely disconnected from error budgets and alerting.

A

Mm-Hmm, yes, but we still need to support that case because.

B

A

Would want to alert.

B

On it somehow like if this job starts failing or whatever, because they are so infrequent, it does count, doesn't it.

A

uh Yeah we could, I mean we should alert anyway if those jobs start failing, but we might not, if they're infrequent, because you.

D

Know right, but I I I think, the way uh okay. What sounds to me like the best idea right now, and maybe that's just my the way I read the conversation, but the best idea sounds to sounds like yes, initialize everything at zero, but at the same time, do the best we can to push down the cardinality by going to the the most prometheus friendly model for the case of alerting.

D

So for the case of slice and error budgets and outside of that people can do what they want as long as they don't use too much space on the prometheus server.

D

And another advantage of this- and I don't I'm sure.

A

You were muted, oh, I was just saying I think those two go together: yeah the if you're going to pre-initialize metrics the cardinality does need to be as low as possible.

F

D

Yeah and so eventually, it'd be good if we could just or maybe it would be good if we decide that we want to get there at some point where we just have the lowest possible cardinality and the other benefit that's in the back of my head is that uh once we count things in the application, that also means you can code, you can put limits in the application and uh that's.

E

Exactly what I was just about to mention.

D

Yeah- and I think this is a uh this- is this from my limited view. This has been uh a pain point already with error budgets that uh the stage groups say well. This thing is allowed to be slow because of x, and then we have to say well, but the recording rule says it's not allowed to be slower than that. So go fix it, which is a slightly inflexible, and it doesn't quite feel right.

B

There's an issue for that on on switching the the way. We do that now for, like the whole thing for aptx, to switch that from uh histogram to two counters, which.

D

Yeah, which is a variation of this right, it's it's yeah.

B

But then you can put the threshold in the application they define. This method has five seconds, and this one only has one like. We need to keep an eye on it, but it can be there.

E

We should just make sure that we do that in the broader context of doing that, for everything you know yeah, but yeah. It's.

D

Not just abducts but yeah exactly.

E

All these things that.

D

Funnel into slis and error budgets should pre-aggregate as much as possible.

E

And if we, if we can do that, we can start doing it with you know, correlating the logs and the traces uh soon as well, hopefully, but the the the one thing I thought you were going to say, but actually you meant a different thing was if we start pre-initializing the values we can also have ci checks to make sure that people we can give people limits on the cardinality that we are allowed that we allow on metrics, because we can say you know, pre-initialize and give us all the all the values that you expect for this thing and oh, no multiplying those all out.

E

It's too high um and then they'll say well. You know we really need this, and then we can go back and give them an exemption or we can say do you really need it or can you use logs for that.

D

It's an interesting capability, uh but is this a problem we run into in practice already that we.

C

D

Make to generate too many, and then we have to go back and say you know what you made too many.

E

Only very occasionally, but it does happen from time to time, yeah yeah, usually.

D

H

Do it? That's us.

E

Yeah, that's true.

D

B

Nice fci stopped us instead of.

D

I think what I like generally about this idea is that it just feels like using prometheus the correct way, uh just by pre-aggregating, more and not trying to hold on to your precious data uh in on on the prometheus server like you just need to count things and let go of detail. That's the philosophy of prometheus.

E

But yeah I mean, maybe maybe we just skipped the whole thing of changing the developer interface for adding metrics, so that we can kind of push them to pre-initialize and just leapfrog that to the point where, on a on a class, you can kind of pre-initialize your service level indicators for that class. So we kind of go next level.

D

We don't need to straight away force developers to do anything, because we just uh realized that we are the developers who do the worst cardinality.

E

In all of that, yeah, so yeah, if we just said we chase our sli.

D

And our error budget metrics, then we've addressed the worst offenders because it's us.

D

Cool sorry, I'm being a little bit a little bit of a troll.

D

E

Shall we move on.

E

um I put this at the bottom, but I think somebody moved it up.

E

Okay, uh so this came out of like several different conversations with different people, and so I thought that there was like a I felt while they were being had that scalability should kind of be included in the conversation. So now you are, um but it came out of several things, so the first one I think that really kicked it off for me, was that uh craig gomes opened up an issue about the value of the database peak performance issues, and there was like a lot of discussion on that and um kind of.

E

I think where we arrived, was that really to get the most effective value from it? The process needs to be that if infrastructure see a big problem and they see like something- that's just consuming all of our all of our cpu.

E

The best thing for for infrastructure to do is to kind of do the investigation and get to like an end point and a team in the stage groups that we can say hey. This is a problem and actually, when you do that, often you'll find that there's already an infra dev issue for that thing um and then, instead of having two issues that everyone's looking at you've kind of got it down to one issue and you can kind of say, there's various different teams in the company who now reporting.

E

This is a problem and you can kind of get the really focus on the prioritization, um and so that was where that came from. But then what came out of it was like. Well, it's really really difficult to do this and I tend to agree like. If you see cpu, then this is- and this is all tribal knowledge. So the next thing you'll do. Is you go to pg stat statements in thanos and you'll?

E

Look for some patterns in there and then you might go to the slow log and then you might go select some stuff from pg stat statements with like queries to try and find the right query and there's all of this magic. That's happening along the way and then you go to other rails logs and eventually you come and it's it's very difficult and it's very difficult for people to to kind of grok all these different sources.

E

And um so that was the first thing. The second thing was that um jerry was saying it would be really nice if, if there was more correlation between the application and the uh and the database and what's going on in the database, which is kind of actually the same different side of the same coin and then the third part that um andreas was really talking about is the attribution on the on the tables and having feature categories for tables, which I think we spoke about in this meeting as well.

E

Last week, maybe yeah and so there's there's different sides, and so craig opened up an issue. And I because I've been thinking about it for a while. I kind of did this like really rushed sketch, and it reminded me when I wrote it down of that meme of the of the investigator with the board and he's like yeah. Yes, there are 20 different things and the wires between them, because that's what that um thing that I wrote in there was like, but um basically what I've done. Let me share my screen.

E

Maybe is we've got two parts of that already done since yesterday, which is quite quite cool. I think, um and the I'll just explain the issue quickly as well.

E

So um really what we can do, what one of the big problems that I see is that um kind of we all talk about the pg, the query ids, because we have that in um in thanos, and so people say. Oh, we know that this query id is this uh is is problematic, but the only people that can translate those query ids into sql statements are are dbres and people who can access the primary database, and so it's not a very friendly interface. It's not very democratized and people can't get there.

E

So what I propose is that we take. We have we provide people with a way to map from a query id to a query right and played around with various different things, and then I realized that the easiest way to do this would be with a fluent plugin that occasionally polls postgres and says what are all your query: ids and what are the queries for those query, ids and then also because we can do it now we can say what are the fingerprints for those queries.

E

So we use this library called pgquery and it's got a thing called fingerprint and it looks at the ast of a query and it generates a fingerprint and we're starting to use that as our unique id for a query. It's not that good, because if you have queries with different number of question marks in clause they're different queries, but it's good enough, and so what that will do is we'll we'll poll the database and then we'll put that into into elk, and you can basically log into elk and you can look up a query.

E

Id and you'll be able to get the query and so developers will be able to do this off the bat um and I put together a little merge request for that. um It's like pretty straightforward, it's just a little plug-in and it just pulls like. I think I've got it to hit every five minutes, because it's there's 5000 queries. That's quite big, and I don't want to do it too often, and it doesn't change that often so once every five minutes, I'm sure it's it's fine, um but yeah.

E

If anyone wants to go and mock my terrible ruby, this is a good place to do it, um but stan's already had a go at it, so that's cool and then um so that will give us a way. So, instead of always just talking about query, ids and and kind of struggling to map that to actual queries, we've got that then the second part of it, which I think is going to be really interesting.

E

Well, I think, could be really interesting- is doing the same thing with pg stat activity and what we can do there is. We can occasio, we already occasionally polling, pg, star activity, and we built this thing called the pg marginalia sampler um marginalia. What's.

D

In pg stat activity.

E

Pg set is is um a view of of the current running things in postgres, and so it says right now at this moment, and so obviously we can only sample it. Really, it's not really any good for kind of constantly understanding, but it's a it's a useful sampling data source and so we've had this thing called the marginalia sampler for for quite some time and effectively what it does is it collects. uh Maybe if I just go back to here- and I go into here, this might explain it a little bit.

E

F

E

uh Deep links in gitlab, um so so what it does is the marginalia sampler looks at pg stats activity and because we have marginalia at the beginning of every this is this is the query here. I can see ukrainian.

E

Let's see if that works, I don't know if you can read that, but effectively what it does is. It looks at the at pg star activity and because we prefix all of our queries with marginalia.

E

Now we can do really horrible, regular expressions which pause the margin earlier and then tell us that right now, there's 10 queries that are running that are for um apiv for jobs request, and there are five queries that are running at this moment that are for this endpoint and of the ones that are for apiv, for jobs, request they're, all in like lock contention states, so we kind of get the sample and because it's just a postgres exporter query, it just runs every every 15 seconds, but I've I've found this um this view to be pretty useful.

E

You've just got to kind of treat it with a you know as what it is. It's it's a once. Every 15 second snapshot of the database but like this is a this was some time ago so wow this was from when we were having the real database troubles. um It just was what came out, but you can see here.

E

This was a time series and then this was obviously during an incident, and you can see that the number of active queries for um projects id kind of spiked up to 114 active connections on the on the primary database that were for that data, and then I just kind of break it down. Here you can see that's by it was actually 180 at one point, so that's not good, um and you can see what they're doing you can see that a lot of them will idle in transactions.

E

So that's kind of a bad thing, and it's just this sort of view that you can break things down, and so how that ties into this is that it gives us um what we can do is we can take that technique of sampling like every 15 seconds, and we can extend it by saying we current using the marginalia.

E

But we will also take the query id from the query and we will fingerprint it with the pg query fingerprinting technique, and then we can say that um we can say that this query with this query id is called by these endpoints, or at least we've sampled. It and we see that that happens a lot because we'll record that in the logs as well, so we can say you know- we saw this query a lot um and 95 of the time.

E

It was api v for jobs request, and you know we don't always know that and where this will come in like really really helpful. For example, there was, um there was an issue where someone was saying that we were making too many queries to this query and no one could decide whose problem it was.

E

It was kind of like a user's table, and so people like well, it could be a cookie and there was a lot of kind of fighting about it, and the theory was that there was one endpoint that was kind of coming in and really smashing that um that query, but there was no way to kind of do it, and so we thought about extending this um marginalia sampler and looking for that exact query and then kind of reporting on it.

E

But this would just be a general technique for that um and it's based on the on the fingerprinting. So this this over here has been done. Hopefully, you'll get merged. Today, this we are like it will be kind of about as straightforward. I suspect it's done already for.

A

Something that wasn't put in stan already did something that did marginalia influence.

E

But that was for slow logs, uh okay right thanks yeah, and so it's it's actually it's in the same project and it's uh so so. The difference so slow logs is great. But the problem with slow dogs, as you know, is that there's lots of stuff that that is very fast but very high traffic that we don't that doesn't come out in the slow logs and- um and we don't get any attribution on that. um So so. This is kind of a way of saying that this query belongs to to this team, and we can.

E

We can do that um and then the third bit.

E

Basically, this is kind of just like my wish list of all time, which is like if we just enable distributed tracing in production like so much of this stuff will be easier um and actually weirdly igor's just enabled tracing for getting but using google's stackdriver tracing, which isn't very good, but it's better than nothing um so we're slowly making things, because we have the the the request in there. But then the last bit, which um stan has just done, um because it wasn't very hard.

E

Is that he's taken when we do a trace and we include the sequel of so let me just open this up here in the trace. We include the sequel, but the sequel is not very searchable.

E

What he's done is he's put the fingerprint of the sequel into the distributed trace. So then you've got like everything right. You can go from um a query.

E

You can go for something in pg, stat statements, two traces that are generating that you can go to end points, because you know what the end points are and you've really got like full connectivity or much much much better connectivity between like what's happening in the database and what's happening in the application than we've ever had before, um because we can just say we know that this this query was really bad. We know that this query is equal to this fingerprint. We know that this is a sql.

E

We know these are the endpoints that use it and here's some traces that are actually generating that exact query, so it should help people. So let's just take a look at this, for example, just to give a quick demo. So here's a query: here's a sql query!

E

So that's the sequel that was issued to the database and obviously this is kind of a bit difficult to grok, but we'll be able to add. So what stan's doing is he's adding the finger here takes this and he generates a fingerprint from it and then we will be able to search traces by fingerprints, and so if we have a database problem- and we see like a major spike in in cpu, we should be able to tie that back to like traces and end points very quickly.

E

I don't know if that's all kind of like still that crazy guy like great. I understand it all, but if that makes sense or not.

A

I think the exciting part is the distributed tracing part like, um I think the exciting part is the last part about the distributed tracing like I was talking to igor a bit about this yesterday and even if it's just the.

H

What makes this so sean.

A

The google cloud version- that's not quite as useful. That's still a big step to have that enabled yeah yeah.

C

What makes it difficult to put distributed tracing in? Is it just it's just it just in case it's time.

E

It's it's, the the thing that's difficult is finding uh an infrastructure person, but actually, like I've kind of I think you know. I think we should just just do. Do the google thing and then people will get the understanding of it and um and be able to. You know once like. I think we should just use stackdriver and and just accept that we don't have people to focus on these things, and um you know we we haven't got a yeah.

E

We we've literally been trying to do this for a very long time and it just keeps getting pushed back and push back and we should just do it now with stackdriver.

D

But the pushback is that maintaining the infrastructure is too much work. The tracing infrastructure now.

F

The pushback is that we never finish the work right like it's always lower priority than the super urgent thing that is happening right now, so it has.

C

F

Back for three years, as far as I remember, um because we just get to it and uh something more origin, pops.

F

D

It sounds like it would be very worthwhile to uh somehow get it out there.

E

I think the easiest way to do it is just to get it in through the back door with with stackdriver tracing and just accept that it's not great, but it's better.

D

Well, didn't you set it up in a way where the the tracing back end is exchangeable anyway,.

E

Yeah so so igor was busy replacing. He has already started working on replacing the lab kits open trace with a with a um stackdriver. um It's forgot what it's called now the claw open sensors is the is the name of the of the api that um stackdriver tracing uses. It was google's alternative to open tracing.

D

And did that alternative become the standard or is this is there's.

E

No, the standard is the standard is a is a third thing called open, telemetry, but stackdriver tracing sorry, stackdriver tracing does. As far as the last time I checked it didn't support that, and so this is exactly why we built lab kit right so that this mess can can be addressed.

E

The the good news is that, whenever an adopts um open telemetry, it is a much nicer api like this. The open tracing api is is one of the nastiest apis that I've had the experience of using as a developer.

D

If, if we can get something up on a screen that people can start using.

E

D

And it doesn't matter what the api is then we're in a better place than we are now.

E

Yeah but but having lab kit means that we can switch to that and drop it in yeah. So that's the only part- uh and I asked ego about this this morning- was whether ruby has open census library and he said that it he thinks it does otherwise that that was the only thing I was kind of um wasn't sure about, because obviously, if that's the case, we lose a lot of the benefit because there's a big black hole of of.

D

We don't want that, but it sounds like igor is doing. This is, is this something that one person can just do or do we need to support him or.

E

um I I think I mean check check in with him. It's like you know. He goes quite good at he's at just competently, getting through things and asking when he needs help. But, like you know, it's worth it's worth asking um whether he needs some assistance with that, but um luckily, like it's not that hard to add different um endpoints apis to to the lab kit side of things, because it's it's fairly, it's a very high level abstraction.

E

So it's sort of like you know like instruments, an http request um and then the inside you know it's not like a low level api like open, telemetry, open tracing. Anything like that, so you can replace them fairly easily.

C

I think it's also one of those things where as much as we'd like to help with doing this, we do also have like an awful lot in progress at the moment and I'm always nervous to help with another thing. I think if this was something we were going to contribute to, we need we need more space before we can say. Yes,.

D

Yeah, I I'm not suggesting we jump on this right now, but uh the the fact that this has been stuck in development hell for three years and it's not happening is since cape town is, is not good and uh I think it it feels to me like it's close to what we're also doing with error budgets. It's about helping the organization scale by helping developers making better choices about what's going uh about how their software runs, uh making it easier for sres to understand what is going on it.

D

If, if this and getting these tools to work right is a lot of work, I mean just just having something you can. Click on is step one, but then you find out that the data is diffic as like holes or gaps. You can't correlate things you can find things. You get things like these fingerprints. That andrew was talking about so once you have it, you need to make it good and you need to keep it good and right right now.

D

It's not part of our mission statement, but it sounds like it's close to our mission statement and it could maybe uh become part of it. I guess is what I'm saying not short term, but it does. It just sounds like it's. uh It aligns well.

B

This kind of tracing is something that developers have wanted like since we started uh removing the thing that you already have that I forget the name of now, but yeah.

D

At lab performance, metrics or.

B

D

E

But yeah I mean, I think, if we can get it into open tracing, I don't know what the costs or anything like that are and there will be a lot of tuning. But it's a it's a starting.

F

Okay, so just to be clear, we're not our mission is not to make developers happy. Our mission is to scale get loaded. um What we do on the path to scaling gitlab.com, if that makes developers happy that's great, but that's a secondary third ternary or whatever uh can.

E

We can we say that we make developers away.

F

Okay, um the second point is: we also have the observability team, which is supposed to be driving. Some of these items, uh specifically tracing, was supposed to be one of the items they'll achieve and last night check. They also had site reliability engineering in their title right now, who am I to say, like I understand that there are, like many other, more important things that they need to handle. So what I would suggest here is figuring out with the observability team leadership, so we they have leadership now. What their mission is, how?

F

How does that align uh with? You know the project that you're mentioning here and if this is something in their uh roommate remit, remit doesn't matter if it is um then great they're going to have to prioritize. If it's not, then, if there is no owner of this, um then we can discuss how this fits into our own mission. And now I understand this sounds like and we're not prioritizing globally.

F

But um if you check out all the projects that the team has been doing, we've been, maybe even over optimizing globally right, we have been doing a lot of things that don't really move the dial as much as we want it to move, uh even though the team has been doing a great job.

F

So I would rather go through the um being purposeful about this and making this a discussion between observability and scalability, and then when we, if we take ownership, we know why we took it rather than because they didn't manage to get it done in three years like that would be my suggestion.

E

To point point taken and also you know, eagle- is driving this and owning this at the moment. So it's kind of exactly that.

F

It's again, it's absolutely fine. We can then do a a combined project like none of that is a problem. It's more that if we go into, if you continue going into a habit of taking over items from the srv or because they are drowned by other work, I would rather figure out how scalability can uh make the platform better for them, so they are not drowned in other work. uh That is obligation related.

F

Then, taking on the items that um you know- and I know that there is like a dependency there right. How are we going to do this? If we don't know.

C

F

So on and so on- and this is the part where we can uh discuss this- uh a bit more- not discuss but make a purposeful decision about this.

C

And now that there is an engineering manager there, the conversation becomes much easier becomes much easier, because we have someone very specific that we can go to with these things. Now.

F

I'd be, for example, much happier uh if this team takes on the task of uh italy cluster on.com running at scale. That would be much more interesting thing to me whether it's can run the scale or not is a different question, and but I that's the question I would like answered um rather than like the the tracing work right, because this has direct impact on the application, direct impact on our uh capabilities of scaling and has impacts all across not only on our platform but other scaled platforms as well.

F

This was just an example. We have many others like. I can pull a couple of others from my head, but I just came out of a discussion about getting cluster.

E

E

I think that's a good thing to keep in mind is that let's focus on the application side of it.

F

It's more focused on the application scaling side of things um right and I'm not saying we shouldn't discuss this so rachel. Maybe you want to take an action item here to figure this out with ken, but at the same time I don't want us to go in and just take on an item that they've been working on for a while.

C

Especially when we have plenty already.

F

Oh, do we haven't.

H

Well, we've got five minutes left uh bob. Do you want to show us tables or.

B

With all the work that we've been doing, no no it's that it won't even take five minutes. I was just like it took longer than I thought it should have to write a line, a column, and that was because the there's two kinds of tables in grafana and grafalet doesn't have the new one. Yet. But since it's just jason, you can do whatever you want. That made me think if we maybe could upgrade graphonic. But I didn't start that yet.

E

um Somebody actually, a friend of mine, sent me a link to rafana had a conference recently and they had a talk on like new new way to do declarative things and because I think they're kind of realizing that lots of people are using this now sort of as a bit of a side, project I'll, try and find the um the talk. I haven't had a chance to look at it. Yet myself.

B

Yeah, I think the biggest thing for us is going to be to switch over to when they- uh I don't know where it's at right now, but they were going to start doing the generated thing right for performance, so it would always be in sync with what grafana can do now.

G

Yeah we're not sad.

E

I can't find that linky then.

B

I don't have anything more on that.

B

Microphone, that's.

C

Funny well, if there's nothing more to add, um I'm going to. Thank you all for the time and the discussion, and I hope that you all enjoy a good friends and family day tomorrow.

G