Grafana Loki, 3 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Loki Community Call 2021-06-03

Description

https://docs.google.com/document/d/1MNjiHQxwFukm2J4NJRWyRgRIiK7VpokYyATzJ5ce-O8

A

Welcome to the june loki community call.

A

I'm having a real hard time, not telling people to smash that, like button.

A

So do it subscribe.

A

I didn't upload last month's youtube video, yet so that's less funny unless you were here last month, but um the a few things on the agenda. uh If anyone uh wants anything to talk about or to talk about anything just add it to the list and otherwise, um unfortunately, uh we're missing danny and cyril. Who did the bulk of the work on two of the things we want to talk about, but we'll do our best to represent their efforts. So owen is going to start by talking about some recording rules. Yeah.

B

I'm going to do what I do best, that is, take credit for other people's work, um yeah, so danny authored a pr recently that has been in the works for a while that introduces recording rules into loki now for users of prometheus.

B

This might sound pretty familiar and was largely enabled by the new experimental, remote right api support in prometheus itself, but also works with things like cortex and thanos, and the idea here is that you can use the ruler component, that's already responsible for evaluating, alerting rules of loki, but you can also introduce recording rules and those would be log metric queries so extracting metrics from logs that you then want to remote right into the prometheus, compatible backend, effectively turning creating metrics out of logs and then storing them in a native metric store.

B

So you can do this for a bunch of reasons right, probably most commonly um for pre-processing queries that may be particularly expensive right being able to calculate these continuously. You know like every one minute aggregate. You know some expensive query across all these logs right turn it into a set of series in prometheus and then write those to meet these compatible backend, and then you can use those just within prometheus from there.

B

So that's kind of like one of the aspects and then you can um get progressively more complex from there for a few kind of other, uh uh more advanced use cases- I'm probably not gonna, go into on this call, but um hopefully we'll be adding some uh documentation around that soon and would love to get people using it. It's largely built on top of the existing ruler component and a lot of the internal work was already there, which is really handy, so we feel pretty confident uh pushing this forward.

A

B

A

So there's I'm actually just gonna solicit a little feedback if anybody has any so this has come up in our conversations. We don't really have any limits that we apply in loki right now around recording rules or um alerting rules. So there's there's sort of two types of limits that come to mind.

A

um One would be uh you know, the duration, that the query can run for right, so um I'm gonna feel a little stupid here. I'm trying to think of what the syntax looks like for that, but um it would be the four clause for the alerting rule right, but there's not one of those for a recording rule. Is there.

B

um No there's not a there's, not a four clause for recording rules, because they are just evaluated to create metrics for clause and.

C

B

Would be, um you know, alert me if I see a high a high rate of errors in my logs for 10 minutes you know, or for five, yes and a recording rule will just generate a metric constantly. So.

A

The question would be if someone says alert me if I see this in my logs for 30 days, um you could do that now and it's definitely going to break something.

A

So you know do we want a limit for that? I think the answer is probably yes.

B

um I think that will actually technically be okay, as long as you don't roll pods there, although it's a little bit more complex the for those pretty familiar with log ql, if you want to do like a 30 day rate, for instance, right not be the bracket syntax.

B

A

That's the second one yeah, so that's the one I'm probably concerned about.

B

But the okay yeah that that one is is definitely something that we're we're probably going to look at adding um the overarching, I think sentiment here. Is that right like if you, if you were a bad actor, there's a couple ways which you could kind of mess things up um we'll you know probably do our best to close holes on.

A

Yeah, so the I mean the four, the four duration has the problem of you know. Instant queries are only evaluated on one query: we're not gonna parallel or that's currently, not parallelized, so you're gonna be running a 30-day query every so often, or something like that, so that one, I think, is pretty straightforward to limit it. You know it's pretty easy to sort of understand the intent of a learning rule and work around it.

A

This other one, I think, is more interesting because someone had a really good use case and an issue that I was looking at yesterday, where um I would say so like within, like the prometheus world, we we do limits on the range part, so the you know inside the square break and then brackets um somebody wrote a recording rule or an alerting rule that was um looking for a count of basically the word login succeeded or something like that over a six hour period and the lack of that would trigger an alert.

A

So it's kind of an interesting idea of like hey, nobody's logged into my service in six hours. That means, probably something is broken right like or you know something bad has happened, there'd be an uncommon case, so I was thinking we would want to limit that, for the reasons like recording rule and alerting rule type, queries aren't really intended to be intensive because they run frequently, um and so I I don't have a great answer here, although we probably will add limits, it's just a question of like what do they default to then you know.

A

um But yeah the the the downside right is you end up with recording or with rulers that will probably um crash on you if they have sufficient numbers of rules and you're running a lot of really large range queries.

A

This stuff is currently unlimited, but we probably will be adding limits. I guess right there.

B

Yeah yeah, I totally agree for what it's worth right. These are all the same problems that kind of exist in or that you have to worry about with recording rules in general, be it in prometheus or be it in. You know something like cortex or thanos as well.

B

We do it, we do have a couple limits that are more administrative, um and so you can basically specify like a group of rules to be evaluated together and so they're things like. You know the max number of those that could be within a within a group or I believe we have total, which is max groups for tenant, and all these are pretend overrides um and.

B

Basically, you can specify uh like where your prometheus back end is and it'll it'll consume the remote right api and just write them write them upstream for you, I probably don't want to get too too much farther in the weeds. Unless you know.

A

Yeah, I feel like that question thought there would be more of a discussion around that than maybe they really did. um I mean pointing.

B

A

The challenges is maybe what the point of that was.

B

Yeah, I think that we should should probably uh add a little bit of documentation both around why you would want this right. What does it enable you to do? Why is it attractive? You know what do we use it for as well? As you know, excuse me a.

A

Great segue owen, documentation.

B

Yeah is that next off.

A

um Yeah diana you put this on here: do you want to introduce karen.

D

Sure uh she'll shop, to come out of hiding for a moment, but uh loki has a brand new technical writer devoted entirely to loki oss, so uh she's still getting up to speed, but she will be figuring out uh what the documentation needs to be better and uh part of that is looking at stuff because she's a very experienced writer, some of it is uh talking to people and some of us listening to the community anyway. uh Karen have feel free to to introduce yourself tell us a little bit about yourself.

D

If you want to.

C

uh If it helps any of y'all um born and raised a hardware design engineer switched to software early on.

C

Done lots of technical writing as well as computer science teaching in my past I came I'm now with grafana labs and uh came from vmware, which was derived from pivotal software.

C

That's my background. If it helps you, um I put in my first uh my first review of a loki doc's pr yesterday or the day before something that's uh a pull request that cyril put in and we're going back and forth now on. It um hope to hear all your suggestions for things that you want me to work on, um especially your priorities.

C

What do you think ought to happen first, so.

A

Thanks yeah, that's great karen! Actually I was while you were talking, I sort of put in there a loose order of I called focus. Maybe that's the priorities of um what we want to do and to me that currently would say like just the general organization. Like a review like, we basically took one major pass at the docks.

A

Probably a year and a half ago, robert frato did and we got through quite a bit of it, but one challenge with open source projects, which is great and tough, but we get a lot of contributions to the docs. But it also means that over time I feel like the voice becomes a little less consistent as more and more people add things here and there, and um I think, we've lost some of our organizations.

A

So we kind of want to iterate over that again and try to keep it as consistent as we can with other graffana projects like grafana and tempo and then the other areas that I know that we're lacking examples and and how to's so make the project a little more accessible to people that aren't familiar at all with loki or cortex or prometheus.

A

um And then the uh maybe I'll put this in. Here too, I keep calling the day two operations, so things like.

A

Using schema- because it's nuanced in this context, but so things like you know, store schemas, uh you know operating loki over a longer period. um I know one of the things we get asked about a lot that is probably not going to be in scope. Here is still things about like kind of resource sizing and scaling and um the in, but the like guides around operations, and only just because it's there's just sort of so many ways in places. You can run loki that I don't know I'd ultimately like to crowdsource.

A

That initiative, like have us, build a good benchmarking, suite that people can use to to provide that info, but um yeah. So that's the the what we want to do kind of you know the goal would be to help people that are new, as well as like the the secondary stuff like. So I think our reference stocks are okay, it's just hard to find things. um You know like a lot of the options that you know like here's, a good one right like what permissions do.

A

I need on my s3 bucket, like that's in our docs, but you'd be hard-pressed to find it. I think so um we're gonna work on that.

A

Your comments, feedback on docs.

B

I'm glad we, uh this has been something we've talked about for so long and it's always been hard to prioritize right, and this is what like definitely on us in a lot of regards.

A

Yeah well thank thank the community for helping make the project popular and helping us get more resources to work on it.

D

Speaking of uh hey community, if you have any thoughts you want to share on the docs, now is a great time.

A

Yeah speak now or just.

D

Speak now or file an issue.

A

Yeah that works too um all right zach. You want to talk about the little bit unwrapped question that I don't know if I can answer, but I'm interested.

E

Yeah, I'm not sure if anybody can't yeah um so I'm doing the whole uh cloud watch logs uh from uh ecs fargate. um Oh, you can see my dog with the donut of shame in the big ass.

E

It's a donut, not a cone um uh yeah. uh So we're doing the uh fluid bit. Firelands shipper from fargate on ecs and.

E

um And- and I discovered kind of along the way that uh fluent bits mode of operation is that whatever log you get it it, it doesn't ship it, as is it has to like wrap it in kind of an envelope, and so you get this. uh You know extraneous like log uh layer on top of your message right and so uh in in key value format. It doesn't really look all that bad because it's just log equals and then a big set of quotes with the rest of your your giant log message stream in there.

E

um If, if you use json format and your logs happen to be json, it double json encodes it, which is it's incredibly obnoxious, um you know, and then, when I'm looking at stuff in loki, I can you know, hey. I can hit the json filter and then unwrap it and then and.

C

E

Then you have to run it through a second one to get the to get your fields out, um and I was showing that to someone and they're they were just like I'm not going to remember to do that. I said that's fair.

E

uh What I didn't know is: if there's some way the problem is, the flu event is going straight to loki and- and I I don't know, if there's some way to say, like drop this crap at the at the top of the log right, um I thought maybe hey, I could redirect it to prom tail, but then there's like the whole. Now I have the prom tail ordering problem. I think.

A

um What version of are you using the native output of fluent bit.

E

Yeah, it's not the not the loki flume of bit plug-in right yeah. I went to the native one um specifically.

C

I'm using the the.

E

Downstream aws one because they include a like some extra utilities that do health checks and stuff on it.

A

I wonder where that wrapper.

E

Is actually coming from because I guess it's from it's from fluid bit itself. I finally found it like in their docs. It says that that's that's! How.

B

E

Vlogs yeah.

B

Can you uh can you throw a link in there I'd love to take a look, the.

A

Reason that surprises me is like tons of people use fluid bit, and I had not heard of anyone else having this problem. um Yeah.

B

Yeah I mean it's, my guess would be the like any of the label-based cardinality problems would probably be coincidental there, because, what's coming through your as you're describing and as I understand it is the log line itself right, which is independent from the labels yeah.

E

B

Yes, it seems like yeah, I'm sure I'm sure is it eduardo runs that builds that ad.

E

Yeah I'm curious.

D

About this, it sort of sucks, because I.

E

I thought maybe I could pull it out like hey if you provide a custom comp file to aws they'll run that whole thing, so you can do like a pipeline in it as well, but there's the that log wrapper gets added on like as they ship it. So there's nothing. I can do until it's at the recipient and um so yeah like I said I don't know if there's something like throw it on prom tail and have prom tail carve up the the log.

E

But again then there's the I think right now, until you guys have the next release, I think there's at least the the time ordering problem.

B

Yeah, that's probably going to be what two releases yeah so.

C

It's mostly designed.

E

B

I haven't started working on it yet: okay,.

A

um Yeah well, let's poke into that a little bit, I'm okay, I'm just interested in what the yeah.

E

A

You can paste the link to the docs. You found yeah.

E

I'll try to find that again. um I think it wasn't specifically in the law. I think I finally found somebody else asking and it was like a fluent employee. Respondents are like oh yeah, that's that's how it shifts.

B

Yeah, I think the only thing at the moment is having the using the line. Format function at the end of basically every query right to only select that field right. So if it's json, you know pipe json pipeline format, you know.

E

Yes, yeah whatever that.

B

Is that's what I was.

E

Doing when it was outputting json, because it was so difficult to read, it worked for me but, like I said like most most of my, my dev team are just not going to remember to do any of that.

B

Yeah, that's pretty reasonable.

A

All righty thanks zach, um so I want to just briefly mention uh which unfortunately cereal's not here to talk about it. I was putting another pr in here. um Okay has custom retention, now um well kind of it's it's merged into main, uh there's no release with it yet, but um so this brings the ability to through config defined stream selectors to set the the retention for different log streams to different time periods.

A

So I'll give everyone a quick recap of how retention works today, which is that loki, basically with the exception of the file store, leaves that as an exercise to the storage component. So the the index is pruned by the table manager um after some time period and then the objects themselves need to be set with a ttl or some policy to remove them. So what will change here is now loki will be responsible and capable of removing both pruning the index and removing objects from the file store.

A

So you won't need to configure a ttl or anything on the object store. In fact, you won't want to. I guess, to avoid having that you know, delete something that isn't officially deleted yet, but um yep. So the the compactor now has a component that will go through and basically at regular intervals, mark objects and remove them from the index first and then come through and delete them from the object, store and we've been testing that out and playing around with it and making sure that we can get.

A

You know good enough performance out of it, and so far things have been working really well for us. So there's a documentation, pr open that that karen just mentioned that we are working through and uh we got to figure out the like what the communication looks like for how you transition to using this from your current config, etc.

A

But uh you know in the near future so I'll pivot to that in just a sec. uh The only thing I mentioned first is included with. This is also support for delete, so you can do a delete of a stream for a targeted time range, so you can provide stream selectors in a time range and honestly, this is mostly implemented alongside of custom retention for compliance reasons. um Specifically, you know gdpr or any other kind of compliance reasons.

A

So if we start keeping things longer than 30 days, we're going to need a way to remove things on requests and that's really the intent of the delete support. As a result, it basically deletes up to 24 hours ago or put another way if you put into delete requests, that request is basically held for 24 hours and then the delete is made.

A

That does give you the option to change your mind if you wish to not delete it, and I think what I'm interested here is is um you know somehow we've got to figure out a way to get community feedback. Maybe there's an issue here in terms of like you know, is there is there you know enough sort of drive to further extend that delete to be like up to the minute accurate? um As a you know, the some of the work for that already existed.

A

The complexity that comes along with having delete support like up to the minute, basically means because of the caching layers in loki. You need to both be able to invalidate caches at a several levels as well as do query time filtering, so you have to basically pass that delete filter through all of the different levels. So this implementation is the.

A

I would say that kind of the simplest was the least effort, and it's a good starting point, but you'd be interested in kind of understanding how important it would be to people to like kind of further add that pipe that, through the rest of the components to be able to have you know, deletes to within you know the I guess I'm basically interested in like. Is there a real use case for that right, like you know somebody uploads or scrapes a password or something like that?

A

I mean yes, you can go delete it, but does that really help you? You still got to change it. You know like you still, if you you know, if you, if you check in a you, know password into github right like you can go, remove it, but you probably should still change it. So I'm not sure like. Maybe there are other use cases that people have for deletes that that make it you know more compelling. I guess so.

A

I would have to say an issue for I think from.

B

A high level this is really attractive because it allows you to do compliancy things that ed mentioned and it'll also allow you to say I care about my application logs up to about you know two weeks, but I need to have these audit logs available, for you know six months right and to be able to kind of slice and dice both across your tenants and across the streams within the tenants kind of different retention policies that can be relevant to your organization.

A

So um two things I wanted to so one of these is actually going to pivot to the next two. I think um the 2.3 will be the next release that has this support in it, and I would say roughly we are thinking I would say, towards the end of june.

A

It's a little bit up in the air and I'm trying to avoid what I normally do, which is say we'll do it next week and then like a month, goes by so um let's say probably targeting end of june, um so like yeah, so really early july, maybe take your july off with a new low-key release, so that would be recording rules and custom retentions. The major features um as well as some other fixes. So there is actually one particularly useful fix that we found there's.

A

uh There was rather a memory leak in the warriors.

A

Yes, so in the queryers when when we download so when you run a query, we asynchronously download chunks in a separate thread to try to keep the processing pipeline fed with chunks.

A

um When a query is cancelled, we discovered that the go routine that was fetching those chunks was not cancelled, and so over time you end up with a number of go routines that went and pulled a chunk out of your object store with nowhere to send it accumulating over time. Until um you know, in our case you would eventually, you know, probably um crash your your querier.

A

um It was, uh I guess, interesting in the the fact that, like it was relatively unnoticed until we had a very particular situation like a a workload that someone set up, that ex asked for you know more or less caused the queries to crash. I don't know every 20 or 30 minutes instead of every four or five days or something like that. So it was, um you know significant, so it may not be hugely impactful.

A

Like I said, the normal occurrence of it was actually fairly slow, but it is going to mean you're sitting on memory that you might want back. um The k50 release is probably the one that people would want for that. Right. Owen.

A

Yeah yeah. Okay, can you find a tag for that quickly or.

F

A

Look too so the the internal releases that we use. We do push those images like, as we've talked about this before we don't have a great mechanism for how we would like communicate and curate that, but we'll put a link to that image in here. So, if anybody's overly concerned about this, you can run k50.

A

um I guess the caveat that comes with that as off the top of my head. I don't know what the scope of differences is. We haven't created the like upgrade guide for 2.2 to k50, but um there's no breaking changes that I know of I'm just you know advising a little caution there to maybe not jump right to prod with that.

B

Yeah we, it looks like we hotfix k50, with the customer retention, so it'd be uh maybe.

A

We'll wait for cafe a wonky.

B

A

um Okay, um uh uh massage, I see your question about log archiving for compliance reasons. So the one way I would say that you could deal with that in the near future is using custom retention, so tell loki to have longer retention on certain log streams. um You know from a there's one question here right that I think a lot of people come to you typically, what most people do today is, and we've talked about whether or not to support this in our agent.

A

For this reason, is they use fluid bit or fluid d, which can basically send a gzipped version of the logs directly to s3 or to a bucket, and so um that's sort of a simple way to meet compliance requirements, but it doesn't really help you. It's not like. You can easily query those logs, um but you know checks that box, which is probably what most people need. You know if you had a reason to go, dig the logs out they're there.

A

It's not that hard loki, it's storage format itself is already pretty well suited for long-term retention. um You know we're storing the index and the chunks right alongside each other, in an object store, and you could do that for years and years and years, um so I would say that you know take a look at the custom retention stuff to see.

A

If uh you know that's what our intention is going to do is to be able to offer longer retention periods for compliance reasons effectively, so that with be able to say delete streams that are, you know, have certain criteria. You can keep those for. You know 30 days or whatever, but other things you can set to be much longer.

A

Does that kind of help? Answer that question a little.

F

So if I understood correctly, uh there was a setting about how far back loki is gonna search when you put a term in without a date range. I just want to double check that my understanding of documentation is correct and if we keep 10 years worth of logs in our bucket and someone put a term, it is not just gonna start searching through 10 years worth of log for that term. If there is no someone making a mistake with time range or something like that that was. uh That was my concern.

F

When I was thinking about, we can leave the objects and the indices, and we change that. Look back. If I remember the property name, we can change that. Look back and say that you won't be looking back more than 30 days. Let's say uh into the into the indices. uh Is my understanding correct on that documentation?.

A

Yes, it is so the and I'd have to look up the flag, because it's I I think, you're almost 100 right that it has look back in it, but.

B

Matt before you look back, it's a pertinent config, let's see, got it exactly right: yep, okay,.

A

Okay, so yes, you would need that to be as big as you wanted. So if you wanted to query the 10 year old logs, that limit would have to be at 10 years. Basically, however, you could set loki's retention or have no retention and keep the query look back to say 30 days and effectively.

A

That loki instance would only ever go look back 30 days. It would just fail queries that tried to query farther back in time. um Okay, okay, so so you can like changing that. Config just changes. What loki lets itself do the only one you really want to watch in terms of of durability as the actual retention flags and make sure that nothing is being deleted on you.

E

Okay. Okay, thank you. Thank you I'll I'll note for karen that that's that look back config is a a very confusing uh option for people in the slack channel. It's like a like every couple weeks. There's someone who gets bit by that.

C

Cool I'll make note of that. It's.

A

Like a top five loki, config gotchas, you know.

E

Like it's one of the yeah they're like oh I'm, I'm starting like six months of logs and I'm only getting like two views of it. You know.

A

What we just ran into that I got to look to that. The thing that that it's tough for so to query timeouts, like loki, has four places you can configure timeouts um and now grafana has a data source proxy timeout, I believe all 30 seconds, and I don't remember if that was new, but we just recently changed it in our environment because we were seeing so like timeouts are a nightmare um and then, if you throw nginx in the middle, it has proxy timeouts. If you throw in the middle, like it's just.

B

Yeah, what what what a fun time yeah.

A

um I intended on writing a blog post for this, because I I set up like this kubernetes cluster in my house, and I have nginx sending to like a traffic ingress or whatever, and it was like. You know. I know it took me, the better part of a day to figure out how to get queries to run for longer than 30 seconds and in this case nginx, when you set up an upstream in the load balancer section of it like it defaults to having a 30 second timeout.

A

But it has this really nice behavior, where, if the thing times out it, auto retries the next one in the pool for you silently. So I can see the query being resubmitted in grafana, but I had no idea who or by loki rather- and I have no idea who was doing it. So I don't know that's an nginx thing, but I'm out through a nightmare.

A

I don't think we're making that any easier with. We.

C

A

There's the query uh engine timeout and then there's the query timeout. So there's.

B

I don't think this is, I think, timeouts are are kind of illustrative of a larger problem, which is coordination of certain configurations right. So there's also things like loki uses, grpc internally between components, and you can specify, like the you know like how large these these uh kind of requests and responses can be right and it's you can set them to different things right, which is probably not not what you want right, and so there's like a a lot of places where we could probably derive. You know sensible configurations right from yeah.

A

This just came up in in one of our slack channels, too, related to like the the drift between defaults and how we run our environments, and how do we go about changing things that have defaults one way and that we sort of run another because it's a little bit of a breaking change in some folks consideration? It's like um you know so so some flag, you know always is the falsely false, but we set it to true in every environment.

A

Should we just change the default to true, because you know obviously, like you know our environments and the way we run things like it seems intuitive to have the defaults, but it isn't one size fits all you know, and especially if you're running single binary, configs versus microservices and also the just the sort of nature of someone doing a an upgrade and all of a sudden things are much different, and so I don't know there's an interesting debate there. I I tend to lean on the side of.

A

I think we should try to steer our defaults to match the way we use loki, because that gives us the most. It makes it easier for us to answer questions typically because, like most people's operations are similar to ours, so the behaviors are similar. But I don't know if there's a right answer here.

A

All right, zach.

E

Yeah I gotta find it it's just. uh uh I didn't rose right down there at the bottom yeah this sort of thought kind of popped in. I was just curious.

E

Are you guys maybe gonna go that way, so the recording rules is very specific to the ruler and uh and to calculating for alerts, but is there some thought of like um being able to use that in a more generalized fashion of pre-computing like what might be kind of a a complex or expensive query to run side, um especially if you've got a bunch of dashboards that are running these you know or or something that we want more translated into yeah.

E

You know as the metric representation and maybe being able to, I don't know, start stored or query it long term as a prometheus query uh fashion.

A

So that's actually exactly what recording rules provide is you can figure uh well prometheus has of the most recent or most too recent versions. Let you actually remote write to it or you can remote right to so like cortex or thanos or any other. So recording rules will let you configure a prometheus, remote right, endpoint and generate basically synthesize metrics from the logs as they're interested.

E

Okay, so the fact that it's in the ruler is just sort of coincidental.

A

um Yeah yeah annoying coincidence, I guess, is it in the ruler.

E

A

Is it what you're.

E

Saying is that it's part of the ruler uh which just made me immediately yeah.

B

I can probably answer this question. I think the you're asking are recording rules evaluated in the ruler yeah. Yes, the answer is yes. Yeah.

C

B

Itself is: is it's a component that was started in cortex right, which is the like court, like which loki derives a lot of like architectural one code from, um and it originally did recording rules in addition to alerting rules when we added alerting rules into loki at that time, prometheus didn't implement the remote write api, so you couldn't write back to prometheus, and so, if we wanted to implement something like recording rules or loki at that time, we would have had to do something bespoke, like you know, just for cortex, for instance.

B

So recently you know ie yes, four months ago or something um prometheus merged. The remote right api into itself. So a prometheus could accept the remote right protocol, which it previously used to send data from a prometheus to you, know another prometheus or something like cortex, and that basically enabled us to do this in a much more elegant and simple fashion, but like a lot of the plumbing in the code, was already there to enable recording rules. It's something that we've wanted to add for a long time. It just kind of became very feasible recently.

B

But, to give you a real example of like kind of how we use this recording rules internally, we'll have right.

B

We have like these external slas that will run for customers, and then that means we'll have internal slo service level objectives, and this will say: okay, basically, we need to end up tracking error rates across, maybe not across each pod or each container that we run but across like an entire cluster and so we'll use recording rules to basically you know every minute or so evaluate every pod in the cluster and then aggregate them right to build these cluster level slos.

B

We can then use those to write into a series which is then already pre-aggregated right and already has. All of these has all the hard work done effectively, and then we can use those very very quickly in prometheus.

B

Does that help kind of get the picture.

E

Yeah, I think uh I think, just because of the way the ruler was originally presented as being for alerts that that's that's how I so I when you're saying recording rules, I thought oh, these are just for alerting so and not in a general case. Okay, yeah.

B

So, for what it's worth, um uh uh reporting rules and alerting rules are like it's very, very similar yeah.

E

B

Yeah, no I'm gonna wait. I knew but yeah. I thought this.

C

Was a sort of a.

E

Functional thing of loki in that sense, no, that's awesome, because that that then handles uh I was using a prometheus log or sorry nginx logs to prometheus exporter and parsing engine x logs into prometheus measuring this would handle that there you.

A

Go right, yeah, um yeah thanks for asking that question, though I think that helps us realize too. We need to make sure we get we're clear around when we start promoting this. I I put a tagline in there. You know ruler not just for alerts anymore.

A

Damn writing the blog post. Already, as we speak,.

B

A

uh That's that's all we got on the agenda. Does anybody have anything else that I'll chat about a few minutes.

B

No, I was really interested in talking about defaults. This is something I think about a lot too right like there a lot of times when we set defaults, it's especially for new features. It's based on our initial understanding, right and so we'll say. Oh you know from our initial understanding of this new feature set.

B

It seems like it should have these defaults, but then we run it, for you know a long period of time and we develop more and more operational expertise around running and then that you know kind of feeds back into us actually changing these, and so, as ed was mentioning, we run like our internal clusters, with kind of like in overrides all right kind of set of defaults, which are basically just accumulated from the knowledge of running low-key for a long time, and we kind of feel like our hands are bound in some ways, because it's hard to change these defaults in loki itself, because that is technically a breaking change.

B

But at the same time, it is also uh like that they tend to be better ways to run loki and it's a hard balance to achieve and to figure out you know which side should we air on.

B

And so anyone from the community right like how would you feel would you like us to include what we consider to be more sensible to faults at the risk of some things? You know mutating between releases a little bit.

A

Yes, so grafana, I just upgraded some stuff to grafana 8 and a number of my plugins didn't load and it's because in grafana 8 the default now does not allow unsigned plugins and you have to explicitly allow them, and so you know, like I don't know, I. I have mixed feelings here right as the maintainer of our project. I love breaking defaults, but as a user of a project, I hate it when people break.

B

Yeah yeah, what we're looking for is people to give us. You know license to do it and not.

A

Feel bad, just you know, silence is acceptance here too. If nobody says no we're going to start crushing defaults all over the place.

A

No, that's a joke.

C

Are you considering defaults?

C

um Could you break it into categories defaults having to do with timing? Like you know, if it's five minutes, as opposed to you, find 10 minutes, do you really consider that a breaking change I mean, as opposed to enabling disabling whatever some future.

A

Well, here's a good example- or I have a couple good examples. um I did with 2.0 change these and when we did full db shipper, but you know years ago, loki never had a config called chunk target size and we added that and for the longest time like it, never existed in the default configs because it's kind of an operational behavior. It changes the memory.

A

Consumption changes a few things, it's kind of crept in over time, but like there's, not a cluster that we run that doesn't set that config now um and I suspect there are people out there that maybe don't because they you know, I mean now we have new configs for that right like, but similarly um we use snappy as the compressor for all of our clusters. The default is gzip. um Gzip is slower, it's more compression effective, so it reduces it. It creates smaller compressed objects.

A

So it's cheaper to operate loki at the storage perspective, but you know for most people, especially us storage is not the expensive component of running loki and getting query. Speed and performance is more important to us, so we use snappy. um What's the right default there you know, and do we ever risk changing it because it you know it's a.

A

I mean the good news is like you can change it whenever you want and whenever you're running it's just going to create chunks with that format, and you can always read gzip or snappy. So, like you, you know you can change it. It will but like it's, um you know it's it's tricky.

A

The schema upgrade we did from v9 to v11 was another example of like one, but that one has a different problem of it being harder to roll out, but um they answer your question, though kara there definitely is a difference here between some of these things right, like changing a timeout which has almost no impact, seems fine or likely wouldn't, but something that might change the memory overhead of your running setup.

A

um You know or change like you know, your cpu versus storage costs, like those I have, I'm not really sure um you know, there's enough different use cases with loki that I'm not sure which one we want to optimize for, like I would say in my mind, it's the I don't know, let's say 1 to 10 terabyte a day cluster size right like that probably covers you, know, 50, I'm making up numbers now.

A

I have no idea what percentage of users that covers, like um you know, if you're in the hundreds of gigs a day like I don't know that this stuff adds up to make a huge difference to you if you're over 10 terabytes a day like give us a call, because we're curious to see how things are working for you. You know in that one to 10 terabyte a day I feel like is where we tend to size a lot of our operating instances, which is where a lot of our configs are tuned.

E

uh You know I was gonna, I think. Maybe two things is that if you're, if you've got a configuration, set and you're accepting the default, it seems like you're opening that door, regardless that the default value may change right, you're, you're kind of accepting whatever the software is providing for you so uh and since most of loki in order to run it, you know, tempo is the same way. Is that you have to override the defaults.

E

If there's anything that you're accepting the default on, I don't know I feel like you're you're, if you're not already overriding it by explicitly setting it to yep. I want I want gzip, then you're, sort of yoloing it right.

E

So I I'd feel comfortable, saying, like you know: hey here's an announcement to the community that, in this version, something's going to change and I'd be okay with that, like all right.

E

Well, you know I'll go change, my config, so that we set that now I don't want to use that, but um you know as long as there's some lead time to it, yeah and if you're pulling from maine and running that you've got a difference going out there too right um and then uh I I think the second part that might help with some of that anyway, is on the subject, just just defaults, even if you, if you're like I don't know like this works for us, but maybe we don't want to make that the default is like having a you know going back to karen with the documentation of things that would help her.

E

You said examples is like hey here's kind of some settings that you know are you running like a small cluster where you know these are your inputs, and this is your query rate. Here's some good setting target settings to start with and then here's what a medium size is, and then you know here's what grafana runs right would be awesome.

E

You know a small, medium, large kind of thing where you can figure out like oh okay, I you know my interest rate fits with this, but my career rates here so I'll kind of piecemeal these together and now I have a good starting point above and beyond just the out of the box, it turns on setting.

A

um Yeah, actually I mean to be honest, there's not a lot that we tune, so it's not a huge number like you know. This is actually because I don't even know what the implication is, but we talked about like a grpc.

A

um The default grpc message. Size limit is 16 megabytes. I think, and I think we run our clusters at 100 megabyte or something like that, and we could make that the default. I have no idea if that has an implication to anyone. You know like. I don't think it would, um and it would just help people not run into that problem, because you know you can you know you can hit those limits, um yeah, the the ideally.

A

I would like for there to be the least amount of config possible to use loki right, like the most sort of just especially for that single binary case. Like here's, a config file with the fewest lines, you need to actually run this thing, and then people can discover them as they need to do things that are outside of the entry level case.

A

I think that's that sounds reasonable. I guess, but all right, I think we're anybody have anything else. We wanna.

A

Minutes left but.

B

Yeah going once.

A

uh Thanks for showing up today, uh everyone uh I love seeing yeah, I love the donut dog, poor, donut dog.

D

It's a good dog.

A

Yeah uh tell your you know: tell your friends tell your enemies come check out. The loki community call um all right. We were talking about this. Like you know, we've we're starting to get more interest, it's getting more exciting, we're going to start doing a better job of promoting and coming up with an agenda more than a few minutes ahead of time and see. If there's there's more things that we can do to like, engage in discussion because that's actually super fun for us, so appreciate zach.

A

Your your input on this has been real real fun um and the couple folks that you've seen roger I know you've been here a few times love it um thanks. Everybody else for coming and asking questions, and we see you all in a month. Hopefully we'll have another loki release to talk about.

B

A

A