Grafana Loki, 6 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Loki Community Call 2020-08-06

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Well, I don't even know we must make these recordings public right. So welcome. Welcome everyone to this month's chapter, installment of it's a loki community call.

A

The agenda I have is pretty short. I was just going to talk about sort of what's been going on the 1.6 release. I want to look. I probably mentioned that at least the last couple weeks now, the last well at least last month that it was mentioned so that is actually going to happen early next week.

A

The hold up was a queer panic, a race condition that we discovered that proved a bit challenging to pin down, but um thanks to robert fratto for uh our last ditch effort, when we just said hey, can anybody help us and uh so that's fixed, though that's exciting?

A

um The only other thing that I'm um we just recently sort of overhauled the bolt db shipper code and we're actually running that in our operations cluster, which is our biggest internal cluster, um which sees, I don't know a couple terabytes of logs a day and uh but we haven't had a chance to really like scrutinize it for performance, yet functionality wise, it seems good.

A

There's a there's, a big change in how the code was written from the last the last iteration that was in 1.5 functionally I mean it should should be the same, but we kind of refactored it to make it a bit easier to work on um the one significant change that I should mention here is that now requires a 24 hour index period, and this is actually enforced in code. If you try to start with a non-24-hour index period, it will yell at you.

A

Unless you have a index period upcoming, that's 24 hours, then I think it does not yell at you.

A

The reasons for this have to do around a change that was made. We are uploading the index files now so previously it would re-upload and append to the same index file throughout the day. Basically, but that sort of mutable state in an object store is, uh is a bad deal. It makes it very hard for us to look at things like deletes. It makes the synchronization across components when we start talking about things like deletes or caching, it makes it really really difficult. So this now has immutable state in the object store.

A

So we upload a new index file every 15 minutes, the big difference there is now you know a query has to download as many as you know, number of ingestors times, 15 minute periods in a day index files to do queries, so there's um there's definitely some optimization. We need to do there right now, that's done serially. It needs to be done parallely.

A

um Actually, I was going to see if I could look at that today or tomorrow to change that at least before the release, but there isn't really an easy way for me to separate those changes out. I do think functionally they're fine.

A

You know if you're running on a really big instance, the performance might change, but I'm not totally sure, uh but that's why this is experimental so, but the good news is that this is moving forward, a more stable sort of better platform to build on and we're gonna actually be putting a lot more effort into optimizing this for performance now. So by the time the you know. Next release rolls around. It should be getting a lot closer to production, ready.

A

And then I just listed out what we talked about internally for, like the work that grafana labs is, is focusing on for q3. um I did mention that cyril was out on leave for about five or six weeks. So when he's back he's going to dig into the log qlv2 work, we've been working on the work here that owen's been doing for alerting is going to be.

A

Basically the pr is up and review it. We need to merge. It start hammering on that to see how alerting from logs works and kind of a next progression from that is, is a ruler to be able to generate metrics from logs when they're ingested, so we're going to start taking a look at what that would entail, and then relatively new to the list. Here is this in jester right ahead log.

A

This has been spurred by a kind of unfortunate incident where some poorly some some poor odds, led to a bunch of chunks synced at the same time that were very full, which I think there's some tuning. We need to do on how the flush code works too, but it led to several ingestors uh out of memory crashing uh and that just every time that we have ingester crashes or anyone does it sort of reminds us that we should have a right head log.

A

So I think that's going to make the high up in the list.

A

For the next course, so when I say quarters, this is the grafana lab financial quarters which are august 1st through, I believe, the end of october. um That's just the cadence that we're on. So that's what we're looking at. So you wanna anybody like to add anything else.

A

Talk about any of this stuff, exciting anything, not exciting!

A

A

That's exciting, I'm excited 1.6. Is uh it's been a long time this? I bet it's been about three months now, which is really long so there's.

B

Isn't it long every time.

A

Every time man yeah, on the one hand, like the releases, are generally pretty significant for loki. On the other hand, they're kind of too far apart, but every every time. It's the same story, though, like I start getting ready to make the release and we either think we find a bug or find a bug which is not. I don't know, maybe that's every time, maybe not, but more likely. What happens is oh man.

A

It would be really nice to have this in the next release, because everyone really like this, but you know if we merge it today, then we need to at least run it for a while before we cut a release from it and then like two days later, we do that again and then two days later, we do that again and at some point it's like. Oh, it's been like three weeks now. We should probably have cut this release anyway. Can.

B

We like um can we put in some uh release, cadence like cortex or prometheus and like if our, if a feature doesn't make it, it doesn't make it. They just need to wait. Six weeks more.

A

um I don't know I mean yes, we could. I don't know that um perry. I don't know that it's.

A

I don't know that the experience of what we have now is that, like the people that want you know, new features will run master builds anyway, which in some ways is nice, because then they're helping us sort of test things that we're not signing off as a release.

A

But there are cases where you know, there's people that want features that are unwilling to do that and they're sort of stuck waiting for them. So that's a problem too.

B

But then everyone is stuck waiting for them.

C

I think that we could probably just promote some of our internal releases, because we've generally vetted those through all of our environments. Right. We have associated branches on the loki github for them, I'm not sure if we need to do one every week. That seems a little too much, but I think it would really allow us, like. We already have a lot of that process and infrastructure in place to actually to take those and promote them right.

C

Could we just start cutting tags off of those.

A

I mean we already do right, like those those they're released as k, you know 27 or whatever our internal numbering is. um I actually put in docker hub explicitly to tell people not to use those and the biggest reason for that is like if, if if it goes wrong for us- and we tend to like like which doesn't happen- often right, but if we just scrap a release completely because it had problems, there's no way to communicate that to people.

A

So I didn't want a case where somebody's like regularly running our internal release builds and one of them is totally broken and they don't have any way to know that so maybe the opposite of that could be true right. Where we regularly promote.

A

You know some of them, as you know, interim builds that you know once we've. You know if we've been running them for a week with no problems, there's no reason anybody else should be concerned.

C

Generally, we use a two-week cycle internally for to get through all of our different environments.

C

So, like we could say you know we want. You know one release every month, two months, whatever you can kind of figure that out there, but then we just choose the closest one that overlaps with that time period and we don't have to directly use the you know: k, released, k, tag, k, branch, whatever right.

A

We could re-tag them.

C

A

Would probably be much clearer yeah and I think that would.

C

Make it a lot easier for us to to do this more regularly, because it seems like really small step that we're missing to just get them as official releases? I think a lot of of the work around this will actually come from like the kind of promotional side right like writing up, which generally falls disproportionately on anne's head.

A

um Yeah right, what like? What do we include for release, notes, and um I think we, our internal process, should probably be doing a better job of this. Anyway, we have some other reasons that would drive having some better understanding of what's been in a release.

A

I like the idea, though, of just somewhat semi-regularly or or you know, monthly, promoting a build so that we at least have a. I don't know what to call them yeah, but.

B

We can call them 1.71.8 spawn.

A

um Yeah I mean you could.

B

uh For very the context is weird like uh sometimes our releases are too far apart and we are trying to see if uh regular cadence makes sense.

A

The the part about this that I the reason why I'm not keen to just have a release every week or month is that becomes sort of noisy or you know, background noise right like, and it's not uncommon in our releases that now we have some upgrade requirements.

A

So, if someone's not willing to do a new release like every week or month or whatever, like the burden of them to jump versions, is higher um and to see what's changed is higher. So for a lot of people having like a couple months, release cycle, it was maybe already too aggressive right like depends on your.

A

um It depends on where you fall on that spectrum so like having like more formal releases with you, know, bigger write-ups and release notes, and you know not having them so often is seems nice to me, but also having a way for people to have access to. The interim builds that we're already making.

D

May I share my experience. We have here in openshift, for example, a release train and, um I think, to sustain a release trend you have to have you. You have at least to be confident enough. Your that your ci cd can sustain this. This train um things like that are you know there is a deadline and you put things on your deadline um and if things are not finished up to the deadline, then you need a process or tooling or whatever to remove this thing from your code base. So this brings some.

D

Let's say you can solve this with bots or whatever, but some some some machinery in there that make pr and pr merging a little bit heavier.

D

For example, nothing shift in any of our operators, for example, even if you are not of the core product, um you cannot merge things until you pass two big gates and you have the review you have then the um are you in the train. Are you out of the train and then, if you are out of the train, you will have um changes that will stay there and you need to rebase and keep up until you. You get the the gate opened and um this gets far far worse from a perspective of a developer velocity.

D

If you um take into account patch releases for older releases at some point in time, you may say we support the last two releases of loki. Whatever comes for security or whatever you will need a policy, and then you will have let's say to say how. How do I survive this patch releases?

D

um So I think everything starts with confidence in the cicd we have currently, I'm not um I'm not experienced to tell here how far we are here in loki but um the more process you want, because you want cadence, the more tooling and the more confidence you should have in the systems before you go out with the cadence.

D

So I think what I see currently observe is we do open the pr and sometimes apr is not only a feature. It's also importing things like updating, cortex and yes, the code compiles the test pass, but maybe the one or the other glitch that we import from cortex maybe bring a regression in performance and stuff like that. So um we may start there to see how do we catch these things? First, and then we see how. How fast are we defined in the cadence yeah.

A

Yeah, so I mean the way: that's works now. Is um we promote through our internal environments, environments within basically increasing traffic load and monitoring, because there's not any automated tooling for performance on loki? Currently, so the it's it's I don't know, we've had internal discussions around this and with other projects too.

A

Like can you do this or how would you do this like at some point, especially writing go as a programming language like, I think, there's no substitute for actually having the software run in an environment where people are using it and you have we're a monitoring company. So we have a lot of tools for monitoring. um So that's a direction.

A

This doesn't mean that can't process can't be automated within using those those environments, though right like um and we've actually talked a lot about that too, like can we automatically promote through environments.

A

um I still think that sort of lends itself to having um like what you're saying owen, which is you know more or less like certified, builds that come out every you know regular interval, which is what we're doing now, that we can promote for people to consume in a better way, um and then I'm I'm, I'm not sure if we want to at that point make those the like what gotham saying like that's just the build right, like that's a release. Every couple weeks we have.

A

I just worry that that's that becomes noise for people that are running loki in most of the environments, I'm familiar with right like if something has a new version. Every two weeks um like start to stop paying attention to it right, like.

B

I I kind of agree with you that maybe two weeks is too quick because there will be. There are some breaking changes, but if there are no breaking changes, there's no reason not to do monthly releases um yeah, but maybe shorten our three-month span to six weeks.

A

Yeah three months is probably a little long I'll agree with you there.

A

um I do uh I mean ideally like this, is you know, and and it's better for everybody, if everybody's sort of constantly updating right like because that's sort of how we run in tests, so the closest that people follow like what we do is going to have the closest experience to like?

A

Is it what we have right, because we don't have infrastructure right now, that's saying like go upgrade from version one to version 1.6 to see you know what happens right like um so I mean we have upgrade guides for that, but we don't really have that any infrastructure in place to test that.

A

I think the long and short of it is- um and the interim I mean we could- we could probably have a certainly you know not push out to three months right like, but I am interested in seeing if we want to make a the beta test train for loki users right, like people that want to have regular releases and are um you know, want to stay more current with the project um that we can make the internal releases just re-tag them every like once a month or once every couple weeks.

D

I mean it depends what you put always in the release in general. uh The far we get uh and need to, let's say, put more maintenance work in into the thing um the the faster people will ask for releases, because there are minor fixes uh here and there security fixes here and there that you need to bring them out um in the fast candidates.

A

Big features, for example,.

D

Like volte b shipper needs some maturity before we decide, um or we can always work fast and say there is something like um declare. Things experimental declare things behind the future gates um so that people can use in production but at their own risk.

D

So um it should be explicit what to expect uh from which piece of look you want to use yeah. So um I I for my sides, for example, I'm a user of mobile list and volte bishop, and I know what to expect it's full experimental and it's not battle proof. So I am okay if I take a master, bra master release and and try to figure out what how things work out, but I can pay for this. Others may be more conservative from that.

A

A

Okay, um I I think you missed the first part, though perry, but you're, probably interested in the bolt db shipper that we emerged a pr last week, which was a pretty sweeping overhaul to sort of the internals um yeah and.

D

This is really an interesting pr. I'm I'm more a silent reader of this pr currently lucky lacking resources to jump in the train. But since we are here, we have, we have a small split out in our team that we can focus on things like that. Currently, so me and other three can concentrate more on upstream work here. So um if you feel overloaded with bringing faulty piece shippers you can you can scream, we can. We have more resources than me currently taking up work here. So.

A

Yeah, absolutely um our we just promoted it to our biggest internal cluster. We actually set up a new cluster running it to compare against the um what's. Bigtable. Is the index we're using in our other cluster? So uh we have a some instrumentation to compare queries between each and and basically start improving performance.

A

You know initially the so. The biggest change with this br was forcing or associate api's forcing a 24-hour index and um not mutating the objects once they're uploaded. The first iteration was re-uploading index file every you know, 15 minutes with basically the current index and new additions, but mutating the state in the object stores makes things like caching and deletes extremely difficult to impossible. So having a you know, an object store once the objects are uploaded that we, you know by principle, never change them simplifies some of that stuff. So that was a driver there.

A

The downside of that is now. You have 15 minute index files. Basically every 15 minutes we make a new file and upload it. So you increase the quantity of index files that you're uploading substantially.

A

Ultimately, we'll probably need a compactor to go through and smush. Those back down, maybe, though, because you know, depending on how big they are and whether we can actually sort of shard them and download them in pieces. We might not want one index per day, but but the the one index per day sort of makes more sense for the size of the files that we want to work with, and the requirements around. How deletes would work and the tooling eventually that we'll have to have for deletes.

A

D

I think we need to figure this out over time um like the before commer, as it happened, for example, with tsdp. They start with a specific amount where they do chunks and they're, let's say store, not in their index, and then they figured out that two hours is a good good set.

D

Let's say to to to cut these chunks and we will see how what people do with their indices in terms of cardinality, and then we can figure out what is here better path, better optimum in terms should we go for 24 hours less or leave it give an option to the users to declare their own, uh um let's say uh um time frame if it fits for them. So um I think 24 hours from what I read from the pr are fine. um It may it at least it solves this problem.

D

What if I have enough a lot of data- and I want to restart this thing- it makes it possible again so um possible insurance. You don't need to wait until you get retired, but um and the immutability is uh definitely a win. If we want later to do anything like iteration or things like that, so you don't need to care about uh people mutating this beast.

A

Yeah um so yeah. I think this is a step in the right direction and we're just now running this at a reasonable scale to see what it looks like right off the bat one of the problems is we, we download the index files in a single thread that needs to be parallelized, because we basically wait, um but that should be relatively easy.

A

uh It's it's tough like this is the problem with that 1.6 really some of the stuff I'd like to fix for that release, but maybe the you know real path here is just ship it and like well, it's still marked experimental and we'll do either we'll promote internal releases with those fixes sooner or we'll. Just do a 1.7 and a faster clip. Then.

D

Don't don't bother it's it's. It's it's a good shape to start. um We can only endorse this. We we already have a branch using this uh for our smoke test, currently to see how things work, but uh we will go with that in our um in our internal staging cluster um when you are with one sixth out and if it's slow, it's low, it's okay, it's single threaded. We know what it is yeah so.

A

Yep, um but that is going to be a big improvement source. We're going to be uh our goal is to really drive that query performance to be as good as we can well, ideally good or better than what we can get at a big table. But um that might improve might take some time. But.

C

So I know that we well one of the things I would like to do in the next few months is bring some of the like. There's an integration testing framework in cortex and I'd like to see some of that in loki as well. um And would you be interested in slash like? Does it make sense to map some of the smoke testing that you're talking about over into oh.

D

Yes, like that, obviously, especially because it becomes just a hand, crafty uh thing here and if we can, I mean there is one stream here that I, um since you're speaking this up, that we were talking last week in our grooming.

D

We would like to to bring in some some per component benchmarks. um So spinning up this beast and start writing some benchmark because testing the wall thing um on a cluster um for small things and small changes is just um only yeah. You just lose a lot of time and if you.

B

D

In distributor, you want to have the your effect at least a benchmark that gives you a distributor has a regression, so I think yes, an integration test framework would be a good way to go and extend on that for benchmarks later so.

C

Yeah, I think I I would need to check, but I believe that the integration testing framework in cortex is either uses directly or is modeled on top of the one from thanos, which I think does combine benchmarking as well.

C

But that would be something I'd be really interested in would be great to get a couple extra hands on.

D

Yeah but um there's a lot of curiosity: we don't have anything yet in the local poster right. It's just hidden in the vendor. Dependencies called cortex.

C

Yeah, we would need to pull that out and I I haven't looked through it.

A

No, I mean we would need to wire it up right, like the code is vendored, but we would need to actually write it to where anyone can start adding tests and start using it. But um yeah.

C

It should be easy to get.

A

To it's, just not nothing's been done with it yet.

C

Yeah there's a number of small things that I think we could really get benefits from right, just out the gate right like things that compile. But you know you can't actually run right, so we should have like really really small smoke tests, for you know running with the example configurations, syncing those with our uh like with our documentation, so that we know.

D

You know, what's going on exactly.

C

Actually run because that tends to rot over. You know the course of development, and then you know we end up fielding answers in public slack about you know: okay,.

A

Yeah, this is the bar right, like that's a terrible user experience when you copy and paste it.

C

And it doesn't work, it means that you know it takes a lot of time. Out of you know our days whenever we need to help people out with that, and we obviously want to it's just.

C

I think this is a better way forward. We can actually devote some of the initial fixed cost time.

A

um Cool great all right um yeah, um I think, technically, we should. I think it's still a separate call. Isn't it the bug, scrub.

D

It is technically.

A

All right, you should probably do.

C

Because they're on.

A

Separate like they're on a separate cadence, I think, is why, but they always overlap the dates. We have these calls, but you have to hop over there just in case anybody else was joining, even though I don't expect anybody else to be there. But let's talk about bugs.

C

D

A

But all right see you guys.