Grafana Loki, 2 Dec 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Loki Community Meeting 2021-12-02

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Then I was saying that I'll record, but doing the usual spiel. uh If you would not like to be recorded, you may uh mute yourself and turn off video leave kind of whatever is most appropriate for your situation.

A

I need to share my screen.

A

Okay, what we've got today.

B

A

Okay, um so yeah. This is probably relatively light. Today, um after 2.4 we've been kind of turning a lot of the loki teams, uh efforts in different places, and so there's not not a ton of updates here, but I think we can probably start off with steve who's aggregating loki rules into a single configuration.

C

All right, steve, you don't mean to throw you on the spot, but you were the first to add it to the agenda. So I left it at the top.

C

I think I saw steve join.

A

Steve here.

D

Sorry I was, I was on mute.

A

No worries: that's.

D

That's fine, I'm just looking for looking for my notes on.

D

On what I sent? um Apologies um no.

C

Problem, you want us to circle back around in a few minutes. No.

D

No, it's fine. I can. I was trying to look at the name of the um of the of the sidecar piece, but um fundamentally you know loki brilliant. um You know, 2.4 has made up made a massive difference and I was speaking to jen about some of the some of the potential improvements and you know we build um kubernetes as a platform. So we we stick kubernetes. So we stick loki in each of our clusters for nice, localized logging.

D

um The current design means that we need to know all of our loki rules when we build the cluster um rather than you know, like the prometheus model, where we can add rules dynamically. So the initial question was whether there was any intention for putting a loki operator in with a custom crd and then the follow-up question was actually. I can see a pattern to use a configmap generator as a sidecar.

D

um You know whether anyone looked at that and if they hadn't I'm happy to have a look at it, but also you know whether there's a an endpoint in loki to reload the rule component and generally just sort of see if anyone else has been speaking about this previously.

A

uh Yeah incredible: um let's see I got kind of a couple answers we can start with and then we'll move on to the more complex answer of operator trd, um but.

A

There are rules yeah, so we implement a lot of the same endpoints that prometheus itself actually does, and you can configure loki to um use a a back end, which is dynamically reloaded and you can also put stuff against it, get delete. You can all your normal crud methods and sorry.

D

It's more whether you can reload loki so there's a. I forget: the exact name of the sidecars, the kiwi grid side, car that watches um config maps of a specific specification merges them into a single config map, um which would obviously be a known value that could be mounted into the loki rule pod, but the output of that action. Changing the conflict map is a web hook, and it's where the loki has a web hook to reload or whether you need to kill the pod and start it again.

A

Okay, um so this particular model uh would need to be reloaded that the whole process would need to be reloaded here. Actually, I need. I need to actually double check that off top my head. We might hook into the same code like if you change the the the file on disks will.

D

Thanos have some points. I was wondering if you had the same similar principle. Thanos rule seriously can be fed and then has that end point to reload the process.

A

We certainly don't function. Equivalently um loki has a number of back back end stores. I know this isn't exactly what you're asking, but just want to throw it out. There loki has a number of like configurable back-end stores that you can use.

A

For instance, like we run a lot with like a gcs bucket as a back end, which means that we can, and that will be dynamically reloaded and also dynamically resharded across a number of rulers, so that you can kind of horizontally scale that sub service for uh for using config maps, specifically um I'm not sure off the top of my head.

A

If we reload, if that same, like reloading code, is runs on like a local file system as well, my guess would be no, but I'd have to double check there um and then do we offer and endpoint.

A

No, although, like that, the idea of reloading regularly based on the um local file system or like mounted configmap, seems completely reasonable to me right since we already regularly reload things like back end object, store buckets for rural configurations. We could do the same thing on localhost and I wouldn't see a problem there.

A

It also probably wouldn't be too much of a of a lift in terms of code, so I think we should definitely put that on something I need to to validate. I just don't want to go splunking through the code right now in this call. If anyone else.

C

I'm trying to, I think it maybe does reload I'm trying to figure out if the file store does it the others do, but.

D

Yeah, so if you know, if we can confirm it works, I'm happy to put a pr in on the leaky disputed chart, which is the one that we use um to put an optional side current, to scrape complex maps together into a single config map. That means the rule. Can you can operate similar to me.

C

Yeah, it should reload. So if you mount the config map as a file and use it as a local file rule, it looks like we do rescan that periodically there's a timer that does it. So you would, I don't think, there's a web hook, but there's a timer that runs. I'm not sure if that interval is configurable, but it should reload the rules. If the config map changes awesome I'll, give it a go.

A

Do you know what the timer is.

C

Let me go back a bit here and see.

A

Into the whatever that timer is, that would basically turn your delay. Interval to um there's cubelet has a timer as well for like maximum config map, resync interval, and I can't remember exactly what that's called but it'll, be that plus whatever our internal timer is, would be the maximum lag yeah.

C

It is configurable, it's called the poll interval and it's wonderful. uh I don't know what the default is, I'm guessing, 5 or 10 seconds. Let me see.

C

uh One minute, actually that's pretty long, so the default poll interval is one minute for rules, but that could be. I don't see any reason why that couldn't be five or ten seconds, although I guess I'm sure, there's a trade-off there too, with what it does, but but yeah. I think that if you load as a config map that it should auto reload, let us know how that works, though, because um seems like a reasonable thing to be able to do for not doing it already. It seems like it's most of the codes already.

C

There should already work, hopefully.

A

Again not uh the specific question that was asked, but this is kind of a nice tie-in, something that I've wanted to add for a while. Oh, I need to put that, as we've got this new low hanging fruit tag in in the loki repo that I'm trying to use to bootstrap things of limited scope, um we've got some possible to do work work here.

A

um We can right now you can tell the ruler to like enable its api or not so historically, this functioned exactly how prometheus did, which was you loaded from the file, and that was that and then initially in cortex, which is our parent project.

A

The ability to like basically dynamically do this stuff and you know, persist it to like an object, storage backend. That sort of thing was added and so there's a flag to enable the api, but the file system actually does not have an implementation to support it, but we could definitely enable the api on top of a file system back in and I've got this request a number of times to be totally reasonable.

A

So on the to do.

C

um The loki operator, so uh I guess I don't want to spoil the the we'll- have a better announcement around this once it's it's sort of in action, but um um the awesome folks at red hat have built a loki operator that they're, using with an open shift, that's um going to be upstreamed into the loki project for a general kubernetes operator. um So you know there'll be a sort of a lot more to come on this over time in terms of like how it evolves and how it gets used.

C

But it's primarily targeting probably the exact use case that um that you're using steve, where they basically want to make it easy to provision. You know, relatively, I would say, sort of small medium size, loki cells in an easy fashion. So uh I guess stay tuned to that we are going to say we will be upstreaming it into they're, contributing it to the project. So we're excited about that. So we'll see more about that soon.

A

Does that all uh answer any of your questions? Steve? No! That's! That's! Brilliant! Thank you. Thank you very much. All.

D

A

All right, wonderful, um ed! You want to talk about 214 a little bit.

C

I put this on here just just to sort of a couple things that we've noticed. um I don't know what to do about the first one here that it has been inconvenient for a lot of folks because we enabled the wall by default and that directory wasn't mounted um in the.

C

uh So if your config doesn't specify where the wall directory is it just defaults to a directory named wall wal which isn't going to be mounted by anybody and isn't going to be writable on almost any container like our loki containers, I can't write to that so so that has been a bit frustrating for people, so you basically need to specify the directory and um yeah on the newer configs that we've built that are part of the opinionated configs do this, but if you have an existing like values, file for your helm, chart, for example, that has a config in it.

C

It's not going to do this and you're going to get this error. I think we've merged some prs to make the error more descriptive to try to improve this, because I know the error is a bit nasty too. It just can't make their wall and it doesn't really help. You understand what to do so. If you run into that, if you've seen that it basically means you need to specify in the adjuster config where you want the wall to be, and you have to make sure that you have in your container that mounted as writable.

C

Usually people just specify the same container, you put the you know, bolt tv index files and the perhaps chunks files like slash loki is already writable by um the container, all new containers, so you can put it there, but.

A

Yeah, can we just make this uh like? Can we handle this or mitigate this in a docker image and just make that writable.

C

I mean yeah like the the reason we broke it or what broke. I guess is because we we in code now enable the wall and in config, if you don't specify the wall directory, it's so like we could encode force the wall directory to one that we know that works in the docker image, but I'm not sure if that's a better side effect.

C

It's basically just change the default from wall to slash loki, slash wall, and that would fix it in, but what's annoying about, that is that the helm chart mounts all of the stuff at a different location than the docker image. Does so the docker image mounts everything at slash loki, but then the helm chart overrides that to slash temp or no slash, var, local or something like that. It's a mess! I don't know how to unmess it honestly, um but it's just.

A

It's a mess, I don't know how to unmess. It welcome.

C

So sorry I mean I wish I knew a better way, but we want the wall enabled by default, because I think it's a value to everybody like they should have it. um Unfortunately, defaulting the directory becomes really really difficult, because the helm chart has different defaults than the json, which has different defaults than the image itself. And I don't know how to reconcile that.

D

Did the home chart get updated in the end.

C

um I updated the helm chart to the the default values.yaml file that comes in the helm. Chart now specifies the wall directory to match. Whatever was specified in that particular helm chart because it's actually different between the loki home chart and the distributed home chart the rabbit hole gets deeper, so each of those were updated, but typically most people have a values, file that they've curated and don't probably go reference to see if the one that ships with the helm chart changed, and I think that's what caused a lot of the trouble is.

D

I think it was missed from the readme there's multiple examples on the disputed chart. Read me because you don't distribute min io with your chart. You don't run it in single shipper mode.

C

uh It definitely.

D

C

From the readme, because I think I made the prs and I didn't update any readme, so I could go look at that.

D

It took a couple of hours to update to 2.4 config, based on the docks, the values in both the non-disputed and distributed charts and wow was one of the ones that was yeah potentially missing from the dogs.

E

D

This pain then steve you're, saying yeah. We adopted 2.4 the day it was released, so uh it took me a couple hours to a couple hours get it working.

C

um Sorry about that, what what I guess is is: is there something that we could do to help others at this point still or you know, where did you go? Look that you didn't find the.

D

So I mean first principles: if you ship minio in the distributed chart and ship it with the bulk db shipper turned on, the config will be closer to what most people should be using, and then you wouldn't need the extra copy and readme. um But I think I think, there's just there's some missing missing continuity in the dark. So then you can read what something does, but the conflicts obviously are dependent on each other, the wow being a new concept that was introduced and yeah.

D

I saw your commit, but I didn't see it updated in the readme or anywhere else made it slightly trickier and the fact that the paths were different added to the.

C

D

Also, just throughout.

C

There I'm terrible at home, and so whenever I'm messing around on the helm charts, I don't know I feel like I make it worse than better sometimes, but um the readme is a good one, though I should look at that.

B

And I do think it is reasonable to maybe change the default in the actual binary, because slash wall is just never.

C

Never right yeah. We could probably.

D

C

B

C

One of the.

B

Times but exactly yeah.

C

Pick one of the.

B

Three right options and just go with that yeah. This is probably on me all.

A

Right, I'm almost positive.

D

If, if you're after another chart, maintainer I'm happy to happy to help with any of that stuff.

C

Absolutely yeah the the helm, charts we've always been in a bit of a challenge here, because internally we run loki with jsonnet and that effectively makes us the only ones in the world that do that um everybody largely that's not true. I mean a lot of people use json it, but I know we're a bit unique there. So we don't have a lot of hands-on with the helm, charts and like the loki distributed. Chart was very graciously created and donated by the community member and for the most part the charts are community maintained.

C

So um absolutely would take you up on that um if you, if you'd like to be involved with helping with the helm, charts um and helping me understand how to not break them, would be nice too.

C

So if you maybe the best way, steve is, is ping me on the the public slack.

C

Or slack.grafanow.com there's a loki channel or just kind of dm me, and we can kind of figure out how to get you involved. um I would really appreciate that yeah, that's.

D

Great. Thank you.

C

C

um Another thing that is that's come out is that we did miss some configs in 2.4 for getting the most parallelism out of of loki, so I gotta, I'm not sure so. The paralyzed charitable queries, I know for a fact, is still defaulted to false and that's gonna be a big one, because it's gonna enable sharding on charitable queries which allows them to be split basically 16 times farther. So that's a huge config.

C

Probably it matters more when you have multiple instances running because you're, you know, if you're only running one instance, parallelization is largely limited by your cpu cores and throughput, but you're gonna want that on. um We will update this in the next release, so these will get fixed and probably be 2 2.5 to be basically correct defaults um and I pasted a couple others here. I should have added more notes to these. Maybe I'll circle back around the split queries by interval.

C

One actually confused me because we defined that as both a config and a limit and I'm not sure which one you really need to set. But you want to enable splitting queries. That's the other way we parallelize 15 minutes is a reasonable default.

C

We actually probably run 30 minutes in some of our clusters too. You don't you can go less than that, but there's diminishing returns on splitting by time, because you end up duplicating a lot of work. um I I would be hesitant to run 15 minutes honestly, should we just say 30.

A

C

A

I mean you can run 50, but if you're running like under 30 40 queriers, like I wouldn't hit 15.

C

Yeah, the sort of way that works right is it's. If you think about how much data is in one chunk and if you tend to you know max chunk ages, one hour or maybe two hours and you don't write a lot of data and you don't end up filling chunks. If you split by five minutes, it just means you're gonna. Take that two-hour chunk and break it up into five-minute pieces and every query is going to be operating on the same chunk. It's not really going to be faster.

C

If you have streams that are riding 10 chunks, a second or something that would be a huge amount, but um splitting those on five minute intervals would effectively paralyze a lot better. I mean someday loki will be smarter at this. You know we'll know the throughput of a stream and be able to split automatically, but for now it's a bit of a tunable that 30 minutes is probably a pretty reasonable default.

D

Sorry, my voice, you can ask a question yeah, please. um So I said we we build and ship loki as a platform, so we've got some consumers and we're taking on some new internal consumers on this internal platform um and one of our consumers turn their chunk age to 24 hours to because they read the docs. That said, unfilled. Chunks are bad.

D

Now, that's running on the disputed charts. They massively disputed, I'm assuming that that is a contract you're, an area where you know you know my understanding is you know an hour to two hours on chunk, size change, sorry, chunk age set the chunk size down.

D

I couldn't find anywhere definitive that pulled the logic together.

C

Yeah, um so so, where you find yourself is, is sort of where I would say, are kind of maturity as a project lies right. So um there's a lot of tunables and we've gotten a lot better at tuning them through experience and haven't gotten very good, I guess so far figuring out how to communicate.

C

um 2.4 included a number of those sort of opinionated changes that we've learned on how to run loki for reasonable use. Cases like we, you know, force the chunk target size now and the max chunk age. um So this this is an area that we want to put more time into where we're trying to get a lot better at to make it easier to just have loki work.

C

To answer your questions specifically about the max chunk age, I probably would advise against doing 24-hour chunks.

D

That's why that's what I advised.

C

Against but I.

D

C

The optimization that they would get from the information that they read is going to offset the sort of trouble you might have like. It's not that big a deal to flush, small chunks, but it's depends a bit on the storage you're using, and um you know I'll talk about this in a second. Maybe if we have time at the end about what we just started conversations about, how we're going to iterate on the storage to make some of these situations better. But um what ends up happening is that same thing right?

C

If you have really long chunks, you're splitting and charting means that queries are just going to all have to access now. Caching alleviates most of this trouble, but it's like you don't gain any parallelization performance. Generally speaking when everybody's accessing the same chunk.

C

Well, that's not totally true, I would say, like you know, one or two hours is probably fine and you're increasing it more than that. It's probably not going to gain you a lot in the long run. It's not that big a deal if you flush, you know six chunks versus two, um but it's going to be better for querying. If you probably keep it at one or two hours for junk age,.

D

That's that that was my understanding, surrounded, what's what's performing over a grpc connection to an injester versus backhand stores and the fact that the back-end stores can be cached because those chunks are finished and yup.

D

I think the reason I ask on this because I think it's related to split queries, but also it would be yeah communicating the defaults somewhere to look up. You know whether that's in the helm, charts or or somewhere for to see what the standards are. I think some of the defaults are non-documented for 2-4. I think they were internal defaults, but if you had a config already where you set them, I think your you know what you had before would would be kept with the way that the uh the conflict said. Isn't it here.

C

D

C

This could this discussion has come up in the sort of like a next step, which would be a config auditing tool or something that we would do to answer that question for everybody right already running like what what does the, because we simplified the config and in doing so exactly what you said we internalized it. We didn't really document it. So I'm sure karen's, if she's still on the call ears, are burning because she's our tech writer and has been asking us for a long time to help solve some of these problems.

C

um Let me take that as a good note here that, if nothing else, we actually recently talked like what are good reference. Implementations like we've started some work on this. It gets tricky because there's so many ways you can run loki with so many stores, but some of these configs would be the same regardless. So we could definitely document those.

A

Steve, can you share a little bit about how you're running loki um like? Are you running like single binaries per tenant? Are you using the new uh single scalable deployment model? Are you running microservices.

D

So we're running the distributed helm chart um backed on s3. um We haven't, got any memcache caches at the moment because we're running in cluster, so you know we're talking about most clusters are sort of I'd, say about 50 nodes, um so we're you know we're running all of our observability in cluster as a control plane with the potential to then aggregate those in something separate.

D

You know maybe refine a cloud or equivalent outside of it, so not huge setups but the the micro service model, because when we previously ran in single binary, we did see the the differences around the um around the read, write and you're really happy with it. I think you know we used on some clusters previously we're just now shipping to every single one of our clusters um with a new pattern, and some of them have different use cases which might you know where we might not been called out before running running what we had.

D

You know we're a bit more conscious about the performance. So having the reference to look up. You know even just knowing to look in that lip sonic file that will make my life a lot easier. um I was like, I have an answer.

A

For you not a good one, but this is yeah for the you know, curious diy, er right. How do we run? How do we run distributed loki? What are the configs we use.

C

Yeah I mean, though I guess I would ask first before, like how correct is that now like, because we often override I'm not sure that we've even upstreamed our own changes entirely.

A

Yeah, I I think it's a lot better than the defaults like we also do. As that said, we have another overlay which kind of lives. On top of this, the open source, one as well, which is kind of tailored to some particulars, for how grafana cloud specifically does it, and there is definitely some improvements to be made for migrating some of those changes up, but I do think this is a right like it's immediately here. It is not it's not by no means a good answer, but it is.

A

It is an immediate bad answer that that can be helpful.

C

Yeah, throw it in there for sure just so that people get but yeah the. uh I will say it's, it's very present in our minds to continue improving this. Like it's, I know it's I apologize. It's been.

C

We really appreciate how much people have have endured figuring out the stuff on their own. It's certainly not how we want the grow. The project in the long run by you know not having these example configs, so we're hopefully going to fix that.

C

um I don't know. I feel like you've heard me say this before, though so anybody's been.

A

Running out of goodwill.

C

um But this feedback is really helpful, steve anytime. Anyone and that's why it's it's nice, I'm I'm happy! You showed up because it you know it's both encouraging for us to see. You know, use cases and what you're doing and a good reminder to us that this stuff is really important to get prioritized.

C

E

Like I think that was it for.

C

2.4 that I was going to talk about so um and if anybody else had any questions or feedback on their experiences with 2.4 going once going twice, nice um cyril wasn't able to make it. So I'm just going to mention this. It's not released yet, but we did add support for the as a udp receiver for the gray log extended log format so that you can ship grey log jelf. I don't know if it's pronounced gelf or delph.

A

Anyway, log format.

C

The grand enterprise- oh my god, so the the the yeah greylock extended log format- will be available from udp and it's effectively json. So after it's received via udp, you can use the prom tails pipeline stages to manipulate the json to pull the. Maybe they call them tags. I can't remember into labels if you want, but um so another cool way to get logs into loki um and this one daniel yeah. This is super exciting.

C

I can't wait to see how this one I wish we had some cool graphs to show, but we haven't promoted it through enough environments, yet I don't think have we.

F

C

F

If you click on the the tempo pr, they show what impact it had on on their side. We would probably get larger gains. I would imagine um so you're already excited to see this happening.

A

There's also some is this: is the tail at scale.

C

Yeah I mean yeah, you want.

F

To sort of describe what.

C

F

Yeah yeah, so, um oh, and if you go to the first pr in the first um or in the uh yeah that one pr one that's got the tail at scale paper link in it there yeah. So the idea behind this feature- um and this was implemented by cyril.

F

So it's another item uh taken credit for cyril's work this week, um this unreleased feature as yet um so when you are fetching chunks from object storage, um if your object store is experiencing some latency that can really push up the the latency of the overall query right. So if you put a query and that's got to fetch, 100 chunks and one of those chunks is a bit slow to retrieve that that increases the latency of the overall request.

F

So what um this change introduces is the ability to hedge a request, so we will you'll typically run this in your environment. For a while and then figure out. What is your sort of p99 of your requests to your object store? Then? You can configure this hedge request feature so that when you, when a request, exceeds that p99 time, it will actually cancel that request and issue a new request and the idea around.

F

That is that, if you're using a cloud service that that one request that one chunk in the object storage could have hit a node. That was a little bit overloaded or there's some other kind of network congestion, and you issue another request and that usually completes faster and so there's two pull requests there.

F

The one is implementing the base feature and the second pr is implementing the rate limiting and the reason why we need the rate limiting is because, if the object, storage service as a whole um is slow, uh then you're effectively going to be hedging every single request and that's going to push up your uh request, latencies quite ex, quite massively so um yeah. Those two features together should provide some some pretty awesome gains.

F

So we're going to be running it in our cloud environment over the next few weeks and experimenting with that and seeing what what kind of benefit that gives us. But it gave tempo quite a good um improvement on their side, so we're expecting to see um something like that as well.

A

Do you know what the their p99s like there, where.

F

I think I think yeah they configured their um hedge time at 500 milliseconds. If I remember correctly,.

A

Okay, so it looks like they brought the so p99 is generally your slowest request right, your 99 percentiles, and this is what we kind of talk talk about by the tail. It's the tail of distribution, meaning the slowest of the request that you actually sent out. It looks like they reduced theirs from about 10 seconds to about two and a half, so it's a factor of four there which is really impressive and when you think.

D

A

That um this can can function a bunch of different layers, so it can be like the one request that you send to loki, for instance, but it can also be that loki, splits and charge and paralyzes one request into a thousand, and maybe the thing that holds up your your your whole query from returning is you know one of those thousand?

A

The idea is that we can take this and reduce it uh by a large factor, and so, if you're spending a lot of the time on on your queries, just waiting for something to happen waiting for the tail to finish. This is another way to mitigate that.

A

Okay, sorry, that guy just did I just talk over either danny all right. I.

F

A

F

Think you explained it, you explained it a lot better.

C

It's super fun, I mean the the. If you look at loki's traces you'll see a pattern right where we, we limit parallelism to sort of save resources, but that does mean we have to wait for at some point every one of say the 16 queries that were parallelized at the same time to return, and if one of those is waiting two seconds for one chunk, then your query ends up being the sum of the slowest requests and so by hedging.

C

Those requests, if you take any of those say one or two seconds and just reissue them, and they come back in 500, milliseconds you've substantially improved your query response time. So that's what our expectation is here is that this is going to be a huge improvement, so very excited.

C

So does anybody else want to take a stab at explaining the same thing.

A

I would say: there's a I put the paper as a as a link there that.

C

Would go read the really good paper about it.

A

Yeah yeah, that's true: it's it's one of the better papers. I've read in recent memory. I think it originally comes out of google, but they in this isn't the only strategy that they talk about, but it's the one that we've implemented in these sets of pr's.

A

um I guess we can.

C

Trevor talk about himself.

A

Do you want me to play this.

B

C

Play the video.

A

Yeah, how do you, how do you want me to do this.

B

uh Yeah no, I want. I don't want to just talk about myself. Oh that's! What.

G

B

Go for it uh so yeah I'm starting to to change my career into acting. uh So I got this video going. It's really excited about it. I.

C

Play a role or were you just trevor whitney.

B

Yeah yeah, I played uh I played myself. uh Actually uh I know so there's a so a lot of the work that um that I contributed to uh to 2-4 was around the uh um simple scalable models. So these two new targets that my I'm not just me. I mean a lot of people worked on on these new targets that read and the right targets. um So I put together a video on getting started with those and then there's also a helm chart.

B

That's in pr and I'd love, some reviews from the community on that, because I am in the same boat as ed when it comes to helm and not feeling like, I know what I'm doing when I'm getting in there so uh um yeah. I love. If anyone has uh tried out this, uh these two targets, love feedback um and uh yeah. That's it.

C

uh Do you wanna quickly describe what single scalable ss yeah does.

B

um It's uh so we introduced two new targets, uh a read and a write, uh so the goal is to provide a deployment topology that sits somewhere in between single binary and full microservices full distributed.

B

The the problem that we've seen with single binary um is when there's a problem. It becomes a really big problem. So if you have um like, if you're queriers, um for example, that will actually take down your injectors, um because it's single binder so on the same, um but we thought that we could maybe get um a little bit better.

B

But then the problem with microservices obviously is just it's complicated right, there's a lot of moving pieces, and so for people who are maybe just kicking the tires with loki um or like setting it up for maybe like a smaller load. uh Maybe it's for like a hobby project or something like that. um We wanted to find a middle ground between the two where it was simpler to deploy in terms of the topology, um but was a little bit more rugged and robust than than single binary um and so yeah.

B

That's that's the uh the simple scalable it um it sort of encouraged a a a lot of other work that I think is going to be really valuable, like uh the query schedulers having a ring. So um now, even in single binary mode, the queries can be started and then we also got a little smart with the compactor with a ring so that um the the ring determines, which instance the compactor would run on and making sure it only runs once um so. I think it's it's.

B

uh It's really helping push us in a direction where low-key is just easier to run, which is sort of the end goal, um but uh we we don't.

G

B

How well we've done until people run it that way so um give it a try and let us know what works it doesn't work.

C

Yeah, the only thing that I would say is that I would say the ssd mode is: is really actually intended for people that want to run gigabytes or even a couple terabytes a day like you, you can scale it like you're you're, making a trade-off on complexity versus sort of brutal efficiency that you can get with microservices. So so that was the idea right like in a number of cases.

C

Most people don't want to know what an index gateway is or query scheduler or query, front-end or queriers, and even the caching layers, and then we hid all of that inside of these new read and write targets. So it includes in-memory caching um and includes the query. Scheduler and actually the bind single binary does now too. So you get that parallelization.

C

So hopefully it's a way to bridge the gap between starting, you know small and quickly and having some scalability to. Ultimately, if you really want to have the best efficiency and observability just the full microservices approach.

C

So hopefully we're excited to see that make you know adoption especially non-kubernetes adoption, much much easier.

B

Yeah and actually the video is, is non-kubernetes. um So if you.

A

B

A

Is this it actually is it's just.

B

Yeah, that's okay, never mind, watch it watch it at 2x speed. It's.

C

Worth watching, that's great, though you did an awesome job trevor. You gotta, I think you're, like yeah, no you're a better actor than an engineer. You know I just.

B

Oh, I don't know, I don't know if that's goodness,.

C

I don't know what yeah.

E

Or a compliment, I'm.

C

Not sure either, let's take it as a compliment. Yeah.

A

um Let's, uh let's segue into uh ivana the 8.3 release, not dwell on that one.

C

A

G

Yeah, so basically I just wanted to mention here that yesterday we had a3 release uh that comes with long awaited and unwanted feature, which is uh what volume or you might know it under the name. Full range log histogram and this feature is still under feature toggle. So if you would like to try it out, uh you need to, in your custom, ini file um toggle this feature and uh let me find out what's the name, maybe I should mention it here. So the name is full range logs volume uh I will maybe uh oven.

G

You can write it in the dock or I can write it in the dock.

A

If you could do that, though, I'd appreciate that.

G

C

Would love feedback on this word, um I I would say like we're. You know the two big concerns or there's two sort of questions. um One is what the ux is like and how do people like it um in our experience running it's it's nice to have the full range histogram.

C

I know I've said a lot of mean things about it in the past, but um it's nice that you know. Most of my arguments have always been. The concern we have right is like it's a lot more query: work right when you're doing logs queries now, you're also going to be doing a metric query at the same time.

C

So we've hedged that by limiting it to 10 seconds, which sort of is a way to say, if you're querying a really long time that it, you know, doesn't consume a huge amount of resources, and this is the part that I think is going to probably be a bit contentious and what that experience looks like so I'm interested to see we're gonna, you know run it on our cloud environment. We're gonna, do sort of a beta version of it.

C

First, try to understand what the performance looks like see if we can be a little bit more lenient with that timeout see what the experience looks like, but it's very exciting, though go check it out. You just gotta enable that feature flag.

C

I don't know if there's an if we can link to an easy way to do that somewhere, but the uh yeah. I would love feedback I'm no. No! Maybe we should create a github issue or something to give feedback.

G

Yeah we can either create like one issue, or I was thinking that if anyone has any kind of issue uh feedback just like create issue for that, specifically in graphene already, I'm wondering if it should be graphene ripple or loki repo as it's kind of interconnected. But what like, if you do it in any of these repositories like we will get the feedback and definitely okay.

C

um Yeah, it's a good question. I would say grafana though, just because we're largely targeting the sort of user experience and the front-end part is um what we're looking for feedback. But as an operator, I guess we could have one for loki too right, like you know, what's your experience like as an operator of a loki cluster when you turn this feature on, we could maybe do both and reference each other.

G

Yeah and so I'm not sure if anyone has any questions or thoughts or anything related to what volume that you would like to ask or tell.

G

If not, I have like one more little feature, which is it's not really like related just to loki but in explore. We have introduced these like graph modes, so you can change it from just line to like dots or bars.

G

um So if you would like to prefer, if you would prefer a different visualization now you can toggle it out.

A

All right, I guess we should open this to the floor. If anyone has any other questions, topics you want to bring out, tell us that we're doing a good job, bad job, whatever.

A

And otherwise we can probably uh cut this a little bit short and give everyone back some time.

A

All right thanks, everyone for coming and we'll see y'all, I guess in next month and enjoy the holidays.

A

Yeah thanks. Everybody.

D

Thank you very much.

B