Cloud Native Computing Foundation PrometheusDay EU 2022, 19 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Lightning Talk: Troubleshoot Compactor Backlog with Ease - Ben Ye, ByteDance

Description

Lightning Talk: Troubleshoot Compactor Backlog with Ease - Ben Ye, ByteDance

This talk will talk about a common problem if you are running Thanos and Cortex on large scales: compactor backlog. As a core component, it is important to make sure that the compactors are running smoothly and well scaled. In this talk, Ben Ye will explain why compactor backlog happens and how to prevent it from happening. He will walk through ways to identify and troubleshoot it using existing metrics and tools.

A

Hello, I hope everyone is having a great time at premises days this year, so this is ben yeah and I'm an sre and by dance today. The topic is about troubleshooting compactor backlog with ease and let's get started.

A

First, let me introduce what is the sonos compactor, the sonos compactor compacts blocks on the object storage in order to improve the query performance besides, it also deals with block down sampling and data retention as well so from the implementation perspective, the compactor is just a cron job. For example, it runs every five minutes and each run is called an iteration, so each iteration, the compactor, will perform the three tags here in order, which means, if there are too much compaction work to finish, then it can't start down sampling and retention.

A

So usually the backlog happens in phase one, which is the compaction phase. So why does this happen? And maybe we can think about this and imagine it as a massive queue scenario so here and the tunnels compactor is a massive q consumer as a producers are silent side, cars, googlers and receivers who upload blogs to the object? Storage, in this case, object, storage is a message queue.

A

So if we scale more on the producer side and we don't scale on the consumer side, some much more data will be uploaded to the object, storage and the compactor cannot keep up with the load and then it falls behind and finally backlog happens.

A

So the key thing here actually is to identify the backlog issue and there are several way to go so. First, the compactor itself exposes some very useful metrics, so these two metrics actually uh tell us the current iterations and the down samplings performed.

A

So if these two counters remain the same value or they increase slowly, then backlog might happen, and if you don't see any retention happens for very old blocks, then the compactor might be busy compacting your blocks and they cannot start doing the compaction. And the last point might not be that obvious.

A

But if you have your compactor has the backlog issue, then some long-term range queries performance might be degraded.

A

So another way to identify the backlog issue is to use the progress matrix. So since sanos v0.24 release for new metrics are introduced and the there are very good signals to tell whether your compact compactor hit backlog or not, and they can represent the compaction progress, please do give them a try and they are very useful in alerts as well.

A

So next, let's talk about the solutions for the backlog. So in order to solve the backlog problem, we definitely want to scale the compactors more and the easiest way to go is to simply scale vertically. So we can add more computation resources to the compactor instances and another way to do is to just increase the compaction concurrency.

A

So there are two flags provided by the tiles compactor. One is the compaction concurrency and another one is the down sampling concurrency. So we can tune these flags and make the compactor instance more powerful and another way to go is to scale horizontally and about uh horizontal scaling, and there are actually two ways to go. One way is to just short by time.

A

So, for example, we can have uh two compactors and one compactor. Take care of logs produced last week and another compactor take care of blocks produced maybe last month, and in this way we can distribute blocks to different compactors by time, and another way to go is to shard the blocks by their external labels so that we can groups blocks from the same clusters together to the same compactor, and in this way we achieve the same goal and we successfully distribute logs to different compactor instances.

A

So I think that's all about today's session and I hope you enjoy it. Thank you.