Cloud Native Computing Foundation KubeCon + CloudNativeCon Europe 2020 - Virtual, 4 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Make Prometheus Use Less Memory and Restart Faster - Ganesh Vernekar, Grafana Labs

Description

Don’t miss out! Join us at our upcoming events: EnvoyCon Virtual on October 15 and KubeCon + CloudNativeCon North America 2020 Virtual from November 17-20. Learn more at https://kubecon.io. The conferences feature presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Make Prometheus Use Less Memory and Restart Faster - Ganesh Vernekar, Grafana Labs

These days, the most common reason for a Prometheus server to run out of memory is an excessive amount of time series in the so called head block, the part of the internal TSDB with the freshest data, which has to be kept in memory prior to consolidation into a block on disk. A large head block leads to a long restart time because the head block has to be rebuilt from the write-ahead log. On large servers, the restart time can be 10 minutes or more. Since restarts happen regularly to upgrade the binary or to change flags, the resulting interruption of sample collection is problematic. Even worse: After an OOM crash, the same replaying from the WAL has to happen, often causing another OOM crash immediately. Ganesh Vernekar will talk about the work started in late 2019 to persist parts of the head block earlier, thereby reducing both the memory footprint and the restart time.

https://sched.co/Zeih

A

Welcome everyone to my talk, mate, prometheus, use, less memory and restart faster a little bit about myself. I am ganesh vernecker. I am a software engineer at grafana labs and I am also a prometheus member and I maintain the tsb engine of the prometheus so before we understand how we achieved a reduction in memory and faster restart, you need to know how a sample goes through in the tsdb.

A

Here. I am assuming that you already know a little bit about prometheus and what does a memory series mean or a series mean so, let's dive into a life cycle of a sample. So this is how tsdb looks at a higher level.

A

There is something called the head block, which is the in memory part of the database, and there is a right ahead log which keeps it's on the disk which keeps the record of all the samples and series that are incoming in the head, so that in whenever there is a crash, we can recover the data from the right ahead log and there are blocks which are persistent blocks on the disk. These are just data flushed from the head block onto the disk and uh with the in this diagram, the time grows from left to right.

A

So the block on the left side is the oldest block. This one looks bigger because it's formed by merging two blocks or more so the main focus of this work is going to be the head block and the right ahead log, because the memory optimizations and the restart all are linked to this particular section. So, let's zoom a little bit into this and understand in depth about the life cycle of a sample, so in in all the discussions going forward, I'll be talking with respect to a single series and yeah.

A

All the discussions are just about a single series. I have a horizontal line here. Anything that's represented above this is in the memory anything that's below. This is on the disk.

A

So here we have a sample incoming inside the head block and we store the samples in something called chunk. A chunk is a compressed unit of up to 120 samples. So let's say your scrape interval is 15 seconds. That means 120 samples would span up to like 30 minutes.

A

So here I have represented the chunk, which is actively being appended to in the color red, and whenever we write a sample to the chunk, we also write it to the right ahead log so that whenever there is a crash, we can recreate the same sample and, as I said, it's a compressed unit of 120 samples. So what happens after you cross 120 samples?

A

We just cut another chunk, so we end the life cycle of a single chunk, so this one was red before now it has been cut up to 120 samples and there is a new chunk which will be actively appended too. So once you cut a chunk, the chunky seen the yellow shade is read only that it's never appended to or no sample is deleted before we flush it into a block.

A

Similarly, the chunk just keeps getting the data and the new chunk is cut after every 30 minutes. If you assume 15 minute 15. Second, scrape interval and yep every chunk has 120 samples and considering the head block range is like two hours like the first block that you cut is two hours in size. The head block will store up to three hours of data.

A

So if you count the chunks here, 30 minutes, 30 minutes, 30 minutes this spans up to three hours of data, and only the red color chunk is mutable, like the samples will be added, but it won't be deleted and the rest of the chunks are read only once we come to this stage where the head block spans three hours, we cut a block, we take the first two hours of data, which is the first four chunks here, leaving aside the fifth and the sixth chunk, and this is flushed into the disk as a part as a block.

A

If you observed from the beginning the right ahead, log, it's growing as and when the data is incoming and once we cut the block, the right-hand log is truncated and the block consists of its own index and such chunks stored separately. The index is required to search into the chunks and here in the head block, I'm not showing you other parts like the in-memory index, I'm just showing the chunks because that's more relevant for this stock, but there is a index here too for the head block.

A

So this cycle repeats. Now you see the chunk is 5 again it gets more samples, there are more chunks and the block is cut again sweet.

A

So the block the diagram is changed little bit to represent the active chunk, which is in red color here and the index for the blocks, and we created smaller blocks once it grows back in time if the blocks are merged to form bigger blocks.

A

So now, let's save some memory and the as you see, as you can already see here, it's done through memory mapping from disk. So let's see how that's done so, that's done by these two pairs that have worked on and this work is in the prometheus release. 2.19. So you can appeal to that. If you wish to have this optimization in so, let's go back to the initial state where there was one active chunk and the sample was being appended to that, and obviously there is a right ahead lock.

A

As I said, when there is a chunk, the yellow chunk, which is already cut, it's just read only and it cannot be returned to or it cannot be deleted, it's similar to a block where the chunks in the block are immutable.

A

Hence your memory map from the disk memory mapping is a feature given by the os, where you need not load the entire file into the memory. If you want to access- and you can just say- I want to access this part of the file and the os will take care of loading just that part of the file into the memory, which is great, it removes a lot of complexity from the code.

A

So, similarly, because we memory mac jumps chunks from the disk for the blocks, why can't we do it for even the head block? So that's what we exactly do once we cut a new chunk for the head.

A

We flush that to the disc, as you can see, I already mentioned this line is separates the memory from the disc. So now the chunk is on the disc, and while the chunk is on the disc, we just store a reference for this chunk into the memory, so the chunk can grow up to like 120, bytes or 150 bytes, or maybe even 200 bytes, that's replaced by just 8 bytes of reference into the memory.

A

Similarly, as and when we are cutting the chunk here, the new chunk there is a second chunk incoming immediately. We flush it to the disk. If you observe carefully the right ahead log, it grows like once it's flush to the disc. The right headlock is not truncated. It grows. I will explain why this happens in a moment.

A

So similarly, we have one chunk, two chunk and all the five chunks memory mapped to the disk. So when the compaction happens, the chunks are taken from here and the truncation doesn't happen immediately. Here. I am not going in deep about how the my chunks are truncated, because it's not much relearned for this talk, but yep.

A

That's how we are saving memory, though it feels like you are saving like 5 6 of the memory, but it's really not 5 6, because the other factors are considered like the in memory index, the symbols that is stored and the memory required to load a block and lots of other things.

A

So, realistically speaking, the memory savings is like 30 to 40 percent in best case or up to 50, and when there is a high chunk, when I say churn, it means the rate at which new series are being created or new series are being deleted when that is high, the savings are less.

A

But if we go back to the discussion of 15 second scrape interval, we used to store six chunks into the memory in the worst case. But now it's in for all the scrape intervals. It's up to one chunk that we store into the memory which is sweet. So we run something called prom bench, which is a heavy benchmark for every release that we cut out.

A

So this uh is comparing the release 2.19 with the release 2.18 and you can see there is 30 to 40 reduction in the memory usage.

A

There are lots of lines here not to confuse with many other lines that we have it's the yellow line here and the green line that is being compared to- and this is when the churn is low, which means the rate at which new series being is being created is very less and when there is a high churn, there is still some reduction, which is like 10 to 20 percent, and this is from git lab from one of the github gitlab issues.

A

So when they upgraded to prometheus 2.19, they saw a reduction up to 50. If you see on the left side uh before memory mapping, there are spikes every two hours because, as you saw initially in the life cycle of a sample, the head stores up to six chunks. If we take the scrap interval of 15 seconds so with time, the memory grows for head block and after compaction it comes down. It grows for two hours after compaction. It comes down, but you see after memory mapping, because we store up to one chunk in the memory.

A

In the worst case, the memory is little stable. There is no ups and down like this, so that's sweet now talking about little fast replay. This is not the ultimate fast replay that I'm going to speak about. This was a good side effect of the memory mapping work so explaining a little bit about replays. The replay consists of going through the right headlock records and replaying each and every sample that we ingested before.

A

So the replay consists of recreating the compressed chunks that we have in the memory or you can see the memory map chunks so recreating those compress compressed chunks is an additional cpu task and takes a bit of time so because we already have the entire chunk flushed on the disc. Whenever we are iterating through the samples we look back and see if there is a chunk, a memory map chunk, which is already present for this time range and hence we skip the sample.

A

So we don't need to recreate all the chunks that are already there on the disk and because we don't have to recreate the compressed chunks. This uh causes a reduction of restart time by up to 20 to 30 percent that we saw in the from bench, which is a nice side effect of this now so I mentioned, I will talk why we should not. We don't truncate the right headlock, because the right hired log is required during replay to recreate this chunk in the memory.

A

But because there are millions of series we don't know where exactly is the sample for this right ahead for this chunk in the right ahead log? So if we had to truncate the right headlock for these chunks, we have to go through the entire right ahead log and remove the irrelevant samples and rewrite the right dialog, which is very inefficient and very expensive in terms of disk and yep.

A

It will just stall a lot of other process of prometheus, so we don't truncate that ahead log immediately, we truncate it like before, whenever a compaction of head happens and also remote light depends on right ahead log to send it to remote storage for long term retention.

A

So unless the remote write starts using the memory map chunks for remote writing, we cannot truncate the right headlock very soon. So that's one of the reason- and there are some things to keep in mind this memory. Saving is not like the final thing that you see the memory can grow because, as I said, memory mapping is an os feature and if you want to access it, it's brought back into the memory again.

A

So that happens when you have a query hitting the memory map chunks. So, for example, if you have a query which touches like all the series for the past three hours of data, then it's going to load back all the chunks into the memory, but realistically speaking that doesn't happen often.

A

So if you don't run a very heavy queries for a very well like past two or three hours of data, then you should be good in this regard. Also because we are flushing the chunks immediately to the disk. There is an additional disk space requirement, because even when the compaction of head happens, there is still some chunks, like the memory map chunks on the desk, so that takes some extra space.

A

So if you are tight on disk space, this is something to keep in mind to adjust your retention or the this size base retention or just give more space to it.

A

Now that we have seen how are we saving the memory? Let's talk about fast restarts, because when you have like millions of series, the restart can span up to 10 minutes or more, which is not so ideal in terms of monitoring because yeah you are using this for writing and you want it to be up for most of the time.

A

I'm saying it's coming soon, because it's a work in progress and it should be in sometime soon in the future.

A

So this is how the tsdb, uh specifically the head block and the right headlock, looks after the memory mapping and, as I briefly talked before, we are iterating through the entire right ahead log just to create this last chunk, because we don't know where are the samples for this last chunk, and even though we have the chunks already flushed entered a disk like the old chunks.

A

We are still going through the samples for these chunks just to discard it later. What if we knew? What are the exact samples for this chunk so that we don't need to go through the entire right? Headlock yep, you guessed it right. We just flush this last chunk on the disc and we call it a snapshot.

A

So this happens during shutdown. Initially, we thought we could take a regular snapshot, but with some calculations we came to a conclusion. This can cause up to 50 percent or more right amplification, which is not really desired. So in this, what we do is whenever we are shutting down the prometheus.

A

We just flush this chunk, which is being appended to to the disk, along with its labels and values which are part of the series uh yep, I'm not going in detail about the prometheus, but if label values are part of memory series.

A

So at this point you know what are the old chunks and you don't need to go through the right ahead log, because you already know what are the samples in the active chunk because we just flushed it during shutdown.

A

So the replay with this snapshot in just consists of going through this memory, map chunks and the snapshot, and you don't need to go to the right ahead log. So I mentioned recreating. The compressed chunk is one of the cpu intensive task. The another thing which is cpu intensive is decoding the right headlock records from the disk, which forms the majority of the right ahead. Log replay time so with this snapshot in there is no need to go through the right ahead log.

A

But when we talk about restart time, it's not just about when prometheus is coming up. Now that we have snapshotting into the picture when shutting down. We also need to consider the time that it takes to create this snapshot, but surprisingly, snapshotting, like a million series, takes two to three seconds or maybe max five seconds so, which is great and the entire turnaround time, which is shutting down and start replaying. The right headlock and getting the promises back up again is like 5x faster, which is like 80 reduction in time.

A

To give you some realistic numbers when we ran this using from bench and restarted the prometheus server without this snapshotting.

A

But with the memory mapping, the replay was like two and a half minutes and with the snapshotting in place, the the entire turnaround time was uh like less than 30 seconds, so it can be little more than 5x faster, because the test was done before this right ahead, lock could grow for entire three hours. It was more like for, like one hour of one to two hours of right ahead log, but if the prometheus was kept learning for longer and then restarted, the gains could be more than 5x faster.

A

So that is sweet.

A

So more about snapshotting yep. This is best effort. As I already said, this is during graceful shutdown and it's not done in regular intervals.

A

So if your prometheus happens to crash in between, for whatever reason, then the restarts will be slow as before, and it won't be fast and because we are taking snapshot, it's going on the disk. It's going to require a little more disk space which you need to take into account. When you plan your resource requirements or the retention period.

A

So this is a work in progress. Majority of work is done like the designing and lots of iterations and how it should be done and the format etc.

A

So I expect to be done in august or latest by september, because I'm going to work on this, so I'm giving an idea based on what I think will be done, but yep with snapshotting in place. The restarts are like 80 fast and there are more exciting things coming in the tsdb.

A

These two just scratch. The surface of what has done has been done in the tstp and what's coming up, but a blog post will be soon out or it will be already out by the time. This talk is up in the grafana blog, where I talk more about what's more uh coming up in the tsdb space and what you should be looking forward to yep.

A

To summarize what we have uh discussed till now in the memory mapping, we flush the chunks on the disk, and hence we save the memory and in snapshotting, because we have just the last chunk. That is on the memory which needs to be recreated.

A

We just flush it to the disk during shutdown, taking a snapshot and we don't need to go through the right head log again and the replay just consists of going through the memory map, chunks and the snapshot and that's how we are making the restart faster.

A

Thank you. That's all, and by the way this is my twitter handle. If you have any questions or if you have, if you are watching this talk after it's aired, you can feel free to reach out to me for any questions that you have. Thank you.