OpenZFS 2020 OpenZFS Developer Summit, 12 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ZFS Caching: How Big Is the ARC? by George Wilson

Description

From the 2020 OpenZFS Developer Summit
slides: https://drive.google.com/file/d/19th2JHeITp1Iefc-JffIDqn_4oy_JfVx/view?ts=5f7b7499
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2020

A

Great well uh thanks everyone for uh for joining virtually. um This is going to be an interesting thing. This first time, I've ever done a virtual uh uh conference, as probably the case with many of you.

A

um So what I wanted to talk about today was kind of like uh some things that we encountered at delfix regarding the arc for those that don't know, we have recently switched our product to linux, so this was kind of like a a big change for us, and one of the things that we were trying to do was to make sure that in fact, we kind of get the same performance as what we're used to seeing in illumos, because that's what our product was based on and so, as we were kind of going through and kind of, you know, had this big giant schedule for it.

A

um One of the things that we noticed was that the uh arc size on linux is quite smaller than what we were expecting and what we're used to uh on lumos. So we said: okay, well, you know: hey no big deal, we we know the code, let's go in there and change it.

A

So we had noticed that linux was half of memory, um and here you kind of see a diagram of what like we're used to in a limous which is kind of taking up all of memory um and that we use that primarily because we're really read intensive for for certain workloads. So we want to have that caching!

A

Well, no big deal, you know, hey we're we're all uh developers, let's just go in there, make a change, simple fix and sure enough. It was a simple fix. um We actually got it changed and- and so now our delfix platform kind of looks like this, which matches some of the other platforms that you may be used to so like freebsd and lumos, also kind of do a similar type of configuration where they're using most memory um and so great we we're done and that's the end of my talk. No sorry! It's not!

A

um It turns out that, even though this was a very simple change, there were some very interesting side effects that we didn't necessarily expect and we didn't necessarily catch right away.

A

The first thing was that we started seeing these iscsi lund resets and, as we kind of dug into it, we found all sorts of things and kind of led us down various rabbit holes, everything from looking at iscsi code to like seeing delays in the zio pipeline, and it wasn't until you know, maybe like a few days into the problem where we encountered and saw this type of behavior.

A

So what I'm showing here is a graph of the arc and in this case arc c, so the size of the arc and what the target sorry the target size of the arc and kind of what happens over a period of time. And so we noticed this deep drop off where, like this, just steep uh decrease in the arc size and that started us kind of investigating. Really what was happening here and what's interesting is like you know.

A

We had made the change for uh to increase the size of the arc months and months ago, and we're now kind of starting to see this now in kind of like real systems, so we kind of had to step back and we had like it's two types of uh things we looked at, so we were familiar with what you see on here on the um on the left side of your screen, which is expected workflow. We knew we had. You know the arc. Has these two threads that run?

A

You know one of them detects like memory shortfalls and tries to shrink the the target size of the arc, while the other thread goes in and fix things out to actually get us down to the right size and then, if there's any threads, that are waiting for it it blocks.

A

um What we discovered, however, is that the actual workflow had this other component called the shrinker, and this was relatively new to us because again, this is a kind of a linux specific component and not something that we're used to seeing in illumos and what we, when we dug into it. We noticed that uh the shrinker has kind of these two modes where one of them is to just get some.

A

You know how many objects can can the uh shrinker actually free or how many are you going to tell the kernel that you can free and then actually go in and shrinking the target size?

A

um In addition to that, we also uh wanted to kind of like we noticed two different ways that the shrinker was being called um and there's two different types of memory: pressure that you can see like and you're, probably used to this on on any platform which is like kind of a direct memory reclaim and an indirect memory claim, um and this graph on the left is um something that I kind of wanted to speak to, because the indirect, the indirect memory claim actually happens.

A

Asynchronously through a process called k-swap d, and so what this is showing here is that as memory is being consumed and it reaches this low water mark, we actually start to wake up k-swap d to go and call back into the shrinkers to go and start freeing up memory.

A

So k-swap d will kind of run until it gets to the point where it's this minimum water mark and at that point in time, that's when you actually start seeing these direct reclaims so you'll get the synchronous reclaim, which is happening in the context of whatever threat is running, so you may be in the middle of doing a um you know, a vfs read and all of a sudden you need memory and it's coming in and calling back into the arc and the arc is having to go. Do some stuff.

A

So that's kind of the way the model works within linux, and so you had these two pieces within the arc code, the arc shrinker count objects and the arc shrinker func scan objects. And when we looked at these, we found some kind of interesting things.

A

The first thing is, like the count objects uh effectively, just return back all evictable memory to the kernel, um so it was saying hey. If you need memory here, we have everything, that's that's evictable you can have um and then the but the count objects. That's all it does. It just returns back the number of pages that you think you can give back. If the kernel needs it, and then the scan object is really the one that does the real work. So when the kernel calls you back, it says here go do some work.

A

I need some pages and what we found in the linux implementation was that it really just decreased the arc target size, and I think there was a little bit of nuance there in that at one point in time. This actually used to do more and inadvertently some of that functionality was was removed.

A

So as a result, what every anytime you got the you know the shrinker called back into the arc. It simply reduced the target size and away. You went on that steep decline all the way to, and so, as we started monitoring this, we noticed like here's a sample so over a 20 second period, with just some small memory pressure. We were seeing.

A

198 000 calls to go and shrink the arc, and so we're being bombarded with these calls coming in from case walk d, so over 9 000 per second, and what was interesting is that, even though we were trying to keep up and do some eviction, it seemed like the colonel was never satisfied that we were doing any work, um so that was another mystery for us to go solve.

A

So, let's step back and go back to our original problem, I mentioned we had these iscsi lund resets. So how does this actually relate? What was really happening here? um If we take the same graph and now map arc size, things started to become a little clearer, so you can see here that we had seen the arc target size decrease on this steep decline and the size is not able to keep up it's actually slowly going down, um and these are actually happening by two different processes.

A

So what you're, seeing the drive from the red line is all coming from case? Swap these and those bombarded calls that we're getting into the shrinker and then from the blue line. This is happening through the archivic threat and it's trying to make some work and keep going forward.

A

So before we go into the real crux of the problem, I wanted to kind of make sure that people kind of understand what happens when you're. In this you know kind of arc full condition or if you've looked at the code, the arc overflowing condition where an arc overflowing condition is when the size of the arc is bigger than the target size.

A

So when you're actually going in and trying to do, for example, a reed and you need a block in the arc, the first thing you're going to ask is: is the arc overflowing? If the answer is yes, then you're going to be asked to block giving the the arc um evic thread the ability to kind of make some progress and find a block for you, so that you can make so you can actually have it and uh and do your I o if you're not overflowing great, you get a block.

A

You know memory allocated you move forward, so when we keep that in mind what we started seeing was that consumers that were coming into our agreed we're now being bombarded with this 25 second delay, because that's the time this in this entire period that you see here is where arc size is greater than arc c, so arc size is or the arc is overflowing and all those threads are gonna have to block.

A

So when we started looking back and saying, okay well, what's causing these ice cream delays, it was the fact that they couldn't make any forward progress. They were they weren't, actually able to do any. I o.

A

So this is kind of what you know. We summarized, as we kind of dug into into the shrinker, was that we were seeing these repeated calls coming in for both direct and indirect page reclaims. It was introducing stalls from the I o pipeline and actual consumers of.

A

I o we weren't able to like figure out how the kernel was supposed to like know that we had made any work, and then there was this really kind of strange interaction with the arc reap thread and the arc reap thread was the one that we thought was supposed to detect uh memory pressure.

A

So with that we set off to go and try to improve how we actually do memory pressure, detection and the first uh place we looked at was to actually revamp the way the shrinker logic works. So because we have, it could have lots of evictable memory. We felt it was a little unfair for us to simply say: hey here's all. You know gigabytes of evictable memory anytime, the shrinker was called so instead we said well, you know, let's lie to the shrinker and give back a certain number of pages.

A

What we feel should be enough for anybody, that's looking for you know if we have a memory shortfall to be able to make some progress, and this is a tunable and it defaults to 10 000 pages.

A

um Likewise, the work that was actually happening with the shrinker, instead of just going back and adjusting the target size, we wanted to make sure that when we adjusted the target size, we actually waited for those evictions to happen, and then there was one little nuance that we uncovered, which is there is a way to actually let the colonel know that we're making some progress. It just wasn't what we expected um instead of the kernel tracking, like free pages, it wants to know that things are in this reclaimed uh state.

A

So we had to kind of tell the colonel, hey, we've evicted this much memory, and it's now reclaimable.

A

We also introduced this lock step type of mechanism, so I mentioned that we want to make sure that we're making some progress any time that you shrink the um that you're actually shrinking the arc size. So with that in mind, we introduce this new function called arc weight for eviction, and the way it works is that it's just a list of how many bytes have been evicted since the system's booted and any time you're requesting for some eviction.

A

You just add yourself to this list and you kind of increment, so this this diagram is showing here that there's four different consumers, they've added and just been accumulating on here. The number you see in here is the number of bytes that it's expecting for eviction to get to before they're woken up, then, from the archivic thread side. It's going to process this, it's going to keep you know if there's memory pressure and it needs to do some work, it's going to be woken up.

A

It's going to go through the ark and say: okay, I need to evict some pages and, as I a victim every single time, I complete some eviction. I'm going to look and see who do I need to wake up in the past. The old logic was simply, you wait until the arc size got below arc c, and then you woke everybody up here. We now have an opportunity to wake up people as they're make as we're making progress. This kind of allows us to do this.

A

Lockstep thing and also any consumers out there that are waiting for you know for blocks to be allocated can actually make some forward progress. Once we've, you know, freed enough memory and they don't have to sit there for long periods of time. So in this example, we can see that we started off our eviction. Count with you know: 10 156 bytes, we free 384, bytes and, as a result, we're able to wake up these two threads.

A

So we could wake up pay swap d as well as somebody that might be stuck in arcread and they can make forward progress.

A

So this also changed the way that you know when you're going and reading a block how we actually do that allocation, because now every single time the arc is overflowing, we now will put you on this list and you're going to wait for a specific amount of memory to be freed rather than waiting for arc seat or sorry arc size to get below rc. So you can make some forward progress and we don't have those long delays anymore.

A

So our simple test of just going through and saying: okay, let's run the same workload, fill the arc and then add some memory pressure. We now see that arc size and rc are able to stay in lockstep and we can see that in this case we were doing a 30, gig, 30 gigs worth of memory pressure. We saw the arc actually slowly come down to that.

A

You know and and reduce itself by 30 gig, give that memory to somebody else to consume and were able to drive off without any long delays, so um that's kind of where we ended. But it's not the end of the story. There's still more to be had here, um in particular one of the things that um that we did was we actually introduced kind of this minimum threshold of memory that we should always leave for the system to run um in currently in upstream. It's set to, I think, 132nd of of all memory.

A

Something about that size has a. It has an interesting calculation in delfix. However, we found that having that value set ended up, causing us to actually see out of memory conditions, so we've increased it, but we still want to dig into that further. Our main goal here really is to make it so that the arc sizing is not something that people have to worry about. um You know we want consumers to be able to say I install open cfs and the arc is going to adjust to your workload and to your environment.

A

So we want these to be to behave the same as you go across freebsd illumos, osx whatever, and so it has to be able to adapt these conditions. So we know there's more work to be done here.

A

We do want to tackle some of these out of memory conditions, but at least we feel like we have a much better step now in the way that the shrinker logic works with the linux kernel and with that I'll answer, any.

B

A

So question is: if the memory pressure eases, will the art grow back to the arc max size? The answer is yes, um so we have this.

A

This component called arc no grow and it will detect and kind of slow down its ability to grow once it finds that you know that memory pressure hasn't has been incurred and then, after a period of time it says. Let me check again. Let me see if that memory pressure's still there and if so you know, then I don't grow, but if it's gone then I'll slowly start allowing the arc to continue to grow, so it will adjust over the course of time.

A

And then alan asked is this related to the freebsd tunable's arc free target which triggers arc reclaim when the kernel free pages get below this target. Alan I'm not too familiar with arc free target. um So I'd have to look at that, but it's possible that there's some similar similarities there um in like for uh for linux.

A

Right now we have arcsis free, which is kind of the the bottom end of when you can actually or how much memory you leave for the system um and that's the piece that kind of determines at what point in time. You know we won't allow you to to um or treat memory as kind of like memory minus that amount so that we always leave that around.

A

But I'll follow up with you and see I'll. Take a look at dark free target.

A

And then is there a way to test these arc sizing algorithms in unit context, and it is that control currently part of the zfs test suite?

A

Unfortunately, there isn't a good test suite that I'm aware of I know some people have been looking at trying to figure out ways to verify and validate some of the arcs, algorithms and and heuristics, because there's quite a few things here, um as I mentioned like I mentioned the arc reap thread um now with this shrinker logic, the arc reap thread doesn't really do very much and- and even when we looked at this, when we first encountered the problem and we kind of expected the arc reaped thread to be the detector of memory of memory pressure.

A

When we started tracing that we found that it never woke up like it just wasn't doing anything, it was every single call that was coming through to shrink the arc was actually all coming through casewarpd, maybe under rare occasions. It would do that- and I also mentioned uh briefly that there was this kind of cumbersome uh interaction between the shrinker and the way that we would detect memory pressure and what it was doing was.

A

It was incrementing this needs-free property value, and the reap thread was supposed to go look at this, but it didn't seem like that worked very well.

A

It only incremented under certain conditions, and so, as a result, we were too late to the party whenever memory pressure came in and so the arc weak thread is probably something that we have to go back and you know revisit and see like if it's got value anymore and then another question is the this new list of waiting allocations. Is there an arc os? Is this an arc os.c or more generic um other platforms? It is generic. um It's actually used, I think both for freebsd and linux today.

A

So it should be something that would be applicable regardless of the platform.

A

Did you look at the case with instant memory requests on ins on linux, like keemu, now fails easily um to fail its allocation? um We have looked at so one of the things that we want to do, and this is one of the I guess areas that probably needs investigation.

A

I'm not look specifically at like the qmu kind of uh use case, but it's my understanding that that was one of the reasons why linux had never really went to a full.

A

You know use all of memory, type condition simply because the way that it maps files and you would end up with the map failing and you couldn't actually start any vms in in qmu.

A

I haven't looked at that to see if, if that's something that still exists, we suspect the answer is yes and it's probably something that we have to go and kind of address, because that I think that's one of the use cases right now, which is problematic in linux, to get us to the point where we can actually have you know the arc, take all of memory.

B

Hey george yeah um 2.1, uh you can stop sharing your slides and people will be able to see you a little bit bigger and the other is. I answered somebody's question, but I kind of dismissed it before I should have. The question was which release of open zfs will have the fixes that you're talking about.

A

Oh so these are all in uh open, cfs 2.0, so I think they've already been pulled in.

A

um What is the largest arc size that you've tested in how much memory is too much memory? So we've had systems that have been using up to like a terabyte of memory is the largest that we've seen. uh I don't know if others have have gone much larger than that um I'd be very interested to hear about. You know experiences that people may have had with some extremely large systems. um We know that with, like you know, team m systems coming out.

A

You have the potential to have some really really large arc sizes. So I I'd love to hear a little bit more from from folks that have kind of played around with those types of environments.

A

Okay, I think I have okay, I have one more minute. uh Do you expect this to remove the need to limit the arc size on virtual machines hosts? We would like to get to that. um You know, I think if you know, as we start to kind of dig into this more and and make it more robust, we're going to be looking for feedback from the community um as people start to run environments. I I mentioned kind of the qmu case. We know that's problematic, um but we've you know at delfix.

A

We we run on virtual machines. uh You know vmware all the cloud environments um and we're not seeing that to be an issue, um but you know, obviously uh everybody's workload might be a little bit different, so we would love to kind of get a feel for how people are kind of seeing this in the wild, um and you know experiences you may have had in the past and try to go back and see if, if these things are now addressed with the changes that we've uh that we've implemented.

A

All right and I will be available at the next breakout session if people have more questions, feel free to drop in with that I'll turn it back over to matt.