OpenZFS 2020 OpenZFS Developer Summit, 12 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ZFS Performance Troubleshooting Tools by Gaurav Kumar

Description

From the 2020 OpenZFS Developer Summit
slides: https://drive.google.com/file/d/1YzulcT7p7TvHF50aI-Rxg6CMZMIGnxL_/view?usp=sharing
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2020

A

All right, so, let's talk about performance, um so the agenda is we'll talk about some of the goals that we have. um What tools are available uh when we have to debug a problem and we're gonna. Take an example of a case that you know we looked at and how all these you know. uh Stats comes together to figure out, you know what's going on in the system and then some we have some key takeaways.

A

So as a goal we have to identify bottlenecks. uh More importantly, we need to be able to avoid pitfalls. You know, sometimes we might start looking into the wrong direction and how do we? You know backtrack from there and um there are times when you know we make some changes. We do some tunings and we should be able to really see the impact of it. The impact may not be very evident just by looking at the throughput or the latencies, but it might be. You know well hidden beneath the layers.

A

So so, let's talk about the tools, so first, I think, is very important, which is understanding the stack. You need to know how the io is originating. What all layers are involved in it, um so that understanding is very crucial.

A

So please do not jump, you know just don't take per for any other tool and just start debugging. You know so. You have to first understand the system. Then, let's talk about the internal uh mechanisms that we have within zfs, uh there's a lot of lot of metrics already involved in zfs we have logs. We have counters uh that can be really helpful to nail down an issue, and if that is that is not sufficient.

A

uh We have some external tools, uh evpf system, tab that we can use uh to get more uh data out of the system and often at times um we underestimate the power of visualization, uh but I think, as even matt also had uh uh you know, use flame graph in his previous presentation and uh you know showcasing all these metrics. You know using some ui is really beneficial and we'll we'll talk about that in later slides.

A

So, let's talk about the internal tools, so I'm going to take an example. We had an issue where we were not able to write to the file system. We took a look at some of the backend issues and you know disks were fine. Everything looked okay.

A

Now it's not a it looks like a functional issue is not a performance issue, so one of the easiest things that you can do is you can enable the code paths that returns errors and after doing that, if you've done those logs there's an interesting entry there, which is a dsl door, temp reserve. Basically, we are trying to see if the incoming right can be served on the disk and we are seeing a error. Ed start here, so looking at the code, we can quickly find out.

A

You know, what's going on, it's not necessarily a performance issue, but still good to know that something can be done quickly without using using any tools. I think one of the reasons why these metrics becomes important is on when you're debugging a problem on a customer environment.

A

You may not have leverage to all these. You know tools that you can deploy and make use of it. So all your best tools available are the logs that the counters that you have on the system, so dprintf uh is what we are. Basically, you know using here. um The only issue I have with deep printer is very chatty and log scene can uh roll over very quickly. So be careful about that.

A

So now, when we are looking at into a performance issue, we have to find clues. You know, what's going on with the pool. One of the handy commands is deep history.

A

It tells you what's going on it's important because let's say you have a performance issue and actually we had a performance issue where you know uh there was a certain spike in latency for a you know applicable amount of time, and um it was basically when we looked at the history we saw that there were data set, deletes happening, a lot of deletes happening, and uh then we kind of you know we figured out is a problem with the trims.

A

So you know it could be really useful. uh A cool tip use an eye option. It can give you more information about the you know when those events were happening. For example, it will dump the transaction group number, so this is just a cool tip um and looking further we have zippo events, which is important because it can tell you uh if you're seeing tech sum issues on a disk.

A

If your disk is in a degraded state and more than that, I I like to you know even see if a particular I o is taking longer than usual. So I think um I generally tend to you know at times uh tune this parameter. I think it's last I checked it was set to 30 second default.

A

What it means is anytime, an io takes more than 30 seconds, uh it's going to raise an event and it's getting, it will be logged, so I generally, you know, put it a small value here and just to see you know out of all the disk. I have, if there's any particular disks, where I'm seeing higher latencies compared to the other disk and the typical output that you will get here is you will have a report saying zfs and delay it will dump the path of the desk and all the bunch of information.

A

So the next thing is, um you want to also see uh whether you are hitting the right throttling. You know if the rights are coming in and you are not getting a performance. How do you look for uh right throttling, so the stats to look uh at is dmu tx, it's a bunch of stats.

A

uh If you see the last second last two, the dirty delay and the dirty over max getting incremented. That simply means that you're not able to drain the dirty data as fast as your incoming, I o, and if you see the the initial, the the initial, the top four, that simply means that you have memory pressure. It means that your arc is either shrinking or it's not growing, or there are issues with you know your memory, so you probably want to take a look at that part.

A

So going one layer further down, we also want to see how my transaction groups are moving along right, which means that the most important piece being how they're getting synced to the disk. So we do have stats, we can dump dxgs and it will dump all the information how much data we are syncing in each transaction group.

A

How much was the time for how long the transaction group was open, how much time it took it to get quizzed how much time it waited for uh this transaction group to be synced, and this is a sinking time. So, as you can see for this transaction group uh 43950, it was in the waiting to be synced because the previous transaction group was taking around. You know that much longer- and hence this was you know waiting. So this is a good way to figure out how your underlying infrastructure is behaving with respect to sinking so.

A

Again, arc is the heart of the system. So again we have arcstar, we have proc uh arcstats and the proc interface, and that also gives a lot of information regarding whether reclaim is happening or not whether the arc is arc no grow is set, for example. So all that information is useful.

A

I'm not going to go in all because there's a lot of bunch of information there.

A

And finally, we are in the pipeline stage and we want to look at the disk performance how my disks are behaving. So one of the easiest thing that you can do is just to check whether you know any of your desk is in a degraded state, uh and then you can run the ios stack to see how your um you know. um Ios, you know eyes are happening like you know what what's the bandwidth you're getting from the pool and so on. But you know you need more information than this. It's a very high level.

A

Information you want to really know um is the I o spending time in the queues versus the ios.

A

The time is being spent while doing the I o to the disk, and then again you also want to know how many outstanding ios you're doing, because that also dictates throughput and then even the average latency is not good enough. So you also want to see a nice distribution.

A

You know across different latency uh buckets and then uh last, but not the least. You also want to see if these ios are small versus large ios, because that again dictates the throughput.

A

So we have all these flavors of zpool iostat and you can use different options to get different, metrics and they're really helpful in debugging performance issues. So I'm going to skim over these. These are here just for your reference.

A

And um one thing I wanted to highlight was the zfs tuning: every system is different, workloads are different and there are times when you want to tune something- uh or you know, according to your workloads or systems, and um here's a link for all the tuneables that we have, and the good thing is that this is really very well documented and uh it's very easy to understand. You know what the parameter means and what it can do for you. So this is a good place to you know start looking into.

A

Obviously, zfs is running, uh you know as a sub component in a bigger system where you have cpu memory, disk network and so on, and everything any of these things can impact right. So it's important to look at the health of you know how much cpu is being spent. Is the cpu is idle or saturated, or you know if you have memory available and so on, so I'm not going to go and talk about these. um So these are the basic commands that are available in linux. Vm.

A

So let's talk about some of the external tools now the way I'm gonna talk about this is I'm gonna. Take three examples that we really um dig into in the past and how we use different tools um uh to really help us out. You know figure out what's going on and it's not, there was no preference as such why we chose one tool over the other, because all the tools can do the same thing, but it's just the choice of the tool at that time.

A

So one of the issues that we uh long ago faced uh are shrinking to semen. Now we might say: okay, if arc is shrinking, that means you have memory pressure right. So we said okay, but there's no load. um We cache the arc. We leave the system as it is. We don't do anything on the system and still arc is moving back to siemens and then we said, okay, um looking at other components, I don't see anything happening. You know, there's no other memory, hungry application running, which is consuming memory or leaking memory.

A

So then we said: okay, one of the things that I know is there could be allocations with the reclaimed flag that can also trigger uh reclamation. So we said: okay, let's look at the allocations that are happening on the system at that time and uh the fps came to mind uh it's the linux internet tracing tool. So what we did was we mounted the debug file system and we set a filter, and this filter is saying that anything which has a reclaim flag set and if the order is greater than four that means.

A

If the request for the allocation is more than 64 kilobytes, I'm going to dump the entry I'm going to dump the log, and then we enable this log and the output. What we got was interesting. We saw that there was a request of order of nine, which boils down to two meg uh pages, and we can see the gpa gfp flag being sent gfp trans huge. So looking at the documentation, what this means we figure out that this has to do with transparent, huge pages and the way we uh you know uh fix.

A

This problem was to say: hey. We don't need this, let's just disable this uh for us, but you know uh I'll be honest. One of the things that intrigued us even at that time uh was the fact that we did see a lot of reclamation happening so in the arc stats. We have this direct, reclaim and introductory claim counters, uh direct count and indirect count, and it tells how many times the kernel is trying to.

A

You know invoke the shrinker on arc uh as a result of k-swab demon or as a result of um direct reclamation, and we did see what you know was mentioned in today's uh earlier presentation by george that there were so many counts. There were thousands of times. You know the account was incremented over a very short period of time and it always intrigued us, but we never invested time to really dig into that until today.

A

You know, I'm glad that you know we had that presentation from george, and I know that matt has done a lot of work in this area. So thanks for that, at least, we know why you know um I mean that fix is definitely required and it's a very good improvement.

A

um I just wanted to let you know that you have a little bit over five minutes left. Okay, so um thanks man, um so again just a tip balance between arc and page cache. You can tune this parameter so that you don't unnecessarily um shrink your arc on the expense of page cache.

A

We had another issue where we had similar testbed and the performance was different. So we checked everything and you know we said everything looks good from the back inside. So we said, let's start from the fresh from the top, uh how what is zfs is seeing. So we installed this basic uh system tap script and um the results were interesting. We saw that in one cluster we were seeing all one mix versus uh the other one which is saying 64k.

A

It had to do with some client version. I'm not an expert in that, but uh you know we at least by debugging. On the gfs side, we were able to figure out what's the issue, uh what the issue was- and this is a sample output from the system- tap script. So here we use system tab. uh You know to our advantage.

A

um I will skip through this slide uh and then I'll just go to the next one, so we were looking at one problem where um we wanted to really look at. You know what are the different I o sizes and how frequently we're doing io on a disk. We could have used block trace, but you know we decided on using ebpf. It was very useful and very easy to install bcc tool is a front end to ebpf and there's already a lot of scripts that can be helpful, can be used and they're very helpful.

A

One of the script is biosnoop and we can we just got the entire profile of which thread is doing what I o and what sector? What is the size of the I o and the latency that we're receiving from the block layer? So this is pretty cool, so this brings to the case study. So how do we, you know, get all these stats together. So I think the goal one of the goal was: how do we avoid using a lot of these external tools internally?

A

We can definitely use these external tools to figure out what's going on, but there are oftentimes cases where you know uh we don't have. Leverage of you know all these tools in the production environment on the customer side. So let's see how all these stats comes into play. So we had a problem where we were seeing not a very good performance on nfs. It's a single client single file right. So, as you can see here, writing a 50 gig file using a fio and the configuration is a 12 core.

A

32 gig memory and network is 10 gig for us, the record size was 64k compression is on. We disable sync, because we don't have slog devices and we for a very long time. You were thinking that maybe slog is a problem. Maybe you know the sync: ios are a problem, so we said: okay, let's disable it and see if we can get the throughput and we have set the aggregation to one meg.

A

So this is a general io flow. uh We have an fs client going to the network. uh Nff server gets a request sent to the zfs and zfs arrives to the disk. So let's isolate the problem. We said: okay, let's just short circuit the zfs, so we got a one gig throughput. We said everything goes from the protocol side. So let's look at vfs, so we did a test directly on zfs yeah and then we again got one gig.

A

So we said everything looks okay from zfs side as well, so where the problem is so it's time to dig deeper now so before we further dig into this, we wanted to set the stage. uh The question was: do we have enough metrics? uh So we added a lot of metrics uh looking at some of the past problems that we have faced and things that we thought it will make sense like workload pattern.

A

How much is the latency on the posix layer and so on, and then once we have these metrics, how do we consume it? I mean there are so many metrics hundreds of them. How do we really look into each and every layer, so we use grafana and prometheus?

A

I will not go in these in this detail much, but it's basically storing the metrics and pulling it out from the grafana.

A

So coming back to the case studies, the first thing that we see is the right request coming in it's all one meg, which is pretty good. That's what we will expect and then we looked at.

A

You know how many outstanding requests are there in zfs, because if there's only one outstanding request, we will not be able to drive the throughput right. So we see that they're almost around 32 to 64 threads in the kernel, drawing doing the zfs write.

A

So then we look at the zfs write latency, it's just a profiling of just the zfs write function, which is an entry point for the right call, and we saw that most of the time latencies were higher than 64 milliseconds. Now that was in intriguing, because we are writing to ram right. We write into arc, and I mean I can I can do better than this even on a physical drive. So this was something intriguing and the first thought came to our mind was it's probably because of the throttling.

A

So we look at the throttling stats and we see that we are indeed delaying it. So we said home good. We know now why we are, you know, delaying the clients, but then we, you know, started looking a little more deeper and we said: let's, let's look into the how much time we are delaying the client right and when we did the math, we were not able to account for the entire delay that the zfs write uh call was showcasing.

A

So we said that let's look at the cpu and the arc right, everything looked. Okay, you know, 50 idle arc is completely uh full to the c-max, and everything looks okay from that angle. And then we looked at the pool dirty data sync and we said okay, you know we are able to sync almost two gig of data in four seconds, giving us a throughput of 500 meg, and then we looked at the I o start. You know obviously, and ultimately, you'll have to then look at the disk.

A

You know how your desks are performing, uh so average was 10 millisecond, it looked. Okay, not bad, and the queuing delay on the I o side was fairly very, very less. So it's kind of saying that the ios are not sitting in the queue at all. I mean you just push into the queue and somebody pulls you out and throw it in the disk.

A

So that was some intriguing part, and this is a new start that we added just some more concise way of figuring out how my aggregation is working and not enough, basically, is what it means is that you only have one I o in the queue and hence you cannot aggregate failed means that you have multiple ios, but you cannot aggregate because they're not contiguous, so we saw a lot of higher count or not enough.

A

So it means that we are basically writing to the queue, but we don't have enough ios in the queue to aggregate, and that's why we are doing you know small ios. So I will not go through this uh diagram, but this diagram is showing that you know how the the right pipeline works and you need to have enough ios to aggregate so z. Right issue thread is a one which is basically pushing to the queue and- um and so we need to take a look at this- to see what this thread is doing.

A

So there are two problems now, so we have a zfs write, latency issue, which is partially explained by the dirty delay that we are seeing, and then there is a problem that z right issue thread um the writer threads. They are not pushing enough fires to the queue.

A

So then we looked at the stack traces for these and on repeatedly you know, printing these track traces.

A

We were able to see that they're, most of the time contending for a semi-for for a lock and when we looked into that it was basically a denote level lock and uh basically we have a lot of readers um who are taking the lock in the reader mode and there's a writer who is blocked because the readers are there and he's waiting for the leaders to give the lock away and so on and because of this contention, things were not falling in place.

A

So now we know the problem, but there's a problem again how to fix it right.

A

Fortunately, for us this is already fixed uh thanks to all the efforts by paul, um it was fixed as part of this commit, and why did we miss it because it was not part of the dot releases and it was not part of 0.8 release and it would have been nice if it was including 0.8.

A

But then you know we would have missed the opportunity to look into this performance problem.

A

So just a quick recap: after and before before and after the patch comparison, once we took the patch, the performance got a boost from 600 to 1100 almost double, and we can also see the aggregation um happening much uh efficiently after the patch, because we have reduced the contention point. We are sending more ios to the queue and we are able to get the aggregation benefits.

A

This is just a recap of what we just talked about, and finally um the key takeaways is, I think enough. Instrumentation can definitely help, and then you know, having visualization also helps, because now you can side by side, compare two things and figure out how the individual layers are behaving.

A

To give you a really good understanding of you know the change that you have done. So these are some references for you guys to look at and questions.