OpenZFS 2018 OpenZFS Developer Summit, 21 Sep 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Monitoring ZFS by Richard Elling

Description

From the OpenZFS Developer Summit 2018
Slides: https://drive.google.com/open?id=1Q-I4xD6q_wWkmhCy5oJ0Gyt00EgvfCec

A

Our next speaker is a Richard Elling, so Richard comes from the sun microsystems days, so he's been exposed to CFS from a long time ago and he's been playing around the FS and doing some changes for many many years back. So today he is gonna talk about observing and monitoring in CFS and how to improve. In that aspect,.

B

Thanks and good to see your buddy here again this year, how many years has it been six Wow, very good glad to see a lot of familiar faces um like many of you when we deal with people with issues and so forth and or monitoring systems, especially with systems demonstrating mal behaviors?

B

It's not uncommon for us to come back and someone comes on the net and they say: hey, look I need it. I have this problem and then you go back to I mean in the canonical.

B

Send me the logs question, and so what I'd really like to do from this or what I'd like to see I get out of this presentation is that we move a little bit further upstream towards more modern monitoring and observation tools in the community, and so I won't be getting a zip file with you know, six megabytes of logs that I had to dive through and understand so so that got me started on this and then some other tools.

B

The good news is all this stuff is I'm going to discuss, is open source except the spell checker. Apparently.

B

But so with me, telemetry is very important and it's been a while, since I worked on the space shuttle program, but STS stands for space transportation system, which is more commonly known as space shuttle, and we had it. We have a lot of telemetry in there and this is a failure mode or the result of a failure. Sts-107 is Columbia that burned up on re-entry when the wing burned through- and this is the telemetry where you can clearly see that that was not normal right.

B

We have a bunch of normal flights and then we have the flight that was abnormal, and this happens to our ZFS systems in the world as well. And so the question is: how can we not only capture this? We deal with it immediately for operations, but also forensic ly. How do we go back and understand what happened back in time? So so it's very important and NASA. We collected a bunch of telemetry all kinds of telemetry and stored that stuff forever, and so they were able to go back through and get that information. So how?

B

How can we get that back into into our daily life in the GFS community and I'm, going to focus primarily on GFS on Linux, because that's where I've been working most recently, those in the Solaris or illumise environments know that case stats have been around for a long time.

B

A typical system in illumos will have you know 45,000 case, that's something like that available for you to peruse and enjoy when the port was done to Linux, Linux didn't have K stats, and so nor did Linux have at the time really this event infrastructure that all came as part of illumos fault management architecture, and so, fortunately, that was fairly easily portable and something that sort of resembled what we can do in Linux.

B

So in Linux, then we use a pseudo file system under proc to represent the case, stats that we would normally have seen in the case test system in a solar system which is really cool, because now we can developers can add new code, add new stats and then operations people can grab those and then you know understand and monitor them.

B

It's an example load the font conversion. This is an example, then, of what you actually see in a in a Linux case, step file you'll see some metadata in the first line, then you'll see not quite like this is a proportional font and it out, but you'll see a name type data as a tuple and then until recently, all the types were four which is unsigned int.

B

Recently we started adding in some strings, so you'll see type seven, so any if anybody's written code that assumed that four was always going to be four surprised. It's not it's. It is, in fact the data type and so most of within what we see is unsigned in 64 for K stats and ZFS on Linux that becomes important soon and there's also a few command line tools that you can run to observe these things so in in the ZFS on Linux tree, there's, April, Irish death, of course, built into the zero command.

B

There's a couple of commands out there with history that goes way back they've been carried along arc stat arc summary tell you a lot of information about this the arc and how it's currently using all your REM debuff stuff, our stat in SPL slab are a little bit more developer-friendly or maybe developers are more interested in it and say operations, people but but they're there and then, of course, in the performance world. There's you will always hear us say so.

B

Can you give us the IO, stat minus X output from that, and the reason is, is it shows us? The device stats, including operations reads: writes operation counts but, most importantly, latency, or at least an average latency, which is the first place where we go to say: hey your average latency is 1.3 seconds. You might have something wrong your disk, and so we see that a lot in the ZFS on Linux world. It picks up G Vols as well, so people were looking for instrumentation for his evils. Bandwidth and latency. It's all right.

B

There and I was tap my sex and then there's a couple. Others a little bit more specific Z fetch debt, which just looks at how well your prefetcher is working or not working and then there's one you may not have heard of three people have that I wrote called Costa analyser, which started in the lumo space and it tries to do a performance guy analysis of I see all these stats.

B

I correlate them like I pick out some information, and not only is your disk low, but it's actually only one of your eight disks is slow kind of thing in case that analyzers are ported over to Linux and we'll be making that a little bit more full-featured over my gratuitous spare time.

B

So that's what you can get command-line, there's probably a thousand others out there as well. That's pretty easy to write a tool to scrape that stuff, but we really want to use our eyeballs right, and so what I'm gonna talk, then about is really three stacks or three major components of modern databases and monitoring tools.

B

One thing I've seen a couple of slides I think Christian showed that using Prometheus with graph Anna Griffin ax is by enlarge a terrific open source project to help us guys, take telemetry and other data and and present it in dashboards, really good stuff. I can't recommend it highly enough, then. The other two are in flux, data's, tick stack, which is Telegraph in Flex, DB, coronagraph and capacitor, and the other project, though you'll hear a lot about, is Prometheus which comes where's that originally from yeah.

B

So it comes out of the cloudy space and they're a little bit different, but I'll go over those and why I end up using both in market textures but an example.

B

Architecture of how we're going to put this in is we'll have nodes out there in the world they're going to have some sort of aggregator or agents that collect this information and then we're going to push that up to a database somewhere and then look at it with Griffin right and then it turns out that well, there's a various in other major components to this architecture and those are really the push versus the pull and Prometheus is a pull model.

B

So the Prometheus queries an endpoint and it gets the stats back and then a while later he goes and queries. It again gets the stats back. Griffin I can do the same thing, so I can go to a Prometheus endpoint directly from Griffin inquiry. That's or thing the in flux, DB and telegraph for push.

B

So it's going to collect data all the time and then it's going to periodically send that up to a database, and so it turns out that there's not a perfect case for either one versus the other I just end up using both- and these are the two I use.

B

So, let's start with in with seeing the data graph on ax renders in the browser, which means that once the data is transferred across a network, then you can do some operations and you can peruse the data and look at it in more detail all in the lovely confines of your browser.

B

It uses JavaScript, and so it pretty much runs everywhere.

B

In the sense, the javascript in the browser runs everywhere. It actually turns out to be quite good, even on small phones. So that's pretty cool. It's got a plug-in architecture. Lots of community stuff check it out. It's good stuff, I, store stuff, then, into a time series database and in the old days we would use sequel or nice you'll see some old people talking about using things like Cassandra.

B

You know databases which really weren't specifically designed for time series, but nowadays we recognize the need to really focus in on doing a real time series, databases and they're becoming commercially viable and big open source environments. So the difference between the two in flux, DB versus Prometheus, push-pull, discussed a little bit earlier. They've got query languages to me. The most important thing is data types, so you remember earlier today, George was talking about the CIO pipeline and where are you in the CIO pipeline when you get an error?

B

Well, you'll get an event with that error, and in that event it will tell you where, in the pipeline, which pipeline stages you were instructed to go through and where you are in that pipeline stage, which is really useful information, it's an integer right. It's an enum story is, it is a bit field and the problem is now if I store that as in boolean or you know, I certainly I won't do as a boolean, but if I store it as a float in Prometheus, then I have a different version.

B

Now that has a different pipeline set. Like suppose we add encryption now. That number doesn't make sense unless I also know which version of the OS that came from. So why do I like data types of strings as data types of strings is when I generate that event, information I can decode it and then give you a string. That says these are the pipeline stages, and this is where you broke, that it uh-huh and it comes out nicely gets stored in my database.

B

So the other thing you'll notice about prometheus and it's float issue is we do high speed data systems at new Isis, and so we look at counters so just for example, bandwidth, counters, I'm, counting bytes, going across my IOP. The path for the high speed systems we have I will roll over 43 bits of a mentis in just a few days. Right and now in the systems we have on, the drawing board will roll over that counter and and maybe two days and it's just not going to fit afloat, and so.

B

That's a painful thing, but an influx DB. We do heavens in 64, so we can use those natively which is cool.

B

Collectors agents and aggregators there's two that I'm missing out here. These are the two that I keep track of in trying to keep up the ZFS on Linux data collection with those and those are Prometheus. Projects has a node exporter which puts out you know the generic set of I/o statistics. You know network bandwidth, CPU memory, those kind of things for a compute node and then there's a ZFS plugin for that, and it will give you all the ZFS data and then in Telegraph for the influx data environment.

B

It also collects most of those things as well: I'm able to convert those strings to enums as strings so they're a little bit more useful, but both of those are are there and, as of my Christmas project, was to get those up to date on as far as where we were on Christmas of 2017, so that should be pretty up to date and those are out in the in the released and supported versions, which is pretty cool. So then those take a look, then.

B

What a an example might look like for an arc problem that we were discussing just last week and so I've got a couple of dashboards that showed in my arc activity and then on the bottom, which you really want to draw your attention to. The bottom of that is a single.

B

If you have Enon data, then you can graph it as a stripe and then it will change various colors depending on the context of the or the value of the enum, and so in New York. We have this thing called no girl and it's a boolean.

B

So I only have 2 colors, sorry, but basically there's times when the arc is been asked to shrink and it and then for a while, it's going to not try to grow again and then that's when no grow is ticked over to 2 off, and so it goes into a phase where it won't grow. And so you can actually then see that in the bottom, where we we were growing the arc or in the arc, then we got to a point. We started to reduce the arc size.

B

This is what we still need to figure out. Then we tip over to a no grow phase and then, after a while there's a timeout for no grow and then it's allowed to grow again, you can see it grow a little bit and then comes back down and goes back up. So this is the kind of thing that when you look at it this way it's it's immediately obvious that oh I have a problem here and we will. We are working on this problem.

B

It's annoying, but it's not it's a it's a new thing, so we know it's something we've introduced and so, if I, by contrast, if I had been into an email conversation with somebody in that, so will give me the arc, stats and and inside no, no, not those ones. I need this. One I need to know what arc no grow is and I need to know.

B

You know these various levels for these other parameters, then in fact I wouldn't be able to you know it take a week to get through that email chain right, but if they had all the data always collected and it's all right there then I can just say: pull up this dashboard or I will email you a dashboard. You take a look at the timing question and then we can dive into it further and with that I'll make a shameless plug for tomorrow, the people at the hackathon.

B

We start to peel this onion when we start to look at this data and when I have all the stats available. My fingertips in the database back in time going back two and a half years, for example in my lab, then we can questions now that we could ask of the experiments we did two years ago. That, in fact, we didn't understand, might have been a problem back then, and we can go back and take a look at that thanks to the databases and there are new dashboards and we can deliver new analysis as well.

B

Yes, isn't it another example same sort of thing and we can show clearly hit rates and that kind of thing, and since this was done in the browser, when I hover over a particular time point, you can actually see all the values out numerically, and so you know this is very useful. It becomes very interactive in these things when you start to get into it so tomorrow, those that were interested at the hackathon we can go and do some some stupid, Patrick's.

B

Here's another example within of IO stats.

B

This is the graphical version of that which you would see scroll off a few pages of text. If you ran out of stat my sex- and this is just a transition from ways of doing random, writes and then we started this grub, and these are hard drives. So you can kind of see and with that and I really wanted to point out that we start to view these things differently and on the bottom graph.

B

I use a model where writes go down, so I use a negative Y axis for the writes and then reads go up, and so, when you do that, it's very obvious, then, when I look at the graph, am I writing versus reading. My brain will tell me and then, if you look at the total bandwidth, when I'm doing both reads and writes, I can get it an appreciation of the total bandwidth in the system all without looking at any numbers right. Let your eyes do that. Do the math for you and so that's an example.

B

Then of we did some random fills and we did some write and I scrubs. So next I want to talk about when I build the dashboards. The ones I release out two people at the bottom row of the dashboard I put some documentation in and we can do its markup for those who are interested. So you can get a little bit fancy with the documentation. Earlier today we were talking about the zio pipeline and briefly mentioned the fact that we break these iOS out into queues, and then we so we have five queues.

B

For sync, read: sync: write, async, read a sink or eight and the scrub queue and then those all merge back together and get actually sent down to the disk, which also has a queue, and then we kind of want to be interested in understanding. Where are we spending all the time? Because we have these tunable x' in in the scheduler, but to date I?

B

Don't think anyone actually has really changed those very often we do in performance tuning sometimes, but we really didn't have a very good understanding of why and when we should in fact adjust these because, since you're you're trying to adjust, you know how we're going to fill this queue to disk.

B

You kind of need to understand, what's in the queue during my experiments and then also what do I want it to look like at the end right. So what is it now and adversity? What I want to look at? And so, if you look in the full IO stat W on linux, you'll actually see a histogram, then of the Layton sees in all of these queues again great font translation.

B

But trust me when I say that you, you will see these things and we gather this data for not only every or the top level, which is what I'm showing here. When you do, is you pull IO step by step? You get the top level roll-ups for all of the V devs underneath, but in fact, in the system there is in fact every read of all the way down to the leafy a histogram for all of these cues.

B

So then, how do we look at these in a more meaningful sense to the human eye, and we do that with a.

B

Some people call these heat maps. Some people call him up. So the things heat maps will overused in the in the.

B

Groups that do dashboards, you'll find heat maps that don't do anything like you thought you thought new heat maps, but these are latency heat maps made very famous by early work in Joyent a few years ago and now readily available to you. So this is the same data over time in the queues. So at the end to end we have the top level read and write lane C's for the pool and I did it in this dashboard I did reads on the right.

B

Writes on the or excuse me reads on the left, writes on the right and then I color-coded them just different colors, so that now, once you got used to using the dashboard you, you could immediately look up and say: oh I'm, having a problem with writes before you even got into the details of the data.

B

So in this experiment, which I just ran, my laptop on the plane over I was doing a fill of a small pool and then, when it finally filled up, then I did a scrub and so I've kind of brought out for this time period at the top level. Then we see a bunch of writes and then at the top level the scrub does a bunch of reads: okay, so that should be pretty readily. Hopefully intuitively obvious.

B

When we go down, then the next level down into a sink sink reads versus, writes for the next level down. Then you can really start to see it break out. So you can see clearly that we've got synchronous read going along the only async read that I'm, aware of is a prefetch right. So the only time you're going to see secret and all reads of basically synchronous is, is the lesson and then we have scrub reads and scrubs have a intentionally lower priority than the others.

B

So these are the the cues waiting to get to the disk. Okay for the the three columns on the left side and then the actual disk use are on the right and so which you can clearly see or I tried to intentionally show. You is on the scrub, read queue, you see, there's a few iOS or few residency times in there that are down around the to microsecond kind of range and then the bulk of them are way up and around you know, eight microsecond or eight milliseconds or so, and.

B

But it's time going up from microseconds to milliseconds to seconds on the on the y axis and we get a distribution across there and you can see that a whole bunch of these iOS are queued in the scrub. Read queue for you know, eight milliseconds kind of timeframe, and if we were looking at that versus the other queues, then it becomes interesting. Do we need to bump up the the priorities for the scrub queue or not? I, don't know the answer, but this is where I can kind of get a feeling for that and then.

B

Similarly, on the bottom row, we have rights and you can see the rights coming through and you know: I filled up the pool pause for a little while and then described so you can kind of get that so hopefully, if we can get more and more people to take a look at these, especially when you're off tinkering in the lab with new code and a repeatable workload and even some ZTS stuff, then we can get an idea. You know. Is this really an improvement? Is it doing what I think it is?

B

Is it moving to the right queue and so forth? So with that I'll ask for some questions and then I've got some references that are there as well. So any questions I.

B

Have a question: so how am I going to get this to you? Guys I'm gonna put together a page that kind of tells you how to build and get to the point. We can collect the data and then I've got somewhere on my github site, a start to a series of dashboards that are specifically oriented towards my ZFS work and then there's a few dashboards in the public domain out there on the graph on uh a dashboard lists that do do some ZFS work.

B

That's pertaining particularly to the tried and true collectors that have been around for a long time. Collecting things like arc, stat and and so forth has been around for quite some time. But then the new stuff for the latency histograms I need to push that so mark.

B

Oh, so how much space do I need to store all this? This stuff? It's surprisingly compact, so one of the things most of my data is stored is in influx DB and influx. Db has an extraordinarily compact ability to compress the data, since it knows its time series- and it knows the types and it really compacts that quite well. I've got two and a half years of data on I'm up to about 140 machines right now, and it's using about 200 gigabytes of disk space.

B

There are people who are building these things at million nodes, kind of kind of level, so maybe a few terabytes. It doesn't so much disk. Unfortunately, a question.

B

So the question is: how do we parse the data and get it up into in flex, DB and the way it's actually works is inside Telegraph and inside Prometheus, node Explorer, there's a ZFS agent that will read the case, step files and then convert that into the format for the telemetry stream. That's given to you, so you know it's it's a transformation kind of workload. It's it's not high rocket science, so kind of thing. The biggest impediment to us doing more in lieu of these is how do we get the code?

B

Actually, how do we get the developers to add instrumentation? Yes, because, as George mentioned earlier, we would in the you know, Solaris alumina space. We would do dtrace right, I need something. I got dtrace, you know, I'll whack out a quick one.

B

Liner I'll take a look at some stuff and I'll forget about it, but some of those things are quite important and they're very useful to have and and then we start to put our operations hat on then we start to think what would an operations guide want to know such that they could know, there's a problem that needs to be attended to, and for that we were still lacking quite a bit of coverage in the environment and anybody who's willing to work on that.

B

It's not a free thing like these races, not free, eb PF is not free, but if we do judicious use of of instrumentation and in the code, then we can get a long way so.

B

How do you bridge the dev and ops gap.

B

Is the bar open.

B

The approach I've I've, been taking lately with a little bit. Success is to get the right level of dashboarding and, more importantly, another portion of monitoring I didn't touch on his alerts. Right in my example of when we had the challenge or Columbia accident.

B

You know that was obviously a red alert kind of a situation, and you want to get those early, as can be into the operation sense and at the same time, there's an Operations Group that usually does capacity planning, and you want to help the capacity planners predict how much more gear that they need to buy from new Isis. And so all these things are very useful for us, and- and we want to do that and to help that- and so we have different consumers for the data.

B

Yet I'd mostly do performance work and so in the performance, work I care entirely about latency, all right, not entirely mostly about latency, because that's where it's it's most painful, but for an Operations guy. Maybe they care about. You know what is my daily increase in load right, which, which is a very different question the tooling can handle both and then it's a matter of how do I communicate a dashboard that make sense right.

B

If we talk through dashboards, it's kind of the way I've been trying to approach that going forward now for forensics or figuring out why it broke last week.

B

There's a I I'm of the opinion, don't delete anything so I'm, like oh yeah, I, collected all this data right every 10 seconds or 15 seconds or 1. Second, depending on what it is. I don't ever want to delete it, which is why my lab has got this database. That's been around since I started working there, but and that becomes impractical at large scale and so those guys wanted downsample and all that so there's ways to do that. Fortunately, most of those things we do in ZFS is counters and they're unsigned in 64.

B

Is there always going to go up and so they're easy to downsample I can just take a sample and and throw away an intermediate sample and still retain that that growth over time, which is really cool to developers gauges are troublesome. Averages are appalling, incrementing counters are good, just take account, we love it any other questions.

B

We're done we're out of time. Thank you guys.