OpenZFS 2019 OpenZFS Developer Summit, 13 Nov 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Capacity Usage Calculator by Kody Kantor

Description

From the 2019 OpenZFS Developer Summit
slides: https://drive.google.com/open?id=1pfg0NxMVHJqiptxeNaTKMI461IJaKxdj

A

All right, so our next presenter is Cody Cantor from join and I believe gonna be talking about CFS capacity calculation. So please welcome Cody.

B

Thank thank you yeah, so my name is Cody. Can I work at giant I've been there for like two and a half years, and this talk is about something that I wrote that will generate Visio diagrams for you, I'm just going yeah, so I think probably a lot of us have had to answer some questions when capacity planning comes around a few folks are like buying servers and stuff and with ZFS the answer to these questions. It's actually really complicated.

B

Usually somebody just says like hey, like where's, all of our storage going or something and ask somebody who knows something about ZFS. You have to like sort of explain all these little nuances of ZFS and it can be really difficult to express these things to folks that maybe aren't as interested in ZFS or just want to answer their capacity planning calculation. So, like some of these things like record size stripe with compression like how do they actually impact the data that we need to store in production?

B

So there's a few ways that we can like well, I, guess the impetus of the saut talk is really the record size portion of this, like a giant both in the community, like in community interactions and internally, there's a lot of questions about what this record size setting is. You know we talked about a little bit already today, but like what does it do? We know that it can probably affect performance, but can it also affect the capacity of usable space on our pool?

B

So that was really like one of the things that we one of the reasons that we started doing this work in modeling ZFS capacity.

B

So the first like I, don't know if you would say like the naive solution to like modeling out capacity usage is just to like use. Production hardware, write production data to it and see like how much space is used by that. The problem with this that it's like really expensive to get these like production sized machines. You know hundreds of typify, it's storage, you can't just like you know, take that out of your pocket and start using it, and then you need access to production data like is there a custom?

B

Gonna be okay like having you just copy data from production into this. Like other machine, you have to do some analysis and then also it's super boring. Like I, don't know if you guys have had to do this sort of stuff like filling up a two hundred like three hundred tip of a pool takes forever and like. Usually, if you want to make it not super boring, you have to write a bunch of software to do this.

B

Anyways, which I mean I like writing software, but like if I'm gonna, write software I, don't want to like I would rather like do more interesting things, so we sort of came up with this option, B, where we actually have no time or hardware. So this is probably the reality for most of us and what we ended up doing was gathering like statistics about our production data and then sending it through a script and this script you can just run on your laptop.

B

So this is perfect because we all have like a little laptop or something so there are also problems with this and that's what that's? What this talk is really about. It's about all the problems that we ran into when we were doing this modeling and the problems really continue software. It's really hard to get right and modeling. Something like ZFS is really difficult. It's a very complex system.

B

The other tough part about this is that we have to go in this, knowing that we're doing this for capacity planning decisions. So when you mistake that we make along the way could be a huge mistake down the line like if you buy too few machines, then, like you have to what do you do? Do you would like tell your customers all there's, no more storage, or do you buy too many machines I make your cost per gigabyte not as attractive as could have been so I.

B

Guess, like the initial theory here when we like we, the impetus was record size setting. So the initial theory and understanding of like my understanding, was that the capacity usage for parody data and arrayed Z to like pool topology doesn't change as block size changes. So, like my, my thought was hey. If you write like example, one here you write a one use or one meg record size, and you write a single one.

B

Meg file, then it'll just be like one megabyte, plus whatever parity data raise you to like two disk sectors of a K and that's only a hard 20 like 1 1 28th of the like file, size and overhead. But if you use like 128 K records like the raid Z would be so much more because you have more records turns out. This assumption is incorrect, but I didn't know this at the time.

B

That's why it's an assumption, so we went along with it anyway and, like the first thing we did was trying to verify like that. Zfs was like doing the things that we thought it was doing so here we just like created a file system and set its record size to 1 Meg, and then we write a 1 Meg file uncompressed, and we can see that on the disk. It is 1 Meg taking up 1 Meg.

B

So let's that's good, that's exactly we wanted, but then, if we write one more bite to that same uncompressed file its uncompressed file system, suddenly the size on disk becomes 2 Meg. So, even though we're storing 1 Meg plus 1 byte update user data, it looks like we're storing what we are taking 2 Meg of space on disk and that's not good. That's like 50% efficiency. You went to having like really good efficiency of storing 1 Meg, exactly to taking 2 Meg's to store effectively 1 Meg of data so that that was like really scary.

B

You know the last record wasn't truncated, and this is when we sort of realized that this isn't like a trivial thing anymore. We can't just like like write, do easy multiplication. We have to take a new account on the object size and how it can influence the efficiency of the day that we're storing. So we have to consult the file size, distribution of the files in our pool.

A

So this is an eye chart.

B

The y-axis here is the number of objects Poole. This is just like some random pool that I took out of like Jenkins service in our staging environment, and then the x axis here is object, size in megabytes, buckets I'm. So on the far right there is 10 Meg and then on. The left here is like in between 0 & 1 Meg. So you can see that we have a ton of really tiny files, a ton of files less than 1 Meg.

B

So the I mean like the big problem with this- is that files are millions and millions or pools or millions of millions of files. How are we gonna model that we can try and shove it in like a Google Doc spreadsheet, which is totally what I tried to do, but I guess, like Google Docs doesn't like ingesting like a paste buffer of two and a half gigabytes.

B

You know I very quickly, learned that Google Chrome just can't handle that sort of thing. So what we did is we created this simulator and initially the simulator was gonna, be really trivial. It's just like figure out how much user data you have add this parity, because it's constant, like you know it doesn't change, and then we get this nice little output on the right here, which I'm sure you can shove in a Visio diagram.

B

If you need to I don't know, and with that data we can draw like we drew this, this fancy graph here you see on the bottom this the blue section is like user data, that's constant, but then with the simulator we could die Matt dynamically change the record size that we were modeling with. So you know the first run was 128 K just this default, but we could also simulate how much capacity to be used if we used a 1, Meg record size and that red section is the wasted space so like.

B

Since, since we were still under the assumption that the last record and a file, if it goes over record size, would be a full record that space was like tallied as wasted and then raid-z capacity like. If you look at this, this is totally wrong.

B

This is not accurate, based on like the assumptions that we had taken previously, so we sent that out to the company- and you know: I was talking de pacheco, somebody at Giant and I use the school with permission he's like yeah I, wonder what would happen if you use compression on the data side, yeah.

A

B

A really freakin good question, I, don't know why I didn't do that. So, like turns out that if you'd like actually use compression than that last block problem, if I takes up an entire record, it's not a problem anymore. The like it still uses a little bit more space like it's, not exactly one extra byte of graphs used, but it's much much less than entire megabyte record. So like now, we like I sort of thought. Okay, well. I was totally wrong about like a lot of my initial assumptions here.

B

So let's rely do a bunch, more research and how ZFS actually works. So you know: read blog posts like Matt's, blog post on like how I learned a lot of great or how I learned a little braid Z. We had some blog posts joined internally. You read a bunch of code.

B

Look at zdb again, look at the source code for zdb, because otherwise you don't know how to use EDB, I guess, and we learned that we need to take into account all these other things like the parity complexities, padding, which I think Ellen was asking about. How wide is your stripe? We weren't, even taking into account stripe within our capacity calculation and the like? First version of the simulator, like block pointers, sometimes take up a non-trivial amount of space.

B

Minuscule applications behave differently than really big allocations and then, of course, compression which is like can be really important as well. So we iterated on the simulator, and now we get a little bit more granular output. This is like for one of the later versions of the simulator, where we can see like where our storage is actually going and using this we can make like a lot more, more accurate graphs, so on the top left here. This is similar to that first graph. That I showed you can see like the blue is user data.

B

Raid Z is the gray and in the middle there is all this other stuff like padding block pointers, and this takes into account like smaller allocations and compression, but other things that we can do is actually figure out.

B

How many blocks were allocating, because that can make a difference in the performance of other things in the system like scrubber, II silver, as well as maybe fix some compression problems that you have and then stripe with if you're, if you want to like optimize for padding for whatever reason you can see like you know as stripe with changes, how much padding are you gonna have in the system so it to generate like the data for a pool that has I, don't know like 60 million files in it? It takes like three minutes.

B

So this is like you know. I would much rather do something that takes three minutes then like fill up an entire 300, terabyte and I. Think you guys probably would do so. I guess! That's! That's really like the simulator. In a nutshell, the the problem with it is that it is software. That's like separate from ZFS, so it's only gonna be as good as the knowledge of the person that implements it.

B

So, in my case, like I know that there's already there's still like a bunch of problems in the simulator that really should be fixed to give us better insights into capacity usage. Like pool usage, this this only takes into account like the file allocation stuff, but I know, there's a lot of overhead in the pool like a pool wide slop space per pool capacity usage. You know zap if you guys are using like large D nodes and put our bunch of stuff in there. This doesn't account for that.

B

So you know the simular it could be emplou improved. There's also like this no pool idea that I was sort of like just spitballing, where, like you, write data to a pool except the data, isn't actually written.

A

B

Like whatever minimal amount of data is required to like make the pool work, so then you can sort of model ZFS well without actually having to like write. A lot of data might be faster, but um you would need some hardware to.

A

B

So I think that's mostly what I have I mean. I have some links here. If folks are curious, the simulator it looks like this link was clicked, which is cool. This simulator is the third thing on the bottom. It's like it's actually just an ox script. It's like four hundred lines or something- and you know awk, isn't like the most performant thing ever, but it's sort of fun to write so yeah. If you're interested we can iterate on it there really.

B

It does need a better place and here's some other resources for blog posts that you know if you're curious about how ZFS allocation works, yeah so folks have any questions. I think we have a little time.

B

B

Yeah, so the question is like ZD, be like Paul sort of confirms that he thinks that it will be difficult to understand C B while reading without reading the code, and he asks if there's any way that we can make it better.

B

I mean like Z, D, B's man page is sort of frustrating because, like it just says, like yeah, add more this letter for like more things like it's really unclear and also like the fields that are out put in gdb I think that they make a lot of sense if you're in the ZFS code a lot. But it's somebody. That's just like casually observing a system they're not documented, really anywhere.

B

Unless you source dive and like even when you source dive into Z DB, then you have to like go into the ZFS core code and figure out like in this struct. This is what this means and like look at where it's used so I, don't know, I mean more documentation, I suppose or maybe like other output formats, would be helpful as well, because right now it's heard it's like a big raw data dump. Well, maybe that's what's what it's used useful for, not sure.

B

B

Yeah absolutely so the question so Steve asked like the data it took 1 point like when we enable compression on this file. Previously it took 2 Meg to store 1 Megan 1 byte after we enable compression, even though it was incompressible, because it wasn't compressible, it was from Debu random. We still saw benefits of compression and yeah like absolutely like. It's just turn of compression on. Even if you have incompressible data for the cases like this I think and our data, for the most part is mostly uncompressible yeah.

B

Yeah, so the question was the data that we showed, like you know, it's mostly like predictable for the the data that we're testing with, but have we tried with any other data, sets that might tickle other pieces of ZFS like using a lot of padding or using having more raid-z overhead and yes last night, actually in the hotel?

B

I was just thinking like hey what, if we use like, really really tiny blocks, unlike a raid Z to pool- and you could see that like the data was like a hundred gig of data was on the pool like user data, but then there was another 50 gig used for padding another 50 gig for raid C or parity data. So yeah like you, can model all sorts of stuff like this, even if it's just for fun and it's really cool to see how that can change things.

B

Yeah yeah, you probably could so what I use I just like, ran a find on some like the pool directory, so you could run that I like find on the route and throw that into the simulator and get an output. So that's, basically what I was doing. You could totally do that on, like a live production, dataset yeah.

A

B

You could do it offline on a different machine like I did.

B

Yeah yeah so Alan says that he's also seen some differences from like the day that I present on data that he's seen so will be interesting to run that through the simulator as well and yeah. I totally think so, and you know there could be inaccuracies in the simulator too. Just because we don't understand, or we have a model to perfectly correct, yeah I would say like for us like at Giant, even small percentage gains, even if it's like a 1% gain and efficiency is huge when you like make it to a really large scale.

B

So this stuff, even little tiny things, is important for us as well.

A

B

I'm not familiar with it, does it give you some granular information about like DBAs and so.

B

Cool yeah, so it sounds like it's: is it : : blocks tat, be okay, Stan, yeah, :, : BL case dad an MDB, well, I guess give you a lot of statistics like pool Y statistics about how much capacity is being used by different things. In this.

A

B

Yeah so so Matt says it's similar to like zdb as many pieces. You can fit on a line yeah yeah. So the thought is that if we can, if we can produce similar output using the simulator and the output of Z to be you be infinite or block stat, then yeah we could like sort of cross-check the the algorithms yeah. That's I agree: cool yeah. Well, thank you very much.