OpenZFS 2018 OpenZFS Developer Summit, 21 Sep 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Log Spacemap Update by Serapheim Dimitropoulos

Description

From the OpenZFS Developer Summit 2018
slides: https://docs.google.com/presentation/d/1qxsbZGt1jCwhz-eHmilZS0ZATxGvvk7VLYBd1JnCCtI/edit?usp=sharing

A

To date, you of the open, ZFS developer summit. We have I, think five shorter talks. Today, it's going to be a little bit more casual, so they'll be. We are gonna, I think that pebble is not here at this book, so you know we, the speakers a little bit of latitude. If there's discussion on that, we want to have in terms of the time but I asked the speakers to project like ten to twenty minutes. Each and.

A

So welcome, hopefully folks we're using different streaming technology to stream it to the Internet today on YouTube. So hopefully that's working as well. If not, let us know in the YouTube chat there I'll be monitoring that with that handed over to Sarah moment. One more thing: um we don't have like a microphone for the speaker, so speakers please try to protect your voices and folks in the audience and kind of in this general area. Please try to keep the background noise down so that everyone can hear them.

A

It'll be picked up well on the live stream as well all right with that Seraphim. Thank you! I missed you ever nice to you.

B

Hello, everyone, my name, is.

B

Performance based on project the looks vision project is something that I presented on last year's summit was still an open question and because of that, we didn't have any concrete performance results. So that's what this presentation is gonna be about just a small recap about what the project is about. We started working on this project from a problem that we found we have del fix. Basically, there were some pools that were experienced, degraded performance and the reason for that was that we were each the extreme we're reaching a north of iOS for metadata updates.

B

Specifically, we were offending through each at a small space map every Dixie and the reason for that was that the fragmentation levels were very high and the workload was mostly random brands. So what we decided to do about it all basically keep all the changes in memory and don't write anything to this right. We added two new range threes per meter, slab, one for lots of locations and one front of fast freeze and basically how things would go. We want.

C

To keep something you.

B

Would move it from that flash freeze to that class dialogues and, if you want afraid you move it from them classifications to them fast, freeze and now the question is okay. Wouldn't that exert a lot of memory pressure of the system and in reality it doesn't as much, but even in the cases that we do, we set a specific limit in the amount of memory that we have for at last changes. So whenever that limit is exceeded, we start flashing.

B

Some metal, slats and by flashing I mean that this doing the segment's from this to new range trees that we've added for these chains. We emptied them out into the metal, slab, spaceman and now. The other question is what, if we crash, we have all these things in memory right. What? If we grass? How do we reconstruct that state?

B

And the answer is that it's it's end of appending to each metal subspace map as we're doing before for persistence, we issue a single I/o that writes all the metal exchanges in a single pool, wide space map that will refer to the log space map. So it's thick seed. We just keep all these changes in this new space map that we have so in case that we crossed during import time.

B

We just read all the space maps and reconstruct them flat state and now the other question is won't that make import times after a drastic longer. You know you do that wholly 3d. Well, yes, but if we control how many blocks this log space maps have, we also control the important overhead that we exert at the pool.

B

So basically you if we want to control how many loved working have in the log space maps at any given time so solution to that we we've lost a few Metis labs each txt in order from all this flask to most recently flash and the reason that we do.

B

That is that if we flashing that order, all their logs will start becoming obsolete by obsolete I mean that their entries would have made it to the meta subspace maps, which means that we wouldn't have to read them during import time and I'm, giving an example of that just for demonstration purposes. So let's say we have a tool like 200 meter, slabs right we are currently at the X T 10. You know you can see metal slab.

B

One goes from blocks, let's say from 0 to 10 minutes left to from 11 to 20 so far and so forth. We read into all all of them. This plastics you because we're in the whole state and now we are enabling the law of space map feature weeks. For now. Let's say with last one medicine practices for demonstration purposes, so tipsy 11 comes by we're at all our changes in the these cool white locks face lock.

B

We don't touch any of the metal slams so far right, so you can see the first two blocks are freeze from box 3 2 4 8 9. That would have gone to metal slap 1. There is like 11 to 12 15 to 18. We have got 2 metal slab too bad. We keep them all in the log space map and, as I said, with lines in one metal opportunity. So, basically, by flashing, it means we are bending to that metal, slab space map and you can see in gray over here. These are the solid entries.

B

So in the next week she would flash metal slab and we would record all the changes of the pool in a new log space map, and you can see we have gray entries in both the old books. Based on that, you look space map because we've lost all these things in medicine. So after we'll do like a whole round through all the matter, slabs, let's same 200 degrees from now, because we have 200 meters left the whole log of 2 X 11 is going to build solid means. We don't need it anymore.

B

There's one can destroy it. So that's the idea behind like flashing in order, because we can get rid of old space map blocks. So that's what, where we stopped last time so now. The question is how many methods love's should, with last it's the extreme. The trade-offs are the following: if we floss the less metella truth last LSI oath, we issue, but if the incoming rate of log blocks basically the size of our logs incoming most face maps is high, a lot of log looks, accumulate and import times take a hit.

B

If we've lost more, we get rid of log space maps faster, but we show more iOS and we may end up going into this old state of the system where we're appending to its meta stuff space map. So, as you can see the work, the problem is were codependent and ZFS needs to adapt to that. So we need to come up with a heuristic.

B

The most knife here is weaken the one that we that's the basis for all. The other cases that you came up with is the block limit. Heuristic, specifically, we said like a limit on the amount of lockbox that we want to have at any given time in the poll, let's say like a thousand lock box. So as we have more incoming blocks and log space maps whenever we exceed that limit, we start flashing meta slabs until we get below that limit.

B

Basically, that limit ask access like an upper bound in the overhead of the important that we accept the problem with that heuristic by itself is that it's susceptible to lot of performance pathologies. These are some of some sample results from the same simulator that I made where you can see like there were teeth. These are Whitworth flossing basis, almost all the meta slabs in the pool and the reason for that will could be many scenarios.

B

It could be, like our incoming rate, completely changed and we were very close to that to our log, but we had a lot of incoming logs on that specificity and like not a lot of many other cases. Basically, the problem is that the behavior is not consistent and not predictable. We can't reason about it and because I don't have time. This is a lightning talk. There are a bunch of other heuristics I've open sourced my simulator, and you can give them a try. You can specify different parameters.

B

Try your own things like that, and I want to talk about the idea of heuristic. The idea here is to take into the duration the distance between the amount of logbooks that we currently have and the limit that we set. So if we are far away from the limit you can say like okay, we are still good. We don't need to flash as much if we're close to the limit. Would you say? Oh the system is under pressure. Maybe we want to flash a little bit more.

B

The second consideration is the current incoming rate, regardless of how far you are from the limit. If your income in rate is high, it means that you can approach it like very quickly, so you need to take that into consideration. A third thing is the distributions of metal slug flask over a lot of space map.

B

History I will be flashing, a lot who would be flashing little that's important, because, basically, our class in history correlates to how easily we can destroy old log space maps and, finally, the distribution of log books over our log space monkey. Sorry, how has the incoming rate beam in the past few TX disease?

B

So with this in mind, we came up with the running some heuristic. Basically, the way that this works is that it's the XT can take the current incoming rate and we project it in the future that we we're gonna. Let make projections on what we are getting so far, man we're saying like okay, even some take this. For now, whenever the limit is exceeded, how many meta slots would we need to flash to stay below the limit, and then we take the average of these metal slabs over the text. Is that we protected?

B

So we can tell okay if I start flashing from now will I stay below that limit? How many medisoft I need to flash from now in order to stay below that limit on that giggety. So here's an example that I have you can see that that table is basically I'll. Explain all of this. um These are the history of our log each row. It's a log space map. So the first row is the log space nugget of Dixie 10.

B

It has two blocks: I'm gonna talk about the running Suns in a bit and at that 60 with last one metal slab at sixty 11. We had six incoming blocks and with last two.

C

B

So the idea of this whole table in the running Suns is at any point in time to say if I wanted to get rid of eight blocks. Many medicines today me how many locks do I need to destroy and how many matters laughter in class in order to destroy these laws. So, for example, if I just wanted to get rid of the first log I would flash one metal slab would gets rid of two blocks: five last three metal slabs.

B

It would get rid of the both these logs log, 10 and 11, and it would last eight blocks. You get the idea and that's why we call them the running songs. Basically, so the scenario for this example is that we are currently at txt 16. You know, there's no log over here for 60 16. We have 24 log space maps. You can see it on there on exam right here too, and we sent a block limit of 32 blocks and for now just for demonstration purposes.

B

We say that we have four incoming block space map blocks, so we start running our heuristic in one.txt. From now, the limit is 32 and we currently have 24 blocks in our pool and the incoming rate is 4, so in one take see from now, we would end up with 4 blocks below the limit. The limit is not exceeded, there's nothing to do in two ticks trees. From now who were left with 4 lock box below the limit, they, as I said, we protect the incoming rate.

B

So we get exactly to the limiting to take this from now now in 3 take this from now we, with the same incoming rate, we exceed the limit right, so we want to start flashing. So in order to get below the limit, we need to free at least 4 blocks right I'm, exactly how much we exceeded so the first row in the table with a block running some.

B

That's at least four blocks is Row 2, that's Dixie 11, so we need to get rid of Dixie span and the exist 11 logs, which requires classing three meta slams and that would release eight blocks. That would get us below the limit. So we need to flush three metal slabs in three to exist. In the future, so 3 over 3 is like 1 metal slab. Last vertex C- and you know we can go on, went below the limit by 4 are the incoming rate. Okay was four, so the limit was exceeded.

B

Then we exceeded the limit again and based on our updated table who say: okay. Now we need to flash seven metal slabs to release six blocks, because we accept the limit by four, so 7 over 5 X is gonna, be we need to flush two metal stops by pixi in order to stay below the length and we keep growing until we go over our whole table and by the end we just take that column. When we take the maximum of the meta.

B

Slabs plastic see that we calculated in the past, which can pieces that we're gonna stay below the limit based on the incoming rate of box. So here's some sample simulation results. This is from a hypothetical pool of 300 meter, slabs and accepted block limit to be 300 blocks, and the incoming rate was like randomly chosen between like 10 and 64 block box.

B

It's 60 and the results were that we were flashing in 11 meta slabs on average, it's txt and the maximum that we ever got was 24 meta slabs, which is like 8 percent of the meta slabs in the poll, which is pretty good in terms of ions, so yeah, a small optimization that we did because the table can get very big depending on what your block limit is is that we decided to keep a summary in memory all the time of the log space map table which decreases the running time of a heuristic doing this implementing this on my simulator yield similar results with the actual algorithm, so yeah I've been talking about this block limit that I have been really set and we need to come up.

B

We want to come up with the same default, because this limit indirectly controls the classroom rate, so the higher the limit is less meta stuff that we need to class the lower the limit, the more pressure on the system. We need to flash more so the driving factor on deciding on that or that we want each flask to count meaning we want. Every time that we sent an I/o to append to a marathon, we wanted block size to be utilized.

B

So in theory, if we went to get rid of the whole law, basically plus all the meta slop space map, we want each Metis lab to end up with at least one more block its Metis lab space ma'am.

B

So the question here is like: what's the correlation between log entries and Metis lab space map entries, so we looked into that look space map entries, as I said, may, have obsolete entries and the fashion that they work most of the time is about half of the look space map entries are obsolete, as you can see, the oldest log piece has all its data Flast and the newest log is mostly consists of one class data, and everything in between is like kind of like an intermediate stating an increasing rate function.

B

So we could make that assumption that about about at any given time about half the log consists of obsolete entries.

B

The other thing to take into consideration is that meta stop space map entries are more generally one word while logs face. Biometrics are always two word and the reason is because we're talking about a cool wide space map, we need to log entries because there's a field about the video, so we can be specific, which meta slab on which we that we're talking about so so far.

B

We would say that our factor of log entries to some medicine space, my matrices, are around four and if that's not enough, well, there's also consolidation, the log space month, for example, you may have a block that you allocated freed and then allocated again, but in the meta subspace map it will count as only one entry as like one block allocated. So the consolidation comes into play.

B

So we decided to make that limit a tunable factor in that in terms of number of meta slabs in the pool, and we decided to make that tunable F for just to be conservative.

B

So now performance results. These C's the number of fire ups over five days, a five-day long experiment that we ran. Basically, what we did is we had two pools with the same setup, but one of them had the log space. My feature enabled and we started doing random, writes to them until they reached high fragmentation and their reach, but they also it's steady state in terms of number of ions. Basically, there weren't any fluctuation everything was ready.

B

So looking at the last ten hours of these, we see that the log space map enabled pool was so the 40% wing overstocked bits, and that was at that point in time, pools had reached around like 91% fragmentation and one with the logs basement was doing like 240 I ups more consistently.

B

Also, during the course of these experiments, we verified our assumption about the amount of obsolete entries, the log space map at any given time. So, basically, what you need to emphasize more is this ratio over here, which is around like 0.5 at any given time, which means that we can verify our assumption about half the entries being obsolete at any given time.

B

There are a lot more to this project. I wrote a blog post about it. There's also my talk from last year and a lot of other stuff. If you are interested in space map or like how ZFS specification and questions.

B

Yes, we actually save part of these. Things was also changing how the number of Metis labs per feedeth are assigned. So we made some changes for its meta stuff to be like a certain I. Don't remember if it's like 8 gigabyte, something like that, but basically yes, with these change they're also, both you have less Metis labs.

A

C

A

Level point just so that.

A

I think in your previous talk, where you talk about the kind of the motivation of the whole project, we talked about the fact that yeah, if you didn't, have this, yes, lowering the number of minutes would kind of achieve the same thing, because you're only appending to smaller number of medicines rather than you're, gonna, repeat it or whatever. But that has.

D

A lot of other treetops like.

A

Memory usage, the minutes of the time, which are all a kind of like basically like the previous before so it comes to. You were kind of screwed on every dimension.

D

A

Those times were bad memory. Usage was bad. Number of my house was bad and, like yeah, you could true, you can trade off between any of those, but you're just gonna make another one of them or to them like much even works and they're already bad. So this lets us get much more flexibility and the system is much less sensitive to the number of medicide, so I'm totally messed up.

A

So if some cases were able to increase the number of meta sobs while still improving performance, because we aren't ready to all of them, so you have more meta sobs.

D

B

A

Memory use short a minute stuff a lot of times. You can also have larger metal oxides. We sometimes these are all changes. Upstairs I made I.

B

Talk about these trade offs on my first talk. It just I want to Quran on this like.

C

Really really large aircrafts like 14 terabytes, that is 12/5 bit to.

B

Your bets that might be used to it yeah yeah, maybe instead of magically exciting they're the best labs suitable stuff like the maximum size.

B

That's exactly yes, so basically we did. What we ended up doing is that we so for pools that are very small and rules that are very large. We send up some like limits the number of metas left, but in between, we basically have like a fixed medicine size and we basically put more meta slabs, as your storage capacity grows.

B

That's a good question: I haven't done like a lot of testing in terms of that, but so far in because we run with this feature enabled so far since like February, and we haven't found any issues with it. The thing is that if you're, if you don't have a presentation, you're still dependent on taking coming right, like basically down the flashing algorithm, doesn't work in terms of fragmentation right so that ideally shouldn't matter.

B

C

With your testing doing a lot of chart, did you do any testing where you.

A

C

Dfs destroys and like had a huge influx of freeze or how would that impact kind of, like your algorithm, for keeping track of, like the writing? Yeah.

B

I think actually I think that's that's the case too, because at that point you have a lot of freeze right and even assuming the same workload. The freeze would be scattered around all the Metis labs right. True.

C

Yeah I guess the difference would be like within normal turning pool your incoming radius, you know, might.

B

C

B

You're saying yeah.

C

There's a huge influx.

B

Yes, yes, actually that's exactly what made that first block limit heuristic go crazy and we tried we tried with a lot of different like incoming trends. So, for example, if you suddenly have like something said, and then you jump or you go very low or you don't have something like that looks like a sign that you kind of like you know, have a lot of incoming rate. Then it goes down and also like a randomized sign.

B

So, even though it's not like exactly steal like like a perfect sign, it's like randomized things like that, and sometimes you use like random fluctuation like they saw yes indeed, the I did this the flashing with that. Then.

B

Yeah, it's very interesting because, like we spend a lot of time on this and I started, reading like on queueing theory and I, couldn't find any model that has this weird correlation between, like these two different, distinct things like having a relation. In our case, the Metis laughs last and the log blocks that we can destroy so yeah.

D

Itself is stored, a new pool right.

D

C

D

T mu, Tong horses.

C

B

Not exactly sure, if I understand the question, the way that we started your ad we're certain about they're, saying SAP, okay inside the most suits as out, and we basically have like the keys, the fixity and the value. Always that was.

D

Curious yeah, yeah absence, okay and since it's stored in the pool itself, a pendant, even the pending entrance to it requires Menace will have operations so there's kind of a feedback yeah.

A

B

Right, yeah digital investigates.

D

Storing the fazil or.

A

Simple I, don't know the atari bus think to convergence yeah.

B

A

Like before surfing's changes there's this idea of safety convergence- we're like you might have to you, might be sinking, and then you have to allocate some stuff and then you have to write out the space maps, so we actually- and you still have that with with certain changes, but because now you're only writing one space, not like you like in the worst case. You're, you know, one more sink. Past means that you're appending to one space, not sure we actually saw the number of passes required to sink to convergence.

A

Go from like dozens to like four, it's very consistent, it's much more consistent with surfing's changes then done before.

C

A

There's like there's a lot of like collateral winds from from this one.

D

Change because I'd always wondered sometimes summer passes with, requires a few things for which there isn't.