OpenZFS 2022 OpenZFS Developer Summit, 1 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Faster ZFS scrub and other improvements by Alexander Motin

Description

From the 2022 OpenZFS Developer Summit https://openzfs.org/wiki/OpenZFS_Developer_Summit_2022

Slides: https://drive.google.com/file/d/1bz1IGuzKdEPze4uLm8Cp6jEtogryv_u1/view?usp=sharing

A

Good morning, everybody, um yes, as was told, my name, is Alexander Martin I am a OS team leader at IX systems, uh working on freebiesd for many years on the fs, also for many years and in practice applying the fs for freenas and now turn us storage appliances on both freebc and Linux, and today, I'm gonna talk about several aspects of my work earlier this year. So I will talk about faster, safe and scrap, faster pool, import and adaptive speculative prefix.

A

uh Let's start uh I started looking at this earliest this year, just from trying to fix minor bug, but I couldn't stop myself from doing some benchmarks and I Benchmark. It scrap several possible directions and first uh Direction was a case of Highly. Fragmented pool I set up quite beefy system with a lot of nvme drives and I filled. The pool with four kilobyte blocks: I trashed it for a couple hours with random rewrites and that caused full fragmentation about 70 percent and then I run a run.

A

Cpu profiles of two different stages- that's crap sort of scrap- has our first his uh scan stage where it passes all metadata and processes them and second stage where actually each issues.

A

Here, you may see what I first found on a uh scan stage. You may see here uh avl3 sorting where, uh for each video scan process, tries to order all requests in in offset order to fully execution and on the right to sort of CPU time is spent by B3, where CFS only tries to find where sequential chunks, so that's kind of counterproductive workload.

A

But what hit me the most you may see here about quarter of all CPU time spent by memo operations inside of b3s uh and I dive, deep into degrees, how they look and started optimizing only to them only later notices that you may see that my move for insert operation takes incomparably much more time than it takes for remove operation while actually in this for cloth and certain remove are quite comparable because of ir getting aggregated.

A

So uh those should be comparable. That makes me look into the trivial freebies demon move code that I found that sometimes it's better, not to speak than speak wrong in practice appears that modern CPUs and hence the trip moves. B operation appears to be not so much enhanced if you're trying to move data in opposite or this direction and just removing a bunch of code improved my move on that case by several times. Obviously it's always specific, but that was unexpected. Fine.

A

So after that I got such results, you may see uh it's got much better, uh maybe who they see it in that situation. I wouldn't even start the project, but I was already half there.

A

uh So what B3 does is it has a tree of elements, it's like the real tree, but it used it stores, multiple elements within each three, three and old. uh The most interesting I leave node signs. That's where most of activity happens, it's four kilobyte chunks full of elements and if B3 tries to insert something in the middle of the node, it will obviously if it has to shoot all elements after it to the right on removal.

A

It has to shift elements to the left so with the file to leave size of four kilobyte and average feel of 75 percent. It means for each insert normal operation. We need to move one and a half kilobyte of memory on average. Obviously, it's not great, not great, but I found out it's quite easy to do or to allow empty elements in front of the list too.

A

Not only at the back after that code can decide, where's it on the front move elements in front or in the back, so on a range it gives 50 savings.

A

Second part: I, try to improve is scrap process itself. Actually scrap has not 1v3, but 2b3s I use it for slightly different purposes. Both p3s we're using 24 byte elements, including start of segment, end of segment and field which number of bytes within fields that field within the segment, oh so for the first of between makes sense for the second much less for the second second, with reused only uh to to track category segments, so we need practical, only start and some uh score, which is calculated based on all three use.

A

It for sorting I found that uh between that scrap was using. The B3 was quite complicated uh comparison function uh where uh it practice required 264-bit divisions for each comparison, operation, which obviously is a terrible waste of time. Even more modern system, not talking about some older aware divisions are even worse, so what I was able to do is I was able to squeeze those uh 24 bytes into eight byte single value, where I put a score in Upper 8 bit, actually using only five or six bits.

A

That's enough because I use it just an exponential scale, it once on insertion and, and then it's compared compared to a simple 64-bit value, all right, Exquisite start in the lower bit, because we know that we never shift is always bigger than 9 bits. So we give us the required space. After that, you may see what happened dramatically. Better results, my move, you're using half as expected from the G optimization comparison function between four times because of remote division. So now time practically spent because of some cash misses.

A

That's all and uh every all three now consumes significant chunks of space of time. It would be great if it could be somehow optimized, but the code was already pretty much optimal and the biggest problem is just cashmesis which are plenty of in a real tree. uh Who is my recent magician experiments with B3 I think? Maybe we could think about moving some more a cold pass to b3s just to remove or use number of cashmists and uh uncashed in pointer the references because uh B3 is more, the sensors have bigger leaves it.

A

May it not require so many pointer differences, but again it works best for a very small element which I was able to achieve here. Reducing just eight byte uh next part I took a look, uh was issue stage of scrub. uh You may see a huge amount of time spent on low contention like 72 percent and closer look shown that actually three different contentions. One contention was caused by using shared, Zio or pool white for all scrap iOS over the all the pool, all the devs.

A

Obviously it's each additional removal of child CIO required lock unlock and it was terrible, so I just introduced one intermediate Gio for each PDF and very small patch. It solved the problem immediately. Second thing: I found that a scrap calculates statistics for Zio four blocks in a pool which was originally accessed at only lumos debugger on NASA freebies, Dino Linux. To use it it's not used anywhere.

A

So I done two things: I moved the statistics from issue stage where it required Global lock to the scan stage, which is single threaded anyway. It doesn't require any locks, and second tag is abled until somebody needs it and wish and has the idea how to export so much data as it collects. It particularly collects uh number of each blocks on each level of interaction, size or blocks. It just collects a lot of data. I just haven't, found good weight, how to represent it or be nice in the user space.

A

So if anybody which have idea you just disabled with load or tunable or modern module parameter or whatever- and the last thing that remained is- we have in other cases, is spark config enter exit which already micro optimized it several times. Probably nothing can be done much more, except maybe we couldn't replace it with some other primitive that much more suitable for concurrent accesses about, maybe something always specific or something.

A

But that's it's ended up after optimization, so you may see low contention reduced from 72 to 44 and that's why iops has tripled at this point. uh Here's total results of scrub time. It's introduced in half in green. You may see the uh scan stage in red. You may see issue stage and a yellow is actually mixed where we, where scan, doesn't currently properly calculates time where it actually scans and where it issues. But you may see that all of them got reused. You should reduce the most but can also get better.

A

That's it uh second part of my investigation, went about large blocks, it's significantly different problem because we don't care about Ira, but we care about efficiency of the process. I use the same configuration and just added few more nbms and changed it from stripe into mirror and raids and I filled pool with one megabyte blocks around scrub. This is case of mirror, and you may see that only 24 of CPU time actually spent on chick Simon.

A

The rest spent on memory copies like half of CPU time, and you may see also a contention Low contents caused by the fact that some of memory copy done under the log and in parallel, so it's a fixed amount of local tension, no matter what you do uh investigating that I found that scrub doing quite weird thing. If we have a dual mirror, each disc is original into its own buffer. That's predictable, calculated, but then, from from every successfully for for everything that successfully passed through the check sums data copied into the parent buffer.

A

So it may be copied twice three times four times whatever which mirror you have, and then we have a deep top blocks, support which also mirror inside that also copies data, one more so for either white mirror. We always had n, plus one memory copies I found it's completely, not needed. I was able to share buffer original, so original buffer share with dito buffer and then with one of mirror.

A

Buffers Downstream I tried to choose the most uh promising buffer, more promising PDF, but if that will fail, took some fail or anything data only then copied from the other Dev. Otherwise, this process completely removes primary copies, and here you may we may see that Jack Simon takes 76 percent of CPU time, probably not much. That can be done after the point. I can only wish.

A

We had a shot 256 of Lord for more modern CPUs, so that we could actually do to be as fast and as efficient with complicated check sums, because that may still be a problem for fast pools where it's critical and here's.

B

A

Results are not as bad as it was originally for uh mirrors, but still one memory copy actually caused by the same mirror, because data blocks in case of 3D are still mirror and one memory copy. Second memory copy was using for array, Z parity verification. It took original buffer allocated new one copied their previous parity recalculated compared I, just replace it that coping with buffer swap I allocate new buffer put there that other taken out comparison, free, just some optimization trivial and here's a result.

A

We may still see memory copy yeah here it is hiding under a slightly different name, but that one that memory copy is a part of red Z in parity calculation, because all the functions are like single argument function. Where is that using accumulator buffer? If we have three white right Z, we first copy the data and then doing parity with second disk. Secondly, Dev. If we would introduce dual argument: parity function that could be avoided, so that's like 15 of CPU time would be nice project for somebody to touch I haven't bought at least anyway.

A

Results got dramatically improved. You may see that scrap time reduced a lot, cpu's age dropped a lot, especially in case of mirror uh this bandwiz, uh it's a robot with Summer total from all. These can increase it from 20 Gigabytes to 30, gigabytes and I. Also calculated memory boundaries from Hardware performance counters. You may see that in case of mirror it even dropped from under 373 to 109 gigabytes per second. So it's always I know. It means that we have like sixth time of more of memory. Bandwidths. Then we have data boundaries.

A

uh We should investigate what we what actually happens in case of data. It writes how many times our heat is there, but at least in case of scrub I was able to reduce it in half.

A

That should help a lot when we go into faster system faster pools that where we may not have enough memory boundaries as on this system with 12 memory channels, there are a lot of systems with one two, four whatever uh so that's the result for a scrap next uh project I had also earlier this year is pulling part-time for our Turner's Appliance. We need uh High availability Solutions, where we need uh to guarantee for a lower in case of controller fault or just a routine update, uh preferably within like 30 seconds.

A

Better more about 30 seconds is generally acceptable, uh but we also need to detect that controller failed. We need to do skies reservation. We need to do uh server start from networking, reconnect everything so it best.

A

We should be able to import pool in 30 seconds, but we found that we, if we create a system with hundreds or a thousands, hard disks, and we actually have an even bigger system right now we have 1200 disks in our lab just taking three racks, no, no not like, if put densely two racks, but still big system, and it may take up to 45 minutes to import the pool.

A

Obviously it's not acceptable even for non-ha and for H A, it's not even close, not anywhere, and even if we try SSD pool uh import still takes five to ten minutes, also, not even close, but there we can't even blame SSD. It's not a problem. Not a problem of SSD I went investigating that and I found. The biggest problem was logspace map replay I'm, not going to dive too deep.

A

It was presentation a few years ago about log space map, but to say short uh idea of log space map is to avoid meta perimeter, slab space map update on every transaction group, because each disk have several hundred meta, slabs and so space mobs. We have thousand disks multiplied at 200, 000 or 300 000 meta slab and if pull is written randomly each of them updates and creates a lot of traffic. So instead, as your first writes a single log, sequential purple, and then updates distribute data later flush later to reduce number of iops.

A

But the problem that during pool import, we can't use any space Maps until we replay all that log and since practically blocks everything during pulling pulling Patrol complete.

A

Originally, the problem was handled by a limited number of blocks in the log it's limited in two ways: either it's limited to four blocks parameters lab in a pool the guarantees like force selected to guarantee acceptable space efficiency for the logo record, so that we have enough data to fill all the buffers or we are right in into middle slab and, secondly, meters to 56 000 blocks, which uh allows to limit maximum import time. But there was assumptions said that we should import within 10 minutes.

A

That's exactly same 10 minutes I mentioned during SSD pool import, it's practically hard-coded into ZFS into its default Union. It was maybe for somebody and it's good, but not for us. So I went uh to investigate in that and found two problems. First log replay is inherently sequential, like all the records has to be processed one after another, sorted put into bitrees in memory, and only one CPU can do that. Well, it says one side and uh also blocks it has to build it. Sequentially been processed sequentially, so in case of hard disk.

A

If, in worst case, it may happen that 256 K blocks mean 256k transaction groups, each of transactions who a group have separate object, which means uh speculative prefecture of the fs, can't do anything, it's practical objects of single block. There is nothing to prefetch already, and if we try to just divide 256k by this health secret, it will be 40 minutes by itself, just on the read. So what I've done?

A

I've made every movement prefix up to 16 transaction grouping at once, so that it's always uh become a CPU Bound for hard disk same as for SSD pool it reduced import time to like 5 to 10 15 minutes somewhere there, but then it just single core bound and it's no problem of hard disks anymore. Here you may see the CPU profile I get. Originally. The code can process about 300 bucks per second in some Benchmark I got 400 Bots. It probably depends on how fragmented the pool specific workload and sandwich Market got.

A

Four then later it right. I got three but someone's there, but it means like 10 to 15 minutes processing time and, most of time again my move between old old friends. uh Just after video optimization from the previous part, you may see my move reduced dramatically and there are no big issues like there is obviously still comparison. But in this case comparison function is three wheels, not much. That can be done, so it improved like almost double uh block rate and reduce it to my maybe five minutes but better, but still not great.

A

We need to reduce log size, as I told the original design was to achieve best possible space efficiency, but best possible means five pulling part times is not acceptable, so uh I try to reduce it more to make it not best possible but acceptable or efficient value. Oh, the ice is the most prominent example of. What's of why it was needed, or we may consider a pool of the Thousand disks which I mentioned. If we start writing it, sequentially in slow Speed without rewrites, it will feel one meters laboratory after another.

A

Oh I think maybe several at a time, but it's pretty sequentially and most of metal slabs during process are no longer modified after they are already full. They are complete and they are done after that point. It would be good to flush the emitter slabs and don't touch them any anymore, ever forget about them, but Concord doesn't do that.

A

Oh, so that's what I was trying to do so. Instead of scaling number of limit of number with the slabs in the log by number of the general number of meta slabs are using number of unflashed metal slabs, practically, which scales with how dirty is the pool? How like how activities if we are starting writing, start writing Bull from scratch from empty there's, no dirty metal, slabs and the effects will actively start flushing them.

A

uh So it will keep some like a thousand meta slabs, just in case to not overdo too much uh but uh as full as we're getting writing more and more. You start flashing more and more and we'll keep up with time. Lock will not grow, but there will appears one more scenario when it still May backfire. If, after that, we delete a lot of objects randomly from a pool, it creates a lot of holes through all the pool. That is how why actually log metal slab was implemented.

A

So uh with only this limitation that pool. Will you see that there are a lot of unfortunate slops and we'll start flashing them slowly, one after another, but again it will consider its normal situation and will not try to shrink it actively so introduced another limitation or that each meta slap must be flushed every at least thousand transaction groups.

A

It's after that it means, after massive deletion will get a lot of some of them dirty and then the defense will immediately start flashing them quite aggressively if there will be no longer massive duration, it's after well like a few hundred or thousand transaction groups. It will get cleaned out, stable and quiet again if the pool is really fragmented and a lot of random operations go on and on and on it will still keep all meta slabs as dirty more unflashed and will actually try to flush them.

A

But again, each of metal slapsibly flush at once per thousand transaction groups. Still about 500 is better than was before the implementation of the log meta log space map, but uh it's still at his. It can at least has some constraints so that one pulling part we would never have to replay more than thousand transaction group of log. It will never grow to 256 thousands, so it will be still stay Compact and try to adapt with workload.

A

Oh just idea for if somebody wished to find a project to work, I was thinking. Maybe we could uh in some cases when we see that some metal slab was previously flushed, but within one transaction group it received a lot of updates for some reason. Maybe, instead of writing into pool white look, we could write it directly to metal slab space map itself so that we would avoid double copy right now. It is double copy.

A

That's why I mentioned it's like 500 times better, so it could be potentially a thousand times better but like in space for optimizations over there.

A

Oh and second offender I found during a pool import uh is that import tries to scrub uh last three transaction groups written uh before their crush or export or whatever, uh and that may means like dozen gigabytes of traffic or even more, depending on setting closer pool and activity of the pool, and it may take may take time because we have no uh anything catch up to all. The metadatic has to be traversed, sometimes sequentially, and it takes a lot of time.

A

uh Looking through the code, I found that errors during that scrap for data do not affect import process it just like regular scrub. The it GFS tries to recover them, but if it fails, fails a bit so uh what I've done? I've I've disabled scrub for data during the import that significantly reduced amount of data that needs to be scrubbed and reduced the one exception, or that Still Remains significant, is in case of dedup.

A

We have to practically scrub all the dupe table and because within City transaction group, it's quite likely all been updated or significant part of it, and it's all in four kilobyte blocks huge number of around around the read operations. That's a not great, so I was thinking, maybe one more small project, or maybe we could reduce that three transaction groups to something smaller, because right now uh number three goes from number of uh transaction groups.

A

Cfs keeps the previous Data before the freeing space, but those two values I, don't believe, they're related in any reasonable way. It's pretty arbitrary and it makes no sense. So it would be good if somebody have ideas. Why do we need to replay more than one transaction to to scrap more than one transaction group?

A

Is there a chance that some right will fail, but we still commit transaction group and continue and somehow it happens like three transaction group back so maybe we could reduce a three to one or two and still get exactly the same result, or at least errors may always happened undetected, but I don't see a reason to have exactly three there.

A

That number doesn't mean anything to me, so I think for some investigations that could be done and with all those optimizations or we measured up to 95 reduction of pulling per time from the 45 minutes worst case. We got to like one minute, plus or minus, depending on situation, which is incomparably better, so I think I heard some responses from other people. They were happy.

A

The last topic weekly is more adaptive because of privilege. I should preface many times over the years trying to push it here and there and why I'm touching it, because it's critical for white pools, especially for hard disk pool to have a good prefetch.

A

This has good throughput but terrible latency and to make any reasonable bandwiz out of hard disk pool. We need to perfect a lot ends up traditionally to problems with prefix how to detect pattern for prefetch and second, how much to prefetch. In this case, I try to address or improve both.

A

So for many years, our prefecture analyzes up to eight uh streams, it tries to take up to eight sequential, read or write streams and keeps detective streams for up to two seconds. The problem appear if we try to mix uh sequential and run the workloads in foot to the same objectives, for the most prominent is for the devs or for the walls. So in case of the walls, we have some nice guys, Target on top One processor in sequentially data. Another processor limited data randomly through the pool.

A

At the end we get no prefetch at all. The problem is that all Random Access is feeling immediately feeling all the eight streams and after that, preface block it for next two seconds. Nothing goes on, so what I have done uh I for streams that never saw second hits that were just random reads and point. Those could be very reduced by later accesses immediately.

A

So just in order of praying, the oldest used very used at first and for strike streams that had some hits they kept for this, the second uh to still benefit from preface for them not to be white but but random accesses, and then they can really use it in order of arrival.

A

Just that makes me wonder why do we have a limit of eight streams couldn't shouldn't it be improved for large, easy walls, maybe not so much needed for files, but for the walls could be it very old limit. Maybe streams could be allocated instead of the list of elements allocated just as uh array to reduce number of pointer references, another project for somebody to work on.

A

It's pretty small and compact chunk of code, not invasive, and here around simple Benchmark I run random IO info streams and stride that workloads two megabytes per 100 megabytes at the chunk. Hundreds here 100 there 100 there and you may see that while random iops haven't changed, uh strided throughput improved by five times it's just because previously there was absolutely no prefecture hits and after their period sum, maybe algorithm could be improved fuser. So probably it can be, but still better much better than it was and second part I addressed is prefetch depths.

A

uh Previously we had default The Limited, preface up to eight megabytes, so prefix started from bios size. Double it on every successful hit. So after, like 16 for 16, reads professionally reached maximum of 8 megabyte and it stopped. Oh, it just continued to have that eight megabyte in advanced region, I found from my tested for a vme pool, no matter how wide it is, we barely ever need profession more than four megabytes, just because nvmia so fast and low latency they're, just single reader thread unable to read more data.

A

Anyway, it's limited by just memory operations, so we don't need more than four megabytes, but if we use hard disk pool, even 64 megabyte is not limit the longer you increase preface depth the longer. It improves I'm thinking that thing once it should be investigated. How sequence should are actually writing the data I have strong suspicion that our space allocator in on system with many cores allocated rewarders data pull quite a lot. Maybe that's why we benefit so much from professional hard disk.

A

Maybe it should be improved from other side, but still that's where we are and now uh the prefetch does really help for pool, and uh what they have done is that I split the growth into two stages. Up to the first four megabytes with one more new tutable preface grows exponentially the same as before. uh At that point, it stops and grows fuser uh only when it's needed. It grows up to one eight, every time when uh prefetch for new read didn't complete in time.

A

So if we have pool which is faster than our consumer, prefetch only grows to the point that is sufficient to satisfy uh bandwidth. We took our latency for that. Pandu is and then preface stops, and it allows to avoid extra prefetches that are not needed in case the strided access. If consumer doesn't need the data we would drop, otherwise uh it shows good results and I was able to increase maximum from eight megabyte to 64 megabyte, which is not so dramatically bad consequences or if it goes to the amount of extra read data.

A

One downside of this algorithm is that if we have two slow pool, for example, we have some USB stick which anyway, can handle more than one request at a time it with pretty quickly. It will reach maximum perfect distance of 64 megabyte. If we could set more, it would reach more there. We set it will reach, but it makes no sense so one more project.

A

If someone you wish, we could try to somehow limit much on preface Edition based on time, or it definitely makes no sense to prevent more than one second in advance, because in couple seconds, Arc will start reclaiming what we just read, but still one megabyte is a lot and 64.

A

even for 64 megabyte there was much device must be really slow to reach that point, but maybe we could mostly limit I, don't know 100, milliseconds or something else, or maybe we could Implement some more fancy logic like if, if none of request preface requests be sent, was sent to disk immediately but were cute, and it makes no sense to increase preface depth or something like that. Ideas are welcome.

A

Obviously, space for improvement projects for somebody to play with sorry, I've been so fast, but uh time was constrained and I'll be open for any questions during the day or discuss those topics or any others. Thanks for.

B

When we have like maybe five minutes for Q a sure.

B

Yep so I'm just curious for that USB certificates did you consider just checking if the media is not rotational, as we have that flag on the Beaters.

A

No I'm not sure how would it help? Obviously, we have a flag, but right now uh prefecture doesn't use it for anything. We can. We may have no rotational USB stick or we may have no rotational nvme, storage or.

B

It sounded like there wasn't much points increasing in past four on non-vertisement or storage anyway, so um I.

A

Agree, maybe that notification was supposed to have lower latencies but I'm, not sure that's exactly the case, but no that's right now. Let's not use it for this proposed that information, maybe we could but I'm, not sure it's. It would be dramatic.

B

I think using the latency would make more sense. No.

A

Yes, the question like what to use as threshold easily it would be if we could analyze IO and see what's average, and if we go beyond our age. You automatically should understand that we are uh triple perfect anymore. That's why I was thinking about other algorithm like analyzing Q. If we were put on a queue immediately executed, then we are benefiting from prefetch, but there my peers, so two cases when we have uh different uh types of vdfs.

A

We have special vdf with metadata, which is very fast or L2 Arc with very low latency, and then appears that some requests are always saturated Satisfied by those requests.

A

So there could be some variance, but that looks the most promising to me still.

A

More questions.

B

We measure the memory.

A

Memory uh there are tools, at least for freebhd, from Intel uh in their Imports PCM Intel PCM, something no I bet. There should be the same for other platforms. It's just a set of convenient tools for freebies they're, using the same same performance counters as he uses for profiling, adjusts all modern CPUs collecting a bunch of random architectural things and Intel nicely wraps that into tools that collecting per Channel bandwiz personal qpi bandwiz. All the numer effects power consumption. They consume a lot of different information.

A

I just took one line out of there, what its endless source of good data and all the just Hardware counted, so they're, more or less reliable discussion of the interpreting them right. Sometimes that's, maybe difficult, but.

B

Cool thanks a lot Alexander.