OpenZFS 2017 OpenZFS Developer Summit, 31 Oct 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ZIL Performance: How I Doubled Sync Write Speed, by Prakash Surya

Description

From the 2017 OpenZFS Developer Summit:
http://www.open-zfs.org/wiki/OpenZFS_Developer_Summit_2017

A

This is gonna, be our last but not least Doc Prakash Surya, who probably everyone here knows. As Matt mentioned, he has implemented the automated testing for open ZFS. So today he's going to present about performance improvements, the ZFS intent log. So please welcome Prakash.

B

Cool so yeah I'm Prakash I'm, going to talk about my Zil performance improvements. So here's a brief overview. What I plan to discuss is broken up into roughly three parts. First, I'll give some background and discuss what the Zil is, how it's used and how it works. Then I'll get into the problem. I set out to fix how I fixed it and provide some details on how I did that and then. Lastly, I'll show off some graphs and results of my work. So with that out of the way, let's get started first off.

B

What is a Zil Zillah stands for ZFS in 10 lakh. It's a mechanism, that's responsible for logging synchronous operations to disk the operations that you logged are logical; operations such as ZFS, create remove, etc. This does not include non modifying operations such as ZFS, read, seek, etc.

B

The data that gets logged is simply the fact that the logical operation occurred, for example, for ZFS, remove it's only the name of the object to remove and the object ID of the directory from which that name will be removed. It's not which blocks on disk would change. Due to that removal.

B

The Zil is almost always used whenever any of these logged operations occur. They're inserted into the zil's in-memory list of operations. These operations are often called IT, exes or 1/10 log transactions, when the Zil is called the ITX is tracked by the in-memory Zil are written to disk when Zil commit is called the caveat to all of this. Being. None of this occurs that the dataset is configured with sink equals disabled if sink equals.

B

Disabled ITX is aren't tracked in memory, nor are they written to disk since I often see the Zil and s log term used incorrectly I wanted to briefly address this s. Lock stands for separate log device. The Zil and s are different in that the Zil is a mechanism for issuing rights to disk, and the S log may be the disc that the rights are issued to with that said, an S log is not necessary by default.

B

Zil rights will go to the main pools disks, but an S log can be used to try and improve the latency of Zil rights if the main pools be tabs are deemed too slow. So why exactly does a Zil exist in the first place? Well, rights and ZFS are right back. What that means is data is first modified and stored in memory in the DM you layer and then later at some point, the data is written to disk via spa sync. The problem is spa.

B

Sync can take tens of seconds or more to write out this data. It's unacceptable for all sync rights to take tens of seconds to complete, yet to complete further rights and ZFS often cause more rights to occur. For example, a single file right, modifying a single block of user data will then cause indirect blast. Indirect blocks to also be modified and written. The Zil allows us write amplification effect to be mitigated.

B

Essentially, the Zil exists as a performance optimization to provide synchronous to barite synchronous semantics to applications faster than what could be achieved with spa sync alone, while correctness could be achieved without the Zil performance would be unreasonably bad, which makes it a necessity before I jump into the next section. I wanted to quickly go over the on disk format of the zone. Each ZFS dataset maintains its own unique Zil on disk.

B

Each of these ills is a singly linked list of Zil blocks or, as they're also called LWB, which stands for log write block, as one can see here. The uber block has a pointer to the moss, which then has pointers to each data set in the pool, and each of these data sets has a pointer to its own Zil header. Each seal header then points to an LWB and that block and that block points to the next block in the list.

B

Now, let's get into how the Zil is used, the Zil is used by the z, posix layer or ZPL. For short, the CPL interacts with the Zil in two phases. First, it uses ill ITX, a sign which causes the Zil to log the fact that an operation is occurring then it uses ill commit, and this tells the Zil to write out these log records to disk.

B

Let's look at ZFS right as an example of this ZFS right, we'll call ZFS log write, ZFS log write will then cause ill ITX create which will create the ITX structure in RAM, then it'll cause ill ITX, a sign which will insert the ITX into the zil's in memory state. Finally, if this is a sink rate, ZFS right, we'll, then calls ill commit which will cause the ITX to get written to disk.

B

Now, let's look at a ZFS F sink as another example. F sink doesn't create any new modifications; instead, it simply ensures any previous operations are written to disk before it returns. Thus, the FSF sink doesn't cause ill ITX create, nor does it cause l ITX a sign. Instead, it only calls xilk commit calling Zil commit will ensure all previous operations are written to disk before f sink returns.

B

The parameters of Zil commit are such that the caller will pass in enough information to uniquely identify an object whose data is to be committed and the contract as ill maintains. What the color is, all operations relevant to the object specified will be persistent on disk by the times they'll commit returns by relevant. It means all operations that would modify that object and by persistent it means the operations are written to disk and the disks used for those rights are flushed further. We must issue the the disk flush after the writes complete.

B

Lastly, the interface for Zil commits doesn't allow the caller to specify which operations that they care about. Thus xilk commits must write all operations for a given object, even if the caller only cares about us a subset of those operations. For example, if there's multiple threads writing to the same file, but at different offsets, all offsets must be written to disk before zel commit returns. Even if the calling thread only cares about one of those offsets.

B

So how does the Zil accomplish this? Well, as I alluded to previously, the zeal maintains an in-memory list of IT x's that have occurred, but having yet been written to disk. This list is maintained via the ITX G structure in the Zil, and each ITX G structure contains the following: a single list of all synced operations that have occurred for all objects in the dataset plus per object list of async operations for each object modified.

B

Here's what this might look like in this example, the ith ITX g-sync list has to I T X's in it, each of which can map to an operation for any object in the data set, and then a list of async operations that occurred for object. A and a list of async operations that occur for object, B, when's, L commit is called. How do these IT x's get written out to disk?

B

Well, first, we must determine which of these IT x's are relevant, and then we must move these relevant I T axis to a new list called the commit list.

B

Here we have the same ith list as before, except now we also show an empty commit list at the bottom. Also, let's presume Zil commit, is being called for object, B,.

B

The first step is to move all of object: B's a sink I T axis to the sink list, so we select the I T X's and move them next. We move the entire contents of the sink list to the commit list.

B

Additionally, it's worth pointing out the point of the commit list is so we have a list of I T X's to write out that will not be modified by any concurrent zpl activity, as new zpl operations occur. The sink list may change. For example, operations may be added to it, but the commit list will remain the same.

B

Now that we have a list of IT x's to be written out, it's time to actually issue them to disk. We do this by iterating over all of the I T X's in the commit list, then for each ITX we attempt to copy it into the the open Zil block. If there is insufficient space in the block, then we allocate a new block and issue the old one to disk. Lastly, after all, IT x's are copied into LW B's wish you, the last open block to disk, allocating the next open block in the process.

B

So here's what I mean given the commit list from before and lwb one as a currently opens L block. We select the first ITX in the list, so here we select ITX s1 and copy this into the blocks buffer. This block will remain open as denoted by the dotted line, since it may also be used for the next ITX in the list.

B

So now we move on to the next ITX ITX s2. This one doesn't quite fit in the currently open, Zil block. As you can see, it doesn't fit in lwb one here, so we must allocate a new block and issue the current one to disk here. Lwb one now has a solid line to indicate it's been issued to disk and LWB two has a dashed line to indicate it's a new opens ill block. Further LW b1 maintains a pointer to LWB two on disk and that's the singly linked list relationship that I mentioned earlier.

B

Now we can go back to processing the commit list, so we go back to ITSs two and copy that and copy the ITX into the new LWB into LWB to the open LWB.

B

Finally, we reach the last ITX in the list. Itx b1, since this fits into LWB two. It's copied directly into place and at this point the commit list is empty.

B

So we issue LWB two to disk, allocating the next open block in the process. So here elder BB, one and two have been issued to disk as we can see if they have the solid line T and then LWB 3 has a dashed line to indicate it's now. The new open, Zil block and this block will be used when writing out the net. The next batch of my exes now after we've issued all Zil blocks to disk. We must wait for them to complete.

B

The blocks, can complete in any order here, lwb to completes first and then LWB one at this point all blocks have completed so it's time to issue a disk flush to the V devs involved, so we flush the V depp's and once those complete we can notify any waiting threads, letting them know that their data is safe on disk.

B

So now, let's dive into the problem with with all of this, so the main issues can be summarized into the following three points: first, I T X's are grouped and written in batches where the commit list constitutes a batch and the batch size is proportional to the sink to the sync workload on the entire system. Next threads waiting for Zell commit to complete are only notified when all Zil blocks when all Zell blocks in a given batch complete.

B

So if a given batch is large, that means L commit must wait for all of those blocks to complete, even if the caller only cares about a small percentage of the data in that batch and lastly, only a single batch can be processed and written out at any given time.

B

Here's an example of what this ends up. Looking like this is a timeline of the disk activity, disk activity of an example pool what can be seen here is blocks a through e that first screen batch blocks. A through e are all written in the first batch, but the disk activity is slightly uneven, while disk disks, 2 3 & 4 only receive a single Zil block to write disk one receives two blocks, as you can see.

B

Thus disks 2, 3, & 4, complete their rights and then remain idle, while disk 1 finishes its work. This idle time is due to the fact that only a single batch can be processed at a time which leads to inefficient usage of the storage, for example, blocks F, G and H, as shown in the yellow batch. Just after that, first green one blocks, F, G and H could have been issued to disks 2 3, & 4 filling this idle time, but the batching mechanism prevents us.

B

Another issue is the fact that waiting threads will only be notified once all blocks in a batch complete. For example, if a thread was waiting on data to be written by block a.

C

B

By this one up there, it would also have to wait for Block E to be written as well also written by block by disc 1. This unnecessarily increases the latency of Zil commit, and in this case, potentially doubling. Yet the solution is somewhat obvious. Let's just remove this concept of batches, rather than waiting for the current batch to complete. We should issue new z locks to disk immediately as soon as they can be written out further, rather than waiting for a batch to complete before notifying threads.

B

These threads should be notified immediately when their data is safe on disk. If we did that, then we could go from this diagram, which I showed earlier to this one, where all disk all disks in the pool are saturated and threads are notified as soon as each individual block completes. As this diagram illustrates, we'd be able to service the same number of Zil blocks in nearly half the time, potentially doubling our eye, ops and and- and this is without changing a single thing about the workload nor the underlying storage characteristics.

B

So how is this accomplished?

B

The bulk of the changes revolve around three things: changing how Zil blocks are issued to disk changing one. The flush commands are sent and changing how we notify waiting threads. Previously, this was a slip, was a sequential three-step process step. One would consist of creating the Zil blocks, issuing these to disk and then waiting for the I/o for all of the blocks to complete next after the blocks completed, step two would consist of issuing the flush to each v dev and then waiting for those flushes to complete and finally, after all, those flesh is completed.

B

The Zil CV would be signal would be signal to notify any waiting threats. All threads that called Zell commit would be waiting on the CV, so this was a mechanism to let them know that their data is safe on disk. All three of these steps would consist of a single batch after one batch completes. Another would start out and another.

B

This three-step process would repeat as long as a workload would allow as long as you have synchronous operations and sync rights going on on the system.

B

Now the process is entirely different and heavily leverages the zio infrastructure instead of a single route zio for an entire batch of blocks. Each block now has its own unique route. Zio each route will eventually have two children, a right zio containing the data, the ITX data to be written and a flush cio that is issued after the right completes.

B

Since these are each child zio s, the route cannot complete until both the right and the flush complete. This is enforced by the pre-existing zio parent semantics, so I didn't have to change any of that. I just got it for free by using CIOs further the route CIO of the previous block will also be a child of the next block. For example, here l WB one is a child of l WB.

B

This ensures l WB route. Zio is complete in the correct order, again leveraging zio dependencies to achieve this. Finally, each l WB maintains a list of CVS where each cv on this list maps to a single thread that called zip commit, rather than all threads using a single cv.

B

We now have a cv per thread then, when the route zio for any Zil block completes, each cv in the blocks list is signaled notifying the waiting threads that their data is safe on disk, let's walk through an example of what this looks like first, the l, WB structure and the route zio is created. Initially, the list of Seavey's will be empty, but, as IT x's are copied into the blocks buffer, it will be can't will begin to accumulate a list of waiting of threads waiting on it.

B

Next, when the blocks buffer is full, its right will be issued to disk.

B

Then, when that right completes, the flush will be issued when the flush completes and since the Zil block doesn't point to any previous block, this will allow the route IO for this specific block to complete, at which point each CV in the blocks list will be signal.

B

This same process will occur for for the next block and the next. At this point, though, it's important to note, the sequence of events doesn't have to occur, occur in this specific order. For example, it's possible for block to to issue its right before the right for block one completes in this case. The right for block one would have been issued and then the right for block, even though block one's right was still being serviced by the disk. Previously, this would have been prevented due to the batching mechanism.

B

The right for block could not have been issued to issue to disk until the block until the right for block one was complete.

B

The same goes for the right for block three. We can also issue block three to disk before two and one complete, it's even possible for the rights of block two and three to complete before the right of block one. If this happens, though, the route of block two and three will still be blocked, waiting for block one to complete. Thus, even if the flushes were issued and completed for blocks, two and three their CVs would not get signaled.

B

Yet, as soon as the right for block one completes, the flesh would be issued and once block ones flush completes, then all three blocks would complete simultaneously and the Seavey's for all of these blocks would be notified. So while before this process was very sequential now, it's completely driven by the coming sync workload and the end disk completion events.

B

Zil blocks are created and issued whenever there's data to write and waiting threads are notified whenever those rights complete.

B

Before jumping into the performance results, I wanted to quickly talk about how we determine when to issue a Zil block to disk. If you remember from earlier in the talk, will we build up these blocks by iterating over the commit list and copying each ITX into one of the LWB buffers previously once once we reach the end of the commit list, in essence the end of a batch, we would issue the last LWB to disk now that we're batch lists.

B

We don't necessarily want to do that if we reach the end of the commit list, but there's still buffer space available in the current block, for example, we could have 128k a block with only a single a k right in it. Then we actually want to delay issuing that block just in case new IT x's are generated that would still fit in that in that Zil block this way, we can write out more IT x's using fewer AI ops.

B

The problem is, we have no way to predict a future here. We don't know for sure if more IT x's will be generated. Thus, if we wait for future IT exits, but none are generated where we're adding additional latency to the current LWB for no benefit. But if we don't wait at all and additional IT x's are generated, we could end up using more AI ops and we need to and potentially degrade performance by saturating the disk.

B

The solution we implemented is a Zil block may be delayed up to 5% the latency of the latency of the last completed Zil block. So, for example, if the last block took 5 milliseconds.

D

B

It's same one, so it if the last, if the last block took five milliseconds to be serviced by the storage, then in the next block, we'll wait a maximum of 250 microseconds.

B

If it's filled within that 250 microseconds it'll be issued to disk immediately so like here. If lwb 2 might still have some space in it, so it it'll probably wait here right, it'll wait for the timeout! So if it's not filled it'll timeout after 250 microseconds and be issued to disk partially filled, but if there are more ITX than commit.

E

B

Lwb 2 could be filled and issued to disk immediately without waiting that full 250 microseconds and for those wondering 5%, is a default and not that I recommend doing this, but it can be changed using a new tunable if you want to wait some other percentage.

B

Finally, let's go over the results of the performance tests that were used to verify. This I used two different FIO workloads to verify for this workload, each fil thread was submitting sync rights as fast as it could and I measured. The total number of I ops that were achieved across all all FIO threads and I very varied. The number of FIO threads from 2 to 1024. This graph shows the percentage difference in I ups between illumos, with my changes and illumos.

B

Without my changes, the dashed line at the bottom is just a visual aid showing to highlight where a 0% difference would be so anything above that line is improvement and, on average I measured an 83 percent increase in I ups with my changes, the dotted line in the middle is another visual aid simply to show where exactly 83% improvement is in relation to the actual measurements taken. And additionally, this zpool used for this graph consisted of a traditional spinning drives.

B

I also ran the same workload on a zpool consisting of eight SSDs when running on SSDs improvement isn't as dramatic, but I was still able to measure a 48 percent improvement on average. Again, the same visual aids are here the dashed line at the bottom. Anything above it is improvement, and then the dotted line kind of in the middle ish is the average to show a forty eight percent.

B

Forty eight percent improvement.

B

The second workload tested was again using FIO, but this time each FIO thread would attempt to issue a maximum of 64 sync rights per second. Thus, the number of I ops was constant with and without my changes, but the latency of each sync right was still improved since now we're measuring latency, rather than I ups, any value below the dash line is improvement. The dash line still showing 0-0 percent change when running this test. On my 8 spinning disk pool I, measured the latency of each sync right to decrease by an average of 27%.

B

Also worth noting the I ops began to diverge at thread counts greater than 64, where my new code started doing more. I ops on the old code, so I removed those data points to keep the comparison fair. There was still improvement, but I didn't think it was fair to show that and lastly, I ran that same workload where, with a constant FIO, sync writes per second on my 8 SSD system.

B

The high ups were the same for all thread counts on this pool, so I didn't have to remove any data points, and here I, measured the latency to decrease by an average of 16 percent on this system.

B

Here's some more details about the test that I ran not going to spend much time here and just for posterity's. There there's another link to more information, and with that, that's all I got.

B

B

Iran testing but I didn't oh, did I run any performance testing with a mixed workload of sync and async writes or any other mixture of stuff I ran testing, but I didn't measure like the performance difference of that, just simply because I wasn't really sure how to get a baseline and how to compare when, like I, really wasn't sure what was important to test, because you know the test matrix kind of expands exponentially with all of these different variables. So I ran a test for like 24 hours. Just let it do whatever the hell.

B

I wanted to do with a mixture of reads: writes sink a sink just to make sure it didn't blow up, but I didn't really measure the performance differences there.

B

There shouldn't be so my changes were all oh. Is there any difference between using an s log? So all of my changes were above that layer, so I didn't touch any of the allocation code.

B

I just touched how it is issued and can try to remove the batching and the wickiups, and that sort of thing so I would expect s log to also have improvements, but, as I showed with my SSD testing, the improvement isn't as great there and there's still some more things that you know we could improve on to make it better for really low latency drives, and that was kind of the issue that I saw with a with the SSDs is just really low. Latency drives the current algorithm, it's good enough, so I.

B

I think it will improve with the number of drives, based on the theory and I did run some tests on Linux with a bunch of drivers, because brian behlendorf gave me a box too test with and I saw the same improvements with lots of drives the same percentage improvement. The the graphs were mostly the same so yeah.

B

Do I have a feeling for if my 5% delay was able to catch like I TX's and improve the performance right? It definitely did because that that could be either due to my early code being buggy and just not correct or the delay. But what what I did know is that early on I didn't have any delay.

B

So every time we reached the end of the commit list, it would issue that LWB to disk immediately and what I saw with was because of the changes that I've made to how the blocks are are like how the commit list is built up and how there's no batching the commit list usually was like 1 or 2 IT x's, or a very small, a number of IT x's. Each time it was traversed, so we would get to IT X's into a LWB and then we would issue it to disk immediately.

B

So what happened is I was saying instead of using large 128k LW B's using the maximum size. I would see a lot of like 16 klw B's filled with much smaller IT x's. So it's hard for me to say how much it helped, because I didn't try- and you know do before and after with and without that delay with my final code. But early on performance was was terrible until I added that delay.

B

So it really was critical to not only pack ITX into the LW B's but make sure that the algorithm was using as larger than LWB size as possible.

B

Say say that again.

B

B

Yeah, so so the how do I summarize ms question is like in my early testing.

A

B

I you're remembering the correct thing, but I don't know exactly how to present what we found out, but it was without even with the delay like if you don't have enough threads to fill an lwb block. If you have 120 okay block and you only in my I was doing a k writes so you need 15 k, threat, 15 kak writes will fit in 128 K. So what like, when I, was running with 16 threads 15 of those a cake threads would wind up in one sill block, 128 K and then there's one straggler thread.

B

That would because of like metadata accounting in the block would wind up in the next one. So, instead of having all 16 fit in a single block, we would have to use two blocks. So there was kind of like a performance kind of cliff spike and that that could be a artifact on my testing or whatever you want to call it.

A

The reason I'm asking.

E

A

B

So the question is: did I, do any performance testing about the impact of flushes? Did I try and test with flushing disabled, and could we like batch the flushes instead of issuing a flood fur for each lwp? Does that a good summary I didn't do any rigorous performance testing with flushing disabled I just use the default config I think it would be useful to do that. The the reason I didn't is because I think performance can only get better if we turn it off. So I wanted to try and see what the defaults like.

B

Is it good enough? Do I need to do something more complicated and since I saw a bunch of improvement with the defaults and flushing for each LWB, I kept it, as is, if we, if we needed to do testing- and we found out that this is doing too many flushes and degrading performance- there's no reason we couldn't change it to two batch it up, but then it kind of gets us back into the batching mechanism. So it's we just need to be a little bit careful about how we implement that.

B

So the the the question is: what was the motivation of the original backing mechanism and I guess maybe was that because it helped order order the IT, x's and and things um I definitely can't comment on the original motivation, because I was way before my time, maybe mark or Matt or George, or somebody else can comment on like the motivation there. It's definitely a little simpler, well, I mean I think this.

B

This mechanism is pretty simple just because it falls out with it with the CIO stuff really nicely it the ordering just kind of because it's batched that you don't have to worry about the order of wake-ups there. Everything just wakes up at the same time, so that is, is solved automatically, but I mean it.

B

From my perspective, a sort of felt like that was geared more towards throughput and maximizing throughput, because you issue a bunch of IT x's assuming the batch lice is really big, and then you wait for the Malka all to complete, so I, don't know, D I know I Oracle's done some work on this I think the implementation is completely different, but you guys try to address this. Do you wanna.

F

B

So take a single-threaded case. Oh the question is with a small number of synchronous, writers and an S log. Can the delay like negative negative negatively impact performance? Is that good, summary so I think it can like for a single thread threaded case you're, basically always going to hit the delay so I the way I rationalize that is?

B

Hopefully single threaded cases aren't the norm and with a small number of threads writing with a previous code, it would like I saw improvements with two threats: I saw I didn't see improvement with one thread, because I would always hit the delay, but with two threads. The old code also wasn't good because it would almost always use at least two blocks, so the first thread would come in and do a Zil commit and it would it would consume one Zil block and once that writer finished it would issue the Zil block.

B

So if you had two threads one calls he'll commit. It takes one Zil block, ish issues it to disk when it's done and then then the next thread comes in calls they'll commit and it has to be put into another Zil block because there isn't a delay in that mechanism.

B

So I think there is some concern, potentially at least to be aware of, but in my testing I saw improvement even with two threads.

F

F

F

B

Yeah, so so the more concern about small numbers, think writers think writers are important and we don't want to delay them any more than then we need to. We could set it to zero or we could make the delay opt, make it optional. We did make it a percentage B. So like it's still 5%, you could end up adding that latency for no good reason but yeah it.

B

If, if you want to let's collaborate and and figure out the right thing to do, I didn't have data to kind of test that so I didn't want to go making wild. You know code changes and in complicating things without you know useful measurements to prove that it's the right thing to do so.

F

B

Repeat the question again.

B

Yeah right the the the question: I guess: the concern is that maybe these changes increases the space consumed by Zul rights right is that I didn't change any of that, so in in theory it should be the same if not improved, but let's take some measurements. If, if you see differently, then let's fix it.

B

That was another consideration about the 5% timeout, we're like without it I in my testing, always saying these LWB blocks not utilized very well, because new ITX would come in and have to get a new ziplock. So I I saw the space efficiency. It's the performance issue is kind of a variation on the space utilization. We want to make sure as best as possible. We issue these LW B's as soon as possible, and we also utilize them. So we don't waste space on these small. You know expensive devices, so yeah.

G

B

I up SIA, so it shouldn't affect it. But let me know if it is.

B