OpenZFS 2020 OpenZFS Developer Summit, 12 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Send/Receive Performance Enhancements by Matt Ahrens

Description

From the 2020 OpenZFS Developer Summit
slides: https://docs.google.com/presentation/d/1HuKHawQbuetqpbwp4wmfm6Ozj-WYJpPa6QAxgDxLsgk/edit?usp=sharing
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2020

A

Awesome so uh thanks for coming uh thanks for coming back, uh I'm matt ahrens, I work at delfix and I'm gonna be talking to you about performance of zfs center receive and some things I've done to improve it.

A

So uh if you have been a a faithful, uh opencvs developer summit attendee, you might remember this talk from five years ago by my colleagues, paul and dan about how send and receive performance is awesome, and they were talking in particular about incremental, sends are way better than the competition, because we're able to find uh all the blocks that we need to send very quickly um without having to like look at all the metadata or look at all the files, so that's cool uh and uh whether it's the incremental or a full send.

A

uh We can get really good throughput, so uh more than 10 gigabits per second on both for both send and receive. So I don't know why am I even here? Well, it turns out that's only true. If you're using large blocks, uh if you're using smaller block sizes, then you get maybe a third of that performance, so you're nowhere near saturating your network, um so why? Why do we care about this? So I work at delfix in our product.

A

We use zfs to store databases um and databases, uh store their data record structure, typically in records of 8k, so to align with that we're using zfs we're setting the zfs our record size property to 8k to match that we get really good compression ratios with lz4 ratios of about 2 to 3x, so the compressed block sizes are 3 or 4 kilobytes, and we found that uh so you can use our products to do replication between uh delfix engines to move your database data around, and uh you know you might have like a 10, gigabit or faster local network, or you might have a cloud direct connect which we've seen.

A

Customers have those up to about five gigabits per second um and uh we would like to be able to saturate that, um but we, but uh we can't um and the other the other wrinkle here is that a lot of the time the majority of the data is just in one file system. So we can't take advantage of like running multiple sends in parallel.

A

um Why do you care about this? uh You might have the same same kind of setup or you might be using z, vols, which default to the vault block size of 8k um or disk images uh where you want to have smaller record size settings as well.

A

So our goal my goal for this project was, you know we want to be able to do a zfsn pipe that over the network, type that into a zfs receive on the other system, and ideally, we want to like saturate either the network or the disk, like some piece of hardware, should be going full tilt and we shouldn't be waiting for the software in the cpu. uh But unfortunately that is the case today.

A

So the goal that we said, the beginning of the project was to be able to saturate one of these five gigabit per second um uh cloud direct connect um networks, uh and we wanted to be able to do it with a average compressed block size of 4 kilobytes, which means that we need to be processing.

A

160 000 blocks every second, which means that we have about six microseconds to process each block, which is not very long.

A

um So this is just showing uh the same thing that I said, but in more detail, so uh the larger block sizes like 32k enough, get really good performance, but then, as you get down to these smaller ones, it's not that great and the goal of the project is um not necessarily to improve the performance of center receive by these large block sizes, but just to make the graph a little bit smoother bring the performance up specifically of 4k compressed block size.

A

So I'm going to be talking a bit about like how we did this um and then I'll. Tell you about um a few of the improvements that I made. um I had to try like a lot of different things, so there's a lot of iteration um and uh so when, when I'm iterating, basically what I'm doing is like looking at a flame graph, seeing what's using cpu going and trying to find out. uh Is this actually important?

A

If I could get rid of this, this cost like it seems like there's this cost in this particular code path. If I could get rid of that, would that how much would that improve, send and receive? And that's a different question than actually like fixing it. So, to give an example like every time you create you instantiate a um a debuff, you might be.

A

We need to allocate some memory and we need to do some space accounting to say like okay, now there's this much amount of memory that zfs is consuming for its internal data structures, there's probably like locks associated with that. So you know. Maybe I see the locking is taking a long time uh rather than like breaking up the lock or doing something sophisticated. I was just like okay delete that code or comment out that code and see if we get better performance. If we do, then I can go back and implement a real solution.

A

So I made lots of hacky changes that just uh that work for this use. This narrow use case of send and receive, but wouldn't necessarily work for zfs in general.

A

So then I could come back towards the end of the project and evaluate like which of these will give us the most the most impact to performance for the least amount of effort right. So uh the test setup that I used uh was not the full-blown like product with databases, and you know, networks for the cloud. uh But I I wanted to look at the performance of send and then receive separately because you know sending receive. Obviously they go together like peanut butter and jelly.

A

You almost can't have one without the other, but uh under the hood they're doing almost totally different things right because send is getting data receive, is taking the data in and then writing it out. The code paths are very different, so things that might improve one, don't necessarily improve the other, but I did want to capture the costs associated with the pipe.

A

So I didn't just use zfs send greater than devnull, which is even faster, but I wanted to capture um the performance of sending out to the pipe, because we, due to some previous investigations, we suspected that there might be problems there um and uh I use dd uh quick side note. uh There are not a lot of utilities, uh standard, unix utilities that can produce or consume a pipe at more than a gigabyte per second, I I tried using pv the pipe viewer initially and that will instantly become the bottleneck of any pipe.

A

So you don't don't do that for the test setup I we actually didn't. I didn't use compression and I didn't use record size ak. Instead, I set the record size to match the compressed block size that we typically see of around 4k, um and I just have 100 gigabytes of data. uh I, when you so when you're doing uh zfs send uh basically what that's doing is like issuing a whole bunch of reads to disk, and then you know, they're completing and then we're generating the sense stream.

A

um When we have a lot of ios a lot of reads uh pending at the in the zfs vdf queue layer, uh then zfs is able to do aggregation so it when we're going to issue an io. If we see oh there's, actually several adjacent. I os to this that are all also reads. Then zfs will issue one big read to the disk and then copy out the data wherever it needs to go.

A

The problem with this is that it can lead to really inconsistent performance, because uh whether you get that I o aggregation or not depends on um it depends on like the timing of everything, that's happening, so uh you know you might get into a mode where it's like we're. Just the eyes are flying out really fast and we're not getting any aggregation and we, you might get into a mode where uh things are getting queued and also depends on the layout of the the data on disk.

A

So I disabled this uh read io aggregation for my test um so that we get a consistent per I o cost, so every 4k we're actually sending down through the block device driver down to the actual hardware. um This also simulates what you would get in kind of a worst case scenario, where the files are spread totally evenly across your disk and super fragmented.

A

So the test environment I used ubuntu, uh the one thing that I needed to do is uh on the on the later. A bunch of releases they've changed the default uh in on alec.

A

This is a new feature in the linux kernel, 5.0 and in a bunch of they said like for security reasons, we're going to whenever you allocate a page we're going to zero it out that these two pretty poor performance uh for zfs when you're at high throughputs, um and so we we've actually disabled this in our product uh and and uh you'd- want to disable it for high throughput workloads and all right. So, let's get started with our investigation. uh So, let's start with zfs send uh sort of the flame graph.

A

So this is a flame graph. uh What it's telling us is uh kind of on you. You have a bunch of these bricks right so, like here's, a stack of bricks and the uh the stack of bricks is telling us uh a stack trace. So, for example, here, like zio, execute called zio v dev I o done, which is called v q. I o whatever um and the width of each brick, is how long that function was on cpu or that that function or this colles were on cpu.

A

So we know that, like you know, zio execute was on cpu for longer than zfs do send, for example, uh but this is kind of a lot to look at. Usually you can view this interactively and kind of zoom in on different parts. I'll kind of summarize this for you so about half of our cpu is idle.

A

So that's interesting. I don't know. Maybe we don't have a cpu problem. Why do we have all this idle cpu?

A

Most of the cpu that is being used is by these zio threads and there's a whole bunch of these xero threads, at least as many as there are cpus, so at least in theory. um If these threads need to do more work, then they should be able to consume the idle cpu, and so these threads shouldn't be the bottleneck, and then let's look over here, so we have the main zfs thread. So this is like the thread that's running from userland, it's outputting data to your pipe and it's on cpu 73 percent of the time.

A

We also have this other thread, send prefetch thread, which is on cpu also about three quarters of the time.

A

So you notice that I said it's on cpu for three quarters of the time, but uh it's not consuming three quarters of all cpu.

A

So each of these for each of these threads, where there's just one thread, it can only be on one cpu at most. So the way that we read this is, um for example, we can look at this function here, our header alec. It says it's on cpu 1.7 percent of the time, so we have to multiply that by our eight number of cpus to tell us that this function is on c on the one cpu that I could be on for 13 of the time.

A

uh So that's interesting. uh Let's try and figure out what this prefetch thread is doing and how it kind of fits into the um the flow of data with zfs send.

A

So this is a diagram showing each of the threads. That's part. That's involved with the zfs send uh and then uh what what data is being sent between is being, I shouldn't say, send what data is uh is being passed between from one thread to the next.

A

uh So this traverse thread is walking the tree of blocks and figuring out what blocks we need to send for a full send. That's just all of the blocks that are part of the file system and it's outputting the block pointers uh that we need to send. So it's just the block pointers. We don't actually have the data yet, but it's telling us which blocks we're going to need to retrieve from disk.

A

In order to send it, then we have the prefetch thread, which is what we were looking at previously and what it's doing is issuing an arc read. So it's asking telling the arc hey, go off and read this data.

A

It's not actually uh waiting for that to complete we're, just kicking it off and we aren't actually uh getting the resulting data, we're just pre-fetching it so that it'll be in the arc for a later uh real read, and then we have the the zio threads, which have many little squiggly threads here, which are going to actually issue the reads to the device driver and then notify anybody who's waiting.

A

So the prefetch thread is passing on this exact same list of block pointers that we need to send to the main thread. So the main thread is going to now issue a real read to the arc, which hopefully it's already cached and we're just getting the cached data, and then it's going to output the data to the pipe which is going to be consumed by dd.

A

So there's already a couple things that we can see might be issues here. uh One is like we're doing arc read here and then we're doing another arc right here. uh Maybe we could just do one arc read and have the you know. This thread like pass. The actual data on that would probably save us.

A

Some cpu in the main thread, probably wouldn't help with the prefetch thread, um but the other thing- oh, I should also mention uh so we're doing all these arc reads: we're getting all the data into the arc cache eventually that data is gonna, have to be evicted. So there's another thread: the archivic thread, which we also saw on the flame graph over here, which is going to have to remove that data from the arc to make room for the new the new blocks that we're reading in all right, so um pause for just a sec.

A

uh What's so special about this workload, so we already mentioned that we're using small record size uh or small. We have small compressed block sizes. uh The second big thing is that this is single threaded. I know that there's a lot of threads on here, but because every block needs to be processed by every one of these single threads, so any one of these threads can become a bottleneck.

A

So if the prefetch thread is taking a long time to do its thing and before it can pass the block pointer on to the next thread, then it is, it can become essentially the single thread. That uh is our bottleneck.

A

Yeah. So, as I said, there's several threads, but each one is just doing one job and each one has to process all of the blocks. We're doing a lot of work with the arc and the dmu around debuffs.

A

Those subsystems are are pretty scalable but they're relatively heavyweight. There's relatively high latency for each. I o. You can get really good throughput from zfs if you have a lot of threads hitting it at the same time, but uh it's not great for single threaded throughput- um and this is true uh not just for sender receive, but if you're just doing like a single dd to read or to write, you're you're going to see the same kind of curve relative to the to the record size property here.

A

So we're going to see two kinds of solutions uh throughout this talk. One is uh just don't do that uh bypassing the arc in the dmu altogether and the other is um batching operations together. So, even though we need to issue I we need to issue an io for every block um we can. We can.

A

uh We can uh recruit a lot of threads to help us with processing those ios, so we can get good scalability there, but there's other things where we're processing things one block at a time that we can improve on by batching those operations together, all right.

A

So uh this is what we have now and how can we fix it? So the change that I made was to change this prefetch thread into what I now call the reader thread and what it's going to do is issue a zio read, so we're not going to add anything to the arc, we're just going to go directly to the zio layer. uh Read that block and then pass the z or initiate the read and pass that zio on to the next thread we're going to check if it's in the arc first.

A

So in case it's cache, you know it's probably not cached, but just in case you have a workload where it is cached in the arc we'll check if it's in the arc, but uh that's a lot faster than adding something to the arc, because we don't have to allocate any new data structures, link them into our hash tables, etc, etc.

A

Then the main thread doesn't need to issue. Any ios doesn't need to talk to the arc at all. It just needs to wait for that zio to complete, and then it can produce the stream data.

A

So this is a great idea: uh let's see how it turned out in practice. So um usually what I would do is look at the flame graph and see. Okay. Did it make kind of? Did I reduce that cpu time that I was targeting um in this case we did so allocating the abd the arc. Buff. Sorry yeah, allocating the abd is much. uh It takes much less time three percent of the cpu time than we took to allocate the arc header, which is 13 and then checking the zfs send throughput.

A

uh We did get a really good improvement here. 36 percent better on our 4k uh target block size cool. So let's look at another thing: 36 didn't get us quite as far as we want to go. So, let's see what else we can do um so I wanted to talk a little bit about uh how about what uh how I used off cpu flame graphs.

A

um So if we take a look at the send prefetch thread, they were seeing before where we saw was using a bunch of cpu. What was it doing when it was not using cpu? So by using off cpu flame graph, we see the same stack of function calls, but this time the width is telling us how long this thread was blocked, not on cpu, because because we're waiting on the stack trace so in this case we're waiting on a condition variable in bq enqueue and what this is doing is waiting for.

A

It's it's trying to produce its output, trying to cue it on the the queue of output and it's not it isn't able to, because the queue is full. It's waiting for the next thread, the main thread to consume the output.

A

So the way that we figure out how important this is, is we ignore this percentage here and we instead look at the number of samples which, in this case is the number of microseconds. So this is telling us that uh for two and a half seconds 2.5 seconds, we were waiting on this stack. We divide that by our sample time. So I was running this running this data gathering for 10 seconds, and it tells me that we were off cpu for 25 of the time.

A

We can then look over to the main send thread. I wonder what it's on cpu bunch of time it it's also off cpu. What was it doing when it was off cpu? um So we spent a bunch of time, one 14 of its time. Waiting for locks and cv is associated with the pipe so we're trying to output to the pipe and we're having to wait either on the mutex or because the pipe is full.

A

Taking a look at the code, I found that we're calling pipe right two times for every block, which means that we need to do this, every three microseconds to get the performance that we're targeting, which is pretty often to be uh scheduling and waking up other threads.

A

So how do we address this? So this is the same diagram that we saw before with our reader thread. The main thread passing this: the stream data block at a time over to dd.

A

So what I did was I introduced a new thread, so I'm sending it block at a time, but but now I have a new thread, this writer thread, which is consuming the data.

A

What the writer thirties can do is batch up this in the the stream into one meg chunks and write those chunks, one meg at a time to the output stream uh to the pipe. So the end result is we have you know one 500th of the number of calls to the pipe potentially 1 500th of the calls to the mutex and having to potentially go to sleep and wake up. The the dg thread.

A

The other cool thing about this by introducing a new thread is that the main thread can continue working while the writer waits on the consumer. So in the real world uh we aren't just like doing dd to devanal uh we're sending out over the network.

A

The network might have packet loss, it might have kind of intermittent or bursty throughput, and uh this allows us to keep the main thread busy, while the writer is waiting for that, you know packet loss to be recovered from or waiting for, the network burstiness to kind of go back to full bandwidth.

A

Cool, so um I did a few more little kind of micro improvements like uh in you know, taking locks less often, and things like that got a little bit more from that and overall I was able to get a 87 performance improvement, uh bringing us up to uh 840 megabytes per second, which beats our goal of 5 gigabits, but all that's kind of for naught if we're, if we're still piping it through over to uh slow zfs receive um because we can only go, we can only produce data as fast as it can be consumed.

A

So all we've done so far is like reduce the cpu usage on our sending system, but we, but we can't actually get the job done any faster. So, let's take a look at zfs receive again, let's start with the flame graph summarize it here again it looks kind of similar a bunch of idle time, a bunch of time being uh in the zio threads, and then we have this receive writer, which is two thirds on cpu. That's just just a single thread and then the main thread which is more than half on cpu.

A

So, let's drill down on that receive writer thread um here here, uh I'm displaying the on cpu and off cpu flame graphs uh kind of next to each other, with their widths, proportional to the amount of time that we were spending on cpu and off cpu, um and actually in both the on and off cpu cases. We are we see that we've had two big categories: one is dealing with debuffs, which is kind of the the heart of the dmu.

A

These are in-memory data structures that represent each block um and then the other big chunk is dealing with transactions. So whenever we're making a doing a write or making a change, we need to have a dmu transaction that manages that right and then off cpu we're basically dealing with waiting for locks associated with those two data structures, all right. So what are these debuffs?

A

Why are we creating them? What are we doing with them? So, let's take a look at here. This diagram is going to be showing the in-memory data structures that are involved in zfs receive, which is very similar to what you see with any writes in zfs.

A

So what we really care about is our data. In this case databases don't make very good pictures. So in this case our data is a jpeg of some pretty pandemic, well-dressed co-workers and uh we have the arc buff, which manages the which keeps track of that and the arc buff is associated with the narc header so at so the main thread is going to um get the data from this from the stream uh and put it into this arc. Buff then we're going to uh now. We need to associate that with like that.

A

Belongs it uh at this offset of this file the debuff tracks. What is that given offsets of files so that we we create this debuff? um Everything in zfs on disk is a tree of blocks, so you have like the object points to blocks that are called indirect blocks, each of which has a bunch of block pointers that point to maybe more indirect blocks, which point to eventually to the actual data.

A

So whenever we have the leaf leaf, debuff the one that actually contains the user data instantiated memory, we also have the indirect debuff, all the indirect debuffs above it and the dnode, which is what tracks this file.

A

Now we mark the dbf as dirty uh when we do that. We create this thing called a dirty record that keeps track of all the dirty data, and we do that so that the sync thread can find all the dirty data that it needs to write out.

A

So the whole point of all this green and blue stuff is so that the sync thread uh which processes uh seeking out txgs can create an io that points to our data and write it out um and then obviously, we need to after we write it out, we're gonna get a block pointer and we need to stick it into our indirect block and then uh eventually the rate's going to complete.

A

We can then we'll get rid of the uh dirty records link the arc header into the arc, cache hash table and eventually we're going to evict that stuff to make room for more dirty data yeah. I know there's a lot of stuff if, if you take anything away from that, you should take away that it's a lot, there's a lot of things here and there's a lot of things pointing to other things and each one of these is a relatively heavy weight.

A

So, like the d node, uh it's not just like one pointer to one leaf debuff. It's like a whole data structure, uh avl tree that points to all of them. So every time we're like instantiating the instantiating, a new leaf debuff. We have to associate it with each of these, which means like getting a lock associated with it, uh adding to a data structure which is probably um like o of log n. uh So these are relatively heavy weight and that's not even the worst of it.

A

So uh for every one, indirect debuff, it can point to a thousand data blocks which I've shown just a few of on the slide. um So you really have pointers and pointers going everywhere lots of space accounting to know how much memory we're using.

A

So what can we do about it?

A

Let's get rid of the stuff that we don't need. So what I've done is I introduced something called a lightweight write which I'll try not to say too many times quickly in a row.

A

So we have this new type of dirty record. That's a lightweight right, dirty record and it uh it is not associated with any debuff.

A

Instead, it just says here's the data that we need to write uh to this object, and this offset so now when we're when we're trying to when we're writing out all the data in syncing context, we're going to follow these pointers down to the indirect debuff sturdy record, and then it has a list of leaf of dirty records of the leaves and we're going to be able to create this zio that points to it, and because of this we don't need to create the leaf uh the leaf debuff.

A

We don't need to create the arc buff. We don't need to link into our cache table. We don't need to evict later on. So this is great. um The uh the downside is this is this is like it's really slick, it's really lightweight, um but uh we can't handle reads from data. That's lightweight dirty because the way that we find like, if you're doing a read, we're going to find the debuff associated with that. If it's not there, then we're going to instantiate it. There is no debuff associated with this data.

A

So if we, if we kind of just try to drive on without making any additional code changes, we would just instantiate it. We wouldn't find this dirty data, we'd instantiate, a new one, read it from disk and then we would see the wrong data.

A

We would see that what's on disk, rather than what was written but not yet synced to disk. So if you're wondering this is why I removed the ddip receive code. I know we talked about doing that, for I think I was proposed at this conference like four or five years ago. I finally got around to doing it because I didn't want to deal with this so uh with and now that's been removed, uh zfs receive doesn't ever read from the dirty data.

A

It's it's a write-only workload until the receiver completes, so um we don't need to worry about that. The other implication of this is that the received data doesn't pollute the arc so uh because we're not putting it into the arc. um It is not cash in the arc.

A

You won't get a cash hit from it, uh but for this data, it's typically not you're, not going to typically get hit, get cash hits you're not going to typically be reading from it right after you receive it, especially not if you're receiving a lot of data it's going to be flying out of your cache anyways besides, before the receive completes.

A

So how does this impact performance? This is what we had before 30 of our cpu dealing with debuffs, and this is what we have after down to 13 um and we have similar improvement in time waiting for locks. So that's great and uh running the zfs receive we got a 54 performance improvement, so awesome.

A

What else can we do? So, let's take a look at um again in the receive writer we saw, the other big category of cpu usage was dealing with transactions, so every for every block, in this case every four kilobytes we're creating a new transaction um and uh that's gonna. That's having like look up the d node, um do a bunch of accounting make sure you have enough space to do it and we're doing that.

A

Every single four kilobytes uh and what I notice is that uh you know if you do, if you have a big sense stream, it's probably mostly in big files, especially if you have small record size, the you probably have a lot of records there in each file. So we have a lot of write operations that are happening to the same file um and most of these things are finding the same like finding the which file we're writing to. Let's batch that up, so what I did is uh batch up contiguous write operations.

A

So when the sent stream we see, you know right to offset one right to offset two right to offset three. What I'm doing is uh just remembering that not sending those to the dmu, yet until I get a whole megabyte of those rate operations and then sending that to the gmu with one transaction.

A

So we had a huge reduction in the amount of cpu used dealing with those so down to actually less than one percent of one cpu, um and we got a good additional eleven percent performance improvement on top of what we already had cool. So, let's take another review of receive cpu usage.

A

This is what we had as a baseline after we implemented those two improvements. uh The receive writer is using a lot less cpu. We totally got rid of this debuff evict.

A

uh The other threads are mostly using more cpu because we're just we're going faster and we haven't done anything to improve them, and the main thread is still sitting here at 57 on cpu.

A

So um let's take a look at the flame graph for the main thread. 20 of it is uh dealing with pipe mutexes and condition variables, so we have kind of the same situation as we did with the send case where we're doing two calls to into the pipe for every block.

A

So again, what I did was batch up. The batch up. The pipe read calls so that I'm reading one megabyte at a time which had a huge reduction in the amount of cpu used from twenty percent down to four percent.

A

uh Although my implementation needed to do an additional b copy that cost ten percent still we're better than we were before at twenty percent, and I got a two additional twelve percent import performance improvement on the zfs receive, but uh so the way that pipes work uh you uh each side, one side reads and the other side writes and the side that's reading. It only reads you can't like push things back into the pipe.

A

So the problem is that, when we're reading the last chunk of the send string, we don't know that it's going to be the last chunk of priority, so we're reading that whole megabyte and if there was some data after the end of the sentence stream, we're going to be reading that in as well there's no way to like undo it and push back in the part that we didn't actually want.

A

So this breaks things where um the sense zoom is followed by some other piece of data, like uh with z of s, capital, r, capital, r or capital. I we're generating like a a bundle of a whole bunch of different snapshots, um so to really address this, we probably need to change the send stream for uh send stream like over the wire format. um So I didn't do that, but uh this kind of proves out like how much performance benefit we could get if we could make this real all right.

A

So, in conclusion, um this is the baseline, where we were before any of these changes, and then here is uh where we improved performance. We actually got a pretty big, pretty good performance boost, even with the large block sizes, um but almost doubling performance, 87 or 90 percent uh up to 840 and 740 megabytes per second.

A

So uh where are we at with these changes? um So the bypass bypassing the arc and um the right transaction batching are done and those are in those will be in open, cfs, 2.0. uh The other changes, uh the batch output and right trend, the batch output and lightweight write code is basically done. I'm still working on streaming that um and then the the batch input needs a little bit more work. um I see we're running short on time, so uh I will skip the future work.

A

You can come to the breakout session if you want to hear more about other crazy ideas, but I will go to q. A all right, so jan asked does zfs support, reading and writing tcp sockets within kernel tls. um So the way that it works, uh zfs send and receive are passed a file descriptor from userland, so uh yeah you can attach that file descriptor to a. If that can represent a socket. um I uh we I haven't done that in practice, um that'd be something really cool to investigate.

A

What we're doing is sending to a pipe to use to a user line process which is then doing like a whole bunch of other stuff encryption um and spreading it out over, like a whole bunch of tcp connections um to get really good bandwidth.

A

So zfs supports. I don't know how widely used that is. uh Maybe if folks have experience with that, then uh come to the breakout, come to my breakout session and and and talk let's talk about it.

A

um Jan also asked: what's the definition of idle cpu, I'm not sure that I understand that there are different definitions, so maybe you can help educate me on that. I was looking at both the flame graph, the flame graph, where I'm gathering all the stocks and not excluding the idle ones, and that that lined up pretty well with you know, toff or um iosat-x, something that that just gives you a summary of like this percent cpu system. This percent was idle.

A

Someone asked what kind of load this was run with, um so there was no load beyond the send or receive that. I was testing and I talked a little bit about the uh test, setup and methodology here so um like when I'm talking about the set performance. It's just like it's literally running the cfs send, then um the snapshot so we're doing a full sound, not an incremental and then piping, that into uh dd, with the block size of one megabyte. So so dd is going to be consuming it in one big chunks.

A

uh Okay, so anonymous is asking about uh the the the readability of the lib zfs send receive dot c uh code file. um This, uh I think, all of these changes that I've talked about here um do not touch lib zfs. These are all kernel changes. um I did do a bunch of cleanup. That's in, like smaller minor commits that are already in there to the to the kernel, send and receive code, um but uh I did not have to touch the useless stuff. I was thankful to not touch it because you are correct.

A

It is um not very easy to understand um so uh yeah, that's that's still as it ever was.

A

All right- and I don't see any more questions- awesome all right. Thank you.