OpenZFS OpenZFS Developer Summit 2013, 19 Nov 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenZFS Developer Summit Part 7

Description

http://www.beginningwithi.com/2013/11/18/openzfs-developer-summit/
Scalability (Kirill Davydychev)
Virtual memory interactions (Brian Behlendorf)

A

People that a lot of people expressed interest in and about 45 minutes until dinner arrives, um so the talks uh that a lot of people were interested in were scalability issues uh with corel interactions with the vm subsystem, uh brian uh multi-tenant, zfs, uh rob and the um examining on this format uh with uh max.

A

So what do people think should we should we like give each of these hawks 15 minutes um and just go a little bit over or are there some of these talks that are more interesting than others?

A

Could we just go, have them plow through and when people start, you know gonging them off the stage, and maybe, if.

B

We if we go yeah.

C

If we have dinner time, yeah I'll get my own dogs.

A

And if we, if we start eating dinner, while people are still presenting that probably isn't the end of the world either all right.

C

Let's talk me off the stage.

A

There you go all right, so um would you like to go first.

A

Okay, so we'll do them in the order that I mentioned so uh brian uh and whoever else is leading the discussion about interaction with the vm subsystems. You'll be next.

B

D

Right so I actually don't have a powerpoint um and unfortunately, or fortunately, in the context of time, uh adam covered a lot of the stuff. I wanted to talk about, because at an example, we had a lot of the same issues as um as it does so. Basically yeah.

D

We see, we see several uh several scalability things that come up with customers uh for a variety of reasons, the the well not the worst ones, but the most complicated ones are where customers decide to scale up way too high or to the point where it doesn't make any sense or to the point where, when.

A

B

D

uh Memory number of disks anything so um we have deployments of up to uh 480 drives in a single vehicle which may may get into a pickle where you work your cpu bound or you don't even have enough memory for metadata, so you're, just stuck on io or you're. Just not even utilizing your disks to um to any meaningful levels.

D

um You've got the customers that try to scale up ssd systems, uh also people's let's say just hold mirrors um 60 solid state drives in a full, so 30b, devs and you'll see that even on the fastest headnotes like dual uh the top end zeon a5s they they cannot push the ssds more more than made to 10 15 uh busy.

D

So of course, they're not very happy about it because they bought a half million dollars worth of drives and they get just several drives worth of performance out of them.

E

In those cases, on the ssd case, have you guys looked to see where your bottleneck tends to be.

D

Yes, yes, we've profiled it, and so unfortunately, I've been pulled away on some other stuff. I do have an in-house ssd system now, where I can play with it. A lot of it is in locks so uh and it's both the arc and a few lock. I'm hopeful that some of the delphi exchanges will help that it will help. Oh.

A

It's probably not going to totally solve the problem, so there's.

D

Room for improvement for sure, and it's it's it's a lot of flocking issues, a lot of um things that unexpectedly start taking longer as your slow functions in the spinning test. Contacts become really really fast, um so.

B

Meta slabs start.

D

Playing more of an issue.

E

D

On, like small block size it it can be yes, so it's more it's most visible on small block size, because.

A

I think a lot of these locking issues are exacerbated, but are basically proportional to the number of blocks rather than the amount of storage, so small block systems tend to hit these locking issues.

D

Well, so, yes, it's uh it's exacerbated, essentially by the amount of high ups that you're pushing yeah, so the smaller the block, uh the uh the worse. It is in terms of the amount of work you have to do per with say per megabyte.

B

D

In our testing on both on both open indiana and on our current development branch of an exeter store, which is a few months behind, I was not able to push uh more than 130 000 right, high ups out of any pool uh like not even not even a pool, that's supposed to be able to push a million uh from a rodis perspective.

D

So this is where we stand kind of in a um in a scale right right, performance, scalability perspective, um and it's especially sad because when those store spaces can push uh six hours, bottleneck is a cpu. So it's exacerbated by the locking um which locks there's a variety of them. I uh I can probably publish the data openly, I'm not really trying to hide it, but uh I don't remember so. The there were arc locks, some cio locks, some task queue stuff. uh Probably a lot of the stuff you've seen already yeah.

E

I think I think that's another another way that the community like, if it's you know. Obviously, if it's.

B

E

Data, maybe we don't want to necessarily you, don't want to necessarily publish it, but it would be good to like start having those conversations and those threads in the open, dfs community, as we look at them and can try to brainstorm on what. If it's a problem, that's already been solved or it's something new that needs investigation.

D

Yes, absolutely so uh one thing: that's uh that's kind of beneficial about our customer partner model. Is that a lot of times we have partners that build those in-house systems that they should test before shipping to the customer and they come to us to see how we can tweak or optimize those systems.

D

So the large ssd system, for example, was at a large partner that also happens to manufacture ssds, and they gave me unrestricted access for like a month to play with it, but then, unfortunately, they had to take it away. So uh that's still an outstanding issue.

D

Another area is arc specifically so, regardless of what your best backend is- and I'm not sure if that is going to be the same on other platforms on freebsd and on linux, so it might be specific to the memory allocator uh km.

D

We are seeing that first of all, there are those furious reaps end up deadlocking the system for seconds at a time deadlocking, it's so bad that the network drivers cannot allocate memory for packets, which means that the box becomes not payable, which means that the cluster heartbeats drop, which doesn't make everybody happy, and so those reaps are visible, uh are very much more visible during a abrupt change in workload.

D

So if we were, for example, doing pushing mostly um mostly a small working working set size- and we are mostly in mfu and suddenly a big uh backup job runs uh on some different data set and those blocks that the backup job reads are still in the ghost list for mru. So now we have mfu scaling down rapidly and are you growing rapidly and as far as I know, there's not a threshold or there's not a limiter that will actually slow down this rapid progression of going from one cache to another catch.

B

D

Same three goes for uh different block size workloads. So when you have a mixed use uh system uh with, let's say nfs shares at 128k and you have some z-volts at 4k, some z-volts at 16k- uh something like that again. So, as uh as the system goes through its daily or weekly or whatever life cycle, you have, let's say it's a vdi system plus some files, plus some databases.

D

So in the morning your users log in they uh they go to the 8k blogs. The cache is popular populated in solaris arc, actually is uh so. The memory goes into buckets into slabs that are of uh certain size blocks within them, so each slab has 128k and that can contain smaller chunks that are that are equal within the block. So the way the memory scales up and down is uh it looks at each size that the system needs.

D

Let's say we need more 4k, so it tries to reap all the other ones, but as as you move between the workloads, uh you can end up again in a situation where you need to drop gigabytes or tens of gigabytes or sometimes hundreds of gigabytes of let's say 4k buffers, which is a very very memory intensive operation and a very cpu intensive operation because of crossbones.

D

So again this can cause a deadlock. Sometimes it actually does cause a complete deadlock. I believe boris worked on a case like that, where we got it to a point where it's not a complete deadlock, so it slows down, but it doesn't actually freeze the system completely requiring kernel panic to get it out.

F

D

Yes, so so there's quite a few um also a lot. A lot of those are difficult to troubleshoot, because you, you don't know if your cpus are used up uh by by the reeves or so so you don't know where the root cause of your bottleneck is, for example, you you may want to add more cpu, because you think you're running out, but adding more cpu. Of course just means you have more prospects, so in effect uh it might slow the system down.

D

So you got to find a balance, and I do not have the answer to what the balance should be. Yet right now we're going back mantra of a fewer faster force.

A

So where do you see like there's a lot of problems there? Yes, where should we or you focus your efforts um in order to attack the highest value? You know lowest cost things there.

D

um So right now right performance and the memory instability are the two key areas where I think we should focus our efforts. uh It is those are the top two things for an example right now internally as well. So we're going to be doing a lot of work there uh hopefully well.

D

I hope we'll base it on your work to eliminate things that are the right stuff, yes for the right stuff, but I am sure that once once we optimize- or once we get those to be somewhat stable, there's going to be another challenge because of course, the systems keep growing.

D

Two years ago, a system with uh 256 gigs was almost unheard of. uh Now we have customers employing five people and in some cases I think somebody's talking about a terabyte of ram yeah.

B

A

So it's it's a gaming catch yeah. I think that's a pretty good transition into the the next talk that we have planned, which was the talking about the vm sub system right since there's. A lot of these problems are with the.

E

A

Necessarily the vm subsystem, but with memory allocation- and I think, that's related to.

F

Your stuff, we might just say a couple things: yes, sir, so there's one thing that uh materials mentioned uh about art and the transition between multiple different workloads. I've looked at the code and it seems to me that you know the architecture code is something simplistic and another. It chooses to fix the order of going through the last step, like first.

F

And I don't know how much it's going to help, because the other thing is with the multi uh multi-modal workloads.

F

My my look into the gist of this kind of leads to virtual memory, uh arena presentations and the classic cases like when you fill up or almost fill up all the errors with small blocks, then randomly overwrite them and then start filling up with progressive or larger block size, which seems like a pathological escape. But sometimes it happens. You start with once I fill it up with 8k already randomly. That starts going up like 18, 32, 64 and 128, and that's the same time.

F

A

Mean I think, creating test cases like that that can reproduce these performance problems is going to be super valuable to actually fixing them, because you know when you only see this on a customer system- and it happens once a day right like that. That's really hard to diagnose um or it should evaluate.

D

A fix so yeah and I've had systems where it happens uh once a week, and uh so so that particular one it was attributed to a microsoft, sql, backup job, uh which was the precise change in workload that induced it. Without fail, every single weekend, uh sending the customer box offline for 30 seconds or so do.

E

You have do you have a confident workspace or a test case that doesn't require like sql server to do it.

D

um Yes, we do have test cases that are are synthetic essentially just bd bench scripts uh that try to simulate this, but the the biggest problem with the sort of test cases uh is they really require a very large test system to be to be to be visible enough to show that it's a real problem.

D

So, in my case the biggest one I have is 280 gigs of ram, and it's it's sufficient for this sort of stuff. It will not be sufficient once we have um once we have eliminated the low hanging fruit, I'm afraid, but uh that's something I'm I'm talking about with uh with the guys having alexander so, hopefully we'll be able to scale with our customer base. There thanks krill.

A

Well, I don't have slides either so I'll, try to keep it quick, um but I'm actually encouraged to hear that other people are looking at the memory management stuff too. I had feared for a while. This was only something I was going to reflect on export, but knowing that their works on the other platforms is actually a little bit encouraging to me believe it or not. um So the problem we're suffering with on the linux board.

A

I'd say probably our biggest stability concern at the moment is the memory management and it's a problem because we basically took it over kept it unchanged from freebsd and almost as best as we could and the problem with that is linux's uh memory management subsystems were very, very different than freebsd alums. Certain things that you would think would be fine, um just aren't the way you do it in the linux kernel, so location, point large memory allocations all right, so anything over.

A

I would say a couple of pages in linux: um pretty much you should think about allocating pages for them. If you want to do it fast, um we have k malik and we have v malloc, but the situation olympic is situational. Linux is kmalik, is fast for a couple of pages, and vmalic is strongly frowned upon that you should not use it. It is not a good interface, it is not a fast interface. This is not meant to be a fast interface, it should not be relied upon so um early on in the linux sport.

A

Knowing this, we decided that, because we didn't want to go through all the zfs code when it wasn't working yet on linux, and you know.

B

A

Up all the code to do scatter gather based page allocations. What we do is we put a layer in our spl to try and make the network work reasonably well with linux, so we did. Is we implemented our own? um Basically female, back slab in linux to try and make that behave pretty well, so we wouldn't have to change any of the zfs code. It could stay unchanged. We get it working.

A

That's worked pretty well actually for a couple of years now on linux, but it is clearly one of the creepier parts in the current implementation. The thing that needs to be fixed, probably soonest.

A

It got us a long ways, um but at the end of the day, we just can't really get away with doing large memory allocations in internal analytics.

A

So to address that um we've been thinking about how to re-plumb zfs to use scatter gather lists for all the large zio buffers currently that come off of a slab.

A

That's one thing I wanted to talk to this group about in particular, is: if we're going to go down this road, what's the best way to rework those interfaces in a way.

A

That's workable for everybody: um should we just put wrappers around them so that you guys can continue to use femaleic and be free and request lab implementations, and we can do something more linux specific um or should we or is this a bigger problem for everybody else, going forward that maybe we should all move towards a scatter gather, sort of infrastructure or zfs in general? And I think that's still, we know we have to do it for linux, but I'm curious how much of a problem it really is for uh the other implementations.

A

So can you kind of describe? Maybe I'm I just didn't quite understand like what is? Is the exact problem? Is it like? You cannot allocate contiguous virtual address space? No, so the problem, I guess, there's a couple problems, that's worth explaining, because I think most of us aren't linux folks here yeah, um so you can absolutely do it on linux. um But you have to be aware that it's a global address space in the kernel and it's covered by a single lock.

A

So if you do a vmalic, you take a global spin lock on the system, so everything gets serialized through that lock. So it's not that it doesn't work. It's just that it's very, very slow to do so. It's frowned upon I've implemented. Our improvements have gone to the linux kernel for the last couple of years to speed that up, but fundamentally it's all still serialized for global lock. So it's expensive and are you using like a slab allocator so that you know you aren't going to that global lock for every allocation?

A

Only when you need to get into slab, that's what we did in the spl. That's this live allocator. We wrote, we said okay! Well, we know this is what linux is going to do, we'll allocate pretty big slabs and then we'll just carve them up internally right and avoid getting that global lock. So that was the dodge. We pulled to get around this being a really bad problem and.

E

For links you're doing that, for anything, any allocation is greater than 8k. So.

A

Basically, there's two types of allocation right: there's vmalic and there's k now. So, if you do anything, go through anything that goes through d, malloc goes through that global, lock, it'll be serialized. Anything that goes through kate mallock is quite fast and is backed by a slab, but there's a size limit on it right. You probably shouldn't allocate more than I would say two pages, probably.

B

A

Why is it that you don't want to allocate more than two digits? Well, there's a good chance, it's failing, because if you go through k, male analytics, it has to be physically continuous in memory, so you're, looking for physically contiguous pages at that point, so.

E

And your kind of current workaround is to use vmalic and out allocate a large kind of create your own slab yeah you carve up and exactly that handles anything greater than 8k.

B

E

So 18 to 128k you kind of do your own management of it. Yes, exactly.

A

Linux does provide its own slap, but it's got these restrictions right, like I think. The biggest thing you can actually allocate out of the make slab is 128k and don't count on being fast and uh don't count it as succeeding because it's totally possible for it to fail and say no, you can't have one that's a legitimate way for the allocator to behave. So can you describe what you mean by scattered gather and how that's going to help?

A

Yes, so I was getting there so the solution we're proposing on linux is um basically for file systems, all the file systems. Linux, don't do large memory allocations right. They allocate individual pages, they get individual pages from user space for io that's going on and they get feed individual pages to block io block layer right and they assemble those into just scatter gatherings as I get passed through. So it's a set of pages right, just a list of pointers, and so these pages they don't need to be physically contiguous in memory right.

A

This is my out. This is my buffer and then it's uh or virtually continuous for that matter, right right, they're they're not even mapped into the address space. So that's why it's better! uh On linux, you have these set of pages and they have physical addresses, but they have no virtual address, but I mean they need to be mapped in like to check something. Yes, is that going to destroy performance?

A

So that's an open question so because, like normally like creating a mapping, uh you have to go shoot down the clb with all the other cpus and that's like changing the outer space is usually not very fast.

C

Right, it can be done in one cpu at a time, so on i36 you can do a local um temporary mapping on a paid-by-page basis without shooting down all the other cpus right on amd 64. There's no address space. You can do the direct map, I'm guessing. Linux has a direct map as well, so this frequency and so there's really almost no cost. What can you pay for a thread to that cpu uh during the cop during the calculation? Absolutely.

A

In fact you have to on linux, so you take the the buffers and you you know, map them to that cpu for the operation you need to perform. um There's certain restrictions like no locking right to get spin locks, but don't ever yield.

D

The cpu and make sure you unlock it, are you.

A

Also like running at high high ipl, so that you can't take an interrupt yeah. Yes, in fact, all of that gets disabled. It is that's exciting.

B

A

Makes the programming a little more difficult, but it's really good for speed at the end of the day, it does not move across the.

B

A

They'll pin it that's how you see yeah, so you just map it locally for that cpu and you disable, interrupts and yeah yeah.

C

I'm surprised I mean on previously, you just hit it and then, when it returns uh basically based on priority, you get resumed by the same cpu.

A

That's the way it works on linux, so so that would be. We would be looking at introducing wrappers to do that kind of thing right. So, instead of um doing what now is allocating the zio buffer with dmalic right we allocate, maybe there can still be a wrapper. That looks like that's what happens right except you get this z, buff back all right, then you can. You can pass around as if it were your buffer, um then on linux. What you would do instead would be.

A

You would allocate a bunch of pages randomly from well, not random, where you just ask for a bunch of pages and you get them somewhere in the address space, um and then you would have some wrapper functions to copy data into them or probably get out of them. You could make a checksum that kind of thing, but you would never need to map those pages into the virtual library space.

A

Maybe we would have some functions to do it for operations that were harder, like maybe there's some transforms in the pipeline. That would be really hard to implement on a page by page basis or if they have like compression.

A

Well, I mean you can like we were saying as long as you don't actually need to yield the cpu, something like compression. You don't need to. You, can map it onto that cpu and do the compression and then you're good. If you need to do something else like oh, I don't know something where you have to take a new text or something like that. Are you proposing doing this for only user data?

A

I think it's a first step. Yes, we were proposing doing this.

B

Just for the video to do that as a.

A

Last step as well, I mean that would be extremely either disruptive or performance killing to do this, for, like all of the metadata, you know, because you know you're modifying it and reading it, you you're reading and writing it everywhere.

A

You know, there's probably like a thousand places in the code where it's like you know, updating, indirect block or update uh you know a z, node or update.denote or update a you know. Dsl data set, yes t like all those things are mostly on pretty small blocks, but uh you know they're.

A

The access to them is much less constrained than user data. Yes, um it's for that exact reason that we went through the hoops we did initially to make our own slab implementation. That was faster. We wanted to avoid all of that, but long term. I don't see how we avoid it on the linux side or we can't really continue to do what we're doing.

A

We need to do something that doesn't require us to map anything into the virtual library space. Can you just like um when the system boots up uh map all physical pages into uh the address space and then.

B

Put an allocator on top of that yeah. That's actually.

C

C

Of me, that's used.

F

The needs of the report.

A

F

A

Yes repeatedly and they say, don't use the now. You should not use it to slow. Well, don't don't say we're using b-max. It can give us an out here that works. Yeah.

F

This is my idea, so you've got to remember the slide out here for quite some time. Well. Well, it's.

B

F

It's the same problem except it's elevated with quantum dashes right under the arena levels. They also have a single walk within a female arena and the way they make it work multiples to use it. Certainly they have quantum characters for like so many uh so many times, but but the other thing is that all the uh the caches are built on top of koreans in arenas are sort of hierarchical. So that's how they all work well,.

A

I think that you know, I think, that you're, I don't have any fundamental problems with like doing that. So um you know, I think, there's a lot of ideas on how maybe like sure learning should be better, but I think working within what you've got and doing that uh you know. That's fine. I don't think it's gonna hurt it's not to kill anybody else. I don't think so.

A

I should also mention that yeah, we kind of brought it up yeah with the upstream maintainers, and um this is just a more philosophy thing I would say on linux. This is not viewed as a bad thing. It's this. If you want good performance, you should not go through the virtual address space. It's just kind of baked into the culture if they don't believe, they're doing anything wrong and I tend to agree with them. Actually um allowing this sort of thing is a convenience for the developers right.

A

So so I don't think we're going to be changing opinions there. So at some at the other day, we need to do something that just works. Does it make sense.

A

You can't just do this in the portability layers, so I mean even the fact that we've tried to hide it, but there's other issues that come up on linux. Things like you can't do any kind of sleeping allocation or email in any right path. Right that is will deadlock shall not be done all right. So, let's maybe say that.

B

There's lots of restrictions this.

A

Is a long thread to pull it's the root of a lot of our problems right, so we want to do something about it and that's good. So if I got a few more minutes, I guess one other thing. This lets us do on linux.

D

At least is better page.

A

Cache integration we'd like to bring zfs into the page cache on linux and if we start doing things on a page basis, we can actually do that. We don't have this problem mapping these random buffers that are in a slab somewhere into the case cache. We can now properly map pages in fault them in and get rid of the ugly map hack, probably that we have at the moment where we keep two copies of this.

A

So I think it's a lot of work, but I think for everybody there would be a lot of benefits that would fall out of it. um I don't expect it to be a quick change. The big disruptive change, I kind of hope to spend the next year kind of poking at it and moving it forward. So I don't want to underestimate the amount of work but cool thanks.

A

So let's go ahead and grab dinner um come back here, no beer until you listen to two more topics.

A

Get your dinner sit down, no shadow.

F