OpenZFS OpenZFS European Conference, 3 Jun 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ViennaScientificCluster

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

I'm gonna run through is um we've looked at um and I'm gonna show who we are just a second uh we've looked at the performance of using using the the zfs as as a store for a parallel file system on top, and we use the fraunhofer parallel file system generally familiar with it. So I'm gonna spend a few words on it, charlie.

A

So, first looking at the use cases there's a big disclaimer. I should actually both type it uh some performance data we we retrieved and some some going forward now um we're in hpc center. So we do high performance computing. We do that for for science users in multiple austrian universities, we've got one can already say three clusters, because this is, as you see is from yesterday. That's a matter of fact. The third one is being built as we speak.

A

We mostly have hpc standard hpc codes, so that codes for weather weather forecasting and there are codes for various physics, applications, finite element, analysis and so on, uh and they usually run as a message passing over mpi, and this is infiniband connected nodes. So that's basically the environment um for um for uh the point of node. In here, for just for the environment, this uh will be the largest oil cooled.

A

uh This is this: is an oil pump uh um hpc site in europe uh very shortly, so this is actual actual boards being being put into oil. So this is. This is something with some novelty value now.

A

The the use cases we look at here for for cfs is, uh uh is uh scoped to be around archiving around uh uh user data, backup and there's some priorities in in terms of what we look at- and this is not necessarily the the the weighted list, but the reliability is is really really much on top performance is not uh uh in this. In this, with these use case is not really the the top one at all. So we we look at.

A

uh It's got to be reliable, stable, it's gotta, uh be nicely manageable, and, uh and so on, uh the the uh environments uh will all be using uh um zero six two linux, so we love linux, where scientific linux uses and we're all over linux.

A

uh For uh the parallel file system called fraunhofer, a global file system hgfs as it used to be called, and now it's called bfs a big there's, a b symbol, apparently, which now comes with it, so we're using a parallel file system coming from the quote-unquote makers of mp3- and this is a file system which you can think of separate separate storage nodes, separate metadata nodes made to really allow for multi gigabyte uh uh throughput for for users such as genomics and and what comes so this this scales to uh very large uh 40, 50, 60, 70 80, and this the curve.

A

The curve is very, very, very, very nice. So this is a a very high performance file system. So the the questions we looked at first and uh I I'll say it very openly- we were quite negative on it. So we're looking at this. uh Is this stable, zero? Six two for, for our purposes, uh uh is the stack stable, is it uh performing and uh so that the null hypothesis were actually no.

B

A

Just uh look at it uh as an exploratory testing now um disclaimer for for for the data is and you're gonna see. We we thoroughly rejected this sort of hypothesis. It turned out to be extremely uh doing extremely well. Actually. uh For most part, um the the disclaimer is, of course uh these are. This is work in progress. Non-Finished, it's just a snapshot, it's uh mostly synthetic workloads, so just one is one is called a real world, a real workload. And last but not least, it's not optimized. That's for sure.

A

So, there's a long list of optimizations, which can be can be tried as we go now uh free environments. One is a. uh These are not huge environments for this testing. uh One is uh one one blown up server node uh with a collapsed metadata management management with another management tool on it. um Then then, two to infinite bands uh most for those tests.

A

We only use one though uh we have uh uh enterprise ssds in there uh for for for, for the for the cache and from the l2arkan.com, uh then we do have uh um eight load, creating clients for this environment and release wise again, it's shown before second environment is uh from from prior cluster. Just uh for for sizing sizing up the testing, it's uh for old uh sunfire nodes, which you see here no surprises here. um The next environment is, is a server node out of another cluster which is amd here.

A

Yes, this is built on raid for um for um um for process reasons and uh uh uh going forward, we'll we'll put we'll we'll put it on on jbod, of course, and, uh however, we we we we had this on on the on the on the rate, uh the the um the first battery of tests were basically looking at the com garage, combinations of native file system we use xfs uh sort of is the default default native and zfs and remote, not remote, and so.

B

A

And with large files and small files- and there is some some data- this is uh uh basically uh for orientation. Does this? Have a laser then see? If I can? Oh, it's the server laser.

A

uh um Basically, what to see from that is uh the the the small. I would disregard that the small, the small file sizes. uh We were just uh seeing early on that uh that, of course, cache isn't so matter, but uh what matters is that the large number of files, this performance everywhere is just bad and, uh however, uh it's along the lines of the other options, so cfs being one and the others being uh uh similarly, not not too high, but it's not grossly grossly off.

A

So this is a first first string to to see and then, as we go to the larger file sizes, the the data becomes more interesting. So I would go straight to here: there's a eight something: almost a terabyte file, uh ballpark wise, where we're uh we're uh comparing uh compe we're getting into the region of native xfs and uh there's some some, some um of course local remote differences, but ballpark wise.

A

We're we're coming close, um then, here here uh a a another baseline test, where we put the the only the xfs is as backstore and here- and this is- this is a key figure for us per node this. This shows about 3.6 gigabyte per second, so this is a again disclaimed in here. This is this is one way to measure it. Otherwise, you get you get different values, but uh uh this is close to native performance. What uh over infiniband qdr what you can get, so uh uh this is actually the three six.

A

This might have some cache effect in there. This is this is very high. uh The uh uh here we have a red z already with uh four tanks with nine discs, and uh uh there comes already the surprise. So again, we were very negatively uh looking at this at first. So how can this and so on in this deck?

A

But uh uh here here we're reaching- and this is without any optimization at all- uh are reaching uh uh about two points: 2.7 gig, uh 2.2 gig uh in this in this configuration, which is uh pretty good, it's pretty good, uh very good. Actually so uh this is this is this. Is this? Is a a territory uh we're we're in? And this is um definitely something uh uh in in the in the sort of admissible range.

A

A

ah Then here oopsie 2.7.

A

And then here or what the parameters are different, uh then here it it even goes up to a three three point: four uh uh f2. If they look for the test parameters, I don't have them on the slide.

A

I'm trying to figure out what's the difference between those two words, I cannot see it on the slide, so I'm afraid I have to go back to the notes. So there must have been a different parameter. I uh I would have to go back to the note, so there's some parameter different. Obviously the reading goes up to 3.4. I would believe uh I would believe. No, I couldn't say I have to go back here here then uh there's a stripe of four four four four times: eight again nested.

A

Nested and here with different file sizes- and you can see two four up to 64 clients leading to leading to about uh on the on the large file, says again: uh 2.3 geeks, which is which is which is excellent, which is excellent in here, uh uh just just marvelous. So this is this: the real real uh uh uh world uh fraunhofer file system, layered on top of the the uh nested raid z1 2.3, with a with a single server. Perfect here, is another test. uh We used the the benchmark tool called ior.

A

We like to use this for for mpi, especially if multiple different processes are active uh here is a simple single single case, but this uh tends to lead to lower lower values uh in comparison uh with with other, simpler tests. But uh this is also a very good figure in here. One one, gig very good figure.

A

um Here it uh becomes a little bit more difficult now, so here's some read write a hundred thousand files. um The the the the performance, as shown before, is pretty comparably bad. As for for other file systems or the layered file systems. um However, um the the the let me go, the uh in the in the following.

A

uh um Now, uh last but not least, we did some do not show that data here, because it's it's, it's uh almost un, it's too good to be true, so we have to go back to to to to doing more measurements, but we did like a 72 disc stripe with uh uh all all out uh uh load on on this on this stripe and we're we're basically getting uh infiniband speed on it. So, uh despite the layer, despite despite with huge files, uh it's it's very good. It's very good uh now comes the, however.

A

The, however, is uh in another test case. We we back up the user data, so real world data to another environment. Here here we've got about 152 million files on one cluster, so that's quite a bit.

A

The the the l2 arg and zil were were, of course, their enterprise ssds. um Now um here's the good part, uh the compression we're getting. Is nice and 1.5 approximately this this this pace now with dtap, enabled, I shall say uh in the dub, not all files as of measure as of measuring this, not all files have actually been been on the dtop now. As of measuring this, the the the small file is. This performance is uh metadata.

A

Side is really really um an issue, so we, if you, if you traverse for 152 million files with a recursive, ls or a concretely, I think, a find, find not minus x just defined uh recursively through through the tree, looking at like 10 hours or so, and if you do a uh find, uh and um if stat and so on, on the files it gets even worse. So this is.

A

This is quite uh quite long so uh here here, um uh however, uh um in this test, with the with the with the long long recursive runs, this was just cfs and not layered on top, so uh presumably this would not get better with it. um Maybe so uh here here there's room of room for improvement, so this is really difficult, yeah, so.

C

This test was not with raid z. Is there.

A

This was uh uh this was uh with. um Let me think um this is.

A

C

We have the hardware rate uh I have.

A

I have to have to look in there what the data is. I'm sorry uh I'll follow up right, yeah.

C

Right now and but you did have, but you were using deduplication, uh it's really giving you any benefit in this uh early.

A

To call this is so large, it takes. It takes so long to the if the copy this over and uh so more more data is coming, but so far it's not uh it's not bringing that that much of a benefit so metadata. This is this is this is this? Is this is an issue, however uh uh uh running these things? I mean think of it. This is 152 million files. This is uh uh many many a double digit terabyte, so uh that this runs out of the bed. Unoptimized stacking this on each other.

A

uh uh uh This is this is this is very nice, and this is well well be above expectation, so, uh uh even even under many many file sizes this this could this could basically do it. So we we saw a very few issues if at all- and uh one is that this is not this is this is a non-issue.

A

This is not an um issuer bug where we just ran into this from the very beginning that the udev, the linux, udev messes things up, tiny, tiny, tiny, something uh uh we think we saw a memory leak uh on the on the user data backup site. This is now instrumented, not the d trace or anything just looking at the memory as it goes along and see if you can reproduce it and uh the key. The key thing is that one would have expected layer this uh and non-si. No, no side has really adapted anything.

A

It just worked out out of the bed, so nice nice.

A

So what? What would? What would be of interest going forward from from from from our perspective of operating? These things is, is, uh is checksums end-to-end, of course, not computing them multiple times. So that's one thing, erasure codes. Oh, we would love this uh so if he could do away with raid and have this all in the in the distributed file system, the parity information and then all scattered around the disks. This would be very, very nice indeed. So this is not a cfs issue.

A

This is more and more up there, but this would be a nice thing to have and of course, a larger, a better bet, small small file performance. This is nice.

C

Did you measure small file performance with without raid z,.

A

uh Let me let me let me um in comparison without rate c, just uh non non-cfs we did, but without raid c I have to go back to the data I I will come right back after the the presentation uh here, acknowledgements and so again this was a snapshot of measurements, so far very happy against low expectations, but very happy uh we'll we'll go on testing.

A

uh uh uh We haven't done this yet, but uh we will be doing, is um bit rot testing, so I'm I'm writing a little little little tool there to to uh selectively flip bits. uh uh Put it up then, and uh we need to do more negative testing. So we have done negative testing. It's not listed here in the slides, but yes, we have done it, pull the disc uh pull the cord and because reliability is most important first and it always came back always came back.

A

So uh the most important is that the feeling that you would trust this to be reliable is actually uh so far so good. But again we need to do more much more negative testing for that. We need more test automation, so we've written some scripts, and so but this needs more of it. That's it.

D

I have a question: what's the availability of fh gfs, so if I don't know, I might want to use it and give it a run, how can I get it.

A

Availability in a sense of getting a hold of.

D

Yeah, what are my options? Do I have to buy it? Is it open source? Is it closed source.

A

I couldn't I couldn't say I I I couldn't say that that the s is status, but I believe well, first of all, I we're a commercial user of it, but I believe the base use of it is free. I believe, and uh the client side I believe, is open source and as a commercial user, we can get the source of the server but which is not open source. I believe, uh but uh the base usage without any support contractor so is is free. As far as I understand, yeah.

D

And in terms of clients, which kind of clients there are so there's a windows, client, there's a linux client, there's a oh.

A

This is all linux. This is all linux. All.

D

B

Thanks um so on the scenario you presented, that um you used raid hardware right.

A

Right this again will not. We will not use this uh uh rate going forward, but uh for I should say in operations, you go step by step and things you know very well. You know rate and.

B

Yeah, because I was thinking about mix and matching these two, because it would obviously benefit more zfs to use the disks like the whole disks, not having a single loan exposed from a hardware rate performance wise. I mean.

C

Yeah, I mean performance wise, uh it's not necessarily a clear-cut case because it depends on like what hardware is in that raid. um So, like you know, raid z, obviously like raid z, has its own performance issues um and if you have you know a beefy piece of hardware raid, that has you know some nvram, that's caching and you know it could actually accelerate performance. It's just that you're spending a lot of money on that right, so cost performance, yeah you're, definitely gonna do better with uh with just zfs on raw discs, but ultimate performance.

C

You know, if you have, if you're putting in this additional hardware, it may give you a performance boost.

A

Think of it also as a question of operation things, so we've got many many disks, so we know we have procedures when a disk fails and uh uh with cfs all this has to be established first. So this is how we went uh yep. Okay,.

A

Thanks very much.