OpenZFS OpenZFS Developer Summit 2016, 10 Oct 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ZFS Validation & QA by Sydney Vanda & John Salinas

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

So hi everyone, I'm Sydney, banja and I'm also here with John Salinas, my coworker we're going to be talking about CSS validation and quality assurance.

A

So the purpose of behind this presentation is truly to focus on bringing Linux up to speed with regard to CFS validation, both ways, our team's testing efforts, as well as the communities so with this I'm, going to be discussing the CFS test suite and with that the real just the real disk support that our team has been working on and currently added to that, as well as that I'm going to address some standardization areas that I've noticed within the test suite that needs to be addressed.

A

Next I'm going to be talking about the performance regression test suite and with that I'm going to talk about how our team is currently using.

A

This performance digression test we to validate our current features as well as compressed dark features, and I'm going to give another overview on that for anybody, that's unfamiliar and next John Salinas is going to wrap up he's going to dive into the ZFS, has to meet a little bit more and he's going to talk about how we're using that to inject failures and different methods for that, and also how we're validating other features with that and finally, at the end, we would love to get a community response on not only our work, but also how we can all collaborate to make the CFS test suite and the performance regret.

A

The performance regressions has to be to excuse me tools that enhance tools that we can actually use for the better, at least within Lennox.

A

So earlier this year, the FS on Linux landed the DFS test, suite from Solaris that Dell fix was using for a Lumos development. So those for those unfamiliar with this DFS test suite is a series of command line interface, regression tests that are aimed at testing the production software, as opposed to other tools like Z test, which mainly just focus on testing the core algorithms. So there's still a considerable amount of work for integrating this route. The CFS test suite into ZFS on Linux. Currently there are I think about 650 tests.

A

Running was still a good percentage of tests still disabled within the test. Suite I think there were just under a thousand tests in the original dell fixed huh sweet I'm, not sure about the current state of that, but anyway, and a lot of these tests that are disabled. Some of them actually do work.

A

Some of them require some minor linux equivalents fixes and some of them need some more investigation into why they actually do not work within linux and john's going to be discussing some failures a little bit later more detail, so the tests are basically they're, generally small, not too complex, they're, all grouped together with similar similar tests. They test all the positive and negatives, commit or positive negative test cases for a single command, such as all the zpool create tests and within all of these test groups they have independent, clean up and set up scripts.

A

There are some default methods for clean up and set up. However, those are rarely used I'm going to discuss that a little bit later when I discuss the need for standardization within the test suite.

A

So here's just a little overview of the ZFS test suite so right now, there's 103 test groups and a little bit over a quarter of that of these enabled test groups require disk partitioning. So right now the test suite. You basically have three defined devices, and then those devices are partitioned as needed or if you need more devices for certain tests that you're running there's 650 running tests, like I, said previously, and 213 disabled tests.

A

The original port of the ZFS test suite into ZFS on Linux only supported loopback devices. So in order for this test, we to actually be useful for our team, as well as everyone else. Real disks needed to be used, so our team push to get real device support patch landed into master into ZFS on Linux master earlier this month.

A

So basically partitioning was previously done by setting a slice prefix variable to P by default, but in order for it to handle anything else, besides loopback devices, basically, functionality need to be added to determine the device directory and the naming of the device within Linux, and this is all defined within the disks variable that you basically set within the run script.

A

So, and another thing to note is that basically, there will be need to be a push to move to transistor, to excuse me to persistent na'mean, because right now it is non-persistent naming paths such as dev SDA, but this will need to be moved to actual persistent naming for tests such as you know what Don was speaking of earlier, such as the vault management test and so parted.

A

Partitioning is actually used right now for this patch, as opposed to other partition conventions such as K part x, and this might just for simplicity purposes, and this might cause some issues for distributions like you bond to where K part x is enabled by default and that actually might have to be disabled temporarily to actually have the tests we run properly.

A

Another thing to note is that multipath devices are currently held less capability within ZFS, so this was noticed by basically a zpool cannot be created on a multipath partition device, so this caused problems for tests that had size constraints requiring had size constraints that required an entire device to be used and not just a partition for some tests and also for the default cleanup method.

A

So the default clean method currently involves basically creating a new pool on an entire multipad or an entire partition device, and then once that pool is destroyed, so we're all the partitions of the device. So it's like a nice clean up method, but that's not going to work with multipath devices, so they needed to be handled differently and currently they're. Just all the partitions have to be deleted until something better is thought up for that. But this does it works fine for SD devices and and of course, loopback.

A

So, while implementing a change like the real device partitioning that basically incorporated the entire test suite some things were noticed about the some areas of standardization in the suite that can actually be that need to be addressed and brought up to the community. So the youth of are actually lack thereof rather of configuration files within the within different test groups. So there are some tests that of configuration files.

A

However, a great majority of them do not, and the configuration file basically have some important variables that are basically used throughout and there's a lot of repetition, and this is also where the slice prefix variable for partitions and also the device directory gets that and this functionality can actually be abstracted even higher. But right now there wasn't really a good area to actually abstract. This wasn't seen within how the current framework of the test, actually the the test suite actually is.

A

So it's just kind of a awkward area and right now, there's just a lot of repetition and basically it all has to be in within all of the set up scripts, some earlier early failures and asserts for commands not present. So, basically, within the setup script there should be some sort of asserts for commands not added to the system, because right now, basically, you could have the entire test. Suite run.

A

So two hours later, when you have ninety percent failures, you basically find out that you didn't have the parted command or something dumb like that and default clean up and set up scripts. So, like I was mentioned earlier. Basically very few tests actually use the default, the default setup and cleanup methods.

A

This should actually there should be some sort of standard. For that I mean when you actually make a change within the entire, the entire suite. It makes it a little bit way more difficult, actually because you're dealing with different tests that have a different clean up for everything versus when you could just have something when they all call the same function.

A

So there should be some more defaults for that, maybe with some different, whether you include partitions or you don't include partitions or some other constraints, so that can be some more on that can be easily defined there.

A

So the performance regression test suite this was pulled into open, ZFS from Dell fix as a part of the compressed arc feature, and then these tests were actually separated out of this patch and poured it into VFS on Linux by my colleague, Tom Brady a couple months ago. So a little overview of the performance regression test suite it's actually a directory of performance tests.

A

That's used by running the exact same test script as the gfs test script, but all you have to do is is include the dash our option and the performance of Russian run file and all of the tests within the performance directory are currently enabled.

A

However, this patch was not complete and some future work that needs to be first DFS on Linux is that we need some sort of tools to ease easily visualize the difference between different output of runs. So you can see exactly the performance impact that you have as well as there's additional regression tests from Delta X, as well as functionality that is yet to be poured it into ZFS on Linux.

A

So here's an example of how my team is currently using the profession. The sorry excuse me, the performance regression test suite to validate features such as compressed art performance.

A

So the results displayed here are from the log file summary from the end of each run, and it shows the bandwidth on the I ops. So here's an example of synchronous read and write bandwidth, so the baseline runs with st devices and it's compared to the performance results of the compressed dark patch, as well as the arc refactoring and the compressed center see patches. So something to note is that a BD, the arc buffer data is not included in this, but compressed ark itself was a major rewrite.

A

So tests like this can help us easily show that there's no significant performance regression and there's even some small improvement in some areas and something else to note as well as that. These tests currently stop at one mag I, oh, and we're not going to really see the benefit of compressed dark unless this is pushed way beyond that from 2 to 16 bag.

A

So again, here are some initial results for random later right, so these graphs generated these. These graphs again are I'm. Sorry, oh yeah.

A

These graphs are generated separately to quickly see the performance impact and some of these tests might not actually even be appropriate for compression. Now this. This is just more so for to bring up the point that there needs to be some sort of script to consume all of this performance data and display the comparison results easily to actually make this performance regression test. Suite, useful and I think that there might be some sort of capability within that in Belle fix or something else so getting that within ZFS on Linux is going to be key.

A

So again the results showed previously where from blog files, which give the I ops, the read/write I, oh and the bandwidth averages, but there's also a significantly many more statistics generated from iostat vmstat and zpool iostat, as well as MP stat.

A

So there's just here's a couple: few snippets of actually a sequential write, 128 k, io 128 thread tests, so some future improvement. That at least our team is going to need to address is that more pools are in a different pools are going to need to be added, including the deer. A pool into these tests, larger iOS, are going to need to be used extending beyond one Meg, as well as a scene in the iostat I.

A

Don't know if you can see that the results, but the request sizes are actually pretty small and our team is going to need to push this to at least two Meg to be suitable for hpc workloads and so basically to wrap up the first little part of this presentation is that um ZFS on Linux landed the real device support patch. So now the ZFS test suite, as well as the performance regression test suite, can be used with real disks, which is a good thing. Also.

A

The some process of standardization across testing groups within the VFS test suite should be brought to light at some point and addressed, and also the performance regression test suite, which was recently ported by Dawn into ZFS on Linux is rentable on Linux. More work is needed, however, and adding kadon capability for larger iOS and the some sort of ability to see easily see the performance impact and I'm going to hand off to dawn for the rest of.

B

Okay, thank you for having us here to talk about testing. I know it's not the most exciting topic, but as all of you, wonderful people are making new features. We have the opportunity to introduce new instability, so I'm going to talk a little bit about how we've been using ZTS to hopefully and void some of the problems that could come about. We have basically three stages that we're using right now. So, as our developers check in code, it comes in to Garrett.

B

Hopefully, so we can do code, reviews Jenkins goes ahead and builds it and then produces some artifacts. So we have a process that watches over Jenkins artifacts called auto test, and then the second phase here is our CI are continuous integration testing. So in this phase we basically have basic acceptance tests that runs. That's things like the Z config. That I think we still have some of the old Z fault, there's really simple stuff to make sure hey. Can we make a pool? You know?

B

Is the really simple stuff there z loop ones for around 20 minutes, and then we do have a as ETS quick that we do so all the basic tests that are flaky or we don't have in this. This is just a core of tests. It runs in about 30 minutes and we also have some XFS test.

B

So when we run this when a developer checks in code, we want them to know that basically, the basic stuff works, who want to give them feedback very quickly, and we want a hundred percent of this to pass or we or the the testing tools won't give them the ability to merge their code. There's a second phase we have to so we do a weekly build on Thursday night. We drop a build and this is a little more extensive right. So this runs from Thursday night all the way to Monday morning.

B

So we make use of the lustre test. So Lester has a test framework called after that it runs. We run all of the ZFS review one and two on there. So luster is a great way to consume and to use the fs. So we use with other lustre test, so ZTS is really good about testing commands. You know: does this command option work, you know. Does this other command option work, lustre, isn't so great about testing the all the various command options, but it does have a fairly interesting cases in there for well.

B

Will this type of Iowa or if we do this weird use case that some customers brought to us well, will that work? So this gives us a little more coverage. We run Z loop continuously and we run all of these ETS elites, including the flaky tests, and we ran them over and over so I. Think in the course of this each test runs about 20. Sometimes we have fsx that run. Zfs stressed Trinity, that does some fuzz testing of system calls and we just kind of let that stuff brew.

B

In the background, we're working on getting the performance testing stuff running on a real jaybob, some of this stuff runs and vm. Some of this stuff runs on real hardware, so this gives us the way to put a bunch of patches together and say: okay, so this weeks of work, does it actually hold together?

B

Is it stable or not, because it's hard to do that on one check in right, because we don't want the developer after wait three days to know if their stuff is stable, so we have to give them results quickly, so we can use ETS in a quick or in a long context. So what do we do when ZTS fails right, so that never happens so I at least ZFS on Linux? We only have right now, one case it seems to fail all the time.

B

So if you run ZTS on Linux raid 0 to positive, pretty much always fails and it will always continue to fail until we do something about it. So we like solid failures, though, because they're easy to track down. So the problem is when people submit code, it's great, that they submit tests, but when the tests don't pass, that's that is a concern.

B

So in this case, when you go ahead and actually look at why it's failing the code has two options: a dash cap s and a dash t, the dash t is supposed to be seconds.

B

It's set to 300 by default on our test systems, which are fairly modern, has wall system 72, cores, 64, gigs of ram the the 300 seconds takes about 23 to 25 minutes so theres, some timing issues there and actually, if you go, read the comments and the test case, it says, oh by the way this may run a lot longer than expected. So seconds is something that you probably shouldn't take too literally in this test. So in this case it's not actually a test. Failure right!

B

It's just that the the seconds the unit of time measurement isn't accurate right, so we just set it to dash T for T and, while I'll raise you to positive passes all the time now. So this is just a case of a solid failure we find out. Why is it failing? Go ahead and fix it? The harder ones are the intermittent failures right, so I don't have any data I've. Never looked at the other releases button CFS on Linux, it's pretty common that we get a fair amount of intermittent failures.

B

The upgrade tests in particular compress on compress and receive, is another one. These fail fairly regularly right, so you'll submit code. The problem is we don't track these in any way. So if you submit code to ZFS on linux and the builders, tell you hey something, fails it's difficult to say: oh, this failed because of a known intermittent failure. So internally we're tracking all these failures. Now, so we can go back and say: oh okay, dawn broke the code with you know, a new bug and this tester.

B

Oh, no, that's a bug, we've always known about. So it's not a big deal, so we're trying to figure out. Then why do these tests failure and basically we have to enable the debug information. You know. Sometimes it's just a dash X go see why it's failing I'll, get into the the failures and correcting those and a little bit more. But then we decide. Is that worth keeping or not? So, basically, for our quick tests, we don't want any flaky tests because we want all those to pass.

B

So we have to fix those tests before we put him back in back in our quick test. Ok I'm, going to use an example of online offline. So when I started and I was going to get a test going for d raid, I looked at online and offline said: that's something that I want. I want to be able to take a deer. A pool want to have AI o going to it. I want to fail a drive want to bring it back online. I want to fail a different drive so online and offline did.

B

You know, probably eighty percent of what I wanted to do out of the box. The only problem is online offline fails every time you run it on linux, so the current code starts a file truncate for I 0 and then it offline some auto lines. One of the one of the mirrors. The problem is that the truncate option is specific to something that is not Lennox right.

B

So it's never going to work on lenox and even if it did work on Linux, which was the first thing I try to just fix the options is truncate returns really quickly on Linux, so on whatever system this was originally written for apparently truncate was very slow right, so they use truncate to start I, oh and do all these pull operations and then it would go and kill the truncate process. Well, the test was failing on lenox because when it goes to do the kill, it's a bare naked kill.

B

The kill fails because the truncate process has finished long before they tried to kill it. So it's just not a good deal. So here's a really happy example of how we might fix this right so on on the one side this side over here we have the original version where we just do the file. Truncate no dash F option exists on Linux, so that's never going to work, but the kill really bothers me and this right. So this is this is not a good idea.

B

One we're not checking to see if a process is there, but I mean what if we kill it and nothing happens or the kill hangs it's just. It's not not super good, so I've, just as a temporary measure, I've wrapped the kill in something else so that at least it's not going to make the test fail and instead of using file, truncate I've switched to DD as I thought about this more later.

B

Probably this isn't a really good way either, because if we're just doing random I/o for me as a tester I want to know what I've written out is actually valid. So this is really hard to validate too. So that's another another concern we have with this one.

B

So these are temporary solutions. I'm going to talk a little bit more and two slides about why I think they're temporary, but here's an example using Sydney's real dispatch that we've got SD I can't actually read that, but we've got three disc devices up there. You can see. One of them has been offline in the first example, and then the test go ahead and brings it online, and now you can see that we have the online offline. 0123 are now working emilynics, so you've asked ok, that's fine! Why haven't you submitted this code up line?

B

Well, there's another issue that I want to solve and then we'll get to that. Why haven't we up on any of this stuff, yet, as I've been going through these tests? The other thing that's concerning to me is validation right. So it's one thing for me to issue a command. You know zpool, create. Let's see something happen, it's another if I don't actually ever validate what that command did, which is generally the case in the ZFS test suite so the case of ZFS create 03.

B

There is a test to say: okay, yes, the command executed and it didn't return an error, but that doesn't give me assurance as a tester that I trust what that command has done right. So there's no verification in this case it sets a bunch of different block sizes and I want to actually know all those block sizes.

B

Actually, what they are set so on Linux I, don't know if this is portable, it probably isn't, but on Linux in this example, we've gone ahead and we have a second block of code that exists, and basically, we iterate through some some different block sizes, and then we use LS block on that d. Pull that Z I, think it's a Z fault and we get the the block size, that's been created, and we can say yes, okay, so we've done something to validate.

B

We don't want to just run a command and assume because the command didn't you know, horribly fail or the system panic that it worked. We need some way to validate that this actually did. What we wanted it to do test must not assume, because the command worked, that it worked as expected, our goal is not just can we run a command? Our goal is xia. Is the fs doing something saying? Is it doing what is expected?

B

We want to move beyond just as the command run to are we properly validating this in the right way, so future considerations first thing right now, at least on linux. I can only run on three disks very complicated to make any sort of real world set up on three discs right. This is a problem that we have to be able to move beyond.

B

To go back to the question of. Why haven't you up lined these fixes right? This is the kern ondrum that I faced as I thought about this. As I look in the test codes, there's more and more of these huge blocks of, if Lennox, do this bizarre thing, if not linux, and do this other bizarre thing right so I want test codes to be test codes. I, don't want to be worrying about what platform of my own.

B

What am I supposed to be doing, and the test code I want to be worried about testing something that's inside the test. Routines I think we need to move away from these bizarre blocks of if-else platform to different routine names, so, instead of in the first example we showed instead of doing you know if linux DD, if not file truncate, I should just call start I/o load right. That should do something and it inside of that routine. That's abstracted! Anna love above the test case.

B

Everyone can use it and we can make it sane for whatever platforms right, because I don't know what's best for each platform, but you guys can help me with that right. Instead of you know a bear kill, we should have a process of routine that goes out there and in says: okay is the process there. If I try to kill it, is it actually going to die because we have this?

B

Hoffman was ETS right, zpool is hung its unkillable every test case and ZTS will go and try to remove that pull and everyone will fail right. It's not it's not saying we can't do that. So the test case doesn't need to worry about what platform it's on. The test case needs to worry about. Is the fs working right?

B

So that's what we want to move towards, so basically what we're suggesting, instead of my ugly fixes for routines in the hackathon, what we'd like to do is put those into some routines, abstract them in the level above and if you guys think that's a good idea, then we can up line up line that code upstream. That code test. Runner is another one for me, it's very difficult as I'm running it, because I runs ETS a lot to know where I am right, so it's got a million tests.

B

Okay, six hundred and some for us, but I want it to say: I am on test. You know five of five hundred so I know. If I can you know, go get a cup of coffee or if I, you know, sit here and wait for it. The stopping on air is a big one right. So all the time, especially if we've hit an assert or something really nasty, has happened. I know that every test case after that's going to fail right, but everyone is going to try. So we need some way in the test.

B

Runner code to say: hey, I, give up it's not worth trying to run past this point, and if we can dump out some of this stuff and say hey, we need all the tests that have failed. Just go look over here because you get this big long. Output right, you've got to search through, or you know, have a bunch of grant post-processing, let's provide some output, so people can tell what's actually failed and what hasn't so air injection. This is something that someone brought up here earlier.

B

Zts seems like the perfect place for air injection right, so I'm doing an offline online test. That test is doing it in the safest best case way possible, but we want ZFS to work in a worst-case world, not just in the best case world. So am I online fine. I need errors right, I can't just online and offline stuff. That I know is going to work. So Z inject is a great way to do that. There's this grating tool here. Here's an example of injecting a read error in a pool.

B

I can start some reads with DD or something simple check the status again so when I'm offline and online I need air is in there, DM set up for linux, sketchy debug for linux, that's another great way where I can create air regions in a disk where I can say. Ok, if I run over this, I know I'm going to get any I/o errors are 10 check some errors right. It gives us a programmatic way to create errors. I can remove devices all that sort of stuff. So our summary: what do we want to do?

B

We want to improve CTS by correcting issues specifically and the ZFS on linux. We want to correct issues so that we have more tests running more coverage. We want to know that when we're checking in new features were not breaking old ones, we want to improve ZTS by adding verification step. So we don't just want to know, did the command run? We want to verify yes, there's some way that we know that it actually did something useful.

B

We want to improve ZTS by abstracting routines actions out of tests like start at yo kill process, especially things that are platform dependent. We want those to be outside of the test and we want to improve ZTS by adding air injection right. We want you guys to be looking for worst-case scenarios. When you write tests, not best-case scenarios and when you guys check in code right there should be test. It would be weird to create this great new feature and not have any way to test it right.

B

That's that's wrong, of course, I'm a test engineer so I'd say that, but we want. We want ways to validate. Yes, your great new feature works and here's how you can demonstrate it works right and then, as I look at that I can say: oh, but what, if I tweak this? It's not going to work anymore right. So then there's a better back and forth, so we can find more bugs before customers do so.

B

I think that's it so we'll work on this on the hackathon, but if you have any feedback for us or questions for us now, I guess is the time, but we would especially like to know if you're on board with their idea of taking the if-else. You know linux, stuff out of tasks and abstracting that into routines. So we'd love to hear from you on that one go ahead.

C

B

B

B

Okay, yeah, that would be great for those people online. Mr. Kennedy was just saying that he agreed with abstracting routines and he's actually addressed some of these things, and is it delphic sore in Delphic? So we'll look at some of those things tomorrow and go from there any other questions or comments.

B

What's up? Z loop is just a wrapper that folks put around Z test because Z test doesn't run very long before hitting some sort of error. So I think the blondest I've been able to run Z test on. Linux is probably about four days. Maybe five and I filed a number of bugs and no one has ever looked at them, so they've just kind of set there and languished and sadness and sorrow, but to get around that Z loop is a script that some wonderful person wrote.

B

So basically, instead of running one long z-test, you run a short Z test with different options, so you're getting a little more coverage in that you're getting different options. You know all the options you can give this so you're getting a little more coverage. That way. The unfortunate part is you're.

B

Getting less coverage and the length of the run, so we may try to come circle back around to the z-test stability issues on Linux at some point, but basically it fails right and one thread of you of a thousand is hung up somewhere and tracking that down. It is quite tricky so.

C

B

B

Yeah, that's a really good idea, adding a CSE to Z test and having to simplify Z test. So for some of these things we could at least figure out. Why do you do a comment.

B

Yeah, that's just for file system calls so I trinity's, just an external one. So, oh yes, the question was about fuzzing and how we do that and I should have been more explicit. The fuzziness for file system calls we haven't done any fuzz testing just on you know general ZFS options or I/o testing, so yeah Trinity is the name of that tool that does that. For linux there was one question I had for hopefully that someone.

B

That knows, I noticed in the test suite right, so there's options for functional tests, which is where we're kind of living. You know, Don and Sydney have started adding the stress tests. We hope to add more profiles to file for that, but there's also for the performance test, but there's also a stress directory which is empty on Linux are their stress tests that exist and future work? Ok, so maybe Linux can can help with some of that. So lustre has some. You know some things that I thought.

B

Maybe we could pull over for a short, easy solution, but yeah, okay, all right! Thank you. Much.

B