Ceph CDS Jewel, 3 Aug 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS Jewel -- Non-Functional Tests

Description

http://tracker.ceph.com/projects/ceph/wiki/CDS_Jewel

A

Alright, we're on to the next one looks like it's non functional tests with EVO, so you want to give a rundown on your blueprint here.

B

Sure, um hi everybody I guess I'll first quickly introduce myself I'm I'm, a PhD student at UC, Santa, Cruz and I'm a summer intern at Red Hat working on the project that I'm going to describe, and I would really appreciate any feedback that you have I'm going to actually use a flight deck that I prepare for this. I think I can share.

B

Green right, let me know if you see what I'm seen.

C

It I see the presentation: okay,.

B

So yeah, so the idea is to incorporate non functional testing for a in the Ceph development process and non functional testing. It's a word that a generates kind of a discussion of what what that means, but in general it basically I mean, from the point of view of of integration, testing.

B

Benchmarking, it's kind of a combination of the two so in in integration testing. You go through all this setups of deploying configuring and then testing that everything works in benchmarking you deploy configure and then your benchmark non functional testing. It's the way is defined Addison by Wikipedia is a test in our requirement that it's that specified criteria that can be used to judge the qualities of a system rather than specific behaviors.

B

Another way to describe this is due to.

B

To describe it as specifying a statement of the form the system shall be, and then equality, for example, the system shall be scalable, the system shall be perform at it and so on and the main problem with these type of testing I mean it's it's it's useful in the sense that it's ifs, testing, a property of the system as a whole. The problem is that it's hard to quantify. How do you know that how to use quantify precisely in order to determine that your system scales order system performs in a certain way?

B

So that's the main problem that we are addressing as part of our research in UC, Santa, Cruz and and and we try to apply some of the ideas that we that we've generated, and so the the the main goal, or one possible approach that we want to use to address. This is what I mentioned before it's. Basically, you kind of a merge, integrated testing interview, turn testing with benchmark and then by gathering performance, metrics and then on the output of those performance. Metrics. You take some assertions and validate those assertions on top of that data.

B

Do you basically defining tests over the benchmark output data and and then and then those assertions become a test, and you can think of those as a direction test, so it will fail or it will pass depending on how the output of the benchmark looks like, and there are some challenges. The first one is that the hardware is non-deterministic course up, but we want to address that and then the second one is that that we need. We need a way to specify this test.

B

So how would how do we specify that a system should scale or that a system should be perform at doors on, and we plan to address the first one by leveraging darker particular they're darker? Has the I mean, if see groups in particular t groups?

B

Has these subsystems, the block I, oh and the net Class sub systems that allow you to constrain the resources that a container observes from the post machine and that kind of gives you the ability to to to among a different set of hardware setups, you can find a lowest common denominator and then that should be absurd everywhere.

B

That, regardless of what the underlying machine is, so that it's not perfect, but it can work for many cases and that's one of the one of the things we want to try to see how how good it is to to use it as a way to bring the termination and then the second one is that we need a way to specify tests and an inner project. We have a validation language that we come up with, and it's basically having having an output file with that with performing metrics.

B

You can define this type of assertions if, for example, if you're measuring scalability and you have the cluster size and then the performance of the row, devices and the performance of self and whether or not the net way saturated, you can specify this type of validation statements and then we have a validation engine that runs that on and and basically it's a it's a it's a yes or narrow. It's a boolean function, whether or not the output data complies with that validation statement.

B

So using do these two things darker on the one hand, and having this validation language on the other. We want to bring a non functional testing to self and and in particular, that the steps that we need to go through is first, we need to deploy cell phone talker, so we can configure cgroup dynamically, then run benchmarks and validate validate this assertions over the output of the other benchmarks.

B

So we're we're only we're going to focus on rattles initially, so we can grab that up in a three month in a summer project and and so in the particular set of the task, the list of fastest, oh yeah, I'm. Sorry, so I looked first on how to do this using either tautology or the set benchmarking tool kit, the cbt, poor and a there there are. There are some pros and cons of using each but I ended up deciding on using tautology just because there there's more people working on it more ice. Looking at it.

B

So our plan is to add a darker task. It is basically leveraging an orchestration frame or legs called maestro. It's a Python framework that that orchestrates deployment of of soccer multi host systems, and initially we would just pull from the darker registry without having to build the images and then using these doctors. The these dr. task would deploy SEF and configure how what are the resources that each container has available to them? Then the raddest benchmark task that it's already there in the safe QA you an eight week as far as I know, doesn't write.

B

The output do somewhere. I might be wrong, but the idea is to take that and and produce a file that we can actually look at the output of the rabble sweat.

B

Your task, then also add an a verte task, a very sore framework, so you will basically a wrapper around a bear that points the raddest benchmark and a validation statement to the output of raiders bench and determines whether or not the validations hold and then the last thing is to write validation, statements for all these properties that we would like to observe um and that's pretty much all the slides that I have do. You have any question or comment so.

C

If I understand the point of the last task, which is specified validation statements, the goal of this is to provide tests that day that we not experiencing performance regression or performance loss. Is that weird thing exactly.

B

Yes, so you have a particular version or release and then on a on a set of the hardware setups different clusters use. You are certain that these properties hold on on each of those. So you run a scalability a test and it holds on multiple clusters, and so it's like a new layer of testing. You have unit testing, you have integration testing and you have now this type of testing that it's more I would say more high-level.

D

B

So yeah, so the idea is to incorporate this as part of the development process. So when you release something before you release it, you can check that those properties hold and there's nothing that has broken numb. Ok,.

C

So then one fallen, Christian then would be. Have we identified because I'm going to majan you kid, it would be difficult to cover the thing with a lot of a lot of guarantees. What's validations, have we identified as things you want to do, is part of this project so I'm guessing there's some measurements yeah the.

B

First, one would be scalability and performance. I will abilities a little bit. Trickier is, then you need to simulate hosts going, I mean OST is going down, and things like that. Try how I actually haven't thought about that? Really, how do I do implement the viability?

B

D

B

Upon I, because you need a benchmark right, so we already have a benchmark. That I mean rattles bench if you just repeat the same benchmark on multiple o's. Always these configurations- and you have the scalability test. The performance test rather than truck, can also be used for that. But for availability we don't have a corresponding benchmark.

C

Okay, so who's the test focus on a specific part of the postmark. The whole business park.

B

So it it it looks at the what what do you mean? I'm, sorry, guys.

C

I'm just trying to understand like what, in the case of router's bench, it fits out. Some numbers is the test over the entire output, or is it over specific pieces of it? So like I guess what I'm looking for is a concrete example. You can share anything like that. Yeah.

B

So, for example, Rabin's bench shows you the throughput over time. Then you want to observe that throughput not go out of a specific range. You say: ok, the throughput of the system should be within ninety. Five percent of the raw performance, for example, and and the raw performance might be determined by the number of choices that you have available times, the capability of each or something like that, or you can actually run I.

B

May be a DD task distributed BD tablet, but of things that the raw performance, so your your validation, would be regardless of the size. Actually, this this statement is is that is specifying that so, regardless of the size, I would expect the performance of stuff the two products have to be within ninety percent and a web of the real performance when gonna do it I'm. Sorry, sorry.

E

Is that the query project that a ver lets you use? Yes, okay, so we just to drill down in that a little bit. Does it take as an input, some sort of time series data? Yes,.

B

D

B

It's actually implemented on top of so I have a small parser that translate this statement into sequel and then so I support anything that talks. Equal, ok,.

E

That make sense, um so one way you could test things like availability is, if you have the time series, latency and throughput information from rato spam, and you could correlate that with events like taking a nosey down by the way technology already has machinery for doing all of that in Seth manager, double yoi, I think cool, like its extensive. Most of our testing involves to running a random thrasher in the background that kills those to use root. So as these that sort of thing, so there are utility methods for um doing that'sthat's.

E

That sort of thing, so you could write a task that manually performs a very specific manipulation on the cluster logs. The time at which had happened, and then later you can. You would be able to look at the log before and after that event and verify that the constraints were about is that sort of where you're going? Yes.

B

Exactly yeah thnkx actually think thanks a lot, that's very useful to know that that's already there yeah.

E

I've been thinking about this for a while. Actually, um as you start doing that this, you can ask around didn't hun sapien hugs after though.

B

F

D

F

Her leadership test I'm, sorry.

F

E

Norris you can go. I was like I just said the sounds like it's going to be incredibly valuable.

E

Maybe a very good thing also might be famous, so.

F

So one one thing that may be useful regarding regression testing: if you're going down performance regression, testing wrote, is Ben, England wrote just a camp simple script for going back and doing basically just looking at performance regression between different sets, important data, which I think he just uses json as input, but it may be something that you've interested in potentially post-processing smell. Your your run, data with.

B

I'm, sorry, can you repeat that I I miss the first part.

F

If you look in the chat window, there's there's basically just a simple Python script: I've been England wrote. This is what they use for gluster, actually for doing their regression testing it's it's! It's not real. It was just a basically a basic script here, but um but this what will probably be using for cpg for doing like aggression, analysis, I, don't know if it's useful to you or not, but it it might be something that that you at least one old cat awesome.

B

I'll look at it. Thank you very much so.

E

I had a couple of hours.

A

E

I have a look: I haven't seen this aver project before so I'm, guessing that one of the inputs to the is regression concept that must exist somewhere. Is it expected throughput threshold right so.

B

Yeah, so so, where.

E

Does it take historical data and build its own tests.

B

No, so so the aver project it doesn't know about it, doesn't understand what youre. If you look at the from the point of view of columns right a csv file and you have columns yeah, it's only doing numerical numerical you.

E

Have to give it a threshold or a 99 percentile comments at meals, exactly okay, so um I strongly recommend that whatever procedure used to build those those thresholds, those also should be automated. Basically, we would like to be able to point it at a new set of hardware. I know see. Groups is supposed to remove the hardware dependency, but it it won't really it'll just do.

D

F

E

It'll make it less noisy, but when we move to do, hardware will need to build new new thresholds, particularly by going backwards through various versions, to make to get a historical information. So, however, that's done might also be packaged up as a technology. Job I see.

B

E

Would be amazing, I see.

B

I see so I see I see your point so when you say greater than then a number that number should be dynamic. Well,.

E

Not not dynamic, it's fine if you push it into a config file somewhere, but the process you use to derive that number should be reproducible. Okay, so that we can do it again later on. Do hard work. Okay,.

F

So so forever, what's what's the interface that you use you just pump it CSV files is that is that perfectly acceptable? Yes,.

B

Oh yeah, the presuming.

D

B

Mean the only thing well, I assume that there's a driver for a sequel driver- this isn't go. This is implemented in go so I. What I'm, assuming is that there's a driver for sequel for CSV that speaks sequel which I believe it. There exists, something like that um so yeah, so whatever you can up plug sequel to will be supported.

E

So a ver speaks equal and you expect to be able to convert a CSV into a format that your seagull driver can understand. There's.

B

A sequel there's, a sequel driver for CSV files right.

E

Under it so you're not saying there will be an actual people database into which we've stuffed these results, which we then query later you're saying that, at the end of the run, you'll just Hoover up the CSV file and process it: okay, okay, okay, yes,.

F

Is there a github repository forever.

B

B

And one other thing that I that I'm working on is the that would be like a follow-up.

B

So once you have this information, you have information out, see groups and how to you, how your host looks like if you have contextual information of a particular run and then the output, and then you know that, but that study that is valid, then, when you move to upper to a different setup, you would like to when something breaks. You would like to find us the root cause when.

D

A valuation on whenever.

B

D

E

So one thing that might be valuable is each tooth: elegy job generates a summary llamo. So for these the performance testing, once you probably also want to dump all the information you possibly can about the hardware and the C Group configuration. Is that what you're getting out so that later on? When you see the failure you can get as much information those? Yes, you can yeah.

B

So I an often question that we have is: can we actually without having to run another test? Just say: okay, I have a new set up and ate a hardware setup, and it looks like this: will my validation break.

E

I kind of just assume, yes, I'm kind of as unlock see groups are simply much better at constraining, I/o throughput than I think they're. There is no way we're going to be able to come up with thresholds that are aggressive enough to actually catch regressions. That will be conservative enough not to trip on to hardware. I I'm not sure that's a worthwhile design goal. I think the design goal should be to make sure it's transparent and simple, to generate new thresholds for a new hardware.

F

Sam I agree: alright, what's a license forever, I.

B

Think its image via the I think right.

E

Just you know, in addition to ratos bench, there's a tool called small io bench, I think it since f tools, it actually outputs, like one JSON line, / io, so you can get exact latency information on every single I. Oh I wrote it because rye toast bench is not very good for the sort of thing you may want to look into that. That may be less tedious to work with them. Ratos much. They.

B

See so it's small I, oh.

E

Yeah well I mean it. Ratos match also has other properties that are not attractive, like it only writes out full objects and it moves on to new objects. It doesn't ever overwrite objects. So it's not a good proxy for rbd. For example, male Oh bench writes out a large pool of objects. You know that the size you of the number in size you specify and then performs a pattern of configurable size, writes and reads against them. You may find that to be a more flexible tool.

B

Cool is that on the Q&A, essentially one goodnight, sweet or.

E

I think it's Inouye. Well, it's a it's. A it's in the SEF project. I think it's in the SEF tools package, which is already installed by default in technology. That is it's part of the SEF bundle of stuff I. Don't think, there's a ratos task for it: okay, I, sorry, I, don't think, there's a theft, you a sweet, ask for it, but that's easy to write. It's just a wrapper. I mean all the radios bench. One does is invoke great as bench not much to it. You'll.

A

E

That file and well I mean you'll, be start a cluster on a local machine work out how the arguments work and then duplicate that over to the left, you a sweet.

F

Another possibility might be the actually I'ma mailing list recently, there's been some discussion about. Oh hey, Danny's, on about an experimental object, store back end for fio there's already a labarbera geek backend for it, but she'd been probably can tell us about the other one.

E

Well, that is also a little different that doesn't require the full power of technology to set up since it's a strike, it would be a strictly local test. You'd run it strictly against a single block device.

F

E

Got inside up the.

F

E

For small I eventual talking about the ratos one, it's just like Ray its.

F

D

I guess right and.

E

That it performs right us operations against the radio semester. Yes,.

A

E

A

E

Actually useful for a workload people have ever done.

C

E

Somewhat useful I guess you're arguing that rattles bench is useful for emulating the radios, GW workload.

F

Yes, reasonably so well, or also.

E

Worried about performance regretted, essentially, we don't know what that work look looks like yeah anyway,.

F

Yeah with small I Oh bench has there were at one point, I think I was once concerned about a small operation, my performance indications in it, but maybe that's not a problem anymore. Remember. Oh.

E

Maybe if there were like CPU limitations, that might be a thing I didn't think. So, though yeah.

F

E

That was really on strawberry up, though the.

C

Undertow variant has a lot.

A

E

Overhead / io, so that sort of thing would show up more well I believe your takeaway from this conversation should be that I might be lying, and you should take what I'm saying with a grain of salt regarding small I/o touch. So.

D

A

May find to be a more useful.

E

Tool to create as much.

F

F

Hopefully, we've introduced enough confusion into this to make it interesting.

D

D

B

Well, that that's all I have thanks a lot for the feedback.

A