Ceph Ceph Tech Talks, 27 Aug 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2015-AUG-27 -- Ceph Tech Talks: Ceph Performance

Description

A look at performance profiling and tuning in Ceph with some recent findings and examples. http://ceph.com/ceph-tech-talks/

A

Alright welcome everybody back to the monthly Seth Tech Talks event here on our blue jeans, video conferencing system for those of you that aren't familiar with the SEF Tech Talks. These are basically a one-hour deep dive on a technical level of something at least tangentially related to SEF. Thankfully, we've had some pretty pretty core discussions. Thus far, we've had you know, rato some, the block device in the gateway, and we looked at the calamari Romana during that split. We've talked about some placement groups and we had a nice examination of CFS in July this month.

A

It's going to be Mark, Nelson who's, our lead performance engineer, who will be looking at the performance tuning world and SEF, and using some recent findings to help share kind of how we arrived at some of our recent decisions in the future. If other people would like to give SF tech talk, anyone from the community is welcome to do so as long as it is relatively technical and set related, we're looking at potentially in September. So next month doing a discussion of someone who's been working on SEF integrations.

A

So looking at kind of the start to finish on how to do a SEF integration with some piece of custom software, but if anyone has any other ideas feel free to contact me. Otherwise, afterwards, you can find the the videos here for replays or linking elsewhere via the SEF dash dash talks page on set calm.

A

So without any other preamble I think we can get started here. Mark you want to want to lay it on us sure.

B

Sure so I will start out by saying that Patrick reminded me yesterday that I was scheduled to give this talk and I scrambled to put slides together. So I, don't know that will will take up the full hour, but will will go for as long as we can, and maybe we'll have time at the end for folks to ask questions if they, if they would like to so.

B

Basically, what I want to talk about today is just have a general overview of how we do a lot of testing at at Red, Hat and and kind of give an example of some of the recent things that we've done so.

B

At Red Hat- and this is mostly kind of carryover from from when we were ink tank. We we used to kind of major pieces of software for doing a lot of the performance testing work. Those are teeth, ology and cbt. There's actually also various software pieces that are consultants have written that they use as well. Although I don't actually know too much about them.

B

So I'm not going to talk about those today, but in terms of what we've been doing kind of an engineering tooth ology is our kind of nightly functional test framework that is used to run almost all of our our nightly regression tests and also some performance tests as well.

B

We actually recently went in and and kind of analyzed what technology was spending most of his time, every night doing and a lot of it as things like running rados tests, with thrashing happening in the background, so marking, osts down and increasing PG counts for different pools and more or less just really trying to stress the cluster while doing various ratos commands or other things. At the same time.

B

So, let's cap technology, DBT was written a couple of years ago, as kind of a lighter-weight performance benchmarking oriented tool than tooth all g technology is really kind of big and does a lot of things like I, more or less deploying nodes and and setting up software and doing a lot of that. That kind of infrastructure related thing cbt doesn't do any of that.

B

Cbt basically assumes that you already have a number of nodes set up that have stuff on them already and then it will build a set cluster optionally for you and run tests on those notes.

B

It doesn't do any kind of like software installation, but it will at least automate the the ceph portion of the set up and running the tests we use CBT quite a bit in the engineering group. It's also used by the reference architecture team inside Red Hat for like testing on partner lab equipment and then also by our QE group for doing like nightly performance regression, tests against red hats, f, storage, there's a lot of different tests that we use CBT for it's everything from looking at cash turn performance under different scenarios.

B

To actually benchmarking some of the functional tests that are run with ecology under different scenarios, so we use it for quite a few different things is an open source tool, its uses, yeah mph or the configuration files, and it can also run a lot of monitoring in the background, so collect L is a tool. That's used quite a bit for looking at, like no statistics and cbt will run that for every single test.

B

It also can use Val grind to profile different demons and also run things like blocked race, in the background or or other kind of monitoring type tools like perf. For you know, no profiling.

B

Let's see so in terms of hardware, a tank tank, or rather at Red Hat, we for a long time. You didn't have any really good high performance gear. I had a box basically sitting in my basement that was kind of designated.

B

The ink tank performance lab back in the day, but it was it was more or less just a single for you, supermicro knowed that I swapped configurations around you know left and right trying to determine what different things made stuff perform well and what kinds of different hardware configuration stuff did not perform well in well. Just recently, in the last couple of months, Intel very generously donated eight really high-performance nodes to the SEF community, which red hat is is hosting in South community lab and those nodes uh have ten spinning disks. Although those aren't.

B

You know very interesting necessarily at this point from a performance perspective, because there's a lot of other nodes out there that have similar configurations, but each one of these nodes has 4 800 gigabyte, nvme SSDs in it, which is, is fantastic, there's a lot of back-end throughput and I ops that that we can actually now. Finally, test Stefan on a regular basis, trying to determine how to improve performance on this kind of a setup it has.

B

These notes have fairly reasonably fast processors and in each one also has a dual 40 gigabit qsfp+, a ethernet adapter in it, and luckily for us, mellanox actually recently provided us with a 12 port for gigabit switch. So we were able to hook all these nodes up together and actually get really pretty good performance with them. They also have 64 gigs of ddr4 memory, which is very nice, pretty reasonable, actually for how fast these things are and and how much disk they have.

B

So we just got those nodes set up right before the first set hackathon took place earlier this month of in Portland, and we are looking for a have initial test case. We had done some chem, very real initial performance tests on it using fio and ratos bench through CBT, but we didn't have any have really good example for this thing. So when we went to the hackathon we're kind of looking to see well, what could we do with these at the hackathon?

B

To kind of you know showcase their usefulness and also provide the community with some interesting results and at the hackathon Zhang Zhang from Intel presented some very interesting test that he had done looking at TC milk and J milk, and this was based on some work that sandisk had done earlier this year.

B

Also looking at both of these different memory, allocators and actually they had found a bug in TC milk, where you couldn't change the thread cash settings in the version of TC milk that was distributed with almost all distributions that are out there right now, abbu to rel, Santos I, think they bein as well, and- and this was a big deal- um they found that the default thread cash values in TC milk actually were causing stuff to not perform well for small IO and Intel's results that they show at the hackathon reiterated those findings and in fact they went to it in a little bit more detail.

B

So when I saw that Intel had posted these numbers, I became really interested in thought. Well, hey we're here at the hackathon we've got these nodes, we're looking for a test case for them. Maybe we can go and use CBT to try to replicate the findings that both sandisk and Intel had seen. So, um basically, within a couple hours, we sat down at the hackathon and created a cbt configuration that let us try to do that. So in cbt, you define in the ammo I'll.

B

Basically the cluster that you want set up and the test that you want to run and I haven't included all of the different things that you put in the the ammo configuration file. Here you can look at camp the examples directory and github or ask me or ask on the mailing list how to do that.

B

But one thing I did want to point out here is that when you're setting up the cluster, you can specify this fo SD command to run and, in this case, I'm actually using a pre, compile our a compiled version of stuff that was from get and as part of the sepo SD command. I'm. Actually, setting an environment variable here in this case to change the TC mouth thread cash settings. You can also potentially use LD preload here to actually set the memory allocator to use for the SEF OSD process.

B

um It I didn't actually do that here, I actually recompiled stuff to just use the different memory allocator for each run, but potentially you could actually use LD preload. Some people might consider this a kind of a lack of of a input, validation, which is very true and CVTs part, but it does let you kind of do useful things like this. So um you know it's it's a little bit of a mixed bag. Hopefully no one would be doing anything really terrible with cbt.

B

In terms of uh you know, passing malicious things and but, uh but you know, it is kind of nice to be able to change environment variables that you're passing into the processes.

B

So in each case for every single test that we ran or every single different memory, allocator configuration that we tested, we rebuilt the cluster using CBT the exact same way. Every single cluster was configured using the exact same steps and that's kind of one of the things that CBP buys. You is that, if you're going to be doing a lot of repeated testing, you can make sure that clusters are set up the same way. You can actually tell cbt to rebuild the cluster before every single test.

B

If you want that to happen as well, although that takes a long time, there's actually some pull request right now that I need to look at from Ben England to try to speed up cluster creation, he's he has gone through and looked at how to parallel lives as much of it as possible, and that does help, but still that's a big time sink in terms of building cluster every single time.

B

So usually you can build it once maybe for a set of tests and maybe rebuild the again for a separate set of tests that you're comparing against the first or at least that's how I I typically do it.

B

So when we ran through these tests for the hackathon on the performance cluster, we ran through a whole ton of different tests. Random writes random, read, sequential, writes, sequential, read mixed I/o tests at various different sizes, four megabytes, 128 kilobytes four kilobytes justjust account a whole battery of tests.

B

I have not included all of us because it'd take hours to go through all of them, but the most important or most interesting ones were the small I/o tests, at least regarding the different memory alligators that we looked at a couple of things that we found out from this testing is that stuff is really hard on memory. Alligators, there's probably a lot of different opportunities for tuning here and there's a couple of different folks that are looking into that.

B

There are huge performance gains and latency drops that are possible and the next slide you'll actually see for 4k rights. How big of a performance game you saw, but definitely there's a lot of potential here.

B

What we also saw in these tests is that small I/o is really still CPU limited in all these tests, even even in the fastest configurations.

B

So that's another thing that will need to spend more time looking at are ways to try to reduce that overhead, and the final thing is in every single test, regardless of the the I/o size or the I/o type jml used higher amounts of memory in many cases provide higher performance, but it always used at least like 200 to 300 megabytes, more RSS memory, OSD process.

B

So that's a big deal, there's a lot of folks that are running large OSD configurations, nodes that have you know, 30 or even 60, Oh SDS on them, and a two hundred three hundred megabyte increase in RSS usage per OSD would push them over kind of the expectations that they had when they built those clusters. So we need to think about how we can try to gain some of the performance that we're seeing in JD Malik, without necessarily increasing memory usage that much those are kind of the goals going forward.

B

So so, let's take a look here in terms of I ups for 4k random writes. If we look at TC Mel 2.1, which is the default, that's basically distributed with. You know most different. Most distributions that are out there right now with a default 32 megabyte thread cache size, which is actually the only guys that you can use with TC milk 2.1, forcing pretty anemic I abscess 20,000, write apps, considering how fast these SSDs are.

B

That's a fraction, a small fraction of what they can do when we switch to the newer version of PC Malik with 128 megabytes of thread cash. It starts out really good we're about four times faster, but it degrades over time. We see it kept trailing off and ending up somewhere around, maybe 64,000 I apps, which is still much better than it is with the the current default TC milk. But uh you know it's kind of concerning. We don't know with very, very long running tests how much it may degrade.

B

It looks like it's maybe leveling off, but you know that is still pretty. Concerning je Melek, on the other hand, is pretty consistently about 4 to 4.1 times faster than TC male 2.1. It's a big increase, and it's it's looking really consistent in ace, we're still CPU limited when we, when we look later on at the CPU results. You'll see that, but potentially if we can reduce CPU usage or if these no's had even faster cities in them, we may be able to even get more out of the SSDs and we're seeing here.

B

So, in addition to hire I ops, we also saw a really nice decrease in in typical latency fio, which is the test that CBT ran in this case, provides a latency distribution for the tests that it runs, and so we were able to look at every single client process.

B

In this case we actually had 16 fio clients and I, I want to say, maybe 512 concurrent iOS going at once and we were able to see a decrease in latency, as reported by fio, from like around a typical 50 millisecond latency, for these, these ops down to around 10, so really really good improvement, still not as low as we'd like to see um again. This is CPU limited. So that's where we think you know we're just backing up or processing as many iOS as possible, and things are waiting.

B

So when we look at read I ops, which are much less CPU limited, we actually saw that around ninety eight percent of the 4k random read iOS were two milliseconds or laughs, so we're at least four reads getting really close to that one millisecond mark we're not we're not quite there yet, but um but really close. So that's really good. It means that I think will will be able to get there for rights as we continue to find ways to improve cpu utilization and and just can't generally optimize the code.

B

So here's what we're seeing that we're CPU limited basically in all configurations as CPU overhead was, was not quite at a forty course, which is technically it's a twenty cores with hyper-threading but we're up around like 34, maybe 35. Now fio was running on these nodes.

B

At the same time as the OSD we didn't, unfortunately, weren't didn't have time to get clients set up on the 40 gig network, so we were actually running fio on the same nodes as the OS DS and fio itself was using about five cores of CPU just for the client I, oh, so so we're pretty close to maxing everything out. You know it's kind of weird in the 64 megabyte thread cash case with TC mouth that we actually saw it to be kind of a little bit down more, like 30 course being used.

B

I, don't know exactly why it. Maybe there was some other reason for that, but um but we're still really quite up there, using almost everything and, in the other cases we're definitely pretty much pegged.

B

So. The downside here to Jay milk is that it uses more memory. It's really fast and it looked really consistent. But this is what you pay for to get that kind of performance you you, you see probably around, like 300 megabytes, more RSS, TC milk. When you increase the thread, cash also uses more memory, but it actually wasn't as high as I expected.

B

It may be that TC, milk, the new version, pc milk and increasing thread cash- is good enough that that it's not quite as high memory usage as as je milk, and it does give you a lot of performance, although that that trend that we saw where over time the performance was decreasing, is troubling so that how whether or not that levels out is probably going to be the determination of whether or not bc malik is really something that's going to be good in high performance configurations going forward or whether or not j milk is just kind of the definite way to go in terms of je malik.

B

The really big question is not how much memory it uses in kind of the typical scenario like this: it's what happens during recovery, and so that's actually what we did after looking at the performance in these different tests, we said: well, okay, is it's not enough to know the J milk is doing well with this can memory usage under typical scenarios?

B

We need to know what's happening when it when an OST is recovered, because that's when the OSD tends to use the most memory and luckily for about a year year and a half now, we've had a capability in cbt to perform recovery tests during benchmarks and so basically in CB team in the the cluster section, the ceph plus perception there's a state machine that will run through and try to walk, basically go through a number of steps to have a cluster mark.

B

Various osts down wait until the cluster is healthy and mark them back up and waiting tablet. Cluster is healthy again as a really simple configuration change. Basically, in the cluster section of the cbt configuration file, you specify that you want a recovery test. You specify the OS DS that you want marked down and mark back up and then there's a couple of other parameters you can adjust.

B

We didn't adjust that those in this case, um basically there's a wait period at the beginning of the test to wait before the OS DS get marked down and then there's a wait period at the end of the test. For how long you want the benchmark to continue to run until the the overall test completes, you can also specify that you want the recovery test to continuously mark o/s these down and back up throughout the duration of the benchmark.

B

So if you specified that you want to run a test for one hour, it will continuously go through and run the recovery test, basically going through the entire state machine over and over and over again until the benchmark completes upon which time then, the recovery test basically just finishes up and exits out.

B

So we we basically went through and ran this for TC Malik and je milk with 4k random, writes again to see what happened and, and what we saw actually is that in all configurations with with all memory alligators, there is a big spike in RSS memory usage. When recovery happens, when when these osts are marked back up and in and the the cluster is dealing with that before it becomes healthy again with DC Mel in a 32 megabytes thread cash configuration and these numbers are actually very similar, fortissimo 2.1.

B

We didn't, we didn't run through those tests here, but basically 2.1 and 2.4 look very similar in our previous testing. So we just looked at 2.4 here. We see that there's a memory spike, but it is not quite as big as it is either with 128 makes of thread cash or with JD milk.

B

J milk again was by far the biggest bike and we're seeing here that actually de milk is using somewhere around, maybe 500 megabytes, more RSS memory than the 32 megabyte thread: cash configuration with TC milk and it's using somewhere around, maybe 200 to 250 megabytes, more RSS memory than TC Mel with 128 megabytes thread cash configuration so unfortunately we're seeing a pretty big spike.

B

There's a lot of memory usage, a lot more memory usage in recovery with Jade milk, especially compared to what we see with the current stuff configuration or the least common configurations with Seth right now with TC Malik 2.1.

B

Now, having said that, if you look at how deep these graphs go in terms of time with je Malik recovery happened over twice as fast as the current kind of default configuration, it was still faster than the TC Malik 128 megabytes of Fred cash, though not quite you know it's not a significantly faster, but that's a that's a good game. That's a impressive gain right! I mean, if recovery happens, that much faster. That means that your cluster is healthy, more often and a larger percentage of the time. So that's really nice.

B

That's a really big benefit. So, unfortunately, there's a trade-off here, more memory or better recovery and better performance for our work going forward now is to figure out. Can we make je Malik or TC Malik better? We reduce J Malik's memory usage, or can we increase PC mouths performance without increasing memory usage? So far we started looking at je milk. We we've upgraded to pour point 0, which just came out about a week ago, and that seems like maybe is helping slightly, but it didn't do much in our task.

B

It was if, if there is an effect, it's really small on the mailing list, there's been some discussion. Other folks have looked at this and I've seen a little bit more of an effect with j email looks of the new version of j milk.

B

Maybe on the order, if I remember correctly, it may be like seven or eight percent better memory of lower memory usage. So it's possible that that maybe there is something there, but we didn't see much.

B

We tried changing a lot of different j emailed parameters. They let you pass in an environment, variable to set them out conf, and you can change different settings that way everything we tried had almost no effect, it's quite possible that we were doing something wrong, so we we need to figure out. If we're not doing this correctly or if something's happening. You can tell gay male to print out statistics on exit, and that will tell you what those settings are. So we tried to go that route.

B

Unfortunately, when we issued a sig term to the OSD, it caused the OSD to seg fault, um since we've done that we've had some feedback on that. Specifically, we need to probably run the OSD in the foreground rather than demon mode and there's also some documentation in the j mouth man page that printing statistics can cause deadlox if there are threads, are trying to allocate memory at the same time, so that could be related.

B

We both try to print statistics through the malloc offsetting that jml provides at exit, and then we also tried to instrument statistics printing directly into the SEF OSD shutdown process. Our suggestion shutdown function, neither work they're, both still causing seg faults, and it's quite possible that maybe just using je milk in general is causing segfaults when sig terminus sent I don't know yet so there's still a lot of work there. That needs to be done to try to figure out exactly what's going on here and and and gets good statistics out of gay milk.

B

It does provide a lot of really useful looking statistics and also profiling data. That's compatible with g / tools, so this is really nice. You know, potentially there's there's a lot of data that we can get out of this to find out more about what's going on in terms of memory allocations and then also whether or not we're were modifying settings correctly.

B

So beyond just trying to tweak memory, Alec hairs and play with you know what things are being used for stuff in parallel. There's an effort going on right now to try to improve the chef's behavior stuff is really really hard head memory. Alligators. I think I think these results pretty much show that and we kind of known that for a while, but you know the improvement in performance that were say here, kind of makes it a little bit more crystal clear.

B

So so various folks are working on this puter Dalek from fujitsu, specifically as one one person, that's been doing, a lot of work in this area and and potentially some of the patches that he's already submitted, are going to improve performance and and maybe even affect memory usage and and there's probably a lot more- that we can do beyond that.

B

As well, we've got a lot of Investigation to do a lot of hard work to do some of the the other folks on the performance side at Red, Hat recommended actually modifying or implementing our own memory, allocators and stuff, which would probably be a fairly major undertaking, but potentially would be something to think about, and- and we also need to look at at some of these other new things are being committed to stuff as well like a sink messenger, where the amount of threads that Seth creates are actually far lower.

B

um It uses a thread cool with polling, and there are other things that are also being implemented like xio, messenger and various other things that may affect all of this. So it's there's, there's a ton of work, that's happening and there's a ton of testing that needs to happen, and a lot of different people are looking at this. Actually, every week we have a weekly performance meeting where folks get together and kind of discuss all of these different things that we're looking at and people present their results.

B

So uh it you know if you're interested certainly feel free to stop by I post every week. The meeting invitation it's a wednesday mornings at 8am time so feel free to stop by if you're, interested and and if you'd like to to participate. You know otherwise. They'll certainly be more of these kinds of things more presentations and newer versions of stuff. Hopefully we will be able to integrate all of this and and really see dramatic performance increases, especially for small I. Oh so that's it! That's all I've got um well I.

B

Think the background in show up here rights, camp, a blue color, but uh yes, that's! That's all I've got so feel free to ask questions. I had no idea how long that good I! Imagine we probably have time. So that's it plenty.

A

Of time yep looks like Joe has to question a little while ago about why not do in application memory management.

B

Yes, so um probably because no one's had time this, I'm guessing the the right answer. You know. I think that probably talking to two sage about that would be the way to go and see what his thoughts are ill. He knows that far. He knows the code far better than I do. um Actually, if daughter is on, he mentioned on the mailing list that he had actually I think done some initial investigation and was seeing just stuff all over the place that need work. But giver are you? Are you around or you have microphone access.

C

Yeah I don't Kenny come here; yes, oh, oh great, so regarding the memory usage and memory behavior of the set. Okay all have followed me in past. It was the at first the memory usage of the greatest benchmark itself and then I realized that there is a lot of memory shopping going back and forth and quickly. I have identified that the the Riza and similar behavior in the entire set code base, including also buffer list. So this is a lot of work that needs to be actually done on this matter.

C

If we actually want to have a lot better and that perform at at all, we need to have this performance and behavior simply fixed. We made lightly replace the technology gmail, but this will be a short-term solution that will work and will increase the memory usage for most users, but it won't fix the root issue so from.

A

C

A

I'm going to work.

C

On to say, any folders of the total three, there will be a lot less memory allocations and the allocations going on.

B

Yeah I agree with each other, I think I think you know what we're seeing here right is a.j milk is sort of a band-aid right. It's doing much much better, but it's pretty clear that that stuff is is really really stressing the the alligator so there's a lot of work that we need to do to to stop that from happening and who knows, maybe once we do that Jay milk and TC Malik will both start looking more similar in terms of their behavior, but at least for right, now kind of as the short-term band-aid fix.

B

If you can handle the increased memory usage, you can definitely get much better performance out of Jay Melek. Let's go so. Let's see, I'll get a whole such of stuff. In here now in.

A

The meantime, people that want to ask questions somewhere getting via the blue jeans chat here, which is fine. If you wanna unmute your microphone, you can ask a voice question or you could use this up IRC channel, so any of those are acceptable. Looks like brian. Has the next question asking have you done any performance testing at rooster with hdds compared to file store with SS cadence journals? Yes,.

B

Yes, we have um so I, don't remember all the results off top my head, but what we have been seeing with new store is that it's faster in almost all cases, except for our BD style, object over rights, though you know, this is actually object over rights in the general case, but our bodies, where you see it happen, a lot if you're doing like a a day, a 512k right into a 4 megabyte objective or has been slower than file store in those cases.

B

Part of it is due to how rocks TB does is write ahead. Logging um excuse me uh basically, there's there's a lot of overhead do to it recreating a log or creating a new log. Every so often I, don't remember what the default values are, but it periodically creates new log files and then gets rid of the old ones.

B

A sage actually in the last couple weeks has implemented kind of a hacky change into rocks DB that allowed it to kind of recreate the log file in place and that I think he said, yielded about a fifteen or twenty percent performance improvement when he did that. But he needs to kind of rework that and present a formal pull request over there. Xd be guys before that will make it in and we'll need to do. You know a lot more testing on new store.

B

After that happens, another alternative might be to take objects in new store and break them into chunks. So maybe like do 512k or one megabyte chunks, so that portions of the object don't actually need to be rewritten if you're doing a partial over right and that might help too um we'll just have to kind of see where we end up after the rocks TV changes and whether or not we want to go through all the work of implementing like object, trunking.

B

So uh that's that's kind of where we've been at at new with new store, I. Think a lot of work recently has just been going into getting all of the underlying code necessary for new store into SEF, so that new store can be can be merged in actually, there's open pull requests for that right now, so it's all happening kind of as we speak.

A

Awesome looks like the next question here is from KP. What kinds of perform is testing to do with catch theory, same set of tools or any background? You can share on your performance testing efforts on that front. Yes,.

B

So we do do testing of cash tearing and cbt can actually go through and take a set of os DS that you have defined in your septic on file and designate those as a cached here. So what it will do is basically modify all the the crusher rules and create a parallel hierarchy for the cached here and then go through and kind of do all of the Annoying commands that you have to run to create that cash to you automatically and let you specify that that should be a cash for some other pool.

B

So actually in Simi tu, you specify profiles. You say that you have say like a base, pool profile and a cash to your profile, and then it uses those profiles to actually make pools during the benchmarks and if you've specified it in kind of the the expected way, then it will go through and automatically create that the cached here in cbt for the the base pool that's being used for the benchmark. So um some of the testing that we've done on that front has been focused on specifically on promotion behavior into the cash tier.

B

What we've seen is that it's really really easy to get to a point where there are excessive promotions and that really drag performance down. The big reason for that is say that you're, using our BD with a cached here and you have a 4k, read miss well that 4k read miss means that you've got a 4 megabyte, or at least by default.

B

The four megabyte rbd object that gets promoted into the cache tier now, assuming that you have default 3x replication and that you're doing SSDs and you've got your journals on the ssds, as well as the the data. That means that that, for megabyte object actually turns into a 24 megabyte right. The 4k read cache miss is actually promoting 24 megabytes of Rights into the SSD cached. Here it's that's really intense. It's really really easy to overload the cash tier with excessive rights.

B

um Some of the testing that we did an ish alee was basically just to sudo randomly reject ninety-nine percent of the promotions say you know we're going to just limit the promotion rate.

B

You know to be very, very low and really really hot objects will eventually make it in because, even though the only one percent of the promotions are making it through they're, so hot that that sooner or later, they're going to make into the cash tier anyway and anything that's cold is is basically and get rejected, and we saw something like I want to say a round a 40 or 50 X performance improvement. When we did that it was, it was huge. It was actually enough that that the cash tier started looking pretty good.

B

It was, it was not, uh it was I. Remember it went from being significantly slower than just using the base pool by itself to being maybe like a head I want said, maybe like forty or fifty percent faster than using the base pool, though that was really really good. That was really important in addition to that right. Proxy support just got merged this week. If I remember right that will hopefully help things quite a bit as well. Read proxy was merged, maybe whole month like six months ago.

B

There's a lot of work going on in this front. There's probably a lot more work that needs to happen to look at the heuristics of when promotion happens when eviction happens, lots and lots of stuff on this front to look at.

A

Cool looks like the next question for eagle eyes: any anybody looking at bringing more deterministic by leveraging, see groups etc, using cpu affinity for OS, game d, payments, constraining ram, etc. Thunder yes,.

B

um So Kyle Bader in the reference architecture came at Red. Hat has done some work on this and also um Ben England from the the right half performance team is planning on and doing a lot more investigation into see groups specifically and looking at you know both cpu affinity and and probably memory affinity under different scenarios.

B

I think there's been kind of a a general interest in kind of like hyper converged solutions, and so a couple of different people are interested in trying to figure out whether or not this can be accomplished without really significantly impacting the OS DS I think it's gonna be really important going forward to figure out. You know if you want, like hi, I ops, how to deal with the fact that and we're we're basically maxing out the CPUs already or for these kinds of workloads.

B

So there there's definitely going to be some contention in terms of resources when you want to do this. You'll have to be careful about how your design, your notes, maybe you can only have a couple of SSDs if you also want VMs on those on those same same nodes, there's probably a lot of kind of hardware, reference architecture, design, work that will need to go into figuring out where the balancing points are, and certainly things like jml can t see.

B

Milk are going to fit into this because we see such a huge shift in CPU usage when when using different alligators, so you know definitely there's a lot of work there. That needs to still happen, but people are looking at it.

A

Cool right, so the next question here look like we're back to Brian sing scrubbing can have its I. Oh nice value set the idol, but has there been any work on allowing you said I, oh nice, values during recovery operations.

B

So I don't know is probably the right answer. I mean there are things in in stuff itself for trying to shift it kind of how much client I oh vs. recovery happens. I, don't know you know specifically about I. Oh nice, I guess I, don't know. What's happened there?

B

What I've seen when changing CEFs options for kind of priority of recovery operations that it it? It basically changes as you have more client, I/o even past the saturation point on the cluster so say you have 700 megabytes of client I owe under some kind of recovery, priority settings that you've set, and now you increase the number of clients, though you've got more clients that are waiting to do. I owe when you do that.

B

What seems to happen is that your client I/o performance might not change you. You might see the same level of performance, but now you all you've done is kind of made recovery. Take longer, it's kind of just made the situation worse in a way. What you optimally in my mind, what you would think would happen is that you said okay I want to vote. Under-Recovery scenario. I want thirty percent of the traffic to comfort to happen to be recovery.

B

Traffic and I want seventy percent to be client, I 0 traffic, it kind of ideally, maybe that's that's what you want or maybe wants four values. I don't know, but you would hope, then, that you would maintain that, regardless of how much client traffic or how much client I oh you have, even if you have more clients trying to do.

B

I oh you'd hope that you'd keep those ratios, but it doesn't seem like that's what's happening right now, though, um you know, I think we need to kind of investigate that whole kind of area of the code and how it works do to make this simpler and nicer for for users. But that's that's just my take on this um kind of kind of what I've seen.

A

Right Gennie's asking if there's any performance hit, we write on the live data, IE naught terms of 4k.

B

Good question um I.

B

Are you talking specifically about rbd or objects.

B

Rbd, okay, um I don't know that I have actually tested unaligned. Our video rights I may have at one point, but if I did I, don't remember the results. So that's a very good question that I do not know the answer to something that we should probably work into our set of standard tests. Nice.

A

Enters saying that using the Swift api they've been seeing some head request, some large a few objects fire flying take longer than three plus minutes. Is there any optimization tips you might recommend in order to bring this number down here should be doing ahead. Requests on each part, which can take an extremely long time, depending on my parts.

B

So I think that if it's taking three plus minutes, there's definitely something wrong unless there's so much, I owe that it. It really is taking that long. You know for the other howling hard work, but I suspect not what I would suggest doing is looking at the different statistics, the different administrative socket statistics that you can get to performance counters both for uh the OS DS and then for also the other, the other demons and trying to figure out where in the pipeline, you're stuck without standing ovations.

B

This is something positive that we should talk about with with just offline out out of discs. It take too long to diagnose, but but there's definitely I think something something wrong going on. There.

A

Giggles asking the regarding cbt: how hard is it to take the same test, yema file and point it to another set of nodes similar to technologies, targets, option.

B

So right now in the animal file, you specify a cluster section that has the nodes kind of hard code or oh, my heart hood, but the nose defined in the ammo file to right now currently you'd need to make a copy of the mo file and change those noes to the new set of nodes.

B

If there's there's also an option where you can target existing clusters. So if it's already running an existing cluster, you could basically just say use existing and then have it go often and target that cluster there's some interesting work going on by mirantis right now to try to kind of integrate like OpenStack provisioning tools into cbt and honestly I actually do not know that much about OpenStack. At this point, I deployed a cluster like three years ago and I'm horribly horribly out of date. So um I can't even begin to explain.

B

You know what it is that it actually does it provisioning in those cases. But um but there is some work going on kind of to try to do that in parallel, um there's kind of a really simple enhancement to cbt. That would bet that a couple people of express interest in that would be really nice. That tooth ology already does, which is to let you define multiple yamo files, so you you wouldn't actually have to recreate the entire file.

B

You just recreate, maybe a separate note targets section and then include that yeah mole or a different target sea animal for the the cluster part of it, and that would be a really easy change would be just basically letting the settings take in multiple llamo files and then adding those to the settings that gets created in Python. So so that's kind of kind of where that right now we don't do any kind of like auto provisioning, though, which you know maybe someday with like this OpenStack thing. So that's that's kind of where we're at.

A

Okay looks like the next questions from the kill on cash tier 3 testing, those reducing daughter from 23 to either 20 or Nike. We've created our baby image, help with performance so.

B

That's a really interesting question, and what we have seen is that there's kind of competing things going on when you decrease the the rbd default object size.

B

There are some situations where that does lower performance, because now you're you're not doing is big of reads or writes you're, not coalescing as much into to one object when you're doing you're doing I, oh, but in terms of the cache tier, it can actually help you because now you're reducing the promotion overhead, though, depending on how quickly you can absorb, writes into the cache tier, though you know whether or not you can really absorb extra io for larger objects versus the benefit that you get by having bigger objects.

B

There's this kind of competition for what makes gives you the best performance so kind of the answer is, it probably depends on your cluster configuration and your network configuration and all of this kind of stuff I.

B

Think, probably the best answer is you're just going to have to try it and see if it if it helps you with the stuff that that that you need I, suspect that for small ios4, like 4k, reads and writes, you would see a benefit by reducing the rbd object size, but for large iOS. You may actually see a decrease in performance that would be Mike, just kind of guess.

A

Okay, that looks like we've exhausted all of the existing questions. Anyone else have a question throughout the real trick. People.

A

It sounds like a lotta, nothing alright! Well, thank you very much mark for kicking the time to give us the lowdown on set performance. This video should be up before the end of the week here. If people want to review it or share it to folks that missed out so other than that thanks everybody for coming we'll see you again next month.

B

Alright thanks everyone thanks.

C