Ceph Cephalocon Barcelona 2019, 24 May 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Making Ceph Fast in the Face of Failure - Neha Ojha, Red Hat

Description

Making Ceph Fast in the Face of Failure - Neha Ojha, Red Hat

Ceph has made a lot of improvements to reduce the impact of recovery and background activities on client I/O. In this talk, we'll discuss the key features that affect this, and how Ceph users can take advantage of them.

About Neha Ojha
Senior Software Engineer, Red Hat
Neha is a Senior Software Engineer at Red Hat. She is the project | technical lead for the core team focusing on RADOS. Neha holds a | Master’s degree in Computer Science from the University of California, Santa Cruz.

A

All right good afternoon, everybody welcome to Selleck on Barcelona. My name is Neha Mucha and I'm.

A

The tech lead for Raiders I'm, a senior software engineer at Red Hat today, I'm going to be talking about how we can make safe fast in the face of failure and also the whole motivation behind the stop is that we've been doing a bunch of stuff in every release and there has been a rationale behind what we do and it is important for us to put it across to everybody why we are making these changes and how you can you know, take advantage of those changes.

A

So, having said that, I want to understand how many of us in this room actually know how safe deals with failures or like how recovery in general happens, all right, almost 50%, let's say 60% I'll still maybe spend a couple of minutes just to go over the basics of how staff deals with failures. So failure in general is like an OSD. Fails, a node fails or what does have do? There are two basic mechanisms that chef uses to deal with failures.

A

One is called log based recovery and the other was called backfill, and each of these mechanisms have their pros and cons and I will go into that while I explain each of them. So as let's start with log based recovery, as the name indicates, there is some kind of log involved. So the log that we are interested in here is the PG log.

A

So every OSD has a a PG log that it maintains, which is basically a history of the operations that it C is happening on itself, and the idea of log based recovery is that when a node fails, there are obviously some information that is missing in its PG log. So it goes and consults an up-to-date copy of another OSDs PG log and tries to figure out what its missing and then copies the Delta information that it requires to become up-to-date.

A

So obviously, as I said, the PG log is something that the OSD needs to maintain and it is an in-memory data structure which has the advantage that recovery is fast because you, when I explained backfill it definitely backfill, takes longer time to happen than recovery, but the advantage the disadvantage is that, under certain circumstances, when there's a lot of recovery happening, these PG logs can keep growing out of bound and end up, causing out of memory conditions, onerous T's. So having discussed log based recovery a little bit, let's see what back fellas in backfill.

A

There is no use of PG long as such. The idea of copping up-to-date state of objects is the same, but in this case what we do is we do. We compare the on disk state of objects and by just scanning that we figure out what an OSD that has failed is missing, and then we copy those bits on to recover.

A

So again, because there is an understand happening, it is more expensive and takes longer, but there is no issues like PT logs, throwing out of bounds. Stuff like that. So I think that's kind of the quick introduction of both the mechanisms and.

A

Inherently or like when we started out, there was a basic problem with the way recovery worked and safe. So what we focused on was when an OSD died. We would want to bring it back up to normal as quickly as possible. Obviously it sounds good, but there are other implications like client, IO gets gets a backseat and obviously performance during recovery is really poor.

A

So for those of you who are wondering why I have a picture from the dark ages here, it's basically it basically depicts the state of affairs in safe in earlier versions when record who used to happen.

A

So the people dying here are like client operations were not being able to cope up with recovery and backfill, which is like a plague taking over the entire city, and this picture nicely translates into this graph, which compares the performance baseline performance versus performance during recovery in hammer I, don't even want to say I mean like it's less than 10% of baseline performance. So obviously there was a problem and we did address that.

A

What a few things that did help were the defaults like OS D, max backfill and OST recovery max active these kind of parameters that were available for tuning weren't by default tuned. Well, so, if you manually, you know tweak them around, you could get better performance, but there were still problems, so in infernalis, which was the next release, we made those two Nobles default to better values so that you could get better performance than hammer.

A

But still, if you look at it, it's not doing very well, so in luminous what we did was we introduced a way to throttle recovery and the option that did that is called OST recovery sweep, as the name clearly indicates, it's the amount of sleep that you induce between recovery operations and that way you can control the rate of recovery happening versus client I. So this was our immediate solution and it helped a lot with controlling recovery, especially when you wanted client IO to take priority over recovery.

A

We also came up with different configuration values based on underlying hardware, so that safe would automatically tuned the recovery values based on where you are setting up set on.

A

So we did some experimentation because we had our past learnings that we were not good at coming up with good defaults. We thought that, since we have this option, we should come up with good defaults from day one. So we did a bunch of experiments to see what reasonable defaults for these would be and so you're what I have for.

A

You is a graph which was average on the left, which is average latency versus the sleep value, which is the US recovery, sleep value and the one on the right is the total time to recover versus the sleep value. So, as you can clearly see, increasing the sleep did help us reduce the average latency, but on the right you can also notice that the more amount of sleep you induce it's going to take longer for entire recovery to complete.

A

So we wanted to come up with an optimal value so that we could take the advantage of reducing the average latency. But overall not bearing too much of a cost of total time to recover. So we came up with point 1 seconds as the default value for hard disks. Just for the information sake, these tests were done with blue store with FIO for K random writes.

A

We did similar experiments with SSDs, but, as we know, SSDs are very fast. We don't who didn't want recovery to get halted at any stage, so we just kept 0 as the default and then we also have this hybrid option, which is one of the most common use cases for safe, where your data is on hard disks and meta data is on faster devices, even here the trends for average latency and only time to recover, but pretty same as what we saw for hard disks.

A

But we came up with a consulate: sleep value of point, zero, two five, four hybrid setups, but while I showed you these graphs, these graphs made sense when we did these experiments, but the idea what I want you guys to take from this is that we have those knobs now and these results may or may not hold tomorrow or like next in the next five releases. But you can always reaper form these experiments and figure out what defaults work for you and set will automatically use that to control the rate of recovery.

A

So with these new options coming in aluminous was definitely doing much better than previous releases and we are more than 90 percent. Like then baseline, 4i, ops in Luminos Illume. The other thing that we introduced in luminous was the option of forced recovery.

A

So there there were cases when users wanted to recover some of their data as quickly as possible, and we added this option to induce a force recovery on a PG level where you could just run this command, say of safe PG force recovery, and you could give them give it a list of Fiji's or you could give it just one PG, and it would recover that PG over any other pages that are there in your system. You, if, in case you change your mind or you you know just in like incorrectly use the PG.

A

You could always cancel it using the cancel force, recovery and cancel force back wil commands.

A

Okay, then came mimic so with mimic. We thought further of improving latency during recovery and what you know came to us very you know: I mean something that was there was available for us to know, but we did not address it. Maybe is that we knew that recovery is a synchronous process, so it blocked rights. So clearly it impacted right. Layton sees and affected availability. So we wanted to solve this problem by introducing async recovering and the way we did it is.

A

The idea is that we do not block, writes on objects that are only missing on non-acting, OS DS, so when I say non acting OSD. The whole idea is that you have a bunch of OS T's and as long as you can pick some OS DS by ensuring that you have enough to process io2, you can afford to postpone recovery on those non-acting over settees. For the purposes of this presentation, they are called s in recovery targets.

A

We will postpone recovery on those objects and not block io if there are only objects that are not up to date on those or s DS. So what the whole idea is that we use log based recovery to eventually recover those OS DS. But you are just by postponing. You are getting better performance and not blocking on, writes immediately.

A

So when can we perform a sleep recovery, so there are a few conditions that need to be met for a sink recovery to happen, and if the process starts from selecting these recent recovery targets, how can you select so, as I mentioned? If there you have a pool of OSD? How can you say which of these always, these should potentially become eccentric of early targets. So we use this concept of the difference in length of PG logs.

A

So, as I mentioned earlier in log based recovery, every PG has a log that it maintains of the operations that has happened on it during a recovery, a node that fails does not have an up-to-date PG log and when it tries to compare it with an up-to-date copy of that PG, it is going to figure out that it is missing some information. Now this. This Delta of these PG logs can give us a rough idea of how much it is behind from the most up-to-date copy of that OSD.

A

So the larger the difference, the more time it is going to take for it to recover. So we post we try to choose those OSD sizing recovery targets so that we postpone recovery on those which are potentially going to take longer to recover and the other thing is you need to have a way so that you can control a sink recovery in the sense that when do you want a sink recovery to happen or at what level.

A

You say that this is what the difference in logs should be, and that is when you should perform a soon recovery and that parameter is called OSD. A sink recovery, min PG log entries the default for now, I mean at least and mimic 100. There is no like evidence that hundred is the best value to use. We can definitely come with a better default here again, but the idea is that this can let you control how much recent recovery happens and if you let us say one that a sink recovery should not happen.

A

You can set this parameter to a very high value, and the difference in logs is never going to be that high. It's never going to choose. Oh sd4 isn't recovery beyond all those things. The most important thing to highlight is when VR or taking out OS T's for a sink recovery. We need to ensure that we have min sized replicas available. So the part where I said that, as long as you can process IO, it is important that you have min sized copies available for IO to proceed.

A

These are more details about how it works in general, but the overview is that it works for both replicated and original coded setups, and the whole idea is that when, when so earlier, what you've used to Ruby is to see. If then, if an object is missing, that means we want to first recover that object and then process the IO that has come on that object.

A

So now the idea is that if an OSD is a nascent recovery target, we send only the log entries to that OSD and not the entire transaction, and when log based recovery happens, the log the history of that log is present, so it compares it tries to find the right objects in an up-to-date copy. Just by looking at its PG logs and seeing what the missing set on it looks like.

A

We also did some experiments to validate that. Async recovery worked better than regular recovery, and these results seem to indicate that this is a graph of throughput where we are comparing the baseline performance, where there are no failure and the the one that's the blue, one blue bar and the one in red is the throughput for the OSD in the OSD. One is dias, it's basically, the whole idea is, we are running cost bench and we are generating rgw workloads.

A

We are running a workload which is mixed workload with read list right, delete everything and in order to induce recovery which is killing in OST, and this shows that the recover like vendrick up async recovery is happening. Throughput is definitely falling, but it's not falling as much, and if you look at the list and the delete cases or even the right cases, it's pretty much comparable throughput.

A

Okay, this is the average processing time this has similar to a similar of trends. The average processing time is definitely going to increase in our failure scenario. We see that it has increased, but surprisingly for delete. It hasn't, in fact it is lower, so in general, the graphs and our validation looked good.

A

This is a more detailed graph of people who have actually run Cosman should understand this, and the whole idea is to just show a time series graph of how, with every operation how the throughput and latency have been the bubbles will. Basically, if you want to just look at the overview, the bubbles are just like, for example, the one and orange we have 61 to compare with 54.

A

So it's like you're, comparing one with no failure which is 61 against 54, which is not bad during a sin recovery then the the exciting stuff which is now available, not less so as I mentioned earlier, that log based recovery has this inherent, or at least had this inherent problem that PG logs can grow out of bounds. Because, though we have a parameter, though we had a parameter called.

A

Is the which is the maximum length of the PG log, but we didn't really stick to that as a hard bound, because we let those PG logs grow, which so just so that we could recover using log based recovery, but it also, you know, got us back in the foot because sometimes you, the us DS, just ran out of memory because of these huge long, PG logs.

A

So in Nautilus, what we did is we implemented a hard limit for PG log length so that when you just say that this is the max I want my PG log to grow too, it will just stick to it. So we understand that PG like log based recovery, is important and can be faster. So we want you to be able to use it, but you can decide how much of it you want want it to happen and by the way this has been back ported to do ministry.

A

If you are running luminous, you can still use it like mimic and anonymous then came than the other feature. So I talked about forced recovery and forced backfill and I also mentioned that they were introduced. Luminous at a PG level, one feedback that we got from a lot of users was that it is not very intuitive and easy to map which pools are mapped to which PG so I mean you have to go and figure out. Okay, this PG 2.1 is the one I want to run forced recovery on.

A

So, as sage mentioned this morning, we usability has been of thing for us and even in in this scenario, what we have now is that you can run a force recovery at a pool level. You can just indicate that okay I have a CFS meta data pool, which is haha the highest priority thing for me now. I can just run a force, recovery and, of course, backfill on the pool level. Also again, we always let you go back. So if you change your mind, you can cancel on the pool level again.

A

So next thing is about improved performance, improved racing recovery. Now why I say improved as earlier as I discussed the the way we decided, which OS these are going to get selected as ascent recovery targets is by just looking at the length of the logs, but in Nautilus we improved the accounting of missing objects a lot, so we felt that a good cost parameter would be a combination of this difference in PG log and the number of missing objects on a particular OSD and that we renamed that cost parameter to OSD async recovery main cost.

A

So far, no validation has been done on what a good default value is here. I'm, pretty sure it's going to change with different scenarios, but I think this will at least allow you to again control a sinks recovery when you use it. Also, it's I think it's more realistic. Now that our accounting has been fixed just by looking at the PG log length, I think is not enough and missing objects can give you a more realistic picture of how much time is.

A

Actually it's going to take to recover about backfill, so there were improvements around backfill as well. So the whole idea was that we calculate the tentative amount of space that is required for a backfill operation to complete, and if we see that we there isn't enough amount of space available on an OSD, we do not even accept reservations from that OST for backfill. So we basically deny reservations and we put it back into the few --until space gets freed up.

A

So the whole idea is that there are scenarios when users use used to go beyond this mono as the backfill ratio. Just because we were hitting this case now. We will not hit this case if we already know that our back, we cannot complete at some point.

A

Moving on in the lines of recovery sleep, this is I think this is another thing that was motivated by some of our user experiences, which is basically introducing an option to even throttle deletes of BG's, and there could be scenarios where pg deletion can just hog all the bandwidth in your cluster and client I use might not get enough bandwidth, so you can now control that using a similar OSD delete sleep option. This also has different default values to Autotune, based on underlying hardware and I.

A

Think this is going to be useful in it's not going to be like a very big you. We don't have a very big use case, but when you do a hit these scenarios, it's going to be really useful.

A

This is I think the most exciting thing I want to talk about. This is about a new iteration of code plugin. This is exciting, because I think we haven't done much with respect to race record plugins recently and with Nautilus. We have, and this contribution has been done by minor.

A

We and she is and I think a PhD research candidate, and it is her what I think a work of more than a year that we just merged for Nautilus, which I think I'm very proud to say that we have Academy research translate into open source software, which will be I think run in production in very very soon. So the whole idea is that this is experiment has been marked, experiment experimental for the right reasons, because we haven't done at our end.

A

We haven't done enough validation on under all kinds of scenarios, but I think if users and our community can give us feedback on it, it will be really useful to us. It has really promising advantages, and so the whole idea is that, instead of you, actually increase the number of OS DS that you reach out to during Eurasia coded recovery, so that you are getting lesser amount of data from individual OSTs rather than having less number overseas and getting more amount of data from them.

A

So it has savings in terms of network bandwidth and disk I/o. Again, very, very excited very proud of this. But I would really appreciate if people use it and get back to us about how how its performing or some bugs and other things like that so that'll be really useful.

A

Okay, what's next, so these are a few things that are definitely in the pipeline and, as you can see, the first one, which is called partial recovery, which is already merged.

A

I, was having a conversation this morning about some users might have objects that are huge and when you are when we inherently, when we do recovery, we have this idea of copying complete objects instead of even if you know some parts of the object have been modified, we do a complete copy of the object, so partial recovery changes that with partial recovery, only parts of the object or extents of the object that have been modified are going to be copied.

A

So the granularity at which the copying happens is going to be deeper, so you save on the extra copying around that was being done earlier. This merge striping a couple of weeks ago, and this is something really exciting and I- think a lot of users are going to benefit from it. The other thing that's almost ready and I. Think it's getting reviewed and tested and stuff is that we had a restriction in the erasure coded pools that we did not perform. Recovery below I mean recovery below mean size was aloud.

A

So that's a restriction, and we feel that some of the reasons are based on which we had made. This statement do not hold anymore and I. Think we what we are confident now that, as long as in an AirAsia coded scenario, we can recover the data we will afford to go with our main size. That is the whole idea and I think that it is again going to be really useful in the real world scenario.

A

That's going to London October so watch out for that other than that I think one more thing is about in the same lines of prioritizing the right kind of stuff, letting you prioritize the right kind of stuff. We also thought going to the next level. We are going to prioritize stuff that we think is already important like the example of in the southwest case.

A

It should, it could be metadata pools in the rgw case, will be the index pools we are going to by default, bump the priority of such rules so that the cluster gets at least the the pools that hold the important data recover faster than the rest of the data, so that you don't have to intervene and like run, you figure out which PG is which pools it's going to be all under under the hood, all right coming to future work.

A

I think the next thing that we have identified- and we think is going to be useful- is that when I mentioned OST recovery, sleep being an option in which, by making use of which you can throttle recovery, it has a clear downside that it's a static value. So if you use one second value, it's just going to keep it keep a one-second gap between recovery operation, no matter what so either.

A

If you don't have enough client IO, you could be wasting a lot of time in recovery when you shouldn't either you have to manually, go and change it and remember to you know, put it back to the right, defaults or or you just pay pay the extra cost of total time to recover. So what we have now thought of doing is adaptive recovery or throttling.

A

So the whole idea is that we will sense how much of client operations are going on in the cluster and based on that, we are going to bump up or lower down this. This sleep option so that there is no intervention or, like nobody, waking up in the night going to change. Okay, I need to put an alarm to change my OSU recovery, Steve value. You wouldn't have to do that.

A

So that's that's something that we hope is going to land up in the next next release or so, and the other one is a full-blown QoS project that we have been working as a research project and I think now we have realized it's time for it to get even more focused from us. So we as a team are in Raiders, are trying to muster up our resources and focusing on a full-blown QoS project plan and timelines as to when we want to deliver it and what we want to do so.

A

I think there is a bunch of stuff that we have associated with UC Santa Cruz and we have a PhD student contributing on a weekly basis and we meet them, but now I think we also are going to use the research bit of it and hack use our expertise at in the radar steam and get this project to the next level. I would say: that's also something to watch out for all right, so I think I'm on my last slide, which summarizes the summary word is upgrade I.

A

Think if you're running anything like jewel or hammer, you already know that you are in the bad space and there are lot more exciting things. A lot more feature additions that have happened. That can make a chef. You know hands-free and you don't have to bother about a lot of tuning and hands-on operations so just upgrade to luminous or mimic or like even Nautilus. I mean Nautilus is the best I mean it. You can you, as you can see it's the smartest you it has the best kind of auto tuning.

A

It has better performance in terms of a sink recovery and I. Think it has. The prioritization stuff has been walked all over again, so recovery prior utilities are going to work much better and Nautilus than any older releases and yeah watch out for new features in octopus, which should be probably discussed in next cephalic on yes, that's it. Thank you.

A

Time for some questions.

A

I didn't see it see a lot of people sleeping, so there should be some questions. Oh yeah.

B

So I want hello, hello, television,.

B

A

B

B

Thanks so hello, yeah.

A

B

Cool I just wanted to give the feedback that the OSD silly deets delete. Sleep is actually making a huge impact like it's really really working, because if anyone turns on the balancing or like PG merging, then the delete sleep is exercised like all the time so before I think it was 12 to 11 or 12 to 12. Before this balancing was kind of disruptive, and you had to really throttle this back and when we, when we enabled the sleep, it's like everything, becomes much more transparent.

B

A

C

I was hoping if you took a little bit more about how existing work in QSR, because it's going goes back a couple of years now, cumin is coming together.

C

Des Plantes I mean.

A

The new glances I think we had a discussion yesterday with all the corridors developers I think we have enough research as you've been part of as well, that we have those graphs that we wanted to I wanted to identify what the bottlenecks are and now I think.

A

The next idea is that we want to go to the booster level and I, don't and actually validate whether the parameters that we are using to identify the bottlenecks are valid or do we need something else, and if yes, we implement them and if they are valid, then we move to the next step of you know, working on the upper layer using the but.

C

The intent is to continue to connect them clock.

A

The intent is to.

C

Continue connecting em o'clock, yes,.

A

Work more accurately to the fabric store. Yes,.

A

Any more questions, all right, that's my information. If you have questions that come up later, you want to just reach out its my email, github and IRC. Thank you. Everybody.