Ceph Ceph Tech Talks, 25 Jun 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Tech Talk 2020-06-25: Solving the Bug of the Year

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, everyone thank you for joining us for uh this month's stuff tech talk and uh thank you to dan here for volunteering to provide us content for this month, uh especially at a such short notice. So dan has a nice talk for us. It's uh solving the bug of the year, so I'll. Let dan go ahead and take it away.

B

All right, so thanks for the chance to speak here at the sef tef talk tech talk. um I will be talking about solving the bug of the year. That's in quotation marks, because it wasn't me that called it that, but it was kind of exciting that one of the mean steph devs, called it the bug of the year a candidate for bug of the year. So uh I'm dan I'm from cern, I.t and uh yeah.

B

Here we go so the following is a story about february 20th this year and the days that followed when our open stack of block storage went down, how we diagnosed and resolved the outage and then how we finally solved. What would soon be called eventually be called the bug of the year.

B

So a quick recap of saffit cern, so cern, as uh you probably know, is the european center for nuclear research. In geneva.

B

We have the large hadron collider, which is the largest machine in the world and we're famous for having discovered the higgs boson, which is the thing that explains why particles have mass and led to the nobel prize for physics in 2013..

B

Ceph has been a key part of our it infrastructure since 2013, um notably for block storage and stuff ffs for openstack, but also we have an s3 site for opens for object, storage and then we also have some raido's clusters just for custom storage services.

B

In total, we have around 10 clusters with 35 petabytes of raw space moment.

B

um So just to give some setting uh in 2013, we started offering stuff rbd via openstack cinder block storage for our cloud. Seven rbd proved to be incredibly reliable. Over the years we had some very few short outages, mostly related to network connectivity um and over time, more and more use cases moved on to our cloud.

B

Now as of uh february this year, there were just to give an some idea of the importance of suffix cern. We had around 500 different, shared openstack projects using rbd, uh that's like auto, visual audio visual applications, databases, repositories, engineering and physics, applications and more and then also more than 1200 personal openstack projects that were making use of the block storage. So in total we had more than 6 000 cinder volumes about four petabytes raw used, and this was split into two pools uh and two rooms.

B

In fact, one battery backup, room and and one diesel back.

B

uh Here's the timeline of what happened on february 20th, this infamous day, so at 10, 00 10, around 10 11. When we were at our coffee break, we got a message that the main rbd cluster suddenly had 25 percent of its osc's down uh the pgs were all inactive and all I os were blocked. The whole cloud is down. Basically, we started investigating a few minutes later and we noticed the ost processes wouldn't restart and their log files were all showing crc errors in the osd map.

B

uh We, after a couple of hours, investing in ourselves at 12 30, we were checking with the community on the irc emailing list in the bug tracker around an hour later we understood the problem at a simple work around a basic level and then a few hours later after that, we brought the subsequent rooms up and we had the service up and running by uh by the evening.

B

That was the timeline of our of our big outage. um So what happened? But before going to what happened exactly, I want to give a bit of background about some stuff osd internals. So a cosd, as we know, is the object. Storage daemon. Normally you have one per disk in your cluster.

B

Now ceph maintains a small osd map which describes the state of the cluster and I've. Given like a an example one here for a tiny cluster. It has some information when the cluster was created, some various flags and the list of your pools and also information like the osds and what they, when they've, been up and running.

B

The ost map has all of the information needed for clients and servers to perform. I o and recover from failures. So this osd state information, but also it contains the crush map, which is a description of the infrastructure at the rooms, racks, hosts, etc and the data placement rules, um an ost map is shared by appears across the cluster. uh An osd can't do any.

B

I o, unless it has the latest epoch or version of the map, and each ost is persisting saving to disk all of the the recent epochs, maybe 500 of them um a bit more detail. So developers rely on some internal functions to encode and decode, an osd map between uh the osd map c plus plus instance, and its serialized form.

B

So the serialized form is the one that you share out the network and then you can have it in a class as well in the code and then advanced stuff operators are used to using tools like the ost map tool to print and manipulate an oc map so a bit into our outage now. So how did the actual ocs crash at 10? A at 10 11 uh 350 of the osd's out of 1300, crashed all the same time with this back trace.

B

So basically, you have handle osd map a function which is receiving actually looking into the code. You can see what this does. It receives an incremental osd map from up here and then it crashes when decoding it, making matters worse. The osds wouldn't restart, but they have a slightly different back trace when they try to restart they have they call.

B

This function called try get map which is actually loading, an osd map from the disk and then it again crashes in decode the same place it crashed before, but the path is slightly different and now we also had a clear indication of the log file that in var log messages actually that there was a crc checksum error on the ost map.

B

So we know that the osd map has been corrupted somehow, and this is what's causing the crashes um we reached out to the to the sef irc channel, who very quickly pointed us to a related thread that I'll give the materials afterwards after the talk. But you can follow the links, but uh who pointed us to a related thread, an issue, and there was this had been seen before, and there was a quick fix known how to recover quickly so to recover from the actual problem.

B

We extracted a good uncorrupted copy of the osd map from the mon, so seppo sd get map. Then a ver, a version.

B

And start that overwrite, the corrupted version in each failed osd's object store. So we use this concept, object, restore tool to set the osd map uh in the in the object store. We did this for every failed osd across several corrupted epochs and that brought the cluster back online. This is the process that took a couple of hours for us to script and get right um six or seven hours of downtime. We were back online but of course, questions remain. Why did they suddenly get corrupted and is it gonna happen again.

B

But to help with that root cause analysis. We started by looking at a diff of the valid and the corrupted copies of one osce map. So if you look through this diff of a good one and a bad one, you see some bit flips, there's a bit flip, there's a second bit flip, there's a third and there's a fourth somewhere from a one to a zero here bit: flips: okay, um oh no! It's actually yeah anyway. There's four bit flips! You can trust me, um so we had now some different theories. Why?

B

Why are there bits flipping in this cluster? In the osd maps they can have a few different sources right. They can be, can be memory, ecc errors, so uncorrectable memory errors, it could be network packet, corruption or it could be software bugs let's go through these. Could it be a memory error, so this was our first theory.

B

We searched all the servers, uh ipmi error list and dmsk output, which is normally printing something, but there was no evidence of any ecc errors in memory and then also, if you think about it, it's not obvious how a memory error could affect so many servers simultaneously.

B

So could it be packet, corruption, so tcp checksums on the network is very, are very, notoriously weak and there's a link you can follow to a paper. There um ceph uses crc32 to strengthen the messaging layer and it ends up quite quite reliable, so it would be extremely unlikely to have packet corruption that would hit all the servers at the same time and corrupt the osd map in the same way, and it's not clear how a single checksum error could propagate across the cluster.

B

There were confusingly. There were a spike of tcp checksum errors throughout this incident in our in our router and switch logs. um But correlation is not always causation.

B

So the evidence pointed to a software issue, but what exactly uh we have some different clues right. It must be extremely rare because, looking at the tracker and the mail and the mailing list, there are only two other similar reports across the thousands of subclusters out there in the world. uh We had some initial clues. um We have two types of on disk formats in theft right we have file, store and blue store. Only our blue store osds were affected from the other bug reporters on this one.

B

They had mixed flash and hdd clusters, and only the ssds were impacted sage, pointed to a recently bound race condition in the osd map code, where, with a multi-threaded access to shared pointers, that could cause a corruption, but it wasn't, but anyway, we'll get to that in a minute. um There's also a feature called osd map, duplication that seemed maybe worth looking into, and maybe compression could be it.

B

um So when we reported this sage got in touch pretty quickly and suggested to disable this osd map deduplication just a little bit about that feature, because it plays a role later recall that each osd is caching. Several hundred decoded osce map epochs um most of the time between different versions of an osd map, nothing changes so to save memory. There's a feature in ceph to deduplicate the osc maps in memory.

B

So just use pointers to point to the previous version's version of the data for each member, um suspecting that maybe there was a bug in there corrupting the osd map. We set it to fault very quickly on the on the first day, but let's review the rest of the functions, so here's that osd handle osd map function. So this is an incremental message. Arriving it's processed in this handle osd map function and it's pretty simple. It arrives. It checks the crc of the message: that's one valid crc.

B

It reads a previous full osd, bat from disk and decodes it which checks the crc, that's two valid crcs, and then we imply apply incremental changes to the previous. The incremental changes from the message to the previous full and then we encode a new osd map, check the crc again, and then we write it to the disk.

B

So three are three crc checks so from the back traces. We know that we receive an incremental map, apply and store it with no crc errors. We validate the crcs at least three times, but then the next incremental map that comes one. Second later we try to reread the map that we just wrote to disk and there's a crc error now, so this means that something must be corrupting the osd map after we encode it, but before the bits are written to the disk, we studied that race condition.

B

Theory that I mentioned and it seemed unlikely to be related. uh We checked the osd map, dedupe implementation, it's very simple and it seemed very unlikely to be buggy. So we focused, we started to look more down on blue store itself, some more deep thoughts. So let's look again at that hex. If there was something that bothered me about this hex diff over the weekend after this incident and the thing that was strange about it. So if you notice anything here, strange, okay, I'll tell you which what I found strange was this part here.

B

The address where the address where the corruptions are happening were at almost exactly the 128 kilobyte boundary a 128 kilobyte boundary in the in the file in the object, and that reminded me of something because we had maybe three months earlier, enabled compression on the cluster and we had tweaked some options to set the like the size, the compression blob size in in the osd to 120 kilobytes. So this will sound familiar so could it be compression?

B

I was thinking that here's our settings, could it be related, seems still very unlikely, um but then I got in touch with the other bug reporters. Are you using compression? Yes, are you using lz4 like us? Yes, are you using aggressive like us? Yes, okay, which os is centos 7, that's like the same as us, and one of the others was using ubuntu as a slightly old version of ubuntu, okay, a couple of years old.

B

So now we're sure what the problem is. So here's the story of lz4 lz4 is a lossless data. Compression very famous, probably much of the web, is compressed with this thing. um At cern, we enabled it in 2019 december 2019, uh primarily because the block storage is highly compressible. We can save around 50 of our space um five days, so the tuesday the bug happened on a thursday on the tuesday.

B

Afterwards we disabled lz4 compression before having any proof that it was a problem we disabled it just to be sure, but how could it be possible that lz4 could corrupt an osd map?

B

So let's try reproducing it when when we write, as I said a couple of slides ago, blue stores breaking an object into 128, kilobyte, blobs and then compresses and stores each blob individually. So we tried splitting our osc maps into 128 kilobyte blobs, but we couldn't reproduce any corruption so going deeper. We started learning about. I started learning about word alignment and fragmented memory.

B

So when an ost map is compressed, it's actually not the thing that in the code that gets passed to lz4 is not a nice like great contiguous, char, star array: it's not a continuous allocation of memory. It's actually jumping all over the place. The cef, decoders and encoders are very memory efficient and because of that deduplication as well, so you can be sure that your osd map is actually going to be like just as a sort of linked list of of random locations in memory um and those are often not aligned to this.

B

The word size of machines, so a word normally a machine, I guess, would be like four like 64 bits or whatever. This is not the the alignment times is not matching the word alignment of the machine, which adds further complications.

B

um So to reproduce the bugs, so we eventually did. We have a new unit tested stuff and we could reproduce the bug.

B

You need to take a good ost map from the disk decode it into an ose map class and then encode it back and then compress that and the the action of encoding or yeah of encoding a new serialized map with this like kind of memory, scatter the the bytes scattered across memory and then compressing that this triggers the bug so there's a new test unit that that reproduces the corruption bit for bit with what we saw in real life, um zlib, snappy and zsct all pass this test with no changes but lz4 passes.

B

Fails this test, so compressing unaligned memory is complicated. um The developer of the lz4 algorithm has a nice blog post from a few years ago explaining how he optimized access or compression of of unaligned memory. So we suspected that this was uh maybe there's a bug in that area, so we reached out to him on github: hey, have you seen this kind of corruption before and he replied? This is cyan 4973 nope, that's a new one to me!

B

So we tried we had done a thread with this. With the developer, um we tried some different configurations: different compilation options, but nothing changed. The behavior and then sage kind of came to the rescue sage noticed that a newer version of lz4 on his development box didn't corrupt the osd map, um so he bisected the commits rather quickly and found the exact uh commit in lz4 that fixes the problem. So in the end, this was somehow a known issue in lz4 that had shown up in some of the lz4 unit.

B

Testing um and the fix is applic is basically what the fix was. Is that if the, if lz4 is compressing from data scattered in memory, it can corrupt the map, it can corrupt the output, if you consolidate it to one single contiguous buffer, it can improve the situation. So lz4 version, 1.8.2 and newer already includes this fix, but, alas, centos 7 and ubuntu 18.04 use 1.7.5, so ceph needs a workaround or the os is also need to upgrade their version.

B

um This pull request here changes the lz4 plug-in to rebuild the data buffers in contiguous memory. um This has been merged to nautilus, will come in the next release, it's already out in octopus, but it's not yet to push to dynamic. So if you're, using lc4 compression in uh in these releases just be slightly cautious, um so some comments on actually the impact of this. It's the combination of ost compression mode, aggressive and lz4 algorithm that triggers this bug.

B

If you have ost mode compression mode passive, then only the user data in a pool is compressed. When we use aggressive, then blue store, compresses everything, even the osd maps or other metadata, that the osd might need to store onto this client data was not corrupted. So, of course, this was. Our concern was client later corrupted.

B

Rbd data that comes in from the from the openstack clients is always written from a continuous buffer, uh and also we have hundreds of zfs file systems on top of rbd that are doing their own independent checksums of the data, and we would have noticed by now if data had been corrupted and it hasn't, um the corruption was anyway, incredibly rare, our cluster, this cluster iterated, through hundreds of thousands of osce map epochs before it found a corruptable one that triggered the bug.

B

So we learned quite a lot from this. We learned primarily that all software libraries and services can fail, even though they seem to have worked since forever. uh The these ones that have 59s of reliability can eventually fail. Anyway.

B

We also learned that too much reliability leads to unrealistic dependencies, which can then lead to a sort of disaster. um We had like we had zero major outages in six years of ceph and hundreds of certain apps built dependencies on top of a single ceph cluster, uh which was unrealistic. um Thinking back to the google sre book. There's a story in there about something called the the chubby distributed, lock service that had a similar story.

B

It almost never failed and a lot of google services were built on top of it and eventually, when it did fail, it took down much more than it should have. So there are lessons in the google sre book how to how to avoid those kind of things. In our case, um uh we plan to introduce block storage availability zones, so we're aiming for three or four different clusters so that our users can spread their their uh applications across the different clusters, and then we can.

B

They can be more available in case of any problems in the future. So the list of thanks and credits is quite long. um We have we had teo and julian working directly with with me on the issue the day and the days following, and then lots of colleagues that I that you can read here um in the in the office, building that we were bouncing ideas off of to try to understand what could be the root cause.

B

um Everybody agreed that there's no way it could have been a compression algorithm that was corrupting the the data.

B

I also want to thank, of course, the ceph community um in particular dirtwash on irc knew like within 20 minutes of our outage.

B

He pointed exactly to the tracker ticket that was was the issue, so this got us on the right track, right away, um troy and eric were the ones that had seen this cluster before and they helped out with clues and comparing symptoms and then, of course, sage and jan cole, who was the ultimate developer for their expert input and fixes. So that's the end of my talk and I'll be happy to answer any questions.

A

Yeah, so we got a couple of questions, then one of them was: how did you determine the uncorrupted version to use.

B

uh So yeah so each each um okay, so yeah. So how did I determine okay? So the the the mon stores all the versions of the osd maps, so we knew the epoch. We knew through increasing debugging through the this fosd we knew which, which version that the osd was trying to load. So we knew the epoch number that was stored on each ost. That was corrupted, so we therefore knew which one to extract from the mod and inject I mean the epoch.

B

Number is kind of like the serial number of the osd map and that's the same. It should be the same everywhere on every disk of the cluster. On every mod of the cluster I mean in the midst of this bug. We also wrote some tools to like independently verify the crc of osc maps, so those helped along the way, but the short stories that we just extracted from them on where we knew it was good.

A

Okay, another question: what performance impact did taking the contiguous buffer change? Have.

B

That's a good question. I also wondered that it. um I don't think that's been measured by anyone you if you look at the. If you look at the fix, it's simply, uh it checks it just does. Buffer list is contiguous if, if buffer lisp naught is contiguous that it that it rebuilds the.

A

Buffer, okay, and um going back to the previous question, how did you determine the uncorrupted version used um additional question that came in so you just went one back from the problem. Osd map.

B

So I mean so if the lo, if the osd is loaded, I can actually probably just go back to the slide that be exactly.

B

There it was so this is the real. This is like real. These numbers are not made up, so this epoch number 2.9 million. Something was the first epoch that was corrupted so the osd in its like in the object store or this osd 666, which is also real by the way, but I comically included it there. um So epoc 2808 was good on this osd and when this osd 666 tried to start, you would get an error that it's loading.

B

It's loading 298 to 809 from the from the object store locally from its local copy of it, and then it's not able to decode that so it gets the crc error when it decodes the copy that it has on disk. So by reading it out from the mon, we overwrite the copy that the osd has and then he can start.

B

But what we found was that then he got stuck on 2810 in most cases and and then so we had to similarly extract 2810 from the mod and inject it, and we had to do that for probably six or seven different versions of the osd mod map and also, I think, later in the afternoon when we were fixing it, there was another series of three or four osd maps that caused them all to crash again, and we had to do it again.

B

So there was really something in the data structure that day something in the weather, something in the in the crush map. Exactly that was triggering this bug. It was easier to work around than require people to install an updated library version, so we did both in parallel.

B

I think that maybe the week afterwards, we also opened a ticket with red hat to update and as far as I know, they will have 1.8.2 in in the in in one of the centos 7 releases very soon, if it's not already there, um but at the same time nautilus doesn't have a fix for this. Yet right. So so it's a we.

B

We did both in parallel and then there's a question about uh compression in general general question about compression sep is compression, something you would only run on ssd osc's, or is the performance penalty small enough to run on hdds? In fact, I think that your intuition is backwards. I mean the this hard drives. Hdds are so slow that you could. The cpu could easily compress the data faster than what the hard drive can absorb and write the data anyway, and also for for for decompressing the data.

B

So often uh when we use compression in other in other areas like stuff or other areas, we see a performance increase in in bytes per written bytes written per second.

B

Brian's asking when is it expected to make it to nautilus? It's been merged to nautilus, so it'll, be there 14-10, yep and then victoria's asking are all the clusters of nautilus nope. We have so out of 10 clusters, two are still in uh in luminous. We have a. We have a cefs cluster that we're still waiting to upgrade we're waiting for some last minute past bug fixes before we upgrade to nautilus and similar. Excuse me similarly, s3 we're waiting for the last just a few little bugs to be fixed here and there.

B

And then anthony, I'm not sure I understand the compression the question contrast, replication plus compression to ec pc with.

C

With the goal of of having more efficient usage of the underlying raw storage, you know is is why we do compression right and there's also some reason we do ec. So I was just curious about thoughts of of one approach versus the other, or you know combining them.

B

The only I I think, if you have so for rbd, I think that doing both is a good idea. um There are lots of. There are a lot of gotchas along this, though so. First. The reason why I say for rbd is because you, you often have like big, four megabyte or eight megabyte objects that are easy to split and then easily too easy to split into things that are still relatively large and then and then you can compress them easily with ceffs or s3.

B

If you have smaller files, then maybe it doesn't pay off because you have um you have to like pat you have minimum sizes that you have to fill with zeros to pad out small files or small objects. Now the only the other gotcha about compression on rbd is that uh there were. I haven't, observed this myself, but there were comments. There are theories that, because rbd is normally like the actual workload is normally four kilobyte rights or maybe 64 kilobyte writes because it's usually small small changes to the blocks. You're.

B

Not in fact writing four megabyte objects all at once, you're just writing small um small changes. So the fact that you're just writing making those small rights- it might have an impact, and I don't know what the implementation does if it reads out the whole chunk and then writes back or if it just somehow writes a small extent. um But this. But the fact that rb does small, writes anyway, might might over time uh kill any kind of uh space. Savings that you might have earned might have thought that you're earning.

B

So a question from jared just asking for clarification: it was the storage knowledge packet version of lz4 that caused problem, not the client, ciphers correct. It's the it's the version of lz4 running on the ost servers. That's uh that's important because the data is compressed in the case. So in the case of bluestar compression, the data is compressed by the osd itself, brian's asking, if we're still using compression on the object storage clusters on the object clusters too.

B

So we so after this we disabled compression everywhere we haven't re-enabled it yet, but I would re-enable it. I would re-enable in passive mode and enable it per pool. I don't feel comfortable compressing the osd maps anymore, so I wouldn't use osc compression aggressive anymore.

B

Did the version running on the mons matter at all, assuming no mon osd co-location.

B

The mon would not have corrupted the osd map because the mon does not use lz4 to compress.

B

I think that so the mon uses roxtv and the mon can compress using roxtv compression, but I think that that's snappy in all stuff configurations, I might be wrong. I hope somebody can correct me if I'm wrong and I think it's off anyway, by default and and by the way. Well, I think that when the mon would get the osd app, it also wouldn't use these encode decode functions like the osd does. I think it would just get the blob and write it. I'm not sure about that.

B

I could be completely wrong about that yep sun's coming in now, any other questions.

A

Looks like we're clear in chat.

A

All right, I think, we're good.

B

A

All right dan, thank you for uh your time and uh sharing with the sup community for another self tech talk, and uh this will be uh course recorded and put it up on youtube. So thanks.

B

Everyone thanks a lot thanks for the feedback on the talk to you, nice to meet virtually with people somehow, anyway, yeah.

A

All right thanks, everyone bye, okay, all right.