Ceph Ceph Day Melbourne 2015, 13 Nov 2015

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Brad Hubbard -- Troubleshooting Ceph

Description

http://ceph.com/cephdays/ceph-day-melbourne/

A

Yeah, so my name is Brad, as, as Andrew mentioned, I work for redhat support delivery, supporting our distributed storage products, redhat, gloucester storage and Red Hots, f storage, I'd like to talk to you today about troubleshooting SEF problems and I hope everyone can take something away from this talk that can help them in working with SEF in the future.

A

So what sort of trouble are we talking about while staff is a robe, is robust piece of software and is designed to be autonomous, self healing and intelligent? Like any large software application, it can encounter problems and can be mysterious to the uninitiated.

A

Hopefully, I can provide some guidance on how to handle problems when they do arise. I've broken the issues into four main areas of performance, hangs and hangs, is in inverted commas, ear, because quite often a hang is not actually a hang might be something different. That just appears to be a hang a process still running. That is just not making any progress.

A

Next, we have crashes if one of the SEF diamond processes, crashes and then everything else under unexpected or undesirable. Behavior I think that just about catches, everything.

A

Performance when dealing with performance, it's important to establish a baseline so that we know exactly how performance is degraded and hopefully we're performances has degraded. If that happens, most of this most of the tools I've listed here can help you establish baselines for the various subsystems, ratoff benches, built in to write offs and is a quick way to do benchmarking, benchmarking of a safe cluster.

A

Seftel OST bench does a simple right to each to the storage for each OSD. You can specify bytes per right and total bet total rights. If I. Oh now has an R, B R vdio engine, which allows it to talk directly to the storage using lib, RVD and lib RVD sits on top of liberato switch. Patrick was talking about before NAT /. If she can test your networking segments, you need to test all of them. So your public network, your cluster network, Oh s, DS to mom's clients to moms I've listed ppl I.

A

Oh there I, don't know a lot about it. It seems quite interesting and I saw some information about it. The other day it is, but it's new to me and I haven't used it myself, but I've heard good things about it. It uses.

A

What they call enterprise, it's an enterprise workload online online storage transaction units, I'm not exactly sure how they work, but apparently it's a very good, very good for testing enterprise workloads closer to real world.

A

Then you've got your things like DD for testing direct-dial, but testing. Sorry, testing throughput to the disks just make sure you use direct the direct flags.

A

You can use PCP sis dad collect Dell anything you'll use to collect statistics on on what the entire systems are doing. There's also the SEF benchmarking tool, which is under development, but it allows you to run everything under the hood. It even can capture statistics from things like Val grind, block, trace and graph those for you and given that it's part of the safe project, it's you know it's directly dedicated to benchmarking, Seth, just be mindful of caching and and the effects that can have on your performance data.

A

If you do experience degraded performance, the first thing to do is to make sure your clusters, health, is okay and that you are firing on all cylinders. So you are not comparing apples with basketballs.

A

You can use the tools listed here in addition to the tools on the previous slide, to check for errors and statistical anomalies.

A

You can quickly check logs for warnings or errors using the gore command at the bottom there or your own equivalent command.

A

Slow requests, a request is considered slow within the cluster if it takes greater than 30 seconds to complete by default, although that is Chernobyl. As of a recent commitment, if you seen these in the logs, you should check the health of the cluster and check the indicated hosts for problems that may be affecting performance, historic, ops, command that you see there yep down. The bottom will show a collection of the worst-performing recent operations and perf dump. The bottom command will less performance counters, and both of these can offer hints at where the issue may lie.

A

So if you are seeing slow requests, you'll get warnings like that in your logs.

A

Moving on to hangs, is it really a hang as I mentioned previously? Not everything that appears to be a hang actually is one. However, if you are seeing a true hang, you are likely to see processes in prolonged d state in PS output and probably a lot of iowait on the cpus estrace might list output from threads that are running, but none from threads that are stuck.

A

So you should use that as a tool, in conjunction with PS thread output, to verify that all threads are making progress, so you should be seeing estrace, basically, SS trace data for basically all threats, G stack and G core g stack will dump the process stack, so every thread stack gets, dumped g core will will dump a core of the running process.

A

They can help. You look at this f at what cefs actually doing and what the threads might be waiting on.

A

If you do have processes in d state, that's an uninterruptible state within the uninterruptible fleets, sleep state within the colonel, so we need to get information from the colonel on what those threads are actually doing and what they might be waiting on this arc. You invoked with the T trigger outputs, a stack trace for all threads executing in kernel space to syslog.

A

We execute it twice with a 20-second delay, although that's an average arbitrary delay in between to verify the threads in question are in long-term d state and not just not just in a transient state as a transient d state as normal processes fleetingly Indy State as they're doing I. Oh, that's, common!

A

The line in blue is a real world example, and here we see that we may have a problem with the X of s file system or the storage layer below it. The red DS is your D state indicator.

A

If you are real, like if you're seeing a lot of these and it's difficult to determine, what's going on, it's not obvious from other data that you've collected you may need to collector of the emcore VM cos would probably beyond the scope of this discussion, quite a big area to get into, and if you have to go to that length, you may require assistance from your support organization. Unless you have those skills in the house.

A

So, as I said, what appears to be a home process may actually be a process spinning on the cpu and a tight loop, and we need to try to work out where in the code, this is happening.

A

The poor man's profiler, which I've mentioned there, dumps out stacks and can provide statistics on how many of each identical, threader scene, etc.

A

Look for threads that aren't waiting that are actually running on the CPUs and actually doing something. Unless the problem, of course, is that they're that they are waiting and that's the problem, which might be the case when you're dealing with a lock contention issue.

A

But in that case you probably won't see high CPU utilization stack, traces need to be interpreted, and this can require a decent understanding of the issue, but they are definitely worth looking at as they can provide excellent context into, what's, what's ifs or the set diamond state is with gdb scripts or system tap probes. You can gather considerable data on the state of the running program at regular intervals, so you can set up a GDB script similar to instead of G stack in that loop. There.

A

We could put in a GDB command to execute a script or execute several commands to dump out variables that are important to the state of the program and.

A

B

A

There's lots yeah yeah. If you got a source where dog forward, slash system type, I, don't think think. That's the think. That's the main system tap website.

A

Sorry, the question was: are there any system tap probes existing that can be used to diagnose SEF issues?

A

So, yes, there are they're, not specifically written for safe, but they can be used to diagnose if issues yeah, certainly sexy. You use a space program so.

A

You can write system tap probes for user space applications, user space wineries- you can put probes into them, but.

A

Most of what you'd be trying to probe would probably be the the colonel, but you can certainly do it. Definitely.

A

Deadlock all live, lock issues you might want to refer these types of issues straight up the line to your support organization, as they can be very tricky to debug. These are usually lock sequence, synchronization issues where one or more threads can't make progress.

A

Crashes when Seth detects a crash or a failed assert, it will try to dump as much information as it can about the issue to the log of the process.

A

The note about object dump should always be logged, so searching for OB JD. U MP object up should find crash dumps in the logs quickly.

A

In this example, we've hit the suicide timeout, because the thread has not been able to make progress. We can see this we can see. This is an assert by the assert in red there, we've called the search function and that assert has failed. That assert is always going to fail.

A

The hash in green represents the git commit that this surgeon was built from, and the version number is shown there in blue and they can be important and I'll mention them later on. You can also see this information by running Seth minus V, Seth minus V will show you the version of set Seth, but it will also show you that sha-1 hash, which indicates what commit that version of SEF was based on. We can also see the assert occurred on line 79 of heartbeat map, dot, CC.

A

Okay and down the bottom here we can see the line of code where the assert was called the code. Logic has checked the suicide timeout and has determined it has been exceeded, and this is considered to be a fatal problem and that the process cannot continue so we programmatically terminate the process. So the developer has done this on purpose. He said if, if we exceed that timeout, I'm going to call this a cert and this assert is guaranteed to fail. This assert will always fail. So it's like, like it says, it's committing suicide.

A

There are many thousands of examples of assert calls like this in the code base. They're not all asserts that will fail every time, they're run. Most of them are testing for certain values, and if they are outside of a range, then the assert fails.

A

They protect the process against corruption and unexpected values, which may cause further corruption, and they should not be seen under normal operating conditions.

A

Fatal signals are usually an indication of a programming error, although they can definitely be other explanations. Seff installs a signal handler to capture these so that it can do a log dump similar to the failed assert.

A

These are usually memory, accounting or memory access errors, although they could be indicative of other memory problems, perhaps memory exhaustion or even problems with the memory hardware itself, I would say these can go straight to a bug report if one doesn't exist already, so have a look at this f track. Have a look at bugzilla see whether there's already a matching error report? If there isn't, you can go ahead and submit one, definitely here's an example.

A

In this example, a cig abort has been sent to the process and the signal handler has caught that and is dumping out. The stack of the.

A

Osd process that was running the function in red in front nine is the most interesting as it is the last set function to execute before the program entered the dilib see abort code, which is what exactly what is executing in frames. All the frames above that so a through one, are all the actual abort code after the assert school. Sorry after the signal is, is handled.

A

The hex value here in green is the offset into the function for the current infraction instruction. In other words, the instruction that caused the the abort signal to be sent, and the blue value is the return address for the frame.

A

We can get more information by using some tools on the crash based on the information we extracted from the log crash dump in the previous slide.

A

Eu addressed to line gives the source code line the memory address points to so that can be very handy. You can run that on the memory address and get sent to the source code depending on inlining, an amount of assembly interpretation required object, dump output can be tricky to interpret, and I find I find the gdb output provides the best information in most cases, if you start gdb with the ceph OST and a note, it has to be the identical binary that that crashed.

A

You can disassemble with the Ford /em flag, that particular memory address and and what that will do is GD b will work out what function that memory addresses in and it will dump out that entire function and it dumps out the assembly and the source code.

A

In order to do this, you need the debug info package force, F loaded, the debug info package force. F includes the debug info for all of the SEF binaries.

A

Alternatively, you could be running a set binary that hasn't been stripped, hasn't had the debug debugging information stripped out of it, and it would also work for that situation.

A

Debug info packages should be available for all set binaries and most are in this. After bug info package, at least on RPM based distros, there should be an equivalent for a Bunter or deb based installations.

A

Excuse me unexpectable, unexpected or undesirable behavior.

A

In the case of unexpected behavior, it is important to identify what the expected behavior was and what the actual, what actual behavior occurred, always try to get a timestamp of when the problematic behavior occurred, as that will make things a hell of a lot easier to try and trace in the logs start at the user end where the error or behavior is seen and work backwards towards the SEF cluster tracing the process in each log. As you go.

A

Yeah I can't I've mentioned time stamps twice in that slide and I can't overstate how important they are. They just make trouble shooting so much easier.

A

B

Well, that makes it a lot easier as well. Yes, yes, that's very good.

A

Ok, these are the debug logging options for OS DS and their recommended values with the top two or three been first to try, as they are like they're, most likely to show any issues enabling too many of these options is not recommended, as it can flood the logs with data making them difficult to interpret. So you really don't want to turn all of these on at once, you're going to get a lot of output sensor logs.

A

If there's an indication of a specific area that that that area is suspect, then you might want to enable the option that that deals with that particular areas, such as a file store or the journal that may be warranted if you've positively identified that that particular area is, is the problematic one same for the moms and once again, the top three are probably the ones that that you'll need top. One, obviously is for debugging normal monitor transactions, second ones messaging.

A

Third, one is a pack sauce or the implementation of the pack sauce algorithm, which pack sauce is used by the SEF cluster to maintain quorum within the mons.

A

These are the rattle skyway options.

A

And these are the debug options for the MDS is so if you're running CFS you're going to have mdss. So these are the options to debug, those probably say the top two, probably what could start with.

A

These are for the client for client debugging. We can specify a log file and set permissions on it so that the client process can write to it. So, instead of changing the mode, we could change the owner and group to Nova, for example, if we're debugging and Nova processes using lib rbd as a client.

A

If we were trying to sorry, I generally find it easier to just set its permissions to 777, make it open slather, but your mileage may vary, and it depends on your circumstances. Of course,.

A

Openstack OpenStack issues can be difficult to debug, sometimes due to the number of separate processes involved and amount of logs. Sometimes the actual error does not get passed up to the higher level in tax, so the area you end up with is not really representative of the issue. A timestamp is pretty much essential in these situations, so you can follow the entire transaction through the legs through the logs from end to end, you may need to turn up log in velocity for all processes involved or each in turn.

A

In order to get to the bottom of the issue you may find, the problem is in OpenStack code, rather than it being a selfish ooh. So in other words, you might you know you might start at the Nova level, trace it down to glance or cinder and it stops there. You find the actual error there before it ever gets to the to the rbd level or the ceph cluster.

A

K rbd the kernel module logs, 2d message and syslog, so it should hopefully be obvious if you are seeing an issue with it. If not, this module may require instrumentation with system tab in order to establish what's going on. So this is one area where system tap may come into play. Definitely with the colonel. Are they dead client?

A

So is anyone actually using a RB de yeah, okay, cool.

A

The log settings that I've shown you up to this point require we require a restart except for the client, where client processes, because they're fairly transient, don't need to be restarted. Only only long-running client processes would need to be restarted.

A

So, with the options I've given prior to this should put them into your surf comp and then restart your demons for them to take a take effect. In order to change these settings immediately without requiring a restart you can use this F told command I've. Given some examples. There.

A

This invocation would enable debug logging for all OS DS in the cluster, which may not be what you want. You can restrict it to individuals if you want by only specifying their IDs rather than using a wild card. So if you have 10 SD and you want to turn up debugging just for that OSD, you can specify only it's I do the second parameter and in the command to reset debug logging to the default is the in mem MIT.

A

Sorry, I'll start again in memory log level, and you can see that after the ford / in the bottom command there so you're actually specifying a pair 0 and 5.

A

Seff creates a copy of the most recent log entries at the specified verbosity in that second field there in memory and that's what gets dumped into the log in the event of a crash. So if you're logging normally with Seth and it crashes, it will dump out the stack trace which I presented in one of the earlier slides. But it also dumps out recent history of its logs and those logs are shown at a higher verbosity level by default. Then, then, what you would have set by default, which is zero just normal logging.

A

The source code highly valuable resource, very important when you're trying to troubleshoot issues. You know you can you can learn so much about an issue by looking at the source where an error message is being produced or.

A

Reading a particular suspect function so to access the upstream source code. We just climb the git repository and then we can check out the version of the source we need either by using the SS sha-1 hash of the commitment or by using the version tag.

A

This gives you the source code, as it was at the time of that release and should match the source of the binaries you are using. So when you're looking at the source code, try to always match up the source code version to the binary version that you're using otherwise it can be misleading and pretty confusing.

A

For downstream sauce, you can download the source package, install it and use the RPM, build command to prep the source and apply the net necessary patches in the RPM afraid, I'm fuzzy on how this works for Deb, based distros, but I'm sure there are equivalent commands on those to accomplish the same end. Has anyone done this actually like downloaded a Debian source package and prep the source to get to get the source tree for a Debian bond binary anyone using Ubuntu?

A

You have so you understand the process. I might get you to explain it to me.

B

Well, no I'm interested it just a brief overview, I'm just interested in the process, because, obviously.

A

We're we're supporting people on a bunt, 0 and 4. You know someone who's had both feet firmly in the Red Hat camp for the last five plus years. A bunt is a little bit of a mystery to me, sir.

A

Once you have the source you can grip for error messages, inspect functions of interest and generally browse the source. I find them with the G tag, plugging pretty good for browsing nassef source code and jumping straight two functions: macro struck, sand classes.

A

It allows you to navigate the code base and jump between files quickly and efficiently, so that g tag module is the gnu global tags, module really really good for jumping around source code in Vinh works for C, C++, Java, C++, isn't quite as good as C, but still very usable and very handy a lot better than trying to jump around in the files without it most of the I should mention. I guess most of the most of the safe code base is c plus plus, so that's what you'll be reading most of the time.

A

Although the kernel modules written in C a lot of a lot of other utilities and seth deploy and things like that- are written in python, so.

A

Resources I hope, I've. Given you, some ideas for looking into SEF is as they arise, but it is important to realize that you're, not alone- and there are various sources of health available, don't be afraid to give a shout-out on on the mailing list or the IRC channel. You can also check the SEF bug, tracker and bugzilla to see if your problem is a known issue.

A

There are also some excellent troubleshooting docks under the SEF storage cluster section on Doc's, f com, so I haven't gone into those areas of troubleshooting because they're very well covered within that documentation, and I highly recommend that you go and have a look at the dock. The excellent documents on Doc CF com- that's one of the good things about Seph. Is it's very well documented?

A

You know it's something, that's very difficult for a project to keep up to date and keep doing but I think chef. Does it as well as anyone else in.

A

Yeah, so the documentation there is is quite comprehensive and well worth reading and, of course, if you have a red hat account, you can check out the knowledge base and talk to support directly about your issue.

B

What a lovely font did.

A

Someone change that on purpose, or did it just I.

B

A

A bit concerned it wasn't in English for a while, that's wow, that's nice! I might leave leave that as part of the the presentation.

A

Lastly, I just like to say: welcome to the safe community. We think you're going to enjoy your stay, and thank you for your time today and enjoy the rest of the sessions.

A

Thanks very much.