Ceph Developer Monthly, 22 Jul 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: July 2019 :: Ceph Developer Monthly

Description

Monthly developer meeting for the coordination of Ceph project development. http://tracker.ceph.com/projects/ceph/wiki/Planning

A

Good evening, everybody welcome to CD Emma for july. 2019 looks like sea may not be making it, so we can go ahead and get started.

A

She looks like the first topic is reduce think times. David. Do you when I get off.

B

Sure yeah, the I guess if we look at the pad, we can look at that.

B

Yeah I just realized that I thought m/sec was an abbreviation for microseconds and I was just looking that up and it says at M s and M SEC or both represent milliseconds. So.

B

Guess I'll have to fix that now so.

A

B

It's good to get.

A

B

Well, I'm, just using the wrong name for it or I. Don't know what the head of you what's, the abbreviation for microseconds? No, you need like a micro like Greek letter.

B

Well, that's a tough in ASCII.

C

You sick, as in USA, see I think possibly would be microseconds that.

D

Look skills are still meal, I'd, say.

B

Okay, cool so I'll make that change so off of the link that Neha just posted, there's the pad and I have a pull request already and.

B

There's example: output here that shows so I set the the warning level to one microsecond, so anything over one. It's going to report basically on all my connections, so we can see that the back interface had a long, long ping time and the longest one is shown in the summary and same thing with the front interface and the.

B

If you go detail, you'll get up to ten.

B

Showing you which connections on which interfaces and then I changed them algorithm, but possibly improving right now is showing if, as long as it's going from largest to smallest, no smallest to largest meaning, it's been lately been going okay, so oh sorry, I should have prefaced this with. We currently do one minute, five minute and fifteen minute intervals and get the average ping time across those those intervals.

B

So if you see in the example where it says possibly improving as long as the five minute is less than the fifteen minute and the one minute is less than the five minute, then it'll say possibly improving, although otherwise it will not.

B

So you can dump the ping times of a given in turf in OSD. That's in there.

D

Sorry to interrupt you, but maybe you can share your screen, because this is going to be recorded and for folks who want to watch the video. They won't be able to understand what you're talking about.

E

D

F

A

A

Background for what you're trying to do with this well.

B

The high level background is, we want to detect slow network connectivity so that if you are also getting slow operations, you would know to check your network first because you're gonna see both networks are slow and then possibly operations are slow.

B

Maybe your operations aren't slow yet, but just your network slow, but you could be heading for trouble, so you will get a warning.

B

Right now, the time and and obviously if anybody has suggestions on who's familiar with common network communication time, like is one second, just so large that it's you know you could run into trouble at smaller amounts of ping. You know delays.

B

Otherwise, we'll just we can experiment and we'll see what happens in the real world. With that warning level, it is configurable and then, as I said, the the intervals are.

B

Are the 1 minute, the 5 minute and the 15 minute? So let me.

B

B

There we go so I down to the examples, so it's just clear. The first one here is the summary.

B

That shows the longest on the back interface and the longest on the front interface across every interconnect. So it's gets reported to the manager, the detailed ones here, which shows that it's being truncated and it shows you which connection now.

B

It seems like as a future item if you knew where what switch, how the switches are connected or what racks OSDs are in.

B

There might be more ways of sort of sorting this such that you could see that oh, this switch is failing because it's only say the the rack to rack communication- that's going slow, but we don't have that right. Now, if you query an OSD I change this Josh from what you might have seen before, so that there's consistency with the manager output. So if you query an OSD you'll just see every heartbeat interval average that it's maintaining to every other OSD I put a zero here, so it would dump everything out.

B

If you don't specify, it will only show you the ones that exceed the warning level. So you might see an empty JSON array here. If OSD zero is not affected. Right he's got nothing as slow as a second.

B

So I just put the zero in just to this, so we can see what the output looks like.

A

Okay, cool; yes, we're like a network monitoring tool or some other monitoring tool. If you wanted to import that information, you could just always use them that works, zero and quality.

B

Right right right because, like in a large cluster, there's going to be a lot of.

B

There could you know a lot of interconnects for each OSD and then so, if we I'm scrolling down to the manager here so now, this is gonna, be every OSD to every other OSD that are connected, not every OSD talks to every other OSD, but this will be the complete information.

B

I'm, assuming, though, that in the real world you might want to use the value here to hone in on trouble and get kind of, like smaller sets of you know the worst culprits to figure out like what might be going wrong. Sure sure that makes sense.

B

So those are the example output, so.

B

So some of these things have already been done so I guess things to talk about for either inclusion, so I have a pull request which you can see at the top here.

B

That does all of these examples, so we either are gonna, get any suggestions from everybody on follow on or even inclusion in this first. In this first version.

B

So something that we talked about was instead of averaging over the interval, is to get the worst, the worst ping during the interval instead. Now, if anybody knows more about networking hardware, networking failures that could suggest whether the average is better or the maximum or something else as we're monitoring during these intervals, that would be wait to hear.

E

Any both you could you both to the average in the Mac.

B

Okay, so um would you say that it should be the so I'm getting I'm keeping track of one minute data and then I average that into the last five minutes, and then the last 15 so is is five-minute maximum? Is that the the maximum of each of those minutes or is it the average of the maximums of the five one-minute maximums?

B

Maybe it doesn't.

G

Make that much difference? Okay,.

G

Does this have any way of detecting or representing in cases where there was an MSN Teresa fired.

G

Yeah I got a. This is the heartbeat interfaces which are lossy right, yeah, so yeah exactly. We ought to be able to notice and they messenger needed to rehab check. That I think would actually be one of the most valuable things you possibly have you're just.

E

Counting how many times they reset exactly not for this.

G

Request to me, but for the future.

B

But it would be kind of hard, but it would kind of be part of this network analysis. Oh.

G

I think so yeah County. It indicates that that particular link is unable to maintain a stable TCP connection, but actually.

E

A

All you have to disambiguate that, with like new connections, when you see you say, research or something.

G

Possibly you could just pop that one. That is the first time it establishes I mean there's heartbeat state associated with each apartment right, the first time you establish a heartbeat. You obviously don't increment that counter, but perfect. One.

B

B

G

Because of the common case, it would never happen right if the connections were perfectly stable, then you just never get out of yourself.

B

So we've had a warning if it ever resets or if it starts like resetting more than once over some period. I.

G

Think the well at least it should be available in the detailed thump, and maybe it's a warning if it happens. Like honestly once feels according to me,.

A

I'm interesting to see, but.

G

The first one was definitely details such a seat, so for yeah.

E

Backing up this one second, for so over the course of each minute interval you keep track of the average in the max, and then your you also want a five and a fifteen. Does that mean that you keep the last 15 one-minute intervals and like a vector? Yes, yes, got it I like that, okay, I think.

B

That was your suggestion: yeah it's a small number and actually it's 16, but so we don't care about that one. But it's so it's.

B

B

So they we were also talking about having the single last ping time so that maybe some external tool could query that at a regular clip and graph the raw data.

B

It seems like it's a lot of polling for that scenario, but but then the OSD doesn't have to store individual ping times. We can stick with our 16 buckets that are totalling in order to average.

E

E

I mean I bet if we had the min and the max and the average, and even if you scraped that, like at a one minute or one yeah at one minute, interval.

E

You could grab.

D

E

D

E

D

Of representative, don't necessarily need the last one.

E

Okay, you could include it to you just because I guess, but.

E

E

Yeah I mean the thing is that for it's like I, think the interval is like 10 seconds. Is that what the current King interval is.

B

Six, it's six seconds because.

E

B

Know, fortunately, it's not exactly six seconds, it's a random value between 0.5 and 5.9, I think based on the six seconds yeah. So it isn't a precise minute either right, because when the next heartbeat triggers it sees whether a minute has gone by. But at that point obviously it could be. You know I'm in in five seconds, but.

E

B

And I I did change the config in my book in my pull request to a dev option, and so you know you, you would completely make this mechanism invalid. If you change, if you change the interval to you, know greater than 30 seconds or greater than for sure, obviously greater than 60 seconds, so I changed it to a dev option and it has a restricted range. Even if ya the max is 60, so I guess it would get. One trigger per would only get one trigger per minute.

B

Yeah, um okay,.

B

Yeah, this one's a pain to do because I'm actually using this value in the the Mon worn on slow ping time I'm using it in the manager of the Mon and the OSD with the new config stuff, everybody gets consistent view right, yeah.

E

Unless they set the option in like a specific monitor or only in the Mon section or something like that, but right um you're, not the only one they're a whole bunch of options that are like that or the naming is sort of using yeah.

B

Well, yeah I called it Mon well, because the Mon produces the warnings, even though it's part of the dump output mechanism, it's it is for the warning. So that's why I did call it Mon warned so if, from now on, we can kind of assume that configs are, for the most part, always the same on all all types and and in this case it's nothing critical or anything. If it it's.

E

G

B

Know, but so I could do this thing where, if you set this warning to like, if you change the grace period down and don't change the ping time warning down, then you would you would declare Oh s DS down before you would even warn them that that their network is slow, I.

E

Like the idea of just making in a ratio, it's simpler well, that's.

B

E

Of the grace grace yeah just make it like 25 or whatever is any sense right.

B

I forget what Josh you remember what the grace was.

A

B

A

Sounds ready, 20, yeah.

B

20 seconds so the default is going to be like 120th yeah, so 5%.

B

Okay and oh, so a big item is if we handled load averages in the. If we added load averages to the heartbeats, then we could kind of see that oh well, this guy is slow to respond to a ping, but hey we can see his load is high, and so he it probably wasn't the network now I, don't know what the math is gonna be on that you know how much load, how much I mean. It's probably gonna, be tough because of how many CPUs.

B

Yeah I, don't know how well the load average is getting us. What what we want to know it.

E

Used to have load a long time ago, no not really early days, but we weren't using it I mean dump.

G

It into the manager we can let it run on the cluster for a bit collect data and see if there's a useful relationship.

A

And that's good yeah. It certainly and do some scenarios worth it. No is he a CPU bound and see what it looks like from the other side, yeah also being able to collect.

G

Whatever just suits Andy.

B

So should we like, for example, in the manager output here if we had the load average of maybe OSD.

B

So we would know it would give us a context for those ping times right or wouldn't, which.

G

Would also be interesting.

G

Although it's not entirely clear that it needs to be in the thing, that's right what average will be the same to all yeah.

E

And most difficult.

G

Thing, but we might want to collect from the OST, is generally speaking, yeah.

E

I, don't think the peers need to know what the pair of at least right.

B

Well, so on on it, so you don't want to collect so if I'm heart beating to OSD 10, you're, saying I, don't need to save his load. Oh right, I don't need to save his load and then report that everybody would be reporting the same load. Okay, so I mean the.

G

B

Right well, internally, yeah but I mean I guess this dump won't won't be able to have that.

B

Right I mean you know now that I've been playing with this I'm, almost wondering. If we even care about us having the OSD report his his heartbeats I mean, maybe you could have an option to the manager, one that just says only only show me OSD zero I mean there's a bit of a delay of him flushing to the manager, but it well.

E

But you might want to talk to an OST, because the OSD sets are down before they try to boot themselves. Up again, they continue paying their peers, and so I can see a scenario where you want to ask USD directly. Okay,.

G

I do agree that, for the most part, people will probably use the manager version, though yeah.

B

Right so then, maybe it isn't so bad that we would only see the load on the manager version because we're not gonna actually exchange load on every heartbeat.

G

Now that likely wouldn't be the only input we'd use for something that tries to suss out what's wrong with roasty. It would just be another one. Okay,.

B

And then, obviously we have that information or multiple pieces of information, so Oh ooh. This is why the OSD would want to collect load, because then he could say: OSD choose heavily loaded, I'm, not going to kill him after the 20 seconds. We, it would have more flexibility. There, I'm gonna, give him extra time.

B

You could achieve more.

G

B

G

I mean the it's, the OSE is not gonna, kill the peer right. It's gonna tell the monitor to so. If the monitor has a way of querying load averages, it can do the same filtering that it said.

B

Okay, yeah I'm, not sure who is that how that works? The OSD.

A

Yeah reports it down and then the monitor can size whether it's actually mark it down or not, and the monitor already is already adjusting the grace period based on past reports that have been false. hmm So there could be more inputs that, like load, average or I, guess suggesting and Chad just get disk utilization.

H

Yeah from the support point I mean we have seen so much of these disk utilization. We always see like like when we try to troubleshoot flow requests or like performance issues. We try to see like if it is because of network or because network is the worst is not responding because it was the disk itself is so much busy.

H

So if we get the disk utilization in this septum and manager, output from OSD and then one column as the percentage utilization, it will be very much clear that ok, this was ji, has highest this utilization as compared to the others Oh, because loader average would be global to that node. But the dispute realization would be per OS D yeah.

C

So what if he was going a step further and trying to capture something like Perth dump at the sign Tom.

G

Well, that's what kind of what I was getting at will over time probably want to grab several pieces of information as inputs to the is the so? Is he really dead.

B

I already mentioned this last item about understanding the network topology yeah to potentially identify like a failed.

E

Switch I think, once this information is reported up to the manager, then can write some Python and an interpreter go look at it. Identify patterns.

A

You can even have like a cool graphic in the dashboard at some point. Yeah like riddling, some green links.

E

E

Looks like exactly.

G

B

D

If we, if we start adjusting the crease based on certain node conditions or rescue, ties a tional conditions, there should be a limit as to how much we're adjusting at some point we should probably not auto adjust. There should be a hard to stop at some point. Maybe like do it to our iteration, so three iterations and then maybe report it and then let the user decide whether.

E

Yeah, just because we know why it doesn't mean it's: okay,.

F

G

Also, in a systematic fashion, I think that's that's the really interesting case yeah yeah, because then you'd have tighter mark go through all out or none of them out. That's what I think we're go straight about let's storm.

B

So that's all I had here: okay,.

E

A

Shirt: it's that one thought about the naming. um These are just attracting steeping times in the future. Maybe you want to do some kind of network monitoring among other demons too. So should we have like OST in the name of these things.

A

Or we could just add more structures later.

D

I'd not be a bad idea to add. Was these just to be clear.

B

I guess what do you mean by that I mean it does say: OSD and the Oh dump Network, like dump OST yeah.

A

Like some of those d network, or maybe within the json, have like a be within an OST section, look add more sections to the json here: ah okay,.

B

E

Health Alert code could have OST in the name.

E

E

E

Anything else on that one.

E

Right, alright, the next one up is meeting health alerts.

E

We talked about this a little bit on the list and I forget one. Maybe it's another CDM I just wrote down what we had in the email in the pad and clean it up a little bit.

E

And share screen.

E

um Okay, so the idea would be that you know a number of alerts they come and go. You could additionally have a mute that the monitor will record and the mute is associated with one of those codes and it has a time to live.

E

I guess. One question is: if don't specify time to live, does it just meet up forever or do you and force the maximum time to live good, go either way and beats specify which the late going mute and what the? How long will I meet it and it'll then just go at the go. Look at the current alert that matches that and it'll also to track us some additional stuff to match that particular instance of that alert.

E

We could have an a wild card. That'll meet everything I'm, not sure. If that's hopeful or not, it might be nice if you're on the CLI. But it's also like. Why would you meet everything without knowing exactly waiting meeting so who knows? I have a strong opinion on that one, um but once the once, you have things needed, everything you ever looked at will also tell you that things are alert or muted, so that you are aware of it so stuff.

E

Yes, for example, will say: health, okay, but it'll play up to muted alerts, and when you do help detail, it'll show you a lot more it'll show you exactly what's needed in why simply I figure out what the nice syntax for that is.

E

But the main thing is that when the mute is recorded, it takes the like think, we'll start with initially just a summary string and if the summary string changes in the mute sort of automatically goes away.

E

So if the description changes or they lurk, goes away or the TTL expires or und, then you'll go away. That's some basic notes. Do that I mean whatever exposing it? Okay, that's.

B

Gonna be weird like just for the example of my network thing because it reports the largest ping time problem. So, if you're having a problem, if your switch- and you know it would be a problem to be looking at the description and saying, oh, the description changed yep.

E

So basically, the rest of this pad is all about how to deal with that, because each Health Alert a little difference, but at the like phase, zero would just be to match the description just to get the rest of the pieces in place. But the real challenge is around: when done you unmute things and I think I think we're just going to need a bunch of different modes, because each will Health Alert pays a little differently so, for example, OSD down it'll say like Oh Steve's down.

E

So probably, if you go from 200 C to post these down to 100 C down, then you don't want to unmute.

E

But if maybe, if those two O's T's resolved and then new OS D went down, then maybe you do so Paul. You actually want to look at like the set of which O's DS it is and as long as that shrinks, then it's fine and if the sec rose and it unmutes other things like um you, know, degraded and there's a percentage there's a number that you want to like decrease versus increase. um Maybe we just look at the detail items.

E

Each dollar has a bunch of details that actually say like which of those d has failed. Well as long as those like as long as the detail item goes away, then it's fine, but if a new one appears then it wouldn't be fine, um but even that still the tricky, because there's always a cap on those there's always like a maximum is 30.

E

I, think it is after when it just says plus, like you know, 700 other rows to use or something like that and sort of bambbles that so it's a little bit a little bit tricky but I think initially the idea would be to just have a couple different modes and then probably have a one, that's default, and then, in the code say like this particular Health, Alert use this other mode. They can sort of use something that's appropriate. You might need to go back and adjust exactly have what the details strings.

E

Look like I'm for the different health alerts, though they behave a little bit better.

E

But face wanna just be don't like try to get something that like sort of works, mostly first of all enough and then phase two would be. If we like want to be really precise, then I think the right. Now the health alerts are all structured as a summary string and then a list of strings.

E

That's the detail to make that a little bit more structured, though you know, have like a some scalar value that might be used for some health alerts, like you know, 0.7%, degraded or something or maybe have a set of integers for the OSD one. Where, like you, actually have the set of ID's that are dam or or whatever it is the set of Fiji's. Maybe something like that.

E

Go back and figure out what exactly what structured elements might be appropriated so that they can be a little more precise, but I think kind of that depends on if we can, if we can cover most of the health alerts in a pretty decent way with something pretty simple. It's just like a handful of different modes for a meeting. Then that might be good enough to that. Could you encourage way.

A

um Yes and I like the idea of starting off simpler and seeing the more complex things, are necessary.

A

Yeah I'm not sure we end up meaning to be so precise in there is.

E

Yeah I mean I think the only the only thing I'm worried about is the health alerts that are like.

E

You know you have 5% degraded or something like that.

E

Those values that number is current, just constantly changing and presumably slowly ramping down to zero and so ideally feel to mute it long as it continues to decrease. But maybe there was an easier way to do that, like maybe we just like parse a floating-point value out of the string or something in one of the modes and then use that, like a decreasing rule or something maybe a quick hack is good enough. I.

A

Guess you also talked about potentially letting you specify thresholds. That would be um things like this words. A percentage said that English between health war and health care or health- okay, maybe like a 1% degree to this, is okay in your cluster. But 20% is a warning and 50% is an error. I think mm-hmm.

E

Yeah yeah I think I think those are like per health alert, adding additional options to parameterize what the behavior should be. A lot of the alerts already have one or more settings that sort of control when to learn whether to alert at all or what the threshold is that a warning state is, and so we could, you know, go through one by one and say like: are there ways that we'd like to modify the behavior of this particular helpful Earth?

E

Add options for those, but I think that is orthogonal to the meeting part, because the Mutis yeah at least the basic idea yeah.

D

Would we allow users to mute any warning, or are we going to select which warnings we are aligned.

E

Any morning.

F

E

Could have to go. Look at the not sure, psycho warning that so severe that we prevent them from meeting.

E

E

Like Monday on clock, skew.

E

Yeah I don't know if any of these are like so bad.

E

E

H

B

Good yeah, I guess I could see a scenario where, given that you want to do this, automated unmute, where I want to override that, like maybe I'm migrating from to a brand new rack and I'm gonna, get certain warnings, cuz I'm gonna, you know be having misplaced objects, I guess, if I out OSD, slowly on my old rack and going to the new rack, so it's definitely gonna go up and down.

B

So if you're gonna do automated unmute I could imagine a case where you want to mute it and say, but don't automatically unmute I like preserve them. You make.

E

The mute, sticky yeah.

B

E

E

Yeah I could see that.

A

Maybe that's the default mode and you know you even implement the automated emulator.

E

I could do that or just have like a could, have a flag on the knew, whether it's sticky or not.

E

So that, even if the description changes or it resolves, it would stick around.

E

B

Or their three modes, there's forever, which is the stickiest until it resolves, is a little less sticky and then the.

B

Scoop in the description changes it's just the most temporary.

E

Yeah I'm worried about making it too complicated, so people aren't to understand how it's going to behave.

E

How would you describe those.

E

B

Is for the strongest, maybe.

E

Not really Herman in focus of TTL effect to you, because.

A

E

Because, oh my, like your.

B

Ttl means forever.

B

Right well forever, but no but it oh. So little.

E

Results if it goes away, this yeah.

B

The time to live is.

B

Have a until so, you have time to live in until and then the till, as an as an option like resolved until resolved until changed until changes.

E

So if you like to change.

B

B

But if, if you have a time to live and then.

E

The idea is attempt a little Trump everything so soon as the time time tgl expires, and it just goes away no matter what right.

B

But what, if, like my rack to rack example or whatever I'm doing some sort of maintenance over a period of time, I, don't even necessarily specify a time to live, but as long as I've so is that the first stick is that the the no options at all.

E

B

No time to live and it has no unmuting of any sort, because you didn't do the until.

E

Well, I kind of think that the default should be sort of the less sticky version so that if it says one else, D down and you're just like mute and then suddenly a five OSU's down like it feels like it should I'm you right. So.

B

Changed is the default, so that means you have to add another until option, though, like the forever or permanent or.

B

Because you might, if so, if you specify that first one with the until forever, then you've really got the stickiest, and you know it's nothing that mute until you say so. Why.

D

Why can't we just keep the TTL and have a forever flag only when we wanted to move forever? Otherwise, the TTL kind of overrides the and the results right well.

E

Because this until is not really about TTL, it's about the forever isn't about TTL. It's about whether the description alert or dealer resolves all.

D

Right I'm just saying that if you say, if you're, using explicitly the flag forever, you don't really care about that warning right. So that kind of overrides the dealer for how long it actually should be warning about. As the example that David was giving.

E

But you might yeah I mean I. Guess you might want that, though right you might want to keep them you, even if the description and others are changing, but still have a TTL of two days, but forever might not be the right.

H

E

E

Maybe sticky I just.

A

D

Just think we were just providing too many options, and it's just I don't want to keep it simplistic, just one flag. If you don't like what we are telling you to do,.

E

Kind of an I kinda like just having sticky and not this other stuff just to, but even if it resolves or even if it changes needle stick around.

E

They're not that they're not that dissimilar right if an alert like disappears and comes right back again, that's not that different than if the description changes or like that by going from one OSD down to a seated and 0ca on two of these down all kind of there's someone.

H

Yeah there's a requirement form like if you have most of the time we disable the scrub and deep scrub and it goes into a health one and it's not something very much health one. So people want sometimes running it for a month and their scenario, so the key will help in that I. Think well.

E

In that case, you wouldn't do sticky, because you probably wanted W to go away if he said no out and interesting to know brother.

A

Why would you want to war in scrubs if you know that you have no scrub said when you would said? No sorry, maybe I'm misunderstanding that.

E

A

E

You're saying, but sometimes people like said no deep scrub on the cluster for like extended periods of time and so being able to I need that or commute that would make sense. Brad.

H

E

Probably want this the same health code as if you have like no down or no up or something like that, and so the description changes you probably wanna. Unmute they're, just setting a TTL of like one week well.

B

That that's not gonna have a discrete that's not going to have a description change. It's gonna have like if they unset it it'll go away and then, if they set it again, you don't want them you to still be there. Probably, yes,.

E

B

E

If you set some other no flag and the description changes, then you want to unmute, because you want to see the other one I think that's seen for anything right.

B

We I don't know how they know scrub and the know deep scrub. Are they independent warnings or they? It's.

E

The same, it's like the warning is called OSD flag and it lists all the flags that are set. So it looks like no out no and no at no scrub.

F

B

H

B

Bad, they don't want it, they don't want to. They don't want to have worn on other flat, I mean they would want. They great potentially want to worn on other flag changes.

B

E

You, if you make the new go away if the description changes which would be the default behavior, then if you set one of those other flags on me, it seen you see the health warning again, it suppose you said like no down and then unset their down again. You.

A

Still have that you scrub set, then you still got warning spinning.

E

After you get to reset.

A

Them you I guess at that point. Maybe you would be helpful to them more granular.

E

Yeah yeah, but at least in the without doing that, I think the default behavior will be safe, they'll say for whatever right. We are in.

C

E

Slightly more annoying, but not the hiding things you should see.

B

So the is not sticky yeah.

E

E

E

Okay, well, let's, let's start with that and then play with it and see I think the real interesting part is going to be when, during this phase, where you look at each health code- and you look at what the messages look like and see how many modes out to be implemented- hopefully not very many. No people provide decent behavior but find out.

B

And maybe this means we need, we should have every flag as an independent warning. Instead of a flag warning yeah, we could do that.

E

E

Yeah all right.

E

That should be enough to get started.

A

Okay, all right next.

E

A

Request, Diagnostics already talked a little bit about some of this during the never clean time stuff the Regis trying to brainstorm various ways that we could figure out what's actually causing the requests to be slow and what kinds of things we could do it had to either logging or ice warnings or being able to surface more information that diagnose things.

A

So, in terms of disks being too busy, we have the suicide timeouts right now, which ends up making the Westies crash and can kind of be confusing the users since they see you, know Steve crashing, not really necessarily understand. Why that's one thing that we could improve they're, just making that have some clear, clear message in the log.

A

Potentially report being recorded to the monitor to I, guess or the manager.

E

We can put that in the crash report. Just realized, that's pretty easy thing.

A

Speaking the craft reports, so they service anywhere like if there is a crash. Is it yet.

E

That is a 2d, because I wasn't sure how to do it. Health names or brushes and I think it's basically because I didn't know how to mute them. So if you do crash LS, you can see all the crashes like historically forever. So the question is like you to Health Alert. If there's a new crash and if so, for how long do you burn for before your? It goes away.

E

Yeah I, don't know, or do you like delete it? Do you like have an act like I act, acknowledged that this crashed and I've looked at it and therefore the health Gordon goes away or do you need it? I wasn't really sure what the what the right metric was. I didn't delete the crashes or something they can, but you kind of don't want to delete them right. You just want to like you'd like to have a record of them all and they want to make sure they all get phoned home and oh I, think.

D

I, like the AK idea that I've seen I mean it's indirect commuting yeah.

B

Call it archive if you've.

A

B

C

B

A

Then you can once you have them gets in place. You can ignore if you're overseas are crashing a bunch of months.

C

If it's been hi I'm count, we couldn't we get rid of it. Then.

E

um Kind of I mean, if you want to tell the operator, that's something bad happened, and you want to tell the developers I think that happened and they're think they're disjoint right. Just because it's been phone home and we'll eventually maybe find out about it doesn't mean they don't want to tell the operator.

C

But you could come up with some arbitrary time saying you know if it's been phoned home and it's older than three months ya archive it and if it's older than six months and hasn't been looked at, delete it yep.

E

Yeah I think there's already a retention interval, but they automatically get discarded after some at a time.

E

A

A

Yes, we're going back to the disks being too busy we're actually failing. um We might also capture some other metrics, like things like IQ depth at various levels, about mr. level or at the OST level, and make sure that we were exposing those through foreigners or potentially before those as metadata to the manager or your new faces.

A

I got this P Q I mentioned earlier, attracting like this utilization as well. I.

E

G

E

The idea of logging utilization value that you'd see from I/o stat as a perf counter that just gets fed up to the I. Don't know how you get it. I, don't know where that number is coming from, probably from walk or something somewhere, but that's pretty cool.

A

Yeah sure we could see what I host that's doing replicate then yeah.

E

It'll be a little wonky, it's actually a per physical device thing right.

A

E

Maybe it's repetition, no.

E

A

We can figure that if it's like a black thing that may can go in bring that yeah whatever do is were using and for each of the places. I guess. If we have multiple devices, wait.

E

A little bit of weird on the back end because you might have multiple Steve, sharing an SSD and not reporting the stimulation whatever it is. So no yeah.

A

Figure it out, but Nikes just I want to know yeah.

A

Anything else we could gathered at the US. Do you level it right over yeah? What's going on, tsking already have perf counters for things like um two times in season.

A

E

One that bothers me, I, can't think of anything else that we just want only generate raw data for, but the thing that bothers me is that if an OSD fails because it gets an I/o error and it crashes- that's logged in the crash report, but nothing ever looks at that, and the cluster doesn't know that it failed because the device failed. It just knows that the OSD is doing. Yes, that's a good! That's in case we were kind of crashed, a twisty and don't report anything any evident.

E

We don't know why, then there's no like yeah, it would be nice, you have to go in manually, I might go, look and see why I crashed it aside, whether it's replaced the device or do something else or whatever not never been clear to me like what, though, what the workflow should be.

E

The manager could see a crash report come in see that a crash because of an e io and then set an additional flight on the OSD or on the device. It actually I think notice the device, so it could mark the device as.

E

Failed somehow, but even then it's like what is balance of the whole thing with all the failure protection stuff. That's like what what has failed mean. Is it that you got one eye or once or is it that you got too many of them or because what with blue storm most of the time when you get an I/o err on read, um we can fix it like will use the replica to repair, and when that happens like we don't do anything.

E

Nice tie mean the system recovers, but like we don't persist any record that we had to do that. And so, if a device is hitting like lots of failures, then like we should probably do a printer for that. No.

B

Okay, yeah, we I just added the repair. Num-Num objects. Repaired is now a counter for a number.

E

Of earths repaired.

E

The number virus on repair toasties. We could also watch for crashes due to io errors.

E

And in those cases we want to like annotate, the device indicate is.

E

B

Objects repaired is probably not too critical because you care about the disks.

B

That's kind of a pool guy, so I think yeah.

E

Actually, what we want is it's actually, yes, it's an AO, there's unreadable.

E

It's just whether we got a yo-yo and region. Oh please, some recover will happen after that.

B

E

B

The only way that, but the only way to get it is by looking at that number because it's all fixed out from under you, so that's I'm sure we.

E

Good that I'm saying we could add a first canter or that we actually check the raw value, air, I, guess: okay,.

B

A

Gonna have to be free device.

E

A

So for both the suicide timeouts and then yeah I, oh there is that, but we aren't repairing them.

A

um I guess right now we're kind of just asserting and it could be easy to understand if we didn't just do this an effort there, but had a message in the logs I mean it's called like exit, zero or sorry exit earlier and didn't can really confusing temper of all the vlogs back-trace information like that was time out and then understand and that what that means, or even getting the IO error and not understand that means the disk is filling so.

E

The error output now at least, is it better or is it it.

E

It calls if it's a if it's a I wear. It calls a function that makes it include it in the in the crash report and it calls the abort message: those unexpected I over and I think it doesn't dump the logs. You know, maybe it doesn't think I remember, but we can make a dot belongs now. It.

B

Returns remember on read: it returns eio to the caller. It does not crash right here.

G

B

To work which really confused greg every other thing has a crash here and not this mistake.

E

Yeah so yeah, that's what I think the proof counter should be so that when that happens, it bumps the counter the manager will notice and mark the device somehow apply. When you like list the devices. For example, it'll it'll tell you the accumulated error accounts they like that or.

E

Yeah, I don't know what to say like a lot of these. Things are also part of the smart metrics. So it's it's confusing. How much it is we want to like pack ourselves.

E

But anyway, okay, but we got a perfect canvas regionalization. We can track the guy odors as I start to keep going down the list. Yeah.

A

Let's play lost, but four discs now I'm, so calming about like as ICP. You talk a little bit of already about this trying to track CPU load and you just wait for padding that the log periodically so that we have a record of what's going on if the user doesn't have other monitoring setup and that way, if we do like see, you know, Steve suicide time matter, have a some other issue.

A

You least have some idea of whether the CPU was adventure but like could potentially keep some kind of history, but just maybe just pre just adding into the loggers yeah I.

E

Think it has a perf manner for the would average yep don't principle that should be going into well. If you have a Prometheus running that under Prometheus yeah, that's the thing supposed.

A

To have a media server yeah, are they a collecting tool.

A

E

A

Yeah and like we were talking discussing earlier, we could also send this periodically to the manager.

A

Along with other metrics, we gathering.

E

So a temple at given the fact that's the proof can either answer yes, it and it keeps it, but probably we just want to like if anything we'd want to like make the manager watch it like issue a health warning. If so, let's about some numbers and thing under them,.

A

A

Yeah I think it's. The manager could also try to diagnose. There's some issue on a given host or I, give it rack or some other kind of commonality.

A

Or all the diskless hosts busy or all the cpu is in this host busy going along with the never clean times so I. Guess we enough using the same similar kinds of yeah.

A

Adding it for other demons to it, so they seen seen this happen on the monitors with rocks to be a level DB Thank, You ICP, you crazy, I.

E

Just get nervous about getting too carried away, because this is one of those things where you can hour down a path, but if you most of the larger installations already have all this stuff feeding into something like Prometheus and then they have all their Prometheus alerts, refined based on those metrics, though I don't want to reinvent real I gain clarity, just like the minimum.

E

G

Might be a.

E

Place where the.

G

Sort of maintainability burden for small clusters is just very different from big ones: they're, never quite a kinds of things. In this way, some amount of reinvention here might be helpful, in other words right, but we.

E

Should invent the what's sufficient for a small cluster I? Guess so, probably something pretty simple?

E

Yes, maybe a first counter for every demon and then just say it option like born on CPU load, 25 or something.

B

Kind of a large cluster might disable a bunch of these features. Sure I'll, say yeah.

E

A

Yeah, so beyond the basic physical resources in the system, I just wanted to add some more specific warnings for things that we do know about. Like not databases. Thank you. Large I think the tricky part here is figuring out what a good threshold is for fast or it might have been like 20 gigabytes, but for a blue store. It seems to be in the hundreds petabytes range. It's gonna vary pretty heavily I. Think, based on the workload mm-hmm.

E

Yeah I don't think I can run on size.

A

E

A

Might be wrong yeah.

E

Yeah not sure it's useful to do it.

A

We didn't already lowered the per objects map warning at least.

A

Which is still pretty relevant since could have lost activity blocks by a single objects under being in a recovery. Yeah I think that would make fun of sense, I.

A

Think that, like that pearl, it was these sighs whirring still might make sense. If we aside, like a very conservative kind of high limit, they weren't see databases if I've heard gigabytes.

E

E

Don't know I mean it depends on what type of data you're storing if it's, if the OSD is for metadata or whatever bucket indexes, and it's should be a hundred percent of the device size right.

A

Yeah yeah: well, that's the thing right now. If they need to base itself, is that large enough, then you get her in the head. Have Roxy vegans doubles that had light enough that you cut your gonna cause. You very large stalls right, yeah,.

E

So maybe this could be expressed as a ratio.

A

Well, no, it's not it's not really related to it about size, though it's but just adapt to the size of the base.

E

Yes, I mean it seems like in principle. If you had a really cold a hundred percent Oh map data set should be able to fill the device. Sure yeah.

A

If you're not doing any any rates to it at all, then it's no problem.

E

Except that confection needs like double the space to compact right. So if you hit 50 percent of the device for you get meditated, then you have a problem.

A

More or less yeah yeah, so this is where in octopus-like charting the racks TVs into multiple column, feelings is gonna, help wait a bit yeah but prior to bed, it may be useful to have them a warning.

E

So we could have worn it like 0.5 I, guess, that's device, size, I'm, just I'm, just nervous about saying, like 100, gigs or picking a number, because this can very similar time for food. It's kind of vary based on device size and the device speed, and they do you have any other things, but at least the ratio. That's like I know an issue doesn't matter, you don't be able to.

A

Compact it, if you don't have you know, that's that's like 50% of your fast and slow devices at that point, yeah.

A

Okay, um they it's not so useful in. If we can't figure it I got a reasonably useful value. That's like what's most workloads, yeah I think this.

E

But if it does yeah well, actually, if you hit this- and we should just probably call ourselves full or something like that- Alex if you might actually be compact.

A

So last, but here was about time trying to kind of analyze with what's happening in the cluster or the human. Yet do the analysis based on a bit more data, so Michael KITT has written some scripts for parsing the stuff that log and trying, and what kind of bring this into a kind of spreadsheet format. We can look at things over time or in provost e basis, try to see if their psychic events that are correlated with still requests to the log.

A

Something like it up age there there's some screenshots of what it looks like in local is pretty cool.

A

I see hehe grab, you can just visually spot a trend of like when Troy Caesar scrubbing around this time and they were slower quest occurring shortly. Afterward.

A

I guess the question would be: is there something that we can live from this, that it would make sense to but somewhere in stuff, like a manager, module or something that would.

A

Maybe even the future trade off I'd made some with the analysis.

E

It seems like you could do like a heat map. Type of thing, like this I mean a lot of these events, are perf counters Oh, probably already have I.

A

Guess a bunch about kind of me at the PG States.

A

H

It's mostly like we used to pass the quest airlock, and that is thought. Whatever things do, I mean we remove dot of data from the cluster logs in luminous.

H

So earlier it used to have like lot of information from the I of point of view, recovery, I of point of view, deep, scrubbing states, whatever slow requests are coming, how much time they are taking how many number of OSD, how much time deep scrub is checking like these kind of staffs. That's found the Questor logs and I think he might go. Kid has written a script which takes some interval and it just drops it to CSV file and rest I.

H

Think you're thinking, if we can have something manager, module or something to the dashboard where we can just click and these kind of graph I mean starting. We can do from that. Rather than doing it manually from the script.

A

I guess the dashboard is already exposing like the current PT Staden, it's quite a fair bit. Maybe it would make sense to have some kind of historical view. Yeah.

H

Like report like, if the deep scrubbed is running so like reporting, what is an average deep scrub timing for each each interval like if you're taking one minute two minutes? If there is some impact from the like, because of deepest curves, there's an impact in a slow rate request because of deep something like that.

H

Yeah, but this is only bin- we use this all because of the troubleshooting yeah. If you want to troubleshoot something you go to the dashboard and start looking into where the maximum time is being utilized.

H

And this is all to check if cluster is reporting slow request. What is the reason behind it? Is it deep scrub? Is it the disturb busy because of some other, it is split like earlier? It used to be file story split now, with blue stores. We can think of like PBS is doing some spillover or or maybe some compaction is running, which is taking lot of time, which is causing a CPU busy or something.

A

I guess a few: those causes will be captured now by different kinds of warnings we have like they had that warning for spillover master now and with the added information we're thinking about with the plexi few proof counters. That would give you some. My next rule CPU looks like yeah.

H

I think what I understood from David disclination didn't we combine this Network thing, and this I think there will be lot wave, a better information. What we will have for troubleshooting as compared to now I mean networking and the more information from the slow request, point of view. I think will give us a lot of data. What about sorting.

D

Maybe one thing to look at is that what exactly can you get out of the post processing and is useful.

D

And it's currently not covered by the existing commands is: what is that that is the set? We should Karen kind of identify, probably the cath. You have more first-hand experience of what that set would look like considering the new improvements set up, :.

H

Yeah I think eyes for my whatever I understood from this code. I think a lot of it being covered are networking was one of the major thing I mean we have always struggled in itself like finding I mean we have to do a lot of work to find out.

H

It's a networking issue because there's a lot of interconnecting devices are there so I think the one feature which David mentioned that could be already covered in future, like we will have more information that if the interconnecting device like switch, is causing an issue I think most of the things are already covered, but yeah.

E

A

That's not not covered, oh and I. Let that thing's talked about so far is just the past. Pg states and Mike recovery and cluster IO stats that kind of stuff yeah. That's the party being captured by the F word, though HT member.

E

The past PT states yeah.

A

E

Mean it has it, it has the current snapshot of the PG states. It's not trying to remember anything.

A

E

A

Maybe something to bring up with Winston if he s ideas.

E

E

You know one thing that.

E

I've Leonard is is if, if there should be a way to like capture the log for a specific PG, whenever there is like a an annoying bug and something goes wrong and those he's crashing or a good song during pairing or whatever like we have to turn on the whole log for the host e. But it's just one PG that we're looking at yeah.

E

It's like the PG is stuck in a particular state and we can't figure out why from the query, and so we have to go turn on log. So we get that one PG and make a repair, and they see wonder if we, if it makes sense to be able to set a logging level on a per PG basis, I kind of like the way you can do it for OSD, but for PG.

E

You can figure out in that way that do one thing or, alternatively, if all of the, like all the log messages out of the peering state machine like clearing state, basically should be retained and memory have a different, have a high level of memory retention, so that, if did you hit a crash like you? Have it like it's up to like debug level 10, at least that.

A

Seems like you might have to keep quite a bit through those around. If you have a lot of peachy's.

E

You have a serious I, didn't, usually don't send me PJ's right, usually I'm, like yeah.

A

E

It was just hearing related messages, then their file like why don't I keep like the last 1000 messages. That would be probably plenty I think. That's actually not that much.

G

Sorry, did you pick up problems, Towson might be enough, but I think often it won't be.

E

But you think you, the problem is like daring like get info get log, and it like is failing to just either hit some crazy, obscure certain assert or it's just failing to you know, pick a good primary and it goes into down or whatever they.

D

Will just capture the state transitions for a particular beauty that could also be.

G

E

But that one's not very.

D

Really need to problem yeah.

G

Yeah, all you probably right, though, just the most recent thousand likes repeats. You would at least tell you whether more logging would help this. Sometimes you don't even know that much.

E

Yeah, although, as a practical matter, probably all P G's are really in the same bucket, even if you change it from the OSD channel to like the appearing channel, something like that and he's so higher thrush lips with the and the retention or whatever.

E

That might even be enough- maybe you have a larger window, but otherwise we'd have to make it.

E

Four, we can make a special debug macro for all the peering state messages echo into a separate channel, and so, if, when you do a PG query, you could actually like get the log. Something like it. I don't know feels like a bit of like a band-aid but something I.

E

Think about because when when people are like, when QA is like, we want a tool that was gonna go gather the logs to crash like what you really want is to reproduce it with debug OC 20 and then go grep out that PG out of, like whatever the appropriate OSD, is like. If there was a way to automate that maybe it'd be really nice. I.

A

Suppose you want more than just the PG right you'd like the extra logs or something else to for all operations, sometimes.

E

A

But yeah there's the PG you play self would be a good stern yeah.

D

That's probably the first thing that we try to do soon.

E

But the seven bugs are like PG won't go active, so usually it's that it's the parts crashing during peering or something like that. What's the bulk of them are I.

E

E

Something about.

E

E

This feels like a lot to do.

E

A lot of these are really simple right, like this add a metric for a few and.

A

A lot of them I, had my perf counters, had special warnings for the manager yep.

E

E

Turned some of these into tickets.

A

I'm, dangerous and other kinds of problems in general.

E

C

Think once we have the the crash dump on home that'll, you start creating a database.

C

Yeah that'll go a long way because it's always nice to know that a problem you're seeing has been seen before, like it's, it's kind of reassuring. In a way. You know it eliminates certain hardware problems and things like that. Yeah and if you know when it's not happening, yeah.

E

C

E

C

Good for situations where too much information, there's barely enough.

A

And the great, if we could I find out if my fresh they're happy in a while, that we don't see in QA or if you see what fun seeing I was like once into a I, mean ever run it again that it's happening they're out there, yeah yeah.

C

E

Make the Ceph tasks do a Seth crash, OS shutdown well as its tearing everything down it'll, just if there any crashes and it over easily or fail the run or whatever.

E

It already does it actually, because we already know Theo's jeez, it's teens crash. There's nothing there's nothing new there, but it could. It could make a database of like crashes, that we see in QA and then we could cross-reference that with the ones that are seen in the wild.

A

Doing the crash time at the end, everyone might make things more clear: I could, in some cases when we, the demons crash and you don't reduce that with the law or that in the topology log.

H

A

Of the crash mobile times or whatever it is yeah.

A

Base out there in running now, I'm folksy, why running I was already.

E

Yeah, so it's everything's phoning home and it's getting archived dan was working on the back end, just trying to figure out how elasticsearch works and what the best way to you actually use. It is so there's something like processing that needs to happen in order to feed it in to get it in a form.

E

That's actually useful and there's like there's like the crashed um part of it, parts of it that you just have to like extract the actual dump, make them separate records and put him in a one particular table or database or whatever, and then the telemetry is similar where you have to like. You have to associate them with a particular cluster, and you really won't care about the most recent report for a single cluster and so on.

E

The list I'm like so he's working on that, but he got distracted, he's doing the registry stuff for SSDI right now, I'm sure getting that sorted out first and go back to it. So we had that.

A

People can upload their aircrafts at least right now. We're.

E

Archiving them yeah they just can't like bury them yet and incidentally, there's a I had an RFC request, love to hear comments on that's, basically adding an ax concept of a channel to the filmer tree, though there would be the basic channel which it's like the basic cluster stats. What version you're running, how many OS DS you have how much data you're storing that sort of thing there'd be the crash channel, which is all the crash dumps and then eventually we'd add the device channel, which is all the device, health, metric, telemetry and the theory.

E

We could add other channels, the idea. Just being that you could you could turn them on and off. They might not want to root for crashes, or maybe you do, but you don't own home your cluster size whatever it is, it could be Ukraine, Euler or any other control over those things.

E

Yeah, it's it's a little bit weird, because the the telemetry and crashes are gonna go into one database just like for the stuff clusters, the Vice Health metrics. The intention is for that to go into a different data set, that's not necessarily just stuff. It might actually be phoning home to a different location. mmm That I was trying to like just keep it simple, so they're just channels- and you turn them on and off sort of independent of what the mechanism is, that they actually get I.

E

Think I, guess all what that user cares about.

A

And we thought about ways to get more used to enable the module.

E

Though we could um yeah I think it's yeah. The main thing is like that: I want the dashboard to have a pop up. Every time you upgrade that says you do want to enable this. Yes, no fun, ask me again type of thing.

A

Cooper something like that. Yes,.

E

Yeah we could do like a whole thing are suggesting I, don't know: okay I'd like a instead of just the Hamas, we could have a like an asked state and not for every major upgrade. It goes back to ask.

E

Next thing, I was waiting for us for the dashboard. You have a pop-up, we'll see how far back it says, yeah, but.

E

Most really want to want to know, even when somebody says no I want to know why this is just like I, don't know if the energy to think about it. Is it like I, don't want to report this particular metric or you know whatever this that'd be helpful,.

B

A

D

Maybe when they do the upgrade or using the dashboard, they'll have to first fill that out whether you want it done this module on after the upgrade, and if they say no, you capture the reason you should have options like ABC. Why did you turn it off exactly so that they make a choice during upgrade and that's how most pieces of software these days are made so yeah.

E

They're very nice.

E

Okay, it's doctor ones about that, um all right anything else. We should discuss we're ready for bed, pepper.

A

Yeah I've played these ghosts against dinner.

E

F

E