Ceph Developer Monthly, 2 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Developer Monthly 2022-11-02

Description

Join us the first Wednesday of every month for the Ceph Developer Monthly meeting

https://tracker.ceph.com/projects/ceph/wiki/Planning

A

One thing I thought might be worth talking about, which is.

B

A

Added some um user-facing guard rails to Crimson generally for two reasons: one we want to make it hard to accidentally start Crimson osds, since it would really complicate a production cluster and two Crimson OCS, don't support everything. Regular osts do so to the extent that we can it's probably a good idea to guard off those features so that people don't accidentally use them.

A

um So to that end there are two things I did so. The first is editor Crimson experimental feature flag and a set. Allow Crimson OSD map, flag, um I didn't do anything especially complicated here you have to have the experimental feature enabled to set the subtle algorithms and flag, and you can't unset up um Crimson osds won't boot at all. If they don't see this in the ostmap flag. So that's how that part works.

A

The other feature is Crimson pools, so part of the design here is one day we're probably going to want to be able to mix crimson osts and irregular osds in the same cluster. So this is on a per pool basis, as opposed to on the full cluster.

A

um When you, you can either that the OSD pool default Crimson parameter which will cause all pools to be created as groups of Pools by default, or you can give it on a per pool basis with a crimson flag. It disallows a couple of features that krypson can't deal with like changing the number of pgs or using tiers, and the uh made up shot is that Crimson osds won't locally, create pgs from pools that don't support those that's pretty much.

A

It I updated this fadm instructions, at least in the branch that implements these things, to include these steps. So it shouldn't be too annoying for users.

C

What are you going to do when Crimson start supporting those features, like is.

A

There going to change the monitor to undisable.

C

Them well sorry, but I mean. Is there a another set of crimson's feature Flags or do we have to stick them in the in the existing messenger, bitmaps or.

B

C

D

A

Yeah I'm not gonna, worry about that.

D

C

We're gonna they work, they work.

C

E

A

Curious so far we support. Oh sorry, go ahead. Neil.

E

A

When we support a thing classic supports, we support the current version of it.

A

If that makes sense, so in that sense we actually were protocol compatible with classic after a fashion, but for instance, we don't support tiering at all. So when we do support tiering, we'll support hearing the same way classic does um up to the same future flag, Etc, so I'm hoping it won't be a problem, but this is likely something that's going to have to evolve with time we I'm going.

C

To try to push the team to.

A

Support go ahead.

C

Oh, we just like, we don't have a feature. That'll recognize. You know, support, split pools because, like all those teams until now have a supported splitting pgs so like very used to be like yeah all right now, person does it. So we let the monitors, do it or.

A

No, what this does is prevent the monitors from changing the PG num.

C

A

In the future, you're going.

C

To want you're gonna need to change the PD. The PG now right, like eventually Crimson, will support splits and merges so.

A

If we care, we can just condition it on the release flag, like anything else,.

E

That's that is exactly what I was getting it.

A

E

A

E

A

E

No sorry I was just going to say that, like you know, be very clear about which features we are supporting for Reef um and what we expect. So we could potentially use the release brief flag for something all right.

A

So basically, the release flag would just mean different things with respect to supportive features between crimson and classic.

E

A

For fine grain stuff yeah, we might eventually have to add a a special feature: flag, we'll figure that out. If it comes up but.

B

A

That it won't cool.

B

Do we have these features uh which are supported or unsupported, documented anyway, I'm just looking at the Upstream documentation.

A

B

A

E

Thing supported our needs are so maybe when we have a reef documentation come out we'll have this documented.

A

So, basically, we're not going to so much publish, what's not supported as we're going to publish what is supported, which is replicated pools on bluestore without tiering or changing the PG num.

A

Does that make make sense.

E

The other thing I noticed uh uh Sam was that there was a global of conflict set Global that you're doing so. It's not an OST kind of thing. It's a global setting.

A

For enabling Crimson- yes, it's a cluster-wide setting, but all that setting does is literally permits you to turn on crimson, osds, okay, so pool.

E

A

However, are purple.

E

Okay, so there's not a like: you can use regular monitors, running Reef with brief Crimson University's.

A

Oh yeah, you have to.

E

Yeah there is no dependency as such. So, okay, that's.

A

Good wait hang on maybe I misunderstood your question.

E

No and there's no modulated, Crimson flag changes that we're adding here right. It's all OSD.

A

Related well, these.

E

A

All monitor flights or monitor they're, not their OSD map flags that get set by the monitors. If.

E

That makes sense all.

A

Right, yeah and the monitor doesn't care about these acceptance so far as it sticks them into the OC map and polices which ones are allowed to be set on Crimson pools.

E

A

This will likely have to get more sophisticated for S release, but this I think is enough to get us started. I am, however, interested in anything. You guys, think I missed or that I might want to address sooner.

E

I'm curious about the pathology testing, so now you're all your pretty set this or maybe we could have something globally throughout the Crimson switch that sets this otherwise.

A

That branch has the relevant changes. Okay, perfect.

A

I'm pretty sure they work they worked three weeks ago. When the lab worked.

A

Oh well, the lab will have to work again before we can actually merge this because I don't want to break this pathology jobs, but the short answer is when you're creating the cluster. You bet this experimental coverable data, prepping features flag and the satellite Crimson flag and the monitor config. If that causes pools to be created by default as groups and pools.

A

It also um it does what other thing the Crimson pool thing it disables the autoscaler.

A

One way people were shooting themselves in the foot, was by creating pools and then forgetting to disable the auto scaler since Christmas.

E

What about the balancer.

A

Or that one wait a minute: what's the difference.

E

The auto scaler is doing the PG now management, the the balancer is changing mappings for the pgs.

A

uh That should work yeah, the Crimson can deal with peering, just fine. It just doesn't have split up.

E

A

I mean the reef releases to be clear, unstable. The documentation says that you should expect bugs up to and including data loss. So it's possible that I missed something there and if so, that's okay, but the.

E

Attention it should just.

A

E

Exactly and the other thing is that the balancer is on by default. So if there are issues you probably will be catching those and testing.

A

Or so we devoutly hope yeah. If not, users should people.

C

A

I should say, since no one should be using this in production.

A

The intention is for S release to be usable by enlightened adventurous users for real use, use cases, but only people who know what they're getting into so we'll do a lot of testing next year.

E

Sounds great to me.

A

Oh, that is all I had.

E

Do you have any more questions or any more topics.

B

Actually wanted to bring up one more topic other than this, which is conditional, debugging and say for maybe intelligent vlogging I, remember discussing it month or two back, but uh that's hard could be another.

B

This could be a best place to bring it bring it up again. Basically,.

E

B

Intent of uh talking about conditionally working on intelligent logging is that I mean we do have a lot of debug levels in there, but I mean we do have to change it once when we are troubleshooting a situation but with conditional, debugging or intelligent logging, I mean and I think developer available changes would need to be done so that whenever there's a situation or whenever there's a trigger, the debug logs can be triggered or enabled automatically for a condition, and then they could be toggled off again.

B

If a mechanism could be devised that could be really useful in troubleshooting and root cause root, causing the problem whenever it happens, is in some situations. We know that um we require the debug logs from the time when the problem happened.

D

B

Many a times we don't have or find those debug logs because they are by default, not enabled so this could I mean. uh Does this sound like a feasible idea.

E

So this is definitely an interesting idea and this resonated with something that radic brought up and that was in context with an upgrade I think Brad. You are familiar with this one, uh the bug that we were seeing in our givea cluster in the lrc. So every time we were doing an upgrade, there were randomly like one OSD uh that would hit a crash and then it would just come back up fine and we were never able to reproduce that kind of issue and not pathology, testing or anywhere else.

E

We also realized that that was real issue by means of telemetry that there were some real users hitting it. So he proposed something like this, so the idea definitely seems useful. How do we Implement? That is probably the open question, something that.

D

Revolved around uh the inability for.

E

Of The Giver cluster.

D

To capture a core which is not something that Seth can do anything.

A

E

The challenge would.

A

E

A

I don't go ahead.

E

No I I just wanted to say that there the idea was like you know, we knew that restarts for the issue, so something like cephyrian. That was doing the restarts uh for the upgrades could potentially um you know, have some kind of books to enable and disable logging. But then again that would be blanket right. So it's not. You can imagine in a thousand OSD cluster. We are trying to enable logs for all the usds one by one. So it was not logistically feasible. I am Sam. You can.

A

Yeah, so the challenge is predicting ahead of time, where you'd want to put a trace plate like that. So an obvious example is a thing happens somewhere in an I o, and you wish you had logging for the beginning of that I O right. So it's easy to create a branch that does whatever custom logic. You need to detect the condition you're interested.

C

A

And change the debug logging, but that means pushing a point release before you can actually do that right.

A

It's not clear to me how to how to allow you to create a trace Point like that, without going as far as the uh I'm, forgetting the name of the Linux kernel, injectable Trace, Point concept, which would be an interesting idea if someone wants to do a prototype with that, but.

A

B

A

Remember what system I'm talking about there's a thing in Linux kernel where you can inject and other things where you can inject code into a running process and it'll check a condition and emit traces conditionally?

A

No one yeah I'm, just gonna, google, it it's.

C

A d Trace yeah.

D

A

It D Trace yeah thanks.

B

Yeah I think I heard of decreased from Brendan Greg before.

E

We already have.

D

An advanced um or we have an enhanced log that we store in memory and if we get a Core that gets dumped if the diamond crashes that gets dumped and I believe there's a command that allows you to dump that that's.

A

True, but it's much less powerful than Detroit's.

D

uh I'm not saying it should be a substitute for Detroit, so I'm just saying that exists already um I'm wondering how often it gets used. um The the thing is, somebody would have to be proactively.

D

Like are, we are we saying this to all, has to be automated or there's? There's no user input.

A

No, it's more like. Let's say you want to dump um some specific debug line, but only for very specific iOS contingent on some parameter in the I o itself. D-Trace would actually allow you to create a trace point with conditional logic. That's basically arbitrary to the parameters being passed in it's shockingly powerful to do something equivalent to that and stuff you'd have to create a branch like.

D

Probably newly built branch, you could probably do something like that with a system tap.

A

System tap, that's yeah! Sorry! Yes, that's what I was thinking of.

D

Yeah, well that.

A

D

Facility already.

A

D

That would take a uh would have to be a a um Advanced user. That would be able to do that, but I suppose we could come up with a with some system type scripts, but.

A

D

D

A

Motivating example.

B

I, remember this um in rgw, actually as well, where multi-site in one of the earlier situations when it was during a customer situation, when we saw that I was from one side to the other, were not I mean some objects were not getting replicated from one side to the other, and uh maybe we thought that we would introduce conditional debugging.

B

uh This was discussed with I, remember Matt Benjamin before as well. He uh he's not here but yeah. We or I I. Remember thinking about this from rgw multi-site point of view as well, where we could have conditional debugging, where we could.

A

B

That there are some conditions that are triggered in code, where we know that objects are not. Why would some objects not be replicated from one side to the other? I could help us understand.

E

And there are, this could be thought of it is by case. Sometimes there are invariance that are getting broken and then then some things can be thought of um unacceptable value. Then you can take the example of the dupes issue that we saw so we hadn't we had a bug where a data structure called the the dupes in the PG log was just growing uh infinitely.

E

um So you can imagine that if we had some sort of threshold which could have been a very conservative threshold, that we start to log things like okay, if you go about this threshold, we are going to aggressively start logging things right, but then it has to be very. You know, Case by case basis and similarly for for things that end up in crashes. We could we could end up logging things at the whatever the in-memory default.

E

Log level is so that when we have a crash, we get something useful out of the back race that gets captured.

B

Yeah I think this certainly would be helpful on a Case by case, which is where Point, depending on the situation before the crash. If some mod higher level of debugger could be dumped.

E

A

E

A

E

Right so sometimes even um that is not enough right. We still are asking for a core dump or we are looking for further logs and stuff.

E

But probably if we go by inspection, we might be able to get something more useful out of each crash, given that we have these crash terms that get collected, and there is a log snippet that also gets connected for crashes.

B

Yes, I remember a lot of times when troubleshooting a lot of problems when it was discussed that we need debug logs from the time and the problem actually happened. uh But in many situations since in a lot of production environments, the debug logs were not enabled and we could not find the root cause easily and then you need to do a lot of troubleshooting further on. So that's that's how the idea.

D

This is this is girl. This is because we're caught between two worlds, we're caught between a world that says we don't want any debug logging, because we don't want the overhead involved and we don't want to store huge logs. So we don't want huge in memory logs, because it's too much of an overhead, we don't want the you know the impacts on speed from having to log all this data.

D

We don't want the large logs on disk then, on the other hand, you've got somebody who says when, when surf crashes, there's not enough information to be able to debug it. So.

D

We walk a fine line, always.

B

Yes, that's when we talked about which we thought the vlogging could be I mean if you could devise a mechanism where it could be made intelligent. We would certainly have more information on a conditional basis right.

C

A

You have, it would be an amount equivalent to a self-developer. That's the problem. If we.

B

Knew that the condition was.

A

Problematic we'd have either fixed it or asserted. So the fact that the system continued running at all means the developer development system didn't think of it.

C

So like we want to have logging in the times and it's useful not when it isn't, and we do make some efforts in that direction. But we mostly don't know when that is so.

C

The harder part is is identifying conditions. It needs logging and generating it that when it's, when it's still useful, I.

A

Would say it's the other way around? We actually have tons of those it's every assert in the code right, every assert is okay. We claim that if this condition is violated, a very bad thing has happened graph. What you're observing is that there are cases where a bad thing happened and then a long time later, yeah a crash or something happens, and we don't know what happened back at the original point so for a system to notice when it happened and go.

A

Oh, this is something I should turn on logging for it's equivalent to knowing that a buck happened right, yeah, that's a that's a lot of that's a lot of intelligence to expect from an automated tool, something like system tap might allow you to after the fact inject tracing so that when it happens again you can get information without having to make a new build, which might be interesting, but even so you'd have to reproduce it a second time.

D

Yeah, that's the point I'm trying to make is that you're kind of asking us to make debugging redundant.

D

A

D

Longer have to debug a problem because we always know exactly what happened in the leader to the problem. And we have you.

B

D

Piece of information- that's possibly gatherable um and and that's possible to do- but that's and you could probably do that now- just turn on all logs to their highest level. um But.

C

Yeah client's not going.

D

C

I mean this is.

C

If this is something where you have specific things that have occurred, that were that were problematic, we we have done things like, say: okay, we've detected a condition like there's a stuck operation. Maybe we should do something like dump out all the operations that are that exist or that are like dump out the operations that are stuck in what they're in what state they reached.

C

B

A

B

C

And then backed.

A

Off on, but that's why things like the heartbeat timeouts exist too yeah yeah. It might be good.

C

In the OSD and the MBS, so that you can like say, here's what's stuck and what sort of where it's at.

A

um It might be worth keeping notes when you hit conditions like this, and then you could present them as like I, don't know, these are General categories of things that are happening, and then we can brainstorm good ways to attack them. As a group like maybe a bunch of them are all about rgw multi-site, it might suggest a really specific kind of instrumentation. That would be helpful at rgw.

A

If that makes sense.

E

I'm not sure if there's any.

B

I think yeah. That would be good and it would be good to identify the conditions in the first place that could be problematic. Wear.

B

Could do something.

B

It's definitely it's definitely.

B

The point I think uh point to consider about feasibility as well, where to do and it's it's like a lot of times, I mean we might have to identify a lot of conditions or maybe one simple condition that could help us understand when or maybe just like you mentioned, Sam right, maybe identifying an assert where things went bad.

B

There would be a lot of conditions like that on the code, but yeah you have to identify a problematic one would be difficult until it.

D

B

Happened already.

D

This is also the question of what level do you go to I mean users aren't going to like it if we fill up their petition automatically under certain circumstances, um you know so always a delicate balance between not enough vlogging too much logging.

E

D

Why the whole logging.

E

D

um You know various logging levels exist to provide that functionality, but it's a it's a manual process, because if we do it automatically, we can go from very few logs.

D

Where you know a gig for your logs is plenty um to a situation where a gig is not going to last more than half an hour, 10 minutes and so yeah.

D

These are the problems I.

B

Mean we could alert the user in some situations where we could definitely alert the users in situations if that at a point, if just in case, if we had conditional debugging, we could alert the users that there's some condition or some trigger that happened in the code, which actually resulted in generating a huge amount of logs and then.

D

How do you let them buy the logs.

B

E

From my experience, all users will turn that feature off.

A

E

That's not even.

A

That's not even the problem. We do that, all the all the time right. Every time you hit an assert, it kills the process and dumps all the logs.

A

The problem is that, if no one, the problem is that it can't be done generally there's no magic code, you're going to write, that's going to allow the OSD to know or GW or whatever to notice. A bad thing happened that the developers didn't think of that's not generally possible and.

D

On top of that,.

A

D

Number of loved ones, we've dump, is capped.

A

Well, so hypothetically we could dot more, but that's not really what I'm even getting out. The point is that all of those things are specific things that were created one at a time by a developer, who identified an invariant, we didn't write a system that generates us asserts. We wrote them individually, so I think Greg's right most likely.

A

The problem is that whatever system you've been hitting these issues in you're just hitting an issue with fairly poor visibility in the first place, and the answer is to think carefully about the nature of the problems you're hitting and why they're hard to debug and then think of an actual subsystem. You can add to that component that will make those problems in general, easier to debug things like the harpy timers in the OSD or the op tracker.

A

That just tells you everything every op, that's currently alive is blocked on things of this nature, but exactly what the answer is will depend on the specific problems you're hitting you're not going to write a general component that solves this across all such problems.

A

I, don't think.

D

Yeah, there's also the sophistication of the user. That's trying to divide the debug, a problem like some some issues, uh pretty transparent to a developer or somebody who has a high level of debugging experience and some come on and for someone who has very little debugging experience, they're, all pretty good, pretty opaque.

D

The other thing is some of the tooling has a tendency not to work like core dumps, for example, people disable them and even when they're not disabled, they can hit conditions, especially on systems that are running Seth, daemons or a lot of stuff diamonds, where they'll hit a noon killer condition and they can't dump out the core foreign.

D

So you know we face those sorts of things as well.

E

Yeah and the the same thing that I was talking about users tend to turn off those kind of even the op tracker. There have been so many users actually done it exactly. They don't want the extra cost yeah. So you cannot generalize a solution like this I think.

D

No, we could go to the trouble of developing an automatic framework that you know dynamically adjusts log levels and um does X and Y and then find out that 90 of the user base just turn it off disable.

A

um I, don't think 90 of the user base turns off the op tracker. So the fact that people can turn.

B

E

B

A

Invalidates the idea that, again just to grab the example where I've gave if the problem is that objects aren't reliably getting synced, then adding a watchdog that allows you to dump the current state of all moving objects would be relatively low cost intervention that would allow you to gain a great deal more information when the problem comes up.

A

And I think that's probably what like the the flavor of the answer, will have to be.

D

A

D

A single case, rather than a comprehension.

A

I'm arguing that there is no comprehensive it.

D

Doesn't matter how long.

A

The users are or how much disk you're willing to spend short of actually just turning all the logs on all the time. Now, yeah, no targeted ones are really your only option.

A

I will say that there is one possible, but there's a sort of triggered one concept right now we assert on pretty much every invariant, but some are more fatal than others and it might make sense to add a utility in the code that allows us to dump the core memory. The core log lines, the memory log lines to log instead of killing the process. That might be an interesting middle ground.

A

Just a thought.

D

A

What conditions.

D

Would we do that? Yes,.

A

This is specifically in a scenario where a developer is writing a component and identifies an invariant, and it's like this is an invariant which, if it does get violated, I want to dump the logs, but I don't want to kill the process. Yeah.

D

A

Is not an automated system that would identify these these things, but it would be a tool developers could use to give themselves.

D

And that would be a slow Evolution, as the code base change to in you know, to use that um system.

D

A

D

Like that, too,.

B

But I remember a situation in the monitor as well, where I remember. There was an issue in illuminous and Nautilus, where I.

E

Remember which one you're talking about the.

B

E

Corruption, one.

D

You're expecting us you're expecting us to yeah, and that was demonstrated not to.

B

D

B

D

So you're expecting us to generate information about a a external situation. That's impacting us like yeah I. Don't that.

B

D

B

D

A

Know we were not actually familiar with.

D

D

Db was getting corrupted by something to do it. It was never positively identified what the actual issue was, but it was somehow corrected and just went away, and we were almost certain that the database was being corrupted under sex, so the container system or the storage system, something to do with containers or Braille or storage, was corrupting the Rocks DB database. We detected the corruption, but by the time we detect the corruption. All we can do is Crash, there's nothing. We can do about why it got corrupted.

A

So to expand on my point before the intervention I would suggest in a scenario like that would be exploring Rod Stevie's own apis for dumping checksums associated with little file thingies and things of that nature. Doing that on Mount or unmount might give you a way to detect corruption between monitor, starts and that's kind of what I'm getting at. Even in this scenario, where it really wasn't a sep thing, there are sometimes visibility things you can add. That would give more information.

A

It probably wouldn't have been worth it in this case, because apparently it was super obvious that it wasn't us, but if, for instance, this went on for a long time, that would be the class of intervention. I would suggest here.

A

Does that make sense, Brad Grove.

D

Yeah I posted a couple of links in the chat samway um about system tap and and music space applications, just in case you're interested they're, not terribly comprehensive, but.

A

I've messed with it in the past, I just forgot the name, so the system tap is an interesting idea, but not because it'll automatically dump logs it's a way to take a production, running system and add instrumentation to be used in the future without having to create another release, and that is an interesting concept that might be worth exploring.

A

D

And also, potentially, you can set variables and functions and, and that sort of thing as well, um you know, uh make an if statement behave differently. That sort of thing, so um you know it's fascinating in in uh you know you could you could ship a hot fix that doesn't even require the Damon to be restarted? Put it.

A

That way, hot pixels aside, it would allow you to do something like dump all I o operations associated with a specific RBD image, even if no such facility existed in the code already.

D

E

So that remind ask me uh something like Jager tracing the typical I was working on will also allow you to do some of the the example that you were giving about something got stuck got slow, rgw multi-site. So if you can grab traces, if we had the trace points first place, you could go back exactly.

A

Yeah, that would be kind of like the op tracker approach, except for it huge and across the whole cluster, and it has pretty much the same advantages. So that would be a great example of a really comprehensive visibility system that would allow you to gain information on a wide class of problems. Foreign.

D

B

Default, it also has some performance over it.

B

D

We'll have performance overhead, there's no question about that.

A

And I think the disabled by default thing is more a property of needing to set up the things that, like gather the trace points. We could be better about deploying that automatically with death ADM or something an interesting idea.

E

All right anything else on this topic or anything else in general, for today.

B

Nothing further from my side regarding this just I'm just trying to.

E

B

A good discussion.

E

B

Thanks for good feedback, thanks thanks, everyone.

E

You can call it: okay thanks everyone for joining.

A

B

Thanks everyone bye.