Ceph Developer Monthly, 30 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Developer Monthly 2021-11-30

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

So folks, once it has, the cdm for december 2021 uh looks like we've got. I don't know folks, let's get started so we can dive into the first topic, the random tracing in our w, and how to deploy that those kinds of tools.

A

So that I don't know is this: uh on amri and casey: you want to kick it off.

B

Okay, so hello, everyone, my name is omri and I'm an intern in the lgw team and I'm going to talk about the tracing itself and in particular in the rdw.

B

um Well, so I want to give you some background and the context um and we're using and jaeger the diego project as our tracing backend and open telemetry is our client, which is used to send traces to the girl back end.

B

um We have a thing that called trace, and that is uh consists of spam, and spam can can have tags uh for rectification and searching uh later the specific trace and analog for an extra extra data.

B

So let's talk our current work in in the tracing and we have created a tracer class um using the open, telemetry client sdk. uh We we moved from open, telemetry to sorry from open tracing to open telemetry.

B

um The open tracing library, uh more specifically the cpp library of open tracing, uh is has nowhere maintainers, so an open, telemetry library is is a is a new one with with active changes and also supports the jaguar end.

B

And our tracer class is used inside the rdw and the osd right now. I'm currently working on the rdw part and.

B

We are able to create a new trace uh to l, to add a child spans to an existing trace, and we also can disable. uh I mean enable the the tracing at runtime and we want it to be disabled by default, um and we also can uh serialize and deserialize trace info in order to unify uh spans into a single trace.

B

That represents, for example, a single operation in the lgw.

B

Here is an example of the the regular ui we can see here in the we can see here um uh in the 11th races uh which, with the each trace is the operation in the lgw, and the trace name consists consists of the um operation name and some unique transaction id.

B

B

Let's talk about um the actual traces that we already have in in the ldw, and every user request in rgb is being traced.

B

B

We currently have, we currently have a basic expanse in in each trace, that represents a user request, um the user, the we have three spans, one for the verified permission and one for the trace and one from developer provision and the third one is for the execute method of the operation, and we have uh tags uh on on this trace that that can can be used to identify the trace, for example like a bucket name or object id.

B

And the second, the second traces that we have is stress for the multi-port upload operation and which is trace that unifies the the several operations. uh I mean the subsequent operations to a single trace.

B

um We can see here an example of the multiple upload trace that has two or two port object: operation and and eventually the complete multi-part operation, which can be done on on different rdws.

B

We can support it, support it uh with the serialized serialization of the trace info and deserialize it and build the an object called spam context.

B

We are currently working on performance testing.

B

We want to compare uh um several approaches we want to to compare when the tracing is compiled out and versus tracing, compiling and disabled, which is the default state, and we also want to to compare when the tracing is disabled and versus tracing enabled, which is the the when, when most of the runtime cost is happening.

B

In the tracing- and we also want to to deploy um to deploy the the jager components along with with ceph and we need we need to, we still need to uh integrate with safe adm to make it deploy the the jager containers uh if the trace. If if tracing is, uh if we want it, I mean it's not supposed to to deploy it in any always.

B

Just if you pass some flag to it and jager also supports rook deployment and they have an operator, and we decided with the rock maintenance to to document the process of installing the jaeger operator and deploying the agusa and not addicts of code to the rook operator, to the yeah, to the operator to to include to install the yellow pivotal by by itself, and then they have some abilities to to configure the diego components to to config the safe cluster, to communicate with the aggro operator with the edgar um pots, and so adding a code to the rook is, is currently unnecessary.

B

um And our future plans in terms of tracing is conditional tracing and we would like to able to enable tracing in specific cases, for example, trace only a specific bucket or things that the user can decide and not just tracing all over the options.

B

um Multi-Site tracing, um which is.

B

Which is, as I see it very.

B

We can we can help from it from multisite racing, since it it's very hard to debug it, and we also want to um to make a traces uh integrated trace for the rgw and the osd um like an end-to-end trace from the from the start of the operation. Until the those d operations.

B

And that's all thank you for listening and.

B

Then, if you have any questions, I'm here.

C

um Hurry this uh looks great. I I I was just curious. What did you mean by serialize and deserializing the traces? How do we make use of them.

B

Sorry, can you repeat the question.

C

Like in the starting you covered uh serializing and deserializing the traces uh I I didn't get them, uh can you kind of explain a bit about it.

B

um This pen, we have the the spam context that consists of a span id trace, id and and flex. um So I implemented a function to to write it to an a buffer list. uh We we use it using it in the in the multi-part upload of a trace.

C

Like that was my second like uh how multi-part uploading uh is uh like even I'm not much familiar with multipart upload this uh so yeah, um how do you make use of them.

B

um You're saving the saving the uh the trace.

B

This is the serialized spam context in the in meta object when we create the um the upload object and the actual upload, and when, when we read when we want to to um to put to upload and a part of it, we we read, we read the upload the object and from it the meta object and.

B

And we take those span id racd and flex and convert it to a spam context which is used uh in the ad span method of the tracer class.

C

Okay, so it's like deconstructing the trees and then like uh using that context to continue the trees.

B

Yeah yeah exactly.

C

D

As part of the object, so it doesn't matter whether the the second operation.

D

Whenever you read it from um from rados, you get the serialized uh span, so you have you, take it. If you find it there, you deserialize it and you continue your trace from there.

C

Okay, so like in open tracing, we had extract and inject method that was kind of uh storing the context and then uh using um like uh using that one context again continuing the span generation at the other end. Is it similar to that.

B

The the extract the extract function is, is a rgw related it it's. uh um It's not related to the common tracer.

B

But I think it's supposed to be the same. Yeah.

D

um uh Like tracing, if you want to have this uh unified tracing across uh rgb and osd, then we would have to put like to serialize the information into uh somewhere in radius and just send it to the osd and when the osd will get the the command the latest command from the rgw. It would look for this extra piece of information in the payload um and if it finds it there, then it would uh uh reconstruct the.

A

D

And continue on the back end, you can see that it was one one huge stretch with some spans that are on the rgw and some spans that are on the osd.

C

Okay, okay, yeah. I I okay that clarifies my doubt yeah I I even I think, uh even open tracing. uh We were having these two methods, which were similar to the uh decelerating and serializing that we are doing here so.

D

uh The open tracing had had the same functionality as well. I don't think we we used it, though, in open choicing.

C

Yeah, we don't, we don't use it currently and like I, even I am not much familiar with open telemetry. Do they have that feature or not um good, to know that we have uh something peaced out and uh like uh mighty path. uh Traces um like uh has it been worked out, it's not yet merged and still work in progress.

C

You are testing um how? How can you cover a bit about that as well.

B

Sorry but but what.

C

The uh like uh uh you are using my deserializing and serialization in my part, uh tracing if I understood correctly.

B

C

Okay, so is it uh also multiple cross boundary system like rgw and osd.

B

It's supposed to to write in the uh the span context into uh in buffalo list, I'm not sure if we, if the osd uses the same in the same buffer list- or I mean not the same. But for this I mean the same data structure to to serialize objects. But.

E

I do believe that, eventually, when we, when we do pass spans from liberatos clients into the osd, they would use the same serialize and deserialize methods that you're using in multi-part.

C

Okay, so is it a right now rgw only code or is it a common tracer code.

D

It's the common tracer. We already made this change before the migration to open telemetry, but it's common, it's common code, so the new open, telemetry code is already running in the osd with the existing stuff. Now, regardless of serialization and that's and those things, but the open telemetry is already in the osd.

C

Great good to know.

D

Yeah, just one, maybe one small comment on the on the multipart multipart does cross boundaries, but it crosses boundaries between different rgws or the same rgw in different iterations, but not.

A

D

The rgw and the ost, but it does cross the boundary, because this is why we need the serializer deserialize into the uh the object attributes.

C

And uh like uh have you uh tested it in uh like uh multiple nodes or right now, the dev cluster is limited to uh one um personal work environment. um I'm asking from a deployment per purpose like uh because I have not tested in multinote till now.

D

A

Think we tested.

D

It we didn't, we were just testing with restart. I think that uh omri is working on selfie dm integration, and once we have that, I mean I guess we can test that manually on multiple nodes. Even now, you just need to go and and set up things manually, but once we have safe idea integration, then uh you you can set up the uh the agents and collectors in the back end on whatever setup that you have- and this is also true for rook for rook. It will be again well, not exactly manual but just be.

D

You should follow some documentation and then the openshift or kubernetes that underlying infrastructure would take care of running all the agents and connectors and everything in place.

C

uh Any like uh difficulty you faced in uh while working on uh the traces, because I from uh the past, I remember that uh there were cases where um having a two-parent parent traces um like uh that was uh leading to a deadlock condition, and um I don't particularly remember, but we stick to having only one uh parent in the whole scope for individual trace. I am have you: were you able to produce multiple traces um and then the same scope?

C

Or is it still a thing.

B

Do you mean the scope of the open telemetry or I mean.

C

B

We create when we create a trace, we we were passing and some property of current span, so I don't understand how span can have multiple parents.

C

Maybe uh yeah um so like I, I just wanted to ask her. You were able to have multiple uh traces while dealing with the same scope.

D

Yeah I mean this is how the multiplex is working, so we have tracing for each and every operation, let's say a put or a multi-part in it or a multi-part end, and those are like transactional traces.

D

They start with the transaction start and they end with the transaction end, and we have the uh the complete multiplied trace, which starts with a multiple in it and then goes through all the puts of the motor part and then ends and the multiple complete and we have them running in power. So in the same time you would see a trace. Oh, it's a trace for a put a transactional trace for the put, but this same put.

D

Operation is also part of a much longer trace, which uh describes the multipart from the inner through all the different puts until they're complete- and I don't know me correct me if I'm wrong, but I don't think you run into issues with us all right.

C

Great, that's good.

C

Cool uh yep, I can think of these things out of my mind, anybody else have anything.

E

To ask, I guess, just a question about this ceph idiom integration work. Have we thought about how to design that and and who's going to work on it.

B

The design of what sorry.

E

The deployment part of the containers first fadm.

B

I'm currently working on it and.

B

I I don't understand I I I don't think I understand your questions correctly. I mean.

D

C

D

I I said uh omi started to work on on the sfdm integration.

E

D

Yeah and and for rook there won't be any actual code going in. It will be just uh documentation inside the rook manual for deployment with the tracing.

C

So uh just to mention uh is there, like a uh have you guys uh thought about the design of safarium and base deployment?

C

Last I was working a bit on it and, like I was thinking of having different demons and deploying it sidecars. um Similarly, we have um the systemd units or other um other uh cef uh components, but uh anything uh anything specific uh that you have thought about. um Is it still um a plan for future.

B

Yeah, it's it's um the deployment via safety dm is, I mean safety. Damn has some options to deploy promoters which is similar to um to tracing to to jager um and in a staff adm. They.

B

They have some services that, like that outside of um outside of the the core of like.

B

External services and we will deploy it in the same way. Like the promotions, I mean it, it's supposed to be a service that can that the manager can uh uh identify and know that is running or or not running and.

B

It's supposed to be a proper integration between maybe with the is the jaguar components, the regular containers.

F

Hey andre, I have a question for you: um have you thought about making these metrics some of these measures available to the users, for example, by uh the server timing? Http api? I'm not sure if you are familiar with that, but you can add some headers uh not so really.

F

So, just just for you tonight, I think that's uh I'm currently exploring that for that's where I think it's pretty useful, only for debugging tracing purposes, probably to be disabled, uh the rest of the time, but at least you want to debug or get some profiling or some tracing that might help.

D

So I'm not sure that we need to do that. I mean this is probably something you can do from the elasticsearch. Back-End like this is what jagger are using they're using elasticsearch. So um I mean, if you write the correct queries there, you should probably be good to extract the data.

F

Okay, yeah, the benefit of this is that the browser, the major browsers, provide this ui. So basically just access the object or wherever- and you will get this these graphs, so you don't need any other tool for doing that. I mean from a user perspective, it's really really useful.

C

I think we would just need to embed the link for the um the query and the jaeger ui generally takes care of these, so we can display them some of these matrix, consistently dashboard as well.

F

Okay, thank you.

G

I have a question: please: um does it make sense to anonymize some of these metrics and have them being sent through our telemetry module as well.

B

um I don't think it's related to the.

B

The tracing has some, I mean it's not like metrics, it's like.

B

I'm not sure it's related to to the the telemetry uh yeah, the.

D

Main matrix, the main matrix that could be extracted from the traces are the uh the latencies or the when the time it takes to perform operations, the those are like the the matrix information. The other information from there is like more debugging information, the tags and the uh and the logs is more like you get a very nice long log record that can hop multiple, daemons and cross time and stuff like that. So those are like the two functions, but from from kind of latency calculation perspective. I think this could be very valuable.

D

It's also related to what uh ernesto said. So we probably should think about that. um It's all in the elasticsearch backhand, so there probably is a possible way to to fetch it from there and store it in um in places where they show graphs and so on.

G

Are the slides available somewhere? I'm sorry, I missed the first part of it.

B

Yeah I'll send it.

G

C

ah Good thing to know about tracing would be like uh these traces that we see um in uh maybe in remote system we can have them in json format and then we can have them rendered it um ui using a jager ui deployed at our system.

C

So all the details and time stamps and everything would be intact and like uh we can have them easily portable, um so yeah. I really like that should come handy.

A

Anything else on degrees.

A

All right well, thank you, america, good presentation.

A

um Next topic is around some flooring secrets, uh potentially in the monitor, especially, you think you wanted to bring this one up. uh Yeah.

H

Thanks uh josh, so um there is a pull request that adds the ability to store the ssh password, the sh key password within the config key store, and I'm slightly worried that this might not be a good idea.

H

And I'm looking for.

H

Ideas basically right, I'm looking for ideas how to store things like the ssh key password in a proper way and.

H

We can, of course, store the ssh passport in the config key database, uh but.

H

I'm not feeling so well about that possibility. Honestly. um What can we do about it? I mean I mean we could um we could encrypt the mon database? That's that's one idea uh with a business static key. Does it really solve issues? I don't know um we could try to uh hook up vault into the into the manager. Would that be a good idea? I don't know um we could um try to enable the ssh agent in the manager containers.

H

Would that be a good idea? I don't know I'm looking for inputs uh from our group here.

E

So a little important text. What do.

D

E

Yeah, where, where are we using ssh, that we weren't before.

H

So in self-idea use ssh and right now there is a requirement that um there, the ssh key that is used to call to establish ssh connections to remote hosts, uh does not have a password yeah, and this particular pull request had the ability to have password protected, ssh keys, which in itself is a good idea.

H

But we are storing the unencrypted ssh key alongside the ssh key password.

A

This is a parallel, but uh where we store the encryption keys for the osds they're also stored in the monitors after using um address encryption, and I think we're kind of assuming that if users are concerned about the security of those keys they're using a full desk encryption on the monitors and some others, don't have a control over their own block device or anything like that.

H

D

If that's enough.

H

Then I I would be good right uh can. Can we uh verify it that this unrest encryption for the one database.

A

Is that labeled.

A

That's a good question: I'm not sure how easy it is to do.

H

I don't think it's visible to us.

E

Yeah you'd also have to require that all monitors have it before you can use it globally.

H

But I'm also not sure that it matters this ssh key is having root access to the word, trust to the word cluster.

A

And similarly, if you have access to like the set admin key, you can do anything with data in the cluster. It's it's a very similar level of risk.

H

I mean, uh if I remember correctly, you you were pretty much okay with just storing things in the config database right.

I

I think you're, muted.

J

All right, yeah, the I mean the monitor, is the root of trust, and so um it kind of is what it is. um I think at some level the monitor has to have that key. I think the question is like making sure that the monitor stores it in a way that doesn't introduce like the opportunity for users to accidentally expose it.

J

So I think I wouldn't- and I'm not I'm not the security expert here, but I wouldn't worry about things like encrypting the monitor database, because you have to decrypt it and the key for that is going to be readable by the root user and the monitor database is also owned by the root user and not readable by world, and so whatever it's all. It's all the same.

J

um On the other hand, if you do self-config key dump, then it'll dump out a whole bunch of stuff. That includes secrets. So that is like. We want to make sure that people don't accidentally expose that information um and understand that config key is potentially sensitive. um There is sort of a weird cluj that we added a couple years ago, where, um if you have a monitor capability and you do allow r meaning allow read access that used to include reading everything in config key, we changed that, and so you can't read config key.

J

Unless you have the star permission, you have to have allow star, which is only the admin user. So all the other daemons normally only have allow r and then like a specific profile that has specific sub-commands or whatever. Maybe you can read specific subsets of that, and so only client admin generally has access to read all of the key, and so that sort of burns it off.

J

But I think the the danger is that config key doesn't sound sensitive and so um users might not realize that it may contain sensitive information in the same way that, like in kubernetes, there's called there's something called a secret store.

J

That probably functionally has similar parameters around access, but it's called a secret store, and so people understand purpose. Recording.

H

Can we just prefix.

H

Things in the config key store with maybe secret and then treat them differently by not dumping, then them in general.

J

F

I remember we had a similar discussion with the niha. You probably will remember this with the login of the safe key, uh the config key set and uh commands right, so those were basically printed in the log, so we were building also the the secrets.

F

So we were talking about yes, some flags or some way to distinguish.

K

Yes, yes- and this is not the first time that this kind of issue has come up. So if you have some sort of flag or some sort of distinction, where conflict, dump etc doesn't show everything that be.

L

A good solution for future keys, like this.

A

L

Are the semantics that we're.

A

With the distinction there like, if we had the admin user, they're still able to dump everything some way right, we're already disallowing non-admin users from dumping. This.

J

One option might be to just create an arbitrarily different namespace called the secret store that works exactly like config key and might even have the same set of permissions. But it's just not called config key. It's called secrets, but you could do secret stump and it would dump all the secrets. And then we adjust things to store them under that it's just a identical, but not whatever equivalent but separate namespace.

J

J

Might be helpful.

A

Have we encountered a problem with users actually.

H

Least, if they are dumping, the secret say they know that they dumped the secrets.

J

Yeah, I think that people see log messages go by. They say config key. This can tricky that and they don't think twice because they don't realize that that might contain sensitive information.

K

No, I think that that was the busy uh that was a security issue, that we were dumping uh sensitive information in the logs, so we had to remove a whole bunch of things yeah, especially like you know, like values to keys. That's that was that's where the problem was. We are not dumping values, yeah go ahead.

J

The as a practical matter that the get and the set commands can either take um the new value or return the value as the um or they can set the value rather either as an argument in the command or as the like standard in type payload, and so I think, the the practical step that we need to make sure is. Whenever we're setting a secret, we pass the secret via the standard in so it doesn't get logged those two as an argument.

J

If we made a separate command, that's like secret set in a separate namespace, we could like not have it be possible to set pass the test argument.

H

At least that wouldn't make it possible for us to kind of distinguish between called the conflict here and called the secret store yeah when it comes to logging and so on, and- and I think it's also a problem for us developers- that we always have to be aware that the config key is actually also a secret stuff for us, which is, I don't think everyone noticed.

H

I don't think everyone knows it or has it in mind every time it works with the country keystone, so yeah it would work.

J

I think that the problem is that there's gonna be just a whole bunch of like boilerplate clothing code. That gets basically duplicated for like an identical set of commands, but things have to get changed. A bunch of stuff gets removed.

J

Overhead associated with it that might be worth it.

M

F

M

Not dump the content of the key and just access the content of the key. If we get the key with the config down, then you see like like a bunch of stars for those sensitive values. But if you get the key in particular, then you get the value, if you're, afraid of just dumping. Everything and exposing sensitive data.

J

I mean, I suppose, as a practical matter, that dumb command is only used by like developers probably or like sifting through information, so we just like get rid of it, but it is pretty useful. I don't know.

H

I think config key dampers is too useful in order to get rid of it.

J

H

Especially for understanding deeply hidden, self-radium issues: if there is something weird with the data structures, then you could ask the user to just run a config-dump.

H

Manager, self-adm.

H

Right but but that's not going to be a good idea as long as we are storing secrets in the conflict key.

H

I

Instead of getting rid of it implementing another um option like config dump, he uh sanitized like another flag or another- you know parameter uh in that command so that uh there's a dump option, but then there's also a filter dump option just an idea.

J

Then we need to know which keys are secrets and which ones aren't, which I think is sort of.

I

J

Very similar to having a separate namespace for the secrets right now, it's sort of it just stores arbitrary data, so we don't know what's what what any of it means.

I

It would be really hard to or it would be difficult to filter things out like that hard coding, a solution for that. I see yeah.

A

So we could use it at cipher, namespace and then filter things out command. If that's the main part that we're worried about and that needs to duplicate all the plural plain or adding like new secret store commands or something like that is this.

B

A

Command the only place we're worried about really.

J

And also the site command passing the value as an argument as opposed to an input yeah. I guess that's the cli really well. It shows up on all the logs, because uh when a manager module, for example, issues a command to set a secret it whatever it triggers a command that shows up in the logs.

A

I thought that was part of what, when they addressed the with the not liking those values.

J

Okay, let's fix it.

K

But yeah but then that's a very um stopgap solution. It's very easy for somebody in future to introduce something new and go log it somewhere. I mean. Currently the code has explicit comments about why this has been removed. But, like you know, we cannot rely just on that. So having something more foolproof will eventually is probably not a bad idea.

J

The current configures um has like a somewhat of a structure: that's carved up hierarchically and so like, for example, all the manager module stuff is sort of private within the manager, slash module name web namespace or whatever, and so, if we like carved off another namespace, that's like secret, prefixes secret, then it would, it would complicate that hierarchy.

J

I guess I would lean towards having a totally separate thing that doesn't configure keep it implemented. Similarly,.

J

I can put it on the trello board.

A

Yeah, it sounds like we have a couple of options to implement it, but for now sebastian doesn't seems like, like this, isn't any worse than what we're doing with other kinds of secret keys.

H

uh We are typically not storing the root keef ssh key password, um so we are storing the ssh key, but you're not storing the. This is hkey password and I'm worried that someone is using his private key um or for it and then accidentally exposing his private key plus password.

H

So for now I don't know if that's a good idea to merge this pull request, as it is right now.

H

I I would rather wait until we have something in place for this particular use case, because the ssh key password is something uh pretty uh secure in general.

J

I mean the ssh key is already there right. It's no different than the key right.

H

It it's deceiving um if you're using a the password for that key, then you would expect the the password to be a bit more secure than the key itself, which is not the case.

J

We should adjust that expectation. Like I mean, maybe this is going back to the result request, so they have a they generated. An ssh key pair with a passphrase know what that they want to use. I guess so.

H

I I mean I I'm going to ask him. I I'm going to ask him um yeah if he's aware that if he stores it there and then it's not no, not a bit more secure than having an ssh key without password.

A

J

And if we do merge it, then the documentation that explains that you can do. This should be very clear that they're stored in the same place um e and the password password that goes with it.

J

A

Well sounds like that square. Was it for that topic?

A

uh Next, one is around detecting inactive or out-of-date communications with monitors ernesto. Do you want to bring this one up, or maybe alfonso.

F

Yeah, this is going to be presented by perry super. If you want to go with that, but I mean just the context. I think this has been discussed previously. So basically we we got reports from situations where uh at least two of you know three class three monitor cluster twelve, two of the monitors uh come down or unresponsive or whatever, basically from the manager uh that everything will will uh feel like the clusters is going on.

F

It's running right because, uh as per well, very probably you may more or less explain this with more detail, but we will still receive reports, uh monitor reports and signals that are actually or falsely indicating that the cluster is still healthy. So pretty it's structures. If you want to go with that, okay.

N

If anyone is wondering about the tracker, this is the one that we are talking about. There is.

D

N

In that tracker, but as cernisto said when you have, for example, three monitors and you kill two of them- the last one hangs and won't respond to any command, but from the manager side we still see messages coming from the monitor, and the problem here is that the dashboard and prometheus are displaying the wrong data.

N

So that's a problem that we try to fix and yeah we went. I would think that the monitor should recover from this. Obviously, but but we went for a quick fix on this, which basically is when we receive a log message, since we are still getting messages from the monitor. We get the last log message and we store the timestamp from this blog and yeah.

N

If this timestamp is too too far away, we we notify the modules and just we stopped the dashboard and the improvement prometheus basically, but I think that there should be another fix for this, like, for example, a hair bit from the monitor to the monitor or we camping from the monitor to the monitor, maybe uh also the monitor, shouldn't be sending messages to the manager. That's another thing, but yeah. Overall, I think that the monitor should recover right and I don't have that much knowledge about the monitor. To be honest,.

F

Yeah, I guess so eventually this is quite a corner case, but the thing is that, if it happens, all the data that you will get at least in improvisation graphina will be completely mislead, because you will see the clusters in one status rather than in error, and probably lots of stats will be selling misleading, so yeah. The thing is that very use here. The uh log message is a kind of harvard, but the question is whether we can actually send a real harvard or um I'm going to actually detect whether there's been a one.

F

Cluster has actually been broken and there is no clustered out there.

A

I guess there may be two parts of this. One is like displaying some kind of state about the monitors health when they're not in a quorum, and the second is not um getting out of date. Stats from monitors that have that aren't performing part of a quorum anymore is that right.

N

Could you repeat, you said that there is a health message or something.

A

All right, I was asking about that because at least two components to this one is um not that being able to see the dashboard when your monitors are out of quorum.

A

What the what the status of the monitors are, and the second piece it sounds like is um you're still getting getting updates from prometheus and grafana, and therefore answer dashboard from the monitors that are out of form but they're sending you out of out of date, information.

N

Yeah yeah: that's right! We are trying to stop that from.

N

E

N

See the pr when this the monitor, if sent in at outdated messages, we just kill these the modules. Well, not kill, but stop them.

F

No, I was just mentioned that the mechanism that you use for that it's the notify book right so basically.

F

You're, detecting that, by a notify event, uh yeah. The thing is that the the map and everything regarding the monitor in the manager is cache. So basically we have a you know, outdated picture of the class, so yeah probably not sure if stop being sending matches or maybe having a ttl in the maps, so we can detect whether they they are outdated or something so they are multiple was doing this but yeah.

F

It is that and we we didn't want to break either the existing modules, because probably lots I mean, I don't think if the modules are prepared to this situation, so maybe yeah breaking all the manager api in case that there's been a the situation might cause a lot of person, and we don't want to do that. So basically, we just wanted to deal with the dashboard, monitor situation and yeah, not because well I can't. I don't want to imagine what this might happen.

F

I mean to the other modules right, the autoscaler or basically return an exception, or something is that the one word uh isn't there so.

K

So from uh so just so that I'm understanding this correctly from the end user's perspective, um what would the so like if we just stop the dashboard module or what like, what do they see? Do they see the last status or like what is from their perspective? What change changes.

N

Right now we are showing just an error and you cannot uh just.

L

Generally, through the.

N

Dashboard but yeah just other after right now.

K

And and what uh and when the monitors do form quorum, let us say um it just automatically checks that uh last time it got a message uh from from the mon manager had a communication. It will start the dashboard again, that's how it works.

N

Yeah yeah, if you get a new message and it's updated, you will start talking and receiving messages.

F

Yeah, so the thing is that we are sorry we are now using the the time stamp of the last uh log message. That's the and I'm not sure. If that's I mean.

F

J

Yeah, the timestamp of the last log message you said yeah just.

K

Yeah, so this is.

N

Basically, my approach was checking if there was some message with a timestamp or something man, so that means.

J

I use that what, if the monitor's, idle and isn't generating like messages.

N

And I want to keep the last message.

J

I mean practically speaking the only re. The only way to tell whether the monitors are still in quorum is to like regularly issue a message to the monitors. It seems like.

J

You should just have some loop somewhere, that's just doing a stat thing, just refreshing the status. If the status is stale, then pop up the box that says cluster not responding.

N

Like I think, use it from the monitor to the monitor.

J

Yeah or just be issuing like a you know, I'm on command a.

K

Quorum status or something yeah.

J

Yeah or status, that's a bunch of other information that you probably want to be refreshing on a regular basis. Anyway, I mean. Presumably you have some some loop somewhere. That's like I guess, you're, not you're, getting that off from the manager.

N

I don't remember correctly, but I think that command will fail if you call it with a monitor hanging, yeah yeah yeah, exactly.

J

J

So I think that the simple work around would just be to have some thread or loop in the dashboard that you know waits five seconds and then issues this status and then waits five seconds, and does it again and if you ever find that the most recent result that you got was more than everybody seconds old. Then stale.

J

um I can't remember if the I'm not sure if the manager has any notion of monitor, liveness that it's paying attention to, it probably does with the beacons.

K

Yeah, that's what I was thinking like this. These beacon messages that are exchanged between them on the manager. Is there any metadata that we can piggyback on or rely on to just see? Okay, like it's something like a ttl.

J

Yeah, if you probably look at the pro probably we do, because if I just have a command that tells you how old the last beacon response was, it would probably be pretty straightforward to add.

A

uh These days like in the past, we had to go to the events. I've been socket to figure out the quorum status if the ones weren't in quorum right. The.

J

I mean maybe, but then you'd have to be like sending tail messages to all the monitors independently. I think having the manager just exposed to the python modules, how recently its beacon was act by the mon. The monitor would be division.

A

Yeah, I'm just I'm thinking that I guess beyond just protecting whether they're quorum be useful for the dashboard to be able to show like these monitors here in the quorum. These ones aren't.

F

Yeah yeah- that's already displayed in using them, so so isn't the moment regularly sent back to the manager, so I mean: can we rely on ugly.

J

And I'm looking actually and it looks like the beacon- doesn't there's no reply, so the managers are sending beacons to the monitor, but the monitor doesn't send anything back to the manager. So that would be a change.

F

And these beacons are locked, so maybe that was the lock that we were seeing and that's why this worked. This is in the cluster log.

K

Yeah these weapons are locked.

F

Yes, okay, so that's why this was working. Actually, we were actually relying on the last uh block so.

A

Yeah the map helps when there is a worm, but it doesn't help if there's no querm, because it won't be updated.

N

I think that's it.

N

L

K

I guess yeah I mean yeah if you want to conclude something out of this right, so I guess uh I, the motivation with which this pr has been opened is absolutely fair, and I think we should do something about it. um Maybe give us some time to take a look at the manager beacon mechanism to see if there is something we can use there. um If not, we can come up with something more minimal um like this. What you have, but I'm I'm just not sure about how reliable this solution is going to be stuff.

K

It's worth spending more time, finding a better solution. First, does that make sense to everybody.

A

Okay, um next topic is around background. Docker for a development environment.

F

And that's also paris, so if you want to quickly explain the motivation and the current status of that one, okay.

N

So some of you might know that we were developing a new development environment and there basically is a python script that runs docker well, a docker container and inside this docker container we deploy, but we use the fading to deploy another container inside this. This container well, we've had some plug problems here. First deploying was this: this was fixed, so that was nice. I think I I have to talk with sebastian and say it and yeah. Nothing too.

N

Well, but the problem here is that uh in my computer it works just fine, but I've. Also, some avant and teammates of mine have been having a lot of trouble with this because of c groups b2, and also because we are running this first container with privileges so yeah. We are having problems with that and this solution that we think will work, is making the container room and privileged.

N

But this is a problem because we are running systemdo in the container and if we don't use the previous vlog, we have to manually mount all the folders needed for the systemd and you can copy the paste. So you can have more context on this.

N

This is basically what you have to do to add. The system in manually.

N

Then we can maybe use podman for this, but I'm not sure I haven't looked at it. That much also the we wanted to know if it maybe was possible to add to the third volume inventory loopback devices, so they are available there and we can so basically, at the end, we can run the dashboard ci test with with loopback devices.

N

We don't have to rely on a virtual machine.

N

So yeah this is basically the current status of the docker docker. We are still having some trouble with it in my computer works but yeah I would suggest if some of you can test it, but be careful with it, because I think her also crashed his computer with it too and well. Apart from this, I wanted to show you a demo guys so, okay, so here this is the folder where we store the document, docker development environment.

N

We call it a box because and yeah the main command is this one, and basically this is a command, but where we, what we use for everything, so the main you can deploy a cluster with master start and like this, and you can specify the usds and also the host like this, and I already have a cluster running, so I don't want to mess that up.

N

So basically, you can check the cluster running with this list command and, as you can see, there are two containers: the seat. One is basically the one that holds the thefadm and all the other containers, and this one is just ssh servers, so we can deploy well. So we can use them as hosts and if we run this, we enter in the first container this one, and here we can see that we have the favn.

N

You see that we have here a cluster running and yeah, that's basically it. The main thing here is that we have docker containers right inside the docker container.

N

You can see that we have this docker container running inside this dc.

N

So yeah, that's basically a small demo, but.

D

N

You can completely kill the cluster.

N

And to also to create these osd, so we use loopback devices, we create some logical volumes and then we use them to deploy the devices.

N

As you can see here,.

N

Here you can see that we deployed the use this with the therefore demo, not firstly, and also we have this command in the box to deploy them faster.

N

So yeah, I think that's all.

A

Nice locks are quite easy to use and is the idea that it uses existing likes of container image um and for like the dashboard development purposes. I guess like how do you make your local changes to the container images.

N

Yeah we share the safe folder this this, the folder and we share it in the video. So every time you change something it will change in the dashboard and run.

N

I think oh yeah, I could the cluster but anyways when you when I run this, you can check also the ip of the sheet. This one holds the the dashboard, so you just have to type this ip and the port. Eight four, four, three and you'll have the dashboard running with develop the development manual.

F

Yeah, this helps on the python development. Basically, um we haven't thought yet about. I mean uh I guess you would first need to well be the container, but maybe there might be a way to mount. Also other I mean other parts, not only the python one, so yeah these miles work. I guess that, should I mean we should extend separate dm to mount also, these other directories but yeah eventually this they might also work with with the rest of services and demos. So.

A

Cool yeah. That means the python pieces are pretty useful as well just for any kind of major plugin development. At this point.

N

J

I think um I mean this looks pretty useful. um That makes a lot of sense. I think the problem before was just that when the original pull request merged, it broke a bunch of stuff, but we need to make sure we don't break one just this time, but you mentioned that there were like there were specific things that you're still that you're still blocked on that was, I guess, loopback devices you need stuff for them to report.

N

At the end, when I think that you made a pull request, adding the no temp fs flag to lbm activate so with that it started working and also I had to change the storage driver for a docker for it to work. I think it was part of the problem. I don't remember exactly, but the main problem my concern here is that if you run the docker container with privileges on uh there might be a lot of unexpected behavior.

N

As I said before, for example, larp also had a problem where his computer shut off completely. So we don't want that. Obviously,.

N

I guess that the solution should be running docker container, with ruthless.

N

But I wasn't able to do that right now.

J

Okay, yeah, I'm not super familiar with.

F

Yeah, the challenge here is running both a container inside the container and also systemd inside the container. So the mixture of these two things is kind of a challenge. I saw some presentations from red hat that with podman. They are now able to do this, also in ruthless mode because yeah in the past, it also required privilege and android mode, but it seems like it's working in those lessons now so yeah.

J

I guess it it feels to me like there's: this is gonna, be this is a path that is gonna. It's a road that, like is gonna end at some point like the further down this road. You get the more these annoying issues, you're gonna run into. um I mean it it. The overall goal is to have an environment that is like a real cluster and is easy to work on and whatever. Then you can, you can get new.

J

You can get all the same things with a single machine, at least with a single host, just by actually deploying cdm on your local machine instead of running it inside docker.

J

E

Yeah but then you can't.

J

Then you can't have the multiple host part of it right. Yeah.

F

J

F

J

F

The final one would be to have something like via start but with cfdm, fully work as fader and also may be able to run via star runner. On top of that, so we could run the utology test on the start, with safety.

J

Yeah I mean there is a c start command. I don't think anybody really uses it, but it basically just runs fidm um and it can and probably should be modified to have that magic argument that passes your source code directory into the container. So you can make running modifications, but that was the that was the intention to essentially replace or have an alternative to the start that just uses a real stuff adm cluster running on your local machine.

J

But that seems like challenge number one. I'd be curious, like obviously the multiple host part is going to be a challenge, although probably you could run, but you could just start up some vms on your local machine and then add those to your that video cluster, two, that might that might be an option.

F

Yeah, that's the other.

J

Thing they're really working.

F

With the we have this kcli thing, which is basically a wraparound, okay, weird and that's already working but well, those are virtual machines and we wanted to have lightweight containers so.

J

I think the other half of this is the having things that look like real discs and behavioral disks, but aren't real discs and I'm not sure that loopback devices are necessarily the best way to do it. um They're sort of ignored by the volume for confusing reasons that I don't know exactly what they are. um So I would be hesitant to just allow them without understanding why they were excluded in the first place.

J

um But the one thing I'll mention is an alternative to that is the nvme loopback port in recent linux. Kernels lets you create an nvme device like dev nvmes in xero, whatever whatever it is, backed by a file.

J

And those actually do behave exactly like a normal block device. The only way you can tell that they're not is that the vendor, instead of being you know, seagate or whatever is um linux like the vendor string looks different but that they look and behave like real devices. So that's what we started using for technology so that we can test all the orchestration features around creating industries.

J

Like they look like real invenior devices, the the catch is that they're, um you need a pretty recent kernel, um so, for example, ubuntu 2004 is not sufficient.

J

You need to use the hardware enablement kernel, which is a little bit newer than that, but still supported or whatever um and they're slightly annoying to set up, like the the tooling there's like some python tool that somebody wrote that it's like not available on ubuntu and so the technology test that sets these up just fiddles and says dev to do it manually, which is pretty tedious but can be scripted, has been scripted. I can point you to the technology test.

J

N

Actually so in test 350 I mean qa work units. There was something about nvme, look back.

J

It's right here.

J

What that toothology test does it takes all the lvs that are set up with scratch devices and it puts an nvv loopback thing on top of it,.

J

It's only a couple different steps of echoing various file numbers to if this kernel can take stuff do it.

A

Like proceeding in like the long-term goal of fixation, and that's what we're talking about, if having a real environment, you could use for development. Actually you just have adm in like a just like a real user would and hey it's like a real deployment.

A

I think we're on the closenet utility side to be able to like run tests like uh junior high scripts, that he's written to set up like all those topology services. You need to run on your local machine, so I think the only missing piece there is, I'm adding support for like like adding worker nodes like a vm or something to be able to actually run the tests against your local container builds.

A

So we're getting closer to being able to have that, all being all dancing, uh we'll use your deployment like we can actually test with all the kinds of tests we run in your local machine.

A

All right, so the last topic is uh around the auto scaler and the.

A

Pg max all right, general ideas rent how to um work with metadata pools with the scale down mode, and do you want to talk about this.

O

Sure, um I guess the motivation behind this pr is um because, with the scale down mode, instead of um if you're familiar with how the autoscaler works, it scales up the pool um the pg on a pool based on the capacity ratio. So the higher the capacity ratio, the higher it will increase the pg number, but with scale down.

O

What we do differently is that we take into account um all the capacity ratio of all the pools and see if, um if there are one pool that is using more than other pools, we kind of give so. Okay, sorry. Initially we give we maximize the number of pgs, so we call it the pg the complement and that's calculated by multiplying osd count the number of osds by the monitor target pg per osd, which is normally 100.

O

So um so, therefore, it kind of like gives a lot of pgs to all the pools and you only if the pools is using. If one who is using more pgs than the other pools, they will get like more pgs in that pool and whatever is left, um you give it to the all the other pools that are not using as much.

O

um But the problem is that, with the scaled down mode, um when you have a default pool such as device health metrics or like the dot mgr pools, you will scale up the um that pool by a lot because of how the scaled down algorithm works. And if you wait for a bit and you deploy all other services.

O

um The the first pool that you um created and has the most um pgs will start scaling down because um it did the auto scale detected. There are all other pools, so it needs to adjust so um to to give the pg's the correct amount to the pools and that can take a while, and that can be a problem.

O

It can take an hour or more based on how much pg needs to scale down by, and so um I think this pr kind of like tackles the issue of um having too much pg's on um one pool so yeah.

O

um I guess um pg max value would um make it so that, like it respects, um so it is the pool, respects um dollar scale respects um you know the boundaries that is set, so it won't go more than like a certain amount so that um it didn't. It doesn't cause the issue of like um the time it takes to scale down, because when you're scaling down a pg, it's much slower than scaling up, I'm scaling down pg you do it like one at a time but scaling up a pg. You can do it like.

O

I don't know like many pgs at a time, so.

O

um I guess what we're talking about is um um in if we should um impose like a pg on max or could we do like a small or a metadata flag instead?

O

And I guess if we look at the pull request, there's a discussion about it with sage and josh, and I kind of agree on the path that we could take and um having like the a small flag for um any like metadata pools, so that um we kind of have in the calculation of scale download. We can prioritize that specific pool first to so that it um behaves like a scale up mode, so it would only scale um based on its capacity ratio.

O

And then you subtract that to like the um the pg complements the total pg that we would distribute among other pools so yeah. So I can see how it fits into the calculation of the scale downloads. So I think I agree with your josh's latest comment on the pr on how we kind of do it.

O

I don't know if any other people has like thoughts on it.

J

Can you put the pr link in the chat please.

O

K

So uh junior for folks who haven't followed the comments, can you summarize what the idea is that to have like different modes like for smaller uh pools or boost that we know are not going to consume too much capacity? We just use the scale up more and for everything else. We use uh scale down.

O

Yes, um yes, but.

K

I mean how hard making that uh classification that small flag that so it probably only applies to something like the default pool right. The dot.

O

Manager, so um we still have to impose the we have to steal, like you know, all the services that are not like stuff, like other services, that um we that has like metadata pools, we need to in somehow incorporate like the flags into it. When you create that pool, I think, in order for the auto scaler to detect, um if.

K

You you're trying to say like when you use a service. It's like, let us say an rgw based cluster. If there is a small rgw pool that we, you know so, rgw and pool creation would use this small plug and use scale up mode and the data pools would scale down yes and in general. I'm like these classifications make me a little nervous, because you know it's almost using two different modes right.

O

K

O

So, in my thoughts it would be like um scale like the profile of autoscaler is kind of like global. So I I didn't want to um have some pools that are like scaled up and some pools are scaled down. It should be like all the same like scaled down, but um with this like small flag, it kind of adds to.

O

um It only applies to scale down mode, so this small flag would uh only would would help prioritize um how you cal, you distribute the pgs and how like each um each like the calculation of pg's would be um create like like it will prioritize. The pools are small, and that will you know only scale up based on the capacity ratio, but it wouldn't like you know.

O

If you do like osd pull um auto scale status, it wouldn't say that that s pull is a scale upload, so it would remain a scaled down mode and all the other ports are like scale down mode. It's just that the small flag adds into helps prioritize like the calculation, so that we don't have the problem where um it scales up too much with the skill download.

K

So uh why can't we achieve this with the max rpg num max, so what you're describing as a small pool um if we have an upper bound as to how many pgs that particular pool can consume, but then that also it's almost doing the same thing right. So you might may want to.

D

K

You may want to call it anything else, but the idea is to have an upper bound on these small metadata pools or posts that we know are not going to need too many people is that.

O

Yeah yeah, I think it's kind of like similar now that that you brought it up yeah.

A

The reason like we brought up the idea of the flag, instead of just the max value, was that there are a bunch of other settings that we are often applying to metadata pools. We could make this more general than just the pg m, um like I'm, applying things like the record-break priority that we have um set by the uh by defaults and uh like different tools that are creating rgw pools or festivals. Today, set a few different values for the metadata pools.

A

We could make that make those kind of a common thing that are um implied by the small flag.

E

A

Pool creation yeah more or less.

A

The other piece to consider- and it's it's a detail, a little bit in the ether pad there is like the upgrade path. So I think the idea would be to keep everything with the when you upgrade keep all the pools with the current mode in the scale scale up mode and then now like document how to when you're turning on the scale down mode that you would want to set this flag on your metadata pools. This video wouldn't have a good way to do that automatically or existing pools.

A

Unfortunately, um then, but if for new clusters, they could use scale down out of the box and set the flag on metadata pools that are being created for the places that we can change. Creation.

J

I wonder if um a couple things I wonder: if, instead of calling it a small flag, if it should be a scale-up flag like so actually the flag is actually what it controls well,.

A

If we're we're talking about it being more than just like um product, auto scaling, but also for like recovery priority or like other things related to metadata, maybe we're.

J

Yeah, okay, that was one thought the other one was. um Maybe um it might make sense to flip it this around, and so, instead of having to like the special pools, be marked special, have the the bulk data pools that we actually want the scaled down, behavior mark them like bulk.

J

And that yeah trigger behavior there are fewer of those.

A

There are a few of those but they're, often created directly by users like for rbd, there's, not an rvd command. You run into created a pool, whereas foreign ffs yeah all right. Those polls are often created the outlet ones so yeah, I don't know which way it's best. There.

A

I guess from like an upgrade perspective, if you don't have a bulk setting and we try to apply this like profile of like every prioritizing recovery um or other pieces, uh we probably wouldn't want to do that to a bulk data pool.

A

But if we had upgraded pools that didn't have the small flag, it means that those metadata pools wouldn't get increased priority if they didn't already have that set.

A

J

Just um when is that this scale down mode intended to be the default, is it already the default in pacific, or did we revert that then we were. I can't remember.

K

When it was yeah.

J

Okay, so maybe in quincy or later the vic point release once all these other issues are sort of ironed out.

K

Yeah yeah at this point, I'm not even thinking about a backboard, I'm mostly concerned about what goes into quincy first and stabilize and then backward.

K

But yeah, maybe we should take a call about this uh before quincy.

J

May be disgusted on the list. I think I would lean towards putting the burden on the like. These are the pools that should scale down versus the other way around, because it seems less likely to bite people in an unpleasant way.

J

Like the worst case scenario, if a pool scales up instead of scaling down is that some data moves as you fill it up, but not a big deal compared to like these very small pools, generate a bajillion pgs and then have to very slowly collapse down on themselves like that that posse cluster has been working for days, unlike taking up one of the early pools that has a bazillion dgs and like slowly ratcheting down now that some other file systems have been created and it's like it's a totally empty pool. It's like a total.

J

It feels like a big waste.

A

Yeah, that's a good point. I guess that's the big asymmetry like the scaling down is way slower than scaling up. So if we make an error in that direction, it makes because somebody forgets to mark a a small pool. That's small! That's worse than many of these markets. Pig was.

J

Big and things like you know, rvd users creating their own pool. I think there is an rbd init command or something like that. um I think nobody uses it, but um it exists, and maybe we should just be like better documenting that.

A

We can also um maybe piggyback on some of that application enable like for for rbd, there's only one type of pool that they you believe, it's always about data pool.

J

A

Whereas, for example rdw, then it's a mixture, so it needs to be another um setting.

A

K

Talking about that pool creation path, the last time I looked at least uh the most, so we had added this flag called mostly omap or something for rgw said the bias property um for for metadata was in rgw, but that wasn't being exercised. So I don't know what the uh you know where the pool creation was happening or like. Was it a different code path? But that is also another problem uh if we have the implementation, but if you're not using that code, but we use a different code path, then there's no point.

A

Right, that's another thing that we have procreation spread around, not just within stuff, but also other projects consuming stuff.

D

A

Requires of principle or uh probably self-ed at this point stuff, at least so we have to change each of those places to use these new flags.

A

I guess the hope would be that if we have this new flag changing um because that's the main distinction we make is like between these data pools and large pools, um we plan wouldn't have to do that kind of change to everywhere. We create pools in the future because we'd be able to like add new things based on the existing flag.

O

So how does it like work if we were to flag only the block pool, so when the user creates a pool and they have to like manually set it or is it like automatic.

A

I guess it would be like another pool property um like the airbnb is running. I got you upgrading the pulley running the second command to set that bulk property in the pool.

O

A

Like actually clickfully, it has an optional argument for the full create command too.

A

That'd be a bit better since that it would be atomic and we'd have to worry about not a scalar kicking in between the two commands.

A

Although it's, I guess, that's a problem with the book. My net direction too.

J

It was, I think, that's this yeah. It's been a real problem on posie yeah yeah. I think, but it's it's mostly because the um there's that other manager bug where it was making the pg num make really big jumps without jumping the pgp num. But I have an open full request to address that now that I'm deploying today but um yeah it's there are a lot of like commands and not rushing to set flags quickly or whatever just kind of annoying right right.

A

Yes, maybe we could be thinking about adding more of those kinds of things to those basic uh pool, create commander's options. Then.

J

J

Let me just like a dashboard would be.

J

K

I guess what's the conclusion for quincy, I mean like: do we go with the bulk flag um but with the scale up mod by default or.

K

Do we want to send an email to the list as well.

O

I'm okay with the the dash dashboard and we can keep it scaled up, and so, instead of like having the profile to be like this guy, let's go down his skill up and then dash dashboard would um just do the same kind of like calculation for only the bulk uh pools. I think I don't know if it's too much, for I mean yeah.

O

Yeah, it should be okay,.

A

We can test it out on the lrc and- and uh you do have a larger scale test before you can see uh there as well.

A

Yeah, I guess it sounds like based on the existing experience. We should definitely make sure we do a scale test before um changing the default.

L

Yeah sounds good.

J

Goodbye you have to build build that cluster.

A

Okay, anything else for today.

A

All right: well, thanks! Everyone see you next month thanks! Thank you. Thank you. Bye, bye, thanks.