Ceph Developer Monthly, 3 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Developer Monthly 2021-02-03

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Looks like you might have a middle aquarium for that first topic. At least um we should get.

A

Started uh the first topic is sef, mirroring metrics, possibly sharing across services, ffs and rgw.

A

Mikla jason: are we the ones who added this.

B

uh In all honesty, no, it was um paul.

A

C

A

Yeah yeah: it's definitely.

B

Not uh ideal first time zone, uh but anyway he he thought much of it outside he's, not just reading for the first time, but the the high level goal here was that um uh in a kubernetes environment or, and also just in general, there's no real good way um to have consistent, alerting from an end user or even from a storage admin.

B

That's um you know your rgw multi-site or your uh carpeting, mirroring or now your separate smearing is, you know not living up to whatever your your sla is requested to be so, you know how can we, you know, number one track or categorize slas number two: how can we in engineering fashion and then number two like?

B

How can we report against those and gather metrics in a in a scalable way so that you know you can build a grafana dashboard or whatever to talk about uh loads um and then and then finally, based on those metrics? How could you and basically metric and based on the slas? How could you um alert so that's saying like hey this image or this bucket or whatever is falling outside your prescribed.

B

A

Okay, that makes sense at high level.

A

There's a lot here have you talking to like what the implementation would potentially look.

D

Like all the idea would be to expose it in self manager, and this way we can export it to chromatos sources can consume.

D

A

Was going through the manager going to be scalable enough for that.

D

No yeah, that's uh that's!.

B

D

B

D

Well, uh yeah, but I think even.

B

Okay, certainly people have a single prometheus endpoint but yeah trying to shovel all this data. um You know to get it just back to the manager so that it can, you know, format a 100 megabyte. You know data dump every you know 20 seconds. However often it gets scraped. You know it's a it's not scalable.

B

Today it may work yeah, very small clusters and very you know, small workloads and times the number of images or buckets or or file systems, but.

D

Yeah, especially for evas that they are small and we have lots of those uh and the style is stated just a clipper volume. Probably so this will be lots of metrics.

E

D

Buckets, I think it's better, because it's usually you replicate the.

E

Bucket have a dumb question right now. All of these mirroring demons are reporting periodic status to the manager already right. They have that, like just a key value, dictionary blob, whatever of jason right.

E

Going beyond that.

B

Right, sorry, and even if paul gets down there in terms of even just like tracking your replication demon health without knowing that it's one thing to say that there's a demon out there. But then how does that relate to a service going offline.

B

Or you might have demons running, but you know none of them are in like the master mode. You know in terms of being responsible for pushing or pulling data across. So how do we even just convey that and say you know alert? You know we expected 10 demons to be running. um There's zero, there's five or alert there's you know 10 damage running, but no one's doing anything with the data which must be.

D

B

The sla, but still like at a higher level, just knowing that. Well, the reason why all the slaves are being violated is because you know this issue.

D

Yeah another thing is, for example, if we set the schedule to aggressive and the bandwidth is too low or the compute and enough we may be lagging so we also detail the user in that case, either in case the snapshot, schedule or fixes network, or do something.

E

Yeah I mean it seems like we can provide that the high level demon status to cover those cases like we'll know whether it's running because it's reporting back its status and its status, could include whether it's active or not active, and if it is encountering any bandwidth issues or anything like that.

B

Right so so rbd mirror actually already does that right now, where you know.

D

Yeah in the dash.

B

In the dashboard, for example, it says, like you know, a given demon via that you know the key value pairs is reporting. You know an error, a warning or something like that. It it handles that, but it's not being handled. You know, in a generic fashion, for now rdw and for stuff s or.

D

You know whatever.

B

Else needs to um come down there and it still doesn't tell you um that you know there's there's a lot of logic in the the dashboard to say. Well, I don't see any demons, so I don't see any demons that are, you know, reported as primary, so I'm going to you know as a dashboard report that, as you know, an error warning and right and not just a generic service. That says I expect you know now that we have that admin or something like that.

B

Like you know, I expect to see this many uh demons running. I don't so there's some there's. There's.

E

B

E

Well, stuff adam will take care of that part like if you have a a service spec that says there should be five demons running and there are only two of them or whatever it'll. It can raise a warning for that right. I think the bigger issue would be like there aren't any running, although even then like, maybe that's just like a misconfiguration like you didn't deploy it, and so it doesn't necessarily need to result in the warning.

E

Like I wonder if this is a matter of looking at what rbd mirror is reporting and its overall structure, and then what rgw is reporting in several structure and then figuring out, presumably they're totally different, like is that okay? Is it going to be a totally different structure.

B

Yeah, so just um yeah yeah, and just to kind of go back a little bit. What kind of brought this whole conversation up was well, obviously, that you know paul was interested in gathering metrics and alerts for um specifically the case of rbd, mirroring um and and while it does provide a lot of data, so that data is, you know from the point of view of you know, I am kubernetes and I only read prometheus exports or something, but the data is not easily ingestible or available.

B

You know if we, if we have replication status on a given image that says it's down like you, have to periodically pull that data. You know from the the osb's um to say, you know determine that image x is down. You know we don't get an automatic alert that someone just injects that says it's down.

B

E

B

At tens of thousands of images, um that's it's a painful process.

B

And not too much of the fact that we have no concept of slas right now.

E

I mean it's, it seems like the per image status might be a bit of a red herring, because it's not so much that that you're not going to have one image. That's down you're, going to have pool.

B

Well, I mean you can if you got into weird thing like split brain or something like that.

E

I mean uh yeah- I guess yeah, but in general, like that usually going to be on a pool granularity that, like your remote link, is too slow or it's down or whatever.

B

Yeah but without flas to say that we expect that you can do x um or you're not allowed to get this far behind and now suddenly a year you know 30 hours behind you know whatever you find your cell is how do you? How do you generically and theory do that? Because you know I know, there's been issues with rgb. You know multi-site bucket, sync, you know falling behind.

B

So it's that's why it's like we do this in a generic way and it's not just harvey mirren. You know, and now we have stuff that has been coming, but that we also have like rgw.

B

I think it's the same problem over and over again and it.

D

B

To keep solving it and here's the rbd specific way to do it. Here's the touch yeah this way to do it. Here's the rgw specific way to do it.

E

Okay but yeah, so so, even so, though there there are multiple layers here, like one is just like: demon, health, pulsing status status and making that generic and then there's like the full volume, whatever bone, granularity.

E

What's in status, and then there's like image bucket, whatever choose that volume right right um and then there's also like beating this to prometheus.

E

That's raising health words.

E

E

Using some health alerts for all the services and then there's like um writing status and stuff. Yes, dashboard forever,.

A

I'm not the most familiar with prometheus, but um would it make sense to try to have the kind of report? Okay in the normal case, and only report status, quick, individual um volumes or images or file systems or buckets and there's a particular problem with them?.

A

So it wouldn't be that having this constant hundreds of thousands of items.

B

But but lack lack of reporting, I give an image equal good status, or does it equal that something bad has happened and it doesn't know about that image. For example, like there's no data, I mean that's like how do you detect the the case? Well,.

A

If you're, if you're kind of like uh delegating the um monitoring to the the mirroring demon or the rxws, and assuming that, if they're reporting, okay, then immediate their their handling, are okay.

E

That's funny yeah about a question for the prometheus people yeah well, maybe uh before that is the question of, should rbd mirror be feeding its metrics directly to prometheus.

B

Well, so yeah there's no feed per set of prometheus. You stand up uh or whatever web service like export or whatever, so that so they can scrape it. So then you could. You can then have something where you could have. uh You know you know, via those uh service demon, key value pairs you could each demon could report. You know here's my prometheus endpoint, you know here's my ip address and ports that you can attach to my premium server on and then we have, you know via the manager. Rest api. That's the way to grab.

B

You know, gather all the the prometheus endpoints for the system. So there's one place they go to to say: hey get me all the metrics endpoints. You know for a given type, um be it you know: rbd mirroring versus rbd, iscsi rbd, nvme gateways. You know you know all these things going forward. You know so that we don't have to keep feeding stats back. You know to the manager as a single point of failure.

B

um Yeah, I I I know the rook team kind of pushed back on being that on that regards like oh, we don't want to read from another prometheus endpoint like well, you know at the end of the day, also, you can't just keep shoving all the data, and you know you already read you know. Kubernetes itself is reading from a number of prometheus endpoints. So I don't really it's just.

D

B

Configuring and saying now, scrape from here as opposed to all data needs to come from here.

D

Well, we're already in ocs server yeah, so yeah at least for csv functionality in new exposure rgw.

D

So so yes uh well we'll need yeah. I agree with you. We need several export inside. It makes sense that the demand will do that will be collected directly from the demon.

B

Right, so if, if the end solution is to put a prometheus or an optionally, enabled prometheus exporter in each demon, that you know that we care about metrics and alerting from you know, I think that solves a big chunk of the problems. But then it um you know, find a way to discover those endpoints, you know have a way to turn them on or off. If you care about them or don't, um but then yeah it comes down to it still like at the end of the day. Is there a generic way to do them?

B

Slash that we care about and want to alert about, or do we pump that problem and say it's up to the end system from you know, doing those or injecting those metrics to say like well. This is violating I've. It's got to do some correlation um saying you know somehow. This other system said that this rbd image, you know, should be the gold.

B

You know level sla, and thus I see that it's you know this far behind and if you put the problem versus you know, try to bring it in sorry or punch it temporarily and bring it in later. I don't know. Maybe the first step is really just getting a way to export these metrics and and getting a commitment from all the all the teams that we want to. You know move in this direction. That's you know, rbd mirroring rgw multitype, you know is going to be able to provide these.

B

You know set of data in a consumable.

B

E

E

Yeah, it seems like there's a lot of homework to do to see what um rbd and rgbw are currently reporting and how I think the dashboards fly. The only consumer really right now is um presenting it. How much logic is there and how to sort of just try.

A

To standardize that or.

E

Yeah, maybe they're too different, so it doesn't make sense, but hopefully.

E

B

Yeah, I'm going to click for the rgb rbd, mirroring point of view. If it's a high level metric or status like it's a demon's data, something like that, that's something that the dashboard can just easily consume via those key value pairs. But when it comes to this individual images, you know split brains like that. That's something where the dashboard right now is pulling that data.

B

um It you know, doesn't have a way to you know it's not getting pushed to it or whatever like pulling that from radius yeah, because we store the the status of each individual image and its replication status. You know locally and remote in radios, um but if we have some consistent solution, does that then mean that you know the dashboard is still doing it? One way and the you know kubernetes is in another way, or is there a way that we can standardize that the dashboard pulls this data from the same source?

B

And if it's going to be the same source.

D

B

Mean we need a prometheus like, except adm, being able to stand up. A prometheus server should be like read that data from yeah.

E

I mean we can. The dashboard was already.

B

E

Me yes um and seth am, does deploy prometheus and, as does rook, though I mean we could just do you want that level of reporting comes a required component. So that's one option.

B

Okay, to do things like three times over, that's my only concern for them. Yeah the more year.

E

B

The option of saying well, at least the deployment is painless or yeah.

E

B

Know in the past was like, oh no, you got to like configure all this yourself. That's an exercise for the reader.

E

Yeah, I know it should be work by default out of the box and stuff in this deploys prometheus and all that stuff by default now, so it's kind of going to be there.

B

Just put storage on rbd by default or.

E

B

Put it put its database storage on uh rbd by default, or you have to get a separate pool. uh I don't think so. No.

E

No, it just goes in the local directory on the.

E

F

A

It sounds like if we uh make that more of our uh wire thing. If it's used more heavily, uh we might need some more monitoring around from eps itself and it's disk usage and that kind of stuff.

E

E

Okay, well I mean that's. This ffs mirror demon is like just at the stage where it could benefit from this. Be done right, the first time hopefully so.

E

The timing is good, but I think we need to designate somebody to go. Do some research.

E

E

A

That pretty well on any.

E

A

Anything else in this topic.

A

All right next one, then um those handling deletion in full clusters.

A

So um I guess this came up in particular, because uh when clusters get very full, we um block writes to the cluster and they're at the radius level, and we have some flags that you know bypass that or this could still do higher level delete operations, but they're not being used everywhere, particularly in the manager modules for.

A

Rbdsfs they're they're we're added to the rbd cli so that you can do things manually from the command line to remove images.

A

But if your cluster is full otherwise you're using the the manager's ad, you may be in an estate where you're kind of stuck, but you have to end up uh increasing definite threshold temporarily to be able to leave things. And then, once you have more space, not going back to usual.

A

uh So jason, I think you also brought up that there may be some other aspects in the manager that that's uh we want to think about here.

B

Well yeah, so I am, I initially did actually a quick uh hack and flash of uh for the rbd support manager, module that uh it would.

B

uh You know, because so right now it's unfortunate that the the force, the full tri flag or whatever it's it's applied globally. So it's weird! It's like it's! The api is attached to the I o context, but the it actually applies it globally like on the inside.

B

So maybe that's something to look at in the future, but just for my hash flash approach, I opened two two connections. Like here's, my you know, liberators connection. uh You know standard one, here's my one that has the full try applied um and that you know and then in the rvd support module where it's doing the background, deletions and uh other operations. I just had it used the the other radius connection that had the that flag applied, but it became trivial just to still like block it.

B

It's just because there's a single thread in the manager that is, the manager command finisher, um so you could have a command that's running in a totally unrelated manager, module, let's say the progress module or whatever that hits um that box fall blocks it and now no operations. No no manager commands will proceed from that point forward.

E

That's the problem.

B

And that's why I threw my hands up from like this is a bigger problem. Just trying to you know tweak this one thing with either. You know self-assess was the same issue with, like you know, deleting some volumes or whatever um that it wants to be able to do it in the background and uh again it does it in the manager and if the manager is blocking some totally unrelated commands, uh because if the cluster is full um a lot of luck.

A

I guess the bright side is that there's not that many things to manager that actually uh write to the readers directly. I think it's only um really the uh our response module the stuff of that support module and maybe some things in the dashboard back end.

B

Yeah so in this, in this particular case, it'd have to be something that's coming in on the on the manager uh via the manager command like path, because that's where it gets executed in just like the one senator as well, and maybe.

E

It's not it's not like.

B

It's not like the dashboard or whatever getting blocked, because that's at least running its own thread and it's not handling dashboard uh manager commands.

B

Okay, so that's the trivial one to get blocked as well as you're like. Oh, let me do this for the cluster.

E

Yeah I mean so several issues. um Magic thread can get blocked. The objector full try is global.

A

Yeah, that would definitely not be global.

E

A

um And there are some things in like rbd, where you could like for like when you're opening the image. For example, we do a watch which is the right operation, so that'll get blocked, but that could be something. That's like a full try all the time, since that's not ready any data, all right.

B

I'm gonna add some data right. If you send it.

A

Though so that's no problem.

E

And I wonder if uh what if each module got its own finisher?

E

Oh, the finisher is actually the liberatous finisher. I guess that means that each module would have its own independent radius client, which flies what we want on there.

A

Is it's the radius finisher, that's getting blocked jason or is it.

B

Command whatever match, there's a one in the in the manager proper like it's like called the manager command finisher. I think that's literally just all it does. Is the incoming manager command? It processes, validates it and then basically throws it on the finisher.

E

B

That could go. Do it in the background I see and to respond to that command when done.

E

Yeah, it seems like that could be a.

B

But but it, but even even the um rbd support. Let's say I have a bunch of background tasks that were already cued, like hey flatness image, and then I q1 for delete this image. Well, that flattens the image one is gonna block the rvd support.

A

Sounds like we kind of want a different way to run long-running things in the background.

E

A

E

Yeah we want like worker threads versus.

B

Well, I think we can play and then you get like you pick five worker threads, that you're allowed to use in the background and then all five can get blocked.

E

All right I mean just so that the delete would run on something that it wouldn't get blocked. That would have a dedicated channel.

B

Oh well, yeah, so yeah. So the reason to get back to this, even if we do it asynchronously, which we actually do, um we use the asynchronous commands like we only execute x number of concurrent commands, because we want to like say, like oh you're allowed to schedule like 5 000 image deletions and have at it. You know now you got the managers, you know trying to execute. 5000 concurrent delete operations.

E

But if the delete operations get put on a separate pool of workers,.

E

Basically, there's a pool of workers that has full tricep and a different pool of workers that does not.

B

Yeah or or you or you kind of take the the kernel-like thing where it's like do one thing at a time as long as it's you know a forward progress making operation, so you can say that uh a single delete is always allowed to go. You know just to make forward progress.

E

B

I mean yeah, you know at the end of the day, this is kind of a it's still, a band-aid for the issue that the cluster comes full, because what? If? What? If it's using rgw and the cluster, gets full or what? If it's using um right.

E

Right yeah, I mean yeah. Full clusters are bad level.

B

When you're not at clan scale, it's a lot easier to become full yeah, so these are. These are high level concepts that are never like. People really didn't concern themselves about yeah, and even me historically, I was like you know. People said, oh my I got my cluster called like. How do you do that? Well, I just have like you know, a five gigabyte cluster, okay.

A

Another possibility- and maybe this is a terrible idea- um might be um using a full try everywhere in the manager so that everything would be like succeeding or failing and not hanging.

A

D

Take a lot more.

A

Air handling to be able to make make sure we're doing something clean in that and when we do get the inner space.

E

Yeah, I think I would worry about things like rbd flattened, would rbg flatten behave with full tricep.

A

A

It probably takes some work in little bitty and subscribe s as well to handle those kinds of things properly. So maybe that's not.

D

A

I mean that's too much, I don't know, what do you think jason.

B

uh Yeah, I'm less concerned about our video movements. We have lots of negative testing with unit tests on we receive unexpected errors like constantly every. If we try to have at least one you know, every branch have a test case that you know injects a fault. um You know, because you you can get like a an e perm or you know error at any time. um The what's the other one, the uh block listed loan, the e block list- or you know any number of things. So we.

A

B

I'm not saying we've covered all the cases but yeah. I feel pretty confident that the vast majority are covered in terms of sane error, handling um and aborting out. You know.

E

That certainly be it feels like an easier fixed.

E

I want to take a look at stuff of snz.

E

I think probably it's just gonna like throw assertions right.

A

Exactly what kinds of long-running operations the surface module does.

B

I think it's used for the it's used for deletion of, like the volume is in cloniness of volumes. I think, for example,.

A

Yeah yeah, so I don't how much like it goes like through the nps versus the directly to the osd's, though.

E

All the cloning stuff happens in the manager, not the mds.

E

A

Oh, you could play one for which rpd and when I purchased ffs.

E

Well, first, this one the object or full try, not think global. That's an easy one to fix regardless yeah.

A

I think that api was only added for the rbdcli support, um but you can more generically passed past the flag with any operation descending deliberatives.

E

Yeah should be too hard to change it to the eye of context. Specifically, I think.

F

B

Yeah then we still got the the manchester command finisher like so, even if we then, I guess we need the investigation to say. Is it safe to run multiple as long as it's through different management modules to run let them process multiple concurrent manager commands.

B

Yeah, I'm not even sure I also have a the whole python trendy model.

A

Yeah, I wonder if it would make more sense to try to do effect, pool in python.

E

Try to do what.

A

I try to do like a thread in python rather than and make everything asynchronous, rather than introducing more finishers later.

E

Like using, I mean, like green light or whatever, like one of those red.

A

With the python threading module itself.

E

Yeah, I don't know, I don't know how it works.

A

I think if we end up like calling into the same module since they're each running in their own excel interpreter right, then we went into issues with the um global interpreter lock. Potentially, um if we're not using some kind of threading at the python level yeah, I think you're right, josh, python, threading kind of sucks, but.

A

It functions uh which I'm not sure that yeah it works so well with.

A

Maybe something to experiment with.

E

C

C

um On the subject of uh full try and full force flags, uh I just remembered, um like an old tracker ticket that I uh filed uh when looking into the kernel client and making sure that uh those flags uh like making sure I basically understand the semantics of those flags, and uh I think the issue with uh full force is uh still not fixed.

C

uh Basically, the problem was that uh that full try operations are guaranteed to return uh an error either out of space or perhaps a quarter related error. uh But full course shops are just silently dropped in some cases, and uh that way things get stuck.

C

I'm not sure if it pertains to the manager discussion anyway, but uh I'm guessing that there might be some cases where uh you know there might be some operations which we really do want to force.

C

uh But uh given given the existing behavior, which, judging by the tracker, hasn't changed, uh that's not gonna work and things are going to hang anyway.

A

Yeah, I think you're right about the behavior, not not changing. I think that's still a bug, but uh we can see how far we can get without using a full force and just rely on a full try.

A

The only reason you'd need full force if you needed to do something that wasn't increasing the amount of data in an object.

A

The way it currently works, full try doesn't let you do any operation that would write more bytes to the objects or create new objects.

C

Right but one one particular thing that I don't think we handle uh correctly.

C

uh Maybe the rpd uh has done something in this area I haven't looked um but uh trying to discard uh permanent from an rbd image or just remove a file on the or you know, punch a hole in the in this ffs file, at least from the corner client.

C

uh That's currently vulnerable to uh to just uh basically not it doesn't work, because we don't supply any of these flags precisely because I was looking into it and uh my conclusion was that things are pretty broken and uh uh yeah, so uh I just thought I'd bring this up, uh given that we're talking about cool handling, because uh I I do think that we uh that, if we're serious about food handling, we do need uh a way to uh free up space uh on images that are already mapped in our file systems that are already mounted, because you know the same the same cluster.

C

um You know it's not even purple. It's also you know per cluster, uh even though we got rid of the uh global uh full flags in the usd map. uh I think some of these fail-safe checks are still not are still global, and so uh we can still run into this.

A

For those kind of cases where it's already mounted, though, would you ever need to increase the amount of data while you're deleting something.

C

I don't think so, uh at least not in the rbd case uh or.

C

I think uh you know.

D

You know in in the.

C

Case where, for example, due to some complicated like cloning operation or you know, complicated uh chain of uh stacked clones, uh maybe, but in that case it's acceptable to uh fail without a space, but if you just have like a plain image where a discardments uh punch a hole in the object or playing cfs file, where cloning isn't even an option right, uh if you write, if you do a you know, punch hole operation, that's always uh that's always supposed to free up space.

C

uh I think that, should uh I think that shouldn't work and the client should be updated to apply the appropriate flag there.

A

Yeah, I would expect that full try would work in that case, but if it doesn't, then that would be another bug.

C

Yeah, like I said, I don't have the like the current status or I just remember, to buy that ticket, and I I do uh I seem to remember uh you know having some notes on this uh subject, uh but uh I was trying to find them now and I couldn't, uh but the trucker ticket uh came up uh uh it just like, and then this area clearly needs some attention.

A

Yeah definitely we introduced these flags like several years back but uh didn't end up using them everywhere, and so any things probably don't quite handle the full cluster, well, which point.

A

We probably should talk about that in rgw context. uh Sometimes too much, I don't think it's using we'll, try it either.

E

B

Yeah, I don't think it is when I grab the.

B

Code well, the whole punching would always work or like what is it? How does the osd determine that? uh It's not going to increase the space, because what, if it was out on a line, hole punch compared to the blue store block allocation, or what have you.

E

It's set the osd layer, there's the the statistics that the osd tracks on a per pg basis. That's like the sum of bytes and some of files or whatever and.

B

As long as you point out as long as it's as long as it's a truncate or uh something like that, it counts as right.

E

B

Zero zero bytes, but obviously zero in the middle or.

E

Negative fights written in this.

C

Even if it's less than the like allocation unit, that's just a no-op, uh currently new store- or at least it used to be not not that long ago, um so uh that that that should keep those counters the same and uh uh nothing's gonna change from the client perspective. But uh it shouldn't be an error. Yeah.

E

Yeah, I think, from the oc stats perspective. It will should be a zero because the size of the object isn't changing.

A

It also doesn't account for um for outsiders at this point. Just data changed yeah.

A

A

I mean I already works in rgb case by accident, but I might yeah feature.

E

A

A

Yeah so it sounds like we have some uh different, faster explorer here, for uh maybe maybe possibly different approaches for our games.

A

A

It isn't, what do you think we should do with the uh is there anything you should do with the adventures in the.

B

Manager, yeah, you know, I'm not familiar enough at a a low level with how that all executes um I mean I would rather see the python threading uh than a bunch of per module threads, but if.

F

That's possible yeah yeah. Is that code actually called? If I remember um a while ago, we decided to not properly uh shut down the manager but uh more or less killed, because that was not really possible to.

E

Yeah we have a shutdown yeah.

E

That's, I think, that's different, though this is okay, yeah yep, while they're running, I don't really understand how the threading model works right now, so.

B

It's magic, don't touch it, it doesn't break.

E

E

Yeah, at least, I think we solved all the the gill walking issues at least most of them. That's that was progress, but writing is still another thing.

A

Does fadm do anything uh that could block for a long time with handling commands.

E

uh It's doing all kinds of execution ssh out to remote machines.

F

Yeah, but that should no longer block uh that command.

F

That's a big problem: mind if you block your like command, then everything hangs can't even get a status. Nothing works right as long as one cli command is blocked. So we have to make everything as.

F

A

How are you making that asynchronous? Do you already have like a thread pool in the orchestrator module or.

A

Something yep, maybe we could expand that to the um so might, as in general, have access to the workflow like.

A

F

I can, can you repeat.

A

um As I think, maybe we could make that accessible as a generic feature for modules to be able to use um use this pool of worker threads. You.

F

Know run things asynchronously.

F

The code is trivial.

F

Now you can just use the thread pro from from multiprocessing known that the that we're not customizing and we're just making use of it.

A

F

I'm just worried about um already too full much stuff is going on on the manager, with the premises module with the dashboard module, with all the modules that I don't know how many threads f-radium should actually create right now we are hard-coding it to 10.

F

Having more threads is going to make things a bit faster and, on the other hand, it's going to increase what what cepherdium demands from the minute. I just don't.

F

A

All kinds of like cpu heavy things are sfv. I'm.

F

F

It's not cpu intensive, it's uh uh we're. Creating lots of network calls and we're creating, of course, to uh convict.

F

E

Not really I'm not, I don't remember how the I didn't look at those pool stuff, so I'm not sure how it works.

A

Wait for me using it for, like parallelism and being able to, I can have a bunch of things running in the background that aren't taking much cpu or resources doesn't seem like a big problem to increase that my birth threats.

E

Yeah yeah, I don't think so.

A

And perhaps it could even be dynamic based on uh if we need for out of threads capacity, even a lot for lighter cluster. If we detect that the manager is, has a higher q depth coming in increase the number of threads.

F

I don't know I don't know, um I have a big enough cluster to actually play with it.

F

F

I mean I, I don't, um and I know that the promises module was not really able to keep up with the demands from from from isis.

A

I think that's a very separate issue because it's we're handling part data and which is pretty cpu intensive.

A

Whereas uh cypadium, I don't think it's doing so much uh workout requires resources.

F

The safe idea really is hard on the.

F

Confikisto, oh, I don't know how much else we can demand from the from the country case. I I just don't have any performance somewhere. So it's just guessing wild guessing.

F

um Inventory post facts.

A

Right right, we talked about maybe moving those into like a raiders pool right yeah, at least some some part of it. I think I think that would address like that. A lot of the concerns around them like throughput. There would be a scalable cluster rather than the monitors, which are inherently not scalable, io.

F

I I don't know if that's actually a good idea to make critical data be stored in raiders I mean if, uh if the pool is no longer readable or writeable, then self admin would not be able to redeploy these.

F

A

Do we need to create? Could we keep some cache with the critical data or something in the monitor.

F

Everything that's needed in order to deploy or redeploy.

E

I don't know feels like config key is fine for this.

E

A

A

Is that is that probably recently, like frequently updated.

F

And if we increase the um the threat pool, then we are making more requests. uh Write, requests to configure.

E

I mean on these large clusters. The monitor is going to be on ndme, it's going to be fast, I'm not sure.

A

I mean it's at some point at which it's limited, but I guess maybe it's worth measuring what that limit is.

E

A

Many company key calls: can we make for a second or becomes a problem right.

E

Yeah, I would expect it'd be a lot.

F

Didn't have someone in chef devil did some made some comparison to.

E

E

Probably slower, but.

A

Yeah, it hasn't really been optimized for this kind of high throughput use case at all. So.

F

E

F

But that doesn't help in in order to guess where the limits.

F

E

My my guess is that the um config key stuff isn't matching and that's what the throughput issue is, but basically remains constant. Even when you have lots of concurrent rights.

E

That's an easy.

A

Fix is something we could test on the lrc in cpu.

E

Yeah, I mean, I think we can just easily test it locally too. I think, in order to have a real, a better sense of like how it I think doing a micro benchmark like this to figure out how many like concurrent region or whatever rights it can sustain, would be one thing um and then getting a sense of how many writes we generate based on the size of a cluster would be the next thing and then do some simple extrapolation to see where we would expect.

E

E

E

E

All right, I think.

A

That was the end of the agenda, so any other topics folks wanted to discuss today.

C

um Well, I have one it's it's not not much of a topic, but uh I just wanted to ask.

C

So there is a pull request that adds in transit compression to the messenger, and I wanted to ask uh if, uh if there was any sort of like uh discussion that took place on whether uh the messenger is the right place to do that uh or whether it should perhaps be tailored to uh just certain messages, uh because uh or you know, certain payloads of certain messages, such as uh the back end, toes detail, hd uh replication communication, um the the title uh of the pull request.

C

uh I guess I can paste the link says that it's uh geared towards uh detailsd communication, um but uh obviously putting it in the messenger, also enables uh applying to hd and uh uh any other uh uh connection to be uh transparently compressed. um But, like the question is, uh um is that? Is that really the right place to do it? Because again taking the 3x replica is the most uh common setup?

C

uh The primary osd will end up uh decompressing and compressing the same data. Basically, uh you know at least twice uh redundantly, because, unlike encryption, that has to move in the messenger because, first of all, we want it for everything and not just for um primarily for back-end communication.

C

uh The the encryption key uh is per session, and so it's actually like a different stream of bytes, whereas the compression is basically the same transformation applied each time. The message is received and then re-encoded in a slightly different form uh for uh for for sending it to to the replicas.

C

uh So I I was just wondering whether uh this has been discussed or you know any thoughts on this topic.

E

I haven't looked at this full request: does it do the whole message, or does it do just the data payload.

C

It attempts to do the whole message: uh it does it uh per per uh perform your middle per data, so per section or whatever they're called, uh and there is a uh it attempts to do it.

C

uh So there is a cut off, uh so it uses the existing compression uh framework uh that I think was erupted from new store where twice to uh tries to compress and then, if, uh if the benefit isn't there, then it stands uncompressed.

C

uh So it also introduced a new flag to the um preamble uh of the message to say whether it's compressed or not. uh So uh it's not. You know not particularly terrible so, uh uh but uh I was just wondering whether it's in the right place to do it because it it seems to be that really the benefit is always detail.

C

Is the communication uh in the uh in the cases where the cluster is stretched uh and so putting in the messenger just uh adds additional spu overhead and given our crimson efforts, uh epu is really the most sparse resource uh uh and going forward it will be, uh it will be our most cost resource. I think.

E

Yeah, I mean, I wonder if it seems like the benefit is going to be on the data payload, primarily um not the other parts, um certainly for like oc traffic. So you could imagine changing the message interface so that instead of doing set data and giving it the data buffer, you could do you could pass it a compressed buffer and the message header would have something indicating that the data payload is compressed under the particular codec.

E

And that way you could compress it once and then pass it to both the messages that go to the two replicas or even if data comes off from the um incoming right from the client. If the client can press the data when it sent it and you could look at its disposition and use that compressed intuitively pass it all the way through.

C

Yeah and and going further you could uh you could, maybe even you know not do it for for messages that uh you know are not gonna be compressible, so not not even try and not even do that initial step where it tries to compress and then yeah.

C

uh I I think uh it tries to see whether it's at least one kilobyte uh when they order uh one kilobyte, you know savings and would not uh so you can skip that for most messages and only do it for uh for the ones where the uh data payload is known to be uh compressible.

C

uh The other thing which and the which this pull request allows, uh which uh I think is uh which I've raised, but I haven't uh I haven't really followed up, is it allows uh combining compression with encryption uh and that's uh uh and that we can uh the security characteristics?

C

uh So we should probably disallow that uh the reasons uh that were given uh are you know, save money in the uh in the stretch, cluster cases, um but again uh someone who enables compression or sorry someone here enables encryption uh for everything uh probably cares a lot more about uh uh about uh security, uh and we just shouldn't be uh allowing that mode uh uh in the first place, uh just to avoid uh any potential issues. There.

E

Yeah yeah, I like I like the idea of doing this to layer up personally, but it means that there's a ceiling like you can't you don't get automatic impression of like the front buffers or you know, mvs appearing or something like that, but it seems unlikely that we really want that anyway,.

E

um Do we know who this person is maya, yeah.

E

uh Okay, all right.

A

Yeah this project was um entered by josh solomon. Okay.

A

That much, I think your internship's over now at this point so um want it, makes that more changes to this opiate. Another project.

E

That's one big.

E

Commitment, what does everybody else think like do we want to discuss on the list like what approach we should take, or do we agree? What birch we should take? I don't know was this: is it that that's intentional.

A

For benefits for beyond just osd communication too, like maybe between rgw's or oh, they say you can different protocols.

E

D

Yeah, I agree. I think we should have a discussion on the mailing list and see what uh users have to say, yeah and for rgw. We can do that compression and sun compression so and get the data compressed.

A

Good, I mean other multi-site things.

E

I mean you can imagine all kinds you could. Even imagine that, like blue store can compress right. You could imagine having the compression happen in the osd or even at the client and then yeah right.

E

That might run a foul of the way that blue source chunking things, but you could feedback some heuristics or something to you know what the preferred chunk size.

E

E

hmm Anyway, yeah: okay, let's discuss the list.

A

Any other topics: okay,.

A

I think so: okay, thanks folks.

E

See you next month have a good one.

D

Thank you, bye.

F

See you guys bye.

C