Ceph Developer Monthly, 2 Aug 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2017-AUG-02 :: Ceph Developer Monthly

Description

Monthly developer meeting for the coordination of Ceph project development.

http://tracker.ceph.com/projects/ceph/wiki/Planning

A

All right so welcome to the SAP developer monthly, so my name is Leonardo vas, but everyone calls me just leo and the new self community manager- and we have the this meeting from August sage- would like to you continue about the.

B

Sure I don't know, there's not a whole lot, oops, not a whole lot at bay. Pam I should, when I can give a really quick, luminous update, I think we're getting really close to release Josh and I did a bug screw up this morning in mark the last few ray dose bugs that we consider blockers to be immediate about half of them about half of them. Some of them already have fixed this ready, they're just a handful over the ones.

B

We want to either diagnose and demote or fix so I'm hoping for a Loomis release beginning next week. It should be good and then we can head off towards mimic I. Guess that's about it. There'll be a few things that are still sort of in people have been working on. That will probably keep back porting to luminous. Some of the manager modules are sort of trivially.

B

Separable I will get back, boarded like the balancer module that I'm working on we'll talk about in a little bit, but uh otherwise pretty excited about luminous and looking forward to getting it out there and in people's hands to talk all I have really Celia. Just posted the link to the agenda, which has quite a few items so we'll probably want to keep moving and get through it.

B

John, do you want to you want to start I sort of signed you up for this I, don't know if you actually prepared anything.

C

Yeah sure I'll just touch on the the stuff. That's been added very recently in the interest of publicizing a bit. So in the past couple of months, we've added a bunch of stuff that is aimed at making stuff easier to use and enabling other people to make Seth easier to use.

C

The most visible thing is the new web dashboard, which is just the module. That's just called dashboard in this f3 I'm just going to share my web browser Millenia.

C

So hopefully, everyone can see that the idea of this is not to dethrone any of the various self management projects that exist, but to provide a sort of basic, simple one. That's baked em, so that people who just need a gooey Amy gooey have one, but also so that people who want to implement interesting functionality have a place to do it. So if you guys, like a cool graph, that you want to expose or the status of your feature, you want to add a page for it, so you can show off to people.

C

This is kind of an easy place to do that. It's easy to do because it's in the tree, you can add it alongside the features as you work on them. So currently this is pretty basic. You've got a front page, which is pretty much a web version of what you get new types of status.

C

You have a list of servers showing you what's running on them list of OSDs with some usage and the allspark line graphs, the written write activity, there's an IBD section, Curcio jason.

C

This is all playing for me because I don't have any mirroring demons or anything set up and there's a set of s page that will show you a page for each of your file systems with a sort of slightly friendlier view of the status than you get at the command line.

C

So almost everything you see here is just getting pulled from the in-memory stats and objects that manager has so the module is pretty simple.

C

There are a few things that are fetched remotely live, so if I was to click the detail button next to the clients bed well, I, don't know how many clients about it at the moment, but if I did this would be populated via the manager module doing a the equivalent of ACF tell sending on one command out to well one one command them in command message out to the daemon to get the list of clients. So you can do that stuff from the dashboard as well.

C

The some of the stuff here is enabled by new stuff under the hood. So notably, you can see the status of our GW and RBD demons. Now the auntie dollie stuff isn't in the dashboard, yet I think someone will probably create an argue W page pretty soon that used to not be possible because the manager didn't know anything about our GW dealings. There's no a new structure in the manager called the service map, so that's accessible to modules, but it's also accessible from the command line.

C

There are a few commands along the lines of quick liquors, have help service status in this dump. And now, when you look at the output set status, you also see a line for GW demons, so I'm just going to show you again.

C

Star cluster, you want to do so, stateless if they online for RDW, along with PUSD, S&M, DSS and so on.

C

So you can populate the service map from anything that users liberate us.

C

There's a new couple of new calls and liberate us that, let you and a dictionary of metadata when you start up and then a status dictionary, to show your your health as you go on so, for example, the RBD mirror demon it can give you status on whether it's keeping up with its cues or falling behind or whatever, and then you can have things in manager, modules that interpret that information either to put it on the screen in front of the user or to generate health messages from it.

C

That kind of thing, the other thing that's gone into, enable stuff in the dashboard is to list. Rvd images actually used to be pretty hairy because you didn't know which pools to look at you didn't know up front which pools. Where are we people, so you didn't know where to go and run the OBD command, coalesced images, so there's a new concept of application tags that apply to pools. This is pretty transparent most of the time, but you can also drive it manually. So there are these new commands.

C

Osd pool application, enable/disable and those are sort of fancy names for adding a little string tag to the pool and the OSD map. So to say this is an RBD pool. This is a set of s pool or if you have an external application, you can use that to mark it as being used by your application as well.

C

That also enables us to be a little bit safer inside some of the monitor commands, so we can see if a pool is already in use by a different application before trying to use it for something like our BD or SMS.

C

Let me see what else there are lots of improvements to logging, so a few people I'll see in the list have already noticed that it used to be the two type of sessions F dash W, which follows the cluster log. You would pretty much just see a continuous view of PG method, summaries, which was quite useful.

C

Some of the time, if you were just trying to watch the performance of the system and see that I had some activity, it was pretty unfriendly and pretty hard to read so the PG Maps view is now off by default and there are a bunch of new log messages that are emitted by things like the OSD monitor and the manager monitor at the NBS monitor when demons change state so that, rather than having to sort of begin for, what's going on by looking at the periodic dumps of maps, you get explicit sort of plain English messages that tell you that and there's new page in the developer, documentation with some guidance on how messages like that should be phrased for users.

C

The other big new thing in the logs is that all of the health statuses that you can have so all of the various things that can happen at the top of your safe status now generate more messages as well. So it used to be that if you had something going on healthy on your system, that would happen and if you happen to look at self status while it was bad, you would know about it. But if you didn't and it went healthy again, there was no indication of that anywhere in your log.

C

So now all of the possible places where we raise health alerts, they're called health checks, now have a unique code and when a Health, Alert or health check I have to keep getting it right and of that type gets raised. You get a lot message saying this condition is now failing and then, when it clears, you get another message saying this is now clear, so that hopefully makes a lot easier for people to interrogate the log to work out what happened when on their system.

C

So you see those in your logs as well the JSON format of the health care of the health help that has changed to enable this. So health information used to just be strings. It's not anymore, so I'll just show you. What do you want, looks like right well make the system go on healthy. First.

C

Know how to say this we'll get the simple health warning: color up a health warning that says no out flight asset if I go and use the new lot lots to command to see the recent history of the cluster log.

C

Well, I've actually only got one recent line here and it's the message to the log telling me that a health check failed. So this is a generic thing every time. One of these fails. You can a method message that says health check failed and then the text of a blob here and says status, and if I do health for my place and pretty.

C

And the new structure is that there is a checks, object where the key is the code.

C

So there are a bunch of these all capital letter short strings which are the the codes or health checks and there, if there's a new page and documentation that lists all of them, and then each of these has the sort of good old, familiar severity and message and it'll be running a slightly old version here.

C

This has changed again ever so slightly the last minute before we release Luminess, so that the message now, rather than just being a string, we have a object called summary, which has a message attribute: that's to make it extensible so that we can add more fields, post aluminous, without breaking backwards. Compatibility again, if you have tools that need the old format of the health output, there is a setting that you can set to make the molitor's output the old format alongside the new format. So you just get both sets of fields in the JSON elbow.

C

So let me say, that's health checks, logging.

C

There are some lot messages and health checks that have been quietened, so you used to get individual health messages for each set of with each on cpg states that existed in the system, which meant that in practice, when you pull the drive or whatever, you would tend to get like ten lines of output in SEF status for all the various different unhealthy states that PDS were in those are now grouped into a much smaller set of possible health checks.

C

So there's like a PD, degraded, PG, availability and PG damaged health check and the various different PG states get assigned to one of those. So you have a much shorter output to your safe status and you do still have access, obviously to all the detail. If you just you, know, go and use the existing PG commands to go and look at it and also during initial cluster setup.

C

There are a bunch of health checks which is used to trigger during setup of the cluster complaining about live enough, PDS or as T's being a flying or whatever and they've all been sort of massaged so that they don't trigger under those circumstances. So there are checks that will now only complain about having a bad number of PGs if there are actually some objects in the pool and that kind of thing so that when your system initially comes up, you don't see a bunch of scary warning and error messages.

C

Saij, what am I forgetting.

B

But most of it capabilities are simplified.

B

The other manager want. You sure that the manager FS module status module. Maybe oh yeah.

C

So there was a there's- a module- that's actually always been there, but it wasn't switched on by default before, hopefully it switched on by default, yeah yeah. So this this is a module that was kind of mainly written as a demonstrator for manager to begin with, but it's now switched on by default. It's just called the status module and it has two columns FS status and just called OSD status, possibly yeah there you go, which give you sort of slightly friendlier. Colorized summary views, so these are its visit.

C

This isn't meant to blow your mind that we have these couple of commands. It's meant to make you think: hey I could add something cool there. So when you have you know stuff, you would like to monitor for the features that you're working on then it's ridiculously easy to go and add these commands. It's all just iPhone and like a few tens of lines of Python in one of these existing, whether it's in the status module or the dashboard module or wherever or your own module. You can create this stuff really easily. Now yeah.

B

And I guess that the bottom line is that you can implement commands that are in the normal CLI commanding space right. It.

C

Used to do towel manager and then whatever to invoke your Python module. That's all gone now, so the from the users point of view. They can't tell whether the command they're using is coming from a manager module or it's like built-in to the monitor.

B

Can't a caption Feeny! Oh yes,.

C

So big options work have changed. You may only not notice this. They used to be a file called config dot H that had a whole bunch of preprocessor macro calls in and that's gone or a rather has been renamed to legacy, configure up spot H in order to have a nice C++ class definition of all the options which is now in options. Cc, that's not just for fun. The reason for making the change is so that we can add a lot more fields to the options. Things like minimum maximum thresholds, human readable description strings.

C

So all the options now with the description string and a long description string, which currently aren't being used to build the documentation but ultimately will be, and that stuff's all available at runtime through a new config help command, and so now, if I do that, there are a complete help. I get this huge list of all of the possible options in the system and all of the metadata about them.

C

You'll notice most of the descriptions of blank at the moment, though, the infrastructure has gone in, but the work of actually going through and like typing all the descriptions in is still ongoing and that should be fairly easy to pack port to luminous, even if folks don't get around to writing all their dog strings in beforehand. Romantic.

B

Worth mentioning the diff config diff, that's new.

C

Yeah, that's what's already there when I started, listen with us yeah, but it's new central I think I.

B

I haven't actually used that sage. What are you this doesn't just, instead of helped active it just dumps config settings that are set that differ from the defaults, so those are the ones that matter basically, so when previously you'd have to go, do you config, show and I'd show every config option there, like more than that, I wasn't terribly helpful. This will just tell you the ones that are have been modified, just usually what you want to know.

B

It's a pretty big list right now, because this is a V star cluster and it sets a bazillion things, but on a real cluster, it's usually a very short short list.

C

So the consequence for folks working on the code is that, because these options are now defined, while bam, they're no longer preprocessor defined, which means that options that you add in options for CC won't exist as C++ class members on the global config object, you have to actually call dot get with the name of your option to go, get it.

C

This is also as well as adding lots of metadata. This is a precursor to the movement which hopefully will have been a reasonably near future, to store all the config options centrally on the monitors and then have all of the daemons consume them from there, rather than each loading them locally from a from a text file.

C

So this is sort of step, one of making the options easy to consume from user interfaces and scripts and stuff and step two is giving you that central store of all the options that would be much easier to sort of get and set from if you're, putting up a friendly interface on top of set. Oh.

C

Yeah, that's a new command for similar versions in the system. This is your common sage. What is it just set versions? Oh yeah yep.

C

Okay, so you can you can imagine this is going to be pretty useful if you're, in the middle of an upgrade of what you're dealing with user, who might not be 100% sure about what those and they're running or might not be completely accurate, but I will version they're running. This is a very quick way to just interrogate.

B

That there's a feature as one to test features. That's similar pay attention to the feature bits implemented by clients that are connected so that knowing what the treatments are is less useful among the main one, here's two clients, so you can tell if you have Firefly clients connected, or you know, dual clients or whatever and trust him for what release each connection is based on the a feature that said it supports.

C

Do we do we have help of options for turning the the feature, that's into a list of strings or something we do yep.

C

There's no less time, you know nothing to me ya,.

B

Know they don't know, don't mean anything to anyone pretty hard to interpret, even if you know what they mean.

C

Holden is probably worth mentioning the the OSD destroy a lousy bird stuff that job did so that was like a while ago now, but it sort of fits in this general category of commands that are easier to use and make make things easier. So there's a photo SD replacement and ice-t removal without typing, as many commands and without doing as much data movement.

C

B

The main thing is that other, so when you're, when you have a failed disk- and you want to replace it- it's usually best to preserve the same OSD ID so that the crush mapping doesn't shuffle data around, and so those commands allow you to do that. So you can do destroy which marks the OSD as it sets a flag. That says it's been destroyed and removes its like set X keys and stuff, but it leaves it in the crush map and then we need to use set just prepare.

B

You can pass the OSD ID argument to pass in that old, LSD, ID to reuse and it'll check if it's something that it can reuse because it's been previously destroyed and if so it'll let you be reprovision it with a new set of X key together. So it's useful both for failed disks and also, if you're doing any conversion of file sort of loose Toros G's. You can also use that command and then there's another one purse that will just like remove all trace of the OSD.

B

Remember from the crash map, remove all the keys from config key and so on. Although, like for different places, you stepped I touch.

C

D

Could we change the features to hex, so if we needed to look at a bit we could see it. Can you do.

B

Hacks in Jason, mmm no, you can't you have to that. What is this written? Okay? We could do a separate field. That's like a string of it printed index because yeah it's to trip it's our tradition, like very annoying, to interpret.

B

um All right any other questions about that.

B

Right the next thing on here is that counsel balancer module. um Let me just paste the link in here. That's a quick there's, a pad with some notes, I made when I was about to write it and then I wrote it up to date.

B

The main thing here is make sure my cluster is.

B

All right terminal.

B

So the idea here is to make a.

B

Set of commands that run all the all the bouncing code, so I used to be the revitalisation command. Filter, monitor that generally worked, but it was pretty fragile and you had to sort of trust that I was going to do the right thing, but it's still pretty primitive and we have a bunch of new tools in stuff. Now that way, I just do a much better job of doing optimization of the crush weights and layout and distribution and so on, but but they're not easy to use.

B

So the idea with the band manager, the balancer module, is that they'll just be a Isetta commands that sort of make it much much simpler to use. um Okay, any guys is it sharing my screen, my terminal I'll make it big. So it's a set any commands. The basic idea is that once the module is enabled you can do balance or you can set the mode to up map. That's the only one!

B

That's currently working right now, there's a status that tells you whether it's working eventually you'll just kill this a balancer on and all just in the background. At all times, check your distribution and sort of make small changes as needed to make sure that you're evenly distributing all your data across all your devices and that's sort of the hands-off approach. That eventually will one good too that's not there yet, but in the course of implementing this built it.

B

So you can sort of break it down to a series of steps and actually tell what it's going to do, make sure it does the right thing. So the idea is, you run them up for mice command and you give it a name of the plan. You're gonna do called foo it'll decide what that what that is, it'll run the optimizer. Take bunch of steps, so will actually show you what the plan wanted to do. So in this case, since my mode is up map, it's using an EPG up map exception list.

B

These are the changes of that. The plan wants to make I'm. You can go to execute food actually say. Ok, that looks fine, I'm gonna, do it and then you're you can run it again. I make a new plan for additional changes, I'm and so on.

B

You know, after that step it should be completely bounced because I didn't have. That means yeah you'll notice, they're exactly 24 cookies. So it's done. Let's know just why that plan was empty. There's nothing really to do then there's a new eval command. That's only happened woman right now, but it basically does all the the statistical analysis to make sure to figure out what that distribution of data is figure out.

B

What the standard deviation is whether it's above or below it's sort of expected, and to give you eventually it'll, get you a score of how good the distribution is.

B

The idea, basically in doing this is that all the infrastructure is being built into the manager module interface, so that and you can, you can sort of in a sandbox figure out what changes you would want to make and then evaluate what the result would be, and it will score it so that you can build an optimizer that runs on there. That is, for example, to do a gradient descent on the crush weights to get to a better distribution.

B

Out of that, it would have a proposed set of new weights that are, we have as part of the plan you can see if that scores better than I'm better than your current clustering. If so, you could execute that plan and then it would actually optimize those weights so still work in progress, but getting that infrastructure in place and trying to do it in a way that makes sure it sort of takes into consideration all the complexities that have been previously ignored, I'm, so you'll notice.

B

This breaks things down by route, which is essentially roots in the crush tree default in this case, and then also by pool, and it does all the analysis analyses for both of those. So, for example, if you use the new crash device class feature where you have both these tagged as hard disks and some tag business as thieves, and you have crush rules that distribute just across those devices when it does its analysis of the data to be distribution. It'll take that into consideration.

B

So it's looking at the expected distribution of objects for each of those sort of sub tree, so that hierarchy that you're placing data on and the actual I know optimize against that, instead of sort of conflating and confusing the tube, so I think that's looking pretty good.

B

The main I guess consequence of that is that we no longer will sort of do what the old optimizer was, where we just look at the actual utilization on each OSD, based on their like data, fest results and optimize against that, because we want to have this the ability to model what our change is going to do, and so, instead this is going to have an in-memory all the infrastructure so that you can say if I make this change and I move.

B

These peach or MPGs around I know how big that TG is and how much data is going to move, and so I can sort of predict what the usage is going to be after that, which is all which is all good it'll, be a little bit. Well it to be a little bit careful because the OMAP stuff isn't totally unaccounted for.

B

Currently but I think we can make model reasonably accurate and do that, so that's that's about it or for that I'm still in progress, but I expect to once it's working to back port that to luminous, because it's all just Python code in the in the manager. So this will sort of be coming soon.

B

Yeah any questions about that.

C

So how does it currently behave if you give it a brand new OSD with with weight zero? Does it like.

B

Sort of so that it depends on what mode it's in so in the up mat mode, it's it's just remapping, PGS I'm based on him, so to equalize and refugees based on the weight. In all cases, it's it's throttled by how much data is misplaced.

B

So the idea is that you set it at like three percent or something by default, and it'll only make small changes so that no more than 3% of the PGS or is currently rebalancing and moving data around so I can make it go slow, basically for the other mode, where just follow the one that people are actually use. The crush compat mode, which optimizes crush weights, it'll sort of inherit, the ability to ramp up crush weights from zero to whatever the actual weight should be by starting them at zero.

B

And so all the pieces are there to do that, but that mode isn't really implemented yet so, but eventually, yes, there'll, be. The idea is that when you have a new OSD, you'll, basically say that the target weight for this is the size of the device. You know like four terabytes or whatever, but it's the effective weight that it's actually using we'll start out at zero and then it'll slowly ramp, that up and to try to hit that three percent threshold or whatever it is so that it'll slowly fill over time.

B

So I guess I guess my main question here is just whether that I think I sent email to those to you. Whether the interface makes sense and the general approach makes sense.

B

Are there any other things that I should be paying attention to you to make sure that it can support.

B

As I put this together,.

B

I'm planning on making it so that people who are you have been using the old method of the OSD, wait reweighed thing when they switch to the new balancer module and do the crush weight, optimization, it'll, it'll, sort of back off the old Corrections and, as it starts using the new ones. So it all sort of have a smooth transition from those two mechanisms. I.

B

Can't think of other things that I need to pay attention to you.

B

Does the the three percent threshold for misplaced objects? Does that seem reasonable or seem like the right way to sort of throttle? The whole thing.

E

Okay, but the number misplace is more about how much appearing a snap and in any one map, seven.

B

Well, though it it would be a that would be an upper bound alright. So if it's, if it's targeting only some percentage of misplaced VG's that are gonna, be changing at once, then that will be the maximum number that appear in each step. Yeah that will appear a new step.

B

But uh yeah I mean we could have two thresholds. Also, like only do a half of it each time it takes a half of the max or whatever just to make sure it doesn't like miss, calculate an overshoot or something probably a good idea anyway.

B

I guess the only thing I'm, not sure that that would take into consideration is, if you have a cluster with like a thousand OSDs and you put in one OSD, that's like 0.1% of the data, and so three percent is basically I'm gonna, be hammering on that one OSD I wonder if we will also want like a per OST threshold or something and I don't know if it matters that much really, because uh all the existing recovery scheduling stuff is still there.

B

So in theory, you should be able to like say: a hundred percent of the data is misplaced and it's just gonna get scheduled in the background. But again it's appearing disappearing thing, I, guess so.

B

Yeah, okay and.

E

I know I said an email but I'm the part where we don't measure Oh Matt is kind of scaring me because there are some clothes after that. Huge proportion of what'll be on some stuff yep.

B

Yep, so the current code is, it's basically mirroring the calculations for PG count for the number of objects and for the number of bytes. So it could be that in some cases we just count on. If the pool is full of a math or something, then we would just balance objects instead of all rights. The problem is that you can't then equate if you're having to mix the two and when you're doing the cross compat thing you have a single set of crush weights.

B

That's you're actually optimizing based on the crush roots instead of the on a per pool basis. So my hope is that we can. My plan is to eventually basically have it build a model of object. Average object cost on a per pool basis, where we would basically try to infer what the OMAP sizes by doing trying to solve the solve for the unknowns where I can see that the OSDs are using up this much metadata space and I. Think I can infer from that.

B

That takes this module map or something like that, or at least, if they're, like sort of the obvious model where it's 0. If that just disagrees with reality, where my model says that this host, you should be at 2% and really it's at like 40%, then I know that I don't understand where the usage is coming from and then I can stop.

E

We feel no matter how much on that there is no I.

E

B

Thinking about that yeah I, keep reading to go and I want at the very minimum. I want to make sure that Bluestar reports how much how big rocks TV is in that structure.

B

It doesn't know the difference between from a bites consumed perspective between OMAP data and its internal metadata, but it could, if we use I, think it can if we start using column families, but we haven't done it yet so I think it's gonna have to come later and be back boarded.

B

But in principle, yes, that would be that would make that solving that equation, to figure out to build the model of what the object sizes are much more possible.

B

Okay, so the next subject carry I put on here before expecting two tests or proposal, but it didn't so that that idea is just about Auto tuning the P genome.

B

So this came up in one of the discussions on the usability call a couple weeks ago, and the question was basically: if we can't merge VG's well, the motivation is eventually we want to have users not to think about PG counts, so they have the system just automatically adjusts them up or down as needed. The problem with that right now is that we can't merge P G's, so the question was: if we can't do that, can we just if we overshoot the PG count and we want to scale it back down?

B

Can we just adjust the PGP num, which is the placement back down, and so the PGS are still separate but stored next to each other? Does that get us? Is that almost as good and I uh I think the answer is almost there still they're still separate P G's, so you still have the twice as many carrying messages that go back and forth, but the Tsarist data placement goes.

B

It's all the same, so they rely ability implications, don't change, and if we change the way that the OSD is allocating memory to PG logs, then the memory footprint won't change that much jihad that, like the / PG metadata, that OSD is pretty small. It's really the PG log is the part that makes each PG consume a lot of memory.

B

But if we are smart about that, so that it doesn't use it so much it just has fewer PG log entries per PG, for example, then we could probably get much closer and so I think then, assuming that's not ideal, but it's still, probably better than forcing all users to think very carefully about PG s and then getting it wrong. Then the question is: how do you make a set of policies and heuristics that sort of automatically choose a PG value dynamically and adjust it over time so that users don't to think about it?.

B

And I think I had some basic ideas.

B

You know, if you look at the number of objects or bytes in a pool compared to other pools, you can sort of infer what fraction of the total davida of the cluster is in that pool, and you look at how many PG stoat aliy one in the cluster and you can sort of make that pool then be roughly.

B

The proportion of the total PG count that you want something like that at least as a starting point, but I think John had some more specific ideas about how he wanted this, the sort of look to a user, my camera, what you're saying that you wanted it to be sort of tied in with an application-level correct? My.

C

Thinking was that we would want to not just respond to the size of data in the pool, but get input from the user about how big they expect the ball to be, or at least what percentage of their cluster. They expect the pool to use so that in the relatively simple, probably quite common cases, where someone has one or two applications using their whole cluster, we're not sort of running around adjusting P genome. If we could have just originally been told by the user hey this is my rgw I'm gonna use half my cluster Floria.

C

You know we could have just got a right to begin with. If we, if we let them give us that input so for like taking on he W as an example, we we would want them to be able to say here are the pools that I'm using for our GW. This is like the data one.

C

This is going to be X percent of my cluster and then have some kind of rule that says how big the various or like force FS, how big the metadata pool should be as a proportion of the data pool as an initial guess, and then the automatic adjustment sort of patterns based on that input right. So it would be like if they said if they said they wanted 100 terabytes of set of s, and we have a rule that says 1% of that should be allocated to a metadata pool.

C

Then the PG count for the metadata pool would be that 1% fraction or more if they've actually exceeded what they thought they were going to use. You know it would like it would be. The input would be the users policy plus whatever we were measuring and then we potentially could so.

C

The hard part of that is that users want to be able to effectively over provisioned or under provisional, depending on how you look at them, so they might say, like I, want this workload to go up to you know the size of my cluster and I want this workload to go up to the size of my cluster, and we have to deal with that kind of input and decide whether that means well, they both get half the number of PGs they would have got, or maybe they both get more because the user is really telling us that he's gonna add more OSD later or that's that's where it gets complicated, I think yeah.

C

But the I think we shouldn't be deterred from doing the relatively simple case where the user tells us just here's: an application and here's a number of petabytes.

C

And then we just yeah, we just use that as the input and you know if we get it in the minority of cases where that turns out not to be the right answer. They can arise.

B

Yeah I mean it seems, like it'll, be sort of a 2-stage.

B

If you have the the user input about what they expect and then you have what you actually measure in the cluster and those are the two inputs for deciding how to adjust up or down you could in the absence of the user input you could also, if you have enough confidence about, if there's enough utilization in the cluster like you're at 30% capacity or something- and you have a pretty good idea, what's happening, that you can make sort of conservative decisions about what to do. Also yeah.

C

It's the the thing that concerns me about doing it primarily based on used capacity is, is how many times we're going to keep you know if we do it like by factors of two or something or on our way, from zero to filling up the cluster, we're going to go through a lot of factors of teeth, yeah, yeah,.

B

F

This feels like to do some more work on fondling the actual work it's under rings per piece, but that it doesn't over the things you leave as much energy babe you're doing it more automatically.

B

Yeah well, I think that the split itself is pretty, and it's pretty cheap, now I think like it flushes some cues, but in blue store at least there's. No, the only work work that you actually do is is splitting the PG log, which is just in memory. You know it's a few thousand key value pairs or whatever.

F

It might be cheap now, it's just the file store and it's very expensive.

B

Yeah and right, yeah and the and, of course, the date of event there's the date of it. That's that's pretty expensive great.

F

Better that handled by recovery, general yeah, yeah, the local thing in the past- that's been. The big issue were thinking yeah.

B

Well, if we made it like no job by factors of 4 instead of factors by 2, that's at least viewers, and it means that once you take any step, it's going to be like pretty long time before you like step back or make another step and.

B

Probably with that enough actor for us pretty it's good enough, PG stuff.

E

Well again, don't forget about it, there's a lot of hearing that goes on, which means updates, through the monitors, the OST maps yeah, and a lot watched in recent no.

B

Messages well, I think like the like the balancer it once it decides to do something, and it's actually making that change. It's not just gonna like multiply it by four. It's gonna like walk, walk through it slowly so that there it bounds, number of peaches that are splitting or whatever at a time.

C

The the the sort of user input could also be a restraint on splitting. So if, if they've gone like a factor of two over how much data they said, they were going to put in that set of application pools.

C

That could be more of a situation where we flag it to the user morning.

C

Yeah health warning or a sign of anemia polite telephone call, Larry, where we, you know, ask them, because this should be I mean it should be not particularly frequent and if they've, you know, if they've set up their system and said, I would like to provision. You know a petabyte of our BD and they exceed their petabyte I think it would actually sort of line up with expectations that we would complain, but at the same time offer them the solution, which is. We would like to adjust your PG noms.

B

Yeah, if you cool, has like a I mean I already has properties like target max bytes and Max objects that we use for the cash turning out of stuff or whatever, but if, if they're, basically like user input sizes that that they set that are like soft quotas. Basically, then, whenever that diverges from actual usage, we can just tell them at least some one direction.

B

Okay, well, we can, we can move on I just wanted to raise it and get people thinking about it in the background, because I think this is one of the things. This is one of the things that we need to adjust over the next couple releases to make this a thing that people don't worry about.

B

Alright, so the next, the next thing on the agenda is RTW rate-limiting roland here.

G

B

G

Okay Oh some background, um I I work for flip card. Here in India we have deployed the staff as a as a as a rgw, rather as a object, s3 object, storage cluster. We use it for a couple of business workflows. We have a CDN front-end to show a bunch of images that are stored as objects in a cluster, so that that's one of our primary use cases. We also use it to to let's say customer invoices when, when we do our e-commerce apeman's, we have about a petabyte of data right now.

G

The object count is approaching. A VM we've been having we've been running this in production for about two years right now close to three years two other years. What we've been running hammer by the way we are still on hammer?

G

What we have noticed over the years is that users were not very familiar with how to really use object, storage in optimal manner. They have been mostly using the AWS SDK, sts-3, SDK and and friends, and- and we noticed a couple of usage NT patterns. I have listed a bunch of these on the on T pad.

G

H

G

Of these was that we would often see brownouts and outages in our service, primarily due to due to the cluster running out of a ops on together.

G

The we sort of figured out that, okay, maybe we need to add a bunch of throttles on a per count basis, so that we have some sort of multi-tenancy. We are. We have some sort of QoS that we can guarantee, or rather we put a limit on the on the amount of I/o that a particular account could hit the cluster with.

G

So our initial thoughts were mostly to put put limits around the amount of data at the data. People could write in terms of boots on the cluster, because we sort of found that okay, writing data or deleting details from the cluster was quite heavy compared to let's say just fetching, the objects will get false, so do this and we ended up adding adding multiple throttles around all all HTTP rest action operations.

G

We, the way we have defined, is right. Now is we when we onboard a user, let's say when we create a user account, we give them a. We give me give them a upper bound on the amount of operations per unit time that can be performed. Let's say we tell them.

G

Okay, maybe two-thirty gets per second and maybe ten object boots per second and- and we count the week on tops on the rgw processrequest park and whenever, whenever the the count crosses that limit, we respond immediately respond with a final three slow down this error code. We just sort of lifted from what AWS s3 does today, primarily so that the the clients that users are currently using don't really need any. Don't really need any changes to handle. Let's hear for xx or other 4:29 too many requests code that would be more appropriate.

G

Other than so basically, this is something that we have found to be extremely useful. Our current implementation has a bunch of limitations, for example we, although, although we say that okay, we have let's say 10 as UW's of an area of 10 our JW's.

G

So if we give our user, let's say 30, our limit of 30 gets per. Second, we end up sort of dividing it across our UAW's and we we have a load balancer in front that does a pretty decent job of distributing the requests. So we we are sort of relying upon that not to really target disproportionate amount of requests or single, actually w-4. Now so we know working okay. We would probably extend this to put a global limit. Another global counter.

G

We have not really designed that or thought about that completely also. An alternate implementation that we were looking at is to have some sort of a leaky bucket counter, rather than just being a upper limit that gets reset at the end of each time period.

B

G

Period did you guys use so we let users specify that, for example, when we onboard a user, we discussed the are they our use case with them, and if we see that okay, some guys warn't, let's say 10- puts per second ok, we we have a. We have a timer that runs every second and resets the account s. Some some other accounts are ok with, let's say our timer. That gets reset every minute. I agree: it's not it's not really a great way to do that.

G

That's why a leaky bucket might make motions, because, for example, you put, you could burn all your quota in the in the first couple of seconds itself and then probably idle just waiting for the limit to be reset at the end of the minute, whereas with the leaky bucket, you could probably slowly get some some more capacity or other some more requests back into your quota. Yeah.

B

That seems like it's probably pretty easy, please to change their. How much.

G

It's quite minimal right now. It's it's changes that the changes are pretty much restricted to the rgw process function. As in we invoke the the function to actually check the current count against the limit, and then we have another bunch of functions that I treat the limits from a config file for now. Ideally, we would like to define the limits through the reduce gateway admin come on like like we define, object, footers.

G

We did this in a bit of a hurry, so we decided to just dump all the limits into a config file and then pass it whenever that it only starts up. So so we extract what will if we were to take this to completion yeah we would. We would change that eww, the Redis get we had in common to specify and modify the limits.

I

This is this bucket, there's possible bucket policy integrations and so on, but as all this actually lots of lots of folks that are interested in this general area, upstream and and and and in our downstream experiences.

I

So this is valuable input I think well, one thing I want to say is that there's there's work going on targeting post luminous, aimed at both bringing more interesting, fair scheduling into into the work cycle, and we created creating a scheduling step of it and it's participant as part of the ICC as far as part of this part of refactoring to turn to remove thread per connection to integrate and a single into a unified sort of a more synchronous, processing and Dennis can have scheduling steps available.

I

So this one a chiral and they have a natural place, I think there.

B

It's there, a s3 API that lets you set these rate limits on x3 for user records,.

G

I'm not really aware of any such API. If there is one okay, I can probably check yeah.

B

I mean I think you could check Swift too, if there's an existing API for that, for either of those that we can mirror at the ideal before we invent a different one. So.

G

You need to specify: are you need to configure the limits itself right and maybe a to compare the limits? Yeah.

B

Exactly yeah, okay, yeah.

G

B

G

A quick look at.

B

Googled it and the thing that pops up is about like if you exceed 300 or 800 requests and you will be throttled, I- think that's s3 their policy I, don't see one on a more sort of per bucket type policies.

G

Also to to to the other point, I I didn't quite catch. The name of the person who spoke before usage good.

B

G

B

G

I could I could drop one or two matt on the staff level list to discuss this further yeah.

I

And as well would be funded for it to bring it to the to the right w stand up. So if you have this an interest in joining our you know the upstream owner occasional clustering, we have. We had a lot of actually actually daily about. You know, but but we have constant upstream so to communication, where you can discuss things um yeah.

I

It feels to me like, if best three did in an array, ws did engage, Koba API is there and end up being aptitude forms and of being one if you've read, this would end up being bucket or user policy if it was intend to be programmatically updated, or at least I would expect so, but but they're. Also, an increasing number of April I guess you'd call them API is that are just in the control panel, and it's not only the specifially notice how they were how and where they're materialized.

I

So so we've expected that that you know for things for new things. Are there we go where we go out, we try. We try to come up with it if at all possible policy based a extension grammar extension that allows us to be in line with as much as possible with the way things are done in AWS, even if we're moving a little bit outside what it does.

B

And this this maybe goes about saying, but if you post a pull request with the current code that you're using that will definitely generate some some comment to commentary. We can look at what what you've done, what we want to change before merging as a stream, whether we want to drop in the leaky bucket or not or whatever sure we can do that pretty much idly, that's a great yep! This is great, and this is definitely something that I've heard from other service providers that they hit similar issues.

B

I think the last time it came up was robbing a dream. I was talking about what they're doing they're using a chair proxy in front of raitis gateway, and they use it to do whenever they identify like a single bucket. That's getting hammered or something like that. They use it to do, install exceptions that direct them to a specific gateway so that it doesn't affect other workloads or maybe do some other things like mad enough I, don't know all what all they're doing but yeah.

B

This is definitely a problem that service of writers hit.

B

All right anything else on this topic.

B

Choose one I guess the next one. Is you also? It's fast fail requests in case of OSE slowness or high latency. Yes,.

G

This this will probably also again align with what Matt spoke about in terms of policies.

G

The problem we have noticed is again due to some due to usage and a pattern we've seen people creating pretty fact I would I won't say that they have been extremely large buckets. We've seen buckets with objects, running upwards of let's say hundred million.

G

This causes a problem whenever we have hardware failures that result in rebalancing of the index, so subsequent write calls end up getting queued over there and that sort of causes back pressure at the are JW's again and even simple, plain, simple, read: read, calls end up failing or timing out the proposal.

G

The the solution that we've been sort of looking at is have a heavy extra Amin build on top of the current of time that we have in the are you double you two proactively failed requests that are that are being targeted towards latent OS, DS or slow s, DS, the hope being that we are able to free up resources at the RG doubly faster then, rather than wait for extended time odds.

G

So it's so the way I described the signal in in a post acceptable was that we wanted to implement something that looked like a circuit breaker, so that okay, if it does go latent, then we would stop targeting iOS to that OSD and fail fail. Our GW requests much earlier and maybe once once the once the circuit breaker timer expires, then you would probably check its health again and to see whether it was able to so I use normally so so. This looks like pretty much like the circuit breaker design, design pattern.

G

This we try to implement this on on jewel and we also wanted a back port that this Tehama, the code that we have right now is not really in a shape that can be really deployed. It's it's still being tested, but but the basic idea is, as I have described it in the in the post, acceptable.

G

B

A couple of comments on this, so the first thing is that the ill effects of having very large bucket in axis is a well-known problem and one that sort of finally been addressed in luminous in an easy way. We've had the rashard capability and jewel for a while.

B

So when you get a large bucket, you can reach art it across many smaller indexes offline, but in jewel there is now the ability to have rgw automatically do that, and so with that enabled you should be able to avoid the situation where large buckets result in these big index objects that then have ill effects when you do rebalance or an OSD failures, any recovery kicks in.

B

So it's the first thing, so I guess the best case scenario is that once you make that transition and these large indexes go away, then this problem is no longer a problem and we have to do anything.

B

That said, there are probably still going to be situations in the future where there is a problem with one or a small number of ghosties, and it would be nice to have that not eventually make too many requests pile up on that one OSD and then do s the rgw as well. But when that happens, it's not necessarily nasty. It's usually a placement group, that's actually problematic and not the whole lowest e. Well, I guess it can be both.

B

So it's a little bit a little bit tricky. There is a new greatest back-off mechanism and the in the PG case, so when VG is blocked based like gearing, doesn't complete or something like that, there is feedback in a protocol to the client in labret dose so that it knows not to send requests. It might be possible to take that information and surface it up to radius gateway in a way so that it can see that this request is going to be blocked for the foreseeable future and behave accordingly.

B

That might be a bit more graceful than trying to sort of reverse engineer. What radius you think radius is going to do as far as what else to use it's going to target or something that might be a cleaner way to approach it.

G

Yeah I think I think I would agree it does. That code exists in Mastroeni, one of the yes.

B

So in luminous, it's good when I know of it before I guess if I went in like February or something the back half code, so the liberators client, okay, and it's not exposed to the operators API, it's internal to the object or code inside of liberators, but it knows which PG is not to send a request to because they're blocked effectively. Okay,.

G

Sure I think I think we would be interested in looking at that any any any specific, any specific folks that I should bring to. Why not more about this.

B

Probably, probably me you can reach out such as well, but I would I would I would probably focus your initial efforts on moving validating in the moving to luminous and recharging. Those large index objects, because that's going to make a whole group of problems go away, not just this one problem, and once you have that resolved, it might be that this is less less of a pressing issue and there are the things that are sort of more higher value. Things to spend your time on, yeah.

G

Yeah, we have a great plan in place right now. We we are sort of on the upgrade plan on upgrade part to G. Well, first, so we'll probably experiment or try out the manual regarding options or other command that's available in tool. We have not yet tested that to see how that would allow.

B

It is that by the immediate workaround.

B

Other other other comments you have on this man versus you see I.

I

Think they might be interns. Yeah SiO liberate us guys that this could have an impact. The rails back on. Just as I was just talking talking back-channel. We will be talking about exposing more throttle information. This sense related with as part of the way they do good to get a unified event loop for the top half and the radio spot and half of our GW.

I

When one one put one thing, we're rapping or doing is a member since working on a prototype of an ASIO labret dos interface that can hook into the event they ask the ASIO and front-end. That's been developed, HTTP friend.

B

Yeah: okay, get that I! Think that's gonna be pretty.

I

I will post in the chat what it is yeah.

B

Okay, that the thing to keep in mind what the back off is that it's it's a it's only used in certain situations by the OSD, so right by default, do see only just backup when, like peering is stuck, it also has an option to to trigger back off whenever an object is undergoing recovery, but it mostly does that just so that we get like very heavy exercise use of that code during QA I, don't think in a real situation. You ever actually want to do that.

B

Unless maybe your objects are, you know, have a hundred million MF records or something, but in a normal cluster, but it's also. We could to have other thresholds that trigger that. That mechanism to.

B

D

So as the bottom, this is mostly like a raid osich issue like if say you had a bad disk. That's slowing down an OSD that should be handled there, mostly I, think so. Yeah.

B

I think so that so that there's a specific root cause in this case is the large are giv effects and that's just I should have you needs to get fixed, but in general it's possible that something goes wrong with Rados and you end up foots like so that the scenario you could think it was you know, one PG gets is stuck peering for some reason: there's a broken something I, don't know something happened in radio, so you've lost. You can do this or something and rgw will happily keep going.

B

But every you know, 100th requests will happen to touch that PG and block indefinitely, and eventually those will pile up and consume all of your threads in your work, you, and so that eventually rate of scale will get start. Even though it's only one percent of the data that's unavailable and so having a way to detect that situation and mitigate it. It's it's still a good thing, I guess so.

B

I would I would think about the greatest gateway problem as not as a generic issue, where some a subset of the PGS are unavailable or blocking, or something and being able to respond to that situation and deal with the OMAP index thing totally separately.

G

Yes agree: we gently targeted this as a sensing mechanism for, as you said, misbehaving pages of misbehaving, OS DS, the immediate guess, was at the at the index. Booth yep.

B

Okay sounds good yeah, so do you pull up and set it all I can I can put you at the backups s-sure.

G

B

All right, I think the next subject is.

B

Is Freitas level inter cluster replication? Let's see it's very honest here, you.

H

Know you know I'm here.

H

The boys start talking I, just wanna say that my English is not very very good, so please forgive me if I can express express myself very efficiently under okay last night, during the past year, we experienced some kind of federal disaster like manmade operation. The network problems that we think that we need a real-time cross cluster allocation mechanism that we can when when, when one class per goes down, and we can quickly change to another one- make the upper level system run smoothly. Oh.

H

Now we are using our VD r RP and we are planning to 12 in the near future and some of our not only are the RV declines. We need this replication mechanism. Fs ffs mmm concept, also demand are entered into I, see the data replication, so we think maybe we can be Clemente some kind of greatest level replication and should just do this. One I don't know how once and for all- and we don't need the upper level systems to to do that, to do the job separately and now.

H

At our crystal hands, we, without that this, the this may be maybe a little difficult, because we we can. We can just replicate the rep repeal piece to another cluster, but during the USD failure. If that's, if that's the best present and oh, how do we make sure that this replication goes? The deal, works right and and another.

D

H

Like our video, some RBD operations may cross multiple objects. So how do we ensure this this? This consistency that, for example, one operation across involves the object, a and object B, and we know that Raiders can make sure to make sure operate the operations from the same client. The targeting the same object are, are done in the sequence they they arrived, but when they come from different points or they are targeting different objects, this this sequence is not guaranteed. So how do we? How do we? How do we maintain this consistency for cross object operations?

H

I, don't know if I'm, making myself clear.

B

You know I, think it's it's clear, so this is a so if we've talked about about this general problem a couple different times in the past and specifically the since the issue always comes up as the challenge, because you have peaches that are Shirin Ross hundreds of servers and you want want them to be ordered we've.

B

We traditionally approached it in a in a different way, or at least the existing proposal with out there. So your proposal is to look at is to create object, sets where the client specifically communicates, which is that right? Oh, oh,.

H

Our approach is like this: for the first one across depression are we we think that we can? We can make the make new replication goes right with the presence of OST failure. If we can, if we can make sure the journal, the OST journal is only is only deleted or removed or overwritten, when their corresponding wrapper piece is replicated another another cluster and the during during during the recovery recovery phase, when before a Rickett recovery, sores start pushing the object, the missing object before it start pushing the object.

H

It makes sure that all red poppies related to this to this missing object is replicated first or replicated first and then it it start pushing pushing pushing it pushing this object. The if this is possibly if this is make make make sure and we think the the first problem. The first problem is this kind of result: I don't know. If we are, we are going the right way, I think.

B

I think that, and in a at a high level, that makes sense. So as long as you only replicate something to the secondary cluster after you're sure that the first cluster has it fully durable. That makes that template make sense, yeah, yeah, I'm, not I, would be be careful, assuming that it's the same Journal that that the file store currently implements because that that changes. When you start looking at different post, you back ends.

B

But if you think at a high level that there is some sort of data journal associated with each placing group, that is the sequence of transactions that are being applied and as long as you preserve that in those journal records until we get replicated but I think that that will ya.

H

The second question will be from that: we are, we can we can introduce them. Some concepting like objects that corresponding to the upper level systems, like our video image, I mean.

H

For example, one orbiting image we can, we can press pound it with an with an object set and an RVT of operation is, and it responds to an object, set operation. And if we can, we can make sure.

H

We think that's a good way to make sure one of objects at operation gets replicated only we'll only when all the object objects at operations before before it are replicated our sin percent I sent to the I.

H

Start talking together talking this from this question again and so first, we we think that we should forward the object, opieop opiez within the same object, set operation to the same intermediate node, and then an intermediate node then, and these are appropriate to through the biotic phosphor. Only when a point on two conditions, the first is all objects of a object set operation.

H

Is that that for this before this one are replicated to the other cluster and the second is or all ripple piece within the same object set operation are right at the at the intermediate node and when this two conditions are true and we can start the intermediate node start card when hope is to the backup to the backup to the backup prosper, and we think this is the this is making sure. Let's make sure again. The second question is also will resolved.

B

If I'm understanding right what your but essentially is describing transactions and where you would have the ratos clients would say this group of operations need to all be done together before they get replicated or some sort of transaction concept. Yeah. So I think you know that's going down that path to have some sort of transaction. That's one! That's one direction! We can go the thing that worries me about.

B

That is a couple things, one all the current Raiders clients don't operate in terms of transactions, so they would have to be rewritten or modified to do that I'm. The second thing, though, is that the it's not just it's not really atomicity that usually matters its ordering so they'll be some some application.

B

That's using say, let's say it's Oracle it has to to RVD block devices and it will write something to the journal block device and once that's commits then it'll make some update to its other block device, and it's the ordering that matters that the journal change has to be sent to their remote closer before or the other. They ought to be sent together. It just has to be: they just have to be applied in the correct order.

B

You know and I don't know that this mechanism would would capture that they might be able to be modified so that it would but I want to take a minute to when we've talked about this in the past. We've had a different approach to this, because we spent a fair bit of time thinking about how how this could work and I wish. I, don't know if this we ever wrote this stuff somewhere, but the basic idea was to have a a series of checkpoints in time that are essentially consistency.

B

Points across the source, cluster and I have say you had a checkpoint every you know 10 seconds or something for just for the sake of argument, and then the remote cluster, and at that at that particular point in time. Everything in the in the source cluster was it. It was a. There was a point in time checkpoint.

B

You can think of it sort of like a snapshot, although it isn't actually that and then you would replicate to the remote cluster based on that point in time and as when you reach the next checkpoint on the source closer, then you could rectify copy all the changes up until that second checkpoint over the the problem with that is that, in order to actually do a checkpoint, you have to you actually have to pause all clients and or I, have them stop doing IO, so you quiesce the cluster and in practice that's too slow, and so would we.

B

We had a project about two years ago, three years ago, with a group of students that tried to figure out if there was a way to efficiently basically have a checkpoint on a cluster without actually pausing, IO or having a very minimal pasta.

B

Io, based on measuring the bounds on clock drift with NTP and what they found was that you can have a pretty modest delay to get this checkpoint across the cluster like sub-millisecond and was sort of like typical networking hard work so that every you could imagine that every 10 seconds you would have a sub millisecond spike in latency essentially, but the cluster would know that all the I/os were essentially ordered where everything that happened before that was before and everything, but that was after was after and then you could replicate two thermocol cluster based in those checkpoints.

B

I, don't know man I dunno. If this was ever written up somewhere in sort of a human consumable, I started.

J

B

J

Do you mean this like.

H

Doom in this one, okay can.

K

I chime in a bit yes yeah.

J

K

So Ricardo and I had a conversation with sage not too long ago. Well, it feels like ages. Now we have a pad on on either pad describing a source of similar approach to some extent in which we have a demon that will basically act as a proxy and as a truce over the sequencer for the operations. I'm. Sorry.

K

Yeah sorry for the background noise, so the the whole thing that we eventually bumped against was the called causal ordering of operations right. So this is the the hardest. Part of the whole problem is ensuring that we ensure the well the correct ordering.

K

The card is on vacation now and he would likely be the the best person to to argue the the case. But one of the things we considered was having probably a maybe a quorum of these demons that will take the operations yo SDS called us. Ds would push the oppressions into these sequencers and maybe have a sort of a snapshot that would be basically decided by DS DS proxies.

K

Basically, they would agree at the time in which they would issue a stop the world instruction and basically all the OSDs would receive the instruction that they should increase a D, epic, epic or something, and so this was the the very rough idea we didn't really had a chance to to look into the the things that sage shared at the time.

K

Then this is one of the the reasons why I suggested on on the mailing list at so. We should probably well it would be nice to have this discussion later in September when all the vacation design is over and we come back and we can actually dedicate our time to this. So.

H

I'm sorry I said my money is not very good. My my listening is either worse than my oral English. So sorry, so sorry for this, maybe we can we can. We can talk about as using emails or.

H

Really sorry for this no.

B

Problem I think it's a complicated topic, it's hard to it's hard to cover, also in a short period of time. I'm like we can just share a couple of links, I pasted one into the chat, that's the output of the release of code from the clinic project. This is a summer fraud or no year-long project. That's a meta grad students did looking at the time, synchronization problem, so I think you can broadly. You can sort of divide this whole whole piece into sort of two parts.

B

One is what are the mechanics of how we move the data from point A to point B and I think that there so ways to do that, and that's sort of your point, one that you originally raised and the second part of it is how do you get these consistency points so that you have a consistent point in time, replica at the destination and that's what this project was about?

B

It was about how you, how you you can look at the clock synchronization to have the shortest possible stop the world event in order to get this consistent point in time to get the ordering that you need and I think, regardless of what we do for the first part that second piece with the time synchronization can be applied and the better.

B

If you have better clocks- and you have you- know, atomic clocks or something like Google, does then great, it's like you know basically zero time to do that and if you have really bad clocks and you have a longer stop the world, but the idea is that it also to work in any any environment. Okay, there was a. There was actually a written report that they delivered as well. I, don't know, Greg, do you know if that's actually posted somewhere.

E

Somewhere: okay, okay,.

B

E

B

We can find that we can share that as well, because I think that has that gives some background information to you, that'll be but yeah, let's, let's follow up on the email list and we'll gather all the all the links, because we have like three proposals now basically and try to find them: okay, okay,.

B

Yeah thanks so much I think this is it. This is a really exciting problem that we never really gotten around to addressing, but I think it's interesting. A lot of us are pretty excited about this possibility. Yeah.

K

So I I suggest that we talk a bit more about this in early September and okay, because until then I will eventually be on vacation icarly's currently on vacation. It will be part two into the whole thing, but I also am quite excited about the prospect of working on this, though yeah cool cool.

B

B

Okay, thanks, let's see the next item here is 1 &, 2, char, yuan and char. Are you guys here are.

L

You safe, let me try to share my screen here sure. Let's.

B

H

L

L

Here's some updates for the shared reading cache or a video. Oh, let me try to give you some background. So initially we are working on the right bank, SSD cache for a BD and in June, CDM, Jason and sage, so just write back is too difficult to handle those consistency. So we might look at the read only cash first, which should be much easier, so we switched go to the read-only cache and the design choice, which would be a standalone, as was the caching library that can be reused between a B, D & RG w.

L

Currently, we are focusing on two use cases here. The first one is Lee Bobby shared medium in cache. Basically, if you have a parent image, you have a lots of chrome image. Those clone image can read from parent image cache before copy at home at heaven and country. We have some calls, that's generally work, but still I'm trying to make those billion his posse.

L

The second case is, whereas gateway, mute, vocation which is existing or request from Manya, and but this this podcast is against jewel and it was a needs, tiny store to clean up okay. So this is a current status. Let me try to give you some details for the design, so this is the general architecture.

L

Basically, there will be three parts in in the you know currently line the first one is: there will be a lead cache file, which is a common library that does read/write on licensee. For now we are using a specified by sketch there will be just like something like a file store design. There will be many small four mega objects among those SSDs.

L

The second part is the policy which is which were trying to control the cash promotion demotion and also the signing of the cash. For now we are using a simple area based cause, your be aerialist in you know, policy and the last part is actually the hosts.

L

There will be a light embedded hooks inside labardi and also in sign Libra, whereas gateway which were call API from Lib cache file, so that we can do the SSD cache in inside in a B, D and W.

L

Ok here some details for the IBD ridden a SSD cache.

L

Before that this is the chrome flow for a BD. So usually we have a template, image, say I'm a zero and then we create a protected snapshot on that image and and then we cloned many images based on that snapshot.

L

There's a detail here that we should promote contents from the snapshot, which is are protected.

L

Okay, so this is the overview for for the shared rhythm in cash for MIDI, basically, on each camp, you know there will be a shared cache file, which is which is actually the contents of the protected snapshot and for each clone image. If the, if there's no copy on that happen, or rates or B, where are we? It's actually can go for serviced from those shared cache file.

L

For example, you have a several VMs one compute, and but most of the contents of the image should be the same, so we can actually service source with requests from a single shared, shared cache and there's more crushing of the image will be modified and for for those modified contents, we actually have a local cache for that part.

L

So this is the overview for for for the shared reading cash.

L

Yeah this is the cash metadata design and country we are actually using on you in 64 for 8 bytes for 12. For that matter, data there will be two bits indicates whether this block is even in the shell image, cache or or in its own cache, and also there will be a stick say two bits which were, in a case, the block ID I.

B

Must've missed something in the first couple: slides what sitters routine the local cache in a shared cache as I said. I was thinking that that local cache was the in-memory cache but you're saying that there's a non disk cache for.

L

So shared cache actually see the promoted, shared cache and local guys are both cache for SSD. It's not a new memory. Cache Koha, shared cache is for the protected snapshot.

L

Usually so you say if you have a several virtual machines on our computer node, even as their system, the operating system disks are the same, so lots of their contents should be the same.

L

Actually, you can serve those from the shared cache file, but if there's a copy-on-write happen, for example, if I modified the IP address of warm virtual machine, that part maybe should should be cached.

B

Copy everyone had them all implants, but that's the readwrite cash on a pretty much basis. Okay, yeah.

L

Yeah, so all right, let me try to try to uh impedance this page, so this is the real flow. Basically, for each read there will be a cash flow Cup. First, if it's in the.

L

Shared shared cash. That means we can we actually tap down to write on that block, so it's still safe to read from the protect your snapshot and for the another case is in the two prime case. It's actually it's already been written before we have to take care of this block.

L

We have to record this use own matter table and then for the next read where we need to promote the real content from the raddest here.

H

Okay, okay, so.

L

Particular this is the right path.

L

The first step is we're still to a cash flow cap first, and if it's, if it's in the chair cash, we have to mark those entries, we have to remove. Remove those entries prop up from sheer cash on the next we'd. We need to promote the real contents from the biosphere.

L

If it's in the copyright cash I mean local cash. Here we have to invalidate those chunk in the cache file and then us cuz we are, we are we don't make hashing.

L

No I'm not sure if I make this clear, they just yeah.

B

I think that I think that makes sense. It's not it's not quite the way that I had imagined in my head, but I haven't I. Think I haven't thought about this soil, so you can bunch it continue and then in connect it.

L

Okay, so country we're trying to do and the countries there's an optimization point here. For now, we we just promote the shared cash on when, when we open the first chrome image. So this kind of this is kind of very small for for the first open, and this could be improved actually.

L

Ok, so here's some initial results. This is test on our more endowed with two OSD and for the baseline. It is tested without SSD cache. So it's it's, oh by the way this is a SSD cluster, so they are out here. Obvious is not lose that bad and with for the second row we use, we used read OMA caching here and we can see the IOPS increased a lot and also the tale that he'll say: average latency will reduce a lot.

L

We also tried to test the shared shall be diminishing and from the results it it looks like we are hitting the SSD bottleneck here. So it's with two forums and the I hope has increased like 70k. So it's kind of the back performance of the SSC, but.

L

Yeah so currently I'm still fighting with some unit tests. I think I can fix that and send out a PR sue for.

M

The baseline results here: what was the? What was the key configuration for for the OSTs and was every D enabled.

L

No, we just say well have any cash here, it's just a very small cluster with tool or SD.

M

L

L

L

Next part is the rarest kid wait. We didn't make.

L

So I actually got this information from the embassy project, so they're they're kind of building on actually they're building a cilium cluster behind those red or schedule a clusters.

L

The reason is: are they don't use a public setting cluster because they they have some authentication requests over the rails gateway users, so you have to you have to you place the city in clusters behind those red or Skateway clusters, let's see very special requirements.

L

Basically let me go to this page. So basically, there will be a.

L

Two level cache for for for those rare those key to a cluster level. One cache means a local cache for for each relic, a to instance, for example, relegate way long as its own local cash, rather schedu a two-headed and also got his home. Local cash.

L

The heir to l2 cache here means, whereas gateway warm, can actually read from the browser, whereas get with instance. Cash, for example, res gateway. Warm can reach the cash from the whereas gateway to this is kind of a level 2 cache and I.

L

L

So the country they have a Rockwell PR for once very warm for for and I actually checked a bit only a code, don't country there might be some some cleanup work needed because he, the logic is, is tightly embedded in rattles rattles cash public has for and it might need to be, D coped to decouple to cope with our design to you'll, see a Lib cache file and also the policy on bar, but I think that that would be a easy work.

B

Do you mind footing back to I think it's like eight.

B

So it looks like this. This approach is sort of looking forward a little bit to the point where you don't both have a write back cache for that's local to a specific image and you'll have the shared cache for the cloned parents.

B

That might make sense. I, don't I, haven't thought about this too deeply, but the way that I was originally hoping this could be done would was that this would effectively be two different caches. So you would have a shared image. Cache you'd have a shared cache. That's on the on the on the parent image, and so you'd have multiple processes in memory that have their own sort of that.

B

Have that cache open and they would be sharing their state on disk or in the Sun like very minimal way, with some coordinating process so that they could they could make use of that shared cache and because the data was immutable in that cache, it would be a relatively simple coordination problem like you could just have files on disk that are named after the object, but that that that shared cache would be a sort of a standalone thing.

B

That's just for the immutable image and whatever code wrap that up would be able to be reused in gredos gateway and then there'd be a separate cache that probably worked very differently for the write back or whatever. We usually eventually do for the the per image cow data cache file, I used to have there with the two prime.

B

So, for example, you wouldn't have those two state bits you wouldn't have a single lookup table that would talk about State in both caches, but instead the read path would just say: this block doesn't exist in the child image. I'm gonna fall back to reading from the parent image. The parent image has a cache on it. Let me go: look it up in the cache and see if it's there and if that misses.

B

If I, if I, do a read, if I end up reading that block or object from Rados when I read it in I, would a synchronous a also put it into the cache, because I know that it's immutable and I don't worry about braces and so on.

B

L

Keep people on his country we in the are in the country information. This is actually a library. Okay, do you mean we should switch to some socket based beautiful I.

B

I don't want to. We should probably look at what you've written first, because I'm there like a million different like implementation, details that like come in as soon as you actually start trying to write this. This is just sort of what we were thinking. I think what we talked about: okay at some point in the past. So let's, let's look at the implementation and see and see what you have right now.

B

Okay, I think at one point we had sort of dreamt up a way where we would use.

B

My camera rode, it was I think we would use a socket so that somebody has to manage the LRU to retire things from the immutable, cache and so there'd be some socket minimal coordination just to like manage the LRU and epic things, but that getting and putting would be able to just read directly from like a filesystem on the SSD I think was the original idea um good, but again, I think it once you start implementing it.

B

You're gonna come up with all sorts of reasons to do things differently, so sure I don't want to I. Don't want to tell you how to do this because you're the one who's actually doing the work. So let's look at the once you have that pull request. I mean you might not I, wouldn't necessarily wait until you have every unit test, passing or whatever to publish it. I would just publish it and we can we can review the design and approach would be good work. Okay, we get too far along and I.

B

Think I'd like Jason. Obviously the way into he's not he's not here, I think he's out today or something I. Don't know I, don't see him online video to the other week, okay, but that explains.

L

B

L

I see so I think I'll send out CPR tomorrow, ready yeah.

B

Okay, the those performance numbers are great I. Think it's really it's good to see that this is kind of just gonna work. So that's exciting.

L

Thank you thank.

B

F

B

Right anything else on that topic.

B

All right, I, guess, Mina! You want to talk about the array, codes and stuff I. Just remembered you sent me an email like last week and I haven't, replied, replied yet so sorry about that.

B

Hello, yes, we can hear you yeah.

N

Okay, so I just wanted to know if that review is all right for area codes, I'll just share Spradling, which we had for record description and right now. Actually there is a one product list.

B

B

So this is the adds, the sub chunks. It makes the internal API change I.

B

Think, probably the next just to review this. This is adding a new, a new set of functions that includes sub chunks. Is it also changing the I think the way that we would we're hoping to make this transition was introduce the new the new set of calls that pass illicit sub drunks, and then we implement the old calls in terms of the new calls, so they just pass in like a single subject that covers the whole thing. Yes,.

N

Yes, okay code function, which will have the subject indices as part of along with the minimum to repair the minimum to repair. It will return, touch check, IDs, along with the OSD IDs I.

B

N

So there are some changes in ECU till I'm, not sure. Like I mean, if, if there is some part like stopping from like you know the review or something, but this we have tested it even for the previous cases like the normal beats, almond, kiss, etc.

N

So easy utility has some changes.

B

Okay, I think I think that's fine, I'm, not super familiar with this code. I'll be honest, um I think Josh and Mike are more familiar.

B

But I think yeah I think I think the next up is just to review this, and we will eventually want to rebase it so that it doesn't have the merchant commits in there, but will want to review carefully, get it tested and merge and then I think. The next step is then you'll be able to do your your follow on changes adding the new the new code on top of it, we just haven't: we've just been busy with the luminous release, so we haven't been paying much attention to these.

B

These pull requests last few weeks, but hopefully that will change shortly.

N

Okay, that's it I, just wanted to just kind of remind you.

B

N

B

Absolutely yep thanks for thanks for coming by and reminding us I think yep, so yeah Josh you're. Here we should just remember to look at this in the next couple of weeks.

N

Yeah, we would like to submit it at a particular conference and we think that it would be nice to have this as part of self at that time. But that's one reason we are thinking of like pushing it like yeah. So if you want to, if you want us to do any changes, I mean maybe after your luminous relays like you, can like talk to us look and we will be liquid to do this.

B

The Attic, the only so we have got a week before the release, hopefully a week before the release, goes out and then josh is gonna, be on vacation for two weeks, so that I'm going to be on vacation for one week. So it'll probably be several weeks before before we pick this up again.

B

But absolutely yeah I think we want to get this in sooner rather than later and early early in the post, so I'm gonna cycle do the right time to do it.

N

Okay, thank you. Cool.

B

All right thanks so much.

B

Okay, I think the last. The last item is Prometheus John and yarn.

O

Hey yo, you will and I should be now.

O

C

Please, please start: okay, sorry, yeah I just wanted to review with the group where we got to on this and talk about the oceans, so I created a pad.

C

The Prometheus module is in the tree, courtesy of Dan and I. Think the next step is to bring in the code that exposes the sort of central cluster statistics in addition to the performance counters, which we currently all that it's providing.

C

So the hope is that we can some order that code from the set sporter I believe you guys are already using them right.

C

So I guess that part's pretty pretty simple and uncontroversial and the I think the big unknown of the moment is whether we want to implement Prometheus endpoints on individual set services or whether we want to just say to people. If they want to get their stats directly from demons, then they do it like with an agent.

C

So they they run something like collecti or daimond or whatever is on their SEP server, and so that then influences how we expose the service discovery stuff from set manager to Prometheus, because if they using an agent, we just need to give them. We just need to get free via the list of host names. Whereas if we're talking to individual set services to get the Prometheus stats, then we need to tell Prometheus that I've got all the individual stuff. So is this so I for that to the room.

O

There's actually another thing that came to mind actually also started a path, so even if we only expose stuff from a safe manager, so from one manager daemon, we can have to think about about yeah, which actual demon exposes these stats.

O

And how can we manage availability if a manager demon goes away and I think that would again we would really you know it would be really nice to have this service discovery, even even on the manager level already, because then we could just point Prometheus at this service discovery and could even deal with you know different manager demons.

O

Does it make sense.

C

Chanyeol, muted, sorry I always do that. I haven't thought about this too much, because I had kind of optimistic. We assumed that we could give Prometheus the addresses, usually the three addresses of whether managers are running and that it would just succeed in talking to whichever one happened to be up, but we should test that theory, because I I guess that's not guarantee that that's actually how it works, I'm, just kind of assuming it does. What seemed sensible.

O

To me, no I believe it would then scrape all three manager instances and not sure how it would deal with the same metrics being exposed and on different endpoints right.

C

But only one of them is active, so Prometheus would try and connect to three servers, and it would find that only one of them was actually responding to his requests.

J

O

We should test that those there is definitely a timeout for the scraping I'm, not sure what what impact that has done if that times out, every time on to scrape targets should should technically work. Yes,.

B

C

There's an unrelated change if I'm working on that lets the standbys forward requests, so it lets them. If they want bit same from an HTTP, they can read, send redirects to the people who are trying to connect when they connect with standby. So that obviously doesn't help.

C

If you have a manager server that is actually offline, but in a situation where you've you've got like two standbys and an active one, it means that if Prometheus tries to connect to a standby, it will get redirected to the active one which could actually be a bad thing, as we just discussed if it doesn't like getting people metrics. So maybe actually we wouldn't want to enable that for Prometheus itself,.

C

Well, hopefully, prometheus just has the behavior we want and we can just only have the active manage and responding to HTTP requests. It's.

B

The idea that the service discovery piece will, whatever don't thought like a CLI command that also configures the manager in points as well as whatever else we decide to tell Prometheus about.

C

So you you, the service discovery stuff would obviously be getting served from the manager, so something has to have told them from ethier server have a talk to the manager to begin with. But if the thing doing, if the service discovery is happening via a script from the Prometheus node, that is fetching from manager and then writing out to a local file, it could be the Prometheus itself. Yeah learns about the managers from that mechanism and the the initial input is to that script rather than to Prometheus telling it where to do this very original.

B

That could be bit you tell what cluster to talk to you and it says: okay, here, your managers and/or also gives of the hosts and whatever else we decide to do.

C

What you'd still be getting having to give it an address? Initially yeah yeah yeah.

B

So the the question I had so that there's another question you threw out to the room earlier about whether the services every piece would be scraping from daemon targets or whether you just give it host names, and it would talk to an agent, presumably there's going to be an agent of some form, because you also want to be collecting CPU and disk and host all that other. All those other metrics also I,.

O

Would think that would be handled outside SF there is. There is a note exporter for Prometheus already, or it can also there's a collecti metric collector too I don't know if we'd really need to reimplemented.

B

Yeah I mean wouldn't want to be implement it I wonder if, if it's a node exporter, then it probably only knows how to export host information right. It wouldn't know how to also fetch a demon information for you if it were collective and it's a more general pluggable thing. So you could get both CPU information and ask it for the Ceph demon, information right.

O

Well, there is, there is a text exporter, that's included in the node exporter, so you can basically just create a text file and a certain directory, and the node exporter will expose that too I've. Actually, we have a few things set up for that through that mechanism, for our BD and smartphone, for example, but I would guess that's that's not too performant in the long run yeah. If you basically run a script and pipe the output into a file and.

B

Yeah, okay, I guess I'm, not too worried about the scraping metrics directly from stuff demons. It seems like that's always going to be annoying, because even if it's, if it's something like collect, do you have to go, configure collecti and tell it. These are the demons. These are the rats and locations all right, not really.

C

You well I, don't have a clue. I, do you think early once, but like the diamond plug-in that we wrote doesn't need any configuration because it is going lists the up, and so it's.

B

Okay, so we could have a single package that your installed on each host, that runs a single demon that just exposes all set: metrics yeah I, see: okay, okay, that's far, the easiest thing then.

C

Yes, thank you. Father service is kind of really made yes in, while indeed you still have holy all the work to me. So this discovery work.

O

Yeah, I really think that's that would be the nice part if Seth manager, critics well wouldn't know about these endpoints, and that would be easy, obviously. Otherwise, that you know, would either be a little hacky or quite complicated. I feel like if.

C

We assume that the configuration of the agency's uniform then we could make it so that the Prometheus module has an option to say we'll a boolean to say, I'm running an agent and then like an int to say what port the agent listens on that way. The Prometheus module would be able to take the list of Def server host names that it already knows, and tag on the port that the users configured and then tell Prometheus how to go talk to the agents based on that.

C

So then the user still has to configure it, but they have a single point to configure it when they do this. All the nations.

C

And we could conceivably have a default for that, setting right that it happens to be the same thing that our package I.

B

Mean I think when, when Dan was setting the default port for Prometheus, there's some wiki, where you at you just assign your own port, so they're all unique. So we have a safe port assign'd. So we could just pick a second port. That's the SEF host agent or if you happen to be running that, so it would be a zero you. Just you just install the package. It would just serve up everything in borrow and SEF yeah.

B

If we and then service discovery would just have to say these are the hosts that have demons and it would go look at connect to that port on those hosts, yeah boy. You would have to.

C

Of if you were into like, like D, you would have to have configured collab D to Leslie when this.

B

Port, you wouldn't need to if, if we wrote our own little agent, that if it just serves that prometheus style content from everything environments, F yeah I saw.

C

I, haven't I, haven't been proposing writing or a little Asian, but it's not a crazy idea. You're just I, don't know I think some people get kind of warmer fuzzier feeling if we're plugging into and something generic. But then, if you have no exporter, if no big supporter is enough for the generic stuff and we're not throwing anything out. If we just write our own agent justices a stuff as well as exactly it.

B

Would be the fewest number of moving parts police can figure like if you run collect, do you have to go, configure, collecti, I, think um or as if we have an agent? It's like you know a couple hundred lines of Python. It's not that much code, presumably, and it's and you just install it and.

B

Then it's just a bullying, switch and config here, whatever that tells the service discovery agent, whether or not to tell Prometheus, to probe the Ceph agent or to just pull it for manager.

B

Yeah and then there probably another.

A

B

That tells that would also direct a service discovery thing to like tell Prometheus to scrape host level information from the normal host agent, because I think I'm doing most configurations like you just have the default. Prometheus host thing, but you'd want to just have Seth make sure that all its hosts are being scraped right.

B

C

Yeah yeah: we need to go and check the the service is going to reformat and see, although it's like I guess you say like host and then the load exporter, port, comma, verse, F port. Presumably you can tell it multiple ports need a second. The only kind of got sure to that is. If I haven't checked how good the disks? That's all that you get out of node X water.

C

We might run into people like how good collect ears are telling you about disks and find that what comes out with load exporters, but that's that can be overcome. Of course, if they just want to store, collecting and use it instead of, let's practice, yeah.

C

Okay, I might go, I might go right that agent I think because it's mainly gonna be the same code as we had for yeah.

B

Okay, yeah I, think I think.

O

I'm, good okay, and for it's worth there is already a collecti exporter for Prometheus, so I think that's probably just a drop-in replacement for the node exporter or could be use alongside even yes,.

C

Yeah, so young are you up for taking the code from South exporter and putting it into the Prometheus module sure.

O

My next step, because otherwise I'd have to fix this F exporter for the new health moment anyway, which I'm probably still gonna, do but yeah.

O

Okay, okay sounds good cool cool good. Well, thanks guys.

B

All right anything else that we should discuss.

B

P

Good, could you direct me to somebody that is more in to tala G, so I've run into a few snags on the FreeBSD ports, so I keep only running Jenkins tests, whereas now they stand alone is also moved to today and, firstly, that we're only gonna run short test in Jenkinson in the future, so I'll be more or less required to go into technology to get any coverage on the on the more invasive pests, though I need somebody that can help me figure out why I set up the pie, isn't pleasing for me.

P

B

Would ask on a separate channel and set the a channel.

B

Yeah I think several people in that channel continually answer questions.

P

That sort of ranted to a blank stare, but maybe the irt channel, is a little bit more fiber and calm place. Yeah.

B

Or if you just mention a specific person to get their attention, we can help direct you yeah.

P

Okay, okay, I'll, try; okay, that's one and the other one is is that- and this is sort of like since a wild I'd like to add something to the release. Notes on my aim was for luminous.

P

Yeah and something like a note there, but there's only one thing I'd like to add, but it hasn't been merged and that's the IBD G gate demon, because that is actually the thing that got me started important to free these demons. That I can run a device that actually has an RVD image. Though I like virtual machines on that, and though you think, I should just submit PR for the release, notes and just squeeze it in there.

B

Yeah I mean you can definitely smell up here for the release, notes I'm, not sure whether G gate makes sense to merge before we release. Oh I haven't looked at it, I, don't know if it's um it's something new that can break something else or if it's a totally separate.

P

No, it doesn't break it's it's actually completely outside of the tree. Actually, I could even stick it into the freebsd directory, but the the original author actually just put it into the rpgs, your directory, us next to our Beedie Beedie, oh and.

B

You point out put the pull request and the chatter whatever we can look at it: okay,.

P

Sure yeah and then.

B

P

Drops to look at integration so to sort of apologize for that. Okay, thanks all right: cool, okay,.

B

All right anybody else.

B

All right thanks everyone!

B

Next month, first Wednesday of the month there will be deceased. Yes, thank.

J

You folks Thanks all.

B

Right thanks, Erin, okay,.