Ceph Developer Monthly, 4 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Developer Monthly 2020-11-04

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, let's get started, welcome to the set developer monthly for november 2020.. uh Today, we've got three topics on the agenda: uh first, we'll start off with uh discussing manager, scalability and specifically um how to measure and profile uh problems that we're seeing so here's the etherpad link with us.

A

Some broader discussion topics as well.

A

Then three's, uh what folks thoughts are what kinds of.

A

Ideas folks have for um how we could measure which parts of the manager are causing bottlenecks or being slow.

A

Maybe we should uh start with some context around some of the problems we've seen.

A

I guess we've seen a few different symptoms um when it's simply like high cpu usage with uh the manager backing up and different the uh finisher queue where it runs completions or runs uh different calls into intense modules when there's a new map update that can get backed up with like millions and millions of completions um and therefore cause a bunch of latency. All those are processed know.

A

Another symptom we've seen is um some calls in the manager taking a very long time in a particular module, either in in c code or in python code, but either way holding the global interpreter lock at the time or holding some manager lock prevents other command manager commands from being responsive, responsive or making.

A

A

Those are kind of the few symptoms we've seen, um I'm sure we'll see more, as we continue to see larger larger scale deployments, with the manager trying to support any more osds or other other demons that information that they report to.

A

B

Well, I was just gonna like just try to like frame the problem I mean: do we think that, from a high level, we're gonna solve it by just improving.

B

The code because the code is just slow and it's taking too long to do things or do we really think that we need to break this up because each involves its own set of challenges, and you know design philosophies right, yeah,.

A

I think I need to do kind of okay. Both then one short term one launch longer term like there are, I mean clearly, we can't scale a single python process infinitely. So we need to break it up at some point, but I think we can get a lot further than we are now like.

A

um We talked about some simple things we could do with um ipads, looking at um like automatically tuning out the reporting intervals, so we kind of scale back how how many reports we're processing from osd's and um if we're too busy, because that's that's kind of the workaround that we have for now, just increasing those reporting intervals, and so at least, if we could do that automatically, we could avoid some of the issues in the short term.

A

um There's also been some improvements to the different modules to make them more efficient in the way that they're like not uh like getting the data, they need from the maps um more directly or operating more in c, plus plus, instead of going through a uh conversion to string and then first into python, then convert it into a python python objects from json um that can be kind of expensive. If you're doing that, I think many hundreds of thousands of times a second.

C

Yeah and one of the biggest challenges that we've seen with the manager and manager modules is the debug ability is a big challenge at this point, um it's difficult to say which module is causing a problem of its gen uh generally, the manager that's backed up, and uh uh it's one of the short term or like a couple of items on that ether pad are short term items that will help us. Add things like you know, metrics as to what module is using, um how much amount of the q depth or other things like that?

C

Currently, there's no way to figure it out. It's mostly like a hit and trial. We try to turn off modules and turn on things and then see which one is problematic. In the past we've had, like you know the the balancer module causing issues, and recently we also got to know about the progress module. um The core of the problem remains the same but, like you know the some of the short-term things that we can do um to make it easier to debug things and also control things uh and act upon it.

C

Like you know, control like what you know, the stats that we are reporting from uh the manager, uh the period at which we are doing it. We can always auto-tune those things, so I think in my mind, those are short-term things that we can do and get a lot of benefit out of it uh longer term. I think there are other items on that ether pad that have been listed down, that we can also go through.

A

Yes, you mentioned one that wasn't down there yet, which was uh trying to categorize and and uh maybe have four counters for different types of things that are in a finisher queue or try to track where they which module they came from.

A

A

Next, the flip side step would be tracking how much time we're spending for the given completion or given uh notify into a into a given a particular module.

A

You can see if uh any particular one if it's like in the case of like a particular calls being very expensive, or it's just the sheer number of them.

D

Do we have a like a particular threshold where you know problems like this start to kick in you know. Is there like a recipe so that we can talk to the user base and say, look, you guys are in the danger zone um and they could be like the first people to help us hopefully understand um any kind of changes and the impact.

C

Yeah, rather than usually usually, the scale of the cluster is a good rule of thumb. The number of osd is the number of pgs that you have, that kind of determines the amount of time you're spending in processing the pg information.

A

We could probably uh write these happen artificially as well by that like make creating more pgs than we would for a much smaller cluster or uh decreasing their reporting intervals. If we have many more updates happening.

A

I guess we don't have like particular specials in mind, but I guess maybe there's an order of like hundreds of osds to a thousand thousands of usds.

A

So we, where you'd start needing to tune the reporting intervals today.

B

Yeah I mean personally yeah, it seems like getting some metrics around things is the first step of knowing yeah, where the issue is, um and I mean you know when you talk about like serializing data structures like your your pg map, updates or whatever, between c plus and and python. Those are you know my I'm. Last time I looked, I thought they were done pretty you know trivially, because they just straight up, convert everything. You know not on demand.

B

They just straight up, do it um as as opposed to like um kind of building out their own sub data structures that are kind of like more like view, data structures that know how to like then get the data on demand, as opposed to like you know, right now, where we build it all out first, and then we pass it down. You know for everyone to look at, but right.

A

B

Have to have the metrics to know where things are spending the time you know, as as it launches into the the interpreter to like invoke a function on something um you know to start collecting. You know time and time out, calculate latency, metrics and expose it somewhere. I you know potentially yeah. It says here you know the the top or whatever, but you probably just even right now expose it even faster with like some perf stats or something like that. You know without having to you know.

A

B

Get too fancy with it, um because it is kind of like a debug level tool as opposed to um something at a user. If, if we were doing our job right, a user should never have to care about it right.

A

Yeah, exactly that's the goal to make this all automatic and that user doesn't have to worry about any of these kinds of tunings.

E

And if it makes sense, we can also add some of these data to geometry.

A

Yeah, so it can be really interesting to see how some of these things are performing in the wild.

E

It doesn't seem too invasive to me not um not like sensitive data, so yeah. Maybe we can collect this as well.

A

Yeah that'd be really interesting, but then correlate it with that cluster size too.

D

Yep, how often do we get telemetry samples? Sorry for the ignorant question.

E

um Not all um we collect them daily for clusters that have opted in so we receive them daily. um So I don't know how we I mean how it makes sense to collect this specific data um where you can, edit to that first channel. We want to add.

A

Yeah, I think that makes sense. I think some of the things we're talking about so far really makes sense, with the as proof counters themselves.

D

Sorry go ahead, I'm just going to say just a along the same lines: the telemetry side of it correct. We can do like a tripwire type thing where we'd be using some character in the telemetry stats, which actually says ah I've got a problem.

D

So, instead of looking at the whole data set, you can zero down on the ones that have that flag set that I've got a problem experience because of you know the issues that we're seeing with with manager. You know what I mean so there's something specific that we can look to. I don't know whether that exists in the telemetry now or whether it would just be. You know I can't see the wood for the trees in the data.

A

B

The health warnings come in because I mean that seems like a classic like manage.

A

B

Busy health warning type of deal.

A

Yeah yeah, we don't have that kind of warning right now but, like I guess, one of the symptoms that uh I can see is commands to the manager being slow. So we uh that could maybe fit into like a manager's thought warning. Perhaps.

F

Also, adding the uh module name that is causing that issue will be useful.

A

Yeah, definitely if we can um report like what the we can look at these paths internally and report kind of what the um potential blocker is. That would be fantastic.

E

Yes, it sounds really good, so definitely we should get it into the next uh entry revision.

E

A

I guess one of the other major modules that we know is not so scalable is the prometheus one. Since we're kind of reporting everything from the cluster to a single um python process, again um wait. Wait. We think how we're doing that. Thank you so much. You got some ideas here in the past.

B

Well, I mean personally, I think, it'd be kind of interesting if you know if we get to the end of the day, to like be able to do like an active, active manager where it's not a given manager is active for every module, it's more like individual modules can be voted to be active in a given. You know, manager, or you know, maybe first step is not even voted. It's just like saying like you are, you know, instance x, you're responsible. You know to be for this module this module in this module.

B

You know if you're alive.

A

B

um Because yeah I mean when it comes to dashboard or prometheus, or something like that, like the ability to like scale them out, I mean yeah the dashboard right now, it's it's active passive, but be able to even do that. One where it's like you put it behind a low balancer like say, like I've got like three instances of running or something like that for rha is great.

B

uh But it all, I guess, all comes down at the end of the day, is we kind of we treat the manager as uh we just feed it, a fire hose of information, and you know necessarily, you know if you scale the problem out that doesn't necessarily help, but I think if now you're still feeding the exact same fire hose to instead of one manager instance with all the osd's reporting to one manager instance now, they're reported into like x number of manager instances because each of those axe manager instances need to like process and pre-process the exact same amount of data.

B

That's less than ideal yeah.

A

Yeah, the longer term we might need to you know scale out the modules themselves, potentially if they're uh some of them are could be easily paralyzable. Like the I mean, collecting things from prometheus at once, uh embracing the parallel right.

B

Right yeah, exactly partitioning of datasets.

D

So are we still suffering from prometheus scale, because the last I sort of heard from patrick was that um he'd quote unquote fix that.

A

um Do you know which version he fixed it in or what the fix was.

D

I think I think the fix was to put the data gathering in a separate thread and then, when the request comes in, to do the scrape it's just basically taking it from the cache. So, instead of it being an instant thing, it has to go and collect and do all the calculations etc. He's offloaded that so he's just populating the cache at interval and then every script that comes in just refers to the current contents of the cache.

A

Okay, yeah! That's enough to get get. uh I think it's working well well larger scale. That would be fantastic, because that would make a lot simpler than trying to scale things out further, but that's a good example of like offloading things and making things um not require, like the interpreter lock to process.

A

I think most of the reports we've heard so far been around nautilus. So if that was astronomists, um yeah.

D

Yeah, I think I think it was done for octopus.

A

D

And well, it might be done for you know, on master and then backboarded to octopus. It might have missed the initial release of octopus.

A

Yeah sounds pretty interesting, so I think we'll probably know more once we uh can have at this planetary data and displace profiling or.

A

Categorization.

C

A

C

Yeah talking of telemetry data I'd be really curious to find out which modules, though we have like always on and enable different kinds of modules which modules are actually being used by users versus though we are turning it always on. People are just turning it off. I'd, be curious to know, and that may be one of the reasons we're not finding out some issues um from upstream users versus.

C

E

Yes, that's also a very good question.

E

I think some of these data we already have um but they'll need to double check that.

D

Yeah, it's the telemetry data, easily accessible to the user base and to developers now do we expose it somewhere.

E

Yeah we have a public dashboard that everybody can access um opposite paste it now in the chat, um and we also have a private dashboard that has some more information. Like more drill downs, you can see actual raw reports that we received here. The data is aggregated in order to have the most privacy are the users.

A

Now are we already collecting the information about which modules are enabled in general.

E

E

Let me double check that josh. um I think I think we have it, uh but we haven't had any panels with that, but I'll get back to you with that.

E

D

Is the with the telemetry data? Is there any plan, instead of just exposing grafana with pre pre-canned panels, just letting someone access the data source, so they can do their own analysis of the.

E

B

E

Yeah, that's a very good question. um So most of the data is anonymized and I don't know there is no strict policy right now about sharing the data.

E

It's so there are cluster data and also device data that we mostly want to have in order to build better um disk failure, prediction models, um but- and this data can be anonymized and it's easier to share that, but about the clusters themselves, even though we do not collect anything which is can be considered, sensible information, um it's not open in the sense that you can just download it, but uh you can have access as a developer, uh of course, so.

E

That the decision of it is supposed to be of like the project leaders so.

E

It's a bit wrong.

A

Wasn't there a blog post that lizard made a while back, but I can't try to analyze some of the um things that were in the wire data set.

E

You know what sorry.

A

In the broader elementary data set, didn't large do some analysis earlier this year.

E

Yes, but it was the screenshots were taking from. Oh you mean if he had access to that. Yet, yes, he didn't have access to that, but um he was working with panda, but I think eventually he was using the dashboard itself, like screenshots from the dashboard.

A

Okay, that's right.

E

The public one I mean.

A

All right, well, it seems like we have some uh data collect to collect here and uh good plan for the short term. uh Anything else on this topic.

F

Yeah uh on dashboard in demon's column, I don't see manager.

F

Only osd and phone.

E

Yes, we we do not collect yet the data about all the demons. That's um we're I'm working on fixing that as well. We have several missing.

G

E

If that's what you meant honey, I'm not sure.

A

All right, as we move on to the rvd replicated by.

A

A

uh Lisa, do you want to take it away.

H

Okay, hello, josh, hello, hello, everyone yeah! uh This is nisa from intel. Okay, so today I'm going to um talking about the uh replicated rabbit catch. Okay, uh so let me uh present the slides at first is that okay.

A

Yeah go for it.

H

Okay, okay, uh so today uh I'm going to talk about replicated repair cache in the libra bd. uh This okay go to page one okay, so uh this uh is the back cache based on image. That means it is lp based and ordered, rather catch and meanwhile to catch the data we are stored on persistent devices. Nike persist, memory and eyes as these okay. uh This is the currently we have one most village, oh we well.

H

We have one of those villages, the first phase uh that is the uh to catch nature and persistent memory, um and uh also we can catch nature on. I said this: this patch is ongoing. Okay, so this is the first phase, they'll catch the data, a single copy and the second patch is to replicate cat data across uh diff different devices in different servers. This is this is uh to guarantee the redundancy okay and uh in the second phase we will use.

H

We will use the p9 device as a catch device and the true replica data. We will use uh imam device over again ray protocol okay. So this is an overview. Let me give more detail: please go to page three okay, so here I just need. uh This is the overview of our components.

H

Okay, so in the compute cloud uh inside the library we will provide the three components. The first component is the red log. This is to manage a cache gauge data in the persistent memory device and the second patch is the flasher. This part will be flat. Will flash data in the flash catch the data to osd devices?

H

That means to the backhand set cluster and the third patch is a replicator. The replicator is to copy data across different p memory devices. Okay. So currently we have implemented, write, log and flasher and about the replicator. We are going to start it.

H

Okay, and this is the computer load and meanwhile we lead to replicate to catch the data across remote servers. For example, we can, we can start a regular demo services service in a storage load or in other servers. So um there, when directly starts it, will uh label rapid cache at that moment it it can. um I love the cache data you in local. Meanwhile, it can replicate the catch stage through remote server.

G

H

So, let's go to painting sorry, that's the page. uh Data layout um persist the memory okay, so um for the catchphrase yeah for the catchy catchy day, man we will have. We have three parts. The first patch is uh for route.

H

This is this records. The overall information about the catch and the second patch is a list you can. You can think it as a ring buffer list. So every entry in the list is a update, request.

H

Okay, the third patch uh is, the third part include- contains what the custom data so, uh for example, if a write request comes to cams, uh it will allocate uh allocate the space in the third patch and write it copy data to the space, and then uh it will uh insert a log entry in the second patch and then, after that it will update the tail in the pro root.

H

Okay yeah, so this is the uh data layout on pm device and then uh so yeah. So uh this part is about the replicated rabbit catch. Okay about before that it is yeah. It is the most mostly the single the date copy in the live: vd server: okay, okay, so how do? How do we replicate the data? Okay? So at first we it we are. It includes uh three kind, three kind of services. uh The first one is the libra bd uh libra bd process.

H

We call it a master library, okay, so that means in the computer mode. uh The application needs to uh open the rpd image and do the right rate. Okay and the second part second uh kind of service are replicable.

H

Repcrea demo service services manage the manage the human device in the server and and provide the uh the catch replication for their master libabidi and the third uh service is the controller here. I initially use iso monitor fact we can create a controller. This controller will uh manage the status and the information of radical demos so that later mastery by b, master libra bd can inquire their controller and ask further information about replica demos so that it can allocate it can find out which well it can replicate its catch. Two.

H

Okay, so uh for the replicated rapid catch we will use active standby mode. That means um yeah active standard.

F

H

Okay: let's go to the next page.

H

Okay. uh Here um I list the main functionalities for the replicated right that catch okay, we are isolated into three kind of scenarios. The first is the normal io flows, okay, so uh this is a normal case. Okay for so in this case, uh the date, the cache that they choose can be replicated across on local pmm device and the remote premium devices.

H

Okay and the second patch, is about the handler of failures. Okay, if uh something something wrong happened in the in the master leave rbd, that means the liberty will uh maybe crash. Then master baby, maybe crash may crash okay. So if it crashes there, uh the replica demo, the corresponding rapid demos, will track their status of the master libabidi if it fun, if uh they found their, they found their iro.

H

One replica demo will get their exclusive lock at first and then start to flash date flash they catch the data to osds okay. Here I want to emphasize that when we label wave label when it wants to label rubber catch in the libra bd at first, it leads to guide the exclusive lock and the first and then, after once, each gets their log successfully. Either it starts to label uh the rather catch.

F

H

And the second, the case is on failure of if in failures in replica demos, so about this scenario, the I o can be continued war. uh The master baby needs to allocate another copy in other replica demos and uh recover water uh replication and then continue to continue. Airflows. I think about this configuration uh it can be configurable configurable in the future.

H

Okay. So this is the these are the main functionalities for the replicated rabbi catch.

H

Okay, so so, let's go to um the main techniques: okay, so this part okay, okay, one moment: oh okay, so about this patch. um This is a main main discussion point I want to discuss today, okay, so to implement the above uh functionalities. We have three uh my we have three points to discuss. Okay, the first part is about uh the management of replica demo, okay, um so at first um I want we want to. We would like to uh monitor.

H

We want to manage their status of rapid kademas in their self-monitor.

H

Okay, sorry: okay, wait a moment!

H

Okay, so it means that what the record demos report their status to the self-monitor and staff monitor maintains such information and then, uh when a master library starts and uh it is labeled the red bag catch and so that it it leads to query their information of what available replica demos uh from the molecule and then it can find out. Well, it can replicate the date the cache the date yeah.

H

So this is the first patch first patch.

H

Yeah and okay, so uh any suggestion, any suggestions about this part.

B

That is a huge undertaking. I mean the last time we talked.

H

B

This was the idea was that there was going to be a fixed house like host a and rack a host b and rack b, and they were just going to be tied together at the hip and call it a day now now you're talking about a system where lib rbd and these caches, these demons need to understand the entire topology of your network, which hosts you know, have rdma connections potentially between what other hosts um and which hosts have capacity, and this is to me, this is a huge creep of uh you- know, responsibility.

H

Yeah, I saw justin so just like. Do you mean that uh you mean uh self monitor leads to include what I mean other information about replica edema, whether it supports radiant, ray collection.

B

Capacity, just just at a high level right, like the stuff, monitor right now, it's not tracking, like client side yeah right. If I.

C

B

I, if I have 1 000 nodes, yeah.

I

B

On you know, the monitor is not tracking, you know does for him, but it's not like keeping state on 1000 nodes and 1 000 um rbd replica demons. I mean at this point you're just kind of reinventing seth right.

B

We're gonna track, you know and have failovers, and you know yeah. That's that! That's why I you know. I worry that this is kind of like.

H

Yeah right, we we also have the same worries: yeah uh in fact that we we were, we consider the uh chew, put the replica demo on their sim loads uh same I mean sim loads as osd um I mean for the osds, which can support rdma collection. Maybe we can put a rocket demo on the same node.

H

H

uh Further further uh m for the monitor uh it needs to contain their, uh I mean information about, should replicate. Their epidemic includes the status, their collection, information and yeah the capacity information. Such information right.

B

And it needs to potentially optimize and use to handle failures and reroute. You know allocations, it needs to. um You know.

F

B

Had allocated hosts a and host a fails, any other hosts, they were allocated to replicate the storage to host a like. What's the set, uh you know, if you say like? Oh now, you got to move all your data to host bcd, the loadout.

B

How do how do those images that were previously replicating their data to site to host a? How do they get those other sites, then up to all those other nodes up to state so that you know they can actually fail over? Because if you ever have a reallocation like that, you only have a partial view of the world for the the right backlog right. You can't just pick it up randomly and say: go from here. You know you're missing a chunk of data from you know: pre-failure pre-reallocation.

B

So if you can't even handle the case of reallocating nodes on failure, why do we need all this.

H

H

I mean every time uh when um they're I mean, for example, at first livability allocates their uh cat copy in diff in different uh dimmers, so at first in the needs to do initialization and then after initialization, under the idma collections uh I mean their collections are created, um I mean in so after that it will store. It will store such cache image cache information in there in its mighty data, as is metadata and once any fear. Over hyphens I mean, for example, one replica d red one replica copy fields.

H

uh Their master libabidi needs to find out another replica copy and do uh recovery, and I mean after um I mean what's the fear it needs to do recovery right. Do you do relocation.

B

B

um The log on other hosts- you know that you now reallocate to and to me.

I

This is getting more.

B

And more and more like adding stuff a parallel stuff.

A

Yeah that definitely needs like a lot of the same things that the osd does, like the you know, failure detection um like every backfill. Okay, that kind of thing- I guess maybe some of the main differences um from like what the osd does today would be around like their application path. Being argument based and using these uh avoiding the cpu and probably doing something a bit different with respect to all that not um not using uh not sprint, not spreading the data around acro across different hosts, but doing more like a mirroring.

H

Oh sorry here I forgot to mention one point, uh because we use persistent memory over f admit to do wrap data replication. So that means we whether catch copy. We all we have the same data layout, exactly same data layout, so um we will use uh idma verbs like admin read, write to do the data application, um so it means the operation. It doesn't lead their uh involvement of remote server cpu.

H

I mean in most in most uh case, in what's the case most time, yeah.

B

Oh yeah, right yeah. I understand that it's it's it's! It's all the corner cases when you then start to say like we want to be able to have this arbitrary failover demon, whereas regimen. This is talking about the fact that you could use like these pmem. uh I o like replicated offloading functionalities, is no you know we're just trying to save against the one case where you have a you know: a single failure, as opposed to now. You know you're, potentially.

B

You know trying to save against n number of failures, like we replicate to three house like. Would you.

A

Get a little confused about the mirroring part if you're maintaining the same layout, um would you be like say reserving a third of the premium on each node to be mirrors of uh like a set of three.

B

Yeah yeah, whatever your multiplication factor, is like you're, ready.

A

B

I mean yeah, so they would essentially be 100 mirrors of if you're saying like here's, my primary host and here's my replica hosts, you know so um so I just have one backup then yeah, whatever whatever is currently in the log on on host a it, would also be a log on on host b.

H

H

um um For example, if master liverpd uh needs to enable weather catch, so at first it will allocate there is some uh uh catch space I mean one brings, for example, one gigabyte in the local pmap device, and then uh it will um look, it will allocate the sim one.

H

I mean one one gig cache in the remote uh pm device and the water one kickback, one gig by uh p map device we have will be unmapped uh to their application so and then uh the two memories I mean the data on the two memories will be exactly seen.

H

We um allocate data, we, for example, we allocate custom data um in the third patch I mean in the third patch, uh for example, lba 1000 and the right data to this patch. I mean I mean in a sim in the in the remote memory. It works so uh storing the should choose a sim, lpa, 1000 and then yeah.

B

You know yeah, I know I mean. I think we all understand that the data will be the same.

B

It's just the the management part, the you know the bullet point, one, that's where I start losing others like the fact that you have to build this entire set like system almost to manage the you know, all the all, the corner, cases for failure um and assignment and and things like that versus when we originally talked about it. It was just like.

B

Oh, you know we're just trying to solve this one case of you know we have this right back cache and we and we can replicate it to a second host for free, so that so that when we have a you know, emu or you know whatever orchestration layers on top just says like hey, you know this.

B

This vm is allowed to migrate on this host to this host because its data will be either here or there um so that you know these are the two spots where that you know that vm is allowed to to land, because that's where its data is going to be, you know regardless, and then you don't need you don't even need a demon at that point because it's just it starts up and its data is there right? You don't need to have like this osd like process of you know.

I

B

You know trying to write back the data to the osds. You know on failure mode, but like what? If what, if the? If the process, then your vm migrates someone else like to a host that doesn't have the data, then you can't even start up until this flusher process finishes flushing. You know on this third host right.

H

Migration scenario: we think that their customers needs to disable. I mean disable the rabbit catch at first and then do their migration.

B

Migration- I was talking about failure, I mean because you really, the purpose of this whole thing is about failure mode right.

H

B

So if you have a an abrupt node a dies and your orchestration layer for qmu says all right. I got to restart this vm on node b. It needs to have the data that was still stuck in that right that cache. Otherwise it's it's not.

I

I mean it's not the end of the world.

B

Like you know, your data is not corrupt in the rbd image. It's just you go back in time right because it hasn't replayed the log to bring you back up to the last known good state, um but to get to the last known good, committed state when it restarts your vm, it either it it needs to start it on node a where you originally located and that's where cache is or needs to start it. On node b, where you know it was configured to replicate the.

H

um Okay, I I got this yeah just I got this point. uh uh I mean uh if, if their vm restarts the load b will check their failure of their, I mean their discretion of the rdma collection would check their failure of the liberty process, so it will catch exclusive, lock at first and then flash data to osd.

H

um You mean at that moment there must the vm restart. It will read data from the rbd image, but at that at that moment the data is not uh correct.

B

No, I I mean I'm I'm saying it wouldn't even get a chance to start a backup, because you have this potential flusher I mean maybe there's a race, I'm just benefited out. There's no race, like the flushing process, grabs exclusive, lock before your cumin process, migrates to a new node and restarts.

B

um Then your cumin process just needs to wait for the flushing process to do its work um and mark the cache is clean and release the exclusive lock so that the other node can declare the exclusive lock say: hey. My cache is clean. Oh I don't have a cache here and then I got to create a local cache to start all over again and now somehow I've got to go clean. The cache on the other host the original host.

B

That was the backup host so that it can start from fresh or maybe you know this new monitor process is you know, assigned a new location to be the new the new, um I would say, there's a lot of corner cases. There right.

H

A

Yeah, but with the management aspects, I mean when you're talking about playing all these things, that the oc does. um It might almost almost make sense to if you wanted to go that down that path, to make that kind of a new pool type that would uh do replication in this. This particular way and have its own kind of recovery semantics, but that be able to share the existing like rp failure, detection, and um perhaps I perhaps.

A

Lead to a new version of like hearing um for the purposes of writing back to the um other osd's uh solar disks.

A

That's kind of a much much larger um project, but I think it might make more sense than than running a parallel set of the same kind of services. Just for this cash demon.

A

Like that, what do you think jason.

B

Yeah I mean yeah, I mean I I I you know at the end of the day right, you know we're. Our goal is to get to the point where the osgs are a lot faster right and then our goal should also be to get to the point where the osds are being able to do. Some of the the offloading of uh you know transferring messages between its peers right for replication. These I mean these are regardless of how we get there. I mean, I think we can all agree that it's that's.

B

We want the ocs to get faster and we want the osds to you know if they can offload some work in terms of you know sending data you know from osd a to its peer osd b and c. That would be great as opposed to having the osd actually have to then read the data put it on the you know. You know network itself and you know on the other side, pull off the network and you know and process it.

B

You know, I know sage, you know years ago had talked about the whole idea of you know, trying to use nvme over fabrics to get the osds to actually directly put the data where it needs to go on its peer osds, as opposed to having you know all this message, passing of data so that the other osd can process the data to go figure out where it needs to go. um So I don't know if that's still the plan, but I know historically that was yeah.

F

B

Goal the goal is to get the osds out of out of the data path as much as possible right. um That's.

A

Right, that's that's the dream kind of like that's when we talked more about that and like for the future and of crimson, basically like once, we have like you know, that's this, that basic, fast data path and it makes more sense to start looking into like vimeo, f and and um perhaps our rdma or other transports to be able to do a lot of this offload.

B

So I'm I'm not I'm not trying to discount this. I'm just saying maybe first first step with this, is you.

I

B

You have a need, you have to use a story of saying um if we're using this right back cache, we need to make sure that you know the failure of the host that has the right back cache on it. The failure of the the octane device in that in that host doesn't mean the loss of data and- and that's where you know. Historically, we had talked about just this.

B

This fixed concept, just just you know, for the people that the people that are number one opting into you know using this right back cache and and number two. The people are saying that their workloads are are so sensitive and so important that they need to make sure that at least has like end level redundancy.

B

um I think the simpler path, at least to start off and try instead of trying to start off with you know this whole other scheme is just to fix it, and the hard code is say like hey when I'm turning this when I'm turning this feature on, send a copy to host b and send a copy to host c and send a copy to host d.

B

You know whatever your replication factor is, and it's just it you know, you're we're punching the problem for now like, um I think, that's a way more achievable bite of a project um than trying to cover you know, uh building another stuff in parallel to seth, and then it might just turn out that you know if our. If our dreams come true and chef becomes a lot faster.

B

um You know you know, there's there's work going on and there's projects I know going on in terms of even just having the rdb client being more abstract for block and having it. You know using our dma from you know, hypervisor host directly to you know, nosd host and things like that. So at the end of the day, how much do we want to invest in this?

B

If, if, if lifespan is in a perfect world limited um and it's gonna be a lot of code, it's gonna be a lot of untested corner cases, because just odds are there's not gonna, be a lot of people. You know willing to to set this up with what you know the hardware you need to throw at it. You know initially right, so I'm just trying to limit the scope. You know get us to the point where it's something that you know solves the one user story uh issue of.

B

I don't want to lose my data, um but try not to reinvent the wheel of backfill recovery and um alec. All all these things about data placements. You know it's a harder problem to solve and I think it's it's not a problem. That's going to get solved, and um certainly not for pacific and uh questionable for the you know, potentially the future key release, because I mean it's got it's it's gonna be a lot of work right.

H

H

Yeah and uh I want to mention that compared with osd, I mean the wrap the position. The weather catch will be much simpler because oh yeah.

G

A

H

B

Percent, just for one.

H

B

B

You end up with all the same cases of like hey the the image is running fine right now on on node a but your mirror host died. Your replica is done. How do I bootstrap up post c? You know with a good copy of the log, while note a is still running and that image is still. You know actively attempting to replicate data to node b, like what's the failure detection path.

B

Like I mean technically, you have to stall ios right because you can't just write it to a and say yeah I mean because then you have a data loss point until you've reassigned the ios um in the full backlog to another host.

B

I mean I haven't dug into the details about how the pmio.

I

B

Replication works, but I I have to suspect that it has to rely on holding back the completions, essentially of saying, like your data is secure until it actually has it secure on host a and host b. So if host b dies, the you know the replica host, what happens to your ios?

B

You have to wait for the monitor to detect, like 30 seconds later, that um the host is dead.

H

Yeah yeah, these are related related to the detailed and scenarios yeah.

H

And suggesting um so currently, my search is, to I mean depend depend on exclusive lock to keep their.

H

um I mean every every time when um the replica copy fails or some configuration charges their I mean their clients need to get the co to get their exclusive blog and the day they can uh do update and after the date the update is finished successfully and the I o can be continued.

B

Well, yeah, I'm not I'm not worried about the the case of so I have a. I have a queue image: q q, a vm running against an rb image on node, a nothing's wrong with node a, but it's trying to replicate to node b and node b. Does so the exclusive knock, never changes.

B

You don't like so so: node node a owns, excuse the block. So at some point you know it can't do ios anymore until it's told not to try to move those ios to node b, because node b is dead right.

H

um You mean know that a is there is a server well master if ibd is done yeah so.

B

No days where live rbd with chemo is running like I here's my.

H

B

Like my sql database load or whatever.

C

B

So it's it's, it's got the persistent right back, cache enabled and it's writing through pmio, it's right into its local octane device and it's trying to rdma it over to node b's optane device for replication.

H

Yeah so for this, uh for this case you mean, if their local um catch device fields.

B

Local cache device is good to go. Remote cache device fails: okay, okay,.

H

Oh okay, okay, okay, so for this scenario, we need to find another copy. Yeah.

B

Yeah, what what happens to all your I mean your ios would would freeze and pause right, so you then have yeah.

H

Right, yeah right, you mean their phrase time, uh maybe a little long, and so that the l will be effected.

B

Right because you can't complete the I o, like.

I

B

I do a flush. I can't complain.

I

B

Yeah right right until it's stablely written to both hosts, but I can't write to host, b or or node b, because the node is dead or the the obtained device died. Or what have you so now we're getting to the point? Well, this is you know the equivalent of like the osb osd like heartbeat interval or something like that. It's getting marked dead and it's reallocating. You know pgs.

B

um So that it can tell the library, client, with the persistent right back cache on node a hey, never mind about writing to node b now you gotta, you know, write to node, um see that's your new host, um but also also before you let ios continue. You need to copy the current state of the of the log on node, a to node c and then yeah.

H

H

H

H

H

H

I mean so about this case. Do you have any suggestions I mean about the next step? um Do do you think it is a versi that we do some tests to check the time to replicate the data.

B

Well, yeah, I mean yes, that's definitely interesting to know the the overhead of the replication. It's there somewhere right, um because there's a latency involved.

F

B

But it also be interesting to know how quickly and and deterministically can you detect a failure in that path um and not a transient failure. um You need to know deterministically that that node b is dead and- and I don't I don't- have a good answer for you about what you would do once you do detect it.

B

You I mean you go down to replication factor, you know one for the the log and then once it comes back online, you catch it up or um because at a point like at a certain point in time, right you don't. If the whole point of this persistent rate log is to speed up ios, you know now what you involve two hosts in the system: you're actually like decreasing, yeah or increasing the likelihood decreasing.

B

You know the expected time before you hit a failure right, because now you've got two devices that can fail and if, if the remote one fails, it's going to bring you down two.

C

B

um So yeah I mean, I think, it'd be good to know like what pmio has for its replication to hopefully quickly detect such a. You know a case so that you can then make a determination to say I'm just gonna continue on without replication and then, if it ever comes back up, then you have to then figure out about how to backfill it with anything. That's missed.

H

Okay. Okay and then that's even that's even before.

B

You start talking about the case of.

I

B

It to you know different hosts. You know somewhere like this is just a case of saying, like let's say, there's a I unplug the network cable to node b.

B

Does that does that pause, lib rbd on node a because it can't talk to node b to make the backup replica copy of the of the log.

A

Maybe one thing you could do in that kind of situation is uh go right through instead of right back because then you don't have the the risk of a data loss from the cash going away.

B

Well! No, because if the cash already had data in it.

F

B

Yeah, so I guess you'd have to basically like kind of.

F

Flex, the existing stuff yeah right.

B

Flush, the existing cash like and then continue right through until for all future ios yeah, that's one way but yeah. So then it gets down to the question of how fast can you detect a failure um or what what's there in the in the in the pmem application library to handle such a case, but yeah, I'm just trying to I'm.

H

Trying to I'm trying.

B

I'm trying to narrow the problem for you. Sorry.

H

Yeah raj yeah, I'm recording yeah yeah, so about the case. Just just just talked uh so mean that when the um when the red, when the graphic of copy field, we can flash the data in the master libra bd and to cite the cache as a restroom mode is. This is.

B

H

B

Yeah, so if you have this, if you have this 2x replica, you know factor enabled yeah. This is an optional thing to enable. So let's say you enable it, um then I want to back up with my cop, you know of my data on and number of hosts.

B

um If you get to the point where you you detect that your peer hosts, you can't write to it anymore. What josh is saying is the thing you could do. Normally, if you don't do anything you're, just gonna, you're, free you're, frozen right. You can't do anything.

I

Yeah right in advance.

B

So what you can do is you can basically freeze temporarily why you basically replay your entire log to empty it, um just flush right back everything, that's in the log and then any future ios, including the one you basically caused the pause you just avoid the you just got to do a right through mode where you just write it directly. You basically just disable the cache.

H

B

Right until such a time that your peer comes back online or whatever.

B

And- and this is again for the case of keeping it real simple of I you know when I when I started my image, I told exactly what its peer note is right again, trying to constrain the problem down to a simpler thing to start off with, um because if the pure note comes back, you know you can periodically. You know ping it or whatever through pmem. You know io replication library to say like hey, can I talk to you? Can I talk to you?

B

You know whatever um health check it um and then once it comes back online, you know you can in theory ensure it. You know, gets reset back to an empty log and then proceed from there right.

B

Re-Enable yourself.

H

Yeah yeah so about the persistent rapid catch. uh I didn't I I didn't. I mean we didn't. We don't use uh pg logs, like pg logs in the osd, so I mean if, uh if master pd, I mean re-allocate a new copy right.

B

Here, yeah, I don't think you need anything like pg logs, because you already keep in mind you, you have the rwl log. I mean like it's literally a log of all your ios. You know in time order.

H

B

Metadata about what you know ios, it's seen, isn't it you actually have the full. I o picture um and you know what's going.

I

B

um And what you know hasn't been committed, so you can go back and and replay the log and get back to it.

H

Oh yeah right right, a.

B

H

Yeah you're right right.

B

Until until the osds get this and then go right to a log based approach,.

B

A

A

Like the basketball case to handle a product that devices the vlog is too large right. But if there's too much play with that.

B

Yeah I just gotta copy the phone yeah.

H

Yes, thank you. Thank you for your.

B

Suggestions, yeah yeah, I mean like yeah yeah. The rw log already has like a max cap. So if, like it's, not flushing fast enough, not writing back fast enough to the osds. It just basically pauses until I get space because it's like a ring buffer right. So.

A

B

B

Yeah theory, from borrowing into bugs like like it, can't grow without bounce. It just slows down when it you know, if you, if the theft cluster can never catch up.

A

B

In which case it turns into the right, through lock,.

B

That cache all right through log cache, yeah yeah lots of intermediate steps.

H

So could you please give us any suggestions about the next steps of other replica? I mean replica replicated rabbit catch, so do I mean yeah? I need to consider considered water, color cases and yeah and how to handle them, and after that uh easy the best. It is good for me to show such detailed uh design in their stadium or.

B

Yeah the mailing list.

H

Yeah you mean I can list, I can list the water uh color cases and then we can and find out how to handle them. And then I can. We can discuss the the case one by one in the cdn meeting right yeah.

B

Cdm or just bring it back through like started something start: a new email thread on on dev at stuff, dot, io um mailing list right, like hey, I went back, I investigated. You know, you know points a point b, point c. You know about what we talked about.

I

B

Cdm and here's what I found out you know and then it starts a you know, a discussion point as opposed to waiting for another full month to talk about what you found out right.

H

Three waiting for four months.

B

No one month like so if like cdm, is set to other monthly, so to wait another month.

H

B

You find all the answers next week like it would be faster to sign a mailing list than it would be to oh.

H

Okay, yeah yeah, I got it, I got it. Okay, wait.

B

To wait for another uh developer monthly to um come again.

G

H

Yeah yeah yeah. I got it.

B

But I think we should just aim small to start minimal minimal, viable product about like what we need to to add. um You know to solve this initial user story corner case of.

B

If I'm using this, I don't lose. My data like I want to have at least like one copy of it somewhere or something like that um and then, and once that's in we can see where to go um because it's not like it would be throwing away work right.

A

Yeah, I think it would definitely be good to get like a minimal steps going first and get those merged before considering like the large-scale um management aspects that you were talking about before.

H

Sorry I didn't catch you about the nasa nasa sentence.

A

I I just said it: it was it's. I agree with you best to uh start with uh smaller steps and um get those reviewed and merged we're trying to address the american aspect. So that's much more complex.

H

Okay, okay got it. Okay, thank.

H

B

Because they're all kind of related to the management approach right.

H

um If you start off.

B

Fixed configuration pairs like you, don't need, like you, don't need the monitor or any number of, or this other whatever process to manage assignments of replica demons. You don't need a replica demon, you don't need. um You know, control paths for discussing all that data via stuff manager, so I think it punts a lot of the puts a lot of the work when you just say like this process can live here.

B

The scheming process will live here or it can live here and it doesn't matter where it starts up, because it has copies of the replicated right log on both on both hosts.

B

So we don't have to worry about a demon to do right back because once cumulus restarts on that host it'll just do it um in theory, you could add, like an rbd cli command to effectively, do it as well right, um there's already like the cache and validate one, you could do like cache flush um to basically force the flush of the cache on a given host right, um but just trying to trying to simplify the problem down to you know, what's the minimum amount of work that you need to bite off and work on to just add support for making a replica of the of the octane.

H

Data yeah about the vm restart, I mean oh vm restart, but the hypervisor is still there. So the catch is, okay doesn't need to flash.

B

Right! No, but if I, if the hypervisor was running on node, a and node a dies and then openstack or whatever says.

I

B

Going to move this workload to its node pair at node b, because you labeled the notes.

I

B

Such a way to say that this workload can run on node a or can run on node b, because that's where I've configured where the the replicated right log will replicate between um then when openstack restarts the workload after the crash half well after the node dies, um it starts at node b and the cache is right there because it's already been replicated. So it basically starts up and says: oh here's, my cache, it's you know dirty. I've got a lot of entries, I'm gonna have to flush back and I'm gonna keep working on that.

B

In the background. While people issue me new ios.

A

Would there be any work needed at their orchestrator levels to support that kind of ear, repair knowledge.

B

um I mean imagine at least also like, maybe just because I don't want the york, I mean the orchestra is not going to be starting up. Kimu.

A

No, no, I mean I mean like.

B

B

um Maybe um I think the interesting thing would be somewhere like, because what is if you could have like a like a mon config key or something like that to basically define the the pairs or whatever once um so that it can like look that information up. You know in a single in a single place. I think that would be cool as opposed to having to inject it at image creation time or something like that, because in theory it can move around.

B

But if you're just saying like hey, if this, if this workload starts up here, this is set x and set x is is composed of nodes.

B

You know a b and c or whatever, and that that defines my my replication set just some way to like define it and get the data from within librbd. So we don't need yet another like api or whatever to inject it in um because openstack's not going to change realistically speaking or you know,.

B

Anything we can do to like implicitly get that knowledge from within liberty and not have to force pass. It in, I think, opens simplifies the problem yet again, because you know we don't have to then modify a higher level tool to inject that knowledge back down into into stuff. I mean yeah, you might have like ansible or something like that, like as it's setting up openstack it labels nodes and then labels injects the config keys or whatever.

B

But it's not like having to change like a a much larger product, you're just kind of tweaking configs in both products, too, and and now the product is initially like aware or cares that you're using the replicated like openstack, doesn't need to care.

B

Like it's not like an image spec to say, like hey, this is going to be a replicated workload and it's going to be replicated to nodes a and b.

B

And then you have to modify like the nova drivers and things like that to like inject things into libarbd and.

A

Yeah, it sounds like you just need to wait in the high level tools to constrain where you're running the vm or whatever, on top of the uh rd right right yeah.

A

I guess work for kubernetes. That could be maybe something that really.

B

A

A

Okay, well yeah, there's something to think about more. I guess, but that sounds like a good start, any other questions for lisa. It looks like you're saying yesterday: you're never might be having trouble, but hopefully you can still hear us.

A

All right: well, let's move on to the last topic, then um paul. I think you added this one.

D

Yeah, this is just a a simple thing compared to the previous discussion, um so we had a lot of conversations on the dashboard side and the australian side about ways in which we can improve the collaboration from a design perspective, because some of us um especially me because I'm time zone challenged, I can't make a lot of the meetings, and so what we were looking at doing, what we started doing was to put up design pull requests and we started putting them into doc, dev and whatever the component was, and the consensus talking to the rest of the guys was that just to sort of put that to that, you know this.

D

This month's uh developer uh session, just to sort of say any any problems of us doing. That is there a better way to do that, or is that the the right approach to take.

A

Yeah we've done exactly that for larger features and radios in the past. I think it makes a lot of sense.

D

H

My network just disconnected.

A

No problem we were able to hear from the end of the discussion there or.

A

Can you hear us now at least.

H

Oh yeah, I can heal yeah.

H

uh Okay, I think that's.

G

A

It was coming in a little bit, but I think um that sounds like that.

A

Okay, it sounds like we're reached the end. Then. Does anybody have any other topics they want to discuss.

A

All right: well, thanks folks, see you at the next one have a comment thanks guys.

E

B

E