Ceph Developer Monthly, 6 Jun 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2018-JUN-06 :: Ceph Developer Monthly

Description

Monthly developer meeting for the coordination of Ceph project development.

http://tracker.ceph.com/projects/ceph/wiki/Planning

A

B

Welcome back to the self-development lead: this is the addition of Jung.

C

B

80 team and, as the usual we'll have lots of interesting topics to discuss so today's today we have careful talking about sea star Jeff later on talking about any phase Ganesha and setteth s, ruthless.

D

War mode and sage, with.

B

A bunch of interesting topics so.

B

Please go ahead, careful show Tanglin.

E

Say to you I like about the items we need to work on in future I'm, putting the apparently in the chat and put lots of detailing pad, and currently we have a messenger and a single ping pong unit test and also a engine. It's still in progress.

E

E

And I will be also post. The.

E

Theta benched in in chat who down mister.

E

E

Meeting things that I think we need to see buddy, you have a multiple D, it is collected in protest, and that is implied that we will have a single, a single.

E

And single, more climbed in this intercept and a G my.

F

My assumption is that um they would operate just absolutely independently, like they do now. They'd have their own mon clients and everything, except that they would be in the same process. They'll be the first step, but I haven't actually thought through the how the messengers are gonna, get combined, I, think that might be the that's so refreshing, refreshing, my memory. The reason why we need to do this is because they're going to share the same DP DK network card right. That's the reason why I would want to put multiple those T's.

E

G

That's a long-term reason like this work term, one I guess is just to make.

H

It sound too cocky.

G

Course my mining schedule, just through the scheduling.

G

F

We should be able to do that with.

F

Without having any sharing of the messengers or anything.

G

F

So you're gonna actually do anything to keep not sure anything start.

G

With and then yeah.

I

J

F

That sounds good, I. Think a lot of the a lot of the work has been done to strip out the GCF context usage, an OST- that's probably crept in in some places, but it's largely separated out so I. Actually, don't think that's going to be that big of a thing to do.

E

That we will, we will let the most OD things share. The same same.

E

F

E

F

I, don't think they need to I would delete them separate for now, because I'd have to think through what the implications of sharing the same on our.

K

E

I, don't see any talk from this chair everything model yeah.

G

I guess the yeah. Thank you more can't get more coordination across scores to make it work.

E

F

Yeah I think this was that the initially at least certainly and probably indefinitely we would leave things like Mon, client in alien framework and so they're already going to be. None of them are hot pass, so I already be doing an efficient method. Passing between course.

A

Otherwise, rippln my plan.

F

E

F

I would I would. This seems like this is an orthogonal concern like we can proceed with everything else, just with a single listing the process and at some point we'll do this work to have multiple lists, to use and figure out what the right way to do it, but it up I'm, not sure that it they had there any dependencies between it and the rest of the sea, star stuff or the other way around. Is that right, I.

E

Just easy: you see that word and more more intuitive, but yeah we can leave it for future.

E

E

We do to change the compete on the fly and its course you'd have its own copy over. You can tweak so I'm, not sure my.

E

Make sense for exams only on that are included included and each provided copy of okay, and we will will eliminate a group of contacts that she mentioned and multicast okay, timber, iris or condition to compile the configured. We can remove all the lot in the cold path because we don't want any marketing that infinity star.

F

It kind of seems like the thing that we want is something more like RCU.

G

Copied on the copy- and it updates it's a very small thing- to read.

G

H

F

That so one of the first things that that I was thinking of when as terms of sharing stuff across course was the OSD cache all do steam apps and the model I had in my head was like an RC you, where you would have a you, would have a hash table at the lookup table or whatever pointing at mapping ethics to pointers to the OSD maps. That I was T. Maps themselves would be entirely constant and then, when you ever refreshed, you just generate in your hash table and they would do an RC.

F

You type thing where you flip the pointer to the latest hash table. So that way everyone can do and protected reads and lookups.

A

Opposed to you, aren't you know when I do the economy.

F

So I would be probably something somewhere. Words can figure guess as what I was suggesting I.

E

Think, okay, Pat I, know.

F

And I'm not sure if we need like actual RC, you like, we could all we could do it do like a poor man's I see you for a little bit something similar.

F

L

Yeah there is a pretty nice user, land, RCU library and I'll live. You are see, you I, don't know how he's gonna be the plug-in to C++ use didn't see before.

F

You can try playing around with it at one point on something better, but yeah yeah.

E

Okay and because logging and my mud is working I just wanted to update you with the proper.

E

Work he's doing to to add two to four for adding a new log entry. We need to add a treat point in the file under treating tool and other core in in surface exactly what are we are doing at this point, but mr.

E

it turns right. I think that we are trying to reduce the stream stream dude to reduce the training log hit. He approached we print every.

E

Whenever we print you knew Blair, evil, never wins a new instance before student took me by the variable, not a lot of on had only we can. We should continue list when it become hotspot.

G

Yes, this seems like the first time is done so far is turning. Everything to face points are not making them increase points. The second step is probably going through each and every to a lot line and either moving the non constant pieces so that we can make them these points that just have a fixed size data, the output of a constant table or changing those long lines entirely.

C

G

C

Actually, planning on submitting opening appear today, but I'm preparing my death, Calcutta meetup, local meetups. That's why I haven't attended yet, but basically it's ready.

C

It's working good I, replaced it in a blue store, I've, converted, DLT and retraced ones, and I've used a file with the objects on back-end to bunch market I've, seen some about like a 10% through boobs enhancement well, compared to deal with the lab level of 10 and for the disk size are. It goes from like a hundred megabytes to 40 megabytes for the round that I that I have with you.

C

It's like a 60% of covert yeah and I'm, actually using the approach that is used in queue for deciding which back-end to use admin fee because I've come here with it. But basically it's like a script and you tell it which back-end you want and then it will generate C code that implements that back-end and a few years ago, I added a TT in G support for it. So I, basically just did that count and put it in to self right now and actually I'll just open up ER.

C

After this call and I'll update the many miss Cooper with it, but I guess I'm curious um or maybe you should judge this on 7ml but I don't like. Why would we get this moving? What next I.

F

Think sending it to set develop some psych. The next step. I've won one quick question before we get totally sidetracked on this. This is basically taking the existing diets and channeling them through basically effectively an arbitrary string through a single, basically, three l TT in G, right, like it's still just a unstructured log.

C

Not really because what I had done it looked overhead was really coming from copying string string, G that was ready to go back when I did actually was you go on the code and you have to convert all the outs to a function which is Trace underscore your trace point name and you give it the payload as parameter direction, and then the code that is generated by the trace tool script will implement a trace on the score face point whether it will be d out or it will be traced point well, LT, TNG, API or it can be something else, but basic on the calcite.

C

It becomes trace underscore if you name and.

F

You did that probably store.

C

I did that yeah manually for.

F

F

Okay, so I guess what I was. What I was gonna say was. um It seems like if, for these trace points like we should basically not use string stream ever like we have to stringify something. Then then probably it means that our trace point is like not a well-structured, trace point or is not logging. Something right thing.

C

Actually, yeah yeah because do just take objects, and yes, I added the operator string for most of them right. You have to go through that or you have to convert them into basic types.

E

And piecemeal is to convert them to move right away. Yeah.

F

Yep and if there's a way to like, have a trace point, that's annotated such that you know it's like a verbose, slow, debug, type, trace point. Then you can still have something that works kinda like di or you just like spam, the log with a huge structure because you're trying to figure out what hell is going on when you're developer. As long as that compiles away sanely then we're nuts and compel away, but every one time I never actually gets called.

C

The basically the DL line is executed, even though I don't think that's the case for it. There also.

I

E

Let me continue next, item is a is known, is object on the lock and it's these are pretty passed. Allah.

M

Allah t it's only used for cashiering, though.

G

And that's all you ever.

M

G

So I think we can keep that an alien threat for.

G

E

Okay, don't we need to videos ready and look like.

A

F

Yeah it's only like cluster-level logs or, like those barely happen, ever like a pretty rare. So don't worry about that either. Okay,.

E

And next is a leave engine. It's a term indicating over in since our it's a reactor for NASA. We can schedule schedule nothing that hacked in from forcing us to read. It has some pretty broad, like variable and literature,.

E

Function to to alien engine, we can just use it to do. We can export the object and or client service for for to let.

E

It will be used to implement the o stay up, cashing service and and authorized service will be meeting later.

E

The awesomest cashing service will be will consume the by either an updated pipe right. Percy stars stretch program book. It will maintain.

E

Teams, the OT max to be used by pigeon Medina, oh I, already used in certified by car notice, intense and it will offer summer a message. I get a good map and the ED map and hopefully remove map they will be caught. I see, touchpad or live instead could be a could update the middle of the map also Austin story: the mini ot caching police. Also.

E

And I think I think its core will will holding a recount of the number.

E

Is actively using and wounded the creek decrement 0, it will return the referring to the core where the the question is running.

E

E

Then is the minister raised sorry.

F

Coming on the OSI map caching service for a sec yeah, so it feels to be like there. It we have to the question is how how far how far to go in one extreme, because to do this, they, like the in the extreme high performance case, then we would want to avoid the ping pong between CPU cores in order to look at those two maps, and we would also want to avoid OSD map ref I. Think because that's manipulating of anatomic, that has to is actually synchronizing cash, fines or whatever across course.

F

So one one way to do. That would be like an RC. You like thing where the yellow steamapps you just end up with like a raw pointer to a constant in memory, post, email and then there's some other thing. That makes sure that the the reference has gone away by the time that you retire. You off you free that memory.

F

I mean that that you wouldn't do it. When you get a new map, you would allocate it in memory and you would add it to the lookup table or whatever and somehow make it so that all the course could see that see it.

F

You using like a atomic update to the hash lookup table or something like that, and then, when you want to trim something from the cache, you just have to make sure that there are no cores that are accessing that reference to those team up anymore, and so you wait some period where you make sure that all of those all the work on those cores has completed such that they won't be referencing. That map I don't know exactly how that would work.

F

But that would be that's sort of the extreme I'd assume that we would need to do something closer to that. But I guess the simpler solution is really just to do. What you're saying where you just the get map lookup, is not blocking but feature based.

J

F

J

In particular, you can maintain your if count remains on every pore and then just have the central can't be whether it's using cores, and so it only gets updated when a course is done with that map.

F

J

And it's much much simpler. Yes,.

F

Yeah that would simplify the reckoning. Yes and then the lookup part you could, if you make it so that the hash table that this the lookup table is also read-only and you're. Just publishing new versions of that hash table you're telling you scored, because that would also work but I guess.

F

That that's simplifies the code a little bit in that you don't have to have a you know to wrap the lookup in a future, but it might actually be that we don't look at maps that often- and so it might not matter because usually most for most iOS, like.

D

I'm really looking.

F

At maps, that's when, like it's like during peering and PG creation, so it we might be getting out of ourselves there yeah I, think we really just want.

J

To like to notify each core when there's a new map and let it like, take a rep and then do the cleanup. We.

C

Make its own design.

J

Yeah but I don't think we want to maintain like a hash map, with our view, that's just very complicated and probably would break our our coherency for anyway, like our good patterns that these are gives us.

F

Yeah, so maybe this is probably warrants its own thread on it, set to Velda to kick her in a couple options here and see which one because you could notify you could just basically pass you to each pass. The message core each core could have a sown local, well cache hash table. I do want to do it, um but Oliver from Sigma same I, don't know: yeah, okay, okay,.

E

Yeah we can have a look.

E

Round among hand it in agents ready by dating the potentials and offer the blocking cost or versus award like that, all that get orders rather.

D

E

E

Next thing to do is to implement the.

E

Interpol or a plain straight wall redirect that we cannot use the trap.

A

E

The next thing is to implement the post-match interest and the right series, and then we can start, oh, that little phone phone.

E

To go to the pile past.

F

That sounds right. The logging, the logging is in there somewhere too right yeah.

E

I guess the moment will be carried.

E

Complete I think it could be true once we remove the lock.

G

Yeah, that's all sounds good to me. Okay, cool.

F

Really excited to get back to this virus excellent, although.

E

I, probably I can I will send some pr2 to master recording yeah, so both see Joey purpose we can tell so we can work on in the unit if they're interested or actually call if we can with one more contributor working on since that first week, Eric yeah I mean I.

F

Think in an ideal world we could put a bunch of this in master and it'll just be dormant code unless you're, actually that.

E

Short of like in one month or two, but we as he couldn't.

A

J

Was just saying you should create a very intense that I kid and people didn't send yards against that if they wanted to collaborate on that, that's what I would do.

F

E

Oh, he might one report, they.

F

Put in something permeability, probably.

K

F

K

Majorly merge soon, whatever pieces, you can't exactly yes, yeah.

F

Yep yeah I mean I. Think as we sorry, my my I. Don't know how realistic is gonna be, but I hope would be that as we started get into the deep or of like rewriting parts of the OCIO path.

F

We would just have like all of that code sitting in master in the OSD, and it would just know that those code paths would get taken until you like flip them out or like make one change or something like that, so that the code is sitting side by side and then eventually will just delete all CL stuff, but instead of having a heavily modified code in another branch, I'm just having a parallel implementation, because I'll probably like the renaming functions and reimplemented out stuff anyway. So so we can can start.

E

E

F

Yeah as much as possible, if we can put it in master, that would be better yeah, I'm.

J

Concerned, if we do that, we're gonna parallel incantations, that means we'll be bug fixes. So it's.

F

It's no less pill than if it were in a separate branch right well,.

J

It's in a separate branch and I thought we can have a patch that, like does the move, also deletes the old code and then, if there's a conflict in that code, we can looks like it'll, be a lot of conflict. So we look at it and go okay. This is there were no bugs like these humans laws.

J

K

You change that outside then the merge is gonna, be.

K

You're gonna miss bugs you know, bug fixes and you not anyway, but you know, there's high chances in both both ways. So.

I

F

E

F

Yeah I think the good news is that there aren't that many changes actually happening in the I/o path, and you know see these days. Most of the changes are like appearing, which won't be touched for the most part. So hopefully this won't come up too much yeah I. Think probably we should just start with like what the actual pull requests are and we can decide if this can go into master, then great. If it's like can't or this, then we can figure out what else to do.

E

And I was talking with some friends: okay, we need to in Egypt, honesty's remembers the star the summit, a Korea, sometimes the rip green patches are. We can impose no no.

F

I'm your audience Kutner too much I, can't quite follow it. Oh.

E

E

We need to, we can them in an hour.

A

F

G

M

E

F

All right, Jeff, you want to go.

L

Sure, let's say I'm gonna try to share my screen here. Let's see if this works.

L

Okay, everybody see this.

F

No, yes, no nope! It's a blank screen with the mouse cursor keys again.

L

Yeah, it doesn't seem to like my ear I'll try this one. Okay,.

D

L

There how's that better, that's better, yet yep. Okay, all right! um Hang on a second! Let me you just said who actually think I was like I'll change. This tried a different window, got it there on that? Okay, perfect him! Okay, okay, thanks! So I'll try to keep this pretty brief. This is a work.

L

I've been doing for I've been actually looking at this for several years, but I actually started picking up and doing the actual work on it about six months ago, and the idea is to build NFS Ganesha or to change NFS Ganesha, you know benefits cognition, but that's what I'm using to to work and active active configuration on top of the clustered file system like Stefan, fests and so stuff. This is the prototype for this, though yep.

L

So in any case, I'll go brief them talking about rattles. You guys know what that is. This is this presentation and with a you know, idea of maybe taking it on the conference track, but in any case I get Ganesha is a usual and NFS server and that's sort of the what the place I've been doing it our place.

L

I've been doing this work, so the idea is that it has back in has plugins in the backend for different types of file systems, and there was already NSSF that uses live stuff s and it's, but we can also store recovery, databases and configs there.

L

So in case we've had suffice or FSL stuff for a while now and we are actually you tripping it in RHCs 300, just a traditional sort of cluster set up using pacemaker, but you know the key is that active, active configurations actually on stuff worked pretty well, they you have. You know the stuff of s is pretty what good at mediating conflicts, and so we have stuff, like you know, opens locks delegations layouts.

L

All that kind of stuff is pretty well handled and at the file system layer, and so if you just stand up a bunch of independent kinesha servers on top of the seventh s, it mostly kind of just works. The down the catch here is that it it doesn't work. There are problems when you have to win us node crashes, if a connection, node crashes or something crashes and SEF, then we have to do some special handling and that doesn't exist. Yet so Ganesh already had some clustering support too.

L

It's mostly for gpfs, but it's a little kinky. It does some. It uses D bus to sort of drive Ganesh to do certain things.

L

You have to split and merge recovery, databases and the way they do that is not atomic enough, and so you can end up in situations where the recovery databases don't end up being failed over correctly, and so it's not resilient enough in my opinion, and so that I think there's also some race windows, where you can end up with conflicting state as well, because the way they they handle a lot of it and just let things timeout and assume that everything is going to be okay within that period and that's not necessarily the case certain case.

L

This is what I'm looking to build, is sort of a scale out and if a server over SEF of s we want to be able to do is stand up. These into the individual, NFS Canisius servers and active active configuration we want to have them mostly run independently and really only.

L

Get together and coordinate when they have to which is typically only during the recovery like when the node is coming up and and we need to allow NFS clients to reclaim so, and the other good thing here is that Ganesh is pretty amenable to containerization in this. In this configuration, there's no real need for local storage. Everything goes into Rattus or into self s.

L

We don't really need third party clustering, and so you know what I'm looking to do actually is not even bother with trying to do failover in a particular in a traditional and the best way where we have another node take over resources that this other one was using, but rather just stand up a new container or whenever one crashes and I think that's an easier, much easier model to start with.

L

So we may need to do some some takeover kind of stuff eventually to handle migration so like we want to move a client from one NFS connection server to another in the cluster, so that we can, for instance, decommission that NFS Ganesha. We would need to be able to handle that and that's not implemented yet, but it could be done so to do this, we have to kind of step back and look at what happens in a single node configuration right after a restart.

L

You know, NFS servers are sort of a blank slate blank slate. You know you they don't have any knowledge of what the clients held before and so what we do is. Typically, we bring up an NFS server in a particular way and to allow and allow the clients to reclaim their state. They say: hey I, had this file open in this way, I had this lock on this files, etc, etc, and so during that period we call that we call that period the grace period and so the grace period.

L

During that time, clients are not allowed to establish a new state, they can only do reclaims and so to allow that to handle those reclaims there's some certain. There are certain scenarios where, if you have reboots along with network partition, like you know, a client loses contact with a NFS server server reboots a couple times and has you know forgotten all about that client that client could come back in on a subsequent reboot and reclaim state that it really shouldn't have, and so we do the way we handle that is.

L

We have to keep a list on stable storage of which clients are allowed to reclaim. So we do that. Typically, this is done pretty much all NFS servers that don't track state on a real granular basis on stable storage. So this is sort of the minimum.

L

We have to do to allow clients to handle that sort of or to allow servers to prevent that sort of condition, and so that the other catch here is that and the important bit is that we have to atomically replace the old client database with the new just prior to ending the grace period. So up until the point where we end the grace period, you know whatever database existed prior to that is authoritative once we lift the grace period. At that point, the new databases is the authoritative.

L

So a way to think about this and it's a sort of a simplification for that'll make sense later, is to consider each each reboot a particular epoch. So when, as NFS server crashes, we have a period where we have a grace period and then it comes up. It goes to the grace period. We allow recovery and then we go to normal operations, and then we have another grace period and then what you do normal operations and we have another great normal operations.

L

So if we consider each of those periods an epoch that allows us some, it's a another way to look at this and it's a little simpler for dealing with a cluster grace period.

L

Sorry, so now we have to think about what happens when we have multiple servers, so when, in that case, we have to think you know really that what we are, what the whole point. The grace period is to prevent conflicting state from being acquired during reclaim. So when that so and we have to worry about conflicting state being handed out on other servers as well in the clustered setup, so at that point we really have to consider the grace period to be a cluster water property.

L

We want so we we need to enforce a grace period until it's no longer needed by any node, and that means that when we think about it, if the node crashes, while the grace periods, in effect, we weren't allowed to rejoin, and then that means that lifting and also lifting grace is really a two-part process. We have to sort of indicate that this node, you know the node, you know I, don't really need a grace period anymore, and then we have to indicate when we stop enforcing it.

L

No one needs it any longer, and so essentially it's a cluster wide property and you guys be feel free to stop and ask questions if I'm not clear on a this and breezing through the slide. So any case we have one way to the simplest way I've found to represent all this is a little keep track a flag per of sorry like a flag per NFS server, but each each server in the cluster gets a flag. That says you know whether it needs a grace period.

L

But when it comes up it crashes comes back up and says: hey I need a grace period. I'm gonna allow my clients to reclaim, and then we also have a keep a track track with a flag that says whether it's enforcing the grace period. So you know when the client, when the NFS server has finished all of its reclaim, but another node and the cluster is still allowing reclaimed, are still leads that still needs a grace period. Then we we set a flag to indicate whether that it's a it's still enforcing.

L

So we have to separate you know for where we usually in an NFS in a single node, NFS server, a single now infest server. We would conflate these two ideas, but in a cluster we have to consider them separately, and so the idea here is that we have. We really want very simple logic that allows the NFS servers to make decisions about grace period enforcement without based on the state of all the servers in the cluster, and we want that to be decentralized. We don't really want a single node figuring this out.

L

So, in any case, the way we've done way I've done. This is I've created I'm, using a single object in the in Randolph's to as a database, and so that database and usually I call it grace within that within that server or within the within rattles, and it essentially it just tracks a very small amount of data.

L

So we have to un 60 40 values that represent the epoch. First is the current epoch is it, which is where new records should be stored, and what the you know, whatever the current reboot epoch is, and then we have a an R value, which is the recovery epoch so and that's during a grace period that will be nonzero and that will indicate what epoch we are allowing recovery from.

L

So if a server comes in after it's been out of communication for a while, it can look at that R value and if it doesn't have a database for the that R value, that's represented there, it won't allow in recovery and then also in that database. We track a no map which is to do that. Just tracks the two flags and essentially I. Just have a bite in there that I've and I'm using two bits out of that height right now, so we have space for other flags later.

L

If we need so there's a need need grace flag and an enforcing base flag in there.

L

At the same time, we also have to allow for parallel server recovery databases, so so in a clustered setup, we, you know each node. Each server node in there is going to have a list of its own clients that it has and it needs to keep track of that list separately from all the others, and so what we do. What I do here is, but the key is that we, when we switch epics, we need to have an atomic switch on those databases as well, and so that's tricky, because we can't do Tomic.

L

You can't do atomic operations across multiple nodes across multiple Oh, Eddie's and Rattus. You can only do a single.

L

You know, level, atomicity and rathaus is just honest logic, so we have to allow so what I've done is we just embed the current epoch value in the in the name of the database in the in Rattus, though, and then they look like this with the there's a wreck and then there's the epic string with an epic value in it, and then I have the hostname after that order, we may actually change that to be something different. Besides the hostname and in a future version, but for now let's, but what it uses.

L

So it's just a you know an opaque node ID effectively and we use those traditional connections had a at the ability to store recovery, databases for Singleton servers for a long time, and we use the same format for that database in in these. In these recovery databases and when an epic changes, we can also go through and clear out any of the ones that we know will never be used for for recovery anymore. You know once once we're well beyond the point where we would recover from that database. We can just delete it so.

F

Just a quick question when, when a server is going from grace to normal operation, its first clears its needs. Grace it's still enforcing grace, and at that point it would presumably write out a new version for the next epoch of its client database. Is that right and then once right.

L

What, when you when we transition so so what happens is during recovery? What we do is we have so you have a. You, may have an old database that says: okay I'm, allowing these clients to recover right, but as.

I

L

Come yeah as those clients come in to recover. We, we add new records for them in the current epoch, database, okay and and then, and then what we do is just delete. The oh yeah I'll make sure I update that on the slide is you're right. The top here so.

F

The new record database, it always exists unless, basically, the server did failed to participate in degrees, in which case it get like fenced or something right. Something right. Aspecto.

L

Come back in, but if it doesn't have a database for the for the current recovery epoch, it would not allow anything to recover. It would have a basically no database for it, and so we've done no, no clients be able to cover, which is exactly what we want. Yeah.

L

We don't want those clients coming back in sounds good, all right, okay, so in case I have you know as part of that, so that you know we have a recovery back-end that implements this for for Ganesha now and it just got merged this past week and then I was a command-line tool. It goes along with. It allows you to basically manipulate this database that will or an administrator can come in, and do that and also allows you to do things like add nodes to the cluster.

L

For instance, if you want to grow out and scale out a little bit or we'd be able to remove nodes too right now, I don't have a way to the migrated clients that are off of an existing node automatically, but that could be added, but it didn't order to shrink the cluster we'd have to take. That's basically route delete it so map key and of Lee. That's what that's what this tool allows us to do, so it allows us to, you, know, add remove stuff from the cluster admin move notes from the cluster.

L

You know start a new grace period or join an existing one, lift the grace period for one or more nodes or said they're, clearly enforcing flags.

L

So any case you know. The way we have to think about this now is that each clustered NFS server has its own life cycle, and you know this is a way to think about the states as we walk through them. So the first one is the startup state. So first we start up from 0 for the server, and so essentially, this node will either request a new grace period or join one.

L

That's that's already and in if it's, if there's one already active, it'll and effectively, we just ensure that both the need and enforcing flags are set. And then we wait for all the other nodes in the cluster to set their enforcing flags and the reason we do that is that we need to ensure that we don't kill off state held on the FSF MDS by previous cluster note or previous instance of that cluster note.

L

So we so if a node crashes and comes back I'll talk about that here in a bit, but when, if it crashes and comes back, we want the MDS to preserve its state, at least for a little while until we are ready to start reclaim and so we're actually adding some stuff to suffice. Lips FS for that Zhang Jiang has some patches for it and I've been testing them.

L

They work very well so any case assuming you know once the river comes once all the other servers are enforcing, we load up the exports table at that point, and then we kill off in each state that was held by the by this particular NFS server before and that because at that point we are safe to do so. We known of conflicting state can be acquired and and the the mb/s is clear to release that state at that point so that we can allow clients to reclaim them.

L

So at that point we transition to the recovery state. So we start a grace period, timer all right, so we don't want to waste, wait forever for these clients to reclaim. We just give them a certain amount of time, and usually it's you know, half a minute or a minute and a half up to like you know a couple minutes so the clients during that period or you know, allowed to raclette, reconnect and reclaim previous state.

L

You know if they're listed in that every DB, just like with a normal singleton, NFS server and that lasts until the grace period times out or or all known, clients, etc. Click complete. So one of the reasons I'm focusing on fee for one for this is that career prior versions of of NFS did not allow you to do or did not send a reclaim complete.

L

So you'd really had no way to know whether a client was finished with its reclaim and so for one allows that, and so that allows us to lift the grace period early. If we know that all clients have come in and finished their reclaiming.

F

Does this mean you can't a new client can't come along and mount if it's in a grace period, they have to wait. No.

L

You can you can mount, you just cannot do an open, so you won't be able to file. You did.

E

L

You'll get back NFS for our grace. Essentially until you know, and and and the other thing I mean that we have to that- you- you know we it's not necessarily the case that we are dead in the water or during the grace period either. The only thing that's prevented is things that change the state. You know stateful information, so you can still be reads and writes you know. With existing file handles you had or with existing files you had open you, don't necessarily have to wait for those it's just. The state.

L

Morphing operations are prevented until the until the grace period is lifted.

L

So then we move into the enforcing state. So essentially this client or this NFS server is finished with its reclaim. It clears its end flag. So there's the knee grace flag, but it stays enforcing.

L

Now, if it's the last, if this one is the last end flag to be cleared, then we can fully lift the grace period and we set R equals zero. So if there's no more reclaim allowed and we leave and then we can also clear the enforcement flag that I tend to do that in a separate step.

L

At that point, we can clear out the recovery database in memory after the n flag is clear because we never not gonna, allow anymore reclaim and then at that point we're sort of in limbo. We don't we don't we're not allowing any reclaim but we're also not allowing at a new state to be acquired either.

L

So we just kind of mostly we're just setting back errors at that point for any in a pretty state, morphing operations that come in and so once that once we transition to the to where the R equals 0, you know no, nobody needs the grace period any longer and then we can transition to the next day, which is normal, and this is where the server will spend the bulk of its operations effectively. No cover e's allowed, so this R 0 new state can be acquired.

L

Now we can do, opens, closes or renew opens and lock and all that kind of good stuff. This is really where we spend almost all the time for the for the server. You know it's been almost all the bigger and this either ends when the server shuts down and we shut down state or we go to or another node starts a new grace period. So another node may crash come back up, say: hey I need a grace period and at that point we go to the reinforced state.

L

So at this point another node and the another server and the cluster has said: hey I need a grace period, but this particular node still knows what its clients are and it doesn't need to do any reclaim. So what we do at that point is just enforce the grace period, but we don't actually allow any reclaim indefinitely. So what we do is we drain off any in progress operations that would be disallowed during grace period. This that's the idea, and then we set our 'flag and start enforcing the grace period.

L

We we also have to because we're changing epochs at that point. You know we see actually changes what we've declared a grace period, so we have to create a new recovery database. We write all our records for the active clients into it. That's okay, it's all done in a single transaction. So and then we, the next thing we do is and then at the end of that we go in back into our enforcing state.

L

And then you know we can cycle back around and round around round of these in these states indefinitely, and so the last state is the shutdown state we can. This can technically enter for an any other state. I mean we could hit the start state and still shutdown. At that point, where we stop processing our pcs, we started.

L

We usually will declare a new grace period or join the existing one and set our need and enforcing flags, because the presumption is that we are going to be coming back up, and then we complete the shutdown, the shirt and then what we want to do is ensure that the states left intact and intact in the set MDS right now. So good NASA doesn't do this right as something I would be working on here soon, but what it does as when we are shutting down the server.

L

We want to make sure that we kind of want to leak all of our state, it's kind of nasty, but but it's what the only way I can see to do this properly. So what we want to do is leave that intact, so that nobody can acquire conflicting state air state that would conflict with that state until we are back to starting back up again alright that point now we have a demo, let's see if I can get this to work, I may have to change. I got a second.

L

May not work because I have to do a screen. Hang on a second, since my whole screen sharing is not working all right.

L

Doesn't seem to want to use my console windows. Yeah I may not be able to do a demo over this thing. Oh sorry, guys so in case it was have to take my word for it. It actually does work, so you can shut down the thing. I can you know, run IO against it, shut down another node in the cluster and see it go into the grace period, and then it comes back out comes back out again, so yeah there's still a lot of work that needs to be done.

L

Hang on I've got one more slide. No, you done here.

H

Jeff, can we go back to the state diagram side sure which.

L

H

Reinforce sure during that reinforced state is it are the the knishes able to the ones that haven't failed? Are they still allowing rights and changes to the files that are already open? Yes,.

L

Yeah, that's that's the case, even during during a normal grace period. So even during the recovery phase, that's also the case to reinforce and and and a recovery and forcing all that stuff. You know again: The Grates period really only affects state morphing operations, so you can open files, you can or you can't open files, but you can do I/o to files that you already helped held over you can't you can close files technically I mean some. Some servers allow this some don't, but technically you are allowed to close files.

L

You can release state that you felt before, but what you can't do is acquire anything new, because that could conflict with stuff that is done to it'll out that we're going to allow it to be proclaimed. Your question.

F

And you can't open file, because another server might have a conflicting delegation on a file that give it exclusive right back capability or whatever. That's alright.

L

Well, you can't do that because you can't open files mostly to allow for delegations are part of it. Really. What that is for is for NFS. Allah nfsv4 allows you to do. Share level share, locking share, deny locks, you can open a file and say I want to I want to set us. You know a denial Ock that and that will deny others from opening the file. This is a very Windows kind of thing yeah, so we nfsv4 whenever they work the spec.

L

They were really trying to entice the Windows folks to come aboard and you know didn't really work, but but so that we have some semantics in there that allow for that sort of thing now most servers will allow that will do share, deny locking you know. I may actually make a proposal in that one of the coming ones that we allow share, deny locking could be optional, because we can only select yeah.

A

L

You know you could declare that it works in practice of the Linux client, never uses them so yeah.

F

I guess I would actually it wouldn't help, because you'd still have to deny opens because you might have a filled-in NFS gateway that had a delegation and I would also prevent you from doing opens Hey correct, right, yeah. That's the other delegation.

L

Doesn't make it worse, no yeah yeah. We still have to do that, even if we want to allow if we want to allow delegations which we do yeah so any case. You know we still have a lot of work to do on this thing right now. What all this relies on the MDS holding the state for us, while you know for a for an NFS server, that's gone down, so you know if Ganesh node crashes and comes back up what we don't want.

L

The NBS to do at that point is kick out all the state and start allowing other stuff to require it. So we have to ensure that it squats on top of that state for a while until we can come in and and give it an explicit okay, you can release this now, which we do. After the everybody is enforcing the grace period. That's not properly done right now, I need to add them. Start I've started looking at how we fix this, but not trivial.

L

The other thing we need to do in Ganesha is Grayson. Forcement should really be based on references right now. What we do is just check whether we're in a grace period at the beginning of an operation, but that's not really sufficient.

L

We need to you know, because you could check for it and then go into the grace period and then, at that point and then start doing conflicting operations after we've allowed that that state to be released, and so we want to really make sure that we drain off and the old any state morphing operations before we that are enforcing flag, and we can't really do that.

L

Just yet and Ramana and Travis are working on container orchestration integration right now, so there I think Ramana has rolled a new tardy for for kadesha and for kubernetes and we're looking at doing that. I would get to building basically a way for Trinity's couple to start up cluster.

L

Eventually, we've made one redo the grace database right now. The way I do this is with read-modify-write operation. Do a read operation modify whatever the scent is and then try to do it right and if something changed in between we assert on that and then turn around again, and that's not really that thickened and we could do this for those d-class methods in practice. It probably doesn't matter a whole lot and this is we don't hit the database at all. So you know it's not really tear in a little bit of inefficiency.

L

Here it really doesn't harm anything so anyways, that's all I got anybody else.

L

If not I'll turn it over to to uh stage.

L

J

Like you, let's you know, I'm sorry, you hear me yeah okay, so it's like. You really only need the enforcing state in NFS or in Ganesha. If the envious servers aren't already doing it for you, which I think they are unless they've crashed right, because you know the Ganesha server, because the capability that needs for a client to have the delegation or open or whatever.

L

Right, but in order to reacquire the state, so let's say: okay, so server crashes, so we have a server in its story. Sort of it holds a bunch of state clients. It crashes comes back right in order for to reacquire that state, we have to kill off the state that was held by the by Ganesha before by the previous instance, and that's a window of time when another server could come in peek in there and steal that state from us.

L

That's why we have to ensure that we are enforcing the grace period during that time now, there's a thing we could do is if we could teach Gannett or teach Ganesha and the NBS how to allow reclaim of that state right.

L

You know if you could say you know if they could, instead, instead of just killing off the old session, if it could just say hey, you know me back all the old state I held before then we then you don't have to put anybody else into the grace period, because you know that state was still be held by that that particular creature. So it's.

J

Something we could do and.

L

We're looking at it, okay.

J

Yeah because I know there were several different iterations of that, but I thought that one of them was already done than that. You were relying on that, but I guess it doesn't work quite the way. I thought it did to do that. Take over. We.

L

Started so young did some before that, but it's not trivial I mean we have to do locks. You could potentially be shoveling a whole lot of state over the wire to the to the Kratt. You know this to the reincarnated server you know, and so there's a lot of there's a lot of work to be there, and so the way I'm Anna is that we may initially allow that as a potential optimization. But for now what we're going to do is do this one where we have these.

L

You know these safer thing where we just put everybody into the base grade and I. Think the.

J

More portable to other back in sphere, so I think see where that's good friend of that exam. Twenty.

F

To make sure we can block side of our other options and that just to be clear, so the thing that we are doing with this is that the new Ganesha server instance, which is sort of a fresh best client. There are new calls, or it can say you can reassert, reclaim stuff that is, in this limbo state from the previously dead client.

F

L

Works so right now what we're gonna right now, what we're gonna do? What we do is we set sort of an opaque value on on a particular MDS session, and so there's State tied to that session. No opens locks whatever and then, when, when the server comes back, what we do is, we know, declare we say: hey we're. We want to kill off that old session, the week tips and that opaque string or blob or whatever and say, hey, kill off any.

L

You have for this one as part of this global, as we're also do we're allowing the staff client to Claire hey long longer, timeout, because the timeouts for for a SEP session are pretty short right now they can be like sixty Seconds and, and so this is something that's actually useful, even in active passive configuration, because when you you know, because if a Ganesha goes down and you bring it up on another node or something we don't want to have to wait that sixty step the state to timeout, we want to be able to kill it off and start things back up and Natalie does a way for us that have give the client some control over.

L

You know you know: I am this client, you can kill off my old date and and start letting me. You knew stuff got.

F

It okay, so the net new is released. It's really just killing off the old session, and then you just reopen it like. If the client asserts that I had this file up and the new client just opens it, and it will be allowed to do that because we have no conflicting because it's a cluster wide grace periods and nobody else is gonna like that, make sense exactly.

J

Though, that's.

F

Pretty simple: that's good! Okay, awesome.

L

Well, presage now, unless those questions.

F

Alright thanks thanks Jeff, so I got a whole list of stuff here, I'm wondering what we should do. First, I wonder if we should actually rearrange this slightly and talk about the REST API dashboard API thing. First,.

F

Subset around because basically this is a conversation for Lenz who is still here. Yes,.

F

So big picture that we have customers and users who say we need a REST API to manage the stuff cluster. They don't want to use the CLI. They don't want to use the GUI ever or they say something like everything that we can do in the GUI. We need to also be able to do through an API, an API, and so we have a couple different options. There's the old thing that just passes the CLI commands over a REST API. That is also present in the new restful module.

F

There's a new restful module which implements like a small set of things, but it's very minimal and there's no there's no documentation or minimal documentation, minimal functionality, it's pretty basic and then the last option is that the dashboard itself internally has an API that it's using to talk to its itself to the manager. In order to do everything that it does the problem with using.

F

That is that it's it's an internal API and it's always going to do stuff that the like a normal management, REST API, wouldn't want to do like you know, generate give you the exactly the data you need to render some widget or something like that, which is just different, and we don't want to necessarily have at all the versioning and documentation hurdles or whatever for something. That's in terms of the dashboard.

F

So I guess I guess what are the questions? I think the question is one: what's your sort of general view on all this lens and what requirements are you seeing from your customers and are we ready to sort of like commit to having like a separate, REST API and, if so, who's, whose investor than that or are we not ready for that or what so.

N

Yeah, this is a recurring conversation. I hear you and I can basically start by repeating what I've said before I think we're not completely against at some point declaring the Ceph dashboard rest api as being an efficient api. The thing I'm concerned about at this point is that kind of freezing it down at this point will significantly slow us down, since we are still not feature completely, are still adding more functionality to it.

N

Having said that, we actually do have that in mind when working on on the backend, so I just merged a pull request, for example, that will give you an automated documentation of the REST API based on swagger, which is a very popular tool for that. So that replaces an initial implementation of what we call the browsable API, where you basically can connect your browser to the API endpoint and then have a web page.

N

That gives you an explanation of what this particular endpoint does and how to use it so I'm by a bear by embedding documentation in the controller.

N

You will then see this as a documentation on a webpage, so the REST API will be self documented in a way, and that also gives us a way of a marking, particular API parts as internal or maybe not stable, yet and in the other way around. We can also set for for things that have basically settled down and are stable.

N

We can then declare this could be used by externally patience, but at this point the back and is still going through an evolution, and we really see the the dashboard front in itself as the primary consumer. But, as you said, um there are, of course, some queries where you have to have a very specific jacent being returned in order to populate widgets versus other, more operative tasks like creating RVD or creating a pool or whatever.

N

So we also try to separate out these more dashboard specific API calls. So we definitely have this in mind I'm, not just at this point ready to commit to is being efficient at this point. So let's aim for having this as part of Nautilus. How about that.

F

Okay, so that's better than I thought.

F

Swagger sounds awesome, so I mean there's so that you're saying that they do have the ability to mark individual api's as internal or dashboard only or whatever, and so they'd be obscured or labeled as such, if somebody's browsing, the official API it all. The like. The authentication mechanism is like all general enough that it could be I, don't really know what how how people are. Normally. These REST API is in the first place, but it it's it's it's consumable. It's actually like fits the set of requirements that you would.

F

You would expect out of like a generic general purpose right.

N

Additional caveat at this point: the dashboard for now just is aware of one single admin user. You can freely define the username and password, but that user has full access to the entire REST API. We do have a pull request.

N

That's currently under review, that will add roles and multiple users, so that gives you a way to define the user that live, for example, just has breed privileges to let's say RVD management or pools, or rather Skateway for that what it's worth so that's currently being debated, and once that is in well, then you could create in that maybe just an API user that has privileges for just the features that you would like to have as a first stage.

N

This will just be a users managed locally in the Mon config database, but also working progress is adding and not other directly, but using some. Our open ID connects OSS all protocols that are widely established. That's going to be the next step, going forward, we're kind of moving at the stack here, but current as it shipped in mimic.

N

It's just single user single passwords with you can of course hand that over to any consumer that wants to connect to the API, but it would have a lot of power and going forward once we have roles you can further restrict this down and well. The API itself will also still continue to evolve yeah.

N

So that's the current state of affairs, but in general we are aware of the requirement of kind of the expectation being well there's this kind of nice REST API, can't we just make it official well, we are still very busy trying to yeah shape it in and make sure that the the front and the dashboard phone in itself has all the support it needs.

N

So right now we print a big disclaimer that, yes, of course you can use that API, but don't bet on it being stable or officer at this point, but we may be able to remove that comment at some point, but not right now, I mean.

F

Nothing is official or stable until the actual release anyway, so.

F

That's the Nautilus time Brown sound like a time frame, sound like a good time to like, say it's now, ready for consumption right.

N

So the the people working on the dashboard will actually meet in early July and we will further discuss this topic as well. Both downstream representatives from Susan redhead have a an interest in that, and so this is definitely one of the topics we're going to discuss further, but on high-level visits or intentions and plans. With that awesome.

F

F

Great that's way better news and I thought I thought we're. Gonna have to actually maintain a separate API, but yeah.

N

Well, at some point, of course, once we declare this official and we may have to make incorporate the build changes, we need to think about things like API, versioning and all the other baggage that comes along with it, but for the time being, this is still very much in flux and evolving. So we don't even bother stabilizing our versioning it at this point too, because right now we develop both back and in front in parallel.

N

If we need to make changes, we make them on both ends and just commit send yeah as soon as you say. Yes, you can use this by other applications. You have some more liability, so that's the why the reason why I'm still a bit hesitant to to fully commit to this at this point exactly.

F

F

F

All right, that's all I had I just wanted to raise the issue, but it sounds like if you guys are meeting in July. That's why I could time to like lay out a more complete plan and timeline for when versioning is gonna happen. It's one. It.

N

Would help to get a better feeling for what are the most popular requirements and customer use cases a food loan API can do so much, but maybe is it about just creating our biddies automatically. Is that the low-hanging fruit to start with some more business, intelligence or customer insight would be helpful? Yeah a problem may be I. Think.

F

The problem is that that the requirements that I've seen so far, the ones that actually like our specific, are a combination of like nonspecific and arbitrary, so they just want. It was something like that shalt be able to do everything via the API. They can do through the GUI right.

I

F

Is like okay, whatever, and then they also wanted to be based on the like the old breasts API, which, like makes no sense, yeah.

N

F

Lot of it is just like they live a checkbox garment where, like there must be a men's in the API that they can, they can consume understood I, don't have a strong sense of what it means so right.

N

Yeah, but that's the thing I'm concerned about if it's just checkbox item yeah well sure we can see. Yes, we have a manner to rest API, but if there are more particular expectations, we would love to hear about them.

A

N

Right cool many.

F

Other comments, questions, thoughts on the API stuff.

F

Yeah all right thanks next up is the smart prediction of device failures. I want to give an update on that. This is a project. That's gone through several iterations. Now it started with a outreach project with your eat who's, on the call to do some built-in capabilities and stuff to monitor smart statistics and predict device failures and respond to them.

F

It's evolved somewhat since then. I think that the way that I'm thinking about it now sort of into grouping into three sub problems. The first problem is the metric collection and there's a bunch of work that Yuri did initially with smart Mon tools so that will output a Jason dump that instead of this like horrible thing, they have to parse just what smart control does now and then there's also an OSD command that you can do.

F

You can do tell SEF tell OSD whatever smart and it will basically do a smart control, jason dump on every device that it's using and it'll send that back to you. So it's sort of an in band built-in way to collect smart metrics that doesn't require that you deploy your own prometheus thing or other agents or whatever.

F

So there's that piece and that's more or less complete, except that the smart mom tools, changes to outfit Jason are still like going around in circles on the smartphone tools upstream they're like semi-complete, for was it like scuzzy and SATA, but not for nvme or something I can remember, and so I'm, not really sure that release timeline looks like for that.

F

But that's sort of the blocking that, on that end in the middle is something that will like iterate over all euro STIs and scrape the metrics or over an OSD scrape the metric and like predict something about whether the vise is going to fail or not.

F

There is there's a sort of a proof-of-concept module in a manager right now that that does that well go it'll, scrape, cos DS and it will archive them and rate us. It doesn't actually do any prediction yet, but that's sort of a let's call that middle part, the prediction part and that's sort of a black box for now and then the last part of it is like.

F

What do you actually do once you have made a prediction on how long a device is gonna last and that's the bit that we're working on most recently, if you look at the smart pad or the ether pad or whatever, for this full request that is, I could link to here in a moment that basically adds reporting of from the STS and the metadata of the devices that an OC is consuming with a unique identifier.

F

That's like the vendor and the serial number and other things that are actually uniquely identify device and then there's a bunch of code in the manager that parses that and implements some commands that light you listed Isis, so there's new stuff device list command. That was all the unique devices in the system.

F

You can query info on a device that tells you which demons are consuming it and you can list devices by host and by demon to tell what the what the dependencies interactions are between physical devices and the demons that consume them, and as part of that, there's the ability to basically tell the cluster to store the predicted lifetime of a device. So assuming that middle black box says that this disc is going to fail in four days, you can issue this stuff command.

F

That will say set predicted failure of this device in four days from now and the cluster will store it in a same way. So after this pull request is and then the last peace is gonna, be as the next part that your eats looking at, which is basically the automation that does something about it. So it'll basically look at all the devices and their when failures are predicted to happen and they'll say: oh, this disc is supposed to fail in four days. I should do something about it. Maybe that's raise a health warning.

F

Maybe it's marked the OSD out offload data from it before it actually fails. Maybe it's you know, set the primary infinity, whatever whatever policy and sort of automation. Stuff we want to do is going to sort of live in that last piece.

F

So that's that that's sort of the lay of the land and the interesting bit is in that middle part, where you actually have to predict something at the same time that this whole start of building something into stuff is going on and started getting aggressively contacted by a group, a company called profit store who has a product called disc profit. That's like an AI, driven, very smart, very accurate tool that will predict when you're just this sort of fail. They have a product you deploy on premise, and then they have a SAS service and they're.

F

They want to offer a free version of that SAS service, where you basically query an API. You say this: these are the metrics for this unique device and it'll tell you when it thinks the device is going to fail, then they'll be a free version.

F

That's like pretty accurate and then, if you pay them, you'll get like a really accurate version and so they've also written they've written a manager module that needs a bunch of work, but there's a there's, a sort of a draft implementation that sits in that middle part that will basically use the OSD command to scrape the smart metrics that you're reading fomented it'll push them to the API, get a prediction, and then right now it like does something stupid like stored in a file, but it will then pass it off to the device tracking to store that predicted failure so that all the set bits can actually respond to it.

F

So once all this, once all these pieces are there, that sort of middle gap will be filled by or could be filled by a module that queries this disk at SAS service or, if you deploy it on premise or whatever, which would be pretty cool and then because it's sort of broken into pieces. The idea is, then that metal piece can also be implemented by something else. So if somebody has an open-source sailor prediction model code- library whatever it is, then you could.

F

You could use that in place too, and all the bits that teach the cluster how to respond to a predicted failure, don't change and probably the bits that are actually collecting that data don't change. It might be that you don't want to do use the piece that's collecting through the OST either. Maybe you already have like Prometheus is already scraping these things and something that a time series database. That's fine.

F

You could also pull the data from there instead, but that that last end of the pipeline, where you say I, know that this device is going to fail them to two weeks and the closer response to it remains the same. So that's that's the big picture, a whole bunch of like missing details there of any any questions or comments on sort of what the.

O

So so a lot I like the idea of pullings of smart metadata itself, so it makes it actually a lot easier with some miss forgot to say: containerization, we're getting at container with a Prometheus probe is kind of annoying so being able to pull that from the OST is cool I.

O

Think the one part in the path set gave me a little bit of a heart attack was when you read, but when it says that it will store the meta is a smart metadata in a dedicated pool, including history and I like we implement is this. Is this may be scope types right?

O

That's why I make what white? So maybe that was considered and discarded and I couldn't find it, but why so? The if I ask an OST for its metadata. It will already show me its current metadata right. The OST metadata time quit said no piece of place to stores a transient and the latest version of the smart data, and then we store it where.

F

Yeah so that part of the spec and the implementation was sort of the old plan or the original plan before the disc profit thing happened, and you started thinking about the second half of it and how how you could fit it in and have that part be pluggable, and so right now that bit of the implementation of spec is sort of possible useful in that middle area, where that's sort of the black box that redo the prediction so again at the way, I'm the way I'm viewing it is.

F

There's the first part where we have some tools that let you scrape the data and that's generic and you can use them or not use them, and then we have the the infrastructure to store the prediction and to respond to it. That'll be totally generic and everything in between a sort of TBD, the in the disk profit case.

F

There module is going to do all of that, so it'll it'll directly scrape the latest metric and clear their API and on those on this, their server side, they're gonna be storing the history and they're going to be running a I machine learning. Stuff, all over it, they're going to be doing all that and they'll just give you back reduction, and so that doesn't need to do anything. I just need to say here's a device. Here's the latest metric isn't gonna fail.

F

The goal, though I think for the community at large, should be that there should also be a completely free and open implementation of this of that part of the pipeline, and so the thinking when we were expecting out that ratos bit was that there should be some a module, that's entry. That is maybe it's not a very high quality, but it's like a disk failure, prediction stuff that exists.

F

Eventually, they just turn on, and in that case we do need to store some limited amount of history in order to like run some basic model on to decide whether disk is gonna fail, and so that was what that that ratos pool was going to be used for, but whether, whether that's how it's going to be implemented or not like I'm, not sure, okay, that.

O

Makes more that makes sense to me, I would say so. My initial feeling is that that whole model running, probably even the Trivium well, a very trivial model could run in the manager, but maybe even for the free editions. That could be something that is a little bit out. You know not everything has.

N

To be inside or.

O

In fact, oh yes, or for flesh, it's easy right here. As soon as you have two data points who can exist, the flesh actually predicts very well.

O

F

So the other, the other part of this is that the other dimension of the conversation that we had with profit store was that what I originally tried to get them to do was to define a completely generic API, where you would post the the metric and then get back, get back a prediction so that the thing that was actually sitting and manager pushing metrics and getting a prediction would be totally generic and then somebody else could just implement a compatible server that implements that API that runs whatever model and whatever back in stores or whatever they want, and that's still that's still an option and I think that might actually be what what they're doing, because we talked about this in the sefa leadership call last week, I guess: there's some concern about having a manager asset manager module.

F

That's in the tree. That's like called disk profit. That's like a commercial name. That's tied directly to a SAS service, that's a whatever a commercial service, I'm gonna making service, and so what I suspect is gonna happen. Is that that module is going to be like completely genericized non vendor specific? And it's and the api that it uses to query the staff service is going to be specified so that somebody else could implement another model that just influenced that API and on the back end it'll use.

F

You know whatever time series database any I ma around everyone's, but so.

O

That's actually some mobile left is a telemetry stuff. You all variously interesting, interesting questions. That would be if that just do reality. So when we store that data right and that data, you know it's a predictive failure date right and.

F

O

Data came from profit for a profit for.

F

O

Disk yeah can we then say you know if we were to send that data back as part of telemetry? Is that violating someone's intellectual property? You know just one of the things that probably at some point should be asked. You know how proprietary the data zids Annan's, meditate all safe data.

F

Anyway, but yeah that sounds cool I mean that the day it's literally a timestamp I'm, actually not I'm, all I'm, storing as a timestamp and the time that recorded the timestamp. So it's not that much and it's not something I would ever expected celebrity to send it. Telemetry should be sending back like anonymous aggregate data and not anything specific, so I, don't think, there's overlap, but but yeah they're the whole brave new world of GDP our.

F

If we have to pay a lot more attention, I think then we used to but yeah I guess just big picture again. I think there are two. There are two to sort of plug ability options here. One is that you could make a completely different set of tooling.

F

Maybe it's a manager module, maybe it's something else that scrapes metrics and predicts something and feeds and tells the cluster when things are going to fail and and or also the module that profit source working on is clearing an API that I'm pretty sure they're gonna they're going to go down the road where they actually specify that API, in which case you could also reuse that existing module and just implement the server side of that API.

F

So I think there are two different ways that you could have different pipelines or whatever to feed this prediction: data back into the cluster yep. That makes sense.

F

F

Any other questions, questions or comments on this.

F

I'm pretty about this feature in general, because I I don't have hard data on this, but the the risk, your your data durability risk is generally related to the probability of having a second failure or correlated failures in your system, and if you can re replicate data to create any replica before the failure happens. I expect that we're gonna get like you know. Order of magnitude, hopefully increase in the overall reliability.

F

Probability of data loss in our systems, which is going to be a is gonna, be a pretty big increase in overall system Rayleigh with reliability, at least from hardware failures. We software failures are another thing, but uh yeah I'm excited.

D

N

Yeah I was just making a comment from the dashboard perspective that, of course, it would be very nifty to being able to obtain that information from the manager directly. So we can also visualize it there yep.

F

Absolutely and.

D

In general, to contribute to open data, um one of the purposes that we have in mind. Yes,.

F

Yep yeah there was a I, don't know flea socialized it on very publicly but out of after the vault conference last year. This whole conversation started about the idea of building a public data set of smart failure data because currently, if you want to like get data to train these models, you have to go. You know you their back ways as the data set or you have to like sign an NDA with like Google or something to get some big cloud providers data set.

F

But the big question was like: would it be possible to make like a publicly governed public data set of smart data where you have an API that you can feed smart data to and as sort of the carrot to get it to share that data?

F

You would get hopefully good failure, predictions out of and and that fits like perfectly into this overall model like if, if you take the API that disk profit, isn't you could just point it at like you know public smart data org or something like that and you'd hopefully have some like good governance around that somebody people what I've actually trust it, but they need to get like a big data set with a broad range of disk vendors and all that stuff and they'll build a very accurate model, I'm.

F

Basically the same thing that describes doing but sort of an open, open, open data, open model sort of thing, I, don't know how the last person to really run with that was Patrick, he's off it dated Outworld now, I, don't know if anything is really happening in that space, but at least all the other pieces are sort of in place so that somebody could go off and do that.

N

Just questions H, so this profits store actually willing to dedicate an engineer to to work on the safe part for this they've.

F

They've written, like thousands of lines of code, yeah they've got like several engineers, I think working on this they're like super eager to make this make this a thing. Yep.

N

Is that already public? You look.

F

At let me find the pull request. You can go look at that this process here it is here, is the pull request that they publish so far. The pull request needs a bunch of work because it predates this tracking of devices and storing the prediction stuff. So it's doing a bunch of things that aren't quite right, but this is yeah. How many lines is it? It's like it's 3,000 lines of code, currently all right.

A

N

F

F

All right, any other comments, questions on that.

F

Okay, all right thanks, let's see trekker what's next.

F

Fresh reports, okay, so we talked about this I think it was in a CDM, but I couldn't find it on the agenda of any of the past ones. Maybe we were just talking about in a stand-up here's, the here's, the pad. The idea is that we would modify and I just like made a very specific proposal just for as a strong and proposal to still to discussion. So tell me if that any of this is stupid.

F

The idea would be to modify the current seg fault handler that dumps recent log files in a stack trace to the log file to also generate a crash directory report in varlet somewhere. So I'm thinking bar lips have crash and then some unique identifiers for that particular crash, and it would be broken out into several files. So it's like easily possible understandable. One would be that same log dump that we put in the log, so the last like ten thousand lines of log or whatever it is one would be the name of the demon.

F

One would be the stack trace just that little stack, trace section and then I was thinking, maybe with a config, the dump of the config options, I, couldn't think of what else was like fully generic to put in there, but basically just a record that the steven crashed and there it is, and the problem this is trying to solve.

F

Is that like if your demons crash so dump something in the log file and then system, D or whatever will restart it, and then you'll actually notice, like there's no persistent record that that that server crash, unless you happen to go, read the log file which nobody ever does or you have something that's monitoring your log file. So this record makes it like a distinct record of that of that crash, and then we'd have some helper.

F

That would just I would look in that directory and look at all the we'll call it the crash reports and then it would basically report them back up to the manager. So it's a manager I have crash report. This UID have you heard about it already. If not, it would upload whatever information it has about it, and then it would mark it on the local node. That's one. That's already been reported enough to repeat that process every time, and probably that helper would just run on every demon when it starts out.

F

So every time I'd even starts. It just also looks in demand crash directory to report any previous crashes of previous instances, so that it's we sort of Hoover these up opportunistically whenever possible.

F

The manager module would store that those crash reports in config key, probably and the logs are gonna, be big, for maybe those have to go into Reno school, or maybe they wouldn't even be stored by the manager at all, not really sure what makes sense and then maybe I think the real win would be if we take the the stacktrace portion of that and strip out the like the code offsets. So it's just the function names which is still pretty unique as pretty sufficient to like pretty uniquely identify a particular crash. So.

J

J

Not many of them, but some of the same functions will have two or three different asters. Actually it did and I'm not the same yeah.

F

I think my only concern is that it would be nice to capture the same crash across different builds, and so, if you have like 20 different builds out in the wild, it would show up there's 20 different crashes if they're different builds, because those code boxes would be slightly yeah. That was a reason you can do both.

F

It's true yeah: you can filter a server side right, um be.

L

A little leery putting doing that much in a signal handler I, don't know if you, if that was your intent, but that's you may want to do you have it. You know, have the single handle to kick off something else to do all the work doing stuff like that inside of the signal, and especially in a sec d, yeah memory, corruption and stuff so yeah. That could be that.

F

This proposal is that the signal handler would just write out the files and then I would exit yeah right.

L

Writing files and stuff is, it could be pretty involved so yeah and you you have to work it you may want to. You probably need them, protect it and do all that kind of stuff as well. So you know they're dragons and that kind of thing so yeah.

F

So what we do right now is the signal handler already. Does the log dump so those signatures are already? Does it generates a stack trace and, and it spits out the last 10,000 lines of log in memory and it dumps all that to the log file?

F

So it already is doing a bunch of this and we just make it so that once we enter the signal hunter, we reset the handler so that, if there's a seg fault in the seg fault handler, it just won't get up and do nothing it'll actually just crash in core dump and so far that seemed to work.

F

Okay, I think good for you every once in a while, we get like a secondary crash inside the signal handling, but doesn't happen that often it's usually when there's a bug and though in the allocator or the logging code itself.

F

So my hope is that writing the same file to a new file handle instead of an old file and all whoa won't be that much of a delta.

F

F

But I gather that that's a lot more than what most people do and there there's egg being in delay. So.

F

Anyway, okay, so what do you? What do you guys, like I, think.

O

It would be great to ask this information.

F

Yep, it would be really great as far as how implemented are there suggestions on ways to do it differently. That would be better.

O

It's a question of doing it in the signal hand, is valid, I mean we could also look at analyzing, the chord um after we have been restarted, but I think it's it's probably easy this way, so it shouldn't be too bad. I think.

F

Not all systems actually saved the core file or.

G

There being the other.

F

Yeah this this this works, even if court dumps are turned off just kinda nice. We.

J

Should maybe industrial reason you in the systems of units because they do have like post exact stuff I think so it may be that we can tell that to capture the chord on somewhere, even if it's turned off by default.

F

Yeah or it could be, then we could kick the thing that reports crashes. We could have like SEP crash reporter that goes and screams.

D

L

F

Could get kicked off by system d.

L

Yeah, that sounds better to finish, because you can just have the signal handler kick whatever it is. That does the work rather than right. There I don't.

F

Know that we can get away from the signal handler writing out this information without capturing the core file. That's the thing and I, don't know how to and alit I mean. Maybe it must be possible, but I, don't know how you go and automate I guess you could like scrape gdb or something I. Don't know, I, don't know how you automate like analyzing a core file to like it. Obviously it can be done, but so.

O

Does it we are actually running gdb, which is a major pain, because that then complaints a whole lot missing anyway, the other thing might be looking at what Mozilla does. Maybe that is interesting because they get a crash reports as well. No, maybe just for compare and inspiration if they have any good ideas.

F

Yep yeah I guess my my worry is that we start introducing a lot of dependencies on how the system is configured, so some systems will have that a pert thing installed. That's like all the system-wide. All the core files are getting pipe to I, don't know how you override that on, like a per system to unit or if you can, and if we do do that, and we pipe it all to our special cache crash collector. Do we then also feed it to the I?

F

Don't know it gets salt, it gets all messy, whereas the the signal handler that we have right now is already doing almost exactly the same thing. It's touching all the same data structures in memory so I, don't this isn't making it any more fragile than it already is, and we haven't had problems yet and I think the worst case is that you crash trying to generate your crash report, which means that you just don't generate a crash before it. So it's like no, no loss.

L

While you're writing out files, you know from within a signal handler where you're dealing with potential memory. Corruption knows what you're gonna scribble over on the file system. So that would be my real concern there I think so, and that you may have that problem. So I, don't I don't know. Is that exploitable I guess.

O

Yeah I've done something like that in a previous life with an H a stack, it's not said you, you change routes, a signal handler and stuff beforehand. It's not that or you open the files and everything ahead of time. It's not that bad and especially if you know if it's containerized to otherwise constrained you just can't scribble over everything. So not usually it hasn't been a concern for us in the past, at least okay.

L

They're, nothing maybe.

F

A true is probably good widget to help protect ourselves. I think that that the net new here is that what the current handler does is. It writes to a file descriptor, that's already open or, as this will create a directory and write to several new files in the same directory, but the actual data. That's writing is roughly the same.

F

I guess that config options are net new, but but if we, if we true into the directory that we created before what this files then like, even if we start writing around the memory, then like it's pretty.

F

Feels not really safe.

G

Just be worried about it, trying to put the cord on path of trying to catch those widows and they're so large, just haven't been people get on the ballot system Mobley or doing to be anything else later. France. Yes, because we think if Crashers yeah.

F

Yeah the strives to keep them, keep them small.

F

What do you, what do you guys think about collecting the log? Essentially because most of the stuff is really small, except for, though, like last in log lines and.

G

Those lines isn't that long compared to like a little debug log. It's like that make sense neglect again. It needs to be an opt-in thing, because I can have information about they gotta be names and whatever else yeah.

J

A lot more sensitive about that, probably than just bad cases if they're submitting it to us. Oh you.

F

Know what sorry I meant to say is collecting those logs in the manager. I wasn't actually thinking about plumbing them home minister : tree line, I'm.

J

So I mean that there is a point to that, but it's pretty limited if it's only in the manager and not going getting centralized right because I guess any stand ins and what did the long lines about 18 to the certain group? I, don't know that they care about that must.

G

Be eventually have some kind of like crash analysis area of the dashboard that you can go and look at these limits. That doesn't seem like a be yeah.

F

It kind of makes me think that is it's not actually adding any much value, they might not bother doing it. So we could leave the log on the local in the local crash directory and those get deleted after you have like 10 of them, or they get more than 10 days older some retention policy, but then only store the crash, the stack traces in the manager and all my phone and only phone home, the stack traces.

J

Yeah I think that's.

H

Enjoy I sense like.

J

Unique tax rates, and rather than just recycle hood cuz, sometimes something will go wrong and then it'll scribble over annual crash on startup every time. But you want the original crash then actually cause the issue.

F

So I wouldn't show if it's like xx senses are the same crash or whatever same secretaries, but.

J

Yeah, just like decorate a is the actual problem and the decorate to be occurred. Somebody's.

H

F

You want to have a.

J

And B in the log.

F

Yeah I mean I think they would be in this case they would show up as separate crashes.

F

So we probably want to make sure that the the the demon who crashed is also uniquely identified so we'll have a cluster we'll have a cluster ID, we'll have a demon, ID and we'll have the crash, ID, I, guess and those will be associated.

F

So when you look at that particular crash you'll see oh, they were like this particular do you in crash 50 times in the one way, and then right before that I crash this other unique way, and you could also see that oh, the same cluster had like to have another Westies crash in the same way, at the same time or whatever it is.

F

Assuming we phone all that information home, which I think would be nice that if it's a with what the telemetry module it's a it's, a unique ID, that's used only for the purposes of telemetry. It's not actually the cluster ID, but it is unique to the cluster and that identifies all the telemetry from one cluster so and that telemetry user can choose whether or not they identified themselves or not. They can optionally set like I own this cluster. So if we're gonna go contact them or not,.

F

If we wanted to get really fancy, we could um we could have a feedback mechanism whereby telemetry could be operating in a fully anonymous mode. But if the developer sees an interesting crash and they want more information, they could just flag that set a flag on that ID that, like they request more information and the next time, telemetry dials in and they'll, say oh it'll pass that message back and it will show up for their Eunice.

G

The feeling people wouldn't like that one as much yeah.

F

Developers want to hear about your crash, but we don't know who you are please contact. We.

O

Could include a Bitcoin address, I can pay if.

A

F

Okay, anything else on that feels like that I.

H

Have two things did?

H

Did we talk about brake pad at all the Google you.

F

Brought this up last time and I forgot about it: yeah I have not.

H

So that might be something else to consider: I don't want to do. Runtime analysis of of that and I haven't I've played around with it personally another option for generating core dumps. Apparently heat race recently added an option to dump core on what you're tracing and so one possibility is. We could have a wrapper process that runs Ceph traced, but it doesn't doesn't intercept.

H

The system calls it's just waiting for an exit event or, in particular, a stop created by a fatal signal, and then just do a dump core on that and then that that that P trace command lets. You specify a file you want the core time printing to and so we're still getting our our core dumps in the right format, and there shouldn't be much or any overhead because we're not intercepting any real events from the process, except for when it stops, so that that's another possibility for doing it, and it should be pretty simple.

F

That's or robust definitely, and presumably it could choose whether it actually dumps a core or whether it just examines the memory state to generate whatever sectors it needs and then throws it out or whatever it needs to do.

H

Right, yeah I think you could even run GDB with a certain set of commands to generate the things you want yeah and then also there's that issue that I had written where we can have the OSD tell the tell the kernel which memory segments not to to dump for security purposes and also just to keep the core dumps a reasonable size. We can use that also in tandem with this, to to limit what we dump and make it safer.

F

I'm not sure if we know where actual cat suffered cached user data, it's gonna end up in memory very easily, but that would make sense if we were gonna capture course. This actually isn't proposing that we capture core files at all. It's just capturing the stuck traces. um It would just tell us what what versions of stuff are crashing and where and that's it.

F

But should definitely look at that break pad thing that looks that looks interesting.

F

When did we did we talk about this before, because suddenly, all your suggestions, we talked about this by camera, when it was, was it in a CDN before or was it? No, it was a male.

H

Thread, you started capturing crash dumps on February, not got.

F

It okay right over you.

F

F

Anything anything else for that. One.

F

F

Stefan SGO replication kinda wanted to so that I've been thinking a lot about this general topic, because I think it's one of the like main features that were missing. That is going to be important in sort of the multi cloud federated whatever world, especially one that's emerging, was you know, kubernetes and deploying applications across multiple clouds and private clouds, and whatever making all behave so I tried to lay out what the what the motivating use cases are. What some other implementations of this look like and figure out what we should do, um I should mention.

F

There was a thread on the mailing list, I guess last month, forgetting the person's name who wanted to implement this. He was talked at the last CD I'm actually proposing to do the Rae dose level to your application, but it turns out that the motivating use case was really around size of s.

F

Then doing sets best stuff so he's interested in working on this, so I just I wanted to lay out what the what the motivating use cases were and see what I'm missing and then really implementations and then some possible, as that we can take so use cases, are just disaster recovery where you have multiple clusters. If one of them like blows up for the data center goes away, you want to have a usable backup copy at another site.

F

Part of that probably would be failing back. So when that data center comes the back, you can resynchronize and continue without relocating the whole data set. So that's recoveries, one active passive replication would be one I'm, not actually sure how common this would be, but basically you would have multiple clusters.

F

You would always write in one that you'd have a replica of the file system in another location that you could read from, but you couldn't write two: it's not that different from the disaster recovery case, except you might want it to be more up to date as a periodic, backup, style replica and then the sort of the the Holy Grail would be an active active case where you have two replicas of the file system in two different sites updates happen sort of locally, but the there then asynchronously propagated to the other site.

F

Well, if you tell people that you have a new distributed file system, they start asking about like whether you can have a globally distributed file system with replicas and to repress the world and might have day to follow the Sun and always be cached locally, and all this like crazy stuff that it's basically impossible to do in a way that it's fully consistent and performant. um So this is sort of reaching towards that I guess and obviously consistency and conflicts are the big issue there. So are am I missing.

F

Other other use cases that make sense there I'm missing something. Does this make sense.

J

These makes sense, I think maybe we need to drop the drop box functionality as a target.

F

Under related implementations, well.

J

I mean so you talking about bi-directional, active active at the engineer like wow. Well, maybe someone else are. This is complex and hard to explain. Users I just noted and like usually when you have an active active. What you actually got is something like AFS I. Think we're much like the different sites will take over tree and will maintain Authority for that tree and then, if someone else on the different like site wants to access one of those files they go, they do the remote access and they just pay the Lancie class on it.

J

But usually if the tree is in a location and they do a synchronous, replication of changes based on your.

O

Going to master for.

J

A pertinent part for a portion of the namespace, but.

F

What would you.

J

F

That is that, like floating master I tree or something are.

J

You just for me, I mean that I'm saying goes, I think that's how usually implement active active is with I mean that's Andrea, fest right the thing like the old unity.

P

J

Where you delegate it down through the servers to someone's, desktop and yeah and I'm, not sure that that's a thing we're ever going to be good at, but if it is that's like the design I mean, maybe we could be since we have voting it took me a more common capability system, but I think that would be the way to do it. If we wanted to do it not try and like see some weird conflict resolution system, yeah yeah, no one's ever made those work like they try and it's just as bad.

J

If you say that it's actually active, active yeah.

O

So our primary use case that we see clearly would be veiled, soft by active, passive replication based on snapshots. As long as the target is in a consistent state, we can make their work if we can optimize it, so our saying actually find stuff fast. You know that goes a long way towards meeting that goal and then they could make that you know if they wanted to go down that route. With some active, active manual load, balancing things I can always have different directory trees.

O

That say, you know, replicate in different directions and manage that separately outside our control. Yeah, okay, okay,.

F

Well, so what I was trying to do is sort of break apart the use cases and what the users are actually trying to do from how its implemented, because I think I basically completely agree with what you're saying Greg, except that I think it's like it's.

F

You can't do active active for multiple sites with writers like in a in a fully consistent and performant way, but the way that people are with conflict resolution generally, but the way that people sort of address, that is by taking the pessimistic view that we're going to like never allow conflicts to happen and I.

F

Think in reality, for most users, like conflicts, don't happen like it's, not that you have two people editing the same file into geographic regions like maybe they'll have them like once in a blue moon, but like 99.9%, that, like isn't actually the case, it's that you have different applications operating on two completely different parts of the tree and so I'm. What it's! What I'm wondering is if there is a sort of a more optimistic replication that has some basic public resolution that that is going to work.

J

Is usually getting people not using stuff directly it'll be a user like syncing their files to their cloud, and so you have another program in the way anyway, what's before you before, you jump.

J

And so that again is more of an active/passive sort of backup thing.

F

Yeah, but at right so I think that.

J

Basically, what I'm saying is: you should focus on the arson case and then we can link and we're tooling around to say: okay.

D

K

J

See only this tree in this direction and this tree in the other direction.

F

J

F

J

It automatically.

F

That's all that, yes,.

K

Question can we use our W seek modulus to do the sync.

F

Maybe, or something like I was analogous I think that the main thing that's missing right now is is boys are a couple. Different eyes is how to how to identify what the changes are. The urge of using modules have that, like change log in the buckets, we don't have same.

I

F

Got doing them it's great right, very traverse.

K

That's true, you know like seem enough to sink from clouds, that you know you don't have at least of changes or anything. You can build it or do something that yeah, it's that and that's like the extended attribute, sir stuff, like that periodically and build stuff out of that.

F

Yep, yes, maybe the way they were purchase is to have a similarly module based tink framework. Well, okay.

F

So what I wanted to do is catalog a couple of the solutions that are out there right now, just as a point of comparison, because it feels like we haven't like thought about this or like seriously considered so many things because, like solving the entire problem perfectly is hard, but it turns out that you can solve it in perfectly with easy solutions and that actually will capture like a lot of use cases and there's nothing like for running us from doing it. So so one one point comparison is the cluster fest your replication?

F

This is basically on the back and it's our sync, but it's driven by a change log, but then that's of generating so they're generating a list of files that to examine in the arcing background thing, just sits there, our sinks, those specific files. So it's loosely consistent. It's only useful for dr because the it's not the solution consists on the other side, it's one directional, although I guess you can you can chain them to.

F

You can have like a b2b and b2c, but yeah, so it's relatively simple to implement, assuming you have a changelog one, that's widely deployed commercially. Is the net app snapmare thing which takes a snapshot every few minutes? I recently you de mentioned that that the shortest interval that you can configure is five minutes. That's because the design doesn't really scale down to smaller timeframes, but you just there periodically take a snapshot and then sync that sense out the other side.

F

This is interesting because it's something that we could pretty trivial it I think do just by like taking snapshots on a schedule instead of s and then doing an efficient or sync cross, and then once we've finished, our Sigma snapshot take that same snapshot, name on the remote site, and it might even be that we can operate more efficiently than NetApp does cuz.

F

It has to scan it's like snapshot bitmaps for changes, and we would do the same thing that namespace, depending on how many changes are I, guess that may or may not be more or less efficient.

F

Costra fest has this halo replication thing, which is super weird and I, don't really understand, but it's sort of a weird mix between synchronous and asynchronous replication.

F

It basically works as long as your workload is like behaving in a correct way. But if you are writing the same files in multiple sites and you get split, brain and things go by things, go bad so and then the one other thing I wanted to mention is that I think that actually the biggest example of distributed right right replication is actually Dropbox which isn't really at the file system level. It's at the file level, so you have totally disconnected operation.

F

When you reconnect, then it just like checks in a new version of the file and the server side just keeps multiple versions of files revisions as part of its the existing archiving thing, and so, if there was a conflict, you can just roll back to the other one or the conflicting one. It's probably flagged in the history or something like that.

F

But that to me is sort of a proof point that, like conflicts are rare, although not that rare and as long as you have a way to resolve them and like identify them, then like that's the user experience for that like is, is okay, it's not that you have to like magically resolve a conflict between two programs modifying a database. It's that you have to resolve conflicts between humans, editing a word doc.

F

F

So right so there's our sync piece, which is what the fellows name I forget, is working on now, I believe, which basically just makes our sync scans stuff s hierarchies more efficiently. I think there's actually a key piece of work here that we have to do because the our stats don't propagate, synchronously and so I think in order for this to actually be reliable, we have to have a way that, like forces, a flush of our stats up to a particular point of the tree, where you're starting your sync.

F

But I my memory here is pretty hazy Patrick. Do you know what what the what the state of this is.

H

Forcing of flesh of our stats yeah.

H

I imagine a snapshot should be enough, but you know that'd be a question for John I. Think I yeah.

J

I think you're right that we don't actually force it but there's little reason not to because we are doing like synchronous classes or all the data anyway. So I think it's I think we just me good, makes the MDS and do it before writes out, says it's not shot data.

F

Before you can stop snap, this stab the snapped directory tower stats have to have propagate it up or something yeah I have the idea, if it just says transparent, okay,.

J

It shouldn't I mean I, haven't looked at this in a very long time, so take it with a grain of salt, but it shouldn't be too hard to force this we'll need to get someone to do it. Okay,.

F

Okay, because I think, if we did that one piece in the MVS, then we could do the snapshot driven snapmirror type thing where we just have a it could be a like a bash script that just takes a snapshot and our sinks across, and it takes a snapshot. The remote site, I think we could have that solution which is like equivalent to what Ned asked. You should be pretty good with almost snow without any sort of deeper surgery or plumbing.

F

Then that would give us a that would give us a dr solution. That's strongly consistent! That's very pretty nice.

F

G

One note there we not call it Sam here is the feature: that's gonna, be a free market. Yeah.

F

Yeah definitely not yeah I tried to be careful to say snap you're only next tune it up yeah. We would want to call it snapshot driven dr or something yes.

H

The data and metadata in two.

J

Different pools.

J

Snapshotting this.

F

J

F

So this is operating at the POSIX API layer handles yeah.

A

J

Doesn't argue this yep.

F

Yep all right well, so the last bit is my sort of waving hands at the idea of having active, active mirroring.

F

So one idea would be that you just allow writes in multiple locations, and you are a singly replicating those rights to the other side. And then you would have some sort of automated conflict resolution so something like last writer wins for files and first rename wins that the renames conflict, something like that this has like in order to actually implement this we'd have to have a change, log I think to stream those updates across.

F

But the the sort of idea would be that if you have, if Seth FS was able to have a way to track multiple revisions or Forks of a file, so that, if you had conflicting changes on multiple sites, it could preserve them on both sides, and then you could resolve them by reverting to one or the other.

F

Then that might actually be like a sufficient user experience for most for a lot of use cases right, cuz everything would appear to work if you did something totally weird in multiple sites than it would recover automatically, and you could sort of recover manually. The other way, if you needed to.

F

That's the sort of the social user hypothesis, I, guess I think the other, the other one here, it's like Greg was talking about where it's like subtree driven sub tree based masters per subtree masters with unidirectional sync.

J

Yeah I, just I, get where the drop box thing appeals to you, but I think you need to keep in mind that they have a lot of stuff that we don't they they keep like, because there's a user thing rather than and so sort of social. They keep. Basically every version of a file that ever gets created and they're in that independent place and the well sort of transparently resolves conflicts by just having those the most recent one wins. But the important part is not.

J

That I mean that's that's a key thing, but the important part of their conflict resolution is that you get a little notification in the app and then you go to the website and you can download every version that ever existed. Kind.

F

J

Possibly like merge them or pick them out or whatever, and you have a separate file system that that happens in yeah and there's.

F

A very wide you're losing data.

J

Basically, yeah.

F

That's like never loosens.

J

Data yeah, so I mean we can ideally would need to have a like everything would need to be log base or, like you know, yeah, wouldn't you be like, like a long race file system or something for us to have any hope of making that work. I think.

O

J

First, step I see.

O

Sages optimistic approach, the simplest way to have such at first might be to detect the conflict and refuse to merge, and you know if you do, that, let's see who's the source at all. That would probably be the best first step, which is very complicated, pause.

F

Yeah that I think that, in order for it to be feasible, that we'd have to have they'd have to choose something so that, if you didn't touch it, it wouldn't like be broken right. It's making it able to safely roll back, so I think Greg's right having those four powers. Visions is like the key thing that would make it yeah.

J

And I mean that's just you can build on top of sevens of Cerritos and I. Think people have even done that in the past, but I don't think it's something we want to build inside of ourselves. Like me,.

P

O

J

Top-Level project.

O

We do have that, you know if you detect a conflict, uh create a snapshot.

O

J

Then I'm like bring us these snapshots across the version or across the site, yeah.

F

I think the the trick is that they would be divergent right, so you'd have version a on both sides the same and then you'd have like B and C that are different. And then you try to sync and then ideally, you would have both versions on both sides, but they would also have the same winner. But the ordering is wrong because you'd have BC.

F

It's weird so, but if we did come up with the way to like and keep sort of a third dimension in this dimension, where you have revisions of a file, I'm sort of at the seven first level, um I think that. So there are two reasons why I like it one. It would have been able this bi-directional sync with versions, so that the user has an option to go revert and we can invent some like dot version directory to manipulate those or whatever.

F

The other reason why I like it is this idea that I got from hearing Dan talk about their next cloud, back-end that backs and their big our euro FS, whatever it is at at CERN, where they have sort of a POSIX like interface to a huge file system and that that same file system, because it's exposed to next cloud on the front in, and so you can like check out bits of the huge data set like on your laptop or your desktop, then check them back in, and so you sort of, and it usually when you have Pepa back in, for something like that.

F

It's like just storing blobs with all their versions of files, and so it's opaque, but having sort of the POSIX backdoor, so you can access it via the POSIX API or via the next cloud thing where you get all the same revisions by our visions.

F

This seems pretty powerful to me, but it means that you have to like push this version management into the file system itself, which is a well.

J

Yeah and they're already running next up right. So that's that's the app that runs and does the conflict, detection and resolution for the user and what they put in us is like the resolved ones or for whatever, and yet they can still expose our set. Like says, the snapshots to the users, different versions, but yeah.

F

But it means that you, you only get that those features few through next cloud or as if you push them another file system, then you can also you can leverage that same capability to do active, active across geographically distributed sites, which is cool and also you can have like a regular POSIX file system that, like your your legacy stuff, is accessing and also access. Those same data sets be a nice cloud. I have all the same, and.

J

F

J

No you're you're you're.

F

Piling all the features on top of each other, but the combination yeah.

J

I mean the when we talked about adjacent thanks this in the past it's been like, oh like we could have it look like we could have a log structured file mode or we could have snapshots, be different objects instead of fredo snapshots and those all make building in these sorts of features a lot easier and more practical, I. Don't but I, don't really see it with this thing file system layout. Just because you know.

I

J

A whole lot of layers that are built in the wrong direction: yeah.

J

Plus I, if we put this into what we're assuming, is that all of the clients, which means all the guys on their laptop start like running a set, that's balanced and we're not for that either. Well,.

F

They don't have to be like their laptops, could use next time right and just synchronize. Oh.

J

Yeah I mean, but then in next spot it is really is the right layer for this yeah.

F

Yep all right! Well, it's my crazy idea.

O

Just one question regarding the snapshot so, and maybe just inserting that I like having it and surface next one comments that we hear and that some users aren't aware of because well, we can do snapshots with our BD, with our GW and with cephus right that would be cool. Is that they then end up running workloads that accesses them at different points. So at some point we need to take a consistent snapshot of you know all the layers, not just the manila file system, but the manila fossum at the same point as a workload VMs.

O

There will be tough in the future somewhere.

F

Yeah, those things are all pretty separate.

J

F

J

The coolest stuff inside the VNC that's work anyway. At that point, is there anything we need to do besides say? Yes, we've made the snapshot for you. No.

O

Wait so thus we do the snapshots up the M sub, backed by our BD. Let's say they have snapshots, they can be mirrors, they can be replicated and then the files and that's atomic. Essentially they are consistent, but then the files that they're accessing and waste comes along and it's consistent at a different point in time, yeah, something so I.

J

Mean that's going to be true, even if we had a unified snapshot mechanism, because you know you don't know what what your applique I've actually put into the file system verses in there like RBG's journal device. Their page can.

F

Remember I need them. You.

J

Need the VMS that actually flex like Colette, I/o and flush everything, and once you have the VM full of thing I, know and watching everything that it can just like wait for little utility. That goes and takes a separate snapshot, but while all I always call s inside of all those inside of all those mechanisms, right take two snapshots and simple one, but they would be with fullest I.

J

Also, there's they're the same and they're unified consistent across each other yeah like is there any way you can do that without having an agent that does the IO coalescing inside the VM large.

O

Well, in theory, properly written application uses F, Singh and F data, saying you know all those things and it's consistent on disk. You know that's what the database, a probably written application, would do fresh, consistent.

F

So that's that I think it's earthquake saying your application. Without sync and say: okay now, I'm ready to take the snapshot and then it takes snapshot at the block device, a on the file system and they don't continue. No.

O

You don't have to do anything special to the application, any applications. That's not always good this data when the server crashes so outside the radius Redis world. Most people frown upon that I. Think.

J

Application, a single application using both math right yeah.

F

Yeah they would have to be sterilized in the kernel. I think. That's the application. That's running on.

F

F

Killed f-stop the process. Take your snapshots, kill, continue.

J

That is that something could be feasible in the kernels like: let's leave out fuse for them. I'm gonna just pretend they're all like I know Mountain. You.

F

Didn't like a global, walk or something.

O

You would free cements of the level you would freeze all the apps.

O

J

J

Doing me actually go to this command as well.

J

F

You can freeze the VM it'll, be crushed consistent or it'll. Look like a crash.

F

All right we're over time, do we want to talk about the rgw stuff now or should we punt it next time? You can do it next time.

F

Okay, I think that gives a really high level message here. Is that can and should we be changing our perspective and our thinking around rgw as as3 compatible gateway for a radius cluster that happens to have multi-site capabilities to instead be in a gateway to a whole topology of storages, whether they're stuffed clusters or public cloud storages or whatever, where the Gateway knows how to access your data, and it gives you an S face, but behind it it might maybe it's storing her objects and staff. Maybe it's replicating into multiple sites.

F

Maybe your objects are all over the place and it knows how to find them, maybe they're being encrypted and then stored in the public cloud. Maybe maybe it's a combination, maybe it's stored in one cloud and it's in the process of being migrated together. But if we think about rgw is sort of a gateway, that's just mediating.

F

Access to this whole federated topology of places where your your objects could be stored, then that sort of gives rise to this whole set of data service, like capabilities where we can add in smarts to manage where your data is stored and replicated and move, give you cloud portability and all kinds of good stuff.

F

Anyway, that's the high level and what that actually translates into is a whole bunch of whole bunch of features and abilities in rgw that are that are outlined in this document. Do you want to yeah pretty.

K

K

There are two main paths that we'll need to take. One is finishing. The whole cloud sync feature like currently. What we have in mimic is sync to cloud, and we wanna have sync from cloud capability also, so that will give us the ability to to you know, look at a cloud, as you know, an entity that we can fully sync to and from, and you know and provides the whole data mobility functionality, and the second thing is.

K

Which is not necessarily related to that, but the second thing is the ability to have rgw tears on the cloud, which means that objects could be either on Rados or either on. You know the cloud provider in the back and it's these are two separate issues: whether you have a whole zone that is backed by the cloud or a zone that it's kind of hard to operate.

K

Where data can stay in one. You know one policy that says it's gonna be in Raiders and another power. Six guys I mean you know in that cloud provider, a third policy, it's gonna say something else, and one more thing is the. Currently we look at the data at the zone level. We say everything in this zone is gonna replicate to another zone, all within the same zone group very dr.

K

Dr feature is its own except another zone, but when dealing with data mobility and the ability to move data around and to be, you know modular, we might not want to look at the zone if see the unit of that we think by, but maybe look at the bucket itself.

K

That would allow us for better flexibility on one hand, and you know it doesn't constrain us and no one necessarily wants their entire zone back to you to s3.

K

Maybe not no one, but you know it has. It has implications, but you know it's a subset of the data. Definitely so we are as it currently is. It looks to me that the next step would be to do sync from cloud. That would be like the next big step.

K

No, there is cheering that we'd be happy. Someone took on them. One more thing to note is when looking at the gateway, the Gateway for external data, there are multiple options on how to do it. One is it, you know the Gateway can be a proxy for the external data and another option is to Gateway just redirects. It says you know this object researching this endpoint.

K

That's to URI, maybe it's gonna be a pre P pre-authorized time, because the authenticated URL that that redirect to the users will be able to you know you get it directly from where it is at that point. So these are main two options and then there's a question: do we do it only for read operations, or we also do it for reading for goods? But then you know it really depends on where we are exactly in the stack.

F

That puts so much harder because if we're talking about that in the tearing case, you have to have the metadata about individual objects still stored, managed by rgw.

K

Necessarily, depending you know it, the the objects themselves can be just you know soft link right, they can be their manifest, can say: okay, the data is somewhere else, but if rgw doesn't know what the minute it isn't, then, when listing the buckets you're, not gonna, have the information and.

K

F

Yeah, okay, so there's there's the as far as like next steps possible. Next steps: there's the the tearing piece where an individual object would be a redirect. It's actually stored in this external object provider, clods right or whatever, which I guess optionally would have encryption, would be cool. So vo pick on the other end, then there's the ozone or bucket granularity, replication. I guess did you mention that that's one thing sort of different different inch in different direction.

F

Was the other one you mentioned? Oh god, think yeah, okay,.

K

No, the one thing about sync to cloud and sync from cloud: these are the way things work work right now, when you sync, when its own sinks, it'll, always lose data. So if you wanna have sync to cloud and sync from cloud right now, the way it works you need to have two different stupid, separate zones right, one for the where you want the data that keep sync from the cloud is in one.

K

In its own to sync the data into the cloud, it might be, maybe it's second iteration we can think of how to consolidate it, because you know at the end you you're representing a single cloud and you don't don't wanna start zones. Yeah too many zones.

F

It feels like there's a set of I didn't get a chance to read this whole document here, but I think you probably covered a lot of it yeah identifying what this, what the specific capabilities are like at the bucket layer. Mapping that says this entire bucket is sort over here or at the per object basis saying this object is stored over here yeah yeah, and then we can talk about how we do implicit capabilities like migrating a bucket from Cloud Data cloud be in a seamless way, so that access is uninterrupted based on those pieces.

F

F

K

When, when a bucket is on the cloud- or you know at the remote standpoint it doesn't, there is not it's not a one-to-one mapping necessarily because you know x has their own limitations, so we might have multiple rgw buckets residing in a single cloud provider bucket, and there are some naming mutation, that's happening and there's a cure. Mutations are happening and that's how it works now, with with a mimic cloud and crossing. So there needs to be. You know when you do is sync from cloud you'll need to do.

K

We need to do a similar back mutation or some some other solution that yeah. That makes sense.

P

You know I would push them. What user do you, the user needs. We have the same, like credits, Catherine, user and cell provider user. Thank.

K

P

K

The way the cloud sync works now we have 20 configure the sync. You can set a set of echoes that, like there are different profiles, and you can have this bucket, you will use this profile and another buckets bucketz prefix, all back. It's a search with some prefix will have a different profile, so it's very pretty flexible and then there is. There was a way to to also define how you map users in rgw to users in in the backing cloud.

K

Also, moreover, you don't even need to run a single cloud and you can in theory, you have some. Some buckets go to s3 some buckets to go to Asia, not that we support either at this point, but it or two maybe two different every region at this point, so that it's it's pretty flexible. You can mix and match.

I

But for that, don't you don't? You need to create like one zone for each load flow or something like that like one for you and one for s3 or something like that, not.

K

Really not really currently, there's no need for it. You have a logical unit that defines how you, how you take the zone and pushes it to the clouds, but but in that logical unit you know some of the brackets can go to to one provider and others can go to another provider in theory, because it's the alien theory, because currently only is what is free.

F

F

That's I have sounds good, you guys are you guys are meeting in July right, no in August I can remember, you guys is the rgw. You need to get there.

K

F

August yeah, okay, all right, imagine so come up, then, if not sooner I guess my my general thinking is that the more we can sort of map out the roadmap for these different features, the more we can get other folks interested like our friends, a um cloud who have been writing so much code recently or over else um cause it'd be nice to get the tearing stuff rolling, yeah.

K

I agree: we've been discussing tearing and not not necessarily even to to to the cloud just during in general for many many years. No- and you know it's not always like was the next thing. You know exactly.

F

Yep cool all right.

F

F

Okeydoke thanks everyone.

F