Ceph Developer Monthly, 6 Sep 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2017-SEP-06 :: Ceph Developer Monthly

Description

Monthly developer meeting for the coordination of Ceph project development.

http://tracker.ceph.com/projects/ceph/wiki/Planning

A

Welcome to the SAP developer monthly, on it back friendly hours, so I just face it. A well you're all containing the box for the meeting tonight or today, depending where you are all right. Let me check sages around well. I think we can start if we check the tracker.

A

We have zoo a home has chopped regarding radios, radios, level, replication.

A

B

A

C

Yeah: okay, no I'll try to share my screen.

C

Okay, can you see my PPT yes?

C

Okay, on in the last months is CDM. We we talked about the Raiders level replication and a sage that just had that and a main main problem of implementing this replication is. We have to maintain straight across a watering and so I. We did some details. We did some considerations on the details of implements of maintaining cross orderings, and this is the first issue on I'm gonna talk about I, don't know. Can you hear me clearly.

A

C

Okay, okay thanks. um Secondly, we talked about how, after maintaining the course altering, we have to a guarantee, the replication of the correctness when there are OSD failures, and we did some more of other considerations and that's the third issue, and that even started with the first um SH well suggested last last month that the for implementing for maintaining coring, we can split the whole system runtime into a series of translate it and all client operations during the sense unsliced are in the sent in the same transaction.

C

They are all I, they're, all replicated or not or all not, and now that the time boundaries clients has to force first untimely. In order to, in order for the physical clock to go forward, you not to prevent.

D

Also water, if.

C

You violated across class boundaries if given understanding him called Bradley. This should be suggestion that he did the last month, though, we did some more considerations on the details. We think that implemented such mechanism we had to, though we're faced with two major problems: the first days, how to make all clients pause and exactly every times last boundaries, and second, and the second- is how to make all client operations in the st. I'm slice are they're all replicating or all not hurt that they do that's their fault for the first.

C

For the first problem, I, we are, we think much we can. We can do this in this way, our first on every times, the last boundaries, the monitor can send a tunt attempt at understand to to all clients, to say it's t, TS, pound and and then of all clients that their post timers according to the following rules. From the first days before they they said the post timers.

C

They have to make sure that their local system class had to be synchronized with the same ntp server as monitors and the town school is small and mmm in the post, timer expiration time should be p. Ts bell, + Tom's, lies and subtract T local. This means that this this should be the time that they gonna course in the future.

C

Then, when the post timer expires and if all clients, local distant clock bill satisfies this, miss considered miss condition, and then the clients should pause for, for a period of time to say it's deport, but we think that people's has to be larger than the time centralization air about on its it's. Just like this picture shows.

C

If we can make the P pores larger than the time slice at the time, synchronization Paragon then the given, if they are, they are not course pushing for posting at the everything exactly the same time they can port across the the tons. Sync error bound, though that should be enough to maintain watering.

C

Then, if we, if all clients and can can import at exactly every time, math boundaries, then we have to we're facing the second problem that is making all operations either all innocent. In the same time, slice Valerie, but in the tense, lies in the same time slice our other all replicated for all.

C

Not, though all we think not, we can do this in this way arm, first, after for every for every way, every client after being paused for peoples, and they have to report to monitor that if they report that they're, pausing and and in the under the TTS pound that they set this time this this porous and that according to which they sent this course.

C

On the other hand, though, I stays in the master cluster are should periodically reported as monitoring that of their abilities replicated client Opie on each each client operation is mmm is associated with tans down along. If all the clients have has pores have finished nests have finished, the port is that they said, according to our TDS panel, asked and all the Westies have started to replicate client operations that which time stem is later than PD has found the last then the teeth.

C

The tons lies that TDS panel last identifies should be able to be replicated, though arm monitor will send in this time span, and this this this times less less memory and plus times the lies to the backup plaster.

C

In a meal, in the meantime, osp's in the masterclass keep replicating client operations to the back to the backup cluster I. Despite the tons last constraints, they just keep replicating and when they receive this whenever they receive the tongue, the confirmed time / boundaries from the monitors in the master cluster and then in before they before they received before they receive this in the stands. Last memories: they just cashed.

C

The replicated Blanche declined o client operations in their local journal, and when they see this Restless trance last boundary, they may write the journal back to the backing store.

C

Yeah yeah, it's just like this signed him and so I put it together. Would we we think we, maybe maybe we can implement the whole for a replication mechanism just as this picture shows. First, the monitor send a TS bound to all the clients. Then, after all, the clients has been has post, they send. The TT has found the last to the monitor. In the meantime, all OSDs report to the monitor off their latest.

C

Our operations comes down, and if, if these t.o.p, the latest are all larger than the tds bound, velocity, plus otherwise and all clients in have finished the posting and then the monitor senators comes down to to the room, to the monitor in the backup cluster, then that then those monitors did Bradley's ready this time spent to velocities in the bathtub cluster and in the meantime, all operations are replicated due to the backup glasper continuously, and when this this time stone arrives.

C

Right, the general branding operations back to their back and store, and at this time the the replication of that often the operations in this, in particular on its lies, should be finished, ah but we think if, if this is, if this is right,.

C

Some key points of implementing this mechanism. The first is the expiration time of post. Timer must not be influenced by time, synchronization services, alike, ntp and chrony, and this will mean that.

C

The say our clients said their timer to expire at 10 seconds later, but if this time, if this ten-second time isn't can can be influenced by the time by the time synchronization service, then they may, they may be only waited for nine seconds and then the Tamarack expired and they thought they have. They have waited for 10 seconds and that that will make make the replication go wrong and the second is: if one client failed to pause correctly, then, and then this time slice should be merged into the next time.

C

Slice, instead of it, instead of being replicated I.

E

B

E

E

Have you did you consider doing that pause, just the same pause, you're proposing, but do it on the OS T's instead of on the client side? Oh.

C

We didn't think that very gently, but that should be available. That should be.

C

What's the difference, the.

E

I mean there are a couple things you have there: furiosa T's in the system, so it's fewer clocks to synchronize the OS.

E

These are probably going to be more tightly coupled as far as the hardware goes, so the network latencies are going to be lower and they'll have a tighter bound on their clock tryst, but also it's just a smaller number and so probability that one of those clocks is going to have some jitter or otherwise miss its bound is much lower than if you have lots of clients who might have a clock in this app I.

E

Think, probably it's just going to give you a more reliable or.

C

Will consider that will consider that absolutely are in the last sorry, okay.

C

C

In the last issue, we would consider this. Oh, we think that on making all clients to posterior periodically since not necessary um system, because we may be only only a few of them need to be a neat, they need be beneath the point in time, consistency or that maybe we can, we can give declines an option to identify themselves as whether they need the point in time, consistency and then only those who need this consistency need, of course, I. Don't know whether I'm saying the same. The problem that you are talking about.

E

And that make sense I'm it's a slightly different, slightly different.

C

B

E

I think I think there. So this is um this is, if I understand, I've missed the very beginning, but if I understood correctly, this sounds almost exactly like what the clinic team looked at a couple years ago, and they they were the proposal. There was to do the same delay. Where you have you. Basically, everybody has that tea pause, I, guess that you're calling it so that so that you have that that bound whatever I, don't know what words you're using, but the difference is they were looking at just doing it on the OS DS.

E

So, basically, all the OSD s would have a delay of based on what the NTP error bounds are, so that there was prevent any causal ordering between operations, and then they did some. They did some simulations with different network latencies and NTP like protocols to see what how big that delay would have to be. They wrote a report, I'm happy to email it to you.

E

It'll probably explain it in more detail more clearly than I can on the spot, but I think the biggest difference is just that they're doing it on the OS, DS and I. Think one of the other kind of nice things about that approach. Is that do C's, don't actually have to stop doing work. They can still process the requests, but if they just delay the replies back to the client, then they prevent any causal dependency of some other operation.

E

That depends on the previous operation, because it just appears that all those operations took longer and sort of blurs them across that that time.

C

E

Email, you the let me email you, the the write up that they did. It'll have more details, go explain it more.

C

So so you mean that posting you do SDS, instead of posting coins should be, should be thorough, I still.

E

Write it, it sounds simpler to me, I mean there's certainly might be missing something, but it found something.

C

Ok, we'll we'll redesign this based on this suggestions, so.

C

Can I can continue on.

C

By the way, I think, maybe after we reconsider, we redesigned the whole architecture based on forcing OSDs I, don't know, maybe a month wait waiting, maybe a little long for us. It's there any way to continue with you.

C

The designations, of course.

E

Of course, yeah I mean we can we can continue on e over email or if it's we need to schedule a call. We can do that to you. That's that's totally. Fine! There's! No, no need to wait for these for these sessions. You can just send me now.

C

So I'm gonna go on with my second shoot on.

C

Then the second issue is guaranteeing the replication correctness. When someone has to go go down, we think just as the lost city am we think if the these two conditions is guaranteed, then the correctness guaranteed the first days for all OS days in the acting set.

C

The original okay, general I I, mean by original a journal I mean that the journal generated by the Opie's stent from clients not not DOP. The operation sent from other Westies, for example, on the cookie for recovery for backfield.

C

So if the original okie journal gets removed after is cresting, the corresponding original hope is replicated, then the the second condition is this: the second is in the recovery and back or backfield phase recovery, sauce a replicate, all journals related to the recovering object before they pushing the object. If these two conditions are guaranteed, then we think that their application correctness of a single object should be guaranteed and I don't know if this is right.

F

Can you describe this? Is this modeled after the RVD distributor application, we have a whole separate copy of all the data, or are you trying to put something inside of the file store, journals or I? Don't know what.

C

We think we think that maybe we can.

C

We can mark the mark every original piece as need to be replicated in the journal, but I think that maybe maybe only need need to need to add the field in energy. In the open structure, I mean in.

F

In the it's possible for us to overwrite data in the OS in the OS, these data store right. So it might actually be gone by the time you're trying to read it back out unless you have it persistent somewhere, but I think you basically need to either do a full op data copy, which there's no mechanism for in the OS team. That would be really expensive, or else we need to have some kind of snapshotting system. That's orthogonal to what we used right now that likes store all the times light as you care about yeah.

E

Yeah I think I mean those are the only two options that I can think of so either either yeah some sort of snapshot at this point in time, where we make sure we don't overwrite anything that needs to be kind of coming.

G

E

Just a complete right log, everything that's written to the PG, there's also just like a journal. Basically that is written as sort of attached to the PG. Have every right transaction. That's applied, I!

E

Guess I gotta go to.

B

F

Expensive, what would that be in blue store sage I? Have this hope we can just keep all the correct blocks around more cheaply.

E

E

Mean in principle all the at all does all the data path stuff can do copy and write the key value stuff doesn't so that would probably just be yeah. We'd have to do something different way of doing to keep the idea, but for data it'll kind of mess with like the block allocation stuff, because it you can't like overwrite any blocks, you keep allocating new stuff and make it maybe more fragmented, but.

E

It's certainly easier to reason about having just a full write journal, but it's like you know, doubling the right bandwidth right, you're, just writing everything twice that's attractive. Obviously, I.

B

Got your point.

C

So long yeah I know this is this is, although there are still some page, some pictures, but I think this is the whole idea the details it's given in the last time. In the last time, document, okay, I think, will reconsider our designs according to your devices. Thank you. Okay,.

E

I'll email, the point-in-time report, I, think that this all sounds like we're. We seem to converge on the same basic design, they're, just a sort of a few key questions that we should decide the biggest one I.

E

Think for me is this: whether you have a right journal, which is like very easy to reason about and probably easier to implement, but much less efficient or whether you have some sort of snapshotting so that you know that when you have one of these points in point in time, checkpoints you'll be able to read everything at that point in time to be replicated.

E

And that's probably more efficient but more complicated because it has to go to copy and write.

G

Isn't a copy-on-write a double right anyway?

G

E

I mean I copy and write I. Guess it's not it you'd want to you basically write you'd make it so that all of the blocks that are referenced at that point in time are effectively frozen. There's this yeah, you could have a lot of reco listing of rice right. If you do this snapshot approach, as opposed to yeah.

H

Blocking every right.

G

I mean if you're writing whole blocks, then yeah you don't have to yeah.

E

G

E

Read it it should only it only really affect so the way blues store. Does it the anything it only affects stuff below them, italic size, which is like c4k to 16 K somewhere in there. So it doesn't make that big of an impact. I guess unless you have like a random small write workload, but even that it something I'll just get hidden in the noise anyway,.

E

Yeah: okay, okay,.

A

Awesome thanks for go ahead. No.

E

I'm just gonna say thanks for thanks for for sharing and I'll, send an email with the point-in-time stuff and see if that, how that relates, and then.

E

A

Tushar you one yeah.

I

J

Hi everyone yeah. Let me share my screen here.

I

Yeah, yes, agent was in the we. We actually have been working with, with Jason Jason's expert and personally for the last few days to flush out some of the remaining bits of the shared RBD, client-side cache design and so I'll. Let he won. He won cover and I'll jump in as needed.

J

Okay, so can you guys see my screen here, I.

E

Yeah, it's is it just me or is it like a small square in a big black box? It's a very vertical screen, so I think yeah.

J

Okay, let me try to.

H

Get like to monitor stacked on top of each other or something yeah.

K

When it first appeared on the screen, it's.

J

Okay, okay, so um actually, since last CDN I mean over stadium. We have some updates on the general architecture and also we have some new design for the cache constants you see so here's the general architecture. Basically, we have three components here. First, one is on a leaf cache store, which would be a common library that can do read, write on SSD. A second one is a standalone cache demo, which owns private policy that controls the promotion, emotion, sings country.

J

We have a very simple area based Parsi, but we can actually design more fancy things like sar you or to kill things like that, and the last one is the hooks. Actually, we have a very light hook inside the body that can call api of libcast file. With that we can redirect those we right, the memcache store.

J

Okay, so initially we have a PR Oakland one, six, seven, eight, eight, and in that PR we have a generic Palestine cash store, incremented, it's actually a sparse file based cash, and we also provide some cinq-cinq interface for simplicity and also we have a very generic reading leaking from framework.

J

So we do cash promotion on weeds and own rights. We just to cache invalidation and also all of the rights right forecasts were going to read us to actually and there's some missing missing features in that PR and country. We only have a very happy data pass and all of the share cash I mean for the parents, image cash are promoted on the opening of first child, so missing things we have to design and promote don't amount mechanism to make sure the cash is is only stored with what we needed and also we.

J

We are missing a configurable person to control control, the promotion, demotion, behaviors.

J

Okay, so this is the initial results, and actually we have the data pass implemented. So you can see from the table here um with actually with a very small cache size. You can get almost of the same.

J

Let's look at this this this this row with 4 Giga cash we actually can achieve like 20 sounds obvious and also latency is, is actually quite good.

J

Okay, so here's here's some new design, probably since we said August alien, we have been thinking about the promotion on the amount approached and added there. Several components here.

J

Okay, so this page shows the chrome flow for IVD and usually we have a parent image and then we create a protected snapshot and then, based on that protected snapshot, we create several current images and actually the image content we are sharing here is the protected, protect your snapshot.

J

J

So here's our new custom own- we have been thinking about this for a few days. There are actually several key components here. The the general idea is all of the read my blocks from the protected snapshot cached in our shared area, only compute nodes, for example. We have a VM e SSD mounted at to some folder, and then we store those shared contents to to that folder and.

J

On each computer knows there will be a cache demon to control the shared cache state, for example, if one vini vidi instance want to read from some specific offset it just score. Is that customer with an IPC mechanism and and customer tell if it is there if it is? If the cache is already there, then the IBD instance were trying to acquire some reader lock and then just open that file and read the content.

J

We have a detail, really flow in the following slides so country. We have a shared memory based IPC. We, we kind of thinking we kind of think. If we use a sake, the based IPC, the performance might might get hurt. So we are actually using a shared memory based IPC here and.

E

And it's when you are thinking about the socket option, where you specifically, were you concerned about data copies or we're, assuming that you would use something like send, file or splice.

J

No actually or the data paths are not including the IPC and I'm just worrying about see you have to connect, connects, makes a connection, and then at least you have a local domain domain, it's okay to hook, mmm I'm, not sure if the latency will be very big. Okay, so.

H

Depends if you can actually come up with a data structure that fits in shared memory that is safe from crashes.

E

Yeah, therefore, contrary modifications, because the calf demon is adding things, that's.

H

Like independent mailboxes between each already instance and the cache team and you'd still have to have some way that basically ping and prog to each other to say: hey I, put something there in the mailbox or whatever, and that would involve a lock realistically which crashed having it locked, and you know, let's think, but at least you'd be I, guess the private lock that when that instance goes away, it comes back and Bree establishes brand new mailboxes and things like that. But it's all them surmountable I think it's essentially challenging the messages.

H

The messages that are going on the IPC are gonna, be very small, a saying you know, yeah.

E

H

Touching this dead block, yeah.

E

Yeah is the datapath going through that also or the datapath separate like going directly to files or something I would pick.

H

Nowell associated basically directly.

E

H

Files like so, though, what was this be called like the lid cached or whatever yeah.

E

H

You would I think the only messages you need to go to would be basically saying like hey. You know, it's a like hey like make sure that you're marking this in your policies, like the hot object or whatever and then also at some point, the demon says no I, don't have it or we got to promote it. Here's what it righted two or whatever go ahead and do it. You know type of thing, but even but that would just be I. Imagine a straight file.

H

A like, oh, go, create a file right, take a lock on it. You know a write lock on it and yeah.

J

Yeah I see I think I can have a try with local socket to test to the latency okay. So we also have a so country we're trying to do a stateless caste mo here, and so, if the demon crashes, we sync all I need to restart that demon and demon where try to load load the matter file for for this year within the cache and then, if the, if the load fails, we just starts from an empty empty cache.

J

I mean this is a read and a cache the date data consistency will not be impacted, but we just need more time to warm up the cash. Maybe.

J

Okay, so here here's some details for the promotion, commotion and read me right. So, for example, if there's a really request from a be warm and then just try to look up in the caucus mapping, basically this will be a copy of the object map which, which can tell if the if the reader content is is in, is only call cop, hat or or we were safe to read from the parent humid cache yeah.

H

Not a copy of it like by the time that the parent image gets the read request. It would have already figured out that the clone image doesn't have a copy of it.

J

Okay, so do you mean we can director reals that data structure from leader, Bobby G by.

H

The time you don't even have to read it, the cache doesn't have to care about it, cuz it's out, uh because if you imagine that the images you have a parent image and they can form a chain of clones, okay, so the the read and write requests are issued from the let's Akeem you against the clown image. The clone image is what does and these and when I say image in memory.

H

I'm talking about the Lib RBD image context yeah, if, if the clown, the lowest level in the chain, determines that hey, you know the my object map says: I, don't have this data you know associated with me. It automatically then riri issues that read request to the the next parent in the chain and so on and so forth up the chain until some parent can satisfy that request. So the cache is tied to the parents, image contact. So by the time, the parent image context, the I/o path of the parent image gets a read request.

H

You already know that the clone didn't have it.

H

So you have to worry about copy-on-write or anything like that, because it should have already been handled in an upper layer like by the time that the parent image gets the read, requests it better, not have had a a clone copy-on-write or else we've taken from fit something it's a simpler by the time you get a read requests you just a see all right, but the cache path has to do is just say if this data already cached, if so, read it, if not just get, maybe do I have to promote the entire block or whatever the entire objective.

H

You know it's the possibly telling me to read the full four megabytes or whatever. If so, go read the full four megabytes write it to disk yeah. If not, if the posse says great I, don't have it and I don't want it. You know I'm just go read whatever the.

H

J

And so that's actually the step two here, so the IBD image will occur in the customer and then custom own folder, like you'll, see you'll, see policy, here's a private policy to determine if we should promote or we should that's the ability to read from the back end directory. So if it decides to promote that object, we actually have a right to lock on the only cash and then I just promote from the raddest here and country.

J

We, we have a very small, optimization here, so when you're doing promoting and we can, we can actually configure the promotion unit. For example, we can promote only hk4 for that single request, but but for a GW caching, where we just promotes the four object here when, when the promotion is down, we're just a simple and notify the image to to read it, maybe a core back.

J

Okay, so here's our demotion flow. Basically, there will be an internal thread running on the customer which periodically checks the area. Are you list and then determine if it is the right time to kick you some some of the cash here? If, if it gets too cold, we just said evict those caches.

J

Okay, so here's the IO flow for weed and basically the flow is quite similar with the promotion flow and the image would look up in the table first and then try to curry the customer, and if the cash is already in the initial folder, we just acquire some gridlock on that cache and then check metadata and then read from the. If it is already there, then we can read from the share cache.

J

If we, if the first lookup says if it needs the cache, is actually on call paths. We can read, read from the Raiders to actually.

J

Ok, so kids are right flow, it's much simple! Actually, since we are using I shared with in my cache so or all of the rights for be written to raddest here at the x-ray.

J

Okay, so here's the here are some issues and Konan cases we have.

J

We have think about the the first one is on how to do a VM migration and also how to handle VM crashes and so basically over vm migration. We have where we have to rebuild rebuild the cache state when, when the ability is reopening bug them, how? If we have the shared cache on that new computer note, and then we can try to read from that chair cache. If we don't have that shared cache, we are just promotes own demand. I mean.

H

I'm, a on the VM migration or cash case. Really the only have to do is just reconnect to the demon right, because the policy will tell you what to do yeah. That's why anything.

J

I mean if the if the cash is already there and then we need, you may need to rebuild it estate.

H

Nobody liked the the cash date. It's not something that should be in memory of the the Camilla Barbie D process. Right, it's the cash there's the demon and it's you know the files that exist on the SSD so and you're not I mean you're, not gonna, promote the full parent image on startup or anything like that. You're gonna wait for the policy to say yeah.

J

Promoted or don't don't worry about it, I don't want it yeah, that's what I mean we have actually a memory meta file for the shared cash. Somehow you have to check if the. If there's holes in that cash or infinite already, is there.

E

And there's an LRU, presumably in the cash team as well right yeah.

H

Well, the cash demon crashes lower house the same for the VM crash there, which is the first of all point I, see yeah yeah.

J

Yeah, that's right! That's right! And yes, second Casey's Jason actually commented only one six, seven, eight eight PR. So there's a case that seemed our API note were, will be a separate node. So there's a case, some abhi DS will be removed from the remote node. So we have a. We have a policy thread that in the custom own can check the cash if the cash is very, very cold and we it will be removed eventually, so we don't have any offering cash there and sir case is custom crash. I.

J

Think it's kind of similar as a VM crash.

H

Demon cross, nothing the heart, the the real I think rebuild case, because it's got a hopefully in a perfect world. Your periodically persisting near your policy state or you know, buted LRU or you know, multi cue.

H

Whatever your tribe attendee uses your policy and I mean not that you're updating on just February every ping you're getting but like he was periodically persisting it so that you can, in theory, boot up it to back to a warm state as opposed to starting from scratch and having to scan the entire file system or whatever, to say what you know to figure out. What's out there or I, don't know how you keep your metadata.

J

Okay, I see I think this is.

L

J

L

Okay, for the second point, I'm thinking things we already got a parent demon, so is it possible when we want to remove it? We just need to contact the parent and things no matter the shared cache or the local cache is always on the same note on the same folder, so we can just remove those data, so we don't need to do the periodically check.

H

Have a thousand compute nodes, all sharing the same parent right so.

H

You know the cache, will you know it better? Hopefully, start demoting objects that aren't hot anymore. Even if you do nothing parent images.

G

That are deleted, their.

H

Data will eventually start getting demoted from the cache because they're not getting hit anymore. So.

L

Are you eviction thing, because those cache is already there they're? Actually, image is already removed, so in in that case, I just need to contacts apparent and led the parent to delete the child. The cloned cash I'm, though another another thing we need to do is to just to remove those slow low heat ratio. Cash and air. Are your eviction so there's different operation in different face I mean it. One is when the image is still exists. Another is when the image is already removed.

H

With its cash yeah.

L

Yeah sure, okay, thank you. Okay,.

J

I think the last one is more difficult, because there's there's a case that there may be some errors on the file system or only SSD. So contrary we, we don't have any mechanism to check check those integrate a those far integrator of those files. We simply rely on the fascism Tallulah check. That means, if there's some, we ever happen when on weed. We seen probably ignore that and we issue I read from Redis and.

J

I'm not sure if, if there's a very efficient way to do to do an integrated check or something like that, yeah.

H

I mean you're gonna have to you know, be have some safety in there, for you know something that someone was writing the the object out and they crashed or whatever right yeah. You know partially writing the file. I mean you just see or see it.

H

E

And you can write to a temporary file that sync it and then rename that's sort of a usual thing. Then, though, you know it's complete, I think the only issue with the reads is: you have to be a little bit careful because you're in mapping the file and if you have a page fault that gets an IO error and turns into a signal I believe that's how that's communicated the process is a you Danny fault or a cig fault or whatever.

E

It is so to be able to that's sort of awkward to deal with sometimes, but.

L

Like thing right for the cache inconsistency issue, it's like we need to add a check point with some CRC when we do the promotion and the metadata update right is this. Oh.

E

Sorry I mean I, think I think if all you need, if all you're worried about is like writes in the cache and then you crash, then writing to a temporary file, F thinking it renaming it to its final name and then of syncing that and then updating your meta file or whatever yeah.

E

H

Stuff insufficient for a first phase, where you do like full object motions, but if you ever wanted to do like extents base.

E

Yeah, then, you need to write particular blocks. I think that and then update your database thing whatever it is Smita file, if you're doing a map MF, it's like gonna, be them I, think if that's gonna be the most efficient weights as far as like a void, a memory copies and so on, because everyone will be sharing the same pages. But if there's a media fault, it's also the most awkward go, get a signal and things will just crash and then.

H

For the demotion addiction process, though so I mean you'll have to have like um you have to take a file lock, right, I, guess the person that's doing. The eviction would have to try to get it right or lock on it, because they won't be able to release the reader lock until after until the buffer list. So, finally, like freed.

E

Although the Tigers I mean all you have all these processes that I have the file open and re mapping it, you can unlink it and then, if you poke them over the socket that says by the way, this object got evicted or even don't I mean they can just eventually those.

L

Guys had any suggestion here, since we already used a mm map so they're in that better way, like maybe I we create mmm and release it then create new ones, something like this. So too, we can make a checkpoint thing to make sure like yes, maybe after before the we create a new one as those mmm already being stalled in the file, and there won't be a problem. So maybe in this way I can we can make this much more consistent or more safe.

L

E

Not sure I understand what the concern is about safety. These are immutable files right, so there's not they're not being modified except to fill in parts that weren't valid before.

L

To to update some metadata, so if there is a fearful so some some metadata, maybe just a messed up and so I'm thinking, I, originally thinking, maybe I, just use em think so I can make sure the file is already stored on the disk but I'm now thinking if the page fails, so maybe the this whole 4k page may be just too messed up and did their is useless. So the better way is I always like use a parrot thing to new, a new mmm file and the Tuesday.

L

The last one is already is a is a correct one. Until I I may be replacing you want to the oats theorists only one. So this is more like a checkpoints, the.

E

Just doing an M sync on the range that you updated, that sounds sufficient to me. As long as you do, the M sync make sure you have a success error code before you update this meta file to say you can now use that range of the file.

L

Okay, yeah yeah sure okay, see thank you. Do.

E

I understand correctly that the lebar BD is going to talk over the admin socket to the cache team and say: do you have this range of this object? Cached and it will say yes, not the the admin socket per se, but.

E

Yeah, so so it will just not tell you that you're allowed to read that part of the file until it has been in synced correctly yeah. That's why I.

J

Think that's I think that's.

E

Totally sufficient, okay.

J

Okay, so if actually, if these there's was some error happen, when promotion, we just tells the a BD to read from the back an interior directory, yeah I mean.

E

You just go right back to the.

H

E

Just wouldn't say: I have a cache, you would say: I, don't have a cache. Okay, you didn't successfully cache it.

J

Okay, I see, okay, so I'm sure we don't have much updates on for the warehouse gateway here. So I think that's all for the updates.

J

So to Jack how many more comments for for the bullets.

I

No I think having hey you and Cindy covered it pretty well Jason, so the only thing is is basically the you know on the right right. You know basically a copy on the right. Did you did you already mention that we will use the object, my app feature, and so will there be an assumption that will, you know, will need object, map.

H

By the time that the actual Akash interface gets, the read requests percent guaranteed that the that that requested extent was not a copy on write like a copy up to the clone.

H

So it's safe to region, the parent, okay, the IO path will not allow a read request to go down to the parent if it already exists, because that would just crop data.

I

Not you yep, yeah, I, think I. Think we good yeah, we cold everything.

B

Thanks: okay, okay, okay! Thank you! Since I'm, your skin, okay,.

E

Thanks very much.

E

Alright I've got a couple quick ones here. The next one is unified, PG log trimming. Unfortunately, Adam Emerson is on vacation in for the end of the week.

E

He was the one who started look at this, so he used he's looking at a bunch of different, just PG logging in general, neo ste, and how we can make it more efficient. There's a pad linked, called Oh Steve PG blog management.

E

That sort of summarizes all the stuff that we're looking at there I wanted to bring up just one of the topics in there, and that is how to make it so that we trim all the PG logs in memory in the OSD in roughly time water, so that, instead of having each PG having a fixed number of entries- and when you add the 10,000 and first it kicks off the oldest 10,000 entries back, but instead at it, so that the PG, along without the PGS with minimal traffic, have a short log and the P G's, with lots of traffic of a long log and you're just throwing about the old turning away globally, roughly the oldest one.

E

Unfortunately, that's kind of hard, so I was brainstorming ways to do this and I only had a couple ideas and they not know how good they are so I wanted to just throw them out there see if anybody has any better ideas.

E

The challenge is that the BG logs in memory are they're, these big link lists and they are each protected independently by their own PG locks, and we don't usually you'll, be pretty careful about trying to keep all these PG s locked separately.

E

So you can't just stick them all in one big LRU, so the first thought was to shard the overall OST into like some number of buckets based on the number of cores or whatever it is, and this is a trick that the OSD does with its up work queue and that blue store does, with its cache with the same number of chars and tries to line everything up so that P G's always fall into the same shard and so they're always being dealt with by the same threads and hopefully landing on the same cores and all that memory is hopefully being hit.

E

mmm Staying in the right set of caches, so you could have a sort of a global LRU but n of them for each one, for each shard and put all of these PG log entries sort of in a big fat LRU with the block for that entire shard I.

E

Think that logically covers it.

E

So you'd have like say that five shards or four shards whatever, and you would just pick up the ones off the oldest, the oldest ones, but it's super annoying to implement the locking would be super gross and the bit that worries me is that when a PG is instantiated, because you know you they're doing rebalancing or something and you get a public PG log, you have to somehow like merge that with the rest of the LRU, so that it's in time order, which also sounds gross, so that feels like kind of a bad solution.

E

The only other idea I had was I'm doing periodic trimming proportional. The idea here is that say every second you're going to turn all the PGS here. This is all written down. I can post in here. Okay, I want to read it. So every every second you look at all the PGS say: I want I want over all the OSD I wish. I had you know: 20,000 PG log entries I have 22,000, so that means I want to trim. What is that 10% of the PG log entries?

E

And so then you basically tell every PG to trim the oldest 10%, that's still sort of totally parallel and keeps the PG sort of working independently.

E

As long as you have sort of the global account that how many people PG log entries are, it isn't perfect lead time order, because you might have a PG that was really hot and then it got cold and it'll take a while before it sort of balances out and so that it has fewer PG log entries, but it seems like it would sort of work out over time, but that was sort of that was all I could come up with so money.

E

If anybody has any better ideas, or that sounds like a good idea or what.

M

So I guess the idea that part of when I first heard the the title terminal hangers in the time order was that we might be keeping a time stamp with each blog entry so that we wouldn't care about really really old entries, for example, they were hours ago and therefore, probably don't overlap with many I got almost elog at this point anymore.

M

So then we could just trim based on update time.

E

E

That was actually that was kind of the first thing. I thought about too, but I couldn't figure out was how you decide what to trim up to, because the request might not be distributed. Sort of uniformly over time. I guess again, you could just Parks I mean so if the oldest I mean be knowing how to start.

E

If you just like look at all the PGS and look with what the oldest entry is and say that that's the oldest and then I want to trim 10% and so move time forward, 10% and then loop.

M

Yeah, it's a good question.

M

You know I'm, really good idea, it's off my head, how'd. It do that.

E

H

Trim, it's all coming in from this totally dumb, well and I'm. You kind of to be kind of glazed over the fact that the char did PG log like what was the trouble there with Virginia. Oh.

E

It's just like so I mean it's a complicated structure mm.

D

E

Mean it's a it's a list but like we're like appending to the list, usually we're just appending the lists and trimming things off the end, but sometimes we throw out the list and sometimes you roll it back, sometimes whatever it's as fun as a pleasure.

M

Complex and modifications to it so be really painful to make it shared between VG's at all. It's.

E

Already complicated I don't want to like add another dimension of weirdness gotcha neck is yeah I.

H

Was like if everything else is already started like we're, trying to keep this on this car and the Sun is coming this way and let's go you're already, aligning all your locks. You know in theory, along those extents that this kind of the complication this it seems kind of natural that you would also have yeah.

F

Like that much data about that, if we want to coordinate trimming, though do we like if we want to do time based trimming, we just need to know what the like, who owns the hundred oldest PGE vlog entries. So it seems like we could have them periodically.

F

Updating the shared data structure that says, hey like justify us by my oldest entry or like I, have ten that are older than 60 seconds or something, and it seems like you, design a data structure like that to just gets updated every several seconds, and then people can say: hey like and- and you can also put in there. I have this many so that you can like look and see when you're when you're, adding new entries like you can or you can pass around flags that say hey.

F

We need to eliminate this many and I think that I own that own, the ones that need to die.

E

It just take a global, walk and register themselves with their oldest entry and with PG there and how many luggages they'd have they do that. Every second dad would tell.

F

Ya or I'm, not sure I mean I, mean yeah. There are a couple different ways about it. It might also be like these are my five oldest entries, or you know, I just added ten PG logs, and so we we like collectively need to remove that. Many and I did this many on this time stamp and then during that round, people then the other P DS can be like it. Well I trim this money then, and I'm.

E

Just trying to think I mean we don't we don't want those the P G's to have to think very hard about what the distribution of their entries is over time like.

E

Ideally, they would just look at the oldest entry or the newest entry and the number like something that this constant time yeah.

F

E

Know if we can do.

F

Enough G's right, like that's, well bounded yeah,.

E

F

Can have like so you can look and be like. Okay, like we have this list of the ten oldest entries on all the P G's, and it might be a little and you know like it might be stale. But it's.

F

Like that, like this, isn't a thing where we have hard bounds, I don't think yeah.

G

We can have a in.

F

Free, you know, computer time in frequently updated data structure that they share in common, where they post their states.

E

Yeah well, okay, so that the simplest thing would be just to take the oldest one. Each of them says this is my oldest record. You take the oldest of all of those, and then you say that was you know twelve minutes ago and I want to trim 20% and so I'm going to tell everyone to trim everything older than ten minutes ago.

E

That might be good enough.

M

Okay, weekends.

G

M

G

I was well I. Guess I was thinking like. Why isn't? Why aren't there two modes, one mode is just where, when you're adding stuff to the front, you'll trim stuff, that's beyond the the time that we're allowing and that just always happens, and then you only have to do this extra trimming if there's so much activity within our time window that we're now like going to exceed our memory limits right I mean that's pretty much. What you're, mostly talking about yeah.

E

I guess that would be it that would be sort of a different, so you could just you could basically just estimate what the right request rate is for the OSD and say that means that if I want 10,000 records and that's you know 90 seconds of ops and then you just say that's the time horizon that might be even easier. Okay, that make sense.

B

G

Yeah, no, that makes sense I was sort of thinking that you had a sent parameter. That says we all we try to keep ten minutes or I, don't know what you were thinking, but some amount and that that's sort of like your first thing that you maintain now they only need to worry beyond that. If now you aren't gonna have enough room or if you start to now not start using too much memory right, I.

E

Think that the sort of unstated goal of this meterpreter stated was that um it would be a memory budget. We would just say I want to devote this much memory to PT logs and so whatever I can fit in that budget. Okay,.

G

So you want it, you don't mind yet: okay, so you're willing to fill that budget. Even if there's been very little activity and there's lots of old stuff yeah.

E

I mean maybe maybe that's stupid- maybe there's no point in keeping them around if it's beyond two minutes or whatever it is.

E

Know it there isn't, because the if you have that there isn't if there is art if the PGS are degraded actually right right.

M

That's relevant for degree VG's right now, when.

E

M

Then you don't care about longer pg8 logs entries, because you don't accept the do buffs, which are much smaller in memory. Is it's.

E

The plan that's do pop. The new deep up stuff is gonna, stick around and be the thing that we used for doob, uptick detection and we'll just separate it from PG log.

M

Yeah I think it makes sense to keep that as it is. Currently, it might be possible ways to automate it further, but it's already much smaller than a locker key by itself.

F

Easy way to do that.

E

Yeah, although that.

F

E

F

Well means that when we get into the great states, we don't just blow up memory forever and ever as we make modifications right.

E

So that that's our.

F

B

E

Motivating idea here was that we would have some amount of memory that we decide that we're gonna devote to this. We have a budget so that we don't, you know, blow up memory and then, when, if we're degraded, then we would basically say we would steal memory from the blue store cash. So I would say: blue store, used less memory for your cache, so that I can use it for Geechee logs so that when whatever roasty comes back up I don't to do a backfill.

E

That was the dream, but I think that all feeds back into the same problem like maybe it maybe it's what we're talking about, but you sort of have to for aizen's like if it's a if it's a clean PG, then you keep like you know ten entries and that's that's all you need, and whatever the number whatever the in-flight entries are last completed on disk or whatever, and then you just throw them like, there's really no point in keeping them at all.

E

But as soon as you have a degraded PG, then you estimate what the request rate is and how many are gonna fit in your budget and then you just project back in time, 60 seconds and then P G's trim based on that time, horizon I.

E

Think that and the PT log entries already have a local in time in there. So I think the data is there, it's just a matter of like adding the control, the trimming logic.

M

Yeah I think that could make a lot of sense, especially now that we have the dupe entries to get to keep around for do protection.

E

This full so that the again it yeah okay, so that the one of the reasons why I was excited about this direction in general was that I was thinking that, if the PG log entry fee, all the appends were obviously all sort of going in time order, this big sequence of keys and then the deletes, if the deletes were also coordinated, then that would be a workload that would be easier for the backend to handle.

E

In reality, most of the PGS are not going to be degraded, and so the inserts and the deletes are going to be like separated by a second less than a second, in which case they're, probably not even going to drop out of the rocks TV wall right buffer at all. Right. That's probably gonna. Just like cut out a ton of work that we were throwing at the rocks to be. It's really only the degraded P G's that need to be sort of landing in level, 0 and level. 1.

M

Yeah I suppose so, because the do batteries might still have like some space there, but that's much smaller still, yeah.

E

M

We could turn all the Dhruv entries to bison like being older than five minutes, or something like that. Something that didn't mind replay isn't going to be super interrupts in slow.

G

Yeah, why, if it's, why, if it's degraded, do you still have some sort of time horizon I mean you'd want to use as much memory as you could as long as you.

E

G

That once you get past that well, there's no right.

E

But if they're but they're gonna be end of them, that's the thing so assume there are 10 to create a few G's. They have different write rates, and so we want to trim. We want the time horizon to be whatever it is like 13 and a half minutes, but it's going to be different. Lengths of vlog for those different few G's right.

G

Well, why not? Just if you don't have, if you're not used this the amount of space, not the time to get an estimate of how much space use the actual space, because.

E

For any given PG I have I add a new entry and I have to kick out an old one or whatever I'm doing my trimming I, don't know how many entries to kick out, because you don't know what proportion of the total space you're consuming well.

G

When you're degraded, don't you not want to kick out anything eventually, there's.

E

Still, a memory.

G

100 Meg's, so you add until you until.

E

You regs so say we're. Our hunter makes us full.

F

There's still equivalent problems, though David, because we need to coordinate that usage across the across the different duty: threads. Okay,.

G

Yeah I'm, just I, guess I'm, just not clear on why time. It has really matters if you just want to care about memory, just because your memory calculations, and because it's.

E

It's better to have half as many entries on the PG that's having half as much traffic, so they both have ten minutes instead of having one of them have five minutes of history and the other one have 20 minutes of history.

G

E

In order to do it that way, in order for the PD to know that it has a short log, another one has a long log you have to like. Do it based on time, basically but I think.

B

Original suggestion.

E

Is like actually the simplest just.

G

Not sure that was my suggestion, but okay, that was your interpretation of what I said, which was fine, yeah well picking.

E

Picking the time horizon makes it makes the trimming really simple: I guess I, just a matter of figuring out what the time horizon is.

E

E

That works for me any other anything else on that I should go next thing.

E

All right, the next topic is the dee doop I just want to get everyone an update about this, so um jung-hwan @sk has been working on this pretty extensively off and on with mixing with a couple other things at the pass like I, don't know how long it's been nine months or a year and I just wanted to make sure everyone sort of knew what he was working on and what the architecture was. um So the I tried to summarize the current status and the doo doo pad, so the short version is stuff.

E

That's done already and actually was merged for Luminess. Is that.

E

Raitis has a redirect concept now we're in the base here in the object info tea, it'll, there's a flag that says there's a manifest on the object info and there are a couple different types of manifests. The simplest one that's implemented right now is that it's a redirect, so the object controls you just has the object name of some other object in some other pool, which basically just means that this object is stored over there.

E

And so, if there's a read that comes in on an object, it proxies are read using all the same reprocessing code that the cashiering uses just reused, all that and if you have a write on it, it'll proxy, the right right now actually, but later we might have that trigger, promote or something, but so the redirect manifest types are there already there liber8 of operations to set manifests and to get manifests their rate of CLI commands even to do it I believe there tests, the reduce the test rate is SEF.

E

Test reduce greatest model thing is setting these and thrashing these and mixed in with all these stuff. It's been in there for a while. So that's all there, so the basic idea for applying this to D dupe is that, instead of having that manifests, just be a simple pointer for the entire object being somewhere else.

E

It would instead, the chunked manifest that says this beginning part of the object is stored over in this object, and the second chunk is over in that object for all the different bits that the object is d duped and chunked into, and then those chunks would be stored in a content, addressable pool elsewhere I'm. So that's the basic idea there's a full request with some of the chunks manifest infrastructure, that's in flight, so it's all sort of being proposed and then and that pull request as they hit the read handling.

E

So if you get a read on a chunked object, it'll sort of break it into lots of little reads that gets sent to the back-end objects and then reassemble. The results and writes would simply trigger promote. So just pull the object back into that based here and it turned back into a regular object and then things would life would proceed. That's before that's sort of that, the current bit that he's working on sort of next steps beyond that would be eventually.

E

Obviously there needs to be an agent that will look at objects that are in the pool and break them into pieces right those pieces into the backend content, addressable pool and rewrite the objects that but those chunked manifests and then that, because the content, addressable pool the objects would be reference counted.

E

So the there are a couple sort of basic ideas so far. The first one is that since we don't have multi object, transactions and radius, the ordering of operations would be so that if there's some intermediate failure at any of these operations, the failure mode would be a leak. You'd leak, a reference to an object, as opposed to like point to something that doesn't exist.

E

So, for example, when you're adding a pointer to an object in the Cask fool, you would first increment the reference count and then, after that succeeded, then you would set a pointer to that and when you delete you would remove the reference, and only after that, would you decrement the reference count, and that might just be good enough, like failures are rare enough that it doesn't matter if you link like 0.0001 percent those things when those T's restart, but the idea was also make it so that does instead of having a pure reference count.

E

That's just a number on the objects. You could have back pointers and because you have some objects that are some chunks are gonna have like a bazillion objects that have that same content do YouTube to it, and so you can actually point all the subjects. The idea was to. One of the ideas is to have this sort of fixed size back pointer map. So if you have a small number of back pointers, then it would actually point to the objects that actually constitute those references.

E

But if you have lots of references, then it would sort of the resolution of those back pointers. If we get more and more coarse, so in sort of the extreme you would say, I have you know 10 references from pool X with hash prefix. You know the first 12 bits or something like that.

E

So if you were to ever do a scrub, you could conceivably go enumerate all those objects and try to match them up, but the more popular and object is sort of the less precise those back pointers are and correspondingly the less likely it is that you're ever going to delete it anyway and so the less that matters. If you leak a reference so yeah, that's the idea.

E

Any questions on the sort of the basic basic architecture.

E

Something since you I.

G

Like it I thought, D Duke was like this is impossible.

E

Yeah I, like it I like it more as time, goes on so the deal I think the biggest concern I have with the overall approach. I, don't really care about the delete week to offer object, references I think it's like never gonna matter like if you went and deleted every single object, then you'd have some a bit slept over, but like nobody's ever gonna. Do that.

E

So, of course, the only thing that worries me is that um we need to make the the chunks that get stored in the backend pool big enough that the like metadata overhead is manageable, so that the overhead of it being a distinct object with you know the blue store metadata associated with it and so on, and also so that, if you have a format, object, that's d duped into like you know, 100 pieces when you read that form X, it turns into a hundred different areas.

E

So the bigger those chunks are that the less gross here like I understand you get none reads, so you want chunks as big as possible and but the bigger they are the harder it is to actually get a good deed, oof ratio. So it all really kind of sound alike. How are you doing the chunking and what is the worklet? Is it actually to do Google?

E

So this is where it's really exciting that we acquired perma bit who's been doing this for like 14 years or something like that.

E

The good news is that they have a bunch of code that does like reeving, fingerprinting chunking, so that looks, looks at the content and tries to pick chunking boundaries and so on, and there there are other things you can do to you. Can you can do content based chunking? You can also have like bits of code that actually look at the content at the object and recognize patterns like it might recognize that it's a tar file and chunk at the file boundaries, for example, which is kind of cool. These are all additional gravy abyss.

E

Guess you can like add that in to make it smarter over time. um Obviously, we'd start out with the simplest things, but when I talked to so I talked to Jared Floyd their CTO about this and went over it and- and he said that all sounds fine as long as you're again. Basically, his concern was the object size, but he per was an alternative architecture. That's sort of less ambitious but possibly safer. If it actually works. So the idea there is, you could take their existing content.

E

Addressable D do block device, it's a DM device that does D do pond for K chunks. Every chunk is treated independently in D duped and it's all done in device mapper and it's all fast and smart.

E

But it's you know a single host single device. So in order for it to really work in a distributed system, you have to make all the multiple copies of data land on the same OSD so that they can then get d-dude by the same block device.

E

So the trick, then, is to you could deploy that underneath all the OSD s and the D do pool and then you'd have to make sure that objects with similar content get steered to those OSTs and then let the existing block device to use them and the big question mark is: is there a way to like keep these objects still moderately big, but still steer them to OS T's, where they will land with other objects?

E

That they'll D do well against and that's sort of an open question, and it's not clear to me how that's different than just doing the chunking in the first place and then doing reference counting with that content. Addressable pool, um but I need to follow up with him and see if he said any bright ideas. After is after his week at Burning, Man expanding his mind can occur.

M

Bd image below the purpose.

D

E

D

E

Well so he talked to Sean Cohen about this. About OpenStack and Sean got super excited cuz. This is like a checkbox that they want, but Sean's question was like: why not just do to do on the RB client side, but the problem there said it's. You know it's just it's single user, like UD dupes within that one image, but not between images, which is sort of the whole point.

E

H

Could do the the cast stuff on the client side right, but you just have a performance hit of doing like two lookups right of having to look up.

J

H

And then go get your data, so it's like twice the latency, but you could do the same like stuff that you're doing at the client level. Yeah.

E

That's true, that's true, and actually that might make sense for rgw, where the the objects are like right once or immutable or whatever they're not modified in place, because you could just chunk as the data is coming in on rgw, and it could go right it directly into the cast pool in whatever size chunks. It wants.

E

Why should that work as well with our buddy but sort of the implicit goals that it would be nice to have it at the reduce layer where it everything works on top or at least could work yeah.

H

So I mean if I'm getting that like intermediate depth with the DM block device, at least you don't have to like actually implement any of the algorithms of like dee doop with Insaf, so that does save some complexity right. It.

E

Does well there you, okay, so you would deploy the DM beneath each OSD so that part's, easy I guess each OST is sort T do internally, but you have to steer objects towards the same Oh Steve so that they'll D deep together, so you still have to have an indirection layer. That's like right! If.

G

They're still forming.

E

Objects, they still have to like it's hard. It's unclear how you'd have to format objects that know that they will do deep ball together and spend them well.

G

E

G

Right would it be to expend to go through the four Meg's and sort of get a signature for each fork, a piece and then, whichever signature, you.

E

C

E

G

The most of that signature directs you there and so it'll be good enough because it'll be on the best dee doop node for itself, but even if it was but it'll still G.

H

G

With other blocks on that node, even though those weren't the primary for those blocks, yeah.

E

That's that's that's kind of the idea right. So it's. How do you like construct that algorithm? It makes like all the content in the vote and try to steer it to the right place, and even and when it when you when it goes to a particular list, you're, not actually picking the OST you're picking a position in the likes of YouTube it hash namespace for that pool was like it's super weird right.

G

There's no right location, there's just.

E

E

But in order for that to work, you still need this indirection layer right. You still need to have something that says the name gets translated to its like signature, fingerprint right, but that but the other sort of wrinkle there is that. So you do all that and you get that to work well.

E

Each OST is going to be have different levels of do dupe effectiveness, depending on how well that part of the namespace is like to duping against other stuff nearby it, and so you might be able to send more or less data to that OSD, depending on how low the dupes, and so, although, like rush balancing stuff, gets super weird, because you need to like be constantly monitoring the amount of data versus like how much Daddy duped and how much physical storage it needed and then re wait those Steve's accordingly, so that you stay balanced at the same time, you'd have.

H

To cope with compression actually.

E

E

Yeah, except we.

B

E

Of we sort of assumed that, because we're spraying objects everywhere, even though it's the overall compression ratio across that is going to be roughly the same, whereas I would worry, I mean- maybe that's true, but to worry that 2d dupe is going to be more asymmetric.

G

So, are you telling me that, like the reweighed by utilization isn't using the stat FS information, it.

E

Is it is, but it sort of this will just feel like we're it'll be relying under even more right.

G

E

Would I would get even more walkie, though the amount of stored, the amount of stuff that you can stuff in an OSD gets even more squishy I guess. So no no makes me nervous. I.

E

Guess I'm I'm I, like the like the original architecture, much better but I want to make sure I'm not I'm getting the other. The alternate proposal of fair shake.

E

Are they open sourcing, their library yeah? Let's read at everything we buy, we open source. So that's all. It's called coming upstream well, they're mainly focused on getting the kernel driver. Obscure cool, yes is sitting around, so we can as soon as we decide to use it, we can just open source. It.

E

Yeah, okay! Well, anyway, that's an overview of what is going on so I'm young one couldn't make it but he's continuing to work on this, so the next I probably still follow up with with Jared and see if he's had any bright ideas and I definitely want to hear from the rest of you guys about what how you think these two approaches compare either now. Are they.

M

E

Mean I'm still.

M

Kind of partial to the first one, just because it is a lot simpler in terms of managing the space usage and not kind of constraining placement, but this kind of worried me in terms of it being very similar to the cash curing stuff in terms of being difficult to manage, with like goop detection or anything else. Let me need to, or current wouldn't currently need to do, between different gears that might not be supported in the backing of all four. We do I.

E

M

Other than another stuff cool yeah.

E

I think the saving grace here maybe is that when you're deducing something you're doing it because it's cold and that it feels like there's sort of this implicit understanding that it's gonna be slow, whereas the cash hearing was supposed to like make something faster but never slower, and that's here we're only making things slower, and so it's less it's less. That's going to be less jarring for people that I'm sure people still complain, but I feel.

H

Extra, so so on. That is that, like that, that D duping is happening in real time or is that, like a process in the background going through and just periodically scanning everything, it's.

E

Not really, it could be either way. It could be that as you write the object, so you do a rightful it could just like chunk. It just go out the bits and then like wait for all those to commit and then commit it all at once and you'd.

E

Have this like high latency but like more efficient right but I would just my guess is that what we originally, what we would initially implement would just be you to write it into the base tier like any normal object and then, sometime after the fact for some sheets or not decide that this is a cold object and like I, don't care how fast it is. I'm, gonna D do that and so it'd be like sort of an after-the-fact offline thing.

E

Jung-Hwan has a proposal to sort of do both so that, and so he's asked like a his ideas to have stuff that goes into the base. Tier have go into an LRU and if it doesn't get touched for some period of time, then it gets T duped, but if it does and it stays there, let's say yeah this sort of like logical log type concept.

E

Which sounds reasonable, but I think that's sort of I'm, not I'm, less concerned about that I. Guess, because you can do you can do an offline engine, you can do an online engine. You could like build into the OSD. You could have an external, a lot of implementation choices.

E

I think we need to be careful about what which one we choose and what those sort of overhead of maintaining that code is hopefully learn something from what we did with the cash during but the as far as like the overall data model and architecture, and it seems to be the same regardless of how the agent works.

E

Where you have these. These chunk chunked manifests that point out to the bits, and you reference candidate I'm, at least when you get an I/o and one of those objects is still either you spray the read across or you promote it back up, that's sort of not going to change.

G

Yeah we could consider just doing both the both methods will still dedupe but will still stick there deduping at the 4k boundary below us, even though we're trying to de do but we're gonna try to only D do bet a larger size.

E

Yep I mean the cool thing about their their DM driver is awesome, it does it does. Every 4k is d duped independently? It's like it has this like fast, very fast fingerprinting that tries to figure out quickly whether we might have a copy or not, and- and it also does compression and like we'll, take all these 4k blocks and squish them down and compress it out.

E

So it's like it performs pretty well I, think which is kind of a feeling, but it doesn't really it just as a map to a distributed world in its current form, see.

E

Okay, anyway, that's that's T, doop, any other anything else on that before we go to the next thing,.

E

All right, the last thing I have is I mean almost the last thing I have is removing.

E

It's the remove snaps, removing the remove snaps list from the OST map. I was reminded of this because there was this whole series of patches recently from somebody whose name I'm blanking on at the moment, because they have a bazillion deleted snaps and it's like crazy, slow, they're, just printing CPU processing, ghosting maps and doing the appearing stuff and so they're optimizing.

E

The interval set calculations, which is all great, but it sort of ignoring the fact that, but there's no reason to like have these huge lists in memory on the OS to use at all times like once. The stamp is removed like we should mostly be able to forget about it. So I went through and read through all of the code that looks at that data structure and it sort of falls into a couple different bins.

E

There's there's a few things that just make sure if you're deleting a new snap, it isn't already deleted. Those are mostly not important.

E

There are cases when you're like trimming snaps or when you are looking at our request is coming in. That has a snap context you need to like know whether those snaps have since been deleted, since our request has been sent and that's sort of the trickiest one, because you need to like do a look-up on each of those snap IDs to see if it's it's now gone. So I'll talk about that last.

E

But the main thing is that when, when you delete a snapshot, it gets published in the OSD map in the PG pool info and that basically tells the OSD to put it on a queue. It's the snap trim queue on that PG that then chews on to actually delete all those snapshots, and once it's done, then it puts it that snap ID onto a list in the PG info of purge snaps and it stays there pretty much forever as far as I can tell, which is also stupid.

E

So the basic, the first part of the idea is that, once that PG info goes back to the manager, the manager can looked with all the purged snaps on a particular pool and look at the intersection of those and those snaps are purged from all P G's in the pool and can be effectively forgotten.

E

And so we could tell the monitor and it could remove those from the OSD map list to remove snaps, because they've all been purged and probably record some master list of like things that have been deleted, but it'll get stripped out of the OSD map.

E

So then, the OSD map structure has only contained things that are recently deleted during the process of being purged, but once they're purged they sort of move out of the OSD map into the some other data structure, and that part is mostly solve the problem, except that there are these handful of cases where we need to look at a snap context coming in under request until whether these snapshots have been deleted ever not just recently and that's sort of the Annoying bit.

E

So I had a couple ideas but I'm, not sure they're, complete or good I guess.

E

The first thought was that, basically, the monitor would do this sort of coalescing and pruning of things that are in the OSD map that have been totally deleted from everywhere um on a on a very coarse basis like every say, 100, OST, math, epics, 4 or whatever it is, it would update it to the next a criminal. Let's say these ones have all been successfully purged.

E

You can forget about them and it would record that somewhere in the in the monitor would hit like a have an old deleted or old purged snaps data structure, and then the OST s could also have a local copy of that structure. So they can always anybody can always look up in that structure instead of the hosting map that they need to.

E

That's one thought, and then my other thought was that maybe we can look at the sync wits number on the snap context, which is sort of like a timestamp about when that snap context was conceived whatever put together on the client and use that to know whether we even have to look in that old map or not, because if it's a new sequence number a recent sequence number, then we know that um it's already sort of up to date.

E

Greg you're, probably the one who's thought the most about this recently.

E

Any of that makes sense, yeah.

F

I think, most to make sense, I don't have in my head all the rules about how snapshot requests retreated from clients if the snapchat doesn't exist and when they appear and don't so that's the part I'd be worried about whether we can actually throw it out on the OSD here whether we need to keep that giant list like differently already in the OSD map. Only my Shalini, the PGS to have all of them for filtering I. Don't.

E

Yeah so then what happens when I request comes in? Is it just looks at everything in the snap context and if there's a snap I do mention that has been removed, it just removes it. It just basically filters.

E

That's all that's all it does yeah.

F

So, but do you need the whole? Yes, unless we cancel out guaranteed that we don't yeah and I, don't know if there's an yeah.

E

F

Don't think yeah, but.

E

The bigger problem is, it's actually not the request, it's some when you're doing trim, although actually I guess that's the one. That's guaranteed not to be too stale because you're getting a trim object.

E

You know, yeah I, guess that's one! That's not going to be stale because the so there's gonna be something that's still being pruned for us. It wouldn't be on disk yeah, first place right, yeah.

F

So so this does, let us you know, cut down the OST map, size and yeah, all the other things. It's just that filtering on on request handling, which is better. Oh yeah,.

E

Well! Okay, so so.

G

Wait is there not a list of the snaps that it currently exists? No, why don't we just add? Why don't we just add that.

G

E

G

The removed list could be unbounded lot larger over time than the current number of snaps yeah.

E

They're broken.

G

E

It's yeah, there's, there's just no explicit tracking to make it scale so they're, both unbounded yeah. So the fact that we were enumerated explicitly is it's kind of surprising that we've gotten away with it. As long as we have I think it's the way to put it, it would be better I think you're right. It would be better to explicitly enumerate the ones that do exist instead of don't exist. That would be better than what we have, but it doesn't really address the problem. I think I'm, actually not sure.

E

That's the case with our BD anymore Navy.

H

Our buddy tracks the what snaps it has for each image so.

E

That's true the tracks it on a permanent. Yes, but if you have a sort of unbounded number of images in a pool.

H

Yeah no I wouldn't want one global structure that is.

H

Crazy things it's like 5000 snapshots on an image, but.

E

So it feels like the best-case scenario would be that we make it so. The USD map, perch snaps, has like enough history that it goes back. You know like a couple hours or days or something, and if that's long enough, then we should never see. We would hopefully never see an IO request with a snap context that has such an old list of snapshots in it that it includes stuff that has been purged or that you.

J

E

That, though, right right, so if we, if we're super lucky having such a snap context, could be sort of you know, we send an e again or something to the RVD or it just would like not do anything bad like it long as it wouldn't cause the cause, a clones, a bad clone to be created. That would never get burst. As long as that didn't happen, then it almost I.

F

Don't know, that's any way to guarantee. That, though, is that the only way it can go. Bad, though, is by creating a bad clone. We don't have to worry about it running on.

E

Right that would be the bad way to go right.

G

E

Bad thing would be that um it's asking for a snapshot that doesn't exist and you would get like right, yeah, you would say yeah they give it like the wrong clone or no you just say: I, don't have it and.

F

It was deleted so.

E

It wouldn't matter right.

F

Because we yeah.

E

We know we know it. Just we have a numerous amp, so we would get an e, no and basically.

F

E

Yeah so I guess the question is: could we in that case just send a like doing eight something akin to a again back to live already, I would say refresh or freaking snap context.

F

Ten minute a to do that I'm, not sure that helps us either, though right well, we.

E

Could know that the OSD map could have the lower bound of things that have been forgotten, the sequence number. So it basically say if you ever get a sequence number older than you know, snap ID, you know 1 million in 12, then oh I'm, not gonna. Tell you whether that's ninth was deleted or not I, don't I, don't remember, I forgot because it wasn't worth it.

F

E

Like a regional thing,.

F

We could extend that interface yeah.

F

H

Any change of the IO path and you got a lot of backward compatibility issues to worry about right. If you change the result code, that de now says, like newer clients understand that is. Do this different action, as opposed to other than air and something bad happens. Yeah.

F

Sorry, how do you think this would work, though, during the client backward to mentally for me, so you'd Slyke you'd run a delete, and that would give you the sequence number that you're on and then you would say that clients need to have a sequence number that is within the range of ones that are covered by the remove snaps new SD map, which.

H

May not help you have someone created a snapshot a year ago and today and maybe like happy, you know yeah.

G

H

They deleted the one six months ago, so technically the range is valid if I still send one with that.

F

Might not mean you can't do it, though, but that's what I'm trying to work out yeah. Thank.

N

F

And I think you could be like no okay, so is like okay, my snaps are for my snaps. That is valid. As of this map sequence number right, we could do that one way or get out a new that have to be like a new field. Yeah yeah, like the remove snaps, seek or something.

E

Well, maybe you just look at the epoch of the request.

E

Right, if the epoch of the request is reason.

F

Enough maintain their own.

E

Yeah, but you know that their request was sent recently, but liberties about that. Okay hold separate index yeah.

F

E

Alright, well I think continue to ponder that one it almost it will almost let us forget, freaking solved, I. Think that's I, think that's the one piece.

M

Yeah so the another piece for me, but it didn't see his dressed explicitly there, but maybe you already thought about it was. um If we do have an older owes to you that still has some snaps that haven't been trimmed.

M

We wanted to consult the kind of old perched at index when it boots up and can't remove as well really old ones, right, I. Think.

E

What happens there is that it, the backfill, will clean it up. What.

M

If it doesn't have to have to thankful, because there's a really long log, though.

E

Then the log will include the delete events that happen.

M

Position actually going like that's right, yeah.

E

Yep yeah the snap. The snap mapper is just an index of objects that actually exist and their snap IDs. So when you actually do backfill it'll resynchronize with it with what the leading edge of the placement group is.

E

E

Don't think will solve the whole thing there. That's the almost full problem, and then last the last thing I have is really quick. um So this came up recently with a couple different clusters where both these were failing, and so all the PGS kept getting like backfill to a smaller and smaller number of OS T's, and those associates ended up with like way too many PGs than they should have.

E

But it could also happen if you just like put in a bad crush rule that puts you know all 200000 PGs on my cluster on the one illest eat, just because you just made it stupid air or something and that OS steal sit there and credit crepey geez until it eventually crashes, because it runs out of memory or something, and probably after that, it's gonna be hard to bring it back up because the same reason they made a crash the first time we're going to make it just as hard to recover.

E

So the goal would be to have some hard limit on the number of PGS that are allowed to exist on those D. So once it hits like the you know, 300 PG it'll just refuse to create it.

E

The problem is that the there are two ways that PJ's get created. One is the monitor will tell to create them. That's sort of these you want to deal with, but more generally, the primary primary for the PG is somewhere else and you're sending notifies caring notifies and that's will instantiate the PG and you'll just refuse. You won't do anything, and so that peering will just basically get stuck, which is kind of what we want that's better than like making that OSD get overloaded.

E

But when you finally do empty out the host enough so that there's space, then you have to make sure that that pairing process restarts.

E

So pad isn't loading for these from reason, but I have an old, maybe hopefully floating for one else. That goes so. The basic idea I had was that whenever that happens, if we refuse to create a PG, we just put it in a set. So we have a list of what those PGs are and your unique ID. So we know how many there are and there yeah how many there and then, after we remove enough P G's that we decide it's worth it to like restart peering on those are the ones.

E

Then we would just mark ourselves down and then we'll get mark back up and then peering will get kick-started.

E

Since that's I think the only way to like restart, we don't have a way to restart peering on a specific PG in isolation.

E

Actually, I think we do we could set if you set the the remap, the PG temp for that PG to like match, whatever it is, it'll just get cleared again and I'll restart peering or just set it's something wrong. It'll get set back somebody who could actually kick-start individual P geez.

E

Do you remember Josh yeah.

M

I think we discussed that before and just using the temp PG temp from was a way around that okay.

E

That's even better yeah.

B

E

When I looked when I originally looked at this, like forever ago, I think was like two years ago: I thought it was complicated, but I think it's less it's not as complicated as I thought. It was so I mentioned this tacky foo after stand up because he's looking for something new and he's chomping about that. He must run with it. So I think it would be good yeah.

M

E

Cool! Okay! That's all in that one! So.

F

Do next, oh yeah.

E

F

The only thing that worries me about this is that the hard cap might prevent us from recovering in situations where we need to like whap DG's around I'm, not sure, that's a realistic fear or not.

N

F

If you've got ones that are just in the wrong place, yeah.

E

So this this is, this is making it so that you can't instantiate P G's, but if they're already there for some reason like load P G's was so load them up. So you know.

F

F

E

F

E

K

E

So all the situations that I've seen where this happened- it's all it's all like a it's a skewed curve or whatever right, where you have a bunch that have the normal amount of P G's and then as they failed. The Fiji's kept piling up in a small in swallowing upper nodes, and so it's not like they moved around. They flipped physicians, they all just sort of like collapsed, and so this would let them drain back to where they go. Sort of it was at least address that case and I.

E

Guess it's possible if you got into that situation and then like change, your crush map entirely. So everything was like randomly scattered somewhere else. You could have P G's that we need to go to those over full notes. Do not the other way around and they would just get stuck basically until they'd get drained.

E

Yeah, maybe like a fancier version of this, would.

E

Maybe not set set the PG temp so like where they are and be smart enough not to try to set it to where it should go, because you know that you're gonna overload that OSB, but that's that's probably that might not be worth it does any sense.

E

F

Just always worried about you know: yeah.

E

Yeah, but with good reason, I.

E

Guess worst-case scenario: you can just increase the limit at least.

E

E

Last thing on the list is M clock. I got a whole well, criolla is on yeah.

N

Hi hi I'm real yeah, my name is no fun and additional partition patch are Eddie and Tim. Kim I will share those right. Please wait a moment.

E

Are you guys sharing the slides yeah there guys.

M

There's a link in the ether pad as well. If you want to look at them locally, I.

N

Will try share the thread? Please wait. Woman.

E

N

N

Will now discuss implementing dispirited and crunk inside the agenda edge of our toaster cost to Sakura's unit and committed right into pace? Second, don't incur up a data refresh in session in Sept and soul is outstanding. Are you based on the teal saucer and first rust, our friend and so on?

N

So the first topic is addressing it and command-line interface.

N

Don't cross it in fact can be thought of as a final at your attic said earlier: first dress for IBD, mesh and group of object, or in other word, for direct contract early on file system and a crunch, or a subset of crunch on application and our data set and on and universal class, including all those mentioned.

N

So our first goal is to ask for IBD imaging and a group of Judge final core is universal US based on York us identifier, so for people on describing the final universal quest, I would write fully fully describe to some terms and prerequisites, and so on. We decrease for for an IBD machine.

N

So first control has a tree or no guarantees minimum at education and narrative compassion, weight and limit the maximum values.

N

Are the following? Three conditions are necessary for implementing descripted and crime in South a a is now. Each recast must have each one unique identifier for each client. You need a be sad cross. My sister from one on Tralee us control, corresponding to unique identifier.

N

C West must be able to find the Q s, control from the unique identifier of the arriving recast so done. The space available for storing specific information in south cluster is a edge of our writing. First it across the map and such as motor and West PG crush, and yes, the second is a type object, for example, is our BD at a time so, based on this, how we will describe the subclass for Union structure that we have developed already force originally or richly cast?

N

Has the full information in this case, which is Q s identifier and the for property is stored in in the OSD map? Thus the desire did class control. Information can be added to the existing property of h4 Oh from the West. If standpoint CH OSD has the radius OS map, you can know the rest control of the floor, corresponding to receive recast and receipt us through 4 and class counter information.

N

The date is our first implementation and those CRI, a common item facing Humphreys edge or Faro West serve as the first set and poor ID and ledger and 4,000 die offs. That is some huge entrance. So in this case of IVD there is right difference from / h, racast can deduce IBD information through ID and profit for IBD is studied in IBD had a type of jack also the DJ did cross country.

N

Information can be edited to other profit of HIV D, but in case of IBD in cross country, information is of same from IBD had or object at all. I will at the IBD section in generalization. Thus, cursed country information will be included in the recast age. As a result, the worst can avoid asking another West again to obtain the cross country information associated with all rivalry cast IBD.

N

That is our expected, except us, I believe unis, implementation or aspect. The next I will disk discuss aforementioned it, emotional crest, based on Chris ID.

N

From the usual standpoint. A yes agile follows: first, you just create us unique identifier in short, q, qu, ID and related class control, q, ID and related cruise control. I should stored in in global cross-table in short chickity, which will be included in cross map um after the q. Id is mapped to specific class. You need, such as IBD or for or specific object for either user and the completed mapping is also studied in GT on the right.

N

You can see the on example of a simple gqt, so in the io situation it should request, can get issued to your ID through GT, before sending it to OSD. Once q q ID is found, it can be sent to OST, along with QT q ID.

N

Of course, um the beautification can occur in the process of finding q ID for, for example, um recast parent to post. U ID 0a v w1 + q, ID 1, +, 4, 0 age on exam, as an in the example table in, in this case on internal a pre-k internal application for where we needed or injustice small range tanks, tanks, precedence over or large range that is Teddy's. Our VD has higher on Friday, then / o. If we qu ID is found, the request will be sent to OSD and was receiving the recast o.

N

We are often the class culture of to UID and can proceed to s through qu ID n us control dead, yeah. That is our design for and thought of that q s. Tn cannot increase implementation, so we wanna some yeah, yeah and yeah address or idea from so.

E

That the first you look right to me and the pool the pool limits and yet already setting on setting them as a property on the ER VD, the big q2id. One worries me, though, the way that it's described here it looks like the OSD will get a request. The queue ID will be twelve, it doesn't know what that means, and so it has to go. Look that up like go query the monitor or something in order to find out what what the parameters are for.

E

That is that my understanding that right or is that happening on the on the client side.

N

In poor poor implementation case, the class control a yeah first con tourism, maybe I, can be studied in the West map.

E

Yeah I'm worried about that. The third place this last one where it's the universal, the QoS Universal.

B

E

Yeah um number six says the OSD receiving the request obtain secures control. You.

N

E

In gqt, does that mean it has to it? Will the request will block it'll have to go, ask somebody else what the parameters are before it can process the request.

N

N

In Shirley, I Oh chica key baby started in Westham a stolid in chicky Westmeath Oh same as your OS, the magic purple way. So we.

E

Last met yeah, so the idea, then, that the disc us table was I mean in order for it to be stored in the OST map. It has to be pretty small or recently small, so it has. It would be like the that the generic set of policies like fast, medium and slow, or something like that and each of those each of those policies would be applied to lots of images. Is that the idea.

E

So you might have you know thousands of images and the in the gold tier that have 500 I ops and you know a bunch that have 10,000 I apps in the Platinum tier, but those team up just says: Platinum equals 10,000 gold equals 500.

E

Something like that.

N

I think your cousin used chickety chickety is too big a fairly bigger size, and it's.

D

N

Big size is it small, it's just not yeah.

E

Because yeah I mean that's sort of what I would guess, given the diagram, except that your example says that um zero listed like a specific RVD image, name yeah.

N

E

In a cluster, you're gonna have like tens or hundreds of thousands of our buddy images, and you wouldn't want to have anything that references, specific images in in the qos table. Other GQ GQ t. We won't think.

G

You'd want to do.

E

It the other way around where you have 10,000 images but they're only like five different QoS policies in the table and all those images either fall into the fast medium, slow buckets.

E

If that's, if that's, if that's the case and I, think this all works, but if you have a really big Q s table that has lots and lots of images listed in there lots of different policies. Then I worry that it's going to be too big enough, kill again.

H

Cancer for my ignorance on this. Why can't we just pass that on the I/o requests about like here's, a policy that to you it would be like few bytes for each policy or whatever you're, trying to append on to an I/o requests like pretty trivial.

N

On chickity, maybe insecurity is.

N

Become bigger size, so I think I think those pests will spread across the control space information or such as chickity is I. Think first is Christmas and seconded any and you had a type object. So we can, we can say, store security in the other. Header type objects like IBD, yeah, West bed, West bed is not properly proper way. It is not proper way. If we, you think we can observe I, don't so.

E

So I see I see the benefit. So the simplest thing is your your second slide here, where you, the RVD image, just says: I get 500 I ops, like that's, that's really simple.

E

The lebar buddy just tells liber8 owes tagged my requests with like us parameters, 500 I, ops and that gets past the OSD and everything everything simple the place where I think that's going to be a problem is again, you have this huge OpenStack cluster you've got 10,000 images and they all have 500 apps, and you say you know what I think these easier should get. 600 iOS and you don't want to like, go and touch every single image to change it. You just want to have one switch that says my gold policy is now 600.

H

E

500, but that's not how OpenStack.

H

Cinder works right now, no volume type that volume type has the hard-coded spec on it and then, when it attaches it to the storage, you know via Nova. It passes that as a copy of the huge York us back so as far as I can tell there's no live. Updating that's running on image is.

E

That that's the retyping quote-unquote, like you, can't change the.

E

Type, or at least it you, if you wanted to change the I, mean open psyche lets you define what these types are. Yeah.

H

So you you attach the you, can create arbitrary volume types and you can attach the QoS specs to the volume type. But what if you would change it.

E

What, if you change the queue aspect to be something different.

H

If it's already attached it's already attached and you have to like detach it and bring it back right and then it would still be, it would still it wouldn't be RBD controlling those QoS, but actually be cinder, telling RVD what the key aspects are. So from an for your example: OpenStack there actually isn't a management that we need to do we're just gonna, be told on every single image. Let me a bandwidth to this or limit your I up to that, or also we off test.

H

We will also have to support realistically, since we we can, it would be. You know, limit your IEPs per gigabyte to this.

E

Yeah that still feels kind of feels, like that's kind of a limitation on cinder and of the sex part right like it would be nice to be able to say this. Whole tier is no should know me faster and not have that be an order in details, reattach iterate over all every image to fix it right.

E

Much more mass into better work.

E

H

It's still I think, even if you had something like that, it still has to be because the volume you they, the disconnect- is already there that these are volume, types that have Q s add-ons to them and those are signore manage types and they don't correspond to. You know low level, storage, back-end, less things seem to get mapped, but man.

E

Could they be like? Could you cut the cinder driver actually map those volume types directly to a QoS policy in this QoS table and then, when you are like when you attach it.

H

You would you would probably the pro verse writing that send a driver. We have to do it so that now you're trying to keep in sync the cinder driver and cute best volume types with this RBD QoS day and you know, sender's not gonna Besame cinder is the owner of the data right. You know in terms of the data model, yeah yeah, so you need.

E

H

To have some process that I always trying to keep our BD in sync, with whatever the volume type QoS facts are, or you just or whoever. Whenever that our BD hook back-end, look it's implemented, if you, if they ever do, support dynamic updating, it would just have to iterate through all the images in like yeah call on its own, like update key lesson that would ping the header to reread something or whatever.

E

So certainly the second one is sufficient for sender. Then then, the third one assuming so this third one makes sense ignoring sender for a second and, if you just think in general, I have lots of images or I.

E

Have lots of users and I want to have these different tiers and at some point, I want to redefine what the tier means having this having something in the OSD map say that has it's a QoS policy table with the policy name and then like these parameters, and you have some, like you- know, smallish number of policies being able to change those globally and have it to sort of take effect globally forever thing that is assigned to that policy seems attractive and it would work basically like what they've described here.

E

So you would, when the the requests would, instead of being having requests tagged way.

H

I mean never be able to support bandwidth QoS per gigabyte because it would know about that. But right.

E

Yeah, well, it would be whatever I mean, whatever the parameters are like its read/write and weight or whatever it is yeah. You.

H

Can enacted 100 IEPs per gigabyte or whatever you know like the newfangled keyless? Oh.

E

Oh I see yeah oh yeah, okay,.

H

Whatever I'm, just if you put it in a global like now like OST map, if it was something that was already managed, it would be able to understand concepts like that, but I see yeah. If you just well.

E

You could structure.

H

It you could have on Cameron concepts that are you know, army days or maybe eventually stuff of a specific or rgw specific. At this high level, yeah.

E

Well, you, if Sue me, you could parameterize it like. It could be that you specify this is a policy, and this is the size that you divide in or whatever you do something like that. But but in any case that would that would mean that the you'd have a slightly different tagging on the request and then when it hits the OSD that it can look up in the OSI map and apply those parameters but sort of ends up being the same thing as far as one say, it's the deem plug scheduler I.

E

Guess I guess the takeaway is that probably we shouldn't worry about the universal units right now,.

N

H

Even even if you had a global QoS table, it could still be on the client side to look it up. Hio request right, so you don't need the OSD. You have yeah.

E

Assuming there's a way to like notify them and then and yeah.

H

And then there's there's one io path or like one implementation, right, yeah I see yes, if on the I/o operation that comes in it has these optional POS things that say am limited to ax apply it. But it's not the client yeah.

E

Yeah, it could even be a separate, monitor, structure that they subscribe to you through the mod client or whatever so ticket they find out when those show us policies change, if you really wanted to have the life update thing. Yes, yes, the only the only thing that this this would provide as propose where it's done on the OSD would be get that sort of like protection against cheating or a modified client. It's just feeding in bogus parameters in order to steal IUP's, except.

H

What couldn't couldn't a modify? Client just drop its qid from the request, and it just works. Maybe.

E

It has to specify 8q ID I mean.

M

The claim can just not changed the other predators that say like which.

G

M

Many room service I have Scott interest in whether I was used so yeah.

E

M

It's not a secure system, client.

E

Side is totally sufficient in.

N

Discussion, I will describe on an extra section yeah that this second topic is the giant clock, Delta low page insertion, which is in improving edge of precast, 16 and 369.

N

So, first, what is the incorrect data? Low phase parameter? Also, I, don't know it just required it. Our M clock argument for us in this frigid environments, routing system and these Indians must be recorded to provide cures between specific range. The current implementation situation is edge of our host. First, the operation types with Dettol of our meter is West. Oh um OS do P and M o OS gop-led price crest, which is reproducible for main current IO.

N

In other words, don't they are handy handily advisor the incorrect is produced distributions of Stryker second instant in insertion. 10 kilometers into the existing existing Socrates is, as that that it is added directly to filter over and over recast and mos, GOP and mos. The openly progress here here is an example, consider the M Sheen and me Eric, pants and us on both hands or DMZ response.

N

So I want to say if the issue for this, this page is first issue about. On number of shot in the operation queue in one st and there will be Kiera coos, depending on the number of the shot, such a one woman and therefore, as you as the number of shot increases, the number of the increase in the number of the encourage cues in a single Wesley also encourages our large number of Jing crackers can cause problem, as shown. The bottom of the left graph as a group operation ad, is pretty divided.

N

The included number of the ink recuse, the tinkerer, the ink roller cutest problem becomes more worse, um in other words of stacking operation, ranks of which the incur a queue is dirty. Indeed, as shown on the left. Two pictures pictures.

N

Thus, how solution is the fixed? The shot count edge of one when using Jing crack you so.

N

That is our first solution and another solution. Eastham later so and I will frame more detail in the later. So that's. The second issue is the.

N

What is about the weight control with data and row for background I or anchor crank use current Angeles current cues current identifier is a pair of OPA type or client entities, in other words, the current IOM background. The IOUs are upon preform performed in the one queue at the same time, I just shown on the bottom picture, those in weight control to ensure variety between current IO and background io, o back Rhonda, also requires data and row.

N

So our first approach, II, is that um in background I used normalized data low according in canto FD, for example, you need time average value. So, if our fourth, there are some two issues, so I saw another utterly. If solution is here so I wanna I will spread.

O

How this is tanking and.

O

O

Previously only case from clients are passed in here. The two is external.

O

We didn't talk, scheduling, I guess will be edited to the shadow too late. This is pages on cryin spider. It doesn't is to know about the shadow structure. It also doesn't mean there are no Delta burial ground operation. Wait.

O

Recover account Lucas, intended by externally across that caused the internal operation tears that they do not require the loaded value.

O

In addition, we can stack in the shingled en crack here, whether the eternal across multiple char killed. This is all the advantages book us process is the Keydets growth. The downside is that purpose can be better this or cry until kids. Trying cheating want and there's no push testing process passed would be longer because quietly kissed must know, quiet.

O

This is trucks, can simple pie, complex problems by separating them into two people of problems. In other words, the problem can be served easily by using the fired cheerful or a quiet or just an internal fire to pull internal pollution.

E

It seems like the parallelism that we get from having multiple shards is mostly being eliminated by putting the queue in front of it, and so you can probably get the same thing by just having one shard, but having lots of threads lots of worker threads.

E

Does that sound right.

O

E

O

E

Like one shard, but for threads instead of having to Sarge with two threads with another queue in front I,.

E

B

E

Think they would work the same because I think the once once your serializing on that queue, then the real. The only real advantage to having the starting stuff is that you is having multiple threads so that, if 1pg blocks on a read or something like that, it won't slow everything else down and having multiple threads per shard gets you that.

O

O

E

um I think if you set OST arc, pop-tarts per thread equal to something big and OST.

E

Parts one I think that I'll do basically the same thing as your external damn clock, but more efficiently and simpler.

B

Right yeah I'm talking about.

E

Can we can move on if you want.

B

O

Standing are you guys order to mr. Joe mm ascending? Are you based in additives? What's wrong, make place more annoyed and more boy? They are soaring weakest in the o'clock, and none normally expect so against coming in being quite will be bypassed without any petition with no other requests this. Why christ me austere torah, the purpose of a wire throttle is to dispatch only enough to the gas that can get enough performance.

O

Remaining requests will stay in the vehicle at the cure. Death can be Roger, no more scheduling why you want to measure everybody stupid and every girl standing. Are you the why you might also tracks maximum support? If enough are you studies, pettiness was panned. We inquire at this time or a proportional scheduling, stuff inquire.

O

However, regeneration scheduling will be continued because it's some minimum requirement, though we should not study.

O

It is impotent to figure out how much else. Turning on you is for free, a two comments: icon. It situation odd. If more recast than that question, whether I will fetch it abolish. The gradation can be minimized even with. Hopefully, if it is important and find the collective situation work, never I ever. Is you why you and the trooper will be measured? Every 200 find the saturation it's. The average is replaced quite than the previous max Cooper increase increase mixture for the end with situation.

O

If they are enough alone, but the average disruption is less then next decrease Mexico's racial, not the increase and decrease, will be parents of a free, appropriate education model will be.

G

O

Next page is the test result we tested five prior to using insert one who the bottom-left graph is the proportional testing result between that uses sort of all polls. We showed 20 a que yo. We got 30 s configuration on the other hand, who we usually stop around. We can get the ratio 5, 4, 3, 2, 1 and.

O

Right bottom graph is rejected, digivation test.

O

There is no commenced parts if we even use the north wall but reservations maintained by applying. Why is notable.

E

That's really impressive: that's pretty good.

O

We can't continue.

E

That's that's awesome. The I think that that you've gotten the outstanding I know throttling to work is really exciting. We were playing around it's trying to do that dynamic, throttling stuff and not having very good luck. I think um so. That's that's super encouraging.

E

It it sounds to me overall, like were on the right track. I would, it seems, like the next steps are probably to just propose what the specifics SCO like commands are to configure the the first two types, that the pool cue s thresholds and already QoS thresholds and then and then get your pull request that does the Rho Delta, whatever it is stuff in deliberate OHS, and then you know, go ahead, implement it.

E

um The I would wait on the universal QoS stuff until the other parts are implemented and we have sort of a better understanding of what how that should behave, and then the iOS rattler I guess would be the last piece. But overall this thumbs. This sounds awesome, looks really good Jeff any concerns or questions Jason near be fine.

H

No I, just on the IRC I did post I was looking through cinder again it. It does, restrict you from not updating front-end base QoS, but on the back end, if the back end, doesn't it it iterates through all the volumes associate with that back and then just changed on a per image basis so um again like from a cinders perspective, it wouldn't do any help. It wouldn't be any help over to be a hindrance really to have this okay extra layer on top.

E

Unless it's possible and send her to implement that update in an efficient way that doesn't make it in r8 iterative or everything.

H

No I mean like you, can't change that I mean it's already going through, and the scheduler is going through an iterating over all the volumes, yeah and retyping them. Retype.

E

This this just reminds me of snapshots, used to be in cinder or before cinder existed in nova, where they were just stupid and didn't anything.

E

It's like copied into swift, but it was. It was always pasta, a lot more work, but you were able to implement like efficient back-end operations, presumably same thing with like multi-site copy. Like you know, my default Center will go and like literally copy the thing and stream it across and like.

H

There's got a good trade-off in terms of what is the actual like impact of iterating over yeah as an image. It's an updating for those few chances that you do to do it versus having to write the code and maintain all this like extra code on top yeah.

H

Even even the I read ECL I could do something like that that all it does is just iterates over all the images.

E

E

E

Anything else on the QoS front.

E

All right thanks thanks, so much guys for for sharing this all looks great any other anything else you want to talk about before we adjourn.

E

We're probably ready for bed. Alright, thank everyone, because that's the week talk to you see deal.

B