Ceph Developer Monthly, 2 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Developer Monthly 2021-06-02

Description

https://tracker.ceph.com/projects/ceph/wiki/Planning

A

All right, let me go ahead and get started, um so the first topic is around uh respectful and uh that's particular failure conditions.

A

um Nikola, do you want to talk a little bit about this because you're the one who kind of found these conditions and brought up the issue.

B

um Okay, so the issue is when we have a big feeling and unfound object is found during mcfeeling we enter backfill unfound state. This is pg and.

B

When we are restarting.

B

And on primary osd for this pg set, then.

B

Strange things may happen and.

B

There was a crush in one case.

B

Which was fixed by just removing an assertion and.

B

But still there is yet another case when the the crash is possible, um but another issue is this: that actually right now.

B

B

You restart the primary osd.

B

When you have this unfound state and another osg becomes primary temporary primary and if this unfound area- and we are discussing here- recorded, pool and pg in this pool, and so if another osd becomes primary, that do not have does not have this.

B

B

And it is running this.

B

It's after appearing, it starts this backfilling process again, but it does not.

B

Find this uh object in its listing, so it's since everything is okay and then zpg enters a clean state.

B

B

Of the issues is how to mike is this you're persistent.

C

It seems like there are sort of two roads to go down. If I'm understanding this correctly, like I mean clearly master shouldn't crash right, um but you could imagine continuing with the old behavior where, if the ocd restarts we sort of forget that the object was unreadable and had an eio and then, if we try to read it again or we scrub, then it appears it's unfound again.

B

um Yes, if, if you scrub, you will find the oh actually it's yes, thomson like can be found by.

B

B

B

I don't know what we want actually to fix this. How we want to fix, uh probably some can provide some more yeah.

D

Certainly for the older versions, I think the right answer is to shore up the existing behavior um don't crash, but don't try to fix it either per se scroll we'll catch it later and that's good enough well sort of for master, though we now have the capacity to write down things that are in the missing set so plausibly. We could add a message to the back full process that forcibly adds things to the missing set on replicas.

D

So when we do a backfill read if we get back, uh if we learn during that process, further ratio coded pool that some replica doesn't have a copy or the primary doesn't for a replicated pool. We can add that to the missing set and wait for that to be done before before letting glass backfill advance, which would give us the state required to ensure we don't forget it later.

D

That would also give us the right, primitive. I think, for dealing with this and spread as well. When we find one of these objects isn't where it should be or is uh unreadable, we can add it to the missing cell, the relevant osd and continue on that way. It'll give us sort of a baseline piece of state to deal with these things in general,.

C

I guess one sort of high-level question I have is: is it the right thing if we get one bioer trying to read an object, is it the right thing to like mark down that it's unreadable and never try to read it again, not necessarily, but uh that that doesn't mean we shouldn't write a 10.

D

Yeah yeah, that's kind of what I'm saying it's something to resolve. I would, I would argue so until it's resolved it probably ought to get written down somewhere.

C

That's not just backfill right, it's if, if there's um it's essentially.

D

The same problem, yeah.

C

Yeah, it could happen during backfill, it could happen if we just read an object at an eio yeah and it could happen during normal recovery. But there are like at least four different yeah, probably variations on those depending on whether it's a replica or primer, and so.

D

On so we'd want something looking a bit like a common pathway there, but.

C

Yeah, like mark mark unfound or whatever it is.

A

Yeah, we have a little bit of a common pathway there for just handling the ios in general, but it's not persistent right now. So I think we can probably add that precisions there, because just.

D

A

Message handling.

D

Sequence could certainly would be an entirely new thing, that's kind of what I mean.

C

With the cf flag and the um object info t.

C

A

A

E

I maybe missed the initial part of this um discussion, a few minutes late, but uh so nicola from what I'm understanding from your latest test results. Is that with uh the patch that you applied on nautilus, the same test uh that is failing in master is not failing in nautilus, so everything seems just fine, with nautilus, even with file store right.

B

Yes, if we are talking about crash on yes, it does not crash. So I suppose the crash we are observing on master uh is due to refactoring.

F

B

Ahead of the post now chills from this point of view, I'm still interested to know if we can merge this, uh we have for now jealous. It improves the situation that.

B

We do not lost this state when a non-primary is restarted anyway. We still lost it when, when the primary osd is restarted, so it's kind of a partial fix, but yes to just if it was not clear from the context we actually. One of the issue was that the state, this information about unfound date was lost and another that there was some crashes on on the master on purpose, not use branches, and I think they are related to refactoring of the.

E

So yeah, I think in general, uh sam I'd like to hear your opinion too. The idea of maintaining uh backfills in flight for anything but uh the primary which which mercola is essentially manually, backboarding, which you had in your refactor, seems right to me um and that's probably why uh that's fixing the problem and nautilus with the test that mikola has and uh if, if that is valid, then even the other patch which he was talking about the crash and master.

E

That is a follow-up of of that right, uh because you're expecting that uh backfills in flight and the recovering set uh are not the same except the only change is the primary can have something uh which is which is preventing it uh from being the same. So those two, those two changes are currently with this particular patch that he's talking about are same in master and nautilus, but there seems to be something extra or like something missing in uh master, because of which we see that other crash that he saw while running that test.

B

uh Yeah, actually, I I don't know yet, for example, what is the actual cause of the rush? It looks like uh probably well.

D

Maybe that everything backfills in flight isn't recovering.

D

So on nautilus, I suppose it happens not to cause a crash when that gets violated. But it's almost a surprise really. Is that what you're.

E

Saying yeah: that's what I was trying to see and even maybe worth looking at, why uh master is crashing versus nautilus is not crashing, because this test seems pretty controlled like this is just one pg and we should probably look at why uh the same crash does yeah just comparing and yeah it's worth mentioning that this is only happening in file store. I have no idea why, because it's not happening one.

B

Of my, I have not yet confirmed this, but my current assumption that the file store is faster when restarting, and it happens uh only when uh so osd is stopped, tiering happens and it enters backfilling state and when it's still in backfilling state.

B

This is osd that was stopped started again and it appeared. A new pairing event happens, and this is where it may crash, and then it started after.

B

Appearing after big feeling is already completed, then it it will not crash, but I still need to retest and confirm that this is correct.

E

Yeah sure uh yeah, what I'm just trying to say is that maybe compare that that same test that you're running in uh file that same test, that is passing on nautilus and compare the same test on master and see uh what the difference in code path is. Why is uh master crashing that that could give us a rough idea about? You know whether there was something missing or there's some. You know: invariant, that's being broken somewhere.

B

To do, I will investigate this and we'll provide more details about this crash, and why yeah why it actually happens on first forum.

B

Okay, that'd be great.

E

A

The other aspect of this that I think we wanted to discuss, was um how we can get better coverage for these kinds of cases, because these, if uh this kind of thing was being turned up by the existing error injection tests,.

D

If we get better or better implementations of these error cases, we could do something like have. The error injection suites deliberately keep a set of objects with at least one broken copy um and poke at the relevant osds to keep that count where it should be. The osds would then have a really high chance of encountering. One of these broken objects being forced to deal with it in self-test radars.

F

But no eio, you won't be able to create that.

C

Why not yeah I mean it seems like the the air injection that I'm familiar with right now is transient like it's an asoc command that has some state and if you restart it'll forget whatever like it seems like. If we have a persistent thing where we say this object on this osd should get eio yeah, and then we go to randomly poked objects. That way, then we'll be much more likely to hit them when we start moving data around.

D

The slightly tougher part is, we have to make sure we don't poison too many of them. That's yeah! That's the yeah yeah exactly yeah, but I.

G

D

Yeah, I think, at a basic level as long as only one copy is damaged, or rather at least, however many copies are still alive or fine right.

A

Right, if we add that as like a and I'm inside command or something and then or uh controlled it in two dollars, you're in a test, script good, um be sure that we weren't corrupting too many copies.

C

Yeah and we'd have to teach the object store how to persist that information that it's supposed to return the I o for this object, yeah um but yeah. You can create a thousand objects and for like 50 of them just pick one replica and poison it and then go to.

D

Town, there should be some kind of metadata stream where we can see when these things are repaired. So as long as the test is consuming that stream, they can tell when things get fixed and we break them as needed.

D

I mean we can you can parse the cluster log, but we can make sure that we uh put notifications or fix events in there, but that should exist anyway. I think it's a good notification to scrub.

C

At the end, sure everything's.

C

Good create a trial card for persistent. What would it be.

F

Persistent will be the only storage system that guarantees a failed.

C

C

Okay, we added a card to.

C

A

Is there anything else we want to discuss with this.

A

A

All right sounds like not so, let's, let's go to the next top, like a large w, just sign a design discussion for encryption policy.

H

Cool yeah, um so this is a feature that we've been planning. Marcus has done some design work around it and we've got interest from some developers at flipkart that want to help out. So we wanted to get uh marcus and them and then a call um marcus.

H

How do you want to do this? Do you have anything to present, or should we create a an ether pad for for this.

C

C

You're still muted, marcus.

I

Sorry, can you hear me yep yep, um let's see I I certainly uh can can talk about it for a bit um I've gotten outlined in some stuff, if that's useful or um I'm not quite sure how to incorporate an ether pad into this.

I

C

Can just add them and take notes while you talk through it.

I

Yeah, no, I haven't messed around much with etherpad. I'm sorry, um but let's see um what I can do is I can share my screen somehow. uh Let's try that.

J

um Fine come on.

I

I

Apparently this has changed a bit. Okay, can you see something yeah? Okay? So um perhaps that's not the right thing, but.

I

Okay, so um anyway, that was kind of an outline of what I wanted to talk about. um So there's uh about uh four main kinds of uh aws: uh rados rgw s3, encryption. um First kind of encryption is a client-side encryption and that's where all of the encryption decryption and key management happens on the client side. um That's that's a feature that aws sdk and that doesn't have any implications whatsoever for the server um second kind of encryption is called sse dash c and that's where the client provides the key.

I

So um so there's logic on the server side to handle actually decrypting encrypting the object, but the key is provided by the client, and that means the client is responsible for managing that key and has access to that key.

I

The third kind of encryption that that's, that that's in s3, is sse kms and in that scheme um there's a kms, that's that's that's shared between the client and and the server and a client is responsible for creating the keys in the kms and then the server handles fetching a key and using it to to encrypt objects.

I

And then the final kind of encryption is um sse s3 and that's where it works very much like ssc kms except the server is also responsible for creating and and deleting uh keys, and so it's completely transparent to the client. The client doesn't really see much of anything.

I

um So at this point, uh sef well, seth doesn't have to do anything to implement client-side encryption uh ceph does handle both the ssc-c cases and the ssc kms cases, and so the one case, that's left that we don't handle today is sse s3.

I

um At this point, I I thought it would reasonable to talk a little bit about uh vault, because there that's kind of the prototypical kms and um there there's there's a couple of myths about it that are are worth addressing. um uh First first thing is when he said it vault.

I

The very first thing you run across is something called a root token, and it's not obvious immediately from the way that vault's presented that, basically you don't want to use root tokens for anything, except maybe the initial setup.

I

um The right way to think about tokens with vault is that tokens are basically like kerberos tickets. They they have a short lifetime and you want to get a new one, every so often um there's at least 20 different ways in vault to get tokens and that that would mean a lot of code. If you were going to try to support all of those directly and so the right way to, I think, to manage uh vault tokens.

I

Is you run an instance of alt agent and you make it responsible for renewing tokens or getting a new token as necessary? um All of the codes in fault agent, and it's it's easy to use it transparently. So there's there's, there's no reason inside of ceph to manage that vault tokens, except for in terms of using the vault agent um last last thing that vault provides, which is very useful for seth.

I

Is it provides a secret store, called transit and the cool thing about transit is, if you use it correctly, you can manage and use keys that exist only inside a vault. They don't have to be.

I

They don't ever have to come outside of the vault and the way that works is that um uh when you're encrypting a ceph object, um you create a key, and then you ship it off to fault agent, and you ask it to encrypt it under a key, that's stored inside a vault and what you get back is an encrypted string that you can store with the object, and that means at that point you have an encrypted object and you have an encrypted key and in order to decrypt that object, you have to go back and present the encrypted key to vault.

I

Ask it to decrypt that and then you can use that to decrypt the actual object cool thing about that is. That makes key rotation very simple, because key rotation basically consists of of um inside a vault. You asked you can ask it to take one of these transit secrets and create a new key version, and at that point, whenever you're creating a new object, you're going to get it encrypted under the new key and vault keeps the old key around under the under a different key version number.

I

And that means you can have a mix of old objects and new objects and be able to get access to all of them.

I

In order to retire the old key, what you have to do is you have to go through final. The objects that have the old key version number and re-encrypt that re-wrap those keys under the new key version number, and then you can retire. The old key.

I

You can be as lazy as you want about that. um I I think in the context of stuff that might be. That might actually be a choice about how you would like to do that.

I

So, let's see what do I have next here.

I

So so, because stuff has all of the code for ssc, kms, there's there's really not a whole lot of new stuff. That needs to happen.

I

The basic stuff that needs to happen is there needs to be a notion of having.

I

Bucket keys, which.

I

They basically when you're ssc kms, you just name the key, and so you can have keys a bunch of keys and they're, pretty much orthogonal to buckets in objects, but with sse s3.

I

You need to have a distinct notion of what your key is and.

I

If you want to make key rotation work, gracefully you, you probably want to have one key per bucket.

I

So, in terms of the self implementation there's, a couple of things that have has to happen it's most most of the complexity is going to be taking the existing stuff code and and teaching it that you've got um this two different, these two different notions of talking to the kms, um and so that means in the configuration you're going to want a parallel set of uh of configuration options to name the ss3, the sse s3 stuff.

I

That's separate from the ssc kms stuff.

I

The starting reason for doing that is because you want the sse s3 encryption to be using a different key space inside of your kms than your sse kms stuff, because you really don't want your customers to be interacting and screwing up the keys. That staff is trying to manage.

I

An interesting possibility that seems to me reasonable to do is providing the opportunity to have a completely different back end for your kms between ssd s3 and sse kms, and I can imagine, for instance, an installation choosing to use kmip for sses3 and using vault for ssc kms.

I

uh Obviously, if, if f is expected to manage keys, it's going to need to support, creating keys and and deleting keys when a bucket's removed um key rotation is, is something where you probably want to have configuration options to deter to control that when and how often you rotate keys and you might like manual control options to be able to administer that as well.

I

um um And then the final bit of of code that needs to be provided is that um uh in a in the um in the s3 protocol, there is a way to set policy that requires that you have encryption, turned on and there's some missing bits of code there that we'll need to provide in order to make that possible.

I

I

That is pretty much most of what I have. um I can talk a little bit about um the places in the code where um I think things need to happen. I don't know if that's a particular interest to people here now,.

H

Well, can I just ask a quick, high level question about um the put bucket encryption policy? That's like a per bucket policy that the client or admin can set. What exactly does that, let you do? Does it mean that no encryption happens unless the bucket has a policy.

I

uh No, no encrypt encryption is uh the the thing that actually controls whether the encryption happens or not, is the app attributes that are set in the object at the time that you do a put object, um but what you can do with policy is you can require that those encryption parameters actually be there? I think you can actually set you. Can you can set or require that particular values be provided for the encryption also, so that means you can create a bucket that has to have has to have everything encrypted inside of it.

G

So more about, what's the difference between the policy material that goes in bucket policy versus the versus the material that goes in that's stored in the in the document that you upload with put bucket encryption.

I

G

I

I think it's it's different attributes in the same policy document, so we've got the basic policy framework, but we don't have the bids that describe the attributes that are required for that.

H

My understanding of the bucket policy part is that we do have everything that we need to require a specific header with the request, and I thought that's all that we needed for requiring encryption.

I

I

Is was one piece I found? Oh, that's not. It.

I

That was not it either. um I know I've got this somewhere.

H

So the the bucket policy can require clients to request encryption of all of the objects, but is the put bucket encryption policy? Does that let you.

H

Tell the server to encrypt things, even if the client doesn't request it.

G

Yes, I believe, that's the case and I believe it effectively hops them into ssc s3, for that bucket.

I

Yeah yeah, um oh oh.

K

I think I I think I I disagree with that. If you look at the aws s3 api, if put bucket policy, is enforced, then put bucket encryption policies enforced then, for every request put request which comes with a header which says that if you want server side encryption only then it encrypts with the sc s3 created a key.

K

However, if that header is missing, then uh it stores the that object uh in an unencrypted without encrypting that object.

K

But if you want all the objects in that bucket to be encrypted all the time, then you have to set another policy on the bucket, uh which says that deny unencrypted, uploads and deny the I think, some header specific header. There are two such things which have to be specified, so these two policies have to be then enforced. Only then all the objects which come uh you know if the header of uh server side encryption is missing. Then the uh uh then this server throws an error that you better mention. The header.

H

Okay, so the client still has to request encryption with sse s3.

K

H

K

Without the key, but without the screen.

H

Right, the policy.

K

H

Tells rgw how to find the right key to do the encryption? Okay, I.

K

Think thanks prasad, he has mentioned the two policies which are documented in aws. S3, deny unencrypted, object, uploads and deny incorrect encryption.

I

Header yeah, um so these are the notes that I I had apparently there's an encryption parameter that you can specify that will get you some information about how the encryption is working and there's apparently a put bucket encryption that controls part of this logic.

I

Those are those are the pieces that I found that we were.

I

I

So this may actually be a different, a different um string or that that's that's not the same as a policy document.

H

Right put bucket encryption as a separate xml document.

I

I

And that probably means there will need to be some changes stuff, that's somewhere in the bucket data. That's that's on the server.

H

Yeah most of our policy is just um x, adders on the bucket metadata.

I

Right yeah, so this probably.

H

H

So I think, in terms of implementation, a piece of kind of low hanging fruit would just be support for xml, encode and decode for that structure.

I

Yep yep yeah that looks pretty standalone.

H

So priya prasad um about the vault parts specifically, is there anything that marcus talked about that isn't compatible with the way that you're using it.

K

No, actually, I I think marcus mentioned something about came up when to use gamer versus when to use world, so I was a little confused. What I was thinking from sscs3 point of view is that uh I don't think ceph will, uh like you know, ever, have its own kms kind of thing, so it will always talk to some uh world it be. It came a barbican or uh you know the hashicorp with which it is already integrated, three of them.

K

So um at the back end we have any of these three kms and in the uh but to the client. We say that, yes, we support ss3 and, depending upon what type of kms server is there at the back end. It would, you know, have all those policies and everything and generate do the key management with that type of kms. So that was my understanding.

K

For example, in flipkart we have a variant of hashicorp, and that is what we would want to use for all our key management, and so we want our rgw's to talk to that to the hashicorp vault for degeneration key rotation key wrapping whatever it would be.

I

Okay, if you're specifying the km kms, if you're using vault, for instance, um one of the parameters for vault is uh which one is it it's the default prefix and that basically names which part of the the vault namespace you're using uh for sse s3. That needs to be a different string than what you're using for kms, because you don't want users to be interacting with the the uh secrets that seth is managing.

K

Sure got it: okay,.

I

So that means at the very least, you need to have a a way to vary that sort that thing um it seems to me. The more general case would be to allow the possibility of using a different kms back-end for ssc s3 versus for ssc, kms.

K

Okay, so if I understand correctly you're saying that you we should not have our customers talking to the same kms server with which the uh no.

I

No, it's! Okay! No! That's not quite right. They shouldn't be interacting with the same set of keys.

K

That that that is agreed, that is okay, so the customer has no clue of the key ids or the name space. um I think the prefix, I think what you say right.

I

For ssc s3, that's that's.

K

I

That's exactly.

K

Correct, that's true, so it's a complete different uh sort of a name space altogether. You know it's multitenant. One of them would be uh this ceph and maybe other other customers could use. Okay.

I

Yeah, so this this is when I was going through the configuration options. This is this. Is that what I've got on the screen here is more or less what I was finding. Can you read that text.

K

Yeah, I can read this text.

I

And the set of options at the the bottom is more or less this the set, I I thought would be necessary for starting with an implementation that just supports vault for the for the beginning, to start with, um I'm not I'm not in love with these names.

K

I

I was looking at our very existing naming conventions for all of these options and I s3 ssc. Just doesn't really do it for me, but I'm not quite sure how to how to make it all uh coherent and rational.

H

Doesn't look that horrible to me.

I

Yeah, so I I would love a different, better naming scheme than.

K

I

K

Yeah, we can think about the naming convention and try to change it. Yeah.

I

K

I

K

K

So, in some sense, if I understand correctly uh like we will have uh like you know, two sets of apis or interface, one would be which would do all the key management. So maybe we would need some different type of rados gateway admin. uh Key management, sub commands which would be required to uh you know, uh set up uh the keys that if you want to create a pool of keys or do you want to create a separate key for every bucket?

K

uh If you want.

I

Key rotations, I I don't think there's any necessity to do that. um I think that the from the radius uw admin standpoint. I think that I think that the.

I

The only thing that's probably really necessary is to commit some commands to interact with the way key rotation works.

K

And if, in future,.

I

K

Want the key wrapping the dk-dek wrapping.

I

Well, I think we want the key wrapping from the start.

K

We want it from okay.

I

um I mean that's that's what the transit coding for sse kms does today. I think we want to that's. That's probably the only form of of encryption that we want to carry forward. So I'm thinking this is going to look like just the vault transit encoding.

G

I think the ongoing issue- sorry.

A

G

To mention it well, I think it's all possible I'd like to see us not introduce dependencies and write us uw admin, or you know at least if there is anything that needs to be there, make sure it's accessible with the admin ups api.

I

um My thinking is that if you just want to use bucket encryption, you just turn it on and you don't interact with the radius admin command at all. um When, when it's, when you turn it on um uh the next time, staff stores an object, it goes out and sees that the key exists in uh in vault and if it doesn't, it creates a bucket key automatically, and when you delete the bucket, um it would just go out and always try to delete the key that should belong to that bucket.

I

um There is. There is one other thing that I should mention about the um the naming of the key that you put in vault.

I

You don't want to name that after the bucket name, because the bucket name could change. If you rename the bucket um see, the key keys should be named based on the bucket's uuid, which does or that some part of the bucket name part that does not. That cannot.

K

Change: okay, the key name, I'm noting it down in the notepad, so the key name should be based on the bucket uuid. uh That's a good point, however. uh uh Like you know, uh we would need some sort of a rado skateboard command for the key rotation and wrapping right or do you think you want to do it automatically?

K

I

K

A configuration parameter in the config file.

I

At the the very least, there should be a radius admin command that you can ask it to force key rotation that that's. That seems like a good basic starting point. um I think it would probably be desirable to have the ability to say rotate keys once a month or something like that. um There's probably a distinction.

I

H

Does that mean that we have to keep an index of all encrypted objects in order to loop through and.

I

um There's there's a couple of ways to handle this. um When, when you encrypt an object, um you would be storing an attribute that contains that basically contains the encrypted key. It also contains the key version number and the key version numbers could be easily parsed out of that.

I

um The very simplest way to do the key rotation is just always, you know, encrypt new objects and don't bother to go back and and try to re rewrap old keys. But if you want to rewrap old keys, you need to iterate through, I think, probably the simplest way to do. That is just iterate through all the objects in a bucket, and just just you know, get the attributes one by one and and re-wrap the attributes.

I

H

um And I guess you'd only need that's.

I

B

Where I'd start.

H

Buckets that have an encryption policy on them.

I

Yeah, well I mean you could go out and try to fetch the bucket key and if it doesn't exist, then you don't need to rewrap anything.

G

This process might be out portable too, with.

L

G

Bucket inventory also at some point in the future.

I

Yeah, the the re-wrapping keys is, is going to be a distinctly more expensive operation than just creating a new key version number, and so it's probably useful to have make a distinction between those two different events.

K

Marcus, do we plan to uh store the kkks in some different in somewhere or we will have to go through all the buckets which have the policy enforced and fetch the kek and then to issue a key.

I

Rotation like, like I said, the key, the the key rotation is really two parts um to just create a new key. The only thing you have to go talk to is vault. You don't have to do anything else to re-wrap all the keys in a bucket you have to you, have to crawl through the bucket and find all the objects encrypted with that or.

K

I was not talking about that. I will if we have to do automatic key uh kk rotation, the.

A

K

uh In that case, are we going to keep the master key somewhere else?

K

I

No reason to keep that key anywhere, except inside vault,.

K

No, not the key, the key name actually.

I

um If you derive the key name off the bucket name, you don't need to you. Don't need to store that anywhere. Okay,.

K

No, I mean uh okay, so I think if we have to do automatic key rotation, the key rotation we will be going through all the say, there's an automatic key rotation after every 60 days, then we go to all the buckets which have the policy encryption, enforced and uh and then, like you know, create the based on the uuid, create the kkk name and uh call call the wall to rotate those keys right. That's what you're saying.

I

K

I

That vault actually provides an operation to do all of that. So you basically just tell if we'll make do the key rotation and it creates a new key for.

M

You uh marcus, do you think, an implementation where uh we bring support for the three apis which allows clients to opt in for encryption. You know the put get and delete bucket encryption apis.

M

Along with um you know, a support for sse s3, which would uh you know like say, use the bucket marker, which is unique as the key id and have an option to talk to any vault of choice. Whether the word of choice would be. You know something very modular and we could have multiple implementations. uh You know added later, but we could start with uh one of the implementations say something like you know, hashicorp vault and uh which uh will have you know some of the safeguards that you mentioned.

M

Like you know they, uh if a bucket uh you know, uh and can it can be a mixed bucket wherein it could uh host. uh You know, objects which are unencrypted which are encrypted using sse s3.

M

Then it should not share the key name space with uh you know the ones that are used specified by the customers themselves uh and have just one such implementation. On top of the existing kms infra, that's already there in the master branch. Does that you know sound like that's.

A

A fair summary.

M

M

That's something that we'd like to kind of have and uh yeah. We are kind of working on the implementation, uh I'm not sure if uh there already exists an implementation somewhere in a private branch or a repo that we could look up, or this is something that has to be created afresh.

I

um I've got the bare beginnings, but not very much code. um I I. I don't think this this. This doesn't look to me like it's going to be a very complicated project, so I don't think it's going to be that hard to do.

M

So essentially, the development would be. You know a layer on top of kms which would bring the s3 sse s3 functionality, along with implementation of the three apis, which will allow bucket policy to be appended.

I

Yeah, no, it should be sharing a lot of most of its code with the kms stuff. um I I think the most challenging part is gonna, be figuring out how to reorganize the um the kms logic enough to basically make it configurable between these. The sharing the same code with both the ssc s3 case and the sse kms.

I

I

There's a bunch of places in the kms code, where it's looking for um like rgw crib fault, uh adder and rgw vault prefix directly, and that's going to need to be configurable, so that can either look for rgw, vault, prefix or rgw s3 sse vault.

I

Prefix, so so is in the sense it's going to be intrusive change, but it shouldn't be and and it will have a lot of lines of dips, but it shouldn't be a very complicated difference.

M

Okay, yeah: we are not yet intimately familiar with those parts of code, but that's what we'd like to have changed. I think, on an ongoing basis. You know we will probably interact with you over uh email or through patches.

C

M

Try to ship it out.

G

Okay, we can also make a standing item for it in the refactoring meeting. You know we and to hopefully make sure that we have opportunities to talk.

M

Yeah make sense we like to participate until you know we are able to have this feature.

M

Yeah anything else from your side.

K

I think there's one question uh marcus, which I had was. I think, your uh the the kms logic, which is uh which interacts with the three different types of um uh kms's that is k-map and malt and everything is there in the master branch. Do you have any plans to backport it to nautilus.

I

I

Well, I think a lot of this in barbara kent. The only part, that's probably the two things that might not be in nautilus today, uh the kmip stuff might not be there and the transit stuff might not be there. um I think there were plans to backport both of those, but I'm not quite sure where that's at.

H

Yeah marcus, I believe that you did stage the back port, but we made the call not to take it to nautilus.

I

Okay, well, that's what happened to that then.

K

So in an actual uh you'll not be back putting.

M

It yeah looks like I think the last call for back ports is already out. I saw the email like a few hours.

M

H

um If you guys could really use it on nautilus, then we could revisit that and discuss it. I'll try to find the back port pr.

M

Yeah we'd like to see the back yeah.

A

We're trying to close out the final last release pretty soon, um I'm not sure if it makes sense to to continue backwards there. If this is a larger feature,.

G

A

I mean to be fair.

G

We maintain a downstream back port. You know red hat of this of these features and it's the baseline mattresses out there back. You know or staged backboard, so that we're in extremism that might be the way where it would handle it is to do a build that has backboard for their use.

H

So I linked the pr that marcus prepared against the nautilus branch. So all the code is there could potentially be.

H

H

But yeah I considered it a feature back port, and so I didn't. I think it was reasonable to get it in for nautilus.

A

Yeah that makes sense, of course, if folks want to open up backward branches, that's fine, but think of this, the stage in the knowledge life cycle of stream. It doesn't make sense to include it in.

A

Release anything else you want to talk about on this.

H

A

H

To start work is just the put bucket encryption apis themselves um and then maybe next week in the refactoring call, we can go over the next.

H

M

I

Yeah, that sounds fine to me. um Probably what I I will go probably do next. um The interesting piece to me, I guess, is the the how to interact with vault to create and delete keys. That's probably the next probably the next piece I'll.

E

J

H

A

All right thanks folks, I think we can move on to the next topics then uh related to the manager.

A

First, one is about uh self-interpreters and continuing to occasionally run into these new bugs and this time due to dependencies of dependencies or maybe dependencies of dependencies in the case for far down the chain, what we can control directly.

A

um So we need to figure out what we wanted to do to mitigate that, in maybe in the short term and long term.

A

N

So I I do not really have a good answer.

C

Yeah john had a nice summary of options.

A

Yeah we talked about some of these um in the previous cdm as well, like the maybe a shorter term, like keeping things within the same host with multiple processes. Ernesto had a couple suggestions on the mailing list as well. That might be perhaps uh short term workarounds.

A

um Importing them by in the main manager.

C

I mean I'm I'm hesitant to go down the process route because um it will make the intro module communication so much more expensive.

A

There's not that much of that, though I mean it's not really a performance, sensitive yeast. That much is it.

A

um I've been to actual communication, like you.

C

Think of this, like, like modules like the dashboard and prometheus, are like defensively using the state that the super plus manager code is gaining proof, counters and buster state and so on.

C

I think those would, although we're also this, is like after making a rest call. So maybe an extra hop isn't such a big deal, but um I'm also just worried that going down that road is like a significant investment in infrastructure.

C

I'm trying to figure what's like if there are well look, it seems like there's like there's, maybe a midterm like a short to midterm, fix and then there's like the long term strategy, and there was one reference at the end of this thread. I think of there's some new multi, isolated, sub-interpreters, that's experimental, but maybe we'll develop further.

C

That's something we could aim for in the longer.

A

Term, possibly if those caveats are resolved, then it's usable for everything. I think that I'm not sure how how it's going to develop. I guess, but we could uh kind of try to try to focus on a short-term solution that works for the next year or two and see where that goes. That's kind of what I'm thinking.

C

Because, if, if we don't need to, if we aren't yeah, if we just need a short to medium term solution that I wonder if the simplest would be just combining everything back under a single interpreter again yep, that's that is efficient. It's relatively simple. I think the only real challenge is. We have to make sure we only have one user of cherry pie, but that doesn't seem like it would be that hard to fix.

N

The theory pine itself also already supports the vsdi interface to hook in multiple different applications.

N

That's my information, so we should be able to build some kind of a common web server.

C

C

What do you think ernesto.

O

Yeah, I don't know because I think one of the reasons for this isolated uh mod. Well, some interpreters, it's about performance, and I'm not sure. If that my I mean running, everything from a single interpreter could have some uh involve some bottlenecks in terms of performance.

C

There's still a global walk. Isn't there for all the python state.

O

Yeah yeah, that's true.

C

So I think it won't yeah it shouldn't act performance really at all. I.

C

Think, if anything, it might make it slightly faster, but.

O

So the idea, then, would be to run a single global interpreter and then run a threading python threading for every single module right in the buy some site.

C

Yeah, I think so.

N

I think the threading is not going should not change right. It's just that. We are disabling web interpreter so which means that every module shares its global.

N

C

I guess the only cherry pie is the only offender that we're aware of is it. Are there others that might be problematic?

C

I guess we don't really know until we try.

N

um So the unit tests already load all modules um or import all modules without step interpreter, so it works.

N

Though they're not really creating a web service, so yeah, that's only very powerful.

A

K

A

Only tried to find out if it works actually to it and uh yeah.

A

What are you gonna ask.

C

I was gonna ask if there was like um an immediate bug like what is the latest symptom of this that we have? Is there like a?

C

We have to fix immediately, or is this something we can push for quincy? Are we being defensive here.

N

um I think the latest ubuntu progress.

C

Like numpy or something.

N

Yeah, so if we get a new numpy version of the distributions- and we will probably end up with a with a blocker- maybe so we I think we can wait till.

N

It's becoming critical, I I don't think that we will have a lot of time left. Yeah.

A

Might be worth trying it out and just seeing if everything breaks horribly, um because if it does, it might be longer to fix right.

C

Find that out sooner or later, yeah, hopefully it's not too hard to just test it out, just by launching all the modules under one pub interpreter, even instead of the upper ones.

C

Structure's already there.

C

O

So one of the questions, uh the the offending module- is this numpy right, okay, and since it's been used for by the this prediction, local uh only the local one, okay- and also you mentioned sebastian the kubernetes client. It is that drunk from the uranus events or.

N

Yeah, so it's a transitive dependency of given at a client.

N

And I I don't think that we can avoid it. um We we could try mocking the numpy, the the websocket library within the kubernetes client, but.

A

Just getting pretty fragile at that point.

N

Yes, yes, so we would, we would mock intern libraries, that's used within kubernetes client, and I I don't like it.

O

Yeah I was asking because I was checking that it's prediction module and it seems like the um well. It was just the first impression right. The mbs module number is only used in the pre-processing states, so maybe it's not needed in in the running module. It's only during some states or something so maybe it can be imported in that function and we avoid importing that as a global, the states that that might not be the case.

A

Yeah yeah, I think we could definitely avoid it in this fiction. um I think it's also used in the models itself, which are kind of uh sort of pickle files. I believe um so it might be a little more complex to get rid of it there, but it's not as like a straight necessity.

C

Yeah, it might be worth mentioning, although it probably won't change the dependency, but um karen is in the process of refactoring that code, so that there's all the dis prediction stuff is in a standalone python module and that we can then consume so that code's gonna move out of the manager proper, but it'll still depend on numpy.

C

So what necessarily.

C

Changed: okay! Well, it seems like they accept us to try it. Somebody needs to get motivated sometime this summer. I.

C

A

Yep sounds good.

C

um You talk about the last item.

A

Yeah uh so about the major catching.

O

Yeah well price here so brave you to go with.

J

That so well, I've answered to two sides on the epr, but um he suggested that in the validation to the polio, sd demand by this epoch- and maybe I need some info about the rate of how this hosting map changes. So if it changes a lot, we this validation strategy might not be that good.

J

So yeah, I don't have the most info about the usd maps in production.

C

It varies widely, I think, most of the time in production. The map isn't going to roll over very often, but the cluster is like actively recovering and it could roll over as quickly as every two seconds but display usually much less than that.

J

C

Oh, I think we're talking about a detail of like 10 seconds, so I guess not that far off, even in the worst.

J

J

And also I had a problem calculating the the size of the usd map since it's er by object and the the only thing that I've tried is an approximation, but I'm not sure if there is a better option to do this.

C

I guess I have a question about. Are we talking about um dumping? Is this like the jason, oh geez, but it's like that? It's a series of dicks and arrays and whatever that it returns. Okay, I guess, if it.

C

If, if we move away, I mean I'm attracted to the idea of just holding on to the last object that we had um indefinitely until and we're using it if the epic hasn't changed, because there's only one of every type of thing, though, like the memory that we would consume to be bounded, we could even discard, if, like on the manager side, we see that the epic updated. We could throw out the cache copy, since we throw it out again later anyway. So it would be a short usage of memory.

C

I don't think that doesn't seem like the memory usage would be that significant um and because then you don't have to worry about correctness on the side of the manager modules. They don't have to concern themselves with the idea that they might get something that isn't out of date, so that isn't all the way up to date, which in like the case of like the balancer module, I could imagine being problematic um and like the pg autoscaler, would possibly get confused if it had stale information.

C

I had it seems like that. The downside, the main problem with that is that you can't you can't tell if one of the callers one of the manager modules went and like modified the object, and then you return. The root pi object again to some other caller and they see the modified data it sticks around. But I was thinking that we could introduce a have a debug option that will basically regenerate every call it'll regenerate the object, but in the debug mode, it'll compare it to the cached version.

C

I don't think the.

A

I thought the balancer didn't actually modify it. It sends the commands to the manager and then we wait for, like the updated maps to come in.

C

Yeah yeah, I mean, I think, I'm not sure which callers are modifying these objects and ideally they wouldn't. But I didn't I.

M

Think they didn't.

C

There wouldn't be a way to like tell if they did by accident like if they actually called a pop or like did an array or uh assignment or something I don't know.

C

Yeah a manager.

O

Could importantly,.

C

Modify one and it would just that that change would persist, and the other modules would see that persisted change, and it could just cause weird bugs.

O

Yeah the issue here is that what uh where is covered is that the deep copy python deep copy is not efficient, so it's even faster to serialize and visualize from json and or well. It's typical right from.

D

O

Optimal yeah yeah, that's surprising! I guess that it's because the python thing it's also popping the methods and everything not only the data itself. So that's why it's less efficient than realizing python or that's pretty crazy.

A

That's uh that's the test.

O

O

We were also exploring using immutables. uh The other thing is that yeah python is just the opposite of a language supporting in new towers, so pereira was trying that, and we were well explained there is an immutable class. Basically, it's a really simple thing to just override the setters in the.

O

Basic shed objects, so you need to go through all the objects basically and that more or less end up in the same thing as doing a d copy right. So then we explore the copy and write things. So you just modify the copy. The thing when there is an actual modification and well it's not a necessity. The alternative is just to basically create a new table by overwriting the set attribute methods and then maybe discovered.

O

uh I mean let the modules crash as they try to modify that and that way we'll quickly identify which modules are trying to modify some structure.

C

Yeah, maybe that would be the easiest thing.

A

Yeah, that seems like a good way to go.

A

C

Because if we, if we enforce that rule, then it seems like it's a straightforward problem to solve, we can just return the same object well, multiple times.

C

Do you have like a sense of which calls are the ones that are most inefficient like? Is it really the osd map or are there? I would guess things like pg dump and so.

O

Yeah, those are probably.

C

Expensive ones right.

O

I think historically, uh it's been the aussie map. I think that there's been a lot of issues there right and also when the network stats were added. There were some issues with with that. One too there's a long list of trackers. I just collect a few ones uh yeah, but I guess those probably will be the hardest ones using map and pj map yeah yeah apart.

D

O

Yeah time taking for uh pi formatter to generate the python object, there might be to be some contention in the finisher uh bread when that happened. So something like that that was that the last tracker that you faced there it's basically uh includes a description of the queue of your length of the finisher under some large cluster.

O

O

C

Yeah this is notified. We need to fix another file anyway, right yeah.

I

Yeah, this is call.

C

A

Yeah, there's only a few modules that use it. So I think that progress in the insights module are the ones to fix there or are you fixing those? um I think the rest are actually more time-based updates already or using apps that are very small, like the health map or.

A

C

A

Yeah, I guess I think we fixed that there was like a kind of train to pinpoint and only extract the actual data that uh the module needs from the these maps. So it doesn't need to serialize, like all the network uh ping times and everything from all the osds when it responds like up and down status.

C

Yep, it's it's really easy to add targeted, get calls on um in the c plus manager code instead of dumping. The whole map just dump just the thing you need both if you can narrow down like either like specific things that you're always pulling out of the whole steam app or subsets, and that's a pretty quick to implement in sequels plus.

E

Yeah, I think we did something similar uh earlier for at least the networking time stuff and the balancer and the progress module. There was a whole cleanup. We did for the the stuff that we were dumping only dump the stuff that we need, but maybe there are other places as well ernesto. What are the recent bottlenecks that you're seeing any particular modules that are being a problem right now earlier? I know the progress and the network stuff was the problem.

O

Yeah, I think the promisius one has struggled a bit with that with getting all the perf counters in time, so well, improv, counters and everything that is scattered, because I think it's right now taking info from multiple sources. So probably it will benefit from this.

F

It's a problem: why? Why can't we just affect the python object with a simple object, as we do with a pi oc map and the clutch map? Also as long as we don't mutation, otherwise, we need to offer some from mutators to these. uh This c purpose objects.

A

In for the pg map,.

F

And for pg for crash map, we can apply the same same same technology to to on pg map, because they're also backed by by the simplest object, name pg map right.

A

Yeah, that's a good point so.

F

We don't need to copy them, performing a deep copy and server isn't deserialized to so they can be consumed by a python code.

F

A

F

Want to mutate them, we could also over mutate at that yeah.

A

I don't think we want to meet it either. Yeah, that's a good point.

A

Kifu, I I guess that one I'm not clear about is that maybe the dashboards use of these structures.

F

And I think the problem is how what the surface of the the public interface we need to tackle. If it is very large we we should. We probably should leverage the existing pi for matter, because that's what we already have using the formatted infrastructure. But if the surface is very small, we could just leverage the the python object or sorry simple object and exposed minimal interface exposed to the python python code and wrap it around with a python python.

O

O

Yeah, so people are assisting then exposing directly. The uh binary data kind of.

F

Oh, no, it's not exposing the final data, it's backed by the final data in in zipper plus, but the interface is exposed as a python interface.

C

I mean, I think, it's just a question of how much effort we want to invest in like building out the python interface around complicated data structure like in the osd map. As you have, we.

F

Already have a have a pi over each map.

C

Yeah, I know, but but you want to look at like one property on one osd within that map, so you have to define functions and you have to have a wrapper around the osd sub-object and then whatever it just. um I think.

F

That's why an idea like this for generating the interface automatically that's.

C

Why the json dump is convenient.

F

C

um Yeah, I don't know, I think, regardless of whether we go down that road or not like still having this basic capability to like hold on to the big json dump that we generate and reuse it. It's going to be useful.

F

How about exposing a json dump for the object which is interested by by python code, the pinpoint if we, if we, for example, if if the python code is interesting, a certain feature or sorry, a certain property of a crash map, we could. We could just expose an interface like dump and it will that's very if method will call into oh, we will use python type formatter to dump that very structure on demand.

F

C

Kind of what I was suggesting with, um I think, at least, if I'm understanding it properly, that's what I was suggesting with um uh extending get to include um whatever more narrow field instead of dumping. The entire osd map, if you just need to know like the up front for rio, steve then have a command that just dumps that those those.

F

um But that's the problem of the get. Extending the get method is that we, the pylon code, cannot dump a given oc map the property of a given html. What if it have a multiple oc map.

C

It it well, it kind of, can I mean you're, just passing a string, and so the bunch of those like take an epic number as an argument, like you say, osd map space and then an epic number. I mean it's a little bit wonky, but um it's not it's not that hard to wire up additional things.

C

I think I mean I I would maybe look at it on the case by case basis like if you identify the thing that you keep getting that's nested within the deep structure then like we can look at about how the best way to get that out.

C

I is, you might be right that having these pi objects that are backed by people's plus structure might make sense in some cases, but.

C

N

And can can we make those pythomata things backed by c objects.

N

Generic in a way that we don't need to care about the structure.

F

What do you mean by generic.

A

Another angle would be to try to kind of do a bit of a profile to see what kinds of uh fields and areas we're accessing that are kind of the bottlenecks.

A

A

Imagine that there's a generic way to kind of generate the accessories.

A

But if, if we could understand like which ones are which fields are being accessed all the time.

A

Target those for optimization.

F

We could put probably wrong around the present model with c profile or something and should try, try to find out the bottleneck and improve the power only if they would create some generic way to to tackle this problem.

N

um In in one of the tracker issues that nasa shared, we actually do have a call graph uh that one is hanging in hundred percent in a pizza map dump and fifty percent in uh pi dict underscore new. So it's it's not that the json generation is uh expensive.

N

uh It's that the python object. Creation based on that json is expensive, but.

F

We, I think we we don't really in most cases, we don't really interested in the whole python object, but we choose to dump it all, no matter what what part we are interested. So I think we need to to to look into the python code which exactly part we are interested and we need to look at and to probably send it over to permissions.

C

Like if we're counting osds, then don't dump the whole map and count the yeah.

C

C

Okay, well, do we have path forward?

C

Seems like it's two things: it's try out this idea of holding on to the reference and then overriding the gets and puts not, I guess the puts whatever the setters we can identify the offending modules and then um separately also just like doing an audit of these callers and see what what optimizations make sense.

O

Yeah, maybe it's worthy yeah to explore this idea of uh instead of using the pi formatter using the regular json formatter and importing that directly in the with the json uh decoder in in python site. We are passing up white, it's counterintuitive, but it seems like it's more performant with passing the python update itself. So.

A

Okay, it might be simpler to maintain that too.

C

Yeah yeah, the the pi formatter is um a little bit fraught and the sequence looks good because you have to be really careful about the kill. But if we could not do that, then that might also simplify the measure code. But I would be interested in seeing a confirmation, but that's really the case- that it's faster.

O

Yeah well in that period from perry you see that they took some measures, so at least the coping time for solo copy, shallow copy, sorry to call json and copy, so the optimal one is pico, which is a binary format. I'm not sure how feasible is that from the c plus plus side articles. I think v code is yeah, it's internal to python, so it's not standard, but maybe there is a code that we can use to realize.

C

J

A

Now it seems, like we've got a number of things to investigate there and you purchase that we know it will be useful.

A

um Anything else we wanted to talk about today.

A

All right and thanks everybody, you.

I

A

Month, that's idiom next week for this episode. Thanks see ya.

F