Ceph RGW Refactoring, 31 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph RGW Refactoring Meeting 2023-05-31

Description

Join us every Wednesday for the Ceph RGW Refactoring meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute
What is Ceph: https://ceph.io/en/discover/

A

All right, the agenda's linked in chat, feel free to add things. First, up from Gabby and Yuval to discuss, string optimizations for app attribute names.

B

um Yeah um I I can start. Maybe you can fill up uh other things, so um the idea is to try and tackle the S3 attribute names. So attribute names are the the like. The strings like uh x, dash AMZ Dash, something something, um and the idea is that those are uh relatively long strings.

B

um Their variable lengths the dynamic and they are kind of forcing us to use Dynamic string allocation in several places, so it they exist both in the in the messages.

B

There exists uh in any any kind of intermediate structure that we're holding this information inside the redis Gateway, and we also put them uh inside the objects so and- and the idea is that, even though those are relatively long strings, um we can actually hash them quite easily into just numbers, because there's a very small number of them, I, don't know in a in a setup to be surprised if there are more than 100 different attribute names.

B

um So, given all this information, we thought uh of building a system that would um would allow to us to create a dictionary of those uh of those strings and instead of um saving all this information in um all those names in all the places.

B

So there could be a couple of levels for this. The the basic one would be to just do that inside the rgw, so that the client would continue to send us the same. uh The same messages they sent before and we'll just do the translation at the entry point and use those numbers uh from now on.

B

um I mean after a pretty short period of time. The the will probably like we'll see all the different attribute names.

B

uh We'll probably see all the different attribute names that exist and we would save on memory allocations and and so on, uh from the Raiders Gateway and on, and so that's this could be. One level of optimization. Second level of optimization would be to actually send a translation table to the clients and so that they could use it if they want.

B

So, of course, this is a much more difficult thing, because it requires changes to the clients, and this is something completely out of our control.

C

I think what we discussed was client being rgw and severe being the OSD. So that way, it's everything is our under our control. Right.

B

So if, if we do this translation at the entry at the entry of the rgw, then we can optimize that inside the rgw memory allocations and optimize that in the greatest objects. So that's that could be one level optimization. What I'll describe um a second level would be to actually send that to the S3 client and um I mean we. We don't really have to change or break the protocol. For that.

B

We can just send that as an attribute on in one of the replies and if a client implements that, then we will have the optimization kind of end to end.

B

um One client that we do control is in the case of uh of multicide. So when the client is another rgw, then we do control the implementation of the client, and we may optimize that in this case, but as Gabby said, even even if we kind of put aside the like this enhancement to S3, which is always kind of a a tricky thing, then we can optimize that inside the rgw and in addition in the um in the redis objects as well.

C

Do you like kicking out a few things sure.

B

C

On our side, the reason we wanted to try and optimize the art, um the attribute names is because the set of attributes that we use is pretty small. The objects that we have.

C

We could have many millions of objects, and if you check how many attributes you got probably a few dozens, let's exaggerate and say 1000, separate attributes and so you'd see every attributes being repeated thousands time and each time it's Dynamic allocation of a short string and when the object node is brought into memory, object string need to be allocated when you go out of memory. Deallocated.

C

It's make um uh encoding and decoding more expensive because, of course, including the coding strings is expensive, and it also means that the data that the attribute name is not continuous, with the object with the object node, because inside the object node, we have pointers into separate addresses if we are able to replace them with short integers, and we are thinking about u in 16, just two bytes, which would allow us to use c4k different attributes names which is really unrealistic to ever get in this direction.

C

So to buy it instead of 16 or maybe more. In some cases, it could be as large I think as 128 byte, but I. Think 16 byte is a more reasonable attribute names, but what we have been saving is not just 16. Byte goes down to two byte is move away from dynamical location memory, fragmentation and con, and the cost of encoding decode foreign.

C

That we process attribute name, so we search it's a string, compare instead of just doing an integer compare, which is, of course much faster and internally in the OSD. We really don't care about. Your attributes from our perspective attribute is just it should match something, and if everything that we are processing it, we only see two byte integers, that's much faster alteration on our side and, of course we need to preserve the original name in case somebody needed.

C

That's why every time we going to replace, we create a translation from a long attribute names to a short integers. We need to persist that.

C

C

Comment are we very clear or nobody understands what we're talking about.

A

So I'm sorry I got disconnected for part of the intro there, but it sounds like we're talking specifically about the X adders that we're storing on head objects.

B

A

Okay and that the rados layer, all of the keys, are strings. So where would we introduce uh un 16 translation.

C

We claim that they are very short, set, very small set of attributes in use. In reality, it should be a very short set or they could be asked so.

B

Gabby I think what what Casey is asking is saying uh the API to redis doesn't allow that.

C

Yeah that thing we need to change it. That's what we're talking about.

C

Reddit's API should be modified to to support this change.

D

D

C

Mean are you asking us? We need to change to work with integer attribute names instead of string attribute names, that's another thing, but it's not crazy big change and we think it's have some potential.

D

Are you asking us if the rados API changes in that way, then would we be willing to change our attributes accordingly? Yes,.

C

It need to be a cooperative sport. I mean I, mean sorry. The OSD could do this thing internally, just on the OST, but I think it's it's. It makes more sense that everybody would use this new change or this modified Behavior.

B

So the main question is uh I mean it would require changes uh both to the OSD and the rgw and the API. uh Do you see? uh Do you see the value here? Do you think this is something we should actually pursue in implementation? Okay,.

A

I do see a value, especially on the OSD side, I'm just concerned about the complexity and backward compatibility of existing objects.

A

um So I mean it might make more sense. Instead of changing all of the keys to un16, adding a separate API that supports the um 16-bit keys and start converting things to use that instead, um but still I, think there's complexity around mappings between strings and numbers and also supporting things essentially like user specified keys for rgw S3 has user-defined metadata and we map those directly into attribute names.

A

um So there could potentially be more than 2 to the 16 unique user-defined.

B

Metadata strings, but is this? Is this really the case I mean? Have we seen thousands of uh usually defined uh S3 attributes I, don't think so.

B

I mean the current number of attributes is, is probably under a hundred, and if we, if we I mean if we, uh if we take the user defined, it could maybe go to a couple of hundreds, maybe a thousand. So if we have uh 64k, this is kind of way way beyond what we should have, but even.

A

If how do you map arbitrary strings down to 16 bits, you.

B

A

Do some kind of hash no.

B

No, no, we don't want to use a hash here, because those strings could be very long and we can have hash collisions. So the the whole idea is that we will just use some kind of a global counter for that.

A

So, who is storing This Global.

C

um Whenever you see a string for the first time you access hashtag looking for it if it exists, the hashtag is going to give. So, let's say: if it doesn't exist, you allocate an index or running incrementing index and assign it to it saying the key is going to be the string name and the data is going to be the allocated index.

C

If next time you, if you see the string somehow any. Oh sorry on your side, you have the strings, you map it, you get the string, you get the integers and you use it from that point. On every time we add an entry to the mapping table. That table need to be persistent.

C

D

C

Incremental increment of that.

A

And the stable needs to be globally consistent right. Yes,.

C

I was thinking uh um using rocksdb object.

C

And I don't think that so in the beginning, we would see the table being updated. But again, assuming that 1000 separate attributes is a reasonable number. Then, after a thousand update that things is not going to be updated anymore or hardly ever.

E

My other thought on that is like how frequent are the user defined ones, and is it really worth keeping uh Dynamic table for that reason, or does it make sense just to have a predefined statically defined set of mappings for the common ones that are fine defined for S3 and leave? The other ones still have string keys since probably a marginal.

B

Case so the user-defined ones, I mean the the cold user defined, but in fact they're used by applications right. So it's not like every end user of the system is defining like any end user or millions of the users of the system is going to Define their own attributes nobody's doing that. So, if you have an application using the system, they they can add their own attributes. So they can store some some extra information that they need, and so in a given system they use a defined one.

B

It's going to be a very small set of things they're going to they're going to occur, pretty often right. So if we're using some applications and those applications are using those attributes, then we're going to see them over and over so I. Don't think we should kind of look at the universe like.

A

That, though, I mean the the protocol itself lets, you give arbitrary metadata names and values. Different applications might use different ones hard-coded, but a given cluster could have all kinds of different applications running at the same time. So I really think we need the flexibility of our arbitrary strings without the cost of a globally consistent table.

A

So I I think I agree with Corey that we should preserve the string based API, but for all of the hard-coded attribute names that rgw always already uses. It would be easy to convert those and just manage a mapping in rgw.

B

Sure I mean we don't suggest to to remove the string based one. What I'm saying is that if you, if you look at the AWS CLI, for example, both the three they have used, the defined attributes but they're going to be all over the place and they're not going to change.

B

So it's like not everything which is user defined. It's very kind of ad hoc, specific and never kind of you know. Some of the user defined are so common. As common as the non-used Define.

C

So so we we could do a few things. So first we could just create a very simple uh hostable encount, just to see how this thing behaves. We could just load some testing utility and verify the actual number of different attributes used by client. Another option: if we want to have unlimited attributes, then use four byte we four byte your set is virtually unlimited. I cannot imagine somebody having 4 billion separate attributes unless somebody is doing something extremely crazy, but.

A

I, I really don't think it's worth managing the globally consistent table of these things and having lookups that are racing between them, for example, so I think it's a good idea for rgw to use hard-coded ones, but I don't think we should be making assumptions like this for for user to find stuff.

C

Sorry, why are you racing if it's something that the OSD is going to determine and you guys just going to get the answer from the OSD, so the first access to the OSD you're going to use the string and then the OSD is going to to tell you you know what from now on on the reply message. Please use this integer instead, that's easier, so you guys never have to worry about persistency. That's something the OSD need to manage.

C

Persistency consistency, uh race: any of these concept should be internal to theocity.

C

Now the question is: people are using millions of unique attributes. Then probably this concept is meaningless. If real world scenarios are using a few hundred, maybe one thousand attributes, then this concept have potential.

C

But maybe you should just create a quick verification utility to check how many unique attributes we got. We could even create a scrub to do that.

B

Well, the problem is that we don't have access to all deployments right, so we don't know what how this is actually being used. But what I'm saying is is that in many cases the user-defined ones are actually almost standard attributes.

B

They just fall into use, defined definition, but they're not they're. Almost standard they're used by the standard clients, not even application. Just a standard client implementation.

A

um So, going back to what Gabby said last, where the OSD is only managing its own table, I I, think that sounds fine and I would go a step further and say to hide that mapping in the OSD and not change any of the libratos apis.

A

So the the client power GW would still be managing strings, but the OSD would have that optimization kind of hidden, I think that's where I would start at least.

C

Yeah, okay, so it makes sense to sort of the OSD, but in the long run the greater potential is if the clients would also use the same integers that way they always they wouldn't have to do the translation on every aisle, because the OSD doesn't care about the attributes.

C

Name is from my perspective, it's just something you need to preserve, so if every audio would require going into the hash table, finding the integers assigning it to it and that's not going to help the OST very much.

F

Every I o is going to require going to the hash table and assigning integers. This is the question of who does it.

C

Yes, we like the rgw to be the one doing this.

A

But the right okay.

C

Because the rgw definitely have to see the full stream, so there's no way to move around it. If, anyway, you have the full stream, then go to the hash table and get, and then you can get rid of it. Otherwise you got it anyway and you need to do string, compare and every operation, but you also require the OSD to do this.

B

A

So if, if the rgws need a consistent hash table for these lookups, then it has to be distributed and consistent and that that itself is expensive.

C

That the RW doesn't need a distributed.

C

R2W is going to get from the OSD um trusted translations.

C

Again, if we assume that the table is going to be relatively small with maybe 1000 entries, you would get them from the OSD. Every USD you talk to will just give you the numbers and that's not the RW problem, how we get these values.

A

So if the osds globally aren't replicating this table,.

C

No, they should just they should replicate applicator, but the RSW can maintain a local cache which doesn't need to be replicated and it's okay to lose it it. You crush the next time. You you, you restart you'd, get the full tablet from the OST, so the OSD should maintain the persistency rtw should should have a shadow copy, but it's not um it's not the owner of the table. It just have a copy of the table.

A

So I get that the OSD is persisting this, but doesn't it need to replicate this table between every OSD in in the pool? Yes or globally? Yeah.

C

A

Isn't that really expensive.

C

Again, if we're talking about thousand values, that's it's nothing.

F

It's not the data copy, that's expensive! It's the synchronization, that's expensive! I.

C

Know but again, if we have thousands unique entries, then we need to apply thousands synchronization uh events globally.

F

No, a thousand synchronization events times the number of rgws, at least.

C

Okay, so you think there is a thousands, unique value per RW.

F

I mean each rgw needs to do a lookup for every operate, for every string, yeah.

C

Right, the question is: do you expect that every RW is going to see unique, a unique attributions.

F

I expect every rtw to see every attribute.

B

Right but we said that globally, we're going to have maybe a thousand different attribute names right right, so so when an rgw first comes up or whenever it has a cache Miss in the local in-memory hash, which is kind of inexpensive they're, going to fetch the table from the OSD right and assuming the I mean the the change rate is really small like. If we have a thousand attributes, then they're not never going to change after the initial ramps.

F

I mean my understanding: is it's not going to fetch the actual table from the OST? It's going to send, say: I, don't know the string. It's going to send the string down. The OSD is going to do a lookup which involves a synchronization event and pass back a number, and this is going to happen once for each attribute. No, that.

C

Was how you describe the API is working okay, let me redescribe it when the RW doesn't have the integer it's going to do the I O, with a string the OSD going to either find the if the if the string exists in the OSD tables is going to find it act on it. There's no synchronization event here and on the replay to the rjw is going to piggyback the value and say from now on. Please use this value if the value, if the string doesn't exist on the OSD, the OSD will need to synchronize with.

D

C

This information, sorry, this table entry. So that's a synchronization event which going to happen on every time an OSD sees a new value. The question is how many unique value exist in the system if the number, if every rjw you have a unique set of attributes, then your right and it's maybe one thousand.

B

They are, the rgw is not not the one that have the inside it's the system that has WWE are not inventing that with names they're just getting that from the client, so the system is actually the one that is, has the different types with names.

C

But okay, if we think about a cluster, how many unique attribute should we expect.

F

The answer is, nobody knows.

G

A little bit worried about what the Overflow behavior is because I kind of expect that this is something like like this gets deployed. Some will try to break it by just throwing.

F

An attribute after attribute after attribute no, this API has to be at least 32-bit. If not 64-bit, no question.

C

64-Bit, how many attributes do you expect it.

F

Doesn't matter how many attributes you expect, you can't define an API with 16 bits, that's ridiculous! Okay,.

C

So we make it 32 bits, but 64. It just seems crazy to me.

G

I mean the I agree with Daniel, but also I. Do I would like to know what the Overflow behavior is or how we're going to handle like Joe schmoe, just flinging random attributes at us to try to fill up our table.

F

Or what if we get hash, collisions.

B

We're not using hashing here, there's no hash collisions. Thank you.

B

We're not hashing those values.

A

So why sorry I guess I missed this way again? Is it required to be exposed to igw instead of the mapping managed internally by the osds? My.

B

Question is why not? Why not do that on the rgw, it's a waste of our memory and CPU to do that if we need to do it twice on the rgw and on the OSD.

A

I think it's a lot of complexity, especially in changes to existing protocols.

A

Hey we if we have something working at the OSD level and it's proven to work and scale, then I think it's a good idea to talk about the best way to expose that to clients.

A

It's it's very non-trivial for rights and liberators to return data like this, and if it's there's a small limit to the number of bytes that an app can return in the first place. So if we're doing a set, X editor with 20 adders, so SD isn't going to be able to return 20 numbers for that mapping.

G

If we did want to do something like with this, this within the rgw, like I, think we could do I think we could do an interesting split potentially, where the OSD does its own version of compression behind the scenes without telling us that, similarly, the OSD could have its own in-memory. uh Basically, its own in-memory table of interred symbols, put things out when they get older and just don't bother.

G

Persisting them like we'll see a bunch of symbols, come up and be able to use integer comparisons on them, but we won't persist them, but it should make the design a whole lot simpler, not involve, uh like Casey, mentions the rados.

E

C

I didn't understand this last suggestion.

G

If we want to avoid having to store and do a bunch of string comparisons in the uh in the rgw per se, we could simply have an in-memory table of strings that we populate while we're running just by what we're, seeing and cut out the and cut out the synchronization between the rgw and the OSD.

G

That way like we'd, still ingest strings as we see them, and if we are seeing a constant workload that should all get populated relatively soon after startup, we might still be able to get the benefit of replacing string comparisons with integer comparisons, and we- and if we did something like that, since it's not a global shared resource, we could simply uh do an lru type system where we evict the where every now where. If we start pushing our numbers, we evict the least the least recently used 1 000 and uh go that way.

B

In this case, like, let's say that we.

G

I would write the strain to rados the idea. That was simply the idea of how, since you had mentioned taking since you had mentioned the time in comparison, use, I was just trying to think if there was a way to to still get that Advantage without necessarily having to um without having to pay for the rate of synchronization costs.

C

But you're still going to send the full string on the link right.

G

Here, rados, yes,.

C

So when are you going to use the integer, because every interval.

B

Just internally inside the rgw yeah just internally, are you.

C

Doing a lot of comparison on on attributes.

G

F

Mean if we are I mean I, don't think we are which brings the whole question.

H

F

Into question I.

C

Mean I guess we are insofar.

F

As attributes are stored in a map- and if you do look up on the map, then there's string comparisons, but maps are fairly well optimized, even string Maps I.

H

Don't think I don't think we will ever do a.

F

String, compare on a user supplied, attribute, they're user supplied, we don't ever look at them, we don't ever touch them right. We will store them and pass them through and that's it. So there won't I mean the only string. Compares will be insofar as they're in the map with other attributes and.

D

F

A problem we could have two maps, one for our R, known attributes and one freezer supplied attributes and.

D

G

We would never do.

F

A lookup on them and we would never do a string, a bear on them.

G

In that case, I think that's a pretty good argument for the rados only view of it.

F

I mean we wouldn't even need at that point a map. I guess you could just have a vector.

C

The motivation here was that, if you guys could do this work or part of the work and from my perspective, the work means the translation, not the persistency of the table. Then you could say all the effort needed by OSD, because the OSD wouldn't have to deal with streams, since you anyway are translating them.

C

But since persistency is requirement, then it means that the OSD should be the one doing setting up the translation, giving you the numbers and once you get it and you have a cache of the values, then you could just use them from that point.

B

So I think I think what Casey said before regarding the limitations of return value from right operation is uh is, is a true problem here. um So maybe maybe we go because you understand Gabby they. What we described as the API between the the client and the OSD uh replying with couple of values is, is a problem. If this number of values is too high, there's a there are some split limitations on the on the right operations to the OSD.

B

um But maybe you go back to the idea of the um of the non-use defined attributes like the the known ones, animals.

H

Do the folks uh from Bloomberg on the call know how many user-defined attributes their cluster or clusters have or have any idea if the numbers low like in the four digits or if it's much higher.

I

No I don't think we ever checked.

J

I, don't think we have checked user attribute to be honest,.

H

C

We maybe Supply them with a small utility to do the accounting. Thank you.

I

We'll do something like this: if.

J

Yeah, it's just like kind of clown you're getting out the attributes, I mean.

I

So the user defined uh attributes are those a matter of like X, EMZ, uh yeah.

F

E

Okay, exactly yeah, we.

I

Can yeah we can take a look.

A

So I mean um this would come up with CFS, also right because as I understand that they're mapping the posix X adders and uh X editors in the same way, which could have arbitrary names and values.

B

Well, I guess they can take a similar approach if the I mean or the non-user defined ones, for the ones that are I know. If they have such attributes that always come with every file, then uh maybe they can just hard code them and then just pass them as numbers uh to the OSD. The obviously doesn't have to really care about that.

A

um I still think that just the OSD layers, the place to start find a design that the core team thinks is workable and.

A

Get that working and then The, Next, Step I, think would be to expose a new liberatos API taking integers, but still in the rgw level. I think that's going to add a lot of complexity, to do conversions, to store maps of attributes where the keys can either be strings or integers.

A

um It's really not clear to me that the cost of the string, comparisons and allocations is is worth all of that effort.

B

Well for the non-user defined ones, we for sure we're doing uh like we're doing lookups, which has the cost of hashing those strings. And, of course we have the memory allocation calls and for the non-use defined ones. It's the same string for each and every request.

B

And for all requests right.

C

Ivan I think it should be all or nothing I, don't think, there's a benefit in maintaining dual system.

A

So I mean rjw is already storing a ton of objects with string X adders. Is there any plan to for the OSD to convert those or do.

C

We have the existing existed. Should we convert existing attributes?

C

Maybe if, if we decide to go on this project, then we're probably going to be some step in which the background process would start converting conversion, but that's still in the future, I would start with brand new or on a new release that everything on the new release for new installs. Sorry, not just news just new installs and then you can add a scrub process which should go back in time, should go back, should go over existing object and fix them can also be done lazy. So every time the object is brought into memory.

C

You could just do that.

I

I do do we have an open tracker for this if we gather some status disks numbers. So where do we update.

B

Oh there's no tracker: this is just to kind of uh brainstorming the idea, uh but uh for sure we can, we can uh I mean we can summarize that and maybe an email send that to the community and put a tracker. Sorry. The information could be aggregated in this tracker. Yes,.

F

Okay sounds good. Thank you.

A

So um I mean has this been discussed with the rados team at all I think yeah I would want to get their feedback on the the OSD side. Maybe a self-developer, monthly or similar meeting would be a good way to erase.

B

That we can discuss that in the performance call.

A

Okay, thank you.

A

All right any other thoughts here before we move on.

A

All right next on the agenda is Bucket Level redirect Zone prototype is Eugene here to discuss this.

A

Foreign to the pr that he opened about it, um but I'm still not clear on on the motivations there.

A

All right, let's move on to Grinnell, then notification, retry and clean up.

J

Kind of following up, we had a discussion, probably a couple of weeks ago or maybe three weeks ago, about the notification retry, where you all was talking about. Should we come about away on how to retry the like? Also during our current notification testing, what we observed was the Gateway is very aggressively stand, so we had a persistent notification setting and uh I I our broker Kafka broker was on, but somehow it was not able to send him.

J

It was kind of aggressively sending all the notification uh and crashing, and every time we brought up the Gateway it kept on sending so I was just hoping an opinion here. Like is not there a way where the the notifications or the processing notification acknowledgments are retried for some time and then kind of give up, instead of just doing it. Every time like we did discuss this last time, but I'm not sure if he reached the conclusion on this.

B

Yeah I mean uh you have to bear in mind. This was because of a bug that we fixed, so um you know we we shouldn't design a software on bugs. We should design a software around uh kind of system, behaviors or network behaviors or environment Behavior. So, putting aside the specific bug that was causing the crash, um do you want to not see retries of notifications? Let's say after the Kafka broker is brought up again or the network recovered, or something like that.

J

Yeah, that's that's what my question was: did we did we reach any consensus and what we need to do I mean: are we there could be multiple approaches? I I as an end user could wait for some time and if it doesn't come up kind of give them I don't know like how do we want to go about this? Like so.

B

The current behavior is that eventually you will get pushback, so there is a queue of retries and when this queue fills up, then it pushes back. If you define, if you define that those notifications need to be sent for a specific uh bucket, then eventually, uh when the queue fills up, uh maybe they do take a couple hours I, don't know, uh then um they'll be pushback on the client saying like busy or something like that, and you would see that the clients need to slow down now.

B

I mean this is the current design. If you think that a different design is better, then I'm open to to changes here. But but you know, let's not discuss the issue of.

J

B

Keep on crashing or something because.

J

I'm not discussing that issue at all I'm, just saying it's a scenario: I just brought up a scenario where this was happening and it was just continuously trying and never gave up. Basically, so this.

B

J

Be it could be a different scenario where the the notifications are not being sent, but the Gateway just keeps trying trying every time like it just doesn't stop.

B

Right at all, right right, so so let let's take like scenario where there's a network disconnect or the Kafka broker is down. So the current behavior I'm not saying this is the the best thing or the only one that we should have, but the current behavior is that um they'll be retired indefinitely, because I I don't know when to give up um and eventually when the queue fills up, then I would push back on the S3 client.

B

It might take some time, but eventually I'll push back on the S3 clients. Now uh I mean one one design option could be to add uh like.

J

B

uh On the notifications and then uh eventually I would give up and throw them away, and then maybe uh the queue won't won't fill up because I will be deleting stuff from there. This is one option uh or I'm open to other options, as.

J

Well, uh just a good couple of questions before we. uh What is it the size of the queue that it decides when it's full, like format, do we is it configurable.

B

So, currently it's a single single radius object right. uh So it's for mags, which means notifications are usually a couple of hundreds of bytes, depends.

J

On opaque data, actually yeah yeah.

B

uh Yeah and depends on the the the length of the bucket names and so on. So it's not, but it's usually not more than one kilobytes. It's really that that big, uh so.

J

Yeah, what happens when it is fully said? You do a push back to the S3 client. So it's yes, so your put operations would not are just assuming like a put operation here would not feel the button would not go and update the queue.

B

Here, yes, I would, if the queue is full, then I would before then before I upload the object or do anything I check. What is the status of the queue and I reply with the Slowdown message to the S3 client.

J

But does reboot operation complete so no.

B

No, no! No! No! No! No! It's before it's before you! They put operation. So before the put operation, I I put like the queue works with a kind of a reservation system. So whenever I see um a put operation that needs a notification, I make a reservation on the Queue and if I cannot make the reservation, then I immediately reply.

J

B

And if I can make the reservation that I know that after I do the actual put um when I try to send a notification, which means to put something in the queue I know, I have a reservation, so I'll have space in the queue. So that's how the system works. Okay,.

J

So potentially client would never be able to again put the operation if the broker is down for them. For some reason and.

B

J

Have exceeded enough.

B

J

The object limit, like the.

B

J

Of the queue yeah and another.

B

Option, sorry, it was one second. Another option uh was to um to have like instead of uh this message: Cube, which is kind of a limited side to have like an infant size or uh use the the message fifo and then it concatenates multiple objects. So you can have much much larger size, uh I'm, not sure.

B

If, if this is really what where we want to go because, let's say if you decommissioned a Kafka broker and you forgot to remove the topic or something, then you don't want, like a month of notifications to be unnecessarily consuming disk space in your system, yeah.

J

I mean that's again, an open discussion and that's what we think we at Bloomberg have people sending in a lot a lot of requests, so there are 100 chances for us where we would exceed automatic limits for sure.

J

So um again, one way to tackle this is multiple things. Iphone is one option to increase there, but another option is to delete after a few replays uh yeah. So again, that's that's the reason. I kind of open this I don't know what appropriate solution would be to this, but it would be nice to brainstorm on this.

B

Yeah uh I can start an email thread uh on the on the user. Mailing list kind of I'm just gonna list a couple of options and would be great if you can kind of reply with definitely I think and then maybe other people, like other users of saffron notifications, can can chime in and say what is the best option for them. I mean we can Implement all kinds of things, but I would like to implement something that is most useful for most uh users. So.

J

Yeah I mean, of course, we could always have an option of configuring as well. So this way you can cater multiple people as well like yeah. We can discuss more on this yeah, so that was another thing and then yeah. The follow-up for this was the cleanup thing so, uh as you said right like uh if the Kafka broker is down or they usually, if you consider doing it a 5-4 object, the this space seems to be there. So if the tenancy is deleted, should we go and delete.

J

Those topics is what I was asking like. Currently, we don't do that. Okay,.

B

So so currently uh the I mean currently when you delete the topic from the Venice Gateway acli. uh Let's get my admin, then the actual uh queue is not deleted, but this is when you're deleting from like. After we fix the bug. Oh, if you're deleting from the rest API from from uh yeah three, then the queue should be deleted. I mean this is tested. Maybe there's a bug there that doesn't work, but this is tested. So when.

G

B

Delete the queue then uh I don't see any more retrials or anything in this case. Your question regarding the tenant uh I think this is kind of a wider uh question. Do.

H

B

Have like Cascade deletions of uh anything when you delete the tenant, I'm, not sure yeah, okay,.

A

Tenant is not actually a thing, so you can't delete it. I mean you could delete all users under attendant, but nothing knows um nothing's keeping track of whether a tenant name is used or not they're just arbitrary strings. We.

J

Do have an option called a spurge objects. So this way, when you remove an user, it was a purge bucket. So again sports data. It deletes all the buckets and the objects belonging to that user. I believe.

B

But that's at Bucket Level we're talking about Talent here.

J

uh Isn't that part of user as well you remove user and Purge data that will remove all the data, all the buckets Associated to that user, correct right. So in that case, First Data should also go ahead and delete the the queue as well right.

B

I, don't think the queue is associated with a specific user.

J

It is right, I mean.

B

No I think it could be used. The Cure is really a representation of the of the end system that you're sending to so we, you can do that per tenant just to kind of the namespacing mechanism to prevent collisions, but.

J

The confusion here is I believe from what I understand here: there's one-on-one mapping between a tenant and S3 user. They could be some users, but then a queue is specific or you create a notification system specific to a tenant right and then Associated to a bucket. But if that tenant itself does not exist, that Q doesn't make any sense.

A

Sorry I think there's a misconception about.

H

A

There um tenant is a thing that originally came from Swift and it's essentially just a namespace, that's isolated from other tenants, so you create users under one tenant and they can't view anything um outside of that. So you can create several users and several buckets under a tenant right.

J

That's Swift, but in case of S3, like we create a single user right, I mean no.

A

It essentially works the same way for S3. It's just not exposed very well by S3 apis users and their buckets can still be tenanted. You can still have several S3 users in a given tenant name.

J

So coming back to notification, so notification is not. Is it tenant, specific or user specific? We have just done user specific things so far, so.

B

So so I mean there's a distinction here. The notifications are tied to a bucket, so they could be tied to a user and also, if you delete a bucket, then all the notification configuration is automatically deleted. The topics are not tied to a bucket or to a user at all, the namespaced by tenant, but they're, not tied to a user or a bucket.

J

I see so there could be two users or tenants. Slash could have the same topic name.

B

Yeah yeah just a namespacing thing for the for the topics but uh they're not tied to a specific user or to a specific bucket I mean you. Can you can have many buckets sending notifications to the same Kafka broker so.

J

That's what you know I'm just talking about topic here. So when you do a create topic during notification, the first step is to create. At that point you create a topic which is specific to a user. You need to send in your user credential.

J

Without user credentials, you cannot send. Oh, yes, true,.

B

True um so, but that would be okay, so that's kind of a different, that's kind of a different thing, so um you have to have user credential, because this is how S3 works, but there's no association of this topic to the user.

B

Any any user can use this topic. uh We need probably to add some kind of a capability mechanism to make sure that only administrators can create and and delete or modify topics, because this is really a system-wide definition. It's not a it's, not a user or a bucket specific definition. It's only that, uh because it's done over an F3 interface or to be honest over kind of an SNS like interface, then then you have to have some kind of credentials, but other than that, there's no tying of the topic created to the user.

B

That created it.

J

Okay, so that means, if you do a purge so technically by that definition, purse data should not go and clean up the queue or clean up that particular topic right.

B

No, no, the user, like once we once we kind of authenticated the credentials of the user that created the uh topic. We forget about the user.

D

B

We probably need to add kind of a capability verification mechanism to make sure that this user is allowed to create topics but other than that. It's like an administrator. So when a diminisher does something, then uh that they're not tied to the user, which is the administrator okay, I.

J

A

All right so we're running short on time.

D

A

You've all you had offered to send a mail to the user list discussing options here. Let's continue the discussion there, yep.

J

A

Thanks a lot um and Shilpa, are you on the call yep hi I was just wondering if you had a quick update on multi-site testing for Reef? Do we have results from Mark on that.

I

um Yeah so Mark did provide me a Jenkins link and he.

I

Do a 400 million sync workload successfully on relief, but um I I, don't know if that I think we would want to be able to integrate it in some way Upstream but yeah. We do have results for leave.

J

I

This specific to the awareness.

J

Thing I think fairness testing or in general, no.

I

This is just for uh the The Reef release. It's not specific to anything.

A

Yeah, it's the actually the same workload, testing that we did to verify the future when it first merged, but I wanted to make sure that no regressions crept into reef, and that sounds like we're. Looking good there yeah.

I

Yeah and no regressions it was, it was fine. Do.

J

We have any update on that sync fairness thing though: yeah.

I

um Yeah I do have a PR for that um I'm. Just gonna um I'll. Add you as a reviewer on it.

J

Yeah we could pick up that piano. If you want, we could provide the testing as well.

I

J

I

J

I

You I'm just pasted the link here in the chat. Thank you.

A

All right, wonderful, thanks, shopper, all right, so we're over on time. So we'll call it here. Some really great discussion thanks. Everybody see you next time. Thank.

J