Ceph Ceph Tech Talks, 7 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Tech Talk: Ceph at DigitalOcean

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, hi everybody, uh I'm alex merrigan, I'm a senior engineer at digital ocean and I work as part of the storage systems team. So our role in the storage systems team is to um handle the entire lifecycle from deployment to decommission of all the storage back-ends at dio.

A

um As you may already know, digital ocean is a large set consumer. We are also part of the set foundation and we've not. We haven't really done any public talk about our use case, and so this is what it is about. Today, it's going to be fairly uh superficially.

A

We are only just gonna brush kind of the surface about what we do uh we serve and how we operate it and what not the goal is that uh the world will hopefully be in a better place in 2022 and so for the next falcon, if it happens, we'll be able to present some talks uh with some more deep dive on our use cases and how we use ceph, um but today yeah we're going to keep it more as an introduction of how we use that deal.

A

So in terms of agenda uh we're gonna start with what digital ocean. In case you don't know the company uh and then we're gonna dive straight in we said ideal just presenting some statistics and the product that uses the usef over here and then, when I was building the talk, I was really thinking hey. What should I talk about uh and I figured I would talk about operation, uh how we operate steps, the kind of processes we have uh setup and the automation we have.

A

We have developed around seth and a few issues that we've noticed. I call it self drives it's it's not a good term uh and it's not a blame game. It's just some some stuff. We noticed that you may want to put in your radar if you start to have uh to deploy seth at at some scale, and that should take us around 40-45 minutes of me talking, which will leave us plenty of time for any questions uh you may have.

A

All right, what's digitalocean so very briefly: digitalocean is a cloud provider that was founded in and the core com. Core concept of do is simplicity. uh It's simple to provision cloud resources. um The ui is simple. It's simple to sign up surprisingly simple and all of that, so that's really the core concept of digital ocean, uh the first product that was released. What is called the droplets, uh which is what we called a virtual machine and the main attractivity of this product at the time, was that every single droplet came with a local ssd attached.

A

uh Even your five dollar droplet did come with that in 2012, which was extremely attractive, and it's really what put you on the map uh as a resource provider at the time since then, uh we have expanded our product uh portfolio, uh starting with um block storage in 2016, and many many many more products have been added since including spaces, which is our object, uh s3 storage platform.

A

But if you go into the website, you'll see many more products, such as manage databases manage kubernetes, one click, app deployment and a lot of other uh very cool stuff in terms of global footprints. We have data centers in eight regions, with each region having multiple data centers as well and uh in terms of life cycle of the company. One very important event happened uh about six months ago with our ipo earlier this year.

A

All right! So, let's talk about setback deal so in terms of footprints. uh We use geo for two products uh block and object. So block is what we call volume object, as the name of the product is spaces and they are both backed by ceph.

A

We have a third storage product in our portfolio. That's called image backups or bring your own image uh that today is not stored in ceph, but that is something that we are looking into uh and we hope to start to make it part of ceph. um Next year, early next year, in terms of number of clusters, we have a total of 38 production clusters on 38 production. Cluster 37 of them are on safe knowledge and one is on self luminous.

A

This laggard is simply due to the fact that for block cluster, we've decided to repave all the osds from from firestore to glue store before the upgrade, and so we have one lager in that. That should be finished very soon. So at the end of next month, we'll probably have all of our cluster um in on notice. um The entire replay process was over 3000 osd, so so it took a bit of time uh in terms of actual usage.

A

We have over 54 petabytes of row, safe storage and that represents around 21 and 500 for hardware block storage is and has always been all flash 3x replicated object. Storage is a mix.

A

We have a mix of hdds and the qrc flash for newer object, storage clusters uh and they are pretty much all except for some specific cases. uh They are pretty much all uh using erasure coding with 5 2 and four two uh km values.

A

The index nodes are obviously all flash and have always been all flash as well.

A

So why did we pick seth? uh So seth was selected way before I joined geo. uh It's we started with it in 2016. I joined in 2018, uh but the reason why we pick safe are pretty much the same reason. You see a marketing document about seth uh the scalability, especially the ability to add a node horizontally that is very straightforward, to do and quite safe to do with the right set of processes. So we can expand the size of our cluster very easily.

A

uh The most important point in all of this to me is a reliability, um and that goes via self-healing and the strong consistency that comes with seph um without situations where something went wrong in our cluster and we had an outage and you can see your pgs being in terrible states, they are down incomplete, stuck peering and what have you? But every time we solve the underlying issues that caused this outage.

A

Seth came back on its own with very little to no intervention, and that is a huge, huge, huge, strong point of self. We don't have to intervene. A lot, um there's obviously also the scrubbing and auto repair that is fairly cool to have in terms of performance. We always want more performance and if you give us more performance, we're not going to complain about it, and if you look at the second survey every year, performance and asking for more performance is one of the top items, but, generally speaking, theft performance has been very much acceptable.

A

We don't have many issues related to sales performance. It is, it is what it is and is quite quite good, as is we always want more, of course, but it is acceptable and also it does not scale perfectly linearly you, and we cannot expect that to happen. um It scales linearly enough to the point where we can really predict the performance of a cluster from its size, which is a fairly nice concept uh and it improves our operation uh quite a lot.

A

uh Obviously, it works with all the sharp components. So no appliances, uh that is a big win. um That does mean that when you deploy new hardware, you need to do more testing and when you deploy, when you upgrade your your os, you need to do more testing, but we are not tied to any appliances which which is a big win and also, very importantly, to us.

A

It works with multiple products. As I said, we use it for object and block storage and the ability to have only one backend uh greatly simplifies our lives. We only have to have one set of automation. We we have to deep dive into in terms of knowledge into one product and not multiple. So that means that we can handle more cluster with less people and less automation, uh and that is obviously a huge win for us.

A

All right, so, let's talk about uh safe operations, uh so what we do in terms of operations, that's gonna, be uh um I've selected, a few, a few items to talk about um we're gonna brush the surface, but we do some very cool stuff and I'm going to introduce a couple of uh open source tools that we've developed as an idea of interest to the community.

A

So one of the core concepts of our team is that we should not scale the number of people we have in the team with a number of clusters. uh That would be pretty bad. So our goal today is to automate as much as possible with the final like perfect goal, to never have to ssh into a machine- we're not quite there yet but maybe someday in terms of operation and automation.

A

It is mostly done via ansible, playbooks and awx.

A

Awx is the open source version of uh ansible tower, which I think now is named red hat tower, um and we still do have some standalone tooling left here and there that we will want to turn into service in the near future, but for now standalone tuning that works pretty well all right. So, in terms of deployments, nothing really exceptional going on there. uh We have our own in-house and single playbook to deploy ceph um in containers, so we started containerizing self as we upgraded from luminous to knowledge.

A

Instead of just upgrading the package, we do a containerization of the demon and every new cluster that is deployed is containerized.

A

We don't use fadm for the simple reason that unnoticed it's not available uh when we go to octopus or parliament, when you go to pacific, uh we may evaluate safe idm, but for now uh we have no plan on switching to it. um So what we do in terms of deployments we deploy. We build our own safe packages, uh because we cherry pick a lot of patches uh instead of upgrading set directly.

A

uh This is the this lowers the risk of an upgrade going wrong more on that in a bit um and in terms of daemon we use, we basically use the theft demon base image, which is just the image with installed packages, uh and we have just a minimal script to mount osd outputs. That's like 15 lines of bash. That's that's fairly straightforward!

A

um In terms of operation, it's very nice to have all your osd in a sequential order uh when you use seth, because if you see an alert that says osd012 absolute request, you know that the osd012 are in the same host and you know that you have something to investigate on the host uh but deploying it. That way is super slow, um like it's gonna, take eight to ten hours, uh because you need to deploy all the osd one by one sequentially.

A

uh So a neat trick we applied is that because we know how many osds we're gonna have per us. Instead of deploying one time we just pre-create the crash map uh before the deployment, and then we can deploy all the os the other time um and so uh yeah.

A

That's that's one of the things we do and one of the other reasons why we are on in-house and simple playable to deposit instead of using something like uh defensible or something else available within the community is because we have a tight integration with our secret management platform that will push the safe configurations, some secrets for uh monitoring and some third-party services that need to access ceph or the hypervisor. Cheering to our secret management platform for later integration.

A

So all this combination allows us to reduce the time to deploy a cluster from um it used to be like a couple of days with a lot of manual tasks to now maybe half a day to deploy a new cluster in the future. Our main goal for automation is to do a zero intervention deployment, where the hardware becomes ready from our infrastructure team and seth automatically deploys without us doing anything. We just have a slack message or something that tells us hey, safe, is deploying there and we just let it go in terms of monitoring.

A

We use uh for monitoring, set health and safe studies in general, and that's the global global information about our safe cluster, like capacity and all of that we use self-exporter, which is a an open source promoters. Exporter that we developed.

A

We don't use the explorer that is part of the step manager module for a couple of reasons. The first one is when we started seth, it was simply not available and we needed something. So we uh we created set exporter um and because we created it and it's working very well for us, we haven't really thought about moving to the manager one. uh But we also have some, uh I believe, justified concern about the manager's scalability.

A

uh We noticed last week as we upgraded one of our clusters to knowledge um that the manager is struggling to keep up because of what I suppose- and we are starting investigations this week, so we're going to report that back to community when we know exactly what happens, but we think that it's the support module that struggle to keep up with data and it's basically overloading the manager and the problem with the manager today, is that it doesn't scale because it's a single process so the only way to scale it would be to put it onto a more powerful machine which is obviously very expensive.

A

So we're going to look into that this week and report that back upstream to see what can be done, um but yeah we're looking into that in terms of other monitoring, uh we have node exporter running.

A

That's the open source uh parameters, exporter so not much to say there, and we also have built another explorer called store explorer uh that is used to gather um information from every single demon in the cluster via the admin socket so cool stuff, for instance, that we discovered through it was uh bluestone allocated versus bloodstore storm that showed us the overhead of um the minimum allocation size on dominus with blue storms that was 64k for hdds um with luminous.

A

That, since has been switched to 16k on others. But we saw that overhead and we were based off that. We decided to do some testing on 16k minimum allocation size on luminus and we found out that it worked very well, so we actually switched to that. um It also explored other information, um most notably uh something we noticed and we haven't investigated. Yet. Is that with safe knowledge, it seems like the slur requests from that are reported to the monitor might be undercounted.

A

So we export the slow request directly from each osd admin socket instead now and we aggregate that into a graph, a dashboard um but yeah. We need to look at this, though, at some point in the very near future, and also monitoring.

A

uh We have a canary process that is running regularly on each cluster to check performance and availability so to take block as an example, uh it's gonna run read and write workload with different block sizes regularly on a cluster, and that gives us two set of data as either it times out, and that means we have a big availability problem on our cluster.

A

Oh, it doesn't, and that gives us a trend line of the performance of the cluster over time and that's always interesting to see when you upgrade or when you do an operation to see the impact of latency that it has on our customers.

A

uh So when we agreed to notice on top customer, it was very cool to see that, after restarting all the osd, we saw the the average uh latency decrease a lot, and that was fairly cool.

A

um In terms of other set operation, augmenting itself is easy, but it's still something where you can shoot yourself in the foot, if you're, not careful, um so we we have two ways to augment safe today, which is not optimal. We want to convert them both, but we haven't got to that yet so, from the block cluster perspective, it's pretty much the same thing that you're gonna see um out there in the mailing list or in our on on irc, where people are recommending how to augment safely.

A

um So we create all the osd in a dummy root tree uh with a weight of zero. Then we move these us to the right crash location in the right racks and we slowly operate them to avoid any latency issue right, if you just add everything at once, you're going to have latency issue on the clusters, that's going to impact customers.

A

The only difference with what's usually recommended upstream is that we actually have developed a tool that automatically does the approach for you at a specific at a specified uh interval. So by default we automatically upgrade by 0.2, I believe um until we reach the weight we want to be at uh so the tool is open source, it's available, it's called archimedes and it will automatically handle this crush up weight for you up to the weight you want it to be.

A

From the object point of view, we have a different process. um That is more interesting process wise, I think so. The way we do augments is we're going to set no back shell, normal balance. To avoid any data movement, we create all the osd in the right location with their final crush weight. So you see a lot of tiering happening at this stage and a bunch of pg are going to switch to uh to backshell state, um but nothing is gonna move.

A

Then we cancel all the backfills yeah up. Wait. Oh sorry, yeah upmap overwrites! So if you had a pg that was like zero one, two, you add your osd. It became zero one. Three. We switch it back to zero one two with zero one two. So that means that all your pgs are gonna be back to um active, plus clip like nothing happens. To do these maps, we use a tool called pitchery mapper more on that uh in the next slide.

A

So because all the pgs are active plus clean, we unset, no vacuum no rebalance and nothing happens and the way we actually perform. The augment is two ways: either we're gonna run the balancer in up map mode and because the set balancer we always try to undo a map before it adds a new one. It basically do our augments for us, but the concurrency of the adma balancer is pretty low and we don't want to change it because we use it for the entire cluster. So it's like 10 pgr time.

A

If we are, if we are on a rush for the augments, we can use pg remapper, which has an undo a map function that allows us to undo a map on a set of osds with us, with a specified concurrency.

A

Why do we do the augments this way when an augment in object is not latency sensitive is to protect us against long-term degradation in object, cluster and hdds, even if you're on bootstore you're always going to have some sort of flapping. uh It's just going to happen.

A

If you have all your back shells going on and you have flapping someplace else, your pg from the flapping are going to need to recover, but some of them are going to be recovering weights and blocked by the bacterial reservation taken from the ongoing backfield and so they're going to be degraded for a bit and then you're going to have another flight flapping, which is going to trigger more degradation and so on and so on. Until the augment is finished and then uh the recovery is going to it's going to go away.

A

That means that your recov, your degradation, can be very, very long, lasting and that's quite uncomfortable. By doing this process, where we only undo a certain amount of up maps at a time we lower the time to degradation. So let's say we only under 10 10 10 up at a time that stand back in pg once they're done, whatever recovery needs to be done can be done, and then we start again and again and again and again so limiting uh the time to um sometimes you're being degraded. So both approach are basically either limit.

A

Impact of latency or limit the risk of data loss due to the long-lasting degradation, so pg, remapper uh pg wrapper, is a tool we open sourced about three four months ago, so failure recently that is heavily inspired by a set of certain scripts that I believe mostly dan van duster roth. So I put both urls there. Of course, uh pg rappers in python is in golang uh and uh scripts from from cerns are in python uh in terms of ability from pgb mapper.

A

It allows you to cancel backfill for augments, like I just described or prioritized recovery weight pgs. um We also use it in blocks. uh We still have some a bunch of file store uh block clusters, but mostly file starts it's mixed now, uh and so when we even when we add recreate, parameter a new osd, uh there is flapping going on. So we can recreate the osd cancel its backfills, because the cursor is act.

A

It also goes back to active, plus clean and then do it like 10 at a time to not prevent recovery from happening another part of the system. uh It allows you to undo a maps. I already talked about that. uh It allows you also to balance a crash bucket, which is similar to the functionality you have with the opmap balancer upstream, except it's localized to a specific part of your system. So the admob balancer can do on a peripheral basis. Here you can do on the crush bucket basis.

A

It doesn't have the same complex uh considerations that the map balancer has.

A

um I think, if I recall correctly, that I might be wrong, but I believe it just look at the pg distribution, so it's just going to try to equal it to equalize the number of pg you have in your buckets uh in your osd prepare the bucket which work in most cases, but doesn't take into account crush weights and doesn't take into account if you have a non power of two pool and all that it also allows you to drain an osd which is a process we've used when we repaint our file store cluster to bluestorm.

A

uh We drain the osd, which is like marking it out, but only much more controlled, uh because we don't do a hundred and two hundred pg at the time. We only do it ten or something at a time which limits the impact of latency. um It allows you to remap pg, which is kind of a wrapper around the safe cli for the same thing to create another public section. And uh finally, it allows you to export and import mappings, which can be, which can be useful in your automation.

A

It was to us for the draining and the faster recognition, because if you have a map exceptions, but you destroy the osd, the map expert exception are going to be removed from the osd map. Rightfully so so you can export them before you destroy. It then report them back to avoid having to rewrite them.

A

So yeah check it out, digitalocean, slash pg, remember and if you have any issue with submit and we'll try to fix it as fast as possible um in terms of osd lifecycle, uh we've developed a single set of a single tool, basically to handle the entire osd lifecycle from end to end.

A

So we have these tools that allows you allows us to diagnose recondition, deploy, upgrade firmware, uh remove from the cluster and locate osd and blink the light on our chassis uh for uh for the osd's.

A

Most of the operations that are self-based like remove, create and recondition just wraps around safe volume. It also wraps around safe disk because we still have one firestore cluster. So that's good because we're going to remove soon but not yet, but in terms of cool feature, I think one of the cool one is really the ability to automatically upgrade the firmware, so it used to be from a operational point of view uh when you need to upgrade the firmware you you replace a drive, it's taken from a panel of spares.

A

You put disappear in the machine. You look at the firmware. You then look at what is the latest version of firmware. We have validated and if it's not matching your greater firmware um now, what it does is we replace the drive in the machine and when we run the command to deploy the osd it automatically before deploying the osd check, the firmware factual latest version run the firmware upgrade and then actually do the safe with the uh deploy.

A

um So, from our operational point of view, it's a huge gain of time. uh Similarly, when the drive, when the noise goes down uh in safe, we need to determine whether this osd- this is a drive, needs to be replaced or not, and it used to be a very manual and tedious process where you log into the machine. You look at the log in a couple of different places and then you make a determination, um and that was just a pain, so we developed a diagnostic tooling as part of this tool.

A

That's going to look at smart data syslog, maybe some nvme stuff, I don't recall, but basically it's going to make a determination on whether a disk needs to be physically replaced or not, um and if it does, we just swap the drive and redeploy the osd fairly quickly.

A

This diagonal tool is probably uh it's not as mature as we want it to be, but as we have more different type of failure, we just are moving more and more and more data to it.

A

The long-term goal for this tool is to turn it to a service so that unless we have to swap the drive, we never actually log into the machine which will be very, very sweet.

A

And finally, in terms of automatic remediation, um there are issues in ceph that are richering, but not really. It don't really have an impact. A typical example of that is the set manager that is running out of memory.

A

It doesn't happen often, but because we have 38 production cluster, it used to happen, maybe three four times a day in in in global and it's running out of memory, but a simple restart fixes it. So we haven't really invested any time or resources into looking into why that happens and to try to fix it, because the fix is so simple and it's not impactful that uh we just have other priorities. So we have this automatic remediation service. Today, that's gonna monitor um every cluster in our fleet and it's gonna look for condition.

A

uh So in the case of set manager, you can look at hey how much memory is being consumed by the self-manager process and if it's over x gigabytes, I don't remember the number um we trigger an awx job, uh which is an simple playbook to fix it. So in the case of a manager, hey, the manager is over. That amount of memory used just run a random playbook to fix it, uh which is just restart it and that dramatically reduces the number of alerts we have for this and the integration from our club operations team.

A

It used to be the same thing where we have a page for for the our cloud operation team. They log into the machine, restart the demo and then come back, and it just meets three four times a day, which is a a huge pigment. Yes, so uh we have this automatic remediation service in place to do that and we plan to add more stuff. I must talk to it as as time goes.

A

All right, um that's it for safe operation, again brushing the surface, uh and I really hope we can deep dive more into that in person uh at the next falcon. uh We'll have some fun um now in terms of self drives. Again, it's not a it's it's it's not blaming it's stuff. We noticed that you may want to be aware of for specific scenarios um and also it's stuff that we we are looking at or plan to look at very soon.

A

So the first one is that safe upgrades are hard uh and we learn that uh with knowledge. The process is simple, but there are many issues. The process has always been the safe. The same. uh I've been using ceph since slightly before argonauts, and it's always been the same process and that's very cool. You upgrade your moon. Now you upgrade your manager, your osd's and whatever clients you may have right.

A

It's always the same, um but and sometimes you have some short intervals like you need to add something in your setup and all of that, but it's never. This. This part is always very well documented, um but you always have some small issue popping up that are not very well documented, uh and the documentation during upgrade is actually uh somewhat sparse and that's a difficult problem to solve right. You cannot expect you cannot expect engineers every time to be like.

A

Oh, I need to document that and that and that that just never works, uh and we know that um to give an example of a couple of examples of sparse documentation, um a very simple thing was from luminous to knowledge. uh The output of the the json output of cfpg dump changed, uh which brought any tooling that would have used that uh yeah. I would have used that, and that was not documented.

A

So it's not a big deal in the sense that we call it very early before we even upgrade it and we fix all of our tooling, uh but it does point to the fact that it's lacking in documentation and it's such a big change. It's such a change is not documented.

A

It means that a lot of other small things are not documented.

A

Another thing that changed for us that we didn't find the documentation for was with no loss.

A

The safewasd process will now, by default, disable transparent, huge pages, which is a change from the default of the operating system, which actually cause issue for our file store hdds. uh So we fix that. We fix that and we change the configuration. But again it goes back to uh to the to to to documentation um a grid can be slow.

A

It's it's just a fact of life uh when there is on this check or changes uh during upgrade, it's just gonna be slow, but it's important to know when you plan your upgrade, um never playing an upgrade of a cluster with agds at 2pm, because you're to be there for a while, so yeah privilege of the morning.

A

One thing that is important as well is that the history of the cluster matters a whole lot.

A

uh We've seen that very recently with a message on the mailing list, uh stating that if you created your cluster prior to joule and I've, never used step fs, you need to stop at a specific version of octopus before going to pacific, and that's that happens. That's okay! It's not complaining about these specific things.

A

It's just a statement that the history of the cluster mirrors a whole lot, um and so that makes upgrades even more complex, because how can you have proper testing if the history of the cluster matters when staging systems are not meant to be long-lived systems? Right, they're meant to be recreated a lot. um So that's something we're trying to think about and something else we we learned a very the very hard way earlier this month about two weeks ago, is that a successful upgrade doesn't absolutely not make a a trend.

A

We upgraded uh 37 clusters. Recently, 30, I'm running running up or down doesn't matter, but about 30 of them went completely uneventful, where nobody who knows the cluster was upgrading, who tells the crystal we're upgrading. So that was fantastic um on some block cluster, we saw issues uh we saw. We hit a bug where um all of our pg went into snapstream, um which was scary because it looked like it tried to replace the entire snap trim history of the cluster um and so the first time it happened. It wasn't a huge impact.

A

We were able to set.

A

No snap trim and disable snap treatments and schedule our snapstream slowly and surely, and we updated our automation for it to not be impactful and further upgrades, and it worked for some clusters until we upgraded our last two clusters and that did not work, and there were so many snap trims that the entire cluster started to flap all over the osd started to spam, the monitor with otherwise the logs, which makes the monitors completely unresponsive, and so that goes back to the point that yeah a successful upgrade doesn't make a trend, and so all of these points here uh means that testing is testing.

A

Upgrades and testing stuff in general is extremely difficult, and that is an extremely complicated problem to solve um and we're not saying, hey community fix that please and going away and just waiting for it to happen.

A

We are going to look at that very very, very closely as we start to look at pacific uh within the next uh weeks months. um I I don't know exactly when we're gonna start, but we're gonna start very soon, and as we look at that, we have to look at how we improve testing for self and how we improve um uh reporting our our issues upstream and maybe some collaboration for the testing part uh with upstream.

A

So we're not sure how it's gonna look like, but we're definitely definitely definitely definitely gonna gonna try to improve that uh for for everybody, um and the last thing in terms of graphs I want to mention, is around rgw defaults.

A

If you're using the rgw, you have a certain scale, you may want to be wary of a few things uh that I mentioned here: um dynamic recharging, the naming I found a bit unfortunate because it it it it doesn't.

A

It's not clear that dynamic resulting is going to block rights on your bucket, uh and you can see that in the main english, with a lot of people- hey, I have sure shining going on and I can make I o um it works well up to a certain stage, because in most cases, if you have various, if you have small buckets dynamic, recharging is only going to take a few seconds or up two minutes uh which is fine to block right for a minute. uh It's not gonna cause anything.

A

If you have a very, very, very, very large bucket that you're recharging, the resharp could take hour hours uh like 10 hours, easy if you have a humongously large bucket. What happens during that time is um the requests. The right requests going to that bucket are going to be blocked on the rails, gateway which is going to take up threads and they are never going to be timed out on the browse gateway they're going to stay there indefinitely.

A

Your clients, however, are going to time out and they are going to retry the request and retry and retry and retry and retry, and as I do, that they are going to take more and more and more threads on your rails gateway until they possibly exhaust all the threads available, which means that no other request screen can go through to the cluster.

A

um So if you're, using dynamic shouting and you plan on having very very very large buckets, uh be wary of that another thing. Another consequence of free sharding in general is that it makes listing uh very expensive on the back end uh way more. I o and uh slower on the front end so also something something to be aware of um beasts on neues uh as a couple of issues. uh The first thing you want to do. If you use beast, is you want to set some maxcon?

A

uh I think canonical made a patch for that. You can set that as an option. uh I think it's maxcon uh in the configuration option you want to make that to your sumac scan on your system um or even increase your some exponent on your system and then set that to that, um because otherwise um you cannot send a lot of concurrent operation to it with the default thread settings uh one important to know, and this is anonymous. I know octopus and pacific have improved that.

A

um I don't know the final status in pacific, so I also I'm only talking neues here, uh but most of the liberators call, even though beast is asynchronous. Most of the liberals. Just call are nuts, they are still synchronous.

A

So if you are in a situation where um your cluster has some sort of issue- and you have a few uh slow ups on your clusters on your cluster, you may exhaust the thread on this very quickly because it has a very small amount of threads by default. I think 500 512, maybe uh so you may exhaust that and actually uh prevent any further requests coming through because of it.

A

um We we found in our testing that this perform extremely well like impressively well compared to civet web uh in normal cluster condition, but when we have these type of issues, not so much um so we try to increase the number of threads in this and do some testing there, uh but increasing the number of thread in threads in beast does not scale as well as increasing the number of thread on c between. So for that reason we are actually now uh running civic web on modulus in our cluster.

A

As we look towards pacific, we are definitely gonna. Look at this again and see uh and see if things have improved, and the last thing I wanted to mention uh was auto shot removal that shouldn't be an issue in 99 of the cases, but at some point in luminous a patch was added to automatically remove the shaft. So if you run dynamic recharging or if you don't run the raz gateway admin command for resharing, uh it used to leave the old charts behind and you had to manually remove them now it does it automatically.

A

However, if you do it on a very large pocket, you may dos your cluster, because the operation to delete is just a greatest delete of the object and if it has billions of keys, it's gonna dodge your cluster, by which you're just preventing the osds that have uh the free replica of it from contacting the cluster and that's gonna go down for a bit.

A

um So the way we do it is that we develop a small tooling that uh using gosef not just iterate over the key and delete at a specific specific amount of time. I think we delete like ten thousand entry our time uh every few seconds and just catch up catch up until it's done, and then we remove the object. That was right.

A

All right, uh that's all I have, as I mentioned very superficial. I really hope we can talk more at uh cephalocon next year. I really hope we can do that uh before any question uh just wanted to mention we're hiring uh we've backfield a couple of jobs already uh in the storage systems team. uh There are other jobs out there for geo with sev and without self liking, kubernetes and whatnot.

A

So if you want, if you're interested, uh have a look at geo.com jobs, um I'm trying to maintain the list of job in the seth jobs pad as well, but uh yeah I'll try to do my best on that and if you don't see a job that you like or if you have any questions after this touch that you think about, uh please do feel free to shoot me an email at morangan digitalocean.com, and if you have any questions, I'm happy to answer them.

A

All right: well, I'm going to take that as no questions.

A

So I guess that's it. uh Thank you very much everybody for joining and have a good day yeah, just a second alex can. Can we get the slice?

A

Sorry, what was that? Can we get the slice? uh Yes uh so I'll send them to mike and uh I'm sure he has like a platform to share them, I'm I'm not sure uh but I'll, send them to my course to see. uh Okay. Thank you.

A

Absolutely all right thanks. Everybody have a good day.