Ceph Ceph Tech Talks, 1 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Tech Talk: Karan Singh - Scale Testing Ceph with 10Billion+ Objects 2020-10-01

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

So hello, everyone and uh welcome to another step tech talk we're here for the month of october. We push this one over just a little bit uh to make time uh for the speaker and so right now we have uh we're going to be hearing about scale testing with red hats. Yes, with over. Let's see how many is that 10 billion plus objects?

A

That's insane! So we're going to hear about it.

B

Now, all right, so thanks thanks mike. Yes, it is insane and it is actually 10 billion plus objects. So so yeah, hey guys, hello, everyone. My name is karan singh and I'm a senior solution architect at red hat cloud, storage and data services, business unit- and I do lots of stuff in my daily activities.

B

So before I go to this one. So this is not yet another performance testing that we have done with red hat's app. I mean the team to which I belong. We do lots of performance testing on on saf and openshift container storage. So this is not yet another one where in this, in this testing, we specifically wanted to to test uh ceph object, storage uh with not one or not- two, not even five, but a ten billion objects uh into the subsystem. So this is the actual division.

B

We started: okay, let's, let's stop this testing once we hit the 10 billion mark because we wanted to go to 10 billion. That was the intention of this car or this this project. But that's that's why this is not yet another performance testing for us. It was uh predetermined that we're gonna build the cluster until 10 billion, so this is a very rare view of the cluster. So all the folks who are gonna view this view this youtube or or join, live in this session. uh It's a stuff status output.

B

We have 10 billion objects into the surf system and you can expect that uh the pools the pools are are near full. My osd's are full blue store, is spilling lots of data onto spinners, and you know a lot scrubbing and deep scrubbing everything right. So this is pretty standard uh command, but the output is not standard. So that's the reason I put it here.

B

So it's the time when we hit 10 billion, so it captured this the screenshot and uh if you go to the cef metrics dashboard, it also explains the same story like my bucket redox gateway data pool is having you know: 10 billion objects into the, so we kept on like fifth of the 26th may we started this testing and then there is a bump in here. I'm gonna explain this this bump in here and then you know we kept on adding more objects until we reach 10 billion.

B

So this is again a rare view from one of our cluster that we tested in this lab. So uh one would ask like hey: why did he chose 10 billion? Why not 7.5 or something right? So this year in february 2020 um we have tested redexa storage to one billion objects and the the results are already published on redhat dot com, slash blog, slash, storage, you can you, can just google that that link right now and uh so yeah we last the last testing this year was on one billion. So we're like hey.

B

What should we do next? Should we do? Should we do something more interesting with sef? So that's why I decided. Okay, let's go with 10 this time and uh one other thing which came to us like okay, you know other object, storage, uh solutions and systems.

B

They aspire to scale, to billions of object, that to one day all right so but seth can do it. Today. Seth is a 10 year old, 10 year old matured technology. It can definitely do do 10 billion, but you know somebody has to test it. So that's that's why we started with this okay, let's test 10 billion and put the numbers out in the community and and for the customers.

B

uh Another motivation was that we could see a lot of attractions on from customers around data lake use cases for object, storage and when I say data lake, it's you know. Big data workloads, they're gonna, put uh they're willing to put big data workloads on object, store, which means they end up. Writing lots and lots and lots of object. So somebody have have to test that and see. Does it really perform good or are there any implication with when they're gonna write?

B

You know lots of objects into the subsystem, so that was also one of our motivation, and uh this is this is the uh the motto of our team in which I work. So we try to educate and motivate the communities we work into the customers we work day in day out and our partners, so we want to educate the field and the community and customer with the rich data set and backed by empirical evidence. So that was another one of the reason we chose to go with with an even number like 10 billion.

B

When you say executive summary uh to me so uh yeah it. It might look, you know too flashy for you, but actually it uh I'm not. You know, I'm gonna explain each and every term but which I'm gonna call in this executive summary, so red hat, storage or saf in general has delivered a deterministic performance for both small objects and large object workloads. So there is no. You know there is no marketing term here.

B

We just put it here in front of you that yeah we got some numbers which are cool for both kind of object, uses storage, use cases because in production you will have a mixed variety of workload. It could be small, it could be large depending on the use case you're using, but for across the board. We saw a deterministic performance from storage system, uh understanding that the scale, so what is scale for us.

B

So again, we have ingested 10 billion plus objects into the system, and we have we have tried to retrieve, if not all, but most of them are during the read read test so uh and all of these 10 billion objects were across spread across a hundred thousand plus buckets into into the surf system, and each of the buckets were, were you know, configured to store a hundred thousand objects.

B

All of this, the data that we have crunched into this. They it was spread across 318 spinning devices backed by 36 nvme devices for blue store metadata overall, getting close to five petabyte of raw capacity into the system, and it took us. You know several days because you know it will take time to write this many objects, but over over close to 500 unique test, runs to achieve this 10 billion mark into the system.

B

So that was the scale for us understanding the hardware and the software inventory in in the lab. So I'm thankful to intel and seagate for providing us the necessary uh equipments for for this testing. So, overall we had uh six red, xf storage nodes.

B

Each of the nodes were equipped with 53 16 terabyte spinning devices uh using seagate jbods e4e106, uh the for the jbar that we've used for this uh six intel, qlc 39 devices with three 7.6 terabyte. They were used for blue store and then intel xeon, gold, processors and some memory, or maybe lots of memory, and then some standard networking in place. So this was my setup with respect to the clients. We had six clients, so you can see like one to one mapping uh with clients and osg nodes from the software part of things.

B

We have used: rel, 8.1, uh red hat, sap, storage, uh 4.1 and all the daemons were containerized, the osd monitor, managers and android gateway, they're, all containerized. We did something special, I mean. We know that from from our last one billion testing that if we deploy multiple redox gateways per osg node, it delivers, it tends to deliver better performance. So this time we we went with two redox gateway instances or the containers on each node, which means total 12, redox gateway, end points we had in the system.

B

All the testing was based on ec 4, 4 plus 2 uh coding, and uh everything is here uh you see here is based on s3, s3, access modes and again hundred thousand objects. In each bucket for workload generation, we chose cosbench with six drivers and 12 workers and 64 for threads. So we I didn't.

B

We came up with this number uh after some rounds of initial tests so that we can, we can know what would be the right uh workload that we're going to apply to the system, because once we are once we are set, we don't want to change any other thing into the setup. So we will keep everything constant and just measure the performance.

B

So this is how the software inventory and hardware inventory look like this is the the lab architecture so again, six six f nodes, and then we had a melon x switch here, uh 25 2 into 25, which is 50 gig pipe uh for for the for the cluster and the storage network and standard 10 gig management code for for basic internet and management. So nothing complex here, pretty standard lab setup.

B

With respect to the workload selection, we chose to go with a 64 kilobyte, which represents the small objects and uh 128 megabyte, which represents the large object. So again, coming back to my executive summary, we have tested for both small and large object. Workloads because you never know your customer or or a user would be running. You know variety of different workloads. It could be you know small text, files or log files or images or videos- or it could be.

B

You know large parquet data set or or you know, orc data set files so which are typically larger in nature.

B

From access pattern point of view, we had a pretty mix, 100 get 100 port and then a combination of get post put list and delete operation with respect to s3 in capacity of 70, 25 and 5., so pretty mixed workload. Here, we've also done a degraded testing like try to fail manually intentionally, fail one device, one spinning device and then success spinning device, and then you know uh one entire node failure, because we also wanted to test at this scale.

B

What will happen if I'm failing one node or six spindles, or maybe an entire node from a cluster which is like you know, eighty percent filled up or seventy percent filter. So we wanted to know how does the performance change compared to steady state? So that's also. We have incorporated in a test plan, so this is the first graph for you guys.

B

This is a small object performance and the key metrics that you should be looking here is operations per second, so I'm going to help you understanding this graph. Now let me move this thing here. So uh you'll see from the blue. The blue bars here represents the objects ingested. The redox objects into the subsystem or the stuff pool, we start from zero objects. Until we we reach to the top of the peak, which is 10 billion or 10 000 millions.

B

uh Due to the course we measured, um both s3 get and s3 put performance. I also have numbers, for you know the mixed workload, but it will make my graph look too confusing. So that's why I just uh have not plotted here. So the red line here represents the s3 put performance, and the golden here represents the kit performance. So let's go over this one by one.

B

So if you see my red line, so s3 put performance right from the zero object until the very end of my testing, it is close to close to a straight line. I'm gonna explain these two blips in here.

B

Just just bear with me, but overall you'll see right from the zero object until until very end, until the 10 billion object, the s3 write performance or the put performance were pretty pretty. uh You know straight line so which is like you know, close to deterministic performance with respect to get.

B

There are some interesting things happening along the way we we wrote the objects into the system, so we're going to go into the details of each and every uh you know, ups and downs in this course, but yeah overall overall, we uh if I average out this number right from zero objects until 10 billion 10 billion, I'm getting somewhere around close to 17 800 at s3 put operations and 28 800 s3 get operations, so it is operations per second. So it is.

B

It is pretty decent number from from object storage at this scale right, if you ever reach out and if we normalize these numbers for spinning media or per device right which will help you in some. You know calculations, I'm going to explain that also later. So, if I just divide this number by, you know the number of spindles I have like 318, I I'll come to this number like okay, 60 s3, put operations per second from a single pinning device and 90s3 get operations from a single spinning device.

B

So again, this is not the the actual performance of your of your spinning, spinning media, which you know uh the vendors will. Will call out like hey, I can do like you know whatever, like 100 iops per second or whatever right, but this is the s3 workflow performance, so this is completely different. What you get on the internet, so this is the s3 workload performance all right, so this is the graph number one. Now, let's go into the details of uh these these clips in this graph right. Let's try to understand technically what happened.

B

So this is the section of the performance graph. So the first thing which happened in our cluster, which we observed, was at around close to one billion objects. When my system was at one billion objects, I suddenly saw you know performance going down. When I looked into the system, I could see a lot of deep scrubbing going on, which is a standard safe operation staff will try to protect your data by using deep, scrubbing and scrubbing time to time, but we saw a lot of you know uh deep scuffing going on into the system.

B

So uh at this point we have chose not to disable entirely the deep scrubbing part, because deep scrubbing is something good right. So you don't want to disable that point, but what I did is I reduced the rate of deep scrubbing, which is which is not cheating right, I'm just you will you will do this in production as well? You will reduce the rate of your deep scrubbing. So that's what I did at this point.

B

So the performance came back up after like one or two tests on so this explains the the drop in here, uh which is which is attributed to deep, scrubbing effect into the system, and then you know again my performance restored and kind of. I I went ahead and uh the next thing which we observed was at around 5 billion objects like 50 of my cluster thing and what we observed here is. uh You know um there was a power outage uh in in the lab where we hosted all the equipment.

B

So it was, uh it was a long power outage. It was like up to 48 hours and then after 48 hours we. So what happened is what the power outage right. So all the ups powers were were gone, so all the six ceph nodes, all the client nodes they they abruptly got powered off like somebody pulling the the cables right this power power outage. So imagine, like you, know, uh in a safe cluster or a storage system.

B

Let's put it this way: a storage system, storing five billion objects and all of a sudden, all the six storage nodes lost the power. What happened? There's no storage, there is no power. Services are down. We know that. But what happened after this was was pretty interesting.

B

So once the power got restored, we powered on all the all the nodes, like you know, in any order, because I had like six nodes, so we just powered on all the nodes one by one and once the os came up and all the sub services came up, all the parts came up for not parts the the containers came up for my my cluster after, like you know, 30 to 1r, once all the pairing, all the pg pairing uh completed, my cluster was back to normal, just after you know a few minutes.

B

Without I doing some, you know, repairs or doing some you know disc replacements or whatever right. So it was a magical moment to me. A self cluster with 5 billion objects suddenly lost power. We bring it up and all of a sudden everything was calm and the storage was again serving. uh You know the objects. So at this point we observed a performance drop in in the get. The reason to this is that you know you know guys know that uh theft uses.

B

uh You know, osd osg memory targets for the for the uh for the pages.

B

So all of all of a sudden, my my my heart caches in into the memory got, you know, uh got flushed out and then you know, uh then you know this is the core reason for uh for the outage here. Well, especially for the for the get get performance.

B

Remember the memory flushing, because it's non-volatile, so all the pre-prepared caches they just vanished away. But for forget we haven't seen anything like. If you see uh just compare this, uh there is just a minor blip in here, uh but then the performance for the for the get restored. uh Just after we we did. The test, so this was the second event which happened for us. The third event in the testing happened at around. You know when we reached uh to a very high critical level of capacity spatial capacity usage in the system.

B

So at this point my system was like you know, seven to eighty percent filled up. We could have choose to not write anything after at around 8 billion. But again our goal was to hit the 10 billion mark and see what happens. So we kept on writing data into the subsystem, though it is not advised that in any storage system you should not. uh You know, fill fill the cluster to its throat right.

B

You should leave some ban some capacity available in your storage systems, but we have not followed it because we want to choose to go to 10 billion, so we could see a significant drop in performance which is attributed to you know a spatial capacity, uh high utilization of the spatial capacity, as well as a combinatory effect of uh filling over of the blue store metadata from nvme, fast storage into spinning devices.

B

So the guys who are familiar with ceph and how booster works at certain level blue store tend to move data from flash onto the slower, slower tier available which causes performance implication. We know that already. So there was no. You know no surprises here so, but a good point is that the ports, the puts haven't, got impacted by even by by the spatial capacity as well as data movement. However, we see a significant drop in the get performance, so these were three major uh events happened.

B

While we were ingesting data- and here are some- you know some graphite graphs for you, so that you guys can you know- relate this, so this was deep, scrubbing effect going into my system, and if you see my uh this is the read and write ratio of of each and every spinning devices there are 318 of them. You can see these are my test runs going on, but all of a sudden I started to see lots of lots of read because system is doing deep, scrubbing and describing is a is a read intensive operation.

B

So I could see a lot of lots of uh deep slaving going on and at the same time the performance goes goes down because the discs were busy in doing something else. So this was a deep scrubbing affirmation from grafana graphs, and here are some more graphs from grafana, which explains the the blue store spilling effect. So you guys know that blue store uses rocks to be and rocks to be uses level style compaction.

B

We have a graph, we have a blog in here which explains this this thing in great detail, so the rockstar has multiple uh multiple layers so level zero is in memory and then goes on to level one until until until you know too many levels, but a portion of it, if it can store uh this data onto flash rocks, will be prefers to do that. If you have, if you have big enough, uh you know db of a blue store database sizes, so roxy we will try to put that on on flash.

B

But if it's not able to write on flash, if you are limited by the capacity of the flash, it will go and dump that on to spinning media because it has to put it somewhere. So in our system, until l4 caches, we were managed to get or we were managed to store the data onto opt onto qlc intel qlc devices.

B

Until then the performance, what was pretty good uh as the level five hit, which is uh you know, uh which was like 2.56 terabyte of data for every osg device. It cannot fill in that on the flash media or the flash flash partition we had for blue store database, so it has to move the data on to spinning devices. So if you've seen here so this is the metric called as f blue fs slow use, byte, which demand, which tells you that um how much data is moving on to the slow tier.

B

So if you can see this, uh the testing all started at 26 there was. There was absolutely no data movement until this time and soon after you know uh after fifth of july or june, uh I started to see lots of data moving going into uh the these spinning devices, which explains that, yes, there was a data, blue store, spillover effect which came into place and caused cause cost of performance degradation which is expected so no surprises as well.

B

Okay- and this is the graph which explains about the latency, so how much time does it took for for the system to you know to respond so on an average? If I, if I do the average right from zero object until 10 10 billion uh like half a second of latency from the system right, you could say that you know it is. It is too much, but for for an object, storage system it and, depending on the application that you're using uh it depends what latency you want to.

B

You know build your system with so for, like you know, uh for some some latency intensive system, you want to bring it down and there are mechanisms to to bring it down. That's not a problem uh and with respect to the get get latency on an average of 27 milliseconds of get get latency that we that we observe from the system.

C

B

Yeah, no big surprises these two. uh You know the peaks that you see in here. It is attributed again to the power outage and the deep scrubbing effect going into. But overall again you know pretty pretty straight line um across the board right from zero and 10 10 million.

B

The next graph comes is the so the last one was small object. The next one is large object. What will happen if I write 128 megabyte object into the subsystem? Definitely by by the laws of mathematics. We cannot ingest this given capacity system with up to 10 billion objects and each object having 128 megawatt, because we don't have enough storage available right.

B

So we started again with uh with you know: zero objects into the system and being just as close to like 18, 18 or 19 ish million objects, because the object size were were massive 128.

B

during this course of testing. We also measured uh 100 percent get 100 put numbers. So if you focus on the on the red line here right from very start of the test until we reach to the last pretty pretty straight line right, not not supe super super straight, but uh pretty pretty deterministic line here. uh If you see the the put numbers they are, they are. You know super super straight. If you see these numbers, however, we uh actually we missed to run several rounds of get numbers uh in the text test cycle.

B

So it was a. It was a problem at the test test uh plans that we built up. So we missed right. um We missed some of the test cycles here, but since we can't, we can't go back in time and do it redo it. So we just you, know, started at this point. I realized okay, oh oh, we are not capturing. One person gets so, okay, it's not too late. Let's start it now, so on average uh 10.7 gigabytes per second of s3 put bandwidth and 11.6 gigabytes per second for s3 get again.

B

This is average out right from zero object until to the very last, and if I normalize these numbers again with my total number of spinning media into the system, I'm I'm capturing close to 34 megabytes per second of s3 workload. It's again, it is not the performance of the bare performance of your media, which could be you know close to 120 150 megabytes per second, which is advertised performance, but this is actual workload performance, so these numbers typically help you.

B

While you are building a set cluster like okay, you know I need to design a cluster that can give me 10 gigabytes of throughput, how many screen device should we should we should ingest. So so I'm going to go into this detail, but yeah these numbers. This number help you with some ballpark sizing.

B

So this was our large object. Performance results. The next one is performance degradation. So how? How does my system behaves when I intentionally fails devices into the surf system?

B

So the left one here is, with the large object testing. So, first of all, there was no outage into the system. It was a steady state performance. We are getting close to 12, gigabytes and 10 gigabytes respectively for get input all good here then, in the next round we intentionally, you know, uh stopped six osds and waited for sev to throw those out. So after you know, 600 seconds left throw out all the all my six failed osds, which I intentionally failed and at the same time I was running cost bench to measure the performance.

B

How does cause bench reports the performance once I'm missing? Six tries from the system, so you can see this right. There was like a very minimal effect from from 12 I moved to 10 and then from from 10 I moved to 9.6 so which is you know, which is not not not a a huge drop in in performance, given that I'm I'm losing some storage as well.

B

So if you see this one, so two percent of storage failure resulted in into six and eight percent of you know performance which is expected because you have low. uh You have less number of devices underneath which used to work right so in in the in the third iteration of the same test. What we did is we just pulled off one node, one node containing 53 spinning devices. So at this point we have lost. You know.

B

17 percent of my my total capacity and costbunch is trying to write the data set or data, and then I'm measuring the performance. So there was a a decent performance drop of 21 percent and 25 percent, which is again expected because you are running with low low. You know worker workhorses, so yeah it is. It is decent right. It's not it's not too bad, I'm not losing like 50 percent of the battery, or let's say you know, or even even lower than that, but it is. It is expected with respect to uh small object testing.

B

We saw a similar test, uh a similar number performance. So the first one first block is the steady state everything going good and then we failed six devices and uh again there is. There was no huge performance drops as compared to uh you know, delivering the same performance with with large options.

B

uh We did not have time in the lab to execute the third third round of that, so we don't have a data for for the last round, but uh you know you got an idea right, a subsystem which is eighty percent filled up and then I'm pulling out one entire note from the system and trying to measure the problem we are. Actually you know we're actually uh trying we were. We tried, you know very hard on on.

B

You know, break the system and measure the performance so that we can know the internals of of this, but overall rock solid.

B

Here are some based on based on what we, what you guys saw on the performance here are some guidance that you can draw from from this like: okay, okay, this is all good, but how can this help me designing my my next awesome set cluster, because I have got a requirement that I need to build a cluster that can deliver x, operation per second or maybe a cluster that can deliver y gigabytes of of per second uh s3 workload. How should I size it? How many nodes should I go and buy?

B

How many spindles should I would I be requiring for this workload? So again, uh there is no silver bullet. This. This mechanism can help you to come up with a ballpark number so again uh pointing it back to my uh the numbers which we uh so average numbers divided by the total total number of spinning devices that we have so we can get close to. You know 660, ops per spinning device and 34 megabytes of put, so you can do the math from here.

B

Like okay, for example, let's say if you need to build if you need to build a cluster with uh with you know, um uh with that, can deliver 3.4 gigabytes of your per performance. So what you're going to do is you can just go and buy? You know 100, 100 osds, so typically that should that should give you a cluster with 3.4 gigabytes per second of of put output bandwidth. So this is how those averaged out number can help you in sizing your your next big awesome, stuff cluster.

B

uh The sample size was too small, because uh this was based on just one performance testing. So don't take these numbers, as, as you know, the the most accurate to the most perfect, but this can help you with at least the ballpark number, and this is all uh the given is that this is all based on four percent of flash capacity that you should be using for bluestora per osd device.

B

The next one is some recommendation is that you could go if you want to get some more performance from subsystem, just go and use. You know multiple instances of of ceph redux gateway on each fosg node. That can give you the more performance, which means it will gonna actually add on mode loaded into the osd back-ends. So you can get some more performance from here.

B

The third one is uh uh going with a decent blue store, blues or uh flash sizing so for what we have observed and what we are recommending to our customers and others are uh use four percent of uh flash for bluestora, which will help you in most of your use cases like flog file, block and object right. So this is typically a good good starting point. If you don't know what is your actual use case, gonna look like right.

B

If you know the use case, then fantastic, you can probably uh lower it or maybe increase that, depending on your on use cases, but this is usually the uh the idle recommendation that we do.

B

You could also increase max bytes for level base which is default to 256 megabyte, which means, if you are allocating four percent of your of a blue store um metadata device for for osds for each osd.

B

You can actually bump up this number slightly so that you can get most out of your uh your flash capacity, which means you will, you will add, and you will write in more data onto of rocks db into until you hit until you hit the limit of uh rocks db uh compaction, moving the data from from spinning from flash onto spinning devices. So typically you can.

B

You can play around with this with this tunable by the way, all the testing that I've done like it was kind of you know, based on the default setting, except uh some minor settings like this one and some also described in the paper I'm going to show you like you know, objector in flights, and you know those kind of things, but I haven't tuned the theft to the to the to the to the last turn available, because it's it's too too difficult.

B

um The people out there who are still using rpm based things uh or rpm based services in in yourself in a subsystem you guys can- can rely on ceph and uh and using containerized storage demons. Like all these storage components, all the stuff components can can run on containers and they are pretty stable. It's been there since last two and a half or three years I would say most of the customers that we have. They are using contrary storage demons. They are rock solid.

B

With this testing we ingested 10 billion objects. We failed nodes that that filled up at 80, 80 percent full filled ratio. uh We have not seen any problem with respect to continuous storage, yeah go and go and use a co-located uh csd contrast for siemen. This will also reduce your you know. uh Footprint like you, don't necessarily need dedicated machines for mons or dedicated machines for osds and managers and and what not right you can just go and buy. All all of them are like same nodes.

B

You can just go and have six nodes, let's say and just co-locate uh everything you should be good to go if possible, go with a decent size of osg memory target by default. It is six six five or six gigabytes, but yeah. If you have uh availability, you can go with some some decent osg memory target for osd, which is still not not too too. I mean memory. Memories are cheap, cheap these days, so you can go and get some more dents into the system and get some more numbers.

B

So these are some sizing guidance based on our study- and uh here are these some here's, the summary so overall we have achieved deterministic performance at scale for both small and large object, uh workload sizes and before we hit any any saturation limits, like you know, blow store spilling from from nvme devices to spinning devices, and then we that we hit that you know utilization capacity utilization problems because we did not had enough free capacity in the system to you know to keep it to a decent field level.

B

All my systems were like at the end of my testing. They were like 95ish percent filled up. I did not add any any capacitor remaining so yeah until we hit any resource saturation, uh we got some some fantastic numbers at scale, right from zero objects until 10 billion and uh same for my failure, mode scenarios, uh pretty good numbers- and undoubtedly this is not the limit.

B

I I'll reiterate this. This is not the limit of staff. This is what we tested in our labs, so I would I would you know I know people uh like, like cern and other other off-our customers. They have huge stuff clusters which have already have you know: uh multi-deca billion objects into the system, so so yeah. This is not a limit. This is just the tested maximum and uh I hope that this will help you give some more confidence on staff staff is, is really robust.

B

It's been out there since more than 10 years and it performs and it scales to you know some some some very, very uh attractive limits.

B

So yeah you can download uh the full report of this performance testing on this url.

B

uh It might be a very interesting read for you and uh with this uh that's all I have to share with you guys and thank you so much for listening to me and again go and use go to this link and download your your paper.

B

So I will now go and see if there are any messages in the chat or mic. If you have anything for me that I can answer, uh okay, relax objects or s3, so these are uh redox objects anthony.

B

So because that would be easier to measure.

C

B

Yeah really, I think I already answered.

C

B

Oh, thank you and then is there. Everything has the report has to get the same s3cmd. Can you please tell me version of a ccmd port okay, so we have not used sdcmd because s3 cmd is not. I mean to me it's not designed for scale. I would say at this scale, if you, if you just do s2cmd ls, then your terminal gonna hang for so many hours, because uh I I actually did that on another bucket.

B

It was taking so much so much so much to because I had like hundred thousand bucks buckets in my uh well, not hundred thousand uh ten thousand right yeah. I guess 10 000 buckets in my system. It takes a lot of time to to list out so so yeah. We have not used s3. We have been using cause bench to write the data into the system and uh cos bench uses aws, s3, sdk official sdk, to write to the system, which is pretty pretty fast.

B

uh With respect to measuring the performance we have not measured using s3cmd, ls or all those kind of fancy graphenock. It was easier. We just went to uh minus s and we just relied upon the stuffed metrics coming out from the grafana and prometheus, and it was the redox object stored into the subsystem.

B

Okay, hey thanks rodrigo for the question.

A

Yeah is there anybody else that wants to speak up with a question or type in the chat? I.

B

Have a card I mean: if yeah then do it? If you have, if you can, can sk speak yeah.

C

This is, this is anthony. um What was the min allocation size for blue store? Was that still set to 64k because of hdds.

B

That was set to because.

C

I think in the latest community releases it's 4k for ssds, but still 64 for hdds and I'm not sure how rhcs.

B

Tracks that no, we, I think we we had, I cannot remember I mean uh it's been long. I did the testing, but I can I mean exactly the numbers are there on on the paper, but uh it was exactly same what we have in the uh on the fix in the upstream community because uh yeah I mean you know. While we know that you know, the community has told us that okay, um we should be adjusting the block sizes so that we can get better from the system.

B

Otherwise we will end up losing more capacity, because you know I guess that was 16. If I recall correctly, if any of these f expert is available here, I guess uh the fix that we have in upstream is like 16 kilobyte for both spinning and and hdds. I guess so, but yeah. That was the number. I think we we went with 16 or four four yeah I need to. I don't dig that number up. It's not at the top of my head right now.

C

Yeah, because it's like with easy and and and small objects, you can end up with a lot of wasted space.

B

That is correct, yeah, that's correct!.

C

For your index for your index pool um how many shards did you have, did you have utter restarting turned on.

B

Yes, we had our auto recharging turned on we, we have not tuned anything on that part of things and again uh we we've also not uh created a index pool on flash, because.

C

B

Know we don't? We don't need that now, since blue store, so yeah auto starting turned on uh in the substitution.

C

So your your your index was on on hdds.

B

They were amazing, next pool was on spinning.

C

Okay, interesting, okay um and you in one of your slides, um you sort of implied that the the size of the partition that you used for the wall and db external device was 300 gigabytes. But there was a url to a prior presentation about a 1 billion object trial that described an 80 gig partition size. So I wanted to be clear about which you actually used here.

B

So uh in here we had, uh I didn't, do the math here real quick, because it was like I guess we had like 800 gigabytes per ost, yeah close to close to internet, because we had like pretty pretty big size of nvme.

B

So we had like six nvmes in the system and each nvme was uh close to eight terabytes. They.

C

Just chop chopped and buy by the number of osds correct.

B

C

There was a lot of discussion on the list um when uh upstream documentation started recommending four percent because of the you know, the the blue store level. You know, sizes that you mentioned.

C

And so I was curious. Besides, you actually used there that would 800.

B

Would sort of fit what you described yeah correct.

C

B

It's close to 750 or 800 gigs, but you know uh we have not seen because because we we hit this okay. So let me toss my screen on this one. This is a pretty interesting point. I guess so. If I go back here actually you know we have not used a lot of because most of my I was hitting l4 into my flash. My l5 was like 2.56 terabytes and I did not have this many this much this much unique capacity available.

B

So actually, even though I had like 800 or 750 gigs available, I'm actually using you know close to. If you see this graph like seven uh two, two third, two, eighty gigabytes off of dbs.

C

Yeah, maybe a little more during compaction.

B

C

um There's been some discussion, um I've seen some some preliminary um uh prs about sharding roxdb so that we don't have the stair step. You know, you know proxy, be where we can more efficiently use. You know sizes that aren't you know. You know power of 10 multiple.

C

um Do you know if, if that is, if there's actually something complete in the code or is the max bytes per level base a workaround for that where you set up.

B

You can increase that yeah, that's a workaround, you can you can? uh There are a lot of tunables that you can tweak in that sizes that you're describing here.

C

B

Order sharding, if community is doing it, then that would be awesome like super super good thing to have, but yeah.

C

B

Two enables yeah with the rocks tv there are. There are ways you can just change that you know multiplier. uh You know you can just change that by your own, but yeah. That's that's.

C

To make the levels bigger so.

B

A

Fit better okay, cool.

B

All right, so I think that's it. The mic over to you.

A

All right well, thank you for taking the time and sharing this information with us and uh the recording will be posted up shortly today. um Let's see, uh I think the only thing I have uh in terms of announcements, uh stuff newsletter should be going out today, outreachy news as well we're looking for mentors.

A

We have projects already set up, and otherwise that's it. Thanks everybody for joining us and uh have a great day have a great night and uh we'll see you all next time.

B

All right, thank you guys, all right.

A

Thank you. Everyone thank.

B

A