Ceph Ceph Days NYC 2023, 8 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Introducing Sibench: A New Open Source Benchmarking Tool Optimized for Ceph

Description

Presented by: Danny Abukalam | SoftIron

Benchmarking Ceph has always been a complex task - there are lots of tools but many have drawbacks and are written for more general-purpose use. For Ceph we need to benchmark Librados, RBD, CephFS, and RGW and each of these protocols has unique challenges and typical deployment scenarios. Not only that, Ceph works better at scale and so we need to ensure that we can build a benchmarking system that will also scale and be able to generate an adequate load at large scale.

A

So um yeah how many people in your show of hands uh have spent time doing benchmarking in their careers, uh okay and how many of those people, uh with specifically with Seth cool, okay, most of the room and how many people loved every minute of it?

A

Three three Mark Nelson, two others, okay, okay, interesting um so so I think benchmarking can be painful for a few reasons. uh Firstly, you need to run tests many times and consistently consistently. uh The tests you need to run need to be a minimum length of time. You uh there's loads of variables, loads and loads of variables, and so you try and change one variable at a time.

A

um But uh then you deduce results and realize that, even though you were only changing one variable, there was actually loads of variables of a play and and what you thought was true was not true and you just yeah give up. So um so it's kind of painful, because you want good tools and you don't always have um you don't always have. Sometimes your tools get in the way um so for uh for staff.

A

Benchmarking is even more complicated. uh First of all, we have many interfaces, so we don't just interact with ceph one way. We do it. Many ways block file object.

A

We have massively varying workloads, so for each one of these protocols they all have different characteristics. So we need to take different approaches and measure them independently distributed systems need distributed benchmarks. So we need a benchmarking. Tooling, that's going to scale with our distributed system right and also- and this is a big one- workloads can be invisible to operators because most people that are deploying and operating SEF aren't actually using themselves they're presenting it to a set of customers, and those customers are using it in whatever way. Who knows?

A

Please don't do that, and so sometimes you know basically you're predicting the workloads that you need to optimize for can be a challenge. If you don't know what those are. And finally, this is this is great as well is that Seth's background work can also get in the way. So um SF does a whole bunch of background work, that's semi-visible or sometimes invisible, and this can have a performance impact if you're not aware of it.

A

So you need to kind of be aware so up until now, um there's there's a whole bunch of different tools for benchmarking stuff, but a big one for a while was cause bench and maybe some people how many people are still using cause bench today or have used cost bench in the past right. Okay, so start with have used in the past and you're still using today, two people kind of one and a half okay.

A

um Well, so we use cause bench for a while as well and so did our customers, and we thought oh yeah, cause bench great yeah, um but um actually cos bench had a bunch of issues. The first one is um the the Java native interface is pretty expensive, and so, if you're doing it for anything other than S3 um it's it's not great. um Amazon has a pure Java implementation of S3 clients, so it's cheap to run S3 benchmarks from jvm, but for anything else like liberators live RBD cause bench has to Traverse.

A

The jni, which is expensive um also cause bench, is unmaintained. um It's been on maintained for a while, but you could directly maintain it. We could pick it up and and kind of uh work on it, but the problem is the fundamental architecture of the project is quite painful.

A

um It uses this thing called osgi and the idea with ohgi is that everything's, an independent bundle rather than a monolithic application, so cost bench was originally targeting lots of different object protocols and that's probably why they did that. But the result is it's incredibly fragile uh structure, that's difficult to navigate and even more difficult to debug when things go wrong, um so yeah the workflow is also very manual.

A

You submit these XML jobs and you pick them up and there's no Builder install systems, so the binaries are in a repo and then you kind of just have to there's no way it doesn't really tell you how to build it from Source, um so there's no other way to install it. So, basically inspire all these issues. Lots of people were using cause bench.

B

A

There was a value to that because it was a benchmark in the sense of everyone's using the same thing. So even if it's a bit not so good everyone's having the bit, not so good experience, so there's value in that right, um and so that's why we kind of tried to stick with it, and so we did a whole bunch of work. Trying to make this more pleasant. We got a building with Maven.

A

We did a bunch of automation to make the XML jobs easy to submit, but eventually the effort here was no longer worth the benefit, so it just made more sense to write our own tool and that's a diplomatic way of me saying that my colleague Harry just got too irritated with cause bench and then vanished for a bit and then he resurfaced weeks later with a prototype replacement, so um that's kind of where we're at so what are the goals of sidebench?

A

um We want a tool, that's simple and lightweight easy to read easy to run easy debug. We want it to be linearly scalable like surface, so we don't want the benchmarking tool to get in the way or to be the Benchmark the tool itself. We want to Benchmark staff. um We want to Benchmark all the set protocols. We want something that's designed for saf.

A

um We want it to be low level, efficient and inexpensive to call C libraries.

A

um So uh we don't want performance implications, so you know ideally- and- and this was a core goal- is that each side bench worker thread matches fio in performance and finally we'd like a framework that gives us some control over the data we use to run the benchmarks as well. So um those are the goals.

A

uh So what what did we do so cybenchpin and go so it's almost free to call out two steps: libraries, it's both a demon and a CLI tool. So when you install it, a cybance driver, Daemon listens for work on a port, and you can then use the CLI tool to generate work for the demons with localhost's default. Even but then you can just provided other workers um as well.

A

It handles all the authentication. So you go to the ceft keys or the S3 keys for for S3 and in the bucket, and then it just it does the rest talks to the monitors or the gateways as needed.

A

um It's multi-threaded. So by default each sidebench driver will spin up on a a one thread per CPU core. You can actually control that uh it's configurable, but most about most of the time. It's about right.

A

um It does uh bits or bytes, because networking people like bits and then storage people like bytes and then they have arguments about which is the right thing to format. So we just do both um and then uh it has ramp time like cos, bench does as well, so you can say don't actually record the first X seconds or the last X seconds, so you try and get a representative um result um and then finally, it focuses on um actual benchmarking and not orchestration, because there's an orchestration piece so with sidebench.

A

The goal is to run one workload with the workers and the the threats and the drivers not to do loads of benchmarks, but we need to do loads of benchmarks so that we've separated that those two tasks.

A

So uh yeah here we can see um yeah. So here you can see that um the different uh protocols that we we support with sidebench to start with, so you can see cyberge talks directly to the monitor pool and then each worker uh you have a worker. Basically for liberatos you give it a Seth pool, you give it a SEF key, you give it a monitor, address and then that's really all.

A

It needs um same with lip RBD um set, pool, saf, key, monitor, stress and then each worker thread will spin up an image and and read and write to it. um Libs ffs.

A

We don't need to provide a pool, so we just provide a key and a monitor address and then it mounts the file system and then each worker thread reads and writes its own files into that directory. Wireless gateway Works differently. So you provide an S3 access, secret, key and S3 bucket, and then this actually doesn't even need to be a router's Gateway. It could be any HTTP endpoint, so you could put your load balances, Benchmark load balances. You could Benchmark non-s3 endpoint as well. Theoretically, I mean I. Guess that would work. I!

A

Think I think you can um and then finally you have the ability to Benchmark native block and file as well. So you can just point it at a uh block device or out of folder and uh and you'll have to config my Mount or map these manually. But you can then use to Benchmark things like iSCSI or NFS and SMB as well.

A

So some of the other cool things that cybanks can currently do um it can do bandwidth limiting. So we have customers that give us requirements such as hey. We need to have 100 milliseconds of response time when doing 30, gigabytes of traffic or whatever I just made those numbers up, but um it's a good way of getting latency numbers um at a certain bandwidth. So you can say: I don't want to max out the bandwidth on every driver.

A

I want to hit a certain bandwidth limit and then, after that, what I'd like to do is kind of optimize for latency. um So that's pretty good. It's a good way to make sure the workers aren't maxing out their pipe every time um it has a this slice generator so by default. If you don't use this, it generates random data. But if you want to measure for Compassion or the duplication, um you can't you can't do that with random data.

A

So this this actually generates a bunch of random data into an away a buffer, and then it provides slices of that buffer so that it can kind of reuse the same data.

A

um It does read, write mixes, so you can by default. It just does reads and then it does rights. But then, if you give it an early dry mix, um you can specify a blend of reads and rights to conduct in parallel.

B

A

Are like 50 reads: 50 rides or 30 70 or whatever, and then um it also has support for writing out all the individual stats from each worker. So um this can be used for trying to understand what's happening. If the numbers don't make sense or you just want to debug exactly what happened in a workload um or if you want to run statistical analysis over the data that obviously the file is huge because you're getting a far adjacent file of like every single thing that every driver has done every thread.

A

So, oh, that's, quite big and then also by default. Sidebench doesn't clean up after itself because of how deletes work in SEF. So you can actually ask it to delete or not and and see what the difference is going to be in performance.

A

um So what's bench master, so benchmaster is a little python Tool. uh That is a wrapper for both cause bench and sidebench, and uh and it's for running a series of benchmarks rather than just a single one, and it allows us to provide a set of options to sweep over, so I can say, run a workload for 1K, 4K and 16k and I'll go and do all three of those, and it also writes out all the um the workload results to Google Sheets.

A

So if you, uh you can spin up a Google sheet and I'll show you that in a second it's quite useful, um uh so yeah, it's orchestrator I've run a series of benchmarks and sweep of various variables, which is object. Size.

C

A

Yep, okay, so what I wanted to do is see if I can get. My I figured that on this screen, wait: where is.

A

I've lost my sh session hold on.

A

What that was right.

C

A

Cool I figured if I have Teamworks on both on both screens, then I can actually see what I'm doing, and you can see what I'm doing.

A

Cool okay, so um so, first of all, the book I want to show you is um is the fact that.

A

It's both a command line tool and the demon so I'm, it's running as uh as a as a demon on this server for this I'm just using um a cluster we had in the lab, which was like Health error for weeks before uh before this weekend and I had to like go ahead and try and fix it to use them so I, don't know what results I'm gonna get, but um I just want to show you kind of functionally how this works rather than than any performance figures, or anything like that.

A

um So that's that's the demon and then, if I do you know sidebench Ice Age, you can see that um I have a little command line, utility for running S3, red or Surface RBD block file and a bunch of options. We've also got a uh man page, which I'm a huge fan of I, like Man pages, um so uh kind of, goes in and gives you some info on various things and um the other thing we have.

A

Is um ah so wait for? Let me let me show you what it looks like to uh run a benchmark.

A

So here I'm telling it to um use bytes um the up time, so the the ramp up is one second. The down um ramp down is one second, the runtime of The Benchmark is five seconds. I, give it the saf key and I give it the monitor as a Target. So this is the rados benchmark.

A

So you can see it's telling me the list of drivers which in this case is just the Local Host note that I'm on and um if I had more nodes, I'd pass them as as on the command line, as the the workers and I can see. The you know, object, failures, verification failures, it's all fine I can see the right and the read and they're kind of separated.

A

Can I zoom in oh yeah? um Let me see.

A

Oh yeah, sorry about that.

A

So let me just scroll up, so you can see I've. This is the command that I ran.

A

um You can see it's creating a Json report, writes um and then there's a prepare phase and a read face and at the end it tells me hey here's the kind of results and it does it breaks out for per um per Target and per uh driver as well uh for the obviously, with rados you're only going to have one monitorium. It doesn't really matter um so you know for RBD. This is going to look very similar.

A

um Basically, it's exactly the same command except I'm, using RBD instead, and what you see here is um the same thing, but what's happening, is it's actually creating loads of RBD images from each worker thread? So in this case, I only have one thread: One driver but I have 32 cores on this on this node. So it's going to spin up 32 RBD images right read to them, but that's all configurable. So this just defaults.

A

So same thing, so the next that I want to show you is the benchmaster thing, so um we'll create a new sheet and I'm going to call it and find my cursor here. It is it's half days, NYC.

A

So it's going to create a spreadsheet and it's going to share it with me.

A

So you can just um generate some Google API credentials and pass it to the tool and it will. um It will it'll just talk to Google Sheets and this all falls within like free API for Google Sheets. So it's fairly, you have to pay for it.

A

So then it just showed this sheet with me.

A

I'm gonna go back to the terminal right, so um this time, I'm not going to run a benchmark, sidebench command, I'm, going to run a benchmaster command.

A

What our bench is also shared by sidebench, so my reverse: it okay. There we go so this is a uh a rados uh Benchmark one with benchmaster I'm, going to tell it which sheet.

A

Specify the pool same again run time up down and then here I'm going to give it a couple of uh sizes, so I'm going to say 64k and one Meg and I'm also going to give it um a couple of read: write mixes, so first zero, so basically just default, which is uh not Nomex and the second one is a 50 50 mix and I'm going to call it initial test and again, the target is the monitor.

A

This tools are a lot more verbose than sorry bench.

A

For uh for S3 as well, you provide the list of targets, so you, if you don't, have load balancers, you can still test a S3 across a whole number of endpoints, um but uh uh but you don't need to do that with rados and I haven't set up brightness. Give me so I want to show you this today.

A

Yep yeah it outputs to uh Json by default, so it'll drop it somewhere. I think you can tell it where to drop it as well.

A

uh I, I, think, uh I, don't know I I! Think benchmaster will do it for you as well.

A

All right, so, if we go across we'll start to see that um we're starting to see some of these workloads um being kind of reported, I can't actually see it myself. So I hope you guys can see it, but um yeah. So.

A

And then, uh and then there's some docs on the website um sidebands.io and there's the uh kind of GitHub as well as open sources GPL, so um you're welcome to use it and contribute.

A

So let me go back, there's one more slide. Where is it? It's gone? Okay,.

A

uh Let me find it.

A

Oh, is it is it over here somewhere? No, oh here it is.

A

This is why everyone else used the clicker thing. I had to be clever, okay, so the final thing is ideas for the future. um So what we thought would be cool is like a workload generator, so you could basically have this demon. This is like a dream idea, pipe dream idea, but, as you can tell, we've made this particularly for our own use cases, so it's very much what we needed. So it might not be what you need and that's fine, but if it is then that's great as well.

A

So one of the things we quite like to do is have a workload generator where you could kind of sit and watch how safe cluster behaves over time and then spit out a set of side bench benchmarks that will give you a representative kind of this is how you've used this cluster over the last month or whatever, and here's a bunch of benchmarks that you could run to help you optimize and get to a solution. That's going to make sense.

A

um Another one is sweeping over OSD counts, which I actually wrote a script that did this, um but I never merged it into benchmaster.

A

But basically the idea was that if you want to see how set of linearly scales, you can oh it um out, though, know out the osds or so I don't know about, but take out the osds up until the point where you only have three nodes: uh Run The, Benchmark and then add in an OSD node and run the Benchmark again add an OSD card, and- and we actually did this quite successfully for a while- we we could see the linear um increase their performance, so that made everyone happy.

A

um Also, meta operations like snapshots or omaps will be able to like create, create and delete images or some other things in Seth, which isn't necessarily just reads or writes, would be cool to add um kubernetes support. So we have a lot of customers using uh our SEF as kind of CSI persistent volumes and having something that's friendly, for that would be interesting and yeah I'm interested in hearing your guys kind of uh questions towards feedback or whatever you think might be cool as well and uh yeah. That's it that's. Basically, it.

D

Danny great, oh sorry, Danny great talk. Thank you um actually on that last slide. One of the things that I think would be interesting for me for us would be the ability to profile a future event like what would happen if my load went up 50 what would happen if I added 50 osds? What would happen if I decreased by 50 osds like Benchmark against that and see what your performance would be, because there's tuning considerations for those events to plan ahead, yeah.

A

Yeah, no, that's that's another good idea.

C

E

Denny I'm Mark um yeah, so I'm uh I'm I'm super impressed that you did a live presentation, good job, um so uh I I definitely want to talk to you later about all this. But um first question for you is: uh if I want to just use the individual pieces that you have for like uh doing block or doing uh S3 testing. Do you need the the demon to do that? Or can you just run it like individually on a client like you'd, run fio.

A

um You need so the fundamental architecture of sidebench is you need to have a the way it works? Is the demon listens on a pawn and then uh side bench itself? Is both a demon and a cell at all, so you can run it from any of the Demons or just runs high bench, but it you start the server on uh on on wherever you're benchmarking from and that listens. And then you talk to that server over the side. Bench port and say: send this chop out, send that job out.

B

A

You add more, um as you add more servers, then you just add them to the list of things that you're using okay.

E

So you can't, like just run like a client only mode where that client is doing like a contacting like an rgw server or like uh talking via libraridi.

A

You cannot today, I, don't think as in if you download it today but I. Imagine that would be a fairly trivial thing to to add. You know: that's just a command line modification to get that to work like.

E

I'm thinking it'd be really interesting to look at uh this as a validation for like fio right. You want to see whether or not both tools agree with each other, um so yeah, well cool, yeah, I'll. Let other people talk.

A

No, no, we should talk later yeah.

F

It looks fantastic, the uh kind of having used cause bench and seeing this this is like night and day that just looks absolutely fantastic and I like the fio style output, um I'm just curious about, like some of the findings. uh You've got like a million candle power flashlight now to look at uh performance with the with the side bench. What were some of the things that surprised you, uh for you know different customer deployments.

A

So I didn't do most of the benchmarking I I've done quite a bit of benchmarking, with sidebench and with benchmaster, um but um that was a couple of years ago and or maybe a year and a half ago, but we've had a whole bunch of different uh teams within soft iron, doing benchmarking across different things, and particularly when we got to uh got to faster tiers of storage, the optimization of the underlying side bench um driver mechanism.

A

There was a bit of a refactor there to make sure that it was on par with what fio is doing um so so yeah. If the question is, uh if the question is now that you have a benchmarking tool, how does ceph compare to other things, it's hard to say because we've only benchmarked stuff with this, not.

F

God more you guys.

A

Are using that in that lecture, yep.

A

Yeah we did yeah for all of all of the above, depending on which which tier of uh storage right so for uh when you go from hard disk to uh to nvme, and that's super important for us to know. What's the next Appliance, we want to build and what's the right, um what's the right, Hardware architecture.

A

So our platform team has been using this day in day out for a couple of years now to try and to help them understand how to architect the next generation of appliances and and what what would make sense and what wouldn't.

B

um How many resources the clients need to run the tests so CPU Ram.

A

um So we I've run side bench from like a four core arm machine, so it's very lightweight um and you can specify how many the factor of threads it uses so by default it will use one thread per CPU core, but you could do 0.5. You could only have one thread.

A

um It's pretty lightweight, but obviously your your results aren't going to be great. Are you thinking to run it in containers or run it on mass or something.

B

I'm, just thinking about, if I really have a large cluster and I really want to saturate the network and I really want to figure out what is really the next button like I will hit, so I have to take it into account. From my experience when I did this benchmarking test, I have to also to think about what could be the bottleneck of um of the client.

A

Sidebench, yes, yeah, absolutely, and that's that's exactly the process that we've gone through of like uh okay. We have this, you know 42u 42, osds, the nvme and this cross connect and where is the bottleneck and doing it over and over again? And you know and then surprising ourselves like. Oh, you didn't think about this or whatever right so yeah.

B

For example, entry, if you generate percent three objects, the test and so on. That's also sometimes a bottleneck even even delays. The execution of your program as well I mean I. I really did this on soft classes with I think next to three thousand overseas wow so, and this really 20 50 clients to re-saturate the Clusters and it was not possible so yeah.

B

A

You yeah we. We found that uh that it was scary how many drivers we would need to properly drive a large OSD cluster. We expect we almost needed as many drivers to as we as we needed, OST nodes and so you'd end up with like two or three racks, and then you know, half of it is drivers and half of its OST nodes. So yeah you need a lot of horsepower.

C

Yeah so related to that, uh so does it support uh essentially launching a fleet of site bench running on different servers and being able to kind of coordinate that yeah.

A

That's the whole point. The whole point is I mean I ran in a very kind of caveman way, but the whole point is to have loads of of drivers and uh and that's the kind of why it's a demon rather than a command line tool, is that you, you install a demon on, like let's say, 100 servers, and then you provide those 100 service to SCI bench, uh the the command line tool, and it will it's not doing the chronic tools not doing anything.

A

It sends the jobs off to these workers and then Picks Them All back up again. So it's just aggregating them back into.

C

It's just summarizing yeah.

A

This question over there.

F

uh Will this will be available across the distributions.

A

um I think on our uh so we've uh benchmasters Python, and we have uh just like a pipe pip install type thing going on uh with side bench.

A

um It doesn't actually have many dependencies um and I think we've packaged it for Fedora and Debian at the moment. So if you go on cybench.io, there's some instructions for a few different distributions where it's packaged, um but it's not an upstream package upholsteries. Yet because you know who has time for that, uh yeah.

A

A

Itself out of that and all support.

A

uh It doesn't, and so you have it's something you really have to be aware of, which is why we removed the delete thing because it was like we don't want to worry about it. We want to try and remove as many background tasks as possible there's. Actually, if you look at the website, there's a best practices page of how to run side bench which I think part of it does tell you like. Please don't you know here's some things.

A

You need to be aware of before running The Benchmark, to avoid um kicking off background tasks and stuff, um but then sometimes you want those background tasks depending on what you're doing so trying to just forget about it isn't is never going to be an approach. That's going to work. You want to think about. Okay am I going to end up with a bunch of deletes, and that's probably what's going to happen in real life right, so it's yeah so trying to get something representative, yeah, yeah, 100, um cool, okay. Well, thank you very much.