Ceph Ceph Days NYC 2023, 17 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Days NYC 2023: Why We Built A “Message-Driven Telemetry System At Scale” Ceph Cluster

Description

Presented by: Nathan Hoad | Bloomberg

Ceph’s Prometheus module provides performance counter metrics via the ceph-mgr component. While this works well for smaller installations, it can be problematic to put metric workloads into ceph-mgr at scale. Ceph is just one component of our internal S3 product. We also need to gather telemetry data about space, objects per bucket, buckets per tenancy, etc., as well as telemetry from a software-defined distributed quality of service (QoS) system which is not natively supported by Ceph.

https://ceph.io/en/community/events/2023/ceph-days-nyc/

A

Hi everyone I'm Nathan Howard, as was alluded to by Mike I'm, a senior software engineer here at Bloomberg I, am on the distributed. Storage team I've been a part of the company for about seven years now. uh Let's move on so a quick agenda for today. I'll talk a little bit about who we are as a company, some more about the storage group and things like that.

A

Some information about ourself clusters to give context for what I'm talking about uh I'll talk a little bit about stuff's built into lemon tree for those who don't know and why that Telemetry doesn't work for us and as a result, what we wanted out of our Telemetry system. What our solution actually looks like uh the results of this I.E? How well does the solution match up to our requirements and potential plans for the future and I will be taking questions at the end, so, firstly, Bloomberg we're a financial technology firm. We were founded in 1981.

A

If you have not heard of us, we are our Flagship product is called the Bloomberg terminal, our users, clients, whatever you'd like to call them. They use the terminal every day. For data services, news analytics all sorts of things for financial analysis. Essentially, we have over 350 Subs 350 000 subscribers all over the world. uh We process hundreds of billions of pieces of data every single day, and we do this. With a force of over 7 000 engineers, storage, Engineering Group.

A

We are responsible for Designing building and maintaining all of the storage uh used by Bloomberg engineering. We have the three main pillars: file block and objects. We have teams for data protection, you know like automated backups and things like that for Hardware failure and whatnot. We have storage workflows for say, for example, you have NFS mounts. You want your. You know, permissioning to be set up correctly and things like that. You don't want a human to do that. It should be done by a machine.

A

Similarly, we have an automation team that works on automating things, not a big surprise there. For things like you know, you want to increase your quota for your tenancy or you. You want to create new tenancies and things like that, basically letting our users on board themselves and maintain their systems themselves rather than us having to do it.

A

The distributed storage engineering team. So, as you might imagine, we have a focus on distributed storage. We are part of the storage Engineering Group. Our focus is on software-defined storage, with our primary offering being object, level storage, ie, the S3 API internally. We call this Bloomberg cloud, storage or BCS, and this shouldn't really surprise.

A

Anyone given why we're here today it is backed by ceph, so some information about lcf clusters, given that we are very heavily object, focused it shouldn't come as a surprise to anyone that we are heavily rados Gateway focused as our great friend Matt alluded to earlier. Today we do over a billion S3 requests every single day that is billion with a B uh and for lcf clusters. We have a total of four.

A

We have them spread across four different data: centers approximately 1600 tenancies, 65, petabytes of raw storage about four and a half thousand osds in each Data Center, and about 50 000 buckets in each data center as well.

A

So what Telemetry does ceph offer by default by default? It integrates with an open source product called Prometheus. This is a popular monitoring series and time series, database or monitoring system and time series database I should say uh the way that it works. Is it receives metrics by scraping what it calls an exporter. An exporter is just a HTTP server, with a matrix endpoint that metrics endpoint returns a tabulated plain text list of metric names and their values.

A

Given this, you can think of this as a pull-based system, rather than push-based IE Prometheus has to be aware of where to get the metrics from not the other way around. uh For visualization grafana is a really popular choice. You can use prometheus's built-in visualization, but the common consensus from what I can gather is that generally Prometheus is querying and visualizations are used more for debugging, rather than actual like user-facing visualizations grafana is significantly more feature-rich and it's a fun thing.

A

If you do have a ceph ADM based deployment, you get these things for free out of the box. Prometheus and grafana containers will come along. You also get the node exporter container, which does things like CPU usage, and things like that. The grafana container also comes in with a bunch of built-in dashboards and everything, so it's it's extremely Plug and Play, and then finally, you have the surf dashboard or surf web interface. This also provides you a nice overview of the cluster.

A

So why doesn't this work? For us? It sounds pretty great, but number one. The documentation already notes that the plugin will be slow. When you have over a thousand osds from earlier I said we have four and a half thousand in each of our clusters, so that ship is well and truly sailed. Next is another scalability issue.

A

Only the active manager actually provides the metrics that this is a huge bottleneck because it doesn't matter how much you scale up your metric collection at the end of the day, you're talking to a single process on a single machine and finally, the stats are actually just too high level for us.

A

We need stats on like heavily rados Gateway focused, so we want like user level, metrics and user level metrics on like quota usage, you know throughput how life cycle policies are going and things like that and the Prometheus stats don't provide any of that insight.

A

So from this we can come up with a list of our requirements, number one. It should be as real time as possible for both self level and radio scoutware level Matrix. We want to know things like cluster throughput, PG status. You know bucket count garbage collection, all of that sort of thing, with the idea being that the more real time it is, the more responsive you can be to potentially arising issues.

A

uh Scalability uh is obviously an important thing. We want it to be an intrinsic part of the system that you can scale tasks up and down as necessary. You shouldn't have to put a lot of work into breaking up your tasks so that you get Fair distribution over your system and, of course, that means that it should be a distributed system as well and fault. Tolerance should come along for the right. Thirdly, it should publish into grafana. We have a major installation of grafana here at Bloomberg.

A

Basically any and all metrics will go into it. This means that to give our users the best and most and and expected experience, we should be publishing into grafana because that's where they would expect to look. And fourthly, it should be easy to extend if you've ever run, ceph or rados Gateway admin help you'll see, there's like a million sub commands and you can get basically anything you can imagine out of the system.

A

So as those metrics become relevant or you discover them, it should be easy for us to expose them via dashboards either to ourselves or to our users.

A

So our solution at a high level. It implements the publisher subscriber model if you are not familiar with what this is. I will talk about it in the next slide. It is built on open source Technologies. There's no need to rebuild the universe for this.

A

Thirdly, it's written in Python This lends itself to being very easy to extend. I'll have some example slides later that show some code demonstrating this and fourthly, the technology choices that we've made make it really easy for us to scale up and down more on that later, so the pub sub model, you have three major components: you have a publisher, a message broker and a subscriber, as you might imagine, from the name. The publisher is responsible for feeding data into the system.

A

The message broker is responsible for receiving that, storing it as appropriate and then sending it out to appropriate subscribers, and the subscribers are responsible for receiving this information processing it in whatever application, specific logic you have and then notifying the message broker that the processing is complete and it can be removed from the queue. This is nice because you can independently scale any of these individual components depending on what your needs are so say. For example, your subscribers are scraping websites and you want to be able to scrape more websites at a time.

A

You can either optimize this case uh individually, so that you're processing them faster or you can just run more processes, so it's it makes it quite easy, um also in terms of fault tolerance. This is quite nice because say, for example, this topmost subscriber processing message one crashes. What will happen is the message broker will notice that the connection has been lost message. One has not been successfully processed and it will redirect it to the next available subscriber.

A

So that's a high level overview of popsub. So more specifics on our Tech stack uh for our message broker. We went with rabbitmq, uh it's it's highly flexible. You can do a lot of different things with it, depending on what your needs are, and we have a dedicated team at support at Bloomberg providing support for this, so we don't have to go through with the maintenance burden of owning yet another system.

A

Secondly, we have salary. If you have not heard of celery, it is a python framework for implementing a task scheduler. The way I describe this to people is that it is good for turning code into messages when you use vanilla, rabbit mq, you have to care about your serialization formats of your messages as you feed them into the broker. This, for the most part, takes that away from you. And thirdly, as I have already alluded to. Grafana grafana is important for us and we have a dedicated team providing support for that as well.

A

More information about rabbitmq, so it has client libraries in every major programming language. This speaks to its popularity, there's a lot of existing knowledge out there. If you have a question, someone else has probably already asked that question before and someone's probably already answered it.

A

The way that it implements Pub sub is publishes right to an exchange and subscribers consume from a routing key and the way that you join. These two things together is dependent on the type of exchange. So, for example, the most common would be a direct exchange. That's a one-to-one mapping of one exchange to one routing key. Then you have fan out which is very similar, except it's more of a broadcast system, so you can do one exchange to many routing keys. And finally, you have topic based so say you had an exchange called news.

A

You could have routing keys for you, stop Bloomberg or news.cnn, and things like that and your subscribers can decide which they would like to consume from using patterns like news dot star, for example, to consume them all.

A

This is where our scalability comes in. Rabbitmq doesn't really care how many subscribers you have connected so as a result to get vertical scalability. You just run more subscribers on the hardware that you have and to scale horizontally. You basically do the same thing, but with adding more Hardware you're, just running more subscribers on a different machine and finally, uh it's boring technology. So what I mean by this is? It has been so battle tested and it's been around for so long now that it's a safe Choice, it's no longer exciting.

A

It's like picking, Microsoft, Word or postgres it. It's it's. A good thing like safe is good celery as I alluded to uh tasks are just functions like they're, just literal code, you get to forget about uh how you serialize messages, which is nice. It also decouples you from your message broker. If you decide for whatever reason, rabbitmq is not for you, you can move to redis by changing a single line of configuration and being a task scheduler. It has all of the regular Primitives.

A

You would expect from that sort of system like you can group functions, delay them add callbacks. All that sort of stuff- and one thing that's particularly interesting- is that you can create different categories of subscribers by specifying which cues they should publish And subscribe to, um to say, for example, you wanted to do some graphics processing on a cluster of machines. You only had a subset of that cloth so that had dedicated graphics processing Hardware.

A

You could create subscribers on only those machines that only consume those messages so that you're making sure messages are routed as appropriate. This is something that you can do with vanilla rabbit mq, but it's significantly more manual. Salary makes this quite nice and again it's boring technology. Boring is safe and safe is good, so a piece of code. So if we look at this, this is probably the smallest example of a salary application. I could come up with at the very top. Here we have an instantiation of a salary object.

A

This acts as both a task registry and a place to configure your application for like q names and timeouts, and things like that and further on down. We have two functions, the first of which is not super interesting. It is called bar, it receives an integer and it logs that value. What makes this into a task is the task decorator.

A

This registers it with the application, so it will be exposed and the under the plumbing under the hood knows what to do with it further on down. We have our next task here called Foo, it's slightly more interesting because it is uh actually doing some work. We can see it is calling bar dot delay.

A

So if you've used python before you'll know that bar dot delay is like the delay, method doesn't normally exist on a function like that in Python. This is something that the task decorator will add for you. This is nice because under the hood, what this is doing is it's serializing, a message saying: we would like to asynchronously call the bar function with the argument 55.

A

It will pass that off to rabbitmq where it gets consumed by a worker and then processed appropriately. It's good because it feels quite natural. You don't have to write a lot of plumbing and like I'm manually serializing, this thing it's um it's a nice experience and then, secondly, we are calling bar as a regular function. They still work as regular functions. So this is nice for a couple of reasons, one that means you can progressively enhance the system. As you find, you have certain tasks that you know they.

A

You would like to break them out and make them dependent or independent asynchronous tasks. You can do that, while still maintaining the current behavior of letting them run synchronously and then. Finally, at the end, we have a small example of using group and signature to kick off a bunch of tasks all at once and wait on the result of all of them.

A

So how do you actually run this? uh The salary Library provides a command line tool called salary. You basically use this for configuring, the application you want to run the amount of concurrency. You know your concurrency model, things like that logging. You know all that sort of stuff. So the most useful thing in the top line is the concurrency flag. You can see we have it set to one there for the default model. That will mean that you are running a single process, so you can process one task at a time.

A

This fulfills the requirement that we want it to be easy to scale the system up and down, because we just increase that number. When we run our process, we don't need to put a whole lot of effort into changing code. The next thing to note is we have the list of cues here.

A

We did not change them in the previous slide, so they are the default of a cue called salary which is made up of an exchange called salary, which is a direct exchange and a routing key, also called salary, it's a fun stuff, and then we have our list of tasks which, as you would expect, matches the list of functions that we had in the previous slide and grafana I'm not going to talk a lot about this. We've been talking about final Lots today, it's a highly personal experience for what you need.

A

What I mean by this is the querying model that you use to build. Your dashboards determine is determined by your data sources, so if you're using Prometheus, then you use the Prometheus querying language. If you use postgres, then your queries will be SQL statements.

A

It's incredibly feature Rich, there's lots of plugins lots of built-in data sources, so odds are of your storing time series information anywhere you'll be able to get it into grafana with relative ease.

A

um Another nice thing to note is that it impresses non-techy people every now and then my wife will see me working on one of these dashboards and she'll say to me: did you make this and I mean yeah? I did I I'm, not a UI guy, but I definitely made the dashboard right. So that's a huge positive.

A

So if we take all of these things- and we piece them together, what does it look like? We start in the very top left there with a task scheduler. This is a very fancy way of saying a script that we run via cron. This is responsible for generating the list of tasks that we would like to process so say, for example, set status.

A

It will get sent to rabbit mq rabbit m in this state, I refer to them as being unenriched. So what I mean by that? As as I alluded to earlier, we have multiple data centers and we have multiple zones within those data centers. We we want to delegate this task out to each of those data, centers and zones as appropriate. Another important factor is that for rados Gateway tasks that we want to run, we have ownership metadata associated with the tenancies that map to internal IDs for our systems for like routing tickets.

A

So if you go over quota, we want to notify you about that using built-in.

A

So what happens with these unenriched tasks? Is they go from rabbitmq? They are consumed by the scheduler subscribers down the bottom. There they perform the enrichment that I was talking about. They submit it back to rabbitmq and then, depending on where they should go, is dependent on like the parameters right like the the data center The Zone.

A

uh What the actual task is so continuing along with Seth's status, we can see that that will go down to the cefman subscribers and then radios Gateway level commands will go to the redis Gateway subscribers, which run on threaders Gateway nodes. The basic rule here is: if it's a ceph command, it goes to the mon nodes. If it's a rados, Gateway admin command, it goes to the rotos Gateway nodes, and then these uh these tasks on these machines are responsible for running the relevant surf command.

A

Collecting the output parsing it transforming it as necessary and then publishing it into grafana, a more concrete example of what I'm talking about so at the very top. Here we can see. We are now configuring uh our list of cues that we want to consume on. We have the scheduler queue which will be picked up by the scheduler nodes in the previous slide, and then we have extra cues for each Data Center.

A

We then have an entry point of start bucket stats. So this is a task that will be published to the scheduler queue. It is responsible for collecting the tenants and tenancy information that I mentioned and then for each of our data centers, it will call startuserbocketstats.apply async apply. Async is another method that the task decorator will add to your functions. It's basically the same as dot delay, with the main difference being that it gives you more control over the actual queuing mechanism that you use.

A

So you can see I'm using it here to specify which queue it should be published to for the relevant Data Center. You can also use it for timeouts retries. All of that sort of thing, and then if we go down to the actual implementation of start user bucket stats, you can see it's fairly straightforward, we're doing a rados Gateway user list and then for each user, we're generating a task to collect the bucket stats for that given user and then further on down the implementation of the actual bucket stats collection.

A

You can see that this is calling a Roadhouse Gateway bucket stats for the given tenancy and then, in this case, publishing the number of objects that it has within that given bucket.

A

um This is this is quite nice because it's clearly a highly flexible system. You can change this say. For example, you decided that collecting these metrics at a user level is not fast enough. You want to collect them at a Bucket Level. Then you could do a bucket list on your user and then collect the individual bucket stats and when we have like 50 000 buckets that would generate 50 000 tasks.

A

The limit here is not that we're going to run 50 000 tasks at once, it's determined by the concurrency we had earlier. So this is nice because it gives a sense of the scale and breaking your tasks up into as small a task as possible will give you the best concurrency and like the best uh responsiveness in the system.

A

So the results of this as I said uh it's very responsive. We can collect metrics in under five minutes. This means that users see quarter changes effectively immediately, which is obviously nice for them. They can see. You know. Oh I was at 80 percent and I deleted some stuff, and now I can see I'm down at 50.. I'm out of the woods load is distributed really well. uh This is significantly better resource utilization. We pay a lot of money for the hardware that we have using it adequately, for this task is important.

A

There's a lot of built-in retries and timeouts. This is something that both salary and rabbitmq give you. You don't have to write a lot of code yourself for let you know for I, and you know, range 10, try this thing. It's just say: retries equals five and you're good to go and it scales easily as the system grows. As we saw, you just increase a number and a command line and you're Off to the Races, um it's easy to add new metrics, it's just regular python code, you're calling a process and pausing the output.

A

So it's it's quite nice! You don't have to put a lot of work into the metrics once you know what command to run and this integrates with existing metric based alerting. As I said, we have a lot of pre-existing systems here at Bloomberg. Our users are very accustomed to that user experience. We don't want to write a whole brand new. You know alerting system, so it's quite nice how this hooks in potential plans for the future. We could look at extending these metrics for RBD the radius block device.

A

um That sort of depends on our interest for it. We can also collect metrics on the metrics. So what I mean by this is, uh if you collect metrics on how often and how long it takes you to collect your specific named metrics, then you know where to focus your optimization efforts. So if you have a task, you're running every five minutes, but it takes 15 minutes to complete.

A

You clearly even need to optimize that or you need to change how often you schedule it it's just more about getting insight into your system and, finally, depending on community interest, we can look at contributing this to open source.

A

Just from today, there's been a lot of talks about self-est Telemetry, so this feels like something people might find useful, but it sort of depends on all of you, and that is everything um we are hiring. So if you are interested in working on self or any of the other cool stuff that we have here, please do come and talk to me uh because, as I said, that's everything so does anyone have any questions.

A

B

I'm sorry, hello, yeah.

C

B

I'm double miced. So that's why yeah.

B

You you mentioned that this is really cool. You mentioned that you have like tens of thousands of buckets right.

C

B

That becomes like a cardinality problem in any of.

D

B

Systems- and we have that I'm really interested to monitor, like all of those really cool perf dump outputs from each OSD.

A

B

The op latencies everything, but it becomes like in the old school monitoring, stack that we have at certain. It's like you can't search you can't find out. We want to find out which was the slowest disc over the last. You know 24 hours, things like that, are you able to to get like top like filter and get top of say like the bucket that had the most iops over the last day, or things like that? Have you yeah.

A

Totally I mean like this, this sort of comes back to what I was saying about grafana right, like it's heavily dependent on what you're, using for your time, series information and so uh I can't really speak specifically about the time series database that we're using to back grafana here. But you know that I personally feel that, yes, it would be easy to get that information out like the querying model lets you do.

A

You know you can select a given OSD and you can say, hey give me what were the iops like the maximum iops of that time range or you could do over time it but yeah it's heavily dependent on what you're, using for your time, series data.

C

One of the things I want to add is to Telegraph, as I said, plugin that also helps us publish the data with more basis, so that stays well. When you go and go a number of them, you don't have to worry about your point, get in public and you can control the interval on the telegraph side again.

C

Deployment perspective has become.

A

Easier to scale yeah, that's actually really uh important note is that this is more focused on uh overall cluster level metrics and like Roto scatware level metrics for doing like individual, like node level, metrics like what you're talking about for what OSD is giving me trouble.

A

You can use things like Telegraph they they may give you more. You want yes, yeah. It's yeah, yeah.

D

I think there's a lot of interesting stuff here, especially the rabbit uh integration and scaling, um but I want to draw you to mention that there's there is Upstream work going on to to to that. I think is, can possibly convergent with some of this um there's a new thing called the Seth exporter. Damon. That's that! That's that's! That's! That's!

D

That's that's running per node and containerized environments or deployed otherwise um that's designed to sort of vacuum up all the perf counter information and maybe some and then that those that those let's say the RWS or other demons as it grows, I think have and then and then and then there's more types of metrics being there's extensions to the perf counters interface being being being created to make it more hierarchical and then, for example, rgw's are gaining the ability to to sum up counters per bucket number user and those are flowing through that system.

D

um This you know this, and so those would be interesting types of things to potential to have. This thing talk to also the Seth RW admin interface, which can which could be widened. But so you don't so you don't have to launch around CW admin processes to scrape things. That's a really it doesn't. It doesn't look like it, but it's a pretty heavyweight operation that spawns up a new node, essentially in there it should be a cluster gets all that going and then spits out your six Jason Records, but anyways very incredible.

A

That was helpful. Thank you. That gave me a lot to think about.

D

A