Ceph Ceph Days NYC 2023, 8 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Days NYC 2023: Ceph Telemetry - Observability in Action

Description

Presented by: Yaarit Hatuka | IBM

To increase product observability and robustness, Ceph’s telemetry module allows users to automatically report anonymized data about their clusters. Ceph’s telemetry backend runs tools that analyze this data to help developers understand how Ceph is used and what problems users may be experiencing. In this session, we will overview the various aspects of Ceph’s upstream telemetry and its benefits for users, and explore how telemetry can be deployed independently as a tool for fleet observability.

A

Can you hear me well awesome, hey everyone, uh so today, I will talk about um soft Elementary project, we'll have an overview of the project and we'll talk about the motivation behind it. It's architecture uh we'll have some sort of a dashboard demo, um we'll hear some success stories and we'll talk briefly about how to deploy your own Telemetry service.

A

Wouldn't it be nice if we can tell how many stuff clusters are out there in the wild, which versions they're running their storage capacity distribution, what Drive models they use, what crashes the experience and what set features they're using and so many more answers to a lot more questions that we have.

A

This is what we have the Telemetry module form where it allows clusters to phone home and report anonymized and non-identifying data about our installations and configuration, and the data is aggregated and presented in public dashboards at Telemetry, Dash, publix.com.

A

um The Telemetry module is built in intercept and by default, home is Upstream stuff, but clusters can be configured to phone elsewhere.

A

I was first introduced in mimic version 13 in 2019, and ever since we see a steady opt-in increase trend from the community with more than 2 000 clusters that report more than 800 petabytes of total capacity with more than 100 000 osds.

A

It's important to emphasize that by default, Telemetry reporting is off. Users need to explicitly opt in by agreeing to cdla license and they can do it either with a CLI command with soft elementryon or they can use the dashboard wizard to do so. Users can also see a preview of a Telemetry Report with a cefton tree show all or cefton tree preview, all since Quincy, and by default.

A

The Telemetry report is compiled daily from several channels, each with a different type of information, and once the user is opted into Telemetry, the channels can be turned on or off. Currently there are five channels in Toronto tree. The first one is the basic Channel which is on by default, which collects and reports um basic information about the cluster like Seth and kernel versions, the cluster size, number of demons, and so on.

A

Next, we have the crash Channel, which is also owned by default, and it collects information of the metadata of the crashes that happened in the cluster. It includes the back trace the type of the demanded questions so on.

A

We also have a device Channel, which is also on by default, which collects the health metrics, mostly smart, from all of the drives in the cluster and some nvme metrics were applicable.

A

We have another Channel, which is the ident channel, which is off by default, and it allows users to identify themselves. Even even though everything is anonymized, some users choose to identify themselves and this channel allows allows them to provide their name email and organization, and in Quincy we added the perf Channel, which is also off by default, which reports various performance counters in the cluster.

A

It is very important for us to emphasize. uh We care deeply about the privacy of our users and we take several steps to ensure it. Whenever we introduce a new data model version, we require users to reopt in on a safe upgrade, otherwise they can send the current opt-in version.

A

Reports do not contain any sensitive or identifying data like pool or host names or object names or contents. We also anonymize the report, so each cluster is assigned a random uid which is specific for Telemetry. The fsid is not reported in Telemetry and we also redact the disk serial number and we never ever store IPS.

A

We took an additional step to enhance privacy by separating Telemetry reports into two different endpoints, so one is dedicated for the anonymized cluster data and the other is for the anonymized device health metrics.

A

So this even allows even more privacy.

A

The motivation behind Telemetry is but for helping users and Developers so for developers. It is very useful to get feedback on what features are in use in the wild. Should we keep developing them.

A

um We also want to learn about the upgrade Cadence and the version adoption rate from users and the device the device Channel helps us to learn about what hard disk and SSD models users are deploying and in the long run we are creating uh we're currently creating and in the long run, we're going to have a really um a great open data set of device. Health metrics, which is aimed for research to create more accurate, drive failure prediction models with a crash data.

A

We can learn about how bugs um about new bugs that are happening as soon as possible, and it also helps us to prioritize issues by focusing on the most common bugs. So we can see how many clusters experiencing a specific issue, uh and it also allows us to discover some Trends throughout the versions of specific issues and and once uh we have a solution for um an issue that we see in the wild, we can actually verify uh that the solution Works um by identifying regressions and we'll talk about that shortly.

A

For users, users can use the Telemetry data in order to validate your own installations by looking at what is common and by opening to the device, Channel they're contributing to an open data set of Drive Health metrics, which is very it's. It's pioneering, there's only one company in the market. That does that, and we want to create an even better open data set in order to allow research on better failure, prediction models which eventually will lead to preemptive mitigation or failing device to reduce downtime.

A

Another great win for users is that they don't need to actively report issues or Open tickets for each crash. The process is automated and all the crashes that happen in the cluster are just reported to us, uh and users can also use the open data set of crashes to better understand an issue. They can view it on our bug, tracking system track our safcom, and they can learn whether the crash that happened in their cluster. That issue that they're experiencing was fixed um in in a specific version.

A

Let's briefly talk about the architecture, so every day the manager, the staff manager demon, collects data from all of the demons in the cluster and the data collection mechanism is built into Seth. And then it compiles the reports. One report for the cluster and the other reports for the device, health metrics, and it sends it um home to um to our backend to Telemetry safcom, where we have um an Apache server and a postgres skill database and a couple of instances for grafana one for the public dashboard and one for the private dashboard.

A

um We're gonna take a look at the demo. um Sorry about that. It's a screenshot I did not know that I will not be able to connect my laptop, but you can go to Telemetry, safe. The telepublicstaff.com and you'll see the public dashboard.

A

This is a pretty recent screenshot that you can see that the most popular versions in the wild are specific, octopus and Quincy, and there are about 844 petabytes, total capacity from all the Clusters that report.

A

We can also see a breakdown of of the versions for the Clusters that report. So at the top of the page, you can select a breakdown by major or minor in here we have a breakdown for Quincy, so we can see the the adoption rate for diversions and how users are migrating to the most recent release: the seven 17-5.

A

It's also useful to understand the volume of the capacity um that's coming from the different clusters, so not just the breakdown by the versions and the the demon count, but to see where the most volume is coming from and currently in the wild for the cluster that report Telemetry most of it is coming from Pacific.

A

We also have this heat map that allow us to understand the distribution by total capacity, so we can see that most clusters report about.

A

Two buckets so between 32 to 128 terabytes, so about 200 something clusters, and this is pretty steady through the last six months. There are also some very big clusters that report between um 16 and 32 petabytes.

A

We can also learn about a capacity that the Clusters that report are going to need. So this is a dynamic dashboard. It exists on the private dashboards currently that we can look forward to the next 90 days and have a list for all the Clusters that will reach an 80 percent capacity threshold and looking forward to the next year.

A

We can have this graph to see how many um the the total extra capacity for uh for the next year for all the Clusters that report and we can choose the threshold to look at with a device data. We can see the breakdown by the popular vendors that are out there in the wild, and we can see that um currently Seagate is the most popular vendor that users are deploying their drives and right after that, I believe that HST and Western Digital.

A

We can also see a breakdown of devices by interface, either for Flash or spinning. So nvme is we. We can see that it's popular as much as a setup for ssds and scazzi is a lot more popular than SATA for hard drives and all together we can see devices by type, so mostly hard disks are reported and ssds and nvmes are down below.

A

So that was a breakdown for all devices. By types of for all vendors, we can also have a specific breakdown by vendor if we choose it from the pull down at the top of the page at the dashboard, I want to talk a little about the crashes. So let's look what the crash report contains, um so each raw crash report that we receive contained the crash ID, which is a combination of a timestamp and a random uad.

A

The demand, type and name set version of that demon, the back Trace some information about the OS and the assert message and condition if applicable, and we can also look at data from the basic channel in case the user is opted in to that channel as well to to see a bigger picture of the cluster. That is reporting that specific, Crush and I will receive a.

A

We receive a lot of raw crash reports and we need to find uh patterns among them.

A

So the way we do that we have a crash processor in our backend that identifies the similarities among those different raw crash reports and group groups them into signatures by basically sanitizing the back Trace, removing some frames out of it and cleaning it from addresses and so on, and then adding the assert condition and function and then digesting it into a signature.

A

uh The crash processor supports multiple Generations or recipes of signatures, which allows for backward compatibility. So in case we want to enhance the recipe and create an even better grouping, because I know that there are some duplicates in uh in their reports right now, so we can keep enhancing the recipe and we have backward compatibility for that and then the processor populates the database, and then we have database of signatures, which is mapped to Raw crashes, but just having the data itself is not enough. We want to bring it to action.

A

This is where the redmi in Telemetry bot um comes to place and the way that it behaves its queries. The database, the postgresql database, that we have for the most recent Quest signatures, and then it Maps each signature to a redmine issue and the way it does that it searches redmine for these signatures and it updates an existing issues in case it find one.

A

Otherwise, it creates a new issue with the data of this signature, which contains uh the affected versions that this signature um there are that the raw crashes of this signature um were reported to, um and there's also the sanitize and the Raw Back trace, and the link to a dynamic dashboard.

A

What's nice about the bot is that it also identifies regressions. So in case we have this um um tracker issue that is mapped to a signature, and let's say that we solved that that bug and uh and now we receive new reports, uh new crash events with a newer stuff version. That was that basically should not contain this issue. The bot will open a new ticket in redmine and will refer to the original fixed one and give, and it will leave a heads up to Developers.

A

um We have a very powerful and internal dashboard for the crashes which allows uh searching by frames frames in the back Trace um by major minor versions by all revisions of signatures uh and a certain function or condition.

A

um We can see the number of affected clusters for each uh signature and to learn about the crash status. We can also have drill Downs to Cluster information, so you can learn about the capacity of the cluster in the current version that the cluster is running and we can also see the trends throughout versions now.

A

This is a another static demo. Excuse me for that. So here we can see that we have a lot of fields that we can search by what we just mentioned earlier. So here we just um wanted to see all the crashes for the most recent Quincy version 17-5, and we can see the number of crashes in a time frame here. The time frame is, um is very, it's very big, so it is similar to the total number of occurrences that each signature has. We can learn about how many clusters uh see this issue in their.

A

uh How many clusters see this issue and the breakdown by uh major and minor versions?

A

So if we were to click on one of the signatures, um we will see um a signature page which tells us when the issue was first and last reported. We can learn about the the count of the raw crash events that were reported and how many clusters were affected.

A

We can learn about the the status of the tracker issue that is mapped to this signature, and here we can see the sanitized back Trace that is used in order to create that fingerprint and we can click on each of these frames and to see other signatures that these frames appear in.

A

So if we were to click on one of them, we would go back to the search page and you can see in the first field we're looking we're searching by that specific frame and there are three different uh Christ signatures that contain it.

A

Going back to that page, we have some more information on that specific signature page such as the assert function and condition, and then we have the daily occurrences. We can zoom in and learn some more we'll have a list of the affected clusters and lots of information about each of these clusters, and then we have the list of raw crashes.

A

We can filter out by many details here as well, and if we were to click on one of these crashes, we would see a raw crash report where you can see the back trace and all the meta. The metadata of that crash.

A

We have a crash queue and triage custom queries for every project in in redmine.

A

Yeah, if you're, you can just see it on tracker or if anyone's interested I can share the links. um Just a side note, not all crashes I mean Seth bugs, so there could be hardware and IRS uh issues or environment or resource and config issues, um and there could be also issues of other dependencies. So not all crashes are actually bugs.

A

um All right, so, let's talk about some cool stuff we saw so it is very essential to monitor crashes of new releases like we just saw with the 1725, um and here we have examples for bug, fixes to crashes that were reported through Telemetry and were open in the tracker and they have been picked up by the teams and their pull requests and backboards. That fix those bugs, and this is pretty awesome. This is all automated. The only manual thing here is the fix by our engineers.

A

um This is an example, oh, so actually I think yeah I I did have another screenshot for that. So that's an example for um the tracker that was opened by the Telemetry boat, um so you can see that it populates uh the tracker with all the signatures that are relevant for for this issue and all the affected versions, and then it has some um example for a rock Rush report and a link to the dynamic dashboard where the engineers can look and have some more details, and here there's a pull request.

A

That was um that had that issue fixed, um sometimes uh users, report manually issues and uh uh this the crash dashboard, helps us to understand uh when the the issue was uh first introduced. So in this example, a user was reporting an issue that happened in Quincy, but then with the Telemetry dashboard. We could uh search for that assert condition and function, and we realized that it started as early as the Pacific 16 to 5.

A

um And sometimes users themselves respond to Telemetry Crush reports, so the Telemetry bot will open a ticket and uh users chime in and say, hey I also uh see this issue. I can provide logs or any other information that that you need and that's very helpful, because we cannot always reproduce issues in our own lab, um and here is an example for a bug that was fixed uh and the latest version that was specified in tracker was I think 16 to zero.

A

If I remember correctly, but then we saw some new crash events uh from uh 16 to 5 and 16 to 9., so the crash bot that the crash Telemetry about uh opened a new issue and gave Engineers a heads up saying that issue was fixed but now we're seeing new reports from newer Crash from your nearest safe version. So you might want to take a look and like we mentioned, we have the ident channel.

A

Sometimes users want to identify themselves, so we contact them when we need more information, more information about crashes, the that we want to that. We need additional logs or anything like that and uh I'm very happy to share that users are more than happy to to help on that.

A

um We also use Telemetry in order to understand how features uh are used in the wild. So, for example, uh it helped us to understand whether we can announce a file store um as deprecated.

A

um We, we had Okay add this one. Yes, I did okay, so we had full dashboard. That has a lot of analysis of what we see in the wild for um the breakdown of fat store versus Blue Store usage, and we saw that it is pretty safe like um to announce that file store is being deprecated.

A

um We also wanted to know about the original code, clay plugin, whether it's being used in the field, so we could check that with Telemetry as well, and we wanted to know whether there are crashes that are related to to this plugin and um thankfully they were not and again this increases observability and let us understand how stuff is being used.

A

um Yeah and the dashboard team um needed some help uh helping um tuning. They wanted to understand the magnitude of RBD images that are deployed in the wild, and this information was essential for them as well. We have a dashboard for that, but I did not have a screenshot I'm. Sorry, um all right. Let's, okay, two more sides: um let's say that you want to deploy your own Telemetry service and the reason uh the reason for that would be that um sometimes uh you might have constraints.

A

um The classroom can be air gapped or you want to have some more observability in your own Fleet. You need to do two things. You need to bring bring up a Telemetry server and to configure the Telemetry module so in order to bring up server, you'll need postgres database, the web server and grafana, and we have a detailed, install guide in our repo and for the Telemetry module side.

A

You'll need to change the endpoints, because, right now they phone home to Upstream, stuff you'll need to opt in because Telemetry is never opted in by default, and you need to enable the channels that you want. You can do that either with the orchestrators like cefadm or you can change modules default when you build stuff or you can do that at runtime. If you have a running cluster and you need to do that now, you can just config the manager module.

A

Please contribute, please join us and obtain to Telemetry uh yeah I know we are over time. But if you have any questions, um be happy to thank them.

B

At least most of our crashes are based because of Hardware failures. How do you distinguish and filter for that? Like you, you probably don't want to open bugs for every Hardware failure.

A

Yeah great question, so we do have um IO errors that we report that with a crash metadata, so we just filter out all these questions and we do not open tracker issues for them.

C

A follow-up question on that uh regarding the device health metrics that are being captured, there is a device value protection also, that is, you can either set it off or local, like how efficient was local to be proven because, like uh we have tried turning it on on certain clusters, but uh although there are so many crashes like none of what none of that was actually like really helpful, I mean we haven't seen any alerts from the safe side. So is there a?

C

Is there an effort to increase that local database to like uh understand more predictive failure, scenarios.

A

Yes, um so predicting uh predicting a dry failure is not simple. um All like the different vendors have um specific metrics and they there's now like one size, fits-all uh prediction model. uh Of course there are some. There are different types of drives like flash and spinning, and um we need many models to cater to all of them.

A

um We do not have information about how well the disk failure prediction: module is functioning in the wild, but we are working uh on improving the models. uh We're collaborating with vendors and working on collecting some more vendor-specific metrics. In order to improve that.

D

Us so I have a question about the performance: Channel I, don't think it was covered too much in this presentation and but that's like highly interesting to me and I wonder if that was enabled, if there's any overhead and if so, how much overhead is gonna bring.

A

uh So, first we're in the process of analyzing the metrics in our back end and about overhead, so um you mean in generating and compiling the reports and collecting. So from our experience it can. It can take a short while to compile the report, but there shouldn't be um it should it happens daily and it should not interrupt basically, the the Clusters operation and and we're talking about really really big clusters.

A

Otherwise it shouldn't take that long.

E

E

About trying to make sure thanks, um we had a lot of conversations on the OSD side about how frequently we gathered some of that data, and it was. It was actually a really big concern, making sure that that was impactful, so I think, um hopefully we're in good shape, right, yeah, right.

A

Yep yep thanks Mark.

F

So the graphs were really great. um I noticed you know we the one of the graphs. We showed the minor versions and how like 1720, slowly shrinks in terms of uh usage is that does, as we go forward in time. Does that include clusters that have stopped reporting, um or does that only include clusters that are reporting at that given time period.

A

um You mean, um like teen, zero.

F

Seventeen to zero Seventeen, two one Seventeen two two um and we see you know as as we go forward in time. Most clusters are on the latest version of of version 17, but there's still some clusters that are on 1720. So it brings to mind the question: are those clusters, the ones that originally imported early on in the 17 life cycle and they just stopped reporting because they went away for whatever reason or are they still reporting that they're using that version?.

A

um I I believe and I don't have the exact uh breakdown for this specific version, but we could look at that. I believe that these are the original clusters that that were reporting. uh All the like um I find it hard to believe that these are new clusters that started reporting to Telemetry with this version, but but it can happen. So we don't have a differentiation on the dashboard uh to see um like when, when they joined uh started, started to report or whether they upgraded uh to to a different version.

A

um But uh it's a good question I'll be happy to to take a look at that. So.

F

Kind of related to that is, um you know it's not in the data, but it would be interesting to see like cluster retention. Like do we get reports from a cluster? How? How often do we, you know, get continual reports or do they just disappear? You know um similarly I think that has an important impact on like Drive disk prediction. Right yeah, um you know you can't declare the drives died if the cluster disappears. You know things like that. Right.

A

Yeah, that's a very good point: uh yeah. We should add a hit map with like a Time series histogram for retention days for clusters. Sorry.

F

I'm monopolizing questions one last one uh you we did the air gap clusters, but are. Is there going to be any capability to have manual reports done from clusters who want full control over the data that may be shared with telemetry.

A

So we were talking about that as well uh of having um an offline Telemetry capability, um so um even Mark. We discussed that at some point like having a thumb drive for airgate clusters, um that they can just uh choose what and when to report yes, so this is um something that we discussed.

G

Mike I already got authorization for one last question: it's a good one, but back Blaze and now DigiTech a Swiss uh electronic seller get a lot of publicity. Publishing reliability, statistics warranty rates for for products, I mean ceph, could get a lot of publicity. If we, if we make some kind of public report about the reliability and maybe performance of different Drive models, is this but I understand this? Is this can be also a bit of a delicate thing for for us to publish what are the limits of what we can do there.

A

ah It's a very good question, um I think um for start. We want to use this open data set in research and see what what we could learn from that, because one thing that we are missing is the labels. We don't know whether a drive was actually it's actually failed. We can just say that the cloud the host is still active, but that drive is no longer reporting.

A

So that's a pretty good heuristic to understand that something happened to the drive so we're now in the process of playing with this database and getting some more insights from it. And yes, definitely when we have some some good insights, we want to make it uh even more public and yeah, because that was the reason there was um the idea behind it to have some more models and vendors open and drive. Health metrics data set yeah, I, guess I guess the next question is uh when's lunch. So, yes,.