Ceph Ceph Tech Talks, 6 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Tech Talk: Telemetry Dashboard

Description

A presentation on Ceph Telemetry Dashboard, emphasizing crash telemetry work and use cases for Developers and Operators.

Presented by: Yarrit Hatuka

Join us monthly for Ceph Tech Talks: https://ceph.io/en/community/tech-talks/

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/

A

So hey everyone, um my name is yurith, been working on the telemetry project for a couple of years. Now, uh maybe even more um and yeah. Let's, um let's see what we have so today, uh we're gonna have an overview for telemetry, we'll have uh we'll understand better than motivation for it. I will talk about some architecture, we'll have some dashboard demos and we'll see some success stories. What we that we had so far so self telemetry uh means that clusters phone home to report, anonymized non-identifying data about your installation, configuration and so on.

A

The data is then aggregated and presented in public dashboards and detailed data is available in private dashboards, very important for us to emphasize that by default, telemetry reporting is off and users need to explicitly opt in by agreeing to a license, either by scli command, with a self telemetry on or by using a saf dashboard wizard.

A

If you want to see a report for a sample for a telemetry report, you can do that with the cli command or with self telemetry show all prior to quincy uh and in quincy uh we use uh self telemetry review all so. The telemetry report is broken down into several channels. uh Each with a different type of information, and once the user is opted in, telemetry channels can be turned on or off.

A

We currently have five channels. First, one is a basic channel that has information about the staff and kernel versions, the cluster size, how many demons are in the cluster and so on? This channel is on by default again in case the user is opted into the telemetry, and then we have the crash channel. uh It has um information about where, in the staff code, the crash occurred.

A

um We'll talk more deeply about that um soon, and this channel is also on by default, and then we have the device channel which collects information about health metrics of the devices, mostly smart.

A

This one is also owned by fault, and then the ident channel has uh the option for users to share their uh contact details like their email and what organization they're from this one is of course off by default, and it has to explicitly be turned on, and in quincy we added the perf channel, which have all sorts of curved counters of the cluster, and this one is also off by default.

A

I want to touch about privacy. This is really important to us, so in case uh users are opted into telemetry and we add new data new data to the reports um we require. We require users to opt in on a sf upgrade and in quincy we changed a little. This uh design that allows user to keep sending whatever they're currently opted into the the current data model version, but they need to re-opt in for any new deltas.

A

The reports do not contain any sensitive or identifying data like pool names, host names, object, names or object. Contents. We really just care about. um We don't we don't care about who owns the cluster? We just care about the telemetry information in it.

A

A

We put efforts in anonymizing everything we can, so the cluster is assigned a random uid which is specifically for telemetry. This is not the fsid of the cluster and the fsid is not reported.

A

We also remove the disk serial id, it is redacted and this is relevant for the device channel and the ips are never stored on the back end and in order to enhance privacy, we are sending two separate telemetry reports, one with the anonymized cluster data and the other with the anonymized device health metrics. These are sent into two different endpoints.

A

Now, why would we want telemetry to begin with so for developers? It's very helpful for us to get feedback on what features are in in using the wild. We can learn about the upgrade cadence and to see how to learn about the version adoption rate.

A

We can also know what hard disks and ssds model users are deploying, and this also helps us to create an open data set for of device. Health metrics and the reason we want to do that is in order to have better of device failure, prediction models and I'll talk about it very soon.

A

The telemetry data also help us to learn about new bugs and issues as soon as possible, and we can also prioritize issues better having this information by focusing on the most common bugs.

A

We can also learn and discover crash trends through versions, and once we find the solutions for those bugs, we can verify that they actually work by identifying regressions if they occur, and this is all thanks to the crash channel and for users. Users can validate validate their installations by looking at what is common. um What usually ceph users are deploying? They can preemptively mitigate failing devices by.

A

And contributing the smart data- and this is a bit of a longer term goal that we have here um so once we have a more accurate uh failure, device failure, prediction models. We can help the user understand that a device is about to fail which helps to reduce downtime in the cluster and shift downtime to a maintenance window and not to have it on a peak hour.

A

Another big motivation for users is: they don't need to actively report issues or open tickets for each crash um that they have in their clusters and they can use the open data set of the crashes to better understand an issue. So if you see a specific issue on your cluster, um you can search it on our bug tracking system and see if it's a real bug and if it's a real bug, you can learn what version it is fixed in.

A

So I'll talk quickly about the workflow we have on the saf cluster side, the manager that collects information from all the demons in the cluster, and then it compiles a couple of reports and then it sends it to the telemetry backend, where we have an apache server and a postgres database, and we have a couple of grafana instances that get information from the postgres database.

A

Now, let's see what we have with the cluster data so far, so we can learn about breakdown by versions and to see the upgrade cadence uh and version adopt of the story reduction rate that I mentioned earlier, um and we also have panels to learn about capacity density in in the clusters that are reporting in the wild.

A

So we can take a look at at the public dashboard. I added the link for that in the ether pad as well, so you can see that there are nearly 2 000 active clusters, reporting, telemetry and just about 600 petabytes, which is amazing.

A

The blue bump here is uh quincy. We released it a couple of weeks ago and you can see that there are already about a 70-something clusters reported in quincy um and then we can see the actual number of uh demons that report uh queen c as well. So there are about 3 700 demons that were upgraded to quincy already and we can see all sorts of other breakdowns of information like to learn about the total capacity by version.

A

So this help help us understand if users in the wild are adopting new versions and how quickly that happens. This dashboard lets you see breakdowns by major and minor versions. So, let's see we want to take a look, uh for example, at um a pacific, but we want to see all the breakdown by minor versions in pacific, so we'll just um ask for a display by minor and we'll ask specifically for pacific.

A

So here you can see uh the adoption rate. um This uh purple um is 16 to 7 and it was released um about the uh the end. At about uh this. It was december right of 2021, so you can see how it is being adopted by users in the wild.

A

um We have some other uh panels here. I will not get into everything uh I I really encourage you to to take a look at them, uh and here we have a complete dashboard just for the breakdown for capacity density of all the reporting clusters.

A

um Let's take a look at a cluster x-ray page. This is a page of the giveaway cluster that we have uh in our lab, um so we can learn about um how old this cluster is. How many hosts it has um the total and use capacity for um in a time series uh manner uh and then to learn about the pools and uh pgenum of this cluster to learn about um the latest metadata of this cluster.

A

We can see the reports um like to. We can see individual raw reports uh for for each of these clusters, like in the cluster x-ray page. um We have uh latest pools information and we can learn about the recent crashes that happened in this cluster as well.

A

So this was the cluster data I'll talk about the device data. If you have any questions, please feel free to. um Let me know.

B

There was all right, one neha answered it, but um the question was: are these 1840 clusters deployed at a customer's site and neja answered their all.

A

Oh yep thanks thanks.

C

So so what what is the typical? uh Is it only for the upstream deployments? If somebody deploys like reddit or a canonical distribution, it will likely not report.

A

They they can it's it's up to them. So, as I mentioned uh at the beginning, uh every uh every user every operator um has their own um choice. So if, if they are not, if the customer is not air gapped, they can opt into telemetry if they want to, but they have to explicitly do that. It does not happen in the background or anything like that.

A

They explicitly have to opt in, so it can either be via the cli, um with a cell telemetry, on um command or via the dashboard and with the self telemetry on command. um We do require to enter the license, so it has to. It has to be explicitly manually done.

C

A

C

Just asking because the comment was that all those clusters in the telemetry are upstream clusters. So that means that there are no uh clusters running version from a distributor.

A

Sorry, the audio was a bit broken for me. Can you just repeat the last sentence.

C

Yeah, the comment was that all those clusters reported in the telemetry are upstream clusters does mean that there is no single cluster reporting that is kind of reddit. Therefore, canonical surf or always nexus stuff.

A

um It is mostly uh upstream um we, we did see some reports from uh different uh um os, uh such as uh um from different uh distributions, such as red hat distributions and um suzes uh distribution, um but this could also be you know like um a test cluster or anything like that. um We don't we don't know um we. We don't know these clusters, it's all anonymized and unless the user chooses to identify themselves, we have no idea the organization.

D

I guess to emphasize that I think the answer is most likely they're all upstream. uh It is quite possible that you know there is. You know, one-off cases where uh uh a distribution like red hat or a canonical sef has enabled telemetry. But at the moment we don't differentiate.

D

Yep thanks niah.

A

um All right so uh going back to the device data, um as I mentioned, we can learn about uh hard disks and ssds users are deploying and we have a breakdown by vendors and models.

A

So you can see that easily seagate is the most popular device that users are deploying with a um 31 000, nearly 32, even um devices. uh We can take a look at um I'd, break it breakdown by models um and learn that there are eight devices that uh report uh 20 terabyte each.

A

um We can also see, for example, uh for samsung um that most of the devices are indeed ssds and nvme and see the breakdown as well, by models.

A

So the reason we collect health metrics, which is currently uh just a smart metrics, but we are working uh um in order to add vendor-specific metrics as well. um The the end goal here is to provide um a disk failure uh prediction service. So everyone knows that uh in order to have a good model, we need a lot of training data and currently the only open data set out there is by backplace, um which is really nice, that uh they provide it. But the problem is that it is limited and it's not diverse enough.

A

So um we are opening uh the device data telemetry. We have an open data set um which can be downloaded from our website, and we call for researchers to do an open research um about this about this um data and come up with the better um models for predicting uh failures, uh and we also have uh plans to collaborate with other projects in order to create a larger data set for this.

A

All right, um the crash data that we have from the crash channel um has raw crash reports uh which contain um each one of them contain a crash id which is basically a timestamp plus a random uuid.

A

It has information about the demon, type and name the ceph version of the daemon. It has its text trace the vectors of uh of that graph specifically, and it has information about the distribution and the kernel version, and if it was, if the crash happened due to an assert we'll have this information as well.

A

um Now in case the user has enabled the basic channel as well or did not disable it we'll have information about the cluster, so we can learn about the available and used capacity of of the cluster um at the time of the crash, the number of demons, their metadata and so on.

A

The thing is that we need to find patterns among these raw crash reports, so it will be.

A

It will make more sense for developers to take a look at them. The problem is that same issues can have uh different back traces, and this can happen due to um different versions. uh The code has changed, so the back trace looks a bit different. um There could be differences due to different compiler versions or compiler optimizations.

A

So one way that well, we found to be effective, is to sanitize the vectors.

A

So in the back end we have a crash processor um that looks at all these raw crashes and identify similarities among them by taking those raw crashes and sanitize their back traces, and he does it by removing the offsets and addresses from all the frames, and then it applies some search and replace patterns and filter out patterns from some frames that are just noise in the back trace, and then it adds the assert data if it's there and it calculates the signature um using a 256., um and this processor supports multiple generations of recipes of signatures uh which allows for backward compatibility.

A

So it also supports uh the version of the crash signatures that we have on the cluster side um and then it populates the database um creating all of these signatures for the raw uh crashes, but just having the data is not enough. We need to take action in order to um to do something with those crash reports.

A

This is why we have the redmine bot which queries the database for the most recent crash signatures and then it maps those crash signatures into red mine issues by searching red mine for these signatures and if it finds an existing issue, it updates it.

A

Otherwise it uh creates a new issue um and it knows how to pick up the the right uh project in mind for that.

A

um So the information that the bot updates um is some essential information from the telemetry database, uh such as affected versions, uh sanitized and raw vector cells, and it links to a dynamic dashboard that we'll see in a moment.

A

um Another important thing that the bot is doing is identifying uh regressions. So in case um there was a a crash report that it synced with redmine, and um we found that it was a real bug and we fixed it. um But then we uh we receive new crash reports um with a newer version than the one that has already been fixed.

A

um The bot will open a new issue and it will link it to the original issue uh which is uh allegedly fixed um and then in case in case the same thing happens, but uh on an older major release.

A

But with a newer minor, we'll just send an email about it because it might not be a really issue and we don't want to spam developers too much.

A

So um we have um custom queries uh for the crashes uh that we think um so. First one would be the fresh triage or this one specifically is for ceph for the entire project. um So we'll have all the crashes here that that are new um that are open by the telemetry bot. um If you want to take a look, uh for example, just uh the blue store um pressures that were opened by the bot, you can um choose those custom queries from the sidebar here, um so we have yeah.

A

We have both um um the queue and the triage q has everything which is open um and the triage just the new ones.

A

um The latest telemetry crashes sink that we had was uh mainly for 16 to 7 uh crashes, um and it's very important to emphasize here that not all crashes mean safe, bugs could be hardware issues, it could be environment or resource limitations or configuration issues, or it could be issues with other dependencies as well. So there might be many signatures linked with red mine, but they not all represent real sandbags. It's really important to emphasize that all right, so we can take a quick look at the architecture on the server side for the crash telemetry.

A

So here we will have the um telemetry report lands on the rest, api. um It goes to the database and then the crash processor um sanitizes the back trace and generates the correct signature, the there's a grafana instance that knows to query this database.

A

Of course, the crash processor updates the database with all the new signatures, um and then the red mine bot uh stinks uh those signatures with red mine and there's another component um that I will not talk about today um that its essence is to improve um the signature uh creation for for the crashes. So basically we we have a bet better deduplication for uh raw reports.

A

All right, so we have a powerful crash dashboard.

A

They allow to discover trends in crashes perversions.

A

um It allows searching by vectorized frames uh by versions either major or minor um all revisions of signatures, um because earth function and condition number of affected clusters um and to see the crash status, and it allows a drill down um to cluster information, as I mentioned earlier, if you want to take a look at um how big the clusters that experiences, a certain crash are um what versions they are currently and so on all right.

A

In order to access the dashboards uh developers uh need to have an access to the sepia lab and to be members of the staff organization in github, um users can search redmine for the batteries for specific frames in their batteries and for crash signatures, and I just want to emphasize here.

A

If you manually, create a a red mine tracker and you add the crash dump there, please do not remove the stacks key. It is not a secret and it really helps the crash bot um to sync uh similar issues that we receive through telemetry.

A

All right, so, let's take a look at the crashes um landing page, so here we have all sorts of panels that help us have a bird's-eye view um of time series, data of all the crashes and their signatures um by the day, and if you want to take a look at the new crash signatures, for example, in the latest 30 days, we can take a look at that and we can see that we have a breakdown by versions here, um either major or minor. We can learn about how many clusters are experienced experiencing this information.

A

Maybe I will is this stretcher like this? It was maybe too small.

B

A

Yeah, the font size is better.

A

All right, um so uh we can we can. We can learn um a lot about uh all of the crashes that we've seen in telemetry um in the last 30 days. um So, for example, if we see that there are six clusters that are experiencing a certain issue, that happens only on some sort of a quincy version, we can see it 17, 1, 0 and 17 2 0.

A

We can click on that and uh yeah. It's too big. This is why it's a bit broken now um and we can see that there are uh there's a total of uh 11 um raw crashes reported. um It has a breakdown by versions. So just one happened in 17, 1, 0 and 10 by 17 2 0 we can have. We can have a look at the sanitized factories, and here we can click on the sanitized frame.

A

This is python crash, so we can click on that and see all of the other um crashes that have this exact frame uh in them. um So not necessarily the same issue.

A

um This this crash did not happen due to an assert, and here we can have a breakdown by daily occurrences, and we can learn here about the affected clusters.

A

So we can see, for example, their usage, how big they are and we can learn about their um current and recent versions. um So you can see that one of them has mixed versions. So not necessarily all the cluster is upgraded to quincy, and uh here we can see the actual uh raw reports.

A

um So if you want to take a look at a report that was not sanitized um like with the raw backed race, uh we can do that as well um and here, basically, when users identify themselves and some users do some users want to identify themselves with the developers.

A

We will have a list of these users here, but I removed it here for the sake of the demo, all right.

A

This work all right. I want to take a look at a few examples of some success stories that we have with telemetry. um So, as you know, we launched quincy a couple of weeks ago, so this really helps us to monitor crash reports of new releases and we used that for quincy as well. um So in the example that we just uh saw, there were a few crashes uh just for that happened in quincy as well. But here um you can see that um the time frame is.

A

We took a very big window here, which is uh of course too big, um but then we chose here in the uh versions just um just 17.20, um which was released a couple of weeks ago, and here we can take a look. um There are currently 28 press signatures reported so far and we can see that some of them happened in other versions as well, not necessarily quincy um and again, as I mentioned.

A

Not all of them are real crash soft bugs, but it does help us to to monitor and better understand. um For example, this one has many affected clusters, but this can just be a problem with a hard hardware or anything that is not related to ceph.

A

All right, let's take a look at some bug, fixes that happened uh thanks to the sink with the red mine, so this one um was created by the telemetry bot, this um tracker, uh it assigned it uh to the cfs um project and it uh filled up all the relevant um versions and uh the crash signatures that it saw in the wild and also one that was uh um created on the back end and then in the description.

A

It has a link uh to to the dashboard and has information about the assert that happened, the sanitized spectres and um a sample of uh of a raw crash dump, and it was picked up by the developers of ffs and there is a pull request that is fixing this issue that was seen in the wild.

A

This is the page that was linked from the tracker, so you can see that there are a total of two affected clusters um that reported this issue uh with a total of 13 row reports and here's the breakdown by version um here. We can sorry click on any of these frames and see if they happened in other if they occurred in other crashes as well.

A

But here we see just um one example, which is the one that we're looking at and we we can see again the daily occurrences and if we're curious, um what version the clusters currently have. So we can see that one of them actually upgraded to quincy. um So maybe this can help us narrow down.

A

If, if the crash happens uh just in pacific and not in quincy um yeah, and here basically we'll we could have them the contact um information details for users that identify themselves that experienced those issues.

A

All right now we have another example for another crash that was reported through uh telemetry and uh was also picked up by this time, rgw team um and they had uh even backwards. um So this issue um happened. um uh It was uh reported for 1627, but they realized that um it actually went. uh um Sorry uh yeah. It also happened in octopus, so it helped us um discover an issue that, even though it was reported just for one version um needed to be backported even further.

A

um Now we have a another use case for a tracker here that was reported by a user and they complained about this issue that it happened in 1700 and now I went and checked the telemetry crash dashboard and it helped us understand that we actually saw this in telemetry already in 16 to 5..

A

So those crash reports helped to better investigate this issue, and we understand that it's earlier than whatever the user was reporting.

A

All right, then, I want to talk about this uh tracker, real quick, so this issue was um uh first. um Oh sorry, is it this one?

A

Yes, um so it was opened uh by the telemetry bot and we can see that um we when it was um found during the bug scrub. um We discovered that we need more information to debug a crash like this, and the user actually found this uh tracker uh by searching uh by searching it and um it supplied us provide us with some additional information, so users um can respond to whatever we see in telemetry through the bug tracking system- and I mentioned earlier that the bug can detect regressions.

A

So we can take a look at this tracker here that it is resolved and the version here is uh 1528, but we can see that a new tracker was opened uh recently by the telemetry bot and it says that new crash events were reported via telemetry with newer versions um then encountered so far. This happened because um that tracker is related to other trackers. This is why it picked up 16 to zero, but it linked it to the previous issue, so might be a regression might not be a regression.

A

um We have to investigate, but at least we have um this option um of knowing that it happened.

A

So, like I mentioned um sometimes just the raw crash reports are not enough and users identify themselves. So we can contact them and ask for more information to better debug an issue, and this um issue um was first uh reported in a bugzilla.

A

um You can see about a year ago and it was picked up um by by the bot, uh and that's uh thanks to to the fact that we had the stack signature here. So um there were similar crash reports uh through telemetry that the the bot could scan redmine and update an issue instead of opening a new one um and- and we saw that uh we have links here to to these in telemetry. We can see that there are 49 affected clusters by it um and this helped um to prioritize this issue.

A

um But now, if you want to say a few words about it, because uh this uh a recent um tracker that we've been working on.

D

Yeah, so I think uh this particularly um is interesting, because this is one issue that uh we saw as um gary mentioned in downstream, but we hadn't seen it in upstream and when you look at the the crash, it seems very intuitive like it should have shown up um and clearly that was where my curiosity um arose and I checked the dashboard and I saw that there are users that are hitting it. Clearly, um there was something missing in our integration tests that was not catching.

D

It and junior uh uh was assigned this bug and he did a great job of identifying why we were not catching it and clearly um with um so going into the specifics. There is a way in uh cepheid am to remove um demons, so the there were no tests that were actually exercising the fact like to reduce demons in a cluster in this case monitors um so, which is why we would see this and also turns out that if you do, you use a regular manual procedure of removing monitors.

D

You wouldn't hit this crash, which explains why the other tests weren't catching it. So essentially um that's I mean, I guess that's where one extra data point helped us prioritize this bug, and this is a real issue which we are fixing and now also reproducing in pathology.

A

Thanks um and I want to, um I want to take a look again at this, um uh this tracker and um see how uh let's say that, um for some reason we did not have anything uh in, um we did not have the telemetry bot uh uh synced, uh whatever we saw in telemetry um with redmine, um maybe because um it was an older version or we just haven't synced it. Yet we could manually search the telemetry dashboard for it.

A

So, for example, we can see that um we have the back trace included here uh and if we scroll all the way here, we can see um that there's the the function name. So we can, we can copy uh even a small part of it, um and we can go to to the search page and basically search just for this specific um function. So uh we can leave the five years window. That's fine um and we can see that there are seven uh crash.

A

Fingerprints or crash signatures um are reporting a pretty similar issue, so the reason that um they are uh again not uh all grouped together in case in case there are they are similar is um uh we did not detect um um the the bad choices were different enough and um the filtering out uh did not detect that it is indeed the same issue. um So this is one one uh thing that uh we're still working on improving.

A

So so again, even even if you see any um problem there uh out into even in tautology or um in downstream or wherever- and you don't find it in tracker- please use the dashboard. um You can again search uh just for the assert function. uh There was uh also an asserta condition in this case. So if we can well this one, I guess it was good enough, but uh sometimes uh the search condition is not it's not very, not very um I'll. Give us too many details.

A

um So in this case, um yeah still 7 same thing or we could just use some frames in the back trace, but here uh it will not help us to search for uh frames that we filter out, because we we will search only in the sanitized vectorize. So, for example, if I search for this um frame over here, uh it would better um it would probably find better results so yeah. So you can see that um there are three crashes that were not mentioned here, um probably a different uh way of execution.

A

um So- and this is this is an important um point here. If you don't find it in tracker, please use the dashboard um it can it. It might not be synced with tracker, yet so that's important to to emphasize um yeah, and there are some other um use cases that telemetry was very useful.

A

So, for example, we wanted to know whether a file store can be deprecated, so we looked at the data that we have so far with telemetry um and uh produced these uh panels to um see how many um files or versus blue store um osd's are out there, and you can see that we have a breakdown here by major versions.

A

um So, for example, in pacific there are very, very few demons that are reporting files or, and if you want, we can just uh have a breakdown by just specific. For example, if you want to see the minor versions that are reporting.

A

And I think I think we announced that it will be deprecated, and this data point was very helpful, so we we did use uh the survey for that as well and uh probably mailing lists, but telemetry uh give us real real-time data and it makes your voice heard as users. So so it helps us to better understand what's going on in the wild, so you can see the ratio for um blue store versus file store in 16 two to seven.

A

um We were also asked uh whether the regular code, clay plugin, is being used in the field, um so we did the same. We had uh panels for that um and we saw that it is being used by real clusters in the field and there were no related crash reports um to that. um So another uh real data point um to make decisions, and I think now we are even developing it uh further. uh If I'm not mistaken, it was this. This code was donated by a researcher um and but we did not.

A

Call it as a production code, but now we understand that clusters are actually using it, and telemetry data also helped us to find better tuning to the theft dashboard.

A

They needed to know the magnitude of how many rbd images are deployed by ceph users, and they had some scalability issues and this information really helped them in order to fine-tune the dashboard.

A

So um for all the users out there, please join us with opting into telemetry with the staff dome trion. You can see that it is super super useful for us and it helps us make a better product more robust and have a higher quality.

A

So yeah I'm happy to take any questions. If you have.

A

I see that there are questions.

A

Here, yes, uh so I will. I will have a link, um a link to that as well, uh but, basically, um on that on that page uh you can just um click um here and see all of the related um uh all the related dashboards. But I'll I will, I will add, a link to that in the ether pad as well. That's a good question and how can developers access collected reports so, as I I showed with them faster in the cluster page. We have all the reports here.

A

We can see uh raw reports here, um but if, if there's a need to have them in another format, we we need to have access to the database.

A

Are there any more questions.

C

So going back, maybe to my original question about the upstream versus distributions.

C

If the reporting and it's understandable, if the reporting comes mainly from the upstream deployments, it is likely very skewed towards developers and less maybe less the production and then, if you make decisions uh for future directions based on those reports, maybe they're a little bit biased towards those developers upstream deployments right. So are there any plans to promote telemetry adoption by the people who use the distributions and then not upstream.

A

um There are um plans to promote it downstream, uh specifically in red hat, but I just I just want to say that uh if we take a look um at um at that tournamentary public dashboard, um so they're not all uh developments. uh It's as we can see that they're all that there are some real clusters that are reporting here.

A

um If we, if we can, um if you take a look at the cluster distribution by um total capacity, so you can see that um there are clusters that are reporting um that are that are pretty big. um So these are not just um development uh for for as much as we can assume.

A

um Another thing is that uh there are uh clusters that um the admins really want to uh report telemetry, but they cannot because they're air gapped um and we're thinking of supplying a solution for that. um So these are again real deployment, uh but they cannot contribute their data because of this issue of being air gapped.

D

Going back to the development versus a real cluster question, I think uh the scale of the cluster tells us a story about whether it's a real cluster or a dev cluster, and the other thing is most of the development- only happens on the master branch or the main branch. um So the version number is an indication of whether it's a development, cluster or a real cluster as well.

A

Yeah, that's a very good point. Thanks.

C

Thank you very interesting presentation by the way, and thanks thanks for doing this during the holiday.

A

Excellent happy holiday.

E

The demo detail: oh sorry, yeah the demo really yeah hi yeah, the demo you did just um like for a brief um period with, like the the bug that I was working on, was really helpful, like um just showing around like how we could um you know, find the the actual like classic signature using like strings and um using assert functions. That was nice. So thank you for doing that.

A

Thanks, I'm very happy to hear that um all of the information is there, um but not yet linked with redmine, because we want to better dedupe um the crashes, so we're not overwhelming um uh developers with the crashes that can be uh better deduped. So so it's there. We just need to actively search it. um Yeah.

A

Laura did you have a question.

B

Yeah, I just had a quick uh detail to add about um the developments we're doing and uh uh or the idea we have to collect uh or track unavailable data in clusters. So um that's not a data point that's being uh currently collected, but uh that's set in place for reef.

B

So essentially um the idea was that in stuff clusters there are ways to identify unavailable data through like pg states when pgs were last active and there are ways to see that in the stuff clusters right now like through um through warnings that pop up about data availability and by looking at the pg map.

B

Excuse me, but there aren't ways to track that data over time. Right now. So uh we are thinking about uh ways that we can uh take that data. Look at the pg states when they were last active and calculate some sort of a data availability score uh that can be that can indicate how, uh when data was available, and maybe during over the course of a week, your data availability score was like 80. Something like that.

B

um And then the goal is to uh include that in telemetry, so that we can have reports of data availability tracked over time and collected through opting into telemetry. And that is um an idea for for the next release. Reef.

A

Yes, thanks for mentioning that, yes and this this data is going to be collected in a perf channel, so um yeah we might, we might add some uh highlight information in the basic channel as well, um but yeah. This can really help help us and better understand deployments as well um yeah. So so please, please um again, like we said uh when you want to make your voice heard. Please opt in to telemetry. We really just care about the data, um we're very open to feedback.

A

uh Let us know uh if you have any ideas for improvement or anything we can do better, we'll be very happy to do so um and developers. Please use the dashboard, the crash dashboard.

A

Let me know if you have any questions or any other ideas I'll be very happy to hear.

B

I have just one more question about the the general ui of the dashboard, so uh the the public telemetry link is that providing more of just an overview of all of the data collected versus the sepia link, which is where developers can find the crashes. Is that the difference there? Yes.

A

Exactly yes, this is aggregated data in the public dashboards.

A

It has more of a cluster cluster data, aggregated and device data as well, and the the private one is more aimed at the other crashes. Yes,.

B

If developers will want to search for crashes, they'll go to the telemetry.front.sepia link and search for crashes there with the sepia vpn enabled exactly.

A

Yes, that specs tech search page, I will add a link to that in the in the ether pad as well um yeah, and please remember uh to to check the time frame here. This is really important because sometimes just last 30 days, but you actually want to see some more data, so you need to go back a little bit more and make sure that the fields are um so if, for example, now I want to see just uh in the last uh 30 days all the crashes that happened, let's see just in 1627.

A

And then I don't understand that. Okay, there are just two crashes, but that's because I have this um search for um this string in the batteries. So uh if I move it uh I'll, just I'll see everything um everything on the last 30 days for 16 to 7..

A

um I can also see, um but let's say that I'm curious what also occurs in quincy any uh version of quincy so I'll. Add the major affected uh version here um and I'll see that there are 17 crashes, that happened uh in 1627 and any version of quincy and of course they could happen in other versions as well. But this would be the defaults um and also we can see only a new fingerprints in this time frame. So, for example, this signature here was first seen in 2019 and last occurred in 2022 um like yesterday.

A

So let's say I just care about new fingerprints in the last 30 days, so I will change that to only new fingerprints and there aren't um everything that happened. uh Everything that we saw uh happened prior to um the last 30 days, um so so that that can really help also to narrow down any any issues that we want to to look at. Of course, you can search by demons here as well.

A

You can search by um the stack signature, the crash signature, sorry either version two or version one, um and if you want to search by more than one um string in the back trace, you have three. um You have uh three substrings that you can search for, so um this can also help narrow down uh relevant issues.

A

Yeah there are some status search as well. It's very it's um very extensive, so.

A

um Yeah does anyone else have any questions.

B

Very good demo.

D

Thanks jared for walking us through the dashboard and letting us know how we can use it.

A

ah Sure pleasure, um yeah and if you have any questions, please uh reach out um and um currently uh again, as I mentioned, there's work um done to improve the duplication of the crash signatures. So um casey uh helped with um uh his feedback uh with the the bug scrub that they did on the most recent uh sync for 16 to 7.. um It's not an easy problem uh and try to um apply some ai tools in order to better dedupe that so work in progress.

A

All right thanks, everyone um and we'll see you in the next tech talk or any other.

D

At the telemetry huddle, maybe.

A

Yeah you're very invited it happens, every thursday uh 12 p.m. Eastern time it's on the community calendar so you're very welcome to join us.

D

All right thanks, everyone.

A

Thank you. Bye-Bye thanks, bye.