Cloud Native Computing Foundation Online Programs, 1 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scaling Monitoring at Databricks from Prometheus to M3

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, hi everyone today nick and I will be presenting about how we scaled the monitoring system at databricks from prometheus m3.

A

So some quick introductions, my name is yui. I'm a software engineer on the durability team at databricks.

A

um Yeah, so for those of you who haven't heard of databricks, we were founded in 2013 by the original creators of apache spark we're a data and ai unified, analytics platform as a service, and we serve over 5000 customers, we're still a startup, but we've grown pretty big. We have more than 1 500 employees with more than 400 engineers and with our an arr of 400 million dollars or more.

A

um We run across three cloud providers in more than 50 regions and we launch millions of vms per day to run data engineering and machine learning, workloads, processing, exabytes of data per day.

A

So in this talk we'll cover how we monitor our data bricks before m3 and then I'll talk about how we deploy m3 at databricks, including architecture and migration, and then nick will cover some of the lessons we learned in this process, including operational advice, um important things to monitor in an m3 cluster and how we do updates and upgrades.

A

So what was monitoring a database like before m3.

A

First, I'm going to provide some context about the role that monitoring plays for databricks engineers. We have two main metric sources. The first is our internal services that run on kubernetes clusters that we manage in-house, and then we have external services running co-located with the vms, for customer spark workloads, and these run in customer environments.

A

We've been running a prometheus-based monitoring system to monitor these targets, since 2016 and all service teams rely heavily on the system, um service owners write and omit their own metrics from their own services, and they use these metrics for dashboarding. We use grafana for dashboarding and they write their own queries and most engineers are prom, kill, literate and engineers maintain their own alerting rules.

A

Some of the use cases we have are real-time, alerting, debugging, slo, reporting and alerting and automated event response so yeah. To sum it up, monitoring is a pretty critical workflow for engineers at databricks, for their services to run smoothly and its core to our engineering culture.

A

So this is what our original prometheus monitoring system looked like in one region.

A

So for those of you who are not too familiar with prometheus, it's a single node monitoring solution that is used for event, metrics and alerting. It uses like a pool based model and it will scrape metrics from other services and it stores these metrics as time series data, and then it can service, queries and alerts using this data.

A

So in each region we have two different problems for the two types of monitoring targets that we have. The first problem is prometheus normal. This is the instance that scrapes all the kubernetes pods local to the cluster, and then we have a prometheus proxied instance for proxy metrics from services that live outside of our kubernetes cluster.

A

In particular, we use like a push-based path through kafka or kinesis, to get these metrics into our metrics proxy service, which runs in our kubernetes cluster, and then this is then scraped by the prometheus proxied instance.

A

For this prometheus instance, we maintain a whitelist to only ingest some metrics so that we can reduce metrics volume, since the metrics from our customer environments are the higher cardinality workloads and the reason we have two separate prometheus servers. Instead of just one server per region is because of scaling limitations, and we found that metrics from both of these sources would not fit on a single prometheus server.

A

So though, it's not ideal to have separate servers in one region, there's kind of a logical separation between these two metric sources, so engineers can still function pretty well with having a sharded view of metrics per region.

A

Each of these prometheus servers are set up in the high availability blue green deployment, so that we can in place update one of the prometheus servers, while the other is still running and serving alerts and inquiries.

A

We also have a disk attached to each prometheus server to store metrics.

A

We have a global prometheus server, which contains a subset of metrics federated from the regional prometheus servers across all regions.

A

For this we're just using the out-of-box federation feature from prometheus, and we also maintain a white list here, so we only federate some metrics that users need to have if they want, like an aggregate view of their services across regions.

A

So users interact with this monitoring system in two main ways. The first is alerting, so each of these prometheus servers will evaluate alert rules and issue alerts to alert manager and then alert manager will forward the alerts to the bench response channels like pagerduty or slack, and the second workflow is querying or dashboarding.

A

So users can query the regional prometheus servers to view the metrics from their services or any metrics like underlying, alerting incidents, and typically users will mostly interact with just a regional prometheus, but when they need like a global view of services across regions, then they use the global prometheus server.

A

So here are some numbers to give you a picture of the scale that we were running. This prometheus modeling system at we ran in more than 50 regions and different cloud providers, and we monitored like 100 plus microservices, with an infrastructure footprint of 4 million plus vms of databrick services and customer apache spark workers, and because of our architecture where we had a single prometheus server handling all the metrics from the customer environments.

A

That server was pretty huge, like at its peak. It was handling close to a million samples per second had a pretty high metric turn rate with a lot of metrics from short-lived spark drops persisting for like less than 100 minutes, and the disk usage was pretty high at 4 terabytes for only 15 days of retention, and we also were running on a really big machine with 64 cores of cpu and 2 terabytes of ram.

A

At this scale, we found our prometa system extremely difficult to operate, and we found that it also impacted on the user experience.

A

So for us, as the operators, we had to handle a lot of capacity issues like frequent ooms and a lot of high disk usage situations where we had to resize a disk and also we had to deal with multi-hour prompt updates because of the long right-ahead log recovery process during restarts.

A

In terms of the user experience. Users had to deal with like a shorted view of metrics. They had to be like aware of which metric stores they wanted to query whether it was from normal or from proxied um users also had to deal with slowness of querying bigqueries would take like a long time. Sometimes they would never even like complete, and sometimes they would even cause the prometheus server to.

A

We also had to build a custom querying ui since out of box, prometheus ui was too slow to handle high volume.

A

Users also had to deal with shorter retention period, and they could only see metrics in like the last 15 days, instead of the last 90 days, which was ideal and they couldn't really see metrics spanning across different release cycles. Since our release cycle is only two weeks um and also users had to deal with a metric white list, they could only include a small subset of metrics in the whitelist and occasionally for us as the operators.

A

When we're dealing with like capacity issues, we would even have to actively remove metrics from the white list um and then just to keep our permissive server running.

A

So, with these scaling bottlenecks and pain points, we really needed to find a more scalable monitoring solution. Some of our requirements were. This system needs to be able to handle high metric volume, cardinality and return rate since databricks was growing rapidly and we needed a system that could keep up.

A

We also wanted there to be a 90-day retention so that engineers can monitor their service release over release. um Also, importantly, we needed it to be prom q compatible, since everything is already built on top of prometheus, so we didn't want to migrate. These are workflows like alerting rules querying or dashboards to another system and teach users to use a different language.

A

Also, we needed the system to have a global view of metrics, so that engineers could have the ability to have an aggregate view of important service metrics across all regions. And, finally, the system needed to be highly available.

A

Some features we considered like nice to haves were first, ideally would have a good update and maintenance story without some of the pain points that we experienced with our prometheus setup.

A

So like no metric gaps during updates and less manual operations for updates, it would also be nice if the system had been battle tested in a large scale, production, environment and also good if the system was open source so that we have more freedom to manage it on our own, um to make it more suitable to our metric workload, and also, we felt that there was more transparency into the cost of running an open source monitoring system, some of the alternatives we considered in mid-2019.

A

um Firstly, we considered starting prometheus even more, but we were a bit hesitant about this, because it was more of like a do-it-yourself situation and not an out-of-box scaling solution.

A

We considered some open source solutions like cortex and thanos we prototyped down as in late 2018 at that time. It wasn't that mature back then, and we weren't really comfortable using it at our scale in production, and we also prototyped cortex, and we found that it wasn't really suited for metric workloads, which were pretty which had a pretty high churn rate.

A

We also considered some hosted solutions like data dog and signalfx, but they were too expensive. So, given our requirements and the alternatives we considered, why did we pick m3 m3 fulfilled all our hard requirements? It was designed for large-scale workloads and it's horizontally scalable and it exposes a prometheus api query endpoint as well.

A

It's set up to be highly available with a multi-replica setup, and it's designed for multi-region and cloud setup, which is perfect for our use case and has a built-in global querying feature.

A

An important point also was that it had been battle tested at high scale at uber, in a production environment where it had been running for a couple years, and also an added bonus was that m3 has a kubernetes operator for automated cluster operations like scaling the cluster or updating the cluster, and also we thought there were like some pretty cool features that we would be interested to use in the future, like aggregation of metrics on ingest and down sampling of metrics, which could potentially allow us to have a longer retention period.

A

So now I'm going to cover how we deploy m3 at databricks, um like specifically the different decisions we made along the way and how we arrived at our final architecture.

A

So our initial plan was for m3db to be a drop-in replacement for prometheus local disk storage. So, instead of prometheus servers, storing metrics in an attached disk, it would remote write the metrics into m3db for storage.

A

Prometheus servers would still be evaluating alert rules and they would remote read metrics from m3db to evaluate these rules and then forward the alerts to alert manager.

A

So this setup would result in some improvements. First, since all metrics would be remote written into the m3db storage database in the region, we would get rid of the started view of metrics in one region, so users could have a consolidated view of both from normal and prone proxied in one region, rather than having the query both separately.

A

We would also not have to maintain a global prometheus server anymore, since m3db has a global occurring feature and it can connect with other m3 clusters in other regions to provide the global vf metrics across all regions.

A

So, though, this architecture was really simple and it would have been the least amount of work to incorporate m3d into our system. We did find like some trouble with remote writing from prometheus servers. Like specifically, we couldn't remote right at the scale that we needed, especially for this prometheus proxied instance, which was handling all the higher cardinality metrics from our customer environments.

A

So now our question was: how do we scale up the right path successfully to make it work? We needed to move away from having two giant prometheus servers and instead make the metric remote writing path, a lot more lightweight, handled by many small components, rather than a fewer larger components.

A

We found a solution to this in the grafana agent, which is open source by grafana. This is a lightweight component that is prometheus api, friendly and solely functions to scrape and remote right metrics.

A

There are many instances of this running in our cluster and we shard the grape targets and assign them across the grafana agent instances.

A

So this makes it really easy for us to scale the metric, scraping and remote writing path, since to do that, we just need to increase a number of these agent replicas and then reassign the scrape targets accordingly.

A

Also for metrics, coming from outside our kubernetes cluster from our customer environments that were proxy through our metrics proxy service, we added a remote write protocol into the metrics proxy service so that it would directly remote write metrics into m3db.

A

So this saved us from needing to have another set of grafana agents, scraping the service to handle metrics coming through this path and it just overall reduced the number of hops and the end-to-end latency of the metrics being proxied from outside our kubernetes cluster.

A

So now we kind of solve the challenge to make the right path scalable, but we needed to find a way to evaluate alerting rules, since we weren't using prometheus servers anymore.

A

So, unfortunately, m3 doesn't have an out of box full evaluation engine. um It mainly just serves as like the metric storage database for writing to and querying from so this led us to building our own role engine. um This was more like in the spirit of designing an architecture with more lightweight and scalable components, each serving a more narrow purpose. So for this we ripped out the rule management code from open source prometheus and deployed that as our prometheus api friendly role engine, so we pass our original prometheus.

A

Modern systems alerting rule configurations into it. It issues, alertable queries to m3db, um m3, handles the role, evaluation and evaluates the query, and then it returns. The query result back to the rule engine the rule engine does some extra like processing for checking, for example, the four duration of the alert. It adds like any extra external labels to the alerts and then it'll issue the alert back to the.

A

User so so far, we've covered the components we set up to interact with m3 like the scrape agents, um the metric proxy service with the remote writing and the rule engine.

A

Now, I'm going to focus more on what our m3 setup looks like so basic setup of m3 will typically have like two main components: it has the storage cluster and the m3 coordinators.

A

The storage cluster consists of multiple replicas. In our case, we use three and each replica has multiple pods and each pod has a disc attached to store metrics to scale up the cluster. We just increase the number of pods in each replica and then we have the m3 coordinators. These allow us to interact with the storage cluster to read and write metrics. um The coordinators have empty query embedded in them so for write requests. It receives the requests, unpacks them and then writes them into all replicas of the storage cluster and then for read requests.

A

It will receive the request. Fetch metrics from the storage cluster, evaluate the query and then return. The query results back to the client.

A

We also run the m3db operator. This is like optional to have an m3 system, but it was like we use it, because it's really useful for a kubernetes setup, um since it automates scaling up and down the cluster and also automates like deleting and creating the storage cluster. So we don't have to manually manage the three different replicas.

A

One issue we did have initially with this basic setup, was that the m3 coordinators were having a lot of like noisy neighbor issues. For example, if a user submits a heavy query, it might like take up all the cpu and memory on the coordinator and which might impact the right path and cause data loss or impact the rule evaluation workload from the real engine and affect the availability of alerts.

A

So, to make our m3 deployment more stable and have more isolation between writing querying and alert evaluation workloads. We decided to create separate deployments of the coordinators for different purposes, so in m3, read and write workloads are quite different for the coordinator.

A

Writing is more cpu intensive, since coordinators need to handle and unpack many incoming write, requests and reading is more memory intensive, since they need to fetch and store cached data in order to evaluate and serve queries.

A

So we created four different groups of coordinators. We have a group that handles writing. These consists of many small replicas of right coordinators to handle any incoming requests from our scrape agents and the metrics proxy service, and then we have a group designated to the rule engine. This just handles like a regular and more predictable workload of querying for rule evaluation.

A

Since the rule engine will just submit the same set of alert rules at regular intervals to m3, and then we have two groups here to handle ad-hoc querying. The first is the regional group. um This just returns query results based on the metrics in the regional cluster, and then we have the global querying group which reads from m3db clusters across regions and provides a global view to our users.

A

We wanted to separate the regional and the global coordinator groups since the global coordinator group. The configuration is really different. It requires setting up listening across clusters and has different security configurations, um but, more importantly, our users mostly only use the regional view of metrics in most cases, and so since we knew if we just exposed the global view, it was unlikely that users would be like making the extra effort, to always add a region label filter to each query.

A

We wanted to provide a default region-only query endpoint to avoid creating the extra cross-region traffic and cost for no reason.

A

So now that we separated the coordinator groups, our two most important, stable workloads, um the right path, which, where stability is important to prevent data loss and the rule evaluation path, which is highly critical for us to always have alerts.

A

These workloads were isolated from the more spiky and unpredictable workload of ad-hoc queries where it's subject to bad behavior from users.

A

So the final challenge was: how can we monitor all of these m3 components since there are so many of them and still avoid like a circular dependency, where we're using the same system to monitor the system?

A

So for this we decided to set up a vanilla, lightweight, prometheus server. This prometheus server, only scrapes m3 related components. It has no disk attached and its retention is only a couple of hours, so it's very easy to maintain, since we consider it to be stateless and restarts happen really quickly.

A

The metric retention period is short, but it's sufficient for us, since we mainly use this prometheus to alert us if any entry components are down, um and it doesn't really require us to look back on metrics over the past couple days. For this permission, server to issue alerts and it issue alerts straight to alert manager and doesn't have like it's completely independent from the m3 system.

A

We also have a global m3 monitoring prompt for a longer term view of our m3 related metrics, for example, to track this usage or memory usage or, like the number of reads per second, um and we use the prom federation feature here to federate all metrics from the regional m3 monitoring prompts um to be persistently stored in this global prom.

A

So this global prompt has a disk attached and also we run it with the blue, green high availability setup and the main reason we use prometheus here instead of just running a separate m3 cluster to monitor m3 is because prometheus is a really good, all-in-one kind of out-of-box solution for monitoring, especially at a small scale, and we felt it would be overkill to set up a separate entry cluster with all of its like small components.

A

Just for this use case also we're more familiar with prometheus, since we've been using it for a couple years now, and also promise has been around for longer in the community, and we felt that we wanted to be comfortable with something if we were monitoring m3, which is a relatively newer system.

A

So now I've covered how we deploy m3 and our architecture, and I'm going to share more about how we did the migration from our prometheus system to the m3 system.

A

So there are more four main steps in our migration. The first step is bringing up the whole entry system as a shadow deployment.

A

We were dual writing metrics to both prom and m3 storage, and we were evaluating alerts in both the old prometheus system and the new m3 system. So we just sent all our alerts from the new m3 system to a black hole, receiver that didn't fire alerts to any real receivers, and we also opened up a querying endpoint, but only to internally to the durability team and not really to the rest of engineering or the end org, so that we could do some behavior validation, um yeah.

A

So the second step of behavior validation is we compared alerts across the two systems by using scripts, and we also did some querying speed and correctness checks by comparing our more like well-used, complex dashboards, side-by-side.

A

The third step was an incremental rollout of querying traffic and alerts for ad-hoc querying traffic. We staged it across environments and we did a percentage based rollout of traffic from prometheus over to m3 and then for alerts. We did a per service migration. We replaced alerts emitted by prom with alerts emitted by m3 for less critical services.

A

First and then we moved on to more critical services in the later stages of the rollout, and the final outcome of this rollout is that all ad-hoc querying traffic and alerts are served from the m3 system, and then we can safely deprecate prometheus.

A

Now, here's a diagram to illustrate like how we did the rollout of the ad-hoc querying traffic. It's a pretty simple setup. We just put an m3 uh an engine x in front of the coring endpoints of permeates and m3, and we split the current traffic across both and over time. We just slowly increase the percentage of traffic directed to m3.

A

This strategy was pretty good for us to give an idea to like give us an idea of how ad-hoc occurring traffic affects the utilization in the entry system, and we did end up doing like some tuning of correlating limits and extra provision of resources up front, as we did this rollout to make sure that the rest of the rollout would be smooth.

A

And this is a picture to highlight how we did the per service migration of alerts. So all our alerts at databricks are emitted with a service label which denotes the service and the owning team they belong to.

A

In addition to this label, we added a source label to indicate whether the alert was submitted by the old prometa system or the new m3 system, and then we made some routing configuration changes in alert manager so that if we want to roll out empty alerts for less critical service, first, um the m3 alert from that service would go to the receiver, but the equivalent prometheus alert for that service would go to the black hole receiver that doesn't alert anyone.

A

We found this rolex strategy to be like really nifty, because it was all controlled in alert manager at the routing level, so an alert manager can hot reload any new config quickly without any restart. So it was very easy to advance and roll out board. But, more importantly, it was like easy to roll back if we found any issues- and this was good to do since alerting- is a highly critical service. So rollback shouldn't be able to happen quickly.

A

Yeah the outcome of this one-year migration is that m3 now runs as a sole metrics provider in all environments across clouds, um the global querying endpoint is available via m3 for all metrics. um This is a slim beta, so we're still like testing this and rolling it out, um and the user experience is largely unchanged. Our users still use promql for learning, rules and dashboards, and we still use alert manager for all our alerts.

A

Retention is widely 90 days across all regions, and this is unlike before, where we had our large prometheus server, which had like two weeks of retention and overall, we think that migration went pretty smoothly. We avoided any major outages since we had a good rollback strategy.

A

So now we have much higher confidence that we can continue to scale this system into the upcoming years, since databricks is like continue to grow, continuing to grow rapidly as a company and it'll keep processing, larger and larger workloads and, most importantly, the observability team doesn't have to deal with like a giant prometheus driver anymore. That runs on two terabytes of ram and takes like multiple hours to restart okay um on to nick. For lessons learned.

B

Yeah thanks my way um I was having some sound issues earlier. Hopefully it sounds okay. Now let.

C

B

Thanks everybody um cool, so um I'm going to go cover some of our lessons learned in um operating m3 um over the past year or so um so. I'm going to some of the things I want to talk about are some of the system metrics that you should be looking at if you're monitoring m3, give some general operational advice, um talk about some things that we found. It really helpful to alert on and also talk a little bit about how we do upgrades and updates.

B

um I just want to give a brief overview, because, when you're talking about sort of like from the trenches, I think the uh the perspective can sometimes sound negative, um because I will be talking about issues that we've we've run into, but I want to like emphasize at the start that overall m3 has been amazingly stable for us. Why? I sort of talked about how much trouble we had operating prometheus? It was a constant source of alerts and trouble for us and we operate.

B

You know more than 50 different deployments of m3 and it's it's just really stable. um We have a few places where we've run into issues and I'll be talking about those and those are obviously the ones that have sort of the highest scale, where you're really pushing against the limits of what we can do.

B

um But overall uh it's been an extremely stable thing for us and honestly, the biggest problem we have is just that we keep running out of disk space in places because our metric load keeps going up um and that's not obviously m3 fault. That's our fault for needing to be better about how we deal with incoming metrics. um So even though that's so that's the positive side, we have had some problematic things, so I'm going to dive into that because you know dealing with problems is obviously an interesting thing to hear about um before.

B

I do that, just a little bit more about how how it runs. You know we like I said we have a large number of clusters, so we have to automate things. um We use a combination of spinnaker and jenkins to to do. Temp like template applies um to update things. um So that's where having the operator is really nice, because it makes it pretty easy to kind of do those updates um in our bigger clusters.

B

We process, you know close to a million samples per second and about 200 000 reads, so we are more right heavy. um You know I that that's definitely the workload that we we have at databricks um cool, so I wanted to jump into sort of at the top level, the things that we found most important to keep an eye on um while you're operating. So we look at how much memory is being used. These are in the m3 db pods.

B

We have seen that if you are steadily over 60 of memory usage um that can be bad um mostly because there are certain things that happen, that can cause memory, spikes and then, if you're, consistently over 60 percent, those can can get you all the way up to over 100 and then you boom. um uh You know it's nice that, because it's distributed um and highly available, if only one pod ooms, it's not a big deal, it recovers. Nobody even notices. We don't even get alerted uh when that happens, for a single one.

B

But if you're, all of your pods are consistently over 60, you have a good chance of multiple ones, zooming and then things can be bad. um So how you can you resolve that if you're steadily over 60, um you can scale up your cluster uh or you can reduce incoming metric load? Those are the two primary ones we've found. Obviously, if you're in a more read heavy workload, you may need to do something like reduce the amount of reads that are happening.

B

um One thing I'll mention here with the new version of m3 having all these nice limits for reading and writing. It's really a great other way to sort of put limits on the memory used. So um you know we don't we've already done that. So that's not a way we resolve these, but uh if you haven't set the limits, that would be another way to try to reduce the amount of memory um and I've included here.

B

I you know you obviously don't need to like memorize the the queries that I've put in here, but I've just kind of put them in here as a reference for like the way that we look at these. So we look at this particular metric filtered for our pods. um We also need to alert on disk space. Like I mentioned, this is a problem for us just because, as things grow, uh the cluster can get bigger and bigger, and your disks can fill up. We use predict linear to look at how big the disks are.

B

The disk space usage is actually very easy to predict over, so it's pretty accurate, um and so we do it very early, mostly just to give ourselves lots of time to react um this. You know uh there's other things going on.

B

Sometimes it takes a little bit of time to get to it um and also you know, we've found that scaling up can take a significant amount of time in in the really big regions, um and so it's nice to give ourselves enough time to to deal with this uh and again ways you can fix this. uh If you are running out of disk space are scaling up. The cluster, like I said, reducing retention, obviously will uh free up some disk space or reducing the incoming metric load and again here's the query.

B

um Like I mentioned cluster scale up can be slow. There has been a good amount of work in improving this bootstrapping time um in the newer versions of m3.

B

We are a little bit behind on the update schedule, so uh we're hoping to see some improvements in our cluster scale up time, uh as we upgrade to the newer versions of m3, uh but I would encourage you if you're operating m3, to try to do some testing around how long um uh how long it takes to do this cluster scale up um so that you can set the sort of limits and uh like how long in the future, you need to alert for these kinds of things so that you know um how to react to them um cool.

B

So that was some of the like. Those are probably the most important things that we need to alert on I'll, get on to some of the other, like uh smaller things in a little bit, but I also wanted to give a little bit of general advice um that we've kind of um accrued over the time over operating uh m3. One thing is, um you know, try to not add a lot of custom annotations labels uh configs on top of the deployments.

B

um The operator does have certain expectations about the stuff it's going to deploy and we've we've shot ourselves in the foot a few times by trying to do weird custom things and then the operator can get confused or we can get confused and anyway, uh if possible, avoid doing lots of custom things.

B

Let the operator just do its thing, um like I mentioned, do, observe your query rates and set limits, so look at uh how your cluster's being used uh in a good state and try to set limits so that you can prevent a bad state from occurring.

B

If a giant queries come in or you know things like that, um one thing that I think we waited much too long to do at databricks was to have a really good testing environment, um meaning uh we we rolled this out in all of our clusters, and it was working pretty well. But um you know, a monitoring system is something that everybody relies on all the time, which means that your dev clusters are development for other engineers.

B

But for us, our dev clusters are actually quite important um to have good monitoring because people care about observing how their you know, test clusters are running, um and so we needed sort of a dev dev. I guess- or you know, m3 dev cluster, which we now have, but it took us too long, but it was really important.

B

I think it's important to have a place where you can quickly iterate um on rolling out new versions, testing load uh important to be able to just throw away the data there so that you know if you're testing some stuff out and it doesn't work. You don't you're, not scared of ruining your data, um and I think it's also important to try to have that at scale. And- and this is non-trivial- um you know having load generation and stuff, because if it's truly dev dev you're not going to have a lot of stuff running there.

B

Naturally. So there is some work to doing that. But I think it's very valuable to have this so that you can kind of be aware of how your production clusters are going to behave, but not be testing it in production, because, if you're only testing high load in production, it's not it's not a recipe for great success. um I also would encourage you to have a look at some of the m3 dashboards that are out there and kind of learn what these metrics mean. It can be really helpful.

B

uh You know, as rob mentioned, m3 has a ton of features, and it also, as a metric system, has a lot of its own internal metrics. um The dashboard I've linked here is kind of the the one that that the m3 uh web page mentions. I would say that the this one that's linked here, the grafana.com is, is somewhat developer focused. I think it's built to help.

B

People who are working on m3 uh understand what's happening and debug issues um and that's useful for that, um but I would maybe suggest kind of looking through that and understanding what some of the more useful metrics are from an operator operational perspective um and and making a dashboard with your own key metrics, uh I'm not going to cover that this is like you know, for future reference. This is what our one of our internal dashboards looks like. So we kind of look at some a lot of the more high-level things that kind of show.

B

You know how cpu doing how's memory doing, um and you can kind of see in that memory, one how it's a little bit. It goes up and down, but we try to keep it.

B

You know at about the 50 to 60 level uh of what's available, um you know and I'm not going to cover all the other stuff that's on here, but basically these are some of the things that we have found really useful uh to monitor in terms of understanding, what's happening from a more sort of operational perspective than than this would like really internal perspective of a developer, working on m3 um cool, so a few other things that we alert on um just uh the things that are worth uh worth looking at.

B

um So if we have high latency ingesting samples. So this is the coordinator and just latency bucket metric that can mean that you're just getting backed up and in writing samples or that there's some other problem with the incoming metrics.

B

We do try to filter on the incoming metrics and and prevent them from uh coming in late, but I will say that although the grafana agent has been pretty good for us, one problem it does have is a tendency to sometimes try to write old data and it's hard to get it to not do that. So you can sometimes see spikes in this from that, and then you need to go kind of kick the agent to have it not do that. um We look at the rates of both right errors and fetch errors.

B

um These are good ones to be watching mostly because they kind of represent the user perspective. They say like how is it as a user of m3 trying to do something like write a metric in or trying to do something like issue a query? Am I seeing errors right? There can be all kinds of other errors happening under the covers, but if these two metrics are steady uh from a user perspective, you're you're kind of okay, you're meeting your slo, um so uh we we monitor those and those ones kind of issue.

B

Some high priority alerts if you're getting right errors, something's bad you're, not able to write error right metrics into the cluster, if you're, getting fetch errors, something's bad you're, not able to query the cluster uh and your users will be seeing issues um another one that we monitor, um and I I wanted to mention this one because it can cause for us.

B

It can be a really big issue, um but uh it may not be depending on the the your deployment, but we do look at um how many out of order samples are being are being uh written and the the reason that we do. This is because, uh if you are double scraping, um pods or services that can cause all kinds of craziness in your metrics because they can kind of bounce around and the counter semantics of prometheus make. It think that crazy stuff is happening um and and because we operate in a largely pool-based architecture.

B

um This can cause a lot of false alerts for our customers. um It is a little bit of a tricky want to monitor, because some amount of out of order emerging is is expected and doesn't mean that there's a problem. So I put an x here um because you'll probably need to look at how it looks in your cluster.

B

And another thing to be aware of is that during node startup there can be a little bit more out of order merging just because there may be some data, that's sort of built up and things are not coming in directly in order.

B

um So you'll want to inhibit this during node startup, but if in sort of a normal operation you can you should be able to get a good sort of baseline for what your out of order write rate is and then, if that spikes, it can be indicative that you've somehow messed up the configuration of your scraping um yeah. So that's another one that has bitten this um great.

B

So talking a little bit about uh upgrades and updates, um they have been, I think rob mentioned, that uh they do really focus on or m3 really focuses on, having good forward and backward capacity compatibility, and we have seen that we have experienced that um we are not scared of doing updates to new versions. We have not seen really any issues.

B

uh I only mentioned this because it's literally the only issue we've run into is just there was a tiny query: evaluation regression um where I think they changed something in the query engine and it was also relatively it- was not a big deal and and was fixed relatively quickly, but uh we've done you know. We've been running m3 for a while through a lot of the early pre 1.0 things, um and so there were relatively a relatively large number of updates there that we've gone through and it's been, it's been very smooth for us.

B

So that's been great. um We are now just moving into 1.0 slowly throughout our clusters. That has also been very smooth. um You know rob rob kind of mentioned that there were some config changes and made it sound like it's a big deal, but honestly there weren't that many config changes. We have a pretty complex um kind of programming system for generating our configs, and so it was pretty easy for us. I suppose, if you have manual configs, it might be more work, but for us it really wasn't a big deal to update to 1.0.

B

um One thing to be aware of there were some sort of api changes, so uh this is probably only relevant if you have a built up sort of institutional knowledge around m3 already, but a lot of our run books have to go. We have to go through and change some of the api paths that we say to hit for things like changing retention or updating placements, um that's sort of like advanced usage, so as a normal user of m3, you probably won't run into any of that.

B

um I mentioned um that we manage all of our upgrades and updates via spinnaker and jenkins um the one sort of minor issue here that we have is up. Until now, there has been a lack of fully self-driving updates, um we kind of count. You know we're very bought into kubernetes shop and we kind of count on uh our pipelines being able to do updates by just calling coop control apply uh on a new template.

B

um This was not fully working until recently, but as rob mentioned with the o13 version of the operator, this should now be available. um That's a relatively recent release. We have not had time to go fully test this, but we do believe we'll be able to to move to this in the near future.

B

um One thing that I will want to would you want to mention uh if you are doing these kinds of fully self-driving updates um or sorry, if you're not doing full self-driving updates, where you are just applying and then having to do some kind of manual intervention to get the operator to do the update. um We have to be vigilant that the configs, which can be updated by just calling control, apply um stay in sync, with the m3db version.

B

We've had some issues where you know our pipeline goes and deploys new version specification and new config, but the old version is left running and then we restart it. And it's like. I don't understand this new config. So that's one thing to be careful of, um and then one suggestion I will have. uh This is probably generally good, but it's one that we found really important during the upgrade process is to have a readiness check for your coordinators.

B

So try to make sure that your coordinators as they come up, are able to talk to m3db, and then you know I'm not going to cover sort of update strategies for for kubernetes, but if you're familiar with kubernetes, you know have like a rolling update strategy that doesn't restart too many coordinators at once. What we found is that, if you know you run a lot of coordinators, these coordinators are super lightweight, which is great and we run a whole bunch of them.

B

But that means, if they all restart quickly, which is what happens if you don't have a readiness check. um It's so many services. Restarting that kubernetes can can deal, has a little bit of trouble dealing with the churn um and can leave a number of the coordinators unable to connect to m3db, um just because it hasn't had time to sort of update all the endpoints and the various service updates that it needs to make to do that.

B

And then you have a little bit of downtime and because uh the kubernetes uh control loop is not super fast, uh it can actually just take a while before it sort of churns through and updates everything. So having this uh readiness check, uh if you have a connect consistency on the coordinator, um then that will sort of enforce that the coordinators restart slowly and re-establish their connection before the next one goes down and you can have sort of zero downtime downtime updates um cool.

B

So that's what I wanted to mention about upgrades and updates um and then just a little color about you know. Metric spikes in any high volume system that you operate, you're gonna have to have a way to deal with spikes. uh You know an example of something we sometimes have. Is you know some somebody goes and adds a new label to a metric, and it has an absurdly hide cardinality. um So suddenly your your number of time series is just going up uh by a huge amount. um This is this. This happens.

B

You don't control all the services that get deployed. They can do crazy things um and so uh a great way to deal with this is you know you need to be able to identify where it's coming from so you know have some metrics that you can look at that that cover you know the graphon agents can can expose this. We have our own metrics inside our other systems that sort of push directly into m3 so always have some metric that you can use to see.

B

You know who who is producing all of these metrics and then also be able to cut that source off easily. So we have good ways of sort of quickly blacklisting things, um because this is extremely preferable to ooming your cluster. I'm much happier to be able to go to a customer, and you know internal customer and say hey. Your service is not currently not having any metrics, because you did something silly rather than having to send an email out to the entire company and say: hey our our metricking system is down in production right now.

B

um It's never a nice message to send um so yeah, that's something I would recommend um so just a little bit about how we do capacity. uh Obviously I think you know it's gonna depend a little bit um on your workload, so uh you know we found that about one m3db replica for about 50 000 incoming time. Series works pretty well for us, but we are very right heavy. So that may be different depending on how how your cluster looks.

B

I saw that there was a question about uh how many nodes we use on that big cluster. We're currently running about uh 18 nodes per replica, so.

C

B

You can do the math for how many we have all together um and we run about 50 right coordinators and two different deployments about 100 and, like I said, we just are happy to run lots and lots of these for stability. We probably could get away with fewer, but this, just you know, gives us sort of the buffer that we need, and it's it's okay, um and these are just sort of numbers that we've come to sort of organically.

B

um So I don't have a it would be nice if I had sort of a formula like this. Many in do this many things, but I think partially also just testing and because it will depend on your workload um cool. uh I did want to talk a little bit about some of the things we want to do in the future. So I would say at this point: we have completed the fully completed our migration away from prometheus. We are removing prometheus everywhere, um so we don't need it anymore.

B

Aside from these sort of like lightweight monitoring ones, that that why we mentioned um and we're now getting to the point where we can sort of look towards the future and what nice new things that we can do so we were sort of previously. We were in this bad state that prometheus kept crashing and everything was unhappy and- and it wasn't good and we've- you know- had a very successful but somewhat long.

B

Migration also to now have a sort of stable set of monitoring uh clusters running, and so now we can start to look forward and say: hey now that we have this nifty new m3 thing. What can we do, um and so some of the things we're looking to do in the future are uh to start getting down sampling for our older metrics. um So this is something that we expect to see. Significant savings in disk space, probably also in you, know, query performance over older data.

B

um You know we we run with a 30 second scrape interval and it's a little bit unreasonable to expect that um when you query data, that's 60 days old, that you'll get it at 30 second resolution um we will be looking at using different name spaces for metrics with uh that have different sort of retention requirements. So, like I mentioned, we run a lot of test things in our dev clusters. It's a bit unreasonable, also to expect that these test clusters will have really long retention on their metrics.

B

So a really nice feature that m3 has is that you could kind of put those metrics in a different name space and then have a shorter retention on that namespace. So your test clusters get you know five days or whatever of retention and um that's great and then for things that matter more. You can have longer retention again. This is to deal with less needing less disk space and lower lower load on the system um and then another one that I think that we are excited about.

B

That m3 has enabled, because m3 supports sort of pushing metrics in it. It allows us to do uh to enable work. uh Use cases where you're in a in a mode that's actually very hard to be scraped. So there's things like databricks jobs that we, that a lot of that we leverage where you know spark clusters are running and processing lots of data.

B

um Our data science teams really want to track those jobs and historically it's been very difficult because those jobs are running sort of somewhere else in an isolated cluster and how would prometheus reach over there and scrape the metrics out of it um and so being able to sort of build a little proxy so that they can push metrics directly through into m3. Is is really great another one that our dev tools team is looking at? Is they want to sort of monitor things from developer laptops to know hey?

B

How long is it taking to do certain developer operations, so they can improve the developer experience at databricks and they want to also be able to put metrics in, because, obviously, uh your scraping system is not going to be able to to reach out and scrape metrics from your lap from your developers laptops. So that's another. These are another feature that m3 has that's kind of like a new thing that we can do. um There obviously are ways to push metrics into a proxy for for prometheus, but we found them. We we do operate.

B

We did operate some of those and we found them to be less than reliable just because of the nature of kind of needing to cache the data um and then re-scrape it it's much less reliable to do that than to be able to just directly push the metric in because it lets you. Caching is just a difficult problem with metrics, so I'll leave it at that. It's a lot of time trying to fix various caching metric caching issues um anyway.

B

uh So I want to leave a little bit of time for questions so I'll I'll, just conclude, and then we'll leave about 10 10 minutes. um So I would say in conclusion, this has been a very successful migration for us. um It was, you know we were very happy uh with where we've landed. Overall, um the community has been extremely helpful. um You know, we've worked a lot with the people at chronosphere, who have been extremely helpful.

B

uh We've worked sort of just with the open community as well and and people it's a it's, been a great experience there um and there's a lot of great new things, sort of on the horizon for us and we're really excited to be sure, shifting gears into um making the overall metrics experience at databricks better um from a sort of feature perspective.

B

uh Not just from uh you know, stability, perspective, so cool. I thought that's what I wanted to say um and I think we can maybe shift to questions um so I saw I see a lot of them have been marked as answered, but um okay. So, let's see I see why cortex didn't suit the high churn rate. um We know we didn't. We didn't dive very deeply into this other than that we did actually talk to.

B

um The name is escaping me right now, but the main guy who started cortex at grafana um and- and he did kind of identify this- this problem that uh cortex has with ingesting a lot of sort of short-lived metrics, um and so- uh and I see that why I would like to answer this question live. But um oh.

A

I just clicked it for you.

B

Oh okay, thanks um so that that is uh something that that we kind of ran into um uh and we we we actually did try quite hard to to deal with that um and then um how many production issues are we facing daily on over 50 clusters? um If I exclude disk space issues, uh I would say we average, maybe one or two a week of actual production issues on over 50 clusters. So, like I said, it's really very stable at this point.

B

um Something we are, I would say, probably the main thing we are focusing on right now is kind of getting a handle on the the increasing load. You know, uh databricks is scaling very quickly uh in terms of how many people use the platform, and so our metric loads just continue to go up, um and so uh we we get a lot of alerts. That um sort of say, hey, you're, going to be running out of disk space, and then we have to kind of decide. Are we going to reduce retention here?

B

Are we going to try to scale up the cluster? Are we going to go? Try to figure out like which are the worst defenders and and make them reduce their their things, um so those are sort of a separate set of issues, mostly because they're, not really. I wouldn't blame those on m3. That would be an issue, no matter what system we were using.

B

um So in terms of sort of like uh m3 issues, uh I would say you know on the order of one or two a week where and it's usually something like high memory usage or uh something like that. We we also end up.

B

You know with 50 clusters, there's inherently going to be some underlying infrastructure issues, so a lot of times it'll, be you know you have some cluster where your node just can't schedule, because there's some underlying issue with cloud platform or something like that but, like I said, not a ton of issues overall, um so hopefully that answers that one. um How do you communicate with the community through the slack channel? Yes, uh we have not taken the chronosphere office hours. We may start doing that.

B

um But yes, currently, our communication with the community has been through the slack channel but through flying filing github issues. You know I mentioned the operator self-driving thing, um that's an issue that we reported, that it wasn't that there were some bugs in there and they've now fixed those. So that's been good, so I would say our primary things of communicating are through the slack channel where people are pretty responsive and also just through github, for sort of code issues um node size using for the large cluster.

B

I actually don't know off the top of my head. Why? Why do you know.

A

Like around 100 gigs of ram and 16 cpu, maybe something like that.

B

Cool, oh and someone noticed that I'm listening to mad lib so yeah. That is a really good new album. If you want to check it out, that's pretty awesome um cool.

B

And did we? Let's see, um I saw that somebody else asked if we were wanting to um to open source our rule manager engine, um and the answer is yes, we do want to um at some point in the future. uh We we do not have a like roadmap for doing that right at the moment, um but as soon as we sort of are able to to prioritize that, uh I think that is something that we would definitely like to to prioritize and get get open source.

B

We are a company that likes to open source things if possible. So um yeah. um Let's see, uh I don't know if I should go through. I see why has answered a lot of these questions um sort of uh by typing. So thanks y, um let's see if there are any other questions, are we using stateful set disks? Yes, so we use a stateful set. uh We use ssds under the covers. um You know we're, like I said we are.

B

We run across multiple clouds, um one of the nice things about uh about that is that kubernetes can kind of hide some of those details for us. So we just have a persistent volume claims that that let us get disks um for the clusters and then you know they magically have discs and doesn't matter which, which cloud you're running in um cool. Well, uh I don't know if there are any other questions, uh otherwise um I think.

B

C

I don't know what a kill all bobber does, but that uh sounds fun.

B

Yeah that was me trying to be able to stop sharing my screen, but now I still can't.

C

Thank you. Both um I've now had it back to gibbs but yeah. That was, uh I hope that was like I mean yeah, extreme, deep, dives um and uh really valuable. I think for everyone to hear about so um thanks for putting so much work um into it. It's it's. You know really great uh to see under the covers um uh for the for other folks out there running entry, obviously um so yeah. I just wanted to to say uh yeah give a great thank you for for all the detailed materials here.

C

Awesome gibbs yeah.

A

No agreed that was really great thanks, both yy and.

C