Cloud Native Computing Foundation KubeCon + CloudNativeCon Europe 2021, 14 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Better Scalability & More Isolation? The Cortex “Shuffle Sharding” Story - Tom Wilkie, Grafana Labs

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon North America 2021 in Los Angeles, CA from October 12-15. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Better Scalability and More Isolation? The Cortex “Shuffle Sharding” Story - Tom Wilkie, Grafana Labs

Cortex is a horizontally-scalable, highly-available and multi-tenant Prometheus-compatible time series database. For many years it has been possible to scale Cortex clusters to hundreds of replicas. The relatively simple Dynamo-style replication relies on quorum consistency for reads and writes. As such, a dual-replica failure can lead to an outage for all tenants. To address this we implemented a technique called “Shuffle Sharding” in Cortex. Shuffle Sharding lets you automatically pick a random “replica set” for each tenant, allowing you to isolate tenants and reduce the chance of an outage. In this talk we’ll show you how shuffle sharding achieves better scalability and more isolation, both in theory and in practice. We’ll walk you through the design on both the read and write path of Cortex. Finally we’ll do a live demo of shuffle sharding and how you can “take out” multiple replicas without affecting all tenants.

A

Hello, everyone welcome to better scalability and more isolation. The cortex shuffle sharding story hope everyone's having a good kubecon. So far, so my name's tom, I am the vp product at grafana labs. um In my what little spare time I have, I guess I'm one of the prometheus team, um my main contribution being the the remote write code there. uh I also started the cortex project, which is the horizontally scalable version of prometheus. That's part of the cncf sandbox.

A

More recently I started loki, which is a grafana labs log aggregation system in what you know in when I'm not looking after these projects, I have a bunch of 3d printers sitting on my desk and I used to make beer as well, but I haven't actually had a chance to to brew for a while, so today we're going to talk about shuffle sharding and how it allows us to build a more scalable version of cortex with with better isolation, but before we do that, I'd like to spend some time just introducing you to cortex what it does, what it is why it's important before we go on and talk about.

A

You know how we solved this problem before we added shuffle sharding, we'll then talk about shuffle sharding, what it does and and finally, how good it is. You know if it really delivers on um on what we said. It would so without further ado.

A

Cortex horizontally, scalable, prometheus, so prometheus is an awesome monitoring system. It's incredibly easy to use and we see a lot of people get started with prometheus very very easily. You know you you deploy it alongside your applications, you maybe instrument your applications or or add a few exporters to to adapt them to prometheus and very quickly. You know you attach your grafana very quickly.

A

You can build some really awesome dashboards right and you can really get some great insight into your application's, behavior and and start debugging it and start responding to any problems it might has. You know it is a really powerful and flexible system.

A

The challenge we see uh in prometheus is really when you start to grow beyond the confines of a single location beyond the confines of a single data center, a single region, you know, maybe you've got your application deployed in three different locations. You know grafana labs. We run 15 plus kubernetes clusters in in the first instance. What we see is, is users, add data sources to grafana for each one of these instances.

A

um You know this allows you to build dashboards with with little dropdowns that you can select what region you're interested in.

A

I guess the reason prometheus has to be deployed like this is because prometheus has to be next to your application right. It wants to talk to the local cluster to do service discovery and it wants to connect directly to the application to collect metrics from it um in the prometheus world. There is a solution to to kind of bringing this all together into a central global view, and I guess you know to be clear. The problem is, whilst it's fine, you know this approach is fine for getting information about an individual cluster.

A

There's no way in this approach of really getting a kind of global, latency, number or or finding out what the global error rate is. You know they just can't do it, because each cluster is being monitored by an independent prometheus.

A

So in prometheus world we recommend people deploy a global federation server, and this federation server can scrape the federation endpoint on each of your prometheuses and bring that data into a single place where you can run these. These central queries, where you can ask things like you know: what's my global latency, what's my slowest region and so on, you know this. This isn't this? Isn't that tricky to set up um it generally works reasonably well.

A

You've got to you know you got to get the authentication and firewalls all kind of all kind of working correctly and and got to secure the network and so on, but in general this is. This is feasible. This is possible where this starts to break down is when people you know scale very large and start storing all the raw data in a single prometheus server.

A

It's very easy to overwhelm that central global federation server, so we recommend, as best practices that people only federate, pre-aggregated data. You know and commonly this might mean recording rules that have erased away cane, let's say the instance label.

A

um So these these these are useful for building those dashboards, but it really prevents you doing kind of drill down and add hot queries against this global federation server. You know if this federation server points to a problem in a region. It won't be able to point to a problem with a particular instance of a service because you've erased that label away, um so we were looking you know five years ago now we were looking for a different way of doing this.

A

Maybe a better way of doing this, and this is where we built cortex, so cortex replaces the need for that global federation server and you can push you can have all the edge locations push all their raw samples directly to that cortex cluster, and this is this- is good for two reasons right one.

A

um You know this is a push, not a pull now so so in some ways this is uh kind of more sympathetic towards how a lot of organizations have their networks organized, but also the cortex cluster is is scalable, and so it can. As you add more clusters, as you add, more metrics in individual locations, you can scale up that cortex cluster to take all the load, all the raw data- and this means it makes it really easy to do these kind of ad hoc queries. It's got.

A

You've got all the data that you can drill down. Just within the central cortex cluster and because you've centralized all of this there's also like one natural place to add things like long-term storage, to invest in query performance and and and really make sure your users know. There's one place to go to get all their all their answers.

A

So that's cortex in a nutshell, really it's a time series database. um You know it uses the same storage engine and query engine as prometheus, and what we've done in cortex is really add. The the distributed systems glue to turn those from a kind of single node solution into something that works uh in a horizontally scalable kind of clustered fashion.

A

So cortex is horizontally scalable, it's highly available. We replicate data in cortex between nodes. This means when, when a node fails, you're not going to see gaps in your in your graphs um and we add a more durable long-term storage. So in cortex you can store data in an object, store um and effectively store data for as long as you like, and finally one of the things I think that makes cortex quite different to a lot of systems.

A

Is it natively from day zero kind of built to be multi-tenant to support different isolated tenants um on the same cluster? This means, if you're, an internal kind of observability team, providing a service to the rest of your organization. uh Cortex is really easy to kind of deploy and add lots and lots of different isolated teams.

A

Within your organization too, without having to spin up a separate cluster per team um we've, you know we, we joined the cncf a few years ago, we're part of the incubating phase now and it's apache license it's available on github a bit of a bit of a timeline. You know, as I've mentioned, we kind of started the project uh julius, and I started the project almost five years ago now um we originally stored all the data in dynamodb in amazon, dynamodb um and then over the next year or two.

A

I added support for bigtable and for cassandra. um One thing I'm particularly proud of with cortex is uh in the early days. I think we kind of got the the right path right. You know we, um you know it's very scalable and very performed from the get-go, and it didn't take us a long time to kind of you know, make that kind of effectively done um and so early on in the life of the project.

A

We started focusing on query performance on on accelerating and distributing and parallelizing massive queries against prometheus data, and I feel like we made uh made some really good strides there with with query caching with parallelization and sharding, and I'm very proud of what we achieved. We joined the the cnc cncf sandbox uh just about two and a half years ago now, and then really was the focus on uh ease of use and and on the community we launched a website.

A

We uh did a 1-0 release, we wrote a load of docs generally, we really kind of put a lot of effort into making cortex easier to use, and now now we're up to date. Now more recently in the past year or so, we've been focused on new and exciting features in cortex, so we added a system called block storage. This is uh basically the same thing. Thanos does where we we've reduced. The only dependency that cortex has now is on an object, store, making it a lot easier to deploy and manage.

A

Also block storage is fantastically cheaper to operate than the previous kind of dynamodb chunk storage.

A

um We added shuffle sharding towards the end of last year, and this is what I'm going to talk about for the rest of the talk, so I won't go into any more detail now and then more recently, we've added things like query federation. Relaxing some of those multi-tenancy isolation features. So you can query data in multiple different tenants and per tenant retention, so different tenants can have different amounts of data stored for different lengths of time.

A

So yeah really exciting, uh really exciting progress on cortex, um but today we're going to talk about shuffle sharding, but to tease you a little bit more first, I really have to describe um how how we, you know how things work before shuffle sharding.

A

So in a cortex system, you know, one of the main goals of cortex is to be this horizontally scalable. What this means is we need to be able to take in data and and shard it and spread it amongst the nodes in a cluster. So we do this by hashing the labels within uh within the samples that get written, and this is really how we make cortex scalable right, how we make a cluster in aggregate able to cope with more writes and more reads than any single node in that cluster. Can this is all automatic?

A

The user doesn't really have to configure anything. You know and, as you add new nodes, we can scale up and scale down as you remove nodes, it's really quite cool.

A

The challenge with this is um a a single node outage can potentially impact all of the all of the tenants on the cluster. You know the tenants of the cat and the dolphin and the fox so to prevent this kind of single node outage, and you know it's worth noting. As you add more nodes, the chance of any one of them, failing just randomly is, is higher right to avoid this uh outage from a node failure.

A

We replicate the data between nodes, so we use a replication factor of three and quarant reads and writes what this means is. When you write data, we write to three nodes, but we only wait for a positive response from two of them from a quorum of them. Then this means when there's a node outage. You know you can continue to write uninterrupted to the cluster, because we'll still be getting that positive response from two of the notes.

A

What you'll see, though, is if a second node fails, even with replication factor three we're going to have an outage right because we're not getting that positive response on rights, the the we don't know that they've succeeded and therefore that you basically have an outage for all of your tenants now. What's what's potentially uh more worrying is as cortex clusters get bigger and bigger.

A

You know five years ago we were running kind of four five nine clusters and and ten and twenty no clusters now we're running multi hundred node clusters and that chance of two node failures, just randomly or through user error, is getting higher right, so the chance of there being a total outage on the cluster is getting higher and it's worse than that, the uh you know, because every tenant in effect is writing to every node in the cluster.

A

If there's a bug in cortex, if there's a misconfiguration and the tenant finds a way to exploit that you know a poison request or a bad query could take out an entire cluster for all tenants.

A

So so these really are the problems. We're trying to solve in uh with shuffle sharding and shuffling to be clear is not the only way of solving it. We could we could. We could do something simple right. We could do uh something. You know we call bulkheads right. This is where we effectively turn. You know you can. You can think of this as instead of having one big, nine node cluster.

A

You just have three smaller three node clusters um and you would just map tenants to clusters, and this way an outage in uh you know, a poison request by cap would not affect dolphin or or fox a sentence. I I never thought I'd say you know. We also see you know two node outage would have to be in the same shard to impact any tenant. You know challenge here is that this mapping is relatively rigid. You know it's very hard in this world to to have a tenant that needs all nine nodes worth of throughput.

A

um It's also hard. You know. If I want to scale up, you know I do I scale all of the shards up. Do I scale one of the shots up? What do I do right and generally you can see how this kind of cellular approach is. Is it can be a bit of a management burden?

A

So this is where shuffle sharding comes in and now I'm going to try and explain to you how shuffle sharding works and and then we'll go on and analyze kind of how we tune it and what its properties are.

A

So, first things worth saying: is we didn't invent shuffle sharding? um The first time I uh became aware of it was based on this amazon article in its uh in its builders library, about how they improved the isolation in route 53 that dns service using this technique they called shuffle sharding.

A

um We read this when when this was published, you know it got passed around internally at grafana labs and we're like yeah. This would be a really kind of interesting piece of work to do on cortex. You know we could see its direct benefits, so what shuffle shining does? Is it effectively picks a random sub cluster of the cluster for each tenant?

A

You know we we pick this subset in random, but we, but it is a deterministically random. So we we use the tenant id. We actually hash the tenant id to select the the nodes in the cluster and then with those nodes in the cluster that that tenant is using. We use the normal cortex replication scheme to to distribute rights. Among those notes, this means that you know this. This gives you a nice property where you can have tenants of different sizes.

A

You know using the same cluster and you can control depending on like how many nodes you give each tenant the the isolation between tenants.

A

You know to give you an example: you know if we, if, if we have a three node outage in this situation, we can see this only affected one tenant, because both dolphin and fox only had one node impacted by that outage. You know, so this is kind of the basic idea right. It gives you much better tolerance to failure with with kind of a partially degraded state. You know another example, and you know I talked about poison request earlier.

A

If cap were to do a poison request, you can see how dolphin and fox are not affected, because they've again only got one node, that's been been poisoned, so this is the basic idea. You know we randomly select subsets of the node for each tenant as subsets of the cluster for each tenant. We then randomly distribute using the normal scheme um samples from these tenants within that subcluster, and then we make sure you know we want to tune the the number of nodes that we give to each tenant to optimize between.

A

You know optimize for for isolation, so this is where we kind of have to start thinking about. Well, how many nodes do we want to give each tenant and how do we optimize isolation and what are the trade-offs I'm going to play cards? So imagine we had a 52-node cluster represented by a deck of cards. You know we're going to shuffle that deck and we're going to deal out four cards right.

A

How many different hands do you think how many different combinations of four cards are there? Well, it turns out there's some maths that can work this out. It's called the m, choose, k, problem and and 52 52 choose four is about 270 000 right. So if I were to pick sets of four nodes from my 52 node cluster, there's 270 000 different combinations of four nodes, it's a huge number, but that's actually in and of itself, not super useful.

A

What I really want to know is of those 270 000 combinations, how many of them share one node, how many of them share two nodes in common. You know, and it turns out that's not difficult to work out either. um There's a link in the top to a stack overflow article about how to work it out. My math is not good enough to to derive this, but suffice to say you know almost three quarters of these of these selections. Don't share any nodes in common.

A

You know a quarter of them share one node in common and only about two and a half percent share two notes in common right. So this is an incredibly kind of strong result that shows that you know for argument's sake, a 52-note cluster, where all the shuffle shards were of size.

A

Four, a two note outage would only impact two and a half well less than two and a half percent of the tenants worst case two and a half percent of the tenants, but there's more to it than that right when we're picking how many nodes to give each tenant, we we need to trade off. uh You know fewer nodes means we're going to have better isolation. If I give each tenant one node, you know the the num the you know this.

A

The the isolation between each tenant is going to be as good as it possibly can be right because you know the chance of two tenants, basically hitting the same node is just going to be like 1 in 52..

A

If I uh give tenants more nodes, though, I'm going to be able to spread that load more evenly and in a cortex cluster, the you know, the tenants aren't all the same size right. We have some very large tenants. We have some very small tanks. We have everything in between, so we need an algorithm really for picking how many nodes, how many shuffle shards to give each tenant.

A

You know. One thing I would say is that you know better load, balancing, isn't just a nice to have right, better load. Balancing can lead to higher utilization of resources can lead to a lower cost of running the cluster and if you run cortex as uh as an offering you know as your sas platform like we do in grafana cloud, you know this is super important to us.

A

So we proposed a simple algorithm right. This is to give tenants the number of shuffle shards uh proportional to the number of series. So let's say you know if you've got a million series and we decide that we're gonna, give you one shuffle shard per hundred thousand we'd, give you ten shuffle shards and really what we want to do is find out what that that hundred thousand number is you know what is the right value for that number? What is that constant?

A

So again, as I said earlier, my math is not good enough to derive this from kind of first principles. I if anyone in the audience knows how to do this kind of um mathematically. I'd be really interested in chatting to you, but I'm a software engineer. So we built a simulator.

A

um You know the simulator kind of simulated, a a cortex cluster of a certain size.

A

I think we simulated kind of 60 70 nodes um simulated, a a set of tenants of roughly you know, a distribution of sizes that we observe in our production clusters and simulated kind of you know, picking shuffle shard sizes distributing the samples to each of the virtual each of the nodes and measuring kind of the um the variance in node load, just just based on the number of uh of series that they they have, and the number of tenants that were impacted by by two nodes going away.

A

What proportion of tenants would be impacted by two nodes. We actually measured any two nodes going away. I think the uh the simulator is open source. um So so do ask me: uh ask me afterwards if you want to link to the source code, so suffice to say we got a couple of graphs from the simulator. This one shows the um the load balancing how how well load is distributed within the cluster versus the size of each shuffle shard.

A

So shuffle shard along the x-axis load load distribution along the y, and what you can see here is, as you increase the size of the shuffle shards, the distribution of load gets worse, as we predicted um you can see. Kind of, interestingly kind of the distribution of load starts to tail off. I believe this happens, as kind of just small tenants start to hit the minimum number of shuffle shots, which is three for replication.

A

We we also see here that you know at kind of let's just pick a number, then the numbers aren't super relevant in this. This is just a general rule of thumb, but, let's say a shuffle shard size of 40 000 um series. We can see that the maximum size a node gets to right. The maximum number of series on a single node is about one and a half million, and the minimum is about 75, uh 750 000 right. So there's a factor of two difference here right that gives us some kind of you know idea.

A

You know we probably don't want a factor of two difference in the in the size of our modes right. This is going to make it very hard to optimize um utilization.

A

We also see, as you increase the size of the shard, the isolation, measured as the percentage of tenants affected by a two node outage. The isolation starts to fall and eventually again plateaus. So we can see that, let's say thirty thousand again thirty. Forty thousand, you know way less than one percent of tenants in your cluster are affected by by a two node outage. You know this was modeled with a thousand tenants. um We were averaging, I think, a hundred thousand series per tenants one of the key things this um simulation took into account.

A

Was it also measured? You know also simulated replication factor.

A

So, whilst working on this, we kind of picked some numbers, we debated internally, we kind of find where the two graphs cross and we came up with this kind of good rule of thumb. You know at around 20 000 series per shard.

A

We have a roughly 20 variance in the uh series per node and roughly two percent of tenants affected by a two node outage and I believe our production config that we run on our large cortex clusters matches this roughly, I think 20 30 000 series per shard is what we run internally, and this is really good, because what this means is by by reducing the chance of an outage for most tenants uh when there's two nodes, two nodes that are suffering problems, we've been able to scale up to even larger cortex clusters.

A

You know to hundreds of nodes as opposed to tens right. We've also managed to better isolate tenants from each other, so there'll be less noisy. Neighbor there'll be less chance of a poison pill affecting other tenants.

A

We managed to do all of this, whilst keeping the variance in load amongst these nodes, relatively bound and and therefore kind of not reducing. You know not increasing, rather the the cost of running this cluster and not passing on any kind of cost to the customer for this.

A

So I think this is a really positive result. um I'm I'm really kind of pleased with the work and and surprised at how effective shuffle sharding is we talked today about you know what cortex is the horizontally scalable version of prometheus, that which kind of allows you to centralize your observability into a single single cluster and and act as kind of your own service provider within within your organization.

A

We've talked about how we distributed load before we implemented shuffle sharding and how we just distributed all tenants to all nodes and how we use the hashing algorithm and a kind of a dht to to do that. Then we've talked about shuffle sharding. How shuffle sharding effectively builds small virtual clusters inside a much larger real cluster and how these virtual clusters improve the isolation between tenants at not a huge expense in in terms of utilization, and that's really the talk. I wanted to say thank you to a few people.

A

I wanted to say thank you to marco, marco and and thank you to peter who really did all the work here and they should be the ones giving this talk.

A

What's more kind of the the slides I'm giving here are an evolution of marco's internal slides that he gave at a talk inside grafana labs? I also wanted to say thank you to amazon.

A

They sponsored uh grafana labs to make these changes to cortex really worked closely with us on the design and on reviewing it and really kind of giving them some great feedback.

A

We, uh if you want to hear more about how grafana labs and amazon have worked together to to help uh amazon launch their prometheus service, there's a there's, a blog post on amazon's blog and a blog on uh grafana's blog. That really goes into a little detail about how how the relationships worked and what kind of things we've built for amazon, and with that I'd like to say, thank you and open up the floor to uh to questions.

A