youtube image
From YouTube: Better Scalability & More Isolation? The Cortex “Shuffle Sharding” Story - Tom Wilkie, Grafana Labs

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon North America 2021 in Los Angeles, CA from October 12-15. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Better Scalability and More Isolation? The Cortex “Shuffle Sharding” Story - Tom Wilkie, Grafana Labs

Cortex is a horizontally-scalable, highly-available and multi-tenant Prometheus-compatible time series database. For many years it has been possible to scale Cortex clusters to hundreds of replicas. The relatively simple Dynamo-style replication relies on quorum consistency for reads and writes. As such, a dual-replica failure can lead to an outage for all tenants. To address this we implemented a technique called “Shuffle Sharding” in Cortex. Shuffle Sharding lets you automatically pick a random “replica set” for each tenant, allowing you to isolate tenants and reduce the chance of an outage. In this talk we’ll show you how shuffle sharding achieves better scalability and more isolation, both in theory and in practice. We’ll walk you through the design on both the read and write path of Cortex. Finally we’ll do a live demo of shuffle sharding and how you can “take out” multiple replicas without affecting all tenants.