Red Hat OpenShift Case Studies | OpenShift, 18 Mar 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons Gathering 2019 Santa Clara Building Operators at Uber Paul Schooss Matt Schallert

Description

OpenShift Commons Gathering 2019
Santa Clara
Building Operators at Uber
Paul Schooss Matt Schallert Uber
M3DB

A

Hey everyone, I'm Matt, I'm Paul, as Diane said, were from uber, were sres on the observability team, where we work on ubers open-source metric platform, and today we're gonna, give you or share with you some of our experience, building an operator for part of it. So we work on M, 3 DB, which is ubers open, source time series database. It's part of em three, which is uber. Uber is open source metric stack, so M, 3 DB is the core distributed time series database that sits at the center of M 3.

A

We also have other components like our just like our fault, tolerant, aggregation tier our distributed query engine, all of which act as kind of a way to use M, 3 DB as just a general time series database or as a long-term store for a Prometheus when first built M, 3 dB. Sorry give you a sense of M 3 to B's usage at uber. We have a little over 1100 instances running M, 3 dB.

A

They all do in post aggregation, 33 million writes per second, which comes out to about 55 gigabits per second, and we store a little over 9 billion, unique metric IDs. When we first built M 3 DB in 2016 operating, it was pretty simple. We pretty much ran M 3 DB in two data centers, and we had one cluster per data center. The clusters also shared a static configuration. Our only use case back then was like real-time alerting in dashboard introspection.

A

So we just stored metrics at 10 seconds for a 10 second resolution for two days, but when you fast forward to 2019 things, aren't nearly as simple so we run em 3 DB in more data centers, there are more clusters per data center, the clusters are larger and they no longer share a static configuration. So we have some clusters that store high resolution metrics at like 1 second for 6 hours, and then some clusters that store low resolution, metrics at say an hour for up to 5 years.

A

On top of all this configuration complexity, we also run em 3 DB in multiple cloud providers, where uber has a presence. So when you put this all together, clusters grew like by 20 by like 20 times, engineer's stayed relatively the same, and while m32 B itself is pretty easy manage on like a cluster level, we really needed something that would kind of help automate this whole cluster lifecycle and Paul's, going to talk a bit about what that looks like thanks.

B

Matt, so with all these clusters now in multiple regions, we have to kind of discuss. How did we even manage it? So we kind of broke that down into two life cycles that might be familiar to everybody here. Here's our reactive, like cycle, we have a product, a problematic node that happens.

B

Engineer gets page by page of duty wakes up, ssh is in there and then basically tries to reboot the hose or add capacity or remove capacity to remediate the issue and then everything's back at a steady state and healthy, and this happens throughout a entire week for their own call ship.

B

At the end of that week, reassemble the engineering team or review all the problems that arose through these various clusters and kind of come up with an action plan of do we need to do some capacity management or how to rehand all the management of these hosts and it it takes a little bit of time away from everybody scheduled at the end of the week and then, after all, is said and done.

B

There's an action plan, that's formulated and an engineer has to go out and actually enact that plan, and it could take a lot of time away from that engineer to actually do core feature, development or whatnot for the actual thing. So, let's talk about why we went the operator route because we wanted to make this experience a lot more smoother for our engineers. So, first before we even went down thinking about how to automate this, we had to understand the problem space that we were in.

B

So let's talk about some of the features of m3d be at its core, so kind of understand what we had to automate first and foremost and 3db, is a sharted distributed time series database. So first thing that we have here is basically each piece of data upon ingestion into m3 is physically sharted into a respective partition, for you know separate the affiliate domains that we have inside there.

B

Each of these shards are actually replicated, furthermore, to separate out even further failure, domains that are within the database itself, and so all shards are replicated by a factor of three by normal settings. I would say- and it's always replicated and synced in real time at all times, so with those given primitives. Let's talk about some of the features that we said, Obamacare.

A

Yeah, so a lot of things that we were thinking about when we were talking about our requirements of what we need to have a platform given that we're working on database, really one of the main ones that stood out was storage. We needed, whatever system we chose, had to have features to support a demanding, stateful workload. So m3/d B requires access to low latency storage and we require access to the storage across all of our clusters.

A

We have internal workloads at uber that randomly query large time series, ranging from hours to years for things like alerting or SLA tracking capacity planning, and our stakeholders expect to have access to all this data in a real-time manner. To give you a sense of what we mean by like performance, stateful primitives, some. Let's take a look like some of the ways you might store state in a containerized application, so one option would just be to not store your state at all and m3/d be instances. Data would live as long as its container.

A

This might be nice, as you would have direct access to the host disk and all the speed that comes with that, but otherwise it would be pretty terrible. So for one there's no durability of the data, and this means that as containers are upgraded or restarted, even if that's in place, they would have to restream all of the data that they're responsible for from their peers, which could be terabytes of data and then, in addition to that being inefficient. It also has dangerous reliability implications.

A

So when an m3 DB instance is coming up or as we call bootstrapping it's available, it takes writes, but it's not available for reads, meaning that well and instances streaming data from its peers, you're at a reduced, read replication factor, so another option for storing your state in a more durable manner might be to use remote storage. So this is pretty popular on your neighborhood cloud provider. You can use something like a block store, so in this case we still weren't really super happy with this, as there's increased latency.

A

Additionally, it seemed pretty inefficient for us to pay someone to replicate data at the block layer that we're already replicating at the application layer and then to do that slower with potentially varying performance characteristics. But honestly, the main thing was that this is less portable for us across cloud providers and our work and our data center is so even different cloud providers. There are remote storage, their block, store solutions, kind of have different performance characteristics and then also we just weren't really comfortable using network storage inside of our data centers.

A

So another option for remote storage that we've seen some systems take is to use a remote object store. So in this case you would say deduplicate all of your data and then you would just store one copy of it on, like your cloud providers object store.

A

So this is kind of nice since you're only paying to store one copy of the data, but there's even worse latency introduced, and then this still brings back the problem that, as containers are moved around, they could potentially have to pull all of this data that they're responsible for back down from the object store. So this left us searching yeah.

B

So you can see the metrics we have right here: durability, performance and efficiency and the previous cause of the other approaches that Matt mentioned really didn't meet all the checkboxes that we were looking for. So we went out and did some evaluation and we settled on kubernetes because they actually checked all these boxes. For us eventually, and one of the first features was the local persistent volumes that they provide, and so this type of abstraction was very portable for us to do on Prem, as well as inside the cloud to meet our needs.

B

In addition to that, local volume gives us that speed that we were looking for to provide that little latency throughput that our customers would expect from that. And so we we settled on the local volumes of kubernetes to check all these boxes. For us.

A

Additionally, some more kubernetes features that were really helpful for us was the combination of node affinity and stateful sets so with stateful sets. We got things like strict ordering of pod operations, whether that just be a like maintenance, rolling restart or a cluster version upgrade, and then combining that with node affinity we were able to kind of express our requirements for failure domains.

A

So to give you an example of what I mean by that in our internal data centers, we use ubers provisioning tooling to make sure that the physical hosts that m3/d be runs on are provisioned and scheduled across racks and other failure domains in the data center. Once those instances are up, m3d be itself. The placement algorithms take care of distributing the shards across those failure. Domains such that no loss of a single failure domain can cause the cluster to lose quorum so using node affinity.

A

We were able to express this really nicely in our cloud environments, so we could pretty much do is pin stateful sets of m3 db2 zones in the cloud and then make sure, and then we would be guaranteed that we would get all these pods evenly distributed across all the across all these failure domains and then again, when the m3 DB instances come up, they actually just distribute the shards across all the zones. So.

B

Let's talk about some of the findings that reoccurred, while we were doing the development of the operator itself, first and foremost, development and CI are going to be on the forefront of your your mindset when you're doing that, and first we found out with the blistering pace of the Cooper failure release cycle every quarter. There were a lot of braking changes, so one of the first things we discovered was to have a proper and then tests that would basically install the operator do some rudimentary checks and make sure it's happy.

B

Everybody loves unit tests, but when you actually see the actual end end life cycle work, it makes you much more confident in there helped us catch a lot of bugs with braking changes that came from the API changes from kubernetes. um Another thing that we ran into was like local development.

B

Originally every kind of set forth and did mini cubes mini cube with node affinity is much different than and a full-blown h8 cluster that you would have inside a cloud or on-premise, and so it's it's very important to know that these environmental changes can actually shape how your operator is going to be developed with and and what is going to change over the course of time. Another thing that we ran into is installation.

B

You want to make sure that when you have a consumable product is as easy as possible to integrate, with whatever technology stack that you have going for front right, but we also ran into trying to issues with this that you try to macro ties so much of it, and then you end up actually painting yourself into a corner, because you actually limit the the flexibility and freedom that a consumer might want to do with your platform itself.

B

In addition to this, if you kind of do this right off the bat it let it kind of takes away from your chord feature, development and so you'll start making assumptions that you wouldn't have done otherwise and it kindly kind of down a rabbit, hole and Matt will kind of go into that. A little bit more.

A

Yeah, so one thing we found is really important, as people start to use your operator is to kind of make it make explicit the assumptions that you've built into it and kind of your perspective on building it.

A

When we mean by this is that you know we built this operator with all the assumptions of how we kind of operate em 3db at uber, which is we run these highly available clusters that are spread across multiple failure domains, and so we realized that we ended up baking in some of the assumptions about how we operate em 3db into our operator. That might not be totally portable for everyone.

A

We'll talk about that in a second, but one thing that we realized is that you know users are kind of at the different phases of what you might call like. The kubernetes maturity lifecycle, in the sense that you have some people that are still kind of like testing the waters, maybe they're just installing your operator on a local mini coupe instance.

A

Just to see like does this thing even work, you might have some users, who only have you know, plus the single master clusters in one failure domain, because that's all they need to meet their requirements and then, on the other end, you have users that are running these highly available clusters, spread across failure, domains or maybe they're, using a more like polished distribution like OpenShift or something and all of these use cases are valid and you don't want to exclude any one person or anyone.

A

You know user group from using your operator based on your assumptions and to give you a concrete example of what we're talking about so, like I said, we run em 3 DB clusters that are spread across multiple failure domains and for a while and currently, though in development is changes to fix this.

A

Our operator actually requires that you have a kubernetes cluster spread across three failure: domains if you're using a 3 DB error, if you're creating an EM, 3 DB cluster with a replication factor of 3, because this is how we operate them through Neve you, but not. Everyone is really there yet, and you kind of have to meet your users halfway. So.

B

Let's talk about some of the successes, some of the results that we actually achieve from this and will react Amish. Yes, we got some clusters there, they're in three different zones. Currently we have an operator running a subset of ubers metrics right now for hosting services that are hosted inside our data centers and one thing that we realized right off the bat we were excited with this, but this is, it presents a different paradigm. We were used to how we typically operated clusters.

B

Now we had to shift over to how an operator is being used and how that would interface with our engineers itself. So, let's, let's talk about what what it did really helped out with, though, so, let's go back to those two cycles. We have the reactive cycle which change dramatically now so node becomes problematic. It signals to our operator something's, going on it'll triage based off of you know, whatever cases that will catch and tries to bring the cluster back into a healthy state.

B

So there's no more paging of that operator at of sorry engineer at 2 a.m. and have them wake up and kind of remediate. The issue, so we went down to zero day, is zero minutes of per day and zero minutes per week. Ideally, let's talk about the other cycle where we were where it's proactive.

B

So previously we would have those meetings about capacity planning and how we're going to manage our clusters and it would take a lot of time off people's hands, especially when you had to enact those changes that we would set forth with a state full service. Like this, you have to reach certain goal states with your with your placement in addition or removal of nodes, and that was manually done so an opera engineer would have to go in.

B

Do X amount of state changes to get a node in or out of the cluster, probably and safely to make it reliable and now the goal state is handled automatically right, we've automated it within the operator now, and so it becomes as almost as easy as just updating the manifest and let the operator roll through all those goal states add the nodes, remove the nodes very safely in a reliable manner, and you don't you don't have to deal with it.

B

So it I mean ideally would take it down to 20 minutes a week, but it's it's no longer this whole big parade of. Let's get everybody together, figure out things it's it can be done within a you know, config or I should say a get repo of some sort. Just the config change. That's there. So let's go over some advice for large stateful workloads as Alice mentioning because this is a tough thing to do. Stateless apps are great, but stateful presents a lot more challenges to to engineering.

B

So, first and foremost, we didn't just jump to kubernetes, as batte mentioned earlier, m3 has been around for quite some time the whole ecosystem. That is four years, and so we invested a lot of time and effort to make this platform reliable before it was even inside an automated operator such as inside like such as kubernetes, so first of all, most make sure your tooling. Is there make sure your platform's there and nice and stable don't do a POC and say you know what we need to be. Do we need to put this on kubernetes?

B

That's that's, probably jumping the gun. I would say a little bit fast.

B

Second thing: is we only really approach this, as mentioned before with when we hit kind of scaling, challenges, 40 plus clusters, for an on-call engineer, it's kind of a lot to take on it's burdensome, and so we only started looking at these automated solutions when operational complexity was at such a high scale that we needed to actually remediate this in a proper way. So we only we only started into this when scaling became. An actual challenge for us with that said is keep in mind. This is not a magic bullet.

B

It's not going to solve all your problems. It's actually gonna, add more complexity to your actual, your platform, so be very mindful of like adding yet another layer of operation on top of your cluster management, because there's yet another thing that you're going to have to think about, while you're debugging or actually developing the the actual automation. That's there.

A

Are there a piece of advice if you're, considering creating an operator for your own stateful workload, is to kind of embrace this declarative approach to infrastructure management that kubernetes has made so popular? We were already doing this in the sense that m3/d bees, cluster topologies are actually modeled, as desired states in that are stored in at CD that the m3 deep instances themselves work to converge on and update when that convergence has occurred.

A

So this actually meant that pretty much all our operator had to do was exchange this series of desired States between kubernetes and m3, DB and kind of take feedback from one use it to inform decisions about the other and it kind of just took part in this holistic reconciliation loop. You compare this to something where I don't know. If you have a database where, like, if you're initiating a replace, you need to know that you initiated, replace and somehow keep track of that in m3 DB with M 3 dB.

A

All we have to do is look at at CD and look at our cluster topology and see kind of what the current state of the world is, because it's that one source of truth, additionally by storing our cluster topologies external, to kubernetes, we're able to avoid taking a hard dependency on the kubernetes api to be able to operate our clusters.

A

So this meant that even if the kubernetes api is down or malfunctioning so long as the pods themselves are still up which crewman ids makes great effort to ensure it's the case under failure scenarios that we could still actually make emergency changes to our cluster topologies by just modifying the state that's stored in that CD. So how are people starting to use this?

A

This is probably the part- that's been most exciting for us, so it's only been a few months that our operators, but now- and we already have people that are starting to use it and deploy- are working on deploying it in production. So one thing that you know that's been really cool see. Is we built em, 3 DB? As you know, this long-term storage for Prometheus?

A

It integrates really nicely we act as a remote, readwrite endpoint and, additionally, our query 8 service actually implement implements the prom ql query language that you have like this native Prometheus integration with M 3 dB, and you know the Prometheus operator is a really popular way that people are running Prometheus in their kubernetes clusters and, what's so cool is that with just a few lines of configuration change in your Prometheus clusters, you can integrate them with M 3 DB to use it as a long-term storage.

A

And then, additionally, we have users that have told us, oh yeah, I'm, using the sed operator to make my etsy D clusters that M 3 DB talks to and then I'm using the M 3 DB operator to make these M 3 DB clusters and I'm using the Prometheus operator to make these Prometheus clusters that are reading and writing from M, 3 dB and again this theme that we've heard today. It's you know it's operators everywhere, and this has been really cool to see.

A

Additionally, we're now on the open shift community created operators page and we're working on getting on to the operator hub.

A

Looking at OpenShift has actually been really valuable for us because, again talking about those assumptions that we built into our operator, you know we're running em, 3 DB in these completely walled off single tenant environments, and we hadn't thought as much about things like locking down our pod privileges and the the more the stricter default security policies that come with OpenShift have actually caused us to kind of go back and make sure that you know our clusters support the security configurations that more mature users might need. So.

B

Just wrap up very quickly here we have some future work. That's there more CR, T's right now we're pretty limited. We have it kind of coupled with our API query service that provides the read and write endpoints as well as the actual storage note itself. We want to start separating this out and build a richer reliability story at the scaling story. For folks beyond that, we want even more CR DS M 3 DB is only one part of M 3 M 3 is an entire ecosystem. We have a collection tier.

B

We have a very reliable agitation tier ingestion care, so we want to break this out a little bit more and provide people the entire ecosystem that we use inside uber under an automated operated fashion. Lastly, we would like to think for our entire team- that's not on stage right now, it's just the two of us for helping to contribute to this project and make it make it actual realization and a possibility for more information.

B

There's some links on the slide here for the operator and general talks for about m3 beat m3 in general feel free to check it out. Thank you again from Matt and Paul. Thank you.

B