Continuous Delivery Foundation Spinnaker Summit 2022 - Detroit, 4 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Sharding Clouddriver with Stormdriver - Michael Graff, OpsMx

Description

For more Continuous Delivery Foundation content, check out our blog: https://cd.foundation/blog/

Sharding Clouddriver with Stormdriver - Michael Graff, OpsMx

OpsMx's Stormdriver allows Clouddriver instances to be sharded in various ways, including using both the standard Java Clouddriver and the Go implementation for Kubernetes. This is a presentation on the challenges, success stories, and general ideas surrounding how to scale Clouddriver per account, cloud, or any other axis.

A

Hello, my name is Michael Graff and I work at optimx and I would like to talk about a project I've been working with called stormdriver, so about me. Well, once again, I work at opsamex I have some title there not so concerned about that. I love to write your code these days, it's quite nice and quite easy, and quite small as a final product. I, am on the Spinnaker slack and I'm. Also on CNF slack under my name feel free to reach out to me with any questions or concerns, or ideally PR request topics.

A

um Optimax provides SAS solutions for Spinnaker and Argo CD, as well as a number of other platforms. We do custom pipeline development and help you maintain that and and so on. We also do on-prem Spinnaker support for both the open source and our own internal version, and we are quite happy to discuss anything with anybody who has any questions about that as well.

A

So I want to make certain that I talk about Cloud driver first to sort of set the stage and what is cloud driver Cloud driver is the heart of Spinnaker, in that it is the thing that actually speaks to the different clouds and maintain it pulls for infrastructure, and it also manipulates an infrastructure. It also knows about artifact accounts and even though those are very unrelated, uh you know there's a for example: a get artifact account type, but there isn't a git cloud account uh it.

A

They happen to be in the same in the same service. It knows about all the accounts, so you can't shard currently without storm driver using with different accounts. There is a way to Shard Cloud driver, but I'll go over that in a little bit of detail about why that doesn't really scale so well it can. It can really consume memory and it is written in Java. It's a standard part of Spinnaker, it's required, and it's automatically configured generally speaking, so you can Shard Cloud driver based upon the actions it's going to take.

A

For example, you can have a caching Cloud driver component that all that does is query the cloud accounts and update its data and then send it to a cache. The caching response then comes back and the read-only instances will then query that cache and get the data from there. So the read-only instances don't actually manipulate the cloud or look at the cloud. They only look at the cache and this does separate concerns of types of actions, but every one of these still has to have the full list of accounts.

A

In some cases our customers have 5 000 accounts. If you have 5000 accounts configured into Spinnaker, it's really difficult to uh to do that with a single single instance. It takes a lot of memory.

A

So how do you scale Cloud, driver or sharded, so that you can have the accounts? But well you can't and that's because Cloud drivers also know about every account. uh Large pods and kubernetes aren't a good idea, it's better to have many smaller pods than one or two large pods, and we've actually had customers who had to reduce features in order to make a the cloud driver fit on the on the nodes that they had in kubernetes.

A

It can take 32 gig and maybe the node only has 32 gigs, so they wouldn't they'd have to somehow limit that memory down to something smaller. It's also a problem with kubernetes, if you're the biggest pod on that node and that node runs out of memory it. It really is likely it's going to murder that particular pod, so this division of labor, the ha mode only gets it so far.

A

So what stormdriver does is it allows us to Shard per account. We still have a situation where the individual accounts have to be sized appropriately they'll fit on a particular Cloud driver, but you could have a cloud driver just for one account if it was a really big one and we also notice over time. Most people now have small accounts and they have large numbers of them.

A

You might have kubernetes with 100 or 200 or as one of our customers has 500 namespaces, Each of which is a kubernetes or sorry, a Spinnaker account and it just works.

A

Storm driver is written and go and in my testing, I I set up one Cloud driver that had four 100 AWS accounts, one one that had 100 namespaces in kubernetes, one that had no accounts just artifact accounts, uh the other ones had no artifact accounts. Just artifact accounts on this one and the fourth one had a mixture of AWS and kubernetes accounts, but no artifacts and storm driver during normal operation. All my testing took about 200 mega ram.

A

It's written to be fast, so it's written not to interfere. It doesn't query each of the cloud drivers serially. It doesn't in parallel and combines the responses. So the memory footprint is really based upon the response size, the number and and the total response size and the number of simultaneous queries that are going to stormdriver.

A

This is it, it sits in the middle Orca gate, Igor Cloud driver and the cloud driver source code are not modified. Just the configuration their uh or Cloud driver doesn't care who it's talking to, but Orca gate and egoris configuration is set up instead of talking to a cloud driver or an aha instance of a cloud driver instead to talk talks to stormdriver and then storm driver is configured with the URLs for the various Cloud driver instances.

A

So why do this thing? Well, we needed, for. We need to break it up when we have 5 000 accounts in a single Cloud driver. It takes a long time to start, and even if we turn off the the runtime checking at the beginning, it can still take significant amounts of time to start so. Configuration mistake, for example, might cause a very large downtime.

A

We would also prefer to have smaller pods for all the reasons I covered before and the ha configuration isn't necessarily the best configuration. It doesn't necessarily add a lot of resilience, but it does add some complexity.

A

So by removing that, as as a step, we find that since the the individual Cloud drivers start up faster and they run smaller, there's usually less downtime overall.

A

So how does this thing work? Well, it maintains an internal list of credentials and artifact credentials by pulling every every one of the cloud driver instances, and it does that every 10 seconds- and this is only for internal routing, so it it pulls using a user. That's currently Anonymous, but it's configurable. That user has to be configured such that it has full access to read the list of all accounts.

A

It doesn't have to be able to modify anything, but it it needs the credentials response has to return everything with artifact credentials, and this is only for routing. So we send that to everybody at once. All the different Cloud driver instances and maintain that internal routing table that routing table isn't used Beyond internal routing, though it's not returned to the client or at any point.

A

So how do we route? Well? We have a list of similar items. We have hash Maps. We have single objects that we know the account is associated with. We have broadcast, which we don't know who it's associated with. We have pagination.

A

We have posts and and put requests, in which case we have to uh those are mutating requests typically, except for the put request for the artifact fetch, which is really a get a fancy get, and then we have unknown requests. So similar items are things like credentials and artifact credentials. We send out a query to every cloud driver that we know about in parallel and whenever they respond or they time out, we then respond with the superset of the of the data uh on the three of my dog on the maps.

A

uh It's the same idea. We scatter gather, send out the request to everybody and combine the results Singleton's. We know who to send the request to this is, for example, in this account. This give me the details of this pod or, in this account, give me the details of this instance, VM instance. Then we know exactly where to send the request, so we can just route it directly there. In some cases we don't know so we ask everybody and whoever returns interesting data and non-empty response, for example, is the one we use.

A

We get back an empty object. We treat that as a possible response and if we get back something better with something with actual data, then we'll return that a common one here is tasks where we don't track who's operating the task. So when we do a post request, we get back an ID. We send that to orca Oracle will then pull. We don't know where that ID belongs.

A

So we just ask everybody it's fairly quick, so it doesn't really add much latency and then there's pagination I'm not going to cover pagination, but the API is a little bit strange, and so we just kind of Punt on this. What we do is we limit the number of responses to 2000 and if we get more than that, we just stop keeping track of them and we send that that response back.

A

Typically, if a user in a UI- and this is used only in the UI I- believe if the user in the UI is searching and they they get that back- that many responses anyway, they're, probably gonna, they're, probably gonna type more characters until they get more narrowed down anyway.

A

So this works just fine and we don't do pagination. We send back 2 000 and if you ask for page two, you get nothing posts. Inputs are handled because we know exactly which artifact account or which main which cloud account to send the request to. So we send it directly there and unknown requests. Well, we handle these differently.

A

If it's a get request and we don't know what the URL is for, we will pick randomly and send it to one of the instances and it is logged, but past development time, I've never seen that log message occur unless I Mis type a URL on the curl command manually, then I see it. Of course, post requests are handled since they are mutating requests posts and put requests and delete and all the other HTTP verbs that can that can be used to modify state uh those are handled in such a way that we will reject.

A

If we don't know what the what that's for. This also includes Cloud driver. uh Sorry, Cloud driver accounts types that we don't know right now. We support AWS, Foley, kubernetes, fully I, believe Azure and gcp work, but I have not tested those fully at this point. But those are the only four that that storm driver currently supports it's very easy to add others. I just don't have access to testing them.

A

So security, well, the internal routing table that we use is never released. We use it for routing, but we don't use it as any responses. So we respect the x-miniker user header, which is set by all the other applications, all the other components, and it calls us with that header. We pass that header along when we do the scatter gather, requests or the individual requests that way. The rbac model, we don't care about VR back model.

A

The rbac model is entirely within Spinnaker components and the Spinnaker components can handle the checks just like they do now.

A

Another really nice advantage of this is that we can use, go Cloud driver and there's a little bit of information about it here. I won't go over it in great detail, but it allows us to use kubernetes only Cloud driver and it's written in go it's much leaner than the Java. It takes up much less time to do its tasks. It goes directly to the cluster. Instead of caching and having a delay, you can run multiple copies of it.

A

It shares State through a database, so it's very, very resilient and overall, it's just a much easier to work with kubernetes only Cloud driver, but it does have limitations and therefore we want to mix in we. We have some things that we use the Java side for and something is used, go Cloud driver for in production, and this storm driver allows us to mix and match those and put accounts in in whichever place we think is most appropriate.

A

So her findings. Well, once again, we haven't really been using ha deployments very much recently because it doesn't add as much value anymore.

A

It stormdriver does add a little bit of complexity, but I think it's a lot simpler than the ha mode. You can also run multiple copies of stormdriver and it doesn't they don't share State, there's no state that they have to maintain uh across them. So there's no, no coordination.

A

We are able to mix in go Cloud driver and possibly other a single other types of cloud. Driver implementations in the future and programming to go is quite fun, so I don't actually have a demo. I have an anti-demo and the reason I call it. The answer demo is because it's really hard to demo an API, so instead I'm going to show some traces.

A

This is using Jaeger and it shows the various responses and those responses are the orange responses are the important ones the other ones are the the durations are valid, but the times are slightly offset for some reason, um probably because we're in a different time zone and what, for whatever reason it just seemed like the clocks were slightly off, but you can see that the overall query that stormdriver issued here uh to fetch a list of credentials was 11.75 milliseconds and the longest request from stormdriver uh from one of the cloud drivers was 11.42 and that 11.42 it's only a few 0.3 milliseconds longer.

A

So that's quite good- and this is the internal query that we issue, but overall the responses are very little. Latency is added. This is a real life example where Igor is asking us for list of credentials. We had three milliseconds and I'm willing to take three millisecond hit on a query like this. In order to make certain that we get the responses properly, so where can you get this thing? Well, you can get it from github.com optimax stormdriver.

A

It's an Apache 2.0 license and I'm happy to take pull requests uh reach out on slack, send me email, whatever you'd like to do and I am happy to discuss. Thank you very much.