Cloud Native Computing Foundation KCD Berlin 2022, 14 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Challenges in building and operating a Multi-Cloud-Provider Platform - Jörg Schad, ArangoDB

Description

Building a cloud-agnostic platform used to be a challenging task as one had to deal with a large number of different cloud APIs and service offerings.

Website: https://www.arangodb.com/

Organized by @Microsoft @kubermatic7173 @SysEleven

Thanks to our sponsors @CapgeminiGlobal, @gardenio, @sysdig, @SUSE, @anynines, @redhat, nginx, serve-u

A

So yeah, as you can figure out, I tried to squeeze as many buzzwords into the title as possible, but in short, we'll talk about how to manage containers or especially stateful Services, across multiple Cloud providers. So, in short, what we are doing, we are building a managed database service across multiple Cloud providers, so basically targeting the big ones, AWS, Google and Azure and kubernetes is a great abstraction.

A

So what we decided to do is we don't want to do all the work ourselves we're actually using the manage kubernetes solution, Spies of big cloud providers, because we don't have that team, and we also don't want to have that Focus being focused around like database services to operate that, but actually kubernetes is not exactly kubernetes it really it's challenging to. We still need to differentiate between different Cloud providers in particular, but we'll go into more details about security, authentication, authorization networking. This is probably the team's favorite.

A

So if we did a quick survey with the team members, uh what I like their favorite differences, they are load balancers for pretty much on the top. uh We'll talk about uh storage, different kubernetes version, handling, container runtimes logging, cluster management, Etc, let's see how much we can squeeze in in half an hour.

A

In short, this is actually as a team and it's actually their talk I'm just here to present it. We actually have one person live here. This is uh Oscar thanks for showing up live, uh but it's actually. This is really like their work and I'm just lucky enough to be the person in Berlin to present it.

A

um So in short, uh why why do we actually care so rankerdb we're an open source graph database? We also support, like other data models, such as document full text, search uh like graph analytics, using Google's, Pringle fabric, etc, etc, but I think the interesting uh part is we. Actually, we are distributed systems by ourselves. So, interestingly, I, like our architecture, is similar to uh kubernetes architecture itself um and we very early on invested in integrating with the originally.

A

This is also how I got to know a Wrangler DB in the mesos ecosystem uh when Apache missiles was still around, but then also we see kubernetes ecosystem because we soon figured or back then I wasn't at a Wrangler. Db I was still over at mesosphere as a partner, but uh already back. Then it was a big topic.

A

How can we actually operate uh persistent services on top of all those containers systems, and so this is where we have cuporango, also open sourced our operator and then also the managed service Oasis, which I'll talk about next, is actually also building up on kubernetes itself and therefore also leverages the cooperango.

A

Why do we care actually like The Field's, a big Trend I've, been writing database systems for the past? What probably like 10 12 years and I felt like in the beginning, we were really targeting a static uh static like setup. Those were like some servers in a basement and those servers were pretty stable. There was a network switch in the middle and uh we had up times of like a year or something for some of the processes running there.

A

Nowadays, it's actually a way more Dynamic infrastructure, so we have all those Cloud providers, AWS telling us to please relocate somewhere, kubernetes telling us to move a part over somewhere. So for me, this actually really this merging of like Cloud native and databases. It's like one of those coolest things happening right now, because it can really bring together all my passions, but it really also means internally in the database.

A

We need changes, uh so we are currently also redesigning some core parts of the database uh to deal more with what I would call like the dynamic infrastructure aspect, but this is not so much what this talk is about. um This talk is more like on the other side.

A

How can I then operate it, and if we look at just the cloud native landscape actually uh most big databases nowadays, they show up here as like a cloud native database and also, as I said, like from our side most of our customers nowadays, even though, if they don't use a managed service, they are running in some kind of cloud environment so either using uh darker images using the kubernetes operator or even if they deploy it.

A

uh Somehow, with other packages, it's usually in some kind of internal or external Cloud environment, uh so I think this is a trend and seemingly most database vendors are seeing that as well.

A

For us, uh this is, then, the part where we have our managed service for people who don't want to run it uh themselves. uh This is built, as already mentioned, on as we manage kubernetes solutions by the big cloud providers, allowing us to really focus on what we do well manage and scale operate database systems- it's um it has open apis. uh Terraform provider was just uh also open sourced last week, and uh so there's just a lot to be playing with, and this talk is now basically about yeah, what's happening down here.

A

What are our challenges to actually operate our infrastructure across those big cloud providers uh and making that pretty much mostly invisible uh to the user out there? Why should you care, I, mean I? Think all of us we know here uh know about kubernetes. Here containers are providing already great abstraction. uh We can move them anywhere. We can deploy them anywhere, don't have to care too much about. What's underneath, then we get a container orchestration in there, allowing us to actually do that at scale.

A

We're scheduling, resource management and then also service Management on top and uh then the next level of abstraction we are putting in here is then, how can we yeah run those containers, and this is where uh kubernetes is really then coming in abstracting away from the different Cloud providers everywhere.

A

So we already see a lot of abstraction happening and uh maybe the last thing- and uh this is a follow-up talk- we actually gave it the last kubecon it's about different kubernetes clusters, so uh ex-colleague of mine used to say, like kubernetes clusters are like Pringles, you can't just have one we actually checked. uh We have about Oscar I. Think last time we counted was like a 40 or so, including staging. uh So there's a lot of different clusters going on and uh we also we are working on.

A

How can we actually migrate databases across clusters, uh but that's kind of the uh the follow-up talk to that so yeah for us. We didn't want to manage uh this kubernetes ourselves.

A

It said like we have 40 plus clusters, so it would just be ridiculous to manage that ourselves, so we actually uh decided to use uh the managed version by the different Cloud providers and um which actually is really helping us a lot uh to abstract away a lot of the uh the challenges in operating that, but on the other hand, it also comes at a cost uh of course, because it's managed by them. We give up certain control.

A

We cannot set certain parameters, but we'll see that during this talk, maybe the last question before we really dive into the challenges. Why should we actually do multi-cloud provider? Isn't one cloud provider enough? Maybe quick question in the audience like how many of you are actually using different Cloud providers across like one product or one service?

A

Okay, some yeah, uh but for most probably using one Cloud proof provider is sufficient and it will still simplify your life uh still. For us it was a requirement because it's actually a requirement of our customers. So for many of our customers it basically tell us yeah. We want to run it on a certain uh cloud provider or we don't want to run on another cloud provider, so this company policy it can be either in or exclusive Amazon might be a competitor for certain people and then also just yeah. Where do you keep your data?

A

um Some are also. They don't want to have a specific vendor dependency and then yeah. There is a flexibility, but the last point: it's also I feel for a lot of people. It's it's a buzzword. So, even though I talk about what it means where's the challenges where to look after just always keep in mind. Do you really need it because it will add to your operational challenges running across different Cloud providers?

A

Okay. Finally, uh the actual part of the actual content. uh So what are the challenges? We've seen? I think we briefly went over in the beginning, so let's just jump right in and first kind of the tldr blowing off some steam slide. So on Amazon on eks, uh one I would say like our biggest challenge, our biggest complaint. If we would write them. A letter is probably like, if I'll see a resource management like create, they create a lot of resources on the fly.

A

So if you create a certain instance, they basically create a lot for that behind you and that could be okay, uh but it's actually it gets harder if you uh have to remove it, uh because you have to follow certain orders- and this is always not intuitive like which order it is, and so it makes the removal challenging uh we've kind of I think we now have that pretty much under control, but in the beginning it actually used to be encoding. All of that was was quite some effort.

A

Not all recess resources have tax, also playing in there and then the error handling is a bit Harding of uh of strings uh needed there for us to actually differentiate between uh different error cases. Google is actually uh better in those aspects. It also feels newer, so I think a lot of those aspects. If you actually look at the history, uh when was when were those Services built, that makes a difference, so Google and Azure say feel a much newer um and so, for example, what uh Google uh What uh their most annoys.

A

Us is probably the aggressive update policy of the kubernetes Clusters, so I mean we've run and managed service ourselves, and from that perspective we would love to force people to upgrade from one version to the other as soon as possible, because it simplifies our operational life. But of course, on the other hand, if Google is forcing us to upgrade a kubernetes cluster, it makes uh Oscar's life, then Oscar, State lights up uh Kirsten might always be some challenges in there.

A

uh Azure uh really I think the biggest child is they've, really greatly improved. So again, it's you still feel that it's a nearer uh newer, Service, uh but especially over the last two years, I feel it has greatly improved and a lot of small issues. We were seeing uh still like two years ago. uh Probably the biggest complaints are still like limited limited resource quarters. So, for example, I think it's now 30 VM scale sets we can have, and this is just across different environments.

A

This is just too little for us, a slow, uh persistent volume attachment um and then, as the autoscaler across now. It's also not that bad anymore.

A

Okay, uh first point resource creation. So, as already mentioned, um we tried to look through like what are we actually creating when we set up like one of those uh Wrangler DB clusters, or one data cluster plus uh and a Wrangler DB database cluster. So on uh Amazon we set up a VPC internet gateway, net, Gateway, subnets, routing tables, Etc security groups and, of course, also an eks cluster um and yeah they're.

A

Probably our biggest challenges is that there are a lot of resources created like on the fly as a dependency and uh those removal uh with dependencies. uh In the beginning, it wasn't easy because we had to learn a lot of things, the hard way uh when they weren't removed when we got arrows on removal, because also not everything is properly tacked on Google and uh also Azure. This is a bit simpler um on Google, it's basically, we create the VPC zgke cluster into node pool.

A

So it's less moving resources, also making removal than obviously easier, and especially on Azure. It's really nice that we can create those resource groups and when we remove the resource Group. Basically also all the dependencies are are being removed as well.

A

um So probably on Azure I said like we still sometimes face a challenge that we can't create as many resources as we would like. So the number of VM skill sets already increased over the last years. I think it used to be 10 uh a year or yeah roughly year, one and a half years ago, so 30 is already much better but still uh quite limited, and we also see that across some of the other resources manage kubernetes.

A

This is probably not so much about uh the different Cloud providers, but it's something to keep in mind. If you want to go in and use a manage, kubernetes Solutions, it's great. It actually takes a lot of your plate, but, on the other hand, uh comes with some other challenges, so there are forced upgrades.

A

um So if we don't update on Google, they will just auto update for ourselves being a stateful service.

A

This is always bad because you want control about like how you move your volumes, how you move your database servers uh of course, yeah you have to stay to manage, uh then it's also the availability of different versions um and probably like access to like uh kubernetes API server options, uh so also, then, adding to that or including, like different command line, options which we just cannot set, but we would like to set so, for example, uh the authorization web hook uh or other things.

A

Where simply we don't have that control because we're using a managed, kubernetes service, okay, authentication authorization, uh basically uh quite varying across different Cloud providers, and we always it's kind of like a continuous thing. There are a number of different open source or proprietary solutions for that who claim like we'll. Do it the same uh everywhere so far uh when reviewing a number of those open source projects they actually were insecure.

A

After all, and not really meeting our requirements, keep in mind again that on managed kubernetes, you not necessarily can set all the options uh to use them. So currently, our solution, for that is, we use service accounts plus, like the uh Oasis or the Wrangler DB Cloud authentication system kind of uh interlocking, with each other logging in order audit log.

A

So again, this is kind of each cloud provider has their own sync, but uh with grafana Loki I think we uh are now in in a pretty good State uh to capture those uh uh things uh abstracted away uh from the different Cloud providers. So um I think there um you need to just kind of find a solution, but there are enough mature, open source Solutions out there, which can help you with that.

A

Okay, now we are getting more choosy uh kind of Hardcore Parts uh in there and uh that would, for example, be uh storage, so uh storage, uh again, probably more important for us than if you're running a state, stateless Service, uh just some front-end containers, but for us, especially like a storage performance of course, is critical. So we did uh quite some extensive studies in the beginning about like different uh performance and how we can then abstract it away for, for our users, basically tells them hey.

A

We have different performance tiers and we try to keep those performance tier roughly comparable. For example, iops throughput Etc across different Cloud providers um so um and yeah. This took like some some playing around and it really differs across different Cloud providers or even in between different Cloud providers, so, for example, configurable iops um across different uh volume, types that at AWS some you can't uh or other Cloud providers, uh then also on AWS itself, and this is probably more uh kind of one of those things where you yeah.

A

You still need to be aware of what you're using so, for example, uh gp3 uh still requires a known a controller. Are we still using a known controller whereas for gp2 volumes? It's just the entry stuff um and we'll see this pattern also with other parts, but that's just uh again.

A

So just a little bit violates uh this kubernetes as an abstraction layer, or it's kind of like installing special drivers for uh special Hardware, uh and you have to figure out what you need on which uh cloud provider, and probably it's just from an operational perspective. This just means for us. We have to monitor and operate those things differently: uh external storage, uh again uh different form uh between different Cloud providers.

A

So also just put that, like explicitly, there is a blob storage, um it's just for our backup, which we write uh to those also across different regions. So again there we actually need to have a proprietary solution or multiple implementations uh for different Cloud providers.

A

Networking, as mentioned uh when, when I did a quick survey with the team uh load, balancers actually turned out to be the most favorite uh uh like high highly ranked uh different, uh most annoying difference between uh the different Cloud providers. uh So just in terms of uh setup. uh So, for example, what is all supportive? What what do we need for setup, DNS, Etc? What is being supported uh in in terms of it? What is the performance? What is the different timeouts? Are?

A

They actually vary quite a lot again, first of all, between different Cloud providers, but then also especially AWS offers multiple choices. So they think this is a pattern we've been seeing over and over is AWS is simply the let's say, most mature one uh to frame it like that. Are they often offer multiple choices uh with different characteristics?

A

So then, you have to choose again what you need, and so here, for example, it's also that uh we, some of them, actually require a special controller where, whereas others are in in tree available just readily available uh internal network. uh Also, it's a different. So internal network is like what is if we transfer data between different regions- um and uh this really varies on what is supported.

A

So, for example, AWS is offering a Transit gateways, VPC peering, but that still means uh we have to be aware of how we transfer data in between nodes, so, for example, for uh multi-region backups or then also when we are now working on like the cluster migration between different regions.

A

um Another example is like the connectivity with private endpoints.

A

um Private endpoints basically means if our customers are running their applications, also on AWS, they can use a private endpoint, and so the data traffic is just staying within the AWS or Azure or other network, and so they're, for example, also differences with AWS those private endpoints only work within one region only, whereas with Azure and Google they work across. So that was also learning when we initially uh were working with that.

A

um Networking for multi-tenancy, so uh the Oasis data clusters. So where see data the database is actually running, they are multi-tenant.

A

So on one kubernetes cluster, we might have multiple uh Rango DB clusters, um and so we still need isolation so, especially because, like uh a single VPC is used for the entire data cluster due to uh one kubernetes cluster only running within one VPC, so we have a strong need for Network separations, uh obviously, and so our solution for that is psyllium uh I, think we covered that a bit more in our um kubecon Talk. What we do there with psyllium, how we use it, but basically have been pretty happy.

A

Initially, um it was a bit. The support for different Cloud providers was a bit varying. uh So, for example, especially uh Azure was not as well supported, but that has also really improved over the last year. So by now uh we actually feel uh it works pretty much the same across all Cloud providers. This is just I think the mess of the message from this slide should be.

A

If you use any of those tools, they might actually work well on one cloud provider and even though it's the same same abstraction layer, the same interface, you might end up with uh issues uh for another cloud provider.

A

Okay, uh what about on-prem um I think I'll just briefly skip that over, uh but we just we looked into uh different uh different solutions, uh on-prem as well and actually like uh even those certified kubernetes. You also end up with difference between all those different uh kubernetes uh uh Solutions out there will next year get bet any better, uh not so sure uh one thing I think which is interesting. There is like: where are we going to end up with all the different container runtimes?

A

uh What does it mean for us uh also in terms of security again a set like for us? It's really crucial to have uh isolation in between different deployments in between different containers. So, of course also, those container runtimes are interesting for us and, uh let's see if there is going to be support on the managed kubernetes or how we are moving uh moving forward uh forward over there.

A

uh Actually I was a bit quicker than I thought, but uh thanks for listening, I'm happy to jump back uh to any of the slides for questions or or any feedback, uh but as we have uh five minutes, maybe the questions uh amongst the people who just raised their hand. Where do you have the biggest like pitfalls in between different uh between different Cloud providers?.

A

Anyone wants to add.